Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion

Users commonly convert data to HDMF containers (Units, Electrodes, DynamicTables) from pandas DataFrames or numpy arrays. Since pandas 2.0, new string types (`StringArray` and `ArrowStringArray`) were introduced as opt-in features, but pandas 3.0 made PyArrow-backed strings (`ArrowStringArray`) the default for all string columns. This breaks HDMF's `VectorData` creation with a `TypeError` when users pass pandas Series with the new default string type. 

```python
import pandas as pd
from hdmf.common import VectorData

df = pd.DataFrame({'animal': ['cat', 'dog', 'bird']})  # Uses ArrowStringArray in pandas 3.0
vector_data = VectorData(name='animal', description='names', data=df['animal'].values)
# TypeError: VectorData.__init__: incorrect type for 'data' (got 'ArrowStringArray', ...)
```

This error occurs because HDMF's type validation in `src/hdmf/utils.py` only accepts `np.ndarray`, `list`, `tuple`, `h5py.Dataset`, and optionally `ZarrArray`, but not pandas string array types (`StringArray` and `ArrowStringArray`). While this issue technically existed for pandas 2.0 users who explicitly used `dtype='string'` or `dtype='string[pyarrow]'`, it was far less concerning because those types were opt-in. With pandas 3.0, **every DataFrame with strings** (such as one loaded from csv with pandas) now uses `ArrowStringArray` by default, making this a critical compatibility issue affecting all users who process string data with pandas and HDMF.

I propose accepting pandas string array types (`StringArray` and `ArrowStringArray`) in HDMF's type validation and automatically converting them to numpy arrays (via `.to_numpy()`) with a `UserWarning` to inform users. This balances user experience (works out of the box) with explicitness (users are notified and can pre-convert to avoid the warning). 

**Pandas 3.0 Release Notes**: https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#pyarrow-backed-strings-as-the-default-string-type

<details>
<summary>Click to expand: Full reproduction script</summary>

```python
#!/usr/bin/env python3
# /// script
# dependencies = [
#   "pandas>=3.0.0",
#   "hdmf",
#   "numpy",
# ]
# ///
"""
Minimal script to reproduce the Pandas 3.0 ArrowStringArray issue with HDMF.

To run with uv:
    uv run reproduce_pandas3_issue.py

Expected behavior: Script should fail with TypeError about ArrowStringArray
"""

import pandas as pd
import numpy as np
from hdmf.common import VectorData

# Verify we're using pandas 3.0+
assert pd.__version__ >= '3.0.0', f"This script requires pandas 3.0+, found {pd.__version__}"

# Create a simple DataFrame with string data
# In pandas 3.0+, this automatically uses ArrowStringArray
df = pd.DataFrame({
    'animal': ['cat', 'dog', 'bird'],
    'sound': ['meow', 'woof', 'chirp']
})

# This will FAIL with pandas 3.0+ because .values returns ArrowStringArray
# which HDMF doesn't recognize as a valid type
vector_data = VectorData(
    name='animal',
    description='Animal names',
    data=df['animal'].values  # Returns ArrowStringArray in pandas 3.0+
)

print("SUCCESS: VectorData created (this means the issue is fixed!)")
```

Save this as `reproduce_pandas3_issue.py` and run with `uv run reproduce_pandas3_issue.py`

</details>

As a further complication that can be discussed in a separate issue, nullable integers (`IntegerArray`) and nullable booleans (`BooleanArray`) currently fail in a similar way to string types, but their behavior with `.to_numpy()` is more complicated and may warrant raising an error instead of automatic conversion. For nullable integers, `.to_numpy()` converts to `float64` (losing type information), and for nullable booleans it converts to `object` dtype with `pd.NA` values that may not serialize correctly to HDF5.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion #1384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion #1384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions