Skip to content

Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion #1384

@h-mayorquin

Description

@h-mayorquin

Users commonly convert data to HDMF containers (Units, Electrodes, DynamicTables) from pandas DataFrames or numpy arrays. Since pandas 2.0, new string types (StringArray and ArrowStringArray) were introduced as opt-in features, but pandas 3.0 made PyArrow-backed strings (ArrowStringArray) the default for all string columns. This breaks HDMF's VectorData creation with a TypeError when users pass pandas Series with the new default string type.

import pandas as pd
from hdmf.common import VectorData

df = pd.DataFrame({'animal': ['cat', 'dog', 'bird']})  # Uses ArrowStringArray in pandas 3.0
vector_data = VectorData(name='animal', description='names', data=df['animal'].values)
# TypeError: VectorData.__init__: incorrect type for 'data' (got 'ArrowStringArray', ...)

This error occurs because HDMF's type validation in src/hdmf/utils.py only accepts np.ndarray, list, tuple, h5py.Dataset, and optionally ZarrArray, but not pandas string array types (StringArray and ArrowStringArray). While this issue technically existed for pandas 2.0 users who explicitly used dtype='string' or dtype='string[pyarrow]', it was far less concerning because those types were opt-in. With pandas 3.0, every DataFrame with strings (such as one loaded from csv with pandas) now uses ArrowStringArray by default, making this a critical compatibility issue affecting all users who process string data with pandas and HDMF.

I propose accepting pandas string array types (StringArray and ArrowStringArray) in HDMF's type validation and automatically converting them to numpy arrays (via .to_numpy()) with a UserWarning to inform users. This balances user experience (works out of the box) with explicitness (users are notified and can pre-convert to avoid the warning).

Pandas 3.0 Release Notes: https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#pyarrow-backed-strings-as-the-default-string-type

Click to expand: Full reproduction script
#!/usr/bin/env python3
# /// script
# dependencies = [
#   "pandas>=3.0.0",
#   "hdmf",
#   "numpy",
# ]
# ///
"""
Minimal script to reproduce the Pandas 3.0 ArrowStringArray issue with HDMF.

To run with uv:
    uv run reproduce_pandas3_issue.py

Expected behavior: Script should fail with TypeError about ArrowStringArray
"""

import pandas as pd
import numpy as np
from hdmf.common import VectorData

# Verify we're using pandas 3.0+
assert pd.__version__ >= '3.0.0', f"This script requires pandas 3.0+, found {pd.__version__}"

# Create a simple DataFrame with string data
# In pandas 3.0+, this automatically uses ArrowStringArray
df = pd.DataFrame({
    'animal': ['cat', 'dog', 'bird'],
    'sound': ['meow', 'woof', 'chirp']
})

# This will FAIL with pandas 3.0+ because .values returns ArrowStringArray
# which HDMF doesn't recognize as a valid type
vector_data = VectorData(
    name='animal',
    description='Animal names',
    data=df['animal'].values  # Returns ArrowStringArray in pandas 3.0+
)

print("SUCCESS: VectorData created (this means the issue is fixed!)")

Save this as reproduce_pandas3_issue.py and run with uv run reproduce_pandas3_issue.py

As a further complication that can be discussed in a separate issue, nullable integers (IntegerArray) and nullable booleans (BooleanArray) currently fail in a similar way to string types, but their behavior with .to_numpy() is more complicated and may warrant raising an error instead of automatic conversion. For nullable integers, .to_numpy() converts to float64 (losing type information), and for nullable booleans it converts to object dtype with pd.NA values that may not serialize correctly to HDF5.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions