-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Users commonly convert data to HDMF containers (Units, Electrodes, DynamicTables) from pandas DataFrames or numpy arrays. Since pandas 2.0, new string types (StringArray and ArrowStringArray) were introduced as opt-in features, but pandas 3.0 made PyArrow-backed strings (ArrowStringArray) the default for all string columns. This breaks HDMF's VectorData creation with a TypeError when users pass pandas Series with the new default string type.
import pandas as pd
from hdmf.common import VectorData
df = pd.DataFrame({'animal': ['cat', 'dog', 'bird']}) # Uses ArrowStringArray in pandas 3.0
vector_data = VectorData(name='animal', description='names', data=df['animal'].values)
# TypeError: VectorData.__init__: incorrect type for 'data' (got 'ArrowStringArray', ...)This error occurs because HDMF's type validation in src/hdmf/utils.py only accepts np.ndarray, list, tuple, h5py.Dataset, and optionally ZarrArray, but not pandas string array types (StringArray and ArrowStringArray). While this issue technically existed for pandas 2.0 users who explicitly used dtype='string' or dtype='string[pyarrow]', it was far less concerning because those types were opt-in. With pandas 3.0, every DataFrame with strings (such as one loaded from csv with pandas) now uses ArrowStringArray by default, making this a critical compatibility issue affecting all users who process string data with pandas and HDMF.
I propose accepting pandas string array types (StringArray and ArrowStringArray) in HDMF's type validation and automatically converting them to numpy arrays (via .to_numpy()) with a UserWarning to inform users. This balances user experience (works out of the box) with explicitness (users are notified and can pre-convert to avoid the warning).
Pandas 3.0 Release Notes: https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#pyarrow-backed-strings-as-the-default-string-type
Click to expand: Full reproduction script
#!/usr/bin/env python3
# /// script
# dependencies = [
# "pandas>=3.0.0",
# "hdmf",
# "numpy",
# ]
# ///
"""
Minimal script to reproduce the Pandas 3.0 ArrowStringArray issue with HDMF.
To run with uv:
uv run reproduce_pandas3_issue.py
Expected behavior: Script should fail with TypeError about ArrowStringArray
"""
import pandas as pd
import numpy as np
from hdmf.common import VectorData
# Verify we're using pandas 3.0+
assert pd.__version__ >= '3.0.0', f"This script requires pandas 3.0+, found {pd.__version__}"
# Create a simple DataFrame with string data
# In pandas 3.0+, this automatically uses ArrowStringArray
df = pd.DataFrame({
'animal': ['cat', 'dog', 'bird'],
'sound': ['meow', 'woof', 'chirp']
})
# This will FAIL with pandas 3.0+ because .values returns ArrowStringArray
# which HDMF doesn't recognize as a valid type
vector_data = VectorData(
name='animal',
description='Animal names',
data=df['animal'].values # Returns ArrowStringArray in pandas 3.0+
)
print("SUCCESS: VectorData created (this means the issue is fixed!)")Save this as reproduce_pandas3_issue.py and run with uv run reproduce_pandas3_issue.py
As a further complication that can be discussed in a separate issue, nullable integers (IntegerArray) and nullable booleans (BooleanArray) currently fail in a similar way to string types, but their behavior with .to_numpy() is more complicated and may warrant raising an error instead of automatic conversion. For nullable integers, .to_numpy() converts to float64 (losing type information), and for nullable booleans it converts to object dtype with pd.NA values that may not serialize correctly to HDF5.