You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request adds support for handling list[T] type annotations in schema conversion and validation utilities, allowing fields annotated as lists of supported scalar types (like list[str], list[int], etc.) to be mapped to Arrow list types. It also introduces comprehensive tests for this new functionality and improves error messages for unsupported types.
Schema type annotation enhancements:
Added support in _annotation_to_arrow for mapping list[T] type annotations (where T is a supported scalar type) to Arrow list types. Bare list and list[dict] are explicitly disallowed and raise informative errors. (src/amplify_db_utils/schema.py)
Improved error messages for unsupported types to suggest using typed lists (e.g., list[str], list[int]). (src/amplify_db_utils/schema.py)
Testing improvements:
Added new tests to verify correct mapping of list[str], list[int], and list[float] fields, including round-trip validation and handling of empty lists. (tests/test_schema.py)
Added tests to ensure that bare list and list[dict] annotations raise appropriate errors with clear messages. (tests/test_schema.py)
Added a test to confirm that list[T] columns are not serialized as JSON strings but as native lists. (tests/test_schema.py)
New problem: registry can't deserializelist[T] column types
The earlier commit on this branch added list[T] support at the conversion layer, so list columns could be defined and written. But the schema registry can't read them back.
The registry persists each table's schema to a JSON sidecar ({root}/_registry/tables.json), storing every field's type as a plain string via str(pa_type) and reconstructing it on load with a hand-maintained lookup table (_arrow_type_from_str) that only covers scalar + timestamp types. A list<float> column serializes fine to "list<item: float>", but on load there's no case for it, so it raises:
Registration is in-memory at create_table time, so the writing process never hit this. But SchemaRegistry.load() runs in the store constructor, so any fresh store opened on an existing dataset that has a list column crashes on construction — i.e. every separate reader/analysis process. The same gap applies to other nested types (struct, large_list, map).
The fix: serialize the schema via Arrow IPC
Instead of stringifying types one by one, we now serialize the whole schema using PyArrow's built-in IPC serialization and store it as a base64 string in a new schema_ipc field:
IPC ("Inter-Process Communication") is Apache Arrow's own canonical serialization format for shipping Arrow data/schemas between processes or to disk. It's a built-in PyArrow feature (schema.serialize() / pa.ipc.read_schema). It allows for automatic lossless serializing/deserializing for every Arrow type, including nested types, nullability, and field metadata. Should help with maintainability for all kinds of typing.
This changes adds a new schema_ipc field, but backwards compatibility is otherwise maintained.
Sidecar signature change and why it's backwards compatible
So existing sidecars written before this change keep loading unchanged — they only contain scalar types, which the retained legacy parser still handles. The legacy _arrow_type_to_str/_from_str + _schema_from_json are deliberately kept for that reason. Old registries self-migrate to include schema_ipc the next time the registry is saved; pure readers keep working via the fallback without any rewrite.
Tests added (tests/test_schema_evolution.py)
test_list_column_persists_across_instances — full lifecycle: create → write → reopen new store → read a list[float] column; the direct regression test for the crash above.
test_list_column_idempotent_after_reload — re-registering a list-column schema after reload is a no-op (proves the round-tripped schema compares equal).
test_registry_roundtrips_list_type — unit-level save()→load() for list<float32>.
test_registry_roundtrips_nested_types — same for large_list + nested struct, confirming the IPC path generalizes beyond lists.
test_registry_loads_legacy_schema_fields_only — a schema_fields-only sidecar (string / int64 / timestamp-with-tz) still loads, locking in backwards compatibility.
Spoke with Joe, we are a-go to break backwards compatibility, provided that I make a migration script which I think ought to be pretty easy.
Part of that discussion was changing schema_fields to (a) still be included for human legibility but (b) is just a simple list of column names (dropping dtypes and nullability).
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds support for handling
list[T]type annotations in schema conversion and validation utilities, allowing fields annotated as lists of supported scalar types (likelist[str],list[int], etc.) to be mapped to Arrow list types. It also introduces comprehensive tests for this new functionality and improves error messages for unsupported types.Schema type annotation enhancements:
_annotation_to_arrowfor mappinglist[T]type annotations (whereTis a supported scalar type) to Arrow list types. Barelistandlist[dict]are explicitly disallowed and raise informative errors. (src/amplify_db_utils/schema.py)list[str],list[int]). (src/amplify_db_utils/schema.py)Testing improvements:
list[str],list[int], andlist[float]fields, including round-trip validation and handling of empty lists. (tests/test_schema.py)listandlist[dict]annotations raise appropriate errors with clear messages. (tests/test_schema.py)list[T]columns are not serialized as JSON strings but as native lists. (tests/test_schema.py)