Skip to content

support list[T] columns#1

Open
joefutrelle wants to merge 5 commits into
mainfrom
list_types
Open

support list[T] columns#1
joefutrelle wants to merge 5 commits into
mainfrom
list_types

Conversation

@joefutrelle

Copy link
Copy Markdown
Collaborator

This pull request adds support for handling list[T] type annotations in schema conversion and validation utilities, allowing fields annotated as lists of supported scalar types (like list[str], list[int], etc.) to be mapped to Arrow list types. It also introduces comprehensive tests for this new functionality and improves error messages for unsupported types.

Schema type annotation enhancements:

  • Added support in _annotation_to_arrow for mapping list[T] type annotations (where T is a supported scalar type) to Arrow list types. Bare list and list[dict] are explicitly disallowed and raise informative errors. (src/amplify_db_utils/schema.py)
  • Improved error messages for unsupported types to suggest using typed lists (e.g., list[str], list[int]). (src/amplify_db_utils/schema.py)

Testing improvements:

  • Added new tests to verify correct mapping of list[str], list[int], and list[float] fields, including round-trip validation and handling of empty lists. (tests/test_schema.py)
  • Added tests to ensure that bare list and list[dict] annotations raise appropriate errors with clear messages. (tests/test_schema.py)
  • Added a test to confirm that list[T] columns are not serialized as JSON strings but as native lists. (tests/test_schema.py)

Comment thread src/amplify_db_utils/schema.py
Comment thread tests/test_schema.py
@sbatchelder

Copy link
Copy Markdown

New problem: registry can't deserialize list[T] column types

The earlier commit on this branch added list[T] support at the conversion layer, so list columns could be defined and written. But the schema registry can't read them back.

The registry persists each table's schema to a JSON sidecar ({root}/_registry/tables.json), storing every field's type as a plain string via str(pa_type) and reconstructing it on load with a hand-maintained lookup table (_arrow_type_from_str) that only covers scalar + timestamp types. A list<float> column serializes fine to "list<item: float>", but on load there's no case for it, so it raises:

ValueError: Cannot deserialize PyArrow type: 'list<item: float>'

Registration is in-memory at create_table time, so the writing process never hit this. But SchemaRegistry.load() runs in the store constructor, so any fresh store opened on an existing dataset that has a list column crashes on construction — i.e. every separate reader/analysis process. The same gap applies to other nested types (struct, large_list, map).

The fix: serialize the schema via Arrow IPC

Instead of stringifying types one by one, we now serialize the whole schema using PyArrow's built-in IPC serialization and store it as a base64 string in a new schema_ipc field:

def _schema_to_ipc_b64(schema):  return base64.b64encode(schema.serialize().to_pybytes()).decode("ascii")
def _schema_from_ipc_b64(s):     return pa.ipc.read_schema(pa.py_buffer(base64.b64decode(s)))

IPC ("Inter-Process Communication") is Apache Arrow's own canonical serialization format for shipping Arrow data/schemas between processes or to disk. It's a built-in PyArrow feature (schema.serialize() / pa.ipc.read_schema). It allows for automatic lossless serializing/deserializing for every Arrow type, including nested types, nullability, and field metadata. Should help with maintainability for all kinds of typing.

This changes adds a new schema_ipc field, but backwards compatibility is otherwise maintained.

Sidecar signature change and why it's backwards compatible

save() now writes an extra key per table. Before:

{ "t": { "schema_fields": [...], "partition_by": [...] } }

After:

{ "t": { "schema_ipc": "<base64>", "schema_fields": [...], "partition_by": [...] } }
  • schema_ipc is the authoritative representation read back on load.
  • schema_fields is retained as a human-readable mirror (so the sidecar stays legible) and as the back-compat read path.

load() prefers schema_ipc, and falls back to parsing schema_fields when it's absent:

if entry.get("schema_ipc"):
    schema = _schema_from_ipc_b64(entry["schema_ipc"])
else:
    schema = _schema_from_json(entry["schema_fields"])

So existing sidecars written before this change keep loading unchanged — they only contain scalar types, which the retained legacy parser still handles. The legacy _arrow_type_to_str/_from_str + _schema_from_json are deliberately kept for that reason. Old registries self-migrate to include schema_ipc the next time the registry is saved; pure readers keep working via the fallback without any rewrite.

Tests added (tests/test_schema_evolution.py)

  • test_list_column_persists_across_instances — full lifecycle: create → write → reopen new store → read a list[float] column; the direct regression test for the crash above.
  • test_list_column_idempotent_after_reload — re-registering a list-column schema after reload is a no-op (proves the round-tripped schema compares equal).
  • test_registry_roundtrips_list_type — unit-level save()load() for list<float32>.
  • test_registry_roundtrips_nested_types — same for large_list + nested struct, confirming the IPC path generalizes beyond lists.
  • test_registry_loads_legacy_schema_fields_only — a schema_fields-only sidecar (string / int64 / timestamp-with-tz) still loads, locking in backwards compatibility.

@sbatchelder sbatchelder requested a review from johnwaalsh June 30, 2026 19:17
Comment thread src/amplify_db_utils/registry.py Outdated
Comment thread src/amplify_db_utils/registry.py Outdated
Comment thread tests/test_schema_evolution.py
@sbatchelder

Copy link
Copy Markdown

Spoke with Joe, we are a-go to break backwards compatibility, provided that I make a migration script which I think ought to be pretty easy.

Part of that discussion was changing schema_fields to (a) still be included for human legibility but (b) is just a simple list of column names (dropping dtypes and nullability).

@sbatchelder sbatchelder requested a review from johnwaalsh July 1, 2026 22:57

@johnwaalsh johnwaalsh left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants