support list[T] columns by joefutrelle · Pull Request #1 · WHOIGit/amplify-db-utils

joefutrelle · 2026-03-16T12:44:21Z

This pull request adds support for handling list[T] type annotations in schema conversion and validation utilities, allowing fields annotated as lists of supported scalar types (like list[str], list[int], etc.) to be mapped to Arrow list types. It also introduces comprehensive tests for this new functionality and improves error messages for unsupported types.

Schema type annotation enhancements:

Added support in _annotation_to_arrow for mapping list[T] type annotations (where T is a supported scalar type) to Arrow list types. Bare list and list[dict] are explicitly disallowed and raise informative errors. (src/amplify_db_utils/schema.py)
Improved error messages for unsupported types to suggest using typed lists (e.g., list[str], list[int]). (src/amplify_db_utils/schema.py)

Testing improvements:

Added new tests to verify correct mapping of list[str], list[int], and list[float] fields, including round-trip validation and handling of empty lists. (tests/test_schema.py)
Added tests to ensure that bare list and list[dict] annotations raise appropriate errors with clear messages. (tests/test_schema.py)
Added a test to confirm that list[T] columns are not serialized as JSON strings but as native lists. (tests/test_schema.py)

sbatchelder · 2026-06-30T19:17:02Z

New problem: registry can't deserialize `list[T]` column types

The earlier commit on this branch added list[T] support at the conversion layer, so list columns could be defined and written. But the schema registry can't read them back.

The registry persists each table's schema to a JSON sidecar ({root}/_registry/tables.json), storing every field's type as a plain string via str(pa_type) and reconstructing it on load with a hand-maintained lookup table (_arrow_type_from_str) that only covers scalar + timestamp types. A list<float> column serializes fine to "list<item: float>", but on load there's no case for it, so it raises:

ValueError: Cannot deserialize PyArrow type: 'list<item: float>'

Registration is in-memory at create_table time, so the writing process never hit this. But SchemaRegistry.load() runs in the store constructor, so any fresh store opened on an existing dataset that has a list column crashes on construction — i.e. every separate reader/analysis process. The same gap applies to other nested types (struct, large_list, map).

The fix: serialize the schema via Arrow IPC

Instead of stringifying types one by one, we now serialize the whole schema using PyArrow's built-in IPC serialization and store it as a base64 string in a new schema_ipc field:

def _schema_to_ipc_b64(schema):  return base64.b64encode(schema.serialize().to_pybytes()).decode("ascii")
def _schema_from_ipc_b64(s):     return pa.ipc.read_schema(pa.py_buffer(base64.b64decode(s)))

IPC ("Inter-Process Communication") is Apache Arrow's own canonical serialization format for shipping Arrow data/schemas between processes or to disk. It's a built-in PyArrow feature (schema.serialize() / pa.ipc.read_schema). It allows for automatic lossless serializing/deserializing for every Arrow type, including nested types, nullability, and field metadata. Should help with maintainability for all kinds of typing.

This changes adds a new schema_ipc field, but backwards compatibility is otherwise maintained.

Sidecar signature change and why it's backwards compatible

save() now writes an extra key per table. Before:

{ "t": { "schema_fields": [...], "partition_by": [...] } }

After:

{ "t": { "schema_ipc": "<base64>", "schema_fields": [...], "partition_by": [...] } }

schema_ipc is the authoritative representation read back on load.
schema_fields is retained as a human-readable mirror (so the sidecar stays legible) and as the back-compat read path.

load() prefers schema_ipc, and falls back to parsing schema_fields when it's absent:

if entry.get("schema_ipc"):
    schema = _schema_from_ipc_b64(entry["schema_ipc"])
else:
    schema = _schema_from_json(entry["schema_fields"])

So existing sidecars written before this change keep loading unchanged — they only contain scalar types, which the retained legacy parser still handles. The legacy _arrow_type_to_str/_from_str + _schema_from_json are deliberately kept for that reason. Old registries self-migrate to include schema_ipc the next time the registry is saved; pure readers keep working via the fallback without any rewrite.

Tests added (`tests/test_schema_evolution.py`)

test_list_column_persists_across_instances — full lifecycle: create → write → reopen new store → read a list[float] column; the direct regression test for the crash above.
test_list_column_idempotent_after_reload — re-registering a list-column schema after reload is a no-op (proves the round-tripped schema compares equal).
test_registry_roundtrips_list_type — unit-level save()→load() for list<float32>.
test_registry_roundtrips_nested_types — same for large_list + nested struct, confirming the IPC path generalizes beyond lists.
test_registry_loads_legacy_schema_fields_only — a schema_fields-only sidecar (string / int64 / timestamp-with-tz) still loads, locking in backwards compatibility.

sbatchelder · 2026-07-01T21:19:03Z

Spoke with Joe, we are a-go to break backwards compatibility, provided that I make a migration script which I think ought to be pretty easy.

Part of that discussion was changing schema_fields to (a) still be included for human legibility but (b) is just a simple list of column names (dropping dtypes and nullability).

johnwaalsh

Looks good to me!

support list[T] columns

900a863

joefutrelle self-assigned this Mar 16, 2026

joefutrelle requested review from johnwaalsh, sbatchelder and shravani-whoi March 16, 2026 13:08

johnwaalsh reviewed Mar 17, 2026

View reviewed changes

Comment thread src/amplify_db_utils/schema.py

Comment thread tests/test_schema.py

serialize schema fields with ipc. enables errorless deserializing

2255ae7

sbatchelder requested a review from johnwaalsh June 30, 2026 19:17

johnwaalsh reviewed Jul 1, 2026

View reviewed changes

Comment thread src/amplify_db_utils/registry.py Outdated

Comment thread src/amplify_db_utils/registry.py Outdated

Comment thread tests/test_schema_evolution.py

dev pytz for tests

9c0e884

sbatchelder added 2 commits July 1, 2026 18:54

ipc only - break backwards compatibility

c702be4

migration script

35626ba

sbatchelder requested a review from johnwaalsh July 1, 2026 22:57

sbatchelder approved these changes Jul 2, 2026

View reviewed changes

johnwaalsh approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support list[T] columns#1

support list[T] columns#1
joefutrelle wants to merge 5 commits into
mainfrom
list_types

joefutrelle commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

sbatchelder commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbatchelder commented Jul 1, 2026

Uh oh!

johnwaalsh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joefutrelle commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

sbatchelder commented Jun 30, 2026

New problem: registry can't deserialize list[T] column types

The fix: serialize the schema via Arrow IPC

Sidecar signature change and why it's backwards compatible

Tests added (tests/test_schema_evolution.py)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbatchelder commented Jul 1, 2026

Uh oh!

johnwaalsh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New problem: registry can't deserialize `list[T]` column types

Tests added (`tests/test_schema_evolution.py`)