Proposed Change
Proposed Change
Add a spec-level mechanism for Iceberg tables to declare "virtual fields" - typed metadata about known field paths inside semi-structured columns (VARIANT). Virtual fields let engines resolve types, push down predicates, and transparently redirect queries to extracted physical columns, all without requiring users to manage schema evolution manually.
The Problem
Iceberg v3 adds the VARIANT type where semi-structured data gets first-class storage. But once documents land in a VARIANT column, engines are blind: no type information, no predicate pushdown, no statistics. Every field access is a runtime parse-and-guess.
Meanwhile, in every semi-structured workload, a small set of field paths accounts for a majority of query access. Those hot fields deserve the same treatment as physical columns.
The Idea
Add a virtual fields array to Iceberg table metadata. Each entry declares a known field path inside a VARIANT column, its type, and (optionally) a pointer to a physical column that materializes it.
"virtual-fields": [
{
"source-column-id": 2,
"field-path": "user.email",
"field-id": 100,
"type": "string",
"extracted-column-id": 3
},
{
"source-column-id": 2,
"field-path": "user.age",
"field-id": 101,
"type": "int"
}
]
That's it. Field path, type, and an optional redirect to a physical column.
What This Enables
Type consistency across engines. The table declares user.age is an int once. Spark, Trino, and Flink all read the same metadata. No more divergent type inference or manual CAST wrappers.
Transparent column extraction. When a hot field is promoted to a physical column, set extracted-column-id. Engines automatically rewrite queries to use the physical column (e.g., predicate pushdown, statistics, bloom filters) with zero query changes.
Progressive optimization. Day 1: store any JSON, all fields are virtual. Day 30: hot fields get extracted to physical columns automatically. Day 90: new fields appear in documents, get discovered, and the cycle continues. No manual schema management.
Catalog introspection. Tools can enumerate every known field path inside VARIANT columns so features such as auto-complete, data catalogs, governance can work without sampling.
How Engines Use It
- Query references
doc.user.age > 25
- Engine checks
virtual-fields for source-column-id=2, field-path=user.age
- Found with
extracted-column-id? Read the physical column directly (fast path)
- Found without extraction? Use the declared type for casting (correct path)
- Not found? Fall back to current VARIANT behavior (no regression)
Engines that don't understand virtual fields ignore them entirely. Pure backward compatibility.
Scope
- Targets Iceberg format version 3 (VARIANT type required)
- Adds
virtual-fields array to table metadata JSON
- Virtual fields evolve like schema fields: add, widen type, drop, extract
- Changes produce new metadata versions (time-travel auditable)
- Optional discovery metadata in table properties (advisory, not required)
Prior Art
- Snowflake: Automatically extracts frequently-accessed VARIANT paths to columnar storage
- Elasticsearch: Dynamic mapping discovers field types from documents at ingestion
- Iceberg schema evolution: ID-based column tracking, type widening rules. Virtual fields will follow the same patterns
What This Is Not
- Not computed/derived columns (no SQL expressions, just field paths and types)
- Not a query engine feature (spec-level metadata, engine-independent)
- Not a new file format (no data file changes, metadata only)
Proposal document
https://docs.google.com/document/d/1rxX7MgkByVjY-fCXRY9xTlAxo7rVbMMs2V3S8NWCdoA/edit?usp=sharing
Specifications
Proposed Change
Proposed Change
Add a spec-level mechanism for Iceberg tables to declare "virtual fields" - typed metadata about known field paths inside semi-structured columns (VARIANT). Virtual fields let engines resolve types, push down predicates, and transparently redirect queries to extracted physical columns, all without requiring users to manage schema evolution manually.
The Problem
Iceberg v3 adds the VARIANT type where semi-structured data gets first-class storage. But once documents land in a VARIANT column, engines are blind: no type information, no predicate pushdown, no statistics. Every field access is a runtime parse-and-guess.
Meanwhile, in every semi-structured workload, a small set of field paths accounts for a majority of query access. Those hot fields deserve the same treatment as physical columns.
The Idea
Add a virtual fields array to Iceberg table metadata. Each entry declares a known field path inside a VARIANT column, its type, and (optionally) a pointer to a physical column that materializes it.
That's it. Field path, type, and an optional redirect to a physical column.
What This Enables
Type consistency across engines. The table declares
user.ageis anintonce. Spark, Trino, and Flink all read the same metadata. No more divergent type inference or manual CAST wrappers.Transparent column extraction. When a hot field is promoted to a physical column, set
extracted-column-id. Engines automatically rewrite queries to use the physical column (e.g., predicate pushdown, statistics, bloom filters) with zero query changes.Progressive optimization. Day 1: store any JSON, all fields are virtual. Day 30: hot fields get extracted to physical columns automatically. Day 90: new fields appear in documents, get discovered, and the cycle continues. No manual schema management.
Catalog introspection. Tools can enumerate every known field path inside VARIANT columns so features such as auto-complete, data catalogs, governance can work without sampling.
How Engines Use It
doc.user.age > 25virtual-fieldsforsource-column-id=2, field-path=user.ageextracted-column-id? Read the physical column directly (fast path)Engines that don't understand virtual fields ignore them entirely. Pure backward compatibility.
Scope
virtual-fieldsarray to table metadata JSONPrior Art
What This Is Not
Proposal document
https://docs.google.com/document/d/1rxX7MgkByVjY-fCXRY9xTlAxo7rVbMMs2V3S8NWCdoA/edit?usp=sharing
Specifications