Virtual Field Metadata for Semi-Structured (VARIANT) Columns

### Proposed Change

## Proposed Change

Add a spec-level mechanism for Iceberg tables to declare "**virtual fields**" - typed metadata about known field paths inside semi-structured columns (VARIANT). Virtual fields let engines resolve types, push down predicates, and transparently redirect queries to extracted physical columns, all without requiring users to manage schema evolution manually.


## The Problem

Iceberg v3 adds the VARIANT type where semi-structured data gets first-class storage. But once documents land in a VARIANT column, engines are blind: no type information, no predicate pushdown, no statistics. Every field access is a runtime parse-and-guess.

Meanwhile, in every semi-structured workload, a small set of field paths accounts for a majority of query access. Those hot fields deserve the same treatment as physical columns.

## The Idea

Add a **virtual fields** array to Iceberg table metadata. Each entry declares a known field path inside a VARIANT column, its type, and (optionally) a pointer to a physical column that materializes it.

```json
"virtual-fields": [
  {
    "source-column-id": 2,
    "field-path": "user.email",
    "field-id": 100,
    "type": "string",
    "extracted-column-id": 3
  },
  {
    "source-column-id": 2,
    "field-path": "user.age",
    "field-id": 101,
    "type": "int"
  }
]
```

That's it. Field path, type, and an optional redirect to a physical column.

## What This Enables

**Type consistency across engines.** The table declares `user.age` is an `int` once. Spark, Trino, and Flink all read the same metadata.  No more divergent type inference or manual CAST wrappers.

**Transparent column extraction.** When a hot field is promoted to a physical column, set `extracted-column-id`. Engines automatically rewrite queries to use the physical column  (e.g., predicate pushdown, statistics, bloom filters) with zero query changes.

**Progressive optimization.** Day 1: store any JSON, all fields are virtual. Day 30: hot fields get extracted to physical columns automatically. Day 90: new fields appear in documents, get discovered, and the cycle continues. No manual schema management.

**Catalog introspection.** Tools can enumerate every known field path inside VARIANT columns so features such as auto-complete, data catalogs, governance can work without sampling.

## How Engines Use It

1. Query references `doc.user.age > 25`
2. Engine checks `virtual-fields` for `source-column-id=2, field-path=user.age`
3. Found with `extracted-column-id`? Read the physical column directly (fast path)
4. Found without extraction? Use the declared type for casting (correct path)
5. Not found? Fall back to current VARIANT behavior (no regression)

Engines that don't understand virtual fields ignore them entirely. Pure backward compatibility.

## Scope

- Targets Iceberg format version 3 (VARIANT type required)
- Adds `virtual-fields` array to table metadata JSON
- Virtual fields evolve like schema fields: add, widen type, drop, extract
- Changes produce new metadata versions (time-travel auditable)
- Optional discovery metadata in table properties (advisory, not required)

## Prior Art

- **Snowflake**: Automatically extracts frequently-accessed VARIANT paths to columnar storage
- **Elasticsearch**: Dynamic mapping discovers field types from documents at ingestion
- **Iceberg schema evolution**: ID-based column tracking, type widening rules.  Virtual fields will follow the same patterns

## What This Is Not

- Not computed/derived columns (no SQL expressions, just field paths and types)
- Not a query engine feature (spec-level metadata, engine-independent)
- Not a new file format (no data file changes, metadata only)


### Proposal document

https://docs.google.com/document/d/1rxX7MgkByVjY-fCXRY9xTlAxo7rVbMMs2V3S8NWCdoA/edit?usp=sharing

### Specifications

- [x] Table
- [ ] View
- [ ] REST
- [ ] Puffin
- [ ] Encryption
- [ ] Other

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual Field Metadata for Semi-Structured (VARIANT) Columns #16064

Proposed Change

Proposed Change

The Problem

The Idea

What This Enables

How Engines Use It

Scope

Prior Art

What This Is Not

Proposal document

Specifications

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Virtual Field Metadata for Semi-Structured (VARIANT) Columns #16064

Description

Proposed Change

Proposed Change

The Problem

The Idea

What This Enables

How Engines Use It

Scope

Prior Art

What This Is Not

Proposal document

Specifications

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions