Skip to content

ALTER COLLECTION ADD COLUMN zombifies existing rows — schema bumped without data migration #48

@hollanf

Description

@hollanf

ALTER COLLECTION … ADD COLUMN … on a TYPE DOCUMENT STRICT collection updates the catalog schema (schema.columns.push(...), schema.version += 1) but never touches existing row data. Pre-ALTER rows were written with the old column set; post-ALTER reads run against the new schema. The result depends on how the reader handles version drift, but in practice reads of the old rows return "null-everywhere" row objects (columns populated in the row are invisible because the schema-vs-data offset is wrong) or silently drop the new column from results.

No error is raised. No warning is logged. The only known recovery is to DROP COLLECTION and recreate — which loses all data.

Current code

nodedb/src/control/server/pgwire/ddl/collection/alter/add_column.rs:15-103:

pub async fn alter_table_add_column(
    state: &SharedState,
    identity: &AuthenticatedIdentity,
    parts: &[&str],
    sql: &str,
) -> PgWireResult<Vec<Response>> {
    ...
    // Validate: new column must be nullable or have a default.
    if !column.nullable && column.default.is_none() {
        return Err(sqlstate_error("42601", "… must have a DEFAULT"));
    }

    if let Some(catalog) = state.credentials.catalog() {
        match catalog.get_collection(tenant_id.as_u32(), &table_name) {
            Ok(Some(coll)) if coll.is_active => {
                if coll.collection_type.is_strict()
                    && let Some(config_json) = &coll.timeseries_config
                    && let Ok(mut schema) =
                        sonic_rs::from_str::<nodedb_types::columnar::StrictSchema>(config_json)
                {
                    if schema.columns.iter().any(|c| c.name == column.name) {
                        return Err(sqlstate_error("42P07", "column already exists"));
                    }
                    schema.columns.push(column);
                    schema.version = schema.version.saturating_add(1);

                    let mut updated = coll;
                    updated.collection_type = nodedb_types::CollectionType::strict(schema.clone());
                    updated.timeseries_config = sonic_rs::to_string(&schema).ok();
                    let entry = crate::control::catalog_entry::CatalogEntry::PutCollection(
                        Box::new(updated.clone()),
                    );
                    let log_index =
                        crate::control::metadata_proposer::propose_catalog_entry(state, &entry)
                            .map_err(|e| sqlstate_error("XX000", &e.to_string()))?;
                    if log_index == 0 {
                        catalog.put_collection(&updated)
                            .map_err(|e| sqlstate_error("XX000", &e.to_string()))?;
                    }
                }
            }
            _ => { return Err(sqlstate_error("42P01", "collection does not exist")); }
        }
    }
    ...
    Ok(vec![Response::Execution(Tag::new("ALTER TABLE"))])
}

The function:

  1. Validates the new column (nullable, or has a DEFAULT).
  2. Bumps schema.columns + schema.version in the catalog.
  3. Proposes the new catalog entry via Raft.
  4. Returns success.

There is no iteration over existing rows, no backfill of the DEFAULT, no rewrite of the on-disk row format, and no compatibility shim in the reader that knows to return the DEFAULT for rows serialized under an older schema version. schema.version is bumped (line 66) but I could not find any reader path that reads schema.version and reconciles against a row's written-under version — the row format appears to assume the catalog schema always matches the bytes.

Postgres ALTER TABLE … ADD COLUMN … DEFAULT … either backfills (rewriting the table) or records a non-null default in pg_attribute and evaluates it virtually at read time. NodeDB does neither.

Why this matters

  1. Standard additive migration pattern (ALTER COLLECTION t ADD COLUMN new_col …) is the documented way to evolve a schema. Users doing this against a collection with data will silently corrupt reads until they realise every SELECT returns null-everywhere rows.
  2. Appears connected to the broader class of "STRICT doc SELECT returns null-everywhere rows" symptoms already reported (pgwire SELECT on TYPE DOCUMENT STRICT silently returns null/empty columns via extended protocol (data loss) #44) — but that ticket is about a pgwire extended-protocol divergence, whereas this one is a schema-evolution primitive that breaks reads through all protocols (pgwire simple query, HTTP /query, native client).
  3. The only recovery surfaced downstream is DROP COLLECTION + CREATE COLLECTION + re-ingest — data loss.
  4. schema.version = schema.version.saturating_add(1) at line 66 suggests the data model expected versioning to be implemented, but the read path was never wired up; this is a half-built feature that in its current state silently loses data.

Repro

CREATE COLLECTION t TYPE DOCUMENT STRICT (
  id STRING PRIMARY KEY,
  name STRING
);
INSERT INTO t (id, name) VALUES ('a', 'alice');

SELECT id, name FROM t;
-- expected: [{ id: 'a', name: 'alice' }]
-- observed: [{ id: 'a', name: 'alice' }]   ✓

ALTER COLLECTION t ADD COLUMN note STRING DEFAULT 'n/a';

SELECT id, name, note FROM t;
-- expected: [{ id: 'a', name: 'alice', note: 'n/a' }]
-- observed (known bad): [{ id: null, name: null, note: null }]
--                    or [{}] / row returns but all columns absent

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions