Update schema fork by LLDay · Pull Request #9 · tarantool/iceberg-rust

LLDay · 2026-04-09T11:46:57Z

What changes are included in this PR?

Cherry-picked commits related to the update_schema functionality, with minor changes that allow to explicitly specify the field ID.

Are these changes tested?

Original commits have tests.

kryvashek · 2026-04-13T08:19:24Z

+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.


Seems a bit wrong in a part "Licensed to the Apache Software Foundation (ASF)". Moreover, do we agree on using Apache license here or should we chose another one?

kryvashek · 2026-04-13T08:22:36Z

+// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
+const DEFAULT_ID: i32 = 0;


Haven't seen further yet, but using Option<i32> with None as a default for added columns seems more convenient at first glance.

kryvashek · 2026-04-13T08:37:58Z

+    }
+
+    /// Return a copy with an updated doc string.
+    pub fn with_doc(mut self, doc: impl ToString) -> Self {


Please keep in mind that ToString works via Display implementation for a related type, including the String itself, meaning passed value of that type will be de-facto cloned (and not with a simple soviet memcpy call). If it is expected for argument doc to be mostly of type String, usage of Into<String> is preffered.

This is also applicable to name argument of optional and required methods and parent argument of with_parent method. And also to name argument of UpdateSchemaAction::delete_column.

kryvashek · 2026-04-13T08:41:28Z

+        self
+    }
+
+    fn to_nested_field(&self) -> NestedFieldRef {


Wouldn't impl From<AddColumn> for NestedFieldRef be more convenient? Or at least Into, and probably with &AddColumn?

kryvashek · 2026-04-13T08:47:58Z

+/// ```
+pub struct UpdateSchemaAction {
+    additions: Vec<AddColumn>,
+    deletes: Vec<String>,


Suggested change

deletes: Vec<String>,

deletions: Vec<String>,

kryvashek · 2026-04-13T08:49:59Z

+    // --- Root-level additions ---
+
+    /// Add a column to the table schema.
+    ///
+    /// To add a root-level column, leave `AddColumn::parent` as `None`.
+    /// For nested additions, set a parent path (for example via [`AddColumn::with_parent`]).


--- Root-level additions ---

To add a root-level column, leave AddColumn::parent as None.
For nested additions, set a parent path ...

The first of these lines contradicts the last one.

kryvashek · 2026-04-13T11:41:30Z

+                    .and_then(|field| {
+                        match base_schema
+                            .identifier_field_ids()
+                            .find(|id| *id == field.id)


Use field_by_id, that would be more efficient.

kryvashek · 2026-04-13T11:46:13Z

+            let pending_field = add.to_nested_field();
+
+            // Check that name does not contain ".".
+            if pending_field.name.contains('.') {
+                return Err(Error::new(
+                    ErrorKind::PreconditionFailed,
+                    format!(
+                        "Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.",
+                        pending_field.name
+                    ),
+                ));
+            }
+
+            // Required columns without an initial default need allow_incompatible_changes.
+            if pending_field.required && pending_field.initial_default.is_none() {
+                return Err(Error::new(
+                    ErrorKind::PreconditionFailed,
+                    format!(
+                        "Incompatible change: cannot add required column without an initial default: {}",
+                        pending_field.name
+                    ),
+                ));
+            }
+
+            let parent_id = match &add.parent {
+                None => {
+                    // Root-level: check name conflict against root-level fields.
+                    if let Some(existing) = base_schema.field_by_name(&pending_field.name)
+                        && !delete_ids.contains(&existing.id)
+                    {
+                        return Err(Error::new(
+                            ErrorKind::PreconditionFailed,
+                            format!(
+                                "Cannot add column, name already exists: {}",
+                                pending_field.name
+                            ),
+                        ));
+                    }
+                    TABLE_ROOT_ID
+                }
+                Some(parent_path) => {
+                    // Nested: resolve parent, check name conflict within parent struct.
+                    let (parent_id, parent_struct) =
+                        resolve_parent_target(base_schema, parent_path)?;
+
+                    if parent_struct.fields().iter().any(|f| {
+                        f.name == pending_field.name
+                            && !delete_ids.contains(&f.id)
+                            && !delete_ids.contains(&parent_id)
+                    }) {
+                        return Err(Error::new(
+                            ErrorKind::PreconditionFailed,
+                            format!(
+                                "Cannot add column, name already exists in '{}': {}",
+                                parent_path, pending_field.name
+                            ),
+                        ));
+                    }
+
+                    parent_id
+                }
+            };
+
+            // Assign fresh IDs immediately, preserving insertion order.
+            let field = if self.auto_assign_ids && pending_field.id == DEFAULT_ID {
+                assign_fresh_ids(&pending_field, &mut last_column_id)
+            } else {
+                pending_field
+            };
+
+            additions_by_parent
+                .entry(parent_id)
+                .or_default()
+                .push(field);


This is quite a long method. Have you considered separating some parts of it into a couple of standalone functions? I.e. this loop's body.

kryvashek · 2026-04-13T11:48:26Z

+            if pending_field.name.contains('.') {
+                return Err(Error::new(
+                    ErrorKind::PreconditionFailed,
+                    format!(
+                        "Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.",
+                        pending_field.name
+                    ),
+                ));
+            }


Woludn't that be better to support "parent.child" syntax for making nested fields?

kryvashek · 2026-04-13T12:53:04Z

+                    if parent_struct.fields().iter().any(|f| {
+                        f.name == pending_field.name
+                            && !delete_ids.contains(&f.id)
+                            && !delete_ids.contains(&parent_id)


StructType has methods field_by_id and field_by_name.

Also condition part delete_ids.contains(&parent_id) doesn't depend on loop variables and should be done before that loop.

And also... Suppose we already have struct/map/list column with id A and its child column with id B named "my_column". Then we delete the whole column A (that also deletes all of its children, right?). And then we add a new column named "my_column", setting its parent to A. When we get to this point of code, the condition above won't shoot:

Column A has a child with the same name "my_column" as the one being added => f.name == pending_field.name is true.

delete_ids doesn't contain B cause we've never added it there since it isn't deleted directly => !delete_ids.contains(&f.id) is true.

But delete_ids contains A due to it's planned deletion => !delete_ids.contains(&parent_id) is false.

So, the condition never shoots, and we're going to add a column with parent A although there will be no such parent when we finish.
Am I right? I suggest adding test(s) for such case(s).
(I know this condition tests another case, I'm just saying there seems to be at least one erroneous case we don't check.)

kryvashek reviewed Apr 13, 2026

View reviewed changes

LLDay mentioned this pull request Apr 13, 2026

feat(schema): Add update_schema action to enable table schema updates apache/iceberg-rust#2120

Open

2 tasks

LLDay force-pushed the dsemenov/update-schema-fork branch from 52f8000 to 1c37d2e Compare April 13, 2026 13:44

tomighita and others added 9 commits April 15, 2026 10:05

add update_schema

bd1761d

Cargo format update

2b5284d

Fix clippy!

0c6bf71

Add write_default to new fields

a5000f8

Fix new FileIO api

b38db22

Add integration test storage factory

87ce8d9

add extra delete check for parent

10bcd0f

Refactor AddColumn API

e73d5c8

Add id field to the builder

71c3da1

LLDay force-pushed the dsemenov/update-schema-fork branch from 1c37d2e to 71c3da1 Compare April 15, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update schema fork#9

Update schema fork#9
LLDay wants to merge 9 commits intodf-52.3.0-release-bindingfrom
dsemenov/update-schema-fork

LLDay commented Apr 9, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026 •

edited

Loading

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026

Uh oh!

kryvashek Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
		const DEFAULT_ID: i32 = 0;

Conversation

LLDay commented Apr 9, 2026

What changes are included in this PR?

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kryvashek Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kryvashek Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kryvashek Apr 13, 2026 •

edited

Loading

kryvashek Apr 13, 2026 •

edited

Loading