Skip to content

Update schema fork#9

Open
LLDay wants to merge 9 commits intodf-52.3.0-release-bindingfrom
dsemenov/update-schema-fork
Open

Update schema fork#9
LLDay wants to merge 9 commits intodf-52.3.0-release-bindingfrom
dsemenov/update-schema-fork

Conversation

@LLDay
Copy link
Copy Markdown

@LLDay LLDay commented Apr 9, 2026

What changes are included in this PR?

Cherry-picked commits related to the update_schema functionality, with minor changes that allow to explicitly specify the field ID.

Are these changes tested?

Original commits have tests.

Comment on lines +1 to +16
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit wrong in a part "Licensed to the Apache Software Foundation (ASF)". Moreover, do we agree on using Apache license here or should we chose another one?

Comment on lines +33 to +34
// Default ID for a new column. This will be re-assigned to a fresh ID at commit time.
const DEFAULT_ID: i32 = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen further yet, but using Option<i32> with None as a default for added columns seems more convenient at first glance.

}

/// Return a copy with an updated doc string.
pub fn with_doc(mut self, doc: impl ToString) -> Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that ToString works via Display implementation for a related type, including the String itself, meaning passed value of that type will be de-facto cloned (and not with a simple soviet memcpy call). If it is expected for argument doc to be mostly of type String, usage of Into<String> is preffered.

This is also applicable to name argument of optional and required methods and parent argument of with_parent method. And also to name argument of UpdateSchemaAction::delete_column.

self
}

fn to_nested_field(&self) -> NestedFieldRef {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't impl From<AddColumn> for NestedFieldRef be more convenient? Or at least Into, and probably with &AddColumn?

/// ```
pub struct UpdateSchemaAction {
additions: Vec<AddColumn>,
deletes: Vec<String>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deletes: Vec<String>,
deletions: Vec<String>,

Comment on lines +147 to +152
// --- Root-level additions ---

/// Add a column to the table schema.
///
/// To add a root-level column, leave `AddColumn::parent` as `None`.
/// For nested additions, set a parent path (for example via [`AddColumn::with_parent`]).
Copy link
Copy Markdown
Member

@kryvashek kryvashek Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--- Root-level additions ---

To add a root-level column, leave AddColumn::parent as None.
For nested additions, set a parent path ...

The first of these lines contradicts the last one.

.and_then(|field| {
match base_schema
.identifier_field_ids()
.find(|id| *id == field.id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use field_by_id, that would be more efficient.

Comment on lines +398 to +471
let pending_field = add.to_nested_field();

// Check that name does not contain ".".
if pending_field.name.contains('.') {
return Err(Error::new(
ErrorKind::PreconditionFailed,
format!(
"Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.",
pending_field.name
),
));
}

// Required columns without an initial default need allow_incompatible_changes.
if pending_field.required && pending_field.initial_default.is_none() {
return Err(Error::new(
ErrorKind::PreconditionFailed,
format!(
"Incompatible change: cannot add required column without an initial default: {}",
pending_field.name
),
));
}

let parent_id = match &add.parent {
None => {
// Root-level: check name conflict against root-level fields.
if let Some(existing) = base_schema.field_by_name(&pending_field.name)
&& !delete_ids.contains(&existing.id)
{
return Err(Error::new(
ErrorKind::PreconditionFailed,
format!(
"Cannot add column, name already exists: {}",
pending_field.name
),
));
}
TABLE_ROOT_ID
}
Some(parent_path) => {
// Nested: resolve parent, check name conflict within parent struct.
let (parent_id, parent_struct) =
resolve_parent_target(base_schema, parent_path)?;

if parent_struct.fields().iter().any(|f| {
f.name == pending_field.name
&& !delete_ids.contains(&f.id)
&& !delete_ids.contains(&parent_id)
}) {
return Err(Error::new(
ErrorKind::PreconditionFailed,
format!(
"Cannot add column, name already exists in '{}': {}",
parent_path, pending_field.name
),
));
}

parent_id
}
};

// Assign fresh IDs immediately, preserving insertion order.
let field = if self.auto_assign_ids && pending_field.id == DEFAULT_ID {
assign_fresh_ids(&pending_field, &mut last_column_id)
} else {
pending_field
};

additions_by_parent
.entry(parent_id)
.or_default()
.push(field);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a long method. Have you considered separating some parts of it into a couple of standalone functions? I.e. this loop's body.

Comment on lines +401 to +409
if pending_field.name.contains('.') {
return Err(Error::new(
ErrorKind::PreconditionFailed,
format!(
"Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.",
pending_field.name
),
));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woludn't that be better to support "parent.child" syntax for making nested fields?

Comment on lines +443 to +446
if parent_struct.fields().iter().any(|f| {
f.name == pending_field.name
&& !delete_ids.contains(&f.id)
&& !delete_ids.contains(&parent_id)
Copy link
Copy Markdown
Member

@kryvashek kryvashek Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructType has methods field_by_id and field_by_name.

Also condition part delete_ids.contains(&parent_id) doesn't depend on loop variables and should be done before that loop.

And also... Suppose we already have struct/map/list column with id A and its child column with id B named "my_column". Then we delete the whole column A (that also deletes all of its children, right?). And then we add a new column named "my_column", setting its parent to A. When we get to this point of code, the condition above won't shoot:

  1. Column A has a child with the same name "my_column" as the one being added => f.name == pending_field.name is true.
  2. delete_ids doesn't contain B cause we've never added it there since it isn't deleted directly => !delete_ids.contains(&f.id) is true.
  3. But delete_ids contains A due to it's planned deletion => !delete_ids.contains(&parent_id) is false.

So, the condition never shoots, and we're going to add a column with parent A although there will be no such parent when we finish.
Am I right? I suggest adding test(s) for such case(s).
(I know this condition tests another case, I'm just saying there seems to be at least one erroneous case we don't check.)

@LLDay LLDay force-pushed the dsemenov/update-schema-fork branch from 1c37d2e to 71c3da1 Compare April 15, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants