Conversation
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. |
There was a problem hiding this comment.
Seems a bit wrong in a part "Licensed to the Apache Software Foundation (ASF)". Moreover, do we agree on using Apache license here or should we chose another one?
| // Default ID for a new column. This will be re-assigned to a fresh ID at commit time. | ||
| const DEFAULT_ID: i32 = 0; |
There was a problem hiding this comment.
Haven't seen further yet, but using Option<i32> with None as a default for added columns seems more convenient at first glance.
| } | ||
|
|
||
| /// Return a copy with an updated doc string. | ||
| pub fn with_doc(mut self, doc: impl ToString) -> Self { |
There was a problem hiding this comment.
Please keep in mind that ToString works via Display implementation for a related type, including the String itself, meaning passed value of that type will be de-facto cloned (and not with a simple soviet memcpy call). If it is expected for argument doc to be mostly of type String, usage of Into<String> is preffered.
This is also applicable to name argument of optional and required methods and parent argument of with_parent method. And also to name argument of UpdateSchemaAction::delete_column.
| self | ||
| } | ||
|
|
||
| fn to_nested_field(&self) -> NestedFieldRef { |
There was a problem hiding this comment.
Wouldn't impl From<AddColumn> for NestedFieldRef be more convenient? Or at least Into, and probably with &AddColumn?
| /// ``` | ||
| pub struct UpdateSchemaAction { | ||
| additions: Vec<AddColumn>, | ||
| deletes: Vec<String>, |
There was a problem hiding this comment.
| deletes: Vec<String>, | |
| deletions: Vec<String>, |
| // --- Root-level additions --- | ||
|
|
||
| /// Add a column to the table schema. | ||
| /// | ||
| /// To add a root-level column, leave `AddColumn::parent` as `None`. | ||
| /// For nested additions, set a parent path (for example via [`AddColumn::with_parent`]). |
There was a problem hiding this comment.
--- Root-level additions ---
To add a root-level column, leave
AddColumn::parentasNone.
For nested additions, set a parent path ...
The first of these lines contradicts the last one.
| .and_then(|field| { | ||
| match base_schema | ||
| .identifier_field_ids() | ||
| .find(|id| *id == field.id) |
There was a problem hiding this comment.
Use field_by_id, that would be more efficient.
| let pending_field = add.to_nested_field(); | ||
|
|
||
| // Check that name does not contain ".". | ||
| if pending_field.name.contains('.') { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| format!( | ||
| "Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.", | ||
| pending_field.name | ||
| ), | ||
| )); | ||
| } | ||
|
|
||
| // Required columns without an initial default need allow_incompatible_changes. | ||
| if pending_field.required && pending_field.initial_default.is_none() { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| format!( | ||
| "Incompatible change: cannot add required column without an initial default: {}", | ||
| pending_field.name | ||
| ), | ||
| )); | ||
| } | ||
|
|
||
| let parent_id = match &add.parent { | ||
| None => { | ||
| // Root-level: check name conflict against root-level fields. | ||
| if let Some(existing) = base_schema.field_by_name(&pending_field.name) | ||
| && !delete_ids.contains(&existing.id) | ||
| { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| format!( | ||
| "Cannot add column, name already exists: {}", | ||
| pending_field.name | ||
| ), | ||
| )); | ||
| } | ||
| TABLE_ROOT_ID | ||
| } | ||
| Some(parent_path) => { | ||
| // Nested: resolve parent, check name conflict within parent struct. | ||
| let (parent_id, parent_struct) = | ||
| resolve_parent_target(base_schema, parent_path)?; | ||
|
|
||
| if parent_struct.fields().iter().any(|f| { | ||
| f.name == pending_field.name | ||
| && !delete_ids.contains(&f.id) | ||
| && !delete_ids.contains(&parent_id) | ||
| }) { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| format!( | ||
| "Cannot add column, name already exists in '{}': {}", | ||
| parent_path, pending_field.name | ||
| ), | ||
| )); | ||
| } | ||
|
|
||
| parent_id | ||
| } | ||
| }; | ||
|
|
||
| // Assign fresh IDs immediately, preserving insertion order. | ||
| let field = if self.auto_assign_ids && pending_field.id == DEFAULT_ID { | ||
| assign_fresh_ids(&pending_field, &mut last_column_id) | ||
| } else { | ||
| pending_field | ||
| }; | ||
|
|
||
| additions_by_parent | ||
| .entry(parent_id) | ||
| .or_default() | ||
| .push(field); |
There was a problem hiding this comment.
This is quite a long method. Have you considered separating some parts of it into a couple of standalone functions? I.e. this loop's body.
| if pending_field.name.contains('.') { | ||
| return Err(Error::new( | ||
| ErrorKind::PreconditionFailed, | ||
| format!( | ||
| "Cannot add column with ambiguous name: {}. Use `AddColumn::with_parent` to add a column to a nested struct.", | ||
| pending_field.name | ||
| ), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Woludn't that be better to support "parent.child" syntax for making nested fields?
| if parent_struct.fields().iter().any(|f| { | ||
| f.name == pending_field.name | ||
| && !delete_ids.contains(&f.id) | ||
| && !delete_ids.contains(&parent_id) |
There was a problem hiding this comment.
StructType has methods field_by_id and field_by_name.
Also condition part delete_ids.contains(&parent_id) doesn't depend on loop variables and should be done before that loop.
And also... Suppose we already have struct/map/list column with id A and its child column with id B named "my_column". Then we delete the whole column A (that also deletes all of its children, right?). And then we add a new column named "my_column", setting its parent to A. When we get to this point of code, the condition above won't shoot:
- Column A has a child with the same name "my_column" as the one being added =>
f.name == pending_field.nameistrue. delete_idsdoesn't contain B cause we've never added it there since it isn't deleted directly =>!delete_ids.contains(&f.id)istrue.- But
delete_idscontains A due to it's planned deletion =>!delete_ids.contains(&parent_id)isfalse.
So, the condition never shoots, and we're going to add a column with parent A although there will be no such parent when we finish.
Am I right? I suggest adding test(s) for such case(s).
(I know this condition tests another case, I'm just saying there seems to be at least one erroneous case we don't check.)
52f8000 to
1c37d2e
Compare
1c37d2e to
71c3da1
Compare
What changes are included in this PR?
Cherry-picked commits related to the
update_schemafunctionality, with minor changes that allow to explicitly specify the field ID.Are these changes tested?
Original commits have tests.