WT-16565 Refactor removal of shared metadata to use queue#13115
WT-16565 Refactor removal of shared metadata to use queue#13115
Conversation
pmacko86
left a comment
There was a problem hiding this comment.
Almost LGTM. I found one race condition, but as I explained in the comment, it should not be hard to fix if we choose the easiest option. All my other comments are relatively minor and mostly suggest better naming. Thanks.
src/include/connection.h
Outdated
|
|
||
| TAILQ_ENTRY(__wt_disagg_update_metadata) q; /* Linked list of entries. */ | ||
| /* Metadata type operation. */ | ||
| u_int metadata_op; |
There was a problem hiding this comment.
Change the type of this to WT_SHARED_METADATA_OPS (and move the definition of WT_SHARED_METADATA_OPS above this struct definition).
src/include/connection.h
Outdated
| SHARED_METADATA_UPDATE = 0, | ||
| SHARED_METADATA_CREATE, | ||
| SHARED_METADATA_REMOVE | ||
| } WT_SHARED_METADATA_OPS; |
There was a problem hiding this comment.
As mentioned in my earlier comment, it would be better to use this as a type. In that case, it would be better to rename it to WT_SHARED_METADATA_OP.
src/include/connection.h
Outdated
| * Identify the shared metadata operations inside the shared metadata queue. | ||
| */ | ||
| typedef enum { | ||
| SHARED_METADATA_UPDATE = 0, |
There was a problem hiding this comment.
Also, I think it would be better to use the WT_ prefix for these, such as WT_SHARED_METADATA_UPDATE?
src/include/connection.h
Outdated
| TAILQ_HEAD(__wt_disagg_update_metadata_qh, __wt_disagg_update_metadata) update_metadata_qh; | ||
| WT_SPINLOCK update_metadata_lock; | ||
| TAILQ_HEAD(__wt_disagg_shared_metadata_qh, __wt_disagg_metadata_op) shared_metadata_qh; | ||
| WT_SPINLOCK shared_metadata_lock; |
There was a problem hiding this comment.
The new name of this lock implies that we are locking the entire shared metadata table (analogously to the schema lock), so please rename this field accordingly. Maybe shared_metadata_queue_lock? The name might feel a bit long, but I think here clarity trumps conciseness.
src/conn/conn_layered.c
Outdated
| */ | ||
| static void | ||
| __disagg_update_metadata_clear(WT_SESSION_IMPL *session) | ||
| __disagg_shared_metadata_clear(WT_SESSION_IMPL *session) |
There was a problem hiding this comment.
Same here – the new name implies clearing all of shared metadata. Instead, we are just clearing the queue, so you might want to name it something like __disagg_shared_metadata_queue_clear instead.
src/conn/conn_layered.c
Outdated
|
|
||
| /* | ||
| * __disagg_update_metadata_free -- | ||
| * __disagg_shared_metadata_free -- |
There was a problem hiding this comment.
Same here – need a clearer name, e.g., __disagg_shared_metadata_op_free.
src/conn/conn_layered.c
Outdated
| "operation to shared " | ||
| "metadata table at next checkpoint:", | ||
| table_name, stable_uri); | ||
| table_name, stable_uri, metadata_op == SHARED_METADATA_UPDATE ? "UPDATE" : "DELETE"); |
There was a problem hiding this comment.
This needs a case for CREATE. As this is already the second case of us needing to convert this to a string, it would be useful to have a helper function to do that. As the helper function would be returning just a static string constant, I think you can make it return a const char * type, which would make it easier to use.
src/conn/conn_layered.c
Outdated
| */ | ||
| int | ||
| __wt_disagg_update_metadata_process(WT_SESSION_IMPL *session) | ||
| __wt_disagg_shared_metadata_process(WT_SESSION_IMPL *session) |
There was a problem hiding this comment.
For consistency, I'd rename this to __wt_disagg_shared_metadata_queue_process, even though the risk of misunderstanding this function from its name is less.
src/conn/conn_layered.c
Outdated
| if (entry->deferred) { | ||
| __wt_verbose_debug2(session, WT_VERB_DISAGGREGATED_STORAGE, | ||
| "Defer metadata %s operation for table \"%s\"", entry->table_name, | ||
| entry->metadata_op == SHARED_METADATA_REMOVE ? "create" : "update"); |
There was a problem hiding this comment.
Same here – after you add the helper function to convert metadata_op to string, also use it here.
| * metadata table for that checkpoint. We defer these metadata operations to the next checkpoint | ||
| * to keep the checkpoints metadata and table state consistent. | ||
| */ | ||
| WT_ACQUIRE_READ_WITH_BARRIER(ckpt_running, conn->txn_global.checkpoint_running); |
There was a problem hiding this comment.
I am not sure if this is correct, because conn->txn_global.checkpoint_running is set to true before the checkpoint transaction's snapshot is acquired. We need to prevent the following scenario for create:
- Thread 1:
conn->txn_global.checkpoint_running = true - Thread 2: The new table gets created.
- Thread 2: The
_mdb_catalogis updated accordingly and committed. - Thread 1: The checkpoint transaction acquires its snapshot.
In this case, the update to_mdb_catalogwould be included in the checkpoint, but the new table would not, which would lead to metadata inconsistency.
If we use another field that gets set after the transaction snapshot is acquired, we would risk a similar metadata inconsistency problem with drop.
I think we have the following options, in order of increasing difficulty:
- Do not defer create, since it's always okay for the create to happen before the
_mdb_catalogupdate. (But still keep the separate enum value for create, as it's likely we'd like to use it in the future.) - Use two different mechanism to determine whether we are deferring, depending on whether we are doing remove vs. create. This would be kind of ugly, but it's an option.
- Do this exactly based on whether the transaction ID of the create/drop metadata update is included in the transaction snapshot. This would take careful thought to ensure we do this correctly and really understand all relevant corner cases, because each local metadata table update is done with a different transaction.
I think option 1 is probably what we'd like to do for now, as it's simplest, and as I think it should be the easiest to reason through its correctness.
There was a problem hiding this comment.
Sure I went down option 1.
After WT-16458 refactors how we update the shared metadata table, so that we put all updates in a queue and then process them during a checkpoint. I have re-used the queue to also queue removes. Here are some technical changes I have made:
update_metadata_qhtoshared_metadata_qh.There are few changes, I would like for reviewers to review: