Skip to content

WT-16565 Refactor removal of shared metadata to use queue#13115

Merged
jiechenbo merged 48 commits intodevelopfrom
wt-16565-refactor-remove-shared-queue
Feb 25, 2026
Merged

WT-16565 Refactor removal of shared metadata to use queue#13115
jiechenbo merged 48 commits intodevelopfrom
wt-16565-refactor-remove-shared-queue

Conversation

@jiechenbo
Copy link
Contributor

@jiechenbo jiechenbo commented Jan 30, 2026

After WT-16458 refactors how we update the shared metadata table, so that we put all updates in a queue and then process them during a checkpoint. I have re-used the queue to also queue removes. Here are some technical changes I have made:

  • Renamed queue and related variables from update_metadata_qh to shared_metadata_qh .
  • Added WT_SHARED_METADATA_OPS struct to the metadata queue entry.
  • Update metadata queue process to process removes as well
  • Add defer variable to shared metadata queue that will defer metadata operations to the checkpoint after if a concurrent checkpoint is happening.

There are few changes, I would like for reviewers to review:

  • The queue will have contain both insert/removes entries. This also means that we could have one entry performing creation of same table and then we directly do a delete after.
  • We have a very similar for __disagg_remove_shared_metadata and __disagg_update_shared_metadata. Ideally I would like to combine them.

@jiechenbo jiechenbo requested a review from a team as a code owner January 30, 2026 03:05
@jiechenbo jiechenbo requested a review from mana2-bot January 30, 2026 03:05
Copy link
Contributor

@korteland korteland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jiechenbo jiechenbo requested a review from korteland February 19, 2026 03:09
Copy link
Contributor

@pmacko86 pmacko86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM. I found one race condition, but as I explained in the comment, it should not be hard to fix if we choose the easiest option. All my other comments are relatively minor and mostly suggest better naming. Thanks.


TAILQ_ENTRY(__wt_disagg_update_metadata) q; /* Linked list of entries. */
/* Metadata type operation. */
u_int metadata_op;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the type of this to WT_SHARED_METADATA_OPS (and move the definition of WT_SHARED_METADATA_OPS above this struct definition).

SHARED_METADATA_UPDATE = 0,
SHARED_METADATA_CREATE,
SHARED_METADATA_REMOVE
} WT_SHARED_METADATA_OPS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my earlier comment, it would be better to use this as a type. In that case, it would be better to rename it to WT_SHARED_METADATA_OP.

* Identify the shared metadata operations inside the shared metadata queue.
*/
typedef enum {
SHARED_METADATA_UPDATE = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it would be better to use the WT_ prefix for these, such as WT_SHARED_METADATA_UPDATE?

TAILQ_HEAD(__wt_disagg_update_metadata_qh, __wt_disagg_update_metadata) update_metadata_qh;
WT_SPINLOCK update_metadata_lock;
TAILQ_HEAD(__wt_disagg_shared_metadata_qh, __wt_disagg_metadata_op) shared_metadata_qh;
WT_SPINLOCK shared_metadata_lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new name of this lock implies that we are locking the entire shared metadata table (analogously to the schema lock), so please rename this field accordingly. Maybe shared_metadata_queue_lock? The name might feel a bit long, but I think here clarity trumps conciseness.

*/
static void
__disagg_update_metadata_clear(WT_SESSION_IMPL *session)
__disagg_shared_metadata_clear(WT_SESSION_IMPL *session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here – the new name implies clearing all of shared metadata. Instead, we are just clearing the queue, so you might want to name it something like __disagg_shared_metadata_queue_clear instead.


/*
* __disagg_update_metadata_free --
* __disagg_shared_metadata_free --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here – need a clearer name, e.g., __disagg_shared_metadata_op_free.

"operation to shared "
"metadata table at next checkpoint:",
table_name, stable_uri);
table_name, stable_uri, metadata_op == SHARED_METADATA_UPDATE ? "UPDATE" : "DELETE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a case for CREATE. As this is already the second case of us needing to convert this to a string, it would be useful to have a helper function to do that. As the helper function would be returning just a static string constant, I think you can make it return a const char * type, which would make it easier to use.

*/
int
__wt_disagg_update_metadata_process(WT_SESSION_IMPL *session)
__wt_disagg_shared_metadata_process(WT_SESSION_IMPL *session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I'd rename this to __wt_disagg_shared_metadata_queue_process, even though the risk of misunderstanding this function from its name is less.

if (entry->deferred) {
__wt_verbose_debug2(session, WT_VERB_DISAGGREGATED_STORAGE,
"Defer metadata %s operation for table \"%s\"", entry->table_name,
entry->metadata_op == SHARED_METADATA_REMOVE ? "create" : "update");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here – after you add the helper function to convert metadata_op to string, also use it here.

* metadata table for that checkpoint. We defer these metadata operations to the next checkpoint
* to keep the checkpoints metadata and table state consistent.
*/
WT_ACQUIRE_READ_WITH_BARRIER(ckpt_running, conn->txn_global.checkpoint_running);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is correct, because conn->txn_global.checkpoint_running is set to true before the checkpoint transaction's snapshot is acquired. We need to prevent the following scenario for create:

  1. Thread 1: conn->txn_global.checkpoint_running = true
  2. Thread 2: The new table gets created.
  3. Thread 2: The _mdb_catalog is updated accordingly and committed.
  4. Thread 1: The checkpoint transaction acquires its snapshot.
    In this case, the update to _mdb_catalog would be included in the checkpoint, but the new table would not, which would lead to metadata inconsistency.

If we use another field that gets set after the transaction snapshot is acquired, we would risk a similar metadata inconsistency problem with drop.

I think we have the following options, in order of increasing difficulty:

  1. Do not defer create, since it's always okay for the create to happen before the _mdb_catalog update. (But still keep the separate enum value for create, as it's likely we'd like to use it in the future.)
  2. Use two different mechanism to determine whether we are deferring, depending on whether we are doing remove vs. create. This would be kind of ugly, but it's an option.
  3. Do this exactly based on whether the transaction ID of the create/drop metadata update is included in the transaction snapshot. This would take careful thought to ensure we do this correctly and really understand all relevant corner cases, because each local metadata table update is done with a different transaction.

I think option 1 is probably what we'd like to do for now, as it's simplest, and as I think it should be the easiest to reason through its correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I went down option 1.

@jiechenbo jiechenbo requested a review from pmacko86 February 23, 2026 23:54
Copy link
Contributor

@pmacko86 pmacko86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Contributor

@korteland korteland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jiechenbo jiechenbo added this pull request to the merge queue Feb 25, 2026
Merged via the queue into develop with commit 185f84f Feb 25, 2026
14 checks passed
@jiechenbo jiechenbo deleted the wt-16565-refactor-remove-shared-queue branch February 25, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants