Skip to content

fix(nodemeta): preserve admin-configured MaxMvmNum across cubelet restarts#348

Open
chenggui53 wants to merge 1 commit into
TencentCloud:masterfrom
chenggui53:fix-max-mvm-num-clean
Open

fix(nodemeta): preserve admin-configured MaxMvmNum across cubelet restarts#348
chenggui53 wants to merge 1 commit into
TencentCloud:masterfrom
chenggui53:fix-max-mvm-num-clean

Conversation

@chenggui53
Copy link
Copy Markdown
Contributor

When a cubelet registers with the scheduler, RegisterNode unconditionally overwrites MaxMvmNum from the request. This means any admin-configured limit (set directly in the database) is lost when the cubelet restarts, because the cubelet calculates its own value from available memory.

The fix preserves the existing MaxMvmNum when the incoming request doesn't specify one (<=0), allowing scheduler-side limits to survive cubelet restarts.

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 25, 2026

Claude finished @chenggui53's task in 49s —— View job


Code Review: fix(nodemeta): preserve admin-configured MaxMvmNum across cubelet restarts

Overview

This PR addresses a real problem: when a cubelet restarts and re-registers with the scheduler, RegisterNode was unconditionally overwriting MaxMvmNum with whatever the cubelet sent. Since the cubelet calculates this value from available memory, any admin-configured limit set directly in the database would be lost.

Changed files:

  • CubeMaster/pkg/nodemeta/service.go (+17/-2)
  • CubeMaster/pkg/nodemeta/service_maxmvm_test.go (+70/-0)

The Fix

The logic in RegisterNode (service.go:133-139) is correct and handles all relevant edge cases:

Scenario existing.MaxMvmNum req.MaxMvmNum Result
New node registration nil any Uses req value
Existing, admin set limit > 0 <= 0 Preserves existing
Existing, cubelet sent a value > 0 > 0 Uses req (intentional overwrite)
Existing, no existing limit 0 any Uses req value

The <= 0 guard on req.MaxMvmNum correctly treats both 0 (omitted/zero-value in JSON) and negative values as "no value specified."

Tests

Four test cases in service_maxmvm_test.go cover the logic:

  • TestGetNodeReturnsNilForMissinggetNode nil-safety
  • TestGetNodeReturnsExistinggetNode returns correct values
  • TestPreserveExistingMaxMvmNum — preservation path
  • TestOverwriteWhenRequestHasValue — override path
  • TestUseRequestWhenNoExisting — first registration path

Potential Concerns

1. Tests test inline logic rather than RegisterNode itself
The tests duplicate the preservation logic inline rather than calling the actual RegisterNode function. This means a refactoring that breaks the logic in RegisterNode wouldn't necessarily break these tests, since they have their own copy. I understand the constraint — RegisterNode requires a gorm.DB connection — but consider either:

  • Extracting the decision into a small, exported helper that can be unit-tested directly, or
  • Adding a higher-level integration test that exercises the DB-backed path.

2. No negative MaxMvmNum test
The condition checks req.MaxMvmNum <= 0, but all tests use reqMaxMvmNum = int64(0). A test with a negative value (e.g., int64(-1)) would confirm that edge case also triggers preservation. Minor issue — worth adding a quick extra test case.

3. Minute race window between getNode and ensureNode
In RegisterNode, getNode (read-locked at line 137) and ensureNode/the write-back (locked at line 167-180) are not atomically guarded. Between the two calls, another goroutine could theoretically modify the existing node's MaxMvmNum in the snapshot map. This is a pre-existing pattern in the codebase and unlikely to manifest given the low frequency of concurrent registrations for the same nodeID, so it's not a blocker — just worth noting.

Verdict

The fix is minimal, well-scoped, and correct. The core logic properly distinguishes "admin has set a value" from "cubelet is sending its default" and preserves the former. The code is clean and the comment at line 133 explains the "why" clearly. Ship it with the minor test coverage improvement noted above.


…tarts

When a cubelet registers with the scheduler, RegisterNode unconditionally
overwrites MaxMvmNum from the request. This means any admin-configured
limit (set directly in the database) is lost when the cubelet restarts,
because the cubelet calculates its own value from available memory.

The fix preserves the existing MaxMvmNum when the incoming request
doesn't specify one (<=0), allowing scheduler-side limits to survive
cubelet restarts.

Extract the decision logic into a standalone resolveMaxMvmNum helper so
it can be unit-tested directly without duplicating the condition inline.
Add table-driven tests covering zero, negative, positive, nil-existing,
and zero-existing edge cases.

Signed-off-by: chenggui53 <chenggui53@gmail.com>
@chenggui53 chenggui53 force-pushed the fix-max-mvm-num-clean branch from 67fa4ed to d60a9a0 Compare May 25, 2026 13:39
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 25, 2026

Claude finished @chenggui53's task in 1m 0s —— View job


Code Review: Updated PR (synchronize event)

Task Checklist

  • Analyze previous review feedback
  • Read updated source files
  • Verify all review suggestions were addressed
  • Confirm test coverage

Previous Review Feedback — Status

Feedback Status
Extract decision logic into testable helper ✅ Addressed — resolveMaxMvmNum() extracted at service.go:266
Add negative MaxMvmNum test case ✅ Addressed — "preserve existing when request is negative" at test line 40-44
Race window between getNode and ensureNode ⚠️ Noted as pre-existing, not a blocker

Updated Review

The PR has been updated and all previous review feedback has been addressed:

1. The resolveMaxMvmNum helper (service.go:266-271) correctly encapsulates the decision logic and is now directly unit-testable. The conditions are ordered to short-circuit when there's no existing node or the existing node has no limit set — only when preservation is actually needed does the function return the existing value.

2. The test file (service_maxmvm_test.go) now has:

  • TestGetNodeReturnsNilForMissing — verifies getNode nil-safety for non-existent nodes
  • TestGetNodeReturnsExisting — verifies getNode returns stored values
  • TestResolveMaxMvmNum — table-driven tests with 5 cases covering:
    • Preserve when request is zero
    • Preserve when request is negative
    • Override when request has a value
    • Use request when no existing node
    • Use request when existing has no limit (MaxMvmNum=0)

All test cases call resolveMaxMvmNum() directly, so they actually test the production code (not a copy).

3. Integration with RegisterNode (service.go:133-173): The helper is called once at line 133 and the result (effectiveMaxMvmNum) is used consistently for both the DB upsert (line 147) and the in-memory snapshot update (line 173). No divergence risk.

Verdict

This PR is clean and complete. All review suggestions have been addressed. The fix is minimal, the test coverage is thorough, and the resolveMaxMvmNum pattern is easy to follow. Ready to merge.

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 26, 2026

Claude finished @chenggui53's task in 1m 26s —— View job


Review: fix(nodemeta): preserve admin-configured MaxMvmNum across cubelet restarts

Step Status
Analyze event context ✅ Complete
Read changed files ✅ Complete
Verify previous review feedback addressed ✅ Complete
Post review results ✅ Complete

Previous Review Feedback — Status

Feedback from prior review Status
Extract decision logic into testable helper ✅ Addressed — resolveMaxMvmNum() at service.go:266
Add negative MaxMvmNum test case ✅ Addressed — "preserve existing when request is negative" at test line 40-44
Race window between getNode and ensureNode ⚠️ Pre-existing pattern, not a blocker

All previous review feedback has been addressed. The nodemeta logic is solid.


Nodemeta Changes — Verified Correct

The core fix in RegisterNode (service.go:133) is correct:

effectiveMaxMvmNum := resolveMaxMvmNum(global.getNode(req.NodeID), req.MaxMvmNum)

The resolveMaxMvmNum helper (service.go:266-271) cleanly handles all edge cases:

Scenario existing.MaxMvmNum req.MaxMvmNum Result
New node (no existing) nil any Uses req value
Existing, admin set limit > 0 <= 0 Preserves existing
Existing, cubelet sent value > 0 > 0 Uses req
Existing, no limit set 0 any Uses req value

The result (effectiveMaxMvmNum) is consistently applied to both the DB upsert (line 147) and the in-memory snapshot (line 173) — no divergence risk.

Tests (service_maxmvm_test.go) are thorough with 5 test cases including the negative-request edge case requested in the previous review. The tests call resolveMaxMvmNum() directly, so they test real production code.


Build Changes — New Since Last Review

Two additional files changed (separate from the core fix):

1. Makefile (lines 101-111):

  • Builder targets changed from make build && cp build/... to go build -o /workspace/_output/bin/... directly
  • Added go clean -cache before each Go build target
  • go mod download added for dependency resolution

2. deploy/one-click/build-release-bundle-builder.sh (line 41-42):

  • Added go clean -cache before Go builds

Observations

a) builder-image target always rebuilds (Makefile line 53): The BUILDER_FORCE_REBUILD skip logic was removed — every invocation now unconditionally builds the Docker image. The commit message only mentions clearing the Go build cache, so this may be unintentional or unrelated. If images change infrequently, restoring the skip-if-exists guard (but keeping the forced rebuild opt-in via BUILDER_FORCE_REBUILD=1) would speed up local dev and CI. If this was deliberate for CI reproducibility, the commit message should note it.

b) Redundant go clean -cache across targets: Each target (cubemaster, cubelet, network-agent) runs go clean -cache inside its own container — but since each is a separate builder-run invocation, each starts with a fresh Docker container anyway, so the cumulative effect is correct per build. Not a concern, just noting.

c) make proto preserved: Good — the protobuf generation step is still included before go build, so generated code stays current.


Verdict

The nodemeta fix is minimal, well-scoped, correct, and thoroughly tested. The build changes are reasonable — just note the forced image rebuild point (a) above in case it was unintentional. Ready to merge.


@chenggui53 chenggui53 force-pushed the fix-max-mvm-num-clean branch from 200c437 to d60a9a0 Compare May 26, 2026 02:24
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 26, 2026

Claude finished @chenggui53's task in 3m 32s —— View job


Review: fix(nodemeta): preserve admin-configured MaxMvmNum across cubelet restarts

Step Status
Read changed files
Trace callers and data flow
Verify previous review feedback addressed
Assess fix against described problem

Previous Review Feedback — Status

Feedback from prior reviews Status
Extract decision logic into testable helper (resolveMaxMvmNum) ✅ Done at service.go:266
Add negative MaxMvmNum test case ✅ Done at test line 40-44
Race window between getNode and ensureNode ⚠️ Pre-existing, acceptable

All prior review feedback has been addressed. The getNode extraction (service.go:256-259) is clean, the resolveMaxMvmNum helper (service.go:266-271) is well-factored, and tests call resolveMaxMvmNum() directly (testing real production code).


The Fix — What It Does

func resolveMaxMvmNum(existing *NodeSnapshot, reqMaxMvmNum int64) int64 {
    if existing != nil && existing.MaxMvmNum > 0 && reqMaxMvmNum <= 0 {
        return existing.MaxMvmNum
    }
    return reqMaxMvmNum
}

When RegisterNode receives a request where max_mvm_num is zero or negative, and the node already exists in the snapshot map with a positive value, the existing value is preserved. The result (effectiveMaxMvmNum) is applied consistently to both the DB upsert (line 147) and the in-memory snapshot (line 173).

Tests — 5 table-driven cases

Name existing.MaxMvmNum req Result
preserve existing when request is zero 500 0 500
preserve existing when request is negative 500 -1 500
use request when request has value 500 300 300
use request when no existing node nil 200 200
use request when existing has no limit 0 200 200

Thorough. Code and tests are clean, well-structured, and match the stated implementation.


Concern: Fix doesn't match the described problem

The PR description says:

"Any admin-configured limit (set directly in the database) is lost when the cubelet restarts, because the cubelet calculates its own value from available memory."

The fix preserves MaxMvmNum only when req.MaxMvmNum <= 0. But tracing the cubelet's code:

The cubelet always sends max_mvm_num > 0, so the guard reqMaxMvmNum <= 0 will never trigger for a cubelet originating request. The described scenario — a cubelet restart overwriting an admin-set DB value — is not prevented by this fix.

Concrete scenario where this breaks:

Step Action MaxMvmNum
1 Cubelet registers, calculates from memory 500 (DB)
2 Admin sets limit directly in DB 300 (DB)
3 Scheduler restarts, reload() loads DB value 300 (memory)
4 Cubelet restarts, re-registers with resolveHostMaxMvmNum() sends 500
5 resolveMaxMvmNum(300, 500) → 500 > 0 → returns 500 Admin's 300 lost

The <= 0 check is a defensive guard for cases where the request genuinely omits the field — but that doesn't match the cubelet restart scenario described in the PR.

Options to address this

A. Extend the condition to always preserve a positive existing value (simplest):

func resolveMaxMvmNum(existing *NodeSnapshot, reqMaxMvmNum int64) int64 {
    if existing != nil && existing.MaxMvmNum > 0 {
        return existing.MaxMvmNum
    }
    return reqMaxMvmNum
}

This is a behavioral change: the cubelet's auto-calculated value would never overwrite an admin-set limit. If the cubelet legitimately needs to increase or decrease the limit (e.g., memory changes), it would require admin action or a workflow where the cubelet first clears the existing limit.

B. Keep the fix as-is, but update the cubelet to send max_mvm_num = 0 on re-registration (two-phase change):

  • This PR adds the scheduler guard (already done)
  • A follow-up PR modifies the cubelet to omit max_mvm_num or send 0 when re-registering after a restart
  • The PR description should be updated to reflect this two-phase plan

C. Correct and narrow the PR description to accurately describe the guard as "defensive — preserve existing MaxMvmNum when the request doesn't specify one" rather than "fix cubelet restart overwrite."


Minor: RegisterNodeRequest JSON round-trip

The RegisterNodeRequest struct in both the schedulers nodemeta/service.go:46-60 and the cubelet's masterclient/client.go:27-40 use json:"max_mvm_num,omitempty". With the omitempty tag, 0 is omitted during serialization. This means:

  • If a future cubelet sends MaxMvmNum = 0, the field is absent from JSON
  • On the scheduler side, deserialization yields req.MaxMvmNum = 0 (Go zero value)
  • The <= 0 guard catches this

This is correct behavior, but worth noting it works because of the omitempty interaction on the client side.


Verdict

The code is clean, well-tested, and the extracted helper is well-factored. However, the fix does not prevent the problem described in the PR — the cubelet always sends a positive max_mvm_num, so the <= 0 guard never triggers for cubelet registrations.

Recommended action: Either extend the preservation condition (option A) or clarify the scope of the fix and plan the cubelet-side change (option B).

@fslongjin
Copy link
Copy Markdown
Member

Thanks for the PR! I have a question about the usage scenario.

How do you currently set the cubelet's MaxMvmNum in CubeMaster — do you modify the database directly? We'd like to understand the workflow better to evaluate whether the fix covers the right case, especially given that resolveHostMaxMvmNum in the cubelet always returns a positive value (see the analysis in the review above).

If you're setting it via direct DB edits, we should also consider whether a proper API/UI for this configuration would be a better long-term approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants