Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.
Changes:
- Added new CRD fields
EphemeralNodesandEphemeralTopologyWaitTimeoutto NodeSet specification - Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets
- Added Python wait_for_topology.py script that waits for and formats topology data from ConfigMap before starting slurmd
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| api/v1alpha1/nodeset_types.go | Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to NodeSetSpec |
| api/v1alpha1/zz_generated.deepcopy.go | Generated deep copy methods for new EphemeralNodes pointer field |
| cmd/main.go | Registered TopologyDistributionReconciler and added kruisev1alpha1 scheme |
| config/crd/bases/slurm.nebius.ai_nodesets.yaml | Updated NodeSet CRD with new ephemeral nodes fields and validation |
| config/rbac/clustercontroller/role.yaml | Added RBAC permissions for resourcedistributions resource |
| helm/soperator/crds/slurmcluster-crd.yaml | Updated SlurmCluster CRD with new ephemeral nodes fields |
| helm/soperator/templates/manager-rbac.yaml | Added resourcedistributions to Kruise resources in RBAC |
| helm/soperator/values.yaml | Enabled ResourceDistributionGate feature gate for Kruise |
| helm/soperator-crds/templates/slurmcluster-crd.yaml | Updated CRD template with ephemeral nodes configuration |
| helm/soperator-fluxcd/values.yaml | Enabled ResourceDistributionGate feature gate |
| images/worker/slurmd.dockerfile | Added wait_for_topology.py script to image and made it executable |
| images/worker/slurmd_entrypoint.sh | Added build_dynamic_conf function and ephemeral nodes mode that sources topology env file |
| images/worker/wait_for_topology.py | New Python script that waits for topology data in ConfigMap and writes formatted env file |
| images/worker/wait_for_topology_test.py | Comprehensive unit tests for topology wait script |
| internal/consts/container.go | Added ContainerNameWaitForTopology constant |
| internal/consts/volume.go | Added volume and mount path constants for topology ConfigMap and env file |
| internal/controller/topologyconfcontroller/resourcedistribution_controller.go | New controller that manages ResourceDistribution for topology ConfigMap syncing |
| internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go | Unit tests for TopologyDistributionReconciler |
| internal/render/worker/container.go | Added RenderContainerWaitForTopology init container and topology env variables to slurmd container |
| internal/render/worker/statefulset.go | Added topology volumes and init container for ephemeral nodes |
| internal/values/slurm_nodeset.go | Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to SlurmNodeSet |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
afd3440 to
bf57f8d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
64ead49 to
6558232
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
internal/controller/topologyconfcontroller/workertopology_controller.go:393
- The InitializeTopologyConf function now always includes a "dummy" node in the topology configuration, even when there are actual worker nodes. This could cause issues with Slurm as it will try to manage a non-existent "dummy" node. The logic should be:
- If there are actual nodes (after adding real nodes), don't add the dummy node
- Only add the dummy node when there are no actual nodes to prevent an empty Nodes list
The condition at line 388 if len(nodes) == 0 will never be true after line 374 adds the dummy node, making this check ineffective.
func InitializeTopologyConf(asts *kruisev1b1.StatefulSetList) string {
if asts == nil {
return ""
}
var nodes []string
switchName := "SwitchName=unknown"
dummyNodeName := "dummy"
nodes = append(nodes, dummyNodeName)
if len(asts.Items) > 0 {
for _, sts := range asts.Items {
if sts.Spec.Replicas == nil || *sts.Spec.Replicas <= 0 {
continue
}
for i := 0; i < int(*sts.Spec.Replicas); i++ {
nodes = append(nodes, sts.Name+"-"+strconv.Itoa(i))
}
}
}
if len(nodes) == 0 {
return switchName
}
return switchName + " Nodes=" + strings.Join(nodes, ",")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6558232 to
2ccb9d1
Compare
8fc8429 to
2b01a38
Compare
4621cf2 to
def2a79
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 56 out of 56 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
544d002 to
37f4a74
Compare
…on='' Comment=''" and cluster.HasEphemeralNodes() removed in the cm
382ad2f to
87cba5c
Compare
Problem
This pull request implements support for Slurm dynamic topology and ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.
Testing
manually
Release Notes
Fixed