Skip to content

SCHED-848: add ephemeral nodes#2101

Open
Uburro wants to merge 15 commits intomainfrom
uburro/SCHED-848/0
Open

SCHED-848: add ephemeral nodes#2101
Uburro wants to merge 15 commits intomainfrom
uburro/SCHED-848/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Jan 28, 2026

Problem

This pull request implements support for Slurm dynamic topology and ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.

Testing

manually

Release Notes

  • Added new CRD fields EphemeralNodes and EphemeralTopologyWaitTimeout to NodeSet specification
  • Added new NodeSetPowerState CRD for per-NodeSet power-state tracking of ephemeral nodes.
  • Added new powermanager component plus Slurm power scripts.
  • Support for ephemeral-node settings in Helm values and CRD print columns.
  • Worker lifecycle/bootstrap flow was reworked.
  • Slurm topology and worker/service rendering were updated.
  • Helm/RBAC/manifests were updated to deploy and authorize the new power-management flow.

Fixed

  • NodeSet status/phase update behavior during reconciliation.
  • Handling of stale/removed nodes and UNDRAIN update flow for ephemeral nodes

@Uburro Uburro added go Pull requests that update Go code feature labels Jan 28, 2026
@Uburro Uburro requested a review from Copilot January 28, 2026 15:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.

Changes:

  • Added new CRD fields EphemeralNodes and EphemeralTopologyWaitTimeout to NodeSet specification
  • Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets
  • Added Python wait_for_topology.py script that waits for and formats topology data from ConfigMap before starting slurmd

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
api/v1alpha1/nodeset_types.go Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to NodeSetSpec
api/v1alpha1/zz_generated.deepcopy.go Generated deep copy methods for new EphemeralNodes pointer field
cmd/main.go Registered TopologyDistributionReconciler and added kruisev1alpha1 scheme
config/crd/bases/slurm.nebius.ai_nodesets.yaml Updated NodeSet CRD with new ephemeral nodes fields and validation
config/rbac/clustercontroller/role.yaml Added RBAC permissions for resourcedistributions resource
helm/soperator/crds/slurmcluster-crd.yaml Updated SlurmCluster CRD with new ephemeral nodes fields
helm/soperator/templates/manager-rbac.yaml Added resourcedistributions to Kruise resources in RBAC
helm/soperator/values.yaml Enabled ResourceDistributionGate feature gate for Kruise
helm/soperator-crds/templates/slurmcluster-crd.yaml Updated CRD template with ephemeral nodes configuration
helm/soperator-fluxcd/values.yaml Enabled ResourceDistributionGate feature gate
images/worker/slurmd.dockerfile Added wait_for_topology.py script to image and made it executable
images/worker/slurmd_entrypoint.sh Added build_dynamic_conf function and ephemeral nodes mode that sources topology env file
images/worker/wait_for_topology.py New Python script that waits for topology data in ConfigMap and writes formatted env file
images/worker/wait_for_topology_test.py Comprehensive unit tests for topology wait script
internal/consts/container.go Added ContainerNameWaitForTopology constant
internal/consts/volume.go Added volume and mount path constants for topology ConfigMap and env file
internal/controller/topologyconfcontroller/resourcedistribution_controller.go New controller that manages ResourceDistribution for topology ConfigMap syncing
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go Unit tests for TopologyDistributionReconciler
internal/render/worker/container.go Added RenderContainerWaitForTopology init container and topology env variables to slurmd container
internal/render/worker/statefulset.go Added topology volumes and init container for ephemeral nodes
internal/values/slurm_nodeset.go Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to SlurmNodeSet

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 2 times, most recently from afd3440 to bf57f8d Compare January 29, 2026 15:06
@Uburro Uburro requested a review from Copilot January 29, 2026 15:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 3 times, most recently from 64ead49 to 6558232 Compare February 3, 2026 09:36
@Uburro Uburro requested a review from Copilot February 3, 2026 09:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/controller/topologyconfcontroller/workertopology_controller.go:393

  • The InitializeTopologyConf function now always includes a "dummy" node in the topology configuration, even when there are actual worker nodes. This could cause issues with Slurm as it will try to manage a non-existent "dummy" node. The logic should be:
  • If there are actual nodes (after adding real nodes), don't add the dummy node
  • Only add the dummy node when there are no actual nodes to prevent an empty Nodes list

The condition at line 388 if len(nodes) == 0 will never be true after line 374 adds the dummy node, making this check ineffective.

func InitializeTopologyConf(asts *kruisev1b1.StatefulSetList) string {
	if asts == nil {
		return ""
	}

	var nodes []string
	switchName := "SwitchName=unknown"
	dummyNodeName := "dummy"
	nodes = append(nodes, dummyNodeName)

	if len(asts.Items) > 0 {
		for _, sts := range asts.Items {
			if sts.Spec.Replicas == nil || *sts.Spec.Replicas <= 0 {
				continue
			}

			for i := 0; i < int(*sts.Spec.Replicas); i++ {
				nodes = append(nodes, sts.Name+"-"+strconv.Itoa(i))
			}
		}
	}

	if len(nodes) == 0 {
		return switchName
	}

	return switchName + " Nodes=" + strings.Join(nodes, ",")
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch from 6558232 to 2ccb9d1 Compare February 3, 2026 17:12
@Uburro Uburro changed the title SCHED-848: switch to Slurm dynamic topology SCHED-848: add ephemeral nodes Feb 4, 2026
@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 8 times, most recently from 8fc8429 to 2b01a38 Compare February 6, 2026 17:22
@Uburro Uburro marked this pull request as ready for review February 6, 2026 17:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 9 times, most recently from 544d002 to 37f4a74 Compare February 13, 2026 13:47
@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch from 382ad2f to 87cba5c Compare February 17, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments