Skip to content

Add grouped KWOK controller mode#1103

Merged
wonderyl merged 23 commits intov2from
xinwei/v2-kwok
Mar 28, 2026
Merged

Add grouped KWOK controller mode#1103
wonderyl merged 23 commits intov2from
xinwei/v2-kwok

Conversation

@xinWeiWei24
Copy link
Copy Markdown
Collaborator

  • Add grouped mode to KWOK: split nodes across multiple kwok-controller
    deployments for large-scale simulations (--group, --nodes-per-controller)
  • Add per-group controller deployment template and parameterized config

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “grouped” KWOK mode to support large simulations by splitting virtual nodes across multiple kwok-controller Deployments, along with parameterized configuration/templates and small pipeline wiring updates.

Changes:

  • Introduces grouped mode (--group, --nodes-per-controller) with per-group controller Deployments and controller readiness waiting.
  • Parameterizes KWOK ConfigMap rendering (parallelism, CIDR, lease duration, optional nodeIP/QPS/burst).
  • Updates KCL steps to pass extra args through validation and optionally set nodepool labels.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
modules/python/kwok/kwok.py Implements grouped controller lifecycle, config templating, readiness wait, and new CLI flags.
modules/python/kwok/config/kwok-controller-group-deploy.yaml Adds per-group controller Deployment template used in grouped mode.
modules/python/kwok/config/kwok-config.yaml Converts fixed KWOK config values into template placeholders.
modules/python/tests/kwok/test_kwok_node.py Updates integration test to mock controller readiness lookup.
kcl/lib/steps/k8s/kwok.k Passes extraArgs to the validate invocation as well as create.
kcl/lib/steps/azure/create_node_pool.k Adds optional --labels support for AKS nodepool creation.
kcl/example_pipeline/kwok-node.yaml Removes trailing whitespace in a taint key line.

except Exception as exc: # pylint: disable=broad-exception-caught
self.fail(f"Node creation failed: {exc}")

@patch("kwok.kwok.Node.apply_kwok_manifests")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally you want to mock the part that's a dependency, e.g. k8s_client, which actually need a cluster. Then you can test more of your code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have updated the tests to mock the external dependency k8s_client

- delete unnecessary flags
- simplify controller sharding

def _prepare_controller_deployment_base(self, kwok_release, enable_metrics):
"""Bootstrap KWOK, copy the upstream controller deployment, then remove it."""
self._bootstrap_kwok(kwok_release, enable_metrics)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrap_kwok should move outside of _prepare_controller_deployment_base() rename the function to something the reflect its actually usage. The name _prepare_controller_deployment_base let people think it's only related to the controller deployment, not other things.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I moved _bootstrap_kwok to create() and renamed the helper so its name reflects the real behavior. The helper now only handles copying/removing the upstream controller deployment.

"""Create a kwok-controller deployment scoped to a controller shard."""
deployment = self._build_controller_deployment(base_deployment, controller_index)
deployment_name = deployment.metadata.name
controller_selector = f"kwok-controller-group={controller_index}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not read controller selector from the deployment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selector is not read from the upstream Deployment because it is empty there.
kwok-controller-group= is intentionally introduced by us as a custom sharding selector, and it must match the node label we assign for each shard.
So here the selector is derived from controller_index as the source of truth, rather than inherited from upstream

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you already set it in _build_controller_deployment(), right? if you change the code in _build_controller_deployment to set a different selector, this line of code will be deprecated.

@wonderyl wonderyl merged commit aa8b3ac into v2 Mar 28, 2026
1 of 2 checks passed
@wonderyl wonderyl deleted the xinwei/v2-kwok branch March 28, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants