Conversation
There was a problem hiding this comment.
Pull request overview
Adds a “grouped” KWOK mode to support large simulations by splitting virtual nodes across multiple kwok-controller Deployments, along with parameterized configuration/templates and small pipeline wiring updates.
Changes:
- Introduces grouped mode (
--group,--nodes-per-controller) with per-group controller Deployments and controller readiness waiting. - Parameterizes KWOK ConfigMap rendering (parallelism, CIDR, lease duration, optional nodeIP/QPS/burst).
- Updates KCL steps to pass extra args through validation and optionally set nodepool labels.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/python/kwok/kwok.py | Implements grouped controller lifecycle, config templating, readiness wait, and new CLI flags. |
| modules/python/kwok/config/kwok-controller-group-deploy.yaml | Adds per-group controller Deployment template used in grouped mode. |
| modules/python/kwok/config/kwok-config.yaml | Converts fixed KWOK config values into template placeholders. |
| modules/python/tests/kwok/test_kwok_node.py | Updates integration test to mock controller readiness lookup. |
| kcl/lib/steps/k8s/kwok.k | Passes extraArgs to the validate invocation as well as create. |
| kcl/lib/steps/azure/create_node_pool.k | Adds optional --labels support for AKS nodepool creation. |
| kcl/example_pipeline/kwok-node.yaml | Removes trailing whitespace in a taint key line. |
869a2bc to
215777a
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
215777a to
1dd2f81
Compare
| except Exception as exc: # pylint: disable=broad-exception-caught | ||
| self.fail(f"Node creation failed: {exc}") | ||
|
|
||
| @patch("kwok.kwok.Node.apply_kwok_manifests") |
There was a problem hiding this comment.
generally you want to mock the part that's a dependency, e.g. k8s_client, which actually need a cluster. Then you can test more of your code.
There was a problem hiding this comment.
Agreed. I have updated the tests to mock the external dependency k8s_client
- delete unnecessary flags - simplify controller sharding
modules/python/kwok/kwok.py
Outdated
|
|
||
| def _prepare_controller_deployment_base(self, kwok_release, enable_metrics): | ||
| """Bootstrap KWOK, copy the upstream controller deployment, then remove it.""" | ||
| self._bootstrap_kwok(kwok_release, enable_metrics) |
There was a problem hiding this comment.
_bootstrap_kwok should move outside of _prepare_controller_deployment_base() rename the function to something the reflect its actually usage. The name _prepare_controller_deployment_base let people think it's only related to the controller deployment, not other things.
There was a problem hiding this comment.
Good point! I moved _bootstrap_kwok to create() and renamed the helper so its name reflects the real behavior. The helper now only handles copying/removing the upstream controller deployment.
modules/python/kwok/kwok.py
Outdated
| """Create a kwok-controller deployment scoped to a controller shard.""" | ||
| deployment = self._build_controller_deployment(base_deployment, controller_index) | ||
| deployment_name = deployment.metadata.name | ||
| controller_selector = f"kwok-controller-group={controller_index}" |
There was a problem hiding this comment.
why not read controller selector from the deployment?
There was a problem hiding this comment.
The selector is not read from the upstream Deployment because it is empty there.
kwok-controller-group= is intentionally introduced by us as a custom sharding selector, and it must match the node label we assign for each shard.
So here the selector is derived from controller_index as the source of truth, rather than inherited from upstream
There was a problem hiding this comment.
But you already set it in _build_controller_deployment(), right? if you change the code in _build_controller_deployment to set a different selector, this line of code will be deprecated.
deployments for large-scale simulations (--group, --nodes-per-controller)