Add trainium, inferentia, and efa parameters to @kubernetes decorator#3086
Add trainium, inferentia, and efa parameters to @kubernetes decorator#3086emattia wants to merge 7 commits into
Conversation
Greptile SummaryThis PR adds
Confidence Score: 5/5Safe to merge. The changes are additive, default to None, and follow the exact same patterns as the existing GPU implementation across all runtimes. All new parameters default to None and are gated before any resource or toleration is emitted. Validation (positive integer, gpu/trainium mutual exclusion) runs at step_init time. The inferentia-to-trainium aliasing is correctly handled before CLI serialization, and None values are safely skipped by the CLI arg builder. The implementation is consistent across kubernetes_job, kubernetes_jobsets, Argo, and Airflow paths. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "Make @kubernetes inferentia-trainium ali..." | Re-trigger Greptile |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
9b816f1 to
b1db907
Compare
…efa=N) parameter on @kubernetes. When efa is set, the pod requests N vpc.amazonaws.com/efa resources, advertised by the AWS EFA k8s device plugin on EFA-enabled nodes. Plumbed through to argo and airflow runtimes consistently with how trainium= is.
saikonen
left a comment
There was a problem hiding this comment.
No issues with the changes itself, but a question on the overall UX. Is the main goal of this feature added convenience? From what I can tell, everything seems to already be achievable with @kubernetes(tolerations=) if I'm not mistaken.
It also seems a bit of a departure for the @kubernetes decorator to implement provider-specific attributes.
Right, the tolerations is what needs to happen, the |
| # Validate mutually exclusive: gpu and trainium cannot both be set. | ||
| if ( | ||
| self.attributes["trainium"] is not None | ||
| and self.attributes["gpu"] is not None |
There was a problem hiding this comment.
this should be greater than zero? trainium and gpu=0 should work right
PR Type
Summary
Mirror
@batch's AWS-accelerator surface on@kubernetes:@kubernetes(trainium=N)requests N AWS Trainium / Inferentia Neurondevices (
aws.amazon.com/neuronk8s resource).@kubernetes(inferentia=N)is an alias fortrainium, mirroring@batch(inferentia=N)for API consistency.@kubernetes(efa=N)requests N AWS Elastic Fabric Adapter networkinterfaces (
vpc.amazonaws.com/efak8s resource).Plumbed through
kubernetes_job,kubernetes_jobsets,kubernetes_cli,and the argo / airflow runtimes consistently with how the existing
gpuparameter is handled.Issue
No tracking issue. Supersedes the original PR scope of just
trainium.Brings the
@kubernetespath to parity with@batchfor AWS Neuronand EFA workloads, unblocking customers who run their own EKS clusters
and want first-class Neuron/EFA support without writing raw pod specs.
Reproduction
Runtime: kubernetes (EKS with AWS Neuron and EFA device plugins installed; nodes labeled with the relevant accelerator).
Commands to run:
Where evidence shows up: task pod spec (
kubectl describe pod) andNCCL debug log inside the running container.
Before (master)
(also for
inferentia,efa)After (this PR)
Root Cause
Not a bug fix — net-new feature. The underlying Kubernetes resources
(
aws.amazon.com/neuron,vpc.amazonaws.com/efa) are advertised by therespective AWS device plugins;
@kuberneteshad no decorator-levelsurface to request them.
@batchalready exposedtrainium,inferentia, andefa. This PR brings@kubernetesto parity.Why This Fix Is Correct
@batch's API surface exactly.inferentiacollapses intotrainiumatstep_initand is popped before any runtime translation— same shape as
batch_decorator.py:175-211, only withtrainiumascanonical (since on K8s the underlying resource name is
aws.amazon.com/neuronand we surface what users running on Trainiumhardware naturally type first).
gpuandtrainiumareenforced as mutually exclusive (matching
@batch's convention).earlier in this branch;
efafollows the same pattern.Failure Modes Considered
gpu/gpu_vendorareunaffected — new attributes default to
Noneand resource-limitemission is gated on non-None values.
inferentiaandtrainiumraises a clear error in
step_init(mirrors@batch). Specifyingboth
gpuandtrainiumwas already enforced.inferentiais popped fromself.attributesafter collapsing intotrainium, so the runtimeCLI / argo / airflow translation only ever sees the canonical key.
kubernetes_job,kubernetes_jobsets, argo, and airflow consistently with howtrainiumwas already plumbed.efavalue validated as positive integer (mirrorstrainiumandtmpfs_sizevalidation patterns in the same file).Tests
EFA device plugins. Pod spec contains the right resource limits;
NCCL via aws-ofi-nccl selects EFA as the network backend.
Non-Goals
@batch(already has these parameters).--inferentiaCLI flag —inferentiais purely adecorator-time convenience that resolves to
trainiumbefore anyCLI invocation, mirroring
@batch's CLI which only exposes thecanonical name (
--inferentiafor batch sinceinferentiaiscanonical there;
--trainiumfor k8s sincetrainiumis canonicalhere).
(
FI_PROVIDER,FI_EFA_USE_DEVICE_RDMA). Users set those via@environmentfor now; auto-injection is a separate ergonomics PR.should target — that's a cluster-side concern (instance allowlist
AI Tool Usage
selection, Karpenter EFA NIC layout prior art, and drafting this
PR description). All generated code reviewed, understood, and
tested end-to-end on a live Outerbounds cluster.