doc: Add Kubeflow tech review snapshot#2180
Conversation
|
Anything left to do here from the Kubeflow side? 🤗 |
|
@christian-heusel : thanks for creating the PR - me and @brandtkeller are yet to review this, as currently busy with the adopter interviews. Will soon get to this and give you feedback :) |
6519155 to
8fcc1fa
Compare
In the process of graduating Kubeflow as an official CNCF project it is common practise to add a snapshot of the general technical review document to the toc repo. Ref: cncf#1861 Ref: cncf#2117 Fixes: kubeflow/community#964 Signed-off-by: Christian Heusel <christian@heusel.eu>
8fcc1fa to
0623298
Compare
brandtkeller
left a comment
There was a problem hiding this comment.
Day 0 review - Day1/2 WIP
|
|
||
| For more information, check ROADMAP for each Kubeflow Project: | ||
|
|
||
| - [Kubeflow Spark Operator](https://github.com/kubeflow/spark-operator/blob/master/ROADMAP.md) |
There was a problem hiding this comment.
not blocking - worth noting that a number of these are dated by year and in those cases appear to be out-of-date.
| - [Kubeflow Notebooks](https://github.com/kubeflow/notebooks/blob/main/ROADMAP.md) | ||
|
|
||
| Community-wide changes are proposed as [Kubeflow Enhancement proposals (KEPs)](https://github.com/kubeflow/community/tree/master/proposals) | ||
| in the `kubeflow/community` repository or in the [Kubeflow sub-projects KEPs](https://github.com/kubeflow/trainer/tree/master/docs/proposals). |
|
|
||
| #### Explain which use cases have been identified as unsupported by the project | ||
|
|
||
| As Kubeflow is composed of multiple projects, each working group makes its own determinations as t |
There was a problem hiding this comment.
typo - perhaps:
| As Kubeflow is composed of multiple projects, each working group makes its own determinations as t | |
| As Kubeflow is composed of multiple projects, each working group makes its own determinations as to |
| - The projects are deployed in any Kubernetes (each release will specify tested versions), | ||
| regardless of the underlying infrastructure, independently through Kubernetes manifests leveraging | ||
| Kustomize and/or Helm Charts. However, the project doesn’t provide an implementation to be deployed | ||
| on infrastructure besides Kubernetes. - We do not officially enforce a deployment method or distribution. |
There was a problem hiding this comment.
Is this last bullet meant to be included in the surrounding bullet?
|
|
||
| - Kubeflow doesn’t provide a GitOps implementation, however Kubeflow manifests can be integrated | ||
| into a GitOps solution. For example, Platform Engineers can create an ArgoCD Application (CRD) | ||
| to install and configure Kubeflow projects. by providing Kubeflow individual project manifests, |
There was a problem hiding this comment.
| to install and configure Kubeflow projects. by providing Kubeflow individual project manifests, | |
| to install and configure Kubeflow projects by providing Kubeflow individual project manifests, |
| - [Kubeflow 2025 Survey](https://docs.google.com/forms/d/11cSe5vmGLrGekJISHBMfjVh_97WFGuhcvGnd0l5aNLg/edit#responses) | ||
| - [2025:UX designers supporting Model Registry conducted a series of user sessions to understand preferred interaction patterns (link)](https://docs.google.com/forms/d/11cSe5vmGLrGekJISHBMfjVh_97WFGuhcvGnd0l5aNLg/edit#responses) |
There was a problem hiding this comment.
not blocking - These two links reference the survey itself and I cannot see the responses.
|
|
||
| #### Describe the user experience (UX) and user interface (UI) of the project | ||
|
|
||
| Kubeflow user experience in each project is a collection of projects, the user experience for the |
There was a problem hiding this comment.
Kubeflow user experience in each project is a collection of projects
Is this accurate?
|
|
||
| <div style="text-align: center;"> | ||
| <img | ||
| src="https://raw.githubusercontent.com/kubeflow/sdk/main/docs/images/persona_diagram.svg" |
| end users and vendors. However, we aim to provide a strong foundation through reference architectures | ||
| similar things from which to build on. | ||
|
|
||
| #### Describe the project’s High Availability requirements |
There was a problem hiding this comment.
I don't think this reference is accurate anymore - but regardless this may be an area of improvement for future iterations. I imagine there are more specific requirements to consider for each project and high availability than simply adjusting replicas?
brandtkeller
left a comment
There was a problem hiding this comment.
More feedback - consider none of this blocking
|
|
||
| #### How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes | ||
|
|
||
| Users can set the replica count to 0 in the Kubeflow projects deployment. Existing AI workloads |
There was a problem hiding this comment.
Seems functional but maybe not a pragmatic choice for handling "disabling" - Does this offer any benefits over more declarative abstractions (why not remove on disable versus scaling to zero?)
|
|
||
| #### Explain how upgrades and rollbacks were tested and how the upgrade->downgrade->upgrade path was tested | ||
|
|
||
| Currently, it’s being manually tested by users, but automated tests are work in progress. |
There was a problem hiding this comment.
References would be great.
|
|
||
| #### Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput | ||
|
|
||
| Resources requirements for Kubeflow projects [are set here](https://github.com/kubeflow/manifests/pull/3091#issuecomment-3016609243). |
There was a problem hiding this comment.
The table highlights the max usages (as I understood it) whereas the comment also captured actual use. Maye be worth a distinction between "here is the minimum resource usage to expect being used versus the maximum for planning on both ends.
|
|
||
| #### Describe how the project surfaces project resource requirements for adopters to monitor cloud and infrastructure costs, e.g. FinOps That must happen on the Kubernetes namespace level | ||
|
|
||
| Users are recommended to use third-party tools like Kubecost to measure cloud and infrastructure |
There was a problem hiding this comment.
Is this an implicit or explicit recommendation (I did not see a docs reference). I think it's fine to defer this to other processes and metrics.
|
|
||
| #### How can an operator determine if the project is in use by workloads | ||
|
|
||
| - Check the Pods in `kubeflow-profil`e labeled namespaces. |
There was a problem hiding this comment.
| - Check the Pods in `kubeflow-profil`e labeled namespaces. | |
| - Check the Pods in `kubeflow-profile` labeled namespaces. |
| - Check the CRDs in user’s namespaces | ||
| - Check the Kubeflow Dashboard resources. | ||
|
|
||
| #### How can someone using this project know that it is working for his instance |
There was a problem hiding this comment.
| #### How can someone using this project know that it is working for his instance | |
| #### How can someone using this project know that it is working for their instance |
brandtkeller
left a comment
There was a problem hiding this comment.
Realized the document is missing a section.
| Each Kubeflow project handles failure modes differently beyond native Kubernetes fault tolerance. | ||
| Many of them are configured at the application level in user code. | ||
|
|
||
| ### Security |
There was a problem hiding this comment.
| ### Security | |
| ### Compliance | |
| * What steps does the project take to ensure that all third-party code and components have correct and complete attribution and license notices? | |
| * Describe how the project ensures alignment with CNCF [recommendations](https://github.com/cncf/foundation/blob/main/policies-guidance/recommendations-for-attribution.md) for attribution notices. | |
| <!--Note that each question describes a use case covered by the referenced policy document.--> | |
| * How are notices managed for third-party code incorporated directly into the project's source files? | |
| * How are notices retained for unmodified third-party components included within the project's repository? | |
| * How are notices for all dependencies obtained at build time included in the project's distributed build artifacts (e.g. compiled binaries, container images)? | |
| ### Security |
The compliance section is missing from this document.
DD for k8gb Sandbox → Incubation (cncf#1472). Primary DD: @TheFoxAtWork and @ricardorocha Adopter interviews: @angellk and @kgamanji - Tech review: Satisfactory (Kashif Khan, TAG Infrastructure, 30-Jan-2026) - Governance review: Satisfactory (joshgav, 21-Jan-2026) - Security: Self-assessment complete, OpenSSF passing - Adopter verification: 3 interviews across 3 orgs, 3 geographies (financial services x2, managed services x1). 220+ clusters combined. - Must-fix resolved: API group renamed to k8gb.io/v1beta1 (cncf#2180) Ref: cncf#1472 Signed-off-by: Karena Angell <karena.angell@gmail.com>
In the process of graduating Kubeflow as an official CNCF project it is common practice to add a snapshot of the general technical review document to the toc repo.
Ref: #1861
Ref: #2117
Fixes: kubeflow/community#964
cc @kfaseela @andreyvelich
Note: The document was added as a direct copy of https://github.com/kubeflow/community/blob/master/KUBEFLOW-GENERAL-TECHNICAL-REVIEW.md, if there are any modifications desired before the addition here just let me know!