Conversation
| cluster. | ||
|
|
||
| ``` | ||
| apiVersion: v1 |
There was a problem hiding this comment.
Can we add these as actual files in a directory with a README.md file explaining how to deploy it? Then the user could just clone the repo and deploy it from their commandline.
grac3gao
left a comment
There was a problem hiding this comment.
This is a short-term mitigation which can't be used in a long run. The device plugin change introduced in this doc will be reverted by the add-on manager during an upgrade. If the GPU nodes are used together with autoscaling, new nodes may not contain this mitigation.
Ideally, we would like to have a more stable and simpler solution for this problem. (e.g. user only need to configure the configmap to customize the XID they like).
It would be better to add more explanation in this doc for this situation, mentioning it is a short-term mitigation.
We would like that, yes: just edit/create a configmap. For now, we can modify the Also, xid 79 should IMO be there by default: it's not a user error: https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 Context: we get |
This is making xid error mitigation public