Fix --node-filter for deploy, destroy, reconfigure, and SR-SIM nodes #3041
Fix --node-filter for deploy, destroy, reconfigure, and SR-SIM nodes #3041bayars wants to merge 12 commits intosrl-labs:mainfrom
Conversation
|
I found out other MR/Issue created yesterday. I asked AI to find if they conflict or fix same think. Relationship to #2524 / PR #3038PR #3038 addresses a separate issue (#2524): allowing This PR fixes a different problem (#2106): |
| var allContainers []clabruntime.GenericContainer | ||
| cpmIdx := -1 | ||
|
|
||
| for _, comp := range n.Cfg.Components { | ||
| containerName := n.calcComponentName(n.GetContainerName(), comp.Slot) | ||
| cnts, err := n.Runtime.ListContainers(ctx, []*clabtypes.GenericFilter{ | ||
| { | ||
| FilterType: "name", | ||
| Match: containerName, | ||
| }, | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if comp.Slot == cpmSlot && len(cnts) > 0 { | ||
| cpmIdx = len(allContainers) | ||
| } | ||
| allContainers = append(allContainers, cnts...) |
|
@sacckth These are example clab commands with this changes. General testing with the srl topology:
Some of the corner cases with the distributed lab:
|
|
I realized one of the weird case. If I delete the cpm, this code should leave the orphaned line card containers. Assume fully deployed example distributed topology: 1. That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems working but weird. I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup. I am adding one more commit to fix this. Also, I will add the robot tests too. |
sacckth
left a comment
There was a problem hiding this comment.
- There is a missing file
- Clab files should go under
tests/13-srsim - Consider decreasing the size of the deployments to a bare minimum so avoid issues with runners.
|
|
||
|
|
||
| *** Variables *** | ||
| ${lab-file} ${EXECDIR}/lab-examples/srl02/srl02.clab.yml |
There was a problem hiding this comment.
@bayars please do not use lab-examples dir, as this might change in the future, let's create a copy of this clab under the test directory. Also mind that the SR-SIM image needs to be exactly the same as in the other clab files present on the test directory.
There was a problem hiding this comment.
@sacckth Sorry, for late answer.
I copy paste the srl lab. SR-SIM image updated to registry.srlinux.dev/pub/nokia_srsim:25.10.R2 to match the other test clab files.
|
|
||
| *** Test Cases *** | ||
| Deploy full distributed lab | ||
| [Documentation] Deploy the full manual distributed lab with 8 containers (4 per system). |
There was a problem hiding this comment.
I would try to deploy as small as possible nodes, because the git runners might not handle larger deployments very gracefully. For instance, we don't need to test that the srsim11 is there as we cover this on another test. Consider maximum 3 linecards per container (CPM-A|B, IOM1 or CPM-A|IOM-1|IOM-2)
There was a problem hiding this comment.
I reduced to a single system with 3 containers only (CPM-A, CPM-B, IOM-1).
I removed srsim11 entirely, the multi-node case is already covered by the components test (10). This is still covering all the important cases like auto-expand on CPM-A destroy, no-expand on CPM-B destroy, and line card only destroy/redeploy.
|
|
||
|
|
||
| *** Variables *** | ||
| ${lab-file} ${EXECDIR}/lab-examples/sr-sim/test-cpm-destroy.clab.yaml |
There was a problem hiding this comment.
I don't see this lab commited, and as in the other example, please add this labs in the test/13-srsim folder instead
|
|
||
|
|
||
| *** Variables *** | ||
| ${lab-file} ${EXECDIR}/lab-examples/sr-sim/lab-distributed.clab.yaml |
|
|
||
|
|
||
| *** Variables *** | ||
| ${lab-file} ${EXECDIR}/lab-examples/sr-sim/lab-distributed-components.clab.yaml |
There was a problem hiding this comment.
is there a way we can maybe add these tests on 11?
There was a problem hiding this comment.
Yes, they fit together since both test the manual distributed topology with node-filter. Merged all test cases into 11 and deleted 12. The topology is the same (CPM-A, CPM-B, IOM-1), so no reason to keep them separate.
Thanks for flagging #3055. I didn't see it before, and I looked at it. I think, it's related but tackles different problems: The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of This PR fixes the existing --node-filter flag on deploy/destroy/reconfigure. It's specifically has work for sr-sim's cpm network namespace management with the filtering. Where they overlap is the namespace dependency tracking. #3055 introduces namespaceShareProviders() for its lifecycle validation, and this PR adds expandFilterWithNSDependents() for filter expansion. Both build a map of which nodes share which namespace. It would make sense to share a common helper for that once both PRs land. I don't think merging the PRs themselves makes sense. They solve different user problems and touch different code paths. But I'm happy to rebase on top of #3055 if it lands first and reuse their namespaceShareProviders() for the filter expansion logic. |
This is intended behavior. We have a sorting function so that the netns is not owned by the CPM containers where possible. This is to allow fail over as if the CPM owns the netns, and you kill it; everything along with it will go. In most cases with a distributed container setup, LC slot 1 will own the netns, this means you can kill both CPMs, and bring them back and all should function well. |
fwiw, this was on purpose as well. using inspect with I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers). Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c). |
This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact. |
|
While I run the tests, I realized one another bugs with the --node-filter.
Thanks for explaining the design. That makes sense now. That mean the lc owning the netns allows CPM failover without tearing down the whole namespace. The current changes is not changing the sort order or netns ownership. But my test assertions backwards (checking CPM-A for the management IP instead of the LC). I've corrected those. (not pushed yet)
Yes, actually makes sense for this one too. The logical node in the default inspect table is cleaner. I'll revert the GetContainers() change. The node-filter destroy/deploy functionality doesn't depend on it. |
@FloSch62 Thank you for the correction, Sorry for I mischaracterization for pause/resume when it's actually full docker stop/start/restart with interface parking to preserve veths. But I still think, these PRs are complementary rather than overlapping (except namespaceShareProviders() functionality) It's up to your decisions to merge on that. |
|
Let's park this until #3055 is merged. We can rebase and review afterwards |
Summary
This PR fixes several critical issues with the --node-filter flag that caused it to break labs when used with deploy, destroy, and reconfigure commands. It also improves container discovery for SR-SIM distributed nodes.
Context: Last week, a team member misconfigured a specific router, breaking accessibility to it. When they requested to redeploy just that SR-SIM router, the --node-filter flag didn't work properly, forcing me to find workarounds under time pressure.
Issues: #2106 #2524
Problem
When using
--node-filterto operate on a subset of nodes in a lab, several things went wrong:destroyfound no topology files and silently did nothing.verifyContainersUniqueness()detected the existing lab and refused to deploy filtered nodes into it, erroring with "the lab has already been deployed".--reconfigureremovedTopologyLabDir()(the wholeclab-<lab>/directory), destroying config and state for nodes not in the filter.destroy()also removed/etc/hostsentries, SSH config, and the management network — breaking the still-running nodes.destroyLabDirshad no guard: Although the CLI prevents--cleanupwith--node-filter, thedestroyLabDirsfunction itself had no protection and could wipe the entire lab directory if called programmatically.GetContainers()only returned CPM: For distributed SROS nodes (which spawn multiple containers per topology node — CPM-A, CPM-B, line cards),GetContainers()only queried the CPM-A container. This meantclab inspectonly showed one container per distributed node, and container discovery was incomplete.The changes most of the issues of the
--node-filterissues that I have faced on the last week. Let me know if I need to change something.