Skip to content

Fix --node-filter for deploy, destroy, reconfigure, and SR-SIM nodes #3041

Open
bayars wants to merge 12 commits intosrl-labs:mainfrom
bayars:node-filter-fix-improvements
Open

Fix --node-filter for deploy, destroy, reconfigure, and SR-SIM nodes #3041
bayars wants to merge 12 commits intosrl-labs:mainfrom
bayars:node-filter-fix-improvements

Conversation

@bayars
Copy link
Copy Markdown
Contributor

@bayars bayars commented Feb 7, 2026

Summary

This PR fixes several critical issues with the --node-filter flag that caused it to break labs when used with deploy, destroy, and reconfigure commands. It also improves container discovery for SR-SIM distributed nodes.

Context: Last week, a team member misconfigured a specific router, breaking accessibility to it. When they requested to redeploy just that SR-SIM router, the --node-filter flag didn't work properly, forcing me to find workarounds under time pressure.

Issues: #2106 #2524

Problem

When using --node-filter to operate on a subset of nodes in a lab, several things went wrong:

  1. Destroy couldn't find containers: When containers were already removed or lacked containerlab labels destroy found no topology files and silently did nothing.
  2. Deploy rejected partial deployments: verifyContainersUniqueness() detected the existing lab and refused to deploy filtered nodes into it, erroring with "the lab has already been deployed".
  3. Reconfigure wiped the entire lab directory: --reconfigure removed TopologyLabDir() (the whole clab-<lab>/ directory), destroying config and state for nodes not in the filter.
  4. Destroy cleaned up shared resources: After removing filtered nodes, destroy() also removed /etc/hosts entries, SSH config, and the management network — breaking the still-running nodes.
  5. destroyLabDirs had no guard: Although the CLI prevents --cleanup with --node-filter, the destroyLabDirs function itself had no protection and could wipe the entire lab directory if called programmatically.
  6. SR-SIM GetContainers() only returned CPM: For distributed SROS nodes (which spawn multiple containers per topology node — CPM-A, CPM-B, line cards), GetContainers() only queried the CPM-A container. This meant clab inspect only showed one container per distributed node, and container discovery was incomplete.

The changes most of the issues of the --node-filter issues that I have faced on the last week. Let me know if I need to change something.

@bayars bayars marked this pull request as draft February 7, 2026 22:17
@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 7, 2026

I found out other MR/Issue created yesterday. I asked AI to find if they conflict or fix same think.

Relationship to #2524 / PR #3038

PR #3038 addresses a separate issue (#2524): allowing clab destroy --name <lab> to work without a topology file.

This PR fixes a different problem (#2106): --node-filter is broken for kinD/SROS nodes when a topology file is provided. The two PRs have zero file overlap. #3038 touches cmd/destroy.go, cmd/options.go, core/options_clab.go, while this PR changes core/destroy.go, core/deploy.go, core/config.go, nodes/sros/sros.go. They complement each other and do not conflict.

@bayars bayars marked this pull request as ready for review February 7, 2026 22:19
Comment thread nodes/sros/sros.go Outdated
Comment on lines +1351 to +1368
var allContainers []clabruntime.GenericContainer
cpmIdx := -1

for _, comp := range n.Cfg.Components {
containerName := n.calcComponentName(n.GetContainerName(), comp.Slot)
cnts, err := n.Runtime.ListContainers(ctx, []*clabtypes.GenericFilter{
{
FilterType: "name",
Match: containerName,
},
})
if err != nil {
return nil, err
}
if comp.Slot == cpmSlot && len(cnts) > 0 {
cpmIdx = len(allContainers)
}
allContainers = append(allContainers, cnts...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sacckth @FloSch62 can you check this out? Thanks

@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 10, 2026

@sacckth These are example clab commands with this changes.

General testing with the srl topology:

  1. clab deploy -t lab-examples/srl02/srl02.clab.yml : Assume all srl lab deployed.

  2. clab destroy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1 :

    • Expect to srl1 node removed in the deployed node
    • srl1 node Links are cleared in the deployed srl2 node.
    • Preserved clab-srl02/ directory
  3. clab deploy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1:

    • Expect the srl1 deployed correctly.
    • Links recreated correctly
  4. clab deploy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1 --reconfigure :

    • srl1 destroyed, and deployed correctly.
    • only srl1 node directory removed temporarily, and recreated.
  5. clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml: sr-sim lab deployed fully.

  6. clab destroy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --node-filter srsim10:

    • ALL 4 srsim10 containers removed (srsim10-a, srsim10-b, srsim10-1, srsim10-2)
    • GetContainers() finds all components, not just CPM
    • srsim11's 4 containers still running
    • Lab dir preserved, mgmt network preserved
  7. clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --node-filter srsim10:

    • srsim10-a, srsim10-b, srsim10-1, srsim10-2 all created
    • CPM-A gets mgmt IP, line cards share its network namespace
    • Links between srsim10 and srsim11 reconnected
  8. clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --reconfigure --node-filter srsim11:

    • srsim11's 4 containers destroyed
    • Only srsim11's node directory removed (not the lab dir)
    • srsim10 untouched
    • srsim11 redeployed fresh

Some of the corner cases with the distributed lab:

  1. clab destroy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-1,srsim10-2,srsim11-1,srsim11-2:
    • All 4 line card containers removed
    • 4 CPM containers still running (srsim10-a, srsim10-b, srsim11-a, srsim11-b)
    • Lab dir and mgmt network preserved
  2. clab deploy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-a,srsim10-b,srsim10-1,srsim10-2
    - All 4 srsim10 containers created
    - srsim10-b, srsim10-1, srsim10-2 use network-mode: container:srsim10-a
    - No links to srsim11 (not deployed)

@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 10, 2026

I realized one of the weird case. If I delete the cpm, this code should leave the orphaned line card containers.

Assume fully deployed example distributed topology:

1.clab destroy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-a:
- Only srsim10-a container deleted
- Line cards (srsim10-1, srsim10-2) still deployed
- srsim10-b still deployed
- Topology works but node is incomplete (no forwarding plane)

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems working but weird.

I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

I am adding one more commit to fix this. Also, I will add the robot tests too.

@sacckth sacckth marked this pull request as draft February 10, 2026 17:43
@sacckth sacckth self-requested a review February 12, 2026 20:51
Copy link
Copy Markdown
Contributor

@sacckth sacckth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is a missing file
  2. Clab files should go under tests/13-srsim
  3. Consider decreasing the size of the deployments to a bare minimum so avoid issues with runners.

Comment thread tests/01-smoke/12-node-filter-srl.robot Outdated


*** Variables ***
${lab-file} ${EXECDIR}/lab-examples/srl02/srl02.clab.yml
Copy link
Copy Markdown
Contributor

@sacckth sacckth Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bayars please do not use lab-examples dir, as this might change in the future, let's create a copy of this clab under the test directory. Also mind that the SR-SIM image needs to be exactly the same as in the other clab files present on the test directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sacckth Sorry, for late answer.

I copy paste the srl lab. SR-SIM image updated to registry.srlinux.dev/pub/nokia_srsim:25.10.R2 to match the other test clab files.


*** Test Cases ***
Deploy full distributed lab
[Documentation] Deploy the full manual distributed lab with 8 containers (4 per system).
Copy link
Copy Markdown
Contributor

@sacckth sacckth Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to deploy as small as possible nodes, because the git runners might not handle larger deployments very gracefully. For instance, we don't need to test that the srsim11 is there as we cover this on another test. Consider maximum 3 linecards per container (CPM-A|B, IOM1 or CPM-A|IOM-1|IOM-2)

Copy link
Copy Markdown
Contributor Author

@bayars bayars Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced to a single system with 3 containers only (CPM-A, CPM-B, IOM-1).

I removed srsim11 entirely, the multi-node case is already covered by the components test (10). This is still covering all the important cases like auto-expand on CPM-A destroy, no-expand on CPM-B destroy, and line card only destroy/redeploy.



*** Variables ***
${lab-file} ${EXECDIR}/lab-examples/sr-sim/test-cpm-destroy.clab.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this lab commited, and as in the other example, please add this labs in the test/13-srsim folder instead



*** Variables ***
${lab-file} ${EXECDIR}/lab-examples/sr-sim/lab-distributed.clab.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy lab to test dir

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



*** Variables ***
${lab-file} ${EXECDIR}/lab-examples/sr-sim/lab-distributed-components.clab.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy lab to test dir

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can maybe add these tests on 11?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they fit together since both test the manual distributed topology with node-filter. Merged all test cases into 11 and deleted 12. The topology is the same (CPM-A, CPM-B, IOM-1), so no reason to keep them separate.

@sacckth
Copy link
Copy Markdown
Contributor

sacckth commented Feb 13, 2026

@bayars I just noted #3055 . This seems very similar to what you are trying to do here... Can you maybe try that PR and see if we can merge those two things together?

@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 14, 2026

@bayars I just noted #3055 . This seems very similar to what you are trying to do here... Can you maybe try that PR and see if we can merge those two things together?

Thanks for flagging #3055. I didn't see it before, and I looked at it. I think, it's related but tackles different problems:

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of
running nodes without tearing them down.

This PR fixes the existing --node-filter flag on deploy/destroy/reconfigure. It's specifically has work for sr-sim's cpm network namespace management with the filtering.

Where they overlap is the namespace dependency tracking. #3055 introduces namespaceShareProviders() for its lifecycle validation, and this PR adds expandFilterWithNSDependents() for filter expansion. Both build a map of which nodes share which namespace. It would make sense to share a common helper for that once both PRs land.

I don't think merging the PRs themselves makes sense. They solve different user problems and touch different code paths. But I'm happy to rebase on top of #3055 if it lands first and reuse their namespaceShareProviders() for the filter expansion logic.

I am leaving the decisions to your hands. @sacckth @hellt

@bayars bayars requested review from hellt and sacckth February 14, 2026 13:24
@kaelemc
Copy link
Copy Markdown
Member

kaelemc commented Feb 14, 2026

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems
working but weird.

I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

This is intended behavior. We have a sorting function so that the netns is not owned by the CPM containers where possible.

This is to allow fail over as if the CPM owns the netns, and you kill it; everything along with it will go.

In most cases with a distributed container setup, LC slot 1 will own the netns, this means you can kill both CPMs, and bring them back and all should function well.

@kaelemc
Copy link
Copy Markdown
Member

kaelemc commented Feb 14, 2026

SR-SIM GetContainers() only returned CPM: For distributed SROS nodes (which spawn multiple containers per topology node — CPM-A, CPM-B, line cards), GetContainers() only queried the CPM-A container. This meant clab inspect only showed one container per distributed node, and container discovery was incomplete.

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

@FloSch62
Copy link
Copy Markdown
Member

FloSch62 commented Feb 14, 2026

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 14, 2026

While I run the tests, I realized one another bugs with the --node-filter.

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems
working but weird.
I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

This is intended behavior. We have a sorting function so that the netns is not owned by the CPM containers where possible.

This is to allow fail over as if the CPM owns the netns, and you kill it; everything along with it will go.

In most cases with a distributed container setup, LC slot 1 will own the netns, this means you can kill both CPMs, and bring them back and all should function well.

Thanks for explaining the design. That makes sense now. That mean the lc owning the netns allows CPM failover without tearing down the whole namespace.

The current changes is not changing the sort order or netns ownership. But my test assertions backwards (checking CPM-A for the management IP instead of the LC). I've corrected those. (not pushed yet)

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

Yes, actually makes sense for this one too. The logical node in the default inspect table is cleaner. I'll revert the GetContainers() change. The node-filter destroy/deploy functionality doesn't depend on it.

@bayars
Copy link
Copy Markdown
Contributor Author

bayars commented Feb 14, 2026

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

@FloSch62 Thank you for the correction, Sorry for I mischaracterization for pause/resume when it's actually full docker stop/start/restart with interface parking to preserve veths.

But I still think, these PRs are complementary rather than overlapping (except namespaceShareProviders() functionality)

It's up to your decisions to merge on that.

@bayars bayars marked this pull request as ready for review February 14, 2026 16:50
@sacckth
Copy link
Copy Markdown
Contributor

sacckth commented Feb 16, 2026

Let's park this until #3055 is merged. We can rebase and review afterwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants