Fix --node-filter for deploy, destroy, reconfigure, and SR-SIM nodes by bayars · Pull Request #3041 · srl-labs/containerlab

bayars · 2026-02-07T22:16:57Z

Summary

This PR fixes several critical issues with the --node-filter flag that caused it to break labs when used with deploy, destroy, and reconfigure commands. It also improves container discovery for SR-SIM distributed nodes.

Context: Last week, a team member misconfigured a specific router, breaking accessibility to it. When they requested to redeploy just that SR-SIM router, the --node-filter flag didn't work properly, forcing me to find workarounds under time pressure.

Issues: #2106 #2524

Problem

When using --node-filter to operate on a subset of nodes in a lab, several things went wrong:

Destroy couldn't find containers: When containers were already removed or lacked containerlab labels destroy found no topology files and silently did nothing.
Deploy rejected partial deployments: verifyContainersUniqueness() detected the existing lab and refused to deploy filtered nodes into it, erroring with "the lab has already been deployed".
Reconfigure wiped the entire lab directory: --reconfigure removed TopologyLabDir() (the whole clab-<lab>/ directory), destroying config and state for nodes not in the filter.
Destroy cleaned up shared resources: After removing filtered nodes, destroy() also removed /etc/hosts entries, SSH config, and the management network — breaking the still-running nodes.
destroyLabDirs had no guard: Although the CLI prevents --cleanup with --node-filter, the destroyLabDirs function itself had no protection and could wipe the entire lab directory if called programmatically.
SR-SIM GetContainers() only returned CPM: For distributed SROS nodes (which spawn multiple containers per topology node — CPM-A, CPM-B, line cards), GetContainers() only queried the CPM-A container. This meant clab inspect only showed one container per distributed node, and container discovery was incomplete.

The changes most of the issues of the --node-filter issues that I have faced on the last week. Let me know if I need to change something.

bayars · 2026-02-07T22:19:52Z

I found out other MR/Issue created yesterday. I asked AI to find if they conflict or fix same think.

Relationship to #2524 / PR #3038

PR #3038 addresses a separate issue (#2524): allowing clab destroy --name <lab> to work without a topology file.

This PR fixes a different problem (#2106): --node-filter is broken for kinD/SROS nodes when a topology file is provided. The two PRs have zero file overlap. #3038 touches cmd/destroy.go, cmd/options.go, core/options_clab.go, while this PR changes core/destroy.go, core/deploy.go, core/config.go, nodes/sros/sros.go. They complement each other and do not conflict.

hellt · 2026-02-08T08:53:51Z

+	var allContainers []clabruntime.GenericContainer
+	cpmIdx := -1
+
+	for _, comp := range n.Cfg.Components {
+		containerName := n.calcComponentName(n.GetContainerName(), comp.Slot)
+		cnts, err := n.Runtime.ListContainers(ctx, []*clabtypes.GenericFilter{
+			{
+				FilterType: "name",
+				Match:      containerName,
+			},
+		})
+		if err != nil {
+			return nil, err
+		}
+		if comp.Slot == cpmSlot && len(cnts) > 0 {
+			cpmIdx = len(allContainers)
+		}
+		allContainers = append(allContainers, cnts...)


@sacckth @FloSch62 can you check this out? Thanks

bayars · 2026-02-10T02:11:12Z

@sacckth These are example clab commands with this changes.

General testing with the srl topology:

clab deploy -t lab-examples/srl02/srl02.clab.yml : Assume all srl lab deployed.
clab destroy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1 :
- Expect to srl1 node removed in the deployed node
- srl1 node Links are cleared in the deployed srl2 node.
- Preserved clab-srl02/ directory
clab deploy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1:
- Expect the srl1 deployed correctly.
- Links recreated correctly
clab deploy -t lab-examples/srl02/srl02.clab.yml --node-filter srl1 --reconfigure :
- srl1 destroyed, and deployed correctly.
- only srl1 node directory removed temporarily, and recreated.
clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml: sr-sim lab deployed fully.
clab destroy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --node-filter srsim10:
- ALL 4 srsim10 containers removed (srsim10-a, srsim10-b, srsim10-1, srsim10-2)
- GetContainers() finds all components, not just CPM
- srsim11's 4 containers still running
- Lab dir preserved, mgmt network preserved
clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --node-filter srsim10:
- srsim10-a, srsim10-b, srsim10-1, srsim10-2 all created
- CPM-A gets mgmt IP, line cards share its network namespace
- Links between srsim10 and srsim11 reconnected
clab deploy -t lab-examples/sr-sim/lab-distributed-components.clab.yaml --reconfigure --node-filter srsim11:
- srsim11's 4 containers destroyed
- Only srsim11's node directory removed (not the lab dir)
- srsim10 untouched
- srsim11 redeployed fresh

Some of the corner cases with the distributed lab:

clab destroy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-1,srsim10-2,srsim11-1,srsim11-2:
- All 4 line card containers removed
- 4 CPM containers still running (srsim10-a, srsim10-b, srsim11-a, srsim11-b)
- Lab dir and mgmt network preserved
clab deploy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-a,srsim10-b,srsim10-1,srsim10-2
- All 4 srsim10 containers created
- srsim10-b, srsim10-1, srsim10-2 use network-mode: container:srsim10-a
- No links to srsim11 (not deployed)

bayars · 2026-02-10T02:11:29Z

I realized one of the weird case. If I delete the cpm, this code should leave the orphaned line card containers.

Assume fully deployed example distributed topology:

1.clab destroy -t lab-examples/sr-sim/lab-distributed.clab.yaml --node-filter srsim10-a:
- Only srsim10-a container deleted
- Line cards (srsim10-1, srsim10-2) still deployed
- srsim10-b still deployed
- Topology works but node is incomplete (no forwarding plane)

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems working but weird.

I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

I am adding one more commit to fix this. Also, I will add the robot tests too.

sacckth

There is a missing file
Clab files should go under tests/13-srsim
Consider decreasing the size of the deployments to a bare minimum so avoid issues with runners.

sacckth · 2026-02-13T16:50:59Z

+
+
+*** Variables ***
+${lab-file}                 ${EXECDIR}/lab-examples/srl02/srl02.clab.yml


@bayars please do not use lab-examples dir, as this might change in the future, let's create a copy of this clab under the test directory. Also mind that the SR-SIM image needs to be exactly the same as in the other clab files present on the test directory.

@sacckth Sorry, for late answer.

I copy paste the srl lab. SR-SIM image updated to registry.srlinux.dev/pub/nokia_srsim:25.10.R2 to match the other test clab files.

sacckth · 2026-02-13T16:53:33Z

+
+*** Test Cases ***
+Deploy full distributed lab
+    [Documentation]    Deploy the full manual distributed lab with 8 containers (4 per system).


I would try to deploy as small as possible nodes, because the git runners might not handle larger deployments very gracefully. For instance, we don't need to test that the srsim11 is there as we cover this on another test. Consider maximum 3 linecards per container (CPM-A|B, IOM1 or CPM-A|IOM-1|IOM-2)

I reduced to a single system with 3 containers only (CPM-A, CPM-B, IOM-1).

I removed srsim11 entirely, the multi-node case is already covered by the components test (10). This is still covering all the important cases like auto-expand on CPM-A destroy, no-expand on CPM-B destroy, and line card only destroy/redeploy.

sacckth · 2026-02-13T16:55:19Z

+
+
+*** Variables ***
+${lab-file}         ${EXECDIR}/lab-examples/sr-sim/test-cpm-destroy.clab.yaml


I don't see this lab commited, and as in the other example, please add this labs in the test/13-srsim folder instead

sacckth · 2026-02-13T16:58:43Z

+
+
+*** Variables ***
+${lab-file}         ${EXECDIR}/lab-examples/sr-sim/lab-distributed.clab.yaml


copy lab to test dir

sacckth · 2026-02-13T16:59:09Z

+
+
+*** Variables ***
+${lab-file}         ${EXECDIR}/lab-examples/sr-sim/lab-distributed-components.clab.yaml


copy lab to test dir

sacckth · 2026-02-13T17:04:50Z

is there a way we can maybe add these tests on 11?

Yes, they fit together since both test the manual distributed topology with node-filter. Merged all test cases into 11 and deleted 12. The topology is the same (CPM-A, CPM-B, IOM-1), so no reason to keep them separate.

sacckth · 2026-02-13T18:36:21Z

@bayars I just noted #3055 . This seems very similar to what you are trying to do here... Can you maybe try that PR and see if we can merge those two things together?

bayars · 2026-02-14T13:24:26Z

@bayars I just noted #3055 . This seems very similar to what you are trying to do here... Can you maybe try that PR and see if we can merge those two things together?

Thanks for flagging #3055. I didn't see it before, and I looked at it. I think, it's related but tackles different problems:

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of
running nodes without tearing them down.

This PR fixes the existing --node-filter flag on deploy/destroy/reconfigure. It's specifically has work for sr-sim's cpm network namespace management with the filtering.

Where they overlap is the namespace dependency tracking. #3055 introduces namespaceShareProviders() for its lifecycle validation, and this PR adds expandFilterWithNSDependents() for filter expansion. Both build a map of which nodes share which namespace. It would make sense to share a common helper for that once both PRs land.

I don't think merging the PRs themselves makes sense. They solve different user problems and touch different code paths. But I'm happy to rebase on top of #3055 if it lands first and reuse their namespaceShareProviders() for the filter expansion logic.

I am leaving the decisions to your hands. @sacckth @hellt

kaelemc · 2026-02-14T14:12:34Z

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems
working but weird.

I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

This is intended behavior. We have a sorting function so that the netns is not owned by the CPM containers where possible.

This is to allow fail over as if the CPM owns the netns, and you kill it; everything along with it will go.

In most cases with a distributed container setup, LC slot 1 will own the netns, this means you can kill both CPMs, and bring them back and all should function well.

kaelemc · 2026-02-14T14:18:50Z

SR-SIM GetContainers() only returned CPM: For distributed SROS nodes (which spawn multiple containers per topology node — CPM-A, CPM-B, line cards), GetContainers() only queried the CPM-A container. This meant clab inspect only showed one container per distributed node, and container discovery was incomplete.

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

FloSch62 · 2026-02-14T14:24:43Z

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

bayars · 2026-02-14T14:58:51Z

While I run the tests, I realized one another bugs with the --node-filter.

That makes the line cards exist without network namespace. I am also not sure if the CRI allows that too. It seems
working but weird.
I tested, and this is a real bug. Destroying only the CPM in manual distributed mode silently orphans the dependent containers. They appear running but have broken networking and cannot be restarted by Docker. The only recovery is to destroy them individually or via clab destroy --cleanup.

This is intended behavior. We have a sorting function so that the netns is not owned by the CPM containers where possible.

This is to allow fail over as if the CPM owns the netns, and you kill it; everything along with it will go.

In most cases with a distributed container setup, LC slot 1 will own the netns, this means you can kill both CPMs, and bring them back and all should function well.

Thanks for explaining the design. That makes sense now. That mean the lc owning the netns allows CPM failover without tearing down the whole namespace.

The current changes is not changing the sort order or netns ownership. But my test assertions backwards (checking CPM-A for the management IP instead of the LC). I've corrected those. (not pushed yet)

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

Yes, actually makes sense for this one too. The logical node in the default inspect table is cleaner. I'll revert the GetContainers() change. The node-filter destroy/deploy functionality doesn't depend on it.

bayars · 2026-02-14T15:30:08Z

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

fwiw, this was on purpose as well. using inspect with --details or --all/-a flag should give you the intended output of all containers, of which you can then do your filtering with jq etc.

I think it makes sense as you show only the deployed nodes (whereby a single srsim container is representing the whole node/collection of containers).

Otherwise in that inspect table you will get a bunch of entries of nodes with no IP address, and it's not very useful to the user (just my 2c).

@bayars

The #3055 adds new CLI commands to pause/resume individual nodes. It's about lifecycle management of running nodes without tearing them down.

This is actually not true. My PR fully docker stops/start/restarts the nodes, its not paused. For that it parks the interfaces in a parking namespace, so that on a start the veths stays intact.

@FloSch62 Thank you for the correction, Sorry for I mischaracterization for pause/resume when it's actually full docker stop/start/restart with interface parking to preserve veths.

But I still think, these PRs are complementary rather than overlapping (except namespaceShareProviders() functionality)

It's up to your decisions to merge on that.

them

sacckth · 2026-02-16T01:35:12Z

Let's park this until #3055 is merged. We can rebase and review afterwards

bayars added 4 commits February 7, 2026 08:46

node filter fix for deploy/desptroy only

c522dcc

find sr-sim components for correct filtering, handle cpm filtering

b79cbc4

filter in the destory with links clearing

0d527bf

in reconfigure, call destroy with the filter with func

d8b4d8e

bayars marked this pull request as draft February 7, 2026 22:17

bayars marked this pull request as ready for review February 7, 2026 22:19

linter, formatter, fix code comments

d06fc8d

hellt reviewed Feb 8, 2026

View reviewed changes

sacckth marked this pull request as draft February 10, 2026 17:43

bayars added 2 commits February 10, 2026 22:53

Fix Orphaned line card bugs for node filter issue

97f2d8f

Robot files

3af1f28

sacckth self-requested a review February 12, 2026 20:51

sacckth requested changes Feb 13, 2026

View reviewed changes

bayars requested review from hellt and sacckth February 14, 2026 13:24

bayars added 2 commits February 14, 2026 08:47

copy srl lab, and fix the robot tests

fb12b17

sr-sim lab fixes

977eab8

bayars added 3 commits February 14, 2026 10:41

fix container count strategy with @kaelemc request, fix robot test with

c4497fe

them

Merge branch 'main' into node-filter-fix-improvements

7e46770

srsim robot test fix

a41deb2

bayars marked this pull request as ready for review February 14, 2026 16:50



		* Variables *
		${lab-file} ${EXECDIR}/lab-examples/srl02/srl02.clab.yml



		* Variables *
		${lab-file} ${EXECDIR}/lab-examples/sr-sim/test-cpm-destroy.clab.yaml

Conversation

bayars commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Uh oh!

bayars commented Feb 7, 2026

Relationship to #2524 / PR #3038

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bayars commented Feb 10, 2026

Uh oh!

bayars commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sacckth left a comment

Choose a reason for hiding this comment

Uh oh!

sacckth Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sacckth Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bayars Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sacckth commented Feb 13, 2026

Uh oh!

bayars commented Feb 14, 2026

Uh oh!

kaelemc commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaelemc commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FloSch62 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bayars commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bayars commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sacckth commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bayars commented Feb 7, 2026 •

edited

Loading

bayars commented Feb 10, 2026 •

edited

Loading

sacckth Feb 13, 2026 •

edited

Loading

sacckth Feb 13, 2026 •

edited

Loading

bayars Feb 14, 2026 •

edited

Loading

kaelemc commented Feb 14, 2026 •

edited

Loading

kaelemc commented Feb 14, 2026 •

edited

Loading

FloSch62 commented Feb 14, 2026 •

edited

Loading

bayars commented Feb 14, 2026 •

edited

Loading

bayars commented Feb 14, 2026 •

edited

Loading