nd_maintenance_mode#284
Open
allenrobel wants to merge 12 commits into
Open
Conversation
bff5d88 to
921294d
Compare
277a298 to
fb922d2
Compare
fb922d2 to
0bb6a39
Compare
Wraps POST /api/v1/manage/fabrics/{fabric_name}/switchActions/changeSystemMode
to set switches in a fabric to `normal` or `maintenance` mode. The endpoint is
action-shaped (one POST mutates a batch of switches with a single mode), and
the module adapts it to NDStateMachine's CRUD-shaped driver via an
identifier_strategy="singleton" model and a config-dict-to-list wrap in main().
Idempotency reads `additionalData.intendedSystemMode` per switch and only POSTs
switches whose intent differs from the request. Any switch currently in
`migration` mode triggers a hard-fail with the remediation instruction to run
'Recalculate and Deploy' in the ND UI before retrying. The 207 response body is
inspected per-item; any failed entry surfaces as a RuntimeError naming the
offending switches by IP.
State support: `merged` only. `gathered` will be added when the underlying
NDStateMachine/NDOutput support lands; query_all already produces the per-switch
snapshot it will need.
TODO(4.2.1) markers flag two ND wire/schema divergences observed on ND 4.2.1.10:
the GET switch response nests systemMode under `additionalData` rather than at
the top level the OpenAPI schema documents, and the aggregator
`additionalData.systemMode` returns "inconsistent" (a value not in the OpenAPI
enum) between an intent change and a deploy.
- New endpoint module manage_fabrics_switch_actions.py with
ChangeSystemModeEndpointParams and EpManageFabricsSwitchActionsChangeSystemModePost
- EpManageSwitchesGet added to manage_switches.py for the per-switch GET
used by the query_all fan-out
- Pydantic feature model MaintenanceModeModel + MaintenanceModeSwitchModel
with custom per-switch get_diff
- MaintenanceModeOrchestrator inheriting NDBaseInterfaceOrchestrator for
FabricContext plumbing and switch IP resolution
- 43 unit tests with 93-97% coverage on the new modules
- Integration target with YAML-anchored single-switch, multi-switch,
partial-mismatch, negative, and check_mode scenarios; verified live on
ND 4.2.1.10 / SITE1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified against the live ND OpenAPI spec at /api/v1/manage/openAPISpec (Nexus Dashboard Manage v1 1.1.411): the wire shape we read in `_fetch_switch_modes` matches the spec exactly. The `switchData` schema declares `additionalData` as a $ref to `additionalType` (a `oneOf` discriminated by `usage`), and the non-ACI variant `additionalSwitchData` documents `systemMode`, `intendedSystemMode`, and `discoveredSystemMode` with enums that include "inconsistent", "waiting", and "notApplicable". The earlier TODO marker was based on a stale snapshot of the spec served by the local OpenAPI MCP server (1.1.209). There is no workaround here — we are reading the fields exactly as documented. Replaced the marker with a plain comment explaining the design choice (compare against `intendedSystemMode`, not the `systemMode` aggregator) so future maintainers understand the idempotency rationale without thinking we owe a removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bb6a39 to
504d495
Compare
There was a problem hiding this comment.
Pull request overview
Adds cisco.nd.nd_maintenance_mode, a new Ansible module for setting switch system mode to normal or maintenance through the Nexus Dashboard manage API.
Changes:
- Adds the maintenance mode module, singleton model, orchestrator, and new endpoint definitions.
- Adds per-switch query/idempotency handling and 207 per-item failure parsing.
- Adds unit and integration tests for endpoint/model/orchestrator behavior and merged-state scenarios.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
plugins/modules/nd_maintenance_mode.py |
Adds the module entry point, documentation, examples, argspec wiring, and state-machine integration. |
plugins/module_utils/orchestrators/maintenance_mode.py |
Adds orchestration for querying switch modes and posting changeSystemMode actions. |
plugins/module_utils/models/maintenance_mode/maintenance_mode.py |
Adds singleton maintenance mode models and argument spec. |
plugins/module_utils/models/maintenance_mode/__init__.py |
Adds package initialization for maintenance mode models. |
plugins/module_utils/endpoints/v1/manage/manage_fabrics_switch_actions.py |
Adds the changeSystemMode switch action endpoint model and query params. |
plugins/module_utils/endpoints/v1/manage/manage_switches.py |
Adds a per-switch GET endpoint used for maintenance mode snapshots. |
tests/unit/module_utils/endpoints/test_endpoints_api_v1_manage_fabrics_switch_actions.py |
Adds unit tests for the new switch action endpoint. |
tests/unit/module_utils/models/test_maintenance_mode_model.py |
Adds model/unit tests for singleton behavior, payloads, diffing, and argspec shape. |
tests/unit/module_utils/orchestrators/test_maintenance_mode.py |
Adds orchestrator unit tests for query, update, failure handling, and delegation paths. |
tests/unit/module_utils/fixtures/fixture_data/test_maintenance_mode.json |
Adds fixtures for maintenance mode orchestrator tests. |
tests/integration/targets/nd_maintenance_mode/vars/main.yaml |
Adds integration test variables for fabric and switch IPs. |
tests/integration/targets/nd_maintenance_mode/tasks/main.yaml |
Adds integration test entry point and setup/teardown orchestration. |
tests/integration/targets/nd_maintenance_mode/tasks/setup.yaml |
Adds setup task to normalize switch mode before tests. |
tests/integration/targets/nd_maintenance_mode/tasks/merged.yaml |
Adds merged-state integration scenarios, idempotency checks, negatives, and check mode tests. |
tests/integration/targets/nd_maintenance_mode/tasks/teardown.yaml |
Adds teardown task to restore test switches to normal intent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
NDStateMachine invokes orchestrator.query_all() during init and reuses the same orchestrator instance for update(). The previous update() implementation called self.query_all() a second time to "re-read the live snapshot", which re-fanned out one GET per switch -- doubling per-run request count. Stash the snapshot on the instance (_snapshot_modes) at the end of query_all() and read it in update(). A direct call to update() without query_all() first (e.g. outside NDStateMachine) falls back to a one-shot query. Also drop the redundant validate_prerequisites() call at the top of update(): FabricContext caches the summary across calls, so it was already a no-op on the wire, but skipping the call keeps the intent of "one fabric pre-flight per module run" explicit. Adds test_maintenance_mode_00240 which proves update() does not re-issue per-switch GETs after query_all() has already populated the cache.
Previously query_all looped over every user-supplied switch_ip, calling
_fetch_switch_modes -> _resolve_switch_id (via FabricContext, which itself
issues a /switches list GET on first access) -> GET /switches/{id}. For N
user switches that's 2+N controller round-trips just to read intent.
Verified on ND 4.2.1 lab (SITE1, 2026-05-28) that
GET /api/v1/manage/fabrics/{F}/switches already returns
additionalData.intendedSystemMode and additionalData.discoveredSystemMode
per row, so one bulk call replaces the entire fan-out.
Changes:
- query_all() now issues a single EpManageSwitchesListGet and parses each
row's switchId + additionalData modes through a new _parse_switch_rows
helper. Both the IP->switchId map and the IP->intendedSystemMode map are
cached on the orchestrator instance.
- update() reads switchIds from the cached map instead of calling
_resolve_switch_id, which would trigger FabricContext to issue its own
/switches list GET (a second one).
- _fetch_switch_modes and the per-switch EpManageSwitchesGet import are
removed; class-level query_all_endpoint / query_one_endpoint repointed
to EpManageSwitchesListGet.
- Missing-IP error message kept compatible with the prior FabricContext
wording so existing test assertions hold.
Tests + fixtures rewritten to consume the new sequence (one summary GET +
one bulk /switches GET, plus a POST when update changes anything). Fixture
data captured verbatim from the lab.
Net result: for N user switches, controller round-trips drop from 2+N to
1 (summary, cached by FabricContext) + 1 (bulk /switches) + 0 or 1 (POST).
On a 32-switch fabric that is 64 GETs eliminated per playbook task.
EpManageSwitchesGet.path and EpManageFabricsSwitchActionsChangeSystemModePost.path joined fabric_name (and switch_id) into the URL via BasePath.path()'s plain str-join, with no urllib quoting. Sibling endpoints in the same files (_EpManageSwitchesBase.path and EpManageSwitchActionsDeploy.path) already wrap fabric_name in quote(..., safe=''), so the two new endpoints were silently dropping the invariant. ND 4.2 fabric names today are constrained to characters that survive a raw join, so this hasn't bitten anyone, but a future fabric name with a reserved URL character (space, '/', '+', '%') would build a malformed path that 404s or, worse, silently routes to a different resource. Aligns the two new endpoints with the pre-existing quoting convention. Adds test_change_system_mode_endpoint_00070 asserting that a fabric name containing a space and a '/' is percent-encoded in the path.
The MaintenanceModeModel defaults deploy and blocking to False (because argspec defaults bool params). The orchestrator previously pushed those defaults onto the endpoint's ChangeSystemModeEndpointParams unconditionally, and EndpointQueryParams.to_query_string serializes via model_dump(exclude_none=True) -- which excludes None but NOT False. Result: every request URL ended with `?deploy=false&blocking=false` even when the user opted into neither. Functionally harmless on ND 4.2 today, but inconsistent with the endpoint params' own design (verified by test_change_system_mode_endpoint_00050[neither], which asserts both params unset produces an empty query string). It also matters if a future ND release ever differentiates explicit `deploy=false` from omission. Fix: assign to endpoint_params.deploy / .blocking only when the model value is truthy. ticket_id is already Optional[str]=None so the existing unconditional assignment is fine -- None is correctly excluded. Adds test_maintenance_mode_00250 asserting the POST URL has no query string at all when config supplies only mode + switches.
…allowlist _raise_on_207_failures previously treated any item with status != "success" as a per-switch failure. ND switchAction APIs occasionally return rows with no status field, an empty status, or progress/info tokens like "pending" or "warning"; the old code would surface those as RuntimeError even when the actual deploy succeeded on the affected switch, causing spurious task failures and forcing a re-POST on the next run. Switch to a denylist: raise only when status is one of failed / failure / error (case-insensitive). Anything else -- success, missing, empty, pending, warning, future info tokens -- is tolerated. Expanded test_maintenance_mode_00400 from 6 to 13 parametrized cases covering the new semantics: explicit failure tokens raise; missing, empty, pending, and warning do not.
When config.deploy=true and config.blocking=true, ND holds the HTTP response open for the entire deploy (several minutes per switch). The ansible-connection daemon's persistent_command_timeout default is 30s and will reap the socket mid-request, surfacing as a confusing ConnectionError ... socket path does not exist. Add a notes paragraph in DOCUMENTATION explaining the failure mode and the recommended mitigation: raise ansible_command_timeout in inventory [nd:vars] AND set a generous module-level timeout via module_defaults. Add an EXAMPLES item demonstrating the play-level module_defaults pattern with timeout: 1800. Also replace the en-dash in the existing examples comment with an ASCII '--' to keep the file fully ASCII (per the team's no-smart-quotes convention).
akinross
requested changes
May 29, 2026
Ansible's `required=True` only enforces key presence, not non-empty content. A user passing `switches: []` together with a target `mode` previously fell into the orchestrator's empty-snapshot defensive branch and produced a silent no-op against the changeSystemMode action endpoint (which itself requires `switchIds` with minItems=1). Add a Pydantic `model_validator` on `MaintenanceModeModel` that rejects this combination. The check is gated on `mode is not None`, so snapshots built by `query_all` (which leave `mode` unset) and other defensive no-op paths still construct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fallout from the new MaintenanceModeModel validator: this test constructed `MaintenanceModeModel(mode="normal")` without switches because it only cares about delete() raising NotImplementedError. The model now rejects that combination, so add a placeholder switches entry and a one-line note explaining why. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`_parse_switch_rows` is typed `list[Any]` and skips non-dict items, but
the caller previously fed it `response.get("switches") or []`, which
left a non-list value (string, dict, scalar) intact. A string would
iterate character-by-character (benign but wasteful); a non-iterable
scalar would TypeError inside the helper and surface as a generic
`query_all failed:` message.
Replace the `or []` pattern with a list-type guard at the boundary so
malformed responses collapse to `[]` and the user gets the standard
"No switch found with fabricManagementIp ..." error instead of an
opaque crash. Adds test_maintenance_mode_00140 covering an integer-
shaped `switches` value via a new fixture key.
Addresses akinross's review comment on PR #284.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
akinross
approved these changes
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issue(s)
Proposed Changes
Add
cisco.nd.nd_maintenance_mode, a new Ansible module that sets the system mode (normalormaintenance) of switches in a fabric viaPOST /api/v1/manage/fabrics/{fabric_name}/switchActions/changeSystemMode.The endpoint is action-shaped — one POST mutates a batch of switches with a single mode. The module adapts that to
NDStateMachine's CRUD-shaped driver via anidentifier_strategy="singleton"model and aconfig-dict-to-list wrap inmain(). Idempotency readsadditionalData.intendedSystemModeper switch (from one bulkGET /fabrics/{f}/switches) and only POSTs switches whose intent differs from the request. Any switch currently inmigrationmode triggers a hard-fail with a remediation message asking the operator to run 'Recalculate and Deploy' in the ND UI before retrying. The 207 response body is inspected per-item; entries whosestatusisfailed,failure, orerror(case-insensitive) surface as aRuntimeErrornaming the offending switches by IP. Other tokens (success, missing, empty,pending,warning) are tolerated.State support is
mergedonly for now. The collection-widequerystate is retired;gatheredwill be added once the underlyingNDStateMachine/NDOutputframework support lands (the orchestrator'squery_allalready produces the per-switch snapshot it will need).Two
TODO(4.2.1)workaround markers flag observed ND wire/schema divergences:GET /fabrics/{f}/switches/{sid}nestssystemModeunderadditionalData, not at the top level the OpenAPI schema documents.additionalData.systemModereturns"inconsistent"(a value not in the OpenAPI enum) between an intent change and a deploy, so idempotency must compare againstintendedSystemMode, notsystemMode.New / modified files:
plugins/modules/nd_maintenance_mode.py— module entry point (DOCUMENTATION / EXAMPLES / argspec /main()). DOCUMENTATION includes a notes paragraph and a play-levelmodule_defaultsexample covering theansible_command_timeout/persistent_command_timeoutinteraction with blocking deploys.plugins/module_utils/orchestrators/maintenance_mode.py—MaintenanceModeOrchestratorinheritingNDBaseInterfaceOrchestratorforFabricContextplumbing.query_all()issues one bulkEpManageSwitchesListGetand caches both the IP→intendedSystemMode map and the IP→switchId map on the instance;update()reuses both, so each module run does one fabric-summary GET, one switches GET, and (when changes are needed) one POST.plugins/module_utils/models/maintenance_mode/{__init__.py, maintenance_mode.py}— singleton model + per-switchget_diff.plugins/module_utils/endpoints/v1/manage/manage_fabrics_switch_actions.py— new endpoint module forchangeSystemMode. Path is URL-encoded; query string omitsdeployandblockingwhen the user did not opt into them.plugins/module_utils/endpoints/v1/manage/manage_switches.py— addedEpManageSwitchesGet(generic per-switch GET, retained as collection infrastructure thoughnd_maintenance_modeitself reads via the bulkEpManageSwitchesListGet). URL encoding on both new paths hardened to match the existingEpManageSwitchActionsDeployconvention.nd_maintenance_modewith YAML-anchored single-switch, multi-switch, partial-mismatch, negative, andcheck_modescenarios.Test Notes
source env && python -m pytest tests/unit/module_utils/— 554/554 pass, zero regressions. Coverage: orchestrator 93%, model 97%, endpoint 97%.black,isort,pylint,mypyall clean againstpyproject.toml.pylint9.97/10 on tests, 9.78/10 on production.yamllintclean on integration target.intendedSystemMode=normalfor all test switches after the run. Re-verified green after the follow-up commits below.switch_ip, negative fabric, check_mode) all green prior to writing the integration target.radiusauth domain on this lab is effectively read-only on SITE1/SITE2; live tests use-e ansible_httpapi_login_domain=local. Documented in the integration target'smain.yaml.Refinements since PR opened
Six follow-up commits land behavior/perf refinements on top of the initial implementation:
e65143d— Cache thequery_all()snapshot on the orchestrator and skip the redundant re-query inupdate(). Avoids a second per-run snapshot pass whenNDStateMachinealready populated one during init.d9d09fa— Replace the per-switch GET fan-out inquery_all()with one bulkEpManageSwitchesListGetcall (verified on ND 4.2.1 lab: the list response carriesadditionalData.intendedSystemModeanddiscoveredSystemModeper row). Per-run controller round-trips drop from2 + Nto2 + 1; on a 32-switch fabric that is 64 GETs eliminated per task.3dde916— URL-encodefabric_name(andswitch_id) inEpManageSwitchesGet.pathandEpManageFabricsSwitchActionsChangeSystemModePost.path. Matches the existingquote(safe="")convention used by sibling endpoints; protects against future fabric names containing reserved URL characters.8374b4b— Only push truthydeploy/blockingonto the endpoint's query params. Previously the bool defaults (False) were assigned unconditionally and serialized as?deploy=false&blocking=falseon every request. Functionally harmless on ND today but inconsistent with the endpoint's ownto_query_stringdesign.f77bd42— 207 parser uses an explicit failure denylist (failed/failure/error, case-insensitive) rather than treating anything notsuccessas a failure. ND's switchAction APIs occasionally return rows with missing status or progress tokens likepending/warning; the old code would surface those as spurious task failures.0efba32— Document theblocking=truedeploy interaction withpersistent_command_timeout. New notes paragraph and a play-levelmodule_defaultsexample showing the recommendedtimeout: 1800+ inventoryansible_command_timeoutpattern.Cisco Nexus Dashboard Version
4.2.1
Related ND API Resource Category
Checklist
🤖 Generated with Claude Code