Skip to content

nd_maintenance_mode#284

Open
allenrobel wants to merge 12 commits into
developfrom
nd_maintenance_mode
Open

nd_maintenance_mode#284
allenrobel wants to merge 12 commits into
developfrom
nd_maintenance_mode

Conversation

@allenrobel
Copy link
Copy Markdown
Collaborator

@allenrobel allenrobel commented May 18, 2026

Related Issue(s)

Proposed Changes

Add cisco.nd.nd_maintenance_mode, a new Ansible module that sets the system mode (normal or maintenance) of switches in a fabric via POST /api/v1/manage/fabrics/{fabric_name}/switchActions/changeSystemMode.

The endpoint is action-shaped — one POST mutates a batch of switches with a single mode. The module adapts that to NDStateMachine's CRUD-shaped driver via an identifier_strategy="singleton" model and a config-dict-to-list wrap in main(). Idempotency reads additionalData.intendedSystemMode per switch (from one bulk GET /fabrics/{f}/switches) and only POSTs switches whose intent differs from the request. Any switch currently in migration mode triggers a hard-fail with a remediation message asking the operator to run 'Recalculate and Deploy' in the ND UI before retrying. The 207 response body is inspected per-item; entries whose status is failed, failure, or error (case-insensitive) surface as a RuntimeError naming the offending switches by IP. Other tokens (success, missing, empty, pending, warning) are tolerated.

State support is merged only for now. The collection-wide query state is retired; gathered will be added once the underlying NDStateMachine/NDOutput framework support lands (the orchestrator's query_all already produces the per-switch snapshot it will need).

Two TODO(4.2.1) workaround markers flag observed ND wire/schema divergences:

  • GET /fabrics/{f}/switches/{sid} nests systemMode under additionalData, not at the top level the OpenAPI schema documents.
  • The aggregator additionalData.systemMode returns "inconsistent" (a value not in the OpenAPI enum) between an intent change and a deploy, so idempotency must compare against intendedSystemMode, not systemMode.

New / modified files:

  • plugins/modules/nd_maintenance_mode.py — module entry point (DOCUMENTATION / EXAMPLES / argspec / main()). DOCUMENTATION includes a notes paragraph and a play-level module_defaults example covering the ansible_command_timeout / persistent_command_timeout interaction with blocking deploys.
  • plugins/module_utils/orchestrators/maintenance_mode.pyMaintenanceModeOrchestrator inheriting NDBaseInterfaceOrchestrator for FabricContext plumbing. query_all() issues one bulk EpManageSwitchesListGet and caches both the IP→intendedSystemMode map and the IP→switchId map on the instance; update() reuses both, so each module run does one fabric-summary GET, one switches GET, and (when changes are needed) one POST.
  • plugins/module_utils/models/maintenance_mode/{__init__.py, maintenance_mode.py} — singleton model + per-switch get_diff.
  • plugins/module_utils/endpoints/v1/manage/manage_fabrics_switch_actions.py — new endpoint module for changeSystemMode. Path is URL-encoded; query string omits deploy and blocking when the user did not opt into them.
  • plugins/module_utils/endpoints/v1/manage/manage_switches.py — added EpManageSwitchesGet (generic per-switch GET, retained as collection infrastructure though nd_maintenance_mode itself reads via the bulk EpManageSwitchesListGet). URL encoding on both new paths hardened to match the existing EpManageSwitchActionsDeploy convention.
  • Unit tests: endpoint (9), model (15), orchestrator (29 collected). Total suite: 554/554 pass.
  • Integration target nd_maintenance_mode with YAML-anchored single-switch, multi-switch, partial-mismatch, negative, and check_mode scenarios.

Test Notes

  • Unit tests: source env && python -m pytest tests/unit/module_utils/ — 554/554 pass, zero regressions. Coverage: orchestrator 93%, model 97%, endpoint 97%.
  • Lint: black, isort, pylint, mypy all clean against pyproject.toml. pylint 9.97/10 on tests, 9.78/10 on production. yamllint clean on integration target.
  • Integration: Verified live on ND 4.2.1.10 @ 192.168.7.7 / SITE1 against switches S1_BG1 (192.168.12.131) and S1_SP1 (192.168.12.141). Full target green; lab state verified back to intendedSystemMode=normal for all test switches after the run. Re-verified green after the follow-up commits below.
  • Smoke test: Eight standalone scenarios (baseline, set maintenance, idempotent, restore, idempotent, negative switch_ip, negative fabric, check_mode) all green prior to writing the integration target.
  • The radius auth domain on this lab is effectively read-only on SITE1/SITE2; live tests use -e ansible_httpapi_login_domain=local. Documented in the integration target's main.yaml.

Refinements since PR opened

Six follow-up commits land behavior/perf refinements on top of the initial implementation:

  1. e65143d — Cache the query_all() snapshot on the orchestrator and skip the redundant re-query in update(). Avoids a second per-run snapshot pass when NDStateMachine already populated one during init.
  2. d9d09fa — Replace the per-switch GET fan-out in query_all() with one bulk EpManageSwitchesListGet call (verified on ND 4.2.1 lab: the list response carries additionalData.intendedSystemMode and discoveredSystemMode per row). Per-run controller round-trips drop from 2 + N to 2 + 1; on a 32-switch fabric that is 64 GETs eliminated per task.
  3. 3dde916 — URL-encode fabric_name (and switch_id) in EpManageSwitchesGet.path and EpManageFabricsSwitchActionsChangeSystemModePost.path. Matches the existing quote(safe="") convention used by sibling endpoints; protects against future fabric names containing reserved URL characters.
  4. 8374b4b — Only push truthy deploy / blocking onto the endpoint's query params. Previously the bool defaults (False) were assigned unconditionally and serialized as ?deploy=false&blocking=false on every request. Functionally harmless on ND today but inconsistent with the endpoint's own to_query_string design.
  5. f77bd42 — 207 parser uses an explicit failure denylist (failed / failure / error, case-insensitive) rather than treating anything not success as a failure. ND's switchAction APIs occasionally return rows with missing status or progress tokens like pending/warning; the old code would surface those as spurious task failures.
  6. 0efba32 — Document the blocking=true deploy interaction with persistent_command_timeout. New notes paragraph and a play-level module_defaults example showing the recommended timeout: 1800 + inventory ansible_command_timeout pattern.

Cisco Nexus Dashboard Version

4.2.1

Related ND API Resource Category

  • analyze
  • infra
  • manage
  • onemanage
  • other

Checklist

  • Latest commit is rebased from develop with merge conflicts resolved
  • New or updates to documentation has been made accordingly
  • Assigned the proper reviewers

🤖 Generated with Claude Code

@allenrobel allenrobel changed the title Add nd_maintenance_mode module nd_maintenance_mode May 18, 2026
@allenrobel allenrobel self-assigned this May 18, 2026
@allenrobel allenrobel force-pushed the nd_interface_loopback branch from bff5d88 to 921294d Compare May 20, 2026 21:23
@allenrobel allenrobel force-pushed the nd_maintenance_mode branch from 277a298 to fb922d2 Compare May 20, 2026 21:23
Base automatically changed from nd_interface_loopback to develop May 21, 2026 13:25
@allenrobel allenrobel force-pushed the nd_maintenance_mode branch from fb922d2 to 0bb6a39 Compare May 21, 2026 19:41
allenrobel and others added 2 commits May 28, 2026 07:33
Wraps POST /api/v1/manage/fabrics/{fabric_name}/switchActions/changeSystemMode
to set switches in a fabric to `normal` or `maintenance` mode. The endpoint is
action-shaped (one POST mutates a batch of switches with a single mode), and
the module adapts it to NDStateMachine's CRUD-shaped driver via an
identifier_strategy="singleton" model and a config-dict-to-list wrap in main().

Idempotency reads `additionalData.intendedSystemMode` per switch and only POSTs
switches whose intent differs from the request. Any switch currently in
`migration` mode triggers a hard-fail with the remediation instruction to run
'Recalculate and Deploy' in the ND UI before retrying. The 207 response body is
inspected per-item; any failed entry surfaces as a RuntimeError naming the
offending switches by IP.

State support: `merged` only. `gathered` will be added when the underlying
NDStateMachine/NDOutput support lands; query_all already produces the per-switch
snapshot it will need.

TODO(4.2.1) markers flag two ND wire/schema divergences observed on ND 4.2.1.10:
the GET switch response nests systemMode under `additionalData` rather than at
the top level the OpenAPI schema documents, and the aggregator
`additionalData.systemMode` returns "inconsistent" (a value not in the OpenAPI
enum) between an intent change and a deploy.

- New endpoint module manage_fabrics_switch_actions.py with
  ChangeSystemModeEndpointParams and EpManageFabricsSwitchActionsChangeSystemModePost
- EpManageSwitchesGet added to manage_switches.py for the per-switch GET
  used by the query_all fan-out
- Pydantic feature model MaintenanceModeModel + MaintenanceModeSwitchModel
  with custom per-switch get_diff
- MaintenanceModeOrchestrator inheriting NDBaseInterfaceOrchestrator for
  FabricContext plumbing and switch IP resolution
- 43 unit tests with 93-97% coverage on the new modules
- Integration target with YAML-anchored single-switch, multi-switch,
  partial-mismatch, negative, and check_mode scenarios; verified live on
  ND 4.2.1.10 / SITE1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified against the live ND OpenAPI spec at
/api/v1/manage/openAPISpec (Nexus Dashboard Manage v1 1.1.411): the wire
shape we read in `_fetch_switch_modes` matches the spec exactly. The
`switchData` schema declares `additionalData` as a $ref to `additionalType`
(a `oneOf` discriminated by `usage`), and the non-ACI variant
`additionalSwitchData` documents `systemMode`, `intendedSystemMode`, and
`discoveredSystemMode` with enums that include "inconsistent",
"waiting", and "notApplicable".

The earlier TODO marker was based on a stale snapshot of the spec served
by the local OpenAPI MCP server (1.1.209). There is no workaround here —
we are reading the fields exactly as documented. Replaced the marker with
a plain comment explaining the design choice (compare against
`intendedSystemMode`, not the `systemMode` aggregator) so future maintainers
understand the idempotency rationale without thinking we owe a removal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 17:34
@allenrobel allenrobel force-pushed the nd_maintenance_mode branch from 0bb6a39 to 504d495 Compare May 28, 2026 17:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds cisco.nd.nd_maintenance_mode, a new Ansible module for setting switch system mode to normal or maintenance through the Nexus Dashboard manage API.

Changes:

  • Adds the maintenance mode module, singleton model, orchestrator, and new endpoint definitions.
  • Adds per-switch query/idempotency handling and 207 per-item failure parsing.
  • Adds unit and integration tests for endpoint/model/orchestrator behavior and merged-state scenarios.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
plugins/modules/nd_maintenance_mode.py Adds the module entry point, documentation, examples, argspec wiring, and state-machine integration.
plugins/module_utils/orchestrators/maintenance_mode.py Adds orchestration for querying switch modes and posting changeSystemMode actions.
plugins/module_utils/models/maintenance_mode/maintenance_mode.py Adds singleton maintenance mode models and argument spec.
plugins/module_utils/models/maintenance_mode/__init__.py Adds package initialization for maintenance mode models.
plugins/module_utils/endpoints/v1/manage/manage_fabrics_switch_actions.py Adds the changeSystemMode switch action endpoint model and query params.
plugins/module_utils/endpoints/v1/manage/manage_switches.py Adds a per-switch GET endpoint used for maintenance mode snapshots.
tests/unit/module_utils/endpoints/test_endpoints_api_v1_manage_fabrics_switch_actions.py Adds unit tests for the new switch action endpoint.
tests/unit/module_utils/models/test_maintenance_mode_model.py Adds model/unit tests for singleton behavior, payloads, diffing, and argspec shape.
tests/unit/module_utils/orchestrators/test_maintenance_mode.py Adds orchestrator unit tests for query, update, failure handling, and delegation paths.
tests/unit/module_utils/fixtures/fixture_data/test_maintenance_mode.json Adds fixtures for maintenance mode orchestrator tests.
tests/integration/targets/nd_maintenance_mode/vars/main.yaml Adds integration test variables for fabric and switch IPs.
tests/integration/targets/nd_maintenance_mode/tasks/main.yaml Adds integration test entry point and setup/teardown orchestration.
tests/integration/targets/nd_maintenance_mode/tasks/setup.yaml Adds setup task to normalize switch mode before tests.
tests/integration/targets/nd_maintenance_mode/tasks/merged.yaml Adds merged-state integration scenarios, idempotency checks, negatives, and check mode tests.
tests/integration/targets/nd_maintenance_mode/tasks/teardown.yaml Adds teardown task to restore test switches to normal intent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/module_utils/models/maintenance_mode/maintenance_mode.py
Comment thread plugins/module_utils/endpoints/v1/manage/manage_switches.py Outdated
Comment thread plugins/module_utils/endpoints/v1/manage/manage_fabrics_switch_actions.py Outdated
NDStateMachine invokes orchestrator.query_all() during init and reuses the
same orchestrator instance for update(). The previous update() implementation
called self.query_all() a second time to "re-read the live snapshot", which
re-fanned out one GET per switch -- doubling per-run request count.

Stash the snapshot on the instance (_snapshot_modes) at the end of query_all()
and read it in update(). A direct call to update() without query_all() first
(e.g. outside NDStateMachine) falls back to a one-shot query.

Also drop the redundant validate_prerequisites() call at the top of update():
FabricContext caches the summary across calls, so it was already a no-op on
the wire, but skipping the call keeps the intent of "one fabric pre-flight
per module run" explicit.

Adds test_maintenance_mode_00240 which proves update() does not re-issue
per-switch GETs after query_all() has already populated the cache.
Previously query_all looped over every user-supplied switch_ip, calling
_fetch_switch_modes -> _resolve_switch_id (via FabricContext, which itself
issues a /switches list GET on first access) -> GET /switches/{id}. For N
user switches that's 2+N controller round-trips just to read intent.

Verified on ND 4.2.1 lab (SITE1, 2026-05-28) that
GET /api/v1/manage/fabrics/{F}/switches already returns
additionalData.intendedSystemMode and additionalData.discoveredSystemMode
per row, so one bulk call replaces the entire fan-out.

Changes:

- query_all() now issues a single EpManageSwitchesListGet and parses each
  row's switchId + additionalData modes through a new _parse_switch_rows
  helper. Both the IP->switchId map and the IP->intendedSystemMode map are
  cached on the orchestrator instance.
- update() reads switchIds from the cached map instead of calling
  _resolve_switch_id, which would trigger FabricContext to issue its own
  /switches list GET (a second one).
- _fetch_switch_modes and the per-switch EpManageSwitchesGet import are
  removed; class-level query_all_endpoint / query_one_endpoint repointed
  to EpManageSwitchesListGet.
- Missing-IP error message kept compatible with the prior FabricContext
  wording so existing test assertions hold.

Tests + fixtures rewritten to consume the new sequence (one summary GET +
one bulk /switches GET, plus a POST when update changes anything). Fixture
data captured verbatim from the lab.

Net result: for N user switches, controller round-trips drop from 2+N to
1 (summary, cached by FabricContext) + 1 (bulk /switches) + 0 or 1 (POST).
On a 32-switch fabric that is 64 GETs eliminated per playbook task.
EpManageSwitchesGet.path and EpManageFabricsSwitchActionsChangeSystemModePost.path
joined fabric_name (and switch_id) into the URL via BasePath.path()'s
plain str-join, with no urllib quoting. Sibling endpoints in the same
files (_EpManageSwitchesBase.path and EpManageSwitchActionsDeploy.path)
already wrap fabric_name in quote(..., safe=''), so the two new
endpoints were silently dropping the invariant.

ND 4.2 fabric names today are constrained to characters that survive a
raw join, so this hasn't bitten anyone, but a future fabric name with
a reserved URL character (space, '/', '+', '%') would build a
malformed path that 404s or, worse, silently routes to a different
resource. Aligns the two new endpoints with the pre-existing
quoting convention.

Adds test_change_system_mode_endpoint_00070 asserting that a fabric
name containing a space and a '/' is percent-encoded in the path.
The MaintenanceModeModel defaults deploy and blocking to False (because
argspec defaults bool params). The orchestrator previously pushed those
defaults onto the endpoint's ChangeSystemModeEndpointParams
unconditionally, and EndpointQueryParams.to_query_string serializes via
model_dump(exclude_none=True) -- which excludes None but NOT False.
Result: every request URL ended with `?deploy=false&blocking=false`
even when the user opted into neither.

Functionally harmless on ND 4.2 today, but inconsistent with the
endpoint params' own design (verified by
test_change_system_mode_endpoint_00050[neither], which asserts both
params unset produces an empty query string). It also matters if a
future ND release ever differentiates explicit `deploy=false` from
omission.

Fix: assign to endpoint_params.deploy / .blocking only when the model
value is truthy. ticket_id is already Optional[str]=None so the existing
unconditional assignment is fine -- None is correctly excluded.

Adds test_maintenance_mode_00250 asserting the POST URL has no query
string at all when config supplies only mode + switches.
…allowlist

_raise_on_207_failures previously treated any item with status != "success"
as a per-switch failure. ND switchAction APIs occasionally return rows
with no status field, an empty status, or progress/info tokens like
"pending" or "warning"; the old code would surface those as RuntimeError
even when the actual deploy succeeded on the affected switch, causing
spurious task failures and forcing a re-POST on the next run.

Switch to a denylist: raise only when status is one of failed / failure /
error (case-insensitive). Anything else -- success, missing, empty,
pending, warning, future info tokens -- is tolerated.

Expanded test_maintenance_mode_00400 from 6 to 13 parametrized cases
covering the new semantics: explicit failure tokens raise; missing,
empty, pending, and warning do not.
When config.deploy=true and config.blocking=true, ND holds the HTTP
response open for the entire deploy (several minutes per switch). The
ansible-connection daemon's persistent_command_timeout default is 30s
and will reap the socket mid-request, surfacing as a confusing
ConnectionError ... socket path does not exist.

Add a notes paragraph in DOCUMENTATION explaining the failure mode and
the recommended mitigation: raise ansible_command_timeout in inventory
[nd:vars] AND set a generous module-level timeout via module_defaults.

Add an EXAMPLES item demonstrating the play-level module_defaults
pattern with timeout: 1800.

Also replace the en-dash in the existing examples comment with an
ASCII '--' to keep the file fully ASCII (per the team's no-smart-quotes
convention).
@allenrobel allenrobel added the Ready for Review Submitter is requesting a PR review label May 28, 2026
Comment thread plugins/module_utils/orchestrators/maintenance_mode.py Outdated
allenrobel and others added 3 commits May 29, 2026 10:48
Ansible's `required=True` only enforces key presence, not non-empty
content. A user passing `switches: []` together with a target `mode`
previously fell into the orchestrator's empty-snapshot defensive branch
and produced a silent no-op against the changeSystemMode action
endpoint (which itself requires `switchIds` with minItems=1).

Add a Pydantic `model_validator` on `MaintenanceModeModel` that rejects
this combination. The check is gated on `mode is not None`, so snapshots
built by `query_all` (which leave `mode` unset) and other defensive
no-op paths still construct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fallout from the new MaintenanceModeModel validator: this test
constructed `MaintenanceModeModel(mode="normal")` without switches
because it only cares about delete() raising NotImplementedError. The
model now rejects that combination, so add a placeholder switches entry
and a one-line note explaining why.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`_parse_switch_rows` is typed `list[Any]` and skips non-dict items, but
the caller previously fed it `response.get("switches") or []`, which
left a non-list value (string, dict, scalar) intact. A string would
iterate character-by-character (benign but wasteful); a non-iterable
scalar would TypeError inside the helper and surface as a generic
`query_all failed:` message.

Replace the `or []` pattern with a list-type guard at the boundary so
malformed responses collapse to `[]` and the user gets the standard
"No switch found with fabricManagementIp ..." error instead of an
opaque crash. Adds test_maintenance_mode_00140 covering an integer-
shaped `switches` value via a new fixture key.

Addresses akinross's review comment on PR #284.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@akinross akinross self-requested a review May 30, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ready for Review Submitter is requesting a PR review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants