Cluster Management Repository

This repository manages a small Linux cluster using Rex (Perl), with a security-hardened workflow for mesh SSH — no node is special; every node can reach every other node via both password and key-based SSH.

Core design goals:

Keep machine inventory in YAML files under cmdb/ (networks and optional MAC data only)
Keep secrets (usernames, passwords, key material) in an encrypted SQLite database under createdatabase/
Use bootstrap mode only for first-time access and key deployment
Use key-based SSH for all normal operations (operational mode)
All mutating operations produce timestamped audit log entries

AI Disclaimer: This project started as a manual Rex file configuration, and relied heavily on AI additions (mostly via Claude) to refine security measures. In the process, I found that the AI would hallucinate DISA-STIG rules, generate bloatware and at some point even switched to Python (lolz) for tasks. Still the code generation for what I wanted to do (whether the code is actually doing it is a separate issue), was way faster than if I had written the code myself. Manual editing and review is far from complete at the time of this writing (April 12th 2026), so caveat emptor. If you decide to enter the chamber of AI horrors, would very much appreciate feedback or PRs. Having the ability to manage small networks e.g. in homelabs or small research laboratories in a secure manner is a valuable task,and the value of the task justifies the AI bootstrap (IMHO).

What the Code Does

system_db_setup/generate_cmdb.pl

Reads users_db.csv (hostname, username, password) and networks_db.csv (network, hostname, IP) and emits one cmdb/<hostname>.yml per machine. The YAML files contain only network IPs and optional MAC addresses — credentials are never written to YAML [SEC-CRED-ENCRYPT]. Run this once when adding new machines.

createdatabase/init_cluster_db.pl

Initialises the AES-256-GCM encrypted SQLite database (cluster_keys.db) and imports credentials from users_db.csv and optionally public keys from pubkeys/*.pub. The database schema stores encrypted usernames, passwords, SSH public keys, host keys, and key passphrases with full rotation tracking. All operations are audit-logged.

createdatabase/ClusterDB.pm

Pure-Perl encrypted database module. Provides:

store_key / get_key / list_keys / delete_key — SSH public key management
store_credential / get_credential / list_credentials / delete_credential — credential management
store_key_passphrase / get_key_passphrase — SSH key passphrase management
Private _encrypt / _decrypt using AES-256-GCM (FIPS 140-2) with a per-record IV
Private _derive_key using PBKDF2-SHA256 at 600,000 iterations (NIST SP 800-132)
Per-operation audit logging to a flat log file [SEC-AUDIT-LOG]
Permission enforcement: 0600 on database file, 0400 verified on keyfile

ClusterSSHHelpers.pm (repo root)

Shared helper module used by the Rexfile (and available to other Rexfiles). Provides:

ssh_opts / ssh_opts_bootstrap — SSH option builders for operational and bootstrap modes
audit_log — timestamped audit logging [SEC-AUDIT-LOG]
detect_admin — identify the admin machine by matching local IPs to CMDB (assumes the machine executing the script is the admin node)
tcp_pinger — create a Net::Ping TCP prober on port 22
resolve_targets — resolve --group/--machine parameters into a target list
validate_name — input validation against shell injection
cluster_db / get_credential — lazy-connected encrypted database access
ensure_agent_loaded — load the admin cluster key into ssh-agent via SSH_ASKPASS
bootstrap_via_askpass / bootstrap_run_or_warn / bootstrap_capturex — run commands with password via SSH_ASKPASS + 0400 tempfile
generate_load_cluster_key_script — generate the node-local key loader script
ensure_io_interface / cleanup_io_interface / get_mac_for_ip — IO::Interface helpers for MAC discovery (prefers apt-get GPG-signed packages on remote nodes for DISA-STIG compliance; falls back to cpan temp install only on admin/perlbrew)
And other utilities (command_exists, kh_key, run_or_warn, shell_quote, generate_passphrase, bootstrap_remote_mktemp, all_networks)

Loaded via use FindBin qw($Bin); use lib "$Bin/.."; use ClusterSSHHelpers; and configured with ClusterSSHHelpers::init(...) at startup.

cluster_ssh_setup/Rexfile

Main Rex task file. Defines 16 tasks:

Task	Description
`show_cmdb`	Display CMDB network/machine inventory (passwords always redacted)
`list_groups`	List defined inventory groups and members
`show_keys`	View SSH public key metadata (or full key for one machine)
`show_credentials`	View credential metadata (passwords always redacted)
`check_network`	TCP-probe node reachability on a named network
`refresh_known_hosts`	Rescan and update admin known_hosts for reachable nodes on a network
`nodes_in_network`	List nodes and reachability for one or all networks
`networks_for_node`	List networks and reachability for one or all machines
`show_known_hosts`	Display known_hosts content on each reachable node
`rm_dup_hosts`	Remove duplicate known_hosts entries cluster-wide
`setup_ssh_keys`	9-phase key distribution to build full mesh (bootstrap required)
`remove_ssh_keys`	8-phase key teardown, inverse of setup (bootstrap + confirm required)
`configure_sudoers_scope`	Install scoped sudoers drop-in on reachable non-admin nodes (bootstrap required)
`shutdown`	Remote shutdown of all non-admin nodes on a network
`obtain_mac`	Discover and store MAC addresses in CMDB YAMLs
`validate_ssh_mesh`	Probe all pairwise SSH connections (password + key) across a group

Both tasks accept --machine=<name> (target a single node) or --group=<name> (default: all). If both are given, --group takes precedence.

setup_ssh_keys — 9-phase workflow

Validates required local commands (ssh, scp, ssh-keygen, ssh-copy-id, sshpass, ssh-keyscan, openssl)
Generates the admin ECDSA P-521 keypair (~/.ssh/id_ecdsa_p521_cluster) with a random passphrase stored in the encrypted DB
TCP-probes all group/machine members on port 22 (UP / DOWN / MISSING)
Updates admin ~/.ssh/known_hosts with fresh FIPS-approved host keys for all reachable nodes
Generates a unique ECDSA P-521 keypair on each remote node (private key never leaves the node); authorizes admin public key via ssh-copy-id; deploys a node-local encrypted passphrase store (~/.ssh/.cluster_node_keyfile, ~/.ssh/.cluster_node_pp, ~/.ssh/load_cluster_key.pl) so the node can unlock its own key autonomously without contacting the admin (enables Node→Node and Node→Admin SSH after bootstrap)
Cross-authorizes all node keys on admin and peers (full mesh: node→admin, admin→node, node→node)
Propagates admin and all peer host keys to remote ~/.ssh/known_hosts (no TOFU prompts anywhere)
Deploys ~/.ssh/config with IdentityFile ~/.ssh/id_ecdsa_p521_cluster and StrictHostKeyChecking yes on admin and all nodes
Updates /etc/hosts on admin and all reachable nodes with {network}{machine} aliases

remove_ssh_keys — 8-phase teardown

Reverses all setup_ssh_keys changes: removes keypairs, the three node-local passphrase-store files (~/.ssh/.cluster_node_keyfile, ~/.ssh/.cluster_node_pp, ~/.ssh/load_cluster_key.pl), authorized_keys entries, known_hosts entries, ssh/config blocks, and /etc/hosts aliases. Requires --bootstrap=1 AND --confirm=1.

Repository Layout

ClusterSSHHelpers.pm: shared helper module (SSH options, audit, bootstrap, IO::Interface)
cluster_ssh_setup/
- Rexfile: main task automation (16 tasks, security-hardened)
cmdb/
- One YAML per machine
- Contains only non-secret inventory data (network IPs, optional MAC addresses)
createdatabase/
- ClusterDB.pm: encrypted database module (AES-256-GCM)
- init_cluster_db.pl: database initialization and import script
- pubkeys/: staging area for SSH public key files to import into the encrypted DB. Place <machine>.pub files here before running init_cluster_db.pl; they are imported automatically. The directory is empty by default and is not required for normal CSV-driven workflows.
- (git-ignored) cluster_keys.db — encrypted SQLite database
- (git-ignored) .cluster_db.keyfile — master encryption key
example_hpc_deployments/
- README.md : Examples of provisioning of software for HPC pipelines (currently empty)
system_db_setup/
- (git-ignored) users_db.csv: input credentials for encrypted DB import
- (git-ignored) networks_db.csv: network inventory input
- generate_cmdb.pl: builds non-secret YAML inventory
logs/
- (git-ignored) audit.log: timestamped record of all mutating operations

Security Model

Credentials and keys are encrypted at rest in createdatabase/cluster_keys.db
Encryption and KDF are configured for strong defaults:
- AES-256-GCM with per-record random IV (FIPS 140-2, NIST SP 800-38D)
- PBKDF2-SHA256 with 600,000 iterations (NIST SP 800-132)
Database keyfile should be owner-read-only:
- chmod 400 createdatabase/.cluster_db.keyfile
Plaintext credentials are not stored in cmdb/*.yml
Node-local passphrase store (deployed by Phase 5 of setup_ssh_keys):
- ~/.ssh/.cluster_node_keyfile (0400) — unique per-node AES key
- ~/.ssh/.cluster_node_pp (0400) — SSH key passphrase encrypted with AES-256-CBC PBKDF2-SHA256 600k iter
- ~/.ssh/load_cluster_key.pl (0500) — pure-Perl helper that decrypts and loads the key into ssh-agent
- Enables every node to unlock its own cluster key autonomously (Node→Admin, Node→Node SSH without admin)
All bootstrap password operations use SSH_ASKPASS + a 0400 tempfile; the SSHPASS environment variable is never set

Security Practice Labels

The following security-practice labels are used throughout the codebase to tag the controls each section implements. These are project-defined labels (not external rule IDs) that describe the actual security property enforced.

Label	Description	Implementation
SEC-SSH-NOPASSWD	Disable SSH password auth	`PasswordAuthentication=no` in all operational SSH option sets
SEC-SSH-CONFIG-PERMS	SSH config file permissions	`~/.ssh/config` enforced to 0600
SEC-KEY-PASSPHRASE	SSH private key passphrase	All keypairs (admin + per-node) generated with random 44-char passphrase; passphrases stored in encrypted DB; `ssh-agent` used to cache unlocked key
SEC-STRICT-HOSTKEY	StrictHostKeyChecking	`StrictHostKeyChecking=yes` in operational mode and enforced in the deployed `~/.ssh/config` block; relaxed only in explicit bootstrap mode with `--bootstrap=1`
SEC-CRED-ENCRYPT	No plaintext credentials on disk or in process table	Credentials encrypted at rest (AES-256-GCM); bootstrap passwords passed via `SSH_ASKPASS` + owner-read-only (0400) tempfile — `SSHPASS` env var is never set; node passphrases encrypted at rest (AES-256-CBC PBKDF2-SHA256 600k iter) and staged via non-descriptive 0400 tempfiles with 8-char random suffixes (`mktemp`), never in shell args
SEC-CRED-REDACT	Passwords not displayed to terminal	Passwords always shown as `(redacted - SEC-CRED-REDACT)` in all display tasks
SEC-FIPS-ALGO	FIPS-approved key algorithms	ECDSA P-521 for key generation; `ssh-keyscan` restricted to `ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521`
SEC-AUDIT-LOG	Audit logging	All mutating operations write a timestamped ISO-8601 entry to `logs/audit.log`; log file enforced to 0600
FIPS-140	Approved cipher and KDF	AES-256-GCM (NIST SP 800-38D); PBKDF2-SHA256 at 600,000 iterations (NIST SP 800-132)

Key assumptions

Admin node = executing machine: detect_admin identifies the admin by matching the local machine's IP addresses (via hostname -I) against the CMDB network entries. The machine running the Rex tasks is always treated as the admin node.

sudoers prerequisites for hardened systems

Operational mode requires password-free sudo for exactly two commands on each node. Add to /etc/sudoers on each node:

<cluster-user> ALL=(root) NOPASSWD: /usr/bin/tee /etc/hosts
<cluster-user> ALL=(root) NOPASSWD: /usr/sbin/shutdown

No other NOPASSWD rules are required.

The shutdown task defaults to sudo -n and expects the scoped NOPASSWD rule above. It also supports an explicit bootstrap-only fallback:

rex shutdown --network=<name> --bootstrap=1

In bootstrap mode, Rex uses the encrypted DB password for that node to authenticate SSH and to feed sudo -S non-interactively. This is intended for first-time setup or repair only; normal operations should use sudo -n with the scoped sudoers rule.

You can install the scoped sudoers policy from Rex in one bootstrap pass:

rex configure_sudoers_scope --network=<name> --bootstrap=1

Then use operational shutdown without passwords:

rex shutdown --network=<name>

Known Security Vulnerabilities and Pitfalls

The items below are accepted risks or require operational mitigations.

MEDIUM

1. CSV parsing does not fully validate fields — PARTIALLY FIXED users_db.csv and networks_db.csv are parsed with split /\s*,\s*/, $_, 3. Hostnames and network names are validated against /^[a-zA-Z0-9._-]+$/. IPs are validated against a dotted-decimal regex. Residual risk: passwords containing commas are still silently truncated; switching to Text::CSV would resolve this fully.

2. Bootstrap TOFU — ssh-keyscan output not fingerprint-verified — OPEN During setup_ssh_keys phase 4, host keys are collected via ssh-keyscan and installed directly into ~/.ssh/known_hosts without comparing against a pre-shared fingerprint list. This is inherent to the bootstrap model. Mitigation: run bootstrap only on an isolated, trusted network segment; optionally supply a pre-provisioned known_hosts file.

3. Password mesh test uses StrictHostKeyChecking=no — OPEN The inner command that tests password-based SSH for non-admin source nodes uses -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null. This is appropriate for a connectivity test but means the password test does not validate host identity — a MITM would pass the password test. The result could give false assurance that password auth is "working" to the correct destination.

LOW

4. Silent or next on critical phased operations — OPEN run_or_warn logs a warning to stderr and returns 0, but callers in setup_ssh_keys phase 5 use or next to skip a node on failure. The audit log does not record a per-phase failure for skipped nodes. A partially-bootstrapped node would appear to be configured but would fail operational access.

5. In-memory key material is not zeroed — OPEN The passphrase and _enc_key fields of the ClusterDB object and the state $db singleton in cluster_db persist in Perl memory until garbage-collected. Not a FIPS mandate for scripting platforms but relevant for high-assurance environments.

Data You Must Provide

You need to provide two CSV files in system_db_setup/.

users_db.csv

Format: Hostname, Username, Password

Example: node-alpha,clusteradmin,ExamplePass!234 node-beta,clusteradmin,ExamplePass!234 node-gamma,clusteradmin,ExamplePass!234

networks_db.csv

Format: Network, Hostname, IP

Example: labnet,node-alpha,10.20.30.11 labnet,node-beta,10.20.30.12 labnet,node-gamma,10.20.30.13

You can define multiple networks by adding more rows per host with different network names.

Full Setup Flow

Run from repository root.

Generate CMDB YAML files (non-secret inventory only)

perl system_db_setup/generate_cmdb.pl

Create encryption keyfile

openssl rand -base64 32 > createdatabase/.cluster_db.keyfile chmod 400 createdatabase/.cluster_db.keyfile

Initialize encrypted DB and import keys/credentials

perl createdatabase/init_cluster_db.pl
--keyfile createdatabase/.cluster_db.keyfile
--users-csv system_db_setup/users_db.csv

Notes:

All flags have sensible defaults when run from createdatabase/, but explicit paths are needed when running from the repository root
The script always imports .pub files from createdatabase/pubkeys/ by default; pass --import <dir> only to override that directory
Credentials from users_db.csv are encrypted and stored in DB

Verify database content (without exposing passwords)

cd cluster_ssh_setup CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" rex show_credentials CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" rex show_keys

Rex Tasks

From cluster_ssh_setup/:

rex show_cmdb
- Display CMDB machine and network inventory
rex list_groups
- Show defined Rex groups and members
rex check_network --network=
- Check reachability on port 22
rex show_known_hosts --network=
- Show known_hosts entries on reachable nodes
rex rm_dup_hosts --network=
- Remove duplicate known_hosts entries
rex setup_ssh_keys --network= --bootstrap=1
- First-time bootstrap: 9-phase key distribution and full mesh setup
rex remove_ssh_keys --network= --bootstrap=1 --confirm=1
- Tear down all cluster keys, authorized_keys, known_hosts, config blocks, /etc/hosts aliases
rex configure_sudoers_scope --network= --bootstrap=1
- Install scoped sudoers drop-in (/etc/sudoers.d/cluster_mgmt) for least-privilege operational tasks
rex validate_ssh_mesh --network=
- Test all pairwise SSH connections (password + key) in the group
rex shutdown --network=
- Shutdown reachable non-admin nodes

Shutdown paths:

Operational path (preferred): rex shutdown --network=<name>
- Uses sudo -n and requires scoped NOPASSWD sudoers entries
Bootstrap fallback path: rex shutdown --network=<name> --bootstrap=1
- Tries sudo -n first; if rejected, uses passworded sudo -S via encrypted DB credentials
rex nodes_in_network
- Show nodes grouped by network with reachability
rex networks_for_node
- Show networks per machine with reachability
rex obtain_mac
- Discover and persist MAC addresses into CMDB YAMLs
rex show_keys
- List encrypted key metadata or show one machine key
rex show_credentials
- List credential metadata or show one machine username (password always redacted)

Example: 3 Fake Machines End-to-End

This example uses fake hostnames, fake IP addresses, and fake credentials.

A) Provide input files

system_db_setup/users_db.csv:

Hostname, Username, Password node-alpha,clusteradmin,FakeP@ssw0rd-A1 node-beta,clusteradmin,FakeP@ssw0rd-B2 node-gamma,clusteradmin,FakeP@ssw0rd-C3

system_db_setup/networks_db.csv:

Network, Hostname, IP labnet,node-alpha,10.20.30.11 labnet,node-beta,10.20.30.12 labnet,node-gamma,10.20.30.13

B) Build inventory and secret DB

perl system_db_setup/generate_cmdb.pl openssl rand -base64 32 > createdatabase/.cluster_db.keyfile chmod 400 createdatabase/.cluster_db.keyfile perl createdatabase/init_cluster_db.pl
--keyfile createdatabase/.cluster_db.keyfile
--users-csv system_db_setup/users_db.csv

Note: .pub files in createdatabase/pubkeys/ are imported automatically. Pass --import <dir> only to override that default directory.

C) First-time connection checks

cd cluster_ssh_setup rex check_network --network=labnet

If nodes are reachable, continue with bootstrap key deployment.

D) First-time key distribution (bootstrap)

CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)"
rex setup_ssh_keys --network=labnet --bootstrap=1

E) Validate mesh connectivity

CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)"
rex validate_ssh_mesh --network=labnet

F) Verify ongoing state

rex show_known_hosts --network=labnet rex rm_dup_hosts --network=labnet rex nodes_in_network --network=labnet

Adding New Nodes

There are two paths to add a new node to an existing cluster, depending on whether the node's SSH public key is already known.

Path 1: CSV-driven (most common)

Use this when you have SSH access to the new node (via password) but its cluster keypair has not been generated yet. The setup_ssh_keys task will generate the keypair on the node and capture its passphrase.

1) Add rows to the CSV files

Append the new node to system_db_setup/users_db.csv:

node-delta,clusteradmin,FakeP@ssw0rd-D4

Append one row per network to system_db_setup/networks_db.csv:

labnet,node-delta,10.20.30.14

2) Regenerate CMDB YAMLs

perl system_db_setup/generate_cmdb.pl

This creates cmdb/node-delta.yml with its network IPs. Existing YAML files are regenerated with current CSV data; any previously stored MAC addresses are preserved.

3) Import the new credential into the encrypted DB

perl createdatabase/init_cluster_db.pl \
  --keyfile createdatabase/.cluster_db.keyfile \
  --users-csv system_db_setup/users_db.csv

The DB uses upsert logic — existing credentials are updated (with rotated_at timestamp), new ones are inserted. No data is lost.

4) Run setup (no --rekey needed)

cd cluster_ssh_setup
CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" \
rex setup_ssh_keys --network=labnet --bootstrap=1

This is fully incremental:

Phase 5 generates a keypair only on node-delta (existing nodes already have keys and are skipped)
Phase 5 collects public keys from all reachable nodes (new and existing)
Phase 6 adds node-delta's key to every existing node's authorized_keys, and adds all existing keys to node-delta's authorized_keys; keys already present are not duplicated
Phases 7–9 (known_hosts, ssh/config, /etc/hosts) are all idempotent — they skip entries that already exist

To target only the new node (avoids probing all nodes):

CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" \
rex setup_ssh_keys --network=labnet --bootstrap=1 --machine=node-delta

Note: with --machine, only node-delta receives peer keys. After this, run the full command (without --machine) once so that existing nodes pick up node-delta's key in their authorized_keys and known_hosts.

Path 2: Pre-staged public key via `pubkeys/` (two-stage)

Use this when you already have the node's SSH public key (e.g. exported from another system, or copied manually) and want to import it into the encrypted DB before running bootstrap.

The createdatabase/pubkeys/ directory

This directory is a staging area for SSH public key files. Each file must:

Be named <machine-name>.pub (e.g. node-delta.pub)
Contain a standard SSH public key line: <key-type> <base64-key> [comment]

The directory is empty by default. Place .pub files here before running init_cluster_db.pl; they are imported automatically.

1) Place the public key file

cp /path/to/node-delta-key.pub createdatabase/pubkeys/node-delta.pub

2) Add rows to the CSV files (same as Path 1, steps 1–2)

# Append to users_db.csv and networks_db.csv, then:
perl system_db_setup/generate_cmdb.pl

3) Import everything into the encrypted DB

perl createdatabase/init_cluster_db.pl \
  --keyfile createdatabase/.cluster_db.keyfile \
  --users-csv system_db_setup/users_db.csv

This imports the .pub file from pubkeys/ (storing the key encrypted with its fingerprint) and the credential from the CSV, in a single run.

To import .pub files from a different directory, pass --import <dir>:

perl createdatabase/init_cluster_db.pl \
  --keyfile createdatabase/.cluster_db.keyfile \
  --import /path/to/other/pubkeys

4) Run setup (same as Path 1, step 4)

The key imported via pubkeys/ is stored in the DB for auditing and display (rex show_keys). The setup_ssh_keys task still generates the on-node keypair during bootstrap — the pre-imported key does not replace that step. Pre-staging is useful when you want key metadata in the DB before the node is reachable.

When to use `--rekey=1`

The --rekey flag is not needed for adding new nodes. It is only needed when:

A node's keypair was generated before the encrypted DB was available, so its passphrase was never captured
You want to force key rotation on existing nodes (generates a fresh keypair and captures the new passphrase)

# Rekey a single node:
CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" \
rex setup_ssh_keys --network=labnet --bootstrap=1 --rekey=1 --machine=node-beta

# Rekey all nodes:
CLUSTER_DB_KEY="$(cat ../createdatabase/.cluster_db.keyfile)" \
rex setup_ssh_keys --network=labnet --bootstrap=1 --rekey=1

--rekey=1 removes the existing keypair and passphrase store files on the target node(s), then regenerates everything from scratch.

Operational Notes

Keep createdatabase/.cluster_db.keyfile off version control and in a secured, backed-up location
Rotate credentials by updating users_db.csv and re-running init_cluster_db.pl
Re-bootstrap only when needed (new nodes, key rotation, trust reset)
For strict environments, ensure sudoers is configured as described in the prerequisites section
The keyfile can also be passed via CLUSTER_DB_KEY env var instead of the keyfile path

Blast Radius Analysis

This section analyses what an attacker gains — and what remains protected — if a single cluster node is compromised. Two scenarios are considered: compromise of a non-admin (worker) node, and compromise of the admin node.

Definitions

Term	Meaning
Admin node	The machine running Rex tasks. Holds the encrypted database, the master keyfile, and the admin keypair.
Worker node	Any non-admin node in the cluster . Holds only its own keypair and the local passphrase store.
Mesh credential	The credentials (username + password) for each node, stored encrypted in `cluster_keys.db` on the admin.
Node-local passphrase store	Three files on each worker: `.cluster_node_keyfile` (0400), `.cluster_node_pp` (0400), `load_cluster_key.pl` (0500). Together they allow the node to decrypt its own SSH key passphrase.

Scenario 1: Compromised Non-Admin (Worker) Node

Attacker gains access to one worker node (e.g. via local privilege escalation, physical access, or an application-level exploit).

What the attacker obtains

Asset	Location	Protection	Attacker access
Node's SSH private key	`~/.ssh/id_ecdsa_p521_cluster`	0400, passphrase-protected	Has the file; can extract passphrase from local store (see below)
Node-local passphrase store	`~/.ssh/.cluster_node_keyfile` + `~/.ssh/.cluster_node_pp`	0400, AES-256-CBC PBKDF2-SHA256 600k iter	Both files are on the same node — attacker can decrypt the passphrase using `openssl enc -d` with the keyfile
Node's `authorized_keys`	`~/.ssh/authorized_keys`	0600	Public keys of all peers + admin (public key material only — no secrets)
Node's `known_hosts`	`~/.ssh/known_hosts`	0600	Host key fingerprints for all cluster nodes
Node's `~/.ssh/config`	`~/.ssh/config`	0600	Cluster `IdentityFile` and `StrictHostKeyChecking` settings

Blast radius from a compromised worker

Impact	Scope	Details
Full SSH access as cluster user to every other node	All nodes on the same network	The compromised node's key is in every peer's `authorized_keys`. The attacker can unlock the private key using the local passphrase store and reach any peer node — this is the primary blast radius.
SSH access to the admin node	Admin node	The compromised node's key is also in the admin's `authorized_keys`. The attacker reaches the admin as the cluster user.
known_hosts intelligence	Informational	IP addresses and host key fingerprints for all cluster nodes are exposed, aiding lateral movement.
Network knowledge	Informational	`~/.ssh/config` and `/etc/hosts` entries reveal all cluster node aliases and IPs.

What the attacker does NOT obtain from a worker

Asset	Why it is safe
Other nodes' private keys	Private keys are generated on-device and never leave their origin node.
Other nodes' passphrases	Each node's passphrase store uses a unique keyfile; compromising one node's keyfile cannot decrypt another's.
Encrypted database (`cluster_keys.db`)	Exists only on the admin node.
Master keyfile (`.cluster_db.keyfile`)	Exists only on the admin node.
Plaintext passwords	Passwords are never stored on worker nodes. They are used only during bootstrap (via SSH_ASKPASS tempfile) and are not persisted.
Admin's private key	Only the admin's public key is deployed to workers (via `ssh-copy-id`). The private key stays on the admin.

Mitigation and containment

Revoke the compromised node's key from all peers: Run remove_ssh_keys with --machine=<compromised-node>, then re-run setup_ssh_keys --rekey=1 --machine=<compromised-node> after the node is re-imaged or remediated.
Rotate all peer keys as a precaution: If the attacker had time for lateral movement, rekey all nodes with setup_ssh_keys --rekey=1 after forensic assessment.
Audit logs/audit.log: Check for unexpected setup_ssh_keys or remove_ssh_keys activity.
Network-level isolation: The blast radius is limited to nodes on the same network (e.g. WirA). An attacker on a WirA-only node cannot reach WirB-only nodes unless the compromised user account has keys for both networks.

Scenario 2: Compromised Admin Node

Attacker gains access to the admin node — the single point of control for the cluster.

What the attacker obtains

Asset	Location	Protection	Attacker access
Admin's SSH private key	`~/.ssh/id_ecdsa_p521_cluster`	0400, passphrase-protected	Has the file; passphrase is in the encrypted DB (see next row)
Encrypted database	`createdatabase/cluster_keys.db`	AES-256-GCM, PBKDF2-SHA256 600k iter	Accessible if the attacker also obtains the keyfile
Master keyfile	`createdatabase/.cluster_db.keyfile`	0400	On the same machine — attacker can read it with the cluster user's privileges
All passwords (decryptable)	In `cluster_keys.db`	AES-256-GCM	With keyfile + DB, the attacker decrypts every node's username and password
All SSH key passphrases (decryptable)	In `cluster_keys.db`	AES-256-GCM	With keyfile + DB, the attacker decrypts every node's SSH key passphrase — even for nodes the attacker has never touched
All public keys + fingerprints	In `cluster_keys.db`	AES-256-GCM	Full inventory of all cluster SSH public keys
Admin's `authorized_keys`	`~/.ssh/authorized_keys`	0600	Public keys of all worker nodes
Admin's `known_hosts`	`~/.ssh/known_hosts`	0600	Host key fingerprints for all cluster nodes
CMDB YAML files	`cmdb/*.yml`	Plaintext	All node IPs and MAC addresses across all networks
Audit log	`logs/audit.log`	0600	Full history of all mutating operations — reveals cluster topology and key rotation history
Source code	Repository root	Plaintext	The Rexfile, helpers, and DB module — reveals security mechanisms and aids targeted attacks

Blast radius from a compromised admin

Impact	Scope	Details
Full SSH access to every node on every network	All nodes, all networks	The admin's key is in every node's `authorized_keys`. The attacker can SSH to any node as the cluster user without needing passwords.
Password-based SSH access to every node	All nodes, all networks	Decrypted passwords enable `sshpass`-based access even if key-based auth is revoked.
Impersonation of any node	Any node-to-node path	The attacker has every node's SSH key passphrase. While the private keys are on the nodes (not the admin), the attacker can SSH to a node using the admin key, decrypt the node's passphrase store, and use the node's identity for further lateral movement.
Complete cluster topology disclosure	Informational	CMDB YAMLs reveal all IPs, networks, and MAC addresses. The audit log reveals the full operational history.
Ability to re-bootstrap the entire cluster	Destructive	The attacker can run `remove_ssh_keys` + `setup_ssh_keys` to rotate all keys to attacker-controlled values, permanently locking out the legitimate admin.
Encrypted DB exfiltration	Persistent	The keyfile + DB together enable offline decryption of all secrets even after the admin is recovered. The attacker retains access to passwords and passphrases indefinitely unless they are all rotated.

What the attacker does NOT obtain directly from the admin

Asset	Why
Worker nodes' private keys	Generated on-device, never transmitted to the admin. However, the attacker can reach each node via SSH and extract them indirectly.

Mitigation and containment

Rotate everything: All node passwords, all SSH keypairs (--rekey=1), the master keyfile, and the database must be regenerated from a trusted machine.
Re-image the admin node: Assume rootkit persistence on the compromised admin.
Rotate node-local passphrase stores: After rekeying, the old passphrase stores on worker nodes are replaced. However, if the attacker accessed worker nodes during the breach window, those nodes should also be re-imaged.
Change system-level passwords: The passwords in users_db.csv are typically the OS login passwords for the cluster user. These must be changed on every node.
Review audit logs: Check logs/audit.log and system auth logs on all nodes for unauthorized access during the breach window.

Summary: Comparative Blast Radius

Dimension	Worker compromise	Admin compromise
Nodes reachable via SSH	All peers on the same network + admin	All nodes on all networks
Passwords exposed	None	All (decryptable from DB)
Key passphrases exposed	Only the compromised node's	All (decryptable from DB)
Other nodes' private keys	Not directly accessible	Not directly — but reachable via SSH to each node
Lateral movement difficulty	Immediate (key is pre-authorized)	Immediate (key is pre-authorized everywhere)
Recovery effort	Revoke one key, rekey one node	Full cluster re-bootstrap, password rotation, potential re-imaging
Fundamental limitation	Inherent to full-mesh SSH — any authorized key reaches all peers	Single point of trust — DB + keyfile co-located on one machine

Architectural Observations

Full mesh = full blast radius per network: By design, every node's key is authorized on every other node. Compromising any one node grants immediate SSH access to all peers. This is the accepted trade-off of a mesh topology. The blast radius is bounded by network — a node with only WirA access cannot reach WirB-only nodes.
The node-local passphrase store is security theater against root: The .cluster_node_keyfile and .cluster_node_pp files are both owned by the same user on the same machine. A root-level attacker trivially decrypts the passphrase. The store's purpose is convenience (autonomous key loading) — not defense against local privilege escalation.
Admin node is a single point of catastrophic failure: The encrypted database, keyfile, credentials, and Rex automation are all co-located. Compromising the admin is equivalent to compromising the entire cluster. Mitigations include: hardware security modules (HSM) for the keyfile, MFA for admin access, network-level isolation of the admin node, and offline/encrypted backups of the keyfile separate from the database.
Passwords persist after bootstrap: Even after key-based SSH is established, the passwords remain in the encrypted DB. If the admin is compromised, these passwords enable password-based access even after all SSH keys are revoked. Consider deleting credentials from the DB after successful bootstrap ($db->delete_credential(machine => $name)) and relying exclusively on key-based auth.

TODO

Security Improvements

[HIGH] Hardware security keys: Deploy a FIDO2/PIV hardware token (e.g. YubiKey 5 series) for the admin identity. This has two distinct roles in this codebase:

Role 1 — Protect the master keyfile (replaces .cluster_db.keyfile on disk) Currently the keyfile is a 0400 plaintext file on the admin filesystem. Any process running as the cluster user can read it, and an attacker who compromises the admin gains permanent offline decryption capability for cluster_keys.db (all passwords, all key passphrases). Replacing the keyfile with a hardware-backed secret eliminates this: the decryption key is stored in the token's secure element, operations are performed on-chip (private key never exported), and the token requires a PIN and/or physical touch to authorize each use. Implementation: use the token's PIV slot (ykman piv or openssl with a PKCS#11 engine via p11-kit) to wrap the DB encryption key. Add a --keyfile-cmd option to ClusterSSHHelpers::init() that executes an external command instead of reading a plaintext file. This directly implements the [HIGH] Separate keyfile from encrypted DB item in the Blast-Radius Reduction section.

Role 2 — SSH authentication with hardware-resident keys (admin identity) Replace or supplement ~/.ssh/id_ecdsa_p521_cluster on the admin with a FIDO2 resident key (ssh-keygen -t ecdsa-sk -O resident). The private key half lives in the token; login requires both possession of the token and a PIN/touch gesture. A compromised admin filesystem yields only the key handle, which is useless without the physical token. OpenSSH >= 8.2 supports this natively; no agent changes are needed beyond ssh-add -K to load resident keys.

What this accomplishes

Property	Without hardware key	With hardware key
Admin filesystem compromise → DB decryptable offline	Yes	No — token required for each decrypt
Stolen admin SSH private key → cluster access	Yes — passphrase is in the DB on the same disk	No — key half is on token; PIN/touch required
Unattended `rex` invocation	Yes — ssh-agent + plaintext keyfile	Requires token present + PIN; intentional friction
Audit trail for key use	Software log only	Hardware attestation certificate + software log

How it extends the current setup The existing ensure_agent_loaded and bootstrap_via_askpass helpers already isolate credentials via SSH_ASKPASS tempfiles ([SEC-CRED-ENCRYPT]). Hardware keys slot in at the admin-identity layer only — no changes to node-side key management, bootstrap SSH flows, or the configure_sudoers_scope/setup_ssh_keys task structure. The --keyfile-cmd option is a drop-in replacement for the $cluster_db_keyfile path in ClusterSSHHelpers::init().

Blast-radius reduction

Eliminates the offline DB decryption attack (the catastrophic path in Scenario 2 of the Blast Radius Analysis): attacker has the ciphertext but cannot decrypt without the physical token.
Eliminates stolen admin key → immediate cluster access without also stealing the physical token.
Bounds admin compromise to the window of active token presence rather than making it permanent.
Does not reduce the blast radius of a worker compromise — worker keys are software keys; that is a separate per-network key isolation problem.

How it enables other TODO items

[HIGH] Separate keyfile from encrypted DB (Blast-Radius Reduction section): This IS the implementation of that item. Hardware token = keyfile off the admin filesystem with on-chip private key.
[HIGH] Purge passwords from DB after bootstrap: Hardware-key assurance makes exclusive key-based auth credible; once the DB is unreachable without the token, the residual risk of leaving passwords in the DB is bounded and the incentive to purge them increases.
[HIGH] Disable password auth cluster-wide: Confidence that the admin identity is hardware-bound makes it safe to remove the password fallback path entirely — the only remaining bootstrap path is the explicit, audited configure_sudoers_scope flow.
[MEDIUM] Intrusion-detection tripwires: The token's attestation certificate (ykman piv info) provides a cryptographic root-of-trust anchor for key manifests. Without hardware attestation the manifest itself can be forged by an attacker with filesystem access.
[LOW] Enforce key expiry: The token serial number and certificate notAfter field give a hardware-anchored expiry signal that is harder to tamper with than a created_at timestamp in the DB.
[LOW] Zero sensitive strings in memory: Hardware keys remove one of the two high-value secrets held in memory (the DB decryption key); the admin SSH key passphrase also moves off-memory (on-chip). Node-side passphrases stored in the DB remain the primary residual concern for in-memory zeroing.

[MEDIUM] Comma-in-password CSV limitation: users_db.csv is parsed with split /\s*,\s*/, $_, 3. Passwords containing commas are silently truncated. Switch to Text::CSV (or Text::CSV_XS) to support RFC 4180 quoting in both generate_cmdb.pl and init_cluster_db.pl.
[LOW] Add fingerprint pre-verification for bootstrap keyscan: Provide an optional --known-fingerprints=<file> flag to setup_ssh_keys. If provided, compare ssh-keyscan output fingerprints against the file before installing into known_hosts. Document bootstrap TOFU risk and recommend running only on isolated networks.
[LOW] Add audit log entry for skipped nodes in setup_ssh_keys phase 5: Replace bare or next with a small block that calls audit_log('setup_node_skipped', machine => $node->{machine}, reason => 'cmd_failed') so partially-bootstrapped nodes are visible in the audit trail.
[LOW] Zero sensitive strings in memory: In ClusterDB::new and get_passphrase, consider overwriting passphrase variables before scope exit (e.g., substr($pass, 0) = "\0" x length($pass)). Relevant only for high-assurance environments; Perl does not guarantee memory zeroing on deallocation.
Make $ssh_config_block a module-level constant: The SSH config block string is defined inside setup_ssh_keys and then duplicated in the removal regex inside remove_ssh_keys. Define it once at the top of the file as a constant; use it in both tasks (and simplify the removal regex to match the constant).
Extract _atomic_write($path, $content_or_lines, $mode): The pattern of (1) write to File::Temp, (2) chmod the temp file, (3) rename to target, (4) handle rename failure by unlinking temp, appears in 8+ places across the Rexfile. A helper would reduce each site to one call.
Replace inline remote Perl one-liners with heredoc temp scripts: The Perl one-liners embedded as shell-quoted strings in remove_ssh_keys phases 3, 6, and 7 are hard to read and impossible to test in isolation. SCP a temp Perl script to the remote node and execute it with perl <script>, then delete the script. This makes the logic readable and auditable.
Consolidate CSV parsing into a shared utility: Both generate_cmdb.pl and init_cluster_db.pl parse CSV files with nearly identical split /\s*,\s*/, $_, 3 loops. Switching to Text::CSV with a shared helper would eliminate duplication and fix the comma-in-password edge case in both places simultaneously.

Blast-Radius Reduction (derived from Blast Radius Analysis)

[HIGH] Separate keyfile from encrypted DB — break admin single-point-of-failure: Move .cluster_db.keyfile off the admin node entirely (e.g. USB hardware token, HSM, or a separate vault host). The admin should fetch the key at task-run time via a challenge (PIN, MFA) and hold it only in memory. This prevents an attacker who compromises the admin filesystem from decrypting the DB offline. Implementation: add a --keyfile-cmd option to ClusterSSHHelpers::init() that executes an external command (e.g. gpg --decrypt keyfile.gpg) instead of reading a plaintext file.
[HIGH] Purge passwords from DB after successful bootstrap: Once key-based SSH is verified (via validate_ssh_mesh), delete plaintext credentials from cluster_keys.db ($db->delete_credential(machine => $name)). Passwords remaining in the DB after bootstrap serve no operational purpose but become a catastrophic asset if the admin is compromised. Add a --purge-credentials flag to validate_ssh_mesh that deletes credentials for all nodes that pass both key-based and password-based validation, with audit logging.
[HIGH] Disable password authentication cluster-wide after bootstrap: After successful key distribution, deploy PasswordAuthentication no to each node's /etc/ssh/sshd_config (or a drop-in file) and reload sshd. This eliminates password-based lateral movement even if the admin DB is exfiltrated. Add a new Rex task harden_sshd that deploys the configuration and verifies key-based access still works before committing. Requires a sudoers entry for sshd reload.
[MEDIUM] Restrict admin authorized_keys to worker-to-admin traffic only: Currently every worker node's key is authorized on the admin with no restrictions. Add from="<ip-list>" options to each key in the admin's authorized_keys, limiting each worker key to connections originating from that worker's known IPs. This prevents a stolen worker key from being used from an arbitrary host to reach the admin.
[MEDIUM] Add command= restrictions to cross-node authorized_keys: Not all nodes need arbitrary shell access to each other. For nodes that only need to perform specific operations (e.g. file transfer, job submission), prepend command="/usr/bin/rsync --server ..." or command="/usr/local/bin/allowed-cmd" to the relevant authorized_keys entries. This limits what an attacker can do even after obtaining a peer's key.
[MEDIUM] Implement per-network key isolation: Currently setup_ssh_keys deploys the same keypair identity across all networks. Generate a separate keypair per network per node (e.g. id_ecdsa_p521_cluster_WirA, id_ecdsa_p521_cluster_WirB). Cross-authorize only within the same network. This ensures that compromising a node on WirA does not grant access to WirB-only nodes even if the same user account is used, hardening the network boundary observed in the blast radius analysis.
[MEDIUM] Add intrusion-detection tripwires for key material: Deploy a cron job or systemd timer on each node that periodically checks: (1) ~/.ssh/authorized_keys has not been modified outside of Rex (compare SHA-256 against a signed manifest), (2) ~/.ssh/id_ecdsa_p521_cluster.pub fingerprint matches the encrypted DB record, (3) ~/.ssh/config has not been tampered with. Report anomalies via syslog or a webhook. Add a Rex task deploy_tripwire that installs and configures the monitor.
[MEDIUM] Rate-limit and alert on failed SSH attempts cluster-wide: Configure fail2ban or equivalent on every node to jail IPs after repeated SSH failures. Add a Rex task deploy_fail2ban that installs the package (via apt, DISA-STIG-compliant), deploys a cluster-specific jail config targeting the cluster user, and verifies the service is active. This slows brute-force lateral movement from a compromised node.
[LOW] Enforce key expiry and automated rotation: Add a --max-key-age=<days> parameter to validate_ssh_mesh. For each node whose key is older than the threshold (tracked via created_at in the DB), emit a warning or automatically trigger --rekey=1. This limits the window of exposure if a key is silently compromised.
[LOW] Encrypt CMDB YAML files at rest: CMDB YAMLs contain IP addresses and MAC addresses — network topology that aids targeted attacks. Encrypt them with the same AES-256-GCM scheme used by ClusterDB, decrypting only at Rex runtime. Alternatively, move network data into cluster_keys.db alongside credentials so all sensitive data shares a single encryption boundary.
[LOW] Add remote audit log forwarding: Currently logs/audit.log exists only on the admin node. An attacker who compromises the admin can tamper with or delete it. Forward audit entries to a remote syslog server or append-only log service in real-time so that a tamper-evident copy survives admin compromise.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cluster_ssh_setup		cluster_ssh_setup
createdatabase		createdatabase
example_hpc_deployments		example_hpc_deployments
system_db_setup		system_db_setup
.gitignore		.gitignore
ClusterSSHHelpers.pm		ClusterSSHHelpers.pm
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Cluster Management Repository

What the Code Does

system_db_setup/generate_cmdb.pl

createdatabase/init_cluster_db.pl

createdatabase/ClusterDB.pm

ClusterSSHHelpers.pm (repo root)

cluster_ssh_setup/Rexfile

setup_ssh_keys — 9-phase workflow

remove_ssh_keys — 8-phase teardown

Repository Layout

Security Model

Security Practice Labels

Key assumptions

sudoers prerequisites for hardened systems

Known Security Vulnerabilities and Pitfalls

MEDIUM

LOW

Data You Must Provide

Full Setup Flow

Rex Tasks

Example: 3 Fake Machines End-to-End

A) Provide input files

B) Build inventory and secret DB

C) First-time connection checks

D) First-time key distribution (bootstrap)

E) Validate mesh connectivity

F) Verify ongoing state

Adding New Nodes

Path 1: CSV-driven (most common)

Path 2: Pre-staged public key via pubkeys/ (two-stage)

When to use --rekey=1

Operational Notes

Blast Radius Analysis

Definitions

Scenario 1: Compromised Non-Admin (Worker) Node

What the attacker obtains

Blast radius from a compromised worker

What the attacker does NOT obtain from a worker

Mitigation and containment

Scenario 2: Compromised Admin Node

What the attacker obtains

Blast radius from a compromised admin

What the attacker does NOT obtain directly from the admin

Mitigation and containment

Summary: Comparative Blast Radius

Architectural Observations

TODO

Security Improvements

Blast-Radius Reduction (derived from Blast Radius Analysis)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Path 2: Pre-staged public key via `pubkeys/` (two-stage)

When to use `--rekey=1`

Packages