feat: solo provisioner daemon core#688
Draft
leninmehedy wants to merge 75 commits into
Draft
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
bea81d8 to
51cb661
Compare
This was referenced Jun 13, 2026
- Poll interval: 30s → 900s (15 min) per HIP implementation-guide - Event reasons: corrected to HIP-defined names (SoakStarted, SoakCheck, CriterionMet, FleetThresholdReached, DecommissionTriggered, DecommissionCompleted) - Four concrete soak criteria wired (SoakDuration48h real; 3 stubs with TODO) - Fleet threshold via flag file /opt/solo/weaver/migration/fleet-threshold-reached - SoakCheck carries context fields (soak_hours, uploader_backlog_cleared, etc.) - MigrationMonitorConfig.FleetThresholdPath field added - NodeID threaded from DaemonConfig into NewMigrationMonitor - Unit test names updated; UAT steps reference correct event sequence Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…mands (#623) Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…626) Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…637) Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…uring installation Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ware install workflow (closes #661) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ation alerting (closes #662) - New internal/daemon/monitor.go: MonitorRunner interface (Run + Name) and supervisedMonitor() with capped exponential back-off (5s->5m, x2, resets after 60s stable run). - MonitorDegraded: emits Error log at every supervisedDegradedThreshold (5) consecutive crashes without a stable run; fires again at 10, 15, ... so ops keeps seeing the alert until recovery. Counter resets after a stable run. - 6 unit tests: restart-after-crash, no-restart-on-cancel, back-off reset, degraded event fired, degraded counter resets after stable run, back-off cap. - consensus.UpgradeMonitor + MigrationMonitor: Name() methods added. Wiring into daemon.Run() deferred to S3 (#663). Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Replace the per-monitor errgroup.Go calls with a single componentSupervisor goroutine that runs each MonitorRunner via supervisedMonitor. A crashing monitor now restarts independently with capped back-off without cancelling the shared errgroup context or taking down server.Start. NewFromConfig now respects ConsensusNodeComponentConfig.Enabled and the per-monitor toggles. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
#664) Introduce a layered probe architecture so each monitor declares its own startup prerequisites independently: - probes.Probe: leaf interface defined in probes/ to avoid import cycles - probes.KubeRBACProbe: verifies K8s RBAC for a group/resource/verb set - probes.DiskPermissionProbe: checks declared mode bits on a path - probes.DiskOwnershipProbe: verifies owner user, group, and/or mode bits - probes.DiskWriteTestProbe: verifies actual process-level write access - daemon.ComponentProbe: component-boundary interface (CompositeProbe only) - daemon.CompositeProbe: fans out to leaf probes concurrently; nestable - daemon.ProbableMonitor: monitors declare prerequisites via RequiredProbe() - buildComponentProbe: auto-assembles CompositeProbe from enabled monitors; disabling a monitor automatically removes its prerequisites UpgradeMonitor.RequiredProbe() returns a KubeRBACProbe for networkupgradeexecutes. StatusTracker records per-monitor state (running/backoff:<dur>/stopped) updated by supervisedMonitor on each transition. GET /status returns per-component/per-monitor state JSON. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…loses #665) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Rename flat daemon flags to per-component prefixed names:
--node-id → --cn-node-id
--orbit → --cn-orbit
--upgrade-dir → --cn-upgrade-dir
Add new flags:
--components comma-separated list of components to enable
(choices: "consensus-node", "block-node"); at least one required
--bn-orbit block-node orbit namespace (wired in S7 #667)
Interactive install first asks which components to enable via confirm
prompts, then presents per-component input fields. Selecting block-node
surfaces a clear "not yet available" error until S7. To add or remove
components after install, uninstall and reinstall with updated --components.
Closes #666
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Add BlockNodeComponentConfig (Enabled, Kubeconfig, Orbit, Monitors.Upgrade) to DaemonComponents and config_v1.go. Wire a blockNodeUpgradeMonitor stub in NewFromConfig — logs "not yet implemented" and blocks on ctx. Fix kubeconfig preflight in Run() to be nil-safe for BN-only configs. Update DaemonConfig.Validate() to require at least one component (CN or BN). Lift the "not yet available" prompt guard and wire --bn-orbit through RunDaemonInstallPrompts and applyFlagOverrides. Closes #667 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Add docs/dev/daemon-architecture.md covering goroutine map, component model, config schema versioning, supervised monitor parameters, HTTP control plane, install workflow steps, and on-disk file paths. Add docs/dev/daemon-testing-guide.md with 19 step-by-step test cases covering install/uninstall, flag overrides, --from-config, HTTP endpoints, soak idempotency, supervised restart, schema migration, block-node stub, systemd Restart=always, idempotent re-install, and TC-19 (startup probe failure: missing/misconfigured upgrade_dir — operator diagnoses from journalctl and recovers via reinstall). Includes a CN prerequisite section documenting that the upgrade staging directory must exist, be owned hedera:hedera with correct permissions, and that the daemon user must be in the hedera group before install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…component-scoped routes Routes renamed from /migration/consensus/... to /<component>/<monitor>/... pattern. ServerOptions replaces flat NewServer params; ComponentHandler interface lets each component register its own route sub-tree without touching NewServer. ConsensusNodeHandler owns all /consensus_node/ routes. ConsensusMigrationStatusResponse (GET /consensus_node/migration/status) returns combined monitor health + soak state. Fixes pre-existing macOS Unix socket path length failure in server tests. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ilures - Add IsConfigMalformed() helper to internal/daemon/errors.go - Attach ErrPropertyResolution to ErrConfigMalformed when an existing daemon.yaml fails validation — doctor layer renders Option A/B/C steps - Attach ErrPropertyResolution to the missing-components error in Case 3 (no daemon.yaml, non-interactive) with concrete --components examples - Fix BlockNodeMonitors field: Upgrade → TrafficShaper in install.go Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…through wrappers root.go wraps every cobra error with errorx.IllegalState — previously this swallowed any ErrPropertyResolution attached to inner errors (e.g. the configMalformedResolution hints on daemon.config.malformed). findResolution now walks the full cause chain via errorx.Cast/Cause until it finds an attached resolution or exhausts the chain. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
… poll interval Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Closes #686 Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…onfig The daemon binary's --config flag was routed through config.Initialize() which decoded the file as models.Config (log/proxy settings). Operators passing a daemon.yaml via --config got a confusing "invalid keys: components, schema_version" error. --config now sets the daemon config path; initConfig always uses defaults for log/proxy so the CLI config format is never attempted. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
When the upgrade-monitor's K8s watch is stuck in a backoff loop (e.g. RBAC revoked), /status previously reported "state":"running" — a false- healthy signal visible only in journalctl. Changes: - Add `core.StatusError` (reason, message, resolution, since) used for both connectivity failures and disk-prerequisite probe failures - Add `core.ConnectivityMonitor` interface; UpgradeMonitor implements it with an atomic connectivityErr that is set on any list/watch error and cleared after a successful listAndSeed (recovery within one cycle) - `statusSnapshot()` overlays ConnectivityError() onto tracker state so a goroutine in a retry loop shows "degraded" instead of "running" - Add `probes.TaggedProbe` that wraps any Probe and attaches reason + resolution as errorx properties (ErrPropertyReason, ErrPropertyResolution); UpgradeMonitor.RequiredProbe() wraps each disk probe with TaggedProbe - `runComponentProbes` builds core.StatusError from errorx properties, preserving the first-failure timestamp across retry rounds - `CheckDaemonComponentPrerequisites` now surfaces both probe errors and degraded monitors with reason/resolution/since; daemon service check exits non-zero on either class - Update TC-25 to verify degraded-state output and service check exit code Closes #687 Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Introduce a log/slog seam in the daemon ahead of the pkg/daemonkit extraction (#499). Bump automa-saga/logx to v0.5.0 and install its zerolog-backed slog.Handler via slog.SetDefault in the daemon bootstrap, so slog records reach the same journald + rotating-file sinks logx configures. Convert the 8 daemon-kernel logx call sites (5 in core/monitor.go, 3 in server.go) to slog, preserving the same reason keys, messages, levels, and field names. CLI/workflows stay on logx. Closes #691 Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Convert the 7 fmt.Errorf sites in consensus_migration_client.go to the errorx standard (forbidigo lint). Network calls use ExternalError, request marshal/build use InternalError, response decode uses IllegalFormat, and remote error statuses use ExternalError. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ed to support auto download Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…nto pkg/daemonkit Lift the daemon supervision loop and Unix-socket control plane out of internal/daemon/core and internal/daemon into a new in-repo public package pkg/daemonkit, depending only on stdlib, errgroup, and log/slog (no internal/, cmd/, or k8s.io/). The concrete StatusResponse payload and StatusFn impl stay in internal/daemon; the server takes an injected StatusFn func() any. Rewire consensus/blocknode/daemon to import daemonkit; alias probes.Probe to daemonkit.Probe; delete the now-empty core package and moved files. Closes #692 Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…kg/models coupling
Moves the generic probe framework (disk probes, tagged probe) into
pkg/daemonkit and replaces TaggedProbe's errorx-property coupling with a
kit-native ProbeError{Reason,Resolution,Message,Err} carrying the fields
as plain struct values. The shared models.ErrPropertyReason/Resolution
registry is left untouched so doctor-layer extraction across the CLI and
workflows keeps matching. internal/daemon/probes is trimmed to the
k8s-dependent KubeRBACProbe; runComponentProbes reads ProbeError fields
directly via errors.As.
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Add a "daemonkit — reusable daemon foundation" section covering what pkg/daemonkit provides (SupervisedMonitor/back-off, StatusTracker/ MonitorState/StatusError, the Unix-socket Server, sd_notify, Component Handler, and the Probe framework incl. CompositeProbe/TaggedProbe/ ProbeError/disk probes), the dependency philosophy (stdlib + errorx + errgroup only; slog as the logging seam bridged to logx; k8s.io/client-go deliberately excluded), the pkg/models property-registry boundary and why TaggedProbe uses ProbeError struct fields, and the planned standalone- module extraction. Reconcile the rest of the doc with the post-extraction reality: rewrite the package-layout tree, update core.* symbol references to daemonkit.*, correct the TaggedProbe description, distinguish the two import boundaries, and update the testing key-files and command. Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…re notes Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
a03d073 to
041da2b
Compare
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
1803866 to
e5d3dc9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
A long-running, self-healing control plane that sits on every node, watches the cluster for operator intent, and executes upgrades and migration soaks autonomously — without ever taking itself down.
Features
Please review the daemon architecture guide: docs/daemon/daemon-architecture.md
Related Issues