Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
> **Purpose**: Single source of truth for **pending** work in the OpenWatch Go rebuild (repo root).
> Completed work is removed from this file; provenance lives in the commit history + `SESSION_LOG.md`.

**Last Updated**: 2026-06-16
**Last Updated**: 2026-06-20
**Active Tree**: repo root (Go backend `cmd/`+`internal/`, React/TypeScript `frontend/`)
**Frozen Tree**: the legacy Python/FastAPI backend was archived out of the repo on 2026-06-05 to `~/hanalyx/OWAR/openwatch-python/` (see CLAUDE.md)

Expand Down Expand Up @@ -31,9 +31,15 @@
| Users tab | P1 | Partial | Same shape — reads `intelligenceStateQuery.data.users` |
| Audit log tab | P2 | Stub | Needs host-scoped `audit_events` API hook |
| Activity tab | P1 | Stub | **Where "View all" on the Recent activity card lands today.** Needs full-feed renderer with cursor pagination + source/severity filters on the unified `/api/v1/activity?host_id=X` endpoint |
| Remediation tab | P2 | Not started (scoping required) | Host-mutating fixes (apply + rollback). The last scan-plan piece; plan + the five decisions in `docs/engineering/scan_remaining_work.md` |
| Terminal tab | P3 | Stub | Browser-based SSH terminal. Web terminal lib + SSH-WS bridge needed |

> **Remediation tab — shipped (v0.2.0-rc.11).** Free-core single-rule apply +
> rollback from the host Remediation tab; concurrent fixes serialize per host;
> status updates live over SSE; free-core requests auto-approve. Landed via
> #601 (execute/rollback + governance), #606 (conditional approval, "A-keep"),
> #607 (serialize + live status). Licensed bulk/automated remediation remains a
> follow-on track (see `docs/engineering/scan_remaining_work.md`).

---

## Activity Feed Follow-ups
Expand Down Expand Up @@ -132,8 +138,6 @@ Dependabot major bumps closed (skipped) 2026-06-16, with the reason + revisit pa

| Item | Priority | Notes |
|------|----------|-------|
| Raise specter coverage gate to 100% (all tiers) | P2 | `specter.yaml` currently gates `tier1: 100 / tier2: 80 / tier3: 50` under `strictness: threshold`. Goal: tier2 + tier3 → 100. **Not a config flip** — it is gated on first backfilling real AC tests for every currently-sub-100% spec; raising the threshold before the tests exist would red-wall every PR. Plan: run a full `go test -json` + vitest JUnit ingest, read `specter coverage` for the true gaps, then write the missing-AC tests spec by spec (real tests, not annotation-only), then bump the thresholds. |
| CI gate speed: per-package DB isolation to drop `-p 1` | P2 | After PR #567 (single race+json pass + golangci cache, ~23min→~12-14min) the Go test step still runs `-p 1` (serial packages) because DB-touching tests share one Postgres and contaminate each other in parallel. Give each parallel test binary its own namespace (per-package `search_path`/schema, or a uniquely-named DB created in `TestMain`) so the suite can run `-p N`. Highest remaining ceiling (could ~halve test wall-clock); real refactor of every DB-touching package's setup — the exact cross-package contamination already seen under default parallelism. |
| CI gate speed: split the monolith into parallel jobs | P3 | The `Quality + security gates` job runs lint/vuln/test/frontend sequentially (wall-clock = sum). Split into concurrent jobs (lint+vet+vuln vs. test+coverage vs. frontend) so wall-clock = max. **Needs a branch-protection change**: only one job is the required `Quality + security gates` check today, so splitting means updating the required-checks list in the GitHub UI (operator action, not code). Prep the workflow, then flip required checks. |

### Flakes
Expand Down
80 changes: 80 additions & 0 deletions SESSION_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,86 @@ and their provenance lives here + in the commit history.

---

## 2026-06-20 — Opus 4.8 (1M context)

**Done** (all merged to `main`; cut + published **v0.2.0-rc.11** "Eyrie"):

- **Free-core single-rule remediation governance + UX.** Three landed PRs on
top of the execute/rollback base (#601):
- **#606 — conditional approval ("A-keep").** `remediation.Service.Request()`
gained a `requiresApproval bool`. Free-core single-rule requests now INSERT
as `approved` (auto-approved, `reviewed_by` NULL, review note recorded) and
emit `remediation.requested` + `remediation.approved`; the licensed
bulk/auto track still inserts `pending_approval`. This removes the
one-operator self-review deadlock (you could request a fix but, under the
self-review block, never approve your own request). ADR:
`docs/engineering/remediation_governance_adr.md`.
- **#607 — per-host serialization + live status.** A second fix on a busy
host no longer fails: the worker pre-checks `HostHasExecuting` and, on
`kensa.ErrHostBusy`, calls `RevertToApproved` + requeues with a 3s backoff
via a new job-queue **delayed-visibility** column (`available_at`, migration
`0039`) and `EnqueueAfter`. The Remediation tab + compliance score now
refresh live over a new `remediation.completed` SSE topic (no manual
refresh). New ACs: `system-job-queue/AC-13`, `api-remediation/AC-08`,
`frontend-live-events/AC-09` (+ AC-01 → 6 topics).
- **#605 — 401 (not 403) for anonymous callers.** `denyPermission` branches
on `id.IsAnonymous` → 401 `auth.required` (+ `WWW-Authenticate: Bearer`);
authenticated-but-unauthorized stays 403 `authz.permission_denied`. The SPA
redirects to login on 401, so an **expired session** now surfaces as a clean
re-login instead of the "Failed to load remediation requests" dead-end the
user hit live. Reframed `system-rbac/AC-09`; new error code `auth.required`.
- **#604 — governance docs + RBAC drift-lock.** Remediation-approval ADR +
role matrix (who can request vs approve a fix/exception, self-review rule);
new `system-rbac` spec (C-08/AC-17 governance matrix) with
`TestGovernanceRoleMatrix` so the role/permission map can't silently drift.
- **#608/#609 — Kensa v0.5.2 + tag rc.11.** Bumped the bundled engine v0.5.1 →
**v0.5.2** (PATCH; frozen `api/`, 539 rules). v0.5.2 fixes a `config_value`
delimiter bug so `" "` matches any whitespace incl. TAB — corrects false FAILs
on TAB-delimited rules (RHEL `login.defs`). **Verified live**: rebuilt the dev
instance on v0.5.2 + repointed `OPENWATCH_KENSA_RULES_DIR` to the v0.5.2
corpus; after a re-scan, `login.defs` flipped FAIL → pass.

**Release mechanics — bundled to beat the rebase cascade.** `main` branch
protection has `strict = true` (require up-to-date) + a single required check,
so merging 5 changelog-touching PRs one-by-one would force 4 serial ~7-min gate
re-runs. #604 merged alone (no CHANGELOG); the other four were `--no-ff` merged
into one `release/v0.2.0-rc.11` branch, CHANGELOG reconciled into a single
`[0.2.0-rc.11]` section, opened as **#609**, one green gate, squash-merged;
#605–#608 closed as folded. Tagged `v0.2.0-rc.11` → release workflow green
(build + SBOM + sign + publish): signed RPM/DEB (amd64 + arm64) + kensa-rules
0.5.2 + per-artifact CycloneDX SBOMs + `SHA256SUMS.asc`, marked pre-release.
GPG keys (`GPG_PRIVATE_KEY`/`GPG_PASSPHRASE`) are configured, so the F4
fail-closed did not trip.

**Docs swept this session:** CLAUDE.md (Last Updated, remediation row →
Complete, scanning-status note → rc.11, spec count 108 → **110**), BACKLOG.md
(removed the done Remediation-tab, specter-100%-all-tiers, and `-p 1`→`-p 4`
rows), `docs/engineering/scan_remaining_work.md` (Phase 7 first-slice shipped
banner). specter.yaml now gates **all tiers at 100**.

**Next:**

- Phase 5 bulk-scan endpoint (small, no host mutation) and the **licensed**
bulk/auto remediation track remain — see `scan_remaining_work.md`.
- Email alert **dispatch** (the channel CRUD shipped; firing alerts through
channels by type + per-user prefs is the remaining half).
- Stage 3 GA fleet-verification gate still pending per
`docs/runbooks/RELEASING.md` before a non-rc GA tag.

**Notes / gotchas:**

- The dev instance is a **manually-launched** `dist/openwatch serve` (gnome-
terminal, logs to `.dev/openwatch.log`), NOT a systemd service. Restart by
SIGTERM + Python relaunch, sourcing env from `/proc/<pid>/environ` so the DB
password is never printed. A restart left a **stale duplicate serve** once
(old PID lingered past the 5s wait after releasing `:8443`); confirm a single
process via `ss -ltnp :8443` after redeploying.
- The `login.defs` correction is **fleet-wide**: every host re-evaluates on its
next scan (adaptive scheduler or manual), so compliance scores will tick up
across the fleet over the next cycles — expected, noted in the rc.11 changelog.
- Running deployed test build is commit `bd3ddfc1` (rc.11, kensa v0.5.2 + all
three remediation features) — matches `main` content; no redeploy needed.

## 2026-06-16 — Opus 4.8 (1M context)

**Done** (all merged to `main`):
Expand Down
19 changes: 17 additions & 2 deletions docs/engineering/scan_remaining_work.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,27 @@
> all shipped in **v0.2.0-rc.6**). This file holds the **forward-looking
> remainder** — the two items that are not yet built.
>
> **Status: 7 of 8 phases complete.** What is left:
> **UPDATE (2026-06-20, v0.2.0-rc.11): Phase 7 first-slice remediation has
> SHIPPED free-core.** Per-rule manual apply + snapshot/rollback from the host
> Remediation tab is built and live: the `remediation` service over the Kensa
> transport, the `remediation_requests` + `remediation_transactions` logbook,
> the worker-driven `approved → executing → executed | rolled_back` lifecycle,
> per-host serialization (busy fixes back off + requeue), and live SSE status
> (`remediation.completed`). Landed via #601 (execute/rollback + governance),
> #606 (conditional approval — free-core single-rule **auto-approves**, "A-keep"
> ADR), #607 (serialize + live status). Specs: `api-remediation`,
> `frontend-remediation-tab`, `system-rbac`. **What remains of Phase 7 is the
> licensed track** — bulk/sequenced and auto/policy-driven remediation — which
> keeps the approval-required lifecycle and the design notes below. The five
> decisions and "likely shape" below remain the reference for that track.
>
> **Status: 7 of 8 phases complete** (Phase 7 first-slice now shipped; licensed
> bulk/auto remediation + the Phase 5 bulk-scan tail remain). What is left:
>
> | Item | Size | Touches live hosts? |
> |------|------|---------------------|
> | Phase 5 tail — bulk scan endpoint | small | no |
> | Phase 7 — remediation | large (own track) | **yes** |
> | Phase 7 — licensed bulk/auto remediation | large (own track) | **yes** |
>
> **GA scope decision (2026-06-05): remediation ships as a BETA feature in the
> GA release.** It is in-scope for GA but explicitly labelled *beta* — surfaced
Expand Down
Loading