STELLAR personal assistant#12904
Conversation
✅ Deploy Preview for kubestellarconsole ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
👋 Hey @xonas1101 — thanks for opening this PR!
This is an automated message. |
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
| if err != nil { | ||
| return 0, err | ||
| } | ||
| return int32(parsed), nil |
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
Signed-off-by: xonas1101 <aarushsingh1305@gmail.com>
Stellar v3 — Hive Handoff Spec
0. The VisionStellar is the Jarvis of KubeStellar. Not a chatbot, not a dashboard widget. A junior engineer that:
The mental model: a 24/7 junior on-call who reads every event, files the routine stuff, restarts the obvious crash loops, and pings the senior (the user) only when it's actually stuck — with a written brief, not a raw stack trace. This spec is about closing the gap between "demo-grade autonomous solve" (what's shipped) and "production-grade junior engineer" (the vision). 1. What's Already Built (state of
|
| Behavior | Where enforced |
|---|---|
| Critical events auto-solve without manual click | ProcessEvent → autoTriggerSolve |
| Solve status visible on the card itself, not only modal | EventCard progress bar + attempt badge |
| Activity log is first-person ("Stellar tried…") | pkg/api/handlers/stellar_solver.go activity rows |
| Activity log is clickable → matching event modal | StellarActivityPanel.onOpenEvent |
| Toasts on every page, not only Stellar page | StellarProvider in App.tsx |
| User's chosen provider drives Stellar | resolveProviderForUser |
| Stale escalations don't poison new events | 10-min terminal staleness window + cooldown only blocks running |
| Modal narration matches card badge | solveAttemptCount pill + workload narration |
These are the regression boundaries. A Hive worker that breaks any of these has broken the demo.
2. Architectural Invariants (don't break these)
These are not preferences. Each is paid-for-in-pain from the v2 build.
2.1 One SSE connection per tab
StellarProvider is the only consumer. Components subscribe via useStellar(). Adding a second EventSource will reintroduce the "toasts don't fire on /clusters" bug. If a feature seems to need its own stream — extend the existing event envelope kinds instead.
2.2 Critical events never go through autoExecuteAction
That path was for the legacy deterministic ladder and races with autoTriggerSolve. Critical events route only through autoTriggerSolve, which internally decides deterministic-vs-mission. Reintroducing the dual path was the root cause of "every card escalated despite log showing auto-fixed."
2.3 Cooldown blocks only running solves
Terminal solves are history, not locks. A new BackOff event 12 minutes after an escalation should get a fresh solve, not inherit the escalation. TERMINAL_SOLVE_STALENESS_MS enforces the same on the frontend.
2.4 Provider resolution is per-user
STELLAR_DEFAULT_PROVIDER is a fallback, not a ceiling. If a user has set Anthropic in the navbar, Stellar must use Anthropic — even for background observation. The old "global registry default" pattern caused the "why is Stellar using Ollama when I picked Anthropic" bug. Exception: §3.2 introduces a deliberate Ollama path for cheap scanning. That is per-feature, not per-user.
2.5 Workload-aware status everywhere
Two BackOff cards for the same pod must share status. workloadKeyForNotification is the canonical key. New status surfaces (mobile, e-mail digest, anything) must use the same keying or they will lie.
2.6 No raw hex colors, no magic numbers, no .join() on possibly-undefined
See CLAUDE.md. Failing CI on these will waste a Hive cycle.
2.7 Demo mode must work for every new surface
Every endpoint and every card has a demo path. Stellar v3 features (cost dashboard, planner output, memory) must seed plausible demo data so the hosted site (console.kubestellar.io) and offline pitch demos still work.
3. Next-Level Work — Prioritized
The Hive should pick these up in order. Each section is a self-contained issue with goal, why it matters, shipping definition, and suggested wedge (the smallest first PR that proves it works).
3.1 Token economics: hourly batched scans + on-demand retry (P0 — biggest cost lever)
Goal. Replace the current "LLM call per event" pattern with a scheduled batch that runs once per hour and a manual retry the user can fire from the UI.
Why. Today every observer tick calls the user's provider. A real cluster fires thousands of events per hour. At Anthropic rates that's tens of dollars/day per user — unshippable. Most events are noise; batching them lets the LLM see the shape of an hour at once (which is also cheaper and a better signal than seeing each event in isolation).
Shipping definition.
- Scheduler. New goroutine in
pkg/stellar/observer/observer.gowith a configurable interval (default1h, envSTELLAR_SCAN_INTERVAL). On tick:- Pull every unprocessed event since the last scan (new
stellar_events_scanned_atcolumn or per-user cursor). - Group by
(cluster, namespace, workload)and collapse duplicates: keep count + first/last seen + a representative message. - Build a single prompt: "Here are 47 grouped events from the last hour. Triage: which are noise, which are recurring, which need an autonomous solve? Return JSON."
- Make one call per user (not per event) to the user's chosen provider (§2.4).
- For each "needs solve" entry, kick
autoTriggerSolveas today.
- Pull every unprocessed event since the last scan (new
- Manual retry. New endpoint
POST /api/stellar/scan/now(auth: any logged-in user, target: self). UI button inStellarHeader: "Rescan now". Resets the scan cursor forward (so we don't re-process old events) and runs a scan immediately. Shows a spinner via existing SSEactivityevents ("rescan_started", "rescan_complete"). - Critical bypass. Severity
criticalstill triggers an immediate solve — we don't make on-call wait an hour to learn the cluster is on fire. Implement as a fast pre-check inProcessEventthat skips the batch queue for critical. - Cost log. Every LLM call writes to
stellar_provider_usage(input tokens, output tokens, provider, model, $-estimate). Surface on a new "Stellar cost" tab in settings.
Suggested wedge. PR 1: add the scan goroutine, batched prompt, and POST /scan/now — feature-flagged with STELLAR_BATCHED_SCAN=true, no UI yet. PR 2: usage log + cost panel. PR 3: flip the flag on and remove the per-event LLM path.
Don't break. Critical-event auto-solve latency must remain sub-5-second. Demo mode must seed a plausible cost panel.
3.2 Cheap-local scanner: Ollama for watchers and triage (P0 — paired with 3.1)
Goal. The scanner (the "is this noise?" filter) runs on Ollama, locally. The solver (the "now actually fix this") runs on the user's chosen cloud provider.
Why. Triage is a tiny-context, high-frequency task — perfect for a 7B local model. Solving needs a frontier model for tool use, multi-step planning, and structured output. Splitting them by skill collapses cost by ~50× without hurting quality, because the local model is just deciding "interesting / not interesting," not generating fix plans.
Shipping definition.
- Provider routing. New helper
ResolveScannerProvider(ctx, userID)inpkg/stellar/providers/registry.go— always returns Ollama when reachable; only falls back to the user's provider when Ollama is unhealthy. Decoupled fromresolveProviderForUserwhich continues to govern the solver/chat paths. - Health check. Ping
http://localhost:11434/api/tagsat startup and on a 5-min heartbeat. Cache result. If unhealthy → degrade gracefully to the user's provider for scanning and surface a soft warning in the activity log ("Scanner fell back to Anthropic — local Ollama unreachable"). - Setup affordance. When Ollama is unreachable and the user is on the Stellar settings page, show a one-liner: "Run
ollama pull llama3.2:3bto enable free local scanning. Stellar will fall back to your cloud provider in the meantime." - Wire to §3.1. The batched scanner uses
ResolveScannerProvider. The solver (autoTriggerSolvePhase 3b mission) usesresolveProviderForUser. Two providers, two purposes, no confusion. - Model contract. Scanner prompt MUST be tested against
llama3.2:3bandqwen2.5:3b. Both must produce parseable JSON for the triage payload defined in §3.1. If a model fails the contract, log it and fall back; don't ship a scanner prompt that only works on Claude.
Suggested wedge. PR 1: ResolveScannerProvider + health check + log line, scanner still uses cloud (no behavior change). PR 2: flip scanner to Ollama, add fallback path. PR 3: settings page affordance.
Don't break. §2.4 — solver must still use the user's chosen provider. This is only for scanning.
3.3 Memory: "Stellar remembers what fixed this last time" (P1)
Goal. Stellar should recognize "we've seen this exact problem before" and either apply the past fix immediately (when confidence is high) or show the past fix as the top recommendation (when not).
Why. Half of operations is pattern recognition. A junior who never remembered yesterday's incident is a bad junior. This is also the most pitch-worthy "feels like Jarvis" feature — "Stellar fixed this exact crash loop on Tuesday; reapplying that fix" lands in a demo.
Shipping definition.
- Incident table. New
stellar_incidentstable:id,user_id,signature(hash of cluster+namespace+workload+reason+top-message-tokens),first_seen,last_seen,count,resolution_status,winning_action,winning_action_payload,notes(LLM-written postmortem). One row per kind of problem. - Signature builder. Deterministic hash function. Same crash loop on
api-servertwo days apart = same signature. Cluster-local by default; an env flag enables cross-user signatures for KubeStellar Enterprise. - Pre-solve lookup. Before Phase 2 of
autoTriggerSolve, query the incident table by signature. If a row exists withwinning_actionandcount >= 2:- Jump directly to the winning action ("Stellar remembers this — applying RestartDeployment, which worked the last 3 times").
- Skip the LLM round-trip for root-cause. Massive cost + latency win.
- Postmortem. On solve
resolved, write a 1-paragraph postmortem toincidents.notesvia the user's provider. Onescalated, write what was tried and why it failed. - UI. New "Memory" section on the Stellar page — a 2-column grid: known incident signatures + their winning fixes. Each row clickable → full postmortem in a modal.
Suggested wedge. PR 1: schema + signature builder + write-on-resolve, no read path yet. PR 2: pre-solve lookup behind STELLAR_USE_MEMORY=true. PR 3: UI. PR 4: cross-user signatures behind a separate env flag.
Don't break. Memory must be per-user by default — sharing fix history across orgs is a privacy bomb.
3.4 Approval ladder: bumping repeat fixes from auto → ask (P1)
Goal. If Stellar has restarted the same workload 3× in 24h and the issue keeps coming back, stop restarting on autopilot and require the user's approval the next time. After approval, count resets.
Why. Restart-as-cure-all is a footgun — it hides real bugs and burns through pod-startup budgets. A junior who keeps restarting the same crashing service without escalating is a bad junior.
Shipping definition.
- Reuse the existing
stellar_actions.bumped_atcolumn. - In
autoTriggerSolvePhase 3a, before dispatching: count successful auto-restarts of this workload in the last 24h viaGetExecutionsByDedupeSince. If>= 3, don't dispatch; instead create a pending action requiring approval, setbumped_at = now, broadcastaction_bumpedSSE. - UI: pending action card already exists — extend it to show "Stellar restarted this 3× already in 24h and the issue came back. Approve another restart, or click 'Try AI mission' for deeper diagnosis."
- After user approval, the next 24h window resets for that workload.
Suggested wedge. Single PR — backend count + bumped-action path + frontend copy update. ~200 LOC.
3.5 Cost dashboard + per-feature LLM budgets (P1)
Goal. Show the user a real-time meter of "how much has Stellar cost me today?" Allow per-feature budgets (e.g., "$2/day for scanning, $10/day for solving").
Why. Cost anxiety is the #1 blocker to autonomous AI adoption in operations. Surfacing it as a knob the user controls = trust. Hidden cost = uninstalled product.
Shipping definition.
stellar_provider_usagetable from §3.1.- New
/api/stellar/costendpoint: today / week / month rollups + per-feature breakdown. - Settings page card: budget sliders per feature (scanner / solver / chat). Soft caps (warn + keep working). Hard caps (refuse + log).
- Activity log row when a hard cap fires: "Paused scanning — daily budget reached. Resets at midnight UTC."
Suggested wedge. PR 1: usage logging. PR 2: rollup endpoint + read-only dashboard. PR 3: budgets.
3.6 Real RCA: log pulls before root-cause phase (P2)
Goal. Today Phase 2 (root_cause) is a deterministic string from the event reason. Make it real: pull the last 100 lines of the affected pod's logs via the kc-agent bridge and feed them to the LLM along with the event.
Why. "Container exits non-zero on startup" is true and useless. "Container exits non-zero on startup; logs show panic: nil pointer dereference at handler.go:42" is true and actionable.
Shipping definition.
- New helper
pkg/stellar/diagnose/logs.go: given(cluster, namespace, pod), calls the existing kc-agentkubectl logsroute, tails last 100 lines + previous container. - In
autoTriggerSolvebetween Phase 1 and Phase 2, pull logs if event is pod-related. Pass to the LLM with the event in a single prompt. - Persist log snippet on the
stellar_solvesrow (truncated to 4KB) so the modal can show it in "Stellar's attempts" → expandable. - Token budget: if logs > 4KB, send first 2KB + last 2KB with a
[... truncated ...]marker.
Suggested wedge. PR 1: log fetcher + persistence, no UI change. PR 2: feed to LLM. PR 3: surface in modal.
3.7 Multi-step planner: "diagnose → propose → apply → verify" (P2)
Goal. Promote Phase 3b from "trigger a generic mission" to a Stellar-specific 4-step plan rendered live in the UI.
Why. The mission system is great for ad-hoc requests but generic. A Stellar-owned plan UI lets us narrate ("step 2 of 4: applying the proposed kubectl rollout restart"), surface intermediate state, and give the user a one-click abort.
Shipping definition.
- New
stellar_planstable linked to a solve. - Plan shape: ordered list of
Step{kind, description, status, started_at, ended_at, output}. Kinds:diagnose,propose,apply,verify,report. - New SSE event
plan_stepmirroringsolve_progressbut at step granularity. - UI: when a solve has a plan, EventModal renders an inline checklist with live status icons. Abort button on running steps.
Suggested wedge. PR 1: schema + write-side instrumentation (Phase 3b emits steps). PR 2: SSE wiring + read endpoints. PR 3: UI.
3.8 Cross-cluster correlation (P2)
Goal. If five different clusters all start firing FailedMount on PVCs in the same 60-second window, that's a storage incident, not five independent events. Stellar should detect, group, and report it as one.
Why. Multi-cluster is KubeStellar's whole pitch. Stellar needs to use that signal.
Shipping definition.
- Backend correlation window: sliding 60-second buckets keyed by
(reason, k8sObjectKind). Threshold>= 3 clustersflips it into a correlated incident. - New SSE event
correlated_incidentwith the list of affected clusters. - Toast: "⚠ Storage incident — 5 clusters reporting FailedMount in last 60s. Inspect →"
- Single modal aggregating all five events.
Suggested wedge. Backend-only PR with feature flag. UI follows once detection is calibrated against real noise.
3.9 The "Stellar digest" — daily 8am email + activity feed (P3)
Goal. A 5-bullet daily summary: "Yesterday I noticed 1,847 events, filtered 1,839 as noise, auto-resolved 6, escalated 2 (here they are)." Surfaced as a card on dashboard login + optionally e-mailed.
Why. This is the "I went on vacation; tell me what happened" feature. Single biggest reason operators reopen the console after a long weekend.
Shipping definition.
- Cron at 08:00 user-local. Rolls up last 24h of
stellar_activityandstellar_solves. - LLM call (uses scanner provider — cheap, batched) to write a 5-bullet narrative.
- Persisted as a
digestnotification. - Optional e-mail via existing notification channel (already in the codebase for OAuth flows).
Suggested wedge. Single PR — cron + rollup + persisted notification. E-mail is its own follow-up.
3.10 Voice + ambient mode (P3 — moonshot)
Goal. Spoken update on a hotkey (Cmd+Shift+S): "Hey Stellar, what's the cluster doing?" → 10-second TTS reply from the activity log.
Why. The Jarvis test. If we ship voice, we ship the vision.
Shipping definition.
- Browser SpeechSynthesis API for output. Hotkey binding in
Layout.tsx. - New endpoint
GET /api/stellar/spoken-updatereturns a 2-sentence summary (uses scanner provider). - No voice input in v1 — push-to-speak only. Voice in is a privacy and accuracy can-of-worms.
Suggested wedge. Hotkey + spoken-update endpoint + SpeechSynthesis. One PR. Treat as polish, not a core feature.
4. Operational requirements
4.1 Telemetry
- Every Stellar code path must increment a Prometheus counter (
stellar_solves_total,stellar_scans_total,stellar_provider_calls_total{provider, feature}). - Histograms for solve latency and per-call LLM latency.
- Activity-log rows are the user-facing telemetry; Prometheus is the operator-facing telemetry. Both must be written.
4.2 Feature flags
Every section above ships behind an env var (STELLAR_BATCHED_SCAN, STELLAR_USE_MEMORY, STELLAR_OLLAMA_SCANNER, STELLAR_BUDGETS_ENFORCED). Default to off. Demo recordings need a stable surface; we don't want a half-shipped planner breaking the pitch the day a feature lands.
4.3 Test coverage
- Backend: every new SQL method gets a table-driven test in
pkg/store/sqlite_test.go. - Frontend: every new card or modal section gets a visual-regression spec under
web/e2e/visual/perCLAUDE.md. Baselines committed. - Integration: at least one Playwright spec per feature that drives it end-to-end in demo mode.
4.4 Documentation
- Every new env flag, every new endpoint, every new SSE kind: add to this spec under §1 (state) before merging.
- The
STELLAR.mdtop-level file is the user-facing intro;docs/stellar/HIVE-SPEC.md(this file) is the engineering reference. Keep them in lockstep.
5. Anti-goals (don't build these)
- A chatbot UI. The chat panel that exists is a control surface, not the product. Don't promote it. The product is the autonomous loop, the log, and the toasts.
- A "settings page for the LLM." Hidden complexity. The provider knob in the navbar is enough. Don't add temperature sliders, system prompt editors, etc.
- A second SSE channel. One stream. Extend the envelope.
- Mocking the LLM in tests. Tests run in demo mode with deterministic seed data. Mocking the LLM creates the same drift the user's seen elsewhere; demo-mode fixtures are the contract.
- Cross-tenant memory sharing without a flag. §3.3 caveat. Per-user by default forever.
- Deterministic action ladders for non-restart cases. Phase 3a's
safeAutoActionsis intentionally a single-entry whitelist. The temptation to addScaleDeploymentorDeletePodwill be strong. Resist; route those through Phase 3b's mission system where the LLM can reason about whether they're safe in context. The deterministic path exists only because "rollout restart" is empirically never the wrong thing to try first.
6. Glossary
| Term | Meaning |
|---|---|
| Solve | One end-to-end attempt by Stellar to handle one event. A row in stellar_solves. Goes `running → resolved |
| Workload key | (cluster, namespace, workload) — the canonical identity for status-sharing across events. |
| Scanner | The cheap pass that decides "noise or signal?" §3.2 says this runs on Ollama. |
| Solver | The expensive pass that decides "what do we do about it?" Runs on the user's chosen provider. |
| Mission | KubeStellar's existing autonomous agent task. Stellar triggers a mission for Phase 3b. |
| Phase 3a / 3b | Deterministic restart (3a) vs AI mission (3b). The autonomous loop picks one or the other based on safeAutoActions. |
| Activity | First-person log entries Stellar writes about what it did. Distinct from notifications (what the cluster did). |
| Digest | A periodic LLM-written summary (§3.9). |
7. The single-question test
For every PR a Hive worker opens against this codebase, answer one question:
"Does this make Stellar feel more like a junior engineer who works in the background, or more like a dashboard widget the user has to click?"
If the answer is "widget," send it back. The whole point is the engineer.
There was a problem hiding this comment.
Thanks for the substantial update — I do see two blocking issues to address in addition to the current rebase conflict.
1) Rebase is required
tide is failing because the PR is not mergeable (merge conflicts with main).
Please rebase and push:
git fetch upstream
git checkout stellar-ai-pa
git rebase upstream/main
# resolve conflicts
git push --force-with-lease2) Fix concurrent SSE writer usage
In stellar.go, Stream() starts go h.pushCatchUpSummary(..., w, ...) while the main stream loop also writes to w.
*bufio.Writer is not goroutine-safe, so this can corrupt/interleave SSE frames.
Please serialize writes (e.g., route catch-up events through the same stream loop/channel, or guard all writes with a mutex).
3) Fix user isolation in catch-up notifications
pushCatchUpSummary is per-user, but it calls GetNotificationsSince, and that DB query is not filtered by user_id.
This can leak other users’ notifications into a user’s catch-up summary.
Please add user scoping (e.g., GetNotificationsSince(ctx, userID, since) and WHERE user_id = ? AND created_at >= ?), then update call sites.
Once rebased and these two fixes are in, this should be in much better shape for merge.
|
@kubestellar-hive[bot]: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@kubestellar-hive[bot]: Take this PR up and perform all the required changes using the Spec provided |
There was a problem hiding this comment.
needs-rebase — per the work list, this branch should be rebased onto current main before further review/merge.
This PR is clearly the implementation for issue #13799 (STELLAR assistant feature request). The scope is broad: new store/API/agent code plus a large new frontend surface, so the merge risk is high until the branch is current.
Technical feedback on the current diff:
- The async narration goroutine in
pkg/api/handlers/stellar.gouses the request context after the handler returns. That context is likely to be cancelled immediately, which can abort the provider call and the notification update/broadcast. - Because this lands persistence, SSE, and UI together, please make sure the new STELLAR flows are covered by tests around startup/migration, demo mode, and reconnect/error paths.
|
Hi @xonas1101 — this PR needs a rebase on the latest |
|
@kubestellar-hive[bot]: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Hey @xonas1101 — this PR currently has merge conflicts with main. Could you rebase on the latest main branch to resolve them?
git fetch origin main
git rebase origin/main
# resolve any conflicts, then force-push
git push --force-with-leaseHappy to re-review once the conflicts are resolved. Thanks!
There was a problem hiding this comment.
Hey @xonas1101 👋 — this PR has the needs-rebase label and was last committed today (May 15), so it looks like you're actively working on it. Just a friendly nudge: when you're ready, please rebase on the latest main to clear the label:
git fetch origin main
git rebase origin/main
git push --force-with-leaseLet us know if you need any help or guidance on the STELLAR integration. Looking forward to seeing this progress! 🚀
There was a problem hiding this comment.
🔒 Security Review — sec-check pass (2026-05-15)
Overall: B- — Several issues must be resolved before merge.
🔴 HIGH — SSRF via unvalidated BaseURL
User-supplied BaseURL from provider config flows directly into http.NewRequestWithContext() without any URL validation, DNS resolution check, or private IP blocklist. This allows:
- Internal network scanning (
http://169.254.169.254,http://10.x.x.x) - Cloud metadata endpoint access
Fix: Use the existing ValidateBaseURL() from main (added in PR #13869). After rebase, ensure all provider HTTP clients call that validator before making requests.
🟡 MEDIUM — Error leakage to clients (17+ instances)
Raw err.Error() is returned in JSON responses:
"provider resolution failed: " + err.Error()
"digest generation failed: " + err.Error()
"AI provider error: " + err.Error()Internal errors (stack traces, file paths, driver messages) leak to the client.
Fix: Return generic messages to clients; log the full error server-side with slog.Error.
🟡 MEDIUM — Bare goroutines (5 instances)
go func() without safego.GoWith(). One has inline panic recovery, but the repo convention (pkg/safego/safego.go) should be used uniformly.
Fix: Replace go func() with safego.GoWith(ctx, func() { ... }).
🟡 MEDIUM — Bare JSON.parse in EventSource handlers (~6 unguarded)
~18 JSON.parse() calls in SSE handlers; ~6 appear outside try-catch. A malformed SSE payload will crash the EventSource listener.
Fix: Wrap all JSON.parse(e.data) in try-catch.
Note: This PR has needs-rebase — the SSRF fix from main will be available after rebase but must be explicitly wired into the new provider HTTP calls.
|
👋 @xonas1101 — This PR has merge conflicts with |
There was a problem hiding this comment.
Hi @xonas1101 — this PR has the needs-rebase label and can't be merged as-is. Could you rebase on the latest main branch? Thanks!
|
test comment |
|
Hi @xonas1101 — thanks for the STELLAR personal assistant work. This PR is flagged Could you rebase onto current If this PR is no longer actively maintained, let us know and we can close — shipped components on No rush, just keeping the PR queue healthy. 🙏 |
There was a problem hiding this comment.
🔒 Security Review — sec-check pass (2026-05-16T03:31:38Z)
Issues Found
HIGH — Error leakage to clients (5 instances)
Raw err.Error() returned in JSON responses exposes internal details (stack info, file paths, provider names) to the client:
"LLM call failed: " + genErr.Error()"provider resolution failed: " + err.Error()"digest generation failed: " + err.Error()"AI provider error: " + err.Error()- Direct
err.Error()in bad-request handler
Fix: Use generic messages for 5xx responses ("AI processing failed"). Log the real error server-side with slog.Error().
MEDIUM — Bare goroutines (5 instances)
Lines use go func() instead of safego.Go() / safego.GoWith():
- dispatch goroutine (line ~3396)
- two goroutines around line ~5079/5102
- action goroutine (line ~7531)
- cluster iteration goroutine (line ~8246)
Fix: Replace with safego.GoWith("descriptive-name", func() { ... }) to get panic recovery and structured logging.
MEDIUM — Bare JSON.parse (~3 instances without try/catch)
Some EventSource/WebSocket message handlers parse e.data without wrapping in try/catch. A malformed server message would crash the component.
Fix: Wrap every JSON.parse(e.data) in try/catch (repo convention since PR #13929).
Clean
- ✅ No SSRF vectors (proxy uses existing validated cardProxyClient)
- ✅ No shell injection
- ✅ No XSS / dangerouslySetInnerHTML
- ✅ External links use
rel="noopener noreferrer"
Please address the HIGH and MEDIUM findings before merge.
|
@xonas1101 please rebase this |
There was a problem hiding this comment.
🔒 Security Review — sec-check pass
PR Status: needs-rebase, XXL — reviewing security patterns only.
Findings (3 issues to fix before merge)
1. ERROR LEAKAGE TO CLIENTS (HIGH × 6+)
Multiple handler responses send raw err.Error() to clients:
c.Status(500).JSON(fiber.Map{"error": "provider resolution failed: " + err.Error()})
c.Status(500).JSON(fiber.Map{"error": "digest generation failed: " + err.Error()})
c.Status(500).JSON(fiber.Map{"error": "AI provider error: " + err.Error()})
c.Status(400).JSON(fiber.Map{"error": err.Error()})Internal error messages (stack traces, connection strings, provider details) must NOT reach clients. Return generic messages; log the real error with slog.Error.
2. JSON INJECTION via fmt.Sprintf (MEDIUM)
Detail: fmt.Sprintf(`{"confirmToken":"%s"}`, req.ConfirmToken),User-controlled confirmToken is interpolated into a JSON string without escaping. A value containing " breaks the JSON structure. Use json.Marshal instead.
3. BARE GOROUTINES (MEDIUM × 5)
5 go func() calls without safego.GoWith or bounded concurrency. The repo convention (see pkg/safego/safego.go) requires all production goroutines to use the safe wrapper for panic recovery and metrics. At minimum, ensure these are bounded (semaphore or WaitGroup with limit).
Notes
safeAutoActionsallowlist restricted toRestartDeploymentonly — acceptable risk.- API key handling uses encrypt/decrypt + masking — looks correct.
- No SSRF or command injection vectors found.
- No hardcoded secrets.
Please address findings 1–3 before requesting merge.
📌 Fixes
Fixes # (Use "Fixes", "Closes", or "Resolves" for automatic closing)
📝 Summary of Changes
Changes Made
Checklist
Please ensure the following before submitting your PR:
git commit -s)Screenshots or Logs (if applicable)
👀 Reviewer Notes
Add any special notes for the reviewer here