Skip to content

Conversation

@derekmisler
Copy link
Contributor

@derekmisler derekmisler commented Feb 3, 2026

Summary

Adds an automated nightly codebase scanner that uses a multi-agent architecture to detect code quality issues and automatically create GitHub issues for findings.

Key features:

  • Multi-agent system with 5 specialized agents: orchestrator (Claude Sonnet), security analyzer (OpenAI o3-mini), bug detector (Gemini Flash), documentation checker (Claude
    Haiku), and issue reporter (Claude Haiku)
  • Persistent memory via SQLite database cached between runs - learns from previous scans to avoid false positives
  • Automated issue creation using gh CLI with duplicate detection and a strict 2-issue-per-run limit
  • Multi-provider support leveraging different models' strengths (reasoning models for security, fast models for bugs)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Root Orchestrator                        │
│                   (claude-sonnet-4-5)                       │
│  • Loads memory from previous scans                         │
│  • Delegates to sub-agents                                  │
│  • Filters/prioritizes findings                             │
│  • Adds CATEGORY field before forwarding to reporter        │
└─────────────────────────────────────────────────────────────┘
          │              │                │              │
          ▼              ▼                ▼              ▼
    ┌──────────┐  ┌──────────┐  ┌───────────────┐  ┌──────────┐
    │ Security │  │   Bugs   │  │ Documentation │  │ Reporter │
    │ (o3-mini)│  │ (gemini) │  │    (haiku)    │  │ (haiku)  │
    └──────────┘  └──────────┘  └───────────────┘  └──────────┘

Files Changed

  • .github/agents/nightly-scanner.yaml - Multi-agent configuration with instructions, toolsets, and permissions
  • .github/workflows/nightly-scan.yml - GitHub Actions workflow (runs daily at 6am UTC, supports dry-run mode)

Test plan

  • Trigger workflow manually with dry-run: true to verify agents run without creating issues
  • Verify the scanner memory database is created and cached
  • Trigger without dry-run: true on a branch with a known issue to verify issue creation
  • Verify duplicate detection prevents recreating the same issue

@derekmisler derekmisler self-assigned this Feb 3, 2026
@derekmisler derekmisler marked this pull request as ready for review February 3, 2026 03:13
@derekmisler derekmisler requested a review from a team as a code owner February 3, 2026 03:13
@krissetto
Copy link
Contributor

Might be good to try out some alloy models in this and other actions, could potentially give deeper perspective to the agents and better overall results

Adds a read-only agent that scans the codebase daily and creates
GitHub issues for security vulnerabilities, bugs, and documentation gaps.

- Runs daily at 6am UTC (or manual trigger)
- Creates max 2 issues per run to avoid flooding
- Deduplicates against existing open issues with 'nightly-scan' label
- Dry-run mode for testing
- Strong anti-hallucination rules (read-only, must verify files exist)
- Root agent (claude-sonnet) orchestrates scan across sub-agents
- Security sub-agent (claude-opus) for vulnerability detection
- Bugs sub-agent (claude-sonnet) for logic errors and resource leaks
- Documentation sub-agent (claude-haiku) for doc gap detection
- Add GitHub Actions cache for persistent scanner memory
- Memory stores skip patterns, context, and feedback across runs
- Each sub-agent has strict grounding rules to prevent hallucinations
Agent improvements:
- Documentation agent reads all markdown files before analysis
- All sub-agents explicitly accept empty results as valid outcome
- Added read_multiple_files tool to documentation agent

Workflow improvements:
- Pin actions/cache to v4.2.0 SHA for security
- Use static cache key (matches cagent-action pattern)
- Replace custom JSON memory with cagent's SQLite memory toolset
- Agent uses get_memories/add_memory tools instead of file I/O
- Memory path resolves to .github/agents/scanner-memory.db
- Removes manual JSON initialization step
Sub-agents now return findings in simple text format:
  FILE: path/to/file.go
  LINE: 123
  SEVERITY: high
  ...

Benefits over JSON:
- More natural for LLM output
- Less prone to formatting errors
- Matches cagent-action PR review pattern

Root agent still outputs JSON for workflow parsing.
Move issue creation from workflow to agent:
- New `reporter` sub-agent uses `gh` CLI to create issues
- Checks for duplicates before creating
- Selects appropriate labels based on category
- Workflow reduced from 175 lines to 55 lines

Dry-run mode now passed as prompt to agent.
Model assignments:
- Security: openai/o3-mini (reasoning model for subtle vulnerabilities)
- Bugs: google/gemini-2.5-flash (fast, good at Go code analysis)
- Documentation: anthropic/claude-haiku (sufficient for simpler task)
- Orchestrator: anthropic/claude-sonnet (coordination)
- Reporter: anthropic/claude-haiku (formatting + gh commands)

Workflow now passes all three API keys:
- ANTHROPIC_API_KEY
- OPENAI_API_KEY
- GOOGLE_API_KEY (mapped from GEMINI_API_KEY secret)
Fixes from code review:

1. Restrict shell permissions
   - Changed `gh *` to `gh issue list *` and `gh issue create *`
   - Principle of least privilege

2. CATEGORY field mismatch
   - Root agent now adds CATEGORY when forwarding to reporter
   - Added explicit "Forwarding to reporter" section with example

3. Inconsistent array/NO_ISSUES terminology
   - Changed "Return an empty array `[]`" to "Output `NO_ISSUES`"
   - Consistent across all three analysis agents

4. Documentation trigger clarity
   - Changed to "ONLY run if BOTH security AND bugs returned `NO_ISSUES`"
   - Unambiguous trigger condition

5. Better duplicate detection
   - Changed from downloading 100 issues to `gh issue list --search`
   - Searches by file path in issue body

6. Sub-agent failure handling
   - Added explicit error handling strategy
   - Log errors and continue with other agents
   - Report partial results if some agents fail
   - Added FAILED status to reporter output
@derekmisler derekmisler force-pushed the nightly-issue-scanner branch from 0d1c6d7 to 23e46fb Compare February 3, 2026 16:55
@derekmisler derekmisler marked this pull request as draft February 3, 2026 17:08
@derekmisler
Copy link
Contributor Author

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Found several issues in the new nightly scanner configuration, including a critical cache key bug that will prevent memory persistence from working after the first run, security concerns with command patterns, and reliance on LLM instruction-following for critical error handling.

Key Issues:

  • 🔴 HIGH: Cache key prevents memory updates after first run
  • 🟡 MEDIUM: Shell permission wildcards too broad
  • 🟡 MEDIUM: Potential command injection via file paths
  • 🟡 MEDIUM: Error handling relies on LLM instructions only
  • 🟢 LOW: Issue creation limit not enforced programmatically
  • 🟢 LOW: Timeout may be insufficient for large repos

- type: shell

permissions:
allow:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly permissive shell command wildcards

The permissions shell:cmd=gh issue list * and shell:cmd=gh issue create * are overly broad and allow the agent to execute these GitHub CLI commands with ANY arguments. While the GITHUB_TOKEN permissions provide some protection by limiting repository access, the wildcard patterns still allow potentially problematic commands like:

  • Exfiltration attempts via --repo flags to other accessible repositories
  • Resource exhaustion via extreme --limit values
  • Creation of issues with arbitrary/malicious content, labels, or assignees

Fix: Use more restrictive patterns that only allow the specific flag combinations needed:

permissions:
  allow:
    - shell:cmd=gh issue list --state * --label * --search *
    - shell:cmd=gh issue create --title * --label * --body *

This maintains functionality while reducing the attack surface.

```

## Workflow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell command injection vulnerability in duplicate detection

The instruction demonstrates constructing a gh command with file paths embedded in the search query:

gh issue list --label automated --state open --search "in:body \"path/to/file.go\" \"line 123\""

If a malicious file is named with shell metacharacters (e.g., file$(whoami).go, file;rm -rf.go, or files with backticks), these could be interpreted by the shell when the LLM-generated command executes.

While line 430 instructs to "quote file paths and line numbers", this is merely a natural language instruction to an LLM, not enforced code-level protection. LLMs can make mistakes with complex escaping requirements.

Fix: Use structured approaches that avoid shell interpolation:

  • Use environment variables for dynamic values
  • Use gh CLI with --json output and programmatic filtering
  • Implement explicit sanitization/escaping before command construction
  • Add validation to reject files with suspicious characters before processing


## Your workflow

1. Call `get_memories` to understand what to skip from previous runs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling relies on LLM instruction-following

Lines 52-75 define error handling logic purely through natural language instructions: "If a sub-agent fails (timeout, API error), log the error and continue with other agents".

This is not actual error handling code - it's text instructions that rely on the LLM correctly detecting errors, formatting them properly, and continuing execution. If a sub-agent API call fails, the orchestrator agent might:

  1. Fail to detect the error
  2. Hallucinate success
  3. Skip remaining agents despite instructions
  4. Produce malformed output

For an automated nightly scanner that creates GitHub issues, this could lead to silent failures, missed findings, or incorrect issue creation.

Recommendation: Add system-level error boundaries, retry logic, or at minimum schema-validated output requirements rather than relying solely on LLM instruction-following for critical error handling.

SEVERITY: critical|high|medium
TITLE: Brief description
CODE: exact code snippet
PROBLEM: explanation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue creation limit not enforced programmatically

Line 346 contains only an LLM instruction ("ENFORCE: Process at most 2 findings") without programmatic enforcement. While the PR description claims a "strict 2-issue-per-run limit," the implementation relies entirely on the LLM following instructions correctly.

LLMs can fail to follow instructions due to prompt injection, model drift, context confusion, or simple misunderstanding, potentially leading to runaway issue creation.

Recommendation: Implement a hard limit at the system level:

  • Add a counter/validation layer before issue creation
  • Configure the agent runtime to enforce hard limits on tool usage
  • Add workflow-level validation of created issues

Note: Severity is low because duplicate detection may catch excess issues, and impact is limited to issue spam.

prompt: ${{ inputs.dry-run && 'DRY RUN MODE: Do not create any issues. Just report what you would create.' || '' }}
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
google-api-key: ${{ secrets.GEMINI_API_KEY }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout may be insufficient for large repositories

The timeout of 600 seconds (10 minutes) controls the entire nightly scan workflow execution. While this may be sufficient for small-to-medium repositories, it is likely insufficient for large repositories due to:

  1. Multi-agent architecture with 5 specialized agents
  2. Security analyzer uses OpenAI o3-mini (reasoning model with 30-60+ second response times)
  3. Directory tree traversal and file I/O for large codebases
  4. SQLite database operations
  5. Multiple API calls to different providers

If scans consistently timeout, no issues will ever be reported.

Recommendation: Increase timeout to 30-60 minutes, or make it configurable based on repository size, especially given the use of reasoning models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants