-
Notifications
You must be signed in to change notification settings - Fork 233
Automated nightly codebase scanner #1573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Might be good to try out some alloy models in this and other actions, could potentially give deeper perspective to the agents and better overall results |
Adds a read-only agent that scans the codebase daily and creates GitHub issues for security vulnerabilities, bugs, and documentation gaps. - Runs daily at 6am UTC (or manual trigger) - Creates max 2 issues per run to avoid flooding - Deduplicates against existing open issues with 'nightly-scan' label - Dry-run mode for testing - Strong anti-hallucination rules (read-only, must verify files exist)
- Root agent (claude-sonnet) orchestrates scan across sub-agents - Security sub-agent (claude-opus) for vulnerability detection - Bugs sub-agent (claude-sonnet) for logic errors and resource leaks - Documentation sub-agent (claude-haiku) for doc gap detection - Add GitHub Actions cache for persistent scanner memory - Memory stores skip patterns, context, and feedback across runs - Each sub-agent has strict grounding rules to prevent hallucinations
Agent improvements: - Documentation agent reads all markdown files before analysis - All sub-agents explicitly accept empty results as valid outcome - Added read_multiple_files tool to documentation agent Workflow improvements: - Pin actions/cache to v4.2.0 SHA for security - Use static cache key (matches cagent-action pattern)
- Replace custom JSON memory with cagent's SQLite memory toolset - Agent uses get_memories/add_memory tools instead of file I/O - Memory path resolves to .github/agents/scanner-memory.db - Removes manual JSON initialization step
Sub-agents now return findings in simple text format: FILE: path/to/file.go LINE: 123 SEVERITY: high ... Benefits over JSON: - More natural for LLM output - Less prone to formatting errors - Matches cagent-action PR review pattern Root agent still outputs JSON for workflow parsing.
Move issue creation from workflow to agent: - New `reporter` sub-agent uses `gh` CLI to create issues - Checks for duplicates before creating - Selects appropriate labels based on category - Workflow reduced from 175 lines to 55 lines Dry-run mode now passed as prompt to agent.
Model assignments: - Security: openai/o3-mini (reasoning model for subtle vulnerabilities) - Bugs: google/gemini-2.5-flash (fast, good at Go code analysis) - Documentation: anthropic/claude-haiku (sufficient for simpler task) - Orchestrator: anthropic/claude-sonnet (coordination) - Reporter: anthropic/claude-haiku (formatting + gh commands) Workflow now passes all three API keys: - ANTHROPIC_API_KEY - OPENAI_API_KEY - GOOGLE_API_KEY (mapped from GEMINI_API_KEY secret)
Fixes from code review: 1. Restrict shell permissions - Changed `gh *` to `gh issue list *` and `gh issue create *` - Principle of least privilege 2. CATEGORY field mismatch - Root agent now adds CATEGORY when forwarding to reporter - Added explicit "Forwarding to reporter" section with example 3. Inconsistent array/NO_ISSUES terminology - Changed "Return an empty array `[]`" to "Output `NO_ISSUES`" - Consistent across all three analysis agents 4. Documentation trigger clarity - Changed to "ONLY run if BOTH security AND bugs returned `NO_ISSUES`" - Unambiguous trigger condition 5. Better duplicate detection - Changed from downloading 100 issues to `gh issue list --search` - Searches by file path in issue body 6. Sub-agent failure handling - Added explicit error handling strategy - Log errors and continue with other agents - Report partial results if some agents fail - Added FAILED status to reporter output
0d1c6d7 to
23e46fb
Compare
|
/review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Summary
Found several issues in the new nightly scanner configuration, including a critical cache key bug that will prevent memory persistence from working after the first run, security concerns with command patterns, and reliance on LLM instruction-following for critical error handling.
Key Issues:
- 🔴 HIGH: Cache key prevents memory updates after first run
- 🟡 MEDIUM: Shell permission wildcards too broad
- 🟡 MEDIUM: Potential command injection via file paths
- 🟡 MEDIUM: Error handling relies on LLM instructions only
- 🟢 LOW: Issue creation limit not enforced programmatically
- 🟢 LOW: Timeout may be insufficient for large repos
| - type: shell | ||
|
|
||
| permissions: | ||
| allow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overly permissive shell command wildcards
The permissions shell:cmd=gh issue list * and shell:cmd=gh issue create * are overly broad and allow the agent to execute these GitHub CLI commands with ANY arguments. While the GITHUB_TOKEN permissions provide some protection by limiting repository access, the wildcard patterns still allow potentially problematic commands like:
- Exfiltration attempts via
--repoflags to other accessible repositories - Resource exhaustion via extreme
--limitvalues - Creation of issues with arbitrary/malicious content, labels, or assignees
Fix: Use more restrictive patterns that only allow the specific flag combinations needed:
permissions:
allow:
- shell:cmd=gh issue list --state * --label * --search *
- shell:cmd=gh issue create --title * --label * --body *This maintains functionality while reducing the attack surface.
| ``` | ||
|
|
||
| ## Workflow | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shell command injection vulnerability in duplicate detection
The instruction demonstrates constructing a gh command with file paths embedded in the search query:
gh issue list --label automated --state open --search "in:body \"path/to/file.go\" \"line 123\""If a malicious file is named with shell metacharacters (e.g., file$(whoami).go, file;rm -rf.go, or files with backticks), these could be interpreted by the shell when the LLM-generated command executes.
While line 430 instructs to "quote file paths and line numbers", this is merely a natural language instruction to an LLM, not enforced code-level protection. LLMs can make mistakes with complex escaping requirements.
Fix: Use structured approaches that avoid shell interpolation:
- Use environment variables for dynamic values
- Use
ghCLI with--jsonoutput and programmatic filtering - Implement explicit sanitization/escaping before command construction
- Add validation to reject files with suspicious characters before processing
|
|
||
| ## Your workflow | ||
|
|
||
| 1. Call `get_memories` to understand what to skip from previous runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling relies on LLM instruction-following
Lines 52-75 define error handling logic purely through natural language instructions: "If a sub-agent fails (timeout, API error), log the error and continue with other agents".
This is not actual error handling code - it's text instructions that rely on the LLM correctly detecting errors, formatting them properly, and continuing execution. If a sub-agent API call fails, the orchestrator agent might:
- Fail to detect the error
- Hallucinate success
- Skip remaining agents despite instructions
- Produce malformed output
For an automated nightly scanner that creates GitHub issues, this could lead to silent failures, missed findings, or incorrect issue creation.
Recommendation: Add system-level error boundaries, retry logic, or at minimum schema-validated output requirements rather than relying solely on LLM instruction-following for critical error handling.
| SEVERITY: critical|high|medium | ||
| TITLE: Brief description | ||
| CODE: exact code snippet | ||
| PROBLEM: explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue creation limit not enforced programmatically
Line 346 contains only an LLM instruction ("ENFORCE: Process at most 2 findings") without programmatic enforcement. While the PR description claims a "strict 2-issue-per-run limit," the implementation relies entirely on the LLM following instructions correctly.
LLMs can fail to follow instructions due to prompt injection, model drift, context confusion, or simple misunderstanding, potentially leading to runaway issue creation.
Recommendation: Implement a hard limit at the system level:
- Add a counter/validation layer before issue creation
- Configure the agent runtime to enforce hard limits on tool usage
- Add workflow-level validation of created issues
Note: Severity is low because duplicate detection may catch excess issues, and impact is limited to issue spam.
| prompt: ${{ inputs.dry-run && 'DRY RUN MODE: Do not create any issues. Just report what you would create.' || '' }} | ||
| anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| openai-api-key: ${{ secrets.OPENAI_API_KEY }} | ||
| google-api-key: ${{ secrets.GEMINI_API_KEY }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout may be insufficient for large repositories
The timeout of 600 seconds (10 minutes) controls the entire nightly scan workflow execution. While this may be sufficient for small-to-medium repositories, it is likely insufficient for large repositories due to:
- Multi-agent architecture with 5 specialized agents
- Security analyzer uses OpenAI o3-mini (reasoning model with 30-60+ second response times)
- Directory tree traversal and file I/O for large codebases
- SQLite database operations
- Multiple API calls to different providers
If scans consistently timeout, no issues will ever be reported.
Recommendation: Increase timeout to 30-60 minutes, or make it configurable based on repository size, especially given the use of reasoning models.
Summary
Adds an automated nightly codebase scanner that uses a multi-agent architecture to detect code quality issues and automatically create GitHub issues for findings.
Key features:
Haiku), and issue reporter (Claude Haiku)
ghCLI with duplicate detection and a strict 2-issue-per-run limitArchitecture
Files Changed
.github/agents/nightly-scanner.yaml- Multi-agent configuration with instructions, toolsets, and permissions.github/workflows/nightly-scan.yml- GitHub Actions workflow (runs daily at 6am UTC, supports dry-run mode)Test plan
dry-run: trueto verify agents run without creating issuesdry-run: trueon a branch with a known issue to verify issue creation