An LLM-based agent that autonomously applies software analysis tools to target projects. Given an analysis tool (e.g. Clang, Infer, KLEE) and a target repository, the agent sets up the environment, installs the tool, prepares the project, and drives the analysis to completion — retrying and learning across attempts.
python -m venv .venv && source .venv/bin/activate
pip install -e .Requires Python ≥ 3.10 and (optionally) Docker for isolated execution.
# Set your LLM API key
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, etc.
# Run on a single tool/target pair
python -m analysis_agent.main \
--tool-name "clang" \
--tool-url "https://github.com/llvm/llvm-project" \
--target-name "curl" \
--target-url "https://github.com/curl/curl"
# Interactive wizard
python -m analysis_agent.main --interactive# Batch from a JSONL file
python -m analysis_agent.main --instances-file instances.jsonl
# Parallel execution
python -m analysis_agent.run_parallel \
--instances-file instances.jsonl \
--workers 4Instance file format (JSONL):
{"tool_name": "clang", "tool_url": "...", "target_name": "curl", "target_url": "..."}
{"tool_name": "infer", "tool_url": "...", "target_name": "checkstyle", "target_url": "..."}For large benchmark experiments, the recommended approach is a custom orchestrator with live monitoring and an emergency kill switch. Key design principles:
- Global concurrency cap + per-tool sub-limit for disk-hungry tools (fuzzers, call-graph analyzers)
- Resource-based launch pause (RAM/swap/disk thresholds)
- Background thread monitoring Docker container writable-layer size (
docker inspect --size) - Per-instance hard timeout; status file for monitor integration
See docs/experiment_management_guide.md for the full design and reference implementation.
The included instances.json covers 35 combinations:
| Category | Tools | Targets |
|---|---|---|
| C/C++ analysis | AFLplusplus, Clang, cFlow, KLEE | curl, fastfetch, ImageMagick, masscan, radare2 |
| Java analysis | Infer, jvm-tools, WALA | checkstyle, closer-compiler, jmh, saxon-he, tika |
The augmented_doc/ directory contains LLM-generated runbooks for every tool/target pair. New pairs can be added with generate_brief_doc.py.
- Multi-stage workflow:
docker_setup→tool_setup→project_setup→analysis_run - Multi-attempt retry with cross-attempt learning
- Exit attempt: on full failure, generates a recovery Dockerfile and replay script
- Execution environments: local or Docker
- LLM-agnostic: uses LiteLLM — works with OpenAI, Anthropic, Deepseek, Gemini, and any compatible provider
- Experiment manager: queue-based batch runner with metrics tracking
- Replay web interface: browser-based viewer for inspecting and replaying past agent runs
We evaluate AnalysisAgent against three baselines — RAG-Agent, Mini-SWE-Agent, and ExecutionAgent — across 35 benchmark tasks and 4 LLM backends, with 3 repetitions per configuration (560 total runs). All agents share per-task limits of 120 cycles, $2 API cost, and 5 h wall-clock time.
| Agent | Avg. Verified Success |
|---|---|
| RAG-Agent | 10% |
| Mini-SWE-Agent | 37% |
| ExecutionAgent | 57% |
| AnalysisAgent | 79% |
AnalysisAgent achieves 94% verified success with both Gemini-3-Flash and DeepSeek-V3.2 (33/35 tasks), demonstrating that purpose-built scaffolding reliably handles multi-step tool installation, project building, and analysis evidence production.
| GPT-5-nano | GPT-5-mini | DeepSeek-V3.2 | Gemini-3-Flash | |
|---|---|---|---|---|
| RAG-Agent | 9% | 6% | 3% | 23% |
| Mini-SWE-Agent | 9% | 20% | 57% | 63% |
| ExecutionAgent | 40% | 54% | 57% | 77% |
| AnalysisAgent | 54% | 75% | 94% | 94% |
Verified success rates from manual review. Self-validated rates (mean +/- std over n=3 runs) available in the paper.
AnalysisAgent's advantage is large, consistent across all LLM backends, and statistically significant (Fisher's exact test, Holm-Bonferroni corrected):
| Comparison | Odds Ratio (95% CI) | Cohen's h | padj |
|---|---|---|---|
| vs. RAG-Agent | 34.5 [17.3, 68.5] | 1.55 | < 0.001 |
| vs. Mini-SWE-Agent | 8.1 [4.0, 16.2] | 0.92 | < 0.001 |
| vs. ExecutionAgent | 2.7 [1.6, 4.6] | 0.45 | < 0.001 |
The Cochran-Mantel-Haenszel test confirms the advantage holds across all LLM backends (p < 10-4 for all comparisons).
| Tool | Avg. Success | Ecosystem |
|---|---|---|
| cflow | 64% | C/C++ |
| CSA (Clang Static Analyzer) | 55% | C/C++ |
| AFL++ | 52% | C/C++ |
| KLEE | 40% | C/C++ |
| Infer | 35% | Java |
| SJK (jvm-tools) | 35% | Java |
| WALA | 31% | Java |
Java tasks account for 62% of all failures across agents, reflecting the complexity of Java toolchains (classpaths, bytecode generation, JVM attachment) and heavyweight whole-program analyses.
- Stronger models are cheaper overall: weaker models (GPT-5-nano) require more cycles and wall-clock time despite lower per-token prices.
- Failed runs are expensive: 2.8x more cycles, 4.1x longer duration, and 1.3x higher cost than successful runs.
- Best throughput: Gemini-3-Flash achieves the lowest mean time per task (36 min) while tied for the highest success rate.
Analysis of 182 failing trajectories reveals distinct failure profiles per agent:
| Failure Mode | RAG-Agent | Mini-SWE-Agent | ExecutionAgent | AnalysisAgent |
|---|---|---|---|---|
| Docker/Build Failure | 60% | 11% | 20% | 7% |
| Analysis Tool Misuse | 7% | 44% | 23% | 50% |
| Malformed LLM Output | 20% | — | 50% | — |
| Budget/Time Exhausted | — | 19% | — | 29% |
| Incorrect Analysis Result | 11% | 48% | — | — |
AnalysisAgent has largely solved environment setup (only 7% Docker/build failures) and primarily fails during tool invocation — a qualitatively harder problem that represents the next frontier.
Key environment variables:
| Variable | Default | Description |
|---|---|---|
EXEC_AGENT_MODEL |
gpt-5-nano |
LLM model to use |
EXEC_AGENT_IMAGE |
(unset) | Docker image; unset = local execution |
EXEC_AGENT_MODE |
auto |
auto (runs continuously) or step (pause each cycle) |
EXEC_AGENT_LLM_USAGE_JSONL |
(unset) | Path to log per-call LLM usage |
See docs/AGENT_LAUNCH_GUIDE.md for the full CLI reference.
Inspect and replay past agent runs in a browser:
python -m analysis_agent.replay_web --port 8080
# Then open http://localhost:8080Options: --host, --port, --logs-dir /path/to/logs.
pip install -e ".[dev]"
pytest tests/docs/AGENT_LAUNCH_GUIDE.md— full CLI and programmatic API referencedocs/experiment_management_guide.md— orchestrator design for large-scale runsdocs/experiment_management_guide_concise.md— concise summary of the orchestration architecture
If you use this code in your research, please cite:
@article{TODO,
title = {TODO},
author = {TODO},
year = {TODO},
}MIT