Analysis Agent

An LLM-based agent that autonomously applies software analysis tools to target projects. Given an analysis tool (e.g. Clang, Infer, KLEE) and a target repository, the agent sets up the environment, installs the tool, prepares the project, and drives the analysis to completion — retrying and learning across attempts.

Installation

python -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python ≥ 3.10 and (optionally) Docker for isolated execution.

Quick Start

# Set your LLM API key
export OPENAI_API_KEY="sk-..."   # or ANTHROPIC_API_KEY, etc.

# Run on a single tool/target pair
python -m analysis_agent.main \
  --tool-name   "clang" \
  --tool-url    "https://github.com/llvm/llvm-project" \
  --target-name "curl" \
  --target-url  "https://github.com/curl/curl"

# Interactive wizard
python -m analysis_agent.main --interactive

Batch and Parallel Execution

# Batch from a JSONL file
python -m analysis_agent.main --instances-file instances.jsonl

# Parallel execution
python -m analysis_agent.run_parallel \
  --instances-file instances.jsonl \
  --workers 4

Instance file format (JSONL):

{"tool_name": "clang", "tool_url": "...", "target_name": "curl", "target_url": "..."}
{"tool_name": "infer", "tool_url": "...", "target_name": "checkstyle", "target_url": "..."}

Large-Scale Orchestrated Runs

For large benchmark experiments, the recommended approach is a custom orchestrator with live monitoring and an emergency kill switch. Key design principles:

Global concurrency cap + per-tool sub-limit for disk-hungry tools (fuzzers, call-graph analyzers)
Resource-based launch pause (RAM/swap/disk thresholds)
Background thread monitoring Docker container writable-layer size (docker inspect --size)
Per-instance hard timeout; status file for monitor integration

See docs/experiment_management_guide.md for the full design and reference implementation.

Supported Tools and Targets

The included instances.json covers 35 combinations:

Category	Tools	Targets
C/C++ analysis	AFLplusplus, Clang, cFlow, KLEE	curl, fastfetch, ImageMagick, masscan, radare2
Java analysis	Infer, jvm-tools, WALA	checkstyle, closer-compiler, jmh, saxon-he, tika

The augmented_doc/ directory contains LLM-generated runbooks for every tool/target pair. New pairs can be added with generate_brief_doc.py.

Key Features

Multi-stage workflow: docker_setup → tool_setup → project_setup → analysis_run
Multi-attempt retry with cross-attempt learning
Exit attempt: on full failure, generates a recovery Dockerfile and replay script
Execution environments: local or Docker
LLM-agnostic: uses LiteLLM — works with OpenAI, Anthropic, Deepseek, Gemini, and any compatible provider
Experiment manager: queue-based batch runner with metrics tracking
Replay web interface: browser-based viewer for inspecting and replaying past agent runs

Evaluation Results

We evaluate AnalysisAgent against three baselines — RAG-Agent, Mini-SWE-Agent, and ExecutionAgent — across 35 benchmark tasks and 4 LLM backends, with 3 repetitions per configuration (560 total runs). All agents share per-task limits of 120 cycles, $2 API cost, and 5 h wall-clock time.

Headline Numbers

Agent	Avg. Verified Success
RAG-Agent	10%
Mini-SWE-Agent	37%
ExecutionAgent	57%
AnalysisAgent	79%

AnalysisAgent achieves 94% verified success with both Gemini-3-Flash and DeepSeek-V3.2 (33/35 tasks), demonstrating that purpose-built scaffolding reliably handles multi-step tool installation, project building, and analysis evidence production.

Success Rates by Agent and LLM Backend

	GPT-5-nano	GPT-5-mini	DeepSeek-V3.2	Gemini-3-Flash
RAG-Agent	9%	6%	3%	23%
Mini-SWE-Agent	9%	20%	57%	63%
ExecutionAgent	40%	54%	57%	77%
AnalysisAgent	54%	75%	94%	94%

_{Verified success rates from manual review. Self-validated rates (mean +/- std over n=3 runs) available in the paper.}

Statistical Significance

AnalysisAgent's advantage is large, consistent across all LLM backends, and statistically significant (Fisher's exact test, Holm-Bonferroni corrected):

Comparison	Odds Ratio (95% CI)	Cohen's h	p_adj
vs. RAG-Agent	34.5 [17.3, 68.5]	1.55	< 0.001
vs. Mini-SWE-Agent	8.1 [4.0, 16.2]	0.92	< 0.001
vs. ExecutionAgent	2.7 [1.6, 4.6]	0.45	< 0.001

The Cochran-Mantel-Haenszel test confirms the advantage holds across all LLM backends (p < 10^-4 for all comparisons).

Tool and Ecosystem Patterns

Tool	Avg. Success	Ecosystem
cflow	64%	C/C++
CSA (Clang Static Analyzer)	55%	C/C++
AFL++	52%	C/C++
KLEE	40%	C/C++
Infer	35%	Java
SJK (jvm-tools)	35%	Java
WALA	31%	Java

Java tasks account for 62% of all failures across agents, reflecting the complexity of Java toolchains (classpaths, bytecode generation, JVM attachment) and heavyweight whole-program analyses.

Efficiency Highlights

Stronger models are cheaper overall: weaker models (GPT-5-nano) require more cycles and wall-clock time despite lower per-token prices.
Failed runs are expensive: 2.8x more cycles, 4.1x longer duration, and 1.3x higher cost than successful runs.
Best throughput: Gemini-3-Flash achieves the lowest mean time per task (36 min) while tied for the highest success rate.

Failure Taxonomy

Analysis of 182 failing trajectories reveals distinct failure profiles per agent:

Failure Mode	RAG-Agent	Mini-SWE-Agent	ExecutionAgent	AnalysisAgent
Docker/Build Failure	60%	11%	20%	7%
Analysis Tool Misuse	7%	44%	23%	50%
Malformed LLM Output	20%	—	50%	—
Budget/Time Exhausted	—	19%	—	29%
Incorrect Analysis Result	11%	48%	—	—

AnalysisAgent has largely solved environment setup (only 7% Docker/build failures) and primarily fails during tool invocation — a qualitatively harder problem that represents the next frontier.

Configuration

Key environment variables:

Variable	Default	Description
`EXEC_AGENT_MODEL`	`gpt-5-nano`	LLM model to use
`EXEC_AGENT_IMAGE`	(unset)	Docker image; unset = local execution
`EXEC_AGENT_MODE`	`auto`	`auto` (runs continuously) or `step` (pause each cycle)
`EXEC_AGENT_LLM_USAGE_JSONL`	(unset)	Path to log per-call LLM usage

See docs/AGENT_LAUNCH_GUIDE.md for the full CLI reference.

Replay Web Interface

Inspect and replay past agent runs in a browser:

python -m analysis_agent.replay_web --port 8080
# Then open http://localhost:8080

Options: --host, --port, --logs-dir /path/to/logs.

Running Tests

pip install -e ".[dev]"
pytest tests/

Documentation

docs/AGENT_LAUNCH_GUIDE.md — full CLI and programmatic API reference
docs/experiment_management_guide.md — orchestrator design for large-scale runs
docs/experiment_management_guide_concise.md — concise summary of the orchestration architecture

Citation

If you use this code in your research, please cite:

@article{TODO,
  title   = {TODO},
  author  = {TODO},
  year    = {TODO},
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
augmented_doc		augmented_doc
docs		docs
expected_output_format		expected_output_format
prompt_files		prompt_files
quick_launchers		quick_launchers
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
commands_interface.json		commands_interface.json
generate_brief_doc.py		generate_brief_doc.py
generate_brief_docs.sh		generate_brief_docs.sh
generate_output_example.py		generate_output_example.py
generate_output_example.sh		generate_output_example.sh
instances.json		instances.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis Agent

Installation

Quick Start

Batch and Parallel Execution

Large-Scale Orchestrated Runs

Supported Tools and Targets

Key Features

Evaluation Results

Headline Numbers

Success Rates by Agent and LLM Backend

Statistical Significance

Tool and Ecosystem Patterns

Efficiency Highlights

Failure Taxonomy

Configuration

Replay Web Interface

Running Tests

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analysis Agent

Installation

Quick Start

Batch and Parallel Execution

Large-Scale Orchestrated Runs

Supported Tools and Targets

Key Features

Evaluation Results

Headline Numbers

Success Rates by Agent and LLM Backend

Statistical Significance

Tool and Ecosystem Patterns

Efficiency Highlights

Failure Taxonomy

Configuration

Replay Web Interface

Running Tests

Documentation

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages