Skip to content

sola-st/software-analysis-agent

Repository files navigation

Analysis Agent

An LLM-based agent that autonomously applies software analysis tools to target projects. Given an analysis tool (e.g. Clang, Infer, KLEE) and a target repository, the agent sets up the environment, installs the tool, prepares the project, and drives the analysis to completion — retrying and learning across attempts.

Installation

python -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python ≥ 3.10 and (optionally) Docker for isolated execution.

Quick Start

# Set your LLM API key
export OPENAI_API_KEY="sk-..."   # or ANTHROPIC_API_KEY, etc.

# Run on a single tool/target pair
python -m analysis_agent.main \
  --tool-name   "clang" \
  --tool-url    "https://github.com/llvm/llvm-project" \
  --target-name "curl" \
  --target-url  "https://github.com/curl/curl"

# Interactive wizard
python -m analysis_agent.main --interactive

Batch and Parallel Execution

# Batch from a JSONL file
python -m analysis_agent.main --instances-file instances.jsonl

# Parallel execution
python -m analysis_agent.run_parallel \
  --instances-file instances.jsonl \
  --workers 4

Instance file format (JSONL):

{"tool_name": "clang", "tool_url": "...", "target_name": "curl", "target_url": "..."}
{"tool_name": "infer", "tool_url": "...", "target_name": "checkstyle", "target_url": "..."}

Large-Scale Orchestrated Runs

For large benchmark experiments, the recommended approach is a custom orchestrator with live monitoring and an emergency kill switch. Key design principles:

  • Global concurrency cap + per-tool sub-limit for disk-hungry tools (fuzzers, call-graph analyzers)
  • Resource-based launch pause (RAM/swap/disk thresholds)
  • Background thread monitoring Docker container writable-layer size (docker inspect --size)
  • Per-instance hard timeout; status file for monitor integration

See docs/experiment_management_guide.md for the full design and reference implementation.

Supported Tools and Targets

The included instances.json covers 35 combinations:

Category Tools Targets
C/C++ analysis AFLplusplus, Clang, cFlow, KLEE curl, fastfetch, ImageMagick, masscan, radare2
Java analysis Infer, jvm-tools, WALA checkstyle, closer-compiler, jmh, saxon-he, tika

The augmented_doc/ directory contains LLM-generated runbooks for every tool/target pair. New pairs can be added with generate_brief_doc.py.

Key Features

  • Multi-stage workflow: docker_setuptool_setupproject_setupanalysis_run
  • Multi-attempt retry with cross-attempt learning
  • Exit attempt: on full failure, generates a recovery Dockerfile and replay script
  • Execution environments: local or Docker
  • LLM-agnostic: uses LiteLLM — works with OpenAI, Anthropic, Deepseek, Gemini, and any compatible provider
  • Experiment manager: queue-based batch runner with metrics tracking
  • Replay web interface: browser-based viewer for inspecting and replaying past agent runs

Evaluation Results

We evaluate AnalysisAgent against three baselines — RAG-Agent, Mini-SWE-Agent, and ExecutionAgent — across 35 benchmark tasks and 4 LLM backends, with 3 repetitions per configuration (560 total runs). All agents share per-task limits of 120 cycles, $2 API cost, and 5 h wall-clock time.

Headline Numbers

Agent Avg. Verified Success
RAG-Agent 10%
Mini-SWE-Agent 37%
ExecutionAgent 57%
AnalysisAgent 79%

AnalysisAgent achieves 94% verified success with both Gemini-3-Flash and DeepSeek-V3.2 (33/35 tasks), demonstrating that purpose-built scaffolding reliably handles multi-step tool installation, project building, and analysis evidence production.

Success Rates by Agent and LLM Backend

GPT-5-nano GPT-5-mini DeepSeek-V3.2 Gemini-3-Flash
RAG-Agent 9% 6% 3% 23%
Mini-SWE-Agent 9% 20% 57% 63%
ExecutionAgent 40% 54% 57% 77%
AnalysisAgent 54% 75% 94% 94%

Verified success rates from manual review. Self-validated rates (mean +/- std over n=3 runs) available in the paper.

Statistical Significance

AnalysisAgent's advantage is large, consistent across all LLM backends, and statistically significant (Fisher's exact test, Holm-Bonferroni corrected):

Comparison Odds Ratio (95% CI) Cohen's h padj
vs. RAG-Agent 34.5 [17.3, 68.5] 1.55 < 0.001
vs. Mini-SWE-Agent 8.1 [4.0, 16.2] 0.92 < 0.001
vs. ExecutionAgent 2.7 [1.6, 4.6] 0.45 < 0.001

The Cochran-Mantel-Haenszel test confirms the advantage holds across all LLM backends (p < 10-4 for all comparisons).

Tool and Ecosystem Patterns

Tool Avg. Success Ecosystem
cflow 64% C/C++
CSA (Clang Static Analyzer) 55% C/C++
AFL++ 52% C/C++
KLEE 40% C/C++
Infer 35% Java
SJK (jvm-tools) 35% Java
WALA 31% Java

Java tasks account for 62% of all failures across agents, reflecting the complexity of Java toolchains (classpaths, bytecode generation, JVM attachment) and heavyweight whole-program analyses.

Efficiency Highlights

  • Stronger models are cheaper overall: weaker models (GPT-5-nano) require more cycles and wall-clock time despite lower per-token prices.
  • Failed runs are expensive: 2.8x more cycles, 4.1x longer duration, and 1.3x higher cost than successful runs.
  • Best throughput: Gemini-3-Flash achieves the lowest mean time per task (36 min) while tied for the highest success rate.

Failure Taxonomy

Analysis of 182 failing trajectories reveals distinct failure profiles per agent:

Failure Mode RAG-Agent Mini-SWE-Agent ExecutionAgent AnalysisAgent
Docker/Build Failure 60% 11% 20% 7%
Analysis Tool Misuse 7% 44% 23% 50%
Malformed LLM Output 20% 50%
Budget/Time Exhausted 19% 29%
Incorrect Analysis Result 11% 48%

AnalysisAgent has largely solved environment setup (only 7% Docker/build failures) and primarily fails during tool invocation — a qualitatively harder problem that represents the next frontier.


Configuration

Key environment variables:

Variable Default Description
EXEC_AGENT_MODEL gpt-5-nano LLM model to use
EXEC_AGENT_IMAGE (unset) Docker image; unset = local execution
EXEC_AGENT_MODE auto auto (runs continuously) or step (pause each cycle)
EXEC_AGENT_LLM_USAGE_JSONL (unset) Path to log per-call LLM usage

See docs/AGENT_LAUNCH_GUIDE.md for the full CLI reference.

Replay Web Interface

Inspect and replay past agent runs in a browser:

python -m analysis_agent.replay_web --port 8080
# Then open http://localhost:8080

Options: --host, --port, --logs-dir /path/to/logs.

Running Tests

pip install -e ".[dev]"
pytest tests/

Documentation

Citation

If you use this code in your research, please cite:

@article{TODO,
  title   = {TODO},
  author  = {TODO},
  year    = {TODO},
}

License

MIT

About

AnalysisAgent: Run any software analysis tool on any (compatible) project

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors