Skip to content

sdave0/argus-audit

Repository files navigation

ArgusAudit

Automated Claim Verification for Policy and Compliance

When querying dense SLAs, vendor agreements, or internal financial policies, a single rule often contains multiple strict conditions. Standard RAG pipelines find evidence for one, ignore the others, and synthesize a confident paragraph that passes them all.

ArgusAudit solves this liability in regulated environments through an agentic verification pipeline that treats compliance as a classification problem. Every query is decomposed into discrete verifiable claims, each graded independently against the source material. Three claims in means three separate verdicts out.

AI Analysis Demo

---

Example

Input: "All production S3 buckets must use AES-256 encryption, rotate keys every 90 days, and replicate data to the 'us-west-2' region."

Status Claim Evidence Source
SUPPORTED AES-256 encryption required "All production data must use AES-256..." InfoSec Std v4.1 (p.12)
REFUTED Keys rotate every 90 days Policy mandates annual rotation only Key Mgmt Policy (p.4)
UNKNOWN Replicate to us-west-2 Policy mentions offsite backup, no region specified No context found

Three claims in, two different verdicts. A standard RAG pipeline returns one confident paragraph that passes all three.


How It Works

ArgusAudit runs a multi-step verification pipeline where each stage is isolated, gated, and independently observable — decomposition, retrieval, grading, and guardrail validation never share state.

Claim Decomposition

A query like "AES-256 encryption and 90-day key rotation" is a single vector search that averages the meaning of the full sentence. Strong matches on "encryption" drown out weak matches on "rotation". The LLM fills the gap.

ArgusAudit splits the query before retrieval. Each claim is an isolated unit with its own retrieval pass and its own verdict. Nothing bleeds into anything else.

Blind Evidence Grading

The model grades evidence without seeing source metadata. Retrieved chunks are passed as anonymous snippets indexed by integer. The LLM returns a verdict tied to a snippet ID. The orchestration layer resolves that ID back to the document name and page number after grading is complete.

The model cannot fabricate a citation because it never had one to work from.

Hybrid Retrieval

Retrieval combines dense and sparse signals using Reciprocal Rank Fusion:

  • BAAI/bge-small-en for semantic intent matching
  • BM25 for exact keyword hits on clause numbers, defined terms, and specific identifiers

Guardrail

Before any response is streamed, a final structured pass validates that every claim in the output is grounded in a retrieved snippet. Output is rejected at the schema level if a claim lacks an evidence reference. The system does not release ungrounded findings.


Command Modes

Claim Verification Mode

@c runs the full verification pipeline: decomposition, hybrid retrieval, blind grading, report generation, and guardrail check.


@c Verify that all vendors must hold ISO 27001 certification.

AI Analysis Demo

Retrieval Mode

@r runs a direct vector search against the index. Useful for fast document lookups that do not require structured grading.


@r What is the deductible for general liability?

AI Analysis Demo

Architecture

ArgusAudit Pipeline

The system uses an agentic LangGraph workflow designed for high reliability. The agent follows a strict, self-monitoring process:

  • Evidence-Gated Generation: The agent must successfully find and score relevant evidence before it can formulate a response.

  • Self-Correction: If the system fails to find evidence or exhausts its execution limits (budget), it automatically adjusts its approach.

  • Output Validation: All final responses are cross-checked against the retrieved evidence to guarantee accuracy before release.


Capabilities

Manual Review Controls

The interface is built for human sign-off, not passive consumption. A Contradiction Heatmap cross-references documents against claims in real time, surfacing conflicting evidence before it becomes a finding. Reviewers can override any verdict directly, generating a timestamped audit log that records every human intervention.

Unverifiable findings trigger a prominent callout flagging the claim for manual review rather than silently passing it.

Budget Circuit Breakers

Unconstrained agent loops burn compute and return late. Graph execution is gated by hard limits enforced via a custom BudgetTrackerCallback:

Limit Default
Max LLM calls 10
Max tokens 15,000
Max wall time 90 seconds

When a ceiling is hit, the system returns a partial verification with a clear status rather than timing out or throwing a 500.

Closed-Loop Telemetry

Every output collects a thumbs up or down. Reviewers who override a finding submit a reason code: hallucinated_citation, missing_evidence, incorrect_verdict. A metrics script aggregates these logs into four tracked signals:

  • Verdict Accuracy
  • Citation Correctness
  • Abstention Quality
  • Human Override Rate

This structured feedback loop generates the high-quality dataset required to identify weaknesses, systematically enhance the agentic flows, and eventually enable autonomous self-learning.


Stack

Layer Technology Why
Orchestration LangGraph State isolation per claim, budget gating across nodes
Vector Store Qdrant Page-level filtering to bind chunks to source metadata
Validation Pydantic v2 Strict schemas for grading input and audit log output
Embeddings BAAI/bge-small-en Strong MTEB retrieval performance at small model size
API FastAPI Async SSE for concurrent audit streams

Quick Start

Requires Docker, Python 3.10+, Node.js 18+.

Single command (Windows)

start_all.bat

Starts Qdrant, the FastAPI backend, and the React frontend concurrently. To stop all instances cleanly:

stop_all.bat

Manual startup

# 1. Vector store
docker-compose up -d

# 2. Backend
call .venv\Scripts\activate.bat
uvicorn backend.api.main:app --reload --host 0.0.0.0 --port 8000

# 3. Frontend
cd frontend-react
npm install && npm run dev

About

A RAG verification pipeline for regulated environments that decomposes complex policies into independently graded, hallucination-free claims.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors