Automated Claim Verification for Policy and Compliance
When querying dense SLAs, vendor agreements, or internal financial policies, a single rule often contains multiple strict conditions. Standard RAG pipelines find evidence for one, ignore the others, and synthesize a confident paragraph that passes them all.
ArgusAudit solves this liability in regulated environments through an agentic verification pipeline that treats compliance as a classification problem. Every query is decomposed into discrete verifiable claims, each graded independently against the source material. Three claims in means three separate verdicts out.
---Input: "All production S3 buckets must use AES-256 encryption, rotate keys every 90 days, and replicate data to the 'us-west-2' region."
| Status | Claim | Evidence | Source |
|---|---|---|---|
| SUPPORTED | AES-256 encryption required | "All production data must use AES-256..." | InfoSec Std v4.1 (p.12) |
| REFUTED | Keys rotate every 90 days | Policy mandates annual rotation only | Key Mgmt Policy (p.4) |
| UNKNOWN | Replicate to us-west-2 | Policy mentions offsite backup, no region specified | No context found |
Three claims in, two different verdicts. A standard RAG pipeline returns one confident paragraph that passes all three.
ArgusAudit runs a multi-step verification pipeline where each stage is isolated, gated, and independently observable — decomposition, retrieval, grading, and guardrail validation never share state.
A query like "AES-256 encryption and 90-day key rotation" is a single vector search that averages the meaning of the full sentence. Strong matches on "encryption" drown out weak matches on "rotation". The LLM fills the gap.
ArgusAudit splits the query before retrieval. Each claim is an isolated unit with its own retrieval pass and its own verdict. Nothing bleeds into anything else.
The model grades evidence without seeing source metadata. Retrieved chunks are passed as anonymous snippets indexed by integer. The LLM returns a verdict tied to a snippet ID. The orchestration layer resolves that ID back to the document name and page number after grading is complete.
The model cannot fabricate a citation because it never had one to work from.
Retrieval combines dense and sparse signals using Reciprocal Rank Fusion:
BAAI/bge-small-enfor semantic intent matchingBM25for exact keyword hits on clause numbers, defined terms, and specific identifiers
Before any response is streamed, a final structured pass validates that every claim in the output is grounded in a retrieved snippet. Output is rejected at the schema level if a claim lacks an evidence reference. The system does not release ungrounded findings.
@c runs the full verification pipeline: decomposition, hybrid retrieval, blind grading, report generation, and guardrail check.
@c Verify that all vendors must hold ISO 27001 certification.
@r runs a direct vector search against the index. Useful for fast document lookups that do not require structured grading.
@r What is the deductible for general liability?
The system uses an agentic LangGraph workflow designed for high reliability. The agent follows a strict, self-monitoring process:
-
Evidence-Gated Generation: The agent must successfully find and score relevant evidence before it can formulate a response.
-
Self-Correction: If the system fails to find evidence or exhausts its execution limits (budget), it automatically adjusts its approach.
-
Output Validation: All final responses are cross-checked against the retrieved evidence to guarantee accuracy before release.
The interface is built for human sign-off, not passive consumption. A Contradiction Heatmap cross-references documents against claims in real time, surfacing conflicting evidence before it becomes a finding. Reviewers can override any verdict directly, generating a timestamped audit log that records every human intervention.
Unverifiable findings trigger a prominent callout flagging the claim for manual review rather than silently passing it.
Unconstrained agent loops burn compute and return late. Graph execution is gated by hard limits enforced via a custom BudgetTrackerCallback:
| Limit | Default |
|---|---|
| Max LLM calls | 10 |
| Max tokens | 15,000 |
| Max wall time | 90 seconds |
When a ceiling is hit, the system returns a partial verification with a clear status rather than timing out or throwing a 500.
Every output collects a thumbs up or down. Reviewers who override a finding submit a reason code: hallucinated_citation, missing_evidence, incorrect_verdict. A metrics script aggregates these logs into four tracked signals:
- Verdict Accuracy
- Citation Correctness
- Abstention Quality
- Human Override Rate
This structured feedback loop generates the high-quality dataset required to identify weaknesses, systematically enhance the agentic flows, and eventually enable autonomous self-learning.
| Layer | Technology | Why |
|---|---|---|
| Orchestration | LangGraph | State isolation per claim, budget gating across nodes |
| Vector Store | Qdrant | Page-level filtering to bind chunks to source metadata |
| Validation | Pydantic v2 | Strict schemas for grading input and audit log output |
| Embeddings | BAAI/bge-small-en | Strong MTEB retrieval performance at small model size |
| API | FastAPI | Async SSE for concurrent audit streams |
Requires Docker, Python 3.10+, Node.js 18+.
start_all.batStarts Qdrant, the FastAPI backend, and the React frontend concurrently. To stop all instances cleanly:
stop_all.bat# 1. Vector store
docker-compose up -d
# 2. Backend
call .venv\Scripts\activate.bat
uvicorn backend.api.main:app --reload --host 0.0.0.0 --port 8000
# 3. Frontend
cd frontend-react
npm install && npm run dev



