Prism — Grounded Evaluation for AI Agents

An open-source Streamlit app for domain experts to build evidence-based evaluation systems for AI agents — no ML expertise required.

Based on the methodology from Why Grounded Theory for Reliable AI Agents.

Pipeline

Upload traces → Curate golden set → Open Coding → IAA → Rubric → LLM Judge calibration
     1               1                  2           3      4             5

Step	What you do
1 · Curator	Upload agent I/O traces (JSONL/CSV), select your golden evaluation set
2 · Annotate	Assign failure codes to traces (open coding). Multiple annotators supported.
3 · IAA	Measure inter-annotator agreement (Cohen's κ, Krippendorff's α). Flag low-agreement codes.
4 · Rubric	Group codes into evaluation criteria with observable, scoreable scales. Export to JSON.
5 · Judge	Run an LLM-as-Judge on your golden traces and measure calibration vs human annotations.

Quick start

git clone https://github.com/balasvce2017/PRISM.git
cd PRISM
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.py

Open http://localhost:8501 in your browser.

Input format

JSONL (one object per line):

{"query": "What is x²?", "response": "The derivative is 2x.", "subject": "calculus"}
{"query": "Solve 3x+5=20", "response": "x = 5"}

CSV: must have query and response columns. All other columns become metadata.

Supported LLM providers

Provider	Credential required
Anthropic	API key
OpenAI	API key
Amazon Bedrock	AWS access key, secret key, region
Azure OpenAI	Endpoint, API key, deployment name

Calibration target

Deploy the LLM judge when κ ≥ 0.70 per criterion against human annotators.

κ range	Interpretation
≥ 0.80	Excellent — ready for production
0.70–0.79	Good — acceptable for most use cases
0.60–0.69	Fair — revisit rubric definition
< 0.60	Poor — return to open coding

Data

All data is stored locally in prism.db (SQLite). No data is sent anywhere except to your chosen LLM provider when you run the judge (Step 5). Your API keys are never persisted to disk.

Development

pytest tests/ -v                          # run smoke tests
ruff check core/ tests/ app.py pages/    # lint

CI runs both on Python 3.9 and 3.11 on every push and pull request.

Contributing

PRs welcome — see CONTRIBUTING.md for setup instructions, code style, and the PR checklist.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
assets		assets
core		core
pages		pages
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prism — Grounded Evaluation for AI Agents

Pipeline

Quick start

Input format

Supported LLM providers

Calibration target

Data

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prism — Grounded Evaluation for AI Agents

Pipeline

Quick start

Input format

Supported LLM providers

Calibration target

Data

Development

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages