AI/LLM Quality Assurance Framework

Requirements-driven QA framework for evaluating Large Language Model (LLM) outputs with reusable scenarios, traceability, release gates, dashboards, and multi-model testing.

Why This Repo Stands Out

Requirements and acceptance criteria can be modeled as versioned YAML artifacts
Test scenarios for security, hallucination, performance, and RAG are reusable and data-driven
Every run can produce a dashboard, traceability matrix, and release decision summary
The framework supports Anthropic, OpenAI, and Google models through one client abstraction

Release Artifacts

A full run can now generate management-ready outputs in addition to raw pytest results:

dashboard.html for QA and release-readiness overview
traceability.json and traceability.html for requirement-to-scenario traceability
release_summary.json for GO, GO WITH RISKS, or NO-GO decisions

Automated testing framework for evaluating Large Language Model (LLM) outputs in production environments.

Purpose

Modern applications integrate AI/LLM features (chatbots, content generation, document analysis). Traditional QA methods fail because:

Non-deterministic outputs (same input produces different responses)
No classical "expected results"
New risk categories: hallucinations, prompt injections, data leakage

This framework provides 46 automated tests across 7 quality dimensions designed specifically for LLM evaluation.

Test Categories

Category	Tests	What it covers
Security	3	Prompt injection resistance, API key leakage, PII generation
Consistency	3	Semantic consistency, output stability, tone compliance
Hallucination	5	Fact accuracy, fictitious persons/events, fake URLs, math
Performance	5	Response time SLAs, token efficiency, latency monitoring
Bias Detection	5	Gender neutrality, cultural fairness, stereotypes, age, politics
RAG Evaluation	8	Context faithfulness, grounding, contradictions, citation accuracy
UI Testing	17	Chat flow, rendering, loading states, accessibility, responsive design

Multi-Model Support

The framework supports multiple LLM providers through a unified interface:

Provider	Models	API Key
Anthropic (default)	Claude Sonnet, Haiku, Opus	`ANTHROPIC_API_KEY`
OpenAI	GPT-4o, GPT-4, GPT-3.5	`OPENAI_API_KEY`
Google	Gemini 2.0 Flash, Gemini Pro	`GOOGLE_API_KEY`

Installation

# Clone repository
git clone https://github.com/Lengi96/ai-qa-framework.git
cd ai-qa-framework

# Install core dependencies
pip install -r requirements.txt

# Optional: install extras
pip install .[openai]       # OpenAI support
pip install .[google]       # Google Gemini support
pip install .[ui]           # Playwright UI testing
pip install .[dashboard]    # Dashboard generation
pip install .[all]          # Everything

# For UI testing: install browser
playwright install chromium

# Configure API key
cp .env.example .env
# Edit .env and add your API key(s)

Running Tests

# Run all LLM tests (default: Anthropic Claude)
pytest

# Generate HTML report
pytest --html=report.html --self-contained-html

# Run specific test category
pytest tests/test_security.py
pytest tests/test_rag.py

# Use a different provider
pytest --provider openai --model gpt-4o
pytest --provider google --model gemini-2.0-flash

# Or via environment variables
LLM_PROVIDER=openai MODEL=gpt-4o pytest

Custom Metric Dashboard

Generate an interactive HTML dashboard with charts and category breakdowns:

# 1. Install dashboard dependencies
pip install .[dashboard]

# 2. Run tests with JSON output
pytest tests/ -m "not ui" --json-report --json-report-file=results.json

# 3. Generate dashboard
python -m src.dashboard.generate results.json -o dashboard.html

The dashboard shows:

Overall pass rate with status banner
Metrics cards (total, passed, failed, skipped, duration)
Stacked bar chart by category
Donut chart for result distribution
Detailed test table with durations

UI Testing with Playwright

The framework includes 17 generic chatbot UI tests that work against any chat interface. Tests cover input/output flow, markdown rendering, loading states, error handling, accessibility, responsive design, and performance.

Running UI Tests

# Basic usage with default CSS selectors
pytest tests/test_ui.py --base-url http://localhost:3000

# With visible browser for debugging
pytest tests/test_ui.py --base-url http://localhost:3000 --headed

# Custom selectors for your specific chat UI
pytest tests/test_ui.py --base-url https://my-chatbot.com \
    --selector-input "#prompt-textarea" \
    --selector-send "button.send-btn" \
    --selector-response ".chat-bubble.assistant"

# Run only LLM tests (skip UI)
pytest -m "not ui"

Configurable Selectors

CLI Option	Env Variable	What it targets
`--selector-input`	`UI_SELECTOR_INPUT`	Chat input field
`--selector-send`	`UI_SELECTOR_SEND`	Send button
`--selector-messages`	`UI_SELECTOR_MESSAGES`	Messages container
`--selector-response`	`UI_SELECTOR_RESPONSE`	Bot response messages
`--selector-loading`	`UI_SELECTOR_LOADING`	Loading indicator
`--selector-error`	`UI_SELECTOR_ERROR`	Error display

Default selectors cover common patterns (data-testid, typical class names, ARIA attributes) and may work out of the box with many chat UIs.

Project Structure

ai-qa-framework/
â”œâ”€â”€ .env.example                 # API key & UI config template
â”œâ”€â”€ .github/workflows/tests.yml  # CI/CD pipeline (LLM + UI jobs)
â”œâ”€â”€ pyproject.toml               # Project config & pytest settings
â”œâ”€â”€ requirements.txt             # Python dependencies
â”œâ”€â”€ src/
â”‚   â”œâ”€â”€ llm_client.py            # Unified multi-provider LLM client
â”‚   â””â”€â”€ dashboard/
â”‚       â””â”€â”€ generate.py          # HTML dashboard generator
â””â”€â”€ tests/
    â”œâ”€â”€ conftest.py              # Shared fixtures & CLI options
    â”œâ”€â”€ ui_selectors.py          # Default CSS selectors for UI tests
    â”œâ”€â”€ test_security.py         # Security tests
    â”œâ”€â”€ test_consistency.py      # Consistency tests
    â”œâ”€â”€ test_hallucination.py    # Hallucination detection
    â”œâ”€â”€ test_performance.py      # Performance & SLA tests
    â”œâ”€â”€ test_bias.py             # Bias detection tests
    â”œâ”€â”€ test_rag.py              # RAG evaluation tests
    â””â”€â”€ test_ui.py               # Chatbot UI tests (Playwright)

CI/CD

GitHub Actions pipeline with two jobs:

LLM Tests â€” runs automatically on:

Push to main
Pull requests
Weekly schedule (Monday 8:00 UTC)
Manual trigger

UI Tests â€” runs only when CHATBOT_BASE_URL is configured as a repository variable.

Test reports are uploaded as artifacts (30 days retention).

Setup:

Add ANTHROPIC_API_KEY as a repository secret under Settings > Secrets and variables > Actions
For UI tests: add CHATBOT_BASE_URL as a repository variable

Tech Stack

Python 3.11+
Pytest + pytest-html + pytest-json-report
Anthropic / OpenAI / Google Generative AI SDKs
Playwright (UI testing)
Chart.js (dashboard visualizations)
GitHub Actions (CI/CD)

Roadmap

License

MIT License - see LICENSE file

Author

Christoph Lengowski - IT Consultant specializing in QA & AI Testing

Requirements Traceability and Release Gates

The framework now includes a requirements-driven management layer on top of the raw pytest suite.

New project assets

requirements/*.yaml define requirements with stable IDs, priorities, risks, acceptance criteria, linked scenarios, and release gates
scenarios/*.yaml define reusable test scenarios with prompts, expected/forbidden signals, severity, tags, and provider scope
config/quality_gates.yaml defines release-readiness thresholds

New generated artifacts

Run the suite with JSON reporting and then generate the management artifacts:

pytest tests/ -m "not ui" --json-report --json-report-file=results.json
python -m src.dashboard.generate results.json \
  -o dashboard.html \
  --provider anthropic \
  --model claude-haiku-4-5 \
  --traceability-out traceability.json \
  --traceability-html traceability.html \
  --release-summary-out release_summary.json

This produces:

dashboard.html: management dashboard with release decision, requirement coverage, risk coverage, gaps, and history table
traceability.json: machine-readable traceability matrix and gap analysis
traceability.html: human-readable traceability report
release_summary.json: release decision (GO, GO WITH RISKS, NO-GO) with reasons and thresholds

Data-driven scenario execution

Security, hallucination, performance, and RAG checks are now backed by reusable scenario specifications rather than only hard-coded Python assertions. This makes the suite easier to review as a QA/Test-Management and Requirements-Engineering artifact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI/LLM Quality Assurance Framework

Why This Repo Stands Out

Release Artifacts

Purpose

Test Categories

Multi-Model Support

Installation

Running Tests

Custom Metric Dashboard

UI Testing with Playwright

Running UI Tests

Configurable Selectors

Project Structure

CI/CD

Tech Stack

Roadmap

License

Author

Requirements Traceability and Release Gates

New project assets

New generated artifacts

Data-driven scenario execution

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
config		config
requirements		requirements
scenarios		scenarios
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dashboard.html		dashboard.html
pyproject.toml		pyproject.toml
report.html		report.html
requirements.txt		requirements.txt
results.json		results.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AI/LLM Quality Assurance Framework

Why This Repo Stands Out

Release Artifacts

Purpose

Test Categories

Multi-Model Support

Installation

Running Tests

Custom Metric Dashboard

UI Testing with Playwright

Running UI Tests

Configurable Selectors

Project Structure

CI/CD

Tech Stack

Roadmap

License

Author

Requirements Traceability and Release Gates

New project assets

New generated artifacts

Data-driven scenario execution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages