Skip to content

Lengi96/ai-qa-framework

Repository files navigation

AI/LLM Quality Assurance Framework

Latest Release QA Management Workflow Python

Requirements-driven QA framework for evaluating Large Language Model (LLM) outputs with reusable scenarios, traceability, release gates, dashboards, and multi-model testing.

Why This Repo Stands Out

  • Requirements and acceptance criteria can be modeled as versioned YAML artifacts
  • Test scenarios for security, hallucination, performance, and RAG are reusable and data-driven
  • Every run can produce a dashboard, traceability matrix, and release decision summary
  • The framework supports Anthropic, OpenAI, and Google models through one client abstraction

Release Artifacts

A full run can now generate management-ready outputs in addition to raw pytest results:

  • dashboard.html for QA and release-readiness overview
  • traceability.json and traceability.html for requirement-to-scenario traceability
  • release_summary.json for GO, GO WITH RISKS, or NO-GO decisions

Automated testing framework for evaluating Large Language Model (LLM) outputs in production environments.

Purpose

Modern applications integrate AI/LLM features (chatbots, content generation, document analysis). Traditional QA methods fail because:

  • Non-deterministic outputs (same input produces different responses)
  • No classical "expected results"
  • New risk categories: hallucinations, prompt injections, data leakage

This framework provides 46 automated tests across 7 quality dimensions designed specifically for LLM evaluation.

Test Categories

Category Tests What it covers
Security 3 Prompt injection resistance, API key leakage, PII generation
Consistency 3 Semantic consistency, output stability, tone compliance
Hallucination 5 Fact accuracy, fictitious persons/events, fake URLs, math
Performance 5 Response time SLAs, token efficiency, latency monitoring
Bias Detection 5 Gender neutrality, cultural fairness, stereotypes, age, politics
RAG Evaluation 8 Context faithfulness, grounding, contradictions, citation accuracy
UI Testing 17 Chat flow, rendering, loading states, accessibility, responsive design

Multi-Model Support

The framework supports multiple LLM providers through a unified interface:

Provider Models API Key
Anthropic (default) Claude Sonnet, Haiku, Opus ANTHROPIC_API_KEY
OpenAI GPT-4o, GPT-4, GPT-3.5 OPENAI_API_KEY
Google Gemini 2.0 Flash, Gemini Pro GOOGLE_API_KEY

Installation

# Clone repository
git clone https://github.com/Lengi96/ai-qa-framework.git
cd ai-qa-framework

# Install core dependencies
pip install -r requirements.txt

# Optional: install extras
pip install .[openai]       # OpenAI support
pip install .[google]       # Google Gemini support
pip install .[ui]           # Playwright UI testing
pip install .[dashboard]    # Dashboard generation
pip install .[all]          # Everything

# For UI testing: install browser
playwright install chromium

# Configure API key
cp .env.example .env
# Edit .env and add your API key(s)

Running Tests

# Run all LLM tests (default: Anthropic Claude)
pytest

# Generate HTML report
pytest --html=report.html --self-contained-html

# Run specific test category
pytest tests/test_security.py
pytest tests/test_rag.py

# Use a different provider
pytest --provider openai --model gpt-4o
pytest --provider google --model gemini-2.0-flash

# Or via environment variables
LLM_PROVIDER=openai MODEL=gpt-4o pytest

Custom Metric Dashboard

Generate an interactive HTML dashboard with charts and category breakdowns:

# 1. Install dashboard dependencies
pip install .[dashboard]

# 2. Run tests with JSON output
pytest tests/ -m "not ui" --json-report --json-report-file=results.json

# 3. Generate dashboard
python -m src.dashboard.generate results.json -o dashboard.html

The dashboard shows:

  • Overall pass rate with status banner
  • Metrics cards (total, passed, failed, skipped, duration)
  • Stacked bar chart by category
  • Donut chart for result distribution
  • Detailed test table with durations

UI Testing with Playwright

The framework includes 17 generic chatbot UI tests that work against any chat interface. Tests cover input/output flow, markdown rendering, loading states, error handling, accessibility, responsive design, and performance.

Running UI Tests

# Basic usage with default CSS selectors
pytest tests/test_ui.py --base-url http://localhost:3000

# With visible browser for debugging
pytest tests/test_ui.py --base-url http://localhost:3000 --headed

# Custom selectors for your specific chat UI
pytest tests/test_ui.py --base-url https://my-chatbot.com \
    --selector-input "#prompt-textarea" \
    --selector-send "button.send-btn" \
    --selector-response ".chat-bubble.assistant"

# Run only LLM tests (skip UI)
pytest -m "not ui"

Configurable Selectors

CLI Option Env Variable What it targets
--selector-input UI_SELECTOR_INPUT Chat input field
--selector-send UI_SELECTOR_SEND Send button
--selector-messages UI_SELECTOR_MESSAGES Messages container
--selector-response UI_SELECTOR_RESPONSE Bot response messages
--selector-loading UI_SELECTOR_LOADING Loading indicator
--selector-error UI_SELECTOR_ERROR Error display

Default selectors cover common patterns (data-testid, typical class names, ARIA attributes) and may work out of the box with many chat UIs.

Project Structure

ai-qa-framework/
├── .env.example                 # API key & UI config template
├── .github/workflows/tests.yml  # CI/CD pipeline (LLM + UI jobs)
├── pyproject.toml               # Project config & pytest settings
├── requirements.txt             # Python dependencies
├── src/
│   ├── llm_client.py            # Unified multi-provider LLM client
│   └── dashboard/
│       └── generate.py          # HTML dashboard generator
└── tests/
    ├── conftest.py              # Shared fixtures & CLI options
    ├── ui_selectors.py          # Default CSS selectors for UI tests
    ├── test_security.py         # Security tests
    ├── test_consistency.py      # Consistency tests
    ├── test_hallucination.py    # Hallucination detection
    ├── test_performance.py      # Performance & SLA tests
    ├── test_bias.py             # Bias detection tests
    ├── test_rag.py              # RAG evaluation tests
    └── test_ui.py               # Chatbot UI tests (Playwright)

CI/CD

GitHub Actions pipeline with two jobs:

LLM Tests — runs automatically on:

  • Push to main
  • Pull requests
  • Weekly schedule (Monday 8:00 UTC)
  • Manual trigger

UI Tests — runs only when CHATBOT_BASE_URL is configured as a repository variable.

Test reports are uploaded as artifacts (30 days retention).

Setup:

  • Add ANTHROPIC_API_KEY as a repository secret under Settings > Secrets and variables > Actions
  • For UI tests: add CHATBOT_BASE_URL as a repository variable

Tech Stack

  • Python 3.11+
  • Pytest + pytest-html + pytest-json-report
  • Anthropic / OpenAI / Google Generative AI SDKs
  • Playwright (UI testing)
  • Chart.js (dashboard visualizations)
  • GitHub Actions (CI/CD)

Roadmap

  • Security testing
  • Consistency testing
  • Hallucination detection
  • Performance testing
  • Bias detection
  • Multi-model support (Claude, GPT, Gemini)
  • CI/CD pipeline
  • UI testing integration (Playwright)
  • Custom metric dashboards
  • RAG evaluation tests

License

MIT License - see LICENSE file

Author

Christoph Lengowski - IT Consultant specializing in QA & AI Testing

Requirements Traceability and Release Gates

The framework now includes a requirements-driven management layer on top of the raw pytest suite.

New project assets

  • requirements/*.yaml define requirements with stable IDs, priorities, risks, acceptance criteria, linked scenarios, and release gates
  • scenarios/*.yaml define reusable test scenarios with prompts, expected/forbidden signals, severity, tags, and provider scope
  • config/quality_gates.yaml defines release-readiness thresholds

New generated artifacts

Run the suite with JSON reporting and then generate the management artifacts:

pytest tests/ -m "not ui" --json-report --json-report-file=results.json
python -m src.dashboard.generate results.json \
  -o dashboard.html \
  --provider anthropic \
  --model claude-haiku-4-5 \
  --traceability-out traceability.json \
  --traceability-html traceability.html \
  --release-summary-out release_summary.json

This produces:

  • dashboard.html: management dashboard with release decision, requirement coverage, risk coverage, gaps, and history table
  • traceability.json: machine-readable traceability matrix and gap analysis
  • traceability.html: human-readable traceability report
  • release_summary.json: release decision (GO, GO WITH RISKS, NO-GO) with reasons and thresholds

Data-driven scenario execution

Security, hallucination, performance, and RAG checks are now backed by reusable scenario specifications rather than only hard-coded Python assertions. This makes the suite easier to review as a QA/Test-Management and Requirements-Engineering artifact.

About

Requirements-driven AI/LLM QA framework with traceability, release gates, RAG evaluation, and multi-model testing

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors