Advanced Data Quality Testing Platform with AI-Powered Synthetic Data Generation
DataForgeTest is a comprehensive solution for automating data quality testing in Big Data environments. It combines Large Language Models (LLMs) with advanced data processing capabilities to generate synthetic datasets, validate data accuracy, and create intelligent testing workflows for scalable data systems.
- 🤖 AI-Powered Synthetic Data Generation - Create realistic datasets using LLMs with 14+ data types
- 📊 Data Accuracy Validation - Compare and correct datasets using GOLD reference standards
- 💬 Intelligent RAG Support System - Chat with documentation using retrieval-augmented generation
- ⚡ PySpark Code Generation - Automated generation of data quality validation scripts
- 🔍 Advanced PySpark Generator - Upload datasets for automatic schema detection and intelligent code generation
- 🌐 Modern Web Interface - React-based frontend with responsive design and dark theme
- 🔧 RESTful API Architecture - Modular Flask backend with comprehensive error handling
- Quick Start
- Features Overview
- API Reference
- Configuration
- Development
- Testing
- Documentation
- Contributing
- Python 3.8+ with pip
- Node.js 16+ with npm
- Git for version control
🎯 First Time Setup
scripts\dev\setup.bat- Checks Python version
- Installs all dependencies
- Starts services with health-check polling
- Opens browser automatically
⚡ Daily Development
scripts\dev\start.bat- Quick service startup
- No dependency checks
- Optimized for development
- Clone Repository
git clone https://github.com/Icar0S/DataForgeTest.git
cd DataForgeTest- Backend Setup
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
# Configure environment (copy from .env.example)
cp .env.example .env
# Edit .env with your configuration- Frontend Setup
cd frontend
npm install- Start Services
Terminal 1 (Backend):
cd src
python api.pyTerminal 2 (Frontend):
cd frontend
npm start🌐 Access Points:
- Frontend: http://localhost:3000
- Backend API: http://localhost:5000
- API Documentation: http://localhost:5000/docs
Interactive chat interface for generating PySpark data quality validation code through natural language conversations.
Key Features:
- Full-screen responsive chat interface (mobile → desktop)
- Auto-scrolling message area with loading indicators
- Multi-line message input with keyboard shortcuts
- Real-time streaming responses via EventSource
- Accessible design with focus management and ARIA labels
Keyboard Shortcuts:
Enter- Send messageShift + Enter- New line in messageTab- Navigate interactive elements
API Integration:
- Uses streaming endpoint:
GET /api/rag/chat?message=<query>
NEW! Upload datasets for automatic schema detection, DSL generation, and intelligent PySpark code creation.
Workflow:
- Upload Dataset - Support for CSV, JSON, JSONL, Parquet (max 100MB)
- Auto-Detection - Automatic schema inference, statistics, and data type detection
- Review & Edit - Interactive metadata editing with column-level controls
- DSL Generation - Automatic creation of Data Specification Language
- Code Generation - Production-ready PySpark code with validations
Features:
- Intelligent Schema Inference - Automatic detection of types, nullability, uniqueness
- Column Statistics - Min/max, null ratio, unique ratio, sample values
- Auto-Generated Validations - Not-null, uniqueness, range, format checks
- Interactive Editing - Mark columns as required/unique, set custom validations
- Code Export - Copy to clipboard or download as .py file
- Google Colab Ready - Generated code works out-of-the-box in Colab
Supported Formats:
- CSV (with auto-detection of delimiter, encoding, header)
- JSON / JSONL
- Parquet
Screenshots:
New dropdown menu with "Generate Advanced PySpark Code" option
Step 1: Upload and configure dataset
LLM-powered synthetic data generation supporting realistic datasets with customizable schemas and multiple output formats.
Supported Data Types:
- Primitives: string, integer, float, boolean
- Date/Time: date, datetime with custom ranges
- Semantic: email, phone, address, product_name, price, uuid
- Advanced: category (with weights), custom_pattern (regex)
Output Formats: CSV, XLSX, JSON, Parquet
Capabilities:
- Up to 1,000,000 rows per dataset
- Up to 50 columns per schema
- Preview mode (50 rows) for quick validation
- Batch processing with progress tracking
- Type validation and coercion
- Uniqueness constraints
- Null value control (percentage-based)
- Locale-aware generation (default: pt_BR)
Performance Benchmarks:
- Small datasets (< 1K rows): 5-15 seconds
- Medium datasets (1K-50K rows): 1-5 minutes
- Large datasets (50K-1M rows): 10-60 minutes
Example Schema:
{
"schema": {
"columns": [
{"name": "product_id", "type": "uuid", "options": {"unique": true}},
{"name": "name", "type": "product_name", "options": {}},
{"name": "price", "type": "price", "options": {"min": 10, "max": 9999, "decimals": 2}},
{"name": "category", "type": "category", "options": {
"categories": ["Electronics", "Clothing", "Food", "Books"]
}}
]
},
"rows": 1000,
"fileType": "csv"
}Compare and correct datasets using GOLD reference standards with automated normalization and intelligent difference detection.
Key Features:
- Multi-format support: CSV, XLSX, Parquet uploads
- Smart column mapping: Define key columns (identifiers) and value columns (data to compare)
- Automatic normalization:
- Column names → snake_case
- Key normalization (trim, lowercase, accent/punctuation removal)
- Numeric coercion (comma→decimal, thousands separator removal)
- Duplicate handling:
- GOLD: Error on duplicates
- TARGET: Configurable policies (keep_last, sum, average)
- Tolerance-based comparison: Define numeric comparison tolerance
- Comprehensive reporting:
- Accuracy metrics with precision/recall
- Paginated difference tables
- Downloadable reports (CSV, JSON)
Workflow:
- Upload GOLD dataset (trusted reference)
- Upload TARGET dataset (data to validate)
- Map key columns (identifiers) and value columns
- Configure normalization and tolerance options
- Execute comparison and download corrected dataset
Supported Files: Up to 50MB, 2M rows, CSV/XLSX/Parquet formats
Single dataset cleaning and validation with automated data quality improvements.
Key Features:
- Multi-format support: CSV, XLSX, XLS, Parquet uploads (up to 50MB)
- Automated cleaning operations:
- Remove empty columns (100% null)
- Normalize headers (lowercase, accents, special chars)
- Trim strings and remove invisible characters
- Coerce numeric values (handle US/European formats)
- Parse dates with best-effort approach
- Optional: Remove duplicate rows
- Chunked processing: Handles large files efficiently (CSV/Parquet)
- Real-time progress: Live updates with phase tracking
- Comprehensive reporting:
- Row and column counts before/after
- Changes per operation type
- Null value reduction per column
- 50-row preview of cleaned data
- Multiple downloads: CSV (always) + original format when supported
Workflow:
- Upload dataset via drag-drop or file selector
- Review detected columns and sample data
- Configure cleaning options (checkboxes)
- Click "Generate GOLD" to process
- View detailed cleaning report
- Download cleaned dataset(s)
Access: Home → Data Accuracy (dropdown) → Test Dataset GOLD
Documentation: GOLD Feature Guide
Comprehensive data quality analysis with automated metrics and visual dashboard for single dataset evaluation.
Key Features:
- Multi-format support: CSV, XLSX, XLS, Parquet uploads (up to 50MB)
- Four core quality dimensions:
- Completeness: Missing values and data availability (30% weight)
- Uniqueness: Duplicate detection and record uniqueness (20% weight)
- Validity: Data type and format correctness (30% weight)
- Consistency: Format uniformity and pattern compliance (20% weight)
- Overall Quality Score: Weighted average (0-100%) with color-coded rating
- Visual Dashboard:
- Large quality score display
- Four metric cards with icons and percentages
- Dataset information panel (rows, columns, memory)
- Actionable recommendations panel
- Intelligent Recommendations:
- High severity: Critical quality issues (e.g., completeness < 90%)
- Medium severity: Important improvements (e.g., duplicates, consistency < 80%)
- Low severity: Minor optimizations
Metrics Breakdown:
- Completeness: Percentage of non-null cells (overall + per column)
- Uniqueness: Percentage of unique rows and duplicate count
- Validity: Detection of invalid values (infinity, empty strings, type mismatches)
- Consistency: Format uniformity based on data type patterns
Workflow:
- Upload dataset via drag-drop or file selector
- System automatically analyzes all quality dimensions
- View comprehensive quality dashboard
- Review recommendations sorted by severity
- Take action based on insights
Access: Home → Data Accuracy (dropdown) → Dataset Metrics
Documentation: Metrics Feature Guide
AI-powered documentation chat system with retrieval-augmented generation for contextual support and guidance.
Features:
- Smart document search: Keyword-based retrieval with relevance scoring
- Contextual responses: Structured answers based on query type
- Streaming chat interface: Real-time response generation
- Document management: Upload, index, and manage documentation
- Flexible LLM support:
- Ollama (Default): Open-source LLMs running locally - NO COSTS 🎉
- Anthropic Claude: Cloud-based LLM - requires API credits
Quick Test:
# Test RAG system functionality
python tests/test_rag_integration.py
python tests/test_rag_api.py
# Full system diagnostics
python tests/test_rag_diagnostics.pyLLM Configuration:
Option 1: Ollama (Recommended - Free & Local)
- Install Ollama: ollama.com/download
- Pull a model:
ollama pull qwen2.5-coder:7b - Configure
.env:LLM_PROVIDER=ollama LLM_MODEL=qwen2.5-coder:7b OLLAMA_BASE_URL=http://localhost:11434
- Restart backend
Option 2: Anthropic Claude (Requires Credits)
- Get API key from console.anthropic.com
- Configure
.env:LLM_PROVIDER=anthropic LLM_MODEL=claude-3-haiku-20240307 LLM_API_KEY=your-api-key-here
- Restart backend
📖 Full Setup Guide: docs/OLLAMA_SETUP.md
Current Status: ✅ Ollama support with qwen2.5-coder:7b model
DataForgeTest follows a modern microservices architecture with clear separation between frontend and backend components.
🎨 Frontend (React + TypeScript)
- Modern responsive UI with dark theme
- Real-time streaming chat interfaces
- File upload with progress tracking
- Interactive data visualization
- Accessibility-focused design
🔧 Backend (Python + Flask)
- Modular blueprint architecture
- RESTful API with comprehensive error handling
- LLM integration (Ollama for local open-source LLMs or Anthropic Claude)
- Multi-format file processing
- Automated data validation pipelines
📁 Project Structure
DataForgeTest/
├── frontend/ # React application
│ ├── src/components/ # UI components
│ ├── src/pages/ # Application pages
│ └── public/ # Static assets
├── src/ # Python backend
│ ├── api.py # Main Flask application
│ ├── chatbot/ # PySpark code generation
│ ├── synthetic/ # Synthetic data generation
│ ├── accuracy/ # Data accuracy validation
│ ├── gold/ # GOLD dataset testing
│ └── rag/ # RAG support system
├── docs/ # Comprehensive documentation
├── tests/ # Test suites
└── storage/ # Data storage
Main Application
GET /- Health check and system statusPOST /ask- Process chatbot requests and generate PySpark code
Synthetic Data Generation
POST /api/synth/preview- Generate dataset preview (max 100 rows)POST /api/synth/generate- Generate full synthetic datasetGET /api/synth/download/:session/:file- Download generated filesGET /api/synth/health- Service health check
Data Accuracy Validation
POST /api/accuracy/upload?role=gold|target- Upload datasetsPOST /api/accuracy/compare-correct- Compare and correct datasetsGET /api/accuracy/download/:session/:file- Download resultsGET /api/accuracy/health- Service health check
GOLD Dataset Testing
POST /api/gold/upload- Upload single dataset for cleaningPOST /api/gold/clean- Start cleaning process with optionsGET /api/gold/status?sessionId=...- Get processing status and progressGET /api/gold/report?sessionId=...- Get detailed cleaning reportGET /api/gold/download/:session/:file- Download cleaned datasetGET /api/gold/health- Service health check
RAG Support System
POST /api/rag/chat- Send chat messagesGET /api/rag/chat?message=query- Streaming chat (EventSource)POST /api/rag/search- Search documentationPOST /api/rag/upload- Upload documentationGET /api/rag/sources- List indexed documentsDELETE /api/rag/sources/:id- Remove documentsGET /api/rag/health- RAG system health
Create .env file (copy from .env.example):
# LLM Configuration
# Choose provider: 'ollama' (default, free) or 'anthropic' (requires credits)
LLM_PROVIDER=ollama
# Ollama Configuration (for open-source LLMs)
LLM_MODEL=qwen2.5-coder:7b
OLLAMA_BASE_URL=http://localhost:11434
# Anthropic Configuration (optional, only if LLM_PROVIDER=anthropic)
# LLM_API_KEY=your-anthropic-api-key-here
# LLM_MODEL=claude-3-haiku-20240307
# RAG System
VECTOR_STORE_PATH=./storage/vectorstore
CHUNK_SIZE=512
CHUNK_OVERLAP=50
TOP_K=4
MAX_UPLOAD_MB=10
# Synthetic Data Generation
SYNTH_STORAGE_PATH=./storage/synth
SYNTH_MAX_ROWS=1000000
SYNTH_REQUEST_TIMEOUT=300
SYNTH_MAX_MEM_MB=2048
# Data Accuracy
ACCURACY_STORAGE_PATH=./storage/accuracy
ACCURACY_MAX_UPLOAD_MB=50
ACCURACY_MAX_ROWS=2000000
ACCURACY_REQUEST_TIMEOUT=120
# GOLD Dataset Testing
GOLD_STORAGE_PATH=./storage/gold
GOLD_ALLOWED_FILE_TYPES=.csv,.xlsx,.xls,.parquet
MAX_ROWS_WARN=500000
GOLD_REQUEST_TIMEOUT=300
# Flask Settings
FLASK_ENV=development
FLASK_DEBUG=TrueThe PySpark code generator supports comprehensive data quality validation:
Schema Validation
- Column name and type verification
- Header presence validation
Data Integrity
not_null: Ensure no missing valuesuniqueness: Guarantee unique values/combinations
Value Constraints
format: Validate data formats (dates, emails, etc.)range: Check numeric rangesin_set: Validate against predefined valuesregex: Pattern matching validationvalue_distribution: Frequency analysis
# Clone and setup
git clone https://github.com/Icar0S/DataForgeTest.git
cd DataForgeTest
# Backend development
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements-dev.txt
# Frontend development
cd frontend
npm install
npm startDataForgeTest/
├── src/ # Backend Flask API and modules
├── frontend/ # React frontend
├── tests/ # Automated tests (backend + frontend)
├── utilities/ # RAG debug and maintenance scripts
├── docs/ # Technical documentation
│ ├── archive/ # Historical documents
│ └── assets/ # Images and diagrams
├── scripts/ # Automation scripts
│ ├── dev/ # Development startup scripts
│ └── deploy/ # Deployment scripts
├── docs_to_import/ # Documents for RAG knowledge base
├── .github/ # CI/CD workflows
├── docker-compose.yml # Local Docker stack
├── Dockerfile # Container definition
├── render.yaml # Render.com deployment config
├── pyproject.toml # Python project and tool configuration
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development and test dependencies
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guide
└── run_integration_tests.bat # Local test pipeline
# Python linting
pylint src/**/*.py
# Frontend testing
cd frontend
npm test
# Type checking
npm run type-checkModular Blueprint Design:
api.py- Main Flask application with CORS supportsynthetic/- LLM-powered synthetic data generationaccuracy/- Dataset comparison and correctionrag/- RAG documentation system (Simple + Full implementations)chatbot/- PySpark code generation from natural language
Key Design Patterns:
- Configuration via environment variables
- Comprehensive error handling with structured responses
- Session-based file management
- Streaming responses for real-time feedback
- Modular validators and processors
# Run all tests
python -m pytest tests/ -v
# Specific test categories
python tests/test_rag_integration.py # RAG system tests
python tests/test_rag_api.py # API endpoint tests
python tests/test_rag_diagnostics.py # System diagnostics
# Data accuracy tests (31 tests total)
python -m pytest tests/test_accuracy*.py -v # All accuracy tests
python tests/backend/unit/test_accuracy_backend.py -v # Backend unit tests
python tests/backend/integration/test_accuracy_integration.py -v # Basic integration tests
python tests/backend/integration/test_accuracy_integration_robust.py -v # Robust integration tests
# Frontend tests
cd frontend
npm test# Test RAG improvements
python test_improved_rag.py
# Test synthetic data generation
python test_csv_download.py
# Test system connectivity
python tests/test_connectivity.pyBackend Components:
- ✅ RAG system (14/14 tests passing, 100% success rate)
- ✅ API endpoints (all endpoints functional)
- ✅ Synthetic data generation (CSV download working)
- ✅ Data accuracy validation (31/31 tests passing, 100% success rate)
- Backend unit tests: 9 tests (normalization, comparison, tolerance)
- Basic integration: 4 tests (upload/compare/download workflows)
- Robust integration: 18 tests (multi-format, multi-column, edge cases, security)
- ✅ Error handling and edge cases
Data Accuracy Test Coverage:
- File formats: CSV, XLSX, Parquet ✓
- Multi-column keys and values ✓
- Normalization options (accents, punctuation, case) ✓
- Numeric formats (European/US) ✓
- Duplicate policies (keep_last, sum, mean) ✓
- Large datasets (1000+ rows) ✓
- Security (path traversal, access control) ✓
- Edge cases (empty data, special characters, missing columns) ✓
Frontend Components:
- ✅ React component rendering
- ✅ Chat interface functionality
- ✅ File upload workflows
- ✅ API integration points
- ✅ Responsive design
Core Documentation:
docs/BACKEND_STRUCTURE.md- Backend architecture guidedocs/IMPLEMENTATION_SUMMARY.md- Complete feature overviewdocs/ACCEPTANCE_CHECKLIST.md- QA validation checklistdocs/TROUBLESHOOTING.md- Common issues and solutions
Feature-Specific:
docs/SYNTHETIC_DATASET_FEATURE.md- Synthetic data generationdocs/RAG_TROUBLESHOOTING.md- RAG system guidedocs/RAG_TEST_RESULTS.md- Test results analysis
Testing Documentation:
tests/README_TESTS.md- Test suite overviewdocs/RAG_QUICK_REFERENCE.md- Quick reference guide
Interactive API documentation available at:
- Swagger UI: http://localhost:5000/docs (when running)
- Postman Collection: Available in
/docs/api/
Data Engineering Teams:
- Generate realistic test datasets for pipeline validation
- Automate data quality rule creation
- Compare production vs. staging datasets
- Create comprehensive PySpark validation scripts
QA/Testing Teams:
- Generate edge case datasets for testing
- Validate data accuracy across environments
- Create automated data quality checks
- Document and track data quality metrics
Data Scientists:
- Create synthetic datasets for model training
- Validate data preprocessing pipelines
- Compare dataset versions and track changes
- Generate documentation-aware support queries
The frontend is already deployed on Vercel at:
- Production URL: https://data-forge-test.vercel.app/
The easiest way to deploy the backend is using Docker:
# Build the Docker image
docker build -t dataforgetest-backend .
# Run the container
docker run -d \
--name dataforgetest-backend \
-p 5000:5000 \
-e LLM_API_KEY=your-anthropic-api-key \
-v $(pwd)/storage:/app/storage \
dataforgetest-backendFor easier management and local development:
# 1. Copy environment file
cp .env.example .env
# 2. Edit .env with your configuration
# Set your LLM_API_KEY and other settings
# 3. Start services
docker-compose up -d --build
# 4. View logs
docker-compose logs -f backend
# 5. Stop services
docker-compose down# Install dependencies
pip install -r requirements.txt
# Run with gunicorn (production)
cd src
gunicorn -w 4 -b 0.0.0.0:5000 --timeout 120 api:app
# Or run with Flask (development)
python api.pyThe backend can be deployed to various cloud platforms:
- Render.com - Automatically detects Dockerfile
- Railway.app - One-click Docker deployment
- Google Cloud Run - Serverless container deployment
- AWS ECS - Enterprise container orchestration
- DigitalOcean App Platform - Simple container hosting
📖 Detailed deployment guide: See docs/DEPLOYMENT.md for step-by-step instructions for each platform.
Required environment variables for production:
# LLM Configuration
LLM_API_KEY=your-anthropic-api-key-here
LLM_MODEL=claude-3-haiku-20240307
# Flask Settings
FLASK_ENV=production
FLASK_DEBUG=FalseSee .env.example for complete configuration options.
After deployment, verify the backend is running:
curl https://your-backend-url.com/
# Expected: {"status": "Backend is running", "message": "Data Quality Chatbot API"}After deploying the backend, update the frontend environment variable on Vercel:
- Go to Vercel Dashboard → Your Project → Settings → Environment Variables
- Add:
REACT_APP_API_URL=https://your-backend-url.com - Redeploy the frontend
Or for local development, update frontend/package.json:
{
"proxy": "https://your-backend-url.com"
}We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Run tests:
python -m pytest tests/ -v - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
- Follow PEP 8 for Python code
- Use ESLint/Prettier for JavaScript/React
- Add tests for new features
- Update documentation
- Ensure all tests pass
This project is licensed under the MIT License - see the LICENSE file for details.
Se você usar este projeto em sua pesquisa, por favor cite:
@mastersthesis{santos2025dataqualitychatbot,
author = {Santos, Icaro},
title = {DataForgeTest Big Data Quality Testing: Sistema Inteligente para Validação de Dados},
school = {Universidade Estadual do Ceará},
year = {2025},
type = {Dissertação de Mestrado},
url = {https://github.com/Icar0S/DataForgeTest}
}- Anthropic Claude for LLM capabilities
- Apache Spark for big data processing
- React Community for frontend framework
- Flask Community for backend framework
Built with ❤️ for the Data Engineering Community

