An intelligent AI chatbot for the Office of the Data Protection Commissioner (ODPC) Kenya, powered by Retrieval-Augmented Generation (RAG). This system automatically crawls the official ODPC website, indexes content, and provides accurate, source-backed answers to queries about Kenyan data protection laws and regulations.
ODPC Shield Bot bridges the gap between complex legal documentation and public understanding. By combining web crawling, semantic search, and large language models, it delivers precise answers with verifiable sources, making data protection information accessible to everyone.
- Automated Knowledge Base: Continuously scrapes and indexes ODPC website content in clean Markdown format
- Intelligent Search: Semantic retrieval using ChromaDB vector database for context-aware responses
- Conversational Memory: PostgreSQL-backed session management maintains context across conversations
- Source Attribution: Every response includes citations with direct links to source documents
- Production Ready: Fully containerized with Docker, optimized for performance and reliability
- RESTful API: Built on FastAPI for high-performance, asynchronous request handling
| Component | Technology |
|---|---|
| LLM | Llama 3.1 via Groq API |
| Embeddings | BAAI/bge-small-en-v1.5 (HuggingFace) |
| Vector Store | ChromaDB |
| Database | PostgreSQL 15 |
| Backend | Python 3.11, FastAPI, SQLAlchemy |
| Web Scraping | BeautifulSoup4, Requests |
| Infrastructure | Docker, Docker Compose |
- Docker and Docker Compose installed
- Groq API key (get one here)
-
Clone the repository
git clone https://github.com/peter-kiilu/ODPC-RAG.git cd ODPC-RAG -
Configure environment variables
Create a
.envfile in the project root:# Required GROQ_API_KEY=gsk_your_actual_api_key_here # Database Configuration (optional - defaults provided) POSTGRES_USER=user POSTGRES_PASSWORD=password POSTGRES_DB=odpc_chatdb # Crawler Control (optional) SKIP_CRAWL=false
-
Launch the application
docker compose up --build -d
The API will be available at
http://localhost:8032 -
Verify deployment
- Health Check: http://localhost:8032/health
- Interactive API Documentation: http://localhost:8032/docs
Start or continue a conversation with the bot.
POST /chat
Content-Type: application/json
{
"message": "What are the rights of a data subject under the DPA?",
"session_id": "optional-session-uuid"
}Response:
{
"response": "Under the Data Protection Act, data subjects have several rights...",
"sources": [
{
"title": "Rights of Data Subjects",
"url": "https://www.odpc.go.ke/data-subject-rights"
}
],
"session_id": "generated-or-provided-uuid"
}Get the complete conversation history for a session.
GET /chat/history/{session_id}Delete all messages for a specific session.
DELETE /chat/history/{session_id}ODPC-RAG/
├── crawler/ # Web scraping engine
│ ├── crawler.py # Main crawler logic
│ └── crawler_state.json # Crawl progress tracker
├── rag_bot/ # Core RAG application
│ ├── api.py # FastAPI endpoints
│ ├── chat.py # RAG engine & prompt engineering
│ ├── database.py # PostgreSQL models & connections
│ ├── db_init.py # Database initialization
│ ├── vector_store.py # ChromaDB management
│ └── chroma_db/ # Vector database storage (volume)
├── data/ # Scraped content storage (volume)
│ ├── markdown/ # Cleaned markdown files
│ └── documents/ # Original HTML documents
├── docker-compose.yml # Service orchestration
├── Dockerfile # Multi-stage optimized build
├── entrypoint.sh # Container startup script
├── requirements.txt # Python dependencies
└── .env # Environment configuration
The crawler runs automatically on container startup, indexing the ODPC website.
Skip crawling (useful for faster restarts with existing data):
SKIP_CRAWL=true docker compose up -dForce a fresh crawl from scratch:
# Remove the crawl state
rm data/crawler_state.json
# Ensure crawling is enabled
SKIP_CRAWL=false docker compose up --build -dConnect to the database:
docker exec -it odpc-postgres psql -U user -d odpc_chatdbInspect tables:
\dt -- List all tables
SELECT * FROM chat_sessions; -- View sessions
SELECT * FROM chat_messages; -- View message historyComplete reset (removes all data):
docker compose down -v
docker compose up --build -dThe system includes several optimizations for production use:
- Model Caching: HuggingFace embeddings cached in persistent volume (saves 30-40s per restart)
- Multi-stage Docker Build: Separates build and runtime dependencies for smaller images
- Connection Pooling: Efficient database connection management via SQLAlchemy
- Async Architecture: Non-blocking I/O for concurrent request handling
- Health Checks: Automatic container monitoring and restart policies
All services:
docker compose logs -fSpecific service:
docker compose logs -f odpc-rag
docker compose logs -f postgresSlow startup on first run:
- The embedding model (~133MB) downloads on first launch
- Subsequent starts use the cached model (near-instant)
Crawler fails:
- Check network connectivity to odpc.go.ke
- Verify crawler state file isn't corrupted:
rm data/crawler_state.json
Database connection errors:
- Ensure PostgreSQL is healthy:
docker compose ps - Check credentials in
.envmatch those indocker-compose.yml
Out of memory:
- Increase Docker memory limit in Docker Desktop settings
- Consider using a smaller embedding model
# Inside the container
docker exec -it odpc-rag bash
pytest tests/docker compose down
docker compose up --build -dModify crawler/crawler.py to include additional URLs or domains.
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Office of the Data Protection Commissioner Kenya for maintaining comprehensive data protection resources
- The open-source community for the incredible tools that power this project
- Groq for providing fast LLM inference infrastructure
For issues, questions, or feature requests, please open an issue on GitHub or contact the maintainers.