This repository contains the source code, data, and evaluation results for the thesis "Retrieval Optimization for RAG-based Chatbots". The project implements and evaluates multiple retrieval optimization approaches.
-
/chroma_db/
The vector databases used by the approaches are not included directly in this repository due to file size limitations.
Instead, the full Chroma vector database used in this project is available for download via Zenodo:
[https://doi.org/10.5281/zenodo.15666607]- Approach 01 and Approach 02 share the same Chroma database.
- Approach 03 has a separate database due to the parent-child chunk linking.
-
/data/
Contains the PDF-based directive documents (Weisungen) used as the knowledge base for each approach. -
/eval_dataset/
Contains the manually curated evaluation datasets in both JSON and CSV formats. -
/local_datastore/
Contains the chunk storage for non-vectorized retrieval:/sparse_datastore/stores the BM25 chunks./parent_store/stores the parent-child chunks for Approach 03.
-
/test/
Contains the evaluation scripts used to execute and compare the different retrieval approaches.
This folder also includes:- Log files that contain the overall pass rates for each approach.
- The
/result/subfolder, which contains all evaluation results in both JSON and CSV format, including a combined comparison table summarizing all approaches.
-
/rag_demo.py/,/rag_demo_approach_02.py/,/rag_demo_approach_03.py/
The implementation folders for each of the three retrieval approaches. These folders are located directly in the root directory. -
config.py
Configuration file located in the root directory. It manages folder paths and parameters for all approaches and evaluation scripts.
- Python 3.11.6
- See
requirements.txtfor all required dependencies.
All credentials, API keys, and sensitive configuration values have been removed prior to submission.
- This code is provided for documentation purposes only.
- Execution is not required for thesis evaluation.