Skip to content

A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG). Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.

License

Notifications You must be signed in to change notification settings

nluninja/awesome-rag-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Awesome RAG Study Awesome

A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG).

Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.

What is RAG?

Retrieval-Augmented Generation is a technique that enhances Large Language Models (LLMs) by grounding their responses in external knowledge retrieved at inference time, reducing hallucinations and enabling domain-specific answers without fine-tuning.


Contents


Foundational Papers

Survey Papers

Advanced Techniques

Chunking and Indexing

  • Unstructured - Pre-processing library for parsing PDFs, HTML, Word docs into clean chunks.
  • Semantic Chunking - Splitting documents based on semantic similarity rather than fixed token windows.
  • Hierarchical Indexing - Using summaries at different granularity levels (document → section → paragraph) to improve retrieval precision.
  • Parent-Child Chunking - Retrieve small chunks for precision, but pass the parent (larger) chunk to the LLM for context.

Retrieval Strategies

Strategy Description
Dense Retrieval Encode queries and documents into vector embeddings, retrieve by cosine similarity.
Sparse Retrieval (BM25) Traditional keyword-based retrieval. Still competitive and often used as a baseline.
Hybrid Search Combine dense + sparse retrieval (e.g., via Reciprocal Rank Fusion). Often outperforms either alone.
Multi-Query Retrieval Generate multiple query variations with an LLM and retrieve for each, then merge results.
HyDE Hypothetical Document Embeddings - Generate a hypothetical answer first, then use it as the retrieval query.
Contextual Retrieval Anthropic's approach - Prepend chunk-specific context before embedding to reduce retrieval failures.

Reranking

  • Cohere Rerank - Cross-encoder reranking API.
  • ColBERT - Late interaction model for efficient and effective reranking.
  • bge-reranker - Open-source cross-encoder reranker by BAAI.
  • RankLLM - Using LLMs themselves as rerankers via listwise prompting.

Query Transformation

  • Query Rewriting - Use an LLM to reformulate the user query for better retrieval.
  • Step-Back Prompting - Ask a more abstract question first to retrieve broader context.
  • Query Decomposition - Break complex questions into sub-questions, retrieve for each, then synthesize.

Evaluation

Framework Description
RAGAS Reference-free evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall.
TruLens Evaluation and tracking for LLM apps, including RAG-specific metrics.
DeepEval Unit testing framework for LLM outputs with RAG-aware metrics.
ARES Automated RAG Evaluation System — uses LLM judges with statistical confidence.

Frameworks and Libraries

Framework Language Description
LangChain Python/JS The most widely adopted framework for building RAG pipelines. Large ecosystem, many integrations.
LlamaIndex Python Data framework specifically designed for RAG. Strong focus on indexing and retrieval abstractions.
Haystack Python Production-ready NLP framework by deepset. Pipeline-based architecture.
RAGFlow Python Open-source RAG engine with deep document understanding and chunk visualization.
Verba Python Open-source RAG chatbot powered by Weaviate. Good for quick prototyping.
Cognita Python Open-source modular RAG framework for production use.

Vector Databases

Database Type Notes
Chroma Embedded Lightweight, easy to start with. Good for prototyping and smaller projects.
Weaviate Self-hosted / Cloud Supports hybrid search natively. GraphQL API.
Qdrant Self-hosted / Cloud Written in Rust. Excellent filtering and payload support.
Milvus Self-hosted / Cloud Highly scalable. Used in many production deployments.
Pinecone Cloud-only Fully managed. Simple API. Popular in industry.
FAISS Library Meta's similarity search library. Not a database, but extremely fast for local use.
pgvector PostgreSQL Extension Add vector search to your existing PostgreSQL database. Great if you already use Postgres.

Embedding Models

Model Provider Notes
text-embedding-3-small/large OpenAI Strong general-purpose embeddings. large variant has 3072 dimensions.
Cohere Embed v3 Cohere Supports multiple input types (search_document, search_query).
BGE (BAAI) Open-source Top-performing open-source embeddings. Available in multiple sizes.
E5-Mistral Open-source LLM-based embedding model. Strong performance on MTEB benchmark.
Nomic Embed Open-source Long-context (8192 tokens), fully open-source with open training data.
Jina Embeddings Open-source Multilingual, supports 8192 token context. Good for non-English corpora.

Tip: Check the MTEB Leaderboard for up-to-date embedding model benchmarks.


Tutorials and Guides

Videos and Talks

Datasets and Benchmarks

Dataset/Benchmark Description
Natural Questions (NQ) Google's open-domain QA dataset. Standard benchmark for retrieval systems.
HotpotQA Multi-hop QA requiring reasoning over multiple documents.
MS MARCO Large-scale passage retrieval and QA benchmark.
BEIR Heterogeneous benchmark for zero-shot evaluation of retrieval models across diverse tasks.
RGB (Retrieval-Augmented Generation Benchmark) Specifically designed to evaluate RAG systems on noise robustness, negative rejection, information integration, and counterfactual robustness.

Contributing

Contributions are welcome! This is a collaborative resource for students and researchers.

Please follow these guidelines:

  1. Fork this repository
  2. Add your resource in the appropriate section
  3. Follow the existing format: **[Resource Name](link)** - Brief description.
  4. Ensure all links are working and resources are relevant to RAG
  5. Submit a pull request

License

CC0

This work is dedicated to the public domain under CC0 1.0 Universal.

About

A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG). Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.

Topics

Resources

License

Stars

Watchers

Forks