Awesome RAG Study

A curated collection of papers, frameworks, tools, and resources on Retrieval-Augmented Generation (RAG).

Maintained for students of the Text Mining and Data Visualization course as a starting point for thesis research.

What is RAG?

Retrieval-Augmented Generation is a technique that enhances Large Language Models (LLMs) by grounding their responses in external knowledge retrieved at inference time, reducing hallucinations and enabling domain-specific answers without fine-tuning.

Foundational Papers

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) - The original RAG paper by Lewis et al. (Meta AI). Introduces the RAG architecture combining a pre-trained seq2seq model with a dense retriever (DPR).
Dense Passage Retrieval for Open-Domain Question Answering (2020) - DPR — the dense retrieval method that underpins many RAG systems.
REALM: Retrieval-Augmented Language Model Pre-Training (2020) - Pre-trains a language model jointly with a knowledge retriever.
Attention Is All You Need (2017) - The Transformer architecture — foundational to all modern LLMs used in RAG.

Survey Papers

Retrieval-Augmented Generation for Large Language Models: A Survey (2023) - Comprehensive survey covering Naive RAG, Advanced RAG, and Modular RAG paradigms. Excellent starting point.
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models (2024) - Covers the evolution of RA-LLMs, taxonomies, and training strategies.
Seven Failure Points When Engineering a RAG System (2024) - Practical guide to what can go wrong in RAG pipelines — highly recommended for thesis work.

Advanced Techniques

Chunking and Indexing

Unstructured - Pre-processing library for parsing PDFs, HTML, Word docs into clean chunks.
Semantic Chunking - Splitting documents based on semantic similarity rather than fixed token windows.
Hierarchical Indexing - Using summaries at different granularity levels (document → section → paragraph) to improve retrieval precision.
Parent-Child Chunking - Retrieve small chunks for precision, but pass the parent (larger) chunk to the LLM for context.

Retrieval Strategies

Strategy	Description
Dense Retrieval	Encode queries and documents into vector embeddings, retrieve by cosine similarity.
Sparse Retrieval (BM25)	Traditional keyword-based retrieval. Still competitive and often used as a baseline.
Hybrid Search	Combine dense + sparse retrieval (e.g., via Reciprocal Rank Fusion). Often outperforms either alone.
Multi-Query Retrieval	Generate multiple query variations with an LLM and retrieve for each, then merge results.
HyDE	Hypothetical Document Embeddings - Generate a hypothetical answer first, then use it as the retrieval query.
Contextual Retrieval	Anthropic's approach - Prepend chunk-specific context before embedding to reduce retrieval failures.

Reranking

Cohere Rerank - Cross-encoder reranking API.
ColBERT - Late interaction model for efficient and effective reranking.
bge-reranker - Open-source cross-encoder reranker by BAAI.
RankLLM - Using LLMs themselves as rerankers via listwise prompting.

Query Transformation

Query Rewriting - Use an LLM to reformulate the user query for better retrieval.
Step-Back Prompting - Ask a more abstract question first to retrieve broader context.
Query Decomposition - Break complex questions into sub-questions, retrieve for each, then synthesize.

Evaluation

Framework	Description
RAGAS	Reference-free evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall.
TruLens	Evaluation and tracking for LLM apps, including RAG-specific metrics.
DeepEval	Unit testing framework for LLM outputs with RAG-aware metrics.
ARES	Automated RAG Evaluation System — uses LLM judges with statistical confidence.

Frameworks and Libraries

Framework	Language	Description
LangChain	Python/JS	The most widely adopted framework for building RAG pipelines. Large ecosystem, many integrations.
LlamaIndex	Python	Data framework specifically designed for RAG. Strong focus on indexing and retrieval abstractions.
Haystack	Python	Production-ready NLP framework by deepset. Pipeline-based architecture.
RAGFlow	Python	Open-source RAG engine with deep document understanding and chunk visualization.
Verba	Python	Open-source RAG chatbot powered by Weaviate. Good for quick prototyping.
Cognita	Python	Open-source modular RAG framework for production use.

Vector Databases

Database	Type	Notes
Chroma	Embedded	Lightweight, easy to start with. Good for prototyping and smaller projects.
Weaviate	Self-hosted / Cloud	Supports hybrid search natively. GraphQL API.
Qdrant	Self-hosted / Cloud	Written in Rust. Excellent filtering and payload support.
Milvus	Self-hosted / Cloud	Highly scalable. Used in many production deployments.
Pinecone	Cloud-only	Fully managed. Simple API. Popular in industry.
FAISS	Library	Meta's similarity search library. Not a database, but extremely fast for local use.
pgvector	PostgreSQL Extension	Add vector search to your existing PostgreSQL database. Great if you already use Postgres.

Embedding Models

Model	Provider	Notes
text-embedding-3-small/large	OpenAI	Strong general-purpose embeddings. `large` variant has 3072 dimensions.
Cohere Embed v3	Cohere	Supports multiple input types (search_document, search_query).
BGE (BAAI)	Open-source	Top-performing open-source embeddings. Available in multiple sizes.
E5-Mistral	Open-source	LLM-based embedding model. Strong performance on MTEB benchmark.
Nomic Embed	Open-source	Long-context (8192 tokens), fully open-source with open training data.
Jina Embeddings	Open-source	Multilingual, supports 8192 token context. Good for non-English corpora.

Tip: Check the MTEB Leaderboard for up-to-date embedding model benchmarks.

Tutorials and Guides

RAG From Scratch (LangChain) - Series of notebooks covering RAG concepts from basics to advanced patterns.
Building RAG Applications with LlamaIndex - Official LlamaIndex documentation and conceptual guide.
Pinecone RAG Learning Center - Well-written introduction to RAG with practical examples.
Contextual Retrieval Guide - Practical improvements to standard RAG with contextual embeddings and BM25.
Full Stack RAG App Tutorial (freeCodeCamp) - Video walkthrough of building a complete RAG application.

Videos and Talks

But what is RAG? (3Blue1Brown-style explainer) - Visual, intuitive explanation of how RAG works.
RAG is Dead? Long Live RAG! (Keynote) - Discussion on the future of RAG vs. long-context models.
Building Production RAG (AI Engineer Summit) - Practical lessons from deploying RAG at scale.
Advanced RAG Techniques (DeepLearning.AI) - Short course by Andrew Ng's platform.

Datasets and Benchmarks

Dataset/Benchmark	Description
Natural Questions (NQ)	Google's open-domain QA dataset. Standard benchmark for retrieval systems.
HotpotQA	Multi-hop QA requiring reasoning over multiple documents.
MS MARCO	Large-scale passage retrieval and QA benchmark.
BEIR	Heterogeneous benchmark for zero-shot evaluation of retrieval models across diverse tasks.
RGB (Retrieval-Augmented Generation Benchmark)	Specifically designed to evaluate RAG systems on noise robustness, negative rejection, information integration, and counterfactual robustness.

Contributing

Contributions are welcome! This is a collaborative resource for students and researchers.

Please follow these guidelines:

Fork this repository
Add your resource in the appropriate section
Follow the existing format: **[Resource Name](link)** - Brief description.
Ensure all links are working and resources are relevant to RAG
Submit a pull request

License

This work is dedicated to the public domain under CC0 1.0 Universal.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome RAG Study

What is RAG?

Contents

Foundational Papers

Survey Papers

Advanced Techniques

Chunking and Indexing

Retrieval Strategies

Reranking

Query Transformation

Evaluation

Frameworks and Libraries

Vector Databases

Embedding Models

Tutorials and Guides

Videos and Talks

Datasets and Benchmarks

Contributing

License

About

Uh oh!

License

nluninja/awesome-rag-study

Folders and files

Latest commit

History

Repository files navigation

Awesome RAG Study

What is RAG?

Contents

Foundational Papers

Survey Papers

Advanced Techniques

Chunking and Indexing

Retrieval Strategies

Reranking

Query Transformation

Evaluation

Frameworks and Libraries

Vector Databases

Embedding Models

Tutorials and Guides

Videos and Talks

Datasets and Benchmarks

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks