Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant information from external data sources to improve the accuracy and specificity of their generated responses.
- Deploy LLM on vLLM such as Llama 3.1 8B (the one I used here)
- Install python3, and venv
sudo apt install python3 python-is-python3 python3.10-venv - Create a folder howtos/ and place your markdown files inside it.
- Create Python Script ragdemo.py and configure it with your LLM endpoint details.
- Create Python Venv:
python -m venv demovenv - Activate Venv:
source demovenv/bin/activate - Install Dependencies:
pip install langchain langchain-community langchain-chroma langchain-openai \ unstructured sentence-transformers markdown - Run Script:
python ragdemo.py
Full ragdemo.py script :
import os
import glob
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# --- Configuration ---
DOCUMENT_PATH = "./howtos"
CHROMA_DB_PATH = "./chroma_db"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
# --- Local LLM Configuration (vLLM) ---
VLLM_API_BASE = "http://localhost:8000/v1"
VLLM_API_KEY = "YOUR_API_KEY" # Replace or leave as dummy if not needed
LOCAL_LLM_MODEL_NAME = "llama-3.1-8b-instruct" # Set this to the exact model name vLLM is serving
# --- Load and Split Documents ---
print(f"Loading documents from folder: {DOCUMENT_PATH}")
markdown_files = glob.glob(os.path.join(DOCUMENT_PATH, "*.md"))
if not markdown_files:
print(f"No markdown files found in {DOCUMENT_PATH}.")
exit()
docs = []
for file_path in markdown_files:
print(f"Loading {file_path}...")
loader = UnstructuredMarkdownLoader(file_path)
try:
docs.extend(loader.load())
except Exception as e:
print(f"Error loading {file_path}: {e}")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)
print(f"Split {len(docs)} documents into {len(splits)} chunks.")
# --- Create Embeddings and Index (Vector Database) ---
print(f"Creating embeddings using model: {EMBEDDING_MODEL}")
embeddings = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Creating vector database in {CHROMA_DB_PATH}")
if os.path.exists(CHROMA_DB_PATH):
import shutil
shutil.rmtree(CHROMA_DB_PATH)
print("Removed existing Chroma DB.")
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory=CHROMA_DB_PATH
)
# --- Set up the Retriever ---
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
# --- Set up the LLM ---
print(f"Initializing local LLM (vLLM): {LOCAL_LLM_MODEL_NAME} at {VLLM_API_BASE}...")
llm = ChatOpenAI(
base_url=VLLM_API_BASE,
api_key=VLLM_API_KEY,
model=LOCAL_LLM_MODEL_NAME,
temperature=0
)
# --- Create the RAG Chain ---
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# --- Query the System ---
print("\n--- RAG Demo Ready ---")
print("Enter your questions based on the markdown files in the folder.")
print("Type 'quit' to exit.")
while True:
query = input("\nYour question: ")
if query.lower() == 'quit':
break
try:
result = qa_chain.invoke({"query": query})
print("\n--- Answer ---")
print(result['result'])
if 'source_documents' in result:
print("\n--- Sources ---")
for i, doc in enumerate(result['source_documents']):
print(f"Source {i+1}: {doc.metadata.get('source', 'N/A')}")
print("--------------")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure your vLLM server is running and the endpoint/model name are correct.")
print("Exiting demo.")
- Orchestration Framework:
LangChain(specifically theRetrievalQAchain) to manage the flow between components. - Document Loading:
UnstructuredMarkdownLoaderfromlangchain-community, which relies on theunstructuredandmarkdownPython libraries to read and parse the markdown files. - Document Splitting:
RecursiveCharacterTextSplitterfromlangchainto break down the documents into smaller, manageable chunks. - Embedding Model: A local Sentence Transformer model (
all-MiniLM-L6-v2), accessed viaSentenceTransformerEmbeddingsfromlangchain-community, requiring thesentence-transformersPython library to convert text chunks into numerical vectors. - Vector Database:
ChromaDB, integrated throughlangchain-community.vectorstores.Chroma, used to store the document chunks and their embeddings for efficient similarity search. - Large Language Model (LLM): Llama 3.1 8B, running locally and served by
vLLM. - LLM Interface:
ChatOpenAIfromlangchain-openai, configured to communicate with your vLLM instance using its OpenAI-compatible API. - File Discovery: Python's built-in
globmodule to find all markdown files in a specified directory. - Environment Management (Recommended): Python's built-in
venvmodule for creating isolated project environments.
Quick points about how RAG works and how to get the best results:
RAG helps LLMs by finding relevant info from your documents (like in a vector DB). It gives this specific info to the LLM along with your question. This makes the answer more accurate and based on your data.
LLMs have a limit on how much text they can read at once – the "context window." Your question, the retrieved document chunks, and system instructions all must fit. If there's too much text, the LLM cuts off the input, leading to incomplete answers.
To avoid cut-off answers and fit within the context window:
- Make document chunks smaller (
chunk_size,chunk_overlap). - Retrieve fewer chunks (
kinsearch_kwargs). - Use an LLM with a larger context window if possible.
How specific your question is affects what info the system finds. Specific questions get focused, relevant chunks. General questions might get broader info, which can make the LLM's answer less precise.