Document Loaders read raw data from various sources (text files, PDFs, Markdown, etc.) and convert it into the standard LangChain Document format for downstream text splitting, vectorization, and retrieval.
Each Document Loader has its own parameters, but they all expose a unified .load interface.
Below is an introduction to some commonly used components:
For more components, refer to Langchain Document loaders.
TextLoader is included in the langchain-community package. If it is not installed, use the following command:
pip install langchain-community- Create a
TextLoaderobject
import tempfile
from langchain_community.document_loaders import TextLoader
# Write text content to a temporary file and load it
text_content = "Artificial Intelligence (AI) is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w", encoding="utf-8")
tmp_file.write(text_content)
tmp_file.flush()
tmp_file.close()
# Create a TextLoader instance, specifying the temporary file path and encoding
text_loader = TextLoader(tmp_file.name, encoding="utf-8")- Construct a
LangchainKnowledgeobject using thistext_loaderobject
rag = LangchainKnowledge(
...,
document_loader=text_loader,
...,
)pip install -qU pypdf- Create a
PyPDFLoaderobject
import os
from langchain_community.document_loaders import PyPDFLoader
# Get the PDF file path from environment variables
pdf_path = os.getenv("DOCUMENT_PDF_PATH", "/path/to/your/file.pdf")
loader = PyPDFLoader(pdf_path)- Construct a
LangchainKnowledgeobject using this loader object
rag = LangchainKnowledge(
...,
document_loader=loader,
...,
)pip install -qU langchain_community unstructured- Create an
UnstructuredMarkdownLoaderobject
import tempfile
from langchain_community.document_loaders import UnstructuredMarkdownLoader
# Write Markdown content to a temporary file and load it
md_content = "# Introduction to Artificial Intelligence\n\nArtificial Intelligence is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".md", mode="w", encoding="utf-8")
tmp_file.write(md_content)
tmp_file.flush()
tmp_file.close()
# mode="single" treats the entire file as a single Document; strategy="fast" uses the fast parsing strategy
loader = UnstructuredMarkdownLoader(tmp_file.name, mode="single", strategy="fast")- Construct a
LangchainKnowledgeobject using this loader object
rag = LangchainKnowledge(
...,
document_loader=loader,
...,
)For a complete example, see /examples/knowledge_with_documentloader/run_agent.py.