Document Loaders

Document Loaders read raw data from various sources (text files, PDFs, Markdown, etc.) and convert it into the standard LangChain Document format for downstream text splitting, vectorization, and retrieval.

Each Document Loader has its own parameters, but they all expose a unified .load interface.

Below is an introduction to some commonly used components:

TextLoader
PyPDFLoader
UnstructuredMarkdownLoader

For more components, refer to Langchain Document loaders.

TextLoader

Install Dependencies

TextLoader is included in the langchain-community package. If it is not installed, use the following command:

pip install langchain-community

Usage

Create a TextLoader object

import tempfile
from langchain_community.document_loaders import TextLoader

# Write text content to a temporary file and load it
text_content = "Artificial Intelligence (AI) is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w", encoding="utf-8")
tmp_file.write(text_content)
tmp_file.flush()
tmp_file.close()

# Create a TextLoader instance, specifying the temporary file path and encoding
text_loader = TextLoader(tmp_file.name, encoding="utf-8")

Construct a LangchainKnowledge object using this text_loader object

rag = LangchainKnowledge(
    ...,
    document_loader=text_loader,
    ...,
)

Reference

langchain_community.document_loaders.text.TextLoader

PyPDFLoader

Install Dependencies

pip install -qU pypdf

Usage

Create a PyPDFLoader object

import os
from langchain_community.document_loaders import PyPDFLoader

# Get the PDF file path from environment variables
pdf_path = os.getenv("DOCUMENT_PDF_PATH", "/path/to/your/file.pdf")
loader = PyPDFLoader(pdf_path)

Construct a LangchainKnowledge object using this loader object

rag = LangchainKnowledge(
    ...,
    document_loader=loader,
    ...,
)

Reference

How to load PDFs

UnstructuredMarkdownLoader

Install Dependencies

pip install -qU langchain_community unstructured

Usage

Create an UnstructuredMarkdownLoader object

import tempfile
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Write Markdown content to a temporary file and load it
md_content = "# Introduction to Artificial Intelligence\n\nArtificial Intelligence is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".md", mode="w", encoding="utf-8")
tmp_file.write(md_content)
tmp_file.flush()
tmp_file.close()

# mode="single" treats the entire file as a single Document; strategy="fast" uses the fast parsing strategy
loader = UnstructuredMarkdownLoader(tmp_file.name, mode="single", strategy="fast")

Construct a LangchainKnowledge object using this loader object

rag = LangchainKnowledge(
    ...,
    document_loader=loader,
    ...,
)

Reference

UnstructuredMarkdownLoader

Full Example

For a complete example, see /examples/knowledge_with_documentloader/run_agent.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Loaders

TextLoader

Install Dependencies

Usage

Reference

PyPDFLoader

Install Dependencies

Usage

Reference

UnstructuredMarkdownLoader

Install Dependencies

Usage

Reference

Full Example

FilesExpand file tree

knowledge_document_loader.md

Latest commit

History

knowledge_document_loader.md

File metadata and controls

Document Loaders

TextLoader

Install Dependencies

Usage

Reference

PyPDFLoader

Install Dependencies

Usage

Reference

UnstructuredMarkdownLoader

Install Dependencies

Usage

Reference

Full Example