Skip to content

Latest commit

 

History

History
138 lines (93 loc) · 3.58 KB

File metadata and controls

138 lines (93 loc) · 3.58 KB

Document Loaders

Document Loaders read raw data from various sources (text files, PDFs, Markdown, etc.) and convert it into the standard LangChain Document format for downstream text splitting, vectorization, and retrieval.

Each Document Loader has its own parameters, but they all expose a unified .load interface.

Below is an introduction to some commonly used components:

For more components, refer to Langchain Document loaders.

TextLoader

Install Dependencies

TextLoader is included in the langchain-community package. If it is not installed, use the following command:

pip install langchain-community

Usage

  1. Create a TextLoader object
import tempfile
from langchain_community.document_loaders import TextLoader

# Write text content to a temporary file and load it
text_content = "Artificial Intelligence (AI) is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w", encoding="utf-8")
tmp_file.write(text_content)
tmp_file.flush()
tmp_file.close()

# Create a TextLoader instance, specifying the temporary file path and encoding
text_loader = TextLoader(tmp_file.name, encoding="utf-8")
  1. Construct a LangchainKnowledge object using this text_loader object
rag = LangchainKnowledge(
    ...,
    document_loader=text_loader,
    ...,
)

Reference

PyPDFLoader

Install Dependencies

pip install -qU pypdf

Usage

  1. Create a PyPDFLoader object
import os
from langchain_community.document_loaders import PyPDFLoader

# Get the PDF file path from environment variables
pdf_path = os.getenv("DOCUMENT_PDF_PATH", "/path/to/your/file.pdf")
loader = PyPDFLoader(pdf_path)
  1. Construct a LangchainKnowledge object using this loader object
rag = LangchainKnowledge(
    ...,
    document_loader=loader,
    ...,
)

Reference

UnstructuredMarkdownLoader

Install Dependencies

pip install -qU langchain_community unstructured

Usage

  1. Create an UnstructuredMarkdownLoader object
import tempfile
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Write Markdown content to a temporary file and load it
md_content = "# Introduction to Artificial Intelligence\n\nArtificial Intelligence is a branch of computer science..."
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".md", mode="w", encoding="utf-8")
tmp_file.write(md_content)
tmp_file.flush()
tmp_file.close()

# mode="single" treats the entire file as a single Document; strategy="fast" uses the fast parsing strategy
loader = UnstructuredMarkdownLoader(tmp_file.name, mode="single", strategy="fast")
  1. Construct a LangchainKnowledge object using this loader object
rag = LangchainKnowledge(
    ...,
    document_loader=loader,
    ...,
)

Reference

Full Example

For a complete example, see /examples/knowledge_with_documentloader/run_agent.py.