Intelligent legal document chunking for RAG pipelines.
General-purpose chunkers treat legal text like generic prose. On contracts and terms & conditions, this produces five specific failure modes that degrade RAG retrieval quality.
Clause fragmentation. A 512-token window splits a limitation of liability clause from its qualifying proviso. "The Seller shall not be liable..." lands in one chunk while "...except in the case of fraud or wilful misconduct" lands in the next. Any query about liability scope retrieves an incomplete and potentially misleading answer.
Orphaned cross-references. A chunk containing "subject to the restrictions set out in Clause 8.2" has no connection to Clause 8.2's content. The retriever cannot follow the reference, so the LLM reasons from an incomplete picture.
Lost defined terms. A chunk uses "Material Adverse Effect" without access to its negotiated 200-word definition from Section 1. The LLM substitutes a generic definition rather than the contract-specific one — a silent hallucination.
Destroyed hierarchy. Section 7.2(a)(iii) becomes a floating text fragment with no indication it belongs to Article VII — Indemnification. Retrieval cannot distinguish operative provisions from boilerplate.
Cross-document contamination. Without document-level metadata on every chunk, retrievers pull clauses from the wrong contract. All NDAs look structurally similar; retrieval mismatch follows.
pip install lexichunkOptional framework integrations:
pip install lexichunk[langchain] # LangChain TextSplitter integration
pip install lexichunk[llama-index] # LlamaIndex NodeParser integration
pip install lexichunk[all] # Both integrationsfrom lexichunk import LegalChunker
chunker = LegalChunker(
jurisdiction="uk", # or "us"
doc_type="contract", # or "terms_conditions"
max_chunk_size=512, # tokens (approximate, 1 token ~= 4 chars)
min_chunk_size=64, # merge clauses smaller than this
include_definitions=True, # attach relevant definitions to each chunk
include_context_header=True # Contextual Retrieval pattern headers
)
chunks = chunker.chunk(contract_text)
for chunk in chunks:
print(chunk.content)
print(chunk.clause_type) # ClauseType.INDEMNIFICATION
print(chunk.hierarchy_path) # "Article VII > Section 7.2 > (a)"
print(chunk.cross_references) # [CrossReference(raw_text="Section 2.1", ...)]
print(chunk.defined_terms_used) # ["Material Adverse Effect", "Losses"]
print(chunk.defined_terms_context) # {"Material Adverse Effect": "means any event..."}
print(chunk.context_header) # "[Document: Service Agreement] [Section: ...]"Every call to chunker.chunk() returns a list[LegalChunk]. Each LegalChunk is a typed dataclass:
| Field | Type | Description |
|---|---|---|
content |
str |
The chunk text. |
index |
int |
Zero-based position among all chunks from this document. |
hierarchy |
HierarchyNode |
Clause position: level, identifier, title, parent. |
hierarchy_path |
str |
Human-readable path, e.g. "Article VII > Section 7.2 > (a)". |
document_section |
DocumentSection |
High-level section: PREAMBLE, DEFINITIONS, OPERATIVE, SCHEDULES, SIGNATURES. |
clause_type |
ClauseType |
Classified type: INDEMNIFICATION, CONFIDENTIALITY, TERMINATION, ACCEPTABLE_USE, USER_RESTRICTIONS, ACCOUNT_SECURITY, etc. (27 types). |
jurisdiction |
Jurisdiction |
UK, US, or EU. |
cross_references |
list[CrossReference] |
Every detected reference to another clause. Each has raw_text, target_identifier, and target_chunk_index (resolved after chunking where possible). |
defined_terms_used |
list[str] |
Capitalised defined terms found in this chunk's text. |
defined_terms_context |
dict[str, str] |
Maps each used defined term to its full contract-specific definition. |
context_header |
str |
Prepend this to content before embedding (Contextual Retrieval pattern). Example: "[Document: Service Agreement] [Section: Article VII — Indemnification > Section 7.2(a)] [Type: Indemnification] [Jurisdiction: US]". |
document_id |
str | None |
Propagated document identifier — set via LegalChunker(document_id=...). |
char_start |
int |
Start character offset in the source text. |
char_end |
int |
End character offset in the source text. |
| Jurisdiction | Document Types |
|---|---|
| United Kingdom | Commercial contracts (service agreements, supply agreements, employment contracts, shareholder agreements), terms and conditions |
| United States | Contracts (MSAs, NDAs, SaaS terms, employment agreements, service agreements), terms of service, privacy policies |
Pass doc_type="contract" or doc_type="terms_conditions" to the chunker.
lexichunk applies jurisdiction-specific structural rules. The three built-in jurisdictions differ in numbering, header style, and cross-reference language.
| Feature | UK Convention | US Convention | EU Directives |
|---|---|---|---|
| Top-level grouping | Clause (flat numbering) | Article (Roman numerals) | Chapter (Roman) / Article (Arabic) |
| Numbering | 1, 1.1, 1.1.1, (a), (i) |
Article I, Section 1.01, (a), (i) |
Chapter I, Article 1, 1., (a) |
| Headers | Sentence case, minimal | ALL CAPS common | Mixed case |
| Defined terms location | "Definitions" clause | "Article I — Definitions" | "Article 4 — Definitions" |
| Schedules/Exhibits | "Schedule 1" | "Exhibit A" or "Schedule 1" | "Annex I" |
| Boilerplate heading | "General" | "Miscellaneous" | "Final Provisions" |
| Cross-reference style | "Clause 5.2" or "paragraph (a)" | "Section 5.2" or "Section 5.2(a)" | "Article 6(1)(a)" |
Select the jurisdiction at construction time with jurisdiction="uk", jurisdiction="us", or jurisdiction="eu". Custom jurisdictions can be registered via register_jurisdiction().
Requires pip install lexichunk[langchain].
from lexichunk.integrations.langchain import LegalTextSplitter
splitter = LegalTextSplitter(
jurisdiction="uk",
doc_type="contract",
max_chunk_size=512,
)
# Returns List[langchain_core.documents.Document]
documents = splitter.split_text(contract_text)
# Split multiple documents at once
documents = splitter.create_documents([text_1, text_2, text_3])
# Rich metadata is preserved on every Document
for doc in documents:
print(doc.page_content)
print(doc.metadata["clause_type"]) # e.g. "confidentiality"
print(doc.metadata["hierarchy_path"]) # e.g. "7 > 7.1"
print(doc.metadata["defined_terms_used"])
print(doc.metadata["context_header"])
print(doc.metadata["cross_references"]) # list of dicts
# Contextual Retrieval: prepend context_header before embedding
texts_to_embed = [
doc.metadata["context_header"] + "\n\n" + doc.page_content
for doc in documents
]Requires pip install lexichunk[llama-index].
from llama_index.core.schema import Document
from lexichunk.integrations.llama_index import LegalNodeParser
parser = LegalNodeParser(
jurisdiction="us",
doc_type="contract",
max_chunk_size=512,
)
# Parse from plain text
nodes = parser.get_nodes_from_text(contract_text)
# Parse from LlamaIndex Document objects
llama_docs = [Document(text=contract_text)]
nodes = parser.get_nodes_from_documents(llama_docs)
# Rich metadata is on every TextNode
for node in nodes:
print(node.text)
print(node.metadata["clause_type"])
print(node.metadata["hierarchy_path"])
print(node.metadata["defined_terms_used"])
# Build a VectorStoreIndex from the nodes
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What are the indemnification obligations?")lexichunk runs a seven-stage pipeline on every document:
Raw Text → sanitize (BOM, CRLF, NFC)
|
v
1. Structure Parser Detect clauses, numbering, hierarchy (UK/US/EU)
|
2. Clause Chunker Split at clause boundaries, merge/split on size
| (fallback: sentence-level splitting if no structure)
|
3. Cross-ref Detection Detect references (first pass, unresolved)
|
4. Clause Classifier Keyword scoring → 27 clause types + confidence
|
5. Context Enricher Generate Contextual Retrieval headers
|
6. Term Extractor Extract defined terms, attach to chunks
|
7. Cross-ref Resolution Resolve target_chunk_index (second pass)
|
v
List[LegalChunk]
Structure Parser uses jurisdiction-specific regex patterns (UK, US, EU) to detect clause boundaries and build a HierarchyNode tree. Falls back to sentence-level splitting for documents with no detected structure.
Clause Chunker splits at detected boundaries. Merges undersized clauses with their siblings; splits oversized clauses at sentence boundaries.
Cross-ref Detection & Resolution runs in two passes: first detects references, then resolves target_chunk_index after all chunks are created.
Clause Classifier scores each chunk against 27 clause types using keyword signals with phrase-length weighting and position-aware bonuses.
Term Extractor scans the definitions section for patterns like "[Term]" means, 'the Company' means, hereinafter, and inline parenthetical definitions. Attaches relevant terms to each chunk.
Context Enricher generates a header string for each chunk following the Contextual Retrieval pattern.
Zero mandatory dependencies — the core uses stdlib and re only.
# Extract all defined terms without chunking
terms: dict[str, DefinedTerm] = chunker.get_defined_terms(text)
for name, term in terms.items():
print(f"{term.term} (defined in {term.source_clause}): {term.definition[:80]}...")
# Inspect parsed structure before chunking
nodes: list[HierarchyNode] = chunker.parse_structure(text)
for node in nodes:
print(f"{' ' * node.level}{node.identifier}: {node.title}")Issues and pull requests are welcome. Please open an issue before submitting large changes.
git clone https://github.com/emmcygn/lexichunk
cd lexichunk
pip install -e ".[dev]"
pytestMIT — see LICENSE.