lexichunk

Intelligent legal document chunking for RAG pipelines.

The Problem

General-purpose chunkers treat legal text like generic prose. On contracts and terms & conditions, this produces five specific failure modes that degrade RAG retrieval quality.

Clause fragmentation. A 512-token window splits a limitation of liability clause from its qualifying proviso. "The Seller shall not be liable..." lands in one chunk while "...except in the case of fraud or wilful misconduct" lands in the next. Any query about liability scope retrieves an incomplete and potentially misleading answer.

Orphaned cross-references. A chunk containing "subject to the restrictions set out in Clause 8.2" has no connection to Clause 8.2's content. The retriever cannot follow the reference, so the LLM reasons from an incomplete picture.

Lost defined terms. A chunk uses "Material Adverse Effect" without access to its negotiated 200-word definition from Section 1. The LLM substitutes a generic definition rather than the contract-specific one — a silent hallucination.

Destroyed hierarchy. Section 7.2(a)(iii) becomes a floating text fragment with no indication it belongs to Article VII — Indemnification. Retrieval cannot distinguish operative provisions from boilerplate.

Cross-document contamination. Without document-level metadata on every chunk, retrievers pull clauses from the wrong contract. All NDAs look structurally similar; retrieval mismatch follows.

Installation

pip install lexichunk

Optional framework integrations:

pip install lexichunk[langchain]      # LangChain TextSplitter integration
pip install lexichunk[llama-index]    # LlamaIndex NodeParser integration
pip install lexichunk[all]            # Both integrations

Quick Start

from lexichunk import LegalChunker

chunker = LegalChunker(
    jurisdiction="uk",          # or "us"
    doc_type="contract",        # or "terms_conditions"
    max_chunk_size=512,         # tokens (approximate, 1 token ~= 4 chars)
    min_chunk_size=64,          # merge clauses smaller than this
    include_definitions=True,   # attach relevant definitions to each chunk
    include_context_header=True # Contextual Retrieval pattern headers
)

chunks = chunker.chunk(contract_text)

for chunk in chunks:
    print(chunk.content)
    print(chunk.clause_type)            # ClauseType.INDEMNIFICATION
    print(chunk.hierarchy_path)         # "Article VII > Section 7.2 > (a)"
    print(chunk.cross_references)       # [CrossReference(raw_text="Section 2.1", ...)]
    print(chunk.defined_terms_used)     # ["Material Adverse Effect", "Losses"]
    print(chunk.defined_terms_context)  # {"Material Adverse Effect": "means any event..."}
    print(chunk.context_header)         # "[Document: Service Agreement] [Section: ...]"

Output

Every call to chunker.chunk() returns a list[LegalChunk]. Each LegalChunk is a typed dataclass:

Field	Type	Description
`content`	`str`	The chunk text.
`index`	`int`	Zero-based position among all chunks from this document.
`hierarchy`	`HierarchyNode`	Clause position: `level`, `identifier`, `title`, `parent`.
`hierarchy_path`	`str`	Human-readable path, e.g. `"Article VII > Section 7.2 > (a)"`.
`document_section`	`DocumentSection`	High-level section: `PREAMBLE`, `DEFINITIONS`, `OPERATIVE`, `SCHEDULES`, `SIGNATURES`.
`clause_type`	`ClauseType`	Classified type: `INDEMNIFICATION`, `CONFIDENTIALITY`, `TERMINATION`, `ACCEPTABLE_USE`, `USER_RESTRICTIONS`, `ACCOUNT_SECURITY`, etc. (27 types).
`jurisdiction`	`Jurisdiction`	`UK`, `US`, or `EU`.
`cross_references`	`list[CrossReference]`	Every detected reference to another clause. Each has `raw_text`, `target_identifier`, and `target_chunk_index` (resolved after chunking where possible).
`defined_terms_used`	`list[str]`	Capitalised defined terms found in this chunk's text.
`defined_terms_context`	`dict[str, str]`	Maps each used defined term to its full contract-specific definition.
`context_header`	`str`	Prepend this to `content` before embedding (Contextual Retrieval pattern). Example: `"[Document: Service Agreement] [Section: Article VII — Indemnification > Section 7.2(a)] [Type: Indemnification] [Jurisdiction: US]"`.
`document_id`	`str \| None`	Propagated document identifier — set via `LegalChunker(document_id=...)`.
`char_start`	`int`	Start character offset in the source text.
`char_end`	`int`	End character offset in the source text.

Supported Document Types

Jurisdiction	Document Types
United Kingdom	Commercial contracts (service agreements, supply agreements, employment contracts, shareholder agreements), terms and conditions
United States	Contracts (MSAs, NDAs, SaaS terms, employment agreements, service agreements), terms of service, privacy policies

Pass doc_type="contract" or doc_type="terms_conditions" to the chunker.

Jurisdiction Differences

lexichunk applies jurisdiction-specific structural rules. The three built-in jurisdictions differ in numbering, header style, and cross-reference language.

Feature	UK Convention	US Convention	EU Directives
Top-level grouping	Clause (flat numbering)	Article (Roman numerals)	Chapter (Roman) / Article (Arabic)
Numbering	`1`, `1.1`, `1.1.1`, `(a)`, `(i)`	`Article I`, `Section 1.01`, `(a)`, `(i)`	`Chapter I`, `Article 1`, `1.`, `(a)`
Headers	Sentence case, minimal	ALL CAPS common	Mixed case
Defined terms location	"Definitions" clause	"Article I — Definitions"	"Article 4 — Definitions"
Schedules/Exhibits	"Schedule 1"	"Exhibit A" or "Schedule 1"	"Annex I"
Boilerplate heading	"General"	"Miscellaneous"	"Final Provisions"
Cross-reference style	"Clause 5.2" or "paragraph (a)"	"Section 5.2" or "Section 5.2(a)"	"Article 6(1)(a)"

Select the jurisdiction at construction time with jurisdiction="uk", jurisdiction="us", or jurisdiction="eu". Custom jurisdictions can be registered via register_jurisdiction().

LangChain Integration

Requires pip install lexichunk[langchain].

from lexichunk.integrations.langchain import LegalTextSplitter

splitter = LegalTextSplitter(
    jurisdiction="uk",
    doc_type="contract",
    max_chunk_size=512,
)

# Returns List[langchain_core.documents.Document]
documents = splitter.split_text(contract_text)

# Split multiple documents at once
documents = splitter.create_documents([text_1, text_2, text_3])

# Rich metadata is preserved on every Document
for doc in documents:
    print(doc.page_content)
    print(doc.metadata["clause_type"])       # e.g. "confidentiality"
    print(doc.metadata["hierarchy_path"])    # e.g. "7 > 7.1"
    print(doc.metadata["defined_terms_used"])
    print(doc.metadata["context_header"])
    print(doc.metadata["cross_references"])  # list of dicts

# Contextual Retrieval: prepend context_header before embedding
texts_to_embed = [
    doc.metadata["context_header"] + "\n\n" + doc.page_content
    for doc in documents
]

LlamaIndex Integration

Requires pip install lexichunk[llama-index].

from llama_index.core.schema import Document
from lexichunk.integrations.llama_index import LegalNodeParser

parser = LegalNodeParser(
    jurisdiction="us",
    doc_type="contract",
    max_chunk_size=512,
)

# Parse from plain text
nodes = parser.get_nodes_from_text(contract_text)

# Parse from LlamaIndex Document objects
llama_docs = [Document(text=contract_text)]
nodes = parser.get_nodes_from_documents(llama_docs)

# Rich metadata is on every TextNode
for node in nodes:
    print(node.text)
    print(node.metadata["clause_type"])
    print(node.metadata["hierarchy_path"])
    print(node.metadata["defined_terms_used"])

# Build a VectorStoreIndex from the nodes
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What are the indemnification obligations?")

Architecture

lexichunk runs a seven-stage pipeline on every document:

Raw Text → sanitize (BOM, CRLF, NFC)
    |
    v
1. Structure Parser     Detect clauses, numbering, hierarchy (UK/US/EU)
    |
2. Clause Chunker       Split at clause boundaries, merge/split on size
    |                   (fallback: sentence-level splitting if no structure)
    |
3. Cross-ref Detection  Detect references (first pass, unresolved)
    |
4. Clause Classifier    Keyword scoring → 27 clause types + confidence
    |
5. Context Enricher     Generate Contextual Retrieval headers
    |
6. Term Extractor       Extract defined terms, attach to chunks
    |
7. Cross-ref Resolution Resolve target_chunk_index (second pass)
    |
    v
  List[LegalChunk]

Structure Parser uses jurisdiction-specific regex patterns (UK, US, EU) to detect clause boundaries and build a HierarchyNode tree. Falls back to sentence-level splitting for documents with no detected structure.

Clause Chunker splits at detected boundaries. Merges undersized clauses with their siblings; splits oversized clauses at sentence boundaries.

Cross-ref Detection & Resolution runs in two passes: first detects references, then resolves target_chunk_index after all chunks are created.

Clause Classifier scores each chunk against 27 clause types using keyword signals with phrase-length weighting and position-aware bonuses.

Term Extractor scans the definitions section for patterns like "[Term]" means, 'the Company' means, hereinafter, and inline parenthetical definitions. Attaches relevant terms to each chunk.

Context Enricher generates a header string for each chunk following the Contextual Retrieval pattern.

Zero mandatory dependencies — the core uses stdlib and re only.

Additional API

# Extract all defined terms without chunking
terms: dict[str, DefinedTerm] = chunker.get_defined_terms(text)
for name, term in terms.items():
    print(f"{term.term} (defined in {term.source_clause}): {term.definition[:80]}...")

# Inspect parsed structure before chunking
nodes: list[HierarchyNode] = chunker.parse_structure(text)
for node in nodes:
    print(f"{'  ' * node.level}{node.identifier}: {node.title}")

Contributing

Issues and pull requests are welcome. Please open an issue before submitting large changes.

git clone https://github.com/emmcygn/lexichunk
cd lexichunk
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
src/lexichunk		src/lexichunk
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lexichunk

The Problem

Installation

Quick Start

Output

Supported Document Types

Jurisdiction Differences

LangChain Integration

LlamaIndex Integration

Architecture

Additional API

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lexichunk

The Problem

Installation

Quick Start

Output

Supported Document Types

Jurisdiction Differences

LangChain Integration

LlamaIndex Integration

Architecture

Additional API

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages