Skip to content

emmcygn/lexichunk

Repository files navigation

lexichunk

Intelligent legal document chunking for RAG pipelines.

PyPI version Python 3.10+ License: MIT CI


The Problem

General-purpose chunkers treat legal text like generic prose. On contracts and terms & conditions, this produces five specific failure modes that degrade RAG retrieval quality.

Clause fragmentation. A 512-token window splits a limitation of liability clause from its qualifying proviso. "The Seller shall not be liable..." lands in one chunk while "...except in the case of fraud or wilful misconduct" lands in the next. Any query about liability scope retrieves an incomplete and potentially misleading answer.

Orphaned cross-references. A chunk containing "subject to the restrictions set out in Clause 8.2" has no connection to Clause 8.2's content. The retriever cannot follow the reference, so the LLM reasons from an incomplete picture.

Lost defined terms. A chunk uses "Material Adverse Effect" without access to its negotiated 200-word definition from Section 1. The LLM substitutes a generic definition rather than the contract-specific one — a silent hallucination.

Destroyed hierarchy. Section 7.2(a)(iii) becomes a floating text fragment with no indication it belongs to Article VII — Indemnification. Retrieval cannot distinguish operative provisions from boilerplate.

Cross-document contamination. Without document-level metadata on every chunk, retrievers pull clauses from the wrong contract. All NDAs look structurally similar; retrieval mismatch follows.


Installation

pip install lexichunk

Optional framework integrations:

pip install lexichunk[langchain]      # LangChain TextSplitter integration
pip install lexichunk[llama-index]    # LlamaIndex NodeParser integration
pip install lexichunk[all]            # Both integrations

Quick Start

from lexichunk import LegalChunker

chunker = LegalChunker(
    jurisdiction="uk",          # or "us"
    doc_type="contract",        # or "terms_conditions"
    max_chunk_size=512,         # tokens (approximate, 1 token ~= 4 chars)
    min_chunk_size=64,          # merge clauses smaller than this
    include_definitions=True,   # attach relevant definitions to each chunk
    include_context_header=True # Contextual Retrieval pattern headers
)

chunks = chunker.chunk(contract_text)

for chunk in chunks:
    print(chunk.content)
    print(chunk.clause_type)            # ClauseType.INDEMNIFICATION
    print(chunk.hierarchy_path)         # "Article VII > Section 7.2 > (a)"
    print(chunk.cross_references)       # [CrossReference(raw_text="Section 2.1", ...)]
    print(chunk.defined_terms_used)     # ["Material Adverse Effect", "Losses"]
    print(chunk.defined_terms_context)  # {"Material Adverse Effect": "means any event..."}
    print(chunk.context_header)         # "[Document: Service Agreement] [Section: ...]"

Output

Every call to chunker.chunk() returns a list[LegalChunk]. Each LegalChunk is a typed dataclass:

Field Type Description
content str The chunk text.
index int Zero-based position among all chunks from this document.
hierarchy HierarchyNode Clause position: level, identifier, title, parent.
hierarchy_path str Human-readable path, e.g. "Article VII > Section 7.2 > (a)".
document_section DocumentSection High-level section: PREAMBLE, DEFINITIONS, OPERATIVE, SCHEDULES, SIGNATURES.
clause_type ClauseType Classified type: INDEMNIFICATION, CONFIDENTIALITY, TERMINATION, ACCEPTABLE_USE, USER_RESTRICTIONS, ACCOUNT_SECURITY, etc. (27 types).
jurisdiction Jurisdiction UK, US, or EU.
cross_references list[CrossReference] Every detected reference to another clause. Each has raw_text, target_identifier, and target_chunk_index (resolved after chunking where possible).
defined_terms_used list[str] Capitalised defined terms found in this chunk's text.
defined_terms_context dict[str, str] Maps each used defined term to its full contract-specific definition.
context_header str Prepend this to content before embedding (Contextual Retrieval pattern). Example: "[Document: Service Agreement] [Section: Article VII — Indemnification > Section 7.2(a)] [Type: Indemnification] [Jurisdiction: US]".
document_id str | None Propagated document identifier — set via LegalChunker(document_id=...).
char_start int Start character offset in the source text.
char_end int End character offset in the source text.

Supported Document Types

Jurisdiction Document Types
United Kingdom Commercial contracts (service agreements, supply agreements, employment contracts, shareholder agreements), terms and conditions
United States Contracts (MSAs, NDAs, SaaS terms, employment agreements, service agreements), terms of service, privacy policies

Pass doc_type="contract" or doc_type="terms_conditions" to the chunker.


Jurisdiction Differences

lexichunk applies jurisdiction-specific structural rules. The three built-in jurisdictions differ in numbering, header style, and cross-reference language.

Feature UK Convention US Convention EU Directives
Top-level grouping Clause (flat numbering) Article (Roman numerals) Chapter (Roman) / Article (Arabic)
Numbering 1, 1.1, 1.1.1, (a), (i) Article I, Section 1.01, (a), (i) Chapter I, Article 1, 1., (a)
Headers Sentence case, minimal ALL CAPS common Mixed case
Defined terms location "Definitions" clause "Article I — Definitions" "Article 4 — Definitions"
Schedules/Exhibits "Schedule 1" "Exhibit A" or "Schedule 1" "Annex I"
Boilerplate heading "General" "Miscellaneous" "Final Provisions"
Cross-reference style "Clause 5.2" or "paragraph (a)" "Section 5.2" or "Section 5.2(a)" "Article 6(1)(a)"

Select the jurisdiction at construction time with jurisdiction="uk", jurisdiction="us", or jurisdiction="eu". Custom jurisdictions can be registered via register_jurisdiction().


LangChain Integration

Requires pip install lexichunk[langchain].

from lexichunk.integrations.langchain import LegalTextSplitter

splitter = LegalTextSplitter(
    jurisdiction="uk",
    doc_type="contract",
    max_chunk_size=512,
)

# Returns List[langchain_core.documents.Document]
documents = splitter.split_text(contract_text)

# Split multiple documents at once
documents = splitter.create_documents([text_1, text_2, text_3])

# Rich metadata is preserved on every Document
for doc in documents:
    print(doc.page_content)
    print(doc.metadata["clause_type"])       # e.g. "confidentiality"
    print(doc.metadata["hierarchy_path"])    # e.g. "7 > 7.1"
    print(doc.metadata["defined_terms_used"])
    print(doc.metadata["context_header"])
    print(doc.metadata["cross_references"])  # list of dicts

# Contextual Retrieval: prepend context_header before embedding
texts_to_embed = [
    doc.metadata["context_header"] + "\n\n" + doc.page_content
    for doc in documents
]

LlamaIndex Integration

Requires pip install lexichunk[llama-index].

from llama_index.core.schema import Document
from lexichunk.integrations.llama_index import LegalNodeParser

parser = LegalNodeParser(
    jurisdiction="us",
    doc_type="contract",
    max_chunk_size=512,
)

# Parse from plain text
nodes = parser.get_nodes_from_text(contract_text)

# Parse from LlamaIndex Document objects
llama_docs = [Document(text=contract_text)]
nodes = parser.get_nodes_from_documents(llama_docs)

# Rich metadata is on every TextNode
for node in nodes:
    print(node.text)
    print(node.metadata["clause_type"])
    print(node.metadata["hierarchy_path"])
    print(node.metadata["defined_terms_used"])

# Build a VectorStoreIndex from the nodes
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What are the indemnification obligations?")

Architecture

lexichunk runs a seven-stage pipeline on every document:

Raw Text → sanitize (BOM, CRLF, NFC)
    |
    v
1. Structure Parser     Detect clauses, numbering, hierarchy (UK/US/EU)
    |
2. Clause Chunker       Split at clause boundaries, merge/split on size
    |                   (fallback: sentence-level splitting if no structure)
    |
3. Cross-ref Detection  Detect references (first pass, unresolved)
    |
4. Clause Classifier    Keyword scoring → 27 clause types + confidence
    |
5. Context Enricher     Generate Contextual Retrieval headers
    |
6. Term Extractor       Extract defined terms, attach to chunks
    |
7. Cross-ref Resolution Resolve target_chunk_index (second pass)
    |
    v
  List[LegalChunk]

Structure Parser uses jurisdiction-specific regex patterns (UK, US, EU) to detect clause boundaries and build a HierarchyNode tree. Falls back to sentence-level splitting for documents with no detected structure.

Clause Chunker splits at detected boundaries. Merges undersized clauses with their siblings; splits oversized clauses at sentence boundaries.

Cross-ref Detection & Resolution runs in two passes: first detects references, then resolves target_chunk_index after all chunks are created.

Clause Classifier scores each chunk against 27 clause types using keyword signals with phrase-length weighting and position-aware bonuses.

Term Extractor scans the definitions section for patterns like "[Term]" means, 'the Company' means, hereinafter, and inline parenthetical definitions. Attaches relevant terms to each chunk.

Context Enricher generates a header string for each chunk following the Contextual Retrieval pattern.

Zero mandatory dependencies — the core uses stdlib and re only.


Additional API

# Extract all defined terms without chunking
terms: dict[str, DefinedTerm] = chunker.get_defined_terms(text)
for name, term in terms.items():
    print(f"{term.term} (defined in {term.source_clause}): {term.definition[:80]}...")

# Inspect parsed structure before chunking
nodes: list[HierarchyNode] = chunker.parse_structure(text)
for node in nodes:
    print(f"{'  ' * node.level}{node.identifier}: {node.title}")

Contributing

Issues and pull requests are welcome. Please open an issue before submitting large changes.

git clone https://github.com/emmcygn/lexichunk
cd lexichunk
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

About

Intelligent legal document chunking SDK for RAG pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages