Skip to content

feat: add support for long-context documents#179

Open
eurekayuan wants to merge 4 commits into
mainfrom
feature/long-context
Open

feat: add support for long-context documents#179
eurekayuan wants to merge 4 commits into
mainfrom
feature/long-context

Conversation

@eurekayuan

Copy link
Copy Markdown

Summary

Several stages embedded the whole document in a single prompt and hit DataDesigner's 512K (MAX_RENDERED_LEN) render cap, failing outright on long inputs. Every such stage is now windowed: each chunked generator renders its own per-window prompt and calls the model directly, bypassing the cap. Stages keep a single-call fast path when the rendered prompt already fits, so short-document behavior is unchanged.

Per-stage windowing

  • Detection (chunked_detection.py, new): Overlapping fixed-size character windows; each window is a raw text slice sent to the detector. Per-window offsets are rebased to global, boundary-touching spans are dropped, and overlaps are resolved (resolve_overlaps).
  • Validation (chunked_validation.py): Not a text window — batches candidate entities (≤100 per call), each with a ±500-character excerpt. Calls run in parallel across the validator pool with round-robin + failover. Decisions are merged per row; the row is dropped only if every pool member fails.
  • Augmentation (chunked_augmentation.py): Overlapping character windows over tagged text plus seed JSON. A window dynamically shrinks if its rendered prompt exceeds the cap. Outputs are unioned and deduped by (value, label).
  • Latent (chunked_latent.py): Same mechanism as augmentation (rewrite mode only); deduped by (label, value).
  • Substitution map (chunked_replace.py): Abutting newline-aligned windows, no overlap. Each chunk carries the accumulated replacement map and a rolling summary, proposing replacements only for new entities so mappings stay consistent across chunks.
  • Rewrite generation (chunked_rewrite.py): Abutting newline-aligned windows, no overlap. Runs sequentially, passing a continuity preamble and rolling summary between chunks; rewritten parts are stitched.
  • Final judge (chunked_final_judge.py, new): Splits original and rewritten text into N positionally-paired slices, scores each, and aggregates per-dimension by minimum. Rubric scales are embedded in the prompt with structured output. Replaces the non-windowedLLMJudgeColumnConfig.

Parallel processing

  • Stateless stages (detection, validation, augmentation, latent, judge) dispatch windows in parallel (bounded ThreadPoolExecutor; the per-alias rate limit still governs real in-flight calls) and merge afterward.
  • Stateful stages (substitution-map, rewrite generation) stay sequential to thread the map / rolling summary across seams for consistency.

Window sizing

  • detection_window_max_render_chars (default 128 KiB, clamped ≤ NDD's render cap) is the single knob; it is threaded into detection, augmentation, latent, substitute-map, rewrite, and judge.
  • detection_window_safety_margin_chars (8K) leaves headroom for prompt scaffolding; detection_window_overlap_chars (1K) sets the overlap for the overlapping stages; a 4K floor prevents pathological shrinking.

Fault tolerance & failure tracking

  • Augmentation, latent, and the final judge are resilient to a single bad window: a window whose call fails is logged and skipped rather than dropping the whole record. Skipped-window counts are surfaced in trace_dataframe (COL_AUGMENTATION_FAILED_WINDOWS / COL_LATENT_FAILED_WINDOWS); the judge degrades to defaults if all windows fail.
  • Detection windows re-raise (the record fails) to preserve detection completeness, and validation relies on pool failover.

Observability

Per-window debug logging across all chunked stages: window ranges/sizes, rendered length vs cap, shrink events, rolling-summary contents, and per-stage entity/replacement/window counts.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

@eurekayuan eurekayuan requested review from a team as code owners June 3, 2026 18:42
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@eurekayuan eurekayuan changed the title Add support for handling long-context docs feat/long-context Jun 3, 2026
@eurekayuan eurekayuan changed the title feat/long-context feat: add support for long-context documents Jun 3, 2026
@eurekayuan

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds windowed long-context support to every LLM-calling stage of the anonymizer pipeline (detection, augmentation, latent, validation, substitute-map, rewrite, judge) so documents exceeding DataDesigner's 512K MAX_RENDERED_LEN render cap are processed in chunks rather than failing outright. Short-document behavior is preserved via a fast path that falls through to a single call when the rendered prompt already fits.

  • New chunked generators replace the previous single-call LLMStructuredColumnConfig / LLMTextColumnConfig wiring with custom column generators that call the facade directly, bypassing the render cap.
  • Window sizing is controlled by three new Detect config fields with a model-validator that rejects overlap/window combinations that would stall iteration; these settings are threaded through every stage via the public Anonymizer interface.
  • Fault tolerance is tiered: stateless stages (augmentation, latent, judge) swallow per-window errors and degrade gracefully; stateful sequential stages (substitute-map, rewrite) propagate errors; detection preserves completeness by re-raising.

Confidence Score: 4/5

The change is broadly safe to merge; the fast path preserves existing behaviour for documents that fit the cap, and the new windowed paths are well-structured with per-window fault tolerance.

The core windowing mechanics are correct and well-tested. Two quality gaps exist in the substitute-map chunked path: entities straddling newline boundaries are queued for replacement in a window whose tagged text does not highlight them, and the full replacement map is forwarded to every chunk unfiltered. Neither gap causes outright failure on typical PII inputs, but they are real edge-case correctness concerns.

src/anonymizer/engine/replace/chunked_replace.py — both the boundary-entity tagging gap and the unfiltered replacement-map overhead deserve a second look.

Important Files Changed

Filename Overview
src/anonymizer/engine/replace/chunked_replace.py New file: windowed substitute-map generation with rolling summary. Entities spanning chunk boundaries are included in the entity list for the window where they start but omitted from that window's tagged text, so the LLM is asked to replace entities it cannot see in context.
src/anonymizer/engine/detection/chunked_detection.py New file: overlapping-window seed detection with offset rebasing, boundary-span dropping, and resolve_overlaps deduplication. Logic is clean and well-tested.
src/anonymizer/engine/detection/chunked_augmentation.py New file: overlapping-window augmentation with dynamic window shrinking and parallel dispatch. Graceful per-window failure handling is correct.
src/anonymizer/engine/rewrite/chunked_rewrite.py New file: sequential boundary-aligned rewrite generation with rolling-summary continuity. Previously flagged issues exist but are already in review threads.
src/anonymizer/engine/rewrite/chunked_final_judge.py New file: parallel windowed judge that scores paired original/rewrite slices and aggregates by minimum per dimension. Logic is sound.
src/anonymizer/engine/rewrite/chunked_steps.py New file: generic sequential windowed metadata step used by domain classification, sensitivity disposition, and meaning-unit extraction.
src/anonymizer/config/anonymizer_config.py Adds three new Detect config fields with a model_validator that rejects pathological overlap/window ratios. Config defaults are sensibly clamped to NDD's render cap.
src/anonymizer/interface/anonymizer.py Threads the three new window-sizing config values from Detect config into every workflow call site. Wiring is consistent across detect, replace, and rewrite paths.

Reviews (2): Last reviewed commit: "Address review: thread window sizing, lo..." | Re-trigger Greptile

_clip(summary),
)

stitched = "\n".join(part for part in rewritten_parts if part)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Chunk boundaries are aligned to newlines by iter_boundary_windows, so each tagged[start:end] slice already ends with " ". When the LLM mirrors that structure in its output (natural for paragraph-aware models), every rewritten_chunk also ends with " ", and " ".join(...) then inserts a second newline — producing a blank line between every chunk boundary in the final anonymized document. Joining with "" is sufficient because the delimiter is already part of each chunk.

Suggested change
stitched = "\n".join(part for part in rewritten_parts if part)
stitched = "".join(part for part in rewritten_parts if part)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +172 to +181
# Fast path: the full single-call rewrite prompt fits under the cap.
single_rendered = _render_chunk_prompt(template=params.single_call_prompt_template, chunk_row=row, summary="")
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=_compile_template(params.single_call_prompt_template).render(**row),
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fast path measures single_rendered as _render_chunk_prompt(..., summary=""), which prepends the ~270-char continuity preamble, but then the actual LLM call omits that preamble. This means a document whose body-only prompt falls in (cap - 270, cap] chars will be routed into the chunked path unnecessarily. Measure with just the body to match what is actually sent.

Suggested change
# Fast path: the full single-call rewrite prompt fits under the cap.
single_rendered = _render_chunk_prompt(template=params.single_call_prompt_template, chunk_row=row, summary="")
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=_compile_template(params.single_call_prompt_template).render(**row),
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)
# Fast path: measure body-only prompt (no continuity preamble) since that is what is sent.
single_rendered = _compile_template(params.single_call_prompt_template).render(**row)
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=single_rendered,
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)

Comment on lines 27 to 29
)
from anonymizer.engine.ndd.model_loader import resolve_model_alias
from anonymizer.engine.prompt_utils import substitute_placeholders

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Private symbol imported across module boundary. _compile_template is module-private (underscore-prefixed) in chunked_steps.py. Importing it here creates a hidden coupling: if the function is renamed or inlined, qa_generation.py breaks without any clear contract. Consider exposing it as a public helper in chunked_steps.py or defining a local copy with its own lru_cache in this module.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +372 to +384
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump()
return row

units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fast path stores the result via .model_dump() (no mode="json"), while every other windowed generator in this PR consistently uses .model_dump(mode="json"). Without mode="json", Pydantic returns native Python objects rather than JSON-serializable equivalents, which can cause downstream serialization failures. The batched path has the same inconsistency.

Suggested change
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump()
return row
units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump()
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump(mode="json")
return row
units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump(mode="json")

Comment on lines +27 to +30
_DEFAULT_MAX_RENDER_CHARS: int = _DetectConfig.model_fields["detection_window_max_render_chars"].default
_DEFAULT_SAFETY_MARGIN_CHARS: int = _DetectConfig.model_fields["detection_window_safety_margin_chars"].default


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unguarded index on potentially empty list. _first_output calls outputs[0] without checking length. In run_windowed_step with first_only=True, if iter_boundary_windows returns an empty list, outputs is empty and this raises IndexError. The fast path makes this unreachable today, but a defensive guard would make the failure mode explicit.

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on. This is a substantial first PR, and the overall direction makes sense: split long records into bounded windows, carry forward the state needed for consistency, and keep the replacement map explicit.

I left a few comments on edge cases I think are worth tightening before merge. The main themes are:

  • thread the user-supplied window sizing through every windowed stage
  • make per-window failures local where possible, instead of dropping the whole record
  • validate overlap settings early so a bad config cannot explode into thousands of model calls
  • avoid silently accepting empty rewrite chunks as successful output

The tests and docs coverage are in good shape, and I think the feature is close. These changes should make it more reliable on real long documents.

),
*self._qa_wf.columns(selected_models=selected_models),
*self._rewrite_gen_wf.columns(
window_max_render_chars=window_max_render_chars,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only threads the user-supplied window cap into rewrite generation. domain classification, sensitivity disposition, QA generation, and final judge still build their window params from module defaults, so a user who lowers Detect.detection_window_max_render_chars still gets ~128k prompts in those stages. Could pass the same kwargs through those columns() calls and _run_final_judge too?

if params.first_only:
windows = windows[:1]
outputs = []
for start, end in windows:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code caught this one: once this takes the windowed path, a single transient model error or a chunk that legitimately has no meaning units can drop the whole record. Could wrap each window call, skip/log failed windows, and handle the all-failed case explicitly?

"prompt scaffolding and tags when sizing augmentation/latent windows."
),
)
detection_window_overlap_chars: int = Field(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we validate that detection_window_overlap_chars is smaller than the effective window size? Right now overlap == window is accepted and the planners advance one character at a time. My smoke test turned a 20k-char row into 16,001 windows.

_clip(summary),
)

stitched = "\n".join(part for part in rewritten_parts if part)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate from the newline-stitching comment already here: filtering with if part also hides an empty rewrite chunk. If one window returns {"rewritten_text": ""}, that section disappears with no failed-window count or review signal. Maybe count/flag empty chunks instead of treating them as successful output?

…ate overlap, flag empty rewrite chunks

Signed-off-by: eurekayuan <zhuoweny@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants