Chunk-wise Data Synthesis

English | 简体中文

A test-covered implementation of chunk-wise long-text synthesis with two parallel pipelines inspired by Kimi-K2:

ChunkWiseRephrasePipeline: faithful chunk-wise autoregressive rephrasing.
ChunkWiseGenerationPipeline: plan-driven chunk-wise autoregressive long-form generation.

Features

Hierarchical no-overlap chunk splitting with overlap-aware stitching.
Autoregressive generation with rolling prefix windows.
Parallel workflows for rephrase and pure generation.
Rephrase retries with pluggable fidelity verification.
Generation section retries with issue-targeted repair prompts.
Optional prompt compression for long-context section generation.
Plan + state based long-form generation with consistency pass guard.
Built-in quality checks for coverage, terminology, repetition, drift, and required entities.
OpenAI-compatible backend with environment-based configuration.

Refactored Architecture

The repository now follows explicit domain boundaries:

pipelines/: orchestration only (rephrase.py, generation.py, shared helpers in base.py).
prompts/: prompt rendering only (rephrase.py, generation.py, shared language helpers in base.py).
quality/: quality and fidelity checks (fidelity.py, generation.py, shared text/token helpers in base.py).
backends/: provider adapters (openai.py).
core/: stable grouped API exports (protocols.py, types.py, config.py).
Top-level domain modules remain focused (chunking.py, generation_state.py, generation_types.py, model.py).

Legacy wrapper modules were removed and should not be imported anymore: pipeline.py, prompting.py, fidelity.py, openai_backend.py, generation_pipeline.py, generation_prompting.py, generation_quality.py, tokenizer.py.

Project Layout

src/
  __init__.py             # unified package-level public exports
  chunking.py             # chunk split and overlap logic
  generation_state.py     # generation state table update logic
  generation_types.py     # generation dataclasses and result types
  model.py                # model request/task protocols and adapters
  pipelines/
    __init__.py
    rephrase.py           # chunk-wise rephrase orchestration + PipelineConfig
    generation.py         # chunk-wise long-form generation orchestration
    base.py               # overlap detection and stitching
  prompts/
    __init__.py
    rephrase.py           # RewriteRequest + rephrase prompt rendering
    generation.py         # plan/section/repair/consistency prompt rendering
    base.py               # shared prompt language helpers
  quality/
    __init__.py
    fidelity.py           # fidelity verifier contracts and implementations
    generation.py         # generation quality checkers and consistency guard
    base.py               # shared token/text matching helpers
  backends/
    __init__.py
    openai.py             # OpenAI-compatible backend and configs
  core/
    __init__.py
    protocols.py          # Tokenizer/LLMModel/RewriteModel/FidelityVerifier
    types.py              # LLMRequest, RewriteRequest, GenerationPlan, SectionSpec
    config.py             # PipelineConfig, GenerationConfig, OpenAIBackendConfig
  tokenization/
    __init__.py           # tokenizer contracts and helpers
tests/
  test_*.py               # deterministic unittest coverage + refactor compatibility tests
scripts/
  run_live_openai_pipeline.py             # live rephrase runner
  run_live_openai_generation_pipeline.py  # live generation runner
  run_generation_ab_baseline.py           # one-shot vs chunk-wise baseline evaluation

Setup

This project uses uv for environment and dependency management.

uv sync

Run Tests

Run full offline test suite:

uv run python -m unittest discover -s tests -v

Run one module during iteration:

uv run python -m unittest tests.test_generation_pipeline -v

Validate refactor-era API boundaries and exports:

PYTHONPATH=src:tests uv run python -m unittest \
  tests.test_package_entrypoint \
  tests.test_core_api_compat \
  tests.test_pipelines_api -v

Live Rephrase Run

export LLM_API_KEY=your_key_here
uv run python scripts/run_live_openai_pipeline.py \
  --input tests/data/live_rephrase_input.txt \
  --output tests/data/rephrase_output.txt

Live Generation Run

export LLM_API_KEY=your_key_here
uv run python scripts/run_live_openai_generation_pipeline.py \
  --topic "Chunk-wise autoregressive long-form generation" \
  --objective "Create long-context training text" \
  --target-tokens 1800 \
  --audience "ML engineers" \
  --tone "neutral technical" \
  --output tests/data/generation_output.txt

You can also pass a manual plan JSON:

uv run python scripts/run_live_openai_generation_pipeline.py \
  --manual-plan-path tests/data/manual_plan.json \
  --output tests/data/generation_output.txt

Profile-based quick switch (default is coherence_first):

uv run python scripts/run_live_openai_generation_pipeline.py \
  --topic "Chunk-wise autoregressive long-form generation" \
  --objective "Create long-context training text" \
  --profile cost_first \
  --output tests/data/generation_output_cost_first.txt

Live Integration Test (Opt-in)

The live integration test makes a real API request and is disabled by default:

export LLM_API_KEY=your_key_here
export RUN_LIVE_LLM_TESTS=1
uv run python -m unittest tests.test_openai_backend_live -v

A/B Baseline Evaluation (One-shot vs Chunk-wise)

Use the fixed cases file to build a reproducible baseline report:

export LLM_API_KEY=your_key_here
uv run python scripts/run_generation_ab_baseline.py \
  --cases tests/fixtures/generation_eval_cases.json \
  --output-dir tests/data/ab_eval_reports \
  --prompt-language en

Outputs:

ab_baseline_report.json: machine-readable aggregate + per-case details
ab_baseline_report.md: human-readable summary + manual scoring table
<case_id>.json: per-case raw outputs and metrics

Public Import Entry Points

Recommended grouped imports:

from pipelines import ChunkWiseRephrasePipeline, ChunkWiseGenerationPipeline, PipelineConfig
from prompts import RewriteRequest, render_rewrite_prompt, render_plan_prompt
from quality import FidelityVerifier, CompositeFidelityVerifier, NumericFactChecker
from backends import OpenAIBackendConfig, OpenAILLMModel, OpenAIRewriteModel
from core.protocols import Tokenizer, LLMModel, RewriteModel, FidelityVerifier
from core.types import LLMRequest, RewriteRequest, GenerationPlan, SectionSpec
from core.config import PipelineConfig, GenerationConfig, OpenAIBackendConfig

Compatibility package entrypoint is available at src:

from src import ChunkWiseRephrasePipeline, PipelineConfig, RewriteRequest, WhitespaceTokenizer

Minimal API Usage

Rephrase pipeline

from core.config import PipelineConfig
from core.types import RewriteRequest
from pipelines import ChunkWiseRephrasePipeline
from tokenization import WhitespaceTokenizer


class EchoRewriteModel:
    def rewrite(self, request: RewriteRequest) -> str:
        return request.current_chunk


pipeline = ChunkWiseRephrasePipeline(
    model=EchoRewriteModel(),
    tokenizer=WhitespaceTokenizer(),
    config=PipelineConfig(
        chunk_size=256,
        length_mode="token",
        prefix_window_tokens=1024,
        max_stitch_overlap_tokens=64,
    ),
)

rewritten = pipeline.run("Your long document here.", style_instruction="Rewrite for clarity.")
print(rewritten)

Generation pipeline (manual plan)

from core.config import GenerationConfig
from core.types import GenerationPlan, LLMRequest, SectionSpec
from pipelines import ChunkWiseGenerationPipeline
from tokenization import WhitespaceTokenizer


class StubLLM:
    def generate(self, request: LLMRequest) -> str:
        if request.task == "section_generation":
            return "Section body with required entities and key points."
        if request.task == "consistency_pass":
            return "Section body with required entities and key points."
        raise ValueError("manual plan run should not call plan_generation")


plan = GenerationPlan(
    topic="Chunk-wise generation",
    objective="Teach the method",
    audience="ML engineers",
    tone="neutral technical",
    target_total_length=300,
    sections=[
        SectionSpec(
            title="Intro",
            key_points=["global anchor controls structure"],
            required_entities=["global anchor"],
            constraints=[],
            target_length=120,
        )
    ],
    terminology_preferences={"global anchor": "global anchor"},
    narrative_voice="third-person",
    do_not_include=[],
)

pipeline = ChunkWiseGenerationPipeline(
    model=StubLLM(),
    tokenizer=WhitespaceTokenizer(),
    config=GenerationConfig(prefix_window_tokens=800),
)

result = pipeline.run(manual_plan=plan)
print(result.final_text)
print(result.qc_report.coverage_missing)

Configuration

Environment variables:

LLM_API_KEY (required): API key.
LLM_MODEL (optional): override model ID.
LLM_BASE_URL (optional): override provider base URL.

Current defaults in src/backends/openai.py:

DEFAULT_BASE_URL = "https://openrouter.ai/api/v1"
DEFAULT_MODEL = "stepfun/step-3.5-flash:free"

Live rephrase script flags (scripts/run_live_openai_pipeline.py):

--chunk-size
--length-mode (auto / token / char)
--prefix-window-tokens
--style
--prompt-language (en / zh)
--model
--base-url
--temperature
--top-p
--max-new-tokens
--verbose

Live generation script flags (scripts/run_live_openai_generation_pipeline.py):

--topic
--objective
--target-tokens
--audience
--tone
--prompt-language (en / zh)
--manual-plan-path
--profile (coherence_first / cost_first)
--prompt-compression (on / off) - override profile
--section-retry-strategy (off / balanced / aggressive) - override profile
--consistency-pass (on / off) - override profile
--consistency-guard (on / off) - override profile
--prefix-window-tokens
--disable-consistency-pass (deprecated alias for --consistency-pass off)
--enable-reasoning
--model
--base-url
--temperature
--top-p
--max-new-tokens
--verbose

Troubleshooting

Error contains not a valid model ID: set a provider-valid model, for example: export LLM_MODEL=your_valid_model_id.
Missing API key error: make sure LLM_API_KEY is exported in the current shell.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunk-wise Data Synthesis

Features

Refactored Architecture

Project Layout

Setup

Run Tests

Live Rephrase Run

Live Generation Run

Live Integration Test (Opt-in)

A/B Baseline Evaluation (One-shot vs Chunk-wise)

Public Import Entry Points

Minimal API Usage

Rephrase pipeline

Generation pipeline (manual plan)

Configuration

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chunk-wise Data Synthesis

Features

Refactored Architecture

Project Layout

Setup

Run Tests

Live Rephrase Run

Live Generation Run

Live Integration Test (Opt-in)

A/B Baseline Evaluation (One-shot vs Chunk-wise)

Public Import Entry Points

Minimal API Usage

Rephrase pipeline

Generation pipeline (manual plan)

Configuration

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages