Skip to content

MarsPain/chunk_wise_data_synthesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chunk-wise Data Synthesis

English | 简体中文

A test-covered implementation of chunk-wise long-text synthesis with two parallel pipelines inspired by Kimi-K2:

  1. ChunkWiseRephrasePipeline: faithful chunk-wise autoregressive rephrasing.
  2. ChunkWiseGenerationPipeline: plan-driven chunk-wise autoregressive long-form generation.

Features

  • Hierarchical no-overlap chunk splitting with overlap-aware stitching.
  • Autoregressive generation with rolling prefix windows.
  • Parallel workflows for rephrase and pure generation.
  • Rephrase retries with pluggable fidelity verification.
  • Generation section retries with issue-targeted repair prompts.
  • Optional prompt compression for long-context section generation.
  • Plan + state based long-form generation with consistency pass guard.
  • Built-in quality checks for coverage, terminology, repetition, drift, and required entities.
  • OpenAI-compatible backend with environment-based configuration.

Refactored Architecture

The repository now follows explicit domain boundaries:

  • pipelines/: orchestration only (rephrase.py, generation.py, shared helpers in base.py).
  • prompts/: prompt rendering only (rephrase.py, generation.py, shared language helpers in base.py).
  • quality/: quality and fidelity checks (fidelity.py, generation.py, shared text/token helpers in base.py).
  • backends/: provider adapters (openai.py).
  • core/: stable grouped API exports (protocols.py, types.py, config.py).
  • Top-level domain modules remain focused (chunking.py, generation_state.py, generation_types.py, model.py).

Legacy wrapper modules were removed and should not be imported anymore: pipeline.py, prompting.py, fidelity.py, openai_backend.py, generation_pipeline.py, generation_prompting.py, generation_quality.py, tokenizer.py.

Project Layout

src/
  __init__.py             # unified package-level public exports
  chunking.py             # chunk split and overlap logic
  generation_state.py     # generation state table update logic
  generation_types.py     # generation dataclasses and result types
  model.py                # model request/task protocols and adapters
  pipelines/
    __init__.py
    rephrase.py           # chunk-wise rephrase orchestration + PipelineConfig
    generation.py         # chunk-wise long-form generation orchestration
    base.py               # overlap detection and stitching
  prompts/
    __init__.py
    rephrase.py           # RewriteRequest + rephrase prompt rendering
    generation.py         # plan/section/repair/consistency prompt rendering
    base.py               # shared prompt language helpers
  quality/
    __init__.py
    fidelity.py           # fidelity verifier contracts and implementations
    generation.py         # generation quality checkers and consistency guard
    base.py               # shared token/text matching helpers
  backends/
    __init__.py
    openai.py             # OpenAI-compatible backend and configs
  core/
    __init__.py
    protocols.py          # Tokenizer/LLMModel/RewriteModel/FidelityVerifier
    types.py              # LLMRequest, RewriteRequest, GenerationPlan, SectionSpec
    config.py             # PipelineConfig, GenerationConfig, OpenAIBackendConfig
  tokenization/
    __init__.py           # tokenizer contracts and helpers
tests/
  test_*.py               # deterministic unittest coverage + refactor compatibility tests
scripts/
  run_live_openai_pipeline.py             # live rephrase runner
  run_live_openai_generation_pipeline.py  # live generation runner
  run_generation_ab_baseline.py           # one-shot vs chunk-wise baseline evaluation

Setup

This project uses uv for environment and dependency management.

uv sync

Run Tests

Run full offline test suite:

uv run python -m unittest discover -s tests -v

Run one module during iteration:

uv run python -m unittest tests.test_generation_pipeline -v

Validate refactor-era API boundaries and exports:

PYTHONPATH=src:tests uv run python -m unittest \
  tests.test_package_entrypoint \
  tests.test_core_api_compat \
  tests.test_pipelines_api -v

Live Rephrase Run

export LLM_API_KEY=your_key_here
uv run python scripts/run_live_openai_pipeline.py \
  --input tests/data/live_rephrase_input.txt \
  --output tests/data/rephrase_output.txt

Live Generation Run

export LLM_API_KEY=your_key_here
uv run python scripts/run_live_openai_generation_pipeline.py \
  --topic "Chunk-wise autoregressive long-form generation" \
  --objective "Create long-context training text" \
  --target-tokens 1800 \
  --audience "ML engineers" \
  --tone "neutral technical" \
  --output tests/data/generation_output.txt

You can also pass a manual plan JSON:

uv run python scripts/run_live_openai_generation_pipeline.py \
  --manual-plan-path tests/data/manual_plan.json \
  --output tests/data/generation_output.txt

Profile-based quick switch (default is coherence_first):

uv run python scripts/run_live_openai_generation_pipeline.py \
  --topic "Chunk-wise autoregressive long-form generation" \
  --objective "Create long-context training text" \
  --profile cost_first \
  --output tests/data/generation_output_cost_first.txt

Live Integration Test (Opt-in)

The live integration test makes a real API request and is disabled by default:

export LLM_API_KEY=your_key_here
export RUN_LIVE_LLM_TESTS=1
uv run python -m unittest tests.test_openai_backend_live -v

A/B Baseline Evaluation (One-shot vs Chunk-wise)

Use the fixed cases file to build a reproducible baseline report:

export LLM_API_KEY=your_key_here
uv run python scripts/run_generation_ab_baseline.py \
  --cases tests/fixtures/generation_eval_cases.json \
  --output-dir tests/data/ab_eval_reports \
  --prompt-language en

Outputs:

  • ab_baseline_report.json: machine-readable aggregate + per-case details
  • ab_baseline_report.md: human-readable summary + manual scoring table
  • <case_id>.json: per-case raw outputs and metrics

Public Import Entry Points

Recommended grouped imports:

  • from pipelines import ChunkWiseRephrasePipeline, ChunkWiseGenerationPipeline, PipelineConfig
  • from prompts import RewriteRequest, render_rewrite_prompt, render_plan_prompt
  • from quality import FidelityVerifier, CompositeFidelityVerifier, NumericFactChecker
  • from backends import OpenAIBackendConfig, OpenAILLMModel, OpenAIRewriteModel
  • from core.protocols import Tokenizer, LLMModel, RewriteModel, FidelityVerifier
  • from core.types import LLMRequest, RewriteRequest, GenerationPlan, SectionSpec
  • from core.config import PipelineConfig, GenerationConfig, OpenAIBackendConfig

Compatibility package entrypoint is available at src:

  • from src import ChunkWiseRephrasePipeline, PipelineConfig, RewriteRequest, WhitespaceTokenizer

Minimal API Usage

Rephrase pipeline

from core.config import PipelineConfig
from core.types import RewriteRequest
from pipelines import ChunkWiseRephrasePipeline
from tokenization import WhitespaceTokenizer


class EchoRewriteModel:
    def rewrite(self, request: RewriteRequest) -> str:
        return request.current_chunk


pipeline = ChunkWiseRephrasePipeline(
    model=EchoRewriteModel(),
    tokenizer=WhitespaceTokenizer(),
    config=PipelineConfig(
        chunk_size=256,
        length_mode="token",
        prefix_window_tokens=1024,
        max_stitch_overlap_tokens=64,
    ),
)

rewritten = pipeline.run("Your long document here.", style_instruction="Rewrite for clarity.")
print(rewritten)

Generation pipeline (manual plan)

from core.config import GenerationConfig
from core.types import GenerationPlan, LLMRequest, SectionSpec
from pipelines import ChunkWiseGenerationPipeline
from tokenization import WhitespaceTokenizer


class StubLLM:
    def generate(self, request: LLMRequest) -> str:
        if request.task == "section_generation":
            return "Section body with required entities and key points."
        if request.task == "consistency_pass":
            return "Section body with required entities and key points."
        raise ValueError("manual plan run should not call plan_generation")


plan = GenerationPlan(
    topic="Chunk-wise generation",
    objective="Teach the method",
    audience="ML engineers",
    tone="neutral technical",
    target_total_length=300,
    sections=[
        SectionSpec(
            title="Intro",
            key_points=["global anchor controls structure"],
            required_entities=["global anchor"],
            constraints=[],
            target_length=120,
        )
    ],
    terminology_preferences={"global anchor": "global anchor"},
    narrative_voice="third-person",
    do_not_include=[],
)

pipeline = ChunkWiseGenerationPipeline(
    model=StubLLM(),
    tokenizer=WhitespaceTokenizer(),
    config=GenerationConfig(prefix_window_tokens=800),
)

result = pipeline.run(manual_plan=plan)
print(result.final_text)
print(result.qc_report.coverage_missing)

Configuration

Environment variables:

  • LLM_API_KEY (required): API key.
  • LLM_MODEL (optional): override model ID.
  • LLM_BASE_URL (optional): override provider base URL.

Current defaults in src/backends/openai.py:

  • DEFAULT_BASE_URL = "https://openrouter.ai/api/v1"
  • DEFAULT_MODEL = "stepfun/step-3.5-flash:free"

Live rephrase script flags (scripts/run_live_openai_pipeline.py):

  • --chunk-size
  • --length-mode (auto / token / char)
  • --prefix-window-tokens
  • --style
  • --prompt-language (en / zh)
  • --model
  • --base-url
  • --temperature
  • --top-p
  • --max-new-tokens
  • --verbose

Live generation script flags (scripts/run_live_openai_generation_pipeline.py):

  • --topic
  • --objective
  • --target-tokens
  • --audience
  • --tone
  • --prompt-language (en / zh)
  • --manual-plan-path
  • --profile (coherence_first / cost_first)
  • --prompt-compression (on / off) - override profile
  • --section-retry-strategy (off / balanced / aggressive) - override profile
  • --consistency-pass (on / off) - override profile
  • --consistency-guard (on / off) - override profile
  • --prefix-window-tokens
  • --disable-consistency-pass (deprecated alias for --consistency-pass off)
  • --enable-reasoning
  • --model
  • --base-url
  • --temperature
  • --top-p
  • --max-new-tokens
  • --verbose

Troubleshooting

  • Error contains not a valid model ID: set a provider-valid model, for example: export LLM_MODEL=your_valid_model_id.
  • Missing API key error: make sure LLM_API_KEY is exported in the current shell.

About

A python implementation of chunk-wise long-text synthesis. Includes faithful chunk-wise rephrasing and plan-driven long-form generation with overlap-aware chunking, rolling prefix windows, and pluggable quality/fidelity checks. Works with OpenAI-compatible APIs via environment-based configuration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages