A document alignment pipeline for building parallel corpora from multilingual document pairs. It takes documents in two languages (e.g., an English book and its Finnish translation), aligns them at both paragraph and sentence level using semantic embeddings and Dynamic Time Warping, and outputs aligned text chunks suitable for training machine translation models.
The pipeline performs hierarchical alignment in two stages:
-
Coarse alignment (paragraph-level) — Embeds all paragraphs using a Sentence Transformers model (e.g., LaBSE), computes a cosine distance matrix, and applies Dynamic Time Warping (DTW) to find the optimal paragraph-to-paragraph mapping. This handles merged/split paragraphs across translations by supporting many-to-one and one-to-many mappings. Paragraph groups with low similarity are filtered out before proceeding.
-
Fine alignment (sentence-level) — Within each aligned paragraph group, sentences are split (via Spacy), embedded, and aligned using a greedy fuzzy span matching algorithm that supports 1-to-1, 1-to-2, and 2-to-1 sentence alignments.
-
Chunk building — Aligned sentence pairs are grouped into output chunks of up to 2500 words per language side. Chunks with low average alignment scores can be filtered out.
Input JSONL document pairs
|
v
Paragraph embedding (LaBSE / Sentence Transformers)
|
v
DTW paragraph alignment (approx or exact)
|
v
Path collapsing (group into paragraph ranges)
|
v
Paragraph similarity filtering (drop non-parallel groups)
|
v
Sentence splitting (Spacy) + sentence embedding
|
v
Greedy span matching (sentence-level alignment)
|
v
Chunk building (word-limit aware) + score filtering
|
v
Output: aligned parallel chunks (JSONL)
MT-aligner-pipeline/
├── run_pipeline.py # CLI entry point
├── launcher.sh # SLURM job submission script
├── stats.py # Corpus statistics tool
├── requirements.txt
├── configs/
│ └── default.yaml # Pipeline configuration
├── src/
│ ├── schema.py # Pydantic data models
│ ├── input_handling.py # Input format handling
│ ├── align/
│ │ ├── coarse_align.py # Paragraph-level DTW alignment
│ │ ├── fine_align_spans.py # Sentence-level span alignment
│ │ ├── filters.py # Alignment validation filters
│ │ └── thresholds.py # Data mode presets (parallel/comparable)
│ ├── encode/
│ │ ├── embed_paragraphs.py # Paragraph embedding
│ │ └── embed_sentences.py # Sentence embedding
│ ├── pipeline/
│ │ ├── runner.py # Pipeline orchestrator (multiprocessing)
│ │ ├── document_loader.py # Document I/O and embedding cache
│ │ ├── aligner.py # High-level alignment coordinator
│ │ └── output_builder.py # Result formatting and saving
│ ├── postprocess/
│ │ └── rebuild.py # Chunk building from alignments
│ └── utils/
│ ├── io.py # JSON/embedding I/O
│ ├── sentence_split.py # Spacy sentence tokenization
│ ├── collapse_path.py # DTW path simplification
│ └── debug.py
└── data/
├── raw/ # Input document pairs (JSONL)
├── preprocessed/ # Cached embeddings
└── aligned/ # Output aligned chunks
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m spacy download xx_ent_wiki_smEdit configs/default.yaml:
# Output directory (created if missing)
output: "data/aligned"
# Embedding model (any Sentence Transformers model)
model: "labse"
model_path: "/path/to/local/model" # optional, for offline use
# DTW mode: "approx" (fast, O(N)) or "exact" (accurate, O(N*M))
dtw: "approx"
# Input
input: "data/raw"
input_format: "document_pair"
# Input type: "paragraph" (default) or "sentence"
# Controls how aligned blocks are joined in the output: paragraphs are
# separated by double newlines, sentences are joined with spaces.
input_type: "paragraph"
# Data mode: "parallel" or "comparable"
# See "Data modes" section below for details.
data_mode: "parallel"python run_pipeline.py --config configs/default.yamlEdit launcher.sh to match your cluster setup, then:
sbatch launcher.shThe pipeline expects input as subdirectories under the input path, each containing exactly two JSONL files (one per language):
data/raw/
└── 1399_49487/
├── en.jsonl
└── fi.jsonl
Each JSONL file contains one line per paragraph (or per sentence if using input_type: "sentence"):
{"doc_id": "anna_karenina.en", "para_id": 0, "lang": "en", "text": "Chapter 1"}
{"doc_id": "anna_karenina.en", "para_id": 1, "lang": "en", "text": "Happy families are all alike..."}When working with data that has been pre-segmented at sentence level, each JSONL entry is a single sentence. Set input_type: "sentence" in the config so that the pipeline joins aligned blocks with spaces rather than paragraph breaks, producing continuous text in the output.
The pipeline supports two data modes, controlled by data_mode in the config:
For known-parallel document pairs (e.g., published translations). Uses lenient filtering thresholds — paragraphs are still filtered at a low similarity threshold (0.30) to catch structural elements like footnotes, tables of contents, or image captions that may appear in one language but not the other.
For document pairs that may share a topic but are not necessarily direct translations. Uses stricter filtering at every stage: paragraph-level similarity filtering (0.45), tighter sentence alignment windows, higher match thresholds, and chunk-level score filtering. This mode discards non-parallel portions rather than force-aligning them.
Each mode sets sensible defaults for all filtering thresholds. Individual thresholds can be overridden in the config:
data_mode: "comparable"
thresholds:
para_sim_threshold: 0.50 # stricter paragraph filtering
min_chunk_score: 0.70 # stricter chunk filteringAvailable threshold keys:
| Key | Description |
|---|---|
para_sim_threshold |
Minimum cosine similarity for paragraph groups |
sent_window |
Sentence alignment search window size |
sent_baseline_threshold |
Minimum similarity for baseline sentence matches |
sent_neighbour_margin |
Relaxation margin for neighbor expansion |
sent_fuzzy_threshold |
Threshold for 1-to-2 and 2-to-1 span matching |
sent_min_span_score |
Floor for greedy span selection |
min_chunk_score |
Minimum average score per output chunk |
For each document pair, the pipeline produces:
{name}_chunks.jsonl— Aligned parallel text chunks (the main output):
{
"paragraph_pairs": [[[0], [0]], [[1, 2], [1]]],
"src_text": "Chapter 1\n\nHappy families are all alike...",
"tgt_text": "Luku 1\n\nOnnelliset perheet ovat kaikki samanlaisia...",
"src_words": 2450,
"tgt_words": 2380,
"avg_score": 0.8234
}{name}_meta.jsonl— Alignment metadata (DTW cost, DTW path, sentence alignments, input paths).
The pipeline supports parallel processing via Python multiprocessing (spawn mode). By default it uses up to 8 workers, bounded by available CPUs. Each worker loads its own copy of the embedding model, so GPU memory is the main constraint. Sequential mode is available and recommended for single-GPU setups — set use_multiprocessing=False in the runner.
- sentence-transformers — Multilingual embedding models
- fastdtw — Fast approximate DTW
- dtaidistance — Exact DTW
- Spacy +
xx_ent_wiki_sm— Multilingual sentence splitting - GlotLID — Language identification model, licensed under Apache 2.0 with notices (see
THIRD_PARTY_LICENSES/GlotLID_LICENSE) - NumPy, PyYAML, tqdm