Refactor FlatTN pipeline, remove legacy code, and refresh documentation by binbinsh · Pull Request #5 · thuhcsi/FlatTN

binbinsh · 2026-02-09T09:56:24Z

Summary

@JohnsonTsing This PR cleans up the repository and keeps only the current FlatTN implementation path, while updating docs to match the official workflow.

What Changed

1) Codebase structure cleanup

Consolidated active code under:
- flattn/ (model/data/lattice/training core)
- scripts/ (train/evaluate/predict CLIs)
Migrated runtime dependencies into package-local modules:
- flattn/modules.py
- flattn/rules.py
- flattn/module_utils.py
Updated imports to use flattn.* only.

2) Removed legacy/compatibility code

Removed historical/unused modules and entrypoints (including V0/V1 paths and old root-level compatibility modules).
Removed backward-compat entrypoint and legacy utility files.
Kept only the current official training/eval/inference flow.

3) Training/inference entrypoint updates

Standardized training entry to:
- scripts/train_flattn.py
Updated train.sh accordingly.

4) README refresh

Added paper metrics and current checkpoint metrics in one table.
Added Google Drive pretrained model link: https://drive.google.com/drive/folders/1I-fNYXBwmeyrTWxHJ0X5ySZ-4ugn2Nb2
Clarified official model download location.
Improved inference output section with:
- field table
- readable multi-line JSON example.

Breaking Changes

No backward compatibility is preserved.
Old legacy paths/scripts/modules are removed.
Use:
- scripts/train_flattn.py
- scripts/evaluate_flattn.py
- scripts/predict_flattn.py

Validation

uv run python -m py_compile $(rg --files -g '*.py')
uv run python scripts/train_flattn.py --help
uv run python scripts/predict_flattn.py --help
Smoke test:
- ran a minimal CPU training run (1 epoch, tiny subset) successfully
- used generated checkpoint for prediction successfully
- verified JSONL output format and field consistency

Copilot

Pull request overview

This PR refactors the repo to keep only the current FlatTN implementation path (no backward compatibility), and updates documentation/scripts to reflect the official train/eval/predict workflow.

Changes:

Removes legacy V0/V1/FastNLP-based codepaths and consolidates the active implementation under flattn/ and scripts/.
Adds new CLI entrypoints for training, evaluation, and prediction, plus a reproducible train.sh.
Refreshes README with updated metrics, model download instructions, and I/O format examples.

Reviewed changes

Copilot reviewed 34 out of 40 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
utils.py	Removes legacy utility module.
preprocess.py	Removes legacy preprocessing script.
paths.py	Removes legacy path constants.
load_data.py	Removes legacy FastNLP dataset loader.
gpu_utils.py	Removes legacy GPU utility code.
fastNLP_module.py	Removes legacy FastNLP embedding/module code.
V0/, V1/	Removes legacy model/training pipelines.
flattn/init.py	Exposes the new package surface for the current pipeline.
flattn/data.py	Adds BMES reader, dataset wrapper, and collate function for the new pipeline.
flattn/lattice.py	Adds lexicon/rule lattice construction utilities.
flattn/model.py	Implements the FlatTN model with BERT embedding + Transformer + CRF.
flattn/training.py	Adds training loop and seqeval-based evaluation.
flattn/modules.py	Refactors module utilities and removes FastNLP dependency points.
flattn/module_utils.py	Introduces local replacements for legacy utilities (dropout/misc helpers).
flattn/rules.py	Adds rule/Trie utilities used by the lattice builder (with legacy stubs).
scripts/train_flattn.py	New training CLI aligned with the intended workflow.
scripts/evaluate_flattn.py	New evaluation CLI.
scripts/predict_flattn.py	New prediction CLI producing JSONL outputs.
scripts/init.py	Marks scripts as a package.
train.sh	Adds a reproducible train + eval shell entrypoint using `uv`.
requirements.txt	Introduces minimal runtime dependencies for the new pipeline.
dataset/processed/stat.py	Refactors dataset stats script into a cleaner CLI tool.
README.md	Updates docs to match the new structure and released checkpoint workflow.
CLAUDE.md	Adds repo tooling guidance (`uv`).
.gitignore	Adds ignores for `.venv`, caches, and `__pycache__`.

Comments suppressed due to low confidence (2)

flattn/modules.py:12

seq_len_to_mask() sets max_len = seq_len.max() (a 0-d tensor) and passes it to torch.arange(max_len, ...), which will raise a TypeError on recent PyTorch (expects an int). Convert to a Python int (e.g., int(seq_len.max().item())) and ensure max_len is an int even when provided as a tensor.
flattn/modules.py:105
Four_Pos_Fusion_Embedding.forward() calls .to(dev) on self.pe_ss/pe_se/... every forward pass. This creates a new tensor each time (extra overhead) and can break expected parameter/buffer device semantics (especially if these are nn.Parameters). Prefer registering these as buffers/parameters on the module and relying on model.to(device) (or move them once in an explicit sync step) rather than per-forward .to() copies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 9, 2026 09:56

Copilot started reviewing on behalf of binbinsh February 9, 2026 09:56 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

refactor: consolidate FlatTN pipeline and remove legacy modules

ddf7511

binbinsh force-pushed the master branch from f9ef28d to ddf7511 Compare February 9, 2026 10:05

binbinsh added 2 commits February 9, 2026 18:10

docs: update README links, metrics table, and dataset download note

a3ab8ba

fix: align model path to 20260209

2ab18e4

binbinsh force-pushed the master branch from a74c9f1 to 2ab18e4 Compare February 9, 2026 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5

Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5
binbinsh wants to merge 3 commits intothuhcsi:masterfrom
binbinsh:master

binbinsh commented Feb 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

binbinsh commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

1) Codebase structure cleanup

2) Removed legacy/compatibility code

3) Training/inference entrypoint updates

4) README refresh

Breaking Changes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

binbinsh commented Feb 9, 2026 •

edited

Loading