Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5
Open
binbinsh wants to merge 3 commits intothuhcsi:masterfrom
Open
Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5binbinsh wants to merge 3 commits intothuhcsi:masterfrom
binbinsh wants to merge 3 commits intothuhcsi:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the repo to keep only the current FlatTN implementation path (no backward compatibility), and updates documentation/scripts to reflect the official train/eval/predict workflow.
Changes:
- Removes legacy V0/V1/FastNLP-based codepaths and consolidates the active implementation under
flattn/andscripts/. - Adds new CLI entrypoints for training, evaluation, and prediction, plus a reproducible
train.sh. - Refreshes README with updated metrics, model download instructions, and I/O format examples.
Reviewed changes
Copilot reviewed 34 out of 40 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| utils.py | Removes legacy utility module. |
| preprocess.py | Removes legacy preprocessing script. |
| paths.py | Removes legacy path constants. |
| load_data.py | Removes legacy FastNLP dataset loader. |
| gpu_utils.py | Removes legacy GPU utility code. |
| fastNLP_module.py | Removes legacy FastNLP embedding/module code. |
| V0/, V1/ | Removes legacy model/training pipelines. |
| flattn/init.py | Exposes the new package surface for the current pipeline. |
| flattn/data.py | Adds BMES reader, dataset wrapper, and collate function for the new pipeline. |
| flattn/lattice.py | Adds lexicon/rule lattice construction utilities. |
| flattn/model.py | Implements the FlatTN model with BERT embedding + Transformer + CRF. |
| flattn/training.py | Adds training loop and seqeval-based evaluation. |
| flattn/modules.py | Refactors module utilities and removes FastNLP dependency points. |
| flattn/module_utils.py | Introduces local replacements for legacy utilities (dropout/misc helpers). |
| flattn/rules.py | Adds rule/Trie utilities used by the lattice builder (with legacy stubs). |
| scripts/train_flattn.py | New training CLI aligned with the intended workflow. |
| scripts/evaluate_flattn.py | New evaluation CLI. |
| scripts/predict_flattn.py | New prediction CLI producing JSONL outputs. |
| scripts/init.py | Marks scripts as a package. |
| train.sh | Adds a reproducible train + eval shell entrypoint using uv. |
| requirements.txt | Introduces minimal runtime dependencies for the new pipeline. |
| dataset/processed/stat.py | Refactors dataset stats script into a cleaner CLI tool. |
| README.md | Updates docs to match the new structure and released checkpoint workflow. |
| CLAUDE.md | Adds repo tooling guidance (uv). |
| .gitignore | Adds ignores for .venv, caches, and __pycache__. |
Comments suppressed due to low confidence (2)
flattn/modules.py:12
seq_len_to_mask()setsmax_len = seq_len.max()(a 0-d tensor) and passes it totorch.arange(max_len, ...), which will raise a TypeError on recent PyTorch (expects an int). Convert to a Python int (e.g.,int(seq_len.max().item())) and ensuremax_lenis an int even when provided as a tensor.
flattn/modules.py:105Four_Pos_Fusion_Embedding.forward()calls.to(dev)onself.pe_ss/pe_se/...every forward pass. This creates a new tensor each time (extra overhead) and can break expected parameter/buffer device semantics (especially if these arenn.Parameters). Prefer registering these as buffers/parameters on the module and relying onmodel.to(device)(or move them once in an explicit sync step) rather than per-forward.to()copies.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@JohnsonTsing This PR cleans up the repository and keeps only the current FlatTN implementation path, while updating docs to match the official workflow.
What Changed
1) Codebase structure cleanup
flattn/(model/data/lattice/training core)scripts/(train/evaluate/predict CLIs)flattn/modules.pyflattn/rules.pyflattn/module_utils.pyflattn.*only.2) Removed legacy/compatibility code
3) Training/inference entrypoint updates
scripts/train_flattn.pytrain.shaccordingly.4) README refresh
Breaking Changes
scripts/train_flattn.pyscripts/evaluate_flattn.pyscripts/predict_flattn.pyValidation
uv run python -m py_compile $(rg --files -g '*.py')uv run python scripts/train_flattn.py --helpuv run python scripts/predict_flattn.py --help