Skip to content

Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5

Open
binbinsh wants to merge 3 commits intothuhcsi:masterfrom
binbinsh:master
Open

Refactor FlatTN pipeline, remove legacy code, and refresh documentation#5
binbinsh wants to merge 3 commits intothuhcsi:masterfrom
binbinsh:master

Conversation

@binbinsh
Copy link
Copy Markdown

@binbinsh binbinsh commented Feb 9, 2026

Summary

@JohnsonTsing This PR cleans up the repository and keeps only the current FlatTN implementation path, while updating docs to match the official workflow.

What Changed

1) Codebase structure cleanup

  • Consolidated active code under:
    • flattn/ (model/data/lattice/training core)
    • scripts/ (train/evaluate/predict CLIs)
  • Migrated runtime dependencies into package-local modules:
    • flattn/modules.py
    • flattn/rules.py
    • flattn/module_utils.py
  • Updated imports to use flattn.* only.

2) Removed legacy/compatibility code

  • Removed historical/unused modules and entrypoints (including V0/V1 paths and old root-level compatibility modules).
  • Removed backward-compat entrypoint and legacy utility files.
  • Kept only the current official training/eval/inference flow.

3) Training/inference entrypoint updates

  • Standardized training entry to:
    • scripts/train_flattn.py
  • Updated train.sh accordingly.

4) README refresh

Breaking Changes

  • No backward compatibility is preserved.
  • Old legacy paths/scripts/modules are removed.
  • Use:
    • scripts/train_flattn.py
    • scripts/evaluate_flattn.py
    • scripts/predict_flattn.py

Validation

  • uv run python -m py_compile $(rg --files -g '*.py')
  • uv run python scripts/train_flattn.py --help
  • uv run python scripts/predict_flattn.py --help
  • Smoke test:
    • ran a minimal CPU training run (1 epoch, tiny subset) successfully
    • used generated checkpoint for prediction successfully
    • verified JSONL output format and field consistency

Copilot AI review requested due to automatic review settings February 9, 2026 09:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the repo to keep only the current FlatTN implementation path (no backward compatibility), and updates documentation/scripts to reflect the official train/eval/predict workflow.

Changes:

  • Removes legacy V0/V1/FastNLP-based codepaths and consolidates the active implementation under flattn/ and scripts/.
  • Adds new CLI entrypoints for training, evaluation, and prediction, plus a reproducible train.sh.
  • Refreshes README with updated metrics, model download instructions, and I/O format examples.

Reviewed changes

Copilot reviewed 34 out of 40 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
utils.py Removes legacy utility module.
preprocess.py Removes legacy preprocessing script.
paths.py Removes legacy path constants.
load_data.py Removes legacy FastNLP dataset loader.
gpu_utils.py Removes legacy GPU utility code.
fastNLP_module.py Removes legacy FastNLP embedding/module code.
V0/, V1/ Removes legacy model/training pipelines.
flattn/init.py Exposes the new package surface for the current pipeline.
flattn/data.py Adds BMES reader, dataset wrapper, and collate function for the new pipeline.
flattn/lattice.py Adds lexicon/rule lattice construction utilities.
flattn/model.py Implements the FlatTN model with BERT embedding + Transformer + CRF.
flattn/training.py Adds training loop and seqeval-based evaluation.
flattn/modules.py Refactors module utilities and removes FastNLP dependency points.
flattn/module_utils.py Introduces local replacements for legacy utilities (dropout/misc helpers).
flattn/rules.py Adds rule/Trie utilities used by the lattice builder (with legacy stubs).
scripts/train_flattn.py New training CLI aligned with the intended workflow.
scripts/evaluate_flattn.py New evaluation CLI.
scripts/predict_flattn.py New prediction CLI producing JSONL outputs.
scripts/init.py Marks scripts as a package.
train.sh Adds a reproducible train + eval shell entrypoint using uv.
requirements.txt Introduces minimal runtime dependencies for the new pipeline.
dataset/processed/stat.py Refactors dataset stats script into a cleaner CLI tool.
README.md Updates docs to match the new structure and released checkpoint workflow.
CLAUDE.md Adds repo tooling guidance (uv).
.gitignore Adds ignores for .venv, caches, and __pycache__.
Comments suppressed due to low confidence (2)

flattn/modules.py:12

  • seq_len_to_mask() sets max_len = seq_len.max() (a 0-d tensor) and passes it to torch.arange(max_len, ...), which will raise a TypeError on recent PyTorch (expects an int). Convert to a Python int (e.g., int(seq_len.max().item())) and ensure max_len is an int even when provided as a tensor.
    flattn/modules.py:105
  • Four_Pos_Fusion_Embedding.forward() calls .to(dev) on self.pe_ss/pe_se/... every forward pass. This creates a new tensor each time (extra overhead) and can break expected parameter/buffer device semantics (especially if these are nn.Parameters). Prefer registering these as buffers/parameters on the module and relying on model.to(device) (or move them once in an explicit sync step) rather than per-forward .to() copies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/evaluate_flattn.py
Comment thread README.md Outdated
Comment thread flattn/rules.py
Comment thread flattn/training.py
Comment thread flattn/rules.py
Comment thread flattn/rules.py
Comment thread flattn/rules.py
Comment thread flattn/rules.py
Comment thread flattn/rules.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants