GECToR

This is one of the implementation of the following paper:

@inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA → Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.bea-1.16",
    doi = "10.18653/v1/2020.bea-1.16",
    pages = "163--170"
}

Differences from other implementations

Official: grammarly/gector
- Without AllenNLP
- Trained checkpoints can be downloaded from Hub
- Distributed training
- 😔 Does not support probabilistic ensemble
cofe-ai/fast-gector
- Use Accelerate for distributed training

Installing

Confirmed that it works on python3.11.0.

pip install git+https://github.com/gotutiyan/gector
# Donwload the verb dictionary in advance
mkdir data
cd data
wget https://github.com/grammarly/gector/raw/master/data/verb-form-vocab.txt

License

Code: MIT license
Trained models on Hugging Face Hub: Only non-commercial purposes.

Usage

This implementation supports both our models and the official models.
I will published pre-trained weights on Hugging Face Hub. Please refer to Performances obtained.
Note that this implementation does not support probabilistic ensembling. See Ensemble.

For our models

CLI

gector-predict \
    --input <raw text file> \
    --restore_dir gotutiyan/gector-roberta-base-5k \
    --out <path to output file>

API

from transformers import AutoTokenizer
from gector import GECToR, predict, load_verb_dict
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = 'gotutiyan/gector-roberta-base-5k'
model = GECToR.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
encode, decode = load_verb_dict('data/verb-form-vocab.txt')
srcs = [
    'This is a correct sentence.',
    'This are a wrong sentences'
]
corrected = predict(
    model, tokenizer, srcs,
    encode, decode,
    keep_confidence=0.0,
    min_error_prob=0.0,
    n_iteration=5,
    batch_size=2,
)
print(corrected)

For official models

CLI

Please set --from_official and related options starting with --official..
data/output_vocabulary is in here

# An example to use official BERT model.
# Download the official model.
wget https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gectorv2.th
# Predict with the official model.
python predict.py \
    --input <raw text file> \
    --restore bert_0_gectorv2.th \
    --out out.txt \
    --from_official \
    --official.vocab_path data/output_vocabulary \
    --official.transformer_model bert-base-cased \
    --official.special_tokens_fix 0 \
    --official.max_length 80

Exmaples for other official models:

RoBERTa

wget https://grammarly-nlp-data-public.s3.amazonaws.com/gector/roberta_1_gectorv2.th
python predict.py \
    --input <raw text file> \
    --restore roberta_1_gectorv2.th \
    --out out.txt \
    --from_official \
    --official.vocab_path data/output_vocabulary \
    --official.transformer_model roberta-base \
    --official.special_tokens_fix 1

XLNet

wget https://grammarly-nlp-data-public.s3.amazonaws.com/gector/xlnet_0_gectorv2.th
python predict.py \
    --input <raw text file> \
    --restore xlnet_0_gectorv2.th \
    --out out.txt \
    --from_official \
    --official.vocab_path data/output_vocabulary \
    --official.transformer_model xlnet-base-cased \
    --official.special_tokens_fix 0

GECToR-2024 (RoBERTa large) [Omelianchuk+ 24]

wget https://grammarly-nlp-data-public.s3.amazonaws.com/GECToR-2024/gector-2024-roberta-large.th
python predict.py \
    --input <raw text file> \
    --restore gector-2024-roberta-large.th \
    --out out.txt \
    --from_official \
    --official.vocab_path data/output_vocabulary \
    --official.transformer_model roberta-large \
    --official.special_tokens_fix 1

API

Use GECToR.from_official_pretrained() instead of GECToR.from_pretrained().

from transformers import AutoTokenizer
from gector import GECToR, predict, load_verb_dict
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GECToR.from_official_pretrained(
    'bert_0_gectorv2.th',
    special_tokens_fix=0,
    transformer_model='bert-base-cased',
    vocab_path='data/output_vocabulary',
    max_length=80
).to(device)
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encode, decode = load_verb_dict('data/verb-form-vocab.txt')

Performances obtained

I performed experiments using this implementation. Trained models are also obtained from Hugging Face Hub.

The details of experimental settings:

All models below are trained on all of stages 1, 2, and 3.

Configurations

The common training config is the following:

{
    "restore_vocab_official": "data/output_vocabulary/",
    "max_len": 80,
    "n_epochs": 10,
    "p_dropout": 0.0,
    "lr": 1e-05,
    "cold_lr": 0.001,
    "accumulation": 1,
    "label_smoothing": 0.0,
    "num_warmup_steps": 500,
    "lr_scheduler_type": "constant"
}

For stage1,

{
    "batch_size": 256,
    "n_cold_epochs": 2
}

For stage2,

{
    "batch_size": 128,
    "n_cold_epochs": 2
}

For stage3,

{
    "batch_size": 128,
    "n_cold_epochs": 0
}

Datasets

Stage	Train Datasets (# sents.)	Validation Dataset (# sents.)
1	PIE-synthetic (8,865,347, a1 split of this)	BEA19-dev (i.e. W&I+LOCNESS-dev, 4,382)
2	BEA19-train: FCE-train + W&I+LOCNESS-train + Lang-8 + NUCLE, without src=trg pairs (561,290)	BEA19-dev
3	W&I+LOCNESS-train (34,304)	BEA19-dev

Note that the number of epochs for stage1 is smaller than official setting (= 20 epochs). The reasons for this are (1) the results were competitive the results in the paper even at 10 epochs, and (2) I did not want to occupy as much computational resources in my laboratory as possible.
The tag vocabulary is the same as official one.
I trained on three different seeds (10,11,12) for each model, and use the one with the best performance.
- Futhermore, I tweaked a keep confidence and a sentence-level minimum error probability threshold (from 0 to 0.9, 0.1 steps each) for each best model.
- Finally, the checkpoint with the highest F0.5 on BEA19-dev is used.
- The number of iterations is 5.

Evaluation

Used ERRANT for the BEA19-dev evaluation. Note that I re-extract edits of the official M2 reference via ERRANT.
Used CodaLab for the BEA19-test evaluation.
Used M2 Scorer for the CoNLL14 evaluation.

Single setting

The slightly lower result for bea19-dev in [[Tarnavskyi+ 2022]] is probably due to not re-extracting the reference M2.

Base-5k

Model	Confidence	Threshold	BEA19-dev (P/R/F0.5)	CoNLL14 (P/R/F0.5)	BEA19-test (P/R/F0.5)
BERT [Omelianchuk+ 2020]				72.1/42.0/63.0	71.5/55.7/67.6
RoBERTa [Omelianchuk+ 2020]				73.9/41.5/64.0	77.2/55.1/71.5
XLNet [Omelianchuk+ 2020]			66.0/33.8/55.5	77.5/40.1/65.3	79.2/53.9/72.4
DeBERTa [Tarnavskyi+ 2022](Table 3)			64.2/31.8/53.8
gotutiyan/gector-bert-base-cased-5k	0.4	0.5	67.0/32.2/55.1	73.8/36.2/61.17	77.3/50.9/70.0
gotutiyan/gector-roberta-base-5k	0.3	0.6	67.0/36.9/57.6	73.4/40.7/63.2	77.2/54.4/71.2
gotutiyan/gector-xlnet-base-cased-5k	0.0	0.6	67.1/35.9/57.2	74.0/40.5/63.5	77.4/54.7/71.5
gotutiyan/gector-deberta-base-5k	0.3	0.6	67.9/36.3/57.8	75.2/40.5/64.2	77.8/55.4/72.0

Large-5k

Model	Confidence	Threshold	BEA19-dev (P/R/F0.5)	CoNLL14 (P/R/F0.5)	BEA19-test (P/R/F0.5)
RoBERTa [Tarnavskyi+ 2022]			65.7/33.8/55.3		80.7/53.3/73.2
XLNet [Tarnavskyi+ 2022]			64.2/35.1/55.1
DeBERTa [Tarnavskyi+ 2022]			66.3/32.7/55.0
DeBERTa (basetag) [Mesham+ 2023]			68.1/38.1/58.8		77.8/56.7/72.4
gotutiyan/gector-bert-large-cased-5k	0.5	0.0	66.7/34.4/56.1	75.9/39.1/63.9	77.5/52.4/70.7
gotutiyan/gector-roberta-large-5k	0.0	0.6	68.8/38.8/59.6	75.4/40.9/64.5	79.0/56.2/73.1
gotutiyan/gector-xlnet-large-cased-5k	0.0	0.6	69.1/36.8/58.8	75.9/41.7/65.2	79.1/55.8/73.0
gotutiyan/gector-deberta-large-5k	0.0	0.6	69.3/39.5/60.3	78.2/43.2/67.3	79.2/58.0/73.8

Ensemble setting

Model	BEA19-dev (P/R/F0.5)	CoNLL14 (P/R/F0.5)	BEA19-test (P/R/F0.5)	Note
BERT(base) + RoBERTa(base) + XLNet(base) [Omelianchuk+ 2020]		78.2/41.5/66.5	78.9/58.2/73.6
gotutiyan/gector-bert-base-cased-5k + gotutiyan/gector-roberta-base-5k + gotutiyan/gector-xlnet-base-cased-5k	72.1/33.8/58.7	79.0/37.7/64.8	82.8/52.7/74.3	The ensemble method is different from Omelianchuk+ 2020.
RoBERTa(large, 10k) + XLNet(large, 5k) + DeBERTa(large, 10k) [Tarnavskyi+ 2022]			84.4/54.4/76.0
gotutiyan/gector-roberta-large-5k + gotutiyan/gector-xlnet-large-cased-5k + gotutiyan/gector-deberta-large-5k	73.9/37.5/61.9	80.7/40.9/67.6	84.1/56.0/76.4

How to train

Preprocess

Use official preprocessing code. E.g.

mkdir utils
cd utils
wget https://github.com/grammarly/gector/raw/master/utils/preprocess_data.py
wget https://raw.githubusercontent.com/grammarly/gector/master/utils/helpers.py
cd ..
python utils/preprocess_data.py \
    -s <raw source file path> \
    -t <raw target file path> \
    -o <output path>

Train

train.py uses Accelerate. Please input your environment with accelerate config in advance.

accelerate launch train.py \
    --train_file <preprocess output of train> \
    --valid_file <preprocess output of validation> \
    --save_dir outputs/sample

Other options of train.py :

Option	Default	Note
--model_id	bert-base-cased	Specify BERT-like model. I confirmed that `bert-`, `roberta-`, `microsoft/deberta-`, `xlnet-**` are worked.
--batch_size	16
--delimeter	`SEPL\|\|\|SEPR`	The delimeter of preprocessed file.
--additional_delimeter	`SEPL__SEPR`	Another delimeter to split multiple tags for a word.
--restore_dir	None	For training from specified checkpoint. Both weights and tag vocab will be loaded.
--restore_vocab	None	To train with existing tag vocabulary. Please specify `config.json` to this. Note that weights are not loaded.
--restore_vocab_official	None	Use existing tag vocabulary in the official format. Please specify like `path/to/data/output_vocabulary/`
--max_len	128	Maximum length of input (subword-level length)
--n_max_labels	5000	The number of tag types.
--n_epochs	10	The number of epochs.
--n_cold_epochs	2	The number of epochs to train only classifier layer.
--lr	1e-5	The learning rate after cold steps.
--cold_lr	1e-3	The learning rate during cold steps.
--p_dropout	0.0	The dropout rate of label projection layers.
--accumulation	1	The number of accumulation.
--seed	10	seed
--label_smoothing	0.0	The label smoothing of the CrossEntropyLoss.
--num_warmup_steps	500	The number of warmup for learning rate scheduler.
--lr_scheduler_type	constant	Specify leaning rate scheduler type.

NOTE: For those who are familiar with the official implementation,

--tag_strategy is not available and it is forced to keep_one.
--skip_correct is not available. Please remove identical pairs from your training data in advance.
--patience is not available since this implementation does not employ early stopping.
--special_token_fix is not available since this code always adds a $START token to the vocabulary.

The best and last checkpoints are saved. The format is:

outputs/sample
├── best
│   ├── added_tokens.json
│   ├── config.json
│   ├── merges.txt
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.json
│   └── vocab.json
├── last
│   ├── ... (The same as best/)
└── log.json

Inference

The same usage of the Usage. You can specify best/ or last/ directory to --restore_dir.

CLI

gector-predict \
    --input <raw text file> \
    --restore_dir outputs/sample/best \
    --out <path to output file>

Other options of predict.py:

Option	Default	Note
--n_iteration	5	The number of iterations.
--batch_size	128	A Batch size.
--keep_confidence	0.0	A bias for the $KEEP label.
--min_error_prob	0.0	A sentence-level minimum error
probability threshold
--verb_file	`data/verb-form-vocab.txt`	Assume that you already have this file by Installing.
--visualize	None	Output visualization results to a specified file.

Or, to use as API,

from transformers import AutoTokenizer
from gector import GECToR
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
path = 'outputs/sample/best'
model = GECToR.from_pretrained(path).to(device)
tokenizer = AutoTokenizer.from_pretrained(path)

Visualize the predictions

You can use --visualize option to output a visualization of the iterative inference. It will be helpful for qualitative analyses.

For example,

echo 'A ten years old boy go school' > demo.txt
gector-predict \
    --restore_dir gotutiyan/gector-roberta-base-5k \
    --input demo.txt \
    --visualize visualize.txt

visualize.txt will show:

=== Line 0 ===
== Iteration 0 ==
|$START |A     |ten       |years                         |old   |boy   |go                     |school |
|$KEEP  |$KEEP |$APPEND_- |$TRANSFORM_AGREEMENT_SINGULAR |$KEEP |$KEEP |$TRANSFORM_VERB_VB_VBZ |$KEEP  |
== Iteration 1 ==
|$START |A     |ten   |-     |year  |old   |boy   |goes       |school |
|$KEEP  |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP |$APPEND_to |$KEEP  |
== Iteration 2 ==
|$START |A     |ten   |-     |year      |old   |boy   |goes  |to    |school |
|$KEEP  |$KEEP |$KEEP |$KEEP |$APPEND_- |$KEEP |$KEEP |$KEEP |$KEEP |$KEEP  |
A ten - year - old boy goes to school

Tweak parameters

To tweak two parameters in the inference, please use predict_tweak.py.
The following example tweaks both of parameters in {0, 0.1, 0.2 ... 0.9}. kc is a keep confidence and mep is a minimum error probability threshold.

gector-predict-tweak \
    --input <raw text file> \
    --restore_dir outputs/sample/best \
    --kc_min 0 \
    --kc_max 1 \
    --mep_min 0 \
    --mep_max 1 \
    --step 0.1

This script creates <--restore_dir>/outputs/tweak_outputs/ and saves each output in it.

models/sample/best/outputs/tweak_outputs/
├── kc0.0_mep0.0.txt
├── kc0.0_mep0.1.txt
├── kc0.0_mep0.2.txt
...

After that, you can determine the best parameters by:

RESTORE_DIR=outputs/sample/best/
for kc in `seq 0 0.1 0.9` ; do
for mep in `seq 0 0.1 0.9` ; do
# Run evaluation scripts for $RESTORE_DIR/outputs/tweak_output/kc${kc}_mep${mep}.txt
done
done

Ensemble

This implementation does not support probabilistic ensemble inference. Please use majority voting ensemble [Tarnavskyi+ 2022] instead.

wget https://github.com/MaksTarnavskyi/gector-large/raw/master/ensemble.py
python ensemble.py \
    --source_file <source> \
    --target_files <hyp1> <hyp2> ... \
    --output_file <out>

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src/gector		src/gector
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py
upload_to_hub.py		upload_to_hub.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GECToR

Differences from other implementations

Installing

License

Usage

For our models

CLI

API

For official models

CLI

API

Performances obtained

Configurations

Datasets

Evaluation

Single setting

Base-5k

Large-5k

Ensemble setting

How to train

Preprocess

Train

Inference

Visualize the predictions

Tweak parameters

Ensemble

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GECToR

Differences from other implementations

Installing

License

Usage

For our models

CLI

API

For official models

CLI

API

Performances obtained

Configurations

Datasets

Evaluation

Single setting

Base-5k

Large-5k

Ensemble setting

How to train

Preprocess

Train

Inference

Visualize the predictions

Tweak parameters

Ensemble

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages