Code review - Chart-to-Text: Generating Natural Language Explanations for Charts by Adapting the Transformer Model
Original repository: Chart2Text by Jason Obeid
Code for Chart-to-Text: Generating Natural Language Explanations for Charts by Adapting the Transformer Model.
The dataset is stored in the dataset directory dataset/ and is split into train, valid and test sets in the data/ directory.
Data is split up into directories:
dataset/captions/contains the cleaned chart captions anddataset/captions_old/contains the uncleaned chart captionsdataset/titles/contains the cleaned chart titles anddataset/titles_old/contains the uncleaned chart titlesdataset/data/contains the chart datadataset/multiColumn/contains the chart data for charts with more than two columns i.e. grouped bar charts and multi-line charts
Chart type breakdown:
| Line | Bar | Total: | |
|---|---|---|---|
| Simple | 3564 | 3199 | 6763 |
| Complex | 902 | 640 | 1542 |
| Total: | 4466 | 3839 |
Images available seperately at https://github.com/JasonObeid/Chart2TextImages due to large size ~1GB
Clean the text within the chart titles and summaries:
cd utils/
python refactorTitles.pycd utils/
python refactorCaptions.pycd etc/
python templatePreprocess.py
- Converts data tables into a sequence of records (taken as input by the model):
data/*split*/trainData.txt - Cleans summary tokens and substitutes any possible tokens with data variables(e.g., 2018 -> templateValue[0][0]):
data/*split*/trainSummary.txt - Cleans the title tokens:
data/*split*/trainTitle.txt - Labels the occurrences of records mentioned within the summary:
data/*split*/trainDataLabel.txt - Labels the summary tokens which match a record:
data/*split*/trainSummaryLabel.txt - Saves the gold summaries:
data/*split*/testOriginalSummary.txt
cd etc
python extract_vocab.py --table ../data/valid/validData.txt --summary ../data/valid/validSummary.txt
python extract_vocab.py --table ../data/test/testData.txt --summary ../data/test/testSummary.txt
python extract_vocab.py --table ../data/train/trainData.txt --summary ../data/train/trainSummary.txt
It will generate vocabulary files for each of them:
data/*split*/trainData.txt_vocabdata/*split*/trainSummary.txt_vocab
cd ../model
python preprocess_table_data.py --table ../data/valid/validData.txt --table_vocab ../data/valid/validData.txt_vocab --table_label ../data/valid/validDataLabel.txt
python preprocess_table_data.py --table ../data/test/testData.txt --table_vocab ../data/test/testData.txt_vocab --table_label ../data/test/testDataLabel.txt
python preprocess_table_data.py --table ../data/train/trainData.txt --table_vocab ../data/train/trainData.txt_vocab --table_label ../data/train/trainDataLabel.txt
python preprocess_summary_data.py --summary ../data/valid/validSummary.txt --summary_vocab ../data/valid/validSummary.txt_vocab --summary_label ../data/valid/validSummaryLabel.txt
python preprocess_summary_data.py --summary ../data/test/testSummary.txt --summary_vocab ../data/test/testSummary.txt_vocab --summary_label ../data/test/testSummaryLabel.txt
python preprocess_summary_data.py --summary ../data/train/trainSummary.txt --summary_vocab ../data/train/trainSummary.txt_vocab --summary_label ../data/train/trainSummaryLabel.txt
Outputs the training data:
- Data Records:
data/*split*/trainData.txt.pth - Summaries:
data/*split*/trainSummary.txt.pth
Note: if you get a dictionary assertion error, then delete the old .pth files in data subfolders and try again
python model/train.py \
--model_path "experiments" \
--exp_name "chart2text" \
--exp_id "run1" \
--train_cs_table_path data/train/trainData.txt.pth \
--train_sm_table_path data/train/trainData.txt.pth \
--train_sm_summary_path data/train/trainSummary.txt.pth \
--valid_table_path data/valid/validData.txt.pth \
--valid_summary_path data/valid/validSummary.txt.pth \
--cs_step True \
--lambda_cs "1" \
--sm_step True \
--lambda_sm "1" \
--label_smoothing 0.05 \
--sm_step_with_cc_loss False \
--sm_step_with_cs_proba False \
--share_inout_emb True \
--share_srctgt_emb False \
--emb_dim 512 \
--enc_n_layers 1 \
--dec_n_layers 6 \
--dropout 0.1 \
--save_periodic 40 \
--batch_size 6 \
--beam_size 4 \
--epoch_size 1000 \
--max_epoch 81 \
--eval_bleu True \
--sinusoidal_embeddings True \
--encoder_positional_emb True \
--gelu_activation True \
--validation_metrics valid_mt_bleu
Pretrained models can be downloaded from the links below:
python model/summarize.py --model_path aug17-80.pth --table_path data/test/testData.txt --output_path results/aug17/templateOutput-p80.txt --title_path data/test/testTitle.txt --beam_size 4 --batch_size 8
Substitute any predicted data variables:
cd etc/
python summaryComparison.py
cd studyOutcome/
python automatedEvaluation.py
The BLEU evaluation script can be obtained from Moses:
perl model/src/evaluation/multi-bleu.perl data/test/testOriginalSummary.txt < results/aug17/generated-p80.txt