This project studies adversarial attack on authorship identification model in various ways.
Introduction sildes can be accessed at here.
Adversarial Attack is the type of attack which aims to confuse the classifier by modifying a small part of input data, makes it barely unable to notice the modification by human but successfully tricked the classifier. In our work, we testify the tolerance of adversarial attack on selected existing Natural Language Processing (NLP) model. We choose authorship identification classifier in our work.
We implemented five different attacking approaches, with the attack can be separated into two different phases:
- Phase 1: Confusion. In phase one, our goal is to confuse the classifier so the modified articles should be identified as any other authors’ creation excepts the original one.
- Phase 2: Assignment. In phase two, our goal is to make every article being identified as articles from a specific author.
The baseline (opponent) model can be found at this github repo.
In main branch, our most successful approach in stored in folder phase 1 and phase 2 for two different phases respectively. They used the following methods:
- Phase 1: choose candidates by POS tagging with the help of
nltklibrary and replace with misspelling. - Phase 2: use enhanced TF-IDF to filter out candidates and use Genetic Algorithm (GA) to optimize our result.
As for other three different branches, they represent three different attempts with the following naming rule:
For attacking models in main branch, we built those models on Kaggle. The information about our Kaggle enviroment and some brief guides is on https://www.kaggle.com/code/sheridanm551/how-to-use-lstm-classifier-model and its dataset is https://www.kaggle.com/datasets/sheridanm551/fast-using-lstm-model-steps.
By using the environment and datasets we've built on Kaggle, both models should able to execute by running the .ipynb file.
As for other branches, the files in those branches should be able to build the model under python environment. Details about each approaches are listed below.
- The two files,
train.csvandtest.csv, are the files from the model which is our attack goal. - From TF-IDF part, we can get the important words for the 20th author who has the most important words.
import_words_20.csvcontains the important words for the 20th author. - The file,
model.pt, is the model which we want to attack. change_tfidf()can be used to replace the words in a text to be the words inimport_words_20.csv.model_evaluate()can be used to evaluate the model we attack. That is to say, we usemodel_evaluate()to know which author will be identified for a given text. The input of this function is a dataframe, like the following image.
similarity_cal()can be used to calculate the similarity between the two given texts.generate_initial()can be used to generate at most 5 modified texts from 1 original text. The at most 5 modified texts can be the initial generation of genetic algorithm.- The time to generate the initial generation for the 500 texts is costly. We can store the initial generation into a csv file.
fitness()will provide the fitness of a text.crossover()will generate a child text from two parents texts.mutation()will generate a mutated text from a given text.selection()will generate will the top n texts from a list of given texts.- Use the above 4 functions and
generate_initial()to build up genetic algorithm. - Use
similarity_cal()andmodel_evaluate()to get the similarity and the identified authors of the final generation of genetic algorithm.
compute_idfis for calculating the original normal IDF.compute_idf_for_authoris for calculating the IDF_d in improved IDF part.compute_author_frequent_word_levelis for calculating the IDF_c in improved IDF part.
- Baseline model is stored in
opponent_model/baseline.pt - Some specific datasets for this branch is stored in
used_datasetfolder. - The preprocessing part can be skipped. Start from the cell before Utilizing Datafields to use our provided dataset directly.
AuthorClassifier()defines the model of baseline model, we use this to retrieve the parameters of baseline model.train_classifier()andevaluate_classifier()is for training and inferencing with the model respectively.create_replacement()is the function to find candidates to be replaced. In this branch, we use gradients in the model to select candidates. For other approaches, please check out other branches.- After
attack_iteratoris generated,evaluate_classifier()can be executed to get the heatmap of classification result.
- Baseline model is stored in
opponent_model/baseline.pt - Some specific datasets for this branch is stored in
used_datasetfolder. Some shared dataset is located under the same folder ingrad/Glovebranch. BERT.ipynbis for fine tuning the BERT model for Masked Language Modeling (MLM).- After the bert model is generated,
Articel_level_grad_BERT.ipynbcan be executed. - The preprocessing part can be skipped. Start from the cell before Utilizing Datafields to use our provided dataset directly.
evaluate_classifier()is used to calculate the attacking result on original classifier. This will generate the final heatmap.
- Baseline model is stored in
opponent_model/baseline.pt - Some specific datasets for this branch is stored in
used_datasetfolder. Some shared dataset is located under the same folder ingrad/Glovebranch. - For
TF-IDF.ipynb, guidelines mentioned above can be referenced. - The preprocessing part can be skipped. Start from the cell before Utilizing Datafields to use our provided dataset directly.
evaluate_classifier()is used to calculate the attacking result on original classifier. This will generate the final heatmap.
All attacking result is included in each branches for each approaches. It should be a heatmap looks like the picture below:
This is the graphical attacking result for phase 1 attack. Check out files in each folders for more detailed information.
Thanks for the co-authorship from W. -P. Lin, T. -Y. Liu, Z. -W. Hong, H. -L. You, and T. -Y. Hsieh.
