Language analysis of Brexit debate using Terraform and AWS EC2 instances with Spark.
Spark scripts counts the number of occurrences of a token in the articles of the various newspapers, normalizing the value based on the total number of tokens. In addition, two models are trained with logistic regression, one to distinguish whether an article is for or against Brexit, the other to distinguish whether an article talks about Brexit or not.
All data we used were obtained with Python scripts on brexit-news repository, in which for each newspaper we obtained a JSON file with a list of articles.
- ๐จ Install
- ๐ฎ Usage
- ๐ Development
- ๐ License
- ๐ Contacts
- ๐ฆ Developers
You must have the following packages installed on the system:
And clone the repo:
git clone https://github.com/cedoor/brexit-lang.git
cd brexit-langInside the main directory (brexit-lang):
-
Run
aws configureto save your credentials on local~/.aws/credentialsfile. The command interactively asks some parameters, it's important to set the following: AWS Access Key ID, AWS Secret Access Key. If you have an aws educated student account, you should find your credentials on the vocareum page by clicking onAccount Detail. -
Set your AWS parameters on
terraform.tfvarsfile within thebrexit-langfolder:
vpc_security_group_id(must be changed): You need to set your AWS security group ID conteined in security group section, on the EC2 service page;ec2_instance_count: number of cluster nodes (default:2);ec2_instance_type: the type of node instances (default:t2.small).
Security group must contain the right inbound rules to enable user access with ssh. For example :
| Type | Protocol | Port range | Source |
|---|---|---|---|
| All traffic | All | All | 0.0.0.0/0 * |
* Attention to security, this is just an example!
To run the code as it should, other parameters must not be changed. For instance, do not change parameters like:
region: AWS region (value:us-east-1);ec2_ami: Amazon machine image (value:ami-07ebfd5b3428b6f4d, Ubuntu Server 18.04 LTS);key_name: Name of identity pem file (value:amazon.pem)
To create EC2 instances run the following commands:
terraform init
terraform applyWhen terraform apply command execution ends and prints the cluster instance information, you will set the DNS of the cluster istances created in EC2_HOSTS environment variable as explained below.
It is now necessary to create a file called .env and update the following environment variables as described below:
EC2_HOSTS="ec2-0-0-0-0.compute-1.amazonaws.com ec2-0-0-0-1.compute-1.amazonaws.com"
IDENTITY_FILE_PATH="/home/pippo/.ssh/amazon.pem"
DATA_PATH="/home/pippo/Projects/BrexitLang/brexit-news/data"
LEAVER_NEWSPAPER_FILES="daily_star.json the_telegraph.json the_sun.json"
REMAIN_NEWSPAPER_FILES="indipendent.json the_guardian.json daily_mirror.json"
NEUTRAL_NEWSPAPER_FILE="the_new_york_times.json"
KEY_TOKENS="but although seem appear suggest suppose think sometimes often usually likelihood assumption possibility likely unlikely conceivable conceivably probable probably roughly sort they could would we"
where:
EC2_HOSTSis a list of AWS EC2 hosts URLs (cluster node URLs) obtained withterraform applycommand. The first one must be the master. You can also find created instances on the AWS page;IDENTITY_FILE_PATHis the AWS pem file path. You can create it in key pairs section of AWS EC2 page. It is important to call this fileamazon.pemand runchmod 600 amazon.pemto give the right permissions;DATA_PATHis the directory path of JSON data with newspaper articles (it should only contain the files listed below);LEAVER_NEWSPAPER_FILESis a list of JSON data files of leaver newspapers.REMAIN_NEWSPAPER_FILESis a list of JSON data files of remain newspapers.NEUTRAL_NEWSPAPER_FILEis a JSON data file of a neutral newspaper (it does not mention Brexit).KEY_TOKENSis a list of words (tokens) to analyze.
All the data files has to be in the following format:
{"title": "article title", "url": "article url", "timestamp": 1540252800000, "content": "article body"}
{"title": "article title", "url": "article url", "timestamp": 1540228613000, "content": "article body"}
{"title": "article title", "url": "article url", "timestamp": 1522188456900, "content": "article body"}After the creation of the instances, check again that the paths on the .env file are correct and setup all the nodes with the following command:
bash scripts/setup_instances.sh .envThis command installs Java, Python dependencies, Hadoop and Spark. Then, set up the cluster and upload the data to HDFS. At this point you can run the script for analysis or classification.
Analysis script get the KEY_TOKENS values and all the newspapers, and counts the number of occurrences of the tokens in the articles of each newspaper, normalizing the value based on the total number of newspaper tokens.
You can run the script with the following command:
bash scripts/start_analysis.sh .envAnalysis results will be saved in local ~/Downloads folder as JSON file called analysis_results.json.
Up to April 2020, the result shows that there are no significant differences in the number of the token found in each newspaper. This results are not deterministic in order to understand if an article is for or against Brexit.
The execution times of the algorithm are indicated at the end of the execution.
In the classification script two models are trained with logistic regression, one to distinguish whether an article is for or against Brexit, the other to distinguish whether an article talks about Brexit or not. The first model uses all Brexit newspapers to train except the last one, which it uses to create an additional separate test set. The second model uses all Brexit newspapers except the last one and the neutral newspaper. The script saves the accuracies of the two models in the resulting file.
You can run the script with the following command:
bash scripts/start_classification.sh .envClassification results will be saved in local ~/Downloads folder as JSON file called classification_results.json.
Up to April 2020, estimated accuracy shows that with these data it is not possible to effectively catalog articles taken from newspapers other than the ones used by the test set.
The execution times of the algorithm are indicated at the end of the execution.
Finally, to destroy instances from AWS run the following command:
terraform destroyAt this point if you want you can do another analysis with a new cluster.
Unfortunately there is still no terraform command to stop or start instances. However, it is possible to use aws-cli with the following commands:
aws ec2 stop-instances --region us-east-1 --instance-ids <ids>
aws ec2 start-instances --region us-east-1 --instance-ids <ids>-
Use this commit message format (angular style):
[<type>] <subject><BLANK LINE><body>where
typemust be one of the following:- feat: A new feature
- fix: A bug fix
- docs: Documentation only changes
- style: Changes that do not affect the meaning of the code
- refactor: A code change that neither fixes a bug nor adds a feature
- test: Adding missing or correcting existing tests
- chore: Changes to the build process or auxiliary tools and libraries such as documentation generation
- update: Update of the library version or of the dependencies
and body must be should include the motivation for the change and contrast this with previous behavior (do not add body if the commit is trivial).
- Use the imperative, present tense: "change" not "changed" nor "changes".
- Don't capitalize first letter.
- No dot (.) at the end.
- There is a master branch, used only for release.
- There is a dev branch, used to merge all sub dev branch.
- Avoid long descriptive names for long-lived branches.
- No CamelCase.
- Use grouping tokens (words) at the beginning of your branch names (in a similar way to the
typeof commit). - Define and use short lead tokens to differentiate branches in a way that is meaningful to your workflow.
- Use slashes to separate parts of your branch names.
- Remove branch after merge if it is not important.
Examples:
git branch -b docs/README
git branch -b test/one-function
git branch -b feat/side-bar
git branch -b style/header
- See LICENSE file.
- E-mail : me@cedoor.dev
- Github : @cedoor
- Website : https://cedoor.dev
- E-mail : e.ipodda@gmail.com
- Github : @epilurzu