TwitterTrendData Repository

PySpark and NLP-powered project analyzing historical Twitter data for trending topics based on word frequency and slope.

Requirements

Python 3.9
MongoDB
PySpark
PyMongo
Spacy
Docker
AWS S3

Overview

This repository contains an ETL pipeline designed for Twitter trend analysis. The pipeline consists of three main components:

1. Extract to MongoDB:

The script extracts data from a datasource file (twitter-sample, twitter-sample-2) and loads it into a MongoDB collection.

Usage:

python 1_extract_to_mongo.py <filename> [-verbose]

2. Calculate Trend with PySpark:

This script utilizes PySpark for Twitter trend analysis. It performs NLP to extract nouns from tweet texts and calculates trending topics over a specified time window. IT outputs results as a CSV file.

Usage:

python 2_calculate_trend.py <database> <collection> [-verbose]

3. Load to S3:

The script uploads CSV files (output from trend analysis) to an AWS S3 bucket for storage and further analysis.

Usage:
```
python 3_load_to_S3.py
```

Docker

Separate Dockerfiles are provided for each script to facilitate containerization and deployment. The docker-compose.yml file is provided for orchestrating the deployment of Docker containers for each script.

Configuration

Configuration details such as MongoDB server information are stored in the config.py and config_hidden.ini files.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
twitter_trend_output		twitter_trend_output
1_extract_to_mongo.py		1_extract_to_mongo.py
2_calculate_trend.py		2_calculate_trend.py
3_load_to_S3.py		3_load_to_S3.py
Dockerfile.1_extract_to_mongo		Dockerfile.1_extract_to_mongo
Dockerfile.2_calculate_trend		Dockerfile.2_calculate_trend
Dockerfile.3_load_to_S3		Dockerfile.3_load_to_S3
README.md		README.md
config.py		config.py
config_hidden.ini		config_hidden.ini
docker-compose.yml		docker-compose.yml
twitter-sample-2.json		twitter-sample-2.json
twitter-sample.json		twitter-sample.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TwitterTrendData Repository

Requirements

Overview

Docker

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TwitterTrendData Repository

Requirements

Overview

Docker

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages