SparkProject

RBAD Course Project: What's Hot & What's New - Discover the Trending of Programming Languages & Front End Technologies

Created by Dayou Du, Hao Chen

A living demo can be found here

Prerequisties

Spark-2.2.0
Wget
Gzip
sbt (on DUMBO, simply module load sbt)

MONGODB IS REQUIRED IF CONNECTING SPARK AND UI MODULE

The spark program will feed UI module using MongoDB.
Although the transfered data is really small (about several MBs), it is an elegent and automatic way to pass the data instead of passing raw text files.
So, if you are connecting Spark+UI, please set up a MongoDB and set the corresponding configuration(details are described in the Usage below).
It's OK if you would like to run Spark part ONLY. You can specify the HDFS directory for the outputs (details are described below). If the corresponding configurations are not set, the program will write results in to a TestOut folder on HDFS.

Usage

The project contains three parts of codes:

1. Data collect: ${PROJECTROOT}/dataCollect

This part contains the scripts to collect raw data.

These scripts will download the raw data and put them onto HDFS.

Warning

Extremely LARGE, will need about 48h in total for download and copy to HDFS

How to run?

Simply run these scripts in a unix-liked shell

2. Analysis StackOverflow and GitHub: ${PROJECTROOT}/src/main/scala

This part contains analysis code of StackOverflow&GitHub data.

StackOverflow Part

scoreStackOverflow.scala:

Read the StackOverflow data, get the tags, filter using language list and technology list, aggregate and convert to score.

GitHub Part

loadGitHubData.scala

Read the GitHub Events Timeline data, extract the fields we are interested, filter using event-type list.

Read the GitHub Repo-Language Dataset.
parseRepoLang.scala

Select the major programming language for each repo. Filter out the dummy repos.
parseEvents.scala

For each type of events, tag the event using repo-lang dataset, aggregate and convert to score.
scoreGitHub.scala

Entry point for the GitHub part. Call other functions.

Common Codes

Common.scala

Provide the common helper functions to score a language/technology

Provide Language List, Technology List, and corresponding tag Lists.
SparkApp.scala

Entry point of the whole program. Get the score list from GH/SF, calculate combined score, and then write to Disk/MongoDB

(For Graders)Where are the inputs?

GitHub Events(Cleaned):

 hdfs:///user/dd2645/SparkProject/CleanedEvents/*

Note1: To produce these cleaned dataset from raw data, please check and uncomment the corresponding codes in scoreGitHub.scala. Then the program will run from raw data, locate at:

hdfs:///user/dd2645/github_raw/after2015/*

Note2:(If you choose to run from raw data) It's strongly suggested that you first save the cleaned data, then run the else parts. The clean procedure will read and parse 3.3TB of json files and will take a REALLY LONG TIME - About 20 hours.

GitHub repo-language dataset:

 hdfs:///user/dd2645/github_repo_language/github.json

StackOveflow Posts:

 hdfs:///user/hc2416/FinalProject/Posts.xml

How to run?

check package versions in ${PROJECTROOT}/build.sbt are correct. (should be fine on DUMBO)
check sbt.version in ${PROJECTROOT}/project/build.properties is correct. (should be fine on DUMBO)
goto ${PROJECTROOT}, ensure that you have sbt (module load sbt on DUMBO)
type sbt assembly, which will generate a .jar package under ${PROJECTROOT}/target/scala-2.11/
submit the job:

If you would like to connect to our UI module, please set spark.MONGO_URI, like:

spark2-submit --conf "spark.MONGO_URI=mongodb://{username}:{passwd}@{serverIP}:{portNum}/{dbname}" \
              --conf "spark.network.timeout=1200s" \
              --conf "spark.dynamicAllocation.maxExecutors=200" \
              --conf "spark.ui.port=10101" \
              --conf "spark.executor.memory=4g" \
              --conf "spark.driver.memory=6g" \
              ./target/scala-2.11/PotatoFinalProject-assembly-1.0.jar

If you would like to run Spark part ONLY, just ignore the spark.MONGO_URI. Instead please set spark.HDFS_OUT configue, like:

spark2-submit --conf "spark.HDFS_OUT=hdfs:///{your-desired-output-directory}" \
              --conf "spark.network.timeout=1200s" \
              --conf "spark.dynamicAllocation.maxExecutors=200" \
              --conf "spark.ui.port=10101" \
              --conf "spark.executor.memory=4g" \
              --conf "spark.driver.memory=6g" \
              ./target/scala-2.11/PotatoFinalProject-assembly-1.0.jar

If Both configures are unset, the program will automatically write to a default HDFS directory(but maybe you donot have permissions:P)

hdfs:///user/dd2645/SparkProject/TestOut

IMPORTANT NOTES

The program is pretty large and would run 30~60 min given commands above.
spark.network.timeout and spark.executor.memory is important here. (should be set at least as much as the example above). If you occupy less than 200 executors, you may need to increase the memory for each executor correspondingly. Otherwise we will fail to cache the data
spark.dynamicAllocation.maxExecutors is set because by default Spark would spawn as much executors as possible. But we do NOT want to occupy all the CPU resources. (which would be killed by Admin :P)

3. Web UI part: ${PROJECTROOT}/Spark-UI

This part is developed as a git-submodule

Please refer to the README.md in this folder

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Screenshoots		Screenshoots
Spark-UI @ e79e8b9		Spark-UI @ e79e8b9
dataCollect		dataCollect
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkProject

Prerequisties

Usage

1. Data collect: ${PROJECTROOT}/dataCollect

2. Analysis StackOverflow and GitHub: ${PROJECTROOT}/src/main/scala

StackOverflow Part

GitHub Part

Common Codes

3. Web UI part: ${PROJECTROOT}/Spark-UI

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

potato1996/SparkProject

Folders and files

Latest commit

History

Repository files navigation

SparkProject

Prerequisties

Usage

1. Data collect: ${PROJECTROOT}/dataCollect

2. Analysis StackOverflow and GitHub: ${PROJECTROOT}/src/main/scala

StackOverflow Part

GitHub Part

Common Codes

3. Web UI part: ${PROJECTROOT}/Spark-UI

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages