You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Created an S3 bucket to upload "survey_results_public.csv" file so EMR can access it for data processing.
Locally issued Linux commands (Amazon Linux 2) to the EMR cluster's master node by connecting to an Elastic Compute Cloud (EC2) instance using Secure Shell (SSH) connection.
Overview
Created an EMR cluster with cluster launch mode and an initial S3 bucket was created automatically to store logs.
Software Configuration
emr-5.36.0
Spark: Spark 2.4.8 on Hadoop 2.10.1 YARN and Zeppelin 0.10.0
Hardware Configuration
m5.xlarge
Number of instances: 3
Security and access
EC2 key pair (used Amazon EC2 to create a RSA key pair)
Set-up a new S3 bucket to upload the file "survey_results_public.csv" so EMR can access it for data processing.
Inserted a new folder within the same S3 bucket called "data-source" that contains the CSV file.
Created a Spark application in a Python file called "main.py" for Spark storage job to process data.
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimportcolS3_DATA_SOURCE_PATH='s3://stackoverflow-123456/data-source/survey_results_public.csv'S3_DATA_OUTPUT_PATH='s3://stackoverflow-123456/data-output'defmain ():
spark=SparkSession.builder.appName('StackoverflowApp').getOrCreate()
all_data=spark.read.csv(S3_DATA_SOURCE_PATH, header=True)
print('Total number of records in the source data: %s'%all_data.count())
selected_data=all_data.where((col('Country') =='United States') & (col('WorkWeekHrs') >45))
print('The number of engineers who work more than 45 hours a week in the US is: %s'%selected_data.count())
selected_data.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH)
print('Selected data was was successfully saved to S3: %s'%S3_DATA_OUTPUT_PATH)
if__name__=='__main__':
main()
Opened port 22 to SSH into the EMR cluster using IP address and ran spark-submit command for "main.py" data processing.
Result
A new folder called "data-output" with parquet files was created in the same S3 bucket, after executing commands written in "main.py".
Language & Tools
Python
SQL
Spark (PySpark)
AWS (EMR, EC2, S3)
Bash (Amazon Linux 2)
About
Code for creating a Spark application written in Python and Big Data Processing with Spark (PySpark) and AWS (EMR)