Welcome to my #60DaysOfLearning2025 challenge!
This repository is a personal learning log where I document my daily progress as I explore and upskill in various areas of technology.
June 1, 2025
To stay consistent with learning and build hands-on skills in areas like:
- Big Data (Hadoop, HDFS, Hive, PySpark etc.)
- Data Engineering
- Linux & Shell Scripting
- SQL
- Cloud Fundamentals (AWS/GCP)
- Tools & Frameworks I encounter during my journey
Each day I will:
- Practice or learn something new
- Document the learnings in a markdown file (e.g.,
Day01.md) - Push relevant screenshots, notes, or code snippets
- Share highlights on Twitter using hashtags like
#60DaysOfLearning2025
I will update this section weekly with my learning summary.
| Day | Topic | Summary |
|---|---|---|
| 01 | Setup Hadoop on VM | Installed Hadoop & Java 11, configured environment variables |
| 02 | HDFS Basics | Ran basic HDFS commands, understood roles of NameNode and DataNode |
| 03 | Hadoop Architecture | Studied YARN, HDFS structure and node responsibilities |
| 04 | Practicing HDFS | Created/uploaded/viewed files in HDFS, practiced put, cat, ls |
| 05 | Practicing HDFS | Created & viewed sample.txt with HDFS cmds: put, ls, mkdir, get, cat, tail |
| 06 | Read an article | A Comparative Evaluation of Apache Hadoop and Apache Spark |
| 07 | Read multiple article | On big Data Concepts and Analytics |
| 08 | Watched HDFS tutorial in YT | To stregthen my knowledge in HDFS before umping into Mapreduce |
| 09 | Learnt about MapReduce | Read about MapReduce from youtube tutorials and also implemented it's various usecases in my vm terinal |
| 10 | Read about hive | Read about Hive, it's archiecture, how it works |
| 11 | Installed Hive | Installed Apache Hive on my VM today as I dive deeper into the Big Data ecosystem |
| 12 | Namenode not working | Tried to fix namenode error issue, but couldn't solve it today |
| 13 | Read article | Read article on Case Study of Hive using Hadoop |
| 14 | Read SQL | Basic Terms and Table and Column Naming Rules |
| 15 | Read about SQL keys | Read about foreign, unique and primary keys in detail and made notes in notepad |
| 16 | Read about SQL check and default costraint | Learned about DEFAULT AND CHECK CONSTRAINT in sql and made notes in notepad |
| 17 | Practiced SQL queries | Revised and used some sql functions i hadn't used in a while |
| 18 | Practiced sql queries | Practiced easy to medium SQL queries |
| 19 | Read a research paper | Read a paper on expanding SQL learning beyond RDBMS to Big Data systems like MapReduce, NoSQL & NewSQL. Essential shift for modern data insights. |
| 20 | Read sql quizes | Practiced sql quizes trough learn sql app |
| 21 | Read sql functions | Revised aggregate, string, numeric, date functions today |
| 22 | Fixed HDFS issue | Spent the day fixing an HDFS issue caused by inconsistent NameNode/DataNode directories. Solved it by reformatting NameNode and clearing stale PID files |
| 23 | Practice queries in hive | Created a Hive database lspp23, a sales table, inserted rows, and wrote SQL queries to view sales, calculate revenue, analyze orders by region, and filter data by conditions like product type and quantity. |
| 24 | Installed spark and lerned queries | Installed Apache Spark 3.5.6 and ran my first RDD queries using sc.parallelize() and groupBy. Learned how Spark handles distributed data and transformations across nodes. |
| 25 | Practiced sql queries | Practiced medium hard SQL queries and revised my notes from previous days |
| 26 | Learned about pysaprk | Learning in detail about PySpark from datacamp course 'Introduction to PySpark DataFrames |
| 27 | Learned SQL Window fuctions | Learned about window functions , bit hard compared to other sql topics but an informative read |
| 28 | Hard level SQL Queies | Did some hard level SQL questions today, focusing on advanced CASE WHEN, subqueries, and grouping logic. Feeling more confident with daily practice. |
| 29 | Data Wareshouse | Studied Data Warehousing today: its need, components, types, characteristics, advantages, and disadvantages. Learned how it helps store huge data centrally for analysis and better decision-making. |
| 30 | Hard level SQL queries | Practiced hard SQL today with CASE WHEN, MOD for even-odd, JOINs, GROUP BY, HAVING, LAG window functions, ROUND, and ORDER BY custom sorting. These are levelling up my SQL skills. |
| 31 | Data Warehousing Concepts | Studied ETL, dimensional modelling, star & snowflake schemas, OLAP cubes, data marts, data integration, and governance to build strong fundamentals for data engineering. |
| 32 | Big Data Analytics | Downloaded Nepal daily climate dataset from Kaggle. Connected Spark to HDFS and YARN, explored dataset schema and columns like Temp_2m, MaxTemp_2m, MinTemp_2m, and WindSpeed_10m. Planned to upload to HDFS and perform analytics using Spark DataFrames tomorrow. |
| 33 | Big Data Project Setup | Set up PySpark environment for daily climate data analytics. Connected HDFS and YARN, created project directories in HDFS, uploaded the dataset, and practiced reading data with Spark for upcoming analysis. Faced pandas installation issues, will continue with pandas-based visualisation tomorrow. |
| 34 | Pandas Basics & Dataset Exploration | Faced pandas installation issues in PySpark, so switched to Jupyter Notebook for visualization. Loaded Nepal climate dataset, renamed columns, and explored it using pandas functions for structure, info, and initial analysis. |
| 35 | Visualised climate dataset using pandas + matplotlib + seaborn | Created histograms for temp and pressure, boxplots by district, and correlation heatmaps to analyse relationships between features. Strengthening my data analysis and EDA skills. |
| 36 | Climate Data Visualization | Performed advanced visualizations on Nepal climate data today. Converted date to datetime, extracted year, plotted average temperature trends over years, analysed district-wise temperatures post-1990, and explored wind speed distribution. Strengthening my data analysis and visualization skills with pandas, matplotlib, and seaborn. |
| 37 | Nepal Climate Data Analysis Project | Performed max temperature distribution analysis and calculated average humidity per district, identifying top and bottom humid regions for insights. |
| 38 | Nepal Climate Data Analysis Project | Implemented linear regression to predict precipitation, evaluated model (MSE: 25.6, R²: 0.29), visualized actual vs predicted values, and analyzed feature coefficients for insights. |
| 39 | Superstore Sales Data Project Analysis | Completed EDA, box plots, and styled tables on Superstore Sales dataset (2nd project) using pandas and Jupyter Notebook for insightful data visualization and analysis. |
| 40 | Superstore Sales Data Project Visualizations | Created 5 visualizations (Monthly Trend, Category Distribution, Region-Segment Sales, Day vs. Sales, Heatmap) (2nd project) using pandas and seaborn in Jupyter Notebook, analyzing sales patterns and trends. |
| 41 | Superstore Sales Data Project Visualizations | Added a bar chart of average sales by ship mode and a correlation heatmap (2nd project) using pandas and seaborn in Jupyter Notebook, exploring sales patterns and relationships. |
| 42 | Superstore Sales Data Project Visualizations | Added a horizontal bar chart of sales by sub-category, sales distribution by state, and a line plot of sales over time by category (2nd project) using pandas and matplotlib in Jupyter Notebook. |
| 43 | Superstore Sales Data Project Prediction using Linear Regression | Enhanced Superstore Sales prediction model using Order Month, Order Year, and encoded Category, Sub-Category, Region, Ship Mode, Segment. Analyzed scatter plot visualization to assess model fit. |
| 44 | Superstore Sales Data Project Prediction using Linear Regression and XGBoost | Enhanced sales prediction model with Linear Regression (MSE: 1873.5) and XGBoost (MSE: 14918.1) using new features like discount, competitor_price, and price_elasticity. Fixed randomness for stable results, visualized with a y=mx+c line, and identified two outliers. |
| 45 | Studied 'XGBoost vs Linear Regression blog | Studied https://medium.com/@heyamit10/xgboost-vs-linear-regression-a-practical-guide-aa09a68af12b blog to deepen my understanding. . Learning how complexity vs. simplicity impacts predictions! |
| 46 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Started third project with Kaggle's MID dataset. Spent significant time selecting dataset, cleaned Therapeutic_class_counts.xlsx by dropping empty columns (Unnamed: 3, Unnamed: 4). Plan to explore MID dataset and perform deeper analysis from tomorrow. |
| 47 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Cleaned MID by filling missing values (ProductIntroduction, HowToUse, HowWorks, Chemical_Class, Action_Class) with 'Not Specified'. Performed EDA: counted unique values in Therapeutic_Class, Chemical_Class, Habit_Forming, Action_Class. Visualized HowToUse with WordCloud and text length histogram, and top 10 therapeutic classes by Size with bar plot. Plan to explore more visualizations tomorrow. |
| 48 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Created SideEffect WordCloud to identify common side effects, SideEffect text length histogram to analyze description complexity, and top 10 Chemical_Class bar plot to explore chemical compositions. Skipped ProductUses WordCloud due to messy text (e.g., null, HTML artifacts) and Habit_Forming pie chart (binary Yes/No). |
| 49 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Completed a scatter plot for Size vs Ratio from the counts dataset to analyze relationships and a bar plot for top 10 Action_Class values (excluding Not Specified) to explore drug mechanisms. |
| 50 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Completed a line plot for cumulative Therapeutic_Class sizes to track medicine accumulation and a violin plot for SideEffect text lengths to analyze distribution density. |
| 51 | Medicines Information Dataset (MID) Exploration in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Completed a boxen plot for HowToUse text lengths to analyze distribution, finalizing all visualizations and analysis. Transitioning to model building starting tomorrow. |
| 52 | Medicines Information Dataset (MID) Prediction in Jupyter (Third Project) | Continued third project with Kaggle's MID dataset. Conducted model performance analysis for Logistic Regression, achieving ~98.2% accuracy. Noted strong performance on major classes (e.g., ANTI INFECTIVES, CARDIAC) and weaker results on minor classes (e.g., ANTI NEOPLASTIC, OTHERS) due to limited data. Misclassifications occurred between similar classes (e.g., GASTRO INTESTINA vs GASTRO INTESTINAL) |
| 53 | Medicines Information Dataset (MID) | Created GitHub repository for the MID Therapeutic Class Prediction project. Uploaded images to Loading and analysis of data/, Visualization Images/, and Multi-Class Text Classification using Logistic Regression/ folders, documenting data exploration, charts (e.g., confusion matrix), and model training steps for the Logistic Regression model (~98.2% accuracy). |
| 54 | Created a GitHub repo for Sales Prediction Analysis with Linear Regression and XGBoost. | Created a GitHub repo for Sales Prediction Analysis with Linear Regression and XGBoost. Uploaded notebook and images of data analysis, visualizations, and model training in the repository. |
| 55 | Updated Github repository for Nepal Climate Analysis Project | This GitHub repo is for Nepal Climate Data Analytics Project which was my first project for this challenge. I worked with hdfs, PySpark and jupyter notebook for further analysis , visualization and modeling. |
| 56 | SQL Practice | Solved few basic SQL questions in hacker rank, learned some theory and did some quizzes in SQLZoo. |
| 57 | Learning SQL theory | Learned about keys, normalization and denormalization and practiced some quizzes |
| 58 | SQL Practice with Northwind Database | Completed all easy, medium, and hard-level SQL questions using the Northwind dataset from sql-practice.com |
| 59 | Created a new GitHub repo — Data-Engineer-Interview-Preparation | To document everything I’m learning and practicing in SQL, ETL, and data modeling. |
| 60 | SQL Practice | Did medium, and hard-level SQL questions from sql-practice.com |
- Ubuntu 24.04 (VMware)
- Hadoop 3.3.6
- Git & GitHub
- Jupyter Notebook
- YouTube (for video references)
Feel free to reach out or follow along:
- 🐦 Twitter: https://x.com/itspriibhatta
- 💼 LinkedIn: https://www.linkedin.com/in/priyanka-bhatta/
Let’s grow and stay consistent 🚀
#60DaysOfLearning2025 ✨