Comprehensive portfolio of data science projects completed during the start2impact Master program, showcasing skills in SQL, Python, data visualization, machine learning, and statistical analysis.
This repository contains five comprehensive data science projects completed during the start2impact Data Science Master program. Each project demonstrates proficiency in different areas of data science, from foundational programming and SQL to advanced machine learning and visualization techniques. The projects achieved consistently high scores (500/500 or 1400/1500), reflecting strong technical execution and analytical depth.
The portfolio spans the complete data science workflow: data acquisition, cleaning, exploratory analysis, statistical modeling, machine learning, and insight communication. Projects address real-world problems including environmental sustainability, education analytics, travel safety, wine quality classification, and automated file management.
πΊοΈ SQL - Travel Safety Analysis
Score: 500/500
PostgreSQL project analyzing U.S. Department of State data to compare travel warnings with actual American tourist fatality rates (2009-2016). Reveals significant discrepancies between perceived and actual danger.
Key Technologies: PostgreSQL, Complex Joins, Statistical Analysis
Key Findings:
- Mexico, Mali, and Israel receive most warnings but Thailand, Pakistan, and Philippines have higher per capita death rates
- Strong correlation between warnings and deaths overall, but notable exceptions
- Per capita analysis provides more accurate risk assessment than absolute counts
Skills: Complex SQL joins (INNER/LEFT), aggregate functions, CASE statements, per capita calculations, COALESCE for null handling
ποΈ Python & NumPy - File Organizer
Score: 500/500
Automated file management system that organizes files by type (audio, documents, images) into categorized subfolders with CSV tracking. Implements both batch processing (Jupyter) and CLI interfaces.
Key Technologies: Python, os, shutil, csv, argparse
Features:
- Automatic categorization by file extension
- Dynamic folder creation
- CSV logging with file metadata (name, type, size)
- Incremental updates without overwriting
- Command-line interface for selective processing
Skills: File I/O operations, CSV handling, CLI argument parsing, automated directory management, error handling
Score: 500/500
Comprehensive analysis of global food and feed production's environmental impact (1961-2013), exploring relationships between agriculture, population growth, and climate change. Integrates multiple datasets with interactive visualizations.
Key Technologies: Python, Pandas, NumPy, Seaborn, Matplotlib, Plotly, country-converter, raceplotly
Datasets: FAO production (245+ countries), environmental impact, population, temperature
Key Insights:
- Cereals dominated until 1995, then surpassed by vegetables
- Beef, lamb, dairy generate highest emissions
- Strong correlation between agricultural expansion and temperature increase
Skills: Data cleaning/transformation, multi-dataset integration, EDA, interactive visualization, time series analysis, geographical mapping
View Project β | Open Notebook in nbviewer β
Score: 500/500
Random Forest classifier predicting Italian wine types based on 13 chemical features. Demonstrates hyperparameter tuning with GridSearchCV achieving ~99% validation accuracy.
Key Technologies: scikit-learn, Random Forest, GridSearchCV, Matplotlib
Dataset: 178 wine samples, 3 classes, 13 chemical features
Model Performance:
- Best parameters: max_depth=None, max_features=1, n_estimators=50
- Validation accuracy: ~99.2%
- 10-fold cross-validation
Skills: Multi-class classification, ensemble methods, hyperparameter optimization, model evaluation (accuracy, precision, recall, F1), decision tree visualization
Score: 1400/1500
Capstone project applying multiple ML regression models (Random Forest, Decision Tree, KNN, SVR) to predict student final grades in Portuguese language courses. Analyzes 33 features across 649 students to identify success factors.
Key Technologies: scikit-learn, RandomizedSearchCV, Pandas, NumPy, Seaborn, Matplotlib
Dataset: 649 students, 33 features (demographic, family, academic, social, health)
Key Findings:
- Previous grades (G1, G2) strongest predictors of final grade (G3)
- Study time, past failures, parent education significantly impact performance
- Social factors (alcohol, relationships) show moderate negative correlation
Skills: Regression modeling, hyperparameter tuning, feature importance analysis, EDA, multi-model comparison, stratified sampling
- Python 3.7+ - Primary programming language
- SQL (PostgreSQL) - Database queries and analysis
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- csv - CSV file handling
- Matplotlib - Static plotting
- Seaborn - Statistical visualization
- Plotly - Interactive charts
- raceplotly - Animated race plots
- scikit-learn - ML algorithms and evaluation
- Random Forest, Decision Trees, KNN, SVR
- GridSearchCV, RandomizedSearchCV
- StandardScaler, train_test_split
- Jupyter Notebook - Interactive development
- argparse - CLI argument parsing
- os, shutil - File system operations
- country-converter - Geographic data standardization
# Python 3.7 or higher
python --version
# PostgreSQL (for psql_travel project)
psql --versiongit clone https://github.com/ulpati/s2i_datascience.git
cd s2i_datascience# All projects
pip install jupyter numpy pandas matplotlib seaborn scikit-learn
# data_vis project
pip install plotly country-converter raceplotly nbformat
# No additional packages needed for file_organizer (uses standard library)# Navigate to project folder
cd data_vis # or machine_learning, final_project, file_organizer
# Start Jupyter Notebook
jupyter notebookdata-science-portfolio/
βββ README.md # This file
βββ psql_travel/ # SQL travel safety analysis
β βββ README.md # Project documentation
β βββ psql_travel.sql # SQL queries
β βββ psql_travel.pdf # Report
β βββ csv_files/ # Data files
βββ file_organizer/ # Python file management
β βββ README.md # Project documentation
β βββ fileorganizer.ipynb # Batch processing notebook
β βββ addfile.py # CLI script
β βββ files/ # Target directory
βββ data_vis/ # Data visualization project
β βββ README.md # Project documentation
β βββ data_vis.ipynb # Main analysis
β βββ csv/ # Datasets (FAO, Food, Population, Temperature)
βββ machine_learning/ # Wine classification
β βββ README.md # Project documentation
β βββ machine_learning.ipynb # Model training and evaluation
βββ final_project/ # Student grade prediction
βββ readme.md # Project documentation
βββ final_project.ipynb # Comprehensive ML analysis
βββ student-por.csv # Student dataset
- Data Acquisition - CSV loading, dataset integration, API usage
- Data Cleaning - Missing value handling, duplicate removal, standardization
- Exploratory Analysis - Statistical summaries, correlation analysis, pattern identification
- Feature Engineering - Data transformation, encoding, scaling
- Modeling - Algorithm selection, training, validation
- Evaluation - Metrics selection, performance assessment, model comparison
- Communication - Visualization, reporting, insight delivery
- SQL Proficiency - Complex joins, aggregations, subqueries, CTEs, window functions
- Python Programming - Object-oriented design, functional programming, error handling
- Statistical Analysis - Hypothesis testing, correlation, distribution analysis
- Machine Learning - Classification, regression, ensemble methods, hyperparameter tuning
- Data Visualization - Static plots, interactive charts, dashboards, geographical maps
- Version Control - Git, GitHub, documentation best practices
- Environmental Science - Food production sustainability, climate impact
- Education Analytics - Student performance prediction, success factor identification
- Risk Assessment - Travel safety analysis, statistical risk quantification
- Chemistry - Wine quality classification based on chemical composition
- Software Engineering - CLI tools, automated workflows, modular design
- Problem Solving - Breaking complex problems into manageable tasks
- Critical Thinking - Questioning assumptions, validating results
- Communication - Clear documentation, visual storytelling
- Attention to Detail - Code quality, edge case handling, thorough testing
All projects were developed for the start2impact Data Science Master Program.
Institution: start2impact University
Program: Data Science Master
Period: 2023-2024
Average Score: 500/500 (four projects), 1400/1500 (final project)
Data Sources:
- U.S. Department of State (Travel Safety Data)
- UN Food and Agriculture Organization (FAO)
- UCI Machine Learning Repository (Wine, Student Performance datasets)
- Kaggle (Environmental impact, temperature, population datasets)
Special Thanks:
- start2impact mentors and instructors
- Open-source data science community
- Dataset providers and maintainers