An end-to-end data science pipeline built to predict the landing success of SpaceX Falcon 9 rocket first-stage boosters, maximizing cost-efficiencies for commercial aerospace launches.
This repository contains the complete portfolio framework for the IBM Applied Data Science Capstone curriculum. The project spans data ingestion via web scraping and REST APIs, relational database management using SQL, interactive geospatial visualization, web-app dashboard deployment, and hyperparameter-tuned machine learning classification algorithms.
- Data Collection & Extraction: Gathering launch logs using the SpaceX REST API and scraping historical Wikipedia tables using
BeautifulSoup4. - Data Wrangling & Processing: Handling null parameters, feature engineering categorical metrics using One-Hot Encoding, and flattening raw payloads.
- Exploratory Data Analysis (EDA): Executive data analysis utilizing SQL queries and relational visualization plots.
- Geospatial Mapping: Isolating launch pad coordinates, safety distances, and landing failure/success metrics using interactive map overlays.
- Interactive Dashboard App: Deploying a live analytics control panel featuring structural charts and reactive filter selectors.
| Launch Site Success Proportions | Payload Mass vs. Success Correlation |
|---|---|
|
|
- Predictive Modeling (ML): Training, tuning, and bench-testing four separate categorization algorithms to declare the optimal landing predictor model.
## 📁 Repository Structure
The project is organized into modular notebooks and scripts tracking each phase of the data science lifecycle:
* **`data/`**: Dedicated directory containing your analytical visual assets (scatter plots and pie charts).
* **`01_Data-Collection-API.ipynb`**: Data gathering using SpaceX API requests.
* **`02_Webscraping.ipynb`**: Web scraping historical launch data using BeautifulSoup.
* **`03_Data_Wrangling.ipynb`**: Data cleaning, handling null values, and initial feature engineering.
* **`04-EDA-With-SQL.ipynb`**: Exploratory Data Analysis using SQL queries to discover operational trends.
* **`05_EDA_Data_Visualization.ipynb`**: Exploratory Data Analysis using Python visual analytics (Matplotlib and Seaborn).
* **`06_Launch_Site_Location.ipynb`**: Interactive geospatial mapping using Folium.
* **`07_Dashapp.py`**: A fully functional, interactive Plotly Dash web dashboard application.
* **`08_Machine_Learning_Predictions.ipynb`**: Machine learning classification model training, hyperparameter tuning, and evaluation.
* **`requirements.txt`**: List of required Python packages and environment dependencies.
* **`LICENSE`**: MIT License.
- Python 3 - Underlying programming runtime.
- Scikit-Learn - Machine learning classification models & GridSearchCV tuning.
- Plotly Dash - Dynamic data application framework environment.
- Folium - Interactive HTML geospatial map visualization layers.
- Pandas / NumPy - Matrix manipulations and structured data processing pipelines.
- BeautifulSoup4 / Requests - Web scraping tools and REST API parsing pipelines.
Configure your local environment automatically by installing all the tracked project library dependencies directly via the configuration file:
pip install -r requirements.txt- Clone this repository to your local system environment:
git clone https://github.com/usmanali9999/Applied-Data-Science-Capstone.git cd Applied-Data-Science-Capstone - Start the interactive workspace environment:
jupyter notebook
- Run the development notebooks in sequence (
01_data-collection-api.ipynbthrough08_Machine_Learning_Prediction.ipynb) to replicate the data insights pipeline. - Launch the live dashboard visualization application locally:
python 07-dashapp.py
The table below details the optimal tuning parameters and prediction accuracies across all classification models tested in this lab. Each model was optimized using GridSearchCV and evaluated on identical train and test splits.
| Classification Model | Best Hyperparameters Found | Training Accuracy | Test Dataset Accuracy |
|---|---|---|---|
| Decision Tree | {'criterion': 'gini', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'} |
87.50% | 88.89% |
| K-Nearest Neighbors (KNN) | {'algorithm': 'auto', 'n_neighbors': 10, 'p': 1} |
84.82% | 83.33% |
| Support Vector Machine (SVM) | {'C': 1.0, 'gamma': 0.0316, 'kernel': 'sigmoid'} |
84.82% | 83.33% |
| Logistic Regression | {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'} |
84.64% | 83.33% |
To determine the absolute best predictive framework, an evaluation loop compared each optimized estimator against the unseen testing split.
- The best performing model is:
DecisionTreeClassifier - Highest Test Accuracy Score: 88.89%
While Logistic Regression, SVM, and KNN all converged on a strong baseline performance of 83.33%, the Decision Tree framework adjusted best to the underlying classification boundaries of the standardized SpaceX payload and orbit characteristics. This suggests that the tree-structured partitions were more effective at isolating the specific combination of features that guarantee a successful Falcon 9 first-stage landing.
Note: All optimized classification architectures yielded a tied baseline performance matrix accuracy across validation sets, heavily driven by the initial engineered feature profiles.
Distributed under the MIT License. See LICENSE for more details.

