This project applies machine learning techniques learned in class to a synthetic predictive maintenance dataset. The aim is to predict machine failures based on various features.
The dataset consists of 10,000 data points with 14 features:
- UID: Unique identifier
- productID: Quality variant (L, M, H) and serial number
- air temperature [K]: Random walk normalized to σ=2K around 300K
- process temperature [K]: Random walk normalized to σ=1K added to air temperature +10K
- rotational speed [rpm]: Calculated from power of 2860W with noise
- torque [Nm]: Normally distributed around 40 Nm with σ=10 Nm, no negative values
- tool wear [min]: Quality variants add 5/3/2 minutes of wear
- Failure: Indicates if there was a machine failure
- Failure Type: Specific type of failure
- Data Preprocessing: Clean and prepare the dataset for analysis.
- Exploratory Data Analysis (EDA): Understand the dataset and identify important features.
- Feature Engineering: Create new features or modify existing ones to improve model performance.
- Modeling: Build and evaluate machine learning models to predict machine failure.
- Evaluation: Assess model performance using appropriate metrics.
- Loading the dataset: Using pandas to load and explore the dataset.
- Handling missing values: Checking for and imputing or dropping missing values.
- Data normalization/scaling: Standardizing features to have a mean of 0 and standard deviation of 1 using StandardScaler from scikit-learn.
- Encoding categorical variables: Converting categorical features to numerical values using one-hot encoding.
- Visualizations: Using matplotlib and seaborn for plotting distributions, correlations, and feature relationships.
- Statistical summaries: Generating descriptive statistics to understand data distribution.
- Creating new features: Deriving new features from existing ones to capture more information.
- Feature selection: Identifying important features using correlation matrices or feature importance from models like RandomForest.
- Splitting the dataset: Dividing the data into training and testing sets.
- Model selection: Trying different models such as Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, and Support Vector Machines (SVM).
- Hyperparameter tuning: Using techniques like GridSearchCV to find the best parameters for models.
- Cross-validation: Evaluating model performance using k-fold cross-validation to ensure robustness.
- Performance metrics: Using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to assess model performance.
- Feature importance: Interpreting which features are most influential in the predictions.
- Confusion matrix: Analyzing the confusion matrix to understand the types of errors the model makes.
- Saving the model: Exporting the trained model using joblib or pickle for deployment.
- final.ipynb: Main Jupyter notebook containing the project code, including data preprocessing, EDA, modeling, and evaluation.
- requirements.txt: List of dependencies needed to run the project.
Install the required libraries using the provided requirements.txt file:
pip install -r requirements.txt