Skip to content

mwirigijustice101-pythonist/codespaces-blank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🍷 Wine Quality Prediction Model

Python Version License scikit-learn XGBoost HuggingFace Datasets Status

A high-performance machine learning model for predicting wine quality using data streaming for efficient memory management. Combines supervised learning techniques including SVM, XGBoost, and Logistic Regression for robust predictions.


🎯 Project Highlights

✨ Memory-Efficient: Streams data using HuggingFace Datasets instead of loading entire dataset into memory πŸš€ Multi-Model Ensemble: Combines SVM, XGBoost, and Logistic Regression for superior accuracy πŸ”¬ Supervised Learning: Leverages advanced feature scaling and model optimization techniques πŸ“Š Real-Time Analysis: Visualize predictions with matplotlib and seaborn ⚑ Fast & Scalable: Handles large datasets without consuming excessive storage


πŸ“Š Quick Stats

Metric Details
Models Used SVM, XGBoost, Logistic Regression
Dataset Size 6,400+ samples (via HuggingFace streaming)
Input Features 11 physicochemical properties
Target Variable Wine Quality (Classification)
Data Source HuggingFace - mnemoraorg/wine-quality-6k4
Memory Approach Streaming (Zero-Copy)

πŸ”¬ Technical Overview

Supervised Learning Approach

This project implements a supervised learning pipeline that:

  1. Streams data from HuggingFace Datasets to minimize memory consumption
  2. Preprocesses features using MinMaxScaler for normalization
  3. Splits data into training and testing sets
  4. Trains multiple models (SVM, XGBoost, Logistic Regression)
  5. Evaluates performance using comprehensive metrics
  6. Visualizes results with matplotlib and seaborn

Architecture

Data Streaming (HuggingFace)
         ↓
Data Exploration & Analysis
         ↓
Feature Scaling (MinMaxScaler)
         ↓
Train/Test Split (80/20)
         ↓
Model Training:
β”œβ”€β”€ SVM (Support Vector Machine)
β”œβ”€β”€ XGBoost Classifier
└── Logistic Regression
         ↓
Model Evaluation & Metrics
         ↓
Visualization & Results

πŸ“ Project Structure

wine-quality-prediction/
β”‚
β”œβ”€β”€ wine_quality.py           # Main model implementation
β”œβ”€β”€ README.md                 # This file
└── requirements.txt          # Project dependencies

File Descriptions

wine_quality.py

  • Streams wine quality dataset from HuggingFace
  • Implements data preprocessing pipeline
  • Trains multiple classification models
  • Generates performance metrics and visualizations

πŸ› οΈ Technologies & Libraries

numpy              - Numerical computing
pandas             - Data manipulation and analysis
matplotlib         - Visualization
seaborn            - Statistical data visualization
scikit-learn       - Machine learning algorithms
xgboost            - Gradient boosting classifier
datasets           - HuggingFace Datasets (streaming)

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Internet connection (for dataset streaming)

Step 1: Clone the Repository

git clone https://github.com/mwirigijustice101-pythonist/codespaces-blank.git
cd codespaces-blank

Step 2: Create Virtual Environment

# Using venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n wine-quality python=3.9
conda activate wine-quality

Step 3: Install Dependencies

pip install -r requirements.txt

πŸ“– Usage

Quick Start

# Run the wine quality prediction model
python wine_quality.py

What the Script Does

  1. Streams Dataset from HuggingFace

    dataset = load_dataset("mnemoraorg/wine-quality-6k4", split="train", streaming=True)
  2. Previews Data (first 3 examples)

    for row in dataset.take(3):
        print(row)
  3. Converts to Pandas DataFrame

    df = dataset.to_pandas()
    print(df.head())
  4. Preprocesses Features

    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
  5. Trains Multiple Models

    svm_model = SVC()
    xgb_model = XGBClassifier()
    lr_model = LogisticRegression()
  6. Evaluates & Compares Results

    from sklearn import metrics
    accuracy = metrics.accuracy_score(y_test, predictions)

πŸ’‘ Key Features

1. Data Streaming 🌊

Loads data efficiently from HuggingFace without consuming local storage:

from datasets import load_dataset
dataset = load_dataset("mnemoraorg/wine-quality-6k4", split="train", streaming=True)

Benefits:

  • βœ… Memory efficient
  • βœ… Scalable to large datasets
  • βœ… No local storage required
  • βœ… Real-time data access

2. Multiple Classification Models πŸ€–

Support Vector Machine (SVM)

  • Non-linear classification
  • Optimal for high-dimensional data
  • Robust against outliers

XGBoost Classifier

  • Gradient boosting implementation
  • Fast training and inference
  • Handles non-linear relationships

Logistic Regression

  • Baseline model for comparison
  • Interpretable results
  • Fast training

3. Feature Scaling πŸ“Š

Normalizes all features to [0, 1] range:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

4. Comprehensive Evaluation πŸ“ˆ

  • Accuracy score
  • Precision, Recall, F1-Score
  • Confusion matrix
  • ROC-AUC score
  • Classification reports

5. Data Visualization πŸ“‰

import matplotlib.pyplot as plt
import seaborn as sb

# Visualize model performance
plt.figure(figsize=(10, 6))
# ... plotting code ...
plt.show()

πŸ” Model Comparison

Model Strengths Use Case
SVM Excellent for binary/multiclass, handles non-linear patterns When you need robust, accurate boundaries
XGBoost Fast, handles imbalanced data, feature importance Production-grade predictions
Logistic Regression Interpretable, baseline model, probability outputs When model explainability matters

πŸ“Š Dataset Information

Source

πŸ”— HuggingFace Datasets: mnemoraorg/wine-quality-6k4

Features (11 Physicochemical Properties)

  1. Fixed Acidity - tartaric acid concentration (g/dmΒ³)
  2. Volatile Acidity - acetic acid concentration (g/dmΒ³)
  3. Citric Acid - citric acid concentration (g/dmΒ³)
  4. Residual Sugar - sugar remaining after fermentation (g/dmΒ³)
  5. Chlorides - sodium chloride concentration (g/dmΒ³)
  6. Free Sulfur Dioxide - molecular SOβ‚‚ (mg/dmΒ³)
  7. Total Sulfur Dioxide - bound and free SOβ‚‚ (mg/dmΒ³)
  8. Density - wine density (g/cmΒ³)
  9. pH - acidity level
  10. Sulphates - potassium sulphate concentration (g/dmΒ³)
  11. Alcohol - alcohol content (% vol)

Target

Quality Score - Discrete values from 0-10 (classification problem)


πŸ“¦ Requirements

Create a requirements.txt file with:

numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
seaborn>=0.11.0
scikit-learn>=1.0.0
xgboost>=1.5.0
datasets>=2.0.0
scipy>=1.7.0

Install All Dependencies

pip install -r requirements.txt

πŸŽ“ How to Extend This Project

1. Add More Models

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

rf_model = RandomForestClassifier(n_estimators=100)
gb_model = GradientBoostingClassifier()

2. Implement Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)

3. Add Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")

4. Create Prediction Function

def predict_wine_quality(features):
    """
    Predict wine quality from features
    Args: features (array-like) - 11 physicochemical properties
    Returns: prediction (int) - predicted quality score
    """
    X_scaled = scaler.transform([features])
    return model.predict(X_scaled)[0]

5. Save Trained Models

import pickle

# Save model
with open('wine_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load model
with open('wine_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

🎯 Best Practices Implemented

βœ… Supervised Learning: Properly labeled training data βœ… Data Streaming: Memory-efficient HuggingFace Datasets integration βœ… Feature Scaling: MinMaxScaler for normalized features βœ… Train/Test Split: 80/20 split for unbiased evaluation βœ… Multiple Models: Ensemble approach for robustness βœ… Warnings Suppressed: Clean console output βœ… Modular Code: Easy to extend and maintain


πŸ“ˆ Performance Metrics

The script evaluates models using:

  • Accuracy: Overall correctness
  • Precision: True positive rate among predicted positives
  • Recall: True positive rate among actual positives
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed classification breakdown
  • ROC-AUC Score: Area under receiver operating characteristic curve

πŸ› Troubleshooting

Issue: Dataset Download Takes Long

Solution:

  • First run caches the dataset locally
  • Subsequent runs load from cache
  • Ensure stable internet connection

Issue: Memory Issues

Solution:

  • Data streaming already minimizes memory usage
  • If issues persist, process data in smaller batches:
    batch_size = 1000
    for batch in dataset.batch(batch_size):
        # Process batch

Issue: Import Errors

Solution:

  • Verify all dependencies installed: pip install -r requirements.txt
  • Check Python version: python --version (should be 3.8+)

Issue: Model Training is Slow

Solution:

  • Reduce dataset size for initial testing
  • Use n_jobs=-1 for parallel processing:
    SVC(kernel='rbf', n_jobs=-1)

🀝 Contributing

Contributions are welcome! Here's how:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Make changes and test thoroughly
  4. Commit: git commit -m "Add my feature"
  5. Push: git push origin feature/my-feature
  6. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


πŸ“š References

  1. Cortez, P., Cerdeira, A., Alves, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.

  2. HuggingFace Datasets Documentation

  3. scikit-learn Supervised Learning

  4. XGBoost Documentation


πŸ“ž Support & Contact


πŸš€ Ready to predict wine quality? Clone and run!

git clone https://github.com/mwirigijustice101-pythonist/codespaces-blank.git
cd codespaces-blank
pip install -r requirements.txt
python wine_quality.py

⭐ If you found this helpful, please star the repository! ⭐

Made with ❀️ by mwirigijustice101-pythonist

Last Updated: May 11, 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Generated from github/codespaces-blank