A high-performance machine learning model for predicting wine quality using data streaming for efficient memory management. Combines supervised learning techniques including SVM, XGBoost, and Logistic Regression for robust predictions.
β¨ Memory-Efficient: Streams data using HuggingFace Datasets instead of loading entire dataset into memory π Multi-Model Ensemble: Combines SVM, XGBoost, and Logistic Regression for superior accuracy π¬ Supervised Learning: Leverages advanced feature scaling and model optimization techniques π Real-Time Analysis: Visualize predictions with matplotlib and seaborn β‘ Fast & Scalable: Handles large datasets without consuming excessive storage
| Metric | Details |
|---|---|
| Models Used | SVM, XGBoost, Logistic Regression |
| Dataset Size | 6,400+ samples (via HuggingFace streaming) |
| Input Features | 11 physicochemical properties |
| Target Variable | Wine Quality (Classification) |
| Data Source | HuggingFace - mnemoraorg/wine-quality-6k4 |
| Memory Approach | Streaming (Zero-Copy) |
This project implements a supervised learning pipeline that:
- Streams data from HuggingFace Datasets to minimize memory consumption
- Preprocesses features using MinMaxScaler for normalization
- Splits data into training and testing sets
- Trains multiple models (SVM, XGBoost, Logistic Regression)
- Evaluates performance using comprehensive metrics
- Visualizes results with matplotlib and seaborn
Data Streaming (HuggingFace)
β
Data Exploration & Analysis
β
Feature Scaling (MinMaxScaler)
β
Train/Test Split (80/20)
β
Model Training:
βββ SVM (Support Vector Machine)
βββ XGBoost Classifier
βββ Logistic Regression
β
Model Evaluation & Metrics
β
Visualization & Results
wine-quality-prediction/
β
βββ wine_quality.py # Main model implementation
βββ README.md # This file
βββ requirements.txt # Project dependencies
wine_quality.py
- Streams wine quality dataset from HuggingFace
- Implements data preprocessing pipeline
- Trains multiple classification models
- Generates performance metrics and visualizations
numpy - Numerical computing
pandas - Data manipulation and analysis
matplotlib - Visualization
seaborn - Statistical data visualization
scikit-learn - Machine learning algorithms
xgboost - Gradient boosting classifier
datasets - HuggingFace Datasets (streaming)
- Python 3.8 or higher
- pip package manager
- Internet connection (for dataset streaming)
git clone https://github.com/mwirigijustice101-pythonist/codespaces-blank.git
cd codespaces-blank# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n wine-quality python=3.9
conda activate wine-qualitypip install -r requirements.txt# Run the wine quality prediction model
python wine_quality.py-
Streams Dataset from HuggingFace
dataset = load_dataset("mnemoraorg/wine-quality-6k4", split="train", streaming=True)
-
Previews Data (first 3 examples)
for row in dataset.take(3): print(row)
-
Converts to Pandas DataFrame
df = dataset.to_pandas() print(df.head())
-
Preprocesses Features
scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X)
-
Trains Multiple Models
svm_model = SVC() xgb_model = XGBClassifier() lr_model = LogisticRegression()
-
Evaluates & Compares Results
from sklearn import metrics accuracy = metrics.accuracy_score(y_test, predictions)
Loads data efficiently from HuggingFace without consuming local storage:
from datasets import load_dataset
dataset = load_dataset("mnemoraorg/wine-quality-6k4", split="train", streaming=True)Benefits:
- β Memory efficient
- β Scalable to large datasets
- β No local storage required
- β Real-time data access
Support Vector Machine (SVM)
- Non-linear classification
- Optimal for high-dimensional data
- Robust against outliers
XGBoost Classifier
- Gradient boosting implementation
- Fast training and inference
- Handles non-linear relationships
Logistic Regression
- Baseline model for comparison
- Interpretable results
- Fast training
Normalizes all features to [0, 1] range:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)- Accuracy score
- Precision, Recall, F1-Score
- Confusion matrix
- ROC-AUC score
- Classification reports
import matplotlib.pyplot as plt
import seaborn as sb
# Visualize model performance
plt.figure(figsize=(10, 6))
# ... plotting code ...
plt.show()| Model | Strengths | Use Case |
|---|---|---|
| SVM | Excellent for binary/multiclass, handles non-linear patterns | When you need robust, accurate boundaries |
| XGBoost | Fast, handles imbalanced data, feature importance | Production-grade predictions |
| Logistic Regression | Interpretable, baseline model, probability outputs | When model explainability matters |
π HuggingFace Datasets: mnemoraorg/wine-quality-6k4
- Fixed Acidity - tartaric acid concentration (g/dmΒ³)
- Volatile Acidity - acetic acid concentration (g/dmΒ³)
- Citric Acid - citric acid concentration (g/dmΒ³)
- Residual Sugar - sugar remaining after fermentation (g/dmΒ³)
- Chlorides - sodium chloride concentration (g/dmΒ³)
- Free Sulfur Dioxide - molecular SOβ (mg/dmΒ³)
- Total Sulfur Dioxide - bound and free SOβ (mg/dmΒ³)
- Density - wine density (g/cmΒ³)
- pH - acidity level
- Sulphates - potassium sulphate concentration (g/dmΒ³)
- Alcohol - alcohol content (% vol)
Quality Score - Discrete values from 0-10 (classification problem)
Create a requirements.txt file with:
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
seaborn>=0.11.0
scikit-learn>=1.0.0
xgboost>=1.5.0
datasets>=2.0.0
scipy>=1.7.0
pip install -r requirements.txtfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
rf_model = RandomForestClassifier(n_estimators=100)
gb_model = GradientBoostingClassifier()from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")def predict_wine_quality(features):
"""
Predict wine quality from features
Args: features (array-like) - 11 physicochemical properties
Returns: prediction (int) - predicted quality score
"""
X_scaled = scaler.transform([features])
return model.predict(X_scaled)[0]import pickle
# Save model
with open('wine_model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load model
with open('wine_model.pkl', 'rb') as f:
loaded_model = pickle.load(f)β Supervised Learning: Properly labeled training data β Data Streaming: Memory-efficient HuggingFace Datasets integration β Feature Scaling: MinMaxScaler for normalized features β Train/Test Split: 80/20 split for unbiased evaluation β Multiple Models: Ensemble approach for robustness β Warnings Suppressed: Clean console output β Modular Code: Easy to extend and maintain
The script evaluates models using:
- Accuracy: Overall correctness
- Precision: True positive rate among predicted positives
- Recall: True positive rate among actual positives
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed classification breakdown
- ROC-AUC Score: Area under receiver operating characteristic curve
Solution:
- First run caches the dataset locally
- Subsequent runs load from cache
- Ensure stable internet connection
Solution:
- Data streaming already minimizes memory usage
- If issues persist, process data in smaller batches:
batch_size = 1000 for batch in dataset.batch(batch_size): # Process batch
Solution:
- Verify all dependencies installed:
pip install -r requirements.txt - Check Python version:
python --version(should be 3.8+)
Solution:
- Reduce dataset size for initial testing
- Use
n_jobs=-1for parallel processing:SVC(kernel='rbf', n_jobs=-1)
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make changes and test thoroughly
- Commit:
git commit -m "Add my feature" - Push:
git push origin feature/my-feature - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset Source: HuggingFace Datasets - mnemoraorg/wine-quality-6k4
- Original Data: UCI Machine Learning Repository
- Libraries: scikit-learn, XGBoost, pandas, numpy communities
- Research: [Cortez et al., 2009] - Wine Quality Dataset
-
Cortez, P., Cerdeira, A., Alves, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.
- GitHub Issues: Report issues here
- Author: @mwirigijustice101-pythonist
git clone https://github.com/mwirigijustice101-pythonist/codespaces-blank.git
cd codespaces-blank
pip install -r requirements.txt
python wine_quality.pyβ If you found this helpful, please star the repository! β
Made with β€οΈ by mwirigijustice101-pythonist
Last Updated: May 11, 2026