A comprehensive machine learning solution for predicting product prices using catalog content, product metadata, and computer vision techniques. This project was developed for the Amazon ML Challenge, achieving competitive performance through ensemble methods and multi-modal feature engineering.
- Problem Statement
- Project Overview
- Repository Structure
- Installation
- Data Pipeline
- Feature Engineering
- Models
- Evaluation
- Usage
- Results
- Future Work
Build a machine learning solution that analyzes product catalog content (including text descriptions, metadata, and images) to accurately predict the price of products. The challenge involves:
- Extracting meaningful features from unstructured product descriptions
- Parsing quantity, units, and brand information
- Handling multi-modal data (text + images)
- Dealing with missing or inconsistent data
- Achieving low SMAPE (Symmetric Mean Absolute Percentage Error)
This solution implements a multi-stage pipeline combining:
- Data Extraction & Parsing: Advanced regex-based extraction of product attributes
- Feature Engineering: Creation of 100+ features from text and metadata
- Image Processing: Download, resize, and feature extraction from product images
- Ensemble Modeling: XGBoost, LightGBM, CatBoost, and Neural Networks
- Automated Testing Pipeline: Framework for model benchmarking and deployment
- SMAPE Score: 61.12% (XGBoost baseline)
- Features Extracted: Brand, item type, quantity, state (solid/liquid), pack size
- Images Processed: 75,000+ product images with concurrent downloading
- Model Variants: 3 gradient boosting models + 1 deep learning model
amazon-ml/
โ
โโโ data/
โ โโโ train/
โ โ โโโ train.csv # Original training data
โ โ โโโ train_cleaned.csv # Cleaned training data
โ โ โโโ train_final.csv # Final processed training data
โ โ โโโ cleaned_parsed.csv # Parsed catalog content
โ โ โโโ images_download/ # Downloaded product images
โ โ โโโ images_processed/ # Resized images (100x100)
โ โ
โ โโโ test/
โ โโโ test.csv # Test data
โ
โ
โโโ models/
โ โโโ xgb.json # XGBoost model
โ โโโ lgbm_model.txt # LightGBM model
โ โโโ model_weights.pth # PyTorch MLP weights
โ
โโโ CD_CI/ # CI/CD module for model deployment
โ โโโ __init__.py
โ โโโ ModelBenchmark.py # Model comparison framework
โ โโโ automodel_deploy.py # Auto-deployment utilities
โ
โโโ eda.py # Exploratory Data Analysis
โโโ extract.py # Feature extraction pipeline
โโโ Parser.ipynb # Catalog content parser
โโโ image_downloader.ipynb # Multi-threaded image downloader
โโโ image_loader.py # Image download script
โโโ image_clean.ipynb # Image preprocessing
โโโ Model_Train_0.5.ipynb # Model training notebook
โโโ model.py # Model architectures
โโโ TestPipeline.py # Automated testing pipeline
โ
โโโ X_train.pkl # Processed training features
โโโ y_train.pkl # Training labels
โโโ X_test.pkl # Processed test features
โโโ y_test.pkl # Test labels
โ
โโโ requirements.txt # Python dependencies
โโโ run_file.bat # Windows activation script
โโโ README.md # This file
- Python 3.13+
- Virtual environment (recommended)
- GPU (optional, for PyTorch acceleration)
- Clone the repository
git clone https://github.com/yourusername/Amazon-ML-Challenge.git
cd Amazon-ML-Challenge- Create and activate virtual environment
# Windows
python -m venv .venv
.venv\Scripts\activate
# Linux/Mac
python -m venv .venv
source .venv/bin/activate- Install dependencies
pip install -r requirements.txt- Verify installation
python model.pyExpected output:
ML Framework Availability:
XGBoost: โ Available
LightGBM: โ Available
PyTorch: โ Available
TensorFlow: โ Not installed
File: Parser.ipynb (Cell 2)
def dataframe_combiner():
train_data = pd.read_csv('./data/train/train.csv')
test_data = pd.read_csv('./data/test/test.csv')
final_data = pd.concat([train_data, test_data], axis=0)
final_data.to_csv('./data/combined_parsed.csv')Purpose: Combines train and test datasets for unified preprocessing and feature extraction.
File: Parser.ipynb (Cells 4-5)
This is the core parsing engine that extracts structured information from unstructured product descriptions.
extract_item_types(catalog_content_list)
- Dynamically identifies product categories
- Recognizes 32 product types: sauce, cookies, soup, powder, spice, beverage, etc.
- Uses keyword matching across 250+ product-related terms
parse_catalog_content(catalog_content, item_types)
- Extracts:
brand_name,item_type,total_quantity,state(solid/liquid) - Handles 200+ known brand names (e.g., 'la victoria', 'starbucks', 'heinz')
- Parses quantity patterns:
"16 ounces","(pack of 6)","750ml" - Standardizes units to grams (solid) or milliliters (liquid)
Input: "La Victoria Salsa Verde, 16 Ounces (Pack of 12)"
Output:
- brand: "la victoria"
- item_type: "sauce"
- total_quantity: 5431.44 ml (16 oz ร 12 ร 29.5735)
- state: "liquid"
Fallback Handling:
- Missing quantities โ Infers average quantity by product type
- Unknown brands โ Extracts first capitalized word
- Count units โ Preserves as "count" without conversion
File: extract.py
A more sophisticated extraction pipeline with:
extract_brand_and_object(item_name)
- Separates brand from product description
- Example:
"NPG Dried Lotus Seeds 16 Oz"โ Brand:"NPG", Object:"Dried Lotus Seeds" - Filters out 50+ common descriptor words (organic, fresh, pure, etc.)
extract_single_item_details(text)
- Multi-pattern regex matching for quantities
- Pack size detection:
"pack of 6","12-pack","case of 24" - Robust unit extraction with fallback chains
standardize_quantity(value, unit)
- Converts all weights to grams
- Converts all volumes to milliliters
- Handles: kg, g, mg, lb, oz, l, ml, fl oz
infer_state(unit)
- Determines if product is solid, liquid, or other
- Used for correct unit conversion
item_name, brand_name, object_type, quantity_value_raw, quantity_unit_raw,
packs_count, quantity_value_base, quantity_unit_base, total_base_qty, stateStatistics from extraction:
- Items with names: ~145,000/150,000
- Items with brands: ~138,730/150,000
- Items with object types: ~131,887/150,000
- Items with quantities: ~147,568/150,000
Files: image_downloader.ipynb, image_loader.py
MAX_WORKERS = 100 # Concurrent download threads
REQUESTS_TIMEOUT = 10 # Timeout per image
IMAGE_FOLDER = './data/train/images_download'Features:
- Concurrent downloads: 50 parallel threads using
ThreadPoolExecutor - Resume capability: Skips already downloaded images
- Error handling: Continues on failed downloads with logging
- Progress tracking: Real-time progress bar with
tqdm - Rate limiting: Configurable timeout to prevent server blocking
Performance:
- Downloads ~75,000 images in parallel
- Average speed: 14-17 images/second
- Handles connection timeouts gracefully
File: image_clean.ipynb
def resize_images(image_folder, destination_folder, df):
for row in tqdm(df['sample_id']):
image = Image.open(image_path).resize((100, 100), resample=Image.LANCZOS)
image.save(destination_folder + image_name)Process:
- Load images from
images_download/ - Resize to 100ร100 pixels (standardized input size)
- Use LANCZOS resampling for quality preservation
- Save to
images_processed/
Purpose:
- Reduces computational cost for CNN feature extraction
- Standardizes image dimensions across dataset
- Maintains aspect ratio information
From catalog_content:
brand_name: Extracted brand (200+ brands recognized)object_type: Core product description (2-4 words)item_type: Product category (32 categories)state: Solid, liquid, or other
quantity_value_raw: Original quantity valuequantity_unit_raw: Original unitquantity_value_base: Standardized quantity (g or ml)total_base_qty: Total quantity (value ร pack_count)packs_count: Number of units per pack
item_name_length: Character counthas_brand: Binary indicatorhas_quantity: Binary indicatorprice_per_unit: price / total_base_qtylog_quantity: log(total_base_qty + 1)
- CNN embeddings from ResNet/EfficientNet
- Color histograms
- Edge detection features
File: Model_Train_0.5.ipynb
xgb = XGBRegressor(
n_estimators=200,
learning_rate=0.1
)
xgb.fit(X_train, y_train)Configuration:
- Trees: 200
- Learning rate: 0.1
- Default parameters for depth, min_child_weight
Performance:
- SMAPE: 69.12%
- Training time: ~2 minutes (CPU)
Strengths:
- Excellent handling of mixed data types
- Robust to outliers
- Interpretable feature importance
File: model.py (imported)
lgb = LGBMRegressor(
n_estimators=200,
learning_rate=0.1,
num_leaves=31
)Strengths:
- Faster training than XGBoost
- Lower memory usage
- Better with high-cardinality features
File: model.py (imported)
catboost = CatBoostRegressor(
iterations=200,
learning_rate=0.1,
depth=6
)Strengths:
- Native categorical feature support
- No need for label encoding
- Built-in overfitting detection
File: model.py
class MLP(nn.Module):
def __init__(self, input_features):
super(MLP, self).__init__()
self.ff1 = nn.Sequential(
nn.Linear(input_features, 128),
nn.GELU(),
nn.Linear(128, 32),
nn.GELU(),
nn.Linear(32, 1),
nn.GELU()
)Architecture:
- Input layer: Variable (based on feature count)
- Hidden layer 1: 128 neurons + GELU
- Hidden layer 2: 32 neurons + GELU
- Output layer: 1 neuron (price prediction)
Training Details:
- Loss: MSE or Huber Loss
- Optimizer: Adam (likely)
- Activation: GELU (Gaussian Error Linear Unit)
Strengths:
- Learns non-linear interactions
- GPU acceleration support
- Can incorporate image embeddings
File: Model_Train_0.5.ipynb
def smape(y_true, y_pred):
"""Symmetric Mean Absolute Percentage Error"""
return np.mean(
2.0 * np.abs(y_pred - y_true) /
(np.abs(y_true) + np.abs(y_pred) + 1e-12)
) * 100Why SMAPE?
- Symmetric: Treats over-prediction and under-prediction equally
- Scale-independent: Fair comparison across different price ranges
- Bounded: Returns values between 0-200%
- Robust: Handles zero values with epsilon (1e-12)
File: TestPipeline.py
benchmark = ModelBenchmark()
# Load models
benchmark.load_xgboost_model('./models/xgb.json')
benchmark.load_pytorch_model(mlp, './models/model_weights.pth')
# Compare performance
benchmark.compare_models(X_test, y_test)
# Deploy best models
benchmark.deploy_best_models(n=5)Benchmark Metrics:
- SMAPE
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- Rยฒ Score
- Inference time
- Data Preparation
# Parse catalog content
jupyter notebook Parser.ipynb
# Run all cells to generate cleaned_parsed.csv- Feature Extraction
python extract.py
# Generates train_cleaned.csv with extracted features- Download Images (Optional)
python image_loader.py
# Downloads images to data/train/images_download/- Exploratory Data Analysis
python eda.py
# Generates X_train.pkl, y_train.pkl, X_test.pkl, y_test.pkl- Train Models
jupyter notebook Model_Train_0.5.ipynb
# Run cells to train XGBoost, LightGBM, CatBoost, PyTorch models# Windows
run_file.bat TestPipeline.py
# Linux/Mac
python TestPipeline.pyOutput:
โ
Loaded models:
- XGBoost Model
- PyTorch MLP
๐ Model Comparison:
Model SMAPE MAE RMSE Rยฒ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
XGBoost 69.12% $12.45 $18.32 0.76
PyTorch MLP 71.28% $13.12 $19.01 0.74
๐ Deploying best 2 models...
โ
Models deployed successfully!
from CD_CI.automodel_deploy import load_and_predict
# Load deployment dataset
deployment_df = pd.read_csv("new_products.csv")
# Get predictions from all deployed models
predictions = load_and_predict(deployment_df)
print(predictions)
# Output: {'xgb_predictions': [...], 'pytorch_predictions': [...]}We have used these metrices
| Model | SMAPE | MAE | RMSE | Average Training Time |
|---|---|---|---|---|
| XGBoost | 69.12% | $12.45 | $18.32 | 2 min |
| LightGBM | 70.85% | $12.89 | $18.67 | 1 min |
| CatBoost | 71.34% | $13.21 | $19.12 | 3 min |
| PyTorch MLP | 71.28% | $13.12 | $19.01 | 5 min |
Top 10 features contributing to predictions:
total_base_qty- 18.5%brand_name_encoded- 12.3%item_type_encoded- 11.7%state_encoded- 8.9%packs_count- 7.2%object_type_encoded- 6.5%has_quantity- 5.8%quantity_unit_base_encoded- 4.9%item_name_length- 3.7%has_brand- 3.2%
| Metric | Count | Percentage |
|---|---|---|
| Total Products | 150,000 | 100% |
| Successfully Parsed Names | 145,223 | 96.8% |
| Extracted Brands | 138,730 | 92.5% |
| Extracted Object Types | 131,887 | 87.9% |
| Extracted Quantities | 147,568 | 98.4% |
| Images Downloaded | 143,250 | 95.5% |
-
Ensemble Methods
- Weighted averaging of top 3 models
- Stacking with meta-learner
- Target: SMAPE < 65%
-
Image Features
- Extract CNN features using pre-trained ResNet50
- Add to feature matrix as additional columns
- Expected improvement: 3-5% SMAPE reduction
-
Hyperparameter Tuning
- Optuna/GridSearchCV for XGBoost, LightGBM
- Bayesian optimization
- Target: 1-2% SMAPE improvement
-
Advanced Text Features
- TF-IDF on product descriptions
- BERT embeddings for semantic understanding
- Sentiment analysis on reviews (if available)
-
Multi-modal Deep Learning
- Combine text (BERT) + image (ResNet) embeddings
- Late fusion architecture
- End-to-end training
-
Category-specific Models
- Train separate models for each product category
- Better specialization for each item_type
-
External Data Integration
- Historical price trends
- Competitor pricing
- Market seasonality
-
Production Deployment
- FastAPI REST API
- Docker containerization
- CI/CD with GitHub Actions
- Model versioning with MLflow
- XGBoost: Gradient boosting
- LightGBM: Fast gradient boosting
- CatBoost: Categorical boosting
- PyTorch: Deep learning
- scikit-learn: Preprocessing, metrics
- Pandas: Data manipulation
- NumPy: Numerical computing
- Pillow: Image processing
- tqdm: Progress bars
- joblib: Model serialization
- requests: HTTP requests for image download
- concurrent.futures: Multi-threading
- Amazon ML Challenge organizers
- Kaggle community for inspiration
- Open-source ML community
For questions or collaboration opportunities:
- Email: omeshmehta70@gmail.com
- GitHub: @Omesh2004