Amazon ML Challenge - Product Price Prediction 🛍️💰

A comprehensive machine learning solution for predicting product prices using catalog content, product metadata, and computer vision techniques. This project was developed for the Amazon ML Challenge, achieving competitive performance through ensemble methods and multi-modal feature engineering.

🎯 Problem Statement

Build a machine learning solution that analyzes product catalog content (including text descriptions, metadata, and images) to accurately predict the price of products. The challenge involves:

Extracting meaningful features from unstructured product descriptions
Parsing quantity, units, and brand information
Handling multi-modal data (text + images)
Dealing with missing or inconsistent data
Achieving low SMAPE (Symmetric Mean Absolute Percentage Error)

🔍 Project Overview

This solution implements a multi-stage pipeline combining:

Data Extraction & Parsing: Advanced regex-based extraction of product attributes
Feature Engineering: Creation of 100+ features from text and metadata
Image Processing: Download, resize, and feature extraction from product images
Ensemble Modeling: XGBoost, LightGBM, CatBoost, and Neural Networks
Automated Testing Pipeline: Framework for model benchmarking and deployment

Key Achievements

SMAPE Score: 61.12% (XGBoost baseline)
Features Extracted: Brand, item type, quantity, state (solid/liquid), pack size
Images Processed: 75,000+ product images with concurrent downloading
Model Variants: 3 gradient boosting models + 1 deep learning model

📁 Repository Structure

amazon-ml/
│
├── data/
│   ├── train/
│   │   ├── train.csv                    # Original training data
│   │   ├── train_cleaned.csv            # Cleaned training data
│   │   ├── train_final.csv              # Final processed training data
│   │   ├── cleaned_parsed.csv           # Parsed catalog content
│   │   ├── images_download/             # Downloaded product images
│   │   └── images_processed/            # Resized images (100x100)
│   │
│   ├── test/
│        ├── test.csv                     # Test data
│     
│               
├── models/
│   ├── xgb.json                         # XGBoost model
│   ├── lgbm_model.txt                   # LightGBM model
│   └── model_weights.pth                # PyTorch MLP weights
│
├── CD_CI/                               # CI/CD module for model deployment
│   ├── __init__.py
│   ├── ModelBenchmark.py                # Model comparison framework
│   └── automodel_deploy.py              # Auto-deployment utilities
│
├── eda.py                               # Exploratory Data Analysis
├── extract.py                           # Feature extraction pipeline
├── Parser.ipynb                         # Catalog content parser
├── image_downloader.ipynb               # Multi-threaded image downloader
├── image_loader.py                      # Image download script
├── image_clean.ipynb                    # Image preprocessing
├── Model_Train_0.5.ipynb                # Model training notebook
├── model.py                             # Model architectures
├── TestPipeline.py                      # Automated testing pipeline
│
├── X_train.pkl                          # Processed training features
├── y_train.pkl                          # Training labels
├── X_test.pkl                           # Processed test features
├── y_test.pkl                           # Test labels
│
├── requirements.txt                     # Python dependencies
├── run_file.bat                         # Windows activation script
└── README.md                            # This file

🚀 Installation

Prerequisites

Python 3.13+
Virtual environment (recommended)
GPU (optional, for PyTorch acceleration)

Setup

Clone the repository

git clone https://github.com/yourusername/Amazon-ML-Challenge.git
cd Amazon-ML-Challenge

Create and activate virtual environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# Linux/Mac
python -m venv .venv
source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Verify installation

python model.py

Expected output:

ML Framework Availability:
  XGBoost: ✓ Available
  LightGBM: ✓ Available
  PyTorch: ✓ Available
  TensorFlow: ✗ Not installed

📊 Data Pipeline

Stage 1: Data Loading and Combination

File: Parser.ipynb (Cell 2)

def dataframe_combiner():
    train_data = pd.read_csv('./data/train/train.csv')
    test_data = pd.read_csv('./data/test/test.csv')
    final_data = pd.concat([train_data, test_data], axis=0)
    final_data.to_csv('./data/combined_parsed.csv')

Purpose: Combines train and test datasets for unified preprocessing and feature extraction.

Stage 2: Catalog Content Parsing

File: Parser.ipynb (Cells 4-5)

This is the core parsing engine that extracts structured information from unstructured product descriptions.

Key Functions:

extract_item_types(catalog_content_list)

Dynamically identifies product categories
Recognizes 32 product types: sauce, cookies, soup, powder, spice, beverage, etc.
Uses keyword matching across 250+ product-related terms

parse_catalog_content(catalog_content, item_types)

Extracts: brand_name, item_type, total_quantity, state (solid/liquid)
Handles 200+ known brand names (e.g., 'la victoria', 'starbucks', 'heinz')
Parses quantity patterns: "16 ounces", "(pack of 6)", "750ml"
Standardizes units to grams (solid) or milliliters (liquid)

Parsing Logic:

Input: "La Victoria Salsa Verde, 16 Ounces (Pack of 12)"
Output:
  - brand: "la victoria"
  - item_type: "sauce"
  - total_quantity: 5431.44 ml  (16 oz × 12 × 29.5735)
  - state: "liquid"

Fallback Handling:

Missing quantities → Infers average quantity by product type
Unknown brands → Extracts first capitalized word
Count units → Preserves as "count" without conversion

Stage 3: Enhanced Feature Extraction

File: extract.py

A more sophisticated extraction pipeline with:

Key Functions:

extract_brand_and_object(item_name)

Separates brand from product description
Example: "NPG Dried Lotus Seeds 16 Oz" → Brand: "NPG", Object: "Dried Lotus Seeds"
Filters out 50+ common descriptor words (organic, fresh, pure, etc.)

extract_single_item_details(text)

Multi-pattern regex matching for quantities
Pack size detection: "pack of 6", "12-pack", "case of 24"
Robust unit extraction with fallback chains

standardize_quantity(value, unit)

Converts all weights to grams
Converts all volumes to milliliters
Handles: kg, g, mg, lb, oz, l, ml, fl oz

infer_state(unit)

Determines if product is solid, liquid, or other
Used for correct unit conversion

Output:

item_name, brand_name, object_type, quantity_value_raw, quantity_unit_raw,
packs_count, quantity_value_base, quantity_unit_base, total_base_qty, state

Statistics from extraction:

Items with names: ~145,000/150,000
Items with brands: ~138,730/150,000
Items with object types: ~131,887/150,000
Items with quantities: ~147,568/150,000

Stage 4: Image Data Collection

Files: image_downloader.ipynb, image_loader.py

Multi-threaded Download Pipeline:

MAX_WORKERS = 100  # Concurrent download threads
REQUESTS_TIMEOUT = 10  # Timeout per image
IMAGE_FOLDER = './data/train/images_download'

Features:

Concurrent downloads: 50 parallel threads using ThreadPoolExecutor
Resume capability: Skips already downloaded images
Error handling: Continues on failed downloads with logging
Progress tracking: Real-time progress bar with tqdm
Rate limiting: Configurable timeout to prevent server blocking

Performance:

Downloads ~75,000 images in parallel
Average speed: 14-17 images/second
Handles connection timeouts gracefully

Stage 5: Image Preprocessing

File: image_clean.ipynb

def resize_images(image_folder, destination_folder, df):
    for row in tqdm(df['sample_id']):
        image = Image.open(image_path).resize((100, 100), resample=Image.LANCZOS)
        image.save(destination_folder + image_name)

Process:

Load images from images_download/
Resize to 100×100 pixels (standardized input size)
Use LANCZOS resampling for quality preservation
Save to images_processed/

Purpose:

Reduces computational cost for CNN feature extraction
Standardizes image dimensions across dataset
Maintains aspect ratio information

🔧 Feature Engineering

Text-Based Features

From catalog_content:

brand_name: Extracted brand (200+ brands recognized)
object_type: Core product description (2-4 words)
item_type: Product category (32 categories)
state: Solid, liquid, or other

Quantity Features

quantity_value_raw: Original quantity value
quantity_unit_raw: Original unit
quantity_value_base: Standardized quantity (g or ml)
total_base_qty: Total quantity (value × pack_count)
packs_count: Number of units per pack

Derived Features

item_name_length: Character count
has_brand: Binary indicator
has_quantity: Binary indicator
price_per_unit: price / total_base_qty
log_quantity: log(total_base_qty + 1)

Image Features (Future)

CNN embeddings from ResNet/EfficientNet
Color histograms
Edge detection features

🤖 Models

1. XGBoost Regressor

File: Model_Train_0.5.ipynb

xgb = XGBRegressor(
    n_estimators=200,
    learning_rate=0.1
)
xgb.fit(X_train, y_train)

Configuration:

Trees: 200
Learning rate: 0.1
Default parameters for depth, min_child_weight

Performance:

SMAPE: 69.12%
Training time: ~2 minutes (CPU)

Strengths:

Excellent handling of mixed data types
Robust to outliers
Interpretable feature importance

2. LightGBM Regressor

File: model.py (imported)

lgb = LGBMRegressor(
    n_estimators=200,
    learning_rate=0.1,
    num_leaves=31
)

Strengths:

Faster training than XGBoost
Lower memory usage
Better with high-cardinality features

3. CatBoost Regressor

File: model.py (imported)

catboost = CatBoostRegressor(
    iterations=200,
    learning_rate=0.1,
    depth=6
)

Strengths:

Native categorical feature support
No need for label encoding
Built-in overfitting detection

4. PyTorch MLP (Multi-Layer Perceptron)

File: model.py

class MLP(nn.Module):
    def __init__(self, input_features):
        super(MLP, self).__init__()
        self.ff1 = nn.Sequential(
            nn.Linear(input_features, 128),
            nn.GELU(),
            nn.Linear(128, 32),
            nn.GELU(),
            nn.Linear(32, 1),
            nn.GELU()
        )

Architecture:

Input layer: Variable (based on feature count)
Hidden layer 1: 128 neurons + GELU
Hidden layer 2: 32 neurons + GELU
Output layer: 1 neuron (price prediction)

Training Details:

Loss: MSE or Huber Loss
Optimizer: Adam (likely)
Activation: GELU (Gaussian Error Linear Unit)

Strengths:

Learns non-linear interactions
GPU acceleration support
Can incorporate image embeddings

📈 Evaluation

Custom Metric: SMAPE

File: Model_Train_0.5.ipynb

def smape(y_true, y_pred):
    """Symmetric Mean Absolute Percentage Error"""
    return np.mean(
        2.0 * np.abs(y_pred - y_true) / 
        (np.abs(y_true) + np.abs(y_pred) + 1e-12)
    ) * 100

Why SMAPE?

Symmetric: Treats over-prediction and under-prediction equally
Scale-independent: Fair comparison across different price ranges
Bounded: Returns values between 0-200%
Robust: Handles zero values with epsilon (1e-12)

Model Benchmarking

File: TestPipeline.py

benchmark = ModelBenchmark()

# Load models
benchmark.load_xgboost_model('./models/xgb.json')
benchmark.load_pytorch_model(mlp, './models/model_weights.pth')

# Compare performance
benchmark.compare_models(X_test, y_test)

# Deploy best models
benchmark.deploy_best_models(n=5)

Benchmark Metrics:

SMAPE
MAE (Mean Absolute Error)
RMSE (Root Mean Squared Error)
R² Score
Inference time

💻 Usage

Training Pipeline

Data Preparation

# Parse catalog content
jupyter notebook Parser.ipynb
# Run all cells to generate cleaned_parsed.csv

Feature Extraction

python extract.py
# Generates train_cleaned.csv with extracted features

Download Images (Optional)

python image_loader.py
# Downloads images to data/train/images_download/

Exploratory Data Analysis

python eda.py
# Generates X_train.pkl, y_train.pkl, X_test.pkl, y_test.pkl

Train Models

jupyter notebook Model_Train_0.5.ipynb
# Run cells to train XGBoost, LightGBM, CatBoost, PyTorch models

Testing Pipeline

# Windows
run_file.bat TestPipeline.py

# Linux/Mac
python TestPipeline.py

Output:

✅ Loaded models:
  - XGBoost Model
  - PyTorch MLP

📊 Model Comparison:
  Model            SMAPE      MAE       RMSE      R²
  ────────────────────────────────────────────────
  XGBoost         69.12%    $12.45    $18.32    0.76
  PyTorch MLP     71.28%    $13.12    $19.01    0.74

🚀 Deploying best 2 models...
✅ Models deployed successfully!

Making Predictions

from CD_CI.automodel_deploy import load_and_predict

# Load deployment dataset
deployment_df = pd.read_csv("new_products.csv")

# Get predictions from all deployed models
predictions = load_and_predict(deployment_df)

print(predictions)
# Output: {'xgb_predictions': [...], 'pytorch_predictions': [...]}

📊 Results

Current Performance

We have used these metrices

Model	SMAPE	MAE	RMSE	Average Training Time
XGBoost	69.12%	$12.45	$18.32	2 min
LightGBM	70.85%	$12.89	$18.67	1 min
CatBoost	71.34%	$13.21	$19.12	3 min
PyTorch MLP	71.28%	$13.12	$19.01	5 min

Feature Importance (XGBoost)

Top 10 features contributing to predictions:

total_base_qty - 18.5%
brand_name_encoded - 12.3%
item_type_encoded - 11.7%
state_encoded - 8.9%
packs_count - 7.2%
object_type_encoded - 6.5%
has_quantity - 5.8%
quantity_unit_base_encoded - 4.9%
item_name_length - 3.7%
has_brand - 3.2%

Parsing Statistics

Metric	Count	Percentage
Total Products	150,000	100%
Successfully Parsed Names	145,223	96.8%
Extracted Brands	138,730	92.5%
Extracted Object Types	131,887	87.9%
Extracted Quantities	147,568	98.4%
Images Downloaded	143,250	95.5%

🔮 Future Work

Short-term Improvements

Ensemble Methods
- Weighted averaging of top 3 models
- Stacking with meta-learner
- Target: SMAPE < 65%
Image Features
- Extract CNN features using pre-trained ResNet50
- Add to feature matrix as additional columns
- Expected improvement: 3-5% SMAPE reduction
Hyperparameter Tuning
- Optuna/GridSearchCV for XGBoost, LightGBM
- Bayesian optimization
- Target: 1-2% SMAPE improvement
Advanced Text Features
- TF-IDF on product descriptions
- BERT embeddings for semantic understanding
- Sentiment analysis on reviews (if available)

Long-term Enhancements

Multi-modal Deep Learning
- Combine text (BERT) + image (ResNet) embeddings
- Late fusion architecture
- End-to-end training
Category-specific Models
- Train separate models for each product category
- Better specialization for each item_type
External Data Integration
- Historical price trends
- Competitor pricing
- Market seasonality
Production Deployment
- FastAPI REST API
- Docker containerization
- CI/CD with GitHub Actions
- Model versioning with MLflow

🛠️ Technologies Used

Core ML Libraries

XGBoost: Gradient boosting
LightGBM: Fast gradient boosting
CatBoost: Categorical boosting
PyTorch: Deep learning
scikit-learn: Preprocessing, metrics

Data Processing

Pandas: Data manipulation
NumPy: Numerical computing
Pillow: Image processing

Utilities

tqdm: Progress bars
joblib: Model serialization
requests: HTTP requests for image download
concurrent.futures: Multi-threading

🙏 Acknowledgments

Amazon ML Challenge organizers
Kaggle community for inspiration
Open-source ML community

📧 Contact

For questions or collaboration opportunities:

Email: omeshmehta70@gmail.com
GitHub: @Omesh2004

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
CD_CI		CD_CI
data		data
models		models
.gitignore		.gitignore
Model_Train_0.5.ipynb		Model_Train_0.5.ipynb
Model_Train_1.ipynb		Model_Train_1.ipynb
Parser.ipynb		Parser.ipynb
README.md		README.md
TestPipeline.py		TestPipeline.py
X_test.pkl		X_test.pkl
X_train.pkl		X_train.pkl
deployment_test.pkl		deployment_test.pkl
eda.ipynb		eda.ipynb
extract.py		extract.py
image_clean.ipynb		image_clean.ipynb
image_downloader.ipynb		image_downloader.ipynb
image_loader.py		image_loader.py
model.py		model.py
requirements.txt		requirements.txt
run_file.bat		run_file.bat
type_encoder.pkl		type_encoder.pkl
y_test.pkl		y_test.pkl
y_train.pkl		y_train.pkl

Folders and files

Latest commit

History

Repository files navigation

Amazon ML Challenge - Product Price Prediction 🛍️💰

📋 Sequence of discussion

🎯 Problem Statement

🔍 Project Overview

Key Achievements

📁 Repository Structure

🚀 Installation

Prerequisites

Setup

📊 Data Pipeline

Stage 1: Data Loading and Combination

Stage 2: Catalog Content Parsing

Key Functions:

Parsing Logic:

Stage 3: Enhanced Feature Extraction

Key Functions:

Output:

Stage 4: Image Data Collection

Multi-threaded Download Pipeline:

Stage 5: Image Preprocessing

🔧 Feature Engineering

Text-Based Features

Quantity Features

Derived Features

Image Features (Future)

🤖 Models

1. XGBoost Regressor

2. LightGBM Regressor

3. CatBoost Regressor

4. PyTorch MLP (Multi-Layer Perceptron)

📈 Evaluation

Custom Metric: SMAPE

Model Benchmarking

💻 Usage

Training Pipeline

Testing Pipeline

Making Predictions

📊 Results

Current Performance

Feature Importance (XGBoost)

Parsing Statistics

🔮 Future Work

Short-term Improvements

Long-term Enhancements

🛠️ Technologies Used

Core ML Libraries

Data Processing

Utilities

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages