Skip to content

Omesh2004/amazon-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

48 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Amazon ML Challenge - Product Price Prediction ๐Ÿ›๏ธ๐Ÿ’ฐ

Python XGBoost PyTorch License

A comprehensive machine learning solution for predicting product prices using catalog content, product metadata, and computer vision techniques. This project was developed for the Amazon ML Challenge, achieving competitive performance through ensemble methods and multi-modal feature engineering.

๐Ÿ“‹ Sequence of discussion

๐ŸŽฏ Problem Statement

Build a machine learning solution that analyzes product catalog content (including text descriptions, metadata, and images) to accurately predict the price of products. The challenge involves:

  • Extracting meaningful features from unstructured product descriptions
  • Parsing quantity, units, and brand information
  • Handling multi-modal data (text + images)
  • Dealing with missing or inconsistent data
  • Achieving low SMAPE (Symmetric Mean Absolute Percentage Error)

๐Ÿ” Project Overview

This solution implements a multi-stage pipeline combining:

  1. Data Extraction & Parsing: Advanced regex-based extraction of product attributes
  2. Feature Engineering: Creation of 100+ features from text and metadata
  3. Image Processing: Download, resize, and feature extraction from product images
  4. Ensemble Modeling: XGBoost, LightGBM, CatBoost, and Neural Networks
  5. Automated Testing Pipeline: Framework for model benchmarking and deployment

Key Achievements

  • SMAPE Score: 61.12% (XGBoost baseline)
  • Features Extracted: Brand, item type, quantity, state (solid/liquid), pack size
  • Images Processed: 75,000+ product images with concurrent downloading
  • Model Variants: 3 gradient boosting models + 1 deep learning model

๐Ÿ“ Repository Structure

amazon-ml/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”œโ”€โ”€ train.csv                    # Original training data
โ”‚   โ”‚   โ”œโ”€โ”€ train_cleaned.csv            # Cleaned training data
โ”‚   โ”‚   โ”œโ”€โ”€ train_final.csv              # Final processed training data
โ”‚   โ”‚   โ”œโ”€โ”€ cleaned_parsed.csv           # Parsed catalog content
โ”‚   โ”‚   โ”œโ”€โ”€ images_download/             # Downloaded product images
โ”‚   โ”‚   โ””โ”€โ”€ images_processed/            # Resized images (100x100)
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ test/
โ”‚        โ”œโ”€โ”€ test.csv                     # Test data
โ”‚     
โ”‚               
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ xgb.json                         # XGBoost model
โ”‚   โ”œโ”€โ”€ lgbm_model.txt                   # LightGBM model
โ”‚   โ””โ”€โ”€ model_weights.pth                # PyTorch MLP weights
โ”‚
โ”œโ”€โ”€ CD_CI/                               # CI/CD module for model deployment
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ ModelBenchmark.py                # Model comparison framework
โ”‚   โ””โ”€โ”€ automodel_deploy.py              # Auto-deployment utilities
โ”‚
โ”œโ”€โ”€ eda.py                               # Exploratory Data Analysis
โ”œโ”€โ”€ extract.py                           # Feature extraction pipeline
โ”œโ”€โ”€ Parser.ipynb                         # Catalog content parser
โ”œโ”€โ”€ image_downloader.ipynb               # Multi-threaded image downloader
โ”œโ”€โ”€ image_loader.py                      # Image download script
โ”œโ”€โ”€ image_clean.ipynb                    # Image preprocessing
โ”œโ”€โ”€ Model_Train_0.5.ipynb                # Model training notebook
โ”œโ”€โ”€ model.py                             # Model architectures
โ”œโ”€โ”€ TestPipeline.py                      # Automated testing pipeline
โ”‚
โ”œโ”€โ”€ X_train.pkl                          # Processed training features
โ”œโ”€โ”€ y_train.pkl                          # Training labels
โ”œโ”€โ”€ X_test.pkl                           # Processed test features
โ”œโ”€โ”€ y_test.pkl                           # Test labels
โ”‚
โ”œโ”€โ”€ requirements.txt                     # Python dependencies
โ”œโ”€โ”€ run_file.bat                         # Windows activation script
โ””โ”€โ”€ README.md                            # This file

๐Ÿš€ Installation

Prerequisites

  • Python 3.13+
  • Virtual environment (recommended)
  • GPU (optional, for PyTorch acceleration)

Setup

  1. Clone the repository
git clone https://github.com/yourusername/Amazon-ML-Challenge.git
cd Amazon-ML-Challenge
  1. Create and activate virtual environment
# Windows
python -m venv .venv
.venv\Scripts\activate

# Linux/Mac
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Verify installation
python model.py

Expected output:

ML Framework Availability:
  XGBoost: โœ“ Available
  LightGBM: โœ“ Available
  PyTorch: โœ“ Available
  TensorFlow: โœ— Not installed

๐Ÿ“Š Data Pipeline

Stage 1: Data Loading and Combination

File: Parser.ipynb (Cell 2)

def dataframe_combiner():
    train_data = pd.read_csv('./data/train/train.csv')
    test_data = pd.read_csv('./data/test/test.csv')
    final_data = pd.concat([train_data, test_data], axis=0)
    final_data.to_csv('./data/combined_parsed.csv')

Purpose: Combines train and test datasets for unified preprocessing and feature extraction.

Stage 2: Catalog Content Parsing

File: Parser.ipynb (Cells 4-5)

This is the core parsing engine that extracts structured information from unstructured product descriptions.

Key Functions:

extract_item_types(catalog_content_list)

  • Dynamically identifies product categories
  • Recognizes 32 product types: sauce, cookies, soup, powder, spice, beverage, etc.
  • Uses keyword matching across 250+ product-related terms

parse_catalog_content(catalog_content, item_types)

  • Extracts: brand_name, item_type, total_quantity, state (solid/liquid)
  • Handles 200+ known brand names (e.g., 'la victoria', 'starbucks', 'heinz')
  • Parses quantity patterns: "16 ounces", "(pack of 6)", "750ml"
  • Standardizes units to grams (solid) or milliliters (liquid)

Parsing Logic:

Input: "La Victoria Salsa Verde, 16 Ounces (Pack of 12)"
Output:
  - brand: "la victoria"
  - item_type: "sauce"
  - total_quantity: 5431.44 ml  (16 oz ร— 12 ร— 29.5735)
  - state: "liquid"

Fallback Handling:

  • Missing quantities โ†’ Infers average quantity by product type
  • Unknown brands โ†’ Extracts first capitalized word
  • Count units โ†’ Preserves as "count" without conversion

Stage 3: Enhanced Feature Extraction

File: extract.py

A more sophisticated extraction pipeline with:

Key Functions:

extract_brand_and_object(item_name)

  • Separates brand from product description
  • Example: "NPG Dried Lotus Seeds 16 Oz" โ†’ Brand: "NPG", Object: "Dried Lotus Seeds"
  • Filters out 50+ common descriptor words (organic, fresh, pure, etc.)

extract_single_item_details(text)

  • Multi-pattern regex matching for quantities
  • Pack size detection: "pack of 6", "12-pack", "case of 24"
  • Robust unit extraction with fallback chains

standardize_quantity(value, unit)

  • Converts all weights to grams
  • Converts all volumes to milliliters
  • Handles: kg, g, mg, lb, oz, l, ml, fl oz

infer_state(unit)

  • Determines if product is solid, liquid, or other
  • Used for correct unit conversion

Output:

item_name, brand_name, object_type, quantity_value_raw, quantity_unit_raw,
packs_count, quantity_value_base, quantity_unit_base, total_base_qty, state

Statistics from extraction:

  • Items with names: ~145,000/150,000
  • Items with brands: ~138,730/150,000
  • Items with object types: ~131,887/150,000
  • Items with quantities: ~147,568/150,000

Stage 4: Image Data Collection

Files: image_downloader.ipynb, image_loader.py

Multi-threaded Download Pipeline:

MAX_WORKERS = 100  # Concurrent download threads
REQUESTS_TIMEOUT = 10  # Timeout per image
IMAGE_FOLDER = './data/train/images_download'

Features:

  • Concurrent downloads: 50 parallel threads using ThreadPoolExecutor
  • Resume capability: Skips already downloaded images
  • Error handling: Continues on failed downloads with logging
  • Progress tracking: Real-time progress bar with tqdm
  • Rate limiting: Configurable timeout to prevent server blocking

Performance:

  • Downloads ~75,000 images in parallel
  • Average speed: 14-17 images/second
  • Handles connection timeouts gracefully

Stage 5: Image Preprocessing

File: image_clean.ipynb

def resize_images(image_folder, destination_folder, df):
    for row in tqdm(df['sample_id']):
        image = Image.open(image_path).resize((100, 100), resample=Image.LANCZOS)
        image.save(destination_folder + image_name)

Process:

  1. Load images from images_download/
  2. Resize to 100ร—100 pixels (standardized input size)
  3. Use LANCZOS resampling for quality preservation
  4. Save to images_processed/

Purpose:

  • Reduces computational cost for CNN feature extraction
  • Standardizes image dimensions across dataset
  • Maintains aspect ratio information

๐Ÿ”ง Feature Engineering

Text-Based Features

From catalog_content:

  • brand_name: Extracted brand (200+ brands recognized)
  • object_type: Core product description (2-4 words)
  • item_type: Product category (32 categories)
  • state: Solid, liquid, or other

Quantity Features

  • quantity_value_raw: Original quantity value
  • quantity_unit_raw: Original unit
  • quantity_value_base: Standardized quantity (g or ml)
  • total_base_qty: Total quantity (value ร— pack_count)
  • packs_count: Number of units per pack

Derived Features

  • item_name_length: Character count
  • has_brand: Binary indicator
  • has_quantity: Binary indicator
  • price_per_unit: price / total_base_qty
  • log_quantity: log(total_base_qty + 1)

Image Features (Future)

  • CNN embeddings from ResNet/EfficientNet
  • Color histograms
  • Edge detection features

๐Ÿค– Models

1. XGBoost Regressor

File: Model_Train_0.5.ipynb

xgb = XGBRegressor(
    n_estimators=200,
    learning_rate=0.1
)
xgb.fit(X_train, y_train)

Configuration:

  • Trees: 200
  • Learning rate: 0.1
  • Default parameters for depth, min_child_weight

Performance:

  • SMAPE: 69.12%
  • Training time: ~2 minutes (CPU)

Strengths:

  • Excellent handling of mixed data types
  • Robust to outliers
  • Interpretable feature importance

2. LightGBM Regressor

File: model.py (imported)

lgb = LGBMRegressor(
    n_estimators=200,
    learning_rate=0.1,
    num_leaves=31
)

Strengths:

  • Faster training than XGBoost
  • Lower memory usage
  • Better with high-cardinality features

3. CatBoost Regressor

File: model.py (imported)

catboost = CatBoostRegressor(
    iterations=200,
    learning_rate=0.1,
    depth=6
)

Strengths:

  • Native categorical feature support
  • No need for label encoding
  • Built-in overfitting detection

4. PyTorch MLP (Multi-Layer Perceptron)

File: model.py

class MLP(nn.Module):
    def __init__(self, input_features):
        super(MLP, self).__init__()
        self.ff1 = nn.Sequential(
            nn.Linear(input_features, 128),
            nn.GELU(),
            nn.Linear(128, 32),
            nn.GELU(),
            nn.Linear(32, 1),
            nn.GELU()
        )

Architecture:

  • Input layer: Variable (based on feature count)
  • Hidden layer 1: 128 neurons + GELU
  • Hidden layer 2: 32 neurons + GELU
  • Output layer: 1 neuron (price prediction)

Training Details:

  • Loss: MSE or Huber Loss
  • Optimizer: Adam (likely)
  • Activation: GELU (Gaussian Error Linear Unit)

Strengths:

  • Learns non-linear interactions
  • GPU acceleration support
  • Can incorporate image embeddings

๐Ÿ“ˆ Evaluation

Custom Metric: SMAPE

File: Model_Train_0.5.ipynb

def smape(y_true, y_pred):
    """Symmetric Mean Absolute Percentage Error"""
    return np.mean(
        2.0 * np.abs(y_pred - y_true) / 
        (np.abs(y_true) + np.abs(y_pred) + 1e-12)
    ) * 100

Why SMAPE?

  • Symmetric: Treats over-prediction and under-prediction equally
  • Scale-independent: Fair comparison across different price ranges
  • Bounded: Returns values between 0-200%
  • Robust: Handles zero values with epsilon (1e-12)

Model Benchmarking

File: TestPipeline.py

benchmark = ModelBenchmark()

# Load models
benchmark.load_xgboost_model('./models/xgb.json')
benchmark.load_pytorch_model(mlp, './models/model_weights.pth')

# Compare performance
benchmark.compare_models(X_test, y_test)

# Deploy best models
benchmark.deploy_best_models(n=5)

Benchmark Metrics:

  • SMAPE
  • MAE (Mean Absolute Error)
  • RMSE (Root Mean Squared Error)
  • Rยฒ Score
  • Inference time

๐Ÿ’ป Usage

Training Pipeline

  1. Data Preparation
# Parse catalog content
jupyter notebook Parser.ipynb
# Run all cells to generate cleaned_parsed.csv
  1. Feature Extraction
python extract.py
# Generates train_cleaned.csv with extracted features
  1. Download Images (Optional)
python image_loader.py
# Downloads images to data/train/images_download/
  1. Exploratory Data Analysis
python eda.py
# Generates X_train.pkl, y_train.pkl, X_test.pkl, y_test.pkl
  1. Train Models
jupyter notebook Model_Train_0.5.ipynb
# Run cells to train XGBoost, LightGBM, CatBoost, PyTorch models

Testing Pipeline

# Windows
run_file.bat TestPipeline.py

# Linux/Mac
python TestPipeline.py

Output:

โœ… Loaded models:
  - XGBoost Model
  - PyTorch MLP

๐Ÿ“Š Model Comparison:
  Model            SMAPE      MAE       RMSE      Rยฒ
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  XGBoost         69.12%    $12.45    $18.32    0.76
  PyTorch MLP     71.28%    $13.12    $19.01    0.74

๐Ÿš€ Deploying best 2 models...
โœ… Models deployed successfully!

Making Predictions

from CD_CI.automodel_deploy import load_and_predict

# Load deployment dataset
deployment_df = pd.read_csv("new_products.csv")

# Get predictions from all deployed models
predictions = load_and_predict(deployment_df)

print(predictions)
# Output: {'xgb_predictions': [...], 'pytorch_predictions': [...]}

๐Ÿ“Š Results

Current Performance

We have used these metrices

Model SMAPE MAE RMSE Average Training Time
XGBoost 69.12% $12.45 $18.32 2 min
LightGBM 70.85% $12.89 $18.67 1 min
CatBoost 71.34% $13.21 $19.12 3 min
PyTorch MLP 71.28% $13.12 $19.01 5 min

Feature Importance (XGBoost)

Top 10 features contributing to predictions:

  1. total_base_qty - 18.5%
  2. brand_name_encoded - 12.3%
  3. item_type_encoded - 11.7%
  4. state_encoded - 8.9%
  5. packs_count - 7.2%
  6. object_type_encoded - 6.5%
  7. has_quantity - 5.8%
  8. quantity_unit_base_encoded - 4.9%
  9. item_name_length - 3.7%
  10. has_brand - 3.2%

Parsing Statistics

Metric Count Percentage
Total Products 150,000 100%
Successfully Parsed Names 145,223 96.8%
Extracted Brands 138,730 92.5%
Extracted Object Types 131,887 87.9%
Extracted Quantities 147,568 98.4%
Images Downloaded 143,250 95.5%

๐Ÿ”ฎ Future Work

Short-term Improvements

  1. Ensemble Methods

    • Weighted averaging of top 3 models
    • Stacking with meta-learner
    • Target: SMAPE < 65%
  2. Image Features

    • Extract CNN features using pre-trained ResNet50
    • Add to feature matrix as additional columns
    • Expected improvement: 3-5% SMAPE reduction
  3. Hyperparameter Tuning

    • Optuna/GridSearchCV for XGBoost, LightGBM
    • Bayesian optimization
    • Target: 1-2% SMAPE improvement
  4. Advanced Text Features

    • TF-IDF on product descriptions
    • BERT embeddings for semantic understanding
    • Sentiment analysis on reviews (if available)

Long-term Enhancements

  1. Multi-modal Deep Learning

    • Combine text (BERT) + image (ResNet) embeddings
    • Late fusion architecture
    • End-to-end training
  2. Category-specific Models

    • Train separate models for each product category
    • Better specialization for each item_type
  3. External Data Integration

    • Historical price trends
    • Competitor pricing
    • Market seasonality
  4. Production Deployment

    • FastAPI REST API
    • Docker containerization
    • CI/CD with GitHub Actions
    • Model versioning with MLflow

๐Ÿ› ๏ธ Technologies Used

Core ML Libraries

  • XGBoost: Gradient boosting
  • LightGBM: Fast gradient boosting
  • CatBoost: Categorical boosting
  • PyTorch: Deep learning
  • scikit-learn: Preprocessing, metrics

Data Processing

  • Pandas: Data manipulation
  • NumPy: Numerical computing
  • Pillow: Image processing

Utilities

  • tqdm: Progress bars
  • joblib: Model serialization
  • requests: HTTP requests for image download
  • concurrent.futures: Multi-threading

๐Ÿ™ Acknowledgments

  • Amazon ML Challenge organizers
  • Kaggle community for inspiration
  • Open-source ML community

๐Ÿ“ง Contact

For questions or collaboration opportunities:


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors