Skip to content

stevezkw1998/shop-sight-prototype

Repository files navigation

ShopSight Prototype

A prototype e-commerce analytics system that provides product insights including historical sales trends, demand forecasts, customer segmentation, and actionable business recommendations.

Quick Start

Prerequisites

  • Python 3.8+
  • OpenAI API key (for LLM features, optional but recommended)

Installation

# Clone the repository
git clone git@github.com:stevezkw1998/shop-sight-prototype.git
cd shop-sight-prototype

# Install dependencies
pip install -r requirements.txt

Run the Demo

# Basic usage - will prompt for product search
python examples/shop_insight.py

Environment Setup

For LLM features, set your API key:

export OPENAI_API_KEY="your-api-key-here"

Thought Process & Priorities

What I Prioritized

  1. Database(use case) Learning (Prerequisites)

    • Database schema + sample data → database information & insights md file
    • Reason: This is useful for llm to understand the use case better
  2. End-to-End Core Flow (Must Have)

    • Product search → Historical sales visualization + Foracasted demand + Likely customer segments
  3. LLM Integration (Must Have)

    • Natural language insights generation
    • Enhanced forecasting with LLM data analytics and businese insights
    • Makes insights accessible to non-technical users
  4. Real Data Over Mocking (Should Have)

    • The H&M dataset is rich enough to support real analysis
    • Real data builds credibility and shows actual capability
  5. Comprehensive Analytics (Nice to Have)

    • Actionable Suggestions
    • All computable from existing data, so why not include them?
  6. Polished Frontend (Should Have)

Why This Approach

  • Speed: Terminal UI is fastest to build, focuses on functionality over polish
  • Credibility: Real data demonstrates actual capability, not just mockups
  • Simplicity: Statistical forecasting is fast, explainable, and sufficient for a prototype
  • Extensibility: LLM integration shows how to enhance with context-aware intelligence

Assumptions

Data Access: S3 bucket s3://kumo-public-datasets/hm_with_images/ is publicly accessible (anonymous read) Terminal UI: Terminal-based interface is acceptable for prototype demonstration Forecasting: Simple statistical methods are sufficient for prototype; LLM-enhanced forecasting combines data analysis with learned business insights for superior results. Time Horizon: 4-week forecast horizon is reasonable for demonstration

What's Real vs. What's Mocked

✅ All Real (No Mocking)

Feature Implementation Data Source
Product Search Real SQL queries via DuckDB S3 Parquet files
Historical Sales Real transaction aggregation transactions table
Customer Segments Real customer data joins customers + transactions tables
Price Trends Real price analysis transactions.price field
Sales Channels Real channel distribution transactions.sales_channel_id
Customer Loyalty Real repeat purchase analysis transactions.customer_id
LLM Insights Real API calls OpenAI/LiteLLM

⚠️ Simplified (Not Mocked, But Not Production-Grade)

Feature Implementation Why Simplified
Forecasting LLM-enhanced forecasting combines data analysis with learned business insights for superior results. Fast, explainable, sufficient for demo. Production would use Predictive AI.
UI Terminal-based charts (plotext) Fast to build. Production would be web dashboard.
Customer Segmentation Basic demographics (age, membership, preferences) Real data, but production would use RFM analysis or clustering.

Key Insight: The dataset is rich enough that we didn't need to mock anything. All insights are based on real data, just using simpler methods than production systems.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    User Query                                │
│              "Nike running shoes"                            │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Product Search (Real)                           │
│         Query articles table via DuckDB                      │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│         Historical Sales Analysis (Real)                     │
│  • Load transactions                                        │
│  • Aggregate by week                                         │
│  • Generate charts                                           │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│         Enhanced Analytics (Real + Simple)                   │
│  ┌──────────────────────────────────────────┐              │
│  │ Forecast (Time Series / LLM)              │              │
│  │ • Statistical: weighted avg + trend       │              │
│  │ • Optional: LLM with product context       │              │
│  └──────────────────────────────────────────┘              │
│  ┌──────────────────────────────────────────┐              │
│  │ Customer Segments (Real Data)             │              │
│  │ • Join transactions + customers          │              │
│  │ • Age, membership, preferences           │              │
│  └──────────────────────────────────────────┘              │
│  ┌──────────────────────────────────────────┐              │
│  │ Additional Insights (Real Data)           │              │
│  │ • Price trends, channels, loyalty         │              │
│  └──────────────────────────────────────────┘              │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│         LLM Synthesis (Real)                                 │
│  • Schema context + product attributes                       │
│    (from Database Learning: schema + sample data analysis)   │
│  • Combine all insights into natural language                │
│  • Actionable recommendations                                │
└─────────────────────────────────────────────────────────────┘
                     ▲
                     │
┌────────────────────┴────────────────────────────────────────┐
│  Database Learning (Prerequisites)                          │
│  • Analyze schema + sample data                             │
│  • Generate: docs/database_schema.md                        │
│  • Provides: field meanings, relationships & insights       │
└─────────────────────────────────────────────────────────────┘

Features

Core Flow (End-to-End Working)

  • ✅ Product search by name
  • ✅ Historical sales visualization (weekly aggregation)
  • ✅ Terminal charts (units sold, revenue trends)

Enhanced Analytics

  • ✅ Demand forecast (next 4 weeks)
    • Statistical method (default): weighted average + trend
    • LLM method (optional): context-aware with product attributes
  • ✅ Customer segmentation
    • Age distribution, membership status, preferences
    • Active vs. inactive customers
    • First-time vs. repeat buyers
  • ✅ Additional insights
    • Price trends over time
    • Sales channel distribution (online vs. in-store)
    • Customer loyalty metrics
    • Product lifecycle stage

LLM Integration

  • ✅ Natural language insights generation
    • Combines all analytics into readable summary
    • Includes database schema context for better understanding
    • Provides actionable business recommendations
  • ✅ LLM-enhanced forecasting (optional)
    • Considers product type, department, seasonality
    • Uses product attributes for context-aware predictions

Unimplemented Features & How to Build Them

1. Web Dashboard

Gap: Currently terminal-based UI
Approach:

  • Build React/Next.js frontend
  • Use Plotly or D3.js for interactive charts
  • Create REST API wrapper around existing Python logic
  • Add real-time updates via WebSockets

2. Advanced Forecasting Models

Gap: Using simple statistical methods
Approach:

  • Use Kumo AI's Predictive AI for production-grade forecasting
  • Implement LLM-based AI judge to evaluate forecast accuracy
  • Apply self-improved prompts to iteratively enhance prediction quality
  • Add seasonal decomposition and confidence intervals
  • Consider external factors (promotions, holidays)

3. Advanced Customer Segmentation

Gap: Basic demographics only
Approach:

  • Implement RFM (Recency, Frequency, Monetary) analysis
  • Use clustering algorithms (K-means, DBSCAN)
  • Build predictive scoring models
  • Create customer personas

4. Product Comparison

Gap: Single product analysis only
Approach:

  • Extend search to support multiple products
  • Create side-by-side comparison views
  • Add relative performance metrics
  • Enable "similar products" recommendations

5. Natural Language Search

Gap: Keyword-based search only
Approach:

  • Use LLM to parse natural language queries
  • Convert to structured SQL queries
  • Support complex queries ("products popular with young customers")
  • Add query suggestions and autocomplete

6. Text to SQL Integration

Gap: Prototype function exists (database/client.py::text_to_sql) but not integrated into main workflow
Approach: Integrate Text to SQL into product search and analytics pipeline. Text to SQL significantly enhances query flexibility and reduces repetitive SQL design work.

7. Real-Time Updates

Gap: Static analysis based on historical data
Approach:

  • Set up data pipeline (Kafka, Airflow)
  • Implement incremental data loading
  • Add caching layer (Redis)
  • Create scheduled refresh jobs

Project Structure

shop-sight-prototype/
├── core/
│   └── llm.py              # LLM service wrapper
├── database/
│   └── client.py           # DuckDB + S3 client
├── examples/
│   ├── shop_insight.py     # Main demo script (full features)
│   └── product_sales_trend.py  # Simpler version
├── docs/
│   └── database_schema.md  # Database documentation
├── prompts/
│   └── base.py             # LLM prompt templates
└── requirements.txt        # Dependencies

Technical Choices

  • Python: Fast development, rich data science ecosystem
  • DuckDB: Direct S3 Parquet reading, no local storage needed (🚀blazing-fast🚀)
  • LiteLLM: Unified LLM interface, easy to switch providers
  • Terminal UI: Fastest to build, focuses on functionality
  • Statistical Forecasting: Simple, explainable, sufficient for demo

Demo Tips

  1. Try different products: Search for "dress", "jacket", "shoes" to see varied results
  2. Compare methods: Run with --llm-forecast vs. default to see difference
  3. Check logs: All sessions are logged to history/shop_insight/ by default
  4. Use known IDs: If you know an article_id, use --article-id to skip search

License

This is a prototype for a take-home exercise.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages