System is designed for medium and large e-commerce platforms handling millions of users and tens of thousands of products. It solves three key business challenges:
- Generate personalized product recommendations based on user demographic profiles
- Analyze customer conversion paths and identify purchase abandonment points
- Monitor trends in real-time enabling immediate marketing responses
Complete analysis results, visualizations, and interactive dashboards are located in: ecommerce_analytics.ipynb
Below are example charts from many generated during the analysis:
System utilizes three data sources:
- User activity from December 2019 (9.2 GB, 67,542,878 observations)
- 62,986,067 views
- 3,394,763 add-to-cart events
- 1,162,048 purchases
- Product catalog (205,230 products across 1,162 categories)
- User profiles (5,000,000 records, 370 MB)
System is based on Lambda architecture with three layers:
- Apache Kafka as streaming buffer
- Spark Streaming for real-time analysis
- Five parallel analyses: active users, top 10 products, purchase metrics, event distribution, top 10 brands
- Apache Airflow for workflow orchestration
- HDFS as distributed file system
- Apache Hive for metadata management
- Apache Spark for batch analytics:
- Customer conversion path analysis
- Demographic recommendations
- Brand performance analysis
- Apache HBase for pre-computed results with millisecond latency
Main components deployed in the environment:
- Apache Hadoop 3.3.6 (HDFS)
- Apache Spark 3.5.3 (analytics engine)
- Apache Kafka 3.6.1 (event bus)
- Apache Hive 3.1.3 (data warehouse)
- Apache HBase 2.6.4-hadoop3 (service layer)
- Apache Airflow 2.8.1 (orchestrator)
- Java 17 (OpenJDK 17.0.10)
System is deployed on Google Cloud Platform:
- Region: Europe Central 2 (Warsaw)
- Instance: bigdata-vm, type n1-standard-4
- Operating System: Ubuntu 22.04.5 LTS
- Resources: 4 vCPU (Intel Xeon 2.20 GHz), 14 GB RAM, 194 GB disk
System implements eight main DAGs:
- realtime_ecommerce_stream (@daily) - Simulates user event stream from CSV ingestion to Kafka
- kafka_to_hdfs_archiver (@hourly) - Archives data from Kafka to HDFS with time-based partitioning
- realtime_analytics_manager - Orchestrates Spark Streaming tasks
- funnel_analysis_daily - Daily customer conversion path analysis
- demographic_recommendations_daily - Daily demographic-based recommendation generation
- brand_performance_daily - Weekly brand efficiency analysis
- products_batch_pipeline - Batch processing of product catalog data
- users_batch_pipeline - Batch processing of user profiles
- Time mapping of events from December 2019 to December 2025/January 2026
- Kafka publication maintaining natural pace (~15 events/second)
- User ID-based partitioning ensuring event order preservation
- Five-minute sliding time windows
- Results written to HBase for fast access
- Monitoring active users, popular products, and purchase metrics
- Batch processing after 250,000 events
- Parquet format with Snappy compression
- Hierarchical partitioning:
/raw/events/year=YYYY/month=MM/day=DD/hour=HH
- Identification of user paths: view β cart β purchase
- Drop-off rate calculation by category and brand
- Average time between conversion stages
- User segmentation by age (5 groups: 18-24, 25-34, 35-44, 45-54, 55+) and region
- Recommendation score = views + 2Γadd_to_cart + 5Γpurchases
- Top 10 products for each segment
- Top 100 brands by revenue monitoring
- Metrics: unique users, total views, purchases, revenue, conversion rate
- Analysis by age group
After a botnet attack incident, firewall configuration was strengthened:
- Restricted access to internal ports (Kafka 9092, Zookeeper 2181, HBase Thrift 9090)
- Public access only to UI: Airflow (8080), HDFS (9870), YARN (8088)
- SSH tunneling for internal service access
Status as of December 30, 2025:
- HDFS Archive: 224.9 MB of event data
- HBase realtime_stats: 67,114 metric rows
- Stability: System running stable for 9+ days without interruption
- Throughput: Simulator processing ~15 events/second for 14+ hours
E-CommerceBigDataSystem/
βββ ecommerce_analytics.ipynb # Analysis results and visualizations
βββ preprocessing.ipynb # Data preprocessing
βββ airflow_dags/ # Airflow DAGs
βββ analytics_scripts/ # Analytics scripts
βββ data/ # Input data
βββ scripts/ # Helper scripts
βββ sql/ # SQL and Spark queries
βββ README.md # This file
Project developed by KGW Gawron team - January 7, 2026
Team: KGW Gawron
Authors:
Date: January 7, 2026


