Real-time Bot Detection System for Web Server Logs
WebBotShield is a distributed streaming system that monitors web server logs in real-time to detect bot traffic using Apache Spark Structured Streaming, Kafka, and Elasticsearch. The system analyzes temporal patterns (burstiness), HTTP error ratios, and User-Agent characteristics to distinguish between human and automated traffic.
Nginx Logs → Fluentd → Kafka → Spark Streaming → Elasticsearch → Kibana
↓
Bot Detection Engine
(Temporal + Error + UA Analysis)
- Fluentd: Collects and parses Nginx access logs, forwards to Kafka
- Apache Kafka: Message broker for streaming log data
- Spark Structured Streaming: Real-time processing and bot detection engine
- Elasticsearch: Storage and indexing for logs and detection results
- Kibana: Visualization and dashboards for monitoring
-
Burstiness Detection
- Monitors request rate per IP in sliding time windows (1 minute)
- Flags IPs exceeding configurable threshold (default: 100 req/min)
-
Error Ratio Analysis
- Calculates 4xx/5xx error ratios per IP
- Identifies suspicious patterns (default threshold: 50% error rate)
-
User-Agent Analysis
- Detects missing/empty User-Agents
- Pattern matching against known bot signatures (curl, wget, scrapy, etc.)
- Analyzes User-Agent diversity
-
Weighted Bot Scoring
- Combines multiple factors: burstiness (40%) + errors (30%) + UA (30%)
- Threshold-based classification (default: 0.7 = likely bot)
- Docker & Docker Compose
- Python 3.8+ (for log generator)
- At least 8GB RAM recommended
- 10GB free disk space
cd webshield# Start all services
cd docker
docker-compose up -d
# Check service health
docker-compose psServices will be available at:
- Kafka: localhost:9092
- Spark Master UI: http://localhost:8080
- Elasticsearch: http://localhost:9200
- Kibana: http://localhost:5601
# Wait for Elasticsearch to be ready (30-60 seconds)
bash elasticsearch/setup_indices.shOr manually:
curl -X PUT "http://localhost:9200/webshield-logs" -H 'Content-Type: application/json' -d @elasticsearch/mappings.json
curl -X PUT "http://localhost:9200/webshield-detections" -H 'Content-Type: application/json' -d @elasticsearch/mappings.json# Install Python dependencies
pip install -r spark/requirements.txt
# Generate continuous logs (5 min with mixed traffic)
python scripts/log_generator.py --mode continuous --duration 300 --normal-rate 5 --bot-rate 2Options:
--duration: Seconds to generate logs (default: 300)--normal-rate: Normal user requests/second (default: 5)--bot-rate: Bot requests/second (default: 2)--burst-probability: Probability of burst attacks (default: 0.05)
# Submit Spark job to cluster
docker exec -it webshield-spark-master bash
# Inside container
cd /opt/spark-apps/app
pip install -r ../requirements.txt
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 \
--master spark://spark-master:7077 \
bot_detector.pyThe application will start processing logs and display bot detections in the console.
- Open Kibana: http://localhost:5601
- Go to Management → Stack Management → Index Patterns
- Create index patterns:
webshield-logs*webshield-detections*
- Navigate to Discover to explore logs and detections
- Create visualizations and dashboards
Edit config/settings.yaml to customize:
detection:
burst_threshold: 100 # Requests/min for burst detection
error_ratio_threshold: 0.5 # Error ratio threshold
bot_score_threshold: 0.7 # Bot classification threshold
window_duration: "1 minute" # Analysis window sizewebshield/
├── docker/
│ └── docker-compose.yml # Service orchestration
├── fluentd/
│ ├── Dockerfile # Fluentd with Kafka plugin
│ └── fluent.conf # Log parsing configuration
├── spark/
│ ├── app/
│ │ └── bot_detector.py # Main Spark streaming application
│ └── requirements.txt # Python dependencies
├── elasticsearch/
│ ├── mappings.json # Index templates
│ └── setup_indices.sh # Index creation script
├── kibana/
│ └── dashboards/ # Pre-built dashboards (optional)
├── logs/
│ └── nginx/ # Nginx log directory
├── scripts/
│ └── log_generator.py # Test log generator
├── config/
│ └── settings.yaml # Centralized configuration
└── README.md
For each IP address in a time window, the system calculates:
-
Burst Score (0-1):
1.0if requests >= burst_threshold- Linear scaling between threshold/2 and threshold
0.0if below threshold/2
-
Error Score (0-1):
- Ratio of 4xx/5xx errors to total requests
- Normalized against error_ratio_threshold
-
User-Agent Score (0-1):
- Ratio of requests with suspicious/missing User-Agents
-
Final Bot Score:
bot_score = (burst_score × 0.4) + (error_score × 0.3) + (ua_score × 0.3) -
Classification:
bot_score >= 0.7→ Classified as bot- Detection reasons logged for investigation
Monitor streaming job health: http://localhost:8080
- Processing rate and latency
- Input/output records
- Batch processing time
# Check index health
curl http://localhost:9200/_cat/indices?v
# Query detections
curl http://localhost:9200/webshield-detections/_search?pretty
# Get bot alerts
curl -X GET "http://localhost:9200/webshield-detections/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": { "is_bot": true }
}
}'# Check Fluentd logs
docker logs webshield-fluentd
# Verify Kafka topic
docker exec -it webshield-kafka kafka-topics --bootstrap-server localhost:9092 --list
docker exec -it webshield-kafka kafka-console-consumer --bootstrap-server localhost:9092 --topic nginx-logs --from-beginning --max-messages 5# Check Spark worker logs
docker logs webshield-spark-worker
# Verify Kafka connectivity from Spark
docker exec -it webshield-spark-master bash
telnet kafka 29092# Check ES health
curl http://localhost:9200/_cluster/health?pretty
# Check indices
curl http://localhost:9200/_cat/indices?vEdit docker/docker-compose.yml:
spark-worker:
environment:
- SPARK_WORKER_MEMORY=4G # Increase memory
- SPARK_WORKER_CORES=4 # Increase coresEdit config/settings.yaml:
spark:
streaming:
trigger_interval: "10 seconds" # Faster processing
max_offsets_per_trigger: 50000 # Higher throughput- Machine Learning model integration for advanced detection
- Feature engineering for ML (session duration, path sequences, etc.)
- Alerting system (email, Slack, webhooks)
- Historical data analysis and model training pipeline
- IP reputation integration
- Behavioral profiling and anomaly detection
- Dashboard templates for Kibana
# Unit tests (to be added)
python -m pytest tests/
# Integration tests
python scripts/log_generator.py --mode batchEdit spark/app/bot_detector.py:
def custom_detection(self, df):
"""Add your custom detection logic"""
# Example: Detect sequential path access patterns
df = df.withColumn("suspicious_pattern",
when(col("path").rlike("/admin|/wp-admin"), 1)
.otherwise(0))
return df