Skip to content

Annvalentina13/Distributed-Log-Processing-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Distributed Log Processing System

A Python-based distributed log processing system built with PySpark and Streamlit. This system processes log files at scale, performs analytics, generates reports, and provides an interactive dashboard for visualization.

🎯 Features

  • Distributed Processing: Uses PySpark for scalable log processing
  • Comprehensive Analytics: Error analysis, trends, IP/service breakdowns
  • Alert System: Configurable alerts based on thresholds
  • Interactive Dashboard: Modern Streamlit-based visualization
  • Automated Reporting: CSV and JSON report generation
  • Production Ready: Clean, modular, scalable architecture

πŸ“‚ Project Structure

distributed-log-system/
β”‚
β”œβ”€β”€ data/
β”‚   └── raw_logs/                 # Input CSV log files
β”‚
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ csv/                      # CSV reports
β”‚   β”œβ”€β”€ json/                     # JSON reports
β”‚   └── alerts.log                # Alert history
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ spark/
β”‚   β”‚   β”œβ”€β”€ spark_session.py      # Spark session management
β”‚   β”‚   β”œβ”€β”€ ingest_logs.py        # Log ingestion
β”‚   β”‚   β”œβ”€β”€ parse_logs.py         # Log parsing & normalization
β”‚   β”‚   β”œβ”€β”€ analytics.py          # Core analytics
β”‚   β”‚   β”œβ”€β”€ alerts.py             # Alert system
β”‚   β”‚   └── export_reports.py     # Report generation
β”‚   β”‚
β”‚   β”œβ”€β”€ dashboard/
β”‚   β”‚   └── app.py                # Streamlit dashboard
β”‚   β”‚
β”‚   └── main.py                   # Main processing pipeline
β”‚
β”œβ”€β”€ config/
β”‚   └── config.yaml               # Configuration file
β”‚
β”œβ”€β”€ requirements.txt
└── README.md

πŸš€ Setup

Prerequisites

  • Python 3.8 or higher
  • Java 8 or higher (required for PySpark)

Installation

  1. Clone or navigate to the project directory

  2. Create a virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Verify Java installation

    java -version

πŸ“Š Input Data Format

Place your log CSV files in the data/raw_logs/ directory. The CSV files should contain at least the following columns:

  • timestamp: Timestamp of the log entry (various formats supported)
  • log_level: Log level (INFO, WARN, ERROR, DEBUG)
  • message: Log message content

Optional columns:

  • ip or IP: IP address
  • service: Service name
  • endpoint: Endpoint path

Example CSV Format

timestamp,log_level,message,ip,service
2024-01-15 10:30:00,INFO,Request processed successfully,192.168.1.100,api-service
2024-01-15 10:31:00,ERROR,Connection timeout exception,192.168.1.101,api-service
2024-01-15 10:32:00,WARN,High response time detected,192.168.1.102,web-service

πŸ”§ Configuration

Edit config/config.yaml to customize:

  • Spark settings: Memory, executor settings
  • Paths: Input/output directories
  • Alert thresholds: Error rate, error count, critical errors
  • Analytics: Top N errors, time windows
  • Dashboard: Auto-refresh settings

πŸƒ Running the System

1. Process Logs with PySpark

Run the main processing pipeline:

python src/main.py

Or run individual modules:

# Ingest logs
python src/spark/ingest_logs.py

# Parse logs
python src/spark/parse_logs.py

# Run analytics
python src/spark/analytics.py

# Check alerts
python src/spark/alerts.py

# Export reports
python src/spark/export_reports.py

2. Launch Dashboard

Start the Streamlit dashboard:

streamlit run src/dashboard/app.py

The dashboard will open in your browser at http://localhost:8501

πŸ“ˆ Dashboard Features

The interactive dashboard provides:

  • Key Metrics: Total logs, errors, error rate, active alerts
  • Charts:
    • Errors over time (line chart)
    • Top error types (bar chart)
    • Errors per IP (bar chart)
    • Errors per service (bar chart)
  • Filters: Date range, log level selection
  • Alert Panel: Recent alerts with timestamps
  • Detailed Tables: Top errors, trends, severity breakdown
  • Auto-refresh: Automatic data refresh capability

πŸ“‹ Generated Reports

After processing, reports are available in:

  • CSV Reports (reports/csv/):

    • errors_by_type.csv: Errors grouped by type
    • errors_by_hour.csv: Errors by hour
    • top_errors.csv: Top N most frequent errors
    • errors_per_ip.csv: Errors grouped by IP
    • errors_per_service.csv: Errors grouped by service
  • JSON Reports (reports/json/):

    • error_trends.json: Error trends over time
    • errors_by_day.json: Errors by day
    • errors_by_severity.json: Errors by severity level
  • Alert Log (reports/alerts.log): History of all alerts

πŸ”” Alert System

The system automatically checks for:

  1. High Error Rate: Exceeds configured threshold (default: 10%)
  2. High Error Count: Total errors exceed threshold (default: 100)
  3. Critical Errors: Critical error count exceeds threshold (default: 5)

Alerts are:

  • Printed to console
  • Logged to reports/alerts.log
  • Displayed in the dashboard

πŸ› οΈ Development

Code Structure

  • Modular Design: Each module has a specific responsibility
  • PySpark Only: All data processing uses PySpark (no Pandas for transformations)
  • Configurable: Settings in YAML configuration
  • Logging: Comprehensive logging throughout
  • Error Handling: Robust error handling and validation

Adding New Analytics

  1. Add function to src/spark/analytics.py
  2. Call it in run_all_analytics()
  3. Export results in export_reports.py
  4. Visualize in dashboard app.py

Customizing Alerts

Edit config/config.yaml to adjust:

  • alerts.error_rate_threshold
  • alerts.error_count_threshold
  • alerts.critical_error_count

πŸš€ Deployment

Local Execution

Currently configured for local[*] execution. To run on a cluster:

  1. Update config/config.yaml:

    spark:
      master: "spark://your-cluster:7077"
  2. Ensure Spark cluster is accessible

  3. Run the same commands

Cloud Deployment

The code is structured to be cloud-ready. For AWS EMR, Databricks, or other platforms:

  1. Package the code
  2. Update Spark master URL
  3. Ensure data paths are accessible (S3, HDFS, etc.)
  4. Deploy and run

πŸ“ Notes

  • No Pandas for Processing: All analytics use PySpark DataFrames
  • Caching: Intermediate results are cached for performance
  • Partitioning: Spark optimizations are applied automatically
  • Scalability: Designed to handle large-scale log processing

πŸ› Troubleshooting

Java Not Found

  • Install Java 8+ and set JAVA_HOME environment variable

Spark Session Errors

  • Check Java installation
  • Verify Spark dependencies are installed
  • Check configuration file paths

Dashboard Not Loading Data

  • Ensure Spark processing has been run first
  • Check that reports are generated in reports/ directory
  • Verify file paths in configuration

Import Errors

  • Ensure you're running from the project root
  • Check that all dependencies are installed
  • Verify Python path includes project directory

πŸ“„ License

This project is provided as-is for educational and development purposes.

πŸ‘₯ Contributing

Feel free to extend this system with:

  • Additional analytics
  • New visualization types
  • Enhanced alerting
  • Performance optimizations

Built with PySpark & Streamlit πŸš€

About

Modern distributed apps produce huge logs across Linux, Spark, and applications. Manual inspection for errors, trends, and root causes is slow and unreliable. A centralized automated system is required to ingest, parse, analyze, and visualize logs in real time to improve reliability and faster decisions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors