🚀 Prometheus + Grafana Monitoring Stack with Alerting & Slack Integration

📌 Overview

This project demonstrates a complete end-to-end Monitoring and Observability stack built using Prometheus, Grafana, and Slack integration. It showcases how infrastructure metrics and application metrics can be collected, queried using PromQL, visualized through dashboards, and converted into actionable alerts.

The system monitors:

🖥️ Server health and availability
🧠 Memory utilization
💽 Disk usage
⚙️ CPU load
🌐 Application HTTP request metrics
🚨 Real-time alert notifications via Slack

The implementation reflects a real-world DevOps monitoring workflow where metrics are scraped from exporters, stored in Prometheus, visualized in Grafana, and alerts are delivered instantly to collaboration platforms.

This project simulates a production-style monitoring setup suitable for cloud-based Linux servers and modern application environments.

🏗️ 1. System Architecture

The architecture follows a distributed model to prevent bottlenecks and ensure scalability. Data flows from managed agents on the Application Server to centralized databases for metrics and logs.

📡 Infrastructure Port Mapping

Service	Port	Description
Grafana	3000	Unified UI for Dashboards & Alerts
Prometheus	1990	Time-series Metrics Database
Loki	3100	High-efficiency Log Aggregation
Node Exporter	9100	Hardware & OS Metric Exposure
Titan App	5000	Python Flask Web Application
Grafana Alloy	12345	Unified Observability Agent

🛠️ 2. The Observability Stack

Metrics Management: Prometheus uses a "Pull" (Scraping) model to collect time-series data from exposed endpoints.
Log Management: Grafana Loki serves as a cost-effective log aggregation system, receiving logs pushed from agents.
Visualization: Grafana acts as the unified frontend, connecting to Prometheus and Loki to create interactive dashboards.
Unified Agent: Grafana Alloy handles metrics, logs, and traces in a single binary, replacing legacy agents like Promtail.
Infrastructure: Hosted on AWS EC2 (Ubuntu 24.04) instances, utilizing t2.micro types for development.

⚙️ 3. Automated Deployment

The environment is provisioned using automated Bash scripts to ensure consistency and minimize configuration errors.

Grafana Setup: grafana-setup.sh
Prometheus Setup: prometheus-setup.sh
Loki Setup: lokisetup.sh
Application Node: webnode_setup.sh

🔍 4. Data Exploration & PromQL

This project leverages PromQL (Prometheus Query Language) to extract actionable insights from raw telemetry.

Instant Vectors: up == 1
Range Vectors: up[5m]
Resource Arithmetic: Memory & disk percentage calculations
Rate Analysis: rate(http_requests_total[1m])

📈 5. Dashboards & Visualizations

Time Series Panels: CPU & traffic trends
Gauge Panels: Memory & disk visualization
Dynamic Variables: Interactive endpoint filtering

🚨 6. Alerting & Incident Management

Slack Contact Point: #alerts-prods webhook integration
Threshold Rules: Root disk usage > 65%
Notification Policies: Managed repeat intervals

📷 7. Project Gallery (37-Step Workflow)

🏗️ Phase 1: Infrastructure & Service Deployment

Snapshot 1: Grafana Installation Automation

Executing grafana-setup.sh to automate Grafana installation and service enablement.

Snapshot 2: Initial Grafana UI Access

Accessing Grafana on port 3000 for the first time after successful deployment.

Snapshot 3: Grafana Admin Login & Dashboard Home

Successful administrative login confirming Grafana service health.

Snapshot 4: Prometheus Systemd Deployment

Running prometheus-setup.sh to install Prometheus as a managed system service.

Snapshot 5: Prometheus Self-Scraping Verification

Validating that Prometheus is scraping its internal metrics successfully.

Snapshot 6: EC2 Instance Overview

Displaying provisioned Prometheus and Grafana EC2 instances.

Snapshot 7: Loki Service Status Confirmation

Terminal confirmation that the Loki service is active and running.

Snapshot 8: Loki API Readiness Check

Verification that the Loki API endpoint is responding as "ready" on port 3100.

Snapshot 9: Application Node Automation

Executing webnode_setup.sh to configure the Titan application server.

📡 Phase 2: Application Monitoring & Data Flow

Snapshot 10: Titan Application Live

Python Flask Titan application running successfully on port 5000.

Snapshot 11: Payment Endpoint Validation

Testing the /payment endpoint to simulate application traffic.

Snapshot 12: Application Metrics Exposure

Viewing internal metrics exposed by the Flask application.

Snapshot 13: Node Exporter Metrics Endpoint

System-level hardware and OS metrics exposed on port 9100.

Snapshot 14: Active Load Simulation

Using top to verify active load generation scripts consuming CPU.

Snapshot 15: Log File Generation

Confirming application logs are being generated in /var/log/titan/.

🔍 Phase 3: PromQL Mastery & Target Discovery

Snapshot 16: Healthy Scrape Targets

Prometheus UI showing all configured scrape targets as healthy.

Snapshot 17: Node Status Check (up Metric)

Executing up query to validate instance availability.

Snapshot 18: Historical Query Using Range Vector

Retrieving historical status using up[5m].

Snapshot 19: Counting Active Targets

Using count(up) to calculate total monitored targets.

Snapshot 20: Scalar Conversion

Applying scalar() to convert vector results into numerical output.

Snapshot 21: Total System Memory Query

Querying total system memory in bytes.

Snapshot 22: Used Memory Calculation

Calculating used memory through metric subtraction.

Snapshot 23: Root Filesystem Utilization

Computing available root filesystem percentage.

Snapshot 24: Filtering Active Instances

Filtering active nodes using up == 1.

Snapshot 25: High Load Detection

Identifying nodes experiencing high load averages.

Snapshot 26: Combined Condition Filtering

Finding nodes that are both operational and under load.

Snapshot 27: Offset Query for Historical Memory

Using offset 5m to analyze previous memory states.

Snapshot 28: HTTP Request Counter

Tracking http_requests_total metric for Titan application.

Snapshot 29: Request Rate Analysis

Calculating per-second request rate using rate().

Snapshot 30: Metric Aggregation by Labels

Summing metrics grouped by specific job labels.

🚨 Phase 4: Visualization & Alerting Integration

Snapshot 31: Prometheus Data Source Connection

Connecting Prometheus to Grafana using private IP configuration.

Snapshot 32: Grafana Explore Live Query

Testing real-time metric queries in Grafana Explore.

Snapshot 33: Slack Webhook Configuration

Establishing Slack webhook integration for alert notifications.

Snapshot 34: Time Series Panel Creation

Building performance monitoring panels using Time Series visualization.

Snapshot 35: Bar Chart Visualization

Using bar charts to compare categorical infrastructure metrics.

Snapshot 36: Memory Gauge Panel

Creating a Gauge panel to visualize available system memory.

Snapshot 37: Alert Threshold Configuration

Defining color thresholds and alert triggers for critical metrics.

🧪 8. Testing & Simulation

System Load: load.sh CPU spike testing
Traffic Simulation: curl-based endpoint stress
Log Generation: Fake logs in /var/log/titan/

⚖️ License

This project is licensed under the MIT License.

Updated and maintained by Khawalid Mehmood.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
screenshots		screenshots
titan		titan
README.md		README.md
WebsiteTest-main.sh		WebsiteTest-main.sh
WebsiteTest-payment.sh		WebsiteTest-payment.sh
alloy-defaults		alloy-defaults
config-alloy		config-alloy
generate_multi_logs.sh		generate_multi_logs.sh
grafana-setup.sh		grafana-setup.sh
load.sh		load.sh
lokisetup.sh		lokisetup.sh
prometheus-setup.sh		prometheus-setup.sh
promtail-config.yml		promtail-config.yml
webnode_setup.sh		webnode_setup.sh

Folders and files

Latest commit

History

Repository files navigation

🚀 Prometheus + Grafana Monitoring Stack with Alerting & Slack Integration

📌 Overview

🏗️ 1. System Architecture

📡 Infrastructure Port Mapping

🛠️ 2. The Observability Stack

⚙️ 3. Automated Deployment

🔍 4. Data Exploration & PromQL

📈 5. Dashboards & Visualizations

🚨 6. Alerting & Incident Management

📷 7. Project Gallery (37-Step Workflow)

🏗️ Phase 1: Infrastructure & Service Deployment

Snapshot 1: Grafana Installation Automation

Snapshot 2: Initial Grafana UI Access

Snapshot 3: Grafana Admin Login & Dashboard Home

Snapshot 4: Prometheus Systemd Deployment

Snapshot 5: Prometheus Self-Scraping Verification

Snapshot 6: EC2 Instance Overview

Snapshot 7: Loki Service Status Confirmation

Snapshot 8: Loki API Readiness Check

Snapshot 9: Application Node Automation

📡 Phase 2: Application Monitoring & Data Flow

Snapshot 10: Titan Application Live

Snapshot 11: Payment Endpoint Validation

Snapshot 12: Application Metrics Exposure

Snapshot 13: Node Exporter Metrics Endpoint

Snapshot 14: Active Load Simulation

Snapshot 15: Log File Generation

🔍 Phase 3: PromQL Mastery & Target Discovery

Snapshot 16: Healthy Scrape Targets

Snapshot 17: Node Status Check (up Metric)

Snapshot 18: Historical Query Using Range Vector

Snapshot 19: Counting Active Targets

Snapshot 20: Scalar Conversion

Snapshot 21: Total System Memory Query

Snapshot 22: Used Memory Calculation

Snapshot 23: Root Filesystem Utilization

Snapshot 24: Filtering Active Instances

Snapshot 25: High Load Detection

Snapshot 26: Combined Condition Filtering

Snapshot 27: Offset Query for Historical Memory

Snapshot 28: HTTP Request Counter

Snapshot 29: Request Rate Analysis

Snapshot 30: Metric Aggregation by Labels

🚨 Phase 4: Visualization & Alerting Integration

Snapshot 31: Prometheus Data Source Connection

Snapshot 32: Grafana Explore Live Query

Snapshot 33: Slack Webhook Configuration

Snapshot 34: Time Series Panel Creation

Snapshot 35: Bar Chart Visualization

Snapshot 36: Memory Gauge Panel

Snapshot 37: Alert Threshold Configuration

🧪 8. Testing & Simulation

⚖️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages