This project demonstrates a complete end-to-end Monitoring and Observability stack built using Prometheus, Grafana, and Slack integration. It showcases how infrastructure metrics and application metrics can be collected, queried using PromQL, visualized through dashboards, and converted into actionable alerts.
The system monitors:
- 🖥️ Server health and availability
- 🧠 Memory utilization
- 💽 Disk usage
- ⚙️ CPU load
- 🌐 Application HTTP request metrics
- 🚨 Real-time alert notifications via Slack
The implementation reflects a real-world DevOps monitoring workflow where metrics are scraped from exporters, stored in Prometheus, visualized in Grafana, and alerts are delivered instantly to collaboration platforms.
This project simulates a production-style monitoring setup suitable for cloud-based Linux servers and modern application environments.
The architecture follows a distributed model to prevent bottlenecks and ensure scalability. Data flows from managed agents on the Application Server to centralized databases for metrics and logs.
| Service | Port | Description |
|---|---|---|
| Grafana | 3000 | Unified UI for Dashboards & Alerts |
| Prometheus | 1990 | Time-series Metrics Database |
| Loki | 3100 | High-efficiency Log Aggregation |
| Node Exporter | 9100 | Hardware & OS Metric Exposure |
| Titan App | 5000 | Python Flask Web Application |
| Grafana Alloy | 12345 | Unified Observability Agent |
-
Metrics Management: Prometheus uses a "Pull" (Scraping) model to collect time-series data from exposed endpoints.
-
Log Management: Grafana Loki serves as a cost-effective log aggregation system, receiving logs pushed from agents.
-
Visualization: Grafana acts as the unified frontend, connecting to Prometheus and Loki to create interactive dashboards.
-
Unified Agent: Grafana Alloy handles metrics, logs, and traces in a single binary, replacing legacy agents like Promtail.
-
Infrastructure: Hosted on AWS EC2 (Ubuntu 24.04) instances, utilizing t2.micro types for development.
The environment is provisioned using automated Bash scripts to ensure consistency and minimize configuration errors.
- Grafana Setup: grafana-setup.sh
- Prometheus Setup: prometheus-setup.sh
- Loki Setup: lokisetup.sh
- Application Node: webnode_setup.sh
This project leverages PromQL (Prometheus Query Language) to extract actionable insights from raw telemetry.
- Instant Vectors: up == 1
- Range Vectors: up[5m]
- Resource Arithmetic: Memory & disk percentage calculations
- Rate Analysis: rate(http_requests_total[1m])
- Time Series Panels: CPU & traffic trends
- Gauge Panels: Memory & disk visualization
- Dynamic Variables: Interactive endpoint filtering
- Slack Contact Point: #alerts-prods webhook integration
- Threshold Rules: Root disk usage > 65%
- Notification Policies: Managed repeat intervals
Executing grafana-setup.sh to automate Grafana installation and service enablement.
Accessing Grafana on port 3000 for the first time after successful deployment.
Successful administrative login confirming Grafana service health.
Running prometheus-setup.sh to install Prometheus as a managed system service.
Validating that Prometheus is scraping its internal metrics successfully.
Displaying provisioned Prometheus and Grafana EC2 instances.
Terminal confirmation that the Loki service is active and running.
Verification that the Loki API endpoint is responding as "ready" on port 3100.
Executing webnode_setup.sh to configure the Titan application server.
Python Flask Titan application running successfully on port 5000.
Testing the /payment endpoint to simulate application traffic.
Viewing internal metrics exposed by the Flask application.
System-level hardware and OS metrics exposed on port 9100.
Using top to verify active load generation scripts consuming CPU.
Confirming application logs are being generated in /var/log/titan/.
Prometheus UI showing all configured scrape targets as healthy.
Executing up query to validate instance availability.
Retrieving historical status using up[5m].
Using count(up) to calculate total monitored targets.
Applying scalar() to convert vector results into numerical output.
Querying total system memory in bytes.
Calculating used memory through metric subtraction.
Computing available root filesystem percentage.
Filtering active nodes using up == 1.
Identifying nodes experiencing high load averages.
Finding nodes that are both operational and under load.
Using offset 5m to analyze previous memory states.
Tracking http_requests_total metric for Titan application.
Calculating per-second request rate using rate().
Summing metrics grouped by specific job labels.
Connecting Prometheus to Grafana using private IP configuration.
Testing real-time metric queries in Grafana Explore.
Establishing Slack webhook integration for alert notifications.
Building performance monitoring panels using Time Series visualization.
Using bar charts to compare categorical infrastructure metrics.
Creating a Gauge panel to visualize available system memory.
Defining color thresholds and alert triggers for critical metrics.
- System Load: load.sh CPU spike testing
- Traffic Simulation: curl-based endpoint stress
- Log Generation: Fake logs in /var/log/titan/
This project is licensed under the MIT License.
Updated and maintained by Khawalid Mehmood.
