Metriflow: Web Analytics ETL and Reporting Platform

Metriflow is a high-throughput web analytics data platform designed to ingest large volumes of raw event data, process it through a multi-stage asynchronous aggregation pipeline, and expose secure reporting APIs for analytical consumption.

The system simulates external analytics sources such as Google Analytics (GA) and PageSpeed Insights (PSI), but the core focus of the project is backend architecture, throughput optimization, fault-tolerant processing, and scalable reporting.

Project Overview

The platform processes analytics records through a staged pipeline: Raw → Staged → Page → Intervals → Daily → Monthly → Yearly Data ingestion and processing are deliberately decoupled through RabbitMQ. Multiple background workers execute aggregation sequentially while maintaining consistency in the presence of retries, duplicates, and late-arriving data. The system has been benchmarked to process: 26 million records in under 3 minutes (13M GA + 13M PSI processed in parallel)

This was achieved through:

producer-consumer channels.
streaming JSON deserialization
batch bulk inserts using SqlBulkCopy
minimized write contention on ingestion tables
staged aggregation using stored procedures

Key Features

High-Throughput Data Ingestion

The ingestion layer reads structured analytics data from mock external sources and streams records without loading entire files into memory.

Key implementation details:

streaming JSON deserialization for low memory pressure.
producer-consumer channels for controlled parallelism.
SqlBulkCopy batch persistence at 250,000 records per batch.
parallel ingestion of GA and PSI datasets

This ingestion path is optimized for write throughput rather than ORM convenience.

Queue-Based ETL Pipeline

RabbitMQ acts as the decoupling boundary between ingestion and downstream processing.

The system runs 7 background worker services, each responsible for a stage in the aggregation pipeline.

Processing chain:

Raw ingestion.
Staging normalization.
Page-level aggregation.
Interval aggregation.
Daily aggregation.
Monthly aggregation.
Yearly aggregation.

This separation makes the pipeline easier to scale, retry, and reason about operationally.

Multi-Level Aggregation Strategy

Aggregations are precomputed incrementally across bounded reporting dimensions. Pre-aggregated tables support fast reporting across.

common time slices:

page-level metrics.
interval summaries.
daily summaries.
monthly summaries.
yearly summaries.

For queries that do not fit these fixed aggregation boundaries, Redis is used to support arbitrary date-range lookups without creating unbounded combinations of precomputed tables.

Late-Arriving and Duplicate Data Handling

The pipeline is built to maintain consistency when data arrives out of order or overlaps previously processed ranges.

It does this by:

detecting existing aggregates.
identifying upstream changes.
recomputing affected downstream aggregates.
updating partial results instead of blindly appending.

This ensures analytical correctness across the entire aggregation chain.

Secure Reporting API

The reporting layer exposes REST endpoints for querying aggregated analytics.

Security features include:

JWT authentication.
ASP.NET Core Identity.
Rate limiting for endpoint protection

Load Testing at Scale

A companion load-testing tool was built specifically for this system. Hyper JSON Generator produces millions of analytics records.

See Hyper JSON Generator Code

using:

Utf8JsonWriter
CPU-cache-efficient structs

The tool exists to validate ingestion throughput and memory stability under large synthetic workloads.

Containerized Development Environment

The full platform is containerized with Docker Compose. Services include:

API
Worker services
SQL Server
RabbitMQ
Redis

This ensures reproducible local development and predictable service orchestration.

Why Both ADO.NET and EF Core?

A major architectural change in the current version is the deliberate use of both ADO.NET and Entity Framework Core. This is not accidental and not transitional. ADO.NET is used for write-heavy and throughput-critical workloads

ADO.NET is used where raw performance matters most:

High-volume ingestion
Batch writes via SqlBulkCopy
Execution of aggregation stored procedures
Explicit transaction handling
Retry handling around transient failures

For these workloads, the abstraction cost of an ORM becomes material.

EF Core is used for read-heavy API query paths EF Core remains valuable where maintainability and developer productivity matter more than raw write throughput.

EF Core is used for:

Reporting queries
API read models
Identity/authentication persistence
Ordinary relational access patterns
Architectural Rationale

The project intentionally uses the right abstraction for the workload.

Architectural Rationale

The project intentionally uses the right abstraction for the workload.

Workload	Type Technology	Reason
High-throughput ingestion	ADO.NET	Lowest overhead, bulk insert support
Complex aggregation execution	ADO.NET	Stored procedures, transaction control
Reporting reads	EF Core	Faster development, cleaner query composition
Identity/auth	EF Core	Standard ASP.NET Core integration

This hybrid data-access model is one of the core technical decisions in the current architecture.

Technology Stack

Component	Technology	Role
Backend API	ASP.NET Core (.NET 8)	Secure reporting API
Worker Services	.NET Worker Services	Ingestion and staged aggregation
Database	SQL Server	Persistent analytical storage
High-throughput data access	ADO.NET	Bulk ingestion, stored procedure execution
ORM	Entity Framework Core	Read-heavy query paths and identity
Message Broker	RabbitMQ	Asynchronous decoupled processing
Cache	Redis	Arbitrary date-range reporting
Authentication	ASP.NET Core Identity + JWT	API security
Testing	xUnit, Moq	Aggregation and business logic validation
CI/CD	GitHub Actions	Build and validation pipeline
Runtime	Docker Compose	Local orchestration

Architecture Flow

1. Ingestion.

Analytics records from GA and PSI are streamed and inserted into raw ingestion tables using batched bulk operations.

2. Queue Dispatch.

RabbitMQ decouples ingestion from downstream aggregation. Each stage consumes completed work from the previous stage.

3. Sequential Aggregation Pipeline.

Worker services execute staged transformations: Raw → Staged → Page → Intervals → Daily → Monthly → Yearly Aggregation logic is implemented primarily in stored procedures.

4. Reporting API.

The API reads pre-aggregated data using EF Core. Redis supplements requests for non-standard date ranges.

Database Strategy

Raw ingestion tables Raw tables are optimized for writes.

Design decisions include:

No clustered indexes on high-ingestion raw tables
Minimized write contention during bulk insert operations

Analytical tables Aggregated reporting tables use:

Columnstore indexes

This provides:

compression benefits
improved scan speed
better analytical query performance

Aggregation execution Aggregation logic is encapsulated in stored procedures with:

ACID transactions.
Retry handling for transient failures.

Database Schema Summary The exact schema may evolve, but the current model is conceptually organized around these layers:

Table Group	Purpose
Raw tables	High-throughput ingestion of source records
Staging tables	Normalization and source merging
Page aggregates	Page-level correlation
Interval aggregates	Interval reporting
Daily aggregates	Daily reporting
Monthly aggregates	Monthly reporting
Yearly aggregates	Yearly reporting
Identity tables	Users, authentication, authorization

Setup and Running the Application

Prerequisites Install:

Docker
Docker Compose

Start the Platform From the repository root: docker compose up --build -d

This will start:

API
worker services
SQL Server
RabbitMQ
Redis

Allow several moments for containers to initialize and for the initial pipeline execution to complete.

Accessing the API Swagger should be available once the API container is ready. Typical URL: http://localhost:8080/swagger

Authentication

Reporting endpoints are protected by JWT. Typical flow:

Register a user

Log in

Copy returned JWT

Authorize through Swagger using:

Bearer <token>

Example Reporting Capabilities Depending on the current API surface, reporting endpoints provide access to:

overall summaries

page-level analytics

bounded date-range reports

arbitrary date-range analytics (with Redis support when pre-aggregates are insufficient)

Testing

The project includes unit tests covering core aggregation and business rules. Run tests with: dotnet test

CI/CD

GitHub Actions is configured for automated validation. Typical pipeline responsibilities include:

restore
build
test execution

Current Engineering Focus

The current version of Metriflow is primarily a backend systems engineering project focused on:

Large-scale ingestion
Asynchronous pipeline design
Data consistency under reprocessing
Hybrid data-access strategy
Analytical query optimization

It is intentionally optimized around throughput, correctness, and operational scalability, rather than simple CRUD application design.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CLEAN_ARCHITECTURE_ANALYSIS.md		CLEAN_ARCHITECTURE_ANALYSIS.md
Final_Project_Structure.md		Final_Project_Structure.md
Metriflow.code-workspace		Metriflow.code-workspace
Metriflow.sln		Metriflow.sln
New_Porject_Structure.md		New_Porject_Structure.md
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
Untitled-2025-12-20-1438.excalidraw		Untitled-2025-12-20-1438.excalidraw
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
dotnet-tools.json		dotnet-tools.json
init.sql		init.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metriflow: Web Analytics ETL and Reporting Platform

Table of Contents

Project Overview

Key Features

High-Throughput Data Ingestion

Queue-Based ETL Pipeline

Multi-Level Aggregation Strategy

Late-Arriving and Duplicate Data Handling

Secure Reporting API

Load Testing at Scale

Containerized Development Environment

Why Both ADO.NET and EF Core?

Architectural Rationale

Technology Stack

Architecture Flow

Database Strategy

Setup and Running the Application

Authentication

Testing

CI/CD

Current Engineering Focus

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Metriflow: Web Analytics ETL and Reporting Platform

Table of Contents

Project Overview

Key Features

High-Throughput Data Ingestion

Queue-Based ETL Pipeline

Multi-Level Aggregation Strategy

Late-Arriving and Duplicate Data Handling

Secure Reporting API

Load Testing at Scale

Containerized Development Environment

Why Both ADO.NET and EF Core?

Architectural Rationale

Technology Stack

Architecture Flow

Database Strategy

Setup and Running the Application

Authentication

Testing

CI/CD

Current Engineering Focus

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages