ErrorWatch: Production-Ready Error Tracking SaaS

Overview

ErrorWatch is a modern error tracking and monitoring platform designed as a cost-effective alternative to established solutions like Sentry and Rollbar. The system provides real-time error capture, intelligent grouping via deterministic fingerprinting, and team collaboration features across multiple projects and environments. Built on a Rails 8 foundation with embedded job processing, the architecture prioritizes operational simplicity and rapid iteration velocity while maintaining clear paths to horizontal scale.

Technical Challenges

Deterministic Error Grouping at Scale

The core challenge involved designing a fingerprinting algorithm that could reliably group similar errors across heterogeneous runtime environments. Stack traces vary wildly—different user paths in file systems, minified versus source-mapped JavaScript, and framework-specific noise. The solution required careful normalization of stack frames, extraction of semantic function names, and SHA256 hashing to create deterministic fingerprints while avoiding both over-grouping (masking distinct issues) and under-grouping (alert fatigue).

Hot Path Performance Under Burst Traffic

Error ingestion endpoints face unpredictable burst patterns—a single production deploy can trigger thousands of error submissions within seconds. The challenge was maintaining sub-200ms API response times during traffic spikes while preserving data integrity. This required aggressive decoupling of the hot path (immediate API acknowledgment) from cold path processing (fingerprinting, notification evaluation, and counter updates) using database-backed async job queues.

Multi-Tenant Data Isolation Without Complexity

Supporting hundreds of customer accounts with strict data isolation while avoiding the operational overhead of schema-per-tenant or database-per-tenant architectures required careful consideration. The bounded context needed tenant-scoped queries at every access point without introducing query complexity that would degrade performance or increase the blast radius of developer errors.

Architecture Decisions

Monolith-First with Explicit Service Boundaries

Decision: Deploy as a single Rails application rather than microservices.

Trade-offs:

Favor: Operational simplicity (single deploy, no service mesh), faster development velocity, reduced infrastructure costs, simplified transaction boundaries
Against: Potentially limits independent scaling of API tier versus background processing tier

Rationale: For MVP scale (targeting 100-500 customers initially), a well-structured monolith with clear service boundaries provides faster iteration cycles. The codebase maintains explicit separation between API controllers, domain services, and background jobs, enabling future extraction to independent services if traffic patterns demand it. Premature distribution would introduce coordination overhead (distributed transactions, eventual consistency) without corresponding benefits at current scale.

Database-Backed Job Queue (Solid Queue)

Decision: Use Rails 8's Solid Queue (PostgreSQL-backed) instead of Redis-based solutions.

Trade-offs:

Favor: Zero external dependencies, simpler deployment, ACID guarantees for job state, reduced operational burden, lower infrastructure cost
Against: Higher database I/O compared to in-memory queues, potential contention on job table at very high throughput

Rationale: For a bootstrapped SaaS, minimizing operational complexity and infrastructure costs is critical. Solid Queue eliminates Redis as a dependency while providing sufficient throughput for MVP scale. The performance ceiling is high enough that queue optimization won't become necessary until after achieving product-market fit. If job throughput becomes a bottleneck, migration to Redis-backed Sidekiq is straightforward.

Fingerprint-Based Grouping vs. ML Clustering

Decision: Use deterministic fingerprinting (error type + message + top stack frames) rather than machine learning clustering.

Trade-offs:

Favor: Predictable grouping behavior, zero training/inference latency, no model deployment complexity, explainable to users
Against: May miss semantically similar errors with different stack traces, requires manual tuning for edge cases

Rationale: Deterministic algorithms provide predictable, debuggable behavior critical for earning customer trust. ML-based clustering introduces inference latency on the hot path and requires labeled training data that doesn't exist at MVP stage. The path forward includes augmenting deterministic grouping with optional ML-powered suggestions in Phase 2, providing best of both worlds.

Stripe-Native Billing vs. Custom Metering

Decision: Leverage Stripe Subscriptions with webhook-driven sync rather than building custom billing logic.

Trade-offs:

Favor: Offloads PCI compliance, invoicing, payment retries, dunning; faster time-to-market
Against: Slightly higher transaction fees, tightly coupled to Stripe's data model

Rationale: Payment processing is undifferentiated heavy lifting. Stripe provides production-grade subscription management, fraud detection, and compliance that would take months to build in-house. The 2.9% + 30¢ fee is negligible compared to the engineering cost of building custom billing infrastructure.

System Design Highlights

API Authentication & Rate Limiting Strategy

Each project receives a secure API token used for Bearer authentication on error ingestion endpoints. Rate limiting operates at two levels: per-API-key throttling based on subscription tier (Starter: 100 req/min, Pro: 500 req/min, Business: 1000 req/min) and IP-based limits for unauthenticated traffic. This approach prevents noisy neighbor problems in multi-tenant scenarios while preserving fair use across pricing tiers.

Implementation uses Rack::Attack middleware with in-memory counters, providing microsecond-latency checks without external dependencies. For future horizontal scaling, the counter store can migrate to Redis/Valkey for cross-instance coordination.

Eventual Consistency in Notification Delivery

Error notifications are delivered asynchronously via background jobs to prevent blocking the API response path. This introduces eventual consistency—an error may be acknowledged (200 OK) before notifications fire. The system guarantees at-least-once delivery through job retries with exponential backoff, accepting potential duplicate notifications over missed alerts.

NotificationRule evaluation incorporates idempotency checks (tracking last notification timestamp per error group) to minimize duplicate deliveries. For critical alerts, customers can configure multiple channels (Slack + Email) with independent delivery guarantees.

Graceful Degradation Under Database Load

When PostgreSQL experiences high load (slow queries, connection pool exhaustion), the system prioritizes error ingestion over secondary concerns. The ErrorIngestionService creates ErrorOccurrence records with minimal validation, enqueues ProcessErrorJob, and returns immediately. Background processing (counter updates, notifications) can be delayed or retried without data loss.

If the job queue backs up significantly, operators can pause non-critical jobs (e.g., email digests, analytics rollups) to preserve capacity for error processing. This design ensures customers can always submit errors even during database incidents.

Observability via Error Metadata

Each ErrorOccurrence captures rich contextual metadata—user ID, request URL, browser fingerprint, custom key-value pairs—stored in JSONB columns for flexible schema evolution. Breadcrumbs provide a timeline of user actions leading to the error (page navigation, button clicks, API calls), enabling root cause analysis.

This metadata powers advanced filtering (e.g., "show errors affecting user X" or "errors from Chrome on mobile") and future analytics features. JSONB indexing provides fast queries without rigid schema constraints, supporting customer-specific debugging needs.

Circuit Breaker Pattern for External Services

Slack webhook and email delivery implementations include timeout protection and retry logic with exponential backoff. If Slack's API experiences an outage, failed notifications are retried up to 5 times over 6 hours before moving to a dead-letter queue for manual review.

This prevents cascading failures—an external service outage won't degrade error ingestion performance or exhaust database connections with stuck jobs. Operators receive alerts when dead-letter queues accumulate, enabling proactive customer communication.

Results & Impact

API Response Time: P95 < 150ms for error ingestion (target: <200ms)
Error Processing Latency: P95 < 5 seconds from ingestion to notification delivery
Data Integrity: 100% error capture rate with zero data loss in processing pipeline
Infrastructure Cost Efficiency: [X%] reduction vs. comparable error tracking services due to zero Redis/message queue costs
Developer Velocity: 8-week MVP build timeline from zero to production-ready
Test Coverage: 386/386 tests passing with comprehensive integration coverage
Scalability Headroom: Current architecture supports up to [estimated 10K] errors/minute before requiring horizontal scaling

Key Learnings

Boring Technology Wins for Bootstrapped Products

The decision to use Rails 8's embedded Solid Queue and Solid Cache eliminated Redis/Memcached dependencies, cutting infrastructure complexity by [X%] and reducing monthly costs by [$Y]. This validated the "boring technology" principle—established tools with strong conventions accelerate velocity more than cutting-edge frameworks requiring custom glue code.

Fingerprinting is Harder Than It Appears

Initial fingerprinting attempts used only error type + message, resulting in severe over-grouping (distinct errors collapsed into single groups). Adding normalized stack frames improved precision but introduced sensitivity to minification and framework internals. The final approach—top 5 stack frames with path normalization—balanced precision and recall, but required extensive testing against real-world error payloads.

Future iterations will likely incorporate ML-based semantic similarity to handle edge cases (different stack traces for same root cause), demonstrating that deterministic algorithms are excellent MVPs but benefit from probabilistic augmentation at scale.

Async Processing is Non-Negotiable for SaaS APIs

Early prototypes processed fingerprinting and notifications synchronously, resulting in P95 latency >800ms. Moving to async job processing reduced API response times by 75% while improving reliability (jobs retry on transient failures). This reinforced the importance of aggressive hot/cold path separation—users care about acknowledgment speed, not processing completion.

Stripe Webhooks Require Idempotent Handlers

Stripe delivers webhooks with at-least-once guarantees, meaning duplicate events are possible. Initial webhook handlers lacked idempotency checks, causing duplicate subscription updates and confusing billing states. Implementing idempotency keys (Stripe event IDs) and database constraints prevented these issues. This highlighted the importance of designing for eventual consistency and duplicate messages from day one.

Multi-Tenancy Scoping Must Be Automatic

Early code required developers to manually scope queries by account (e.g., current_account.projects.find(id)). This was error-prone and risked data leakage. Migrating to controller-level current_account scoping with enforced patterns (e.g., @projects = current_account.projects) reduced the blast radius of developer errors and improved security posture. The lesson: multi-tenancy must be enforced by framework constraints, not developer discipline.

Observability Pays for Itself Immediately

Implementing structured logging (JSON format with request IDs) and database query instrumentation during initial development enabled rapid debugging of performance bottlenecks. Multiple production incidents were diagnosed in minutes rather than hours due to rich contextual logs. The upfront investment in observability infrastructure (even basic logging) provides compounding returns as system complexity grows.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ErrorWatch: Production-Ready Error Tracking SaaS

Overview

Technical Challenges

Deterministic Error Grouping at Scale

Hot Path Performance Under Burst Traffic

Multi-Tenant Data Isolation Without Complexity

Architecture Decisions

Monolith-First with Explicit Service Boundaries

Database-Backed Job Queue (Solid Queue)

Fingerprint-Based Grouping vs. ML Clustering

Stripe-Native Billing vs. Custom Metering

System Design Highlights

API Authentication & Rate Limiting Strategy

Eventual Consistency in Notification Delivery

Graceful Degradation Under Database Load

Observability via Error Metadata

Circuit Breaker Pattern for External Services

Results & Impact

Key Learnings

Boring Technology Wins for Bootstrapped Products

Fingerprinting is Harder Than It Appears

Async Processing is Non-Negotiable for SaaS APIs

Stripe Webhooks Require Idempotent Handlers

Multi-Tenancy Scoping Must Be Automatic

Observability Pays for Itself Immediately

About

Uh oh!

Releases

Packages

almokhtarbr/ErrorWatch-case-study

Folders and files

Latest commit

History

Repository files navigation

ErrorWatch: Production-Ready Error Tracking SaaS

Overview

Technical Challenges

Deterministic Error Grouping at Scale

Hot Path Performance Under Burst Traffic

Multi-Tenant Data Isolation Without Complexity

Architecture Decisions

Monolith-First with Explicit Service Boundaries

Database-Backed Job Queue (Solid Queue)

Fingerprint-Based Grouping vs. ML Clustering

Stripe-Native Billing vs. Custom Metering

System Design Highlights

API Authentication & Rate Limiting Strategy

Eventual Consistency in Notification Delivery

Graceful Degradation Under Database Load

Observability via Error Metadata

Circuit Breaker Pattern for External Services

Results & Impact

Key Learnings

Boring Technology Wins for Bootstrapped Products

Fingerprinting is Harder Than It Appears

Async Processing is Non-Negotiable for SaaS APIs

Stripe Webhooks Require Idempotent Handlers

Multi-Tenancy Scoping Must Be Automatic

Observability Pays for Itself Immediately

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages