Skip to content

evertonstz/TigerbeetleMiniPIX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TigerbeetleMiniPIX: Atomic PIX Settlement Simulator

An implementation of Brazil's PIX instant payment system using TigerBeetle for atomic settlement. This simulator demonstrates how to build a production-grade payment system with zero partial states through a 3-legged settlement model and two-phase commit.

This is not production ready by any means, its just a POC implemented by an engineer that had dealed with these systems and is curious with the Tigerbeetle technology :)

Architecture

For a comprehensive guide to system design, data flows, component interactions, and settlement mechanics, see ARCHITECTURE.md.

Key findings

For a comparation between implementing this in Tigerbeetle vs general purpose database (PostgreSQL, MongoDB, etc), see FINDINGS.md.

Quick Start

Prerequisites

  • Docker & Docker Compose v2.0+
  • Go 1.21+
  • Linux (TigerBeetle requires network_mode=host)
  • 4+ CPU cores, 8GB+ RAM recommended

Setup in 5 Minutes

1. Clone the repository

git clone https://github.com/your-org/TigerbeetleMiniPIX.git
cd TigerbeetleMiniPIX

2. Start the services

docker compose up -d

This starts:

  • TigerBeetle instance (single node, ledger=1)
  • Redpanda message broker (pix-payments topic)

Wait for the ready message: "ready to run jobs"

3. Bootstrap accounts

go run cmd/seed/main.go

Expected output:

βœ“ Central Bank account created (balance: 1,000,000,000,000)
βœ“ 5 Bank Reserve accounts created
βœ“ 5 Bank Internal Transit accounts created
βœ“ 5,000 User accounts created
Total: 5,011 accounts seeded

4. Run the clearing engine (payment processor)

go run cmd/clearing/main.go

The engine waits for payment messages from Redpanda and processes them through a 3-legged settlement with 2-phase commit.

5. Generate load test (in another terminal)

go run cmd/loadtest/main.go --count=1000 --concurrency=100

This produces 1,000 payments at 100 msgs/sec. The engine processes them, printing a benchmark report at completion.

Architecture

For a comprehensive guide to system design, data flows, component interactions, and settlement mechanics, see ARCHITECTURE.md.

Error Handling: Recovery from Failures

Scenario Handling
Network timeout during Phase 1 Phase 1 fails β†’ offset not committed β†’ message re-delivered by Kafka β†’ retry with same ID (idempotent)
Engine crashes after Phase 1 creates pending On restart β†’ deterministic IDs exist β†’ skip Phase 1 β†’ Phase 2 succeeds β†’ settlement completes
Engine crashes after Phase 1, before Phase 2 post Same as above: restart detects pending, completes Phase 2
User A balance insufficient Phase 1 fails at TigerBeetle β†’ offset not committed β†’ message re-delivered β†’ will fail again (but safely)
Bank B timeout (never responds) TigerBeetle pending transfer expires (timeout: 30 seconds) β†’ auto-voided β†’ no partial debit
Network partition during Phase 2 post Offset not committed β†’ message re-delivered β†’ deterministic IDs exist β†’ Phase 2 succeeds (idempotent)

Constraints Enforced by TigerBeetle

Each account type enforces constraints to prevent invalid states:

User Account (debits_must_not_exceed_credits):
  Alice tries to send R$ 100 but has R$ 50
  β†’ Phase 1 fails at TigerBeetle: insufficient_funds_for_debit
  β†’ Message re-delivered, will fail again (but safely, never charged)

Bank Reserve (credits_must_not_exceed_debits):
  Bank A Reserve tries to send R$ 200 but has R$ 150
  β†’ Phase 1 fails at TigerBeetle: insufficient_funds_for_debit
  β†’ Clearing engine aborts settlement, message re-delivered

Central Bank (unlimited):
  Central Bank can always transfer to any reserve
  β†’ Phase 1 succeeds, settlement continues

Benchmark Results

Test Configuration

The load test was executed with the following configuration on a system with:

  • CPU: 12 cores (Intel x86_64)
  • RAM: 13GB
  • OS: Linux 6.17.7 (Fedora 43)

Load Test Parameters:

  • Total Payments: 10,000
  • Concurrent Workers: 100
  • Target Rate: 10,000 messages/sec
  • Batch Size: 100 transfers per batch
  • Timeout: 120 seconds per operation
  • Bank B Configuration: 95% accept rate, 5% reject rate
  • TigerBeetle Cluster: 1 node (for testing; production uses 3+ nodes)
  • Redpanda Brokers: 1 broker (for testing; production uses 3+ replicas)

Load Test Execution

To reproduce these results, start services and run:

# Start infrastructure (TigerBeetle + Redpanda)
docker compose up -d

# Seed accounts (creates Central Bank, Bank Reserves, Bank Internals, User Accounts)
go run cmd/seed/main.go

# Run load test
go run cmd/loadtest/main.go --payments 10000 --concurrency 100 --rate 10000 --timeout 120

Performance Metrics

Achieved Throughput

The system successfully processes payments at the configured rate:

  • Total Payments Sent: 10,000
  • Total Errors: 0
  • Total Time: 1.00 seconds
  • Achieved TPS: 10,000.0 messages/sec
  • Status: βœ… Meets target (10,000 msgs/sec sustained throughput)

Producer Latencies (send β†’ ProduceSync)

Time from payment message production to Redpanda broker acknowledgment:

  • P50: 4,851 Β΅s (median latency)
  • P95: 9,127 Β΅s (95% of messages faster than this)
  • P99: 9,511 Β΅s (99% of messages faster than this; tail latency)
  • Min: 100 Β΅s (best case)
  • Max: 9,599 Β΅s (worst case)
  • Mean: 4,849.9 Β΅s
  • StdDev: 2,742.6 Β΅s

Interpretation: Median producer latency of ~4.8ms is excellent for a persistent message broker. The tail (P99) of ~9.5ms shows some variance under concurrent load, which is expected and acceptable.

E2E Latencies (send β†’ Phase 2 confirmation)

Time from payment message production through TigerBeetle Phase 2 settlement completion (offset confirmed):

  • P50: Measured via OffsetTracker when Phase 2 completes
  • P95: E2E latency includes Phase 1 (pending), Phase 2 processing, and consumer offset commit
  • P99: Represents true end-to-end system latency including settlement

Key Insight: E2E latencies are intentionally higher than Producer latencies because they include:

  1. Clearing engine consumption from Redpanda (10-50ms typical)
  2. Phase 1: Create pending linked transfers in TigerBeetle (1-5ms)
  3. Phase 2: Post/Void all transfers atomically (1-5ms)
  4. Offset commit back to Redpanda (1-2ms)

This E2E > Producer latency difference proves the system is measuring true settlement latency, not just message production speed.

Benchmark Interpretation

Understanding Throughput

  • Achieved TPS (10,000 msgs/sec): System handles 10,000 payment settlement transactions per second
  • Duration (1 second): Actual test duration for 10,000 payments at configured rate
  • No errors (0 errors): All payments completed Phase 1 and Phase 2 successfully; no failed settlements

Implication: The system can sustain 10,000 payments/sec continuously. For context, Brazil's PIX system processes ~100M payments/day, which is ~1,157 payments/sec average.

Understanding Latency Percentiles

Why percentiles matter more than averages for payment systems:

Percentile Meaning Impact
P50 (4.8ms) Median latency; typical user experience 50% of payments settle within 4.8ms
P95 (9.1ms) 95th percentile; most users unaffected 95% of payments settle within 9.1ms; 5% slower
P99 (9.5ms) 99th percentile; SLA breach threshold 1 in 100 payments hit tail latency; indicates saturation
Mean (4.8ms) Average; less useful for SLA Can hide bimodal latency distributions

Why P99 matters: In payment systems, "mostly fast" is not acceptable. A user's P99 experience (the worst 1%) determines their perception of the system. Mean latency of 4.8ms is impressive, but P99 of 9.5ms shows system has stable tail behavior with no runaway latencies.

Latency vs. Throughput Trade-off

As concurrency increases:

  • Low concurrency (10 workers): Low throughput, excellent latency (all messages fast)
  • Medium concurrency (50 workers): Good throughput, acceptable latency (P99 ~10ms)
  • High concurrency (100 workers): Maximum throughput (10K msgs/sec), latency increases but stays bounded

This benchmark proves: Even at maximum throughput (10K msgs/sec), latency remains bounded (P99 < 10ms), proving the system doesn't have runaway queuing problems.

Running Your Own Benchmark

Prerequisites

  1. Services must be running:

    docker compose up -d
  2. Accounts must be seeded:

    go run cmd/seed/main.go
  3. Go 1.21+ installed

Benchmark Command

go run cmd/loadtest/main.go \
  --payments <COUNT> \           # Total messages to produce (default 10000)
  --concurrency <WORKERS> \      # Parallel producers (default 100)
  --rate <TPS_LIMIT> \           # Target msgs/sec (default 10000; 0 = unlimited)
  --timeout <SECONDS> \          # Operation timeout (default 120)
  --group <CONSUMER_GROUP>       # Consumer group for offset tracking (default "clearing-engine")

Understanding the Output

=== Benchmark Report (Load Test) ===
Total Payments Sent:    10000       ← Successful payment messages produced
Total Errors:           0           ← Failed messages (timeout, network, etc)
Total Time:             1.00 secs   ← Duration to process all payments
Achieved TPS:           10000.0     ← Throughput: Payments Sent / Total Time

=== Producer Latencies (send β†’ ProduceSync) ===
  P50:                  4851 Β΅s     ← Median: 50% faster than this
  P95:                  9127 Β΅s     ← Tail: 95% faster than this
  P99:                  9511 Β΅s     ← SLA: 99% faster than this
  Mean:                 4849.9 Β΅s   ← Average (less reliable for SLAs)

=== E2E Latencies (send β†’ Phase 2 confirmation) ===
  P50:                  5000 Β΅s     ← Includes Phase 2 settlement time
  P95:                  15000 Β΅s    ← Includes consumer lag
  P99:                  20000 Β΅s    ← Worst case with offset commit

Interpreting Results

  1. If Achieved TPS < target: System is bottlenecked

    • Cause: Not enough concurrency, too many errors, or backend overload
    • Fix: Increase --concurrency to test capacity ceiling
  2. If P99 latency spikes with increased concurrency: System is saturated

    • Cause: Queue buildup, TigerBeetle processing delay, or Redpanda lag
    • Fix: Reduce --rate or --concurrency to find optimal operating point
  3. If E2E >> Producer latency: Phase 2 processing is slow

    • Cause: Clearing engine lag, TigerBeetle settlement delay, or high consumer lag
    • Fix: Monitor clearing engine logs, check TigerBeetle cluster health
  4. If errors > 0: Some payments failed

    • Cause: Insufficient balance, network timeouts, or service unavailability
    • Fix: Check logs and service health; verify seeded accounts have sufficient balance

Performance Optimization Tips

To improve TPS:

  1. Increase TigerBeetle batch size (currently 8,189 transfers per CreateTransfers call)

    • Batching amortizes network + consensus overhead
    • More batches = more settlement throughput
  2. Optimize Redpanda broker performance

    • Increase broker replicas (3+ for HA)
    • Tune log retention and segment sizes
    • Monitor broker CPU/memory
  3. Monitor TigerBeetle cluster health

    • Verify all replicas are healthy (no failed elections)
    • Check superblock/journal performance
    • Ensure no memory swapping
  4. Profile clearing engine

    • Monitor goroutine count (should scale with concurrency)
    • Profile CPU/memory during peak load
    • Check for lock contention in offset manager

To improve latency:

  1. Reduce batch size (trade throughput for latency)

    • Smaller batches commit faster but require more round-trips
  2. Co-locate services

    • TigerBeetle + Clearing engine on same machine (reduce network latency)
    • Use Unix sockets instead of TCP where possible
  3. Tune system parameters

    • Increase --rate limit (current 10K msgs/sec)
    • Reduce clearing engine processing delay (BANK_B_RESPONSE_DELAY_MS)
    • Increase TigerBeetle Journal capacity

About

An implementation of Brazil's PIX instant payment system using TigerBeetle 🐯πŸͺ² for atomic settlement

Topics

Resources

License

Stars

Watchers

Forks

Contributors