Simulcra is an interactive distributed-systems simulator with live topology editing, failure modeling, and observability export.
You can:
- Build and edit service topologies on a canvas.
- Run a tick-based simulation in real time.
- Watch throughput, latency, queue depth, health, and failure events.
- Persist run data and topology to Databricks.
- Export node-level metrics to Prometheus (scrape or Pushgateway mode).
Simulcra is split into two runtime layers:
frontend/: React + Vite app that contains the simulation engine and visual UI.server/: Express + TypeScript ingestion API for Databricks persistence and Prometheus endpoints.
High-level data flow:
- Frontend simulation runs locally in-browser.
- On each snapshot callback, frontend posts a normalized payload to
server. - Server deduplicates/filters snapshots and asynchronously writes accepted data to Databricks.
- Server also exposes Prometheus metrics (
/metrics/prometheus) and optionally pushes to Pushgateway.
frontend/src/app/UI and topology editorfrontend/src/simulation/simulation engine, node models, routing, metrics logicserver/routes/API route handlersserver/services/Databricks, Prometheus, run state, etc.scripts/run-prometheus.shlocal Prometheus launcher helperPROMETHEUS_SETUP.mdPrometheus setup notes
- Drag-and-drop nodes
- Connect services with directed edges
- Select/edit node config
- Pan across large topologies
- Start/Pause/Resume/Step controls
Current starter system includes:
Event ProducerPriority Producer(throughput2)API GatewayKafka ClusterConsumer GroupLoad BalancerWorker Pool AWorker Pool BRedis CacheCircuit BreakerPostgreSQL(sink)DLQ
Notes:
- Cache hits now terminate locally (they do not forward), so they reduce downstream load.
- DLQ routing requires explicit edges.
- Normal routing avoids DLQ edges; DLQ is treated as failure path.
Each node has explicit simulation logic in frontend/src/simulation/nodes/*.ts, with class-level comments describing behavior.
Examples:
- API gateway: throughput budget + timeout drops
- Kafka: partition queues + per-partition drain rates
- Worker pools: replica-scaled processing + probabilistic errors
- Circuit breaker: closed/open/half-open state machine
- DLQ: terminal sink that accumulates unresolved failures
Server persists:
- Per-service metrics snapshots
- Run-level snapshots
- Run lifecycle (
start/end) - Run topology JSON (
run_topologies)
This enables post-run architecture-aware analysis (not just symptom metrics).
Server supports:
scrapemode (GET /metrics/prometheus)pushgatewaymodeboth
Prometheus output includes granular node metrics (identity, throughput, latency, queue depth, error/health, status encoding, backpressure estimate), plus run-level aggregates.
Requirements:
- Node.js 20+
Commands:
cd /Users/arjunsakthi/Simulcra/frontend
npm install
npm run devFrontend runs on Vite default (http://localhost:5173).
Requirements:
- Node.js 20+
- Databricks SQL Warehouse credentials
Commands:
cd /Users/arjunsakthi/Simulcra/server
cp .env.example .env
npm install
npm run devServer default:
http://localhost:3001
Important env vars:
DATABRICKS_HOSTDATABRICKS_TOKENDATABRICKS_WAREHOUSE_ID(orDATABRICKS_HTTP_PATH)DATABRICKS_CATALOG/DATABRICKS_SCHEMA(optional)PROMETHEUS_ENABLEDPROMETHEUS_MODE(pushgateway|scrape|both)PUSHGATEWAY_URL(required for push modes)
Starts a run record and persists topology.
Expected body:
runIdtopologyNamenodeCounttopology(nodes+edges)
Marks run as ended.
Accepts simulation snapshot payload, deduplicates, writes to Databricks, and updates Prometheus state.
Prometheus exposition output (when scrape mode enabled).
Connectivity/feature status including Databricks and Prometheus mode.
Configured targets:
simulcra.snapshot_service_metricssimulcra.snapshotssimulcra.runssimulcra.run_topologies
To clear persisted simulation data from Databricks:
cd /Users/arjunsakthi/Simulcra/server
npm run clear:dataThis clears:
snapshot_service_metricssnapshotsrunsrun_topologies
If using local binary bundle:
cd /Users/arjunsakthi/Simulcra
./scripts/run-prometheus.shPrometheus UI:
http://localhost:9090
See also:
/Users/arjunsakthi/Simulcra/PROMETHEUS_SETUP.md
- Simulation is tick-based (
DEFAULT_TICK_MS = 100in engine). throughputPerSectranslates into per-tick processing budgets with carry.- Failures can emit
message_error/message_droppedevents. - Engine-level failure handling can route failed messages to DLQ only via explicit DLQ downstream edges.
- Load balancer currently adapts to crash/recover state (non-crashed targets), not full runtime error-rate weighting.
- Load balancer does not yet dynamically weight by observed failure/latency.
- Some service models are intentionally simplified and probabilistic.
- Metrics cardinality grows by
run_id+nodelabels; tune strategy if scaling to many concurrent runs. - Frontend chunk size is currently large in production build output.
- Dynamic LB adaptation using downstream rolling error/latency windows.
- Retry/backoff policy nodes to model safer failure recovery.
- Rich run comparison UI backed by Databricks history.
- Optional topology linting (detect anti-patterns before run start).
- Dashboard presets for Prometheus/Grafana.
Frontend build check:
cd /Users/arjunsakthi/Simulcra/frontend
npm run buildServer type check:
cd /Users/arjunsakthi/Simulcra/server
npm run check