Request-safe scale-to-zero for Kubernetes and AI agent workloads.
LeanNodes lets platform teams hibernate idle application and agent workloads, then wake the full dependency graph on the first real HTTP request. The gateway holds the request, LeanNodes starts every required workload in DAG order, waits for Kubernetes readiness, and then releases traffic to the real service.
No SDK. No application code changes. No cron pre-warm guessing.
Browser / test client
|
v
Hold-capable gateway
Istio, Envoy Gateway, ingress-nginx, Traefik, HAProxy, APISIX, Kong, ...
|
| ext_authz / forward-auth / synchronous plugin callout
v
LeanNodes orchestrator
|
| match host + path -> claim cold-start -> execute DAG
v
Kubernetes workloads
Deployment, StatefulSet, Job prep steps, External probes
|
| ready endpoints observed
v
Gateway resumes the original request -> user receives the backend response
The UI is designed as an operator cockpit: start with storage/gateway health, build and version flows visually, prove drained cold-starts, and govern access from the same surface. Click any thumbnail to view it full size.
Kubernetes platforms and AI agent systems often sit idle while still keeping full application stacks, model-serving helpers, queues, tools, and databases warm. Teams usually choose between two bad options:
- Keep everything warm and pay for idle CPU, memory, and nodes.
- Scale workloads down manually and accept broken first requests.
LeanNodes exists for the third path: scale to zero when idle, then make the
first real request wait until the application is actually ready. A successful
cold start means the user gets the real backend response, not no healthy upstream, not an empty 503, and not a dashboard statistic claiming success
while the browser failed.
LeanNodes is built for request-safe workload cost optimization: scale idle systems to zero, wake them only when traffic proves they are needed, and avoid broken first requests while the dependency graph comes online. For cold-start-driven agentic workflows, the same request-hold orchestration model can be used anywhere the operator intentionally accepts startup latency.
LeanNodes treats "supported gateway" as a hard functional contract:
- The flow is drained before the test begins.
- A user hits the real URL through the real gateway.
- The gateway calls LeanNodes before proxying.
- LeanNodes warms every node in the flow DAG.
- The gateway resumes the same request only after the target service is ready.
- The client receives the actual backend response.
If the response is no healthy upstream, an empty gateway 503, or a backend
race, that integration is not considered complete.
- Request-hold cold starts: LeanNodes works at the ingress/gateway layer so the first user request can be held while the application comes online.
- Graph-aware wake-up: flows are DAGs, not independent service toggles. Databases, caches, workers, migrations, APIs, and external probes can be ordered explicitly.
- Gateway-agnostic driver model: each gateway driver translates the same flow binding into that gateway's native artifact.
- Storage abstraction: application logic talks to a Store interface. DynamoDB and Postgres can back the same product behavior.
- Enterprise control plane: built-in authentication, users, groups, roles, per-flow ACLs, audit log, drafts, versions, test sessions, metrics, and notification routing.
- Conformance bias: gateway work is proved with drained four-node flows and real request paths, not warm-only curls.
LeanNodes is more than "scale down and scale up". It is a workflow control plane for request-gated workload recovery:
- Self-healing cold-start recovery: failed or degraded flows can be retried from the next request, node health is tracked, retry backoff is recorded, and recovery events are surfaced through live status, metrics, audit, and notifications.
- Failure-aware rollback: when a DAG partially warms and then fails, LeanNodes can scale successful levels back down by default or leave them warm for debugging, depending on the flow policy.
- Version-managed flow changes: every committed flow version is immutable, readable, comparable, and activatable. Operators can roll forward, roll back, or branch from an older version without losing history.
- Safe draft workflow: edits happen in mutable drafts with auto-save, editor locking, stale-base detection, discard, commit notes, and explicit activation. Frequent edits do not spam the immutable version history.
- Pre-activation validation: cycles, missing nodes, overlapping host/path bindings, gateway capability limits, and invalid workload references are rejected before a flow can affect traffic.
- Operational recovery controls: Warm, Drain, Pin, Disable, Enable, and test sessions let teams recover or validate an environment without changing application code.
- Audit-ready governance: users, groups, roles, per-flow ACLs, personal access tokens, settings redaction, notification routes, and hash-chained audit events are part of the product surface.
- Host/path flow bindings.
- Sliding idle window with automatic drain.
- Manual Warm, Drain, Pin, Disable, and Enable actions.
fastHoldSecper flow: the synchronous gateway wait budget before LeanNodes returns a controlled warm-up response while the DAG continues.- Readiness waits for Kubernetes workload state, EndpointSlices, and optional HTTP/TCP probes.
- Shared workload reference counting so one flow does not drain a service still needed by another flow.
- Failure rollback policy with the option to leave partially warmed stacks up for debugging.
- Re-trigger-on-demand behavior: a failed or drained flow can recover on the next real request through the gateway.
- Browser-based React flow editor.
- Draft model with locking, auto-save, discard, stale-base detection, commit, activate, and branch flows.
- Read-only version pages and diff/compare flows.
- DAG validation: cycles, overlapping bindings, invalid node references, and gateway capability limits are rejected before activation.
- Test sessions for validating a version before it becomes the active flow.
- Commit notes, activation history, rollback/branch flows, and immutable snapshots make flow changes reviewable instead of ad hoc UI mutations.
LeanNodes supports these flow node kinds:
| Kind | What LeanNodes Does |
|---|---|
Deployment |
Scales 0 <-> target replicas and waits for readiness. |
StatefulSet |
Scales 0 <-> target replicas and waits for readiness. |
Job |
Clones/runs a prep job and waits for completion. Useful for migrations, cache warmers, seeders. |
External |
Runs HTTP/TCP probes for dependencies LeanNodes does not scale, such as managed Redis, RDS, or third-party APIs. |
Direct supported drivers currently present in the codebase:
| Gateway | Hold Mechanism |
|---|---|
| Istio | Envoy ext_authz gRPC via AuthorizationPolicy CUSTOM. |
| ingress-nginx | auth_request / forward-auth with LeanNodes Lua pre-auth bridge and service-upstream. |
| Traefik | ForwardAuth Middleware. |
| Envoy Gateway | SecurityPolicy extAuth gRPC plus route/backend policies. |
| Contour | ExtensionService authorization. |
| Emissary / Ambassador | AuthService ext_authz. |
| kgateway | GatewayExtension + TrafficPolicy. |
| HAProxy Ingress | LeanNodes frontend Lua bridge before backend selection. |
| APISIX | ApisixGlobalRule forward-auth plugin. |
| Kong | LeanNodes custom access-phase plugin and KongPlugin resources. |
| Caddy Gateway | Managed Caddyfile with forward_auth before reverse_proxy. |
| Skipper | webhook(...) filter through zalando.org/skipper-filter. |
Documented/plugin or customer-managed integrations include Tyk, WSO2 Choreo Connect, Gravitee, KrakenD, Apigee, F5 BIG-IP CIS, Citrix ADC / NetScaler, NSX Advanced Load Balancer / Avi, Wallarm, Cilium Gateway custom paths, and cloud/CDN edge topologies.
Cloud LBs and CDNs such as AWS ALB, GCE/GKE Ingress, Azure Application Gateway, OCI native ingress, Cloudflare, Fastly, and Akamai are treated as edge routers, not LeanNodes hold points. The supported topology is:
Cloud LB / CDN -> hold-capable in-cluster gateway -> LeanNodes -> service
The Settings page detects installed ingress/gateway technologies and explains whether LeanNodes has a configured driver, what configuration is missing, and which topology is required.
LeanNodes state lives behind the Store interface.
| Backend | Use Case |
|---|---|
| DynamoDB | AWS default, single-table layout, PAY_PER_REQUEST, TTL, PITR support, zero-egress VPC endpoint path. |
| Postgres | Cloud-neutral and customer-managed multi-replica backend. Uses transactions, row locks, JSONB payloads, and expiry timestamps while preserving the Store contract. |
Stored families include flows, versions, drafts, locks, cold-start claims, leader leases, executions, audit chain, users, invitations, groups, roles, per-flow ACLs, settings, and personal access tokens.
- Password login with invitation and reset flows.
- OIDC browser login and group-to-role mapping.
- SAML browser flow support when configured.
- Session-cookie auth for the UI.
- Personal access tokens for CLI/API automation.
- Built-in roles plus custom roles.
- Groups and per-flow ACLs.
- Zero-access default for newly provisioned SSO users.
- Audit log with hash chaining.
- Secret redaction for settings and notification destinations.
LeanNodes notifications are short operational signals, not chat spam.
- Slack routes with Block Kit-style content.
- Microsoft Teams routes with compact cards.
- Custom webhook routes with Send Test.
- Multiple destinations per channel type.
- Event filters per route.
- Secret/header redaction and retained write-only credentials.
- Platform-specific custom webhook formats for Google Chat, Webex, Mattermost, Rocket.Chat, Zulip, PagerDuty, Opsgenie, Grafana OnCall, ServiceNow, Jira Service Management, Datadog, Splunk HEC, New Relic, Elastic, Sumo Logic, n8n, Zapier, Workato, Mulesoft, and customer routers.
- Prometheus metrics and alert rules.
- Flow execution history.
- Live status and node-level health recovery.
- Audit log for lifecycle and access changes.
- Savings attribution model:
idleHours * (flow CPU requests / node CPU capacity) * average node hourly cost. - Dashboard rollups for flow state, cold-start outcomes, and storage health.
The current measured benchmark is documented in docs/slo.md.
Headline results from the recorded E14 run:
| Scenario | Measured Result |
|---|---|
| Unmatched/pass-through traffic | 115k+ RPS at c=100 with p99 around 2.3ms. |
| Warm-path managed Check | About 7.5k RPS per orchestrator replica with low millisecond p99 up to c=100. |
| Management list with 50 flows and ACL filtering | About 1.7k RPS with p99 under 200ms at c=100. |
| Errors in benchmark run | Zero errors across 3.6M+ requests. |
These numbers were produced in a local kind + dynamodb-local environment and should be treated as a defensible baseline, not a universal production SLO.
Kubernetes cluster
+------------------------------+
| Hold-capable gateway |
| ext_authz / forward-auth |
+---------------+--------------+
|
v
+------------------------------+
| LeanNodes orchestrator |
| - flow matcher |
| - DAG executor |
| - readiness watcher |
| - gateway driver registry |
| - auth/RBAC/audit/settings |
+----------+-----------+-------+
| |
| v
| Store backend
| DynamoDB or Postgres
|
v
Kubernetes API
scale subresource, Jobs,
EndpointSlices, gateway CRDs
The orchestrator is a single Go binary. The web UI is a static React SPA served by the same product surface. Helm deploys the control plane, services, RBAC, network policy, metrics resources, and storage configuration.
Use this path to try LeanNodes without a cloud account.
make kind-up
make istio-up
make ddb-local-up
make runThen open the UI at http://localhost:8080. The local admin seed is printed by
the orchestrator at startup.
Detailed guide: docs/installation/local-dev.md
- Kubernetes cluster.
- A hold-capable ingress/gateway from the supported list.
kubectland Helm.- Store backend:
- DynamoDB table access for AWS installs, or
- Postgres DSN stored in a Kubernetes Secret.
- Permissions for LeanNodes to read flow-related Kubernetes resources, scale managed workloads, create prep Jobs, and write the selected gateway artifacts.
DynamoDB is the default backend.
helm upgrade --install leannodes deploy/helm/leannodes \
--namespace leannodes --create-namespace \
--set image.repository=<registry>/leannodes \
--set image.tag=<tag> \
--set storage.backend=dynamodb \
--set storage.dynamodb.region=us-east-1 \
--set storage.dynamodb.tableName=leannodes-state \
--set serviceAccount.roleArn=arn:aws:iam::<account>:role/leannodes-irsa \
--set ingress.host=leannodes.example.comAWS install guide: docs/installation/eks.md
Create a Secret containing the DSN:
kubectl -n leannodes create secret generic leannodes-postgres \
--from-literal=dsn='postgres://user:password@postgres.example.com:5432/leannodes?sslmode=require'Install with Postgres enabled:
helm upgrade --install leannodes deploy/helm/leannodes \
--namespace leannodes --create-namespace \
--set image.repository=<registry>/leannodes \
--set image.tag=<tag> \
--set storage.backend=postgres \
--set storage.postgres.existingSecretName=leannodes-postgres \
--set storage.postgres.existingSecretKey=dsn \
--set ingress.host=leannodes.example.comThe chart validates that either DynamoDB or Postgres has the required settings.
For externally managed Postgres Secrets, set
storage.postgres.existingSecretChecksum when you want Secret rotation to roll
the orchestrator pods.
Start with docs/install.md, then choose the guide matching
your gateway:
docs/installation/istio-new.mddocs/installation/istio-existing.mddocs/installation/ingress-nginx.mddocs/installation/traefik.mddocs/installation/haproxy-ingress.mddocs/installation/apisix.mddocs/installation/kong.mddocs/installation/caddy.mddocs/installation/skipper.mddocs/installation/contour.mddocs/installation/emissary.mddocs/installation/envoy-gateway.mddocs/installation/kgateway.mddocs/installation/aws-alb.mddocs/installation/cilium-gateway.md
Each gateway proof should be run from a drained flow. Warm-only tests are not valid cold-start proof.
- Log in with password credentials or SSO.
- Open Settings and verify storage, gateway reachability, and detected ingress/gateway technologies.
- Create a flow with one or more host/path bindings.
- Add workload nodes and edges in the flow editor.
- Commit the draft.
- Activate the version.
- Drain the flow so all managed services scale to zero.
- Hit the real user URL through the selected gateway.
- Confirm the first request waits and returns the actual backend response.
- Use execution history, live status, metrics, audit, and notifications for operational review.
Hands-on tutorial: docs/tutorials/first-flow.md
Common environment variables:
| Variable | Purpose |
|---|---|
LEANNODES_STORAGE_BACKEND |
dynamodb or postgres. |
LEANNODES_DDB_TABLE |
DynamoDB table name. |
LEANNODES_POSTGRES_DSN / DATABASE_URL |
Postgres DSN. |
LEANNODES_FORWARDAUTH_INTERNAL_URL |
In-cluster /check URL for forward-auth gateways. |
LEANNODES_GRPC_PORT |
ext_authz gRPC port. |
LEANNODES_AUTH_MODE |
Authentication mode. |
LEANNODES_AUTH_COOKIE_SECRET |
Session signing secret. |
LEANNODES_SUPER_ADMIN_EMAIL |
Initial admin bootstrap email. |
LEANNODES_SUPER_ADMIN_PASSWORD |
Initial admin bootstrap password. |
LEANNODES_PUBLIC_BASE_URL |
Used for notification links and UI-generated URLs. |
Full deployment reference: docs/install.md
Core checks:
go test ./...
npm --prefix web run typecheck
npm --prefix web run buildStorage checks:
go test -tags=integration -count=1 ./internal/store/ddb
LEANNODES_POSTGRES_TEST_DSN='postgres://...' go test ./internal/store/postgres -count=1 -v
node scripts/functional-test-postgres-multi-replica.mjsGateway conformance examples:
API=http://localhost:18080 scripts/gateway-conformance-ingressnginx-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-haproxy-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-traefik-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-apisix-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-kong-kind.shSee docs/testing.md for the complete test map.
docs/getting-started.md: choose the right install path.docs/concepts.md: flows, bindings, nodes, state, storage, RBAC.docs/install.md: all installation guides.docs/tutorials/first-flow.md: first working flow.docs/notifications.md: Slack, Teams, custom webhooks, event catalog.docs/slo.md: SLOs and measured capacity.docs/runbook.md: day-2 operations.docs/troubleshooting.md: symptom-based fixes.docs/backup-restore.md: DynamoDB and Postgres backup strategy.ARCHITECTURE.MD: architecture narrative.DESIGN.md: design decisions and implementation notes.PRODUCT.md: product scope and tradeoffs.
Near-term product directions:
- More native gateway validation where stable external-auth APIs exist.
- GitOps-friendly
FlowCRD and controller. - Deeper backup/restore automation.
- Accessibility and chaos-testing hardening.
- LLM-assisted operator workflows: flow diagnosis, gateway setup review, savings recommendations, failed warm-up summarization, and runbook-guided remediation suggestions. These should stay advisory and auditable; LeanNodes should not let an LLM mutate cluster state without explicit operator approval.
- It does not replace HPA, KEDA, or cluster autoscalers.
- It does not perform canary routing, traffic shaping, or service mesh policy management beyond the hold artifact it owns.
- It does not intercept request bodies for normal cold-start decisions.
- It does not make cloud LBs or CDNs into hold-capable gateways; put a proven in-cluster gateway behind them.
- It is not intended for latency-critical traffic paths where startup delay is unacceptable.
Contributions should include functional proof, not just unit coverage. Gateway work should include a drained-flow conformance path that demonstrates the first request waits and returns the real backend body.
Useful entry points:
internal/gateway/: gateway driver implementations.internal/store/: Store interface and backend implementations.internal/api/: HTTP API surface.internal/notify/: notification routing and platform adapters.web/: React UI.scripts/: functional and conformance tests.deploy/helm/leannodes/: Helm chart.
Before opening a PR, run the relevant backend, UI, and functional checks and include the exact commands in the PR description.
LeanNodes is fair-code software under the Sustainable Use License 1.0.
Enterprises may use and modify LeanNodes for their own internal business operations. You may also use it for non-commercial or personal purposes.
You may not sell, resell, lease, rent, offer, host, provide as a managed service, provide as SaaS, include in a paid product, or commercially exploit LeanNodes or a modified version of LeanNodes without a separate written commercial license from the maintainers.
This is intentionally not an OSI open-source license; it is a fair-code source-available license.


















