LeanNodes

Request-safe scale-to-zero for Kubernetes and AI agent workloads.

LeanNodes lets platform teams hibernate idle application and agent workloads, then wake the full dependency graph on the first real HTTP request. The gateway holds the request, LeanNodes starts every required workload in DAG order, waits for Kubernetes readiness, and then releases traffic to the real service.

No SDK. No application code changes. No cron pre-warm guessing.

Browser / test client
        |
        v
Hold-capable gateway
Istio, Envoy Gateway, ingress-nginx, Traefik, HAProxy, APISIX, Kong, ...
        |
        | ext_authz / forward-auth / synchronous plugin callout
        v
LeanNodes orchestrator
        |
        | match host + path -> claim cold-start -> execute DAG
        v
Kubernetes workloads
Deployment, StatefulSet, Job prep steps, External probes
        |
        | ready endpoints observed
        v
Gateway resumes the original request -> user receives the backend response

Product Tour

The UI is designed as an operator cockpit: start with storage/gateway health, build and version flows visually, prove drained cold-starts, and govern access from the same surface. Click any thumbnail to view it full size.

_{Password login and token fallback.}	_{Dashboard: savings, flow state, recent cold-starts.}	_{Light/dark operator UI with slower theme transitions.}
_{Flow inventory with import/export and bulk operations.}	_{Flow detail: hold windows, bindings, live DAG.}	_{Live cold-start stream while workloads recover.}
_{Health, versions, per-flow ACLs, and audit trail.}	_{Immutable versions with activate and diff actions.}	_{Version diff: field changes plus DAG comparison.}
_{Visual editor for DAGs, bindings, gateway, timing.}	_{Draft workflow: auto-save, commit notes, activation.}	_{Workload discovery with HPA/KEDA conflict guidance.}
_{Routing lookup: which flow owns this URL?}	_{Settings status: cluster, storage, gateway detection.}	_{User directory with invitations and auth methods.}
_{Built-in and custom roles for least-privilege ops.}	_{Personal access tokens for CLI and automation.}	_{OIDC setup with group-to-role mapping.}
_{Slack routes, event filters, and Send Test.}

Why LeanNodes Exists

Kubernetes platforms and AI agent systems often sit idle while still keeping full application stacks, model-serving helpers, queues, tools, and databases warm. Teams usually choose between two bad options:

Keep everything warm and pay for idle CPU, memory, and nodes.
Scale workloads down manually and accept broken first requests.

LeanNodes exists for the third path: scale to zero when idle, then make the first real request wait until the application is actually ready. A successful cold start means the user gets the real backend response, not no healthy upstream, not an empty 503, and not a dashboard statistic claiming success while the browser failed.

LeanNodes is built for request-safe workload cost optimization: scale idle systems to zero, wake them only when traffic proves they are needed, and avoid broken first requests while the dependency graph comes online. For cold-start-driven agentic workflows, the same request-hold orchestration model can be used anywhere the operator intentionally accepts startup latency.

The Product Promise

LeanNodes treats "supported gateway" as a hard functional contract:

The flow is drained before the test begins.
A user hits the real URL through the real gateway.
The gateway calls LeanNodes before proxying.
LeanNodes warms every node in the flow DAG.
The gateway resumes the same request only after the target service is ready.
The client receives the actual backend response.

If the response is no healthy upstream, an empty gateway 503, or a backend race, that integration is not considered complete.

What Makes LeanNodes Different

Request-hold cold starts: LeanNodes works at the ingress/gateway layer so the first user request can be held while the application comes online.
Graph-aware wake-up: flows are DAGs, not independent service toggles. Databases, caches, workers, migrations, APIs, and external probes can be ordered explicitly.
Gateway-agnostic driver model: each gateway driver translates the same flow binding into that gateway's native artifact.
Storage abstraction: application logic talks to a Store interface. DynamoDB and Postgres can back the same product behavior.
Enterprise control plane: built-in authentication, users, groups, roles, per-flow ACLs, audit log, drafts, versions, test sessions, metrics, and notification routing.
Conformance bias: gateway work is proved with drained four-node flows and real request paths, not warm-only curls.

Feature Overview

Core Platform Capabilities

LeanNodes is more than "scale down and scale up". It is a workflow control plane for request-gated workload recovery:

Self-healing cold-start recovery: failed or degraded flows can be retried from the next request, node health is tracked, retry backoff is recorded, and recovery events are surfaced through live status, metrics, audit, and notifications.
Failure-aware rollback: when a DAG partially warms and then fails, LeanNodes can scale successful levels back down by default or leave them warm for debugging, depending on the flow policy.
Version-managed flow changes: every committed flow version is immutable, readable, comparable, and activatable. Operators can roll forward, roll back, or branch from an older version without losing history.
Safe draft workflow: edits happen in mutable drafts with auto-save, editor locking, stale-base detection, discard, commit notes, and explicit activation. Frequent edits do not spam the immutable version history.
Pre-activation validation: cycles, missing nodes, overlapping host/path bindings, gateway capability limits, and invalid workload references are rejected before a flow can affect traffic.
Operational recovery controls: Warm, Drain, Pin, Disable, Enable, and test sessions let teams recover or validate an environment without changing application code.
Audit-ready governance: users, groups, roles, per-flow ACLs, personal access tokens, settings redaction, notification routes, and hash-chained audit events are part of the product surface.

Request-Safe Cold Start

Host/path flow bindings.
Sliding idle window with automatic drain.
Manual Warm, Drain, Pin, Disable, and Enable actions.
fastHoldSec per flow: the synchronous gateway wait budget before LeanNodes returns a controlled warm-up response while the DAG continues.
Readiness waits for Kubernetes workload state, EndpointSlices, and optional HTTP/TCP probes.
Shared workload reference counting so one flow does not drain a service still needed by another flow.
Failure rollback policy with the option to leave partially warmed stacks up for debugging.
Re-trigger-on-demand behavior: a failed or drained flow can recover on the next real request through the gateway.

Visual Flow Management

Browser-based React flow editor.
Draft model with locking, auto-save, discard, stale-base detection, commit, activate, and branch flows.
Read-only version pages and diff/compare flows.
DAG validation: cycles, overlapping bindings, invalid node references, and gateway capability limits are rejected before activation.
Test sessions for validating a version before it becomes the active flow.
Commit notes, activation history, rollback/branch flows, and immutable snapshots make flow changes reviewable instead of ad hoc UI mutations.

Workload Model

LeanNodes supports these flow node kinds:

Kind	What LeanNodes Does
`Deployment`	Scales 0 <-> target replicas and waits for readiness.
`StatefulSet`	Scales 0 <-> target replicas and waits for readiness.
`Job`	Clones/runs a prep job and waits for completion. Useful for migrations, cache warmers, seeders.
`External`	Runs HTTP/TCP probes for dependencies LeanNodes does not scale, such as managed Redis, RDS, or third-party APIs.

Gateway Coverage

Direct supported drivers currently present in the codebase:

Gateway	Hold Mechanism
Istio	Envoy `ext_authz` gRPC via `AuthorizationPolicy` CUSTOM.
ingress-nginx	`auth_request` / forward-auth with LeanNodes Lua pre-auth bridge and `service-upstream`.
Traefik	ForwardAuth Middleware.
Envoy Gateway	`SecurityPolicy` extAuth gRPC plus route/backend policies.
Contour	`ExtensionService` authorization.
Emissary / Ambassador	`AuthService` ext_authz.
kgateway	`GatewayExtension` + `TrafficPolicy`.
HAProxy Ingress	LeanNodes frontend Lua bridge before backend selection.
APISIX	`ApisixGlobalRule` forward-auth plugin.
Kong	LeanNodes custom access-phase plugin and `KongPlugin` resources.
Caddy Gateway	Managed Caddyfile with `forward_auth` before `reverse_proxy`.
Skipper	`webhook(...)` filter through `zalando.org/skipper-filter`.

Documented/plugin or customer-managed integrations include Tyk, WSO2 Choreo Connect, Gravitee, KrakenD, Apigee, F5 BIG-IP CIS, Citrix ADC / NetScaler, NSX Advanced Load Balancer / Avi, Wallarm, Cilium Gateway custom paths, and cloud/CDN edge topologies.

Cloud LBs and CDNs such as AWS ALB, GCE/GKE Ingress, Azure Application Gateway, OCI native ingress, Cloudflare, Fastly, and Akamai are treated as edge routers, not LeanNodes hold points. The supported topology is:

Cloud LB / CDN -> hold-capable in-cluster gateway -> LeanNodes -> service

The Settings page detects installed ingress/gateway technologies and explains whether LeanNodes has a configured driver, what configuration is missing, and which topology is required.

Storage and Multi-Replica Operation

LeanNodes state lives behind the Store interface.

Backend	Use Case
DynamoDB	AWS default, single-table layout, PAY_PER_REQUEST, TTL, PITR support, zero-egress VPC endpoint path.
Postgres	Cloud-neutral and customer-managed multi-replica backend. Uses transactions, row locks, JSONB payloads, and expiry timestamps while preserving the Store contract.

Stored families include flows, versions, drafts, locks, cold-start claims, leader leases, executions, audit chain, users, invitations, groups, roles, per-flow ACLs, settings, and personal access tokens.

Authentication, Authorization, and Governance

Password login with invitation and reset flows.
OIDC browser login and group-to-role mapping.
SAML browser flow support when configured.
Session-cookie auth for the UI.
Personal access tokens for CLI/API automation.
Built-in roles plus custom roles.
Groups and per-flow ACLs.
Zero-access default for newly provisioned SSO users.
Audit log with hash chaining.
Secret redaction for settings and notification destinations.

Notifications and Enterprise Integrations

LeanNodes notifications are short operational signals, not chat spam.

Slack routes with Block Kit-style content.
Microsoft Teams routes with compact cards.
Custom webhook routes with Send Test.
Multiple destinations per channel type.
Event filters per route.
Secret/header redaction and retained write-only credentials.
Platform-specific custom webhook formats for Google Chat, Webex, Mattermost, Rocket.Chat, Zulip, PagerDuty, Opsgenie, Grafana OnCall, ServiceNow, Jira Service Management, Datadog, Splunk HEC, New Relic, Elastic, Sumo Logic, n8n, Zapier, Workato, Mulesoft, and customer routers.

See docs/notifications.md.

Observability and FinOps

Prometheus metrics and alert rules.
Flow execution history.
Live status and node-level health recovery.
Audit log for lifecycle and access changes.
Savings attribution model: idleHours * (flow CPU requests / node CPU capacity) * average node hourly cost.
Dashboard rollups for flow state, cold-start outcomes, and storage health.

Measured Capacity

The current measured benchmark is documented in docs/slo.md. Headline results from the recorded E14 run:

Scenario	Measured Result
Unmatched/pass-through traffic	115k+ RPS at c=100 with p99 around 2.3ms.
Warm-path managed Check	About 7.5k RPS per orchestrator replica with low millisecond p99 up to c=100.
Management list with 50 flows and ACL filtering	About 1.7k RPS with p99 under 200ms at c=100.
Errors in benchmark run	Zero errors across 3.6M+ requests.

These numbers were produced in a local kind + dynamodb-local environment and should be treated as a defensible baseline, not a universal production SLO.

Architecture

                 Kubernetes cluster

        +------------------------------+
        | Hold-capable gateway         |
        | ext_authz / forward-auth     |
        +---------------+--------------+
                        |
                        v
        +------------------------------+
        | LeanNodes orchestrator       |
        | - flow matcher               |
        | - DAG executor               |
        | - readiness watcher          |
        | - gateway driver registry    |
        | - auth/RBAC/audit/settings   |
        +----------+-----------+-------+
                   |           |
                   |           v
                   |     Store backend
                   |     DynamoDB or Postgres
                   |
                   v
          Kubernetes API
          scale subresource, Jobs,
          EndpointSlices, gateway CRDs

The orchestrator is a single Go binary. The web UI is a static React SPA served by the same product surface. Helm deploys the control plane, services, RBAC, network policy, metrics resources, and storage configuration.

Quick Start: Local Kind

Use this path to try LeanNodes without a cloud account.

make kind-up
make istio-up
make ddb-local-up
make run

Then open the UI at http://localhost:8080. The local admin seed is printed by the orchestrator at startup.

Detailed guide: docs/installation/local-dev.md

Helm Installation

Prerequisites

Kubernetes cluster.
A hold-capable ingress/gateway from the supported list.
kubectl and Helm.
Store backend:
- DynamoDB table access for AWS installs, or
- Postgres DSN stored in a Kubernetes Secret.
Permissions for LeanNodes to read flow-related Kubernetes resources, scale managed workloads, create prep Jobs, and write the selected gateway artifacts.

DynamoDB Backend

DynamoDB is the default backend.

helm upgrade --install leannodes deploy/helm/leannodes \
  --namespace leannodes --create-namespace \
  --set image.repository=<registry>/leannodes \
  --set image.tag=<tag> \
  --set storage.backend=dynamodb \
  --set storage.dynamodb.region=us-east-1 \
  --set storage.dynamodb.tableName=leannodes-state \
  --set serviceAccount.roleArn=arn:aws:iam::<account>:role/leannodes-irsa \
  --set ingress.host=leannodes.example.com

AWS install guide: docs/installation/eks.md

Postgres Backend

Create a Secret containing the DSN:

kubectl -n leannodes create secret generic leannodes-postgres \
  --from-literal=dsn='postgres://user:password@postgres.example.com:5432/leannodes?sslmode=require'

Install with Postgres enabled:

helm upgrade --install leannodes deploy/helm/leannodes \
  --namespace leannodes --create-namespace \
  --set image.repository=<registry>/leannodes \
  --set image.tag=<tag> \
  --set storage.backend=postgres \
  --set storage.postgres.existingSecretName=leannodes-postgres \
  --set storage.postgres.existingSecretKey=dsn \
  --set ingress.host=leannodes.example.com

The chart validates that either DynamoDB or Postgres has the required settings. For externally managed Postgres Secrets, set storage.postgres.existingSecretChecksum when you want Secret rotation to roll the orchestrator pods.

Configure a Gateway

Start with docs/install.md, then choose the guide matching your gateway:

Each gateway proof should be run from a drained flow. Warm-only tests are not valid cold-start proof.

Using LeanNodes

Log in with password credentials or SSO.
Open Settings and verify storage, gateway reachability, and detected ingress/gateway technologies.
Create a flow with one or more host/path bindings.
Add workload nodes and edges in the flow editor.
Commit the draft.
Activate the version.
Drain the flow so all managed services scale to zero.
Hit the real user URL through the selected gateway.
Confirm the first request waits and returns the actual backend response.
Use execution history, live status, metrics, audit, and notifications for operational review.

Hands-on tutorial: docs/tutorials/first-flow.md

Configuration Highlights

Common environment variables:

Variable	Purpose
`LEANNODES_STORAGE_BACKEND`	`dynamodb` or `postgres`.
`LEANNODES_DDB_TABLE`	DynamoDB table name.
`LEANNODES_POSTGRES_DSN` / `DATABASE_URL`	Postgres DSN.
`LEANNODES_FORWARDAUTH_INTERNAL_URL`	In-cluster `/check` URL for forward-auth gateways.
`LEANNODES_GRPC_PORT`	ext_authz gRPC port.
`LEANNODES_AUTH_MODE`	Authentication mode.
`LEANNODES_AUTH_COOKIE_SECRET`	Session signing secret.
`LEANNODES_SUPER_ADMIN_EMAIL`	Initial admin bootstrap email.
`LEANNODES_SUPER_ADMIN_PASSWORD`	Initial admin bootstrap password.
`LEANNODES_PUBLIC_BASE_URL`	Used for notification links and UI-generated URLs.

Full deployment reference: docs/install.md

Testing and Conformance

Core checks:

go test ./...
npm --prefix web run typecheck
npm --prefix web run build

Storage checks:

go test -tags=integration -count=1 ./internal/store/ddb
LEANNODES_POSTGRES_TEST_DSN='postgres://...' go test ./internal/store/postgres -count=1 -v
node scripts/functional-test-postgres-multi-replica.mjs

Gateway conformance examples:

API=http://localhost:18080 scripts/gateway-conformance-ingressnginx-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-haproxy-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-traefik-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-apisix-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-kong-kind.sh

See docs/testing.md for the complete test map.

Documentation Map

docs/getting-started.md: choose the right install path.
docs/concepts.md: flows, bindings, nodes, state, storage, RBAC.
docs/install.md: all installation guides.
docs/tutorials/first-flow.md: first working flow.
docs/notifications.md: Slack, Teams, custom webhooks, event catalog.
docs/slo.md: SLOs and measured capacity.
docs/runbook.md: day-2 operations.
docs/troubleshooting.md: symptom-based fixes.
docs/backup-restore.md: DynamoDB and Postgres backup strategy.
ARCHITECTURE.MD: architecture narrative.
DESIGN.md: design decisions and implementation notes.
PRODUCT.md: product scope and tradeoffs.

Roadmap

Near-term product directions:

More native gateway validation where stable external-auth APIs exist.
GitOps-friendly Flow CRD and controller.
Deeper backup/restore automation.
Accessibility and chaos-testing hardening.
LLM-assisted operator workflows: flow diagnosis, gateway setup review, savings recommendations, failed warm-up summarization, and runbook-guided remediation suggestions. These should stay advisory and auditable; LeanNodes should not let an LLM mutate cluster state without explicit operator approval.

What LeanNodes Does Not Do

It does not replace HPA, KEDA, or cluster autoscalers.
It does not perform canary routing, traffic shaping, or service mesh policy management beyond the hold artifact it owns.
It does not intercept request bodies for normal cold-start decisions.
It does not make cloud LBs or CDNs into hold-capable gateways; put a proven in-cluster gateway behind them.
It is not intended for latency-critical traffic paths where startup delay is unacceptable.

Contributing

Contributions should include functional proof, not just unit coverage. Gateway work should include a drained-flow conformance path that demonstrates the first request waits and returns the real backend body.

Useful entry points:

internal/gateway/: gateway driver implementations.
internal/store/: Store interface and backend implementations.
internal/api/: HTTP API surface.
internal/notify/: notification routing and platform adapters.
web/: React UI.
scripts/: functional and conformance tests.
deploy/helm/leannodes/: Helm chart.

Before opening a PR, run the relevant backend, UI, and functional checks and include the exact commands in the PR description.

License

LeanNodes is fair-code software under the Sustainable Use License 1.0.

Enterprises may use and modify LeanNodes for their own internal business operations. You may also use it for non-commercial or personal purposes.

You may not sell, resell, lease, rent, offer, host, provide as a managed service, provide as SaaS, include in a paid product, or commercially exploit LeanNodes or a modified version of LeanNodes without a separate written commercial license from the maintainers.

This is intentionally not an OSI open-source license; it is a fair-code source-available license.

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github/workflows		.github/workflows
cmd		cmd
deploy		deploy
docs		docs
internal		internal
scripts		scripts
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
ARCHITECTURE.MD		ARCHITECTURE.MD
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PRODUCT.md		PRODUCT.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

LeanNodes

Product Tour

Why LeanNodes Exists

The Product Promise

What Makes LeanNodes Different

Feature Overview

Core Platform Capabilities

Request-Safe Cold Start

Visual Flow Management

Workload Model

Gateway Coverage

Storage and Multi-Replica Operation

Authentication, Authorization, and Governance

Notifications and Enterprise Integrations

Observability and FinOps

Measured Capacity

Architecture

Quick Start: Local Kind

Helm Installation

Prerequisites

DynamoDB Backend

Postgres Backend

Configure a Gateway

Using LeanNodes

Configuration Highlights

Testing and Conformance

Documentation Map

Roadmap

What LeanNodes Does Not Do

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages