trickstercache · jranson · Jun 1, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
@@ -445,7 +445,9 @@ A backend will report one of three possible health states to its ALBs: `unavaila
 
 Each ALB has a configurable `healthy_floor` value, which is the threshold for determining which pool members are included in the healthy pool, based on their instantaneous health state. The `healthy_floor` represents the minimum acceptable health state value for inclusion in the healthy pool. The default `healthy_floor` value is `0`, meaning Backends in a state `>= 0` (`unknown` and `available`) are included in the healthy pool. Setting `healthy_floor: 1` would include only `available` Backends, while a value of `-1` will include all backends in the configured pool, including those marked as `unavailable`.
 
-Backends that do not have a [health check interval](./health#example+health+check+configuration+for+use+in+alb) configured will remain in a permanent state of `unknown`. Backends will also be in an `unknown` state from the time Trickster starts until the first of any configured automated health check is completed. Note that if an ALB is configured with `healthy_floor: 1`, any pool members that are not configured with an automated health check interval will never be included in the ALB's healthy pool, as their state is permanently `0`.
+Backends that do not have a [health check interval](./health#example+health+check+configuration+for+use+in+alb) configured will remain in a permanent state of `unknown`. Backends will also be in an `unknown` state from the time Trickster starts until the first of any configured automated health check is completed. A pool member in a permanent `unknown` state can never reach `available`, so a `healthy_floor: 1` ALB whose members lack health checks would have an empty pool and return `502` for every request. To avoid that, Trickster resets such an ALB's effective floor to `0` at startup, emits a warning naming the ALB and the un-probed members, and sets the `trickster_alb_pool_floor_reset{backend_name}` gauge to `1`. Configure a health check interval on those members if you want `healthy_floor: 1` to apply.
+
+Setting `healthy_floor` below `0` admits members the probe has confirmed `unavailable`, not just members in the transient `unknown` state. If your goal is to keep traffic flowing during the cold-start window before the first probes complete, lower the pool members' `recovery_threshold` so they transition out of `unknown` faster -- don't lower the floor. When `healthy_floor < 0` Trickster emits a startup warning and sets the `trickster_alb_pool_admits_failing{backend_name}` gauge to `1`.
 
 ### Example ALB Configuration Routing Only To Known Healthy Backends
 

@@ -102,6 +102,8 @@ backends:
       recovery_threshold: 3 # backend is healthy after 3 consecutive successes
 ```
 
+The Prometheus default probe is `/api/v1/query?query=up`. Some multi-tenant Prometheus gateways reject an unbounded `up` with `400 bad_data: "too many series found"`, which keeps the member out of any ALB pool it belongs to. Override `healthcheck.query` with a bounded expression the backend accepts (for example `query=vector(1)`) when probing such backends.
+
 ## Other Ways to Monitor Health
 
 In addition to the out-of-the-box health checks to determine up-or-down status, you may want to setup alarms and thresholds based on the metrics instrumented by Trickster. See [metrics.md](metrics.md) for collecting performance metrics about Trickster.
@@ -91,6 +91,14 @@ The following metrics are available for polling with any Trickster configuration
     * `operation` - the name of the operation being performed (read, write, etc.)
     * `status` - the result of the operation being performed
 
+* `trickster_alb_pool_admits_failing` (Gauge) - 1 when an ALB pool's `healthy_floor` admits members in the `unavailable` state, 0 otherwise. See [alb.md](./alb.md#health-based-backend-selection) for the recommended floor.
+  * labels:
+    * `backend_name` - the name of the configured ALB backend
+
+* `trickster_alb_pool_floor_reset` (Gauge) - 1 when an ALB pool's `healthy_floor` was reset to 0 at startup because pool members have no health check and could never reach the configured floor, 0 otherwise. See [alb.md](./alb.md#health-based-backend-selection).
+  * labels:
+    * `backend_name` - the name of the configured ALB backend
+
 ---
 
 The following metrics are available only for Caches Types whose object lifecycle Trickster manages internally (Memory, Filesystem and bbolt):

@@ -0,0 +1,210 @@
+/*
+ * Copyright 2018 The Trickster Authors
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package integration
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"net/url"
+	"os"
+	"path/filepath"
+	"strings"
+	"sync/atomic"
+	"testing"
+	"time"
+
+	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/require"
+	"github.com/trickstercache/trickster/v2/integration/internal/portutil"
+)
+
+// TestALBHealthyFloorAdmitsFailingMetric verifies the warning surface for an
+// ALB whose healthy_floor admits Failing members. An operator who lowered
+// healthy_floor below 0 to keep traffic flowing during the Initializing
+// startup window also admits members the probe has confirmed broken; the
+// `trickster_alb_pool_admits_failing` gauge surfaces that misconfiguration.
+func TestALBHealthyFloorAdmitsFailingMetric(t *testing.T) {
+	healthy := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "application/json")
+		w.WriteHeader(http.StatusOK)
+		fmt.Fprint(w, `{"status":"success","data":{"version":"2.0"}}`)
+	}))
+	t.Cleanup(healthy.Close)
+	broken := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusInternalServerError)
+	}))
+	t.Cleanup(broken.Close)
+
+	ports, release := portutil.Reserve(t, 3)
+	frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]
+
+	yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/floor_warn.yaml.tmpl"),
+		frontPort, metricsPort, mgmtPort, healthy.URL, broken.URL)
+	cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
+	require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))
+
+	ctx, cancel := context.WithCancel(context.Background())
+	t.Cleanup(cancel)
+	release()
+	go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)
+
+	metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
+	waitForTrickster(t, metricsAddr)
+
+	require.EventuallyWithT(t, func(collect *assert.CollectT) {
+		lines := checkTricksterMetrics(t, metricsAddr)
+		var admits, excludes string
+		for _, l := range lines {
+			if strings.HasPrefix(l, "trickster_alb_pool_admits_failing{") {
+				if strings.Contains(l, `backend_name="alb-admits-failing"`) {
+					admits = l
+				}
+				if strings.Contains(l, `backend_name="alb-excludes-failing"`) {
+					excludes = l
+				}
+			}
+		}
+		assert.True(collect, strings.HasSuffix(admits, " 1"),
+			"alb-admits-failing must report 1: %q", admits)
+		assert.True(collect, strings.HasSuffix(excludes, " 0"),
+			"alb-excludes-failing must report 0: %q", excludes)
+	}, 5*time.Second, 200*time.Millisecond)
+}
+
+// TestALBHealthyFloorResetWhenMemberHasNoHealthcheck covers #1015: an ALB with
+// healthy_floor: 1 whose only member has no health check. The member is stuck
+// Unchecked and would be permanently excluded (empty pool -> 502). Trickster
+// resets the effective floor to 0, sets trickster_alb_pool_floor_reset, and
+// keeps serving 200.
+func TestALBHealthyFloorResetWhenMemberHasNoHealthcheck(t *testing.T) {
+	vector := albTestdata(t, "alb_unavail/healthy.json")
+	origin := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "application/json")
+		w.WriteHeader(http.StatusOK)
+		fmt.Fprint(w, vector)
+	}))
+	t.Cleanup(origin.Close)
+
+	ports, release := portutil.Reserve(t, 3)
+	frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]
+	yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/floor_reset.yaml.tmpl"),
+		frontPort, metricsPort, mgmtPort, origin.URL)
+	cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
+	require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))
+
+	ctx, cancel := context.WithCancel(context.Background())
+	t.Cleanup(cancel)
+	release()
+	go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)
+
+	metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
+	waitForTrickster(t, metricsAddr)
+
+	require.EventuallyWithT(t, func(collect *assert.CollectT) {
+		var line string
+		for _, l := range checkTricksterMetrics(t, metricsAddr) {
+			if strings.HasPrefix(l, "trickster_alb_pool_floor_reset{") &&
+				strings.Contains(l, `backend_name="alb-floor1"`) {
+				line = l
+			}
+		}
+		assert.True(collect, strings.HasSuffix(line, " 1"),
+			"alb-floor1 floor-reset gauge must be 1: %q", line)
+	}, 5*time.Second, 200*time.Millisecond)
+
+	// member admitted under the reset floor -> 200, not an empty-pool 502
+	frontAddr := fmt.Sprintf("127.0.0.1:%d", frontPort)
+	resp, err := http.Get(fmt.Sprintf("http://%s/alb-floor1/api/v1/query?query=up", frontAddr))
+	require.NoError(t, err)
+	defer resp.Body.Close()
+	require.Equal(t, http.StatusOK, resp.StatusCode)
+}
+
+// TestALBPoolDegradeWarnsInResponse covers thinker0's silent single-member
+// degrade: a 2-member TSM pool where one member's probe fails drops to one live
+// member. TSM still serves 200 from the survivor, but the response must carry a
+// `warnings` entry so the caller knows the merge collapsed to a single shard.
+func TestALBPoolDegradeWarnsInResponse(t *testing.T) {
+	vector := albTestdata(t, "alb_unavail/healthy.json")
+	ok := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "application/json")
+		w.WriteHeader(http.StatusOK)
+		fmt.Fprint(w, vector)
+	}))
+	t.Cleanup(ok.Close)
+	var badData atomic.Int64
+	bad := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path != "/api/v1/status/buildinfo" {
+			badData.Add(1)
+		}
+		w.WriteHeader(http.StatusInternalServerError)
+	}))
+	t.Cleanup(bad.Close)
+
+	ports, release := portutil.Reserve(t, 3)
+	frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]
+	yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/degrade.yaml.tmpl"),
+		frontPort, metricsPort, mgmtPort, ok.URL, bad.URL)
+	cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
+	require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))
+
+	ctx, cancel := context.WithCancel(context.Background())
+	t.Cleanup(cancel)
+	release()
+	go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)
+
+	frontAddr := fmt.Sprintf("127.0.0.1:%d", frontPort)
+	metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
+	waitForTrickster(t, metricsAddr)
+
+	var n atomic.Int64
+	require.EventuallyWithT(t, func(collect *assert.CollectT) {
+		// unique query per attempt so the cache can't mask the live merge
+		q := fmt.Sprintf("up + 0*%d", n.Add(1))
+		u := fmt.Sprintf("http://%s/alb-degrade/api/v1/query?query=%s", frontAddr, url.QueryEscape(q))
+		resp, err := http.Get(u)
+		if !assert.NoError(collect, err) {
+			return
+		}
+		defer resp.Body.Close()
+		b, _ := io.ReadAll(resp.Body)
+		if !assert.Equal(collect, http.StatusOK, resp.StatusCode, "body: %s", b) {
+			return
+		}
+		var pr struct {
+			Status   string   `json:"status"`
+			Warnings []string `json:"warnings"`
+		}
+		if !assert.NoError(collect, json.Unmarshal(b, &pr)) {
+			return
+		}
+		var found bool
+		for _, wn := range pr.Warnings {
+			if strings.Contains(wn, "1 of 2 pool members") {
+				found = true
+			}
+		}
+		assert.True(collect, found,
+			"expected degrade warning in response warnings: %v", pr.Warnings)
+	}, 6*time.Second, 200*time.Millisecond)
+
+	require.Zero(t, badData.Load(), "failing member must not receive data requests")
+}
@@ -21,7 +21,7 @@ import (
 	"testing"
 )
 
-//go:embed testdata/alb_cache testdata/alb_tsm_correctness testdata/alb_response_headers testdata/alb_nested testdata/alb_per_path testdata/alb_unavail testdata/alb_request_headers testdata/alb_compose
+//go:embed testdata/alb_cache testdata/alb_tsm_correctness testdata/alb_response_headers testdata/alb_nested testdata/alb_per_path testdata/alb_unavail testdata/alb_missing_hc testdata/alb_request_headers testdata/alb_compose
 var albTestdataFS embed.FS
 
 func albTestdata(t testing.TB, name string) string {

@@ -0,0 +1,43 @@
+
+frontend:
+  listen_port: %d
+metrics:
+  listen_port: %d
+mgmt:
+  listen_port: %d
+logging:
+  log_level: warn
+caches:
+  mem1:
+    provider: memory
+backends:
+  prom-ok:
+    provider: prometheus
+    origin_url: %s
+    cache_name: mem1
+    healthcheck:
+      path: /api/v1/status/buildinfo
+      query: ""
+      interval: 100ms
+      timeout: 500ms
+      failure_threshold: 1
+      recovery_threshold: 1
+  prom-bad:
+    provider: prometheus
+    origin_url: %s
+    cache_name: mem1
+    healthcheck:
+      path: /api/v1/status/buildinfo
+      query: ""
+      interval: 100ms
+      timeout: 500ms
+      failure_threshold: 1
+      recovery_threshold: 1
+  alb-degrade:
+    provider: alb
+    alb:
+      mechanism: tsm
+      output_format: prometheus
+      pool:
+        - prom-ok
+        - prom-bad
@@ -0,0 +1,25 @@
+
+frontend:
+  listen_port: %d
+metrics:
+  listen_port: %d
+mgmt:
+  listen_port: %d
+logging:
+  log_level: warn
+caches:
+  mem1:
+    provider: memory
+backends:
+  prom-noprobe:
+    provider: prometheus
+    origin_url: %s
+    cache_name: mem1
+  alb-floor1:
+    provider: alb
+    alb:
+      mechanism: tsm
+      output_format: prometheus
+      healthy_floor: 1
+      pool:
+        - prom-noprobe
@@ -0,0 +1,52 @@
+
+frontend:
+  listen_port: %d
+metrics:
+  listen_port: %d
+mgmt:
+  listen_port: %d
+logging:
+  log_level: warn
+caches:
+  mem1:
+    provider: memory
+backends:
+  prom-healthy:
+    provider: prometheus
+    origin_url: %s
+    cache_name: mem1
+    healthcheck:
+      path: /api/v1/status/buildinfo
+      query: ""
+      interval: 100ms
+      timeout: 500ms
+      failure_threshold: 1
+      recovery_threshold: 1
+  prom-broken:
+    provider: prometheus
+    origin_url: %s
+    cache_name: mem1
+    healthcheck:
+      path: /api/v1/status/buildinfo
+      query: ""
+      interval: 100ms
+      timeout: 500ms
+      failure_threshold: 1
+      recovery_threshold: 1
+  alb-admits-failing:
+    provider: alb
+    alb:
+      mechanism: tsm
+      output_format: prometheus
+      healthy_floor: -1
+      pool:
+        - prom-healthy
+        - prom-broken
+  alb-excludes-failing:
+    provider: alb
+    alb:
+      mechanism: tsm
+      output_format: prometheus
+      pool:
+        - prom-healthy
+        - prom-broken