Skip to content
Merged
4 changes: 3 additions & 1 deletion docs/alb.md
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,9 @@ A backend will report one of three possible health states to its ALBs: `unavaila

Each ALB has a configurable `healthy_floor` value, which is the threshold for determining which pool members are included in the healthy pool, based on their instantaneous health state. The `healthy_floor` represents the minimum acceptable health state value for inclusion in the healthy pool. The default `healthy_floor` value is `0`, meaning Backends in a state `>= 0` (`unknown` and `available`) are included in the healthy pool. Setting `healthy_floor: 1` would include only `available` Backends, while a value of `-1` will include all backends in the configured pool, including those marked as `unavailable`.

Backends that do not have a [health check interval](./health#example+health+check+configuration+for+use+in+alb) configured will remain in a permanent state of `unknown`. Backends will also be in an `unknown` state from the time Trickster starts until the first of any configured automated health check is completed. Note that if an ALB is configured with `healthy_floor: 1`, any pool members that are not configured with an automated health check interval will never be included in the ALB's healthy pool, as their state is permanently `0`.
Backends that do not have a [health check interval](./health#example+health+check+configuration+for+use+in+alb) configured will remain in a permanent state of `unknown`. Backends will also be in an `unknown` state from the time Trickster starts until the first of any configured automated health check is completed. A pool member in a permanent `unknown` state can never reach `available`, so a `healthy_floor: 1` ALB whose members lack health checks would have an empty pool and return `502` for every request. To avoid that, Trickster resets such an ALB's effective floor to `0` at startup, emits a warning naming the ALB and the un-probed members, and sets the `trickster_alb_pool_floor_reset{backend_name}` gauge to `1`. Configure a health check interval on those members if you want `healthy_floor: 1` to apply.

Setting `healthy_floor` below `0` admits members the probe has confirmed `unavailable`, not just members in the transient `unknown` state. If your goal is to keep traffic flowing during the cold-start window before the first probes complete, lower the pool members' `recovery_threshold` so they transition out of `unknown` faster -- don't lower the floor. When `healthy_floor < 0` Trickster emits a startup warning and sets the `trickster_alb_pool_admits_failing{backend_name}` gauge to `1`.

### Example ALB Configuration Routing Only To Known Healthy Backends

Expand Down
2 changes: 2 additions & 0 deletions docs/health.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ backends:
recovery_threshold: 3 # backend is healthy after 3 consecutive successes
```

The Prometheus default probe is `/api/v1/query?query=up`. Some multi-tenant Prometheus gateways reject an unbounded `up` with `400 bad_data: "too many series found"`, which keeps the member out of any ALB pool it belongs to. Override `healthcheck.query` with a bounded expression the backend accepts (for example `query=vector(1)`) when probing such backends.

## Other Ways to Monitor Health

In addition to the out-of-the-box health checks to determine up-or-down status, you may want to setup alarms and thresholds based on the metrics instrumented by Trickster. See [metrics.md](metrics.md) for collecting performance metrics about Trickster.
8 changes: 8 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,14 @@ The following metrics are available for polling with any Trickster configuration
* `operation` - the name of the operation being performed (read, write, etc.)
* `status` - the result of the operation being performed

* `trickster_alb_pool_admits_failing` (Gauge) - 1 when an ALB pool's `healthy_floor` admits members in the `unavailable` state, 0 otherwise. See [alb.md](./alb.md#health-based-backend-selection) for the recommended floor.
* labels:
* `backend_name` - the name of the configured ALB backend

* `trickster_alb_pool_floor_reset` (Gauge) - 1 when an ALB pool's `healthy_floor` was reset to 0 at startup because pool members have no health check and could never reach the configured floor, 0 otherwise. See [alb.md](./alb.md#health-based-backend-selection).
* labels:
* `backend_name` - the name of the configured ALB backend

---

The following metrics are available only for Caches Types whose object lifecycle Trickster manages internally (Memory, Filesystem and bbolt):
Expand Down
210 changes: 210 additions & 0 deletions integration/alb_missing_healthcheck_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
/*
* Copyright 2018 The Trickster Authors
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package integration

import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httptest"
"net/url"
"os"
"path/filepath"
"strings"
"sync/atomic"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/trickstercache/trickster/v2/integration/internal/portutil"
)

// TestALBHealthyFloorAdmitsFailingMetric verifies the warning surface for an
// ALB whose healthy_floor admits Failing members. An operator who lowered
// healthy_floor below 0 to keep traffic flowing during the Initializing
// startup window also admits members the probe has confirmed broken; the
// `trickster_alb_pool_admits_failing` gauge surfaces that misconfiguration.
func TestALBHealthyFloorAdmitsFailingMetric(t *testing.T) {
healthy := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, `{"status":"success","data":{"version":"2.0"}}`)
}))
t.Cleanup(healthy.Close)
broken := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusInternalServerError)
}))
t.Cleanup(broken.Close)

ports, release := portutil.Reserve(t, 3)
frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]

yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/floor_warn.yaml.tmpl"),
frontPort, metricsPort, mgmtPort, healthy.URL, broken.URL)
cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))

ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
release()
go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)

metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
waitForTrickster(t, metricsAddr)

require.EventuallyWithT(t, func(collect *assert.CollectT) {
lines := checkTricksterMetrics(t, metricsAddr)
var admits, excludes string
for _, l := range lines {
if strings.HasPrefix(l, "trickster_alb_pool_admits_failing{") {
if strings.Contains(l, `backend_name="alb-admits-failing"`) {
admits = l
}
if strings.Contains(l, `backend_name="alb-excludes-failing"`) {
excludes = l
}
}
}
assert.True(collect, strings.HasSuffix(admits, " 1"),
"alb-admits-failing must report 1: %q", admits)
assert.True(collect, strings.HasSuffix(excludes, " 0"),
"alb-excludes-failing must report 0: %q", excludes)
}, 5*time.Second, 200*time.Millisecond)
}

// TestALBHealthyFloorResetWhenMemberHasNoHealthcheck covers #1015: an ALB with
// healthy_floor: 1 whose only member has no health check. The member is stuck
// Unchecked and would be permanently excluded (empty pool -> 502). Trickster
// resets the effective floor to 0, sets trickster_alb_pool_floor_reset, and
// keeps serving 200.
func TestALBHealthyFloorResetWhenMemberHasNoHealthcheck(t *testing.T) {
vector := albTestdata(t, "alb_unavail/healthy.json")
origin := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, vector)
}))
t.Cleanup(origin.Close)

ports, release := portutil.Reserve(t, 3)
frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]
yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/floor_reset.yaml.tmpl"),
frontPort, metricsPort, mgmtPort, origin.URL)
cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))

ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
release()
go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)

metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
waitForTrickster(t, metricsAddr)

require.EventuallyWithT(t, func(collect *assert.CollectT) {
var line string
for _, l := range checkTricksterMetrics(t, metricsAddr) {
if strings.HasPrefix(l, "trickster_alb_pool_floor_reset{") &&
strings.Contains(l, `backend_name="alb-floor1"`) {
line = l
}
}
assert.True(collect, strings.HasSuffix(line, " 1"),
"alb-floor1 floor-reset gauge must be 1: %q", line)
}, 5*time.Second, 200*time.Millisecond)

// member admitted under the reset floor -> 200, not an empty-pool 502
frontAddr := fmt.Sprintf("127.0.0.1:%d", frontPort)
resp, err := http.Get(fmt.Sprintf("http://%s/alb-floor1/api/v1/query?query=up", frontAddr))
require.NoError(t, err)
defer resp.Body.Close()
require.Equal(t, http.StatusOK, resp.StatusCode)
}

// TestALBPoolDegradeWarnsInResponse covers thinker0's silent single-member
// degrade: a 2-member TSM pool where one member's probe fails drops to one live
// member. TSM still serves 200 from the survivor, but the response must carry a
// `warnings` entry so the caller knows the merge collapsed to a single shard.
func TestALBPoolDegradeWarnsInResponse(t *testing.T) {
vector := albTestdata(t, "alb_unavail/healthy.json")
ok := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, vector)
}))
t.Cleanup(ok.Close)
var badData atomic.Int64
bad := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path != "/api/v1/status/buildinfo" {
badData.Add(1)
}
w.WriteHeader(http.StatusInternalServerError)
}))
t.Cleanup(bad.Close)

ports, release := portutil.Reserve(t, 3)
frontPort, metricsPort, mgmtPort := ports[0], ports[1], ports[2]
yaml := fmt.Sprintf(albTestdata(t, "alb_missing_hc/degrade.yaml.tmpl"),
frontPort, metricsPort, mgmtPort, ok.URL, bad.URL)
cfgPath := filepath.Join(t.TempDir(), "trickster.yaml")
require.NoError(t, os.WriteFile(cfgPath, []byte(yaml), 0644))

ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
release()
go startTrickster(t, ctx, expectedStartError{}, "-config", cfgPath)

frontAddr := fmt.Sprintf("127.0.0.1:%d", frontPort)
metricsAddr := fmt.Sprintf("127.0.0.1:%d", metricsPort)
waitForTrickster(t, metricsAddr)

var n atomic.Int64
require.EventuallyWithT(t, func(collect *assert.CollectT) {
// unique query per attempt so the cache can't mask the live merge
q := fmt.Sprintf("up + 0*%d", n.Add(1))
u := fmt.Sprintf("http://%s/alb-degrade/api/v1/query?query=%s", frontAddr, url.QueryEscape(q))
resp, err := http.Get(u)
if !assert.NoError(collect, err) {
return
}
defer resp.Body.Close()
b, _ := io.ReadAll(resp.Body)
if !assert.Equal(collect, http.StatusOK, resp.StatusCode, "body: %s", b) {
return
}
var pr struct {
Status string `json:"status"`
Warnings []string `json:"warnings"`
}
if !assert.NoError(collect, json.Unmarshal(b, &pr)) {
return
}
var found bool
for _, wn := range pr.Warnings {
if strings.Contains(wn, "1 of 2 pool members") {
found = true
}
}
assert.True(collect, found,
"expected degrade warning in response warnings: %v", pr.Warnings)
}, 6*time.Second, 200*time.Millisecond)

require.Zero(t, badData.Load(), "failing member must not receive data requests")
}
2 changes: 1 addition & 1 deletion integration/alb_testdata_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import (
"testing"
)

//go:embed testdata/alb_cache testdata/alb_tsm_correctness testdata/alb_response_headers testdata/alb_nested testdata/alb_per_path testdata/alb_unavail testdata/alb_request_headers testdata/alb_compose
//go:embed testdata/alb_cache testdata/alb_tsm_correctness testdata/alb_response_headers testdata/alb_nested testdata/alb_per_path testdata/alb_unavail testdata/alb_missing_hc testdata/alb_request_headers testdata/alb_compose
var albTestdataFS embed.FS

func albTestdata(t testing.TB, name string) string {
Expand Down
43 changes: 43 additions & 0 deletions integration/testdata/alb_missing_hc/degrade.yaml.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@

frontend:
listen_port: %d
metrics:
listen_port: %d
mgmt:
listen_port: %d
logging:
log_level: warn
caches:
mem1:
provider: memory
backends:
prom-ok:
provider: prometheus
origin_url: %s
cache_name: mem1
healthcheck:
path: /api/v1/status/buildinfo
query: ""
interval: 100ms
timeout: 500ms
failure_threshold: 1
recovery_threshold: 1
prom-bad:
provider: prometheus
origin_url: %s
cache_name: mem1
healthcheck:
path: /api/v1/status/buildinfo
query: ""
interval: 100ms
timeout: 500ms
failure_threshold: 1
recovery_threshold: 1
alb-degrade:
provider: alb
alb:
mechanism: tsm
output_format: prometheus
pool:
- prom-ok
- prom-bad
25 changes: 25 additions & 0 deletions integration/testdata/alb_missing_hc/floor_reset.yaml.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@

frontend:
listen_port: %d
metrics:
listen_port: %d
mgmt:
listen_port: %d
logging:
log_level: warn
caches:
mem1:
provider: memory
backends:
prom-noprobe:
provider: prometheus
origin_url: %s
cache_name: mem1
alb-floor1:
provider: alb
alb:
mechanism: tsm
output_format: prometheus
healthy_floor: 1
pool:
- prom-noprobe
52 changes: 52 additions & 0 deletions integration/testdata/alb_missing_hc/floor_warn.yaml.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@

frontend:
listen_port: %d
metrics:
listen_port: %d
mgmt:
listen_port: %d
logging:
log_level: warn
caches:
mem1:
provider: memory
backends:
prom-healthy:
provider: prometheus
origin_url: %s
cache_name: mem1
healthcheck:
path: /api/v1/status/buildinfo
query: ""
interval: 100ms
timeout: 500ms
failure_threshold: 1
recovery_threshold: 1
prom-broken:
provider: prometheus
origin_url: %s
cache_name: mem1
healthcheck:
path: /api/v1/status/buildinfo
query: ""
interval: 100ms
timeout: 500ms
failure_threshold: 1
recovery_threshold: 1
alb-admits-failing:
provider: alb
alb:
mechanism: tsm
output_format: prometheus
healthy_floor: -1
pool:
- prom-healthy
- prom-broken
alb-excludes-failing:
provider: alb
alb:
mechanism: tsm
output_format: prometheus
pool:
- prom-healthy
- prom-broken
Loading
Loading