Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions astronomy-demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
### Astronomy Shop Resilience Demo Kit

This kit demonstrates an end-to-end incident on the OpenTelemetry Astronomy Shop: baseline SLOs, induce failure with LitmusChaos, detect via Grafana + SigNoz, triage in Calmo, and remediate on Kubernetes with Kustomize.

#### Components
- k6 steady traffic to checkout
- Grafana alerting (p95 latency, error-rate) with webhook to Calmo
- LitmusChaos experiments: pod-network-latency (cart → datastore), pod-cpu-hog (cart)
- Kustomize remediation overlay to scale and resource-bump `cartservice`
- Optional misconfig scenario: wrong image to trigger CrashLoopBackOff

#### Prerequisites
- Kubernetes cluster (GKE recommended) with the OpenTelemetry Demo deployed (namespace `otel-demo` assumed)
- SigNoz or Prometheus-compatible metrics endpoint connected to Grafana
- Grafana v9+ with provisioning enabled
- LitmusChaos installed and a target ServiceAccount with permissions in `otel-demo`
- Calmo ingestion endpoint URL and optional API key
- kubectl, kustomize, and k6 installed locally

#### Environment
Export these before running:

```bash
export ASTRONOMY_NS=otel-demo
export FRONTEND_BASE_URL="http://frontend.${ASTRONOMY_NS}.svc.cluster.local:8080"
export CALMO_WEBHOOK_URL="https://ingest.getcalmo.com/webhook/<your-source>"
export CALMO_WEBHOOK_SECRET="<optional-shared-secret>"
# Grafana: set your Prometheus/SigNoz datasource UID (from Grafana > Connections > Data sources)
export GRAFANA_PROM_DS_UID="prometheus"
```

### 1) Baseline: generate steady checkout traffic

```bash
k6 run ./k6/checkout.js \
-e BASE_URL="$FRONTEND_BASE_URL" \
-e CHECKOUT_RATE_PER_SEC=3 \
-e TEST_DURATION="10m"
```

Verify traces/metrics in Grafana/SigNoz are healthy; note baseline p95 and error-rate.

### 2) Arm alerts and route to Calmo
Provision Grafana contact point and alert rules via ConfigMaps/volumes or by copying files from `grafana/provisioning/alerting/*` into Grafana's provisioning directory. Ensure the datasource UID is set to `$GRAFANA_PROM_DS_UID` and Calmo webhook URL is set.

Files:
- `grafana/provisioning/alerting/contact-points.yaml`
- `grafana/provisioning/alerting/rules.yaml`

These configure:
- Alert A: checkout p95 latency > 2s for 5m
- Alert B: checkout error-rate > 3% for 5m

### 3) Inject failures with LitmusChaos

Set the app label and namespace in the engines if needed. Apply experiments and engines:

```bash
kubectl apply -n litmus -f ./litmus/experiments/pod-network-latency.yaml
kubectl apply -n litmus -f ./litmus/experiments/pod-cpu-hog.yaml

kubectl apply -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-network-latency.yaml
kubectl apply -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-cpu-hog.yaml
```

Observe: increased `cart` span latency, possible throttling/restarts; Grafana alerts should fire within 5–7 minutes; Calmo receives webhooks and correlates with K8s events and recent deploys.

### 4) Remediate with Kustomize overlay

Apply the remediation overlay to scale and resource-bump `cartservice`:

```bash
kubectl kustomize ./kustomize/overlays/cart-remediation | kubectl apply -n "$ASTRONOMY_NS" -f -
```

Validate SLOs recover, then roll back chaos:

```bash
kubectl delete -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-network-latency.yaml || true
kubectl delete -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-cpu-hog.yaml || true
```

### 5) Optional simple scenarios

- CrashLoopBackOff: apply `kubernetes/misconfig/cart-bad-image.yaml` to simulate non-existent image for `cartservice`, then revert.
- NodeSelector misplacement: add a strict `nodeSelector` to `cartservice` to schedule onto non-matching nodes and observe Pending pods.
- pod-cpu-hog: run only the CPU hog engine.

### 6) Grafana → Calmo webhook payload

Grafana contact point is configured to send JSON including `title`, `state`, `labels`, `evalMatches`, `startsAt`. Calmo can enrich with SLO metadata and correlate.

### Clean-up

```bash
# Remove remediation overlay changes (if you used a dedicated overlay, you can roll back by re-applying base manifests)
kubectl rollout restart deploy/cartservice -n "$ASTRONOMY_NS"

# Delete engines (experiments can remain installed in litmus namespace)
kubectl delete -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-network-latency.yaml || true
kubectl delete -n "$ASTRONOMY_NS" -f ./litmus/engines/cart-cpu-hog.yaml || true
```

17 changes: 17 additions & 0 deletions astronomy-demo/grafana/provisioning/alerting/contact-points.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: 1
contactPoints:
- orgId: 1
name: calmo-webhook
receivers:
- uid: calmo-webhook-receiver
type: webhook
settings:
url: ${CALMO_WEBHOOK_URL}
httpMethod: POST
sendResolved: true
username: ""
password: ""
maxAlerts: 0
secureFields:
password: ${CALMO_WEBHOOK_SECRET}
disableResolveMessage: false
73 changes: 73 additions & 0 deletions astronomy-demo/grafana/provisioning/alerting/rules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
apiVersion: 1
groups:
- orgId: 1
name: astronomy-shop-slo
interval: 1m
rules:
- uid: checkout-p95-latency
title: checkout p95 latency > 2s
condition: C
for: 5m
labels:
service: checkout
slo: p95-latency
env: staging
annotations:
runbook_url: https://git/ops/runbooks/checkout-latency
data:
- refId: A
datasourceUid: ${GRAFANA_PROM_DS_UID}
model:
interval: ""
intervalFactor: 2
legendFormat: p95
refId: A
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))
range: true
datasource: {uid: ${GRAFANA_PROM_DS_UID}}
- refId: B
datasourceUid: __expr__
model:
type: threshold
refId: B
expression: 2
- refId: C
datasourceUid: __expr__
model:
type: math
refId: C
expression: "$A > $B"
- uid: checkout-error-rate
title: checkout error-rate > 3%
condition: C
for: 5m
labels:
service: checkout
slo: error-rate
env: staging
annotations:
runbook_url: https://git/ops/runbooks/checkout-latency
data:
- refId: A
datasourceUid: ${GRAFANA_PROM_DS_UID}
model:
refId: A
expr: |
sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
range: true
datasource: {uid: ${GRAFANA_PROM_DS_UID}}
- refId: B
datasourceUid: ${GRAFANA_PROM_DS_UID}
model:
refId: B
expr: |
sum(rate(http_requests_total{service="checkout"}[5m]))
range: true
datasource: {uid: ${GRAFANA_PROM_DS_UID}}
- refId: C
datasourceUid: __expr__
model:
type: math
refId: C
expression: "($A / clamp_min($B, 1)) > 0.03"
55 changes: 55 additions & 0 deletions astronomy-demo/k6/checkout.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import http from 'k6/http';
import { check, sleep } from 'k6';

// Env vars
const BASE_URL = __ENV.BASE_URL || 'http://localhost:8080';
const RATE = Number(__ENV.CHECKOUT_RATE_PER_SEC || 2); // requests per second
const DURATION = __ENV.TEST_DURATION || '10m';

export const options = {
scenarios: {
steady_checkout: {
executor: 'constant-arrival-rate',
rate: RATE,
timeUnit: '1s',
duration: DURATION,
preAllocatedVUs: Math.max(10, RATE * 2),
maxVUs: Math.max(50, RATE * 4),
tags: { service: 'checkout', route: '/api/checkout' },
},
},
thresholds: {
http_req_duration: ['p(95)<2000'],
http_req_failed: ['rate<0.03'],
},
};

export default function () {
// Minimal flow: add-to-cart then checkout endpoint
// Adjust endpoints to match your frontend/cart routes
const headers = { 'Content-Type': 'application/json' };

// Add to cart
const addRes = http.post(`${BASE_URL}/api/cart`, JSON.stringify({ productId: 'extreme-astronomy-binoculars', quantity: 1 }), { headers });
check(addRes, {
'add-to-cart status is 2xx': (r) => r.status >= 200 && r.status < 300,
});

// Checkout
const payload = {
email: 'demo@example.com',
address: {
street: '1 Space Way', city: 'Andromeda', state: 'OT', zip: '424242', country: 'US',
},
creditCard: {
number: '4111111111111111', ccv: '737', expMonth: 12, expYear: 2030,
},
};
const res = http.post(`${BASE_URL}/api/checkout`, JSON.stringify(payload), { headers });
check(res, {
'checkout status is 2xx/3xx': (r) => r.status >= 200 && r.status < 400,
});

sleep(0.5);
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources: []

patches:
- target:
kind: Deployment
name: cartservice
path: patch-replicas.yaml
- target:
kind: Deployment
name: cartservice
path: patch-resources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: cartservice
spec:
replicas: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: cartservice
spec:
template:
spec:
containers:
- name: server
resources:
requests:
cpu: "300m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
22 changes: 22 additions & 0 deletions astronomy-demo/litmus/engines/cart-cpu-hog.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: cart-cpu-hog
namespace: otel-demo
spec:
appinfo:
appns: otel-demo
applabel: "app=cartservice"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-cpu-hog
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '600'
- name: CPU_CORES
value: '1'
- name: PODS_AFFECTED_PERC
value: '100'
24 changes: 24 additions & 0 deletions astronomy-demo/litmus/engines/cart-network-latency.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: cart-network-latency
namespace: otel-demo
spec:
appinfo:
appns: otel-demo
applabel: "app=cartservice"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '600'
- name: NETWORK_LATENCY
value: '400'
- name: JITTER
value: '0'
- name: PODS_AFFECTED_PERC
value: '100'
34 changes: 34 additions & 0 deletions astronomy-demo/litmus/experiments/pod-cpu-hog.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-cpu-hog
labels:
litmuschaos.io/name: pod-cpu-hog
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: ['']
resources: ['pods', 'pods/log']
verbs: ['create', 'list', 'get', 'update', 'patch', 'delete']
- apiGroups: ['']
resources: ['events']
verbs: ['create', 'list', 'get', 'update', 'patch']
- apiGroups: ['apps']
resources: ['deployments']
verbs: ['list', 'get']
image: litmuschaos/go-runner:latest
imagePullPolicy: IfNotPresent
args:
- -c
- ./experiments -name pod-cpu-hog
command: ["/bin/bash"]
env:
- name: TOTAL_CHAOS_DURATION
value: '600'
- name: CPU_CORES
value: '1'
- name: PODS_AFFECTED_PERC
value: '100'
- name: SEQUENCE
value: parallel
Loading