fix(adguard/genmachine): multi-replica CrashLoopBackOff — NFS sessions.db locking + HA redesign

## Problem

AdGuard Home on genmachine runs **3 replicas sharing a single NFS RWX PVC** (`pvc-adguard-data`, `nfs-csi-retain`). Two of the three pods are permanently in `CrashLoopBackOff`:

```
[fatal] initializing auth module: creating session storage: timeout
session_storage: opening db filename=/opt/adguardhome/work/data/sessions.db err=timeout
```

### Root cause

All 3 pods mount the **same PVC** at identical subPaths:

| Mount | SubPath | Purpose |
|-------|---------|---------|
| `/opt/adguardhome/work` | `work` | Runtime state — **sessions.db, query log, stats** |
| `/opt/adguardhome/conf` | `conf` | Config file (AdGuardHome.yaml) |

`sessions.db` uses **bbolt (BoltDB)**, which acquires an exclusive OS file lock. Over NFS, concurrent `flock()` calls time out — only the first pod to acquire the lock survives. The remaining two crash on every restart.

**Consequence for HA**: `externalTrafficPolicy: Local` is correctly set (source IP preservation), but MetalLB cannot migrate the DNS VIP to a healthy pod because the other two are crashing. The cluster effectively has **zero DNS failover** despite 3 pods being declared.

```
kubectl get pods -n adguard
adguard-adguard-home-xxx   0/1  CrashLoopBackOff  4475   talos-1
adguard-adguard-home-xxx   0/1  Error             4480   talos-2
adguard-adguard-home-xxx   1/1  Running           0      talos-3   ← only survivor
```

---

## Options Considered

### Option A — Stateless replicas (emptyDir) ✅ Recommended

Set `persistence.enabled: false` in genmachine values. The rm3l chart falls back to an **emptyDir** per pod for the work directory. The `bootstrapEnabled: true` mechanism already writes the full config from the Helm values Secret into each pod at init time — so config is fully GitOps-driven.

- Each pod owns its own isolated `emptyDir` → no locking
- 3 replicas all healthy simultaneously
- `externalTrafficPolicy: Local` works as designed: one pod per node, MetalLB VIP failover works
- **Trade-off**: query log history and statistics reset on pod restart (acceptable for homelab DNS — the important state is the config, already in Git)
- No PVC at all → fully stateless, no storage dependency
- volsync backup becomes obsolete

### Option B — StatefulSet with per-pod PVC + adguardhome-sync

Each pod gets its own PVC (`volumeClaimTemplates`). [`bakito/adguardhome-sync`](https://github.com/bakito/adguardhome-sync) syncs config from pod-0 (primary) to replicas via the AdGuard Home admin API.

- Preserves per-pod query log history
- adguardhome-sync runs as a sidecar or separate Deployment
- **Complexity**: need to manage primary/replica concept, initial setup wizard on each replica, sync scheduling
- adguardhome-sync has no Helm chart and no Kubernetes-native support
- The rm3l chart has a `statefulset.yaml` template (`deploymentType: StatefulSet`) but it is not paired with adguardhome-sync out of the box

### Option C — DaemonSet with hostNetwork

One pod per node using `hostNetwork: true`, listening on the node's physical IP. Eliminates all shared storage entirely.

- Source IP preservation is native (no DNAT)
- No LoadBalancer service needed for DNS — clients point to node IPs directly
- **Problem**: ties AdGuard to specific node IPs; incompatible with the current MetalLB VIP (`192.168.1.200`) used by both clusters

### Option D — Single replica (give up on HA)

`replicaCount: 1` with a PodDisruptionBudget. MetalLB VIP failover still works (the pod moves to a different node on failure).

- Simple but no simultaneous redundancy
- DNS outage during pod migration (a few seconds)

---

## Proposed Fix (Option A)

In `gitops/manifests/adguard/genmachine/genmachine-values.yaml`:

```yaml
adguard-home:
  replicaCount: 3

  # Disable PVC — each pod gets its own emptyDir for work + conf
  persistence:
    enabled: false

  # bootstrapEnabled copies AdGuardHome.yaml from the Secret at init time
  # if no config file exists yet — safe for emptyDir (always fresh at start)
  bootstrapEnabled: true

  # externalTrafficPolicy: Local preserved for source IP
  services:
    dns:
      externalTrafficPolicy: Local
      ...
```

Remove / clean up:
- `gitops/manifests/adguard/genmachine/templates/pvc.yaml`
- `gitops/manifests/adguard/genmachine/templates/volsync-backup.yaml` (no PVC to back up)

The `bootstrapConfig` block already contains the complete configuration (DNS upstreams, rewrites, filters, users, etc.) — this becomes the single source of truth in Git.

---

## Networking Context

```
LAN client (DNS query)
    │
    ▼ UDP/TCP 53
MetalLB L2 VIP 192.168.1.200
    │  externalTrafficPolicy: Local
    ▼
Node running pod (talos-1 / talos-2 / talos-3)
    │  source IP preserved
    ▼
AdGuard Home pod (emptyDir, independent)
```

With 3 healthy pods (one per node) and `externalTrafficPolicy: Local`, MetalLB distributes the VIP across nodes. If one node fails, the VIP migrates within seconds. Source IP is preserved for query logging.

---

## References

- bbolt concurrent access limitation: https://github.com/etcd-io/bbolt/issues/98
- adguardhome-sync: https://github.com/bakito/adguardhome-sync
- rm3l chart `deploymentType` and `persistence`: https://helm-charts.rm3l.org
- MetalLB L2 + `externalTrafficPolicy: Local`: https://metallb.io/usage/#layer-2

Mount	SubPath	Purpose
`/opt/adguardhome/work`	`work`	Runtime state — sessions.db, query log, stats
`/opt/adguardhome/conf`	`conf`	Config file (AdGuardHome.yaml)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(adguard/genmachine): multi-replica CrashLoopBackOff — NFS sessions.db locking + HA redesign #1708

Problem

Root cause

Options Considered

Option A — Stateless replicas (emptyDir) ✅ Recommended

Option B — StatefulSet with per-pod PVC + adguardhome-sync

Option C — DaemonSet with hostNetwork

Option D — Single replica (give up on HA)

Proposed Fix (Option A)

Networking Context

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(adguard/genmachine): multi-replica CrashLoopBackOff — NFS sessions.db locking + HA redesign #1708

Description

Problem

Root cause

Options Considered

Option A — Stateless replicas (emptyDir) ✅ Recommended

Option B — StatefulSet with per-pod PVC + adguardhome-sync

Option C — DaemonSet with hostNetwork

Option D — Single replica (give up on HA)

Proposed Fix (Option A)

Networking Context

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions