Skip to content

fix(adguard/genmachine): multi-replica CrashLoopBackOff — NFS sessions.db locking + HA redesign #1708

Description

@ixxeL2097

Problem

AdGuard Home on genmachine runs 3 replicas sharing a single NFS RWX PVC (pvc-adguard-data, nfs-csi-retain). Two of the three pods are permanently in CrashLoopBackOff:

[fatal] initializing auth module: creating session storage: timeout
session_storage: opening db filename=/opt/adguardhome/work/data/sessions.db err=timeout

Root cause

All 3 pods mount the same PVC at identical subPaths:

Mount SubPath Purpose
/opt/adguardhome/work work Runtime state — sessions.db, query log, stats
/opt/adguardhome/conf conf Config file (AdGuardHome.yaml)

sessions.db uses bbolt (BoltDB), which acquires an exclusive OS file lock. Over NFS, concurrent flock() calls time out — only the first pod to acquire the lock survives. The remaining two crash on every restart.

Consequence for HA: externalTrafficPolicy: Local is correctly set (source IP preservation), but MetalLB cannot migrate the DNS VIP to a healthy pod because the other two are crashing. The cluster effectively has zero DNS failover despite 3 pods being declared.

kubectl get pods -n adguard
adguard-adguard-home-xxx   0/1  CrashLoopBackOff  4475   talos-1
adguard-adguard-home-xxx   0/1  Error             4480   talos-2
adguard-adguard-home-xxx   1/1  Running           0      talos-3   ← only survivor

Options Considered

Option A — Stateless replicas (emptyDir) ✅ Recommended

Set persistence.enabled: false in genmachine values. The rm3l chart falls back to an emptyDir per pod for the work directory. The bootstrapEnabled: true mechanism already writes the full config from the Helm values Secret into each pod at init time — so config is fully GitOps-driven.

  • Each pod owns its own isolated emptyDir → no locking
  • 3 replicas all healthy simultaneously
  • externalTrafficPolicy: Local works as designed: one pod per node, MetalLB VIP failover works
  • Trade-off: query log history and statistics reset on pod restart (acceptable for homelab DNS — the important state is the config, already in Git)
  • No PVC at all → fully stateless, no storage dependency
  • volsync backup becomes obsolete

Option B — StatefulSet with per-pod PVC + adguardhome-sync

Each pod gets its own PVC (volumeClaimTemplates). bakito/adguardhome-sync syncs config from pod-0 (primary) to replicas via the AdGuard Home admin API.

  • Preserves per-pod query log history
  • adguardhome-sync runs as a sidecar or separate Deployment
  • Complexity: need to manage primary/replica concept, initial setup wizard on each replica, sync scheduling
  • adguardhome-sync has no Helm chart and no Kubernetes-native support
  • The rm3l chart has a statefulset.yaml template (deploymentType: StatefulSet) but it is not paired with adguardhome-sync out of the box

Option C — DaemonSet with hostNetwork

One pod per node using hostNetwork: true, listening on the node's physical IP. Eliminates all shared storage entirely.

  • Source IP preservation is native (no DNAT)
  • No LoadBalancer service needed for DNS — clients point to node IPs directly
  • Problem: ties AdGuard to specific node IPs; incompatible with the current MetalLB VIP (192.168.1.200) used by both clusters

Option D — Single replica (give up on HA)

replicaCount: 1 with a PodDisruptionBudget. MetalLB VIP failover still works (the pod moves to a different node on failure).

  • Simple but no simultaneous redundancy
  • DNS outage during pod migration (a few seconds)

Proposed Fix (Option A)

In gitops/manifests/adguard/genmachine/genmachine-values.yaml:

adguard-home:
  replicaCount: 3

  # Disable PVC — each pod gets its own emptyDir for work + conf
  persistence:
    enabled: false

  # bootstrapEnabled copies AdGuardHome.yaml from the Secret at init time
  # if no config file exists yet — safe for emptyDir (always fresh at start)
  bootstrapEnabled: true

  # externalTrafficPolicy: Local preserved for source IP
  services:
    dns:
      externalTrafficPolicy: Local
      ...

Remove / clean up:

  • gitops/manifests/adguard/genmachine/templates/pvc.yaml
  • gitops/manifests/adguard/genmachine/templates/volsync-backup.yaml (no PVC to back up)

The bootstrapConfig block already contains the complete configuration (DNS upstreams, rewrites, filters, users, etc.) — this becomes the single source of truth in Git.


Networking Context

LAN client (DNS query)
    │
    ▼ UDP/TCP 53
MetalLB L2 VIP 192.168.1.200
    │  externalTrafficPolicy: Local
    ▼
Node running pod (talos-1 / talos-2 / talos-3)
    │  source IP preserved
    ▼
AdGuard Home pod (emptyDir, independent)

With 3 healthy pods (one per node) and externalTrafficPolicy: Local, MetalLB distributes the VIP across nodes. If one node fails, the VIP migrates within seconds. Source IP is preserved for query logging.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions