Skip to content

Build hal0-test-template CT 200 + pct-clone smoke harness for fresh-install testing #407

@thinmintdev

Description

@thinmintdev

What to build

A reusable Proxmox LXC template (CT 200, "hal0-test-template") and a smoke driver script that clones it, runs `install.sh`, exercises a few hal0 subcommands, runs the uninstall path, and destroys the clone. Each smoke run starts from byte-identical state.

Why this exists

LXC 105 (`hal0`) is the live dev environment — full hardware passthrough, editable install, real models. It is the wrong target for "what does `curl … | bash` feel like for a new user?" testing because:

  • Polluted with dev state (registry, slot toml, lemonade config, MCP installs, models).
  • Can't easily reset between runs.
  • Mixing dev + install/uninstall iteration loses both signals.

We need an ephemeral target that exercises the tarball install path with the same hardware-access recipe as 105 so the iGPU / NPU show up to the installer. LXCs share the host kernel and gate `/dev/*` access via cgroup rules — both 105 and the test CT can hold cgroup rights to the iGPU / NPU simultaneously, with a brief contention window only when both try to load a model. For install + uninstall flow testing (no model warming required) there is no contention.

Golden template (CT 200)

Built once. Carries:

  • Vanilla distro (Debian 12 cloud or CachyOS — pick whichever matches LXC 105's base; verify by inspecting 105.conf)
  • The strix-halo passthrough recipe from `220.conf` (privileged, apparmor unconfined, `dev0`–`dev3` cgroup lines from `feedback / strix-halo-lxc-passthrough` memory)
  • `/mnt/ai-models` NFS bind-mount (read-only, from pve) so install doesn't have to re-download base models
  • Standard `halo` user with sudo + ssh key
  • Pre-installed: `curl`, `ca-certificates`, `jq`, `git`, `rsync`, `htop`
  • Networking, sshd, locale, timezone
  • No hal0 artifacts at all — that's the whole point

Once the CT is provisioned and verified, mark as template: `pct template 200`.

First-boot provisioning

Choose one of:

  • Real cloud-init — `pct set 200 --cicustom user=local:snippets/hal0-test.yaml` (works on Debian / Ubuntu cloud-init-enabled templates)
  • Proxmox hookscript — native LXC mechanism, runs at pre-start / post-start. More reliable, more Proxmox-idiomatic.

Provisioning content (translate to whichever mechanism):

```yaml
users:

  • name: halo
    sudo: ALL=(ALL) NOPASSWD:ALL
    ssh_authorized_keys: [<paste from ~/.ssh/id_ed25519.pub>]
    mounts:
  • [10.0.1.110:/mnt/ai-models, /mnt/ai-models, nfs, ro, 0, 0]
    packages: [curl, ca-certificates, jq, git, rsync, htop]
    runcmd:
  • apt update && apt -y upgrade
  • mkdir -p /mnt/ai-models && mount /mnt/ai-models
  • systemctl enable --now ssh
  • echo "READY" > /tmp/cloud-init-done
    ```

Clone-smoke harness

`scripts/fresh-test-ct.sh` — pure bash. Usage:

```
fresh-test-ct.sh [--vmid 999] [--keep] [--version ]

1. pct clone 200

2. pct start

3. wait for /tmp/cloud-init-done (timeout 90s)

4. ssh halo@ 'curl -fsSL https://hal0.dev/install.sh | bash'

5. ssh halo@ 'hal0 status && hal0 --version && curl localhost:8080/v1/health'

6. ssh halo@ 'hal0 uninstall --yes'

7. assert clean post-uninstall state (no /opt/hal0, no /usr/lib/hal0/current, no /etc/hal0, no /var/lib/hal0, no systemd unit)

8. pct destroy --force (unless --keep)

9. emit a single JSON line to stdout: {clone_id, install_ok, smoke_ok, uninstall_ok, residue, elapsed_s}

```

Integrate the JSON-line output with the existing δ-tier harness at `tests/harness/` (memory: `hal0_test_harness`) so install / uninstall coverage gets aggregated alongside slot lifecycle coverage. New row type, same emitter.

Acceptance criteria

  • CT 200 created with the 105-equivalent passthrough recipe; `pct config 200` matches the `220.conf` pattern (privileged, apparmor unconfined, dev0–dev3 cgroup lines)
  • CT 200 marked as template (`pct template 200`)
  • Provisioning mechanism (cloud-init or hookscript) chosen and committed to `packaging/proxmox/hal0-test-template/`
  • First-boot leaves `/tmp/cloud-init-done`; sshd accepts the `halo` user with the provisioned key
  • `scripts/fresh-test-ct.sh` implements the clone → install → smoke → uninstall → destroy cycle and emits the JSON line
  • Uninstall residue assertion: zero hal0 artifacts left on disk, zero systemd units, zero open ports — fail loudly on residue
  • Harness integration: `make harness-install` (or similar) runs the script end-to-end, emits to `tests/harness/findings.jsonl`
  • Docs at `docs/internal/install-test-harness.md` covering how to bump the golden template (apt updates, base image refresh) and the contention caveat with LXC 105
  • One successful run from scratch on pve recorded in the PR description (clone_id, total elapsed time, JSON line)
  • Sub-task: verify pve SSH key is current (the stale `known_hosts` entry for `10.0.1.110` should be refreshed against the post-kernel-migration host — this issue is the natural place to fix it)
  • Sub-task: verify live NAS state matches the assumptions in `pve_ai_models_storage_separation` memory before mounting `/mnt/ai-models` (it should be writable on pve, exported read-only via NFS, with the symlinked HF cache at `/mnt/ai-models/huggingface`)

Blocked by

None — fully AFK from start. Unblocks #406 (install-mode reconciliation) for end-to-end verification of the tarball mode.

Notes

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestready-for-agentPRD is fully scoped and ready for an AFK agent to pick upv0.4v0.4 scope — UI polish + install-mode reconciliation + Bootstrapped

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions