Skip to content

[main] Fix(ci): support Ubuntu Noble stemcell in create-bosh-lite#3790

Draft
prkalle wants to merge 1 commit into
cloudfoundry:mainfrom
prkalle:update/bosh-lite-stemcell_main
Draft

[main] Fix(ci): support Ubuntu Noble stemcell in create-bosh-lite#3790
prkalle wants to merge 1 commit into
cloudfoundry:mainfrom
prkalle:update/bosh-lite-stemcell_main

Conversation

@prkalle

@prkalle prkalle commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description of the Change

The create-bosh-lite workflow builds the BOSH Lite environment that backs our CLI integration tests. It broke when cf-deployment switched its default stemcell from ubuntu-jammy to ubuntu-noble (cloudfoundry/cf-deployment#1224). Ubuntu Noble (24.04) is a fundamentally different runtime for BOSH Lite warden containers — they now boot under systemd as PID 1 with cgroup v2 — and that cascaded into several distinct failures. This PR fixes each one:

  • Warden agents never connected (every VM timed out). Noble containers only boot systemd (and therefore the BOSH agent) when the warden CPI sets start_containers_with_systemd: true. That flag first ships in bbl ≥ 9.0.41, so the workflow now requires the BBL_CLI_VERSION repository variable to be ≥ 9.0.41 (we run 9.0.45). This is the core fix — without it, every compilation and instance-group VM failed with Timed out pinging VM ... after N seconds.

  • External DNS resolution was broken inside containers (app staging failed). On Noble the bosh-dns config lives under the bosh-dns-systemd addon with disable_recursors: true, so diego-cells couldn't resolve buildpacks.cloudfoundry.org and every app push failed in the buildpack-compile phase (server misbehaving). A new bosh-dns-noble-bosh-lite.yml ops file sets disable_recursors: false and the 169.254.169.254 recursor on that addon, applied to the DNS runtime-config.

  • App Envoy sidecars crashed on start. Envoy aborted with inotify_fd_ >= 0 (exit 134) — the director host's fs.inotify.max_user_instances (128) was exhausted by systemd in every warden container. director-inotify.yml raises it (1024/524288) via an os-conf sysctl job on the director VM (the host where inotify limits bind, not the diego-cell).

  • The single director VM ran out of memory. On bosh-lite the whole deployment runs as warden containers on one VM, and with a full systemd per container the previous 32 GB overcommitted — instance-group agents (the database group, then the UAAs) failed to boot. The director VM is bumped to n2-standard-16 (64 GB).

Why Is This PR Valuable?

The CLI integration-test environment is currently un-buildable on the latest cf-deployment, which blocks the CI that validates the CLI against a real Cloud Foundry. This restores that pipeline on Ubuntu Noble — the new default — so the team can keep testing against current CF releases.

Applicable Issues

Related context: cloudfoundry/cf-deployment#1224 (Noble stemcell validation), cloudfoundry/cf-deployment#1222 (pipeline switch to Postgres)

How Urgent Is The Change?

Moderately urgent — the bosh-lite CI environment cannot be created on the current cf-deployment default, so the CLI integration tests that depend on it are effectively blocked until this lands.

Other Relevant Parties

Who else is affected by the change?

@prkalle prkalle changed the title Fix(ci): support Ubuntu Noble stemcell in create-bosh-lite [main] Fix(ci): support Ubuntu Noble stemcell in create-bosh-lite Jun 11, 2026
@prkalle

prkalle commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

The Tests: Integration workflow uses pull_request_target, so it runs create-bosh-lite.yml from the base branch, not this PR. The automated check therefore exercises main's workflow and will fail until this PR is merged.

I validated the changes with a manual workflow_dispatch run on this branch (which uses the branch's workflow). Successful run: https://github.com/cloudfoundry/cli/actions/runs/27443213680/job/81129588803
(client-credential integration tests failed which was failing earlier as well and will be fixed in a separate PR)

@prkalle prkalle marked this pull request as draft June 12, 2026 00:34
The cf-deployment default stemcell moved from ubuntu-jammy to ubuntu-noble
(cloudfoundry/cf-deployment#1224), which broke the create-bosh-lite workflow.
Three Noble-specific problems plus supporting fixes:

- Warden agents wouldn't connect: Noble warden containers boot under systemd,
  which requires bbl >= 9.0.41 (warden_cpi start_containers_with_systemd:true).
  Provided via the BBL_CLI_VERSION repository variable (>= 9.0.41; set to 9.0.45).

- External DNS broken inside containers: the Noble bosh-dns config lives under
  the "bosh-dns-systemd" addon with disable_recursors:true, so diego-cells
  couldn't resolve buildpacks.cloudfoundry.org and app staging failed
  ("server misbehaving"). bosh-dns-noble-bosh-lite.yml enables recursion with
  the 169.254.169.254 recursor, applied to the dns runtime-config.

- App Envoy sidecars crashed on start ("inotify_fd_ >= 0", exit 134): with
  systemd in every warden container the director host's
  fs.inotify.max_user_instances (128) was exhausted. director-inotify.yml adds
  an os-conf sysctl job (1024 / 524288) to the director via
  create-director-override.sh. inotify limits bind at the host root user
  namespace, so this is set on the director VM, not the diego-cell.

- Increase the bosh-lite director VM to n2-standard-16 (64 GB): the whole
  deployment runs as warden containers on one VM and 32 GB overcommitted memory.

- Fix the failure-cleanup step: `bbl down` was passed
  --gcp-service-account-key=key.json (no such file is created), so it parsed the
  literal string as JSON and failed, leaving orphaned infrastructure on any
  failed run. Authenticate via BBL_GCP_SERVICE_ACCOUNT_KEY, like `bbl up`.

Requires the BBL_CLI_VERSION repository variable to be >= 9.0.41.

Signed-off-by: Prem Kumar Kalle <prem.kalle@broadcom.com>
@prkalle prkalle force-pushed the update/bosh-lite-stemcell_main branch from f94f05b to fcfade8 Compare June 13, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant