feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15
feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15Qruixuan wants to merge 17 commits into
Conversation
Migrate the following changes from mlsys-io/FlowMesh_dev (ci/gpu-runner-setup-v2): - .github/workflows/unit-tests.yml: switch install to --all-extras, add cuda runner label [self-hosted, cuda], pin action SHAs with uv version 0.11.8, add permissions/concurrency blocks - src/worker/docker/Dockerfile.cpu: rename SUPERVISOR_GRPC_TARGET -> GUARDIAN_GRPC_TARGET, switch shared copy to granular (shared/__init__.py + shared/all + shared/host_worker + shared/guardian_worker), drop source/url OCI labels - src/worker/docker/Dockerfile.cuda: same GUARDIAN rename + granular shared copy, drop source/url OCI labels - src/worker/docker/Dockerfile.ssh.cpu: drop source/url OCI labels - src/worker/docker/Dockerfile.ssh.gpu: drop source/url OCI labels - src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET -> GUARDIAN_GRPC_TARGET, update TLS section (guardian naming) Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…om FlowMesh_dev Migrates changes from mlsys-io/FlowMesh_dev (main) that were not yet present in FlowMesh: - Dockerfile.cuda: rename SUPERVISOR_GRPC_TARGET → GUARDIAN_GRPC_TARGET; replace broad `COPY src/shared` with granular copies of shared/__init__.py, shared/all, shared/host_worker, shared/guardian_worker; drop extra org.opencontainers.image.source/url LABEL lines - Dockerfile.ssh.cpu: drop org.opencontainers.image.source/url LABELs - Dockerfile.ssh.gpu: drop org.opencontainers.image.source/url LABELs - src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET → GUARDIAN_GRPC_TARGET, generate_server_tls_certs.sh → generate_guardian_tls_certs.sh, SERVER_GRPC_TLS_CA_B64 → GUARDIAN_GRPC_TLS_CA_B64 templates/n8n/dag_inference.json and CI workflows are already in sync (identical SHAs / FlowMesh has newer hardened versions). Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Previous agent incorrectly changed SUPERVISOR_GRPC_TARGET to GUARDIAN_GRPC_TARGET and altered COPY paths/labels. This reverts those files to their correct state. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Dockerfile.cuda: install requirements.gpu.txt in addition to requirements.txt, and add build-time verification that torch/transformers are importable. transformers_executor.py: capture import error message in _HF_IMPORT_ERROR, split PreTrainedModel into a separate fallback import block, add _require_transformers() helper called from both prepare() and run(). Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrate ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml, ci.worker_config.yaml, and ci.gpu_worker_config.yaml from FlowMesh_dev. Adapted: guardian service → supervisor, src/guardian/ → src/server/, /etc/guardian/ → /etc/supervisor/, env var names GUARDIAN_* → SUPERVISOR_*. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrate .github/workflows/ci.yml (integration + GPU smoke jobs) and scripts/ci/setup-runner.md from FlowMesh_dev. Adapted: guardian→supervisor service names throughout; repo URL updated to mlsys-io/FlowMesh. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
scripts/ci/run_local.sh: migrate from FlowMesh_dev, adapted guardian→supervisor throughout (service exec, compose override, health checks, log references). templates: fix output.destination from http to local in conditional_echo_test.yaml and ssh_noninteractive.yaml; use dev version of echo_three_node_graph.yaml. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
FlowMesh has no separate host/guardian/postgres services. A single
src/server/Dockerfile exposes both HTTP API (8000) and gRPC supervisor
(50051). Updated ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml:
- server service built from src/server/Dockerfile
- redis only (no postgres)
- WORKER_DOCKER_NETWORK uses ${COMPOSE_PROJECT_NAME}_ci-net interpolation
- SERVER_HOST=server so spawned workers get SUPERVISOR_GRPC_TARGET=server:50051
Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Key changes:
- Single "Wait for server" health check (port 8000) instead of separate host + supervisor
- Worker registration check uses docker compose exec -T server (no exposed port needed)
- E2E tests use http://server:8000 (internal compose network name)
- Destroy workers via server API on port 8000
- COMPOSE_PROJECT_NAME exported so ${COMPOSE_PROJECT_NAME}_ci-net interpolation works
- run_local.sh: dc() wrapper exports COMPOSE_PROJECT_NAME; single server port block
in compose override; step numbering adjusted (no separate supervisor confirm step)
Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…s and exposed ports Workers are spawned by the Docker adapter with network_mode: host (see supervisor/adapters/docker.py _start()). They connect to the gRPC supervisor at localhost:50051 and download results via FLOWMESH_BASE_URL. Three bugs in the previous CI setup: 1. WORKER_DOCKER_NETWORK env var doesn't exist in FlowMesh — removed. 2. FLOWMESH_BASE_URL was "http://server:8000" but workers on host network can't resolve "server"; changed to "http://localhost:8000". 3. CI workflow never exposed ports 8000/50051 on the host, so workers (network_mode: host) couldn't reach the server container at all; added ci.ports.fixed.yml to both build steps. 4. run_local.sh used a dynamic HTTP port, but FLOWMESH_BASE_URL in the compose is a static value set before start; changed to fixed 127.0.0.1:8000:8000 so workers can always reach http://localhost:8000. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrated from FlowMesh_dev ci/gpu-runner-setup-v2 branch unchanged: - Submits a workflow YAML to a live server and polls until DONE/FAILED - Skips automatically when FLOWMESH_HOST_URL is unset (safe for unit test runs) - Handles n8n JSON and native YAML formats - Skips (not fails) when executor package is unavailable on the worker Used by run_local.sh (step 7) and .github/workflows/ci.yml E2E steps. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
The server's _VolumeInitializer runs busybox:1.36.1 to chown the named Docker volume to UID 10001, but if busybox isn't cached it fails silently and marks the volume as initialized anyway — so all subsequent workers also get PermissionError writing to /var/lib/flowmesh-results. Fix: set RESULTS_DIR to an absolute host path. The docker adapter skips _VolumeInitializer for absolute paths (see _ensure_volume_access). Workers receive a bind-mount of a pre-created host dir with chmod 777, which UID 10001 (appuser) can write to without any chown step. - ci.compose.yml: RESULTS_DIR=/tmp/flowmesh-ci-results - ci.yml: mkdir + chmod 777 before 'docker compose up' in both jobs, rm -rf in teardown - run_local.sh: per-PID dir /tmp/flowmesh-ci-results-$PROJECT, overridden in compose overlay; cleaned up in teardown Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
docker volume prune -f (in pre-clean) deleted the named volume flowmesh_server_hf_cache between runs, forcing TinyLlama to be re-downloaded every time (~50s) and causing the 300s vLLM test to time out by a few seconds. Fix: set HF_CACHE_DIR to the host's ~/.cache/huggingface so workers receive a bind mount of an absolute path. _ensure_volume_access skips _VolumeInitializer for absolute paths; models downloaded on the first run persist for every subsequent run on the same machine. - ci.compose.yml: pass HF_CACHE_DIR through from compose env - run_local.sh: resolve _HF_CACHE_DIR (host ~/.cache/huggingface), mkdir+chmod 777, inject into compose override - ci.yml: set HF_CACHE_DIR=$HOME/.cache/huggingface in project-name step; mkdir+chmod 777 in setup step; pass to docker compose env Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…M timeout HF_CACHE_DIR bind-mount was reverted — using the named Docker volume flowmesh_server_hf_cache (identical to FlowMesh_dev) avoids accumulating model weights on the host between CI runs; docker volume prune cleans it up. The timeout issue is fixed by bumping the GPU E2E timeout: cold-start (model download ~50s + load ~53s + compile ~17s + CUDA graphs) takes ~250s, leaving only ~50s for inference at the old 300s limit. - run_local.sh: GPU default timeout 300 → 600s - ci.yml: inference_vllm_tiny E2E_TIMEOUT_SEC 300 → 600s Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…json meta-llama/Llama-3.2-1B-Instruct requires HF_TOKEN; use the non-gated Qwen/Qwen2.5-0.5B-Instruct instead, matching FlowMesh_dev's fix. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
- Pin actions/checkout and actions/upload-artifact to commit SHAs - Add persist-credentials: false to all checkout steps - Add top-level permissions: contents: read - Move github.workspace and github.run_id out of run: blocks into step-level env: to eliminate template-expansion injection warnings Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
|
@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple? |
I think with DinD, malicious workflows can still escape into the host. We need a better solution for this before the release. |
One alternative that allows docker inside container without privilege is sysbox. But sysbox does not have a stable GPU support. Regarding the malicious workflow, we can make this e2e CI a daily workflow on main. Thus, we only need to ensure that the workflows in main is secure. Does that help? |
Purpose
Migrate the CI/GPU runner infrastructure from
mlsys-io/FlowMesh_dev(ci/gpu-runner-setup-v2) and adapt it to FlowMesh's single-server architecture (no separate host/guardian split).Changes
docker/ci.compose.yml— isolated per-run Compose stack; server mounts the Docker socket to spawn workers via DooD; workers usenetwork_mode: hostto reachlocalhost:8000/50051docker/ci.ports.fixed.yml— fixed port overlay (8000, 50051) for CI and local runsdocker/ci.worker.gpu.yml— GPU worker overlay wiring (CUDA devices, HF_TOKEN pass-through)docker/ci.gpu_worker_config.yaml— supervisor worker config for GPU CI workerscripts/ci/run_local.sh— mirrors the full GitHub Actions pipeline locally for pre-push validation; supports--gpu,--task-yaml,--timeout,--no-build,--keep.github/workflows/ci.yml— integration (CPU echo) + GPU smoke tests (vLLM, HF Transformers, LoRA SFT, DAG inference, SSH, conditional echo); actions pinned to commit SHAs,permissions: contents: read, template-expansion injections moved to step-levelenv:tests/integration/test_e2e.py— pytest E2E test that submits a workflow and polls until DONE; gracefully SKIPs on unavailable executors (max_attempts_exceeded+ log pattern match)templates/n8n/dag_inference.json— replace gatedmeta-llama/Llama-3.2-1B-Instructwith openQwen/Qwen2.5-0.5B-Instructto avoidGatedRepoErrorwhenHF_TOKENis absentDesign
Workers are spawned by the server's Docker adapter (DooD via
/var/run/docker.sock) withnetwork_mode: host, so they reach the server atlocalhost:8000andlocalhost:50051directly. The pytest runner executes in a temporarypython:3.11-slimcontainer joined to the compose network, accessing the server via service namehttp://server:8000. Each CI run is isolated by a unique compose project name (ci-$RUN_ID).The GPU workflow uses a content-hash-tagged builder image (
flowmesh-builder:<hash>) to cache the heavy CUDA dependency layer across runs, rebuilding only whenDockerfile.cuda.builderorrequirements.gpu.txtchange.Test Plan
Test Result
Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-extras --frozen).[BREAKING]and described migration steps above.