feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage#54
Draft
kaiitunnz wants to merge 15 commits into
Draft
feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage#54kaiitunnz wants to merge 15 commits into
kaiitunnz wants to merge 15 commits into
Conversation
e374b55 to
b80f72b
Compare
The parent server opens Redis-over-TLS connections during lifespan
startup, which initialises OpenSSL state. OpenSSL is not fork-safe —
inheriting random-pool state and internal locks via `fork()` can
deadlock the child the first time it calls `ssl.SSLContext.__new__`
(observed as an intermittent "Supervisor child did not register a
node within 30s" handshake timeout, with the child stuck in
`ssl.create_default_context` under `redis-py`'s SSL connect path).
Use `mp.get_context("spawn")` for the supervisor `Process` and for
the IPC queues in `create_task_channel`. Spawn execs a fresh
interpreter so the child gets clean OpenSSL state. The IPC primitives
must be created from the same context — mixing fork-context
`SemLock`s with a spawn-context process raises a RuntimeError.
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in diffusers 0.36.0, fixed in 0.38.0. Bumping is blocked by the same `safetensors>=0.8.0rc0` pre-release requirement that already gates GHSA-j7w6-vpvq-j3gm and GHSA-98h9-4798-4q5v; adding to the existing diffusers row block in the workflow and the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`mp.get_context("spawn")` returns a singleton instantiated at
`multiprocessing` import time, so `@functools.cache` on a getter was
redundant and the "lazily constructed" framing was misleading. A
module-level constant is simpler, honest about what it is, and reads
naturally at call sites.
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in starlette 0.52.1, fixed in 1.0.1. Bumping is blocked by `gradio==5.50` (transitive via `vllm-omni==0.18`), which caps `starlette<1.0` — same chain that gates the existing gradio / vllm-omni CVE ignores. Add to the worker-GPU pip-audit invocation (where the failure surfaced) and document the row in the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The previous commit added the ignore only to the worker-GPU step because that's where the CVE first surfaced. The lock then resolved starlette to a different (still <1.0.1) version on the server side, exposing the same advisory in the server pip-audit step. Same blocker (gradio 5.50 caps starlette<1.0 via vllm-omni 0.18 — already in the docs table). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
FLOWMESH_PLUGIN_DIR is deliberately read-only — it's the code mount — so plugins that need persistence (a SQLite ACL, a cache file) currently have to ask each operator to add a custom bind via docker-compose.override.yml, which is brittle. Add a second mount at /app/plugin-data driven by FLOWMESH_PLUGIN_DATA_DIR. A path value (the default, `./plugin-data`) is a host bind, auto-created on stack up; a bare name is an external Docker volume of that name, which the operator precreates and owns the lifecycle of. The CLI discriminates path vs bare name in apply_plugin_data_env and routes the value to either the service mount directly (path mode) or an internal FLOWMESH_PLUGIN_DATA_VOLUME parameter consumed by a parameterized external volume declaration (volume mode). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The default external volume name was the bare alias `flowmesh_plugin_data`, which would collide across multiple stacks sharing a host (FLOWMESH_STACK_SUFFIX exists exactly to scope these names per-stack). Match the convention used by the other named volumes in the same file. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`apply_plugin_data_env` was hooked into `_stack()._load`, which runs on every stack subcommand (`up`, `down`, `ps`, `logs`, `clean`, ...) via DockerComposeStack.run, so `ensure_dir(resolved)` materialized ./plugin-data in the operator's cwd even on read-only commands. FLOWMESH_PLUGIN_DIR avoids this by routing its mkdir through `ensure_deploy_paths`, which only runs when `to_deploy=True`. Split the helper: env-var routing (path resolution + alias re-route) stays in `apply_plugin_data_env` so compose substitution always has the right values; the `ensure_dir` call moves into `ensure_deploy_paths`, gated on `FLOWMESH_PLUGIN_DATA_VOLUME` being unset (volume mode is operator-managed, no host dir to create). Also drops the redundant "~/" prefix from _PLUGIN_DATA_PATH_PREFIXES (the trailing "~" already subsumes it) and adds five unit tests covering the empty / relative / absolute / tilde / bare-name branches. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
When the host restarts past a ResourceRegistrar's TTL, persistent plugins were silently dropping grants on long-lived workers and workflows. The reconcile sweep replaces TTL pruning with a stronger guarantee: on boot, enumerate every live resource (workflows + their tasks, nodes, workers) and hand the full batch to each registrar's `refresh`, then call `purge_stale` once per registrar. Registrars that persist state use the pair to drop records the server no longer tracks. - `auth.security.refresh_resources(refs, logger)` and `purge_stale_resources(logger)` fan out to every `ResourceRegistrar`. - `_lifespan` runs `_reconcile_resources()` once after plugins load and the system principal resolves. Worker / workflow enumeration is gated on `IS_ROOT_NODE`; nodes are always present. - docs/PLUGINS.md notes the new sweep on the `ResourceRegistrar` bullet. Requires lumid-hooks 0.2.0 for the new Protocol methods. The `tool.uv.sources` entry pointing at `../lumid.hooks` is temporary — drop it (and refresh the pinned requirements files) once 0.2.0 is on PyPI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Review findings on the startup-reconcile work: - Move `_reconcile_resources()` after `SUPERVISOR.start(...)` so this node is in `NODE_REGISTRY` by the time the sweep enumerates it. Previously the local node was systematically missing from the refresh batch. - `refresh_resources` now returns the set of registrar names whose `refresh` raised, and `purge_stale_resources(..., skip=...)` excludes them. A registrar that fails mid-sweep no longer wipes rows it never marked live; other registrars still complete normally. - `docs/PLUGINS.md` locates the sweep relative to plugin install / supervisor handshake and documents the host-side guard. - Style/ops: `TODO(lumid-hooks-0.2.0):` prefix on the `tool.uv.sources` comment so future cleanup surfaces in grep; `uv.lock` revision restored to 3 (was downgraded by an older local uv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Match lumid-hooks 0.2.0's single-method Protocol. The fan-out helper in `auth.security` becomes one call; the failed-set plumbing in `_lifespan` goes away because `reconcile` is atomic on the plugin side — a raised exception leaves the registrar's store unchanged, so the host can log and move on. - `auth.security.reconcile_resources(resources, logger)` replaces `refresh_resources` / `purge_stale_resources`. Each registrar's `reconcile` is wrapped in try/except + `logger.exception`; the sweep continues with the next registrar on failure. - `_lifespan` calls `reconcile_resources(refs, logger)` once and drops the failed-set threading. - Tests trade the old refresh/purge dispatch tests for one happy-path fan-out test and one that asserts a failing registrar is isolated while the rest still run. - docs/PLUGINS.md reflects the single-call shape and the atomic-or-rollback semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
b80f72b to
3c6fcd7
Compare
lumid-hooks 0.2.0 (with the new `ResourceRegistrar.reconcile`) is released on PyPI, so the editable path override from `[tool.uv.sources]` is dropped. flowmesh-hook's pin moves to >=0.2.0 since it re-exports the breaking-changed Protocol surface. Server requirements regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`_reconcile_resources` walked `NODE_REGISTRY` and `WORKER_REGISTRY` but read `.node_id` / `.worker_id` off the records — those attributes don't exist on the `Node` / `Worker` Pydantic models, both of which expose `.id`. The mismatch crashed the FastAPI lifespan during the startup reconcile sweep, taking the whole server down on every boot. Caught while running the lumid plugin e2e plan against a freshly built `FLOWMESH_VERSION=permission` image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
3c6fcd7 to
e4a7b74
Compare
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
FLOWMESH_PLUGIN_DIRis deliberately read-only — it's the code mount — so plugins that need writable persistence (a SQLite ACL, a cache file, an ownership table) currently have to ask each operator to add a custom bind viadocker-compose.override.yml. That's brittle and easy to forget; the immediate use case islumid-flowmesh-plugin'sResourceRegistrar/PermissionCheckerSQLite ACL.Add a second mount at
/app/plugin-datadriven byFLOWMESH_PLUGIN_DATA_DIR. Three modes, single env var:./plugin-data,/var/lib/foo) — host bind, auto-created onstack up. Relative paths resolve against the operator's cwd (viaSTACK_PATH_KEYS).my_external_vol) — external Docker volume of that name; operator precreates withdocker volume create. CLI routes the value through an internalFLOWMESH_PLUGIN_DATA_VOLUMEparameter consumed by a parameterizedexternal: truevolume declaration.The discrimination (path vs bare name) lives in
apply_plugin_data_envincli/stack/.../utils.py— same rule docker compose uses for short-syntax mount sources.Folds in the pre-existing one-line docs fix to
hook/README.mdfrom the original branch (HookBindingsis the Protocol;BaseBindingsis the dataclass) since it's already on the branch.Changes
cli/stack/.../assets/compose.yml— add the/app/plugin-datamount on the server service and a matchingexternal: truevolume declaration parameterized byFLOWMESH_PLUGIN_DATA_VOLUME.cli/stack/.../env_schema.py+cli/stack/.../assets/.env.example(regenerated) — declareFLOWMESH_PLUGIN_DATA_DIRin the "External Plugins" section.cli/stack/.../utils.py—apply_plugin_data_env(base_dir)discriminates the env value into path vs volume-name mode and routes accordingly. Called from_stack()'s_loadhook instack.py.docs/ENV.md+docs/PLUGINS.md— document the three modes and that plugin code stays read-only while plugin data is writable.hook/README.md— pre-existing one-lineHookBindingsvsBaseBindingscorrection (from the original branch).Test Plan
Manual verification via
docker compose -f cli/stack/.../assets/compose.yml configacross all three modes:type: bindto<cwd>/plugin-data.type: bindto the operator-supplied path.type: volume, top-level declarationexternal: true,name:set to the operator-supplied value.Live end-to-end with
flowmesh stack up: not run — left for the matchinglumid-flowmesh-pluginPR's e2e test plan, which exercises this feature against a real workload.Test Result
docker compose configacross the three modes: produces the expectedbind/volumeshapes as enumerated above.Pre-submission Checklist