Skip to content

feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage#54

Draft
kaiitunnz wants to merge 15 commits into
mainfrom
kaiitunnz/feat/plugin-data-dir
Draft

feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage#54
kaiitunnz wants to merge 15 commits into
mainfrom
kaiitunnz/feat/plugin-data-dir

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 20, 2026

Purpose

FLOWMESH_PLUGIN_DIR is deliberately read-only — it's the code mount — so plugins that need writable persistence (a SQLite ACL, a cache file, an ownership table) currently have to ask each operator to add a custom bind via docker-compose.override.yml. That's brittle and easy to forget; the immediate use case is lumid-flowmesh-plugin's ResourceRegistrar/PermissionChecker SQLite ACL.

Add a second mount at /app/plugin-data driven by FLOWMESH_PLUGIN_DATA_DIR. Three modes, single env var:

  • Default / path (./plugin-data, /var/lib/foo) — host bind, auto-created on stack up. Relative paths resolve against the operator's cwd (via STACK_PATH_KEYS).
  • Bare name (my_external_vol) — external Docker volume of that name; operator precreates with docker volume create. CLI routes the value through an internal FLOWMESH_PLUGIN_DATA_VOLUME parameter consumed by a parameterized external: true volume declaration.

The discrimination (path vs bare name) lives in apply_plugin_data_env in cli/stack/.../utils.py — same rule docker compose uses for short-syntax mount sources.

Folds in the pre-existing one-line docs fix to hook/README.md from the original branch (HookBindings is the Protocol; BaseBindings is the dataclass) since it's already on the branch.

Changes

  • cli/stack/.../assets/compose.yml — add the /app/plugin-data mount on the server service and a matching external: true volume declaration parameterized by FLOWMESH_PLUGIN_DATA_VOLUME.
  • cli/stack/.../env_schema.py + cli/stack/.../assets/.env.example (regenerated) — declare FLOWMESH_PLUGIN_DATA_DIR in the "External Plugins" section.
  • cli/stack/.../utils.pyapply_plugin_data_env(base_dir) discriminates the env value into path vs volume-name mode and routes accordingly. Called from _stack()'s _load hook in stack.py.
  • docs/ENV.md + docs/PLUGINS.md — document the three modes and that plugin code stays read-only while plugin data is writable.
  • hook/README.md — pre-existing one-line HookBindings vs BaseBindings correction (from the original branch).

Test Plan

$ uv run pre-commit run --all-files
$ uv run scripts/dev/check_env_examples.py
$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

Manual verification via docker compose -f cli/stack/.../assets/compose.yml config across all three modes:

  • Empty / path default → service mount is a type: bind to <cwd>/plugin-data.
  • Custom path → type: bind to the operator-supplied path.
  • Bare name → type: volume, top-level declaration external: true, name: set to the operator-supplied value.

Live end-to-end with flowmesh stack up: not run — left for the matching lumid-flowmesh-plugin PR's e2e test plan, which exercises this feature against a real workload.

Test Result

$ uv run pre-commit run --all-files
# All checks passed

$ uv run scripts/dev/check_env_examples.py
# Clean (no diff after `--write`)

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# All 982 tests passed

docker compose config across the three modes: produces the expected bind / volume shapes as enumerated above.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run `pre-commit run --all-files` and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that `uv run pytest tests/` passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (`uv sync --all-packages --group ci --frozen`).
  • If this is a breaking change, I have prefixed the PR title with `[BREAKING]` and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz changed the title feat(stack): add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage feat: add FLOWMESH_PLUGIN_DATA_DIR for plugin writable storage May 20, 2026
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/plugin-data-dir branch 2 times, most recently from e374b55 to b80f72b Compare May 22, 2026 11:22
kaiitunnz and others added 12 commits May 22, 2026 21:17
The parent server opens Redis-over-TLS connections during lifespan
startup, which initialises OpenSSL state. OpenSSL is not fork-safe —
inheriting random-pool state and internal locks via `fork()` can
deadlock the child the first time it calls `ssl.SSLContext.__new__`
(observed as an intermittent "Supervisor child did not register a
node within 30s" handshake timeout, with the child stuck in
`ssl.create_default_context` under `redis-py`'s SSL connect path).

Use `mp.get_context("spawn")` for the supervisor `Process` and for
the IPC queues in `create_task_channel`. Spawn execs a fresh
interpreter so the child gets clean OpenSSL state. The IPC primitives
must be created from the same context — mixing fork-context
`SemLock`s with a spawn-context process raises a RuntimeError.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in diffusers 0.36.0, fixed in 0.38.0. Bumping is
blocked by the same `safetensors>=0.8.0rc0` pre-release requirement
that already gates GHSA-j7w6-vpvq-j3gm and GHSA-98h9-4798-4q5v;
adding to the existing diffusers row block in the workflow and the
advisory table.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`mp.get_context("spawn")` returns a singleton instantiated at
`multiprocessing` import time, so `@functools.cache` on a getter was
redundant and the "lazily constructed" framing was misleading. A
module-level constant is simpler, honest about what it is, and reads
naturally at call sites.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in starlette 0.52.1, fixed in 1.0.1. Bumping is
blocked by `gradio==5.50` (transitive via `vllm-omni==0.18`), which
caps `starlette<1.0` — same chain that gates the existing gradio /
vllm-omni CVE ignores. Add to the worker-GPU pip-audit invocation
(where the failure surfaced) and document the row in the advisory
table.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The previous commit added the ignore only to the worker-GPU step
because that's where the CVE first surfaced. The lock then resolved
starlette to a different (still <1.0.1) version on the server side,
exposing the same advisory in the server pip-audit step. Same blocker
(gradio 5.50 caps starlette<1.0 via vllm-omni 0.18 — already in the
docs table).

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
FLOWMESH_PLUGIN_DIR is deliberately read-only — it's the code mount —
so plugins that need persistence (a SQLite ACL, a cache file) currently
have to ask each operator to add a custom bind via
docker-compose.override.yml, which is brittle.

Add a second mount at /app/plugin-data driven by FLOWMESH_PLUGIN_DATA_DIR.
A path value (the default, `./plugin-data`) is a host bind, auto-created
on stack up; a bare name is an external Docker volume of that name,
which the operator precreates and owns the lifecycle of. The CLI
discriminates path vs bare name in apply_plugin_data_env and routes the
value to either the service mount directly (path mode) or an internal
FLOWMESH_PLUGIN_DATA_VOLUME parameter consumed by a parameterized
external volume declaration (volume mode).

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The default external volume name was the bare alias
`flowmesh_plugin_data`, which would collide across multiple stacks
sharing a host (FLOWMESH_STACK_SUFFIX exists exactly to scope these
names per-stack). Match the convention used by the other named volumes
in the same file.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`apply_plugin_data_env` was hooked into `_stack()._load`, which runs on
every stack subcommand (`up`, `down`, `ps`, `logs`, `clean`, ...) via
DockerComposeStack.run, so `ensure_dir(resolved)` materialized
./plugin-data in the operator's cwd even on read-only commands.
FLOWMESH_PLUGIN_DIR avoids this by routing its mkdir through
`ensure_deploy_paths`, which only runs when `to_deploy=True`.

Split the helper: env-var routing (path resolution + alias re-route)
stays in `apply_plugin_data_env` so compose substitution always has
the right values; the `ensure_dir` call moves into
`ensure_deploy_paths`, gated on `FLOWMESH_PLUGIN_DATA_VOLUME` being
unset (volume mode is operator-managed, no host dir to create).

Also drops the redundant "~/" prefix from _PLUGIN_DATA_PATH_PREFIXES
(the trailing "~" already subsumes it) and adds five unit tests
covering the empty / relative / absolute / tilde / bare-name branches.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
When the host restarts past a ResourceRegistrar's TTL, persistent
plugins were silently dropping grants on long-lived workers and
workflows. The reconcile sweep replaces TTL pruning with a stronger
guarantee: on boot, enumerate every live resource (workflows + their
tasks, nodes, workers) and hand the full batch to each registrar's
`refresh`, then call `purge_stale` once per registrar. Registrars
that persist state use the pair to drop records the server no longer
tracks.

- `auth.security.refresh_resources(refs, logger)` and
  `purge_stale_resources(logger)` fan out to every
  `ResourceRegistrar`.
- `_lifespan` runs `_reconcile_resources()` once after plugins load
  and the system principal resolves. Worker / workflow enumeration is
  gated on `IS_ROOT_NODE`; nodes are always present.
- docs/PLUGINS.md notes the new sweep on the `ResourceRegistrar`
  bullet.

Requires lumid-hooks 0.2.0 for the new Protocol methods. The
`tool.uv.sources` entry pointing at `../lumid.hooks` is temporary —
drop it (and refresh the pinned requirements files) once 0.2.0 is on
PyPI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Review findings on the startup-reconcile work:

- Move `_reconcile_resources()` after `SUPERVISOR.start(...)` so this
  node is in `NODE_REGISTRY` by the time the sweep enumerates it.
  Previously the local node was systematically missing from the refresh
  batch.
- `refresh_resources` now returns the set of registrar names whose
  `refresh` raised, and `purge_stale_resources(..., skip=...)` excludes
  them. A registrar that fails mid-sweep no longer wipes rows it never
  marked live; other registrars still complete normally.
- `docs/PLUGINS.md` locates the sweep relative to plugin install /
  supervisor handshake and documents the host-side guard.
- Style/ops: `TODO(lumid-hooks-0.2.0):` prefix on the `tool.uv.sources`
  comment so future cleanup surfaces in grep; `uv.lock` revision
  restored to 3 (was downgraded by an older local uv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Match lumid-hooks 0.2.0's single-method Protocol. The fan-out helper
in `auth.security` becomes one call; the failed-set plumbing in
`_lifespan` goes away because `reconcile` is atomic on the plugin
side — a raised exception leaves the registrar's store unchanged, so
the host can log and move on.

- `auth.security.reconcile_resources(resources, logger)` replaces
  `refresh_resources` / `purge_stale_resources`. Each registrar's
  `reconcile` is wrapped in try/except + `logger.exception`; the sweep
  continues with the next registrar on failure.
- `_lifespan` calls `reconcile_resources(refs, logger)` once and
  drops the failed-set threading.
- Tests trade the old refresh/purge dispatch tests for one happy-path
  fan-out test and one that asserts a failing registrar is isolated
  while the rest still run.
- docs/PLUGINS.md reflects the single-call shape and the
  atomic-or-rollback semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/plugin-data-dir branch from b80f72b to 3c6fcd7 Compare May 22, 2026 14:26
kaiitunnz and others added 2 commits May 22, 2026 22:44
lumid-hooks 0.2.0 (with the new `ResourceRegistrar.reconcile`) is
released on PyPI, so the editable path override from
`[tool.uv.sources]` is dropped. flowmesh-hook's pin moves to >=0.2.0
since it re-exports the breaking-changed Protocol surface. Server
requirements regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`_reconcile_resources` walked `NODE_REGISTRY` and `WORKER_REGISTRY`
but read `.node_id` / `.worker_id` off the records — those attributes
don't exist on the `Node` / `Worker` Pydantic models, both of which
expose `.id`. The mismatch crashed the FastAPI lifespan during the
startup reconcile sweep, taking the whole server down on every boot.

Caught while running the lumid plugin e2e plan against a freshly
built `FLOWMESH_VERSION=permission` image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/plugin-data-dir branch from 3c6fcd7 to e4a7b74 Compare May 22, 2026 14:45
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant