Skip to content

Cut import flowfile time ~2.4x by deferring heavy imports#563

Open
Edwardvaneechoud wants to merge 5 commits into
mainfrom
claude/flowfile-import-optimization-rirha6
Open

Cut import flowfile time ~2.4x by deferring heavy imports#563
Edwardvaneechoud wants to merge 5 commits into
mainfrom
claude/flowfile-import-optimization-rirha6

Conversation

@Edwardvaneechoud

@Edwardvaneechoud Edwardvaneechoud commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

Makes the Python package much faster to import. import flowfile_frame drops from ~2.2s on main to ~0.93s (the earlier improvement/flowfile-smaller-to-import work, rebased in here, had gotten it to ~1.48s).

This PR contains that original commit rebased onto latest main, plus a second round of import-weight reduction.

Changes

Rebase of improvement/flowfile-smaller-to-import onto main

  • FlowfileHTTPException (fastapi-free) now composes with main's new validate_connection, catalog object-storage storage_options, and RBAC changes.
  • DB init/migration stays deferred to first DB access (ensure_db_initialized) instead of running on import flowfile_core.

FastAPI regression fix

  • New catalog notebook code had pulled fastapi back onto the frame import path via fileExplorer/funcs.py; its path-security helpers now raise FlowfileHTTPException (catch sites and tests updated).

Newly deferred off the import path

  • pyarrow (~95ms): utils/arrow_reader, flow_data_engine, flow_graph, flow_node/models+state, catalog/delta_utils, kafka consumer
  • openpyxl + numpy (~160ms): read_excel_tables now loads only on the excel read/schema paths
  • requests + websockets (~90ms): subprocess_operations (lazy module proxy — every use is a worker round-trip), streaming, sample_users
  • catalog service/schema stack (~150ms of pydantic model building): flowfile_core.catalog resolves CatalogService via PEP 562 __getattr__; flow_graph and flowfile_frame.catalog_reference import it at call sites
  • docker + httpx: kernel package no longer imports manager.py eagerly
  • yaml: flow save/open, notebook store
  • confluent_kafka: shared.kafka package init is now lazy; flow_graph imports kafka helpers inside add_kafka_source
  • cryptography (Fernet/HKDF), passlib (unused settings.PWD_CONTEXT, still resolvable lazily), fastexcel, legacy .flowfile migration schemas

Test hardening

  • test_lazy_imports.py banned-module list grows from 8 to 17 so regressions fail CI.
  • Tests updated for the new exception type and lazy patch targets.

Verification

  • Full flowfile_core suite: 4745 passed (remaining failures are pre-existing/environment-specific: one test assumes a non-root home dir, three project-route tests are ordering flakes that pass in isolation and on the base commit).
  • frame (599), worker+shared (369), scheduler (13), CLI (21) suites pass; make check_stubs in sync; ruff matches main's baseline.
  • Smoke-tested every deferred path end-to-end: collect, YAML save/open, secret encrypt/decrypt roundtrip, arrow reads, kernel/catalog lazy resolution, and the server app builds with all 352 routes.

Remaining import cost (structural, not addressed)

polars (~125ms), sqlalchemy (~135ms, model declarations), the input_schema pydantic build (~115ms), and third-party plugin inits (pl_fuzzy_frame_match etc.).

🤖 Generated with Claude Code

https://claude.ai/code/session_01NH1z1i8P9uXhY8KAL8QvN2

Edwardvaneechoud and others added 3 commits July 1, 2026 21:48
Follow-up to the initial import-weight work, after rebasing it onto main:

- fileExplorer/funcs.py raises FlowfileHTTPException instead of fastapi's
  HTTPException (new catalog notebook code had pulled fastapi back onto
  the frame import path); catch sites and tests updated.
- Defer pyarrow (arrow_reader, flow_data_engine, flow_graph, flow_node
  models/state, delta_utils), openpyxl+numpy (read_excel_tables via
  create/funcs and flow_graph), requests + websockets (subprocess
  operations, sample_users), yaml (flow_graph save, io_flowfile,
  notebook_store), docker + httpx (kernel manager via kernel package
  and execution), confluent_kafka (shared.kafka package init and
  flow_graph), cryptography (secret_manager, auth/secrets), passlib
  (settings.PWD_CONTEXT, now lazy and unused in-repo), fastexcel and
  the legacy pickle schema map.
- Resolve CatalogService lazily: catalog package __getattr__ plus local
  imports in flow_graph and flowfile_frame catalog_reference, keeping
  the catalog schema/serializer pydantic build (~150ms) off the frame
  import path.
- Extend the test_lazy_imports contract with the newly banned modules.

import flowfile_frame: ~2.2s on main, ~1.5s after the first pass,
~0.93s now (python 3.11, warm cache).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NH1z1i8P9uXhY8KAL8QvN2
flow_graph no longer holds CatalogService as a module global (it resolves
lazily from flowfile_core.catalog), so patch the package attribute instead —
same fix as test_catalog_flow_graph in the previous commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NH1z1i8P9uXhY8KAL8QvN2
@Edwardvaneechoud Edwardvaneechoud changed the title Refactor code generator into mixins + add project sync, notebooks, LSP Cut import flowfile time ~2.4x by deferring heavy imports Jul 2, 2026
@Edwardvaneechoud Edwardvaneechoud changed the base branch from improvement/flowfile-smaller-to-import to main July 2, 2026 06:44
@netlify

netlify Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploy Preview for flowfile-wasm canceled.

Name Link
🔨 Latest commit 0d75c91
🔍 Latest deploy log https://app.netlify.com/projects/flowfile-wasm/deploys/6a468abe073b74000886a57d

@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown

Claude finished @Edwardvaneechoud's task in 4m 12s —— View job


Review of PR #563 — Cut import flowfile time ~2.4x by deferring heavy imports

Todo list

  • Read repo CLAUDE.md and package-level CLAUDE.md files
  • Review diff (git diff against the pre-merge main tip, 1c54924, which matches the PR's file list exactly)
  • Check correctness (no .collect() in core, worker memory ownership, secret owner-keying)
  • Check conventions (Polars, Pydantic v2, FastAPI, lean comments)
  • Check security (path validation, exception handling, secrets)
  • Check tests & migrations (lazy import test, stubs)
  • Post final review summary

Summary

This is a careful, well-scoped import-time optimization. I traced every deferred import site (pyarrow, yaml, requests/websockets, catalog service, kernel manager, kafka, cryptography/passlib, deltalake/boto3/gcsfs) and the mechanics are sound:

  • The new FlowfileHTTPException (flowfile_core/exceptions.py) is registered via @app.exception_handler in main.py:134, so all the call sites that swapped fastapi.HTTPExceptionFlowfileHTTPException (flow_graph.py, fileExplorer/funcs.py, secret_manager.py) still produce the same JSON error response through FastAPI's normal exception-handling path.
  • flowfile_core.catalog's PEP 562 __getattr__ for CatalogService correctly composes with unittest.mock.patch("flowfile_core.catalog.CatalogService") — patch sets a real attribute that shadows __getattr__, and the tests were updated accordingly (test_catalog_flow_graph.py, test_catalog_cloud_virtual.py).
  • Modules using bare (unquoted) type hints against now-deferred imports (pyarrow, CatalogService, etc.) all have from __future__ import annotations in place, so the annotations stay lazily-evaluated strings — no runtime NameErrors (checked flow_data_engine.py, flow_node/models.py, kernel/execution.py, delta_utils.py, subprocess_operations.py).
  • The _LazyRequests module-proxy in subprocess_operations.py and the ensure_db_initialized() double-checked-locking guard in database/connection.py are both correct — no thread-safety issues given import caching and GIL-protected attribute writes.
  • Places that call SessionLocal() directly instead of get_db/get_db_context (ai/audit.py, ai/metrics.py, ai/diff_routes.py, catalog/migrate_parquet_to_delta.py) were all correctly given their own ensure_db_initialized() call, since they bypass the guard that's now built into get_db/get_db_context.
  • Secret owner-keying (derive_user_key, the $ffsec$1$<user_id>$ embedding in secret_manager.py) is untouched — only the Fernet import became lazy.
  • test_lazy_imports.py is a solid regression guard: subprocess-isolated, checks both flowfile and flowfile_frame, and additionally asserts the catalog DB file isn't created as an import side effect.

One documentation gap worth fixing

flowfile_core/CLAUDE.md still describes the old behavior this PR removes:

  • Line 33: "Migrations auto-run on import: importing flowfile_core.database.init_db runs database/migration.py:run_startup_migration at module level..."
  • Line 53 (Gotchas): "flowfile_core/__init__.py runs validate_setup() then init_db() on import — importing the package has side effects (DB file creation, Alembic schema migration, default-user seeding)."

Both are now false — flowfile_core/flowfile_core/__init__.py no longer calls init_db(), and initialization is deferred to ensure_db_initialized() at first DB access / explicit server startup (database/connection.py). Since this file is a primary onboarding reference for both humans and future agents working in this package, it should be updated in the same PR to reflect the new deferred-init contract — otherwise the next person (or agent) touching flowfile_core will rely on stale, actively-misleading guidance about import side effects.

Fix this →

No correctness, security, or worker/core boundary issues found in the diff. Nice reduction in import time, and the lazy-import regression test is a good addition to prevent backsliding.

@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown

Claude finished @Edwardvaneechoud's task in 1m 48s —— View job


Review of PR #563 — Cut import flowfile time ~2.4x by deferring heavy imports

Todo list

  • Read repo CLAUDE.md and package-level CLAUDE.md files
  • Check whether the previously flagged flowfile_core/CLAUDE.md doc gap was fixed
  • Review new diff since last review (git diff origin/main...HEAD)
  • Check correctness, conventions, security, tests
  • Post final review summary

Summary

This "synchronize" run was triggered by a merge of main into the branch (bringing in #564, the unrelated worker deadlock fix) — I confirmed via git diff 3857ace..HEAD (the merged-in main tip) that the PR's own diff is byte-for-byte the same 51 files / +584/-170 as at the last review. No new PR-specific commits landed, so my prior review findings still stand and no new issues were introduced.

Outstanding item from the previous review, still not addressed:

flowfile_core/CLAUDE.md still documents the old on-import DB-init behavior that this PR removes:

  • Line 33: "Migrations auto-run on import: importing flowfile_core.database.init_db runs database/migration.py:run_startup_migration at module level..."
  • Line 53 (Gotchas): "flowfile_core/__init__.py runs validate_setup() then init_db() on import — importing the package has side effects (DB file creation, Alembic schema migration, default-user seeding)."

I re-verified this is now false: flowfile_core/flowfile_core/__init__.py calls only validate_setup(), not init_db(); DB init/migration is deferred to ensure_db_initialized() in flowfile_core/flowfile_core/database/connection.py, which runs lazily at first DB access or explicit server startup. Since this is the primary onboarding reference for the package, worth fixing before merge so it doesn't mislead the next contributor (human or agent).

Fix this →

No new correctness, security, or worker/core boundary issues found. All prior findings (FastAPI-free exception handling, CatalogService PEP 562 lazy resolution + mock-patch compatibility, from __future__ import annotations on modules with deferred-import type hints, ensure_db_initialized() correctness, secret owner-keying untouched) remain valid on this diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants