Skip to content

Enhance Trainer API with HMR, early-stop, and cleanup improvements#101

Open
k-taro56 wants to merge 7 commits intomainfrom
eng-615
Open

Enhance Trainer API with HMR, early-stop, and cleanup improvements#101
k-taro56 wants to merge 7 commits intomainfrom
eng-615

Conversation

@k-taro56
Copy link
Copy Markdown
Contributor

@k-taro56 k-taro56 commented May 2, 2026

This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.

Build system migration and HMR integration:

  • Migrated the build system from esbuild to Rolldown in arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json, packages/arkor/src/cli/commands/build.ts, packages/arkor/src/core/rolldownConfig) [1] [2] [3]
  • Added a persistent Rolldown watcher for src/arkor/ in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts, packages/arkor/src/studio/hmr.ts, packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]

Cleanup and shutdown improvements:

  • Introduced a reusable registerCleanupHook utility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts, packages/arkor/src/cli/commands/dev.ts) [1] [2]

Documentation updates:

  • Updated English and Japanese docs to describe the new HMR workflow, explaining how live code changes are handled during training and how the system differentiates between callback-only and config changes. (docs/concepts/studio.mdx, docs/ja/concepts/studio.mdx, AGENTS.md) [1] [2] [3]
  • Clarified the use of Rolldown in the build process and updated package policy documentation accordingly. (AGENTS.md)

Dependency and test adjustments:

  • Removed esbuild from dependencies and added Rolldown. Updated test comments to reference Rolldown instead of esbuild. (packages/arkor/package.json, packages/arkor/src/cli/commands/start.test.ts) [1] [2]

Security and internal API notes:

  • Documented the security-sensitive allow-list for SSE token query parameters and emphasized that HMR/early-stop internals remain behind Symbol.for brands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]

k-taro56 added 7 commits May 2, 2026 21:27
- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh.
- Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution.
- Updated documentation to reflect new features and usage patterns for improved developer guidance.
- Adjusted cleanup logic for better resource management during development sessions.
…l shutdown

- Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs.
- Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds.
- Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands.
- Updated documentation to reflect new features and usage patterns for better developer guidance.
…internal callback swapping

- Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality.
- Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API.
- Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience.
- Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.
…lement internal early-stop handling

- Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality.
- Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API.
- Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience.
- Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.
…llback replacement

- Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation.
- Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs.
- Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management.
- Updated documentation to reflect these changes and clarify the internal management processes for developers.
- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly.
- Implemented a cleanup hook system to manage resource disposal on process termination signals.
- Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs.
- Updated documentation to reflect changes in the development loop and HMR behavior.
- Added tests for new signal handling and configuration hashing functionality.
Copilot AI review requested due to automatic review settings May 2, 2026 17:02
@mintlify
Copy link
Copy Markdown

mintlify Bot commented May 2, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arkor-92aeef0e 🟢 Ready View Preview May 2, 2026, 5:03 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

import { ensureProjectState } from "../core/projectState";
import { readState } from "../core/state";
import { readManifestSummary } from "./manifest";
import { readManifestSummary, summariseBuiltManifest } from "./manifest";
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8abc594369

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +421 to +423
const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash);
const restartTargets =
activeTrains.requestEarlyStopOnMismatch(nextHash);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Dispatch train signals only once per rebuild

notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.

Changes:

  • Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
  • Add /api/dev/events SSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps.
  • Introduce internal trainer inspection/callback replacement/early-stop hooks (via Symbol.for brands), plus cleanup utilities and updated docs/tests.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pnpm-lock.yaml Removes esbuild and adds rolldown to the lockfile graph.
packages/arkor/package.json Swaps dependency from esbuild to rolldown.
packages/arkor/src/core/rolldownConfig.ts Centralizes Rolldown input options + entry/outDir resolution and Node target derivation.
packages/arkor/src/cli/commands/build.ts Implements arkor build via Rolldown bundling instead of esbuild.
packages/arkor/src/studio/hmr.ts Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events.
packages/arkor/src/studio/hmr.test.ts Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose).
packages/arkor/src/studio/manifest.ts Adds configHash to manifest summary and splits “summarise built manifest” vs “build + summarise”.
packages/arkor/src/studio/trainRegistry.ts Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop.
packages/arkor/src/studio/trainRegistry.test.ts Tests TrainRegistry signaling decisions and error tolerance.
packages/arkor/src/studio/server.ts Adds /api/dev/events SSE route and integrates TrainRegistry into rebuild handling.
packages/arkor/src/studio/server.test.ts Adds tests for /api/dev/events token rules, loopback guard, and SSE framing.
packages/arkor/src/core/configHash.ts Adds stable hashing of JobConfig for “callbacks-only vs restart” decisions.
packages/arkor/src/core/configHash.test.ts Tests determinism and sensitivity of hashJobConfig.
packages/arkor/src/core/trainerInspection.ts Introduces internal inspection + callback replacement + early-stop brands via Symbol.for.
packages/arkor/src/core/trainer.ts Implements callback hot-swap support + early-stop latch; attaches internal brands.
packages/arkor/src/core/trainer.test.ts Adds coverage for early-stop and callback hot-swap behavior via internal brands.
packages/arkor/src/core/runnerSignals.ts Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload.
packages/arkor/src/core/runnerSignals.test.ts Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers.
packages/arkor/src/core/runner.ts Installs/removes new signal handlers around trainer.start()/trainer.wait().
packages/arkor/src/core/runner.test.ts Adds a test for SIGTERM early-stop behavior in the runner.
packages/arkor/src/cli/cleanupHooks.ts Adds reusable cleanup-hook registration for exit/signals.
packages/arkor/src/cli/commands/dev.ts Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks.
packages/arkor/src/cli/commands/start.test.ts Updates test comment to refer to Rolldown instead of esbuild.
packages/studio-app/src/lib/api.ts Adds typed DevEvent + openDevEvents() for SSE notifications.
packages/studio-app/src/components/RunTraining.tsx SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state.
docs/concepts/studio.mdx Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN).
docs/ja/concepts/studio.mdx Documents HMR workflow (JA).
AGENTS.md Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const [log, setLog] = useState("");
const [manifest, setManifest] = useState<ManifestResult | null>(null);
const [hmrStatus, setHmrStatus] = useState<
"idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped"
import { ensureProjectState } from "../core/projectState";
import { readState } from "../core/state";
import { readManifestSummary } from "./manifest";
import { readManifestSummary, summariseBuiltManifest } from "./manifest";
Comment on lines +190 to 196
function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void {
// Registered before the studio-token cleanup so it runs first on
// shutdown — Node fires signal handlers in registration order, and we
// want the watcher to release file handles before the outermost
// process.exit.
registerCleanupHook({ cleanup: () => hmr.dispose() });
}
Comment on lines +97 to +113
requestEarlyStopOnMismatch(
nextConfigHash: string | null,
): RestartTarget[] {
const targets: RestartTarget[] = [];
for (const [pid, entry] of this.entries) {
if (
nextConfigHash === null ||
entry.configHash === null ||
entry.configHash !== nextConfigHash
) {
try {
entry.child.kill("SIGTERM");
} catch {
// child already exited; close handler will clean up.
}
targets.push({ pid, trainFile: entry.trainFile });
}
Comment on lines 275 to +284
await callbacks.onCheckpoint?.(ctx);
// Early-stop latch: a checkpoint just landed, so the in-flight work
// is durable. Cancel the cloud job and end `wait()` cleanly.
if (earlyStopRequested && earlyStopDeferred) {
await trainer.cancel();
if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
earlyStopDeferred.resolve();
earlyStopDeferred = null;
return { terminal: true, artifacts: terminalResult?.artifacts ?? [] };
}
Comment on lines +129 to +137
function startWatcher(): void {
if (watcher || disposed) return;
if (!existsSync(resolved.entry)) {
broadcast({
type: "error",
message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
});
return;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants