Skip to content

chore: add monorepo workspace skeleton (no behavior change)#89

Open
nicklamonov wants to merge 3 commits into
masterfrom
chore/monorepo-skeleton
Open

chore: add monorepo workspace skeleton (no behavior change)#89
nicklamonov wants to merge 3 commits into
masterfrom
chore/monorepo-skeleton

Conversation

@nicklamonov
Copy link
Copy Markdown
Collaborator

@nicklamonov nicklamonov commented May 29, 2026

Summary

PR 1 of the monorepo migration tracked in epic #90 — fold the URL-to-Markdown actor (currently the standalone apify/page-scraper repo) into this repo as a sibling actor sharing the scraping engine in-process instead of over HTTP.

This PR adds the workspace scaffolding only — RAG's source layout, build, and runtime behavior are unchanged.

Tracking: #90

What changes

  • package.json — adds "private": true, "packageManager": "pnpm@10.33.4" + a devEngines.packageManager block, and lerna + turbo as devDependencies. Workspaces are declared in pnpm-workspace.yaml, not the npm workspaces field. patch-package stays in dependencies (see Tooling note).
  • pnpm-workspace.yaml — workspace globs (packages/*, packages/actors/*), nodeLinker: hoisted, and onlyBuiltDependencies (esbuild, playwright). Mirrors apify/actor-scraper.
  • pnpm-lock.yaml — replaces package-lock.json.
  • lerna.json — independent per-package versioning, conventional commits, GitHub releases, npmClient: pnpm. Mirrors apify/actor-scraper.
  • turbo.json — standard build/test/lint/clean task graph (dependsOn: ["^build"], outputs dist/**).
  • tsconfig.base.json — shared base config that future workspace packages will extend. RAG's own tsconfig.json is untouched.
  • packages/.gitkeep — placeholder for the (currently empty) workspace dir.
  • .gitignore — ignore .turbo cache.
  • .github/workflows/checks.yml — pin Node to 22 (was 'latest', which now resolves to Node 26) and switch the package manager from npm to pnpm (install via apify/actions/pnpm-install, build/lint/test via pnpm), plus concurrency + cancel-in-progress. Fixes a CI hang: Playwright 1.46.0's installer only supports Node 18/20/22, and on Node 26 its post-download unzip step stalls silently until the run is cancelled. Node 22 also matches the production base image (apify/actor-node-playwright-firefox:22-*).

What is not changed

  • RAG's src/, .actor/, Dockerfile, tsconfig.json, build scripts, tests.
  • The current npm run build / npm run start:dev / apify push flows.
  • The production actor still builds with npm inside Docker (the Apify base image is npm-based, same as apify/actor-scraper); pnpm is only the dev/CI/workspace layer.

Verification

  • pnpm install completes (1119 packages); patch-package applies playwright-core@1.46.0 under both pnpm and npm.
  • pnpm run build (tsc) and pnpm run lint succeed.
  • turbo run build runs cleanly with "0 packages in scope" — expected since no workspace packages exist yet.
  • Tests pass 9/11 locally (10/11 single-threaded). The remaining failures are a browser-launch timeout / vitest file-parallelism flake on the dev machine (the playwright crawler test passes in isolation) — identical to the pre-migration npm baseline, i.e. not caused by pnpm.
  • Production parity: a full docker build of .actor/Dockerfile succeeds, and the Firefox playwright-core patch is confirmed present in the final image (applied by patch-package's postinstall during the image's npm install).
  • Workflow: actionlint passes clean; an act dry-run resolves the full job graph, including the apify/actions/pnpm-install composite (which runs pnpm install and caches by pnpm-lock.yaml hash).

Upcoming PRs (tracked in #90)

  • PR 2 — relocate RAG's src/ and .actor/ into packages/actors/rag-web-browser/.
  • PR 3 — extract the shared scraping engine into packages/scraping-engine/.
  • PR 4 — add packages/actors/url-to-markdown/ consuming the engine.
  • PR 5 — switch CI to a matrix push job for both actors.
  • PR 6 — point the new url-to-markdown actor's source at this repo.
  • PR 7 — deprecate the old standalone apify/page-scraper repo + actor.

Tooling note

Matches apify/actor-scraper's stack — pnpm workspaces + Lerna (independent versioning) + Turbo. (An earlier draft of this plan said npm "to mirror the reference"; that was a misread — the reference monorepo is on pnpm [pnpm-lock.yaml, packageManager: pnpm@10.33.4], so this PR uses pnpm.)

The patch on playwright-core is intentionally kept on patch-package rather than migrated to pnpm's native patchedDependencies: the production actor image builds with npm, which does not understand pnpm patches — a native migration silently dropped the Firefox patch from the prod image. patch-package's postinstall runs under both npm (Docker) and pnpm (dev/CI), so both paths apply the patch.

🤖 Generated with Claude Code

This is PR #1 of the planned migration to host the URL-to-Markdown
actor (formerly apify/page-scraper) as a sibling actor in this repo.

Adds the workspace scaffolding only — RAG Web Browser's source layout,
build, and runtime behavior are unchanged in this PR. Subsequent PRs
will:

  - PR #2: relocate RAG's src/ and .actor/ into packages/actors/rag-web-browser/
  - PR #3: extract the shared scraping engine into packages/scraping-engine/
  - PR #4: add packages/actors/url-to-markdown/ consuming the engine
  - PR #5: switch CI to a matrix push for both actors

What this PR changes:

  - package.json: add "private": true, "workspaces": ["packages/*",
    "packages/actors/*"], "packageManager": "npm@10.9.2", and lerna +
    turbo as devDependencies.
  - lerna.json: independent versioning, conventional commits, github
    releases (matching apify/actor-scraper's setup).
  - turbo.json: build / test / lint / clean tasks with the standard
    dependsOn:["^build"] graph and dist/** outputs.
  - tsconfig.base.json: shared base config (extends @apify/tsconfig)
    that future workspace packages will extend. RAG's own tsconfig.json
    is unchanged.
  - packages/.gitkeep: placeholder so the empty workspace dir is tracked.
  - .gitignore: ignore .turbo cache.

Verification:

  - npm install completes (1159 packages, patch-package runs).
  - npm run build (tsc) succeeds.
  - npx turbo run build runs cleanly with "0 packages in scope" (as
    expected — no workspace packages exist yet).
  - Non-Playwright tests pass (9/11). The 2 Playwright tests fail
    locally only because Playwright browsers aren't installed; this is
    independent of the workspace changes.

Tooling note: matches apify/actor-scraper's stack exactly — npm
workspaces + Lerna (independent versioning) + Turbo. The earlier draft
plan referenced pnpm; npm is the right call to mirror the reference
monorepo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added this to the 141st sprint - Tooling team milestone May 29, 2026
@github-actions github-actions Bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 29, 2026
@nicklamonov nicklamonov added the adhoc Ad-hoc unplanned task added during the sprint. label May 29, 2026
@nicklamonov nicklamonov removed this from the 141st sprint - Tooling team milestone May 29, 2026
@github-actions github-actions Bot added this to the 141st sprint - Tooling team milestone May 29, 2026
@nicklamonov nicklamonov force-pushed the chore/monorepo-skeleton branch from 8fc4ad0 to 00e585c Compare May 29, 2026 12:25
The previous `node-version: 'latest'` resolved to Node 26.2.0 on
current ubuntu-latest runners. Playwright 1.46.0's installer was
released August 2024 and only officially supports Node 18 / 20 / 22 —
on Node 26 its post-download `unzip` step hangs silently with no
progress output, causing the CI step to time out.

Pinning to Node 22:
- Inside Playwright 1.46.0's supported matrix
- Current Node LTS
- Matches the production base image (apify/actor-node-playwright-firefox:22-*)

Master's last successful CI run on 2026-05-01 happened to land on a
Node version that worked with Playwright; the implicit `'latest'`
pointer rolled over to Node 26 since then. This pin fixes that drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nicklamonov nicklamonov marked this pull request as ready for review May 29, 2026 13:16
Mirror the reference monorepo (apify/actor-scraper), which uses
pnpm + Lerna + Turbo — not npm. The earlier "corrected to npm to mirror
the reference" note was based on a misread; the reference is on pnpm.

- package.json: drop the npm `workspaces` field (pnpm reads
  pnpm-workspace.yaml), set packageManager to pnpm@10.33.4, add the
  devEngines block.
- pnpm-workspace.yaml: workspace globs + nodeLinker: hoisted +
  onlyBuiltDependencies (esbuild, playwright), matching actor-scraper.
- Regenerate the lockfile (package-lock.json -> pnpm-lock.yaml).
- lerna.json: npmClient: pnpm.
- checks.yml: install via apify/actions/pnpm-install, run via pnpm,
  add concurrency/cancel-in-progress. Node stays pinned to 22.

patch-package is intentionally kept (not migrated to pnpm's native
patchedDependencies): the production actor image builds with npm, which
does not understand pnpm patches, so a native migration silently dropped
the playwright-core Firefox patch from the prod image. patch-package's
postinstall runs under both npm and pnpm.

Verified: pnpm build/lint/test green; test results match the npm baseline
(9/11 local, browser-dependent failures unchanged); full docker build
succeeds and the Firefox patch is present in the final image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants