feat: expose Download objects on PlaywrightCrawlingContext#3596
feat: expose Download objects on PlaywrightCrawlingContext#3596nikitachapovskii-dev wants to merge 4 commits intomasterfrom
Conversation
| * Playwright intercepts downloads before they complete, so the objects are available | ||
| * as soon as the browser starts the download — including inside `errorHandler` when | ||
| * `page.goto` throws `"Download is starting"`. | ||
| * |
There was a problem hiding this comment.
I'd refrain from mentioning internals of Apify's Actors here - Crawlee is a standalone project.
| * // stream to storage... | ||
| * } | ||
| * } | ||
| * }, |
There was a problem hiding this comment.
This won't happen outside of WCC
| * > files this may be a concern; prefer re-enqueueing the URL to a streaming downloader | ||
| * > when file size is unpredictable. |
There was a problem hiding this comment.
🤔 does PlaywrightCrawler store auto-downloaded files by default now?
There was a problem hiding this comment.
double triple-checked. Playwright itself saves downloads to a temp dir on disk (that's why download.path() and download.createReadStream() exist in the first place.
This is Playwright's own behavior, set byacceptDownloads wjich defaults to true in Playwright, and we don't override it.
crawlee/packages/browser-pool/src/playwright/playwright-plugin.ts
Lines 122 to 126 in 1d2528b
crawlee/packages/browser-pool/src/playwright/playwright-controller.ts
Lines 51 to 56 in 1d2528b
https://playwright.dev/docs/api/class-browser#browser-new-context
barjin
left a comment
There was a problem hiding this comment.
lgtm, thank you @nikitachapovskii-dev ! I just have one quick point ⬇️
| * }, | ||
| * ``` | ||
| */ | ||
| downloads: Download[]; |
There was a problem hiding this comment.
Maybe it's just me, but all the other utils in this interface are functions.
Any chance we can make this a (list|get)(File)?Downloads method to make the API consistent? Since this array can grow, it makes more sense to me too:
async requestHandler ({ page, listDownloads }) {
await listDownloads() // empty array
await page.click('download');
await listDownloads() // 1 download
}There was a problem hiding this comment.
Agreed on the consistency point. I went with synchronous return (rather than Promise<Download[]>) because the operation is purely a state read so no I/O, no waiting. Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent, it's a one-line change.
While at it, I also moved the listener wiring + getDownloads registration into _enhanceCrawlingContextWithPageInfo (one closure for the array, listener and getter). Side benefit: this also fixes a latent race where download events fired during navigation would .push() into a not-yet-initialized context.downloads field.
Also added an integration test mirroring the before click → empty / after click → 1 download.
There was a problem hiding this comment.
Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent; it's a one-line change.
I have a different point - if we ever decide to make some async action here (write stats somewhere, upload something to KVS, etc.), we'd need to make a breaking API change. Having this async from the get-go makes this, imo, easier in the long run.
The move to _enhanceCrawlingContextWithPageInfo could be a little confusing (since the rest of the utils is somewhere else), but you have a point, we might miss some downloads due to a race condition. I'm personally fine with the move 👍
When page.goto throws "Download is starting", Playwright has already captured the file — but the Download object was inaccessible to user-land code. The only workaround was re-downloading via HTTP, which breaks on sites requiring a
browser session (e.g. ECAS-gated resources on Eur-Lex).
This PR adds a downloads: Download[] array to PlaywrightCrawlingContext, populated via page.on('download', ...)
registered in _enhanceCrawlingContextWithPageInfo — before navigation, so downloads triggered by page.goto are always
captured.
Closes #3583