Skip to content

feat: expose Download objects on PlaywrightCrawlingContext#3596

Open
nikitachapovskii-dev wants to merge 4 commits intomasterfrom
feat/playwright-context-downloads
Open

feat: expose Download objects on PlaywrightCrawlingContext#3596
nikitachapovskii-dev wants to merge 4 commits intomasterfrom
feat/playwright-context-downloads

Conversation

@nikitachapovskii-dev
Copy link
Copy Markdown
Contributor

When page.goto throws "Download is starting", Playwright has already captured the file — but the Download object was inaccessible to user-land code. The only workaround was re-downloading via HTTP, which breaks on sites requiring a
browser session (e.g. ECAS-gated resources on Eur-Lex).

This PR adds a downloads: Download[] array to PlaywrightCrawlingContext, populated via page.on('download', ...)
registered in _enhanceCrawlingContextWithPageInfo — before navigation, so downloads triggered by page.goto are always
captured.

Closes #3583

@github-actions github-actions Bot added this to the 139th sprint - Tooling team milestone Apr 23, 2026
@github-actions github-actions Bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 23, 2026
Comment on lines +1070 to +1073
* Playwright intercepts downloads before they complete, so the objects are available
* as soon as the browser starts the download — including inside `errorHandler` when
* `page.goto` throws `"Download is starting"`.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd refrain from mentioning internals of Apify's Actors here - Crawlee is a standalone project.

* // stream to storage...
* }
* }
* },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't happen outside of WCC

Comment on lines +1075 to +1076
* > files this may be a concern; prefer re-enqueueing the URL to a streaming downloader
* > when file size is unpredictable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 does PlaywrightCrawler store auto-downloaded files by default now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double triple-checked. Playwright itself saves downloads to a temp dir on disk (that's why download.path() and download.createReadStream() exist in the first place.
This is Playwright's own behavior, set byacceptDownloads wjich defaults to true in Playwright, and we don't override it.

const browserContext = await this.library
.launchPersistentContext(userDataDir, launchOptions)
.catch((error) => {
return this._throwOnFailedLaunch(launchContext, error);
});

if (this.launchContext.useIncognitoPages) {
// Each page requires to have all the context options applied
contextOptions = {
...this.launchContext.launchOptions,
...contextOptions,
};

https://playwright.dev/docs/api/class-browser#browser-new-context

Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you @nikitachapovskii-dev ! I just have one quick point ⬇️

Comment thread packages/playwright-crawler/src/internals/utils/playwright-utils.ts Outdated
* },
* ```
*/
downloads: Download[];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but all the other utils in this interface are functions.

Any chance we can make this a (list|get)(File)?Downloads method to make the API consistent? Since this array can grow, it makes more sense to me too:

async requestHandler ({ page, listDownloads }) {
    await listDownloads() // empty array
    await page.click('download');
    await listDownloads() // 1 download
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the consistency point. I went with synchronous return (rather than Promise<Download[]>) because the operation is purely a state read so no I/O, no waiting. Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent, it's a one-line change.

While at it, I also moved the listener wiring + getDownloads registration into _enhanceCrawlingContextWithPageInfo (one closure for the array, listener and getter). Side benefit: this also fixes a latent race where download events fired during navigation would .push() into a not-yet-initialized context.downloads field.

Also added an integration test mirroring the before click → empty / after click → 1 download.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent; it's a one-line change.

I have a different point - if we ever decide to make some async action here (write stats somewhere, upload something to KVS, etc.), we'd need to make a breaking API change. Having this async from the get-go makes this, imo, easier in the long run.

The move to _enhanceCrawlingContextWithPageInfo could be a little confusing (since the rest of the utils is somewhere else), but you have a point, we might miss some downloads due to a race condition. I'm personally fine with the move 👍

@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PlaywrightCrawler: expose Download object in errorHandler / add dedicated downloadHandler hook

4 participants