Skip to content

Feat/modernize langchain integration crawl tools#31

Draft
daveomri wants to merge 78 commits intofeat/modernize-langchain-integrationfrom
feat/modernize-langchain-integration-crawl-tools
Draft

Feat/modernize langchain integration crawl tools#31
daveomri wants to merge 78 commits intofeat/modernize-langchain-integrationfrom
feat/modernize-langchain-integration-crawl-tools

Conversation

@daveomri
Copy link
Copy Markdown
Collaborator

Summary

Third PR on top of feat/modernize-langchain-integration; builds on the [native components PR](https://github.com/apify/langchain-apify/tree/feat/modernize-langchain-integration-native-components) and adds the Search & Crawling Actor tools layer: four new BaseTool subclasses wrapping search, maps, video, and e-commerce Actors. Upcoming PR will fold this together with social-media tools and documentation onto [feat/modernize-langchain-integration](https://github.com/apify/langchain-apify/tree/feat/modernize-langchain-integration) before merging to main.

New code: ~426 linesTests: ~337 lines

Note on scope: This PR is intentionally scoped to the four Search & Crawling Actor tools called out in US-4 (RAG Web Browser, Google Maps, YouTube, E-commerce). Social-media Actor tools and the integration documentation will land as follow-up PRs.


  • ApifyRAGWebBrowserTool ([_actor_tools.py](langchain_apify/_actor_tools.py))
    • Wraps apify/rag-web-browser. Returns JSON with run metadata (run_id / status / dataset_id / timestamps) and items (crawled-page dicts). Distinct from ApifySearchRetriever (which returns LangChain Document objects); this tool returns JSON for agent tool-calling.
  • ApifyGoogleMapsTool ([_actor_tools.py](langchain_apify/_actor_tools.py))
    • Wraps compass/crawler-google-places. Required query, optional max_results (default 10) and language (ISO code). Returns JSON with run metadata and items (place dicts).
  • ApifyYouTubeScraperTool ([_actor_tools.py](langchain_apify/_actor_tools.py))
    • Wraps streamers/youtube-scraper. Required search_query, optional search_type: Literal['search', 'video', 'channel'] (default search), max_results (default 10). Tight Literal at the LLM boundary, loose str + runtime ValueError at the _client.py boundary so direct callers get the same protection.
  • ApifyEcommerceScraperTool ([_actor_tools.py](langchain_apify/_actor_tools.py))
    • Wraps apify/e-commerce-scraping-tool. Required url, optional max_results (default 20). Bare-URL design intentionally keeps the LLM-facing surface minimal; selector hints can be added later if real users hit empty-result issues.
  • APIFY_SEARCH_TOOLS convenience list
    • New list[type[BaseTool]] exported alongside APIFY_CORE_TOOLS and APIFY_ACTOR_TOOLS for selective agent binding: [ApifyRAGWebBrowserTool, ApifyGoogleMapsTool, ApifyYouTubeScraperTool, ApifyEcommerceScraperTool].
  • ApifyToolsClient additions ([_client.py](langchain_apify/_client.py))
    • Three new methods (google_maps_search, youtube_scrape, ecommerce_scrape) and one rename + signature change: rag_web_searchrag_web_browser_search, now returning (run, items) like the other helpers so the tool layer can build _run_meta(run). All four reuse the existing run_actor_and_get_items plumbing — transport-error wrapping and _check_run_status come for free.
  • ApifySearchRetriever ([retrievers.py](langchain_apify/retrievers.py))
    • Single call site updated to consume the new tuple return (_, items = self._client.rag_web_browser_search(...)); behaviour and Document shape are unchanged.
  • Backward compatible
    • No changes to public API of any pre-existing class. ApifyActorsTool / ApifyDatasetLoader / ApifyWrapper untouched. The rag_web_search rename is internal — only the retriever consumed it, and that's updated in-tree.
  • Tests
    • 35 new unit tests covering: input-mapping per helper (asserts Actor ID + run_input keys), youtube_scrape enum validation, happy-path JSON shape per tool, parametrized _TOOL_INVOCATIONS battery covering RuntimeError → ToolException, empty-dataset, handle_tool_error=True swallow, missing-token, plus inheritance / metadata / APIFY_SEARCH_TOOLS membership. Existing test_retrievers.py tests rewired for the new tuple-return helper.

Review strategy

Suggested reading order:

  1. [_client.py](langchain_apify/_client.py): the three new ApifyToolsClient methods plus the rag_web_search → rag_web_browser_search rename — each follows the same pattern as the native-components methods.
  2. [_actor_tools.py](langchain_apify/_actor_tools.py): the four new _ApifyGenericTool subclasses and their input schemas (homogeneous; once one clicks the rest read fast).
  3. [retrievers.py](langchain_apify/retrievers.py): tiny diff for the tuple-unpack at the two call sites.
  4. init.py: new APIFY_SEARCH_TOOLS list and __all__ additions.
  5. Tests last: grouped by the module they cover.

Merge strategy

This PR targets feat/modernize-langchain-integration, not main. It depends on the [native components PR](https://github.com/apify/langchain-apify/tree/feat/modernize-langchain-integration-native-components) being merged first — _actor_tools.py extends the file introduced there and consumes _run_meta / _ApifyGenericTool from tools.py. Once native components is merged into the integration branch, this PR will be rebased and opened for review. Social-media tools and docs will follow as separate PRs on the same integration branch before the final merge to main.

daveomri added 30 commits April 20, 2026 16:12
…and maintability; update test cases for better formatting and error handling
daveomri added 30 commits April 28, 2026 10:40
…eat/modernize-langchain-integration-native-components
…eat/modernize-langchain-integration-native-components
…ser_search, google_maps_search, youtube_scrape, ecommerce_scrape)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants