Skip to content

feat: modernize langchain integration native components#29

Draft
daveomri wants to merge 44 commits intofeat/modernize-langchain-integrationfrom
feat/modernize-langchain-integration-native-components
Draft

feat: modernize langchain integration native components#29
daveomri wants to merge 44 commits intofeat/modernize-langchain-integrationfrom
feat/modernize-langchain-integration-native-components

Conversation

@daveomri
Copy link
Copy Markdown
Collaborator

Summary

Second PR on top of feat/modernize-langchain-integratio; builds on the core tools PR and adds the LangChain-native components layer: two actor-specific tools, a search retriever, and a crawl loader. Upcoming PRs will extend this with social media tools and documentation to feat/modernize-langchain-integration before merging it all to main.

New code: ~690 lines - Tests: ~545 lines

Note on scope: This PR is intentionally scoped to the four native LangChain components (BaseTool for search/crawl, BaseRetriever, BaseLoader). Social-media and scraping Actor tools, docs, and example notebooks will be as follow-up PRs.


  • ApifyGoogleSearchTool (_actor_tools.py)
    • Wraps apify/google-search-scraper behind a simplified, LLM-friendly interface. Returns a JSON array of {title, url, description} objects. Inherits _ApifyGenericTool's safety clamping and handle_tool_error=True.
  • ApifyWebCrawlerTool (_actor_tools.py)
    • Wraps apify/website-content-crawler. Returns a JSON array of {url, title, content (markdown)} objects with configurable max_crawl_pages, max_crawl_depth, and crawler_type. Reuses _clamp_timeout / _clamp_items for safety ceilings.
  • ApifySearchRetriever (retrievers.py)
    • New BaseRetriever backed by apify/rag-web-browser. Provides both _get_relevant_documents (sync) and _aget_relevant_documents (async) via ApifyClient / ApifyClientAsync. Yields Document objects with source and title metadata, ready to drop into any LangChain RAG pipeline. Actor-run logs suppressed via logger=None.
  • ApifyCrawlLoader (document_loaders.py)
    • New BaseLoader that wraps ApifyToolsClient.crawl_website and maps each crawled page to a Document with source, title, and crawl_depth metadata. Supports both load() and lazy_load().
  • APIFY_ACTOR_TOOLS convenience list
    • New list[type] exported alongside APIFY_CORE_TOOLS for selective agent binding: [ApifyGoogleSearchTool, ApifyWebCrawlerTool].
  • ApifyToolsClient additions (_client.py)
    • Three new methods powering the native components: google_search, crawl_website, and rag_web_search. All reuse the existing run_actor_and_get_items + _list_items_or_raise plumbing, so timeout/memory/dataset error handling is consistent with core tools.
  • Backward compatible
    • No changes to public API of any pre-existing class.
  • Tests
    • New unit tests for every new component: test_actor_tools.py (~184 lines), test_retrievers.py (~224 lines, sync + async), expanded test_document_loaders.py (+139 lines covering ApifyCrawlLoader), and test_client.py (+151 lines for the three new client methods). Error scenarios covered: missing token, Actor run failure, network error, empty / missing-metadata results, markdown vs. text fallback, async path.

Review strategy

Suggested reading order:

  1. _client.py: the three new ApifyToolsClient methods (google_search, crawl_website, rag_web_search); each follows the same pattern as the core-tools methods
  2. _actor_tools.py: the two _ApifyGenericTool subclasses (homogeneous, once one clicks the other reads fast)
  3. retrievers.py: ApifySearchRetriever with its sync/async pair and the shared _items_to_documents helper
  4. document_loaders.py: the new ApifyCrawlLoader alongside the existing ApifyDatasetLoader
  5. Tests last: grouped by the module they cover

Merge strategy

This PR targets feat/modernize-langchain-integration, not main. It depends on the core tools PR being merged first; _actor_tools.py subclasses _ApifyGenericTool and the loader relies on ApifyToolsClient. Once core tools is merged into the integration branch, this PR will be rebased and opened for review. Social-media tools and docs will follow as separate PRs on the same integration branch before the final merge to main.

daveomri added 30 commits April 20, 2026 16:12
…and maintability; update test cases for better formatting and error handling
@daveomri daveomri self-assigned this Apr 24, 2026
@daveomri daveomri changed the title Feat: modernize langchain integration native components feat: modernize langchain integration native components Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants