feat(composio): migrate GitHub provider to memory_tree pipeline (#2885)#2889
Conversation
…humansai#2885) Second of the four-PR follow-up sequence in tinyhumansai#2885 (after Notion in tinyhumansai#2887). GitHub issue/PR sync now writes through `ingest_pipeline::ingest_document` → `mem_tree_chunks` instead of `MemoryClient::store_skill_sync` → UnifiedMemory `memory_docs`, so `memory.search` / `tree.read_chunk` / `tree.browse` / the agent recall path / summary trees all see GitHub items again. Mirrors the Notion shape: - `providers/github/ingest.rs` (NEW): `github_source_id`, `parse_updated_at`, `render_issue_body`, `extract_html_url`, `ingest_issue_into_memory_tree`. - Re-ingest cleanup via `delete_chunks_by_source` before each ingest, so the pipeline's content-blind `already_ingested` gate doesn't drop edited issues (same pattern as vault sync tinyhumansai#2720 and Notion tinyhumansai#2887). - `source_id = github:{connection_id}:{issue_id}` — stable across re-syncs and across connections. Works for both numeric internal IDs and the `owner/repo#number` slug fallback from `extract_issue_id`. - `source_ref` carries the canonical GitHub `html_url` so audit trails are one-click navigable from any downstream UI. - `provider.rs`: dropped `persist_single_item` + `doc_id`, threaded `updated_at` through to the new ingest fn, kept the existing composite-key dedup logic (`{issue_id}@{updated_at}`) intact. Tests (lib): 6 in `ingest::tests`, all green. - `github_source_id_is_stable_and_namespaced` — pins the contract the `delete_chunks_by_source` cleanup path relies on. - `parse_updated_at_handles_valid_and_invalid_inputs` — falls through to `Utc::now()` on bad input instead of failing the ingest. - `render_issue_body_includes_title_header_and_pretty_json` — pins the chunked-content shape; without it the retrieval body loses labels / assignees / state / comment count. - `extract_html_url_handles_both_envelope_shapes` — handles direct and `data.html_url` Composio envelope variants. - `ingest_issue_writes_to_memory_tree` — the tinyhumansai#2885 regression: asserts `count_chunks` rises and the source_id is registered in `mem_tree_ingested_sources`. - `re_ingesting_edited_issue_replaces_prior_chunks` — confirms the delete-first guard works; chunk count after re-ingest stays steady (±1 chunker rounding) instead of doubling. `persist_single_item` / `store_skill_sync` are intentionally left in place — they're still used by the remaining unmigrated providers (Linear, ClickUp) and the wider UnifiedMemory removal lives under the senamakel tinyhumansai#2585 follow-up. They become removable once Linear and ClickUp follow (also tracked in tinyhumansai#2885). Refs tinyhumansai#2885.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a GitHub-specific ingest module that converts single GitHub issue/PR payloads into canonical memory-tree DocumentInput, conditionally deletes prior chunks for idempotent re-ingest, calls ingest_pipeline::ingest_document, and updates the GitHub provider to route issue persistence through this pipeline. ChangesGitHub Ingest Pipeline
Sequence DiagramsequenceDiagram
participant Provider as GitHub Provider
participant Ingest as ingest_issue_into_memory_tree
participant Cleanup as delete_chunks_by_source
participant Pipeline as ingest_pipeline::ingest_document
participant Storage as Memory Tree Storage
Provider->>Ingest: issue payload + metadata
Ingest->>Cleanup: spawn_blocking task (is_source_ingested -> delete_chunks_by_source)
Cleanup->>Storage: delete old chunks for source_id
Cleanup-->>Ingest: deletion result
Ingest->>Ingest: build DocumentInput (title, body, modified_at, source_ref, tags)
Ingest->>Pipeline: ingest_document with DocumentInput
Pipeline->>Storage: write new chunks
Storage-->>Pipeline: chunks_written count
Pipeline-->>Ingest: success result
Ingest-->>Provider: chunks_written count
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related Issues
Possibly Related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Gate `delete_chunks_by_source` behind `is_source_ingested`. The delete path uses a `source_kind = ?1` scan + Rust-side source-id filter (`store::delete_chunks_by_source_filter`), so a first-time ingest of a never-seen issue would scan every Document-kind chunk just to find zero matches. `is_source_ingested` is an indexed PK lookup against `mem_tree_ingested_sources`, so the common fresh-issue case becomes one cheap `COUNT(*)` and we only pay the scan cost on actual re-ingests of edited issues. Same `spawn_blocking` hop, single combined closure — the gate runs inside the existing blocking task. The 6 ingest tests still pass: the regression test exercises the false → no-delete path (fresh issue); the re-ingest test exercises the true → delete path (edited issue). Same fix carried forward from the parallel Notion PR (tinyhumansai#2887), where CodeRabbit caught the issue first. Linear and ClickUp PRs will land with the gate baked in from the start. Refs tinyhumansai#2885.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/memory_sync/composio/providers/github/ingest.rs (1)
191-193: ⚡ Quick winPreserve the error chain with the alternate formatter.
{err}renders only the outermost message; the provider then logs this witherror = %e(Display), so any underlying cause (DB/embedding failure) is lost. The rest of the codebase uses:#for anyhow errors (e.g.,provider.rslines 90/234). Bake the chain into the message:♻️ Use the alternate (chain) formatter
- Err(err) => Err(anyhow::anyhow!( - "ingest_document failed for {source_id}: {err}" - )), + Err(err) => Err(anyhow::anyhow!( + "ingest_document failed for {source_id}: {err:#}" + )),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory_sync/composio/providers/github/ingest.rs` around lines 191 - 193, The error construction in the ingest_document path currently formats the error with Display (`{err}`) which drops the error chain; change the Err arm that builds the anyhow error so it uses the alternate (chain) formatter for `err` (i.e., use `{:#}`) when composing the message for `ingest_document failed for {source_id}` so underlying causes are preserved; update the Err(...) expression in ingest.rs where `err` and `source_id` are referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory_sync/composio/providers/github/provider.rs`:
- Around line 305-333: The ingest failure branch for
ingest_issue_into_memory_tree can cause lost edits because newest_updated is
bumped and the outer cursor is advanced even when an item failed to re-ingest;
declare a boolean flag (e.g., let mut had_ingest_failures = false) alongside
newest_updated and accumulators, set had_ingest_failures = true inside the Err
arm where you currently log the warning, and then only call
state.advance_cursor(&new_cursor) (or log and hold the cursor) when
had_ingest_failures is false; keep using state.mark_synced and synced_ids to
skip already-synced items on the next pass.
---
Nitpick comments:
In `@src/openhuman/memory_sync/composio/providers/github/ingest.rs`:
- Around line 191-193: The error construction in the ingest_document path
currently formats the error with Display (`{err}`) which drops the error chain;
change the Err arm that builds the anyhow error so it uses the alternate (chain)
formatter for `err` (i.e., use `{:#}`) when composing the message for
`ingest_document failed for {source_id}` so underlying causes are preserved;
update the Err(...) expression in ingest.rs where `err` and `source_id` are
referenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82cb18f9-fdbb-489b-b97f-ce021706fa7b
📒 Files selected for processing (3)
src/openhuman/memory_sync/composio/providers/github/ingest.rssrc/openhuman/memory_sync/composio/providers/github/mod.rssrc/openhuman/memory_sync/composio/providers/github/provider.rs
graycyrus
left a comment
There was a problem hiding this comment.
@justinhsu1477 hey! the code looks solid to me, but CI is still pending on a few checks (Windows secrets ACL, E2E macOS/Windows/Linux, Tauri build, Rust Core Coverage, macOS smoke). once those are green I'll come back and approve this.
one thing worth addressing independent of CI:
ingest.rs uses log::debug! while the surrounding codebase uses tracing
provider.rs logs with tracing::warn!(issue_id = %issue_id, error = %e, "...") — structured key=value fields, queryable by any tracing subscriber. ingest.rs uses log::debug!("... connection_id={} issue_id={} removed_chunks={}", ...) — positional format strings with no structured fields. Both work if the tracing-log bridge is configured, but the structured fields get lost, which defeats the point of having them.
Swapping the two log::debug! calls in ingest_issue_into_memory_tree to use tracing::debug! with the same field syntax as provider.rs keeps the pattern consistent.
Also echoing CodeRabbit's cursor-advancement finding — that one's worth fixing before merge. The rest of their nitpick on {err:#} is valid too.
Otherwise the migration is clean: github:{connection_id}:{issue_id} source_id is stable, the delete-first re-ingest guard mirrors #2887 and #2720 correctly, and the regression test suite pins all the invariants that matter.
…chain Two related correctness fixes flagged by CodeRabbit on the parallel GitHub PR (tinyhumansai#2889) — same shape exists here in Notion, so applying the fix here proactively rather than waiting for a second round. ## 1. Cursor advances on ingest failure → data-loss window `provider.rs` updated `newest_edited_time` for every result before attempting ingest, then advanced the persistent cursor unconditionally at the end of the loop. Items whose ingest call returned `Err` were logged + skipped (no `mark_synced`) — but the next sync queried `last_edited_time > {cursor}`, so the failed item was never re-fetched. Pre-tinyhumansai#2885 this was just "stale revision": the legacy `persist_single_item` path wasn't delete-first, so a failed write left old chunks alone. Post-tinyhumansai#2885 it's worse — `ingest_page_into_memory_tree` calls `delete_chunks_by_source` *before* the (failing) write, so an *edited* page that fails to re-ingest is left with neither old nor new chunks in `mem_tree_chunks` until its next edit. Fix: track `had_ingest_failures: bool`, set on every `Err`, and gate the `advance_cursor` at the bottom of the sync loop. Already-synced items are cheaply skipped via `state.is_synced` on the re-fetch next pass, so the cost of holding is just one extra search-API page on the failure range. A retry-only-failed-items fix (separate `newest_successful_updated` cursor) would be tighter but adds state and edge cases; the simple gate matches CodeRabbit's exact suggestion and is provably safe (cursor only ever advances when every item in the pass succeeded). ## 2. `{err}` drops anyhow context chain `ingest.rs` wrapped the `ingest_pipeline::ingest_document` failure as `anyhow!("ingest_document failed for {source_id}: {err}")`, where `{err}` is Display-only and strips the chain. `provider.rs` then logs the wrapped error with `tracing::warn!(error = %e)` (also Display), so the underlying DB / embedding / persist cause was lost. Fix: swap to `{err:#}` (alternate formatter) so the chain is baked into the wrapping anyhow's Display impl. Matches the convention `provider.rs` already uses on its own `ctx.execute(...).map_err(|e| ... "{e:#}")` call sites (lines 92, 207). ## Tests 27/27 still pass in `memory_sync::composio::providers::notion::*`. The provider's `sync()` is hard to unit-test without mocking `ComposioContext` heavily — the invariant is pinned by the rustdoc on `had_ingest_failures` and the inline gate comment. The ingest-side tests (regression + re-ingest) cover the ingest function's contract; the cursor logic lives entirely in `provider.rs`. Refs CodeRabbit review on tinyhumansai#2889.
…chain CodeRabbit caught two correctness issues on this PR. Same shape exists on the parallel Notion PR (tinyhumansai#2887) — fix applied there in parallel. ## 1. Cursor advances on ingest failure → data-loss window `provider.rs` updated `newest_updated` for every result before attempting ingest, then advanced the persistent cursor unconditionally at the end of the sync loop. Items whose ingest call returned `Err` were logged + skipped (no `mark_synced`) — but the next sync queried `updated:>{cursor}`, so the failed item was never re-fetched. Pre-tinyhumansai#2885 this was just "stale revision": the legacy `persist_single_item` path wasn't delete-first, so a failed write left old chunks alone. Post-tinyhumansai#2885 it's worse — `ingest_issue_into_memory_tree` calls `delete_chunks_by_source` *before* the (failing) write, so an *edited* issue that fails to re-ingest is left with neither old nor new chunks in `mem_tree_chunks` until its next edit. Fix: track `had_ingest_failures: bool`, set on every `Err`, and gate the `advance_cursor` at the bottom of the sync loop. Already-synced items are cheaply skipped via `state.is_synced` on the re-fetch next pass, so the cost of holding is just one extra search-API page on the failure range. `set_last_sync_at_ms` still fires — it's a heartbeat, not a fetch-window boundary. A retry-only-failed-items fix (separate `newest_successful_updated` cursor) would be tighter but adds state and edge cases; the simple gate matches CodeRabbit's exact suggestion and is provably safe (cursor only ever advances when every item in the pass succeeded). ## 2. `{err}` drops anyhow context chain `ingest.rs` wrapped the `ingest_pipeline::ingest_document` failure as `anyhow!("ingest_document failed for {source_id}: {err}")`, where `{err}` is Display-only and strips the chain. `provider.rs` then logs the wrapped error with `tracing::warn!(error = %e)` (also Display), so the underlying DB / embedding / persist cause was lost. Fix: swap to `{err:#}` (alternate formatter) so the chain is baked into the wrapping anyhow's Display impl. Matches the convention `provider.rs` already uses on its own `ctx.execute(...).map_err(...)` sites (lines 90, 234, 406). ## Tests 48/48 still pass in `memory_sync::composio::providers::github::*`. The provider's `sync()` is hard to unit-test without mocking `ComposioContext` heavily — the invariant is pinned by the rustdoc on `had_ingest_failures` and the inline gate comment. The ingest-side tests (regression + re-ingest) cover the ingest function's contract; the cursor logic lives entirely in `provider.rs`. Refs CodeRabbit review on tinyhumansai#2889, tinyhumansai#2885.
|
Pushed `9764b31f` addressing both CodeRabbit findings + responding to @graycyrus's nits. What's fixed in this push
I also carried both fixes back to #2887 (Notion) — same shape there, fixed in `0c15fd27` rather than wait for a second round. On the `log::debug!` vs `tracing::debug!` nit — I'd like to push back politely@graycyrus, before I swap that one I want to double-check the convention. Walking the existing ingest layer:
…and `provider.rs` files across providers use `tracing::*` with structured fields. So the codebase already has a two-layer split: provider.rs is the orchestration layer (tracing, structured), ingest.rs / sync.rs are plumbing (log, positional). My `github/ingest.rs` is intentionally matching the Slack / Gmail / vault shape — swapping to `tracing` here would make it the only ingest module using tracing. Happy to convert if you want me to, but I think the right call is either (a) leave as-is to match the established pattern, or (b) file a follow-up issue to unify all ingest modules onto `tracing` in one sweep — both keep the layer convention consistent. Let me know which you prefer. Tests48/48 still pass in `memory_sync::composio::providers::github::*`. The cursor invariant is documented in the new rustdoc on `had_ingest_failures`; the existing 6 ingest tests cover the ingest function's contract. |
graycyrus
left a comment
There was a problem hiding this comment.
@justinhsu1477 the fixes in the third commit look correct — cursor hold on ingest failure is the right call (holds the full range rather than introducing a separate failure-tracking cursor, which would add state and edge cases), and {err:#} properly bakes the anyhow chain into the Display output so tracing::warn!(error = %e) actually surfaces the underlying cause. CodeRabbit's two findings are fully addressed.
CI is still pending on several checks (Build Tauri App, Rust Core Coverage, E2E Appium runs, Windows secrets ACL). Once those are green I'll approve this.
One holdover from my first review that the third commit didn't touch: log::debug! at ingest.rs lines 171 and 180 still uses the log crate with positional format args instead of tracing::debug! with structured key=value fields. Everything else in this module and in provider.rs uses tracing. Small thing, but please fix it before I approve — it's easy to miss in metrics/log aggregation pipelines that filter by tracing span fields.
… ingest Per @graycyrus review on tinyhumansai#2889: the two `log::debug!` calls in `ingest_issue_into_memory_tree` used positional format strings, while `provider.rs` in the same directory logs via `tracing::*` with structured key=value fields. Positional log args are flattened to an opaque message by the tracing-log bridge, so log-aggregation pipelines that filter on span fields (connection_id, issue_id, …) lose them. Swap both calls to `tracing::debug!` with named fields, matching the provider.rs shape exactly: - re-ingest cleanup: connection_id / issue_id / removed_chunks - issue persisted: connection_id / issue_id / chunks_written / already_ingested Scope note: the older ingest modules (slack, gmail, vault::sync) still use `log::*` — those are pre-existing and out of scope here. The new memory-tree ingest modules (this + notion in tinyhumansai#2887) now consistently use tracing alongside their provider.rs. A follow-up sweep to unify the legacy modules can be filed separately if desired. Refs tinyhumansai#2889.
… ingest Carries the @graycyrus review note from the parallel GitHub PR (tinyhumansai#2889) back to Notion so the two new memory-tree ingest modules stay consistent. The two `log::debug!` calls in `ingest_page_into_memory_tree` used positional format strings; provider.rs in the same directory logs via `tracing::*` with structured key=value fields, which log-aggregation pipelines can filter on (the tracing-log bridge flattens positional args to an opaque message and loses them). Swap both to `tracing::debug!` with named fields matching provider.rs: - re-ingest cleanup: connection_id / page_id / removed_chunks - page persisted: connection_id / page_id / chunks_written / already_ingested Scope note: the older ingest modules (slack, gmail, vault::sync) still use `log::*` — pre-existing, out of scope. A follow-up sweep can unify them separately if desired. Refs tinyhumansai#2887, tinyhumansai#2889.
|
Done — pushed Both tracing::debug!(
connection_id = %connection_id,
issue_id = %issue_id,
removed_chunks = removed,
"[composio:github] ingest: re-ingest cleanup"
);…and the issue-persisted log now carries I also carried the same swap to #2887 (Notion) in On the broader codebase: 48/48 tests still green. Should be clear for approval now — thanks for the careful review on both findings. |
graycyrus
left a comment
There was a problem hiding this comment.
@justinhsu1477 the tracing fix in 7d27b5d addresses the last open thread — both debug sites are now using tracing::debug! with structured fields as requested. All prior feedback resolved.
Code is clean. Waiting on CI (Build Tauri App, Rust Core Tests + Coverage, E2E suites, Frontend Coverage are all still pending). Will approve once those are green.
Summary
Second of the four-PR follow-up sequence in #2885 (after Notion in #2887). The GitHub Composio provider now writes through
ingest_pipeline::ingest_document→mem_tree_chunksinstead ofMemoryClient::store_skill_sync→ UnifiedMemorymemory_docs, somemory.search/tree.read_chunk/tree.browse/ the agent recall path / summary trees all see GitHub issues and PRs again.Same shape as the Notion migration in #2887, applied to
providers/github/:providers/github/ingest.rs—github_source_id,parse_updated_at,render_issue_body,extract_html_url,ingest_issue_into_memory_tree. Pretty-JSON markdown body retains labels/assignees/state/comment count for retrieval signal.delete_chunks_by_sourcebefore eachingest_documentcall. Without this, the pipeline's content-blindalready_ingestedgate would silently drop edited issues — the same trap fixed in vault sync fix(vault): sync writes to memory_tree, not legacy UnifiedMemory (#2705) #2720 and Notion feat(composio): migrate Notion provider to memory_tree pipeline (#2885) #2887.source_id = github:{connection_id}:{issue_id}— stable across re-syncs and across connections. Works for both numeric internal IDs and theowner/repo#numberslug fallback fromextract_issue_id.source_refcarries the canonical GitHubhtml_url— audit trails one-click navigable from any downstream UI.provider.rs— droppedpersist_single_item+doc_id, threadedupdated_atthrough to the new ingest fn, kept the existing composite-key dedup logic ({issue_id}@{updated_at}) intact so the cursor-boundary fast-path still works.persist_single_item/store_skill_syncintentionally left in place — they're still used by the remaining unmigrated providers (Linear, ClickUp) and the wider UnifiedMemory removal lives under the #2585 follow-up. Becomes removable once Linear and ClickUp follow (also tracked in #2885).Test plan
Lib tests (all green locally):
cargo test --lib -- memory_sync::composio::providers::github::ingest→ 6/6 passgithub_source_id_is_stable_and_namespaced— pins the contract the cleanup path relies onparse_updated_at_handles_valid_and_invalid_inputs— falls through toUtc::now()on bad inputrender_issue_body_includes_title_header_and_pretty_json— pins the chunked-content shapeextract_html_url_handles_both_envelope_shapes— handles direct +data.html_urlComposio envelopesingest_issue_writes_to_memory_tree— the feat(memory-sync): migrate Notion/ClickUp/Linear/GitHub Composio providers off UnifiedMemory #2885 regression:count_chunksrises and the source_id lands inmem_tree_ingested_sourcesre_ingesting_edited_issue_replaces_prior_chunks— delete-first guard works; chunk count stays steady (±1 chunker rounding) instead of doublingcargo test --lib -- memory_sync::composio::providers::github→ 48/48 pass (no other github-namespaced test broke)cargo check --libcleancargo fmt --checkcleancargo test --tests --no-run— integration tests still compileRefs #2885. Follows #2887 (Notion). Linear and ClickUp PRs come next.
Summary by CodeRabbit
New Features
Bug Fixes
Tests