Skip to content

staging <- dev#137

Merged
ducnmm merged 23 commits intostagingfrom
dev
May 7, 2026
Merged

staging <- dev#137
ducnmm merged 23 commits intostagingfrom
dev

Conversation

@ducnmm
Copy link
Copy Markdown
Collaborator

@ducnmm ducnmm commented May 7, 2026

staging <- dev

hien-p and others added 22 commits May 4, 2026 11:33
fix(sdk): ENG-1725 `recallManual` broken after LOW-24
ENG-1409: Add benchmark scripts for Walrus, sidecar and recall latency
…nd-bulk-remember

ENG-1406 + ENG-1408: Async remember pipeline and bulk remember
…ization

ENG-1405: Optimize recall with LRU blob cache and batched SEAL decrypt
Lifts the practical /api/remember ceiling from ~8 KiB (text-embedding-3-small
context window) to 1 MiB by summarizing the plaintext via gpt-4o-mini before
embedding. The full original text is still SEAL-encrypted and stored on
Walrus — only the embedding input is summarized, so recall returns the
unmodified plaintext.

How it works
------------
text ≤ 8 KiB    →  embed(text) || encrypt(text)              [unchanged]
text > 8 KiB    →  summarize(text) → embed(summary) || encrypt(text)

Map-reduce summarization handles arbitrarily large input within a single
embedder call:

  1. split into ≤ 64 KiB chunks
  2. summarize each chunk in parallel (gpt-4o-mini, bounded concurrency)
  3. reduce the chunk summaries into one final summary (≤ ~500 words)
  4. embed the final summary

Falls back to the direct embed path when OPENAI_API_KEY is unset (mock /
dev mode), so this doesn't introduce a hard dependency on OpenAI.

Boundary handling
-----------------
* MAX_REMEMBER_TEXT_BYTES = 1 MiB enforced inside the remember handler.
* MAX_ANALYZE_TEXT_BYTES = 64 KiB enforced inside the analyze handler
  (analyze does a single LLM call with no chunking, so it has a tighter
  cap independent of remember).
* Auth middleware caps protected JSON bodies at 2 MiB —
  PROTECTED_BODY_LIMIT_BYTES — covering both single 1 MiB remember
  payloads and bulk-remember batches.
* Sidecar /seal/encrypt cap raised to 2 MiB to accept the SEAL request
  for the full original. The previous global app.use(json({limit:256kb}))
  in scripts/sidecar-server.ts was masking per-route overrides on
  /seal/decrypt-batch and /walrus/upload — those are now per-route only,
  using named JSON_LIMIT_* constants.

Integration with PR #121 (async remember)
------------------------------------------
After the rebase onto dev, /api/remember is async (returns 202 + job_id).
Summarization runs inside spawn_prepare_remember_job and the bulk variant
spawn_prepare_bulk_remember_job, before the embed/encrypt fan-out. The
encrypt fork still uses the original text bytes — only the embedding
input is summarized.

Tests
-----
* services/server/tests/e2e_test.py — adds three parametric size cases:
  64 KiB (asserts 200 + summarize log), 512 KiB (asserts 200), and
  MAX_REMEMBER_TEXT_BYTES + 1 (asserts 400). Mirrors the Rust constant
  to catch drift.
* 135/135 unit tests pass, including new assertions on
  MAX_ANALYZE_TEXT_BYTES, PROTECTED_BODY_LIMIT_BYTES, and the bench-
  bypass default.

Benchmark harness
-----------------
services/server/scripts/bench-remember-sizes.ts drives 14 hand-curated
fixtures (Wikipedia prose, Project Gutenberg / Journey to the West for
CJK, science-dense prose, structured JSON, mixed markdown+code) at sizes
4 KiB → 1 MiB through the full async lifecycle (POST → poll → recall),
asserting 202 / job done / 400 boundary as appropriate.

Bench results against testnet (RATE_LIMIT_DISABLED=1):

  Size      Worker       Recall    Status
  4 KiB     ~25 s        ~3 s      done
  64 KiB    ~22-26 s     ~1-3 s    done
  256 KiB   ~36-42 s     ~1-3 s    done
  512 KiB   ~42 s        ~1 s      done
  1 MiB     ~60 s        ~3 s      done
  1 MiB+1   —            —         400 ✅

14/14 fixtures pass end-to-end.

Benchmark-only escape hatch
---------------------------
The async remember pipeline turns one user request into 1 POST + N status
polls + 1 recall, which exceeds the 30-weighted-req/min per-delegate-key
budget on the second fixture. Adds RATE_LIMIT_DISABLED=1 (default off,
asserted in tests, loud tracing::warn at startup, surfaced in /config)
that skips request-rate buckets only — storage quota and auth still
apply. Intended for localhost benchmarks.

Files changed
-------------
services/server/src/routes.rs       — summarize_for_embedding + map-reduce,
                                       MAX_ANALYZE_TEXT_BYTES guard,
                                       summarize wired into async/bulk paths
services/server/src/auth.rs         — PROTECTED_BODY_LIMIT_BYTES = 2 MiB
services/server/src/main.rs         — DefaultBodyLimit + bench-bypass startup warn
services/server/src/rate_limit.rs   — bench_bypass_enabled flag + bypass
services/server/src/types.rs        — ConfigResponse.rate_limit_disabled
services/server/scripts/sidecar-server.ts
                                     — per-route json() limits with named constants
services/server/scripts/bench-remember-sizes.ts
                                     — async-aware harness with poll loop
services/server/scripts/bench-fixtures.json
                                     — 14 hand-curated realistic fixtures
services/server/tests/e2e_test.py   — parametric size cases

Co-authored-by: ducnmm <mauduckiengiang@gmail.com>
@hungtranphamminh hungtranphamminh self-requested a review May 7, 2026 09:05
@ducnmm ducnmm temporarily deployed to benchmark-dev May 7, 2026 09:18 — with GitHub Actions Inactive
@railway-app railway-app Bot temporarily deployed to MemWal / dev May 7, 2026 09:18 Inactive
@ducnmm ducnmm merged commit 1fc103a into staging May 7, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants