Skip to content

feat: Redis distributed lock for high concurrency#662

Open
jlin53882 wants to merge 9 commits intoCortexReach:masterfrom
jlin53882:feat/redis-distributed-lock
Open

feat: Redis distributed lock for high concurrency#662
jlin53882 wants to merge 9 commits intoCortexReach:masterfrom
jlin53882:feat/redis-distributed-lock

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

This PR implements a Redis-based distributed lock to solve the high concurrency lock contention problem identified in Issue #643.

Problem

When captureAssistant=true and sessionMemory.enabled=true:

  • Multiple agents write to memory simultaneously
  • File lock cannot handle high concurrency
  • 200 concurrent writes -> only 6% success

Solution

Add src/redis-lock.ts with:

  • Token-based lock acquisition (prevents race conditions)
  • Lua script for safe release (only releases if token matches)
  • Graceful fallback: Returns no-op lock when Redis unavailable (no blocking)
  • 60s TTL with exponential backoff

Test Results

Concurrency File Lock Redis Lock
10 100% 100%
20 55% 100%
50 22% 100%
200 6% 97.5%

15x improvement!

Required

\
npm install ioredis
\\

Related

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@jlin53882
Copy link
Copy Markdown
Contributor Author

CI Status

The CI failures are NOT related to this PR. Here's the analysis:

Failed Jobs & Root Causes

Job Error Relation
storage-and-schema est/reflection-bypass-hook.test.mjs - 3 subtests failed ❌ Pre-existing
packaging-and-workflow unexpected manifest entry: test/embedder-ollama-batch-routing.test.mjs ❌ Pre-existing
core-regression Multiple test failures ❌ Pre-existing
llm-clients-and-auth Same as above ❌ Pre-existing

Evidence

  • cli-smoke ✅ PASS
  • version-sync ✅ PASS

The failing tests existed in master branch before this PR. They need to be fixed separately.

What's in This PR

Only Redis distributed lock implementation:

  • src/redis-lock.ts - Token-based lock with graceful fallback
  • est/lock-*.mjs - 9 test files
  • package.json - ioredis dependency + test script

@jlin53882
Copy link
Copy Markdown
Contributor Author

Update: Fallback to File Lock

Change

Updated the fallback behavior when Redis is unavailable:

Before After
no-op lock (no protection) file lock (has protection)

Code Change

\\ ypescript
// Before (no-op)
return async () => {};

// After (file lock)
return this.createFileLock(key, ttl);
\\

Behavior

Scenario Behavior
Redis available Use Redis lock
Redis unavailable Use file lock (fallback)

This maintains lock protection even without Redis.

@jlin53882
Copy link
Copy Markdown
Contributor Author

Update: Fallback tests added

Added unit tests for file lock fallback behavior:

Test Results

Test Status
Fallback when Redis unavailable ✅ Pass
Multiple locks with fallback ✅ Pass
TTL respected in file lock ✅ Pass

Changes

  • Added est/redis-lock-fallback.test.mjs
  • Fixed Windows path issue (use C:\tmp instead of /tmp)
  • Removed unsupported
    etries option from sync lock
  • Ignore ENOENT when releasing lock

@jlin53882
Copy link
Copy Markdown
Contributor Author

Update: Detailed logging added

Logging Improvements

Event Log Format
Redis unavailable [RedisLock] ⚠️ Redis unavailable ({err}), falling back to file lock
File lock acquired [RedisLock] ✅ File lock acquired: key={key}, path={path}
File lock released [RedisLock] ✅ File lock released: key={key}
File lock failed [RedisLock] ❌ Failed to acquire/release file lock: key={key}, err={err}

Purpose

These detailed logs help verify that:

  1. Redis is truly unavailable when fallback happens
  2. File lock is being used as fallback
  3. Any issues with file lock can be debugged easily

@jlin53882
Copy link
Copy Markdown
Contributor Author

Update: Redis Lock 整合到 store.ts

現在可以直接使用 Redis Lock!

修改 store.ts 的
unWithFileLock 方法:

  • 優先使用 Redis lock
  • Redis 不可用時 fallback 到 file lock

測試結果

測試 結果
20 concurrent writes ✅ 20 success, 0 failed
50 concurrent writes ✅ 50 success, 0 failed

實際運作 Log

14:58:11 [RedisLock] Redis lock manager initialized 14:58:11 [RedisLock] Acquired lock memory-write after 1 attempts 15:00:32 [RedisLock] Acquired lock memory-write after 1 attempts 15:00:36 [RedisLock] Acquired lock memory-write after 7 attempts

單元測試

新增 est/redis-lock-store-integration.test.mjs 測試整合後的並發表現。

需要

@jlin53882
Copy link
Copy Markdown
Contributor Author

Update: Auto-Capture Lock Contention Test

新增測試

est/auto-capture-lock-contention.test.mjs

測試結果

測試 結果
Individual (10 entries) 54734ms ⚠️
Concurrent (20 auto-captures 20 success, 0 failed ✅

發現

每次 auto-capture 都會:

  1. 取得 lock
  2. 寫入
  3. 釋放 lock
  4. 重複 N 次

這就是 Issue #665 的問題。

解決方案

實作 �ulkStore() 批次寫入,減少 lock 取得次數。

Copy link
Copy Markdown

@app3apps app3apps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here. The direction is valuable, but I’m not comfortable merging this revision yet.

The blockers for me are:

  • Both the targeted checks and the full suite are still failing.
  • This PR touches high-risk distributed locking / store infrastructure, so I want a higher bar than “no obvious blocker found”.
  • The current branch still carries warnings around any usage, excessive console logging, and timeout behavior, which makes it harder to trust the failure modes under load.

What I’d like before merge:

  • Get the targeted path back to green.
  • Get the full suite green, or clearly demonstrate which failures are pre-existing and unrelated.
  • Tighten the remaining rough edges in the Redis lock path so the runtime behavior is easier to reason about and support.

The value here is real, but I think this needs another pass before it is merge-ready.

- Add src/redis-lock.ts: Token-based Redis lock with graceful fallback
- Add test files for performance, edge cases, and optimization
- Add ioredis dependency

Fixes CortexReach#643: improves 200 concurrent write success from 6% to 97.5%

Requires: npm install ioredis
- Replace no-op lock with proper file lock fallback
- Maintains lock protection even without Redis
- Import proper-lockfile for file lock support
- Remove retries from sync lock (not supported)
- Handle Windows path for tmp directory
- Ignore ENOENT when releasing
- Add fallback unit tests
- Log Redis unavailable with error details
- Log file lock acquire/release with key and path
- Use emoji markers for clarity (✅/❌)
- Add Redis lock manager initialization
- Modify runWithFileLock to use Redis lock first, fallback to file lock
- Add integration test for 20/50 concurrent operations
- Fixes Issue CortexReach#643 lock contention
@jlin53882 jlin53882 force-pushed the feat/redis-distributed-lock branch from 7f1475d to 0492aac Compare April 22, 2026 05:43
@jlin53882
Copy link
Copy Markdown
Contributor Author

CI 失敗分析:這兩個錯誤與本 PR 無關

經過完整比對,這兩個 CI 錯誤都是 upstream 既有问题,與 PR #662(Redis distributed lock)完全無關。


1. storage-and-schemasmart-extractor-scope-filter.test.mjs

錯誤this.store.bulkStore is not a function

根因:PR #669a8bb8ec,2026-04-18 merge)在 smart-extractor.ts 新增了 bulkStore() 呼叫,但 smart-extractor-scope-filter.test.mjs 的 mock store 只實作了 store(),缺少 bulkStore()。此測試從 PR #669 合併後就已損壞。

驗證git diff b5a8271..0492aac -- test/smart-extractor-scope-filter.test.mjs 無輸出,測試檔在本 PR 完全未改動。


2. core-regressionsmart-extractor-branches.mjs

錯誤:中文 assertion 出現乱碼(饮?好:?龙茶

根因:CI log 的乱碼是 GitHub Actions runner 終端編碼顯示問題,非實際資料損壞。測試檔本身是 clean UTF-8。

驗證git diff b5a8271..0492aac -- test/smart-extractor-branches.mjs 無輸出,測試檔在本 PR 完全未改動。


建議

這兩個 upstream 既有问题建議開獨立 issue 追蹤,不應阻擋 PR #662 的合併。

Blocker 2 (any types):
- properLockfile: any → proper-lockfile module type
- Promise<any> → Promise<typeof import('proper-lockfile')>

Blocker 3 (excessive console logging):
- Remove ALL console.log from redis-lock.ts (hot path)
- Keep only essential console.warn for actual failures

Blocker 4 (timeout behavior):
- Add MAX_ATTEMPTS=600 circuit breaker (prevents infinite retry loop)
- Add attempts >= MAX_ATTEMPTS check with descriptive error message
- Fix Windows C:\\tmp path → use os.tmpdir()
- Fix createFileLock() throwing on lockSync failure (was returning no-op release, causing silent data corruption risk)

Root cause fixes:
- createFileLock now throws immediately when lockSync fails (no silent swallow)
- File lock uses node os.tmpdir() which works correctly on Windows
jlin53882 added a commit to jlin53882/memory-lancedb-pro that referenced this pull request Apr 22, 2026
- Replace all any types with proper TypeScript types (16 instances)
- Upgrade Lua release failure from warn to throw (CRITICAL fix)
- Remove noisy Redis acquire attempt console.warn
- Improve MAX_ATTEMPTS error message clarity
- Remove unused redisAvailable variable
- Fix pre-existing err:any catch blocks to err:unknown
@jlin53882
Copy link
Copy Markdown
Contributor Author

PR #662 Redis 分散式鎖 — 修復進度說明

已完成的修復(共 12 commits)

1. count() method 恢復(CRITICAL — 防止 merge 失敗)

  • 問題diff 中意外移除 count() method,導致 tools.ts 的 5 個呼叫點在 merge 時炸 TypeError
  • 修復:在 src/store.tsbulkStoreimportEntry之間還原 count() method

2. any 類型全部消除(16 處 → 0)

  • lockfileModule: anytypeof import("proper-lockfile") | null
  • redisLockManager: anyRedisLockManager | null
  • err: anyerr: unknown + type guard(4 處 pre-existing catch blocks)
  • Filter callback r: anyr: Record<string, unknown>(2 處)
  • candidates: any[]Array<Record<string, unknown>>

3. Token 碰撞風險(CRITICAL — Claude Code 對抗分析發現)

  • 問題Math.random() 非密碼學安全,高並發下 token 可能碰撞
  • 修復:改用 crypto.randomBytes(8).toString('hex')src/redis-lock.ts:26

4. Redis 命令超時(新增)

  • 問題:單次 SET/EVAL 沒有命令超時,Redis 網路慢時整個 acquire() 無限期 hang
  • 修復:新增 commandTimeout 參數(預設 5000ms),所有 Redis 命令都有上限

5. Redis 錯誤靜音問題(CRITICAL — Claude Code 對抗分析發現)

  • 問題catch (err) {} 完全吃掉 Redis 錯誤,生產環境看不見 Redis 失敗
  • 修復:改為 console.debug()(debug 模式才輸出,不影響正常日誌)

6. Lua release 失敗行為

  • 問題:原本 throw 會讓已成功的 operation 變成失敗導致 crash
  • 修復:改為 console.warn(lock 有 60s TTL 自動到期 cleanup,best-effort)

7. TTL 單位錯誤(WARNING — Claude Code 對抗分析發現)

  • 問題proper-lockfilestale 選項接受毫秒(預設 10000ms),但程式除以 1000 當秒傳
  • 修復:移除 /1000,直接傳毫秒(src/redis-lock.ts:162

8. console.warn 數量從 7 減至 4

  • 移除重複的 acquire attempt logging、過度冗餘的 fallback warning

9. Timeout 訊息可讀性

  • 修正 MAX_ATTEMPTS 訊息格式:attempts: ${attempts} 不再與 maxWait 混淆

CI 失敗說明

這不是 PR #662 引入的問題。

所有分支的 CI 都在失敗,包括:

  • master(base SHA b5a8271c)— CI failure,2026-04-21
  • fix/issue-448-dual-trackb20ea9c6)— 同樣 3 個 job failure,2026-04-22
  • feat/redis-distributed-lock19527b44)— 同樣 3 個 job failure,2026-04-22

失敗的 job 完全相同:core-regressionstorage-and-schemapackaging-and-workflow

結論:這是 upstream master 本身的 CI 問題,PR #662 沒有引入額外的測試失敗。


尚未解決的項目

CI 測試失敗需 upstream 修復後才能驗證。其他 code review blocker 歡迎提出進一步修改方向。

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to push this forward. I re-reviewed the current merge ref against latest master. The Redis direction is valuable, but I still don't think this revision is merge-ready yet because there are a few correctness and CI/reliability blockers in the lock path.

Findings:

  1. runWithFileLock() can re-run the protected mutation after a Redis-path failure.

In src/store.ts, the try/catch around the Redis path catches errors from both redisManager.acquire() and the protected fn(). If fn() fails after partially applying a mutation, the catch logs "Redis lock failed" and falls through to the file-lock path, executing the same mutation again. This can duplicate writes or mask the original failure. The fallback should only apply when acquiring/using the Redis lock fails before entering the protected operation, not when the operation itself fails.

  1. Redis lock TTL is fixed and not renewed.

The Redis path acquires with PX 60000, but there is no renewal while the LanceDB operation is running. If a write/import/update stalls past 60s, another process can acquire the same lock while the first process is still mutating storage. The existing file-lock path has compromise/refresh semantics; the Redis path needs either renewal, fencing, or a much clearer bounded-operation guarantee.

  1. The new Redis tests are not hermetic.

test/redis-lock-edge-cases.test.mjs and test/redis-lock-optimized.test.mjs are added to npm test, but they assume a Redis server at localhost:6379. That makes the default test suite dependent on local infrastructure and will fail or hang on machines/CI jobs without Redis. These should use mocks/fakes, conditionally skip with an explicit opt-in env var, or be moved out of the default suite.

  1. The production Redis lock key is global across all DB paths.

src/store.ts always acquires "memory-write", so unrelated dbPaths serialize behind the same Redis key. That is broader than the previous per-db file lock and also contradicts the added optimization tests that demonstrate different keys/DBs can run independently. The Redis key should include a normalized storage identity if we want to preserve the old isolation behavior.

  1. The merge result drops existing test coverage from npm test.

The PR's package.json removes test/memory-update-metadata-refresh.test.mjs from the main test script while adding Redis tests. That looks like an accidental regression from an older base and should be restored before merge.

There are also smaller cleanup issues like git diff --check whitespace failures and several added tests that are not wired into CI or contain weak/no assertions, but the items above are the ones I'd treat as blocking.

I'm still supportive of the Redis-lock approach, but for this area I want the failure semantics and test story tightened before merging.

Phase 1 修復(基於 rwmjhb review #4170815598):

1. [FIX #1] runWithFileLock() 不再重複執行 mutation
   - 引入 lockAcquired flag 區分「lock 取得失敗」vs「fn() 失敗」
   - 只有 lock 取得失敗才觸發 file-lock fallback
   - fn() 本身的錯誤直接 re-throw,不再默默重跑

2. [FIX #2] Redis TTL 從 60s → 180s
   - 緩解無 renewal 機制下的 lock 過期 race condition 風險
   - 完整修復需加 renewal(Phase 2)

3. [FIX #3] Redis 測試 hermetic 化
   - redis-lock-edge-cases.test.mjs / redis-lock-optimized.test.mjs
   - REDIS_URL env var 未設定時自動 skip,CI 不再 fail/hang

4. [FIX #4] Redis lock key 包含 dbPath identity
   - key = memory-write:{normalized_db_path}
   - 恢復原本 per-db 的 lock 隔離,避免不同 storage 排隊

5. [FIX #5] 還原 memory-update-metadata-refresh.test.mjs 到 npm test
   - 該檔案在 PR 中被意外移除,現在放回正確位置
@jlin53882 jlin53882 force-pushed the feat/redis-distributed-lock branch from 19527b4 to c668536 Compare April 24, 2026 16:05
Fix ReferenceError at runtime: the function was placed at module level
outside the class, causing jiti compilation to not hoist it correctly.
Now it is a private static method inside MemoryStore class.
@jlin53882
Copy link
Copy Markdown
Contributor Author

Fixes applied to all 5 blocking issues

Fix #1: runWithFileLock() no longer re-executes mutations

  • Introduced lockAcquired flag to distinguish "lock acquisition failed" from "operation itself failed"
  • Only lock acquisition failures trigger file-lock fallback
  • Operation errors are re-thrown immediately (no silent retry)
  • commit b776637 (fix: move normalizeStorageKey inside class as static method)
  • commit c668536 (fix: address all 5 maintainer review blockers)

Fix #2: Redis TTL 60s → 180s

  • Mitigates lock expiration race condition during long operations
  • Full fix (TTL renewal) deferred to Phase 2
  • commit c668536

Fix #3: Redis tests are now hermetic

  • redis-lock-edge-cases.test.mjs and redis-lock-optimized.test.mjs
  • Auto-skip when REDIS_URL env var is not set — CI no longer fails/hangs without Redis
  • commit c668536

Fix #4: Redis lock key includes dbPath for per-db isolation

  • Key changed from hardcoded "memory-write" to `memory-write:${dbPath}`
  • normalizeStorageKey() resolves symlinks and normalizes path separators
  • Preserves the original per-db lock isolation behavior
  • commit b776637 (moved function inside class as static method)
  • commit c668536

Fix #5: memory-update-metadata-refresh.test.mjs restored to npm test

  • The test was accidentally removed in the original PR diff; now restored to its correct position
  • commit c668536

CI status

  • cli-smoke: ✅ FIXEDnormalizeStorageKey is not defined caused by function hoisting issue with jiti compilation
  • core-regression, storage-and-schema, packaging-and-workflow: ❌ pre-existing failures (also fail on upstream/master, not caused by this PR)

Commits on feat/redis-distributed-lock:

b776637 fix: move normalizeStorageKey inside class as static method
c668536 fix: address all 5 maintainer review blockers (PR #662)
2d0937a fix: resolve Blockers 2/3/4 + deadlock root causes in Redis lock path
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants