fix: add LanceDB row-count validation after extraction to prevent poison state

## Problem Statement

### 現狀：萃取完成後不驗證寫入結果

萃取 pipeline 的最終階段：

```typescript
// src/smart-extractor.ts — extractAndPersist()
if (createEntries.length > 0) {
  await this.store.bulkStore(createEntries);  // ← 只管寫入，不管寫入是否真的成功
}

return stats;  // ← 返回的 stats.created 是「預期寫入數」，不是「實際寫入數」
```

```typescript
// src/store.ts — bulkStore() 底層
async bulkStore(entries: MemoryEntry[]): Promise<void> {
  await this.ensureInitialized();
  await this.table.add(entries.map(toRow));
  // ← 沒有比對回傳值，沒有驗證實際寫入數量
}
```

`store.ts` 有 `count()` 方法可以查出真實的 LanceDB row count：

```typescript
// src/store.ts:512
async count(): Promise<number> {
  await this.ensureInitialized();
  return await this.table.countRows();  // ← 可查出真實寫入數
}
```

但萃取完成後**從來沒有調用過**。

### 何時會出問題

**情境 A：OpenClaw 被強制關機**

```
萃取 pipeline 啟動：
1. LLM extract → 5 個候選記憶
2. 個別處理 → 3 個 createEntries
3. bulkStore(3 個 entries) → 寫入 2 個後...
4. OpenClaw 被 kill（使用者關機 / 系統重啟 / 網路斷線）
5. 實際只寫入 2 個，但 stats 返回 { created: 3, ... }

下次啟動：
→ 沒有任何機制知道「應該有 3 個但實際只有 2 個」
→ 使用者以為 3 個都記住了，但第 3 個已經永遠消失
→ 沒有錯誤訊息、沒有重試、沒有修復
```

**情境 B：LanceDB write failure 但沒有拋出**

```
bulkStore() 內部呼叫 this.table.add(entries)
LanceDB 在特定情況下（磁碟滿、許可權、檔案鎖）可能 partial write
add() 本身不回傳成功寫入的數量
No error thrown → 萃取 pipeline 假設全部成功
→ stats.created = 3，但實際 DB 只有 1 個
→ phantom state：「我記住了」但實際上沒有
```

**情境 C：compactor 同時運作（race condition）**

```
萃取正在寫入 10 個 entries
memory-compactor 同時執行刪除（基於 decay / 閾值）
compactor 刪了 3 個舊記憶
萃取最後 bulkStore 只成功寫入 7 個新記憶
stats.created = 10，但實際凈增加 = 7 - 3 = 4
→ 差異 6 個，這 6 個去了哪？不知道
```

### 規模感

| 情境 | 發生機率 | 影響 |
|------|---------|------|
| 每次正常萃取完成 | 極低 | 不受影響 |
| OpenClaw 被 kill（長萃取 session） | 中等（系統問題/網路） | 1-3 個記憶消失 |
| 磁碟滿 | 低 | 全部記憶消失，無錯誤提示 |
| Compactor race condition | 低但累積 | 記憶數量慢性流失 |

這不是一個會每次都發生的問題，而是一個**累積性資料缺口**：每次發生一點點，使用者過了很久才發現「奇怪，某個記憶怎麼不見了」。

### 現有對比：claude-context 的做法

Zilliz Cloud 的 MCP 實現遇到過同樣的問題（Issue #295）。他們的 snapshot 和 Milvus collection count 不同步時，client 會把 `{ indexedFiles: 0, totalChunks: 0, status: 'completed' }` 當成「已索引完成」，實際上 collection 是空的。觸發 force reindex 就把真實資料刪掉重寫 0，形成無限迴圈。

他們的修復邏輯：

```typescript
// claude-context handlers.ts — validateLegacyZeroEntries()
const realRowCount = await vdb.getCollectionRowCount(collectionName);

if (realRowCount === -1) {
    // 無法確定（網路錯誤等），跳過不做任何事
    return;
}
if (realRowCount > 0) {
    // 真實有資料，但 snapshot 可能寫了 0
    // → 用真實 count 覆蓋 snapshot（heal）
    snapshotManager.setCodebaseIndexed(codebasePath, { indexedFiles: realRowCount, ... });
}
if (realRowCount === 0 && snapshot.status === 'completed') {
    // 危險：snapshot 說 completed，但真實資料庫是空的
    // → 這是 phantom entry，刪除讓他下次強制重來
    snapshotManager.removeCodebaseCompletely(codebasePath);
}
```

---

## Proposed Solution

### 核心邏輯：萃取後驗證

在 `extractAndPersist()` 完成後，加入 validation check：

```typescript
// src/smart-extractor.ts — extractAndPersist() 修改

async extractAndPersist(...): Promise<ExtractionStats> {
  const countBefore = await this.store.count();  // ← 萃取前計數
  
  // ... 現有萃取邏輯 ...
  
  let validation: ExtractionValidation | null = null;
  
  if (createEntries.length > 0) {
    await this.store.bulkStore(createEntries);
    
    // 新增 validation
    const countAfter = await this.store.count();
    const actualCreated = countAfter - countBefore;
    
    if (actualCreated !== stats.created) {
      validation = {
        expected: stats.created,
        actual: actualCreated,
        discrepancy: stats.created - actualCreated,
      };
      
      this.log(
        `memory-pro: extraction validation mismatch: ` +
        `expected=${validation.expected} actual=${validation.actual} ` +
        `lost=${validation.discrepancy}`
      );
      
      if (this.config.onExtractionValidationFailed) {
        await this.config.onExtractionValidationFailed(validation, {
          entries: createEntries,
          sessionKey,
          targetScope,
        });
      }
      
      stats.validationMismatch = validation.discrepancy;
    }
  }
  
  return stats;
}
```

### Config 延伸

```typescript
// SmartExtractorConfig 新增可選欄位
export interface SmartExtractorConfig {
  // ... 現有欄位 ...
  
  /** Called when extraction write count doesn't match expected count. */
  onExtractionValidationFailed?: (
    validation: ExtractionValidation,
    context: { entries: StoreEntry[]; sessionKey: string; targetScope: string }
  ) => Promise<void> | void;
}
```

### 錯誤分類與處理策略

| 情況 | 可能性 | 處理 |
|------|--------|------|
| 實際寫入 > 預期 | 極低（並發萃取或 compactor） | log warning，視為 warning 而非 error |
| 實際寫入 < 預期 | 中等（kill / partial write） | log error + callback，外部可 trigger retry |
| 實際寫入 = 預期 | 正常 | 無操作 |

### 為什麼不直接 retry

```
不建議在 validation 失敗後直接 retry，原因是：
1. 失敗原因可能是「重複 ID」而非「寫不入」
2. 重新執行一次 LLM 萃取代價昂貴（API call + time）
3. 外部 caller（如 auto-capture）可以自己決定要不要 retry
4. validation 的目的只是「確認資料確實寫入了」，不是「自動修復」
```

---

## Impact

| 維度 | 改善 |
|------|------|
| 資料完整性 | 及時發現寫入缺口，不再有「靜默資料消失」 |
| 可除錯性 | validation mismatch 有 callback + log，問題可追蹤 |
| 系統信心 | 使用者知道「萃取 N 個記憶」是真的寫入了 N 個 |
| Compactor race | 差異出現時，log 會顯示，累積問題可被發現 |

---

## Questions for Maintainers

1. `onExtractionValidationFailed` callback 的設計方向是否合理？還是有更好的 API 設計？
2. 這個 validation 是否應該預設開啟？還是需要 config 開關？
3. 目前 `bulkStore` 的回傳值是 `void`，是否值得改為回傳 `number`（實際寫入數）？

---

## References

- Zilliz Cloud Issue #295 fix（snapshot validation concept）
- `src/store.ts:512` — `count()` 方法現已完成，可直接使用


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add LanceDB row-count validation after extraction to prevent poison state #693

Problem Statement

現狀：萃取完成後不驗證寫入結果

何時會出問題

規模感

現有對比：claude-context 的做法

Proposed Solution

核心邏輯：萃取後驗證

Config 延伸

錯誤分類與處理策略

為什麼不直接 retry

Impact

Questions for Maintainers

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

情境	發生機率	影響
每次正常萃取完成	極低	不受影響
OpenClaw 被 kill（長萃取 session）	中等（系統問題/網路）	1-3 個記憶消失
磁碟滿	低	全部記憶消失，無錯誤提示
Compactor race condition	低但累積	記憶數量慢性流失

情況	可能性	處理
實際寫入 > 預期	極低（並發萃取或 compactor）	log warning，視為 warning 而非 error
實際寫入 < 預期	中等（kill / partial write）	log error + callback，外部可 trigger retry
實際寫入 = 預期	正常	無操作

維度	改善
資料完整性	及時發現寫入缺口，不再有「靜默資料消失」
可除錯性	validation mismatch 有 callback + log，問題可追蹤
系統信心	使用者知道「萃取 N 個記憶」是真的寫入了 N 個
Compactor race	差異出現時，log 會顯示，累積問題可被發現

fix: add LanceDB row-count validation after extraction to prevent poison state #693

Description

Problem Statement

現狀：萃取完成後不驗證寫入結果

何時會出問題

規模感

現有對比：claude-context 的做法

Proposed Solution

核心邏輯：萃取後驗證

Config 延伸

錯誤分類與處理策略

為什麼不直接 retry

Impact

Questions for Maintainers

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions