Skip to content

Flagship correctness fixes, real escalation metric, lean install#17

Open
ChunkyTortoise wants to merge 1 commit into
mainfrom
claude/improve-repo-quality-DXamB
Open

Flagship correctness fixes, real escalation metric, lean install#17
ChunkyTortoise wants to merge 1 commit into
mainfrom
claude/improve-repo-quality-DXamB

Conversation

@ChunkyTortoise
Copy link
Copy Markdown
Owner

@ChunkyTortoise ChunkyTortoise commented May 18, 2026

Summary

A focused, conflict-free, code-only change (single commit rebased onto current main):

  • reranker.py — stop mutating the caller's SearchResult objects in place (this corrupted the original retrieval scores still used upstream in agentic_rag); return model_copy'd results so rerank is pure w.r.t. its inputs.
  • model_ab_test.py — the standard error was the Welch (unpooled) SE of the mean difference but was named pooled_se. Renamed se_diff; math unchanged.
  • agentic_rag.py — dedup used chunk_id or doc_id, which collapses distinct chunks when chunk_id is a falsy-but-valid value (""). Now chunk_id is not None, extracted to a _dedup_key helper.
  • metrics.pyescalation_rate was a hardcoded 0.12 (also surfaced in the Quality Monitor dashboard). Now computed from ExtractedRecord.needs_review via a single count(*) FILTER (WHERE ...) query.
  • Lean installextract-msg (Outlook .msg) pulls red-black-tree-mod, which needs a C toolchain and broke pip install on clean environments. Made opt-in via requirements-msg.txt; the code already degrades gracefully (HAS_EXTRACT_MSG).

Scope note

This branch originally also carried a doc-honesty pass and a self-verifying repo_stats.py/CI mechanism. While it was in progress, main independently shipped a more extensive parallel pass (hiring narrative, metric-provenance labelling, offline-replay eval gate). That work is now redundant, so it was dropped rather than force a conflicted merge — leaving only the uniquely-valuable, non-conflicting code fixes. Verified that none of main's recent work is reverted.

Validation

  • 1,104 unit tests pass on top of current main (includes main's own new tests), 0 failures, ruff clean.
  • The new count(*) FILTER aggregate compiles to valid PostgreSQL.
  • Clean-venv pip install -r requirements_ci.txt exits 0.
  • Integration suite (Postgres/Redis) not run in the authoring env — covered in CI.

Test plan

  • CI green: full pytest incl. integration (test_business_metrics.py now asserts the computed rate), ruff, eval-gate
  • Sanity-check /metrics/business and /metrics/quality-trend return a real computed escalation_rate
  • Confirm .eml extraction still works with extract-msg absent; .msg returns the graceful error

https://claude.ai/code/session_016hjhwGkKkA7duB6cjzfQ2y

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcc4ee4a05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

def _dedup_key(r: SearchResult) -> str:
# Fall back to doc_id only when chunk_id is genuinely absent (None);
# a falsy-but-valid chunk_id ("") must not collapse distinct chunks.
return r.chunk_id if r.chunk_id is not None else r.doc_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve doc identity for empty chunk IDs

When a tool returns chunk_id="" (which the new comment explicitly treats as a valid falsy chunk id), this key becomes the same empty string for every result, so _merge_results/_deduplicate will drop all but the first such result even when they come from different doc_ids. Use a composite key that still includes doc_id when chunk_id is present but empty, or normalize empty strings to absent if they are not actually valid.

Useful? React with 👍 / 👎.

Rebased onto current main as a clean, conflict-free, code-only change.
The branch's earlier doc-honesty and self-verifying-stats work was
dropped: main independently shipped a more extensive parallel pass
(hiring narrative + metric provenance + offline-replay eval gate), so
that work is now redundant. Only the uniquely-valuable, non-conflicting
changes remain:

- reranker: return model_copy'd results instead of mutating the caller's
  SearchResult objects in place (corrupted upstream retrieval scores).
- model_ab_test: rename pooled_se -> se_diff; it is the Welch (unpooled)
  SE of the mean difference. Math unchanged.
- agentic_rag: dedup via _dedup_key using `is not None` so a falsy-but-
  valid chunk_id ("") no longer collapses distinct chunks.
- metrics: escalation_rate was a hardcoded 0.12; now computed from
  ExtractedRecord.needs_review in a single count(*) FILTER (...) query.
- Outlook .msg support made opt-in (requirements-msg.txt): extract-msg
  pulls red-black-tree-mod, which needs a C toolchain and broke
  `pip install` on clean envs. Code already degrades gracefully.
- test_reranker uses real SearchResult (required by the no-mutation fix).

Verified: 1,104 unit tests pass on top of main, ruff clean, the FILTER
aggregate compiles to valid PostgreSQL.

https://claude.ai/code/session_016hjhwGkKkA7duB6cjzfQ2y
@ChunkyTortoise ChunkyTortoise force-pushed the claude/improve-repo-quality-DXamB branch from fcc4ee4 to 2541de1 Compare May 18, 2026 03:49
@ChunkyTortoise ChunkyTortoise changed the title Repo quality: honest metrics, flagship correctness fixes, lean install Flagship correctness fixes, real escalation metric, lean install May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants