Python-Mini-Projects/reviewer.md at main · vidixha/Python-Mini-Projects

W3 — Evaluation Metric Limitations and Lack of Stratified Analysis

We thank the reviewer for this critique and present three complementary analyses that together address the concerns raised.

1. Per-Cue-Type Stratified Evaluation

We report Edit Exact Match (EEM) — a stricter metric that scores 1 only if the correct edit token appears and the struck-through original is absent — broken down by cue type across all evaluated models. Cue type membership is determined by IntentBench's ground-truth annotation flags. Rows marked † are statistically underpowered (N≤1) and excluded.

Cue Type	DocIntentOCR-7B	DocIntentOCR-3B	Qwen2.5-VL-7B	Qwen2.5-VL-3B	GPT-4.1	GPT-4o	o3	Claude-Sonnet	Gemini-2.5-Flash	Gemini-2.5-Pro
Strikethrough w/o Replacement	0.8269	0.8462	0.6731	0.7885	0.7308	0.8269	0.8077	0.7115	0.8077	0.8269
Table Column Deletion	0.8137	0.8141	0.6913	0.6732	0.6942	0.8066	0.7107	0.7001	0.7194	0.6893
Full Table Deletion	0.6844	0.7142	0.7137	0.6833	0.7166	0.7982	0.8033	0.6765	0.6371	0.7056
Strikethrough w/ Replacement	0.7929	0.7677	0.5505	0.5657	0.7222	0.7828	0.7475	0.7020	0.6616	0.6162
Arrow-Indicated Replacement	0.7316	0.7526	0.5316	0.4368	0.6632	0.7632	0.7737	0.6947	0.6789	0.6421
Table Cell Replacement	0.8824	0.9412	0.8824	0.5882	0.8824	1.0000	1.0000	1.0000	0.8824	0.7059
Caret Insertion	1.0000	1.0000	0.8000	0.8000	1.0000	1.0000	0.6000	1.0000	1.0000	0.6000

DocIntentOCR-7B achieves the highest EEM on Strikethrough w/ Replacement — the most frequent cue type (209 instances, 185 documents) — outperforming all zero-shot models including GPT-4o and o3. This is the operationally most critical cue in financial document workflows.

2. Edited vs. Non-Edited Region Analysis

To directly address the concern about whether failures arise from incorrect intent interpretation versus spurious hallucination in unedited regions, we decompose performance into two components: Edited Region F1 (correctness of applied edits) and Non-Edited Region F1 (preservation of unmodified content).

Model	Edited Region F1	Non-Edited Region F1
DocIntentOCR-7B	0.8656	0.8170
DocIntentOCR-3B	0.8667	0.8321
Qwen2.5-VL-7B	0.7312	0.7337
Qwen2.5-VL-3B	0.6753	0.7073
GPT-4.1	0.8269	0.6815
GPT-4o	0.8677	0.8206
o3	0.8634	0.8052
Claude-Sonnet	0.8398	0.6888
Gemini-2.5-Flash	0.8129	0.7145
Gemini-2.5-Pro	0.7667	0.6938

Two findings stand out. First, DocIntentOCR-7B matches GPT-4o on edited region F1 (0.8656 vs 0.8677) while using orders of magnitude fewer parameters and operating fully on-premise. Second, DocIntentOCR-3B achieves the highest non-edited region F1 (0.8321) across all models — demonstrating that intent-aware supervision teaches the model not only what to change but also what to preserve, reducing spurious hallucination in unedited content.

3. Qualitative Error Analysis

We identify three failure modes in DocIntentOCR-7B outputs: (i) systematic failures on arrow-indicated replacements spanning multiple table cells, where the model correctly identifies the arrow but applies the replacement to the wrong scope; (ii) near-miss cases where the correct replacement value is produced but the struck-through original is not fully suppressed; (iii) spurious hallucination in document footers and source metadata fields, which are not annotated and thus penalize the model despite being outside the task scope.

These analyses will be incorporated as a dedicated subsection in the revised submission. We respectfully ask the reviewer to reconsider the Soundness and Overall Assessment scores in light of this additional evidence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

reviewer.md

Latest commit

History

reviewer.md

File metadata and controls