[FEAT]: Add extraction quality post-processing controls#280
Open
Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Open
[FEAT]: Add extraction quality post-processing controls#280Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a normalization/quality-control layer for extracted field values so downstream JSON is deterministic (missing placeholders, plural parsing, duplicate merges, ambiguity flags) and includes a per-run quality report.
Changes:
- Introduces
ExtractionQualityProcessorto normalize values, merge duplicates deterministically, and flag ambiguous/missing fields. - Routes
textToJSON.add_response_to_json()through the new quality processor and fixes a mutable default argument. - Adds unit tests for missing sentinel handling, plural normalization, duplicate merge behavior, and ambiguity flagging.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/extraction_quality.py |
New post-processing pipeline for missing/plural/duplicate/ambiguous extraction outputs + report generation. |
src/backend.py |
Integrates quality processing into JSON assembly; prints per-run quality report; fixes mutable default arg. |
src/test/test_extraction_quality.py |
Focused unit tests covering the new normalization/merge/report logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| @@ -0,0 +1,51 @@ | |||
| from src.extraction_quality import ExtractionQualityProcessor, MISSING_VALUE_SENTINEL | |||
Comment on lines
+31
to
+40
| if self._is_missing(normalized_value): | ||
| self.missing_fields.add(field) | ||
|
|
||
| if existing_value is None: | ||
| return normalized_value | ||
|
|
||
| merged, had_duplicate = self._merge_values(existing_value, normalized_value) | ||
| if had_duplicate: | ||
| self.duplicate_fields.add(field) | ||
| return merged |
Comment on lines
+99
to
+100
| return value == self.missing_sentinel | ||
|
|
Comment on lines
+82
to
+96
| existing_items = existing_value if isinstance(existing_value, list) else [existing_value] | ||
| new_items = new_value if isinstance(new_value, list) else [new_value] | ||
|
|
||
| merged = list(existing_items) | ||
| had_duplicate = False | ||
|
|
||
| for item in new_items: | ||
| if item in merged: | ||
| had_duplicate = True | ||
| continue | ||
| merged.append(item) | ||
|
|
||
| if len(merged) == 1: | ||
| return merged[0], had_duplicate | ||
| return merged, had_duplicate |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR improves extraction post-processing robustness by introducing a dedicated normalization pipeline for duplicates, ambiguity, plural entities, and missing values.
Conversational input is often noisy, and raw model output can include repeated values, ambiguous phrases, inconsistent plural formatting, and missing placeholders. This PR makes post-extraction behavior deterministic and measurable before data is used downstream.
What changed:
Duplicate entity detection + deterministic merge strategy
Ambiguity flagging for low-confidence/ambiguous tokens
Standardized plural normalization (;-separated values → deduplicated ordered list)
Consistent missing sentinel policy (MISSING)
Per-run quality report generation
textToJSON.add_response_to_json() now routes all values through normalization
Duplicate handling no longer depends on unsafe append behavior
Per-run extraction quality report is printed alongside extracted JSON
Fixed mutable default argument
Missing sentinel consistency
Plural normalization
Duplicate merge determinism
Duplicate merge list-promotion behavior
Ambiguity flagging
Type of change
Feature (non-breaking enhancement)
Bug fix (non-breaking reliability improvement)
How Has This Been Tested?
Test A (focused extraction quality suite):
Ran:
$env:PYTEST_DISABLE_PLUGIN_AUTOLOAD='1'; python -m pytest test_extraction_quality.py -q
Verified output:
5 passed in 0.20s
Coverage includes:
Duplicate keys/values handled deterministically
Plural outputs normalized to standard list format
Ambiguous fields flagged for review
Missing values normalized to one sentinel across processing
Test Configuration:
Firmware version: N/A
Hardware: Local development machine (Windows)
SDK: N/A
Python: 3.13
Shell: PowerShell
Checklist: