Skip to content

[FEAT]: Add extraction quality post-processing controls#280

Open
Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Aryama-srivastav:feature/extraction-quality-controls
Open

[FEAT]: Add extraction quality post-processing controls#280
Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Aryama-srivastav:feature/extraction-quality-controls

Conversation

@Aryama-srivastav
Copy link

Description

This PR improves extraction post-processing robustness by introducing a dedicated normalization pipeline for duplicates, ambiguity, plural entities, and missing values.

Conversational input is often noisy, and raw model output can include repeated values, ambiguous phrases, inconsistent plural formatting, and missing placeholders. This PR makes post-extraction behavior deterministic and measurable before data is used downstream.

What changed:

  • Added ExtractionQualityProcessor in extraction_quality.py

Duplicate entity detection + deterministic merge strategy
Ambiguity flagging for low-confidence/ambiguous tokens
Standardized plural normalization (;-separated values → deduplicated ordered list)
Consistent missing sentinel policy (MISSING)
Per-run quality report generation

  • Integrated quality processing into backend.py

textToJSON.add_response_to_json() now routes all values through normalization
Duplicate handling no longer depends on unsafe append behavior
Per-run extraction quality report is printed alongside extracted JSON
Fixed mutable default argument

  • Added tests in test_extraction_quality.py

Missing sentinel consistency
Plural normalization
Duplicate merge determinism
Duplicate merge list-promotion behavior
Ambiguity flagging

Type of change

Feature (non-breaking enhancement)
Bug fix (non-breaking reliability improvement)

How Has This Been Tested?

Test A (focused extraction quality suite):

Ran:
$env:PYTEST_DISABLE_PLUGIN_AUTOLOAD='1'; python -m pytest test_extraction_quality.py -q

Verified output:
5 passed in 0.20s

Coverage includes:

Duplicate keys/values handled deterministically
Plural outputs normalized to standard list format
Ambiguous fields flagged for review
Missing values normalized to one sentinel across processing

Test Configuration:

Firmware version: N/A
Hardware: Local development machine (Windows)
SDK: N/A
Python: 3.13
Shell: PowerShell

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Copilot AI review requested due to automatic review settings March 17, 2026 17:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a normalization/quality-control layer for extracted field values so downstream JSON is deterministic (missing placeholders, plural parsing, duplicate merges, ambiguity flags) and includes a per-run quality report.

Changes:

  • Introduces ExtractionQualityProcessor to normalize values, merge duplicates deterministically, and flag ambiguous/missing fields.
  • Routes textToJSON.add_response_to_json() through the new quality processor and fixes a mutable default argument.
  • Adds unit tests for missing sentinel handling, plural normalization, duplicate merge behavior, and ambiguity flagging.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/extraction_quality.py New post-processing pipeline for missing/plural/duplicate/ambiguous extraction outputs + report generation.
src/backend.py Integrates quality processing into JSON assembly; prints per-run quality report; fixes mutable default arg.
src/test/test_extraction_quality.py Focused unit tests covering the new normalization/merge/report logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@@ -0,0 +1,51 @@
from src.extraction_quality import ExtractionQualityProcessor, MISSING_VALUE_SENTINEL
Comment on lines +31 to +40
if self._is_missing(normalized_value):
self.missing_fields.add(field)

if existing_value is None:
return normalized_value

merged, had_duplicate = self._merge_values(existing_value, normalized_value)
if had_duplicate:
self.duplicate_fields.add(field)
return merged
Comment on lines +99 to +100
return value == self.missing_sentinel

Comment on lines +82 to +96
existing_items = existing_value if isinstance(existing_value, list) else [existing_value]
new_items = new_value if isinstance(new_value, list) else [new_value]

merged = list(existing_items)
had_duplicate = False

for item in new_items:
if item in merged:
had_duplicate = True
continue
merged.append(item)

if len(merged) == 1:
return merged[0], had_duplicate
return merged, had_duplicate
@Aryama-srivastav Aryama-srivastav changed the title feat: add extraction quality post-processing controls [FEAT]: Add extraction quality post-processing controls Mar 17, 2026
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants