[FEAT]: Add extraction quality post-processing controls by Aryama-srivastav · Pull Request #280 · fireform-core/FireForm

Aryama-srivastav · 2026-03-17T17:02:33Z

Description

This PR improves extraction post-processing robustness by introducing a dedicated normalization pipeline for duplicates, ambiguity, plural entities, and missing values.

Conversational input is often noisy, and raw model output can include repeated values, ambiguous phrases, inconsistent plural formatting, and missing placeholders. This PR makes post-extraction behavior deterministic and measurable before data is used downstream.

What changed:

Added ExtractionQualityProcessor in extraction_quality.py

Duplicate entity detection + deterministic merge strategy
Ambiguity flagging for low-confidence/ambiguous tokens
Standardized plural normalization (;-separated values → deduplicated ordered list)
Consistent missing sentinel policy (MISSING)
Per-run quality report generation

Integrated quality processing into backend.py

textToJSON.add_response_to_json() now routes all values through normalization
Duplicate handling no longer depends on unsafe append behavior
Per-run extraction quality report is printed alongside extracted JSON
Fixed mutable default argument

Added tests in test_extraction_quality.py

Missing sentinel consistency
Plural normalization
Duplicate merge determinism
Duplicate merge list-promotion behavior
Ambiguity flagging

Type of change

Feature (non-breaking enhancement)
Bug fix (non-breaking reliability improvement)

How Has This Been Tested?

Test A (focused extraction quality suite):

Ran:
$env:PYTEST_DISABLE_PLUGIN_AUTOLOAD='1'; python -m pytest test_extraction_quality.py -q

Verified output:
5 passed in 0.20s

Coverage includes:

Duplicate keys/values handled deterministically
Plural outputs normalized to standard list format
Ambiguous fields flagged for review
Missing values normalized to one sentinel across processing

Test Configuration:

Firmware version: N/A
Hardware: Local development machine (Windows)
SDK: N/A
Python: 3.13
Shell: PowerShell

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Copilot

Pull request overview

Adds a normalization/quality-control layer for extracted field values so downstream JSON is deterministic (missing placeholders, plural parsing, duplicate merges, ambiguity flags) and includes a per-run quality report.

Changes:

Introduces ExtractionQualityProcessor to normalize values, merge duplicates deterministically, and flag ambiguous/missing fields.
Routes textToJSON.add_response_to_json() through the new quality processor and fixes a mutable default argument.
Adds unit tests for missing sentinel handling, plural normalization, duplicate merge behavior, and ambiguity flagging.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`src/extraction_quality.py`	New post-processing pipeline for missing/plural/duplicate/ambiguous extraction outputs + report generation.
`src/backend.py`	Integrates quality processing into JSON assembly; prints per-run quality report; fixes mutable default arg.
`src/test/test_extraction_quality.py`	Focused unit tests covering the new normalization/merge/report logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/backend.py

src/test/test_extraction_quality.py

@@ -0,0 +1,51 @@
+from src.extraction_quality import ExtractionQualityProcessor, MISSING_VALUE_SENTINEL


src/extraction_quality.py

+        if self._is_missing(normalized_value):
+            self.missing_fields.add(field)
+
+        if existing_value is None:
+            return normalized_value
+
+        merged, had_duplicate = self._merge_values(existing_value, normalized_value)
+        if had_duplicate:
+            self.duplicate_fields.add(field)
+        return merged


src/extraction_quality.py

+        return value == self.missing_sentinel
+


src/extraction_quality.py

+        existing_items = existing_value if isinstance(existing_value, list) else [existing_value]
+        new_items = new_value if isinstance(new_value, list) else [new_value]
+
+        merged = list(existing_items)
+        had_duplicate = False
+
+        for item in new_items:
+            if item in merged:
+                had_duplicate = True
+                continue
+            merged.append(item)
+
+        if len(merged) == 1:
+            return merged[0], had_duplicate
+        return merged, had_duplicate


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat: add extraction quality post-processing controls

ac2deec

Copilot AI review requested due to automatic review settings March 17, 2026 17:02

Copilot started reviewing on behalf of Aryama-srivastav March 17, 2026 17:03 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Aryama-srivastav changed the title ~~feat: add extraction quality post-processing controls~~ [FEAT]: Add extraction quality post-processing controls Mar 17, 2026

Potential fix for pull request finding

db294c1

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Add extraction quality post-processing controls#280

[FEAT]: Add extraction quality post-processing controls#280
Aryama-srivastav wants to merge 2 commits intofireform-core:mainfrom
Aryama-srivastav:feature/extraction-quality-controls

Aryama-srivastav commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,51 @@
		from src.extraction_quality import ExtractionQualityProcessor, MISSING_VALUE_SENTINEL

Conversation

Aryama-srivastav commented Mar 17, 2026

Description

What changed:

Type of change

How Has This Been Tested?

Test Configuration:

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants