Skip to content

Models topological sort + new counting sharding#5

Merged
Telsho merged 4 commits intomainfrom
develop
Apr 10, 2026
Merged

Models topological sort + new counting sharding#5
Telsho merged 4 commits intomainfrom
develop

Conversation

@Telsho
Copy link
Copy Markdown
Owner

@Telsho Telsho commented Apr 10, 2026

No description provided.

Telsho added 4 commits April 10, 2026 17:55
The Sharded/Parallel Counting feature solves the problem of monolithic, complex prompts during the "Counting" phase of data extraction.

The pipeline now natively supports passing a list[str] of multiple specialized prompts for a single step. The system executes these prompts concurrently (in "shards"), achieves consensus on each shard individually, and then merges and deduplicates the final results.

- Serialization Utils: Allow `resolve_step_param` to return `list[str]`.
- Direct Execution: Updated `count_entities` to detect lists, spawn shards concurrently via `asyncio.gather`, and merge/deduplicate via sorted JSON hashing.
- Batch API Submission: Dynamic request generation tags sharded requests with `_shard_X` and flattens them for unified submission.
- Batch API Fulfillment: Uses regex to group results into shard buckets, achieves per-shard consensus, and deduplicates the merged output. Fixed a `BatchProcessResult` model bug lacking the `errors` field.
@Telsho Telsho merged commit 194aede into main Apr 10, 2026
3 checks passed
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 91.33858% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/extrai/core/schema_inspector.py 83.72% 7 Missing ⚠️
src/extrai/core/batch/batch_processor.py 91.30% 4 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Files with missing lines Coverage Δ
src/extrai/core/batch/batch_submitter.py 73.41% <100.00%> (+1.63%) ⬆️
src/extrai/core/counting_consensus.py 18.03% <100.00%> (ø)
src/extrai/core/entity_counter.py 88.15% <100.00%> (+4.52%) ⬆️
src/extrai/core/workflow_orchestrator.py 78.46% <ø> (ø)
src/extrai/utils/serialization_utils.py 78.26% <100.00%> (+5.53%) ⬆️
src/extrai/core/batch/batch_processor.py 58.85% <91.30%> (+8.55%) ⬆️
src/extrai/core/schema_inspector.py 97.65% <83.72%> (-2.35%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants