Skip to content

fix: synthetic dataset generation for ≥200 rows#54

Merged
Icar0S merged 2 commits intomainfrom
copilot/investigate-dataset-generation-issue
Mar 11, 2026
Merged

fix: synthetic dataset generation for ≥200 rows#54
Icar0S merged 2 commits intomainfrom
copilot/investigate-dataset-generation-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 11, 2026

Gemini (free tier) reliably generates ≤100 rows per call but returns only ~71–76% of rows for larger requests, exhausting all retries. A second bug caused the exhausted-retry path (no exception) to return [] instead of data, producing "Batch 1 complete: 0/200 total rows".

Changes

src/synthetic/generator.py

  • Auto sub-batching: requests above _LLM_MAX_ROWS_PER_CALL = 100 are split into sequential sub-batches of ≤100 rows. A 200-row request now makes 2 reliable calls instead of 3 failing ones, reducing quota burn.
  • Fixed empty-return bug: tracks best_rows across retry attempts; when the retry loop exhausts without an exception, fills from the best partial LLM result + mock data rather than returning [].
# Before — silent data loss
return [], logs  # hit when all retries got < 80% rows with no exception

# After — always returns requested count
if best_rows:
    records = self._coerce_types(best_rows, schema)
    ...
    if len(records) < num_rows:
        records.extend(self._generate_mock_data(schema, num_rows - len(records)))
    return records[:num_rows], logs
return self._generate_mock_data(schema, num_rows), logs

tests/backend/api/test_synthetic_backend.py

  • Added TestGenerateBatchLargeRowHandling with 6 tests covering: sub-batch triggering, single-call at threshold, mock-fill fallback, no-empty-return guarantee, and best-partial-rows retention.

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
data-forge-test Ready Ready Preview, Comment Mar 11, 2026 3:48am

…n bug

Co-authored-by: Icar0S <39846852+Icar0S@users.noreply.github.com>
Copilot AI changed the title [WIP] Investigate dataset generation issue for 200 rows or more Fix synthetic dataset generation for ≥200 rows Mar 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

🧪 PR Test Summary

Suite Status
Backend — Unit ✅ success
Backend — API ✅ success
Backend — Security ✅ success
Frontend — Coverage ✅ success

Coverage thresholds: Statements 80% · Branches 70% · Functions 75% · Lines 80%

📦 View artifacts

@github-actions
Copy link
Copy Markdown

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
1168 983 84% 80% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: bb66b01 by action🐍

@Icar0S Icar0S changed the title Fix synthetic dataset generation for ≥200 rows fix: synthetic dataset generation for ≥200 rows Mar 11, 2026
@Icar0S Icar0S marked this pull request as ready for review March 11, 2026 23:03
@Icar0S Icar0S merged commit c176a2a into main Mar 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants