Skip to content

feat: enhance migration resilience with improved error handling and …#3

Open
jamesdelbarco wants to merge 2 commits intobraintrustdata:mainfrom
jamesdelbarco:resiliency-enhancements
Open

feat: enhance migration resilience with improved error handling and …#3
jamesdelbarco wants to merge 2 commits intobraintrustdata:mainfrom
jamesdelbarco:resiliency-enhancements

Conversation

@jamesdelbarco
Copy link

@jamesdelbarco jamesdelbarco commented Jan 30, 2026

Summary

Enhanced migration resilience with improved error handling and date filtering capabilities.

Changes

🔄 Error Handling Improvements

  • Extended rate limit retry logic: 429 responses now retry up to 30 attempts (~30min total) instead of 5, with intelligent backoff and jitter to prevent thundering herd
  • Added 413 to retryable errors: Payload Too Large errors now work with bisect logic to automatically split oversized batches
  • Graceful oversized item handling: New skip_single_413 parameter allows migration to continue when individual items are too large, tracking them separately
  • Improved retry metrics: Track skipped oversized items in migration state

📅 Date Filtering Enhancements

  • Inclusive --created-before: Changed from exclusive (<) to inclusive (<=) for more intuitive date range behavior
  • End-of-day semantics: Date-only values (YYYY-MM-DD) for --created-before now represent end-of-day (23:59:59.999999 UTC)
  • Separate canonicalization: Different handling for created_after (start-of-day) vs created_before (end-of-day)

🐛 Data Quality Fixes

  • Non-root span tags: Strip tags field from non-root spans to comply with Braintrust API validation requirements
  • Improved BTQL queries: Better query construction for paginated fetching with date filters

✅ Testing

  • Added comprehensive test suite for BTQL sorted fetching with date filters
  • Added dedicated tests for created_before canonicalization logic
  • 355+ new test lines covering edge cases

Files Changed

  • braintrust_migrate/client.py - Enhanced retry logic with extended 429 handling
  • braintrust_migrate/config.py - Separate date canonicalization functions
  • braintrust_migrate/insert_bisect.py - Skip oversized items option
  • braintrust_migrate/streaming_utils.py - Track skipped oversized items
  • braintrust_migrate/resources/logs.py - Strip tags from non-root spans
  • tests/unit/test_logs_btql_sorted_fetch.py - New comprehensive test suite

Impact

  • Resilience: Migration jobs now handle rate limits and large payloads more gracefully
  • Usability: Date range filtering is more intuitive with inclusive end dates
  • Reliability: Oversized items no longer block entire migrations

james.delbarco added 2 commits January 29, 2026 15:07
…filtering

  Add comprehensive resiliency improvements to handle edge cases during migration:

  - Add --created-before flag for inclusive end-date filtering
  - Extend rate limit (429) retry attempts to 30 with enhanced jitter (10-50%)
    to prevent thundering herd and allow recovery from sustained rate limiting
  - Make 413 (Payload Too Large) errors retryable with automatic batch bisection
    and optional oversized item skipping
  - Fix BTQL created_before filter from exclusive (<) to inclusive (<=)
  - Remove tags from non-root spans per API constraints
  - Add skipped_oversize tracking to migration state

  These changes significantly improve migration success rates for large datasets
  and high-traffic scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant