Skip to content

feat(smoke-test-source): Add incremental sync with configurable state checkpoints#1016

Open
Aaron ("AJ") Steers (aaronsteers) wants to merge 6 commits into
mainfrom
devin/1776987918-smoke-test-incremental
Open

feat(smoke-test-source): Add incremental sync with configurable state checkpoints#1016
Aaron ("AJ") Steers (aaronsteers) wants to merge 6 commits into
mainfrom
devin/1776987918-smoke-test-incremental

Conversation

@aaronsteers
Copy link
Copy Markdown
Member

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Apr 23, 2026

Summary

Adds incremental sync support to SourceSmokeTest (in airbyte/cli/smoke_test_source/) with force-emitted state checkpoints, per the requirements:

  1. Cursor coalesce: starting cursor is resolved in priority order state cursor → start_date config → Jan 1 of the current UTC year.
  2. Force-emitted state checkpoints, any/all of:
    • after every batch_size records (paired with batch_count to bound total records = batch_size * batch_count),
    • on each calendar boundary from partition_by: day | week | month (UTC),
    • always once at end-of-stream (deduped when a trigger fires on the final record).

Scope is opt-in per scenario: a new incremental_batch_stream scenario advertises SyncMode.incremental with updated_at as its source-defined cursor. All existing full-refresh scenarios (basic_types, timestamp_types, large_batch_stream, etc.) are untouched. All new spec fields are optional — no breaking changes for existing catalogs/configs.

New spec fields (all optional):

  • start_date — ISO-8601 date or datetime; lower bound for incremental cursor.
  • batch_size — int ≥ 1; emit STATE every N records.
  • batch_count — int ≥ 1; paired with batch_size to bound record emission.
  • partition_byday | week | month; emit STATE at UTC calendar boundaries.
  • cursor_step — seconds (number) or ISO-8601 duration (e.g. PT1H, P1D); defaults to 1 s.

Behavioral notes:

  • Records in the incremental scenario carry updated_at strictly greater than the resolved cursor start, so resuming from a prior STATE never re-emits a record.
  • When the incremental scenario is selected in full_refresh mode, records are emitted with the same generator (deterministic) but no STATE messages.
  • check() rejects invalid start_date, negative/zero batch_size/batch_count, unknown partition_by, and unparseable cursor_step before the sync starts.

Covered by 47 unit tests in tests/unit_tests/test_smoke_test_source.py — coalesce order, trigger logic, dedup, full-refresh non-regression, and config validation. Local ruff check, ruff format --check, pyrefly check, and full unit-test suite all pass.

A separate follow-up PR will bump source-smoke-test's Docker image tag in airbytehq/airbyte once this lands.

Review & Testing Checklist for Human

  • Confirm the cursor coalesce order (state → start_date → Jan 1 UTC) matches the intended semantics.
  • Confirm opt-in scope (only incremental_batch_stream is incremental; large_batch_stream stays full-refresh-only).
  • Eyeball the new spec fields for naming/UX (start_date, batch_size, batch_count, partition_by, cursor_step).
  • Try a manual sync against a destination: run once without state, read back the final STATE, then rerun passing that state and verify no duplicates.
  • Try a partition_by=day run with cursor_step=PT6H to see STATE messages at day boundaries.

Notes

  • All new logic lives in airbyte/cli/smoke_test_source/{source.py,_scenarios.py}, following the existing layout for this source. Happy to move shared helpers out to a public module in a follow-up if preferred.
  • Answered open questions from the plan with: state > start_date > Jan 1 (confirmed); opt-in per scenario; large_batch_stream unchanged; cursor field = updated_at; no extra triggers for v1.

Link to Devin session: https://app.devin.ai/sessions/aed4dcdceb4e444c98be17b2bfcd2602
Requested by: Aaron ("AJ") Steers (@aaronsteers)

Summary by CodeRabbit

  • New Features

    • Incremental sync with automatic cursor tracking, ordered record events, periodic state checkpoints, and stable ID derivation
    • Configurable incremental behavior: start date, cursor step, batch size, batch count, and partitioning (day/week/month, ISO-week aware)
    • New predefined incremental batch-stream scenario for testing; full-refresh scenarios remain unaffected
  • Tests

    • Comprehensive tests covering incremental event generation, cursor/partition behavior, config validation, state checkpointing, and end-to-end read semantics

… checkpoints

- Adds a new `incremental_batch_stream` scenario that advertises
  `SyncMode.incremental` with `updated_at` as its source-defined cursor.
- Adds `start_date`, `batch_size`, `batch_count`, `partition_by`
  (day/week/month), and `cursor_step` to the connector spec. All fields
  are optional and additive; existing full-refresh scenarios are unchanged.
- Cursor coalesce order: state cursor > `start_date` > Jan 1 of the
  current UTC year.
- Emits STATE after every `batch_size` records, at each partition
  boundary, and once at end-of-stream (dedup when triggers coincide).
- Stops after `batch_size * batch_count` records when both are set.
- `check()` rejects invalid `start_date`, `batch_size`,
  `batch_count`, `partition_by`, and `cursor_step`.
- Adds 47 unit tests covering cursor coalesce, trigger logic, dedup,
  full-refresh non-regression, and config validation.
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1776987918-smoke-test-incremental' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1776987918-smoke-test-incremental'

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /uv-lock - Updates uv.lock file
  • /test-pr - Runs tests with the updated PyAirbyte
  • /prerelease - Builds and publishes a prerelease version to PyPI
📚 Show Repo Guidance

Helpful Resources

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@devin-ai-integration[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 15 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 42 minutes and 15 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f1f9a1a0-cc56-4e9e-bf55-46d3839e01fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2986b0c and 69f5ce1.

📒 Files selected for processing (3)
  • airbyte/cli/smoke_test_source/_scenarios.py
  • airbyte/cli/smoke_test_source/source.py
  • tests/unit_tests/test_smoke_test_source.py
📝 Walkthrough

Walkthrough

Adds incremental sync to the smoke-test source: scenario metadata now supports incremental descriptors and cursor fields; a new incremental event generator with partitioning and checkpointing; connector config + validation for cursor controls; discovery/read now emit RECORD and per-stream STATE for incremental scenarios; tests added.

Changes

Cohort / File(s) Summary
Scenario metadata & generator
airbyte/cli/smoke_test_source/_scenarios.py
Adds incremental scenario metadata, cursor_field, a new incremental_batch_stream scenario, PartitionGrain alias, _partition_bucket, and public iter_incremental_scenario_events() to emit ordered RECORDs and interleaved STATE checkpoints (batch, partition, stop-rule precedence).
Source behavior & config
airbyte/cli/smoke_test_source/source.py
Extensions to SourceSmokeTest: advertise incremental streams (enable SyncMode.incremental, source_defined_cursor, default_cursor_field), add config keys (start_date, cursor_step, batch_size, batch_count, partition_by) with validation, derive per-stream cursor_start from state/config/default, and drive RECORD+STATE emission via the incremental generator while preserving full-refresh behavior.
Unit tests
tests/unit_tests/test_smoke_test_source.py
New comprehensive tests for spec exposure, discovery flags, cursor parsing helpers, default/start-state resolution, _partition_bucket, iter_incremental_scenario_events triggers/deduplication, end-to-end read() incremental behaviors, and check() validation cases.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant Source as SourceSmokeTest
    participant Generator as iter_incremental_scenario_events
    participant StateStore as State Store

    Caller->>Source: read(config, catalog, state)
    Source->>Source: determine stream type (incremental vs full-refresh)
    alt incremental
        Source->>StateStore: fetch per-stream state (if present)
        StateStore-->>Source: cursor_start or none
        Source->>Generator: start(scenario, cursor_start, cursor_step, batch_size, partition_by)
        loop emit events
            Generator->>Source: RECORD (with cursor, id)
            alt checkpoint triggered (batch / partition / stop)
                Generator->>Source: STATE (updated cursor)
                Source->>StateStore: persist/update stream state
            end
            Source->>Caller: yield RECORD/STATE
        end
        Generator->>Source: final STATE (if not already emitted)
        Source->>Caller: yield final STATE
    else full-refresh
        Source->>Generator: produce full-refresh records
        Source->>Caller: yield RECORDs only
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Want me to highlight specific tests or helper functions (cursor parsing, partitioning, state-deduplication) for a focused review, wdyt?

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: adding incremental sync with configurable state checkpoints to the smoke-test source.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch devin/1776987918-smoke-test-incremental

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

PyTest Results (Fast Tests Only, No Creds)

394 tests  +51   394 ✅ +51   5m 44s ⏱️ -48s
  1 suites ± 0     0 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 69f5ce1. ± Comparison against base commit 82bf1e4.

♻️ This comment has been updated with latest results.

- Tighten _state_cursor_for_stream namespace check to be symmetric so
  None-namespace callers only match None-namespace state.
- Remove None from partition_by enum; the type array handles null.
- Reject empty ISO-8601 duration (e.g. 'PT') in _parse_cursor_step.
- Add batch_size=True and batch_count=True cases to validation tests.
- Use getattr instead of __dict__ to read state cursor in tests.
- Make year-boundary tests tolerate Dec 31 -> Jan 1 transition.
- Add ISO-week comment explaining iso.year vs ts.year.
- Expand _emit_incremental_as_full_refresh docstring.
coderabbitai[bot]

This comment was marked as resolved.

- _parse_start_date: wrap date.fromisoformat ValueError symmetrically
  so 10-char invalid dates (e.g. '2024-13-45') surface the same nice
  'Invalid start_date: ...' message as other malformed inputs.
- _validate_incremental_config: reject batch_count without batch_size,
  since iter_incremental_scenario_events silently ignores batch_count
  unless batch_size is also set.
- _emit_incremental_as_full_refresh: document that updated_at values
  shift across year boundaries when start_date is not configured.
- Add tests for batch_count-without-batch_size and invalid 10-char
  start_date.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

PyTest Results (Full)

464 tests  +51   446 ✅ +51   27m 17s ⏱️ + 1m 2s
  1 suites ± 0    18 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 69f5ce1. ± Comparison against base commit 82bf1e4.

♻️ This comment has been updated with latest results.

devin-ai-integration Bot added a commit to airbytehq/airbyte that referenced this pull request Apr 24, 2026
The 0.2.0 wrapper still works with PyAirbyte >= 0.41.0 — the image
build pulls the latest PyAirbyte at image-build time, so once
airbytehq/PyAirbyte#1016 ships the new scenario will be available.
Leaving the floor at 0.41.0 keeps CI resolvable until then.
@aaronsteers Aaron ("AJ") Steers (aaronsteers) marked this pull request as ready for review April 24, 2026 05:32
Copilot AI review requested due to automatic review settings April 24, 2026 05:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit_tests/test_smoke_test_source.py (2)

311-335: Small nit on tuple(expected) — redundant since expected is already a tuple, wdyt?

For the day/month params, expected is a literal tuple; for week, datetime(...).isocalendar()[:2] also returns a plain tuple (slicing a named tuple yields tuple). So tuple(expected) is a no-op wrap. You can drop it for a hair more clarity, but it's harmless — up to you.

♻️ Optional tweak
 def test_partition_bucket(ts, grain, expected):
-    assert _partition_bucket(ts, grain) == tuple(expected)
+    assert _partition_bucket(ts, grain) == expected
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/test_smoke_test_source.py` around lines 311 - 335, The test
wraps the expected values in tuple(expected) redundantly in
test_partition_bucket; update the assertions to compare directly using expected
(i.e., change assert _partition_bucket(ts, grain) == tuple(expected) to assert
_partition_bucket(ts, grain) == expected) and adjust the three param cases
(day/week/month) accordingly; locate the assertion in the test_partition_bucket
function referencing _partition_bucket to make this small cleanup.

94-95: Tiny nit: getattr without a default reads like defensive code that isn't defending anything — intentional, wdyt?

getattr(obj, "updated_at") with no default is equivalent to obj.state.stream.stream_state.updated_at and will still raise AttributeError if missing. If the intent is "fail loud when the cursor attribute is absent", plain attribute access communicates that a bit more clearly. If the intent was to tolerate absence on AirbyteStateBlob (which is a dynamic/extra=allow pydantic model), adding a default (e.g. None) would make that explicit. Totally optional though — happy to leave it as-is if you prefer.

♻️ Optional tweak
 def _state_cursor(msg) -> str:
-    return getattr(msg.state.stream.stream_state, "updated_at")
+    return msg.state.stream.stream_state.updated_at
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/test_smoke_test_source.py` around lines 94 - 95, The helper
_state_cursor currently uses getattr(msg.state.stream.stream_state,
"updated_at") which offers no default and is therefore equivalent to direct
attribute access; change it to explicit attribute access
msg.state.stream.stream_state.updated_at to communicate the "fail loud" intent
more clearly, or if you intended to tolerate a missing cursor, change to
getattr(msg.state.stream.stream_state, "updated_at", None) to make that behavior
explicit—update the _state_cursor function accordingly (choose one of the two
options).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/test_smoke_test_source.py`:
- Around line 311-335: The test wraps the expected values in tuple(expected)
redundantly in test_partition_bucket; update the assertions to compare directly
using expected (i.e., change assert _partition_bucket(ts, grain) ==
tuple(expected) to assert _partition_bucket(ts, grain) == expected) and adjust
the three param cases (day/week/month) accordingly; locate the assertion in the
test_partition_bucket function referencing _partition_bucket to make this small
cleanup.
- Around line 94-95: The helper _state_cursor currently uses
getattr(msg.state.stream.stream_state, "updated_at") which offers no default and
is therefore equivalent to direct attribute access; change it to explicit
attribute access msg.state.stream.stream_state.updated_at to communicate the
"fail loud" intent more clearly, or if you intended to tolerate a missing
cursor, change to getattr(msg.state.stream.stream_state, "updated_at", None) to
make that behavior explicit—update the _state_cursor function accordingly
(choose one of the two options).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 75a62930-73a7-4810-b811-399844cdd218

📥 Commits

Reviewing files that changed from the base of the PR and between 0ca9576 and 2927bf9.

📒 Files selected for processing (2)
  • airbyte/cli/smoke_test_source/source.py
  • tests/unit_tests/test_smoke_test_source.py
✅ Files skipped from review due to trivial changes (1)
  • airbyte/cli/smoke_test_source/source.py

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in incremental sync support to SourceSmokeTest with configurable cursor start coalescing and forced per-stream state checkpoint emission.

Changes:

  • Introduces a new incremental scenario (incremental_batch_stream) with updated_at as the source-defined cursor and a synthetic incremental event generator.
  • Adds config parsing + validation for start_date, cursor_step, batch_size, batch_count, and partition_by, and emits STREAM state messages during incremental reads.
  • Adds a comprehensive unit test suite covering cursor coalescing, checkpoint triggers/deduping, end-to-end read(), and check() validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
airbyte/cli/smoke_test_source/source.py Implements incremental config parsing/validation, cursor coalescing, and STATE emission in read() for incremental-enabled scenarios.
airbyte/cli/smoke_test_source/_scenarios.py Adds the incremental scenario definition and the incremental record/state event iterator with size/partition/end triggers.
tests/unit_tests/test_smoke_test_source.py Adds unit tests validating incremental behavior, config validation, and non-regression for full-refresh scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread airbyte/cli/smoke_test_source/source.py Outdated
"start_date": {
"type": ["string", "null"],
"title": "Start Date",
"format": "date-time",
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_date is documented/parsed as accepting either an ISO-8601 date (YYYY-MM-DD) or datetime, but the connector spec declares format: date-time. Some platforms/UIs treat format as a validator or render a datetime-only picker, which can reject date-only inputs that the implementation explicitly supports. Consider removing the format constraint or switching it to a schema that clearly allows both date and date-time inputs (or, alternatively, restrict parsing to date-time only to match the spec).

Suggested change
"format": "date-time",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the format: date-time hint was inconsistent with _parse_start_date, which explicitly accepts both 10-char ISO dates and full ISO-8601 datetimes. Some Airbyte UIs render a datetime-only picker for format: date-time and would reject 2024-06-15.

Fixed in 2986b0c by removing the format constraint and clarifying the description to call out both accepted shapes (YYYY-MM-DD and YYYY-MM-DDTHH:MM:SSZ).


Devin session

Comment thread airbyte/cli/smoke_test_source/_scenarios.py
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

"partition_by": {
"type": ["string", "null"],
"title": "Partition By",
"enum": ["day", "week", "month"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 partition_by JSON Schema enum rejects null despite type and default allowing it

The partition_by spec field declares "type": ["string", "null"] and "default": None, but the "enum": ["day", "week", "month"] does not include null. In JSON Schema draft-07, the enum keyword independently restricts allowed values, so a null value passes the type check but fails the enum check. Verified using the jsonschema library (already a project dependency): jsonschema.validate(None, schema) raises ValidationError: None is not one of ['day', 'week', 'month']. This means that users who leave partition_by unset (or explicitly set to null) would fail config validation if the Airbyte platform performs strict JSON Schema validation against the connector spec.

Suggested change
"enum": ["day", "week", "month"],
"enum": ["day", "week", "month", None],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

- Fix PK collision on incremental resume: derive `id` from cursor
  timestamp (via a fixed Unix epoch) so the same `updated_at` always maps
  to the same `id` — avoids duplicate primary keys across runs that
  resume from state.
- Drop `format: date-time` from `start_date` spec since the parser
  explicitly accepts both ISO-8601 date and date-time inputs; keeping
  the format hint could cause UIs to reject date-only input.
coderabbitai[bot]

This comment was marked as resolved.

- Flip incremental_batch_stream to high_volume: True so it stays opt-in
  (was being pulled into the default fast set).
- Hoist shared _CATEGORIES constant used by both batch generators.
- Fix partition-trigger docstring wording (no look-ahead; trigger fires
  on boundary between previous and just-yielded record).
- Update _emit_incremental_as_full_refresh docstring to reflect that
  id/value now shift with start_date (only category is independent).
- Update tests to opt into all_slow_streams for the new scenario.
devin-ai-integration[bot]

This comment was marked as resolved.

Add `incremental_batch_stream` to `HIGH_VOLUME_SCENARIO_NAMES` so the
public constant matches the `high_volume: True` flag on the scenario
dict. Caught by Devin Review.
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Ran the connector end-to-end against the new incremental_batch_stream scenario at HEAD 69f5ce1. Evidence below.

Unit tests (51 passed)

$ uv run pytest tests/unit_tests/test_smoke_test_source.py
======================= 51 passed, 13 warnings in 0.17s ========================

Demo 1 — initial run, composed checkpoints

Config: start_date=2026-01-01T00:00:00Z, batch_size=3, batch_count=2, partition_by=day, cursor_step=P1D (no input state)

RECORD  updated_at=2026-01-02T00:00:00Z  id=20455  category=cat_a
RECORD  updated_at=2026-01-03T00:00:00Z  id=20456  category=cat_b
STATE   ← {updated_at: 2026-01-03T00:00:00Z}    # day boundary + batch_size dedup
RECORD  updated_at=2026-01-04T00:00:00Z  id=20457  category=cat_c
STATE   ← {updated_at: 2026-01-04T00:00:00Z}    # day boundary
RECORD  updated_at=2026-01-05T00:00:00Z  id=20458  category=cat_d
STATE   ← {updated_at: 2026-01-05T00:00:00Z}    # day boundary + batch_size dedup
RECORD  updated_at=2026-01-06T00:00:00Z  id=20459  category=cat_e
STATE   ← {updated_at: 2026-01-06T00:00:00Z}
RECORD  updated_at=2026-01-07T00:00:00Z  id=20460  category=cat_a
STATE   ← {updated_at: 2026-01-07T00:00:00Z}    # terminal

Exactly 6 records (= batch_size × batch_count), 5 deduped STATE messages.

Demo 2 — resume from state (coalesce + PK stability)

Same config, input state {updated_at: 2026-01-04T00:00:00Z}:

RECORD  updated_at=2026-01-05T00:00:00Z  id=20458  # IDs match Demo 1 exactly — PKs are resume-safe
RECORD  updated_at=2026-01-06T00:00:00Z  id=20459
...

State cursor wins over start_date (resumes at 01-05, not 01-01). IDs 20458/59/60 are identical to Demo 1 — confirms the stable-id derivation fix for the PK collision bug.

Demo 3 — month partition

Config: start_date=2026-01-15, batch_size=10, batch_count=1, partition_by=month, cursor_step=P5D

RECORD  updated_at=2026-01-20T00:00:00Z  id=4094
RECORD  updated_at=2026-01-25T00:00:00Z  id=4095
RECORD  updated_at=2026-01-30T00:00:00Z  id=4096
RECORD  updated_at=2026-02-04T00:00:00Z  id=4097    # Jan→Feb boundary
STATE   ← {updated_at: 2026-02-04T00:00:00Z}
RECORD  updated_at=2026-02-09T00:00:00Z  id=4098
...
RECORD  updated_at=2026-03-01T00:00:00Z  id=4102    # Feb→Mar boundary
STATE   ← {updated_at: 2026-03-01T00:00:00Z}
RECORD  updated_at=2026-03-06T00:00:00Z  id=4103
STATE   ← {updated_at: 2026-03-06T00:00:00Z}    # terminal

Full evidence page: https://app.devin.ai/attachments/c653760d-8f2e-46de-ac19-64876f76d625/evidence_full.png


Devin session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants