Skip to content

INCREMENTAL_BY_TIME_RANGE with batch_size 1 inserts zero rows and loses boundary data #5739

@ShubhamKapoor992

Description

@ShubhamKapoor992

Incremental models using INCREMENTAL_BY_TIME_RANGE with batch_size 1 (or any batch size creating single-day intervals) generate incorrect SQL with identical start and end dates, resulting in:

Zero rows being inserted in all batches
Missing boundary rows at exactly midnight timestamps (especially critical for rows timestamped 1 day before model execution)

Environment
SQLMesh Version: All versions (tested on 0.0.1.dev4573)
Database: All vendors (DB2(Testing is in progress in my dev environment), PostgreSQL, MySQL, Oracle, Snowflake, BigQuery, etc.)
Impact: Critical - causes complete data loss in incremental models

Reproduction Steps
Note: These steps work on any database - PostgreSQL, MySQL, Snowflake, etc. The bug is in SQLMesh core, not database-specific.

  1. Create Source Table with Boundary Data

PostgreSQL/MySQL/Snowflake/DuckDB:

CREATE TABLE sales_source (
    id INT,
    customer_name VARCHAR(100),
    amount DECIMAL(10,2),
    updated_at TIMESTAMP
);

Oracle

CREATE TABLE sales_source (
    id NUMBER,
    customer_name VARCHAR2(100),
    amount NUMBER(10,2),
    updated_at TIMESTAMP
);

BigQuery:

CREATE TABLE sales_source (
    id INT64,
    customer_name STRING,
    amount NUMERIC(10,2),
    updated_at TIMESTAMP
);

Db2 In my case:

CREATE TABLE sales_source (
    id INT,
    customer_name VARCHAR(100),
    amount DECIMAL(10,2),
    updated_at TIMESTAMP
);

Insert test data (works on all databases):

-- Row 1-4: March 1st
INSERT INTO sales_source VALUES
(1, 'Customer A', 100.00, '2026-03-01 00:00:00'),
(2, 'Customer B', 200.00, '2026-03-01 12:30:00'),
(3, 'Customer C', 150.00, '2026-03-01 18:45:00'),
(4, 'Customer D', 250.00, '2026-03-01 23:59:59');

-- Row 5-8: March 2nd (includes boundary row at midnight)
INSERT INTO sales_source VALUES
(5, 'Customer E', 300.00, '2026-03-02 00:00:00'),  -- Boundary row!
(6, 'Customer F', 175.00, '2026-03-02 09:15:00'),
(7, 'Customer G', 225.00, '2026-03-02 15:30:00'),
(8, 'Customer H', 275.00, '2026-03-02 21:45:00');

-- Continue inserting data for more days...
-- Total: 50(in my test case, you can take any number) rows spanning March 1-18

-- Row 50: CRITICAL BOUNDARY ROW
-- This row is timestamped at midnight on March 18th
-- When you run the model on March 19th, this row from "yesterday at midnight"
-- should be included in the March 18th batch, but the bug causes it to be lost!
INSERT INTO sales_source VALUES
(50, 'Customer Z', 500.00, '2026-03-18 00:00:00');  
-- ⚠️ This is 1 day before model execution (March 19th)
-- ⚠️ Represents data that arrived exactly at midnight
-- ⚠️ This row will be LOST due to the bug!

  1. Create Incremental Model

Works on all databases:

-- models/sales_incremental.sql
MODEL (
  name sales_incremental,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column updated_at,
    batch_size 1  -- Single day batches trigger the bug
  )
);

SELECT
  id,
  customer_name,
  amount,
  updated_at
FROM sales_source
WHERE updated_at >= CAST(@start_ds AS TIMESTAMP) 
  AND updated_at < CAST(@end_ds AS TIMESTAMP);

  1. Run Plan on March 19th
# Running on March 19th, 2026
sqlmesh plan --auto-apply prod

Expected Behavior

Batch 1 (2026-03-01):

WHERE updated_at >= '2026-03-01' AND updated_at < '2026-03-02'
-- Should insert 4 rows

Batch 2 (2026-03-02):

WHERE updated_at >= '2026-03-02' AND updated_at < '2026-03-03'
-- Should insert 4 rows (including row 5 at midnight)

Batch 18 (2026-03-18) - The Critical Batch:

WHERE updated_at >= '2026-03-18' AND updated_at < '2026-03-19'
-- Should insert row 50 at '2026-03-18 00:00:00' ✅
-- This is yesterday's midnight data!

Total: All 50 rows inserted ✅

Actual Behavior
All batches generate identical start/end dates:

WHERE updated_at >= '2026-03-01' AND updated_at < '2026-03-01'  -- Impossible!
WHERE updated_at >= '2026-03-02' AND updated_at < '2026-03-02'  -- Impossible!
WHERE updated_at >= '2026-03-18' AND updated_at < '2026-03-18'  -- Impossible!
-- ... all batches have this pattern

Result:

❌ 0 rows inserted in all batches
❌ Row 50 at midnight (1 day before execution) is LOST
❌ All boundary rows at 00:00:00 are excluded

The Critical Loss:

Row 50: updated_at = '2026-03-18 00:00:00'
Model run date: 2026-03-19
Expected: Row 50 should be in the 2026-03-18 batch (yesterday's data)
Actual: Row 50 is LOST because the WHERE clause is impossible

Log Evidence

DEBUG date_dict: start_ds=2026-03-01, end_ds=2026-03-01  ❌ Both identical!

INSERT INTO ... 
WHERE "UPDATED_AT" >= CAST('2026-03-01' AS TIMESTAMP) 
  AND "UPDATED_AT" < CAST('2026-03-01' AS TIMESTAMP)

[1/18] sales_incremental [insert 2026-03-01 - 2026-03-01 (0 rows)]  ❌
[2/18] sales_incremental [insert 2026-03-02 - 2026-03-02 (0 rows)]  ❌
[18/18] sales_incremental [insert 2026-03-18 - 2026-03-18 (0 rows)] ❌ Row 50 lost!

Why Row 50 Is Critical

Row 50 represents a common real-world scenario:

  • Data arrives at exactly midnight (00:00:00)
  • This is 1 day before the model execution
  • In production, this could be:
    • End-of-day batch processing
    • Scheduled data loads
    • Timezone boundary data
    • Daily cutoff timestamps

This bug causes systematic loss of midnight boundary data!

The Problem:

  1. SQLMesh passes millisecond timestamps (e.g., 1772409600000 for 2026-03-02 00:00:00)
  2. is_date(1772409600000) returns False (can't recognize integers as dates)
  3. Function doesn't add 1 day to end date
  4. make_inclusive() subtracts 1 microsecond from wrong date
  5. Both @start_ds and @end_ds end up with the same value

Why Boundary Rows Are Lost:
Even when dates are different, midnight timestamps (00:00:00) aren't recognized as categorical dates, causing incorrect range calculations that exclude boundary data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions