Conversation
❌ 2 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
🔄 Flaky Test DetectedAnalysis: The test ✅ Automatically retrying the workflow |
87b8245 to
1452607
Compare
❌ Test FailureAnalysis: All 3 TestMongoClickhouseSuite tests fail identically across all matrix variants with deterministic errors (UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT and ClickHouse unknown table identifier), indicating a real regression in MongoDB snapshot handling rather than a flaky failure. |
1452607 to
7654168
Compare
🔄 Flaky Test DetectedAnalysis: Two MongoDB-ClickHouse e2e tests failed with "UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT", indicating they timed out waiting for snapshot workflow completion — a classic flaky pattern in distributed systems testing. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: All three matrix variants failed with "UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT" in TestMongoClickhouseSuite at the same wall-clock time, indicating a transient timeout waiting for the MongoDB snapshot — not a code regression. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: TestMongoClickhouseSuite/Test_CDC and Test_Mongo_Can_Resume_After_Delete_Table fail deterministically across all three CI matrix configurations with identical ~31s durations, indicating a real bug rather than a flaky test. |
7654168 to
af08f20
Compare
🔄 Flaky Test DetectedAnalysis: Only one of three matrix combinations (PG18 + MariaDB 8.0) failed in the e2e integration test suite while identical code passed on PG16 and PG17, with no deterministic assertion failure visible in logs, strongly suggesting a flaky e2e test rather than a real regression. ✅ Automatically retrying the workflow |
af08f20 to
f1801ff
Compare
🔄 Flaky Test DetectedAnalysis: Test teardown race condition: the replication slot was still held by an active background PID when cleanup tried to drop it, causing a false failure unrelated to the test's actual logic. ✅ Automatically retrying the workflow |
a1c4da6 to
dae8cf8
Compare
🔄 Flaky Test DetectedAnalysis: The e2e test ✅ Automatically retrying the workflow |
🔄 Possible Flaky TestAnalysis: The MySQL GTID e2e test suite failed after ~720s with no specific test failure message visible, suggesting a timing/infrastructure issue in the distributed test environment rather than a code regression. |
dae8cf8 to
8eb1be1
Compare
| } | ||
| // Cap partitions to the timestamp range in seconds so we don't create additional | ||
| // empty partitions unnecessarily when docs span fewer seconds than partitions | ||
| numPartitions = min(numPartitions, tsRange) |
There was a problem hiding this comment.
Maybe a stupid question, but why do we need to treat the time part in any special way here? If a ton of records are all inserted the same second, it seems like they could be partitioned by the bottom 8 bytes just as well, so why not do integer math
There was a problem hiding this comment.
not a stupid question at all...
The object id consists of: 4-byte timestamp, 5-byte random value unique per process per machine, and 3-byte incremental counter.
Given timestamp is the most significant bits, you are right that integer math is a strictly better partition mechanism, even if all the change events have the same random bytes or if the random bytes are heavily skewed.
This will simplify the partitioning logic as well. Thanks for bringing this up.
❌ Test FailureAnalysis: The test |
6a2dc7a to
acefd3c
Compare
❌ Test FailureAnalysis: TestMongoClickhouseSuite/Test_Snapshot_Partition_Capped_To_Timestamp_Range consistently fails across all CI matrix configurations with expected 5 records but got 100, indicating a real bug in the MongoDB snapshot partition timestamp-capping logic. |
❌ Test FailureAnalysis: TestMongoClickhouseSuite/Test_Snapshot_Partition_Capped_To_Timestamp_Range fails deterministically across all 3 CI matrix variants with expected 5 but got 100 records, indicating a real bug in the snapshot timestamp-range partition capping logic introduced in the parallel-snapshotting PR. |
🔄 Flaky Test DetectedAnalysis: TestMongoClickhouseSuite/Test_Snapshot_Partition_Capped_To_Timestamp_Range failed due to a race condition where ClickHouse returned "Unknown table expression identifier" during the WaitFor loop (table not yet created), then the snapshot completed with all 100 records instead of the expected 5 within the timestamp range cap. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Test_Snapshot_Partition_Capped_To_Timestamp_Range fails with a ClickHouse "Unknown table expression identifier" race condition (table queried before creation completes), unrelated to the last Avro-chunking commit, and compounded by a transient port-binding infrastructure failure in the maria matrix. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: TestMongoClickhouseSuite/Test_Snapshot_Partition_Capped_To_Timestamp_Range fails deterministically across all matrix variants with |
❌ Test FailureAnalysis: The primary failure |
❌ Test FailureAnalysis: TestMongoClickhouseSuite/Test_Snapshot_Partition_Capped_To_Timestamp_Range fails deterministically across all CI matrix builds with a consistent count mismatch (expected 5 rows, got 100), indicating a real regression — likely from the recent "Always chunk Avro on uncompressed bytes" commit breaking MongoDB snapshot partition timestamp-range capping. |
9afdc6b to
36d16d2
Compare
36d16d2 to
b106d5e
Compare
🔄 Flaky Test DetectedAnalysis: Two e2e tests failed due to flaky infrastructure issues: a snapshot status timeout in TestGenericBQ/Test_Simple_Flow and a transient catalog DB connection termination (SQLSTATE 57P01 admin_shutdown) in TestPeerFlowE2ETestSuiteMySQL_CH/Test_Extra_CH_Columns, neither of which indicates a real code bug. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Three e2e tests failed across different matrix jobs due to flaky infrastructure issues: a Temporal server connection timeout (context deadline exceeded), a schema teardown failure, and a CDC record-count race condition — none related to the ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestCancelAddCancel timed out in a WaitFor polling loop waiting for MV error messages, a timing-sensitive operation that passed in both other matrix configurations (PG16 and PG17), indicating a flaky e2e race condition rather than a code regression. ✅ Automatically retrying the workflow |
Introduce parallel snapshotting for MongoDB.
The partitioning strategy uses min-max range partitioning: we get the smallest and largest objectId, extract the timestamp from them, find the timestamp delta, and uniformly divide it to create partitions. This is prone to data skew if document insertion rate varies significantly over time (e.g. huge backfill of data at once), but for most production workload this should provide a reasonable distribution.
Sometimes user supply their own custom
_idcolumn instead of using the default auto-generated object id. We detect this by querying the smallest and largest_id(conveniently, as part of min-max partitioning strategy anyways). If both are objectID, then this guarantees every document in the collection has object id as_id, since MongoDB sort index by BSON type. If one or more non-ObjectID key is detected, we fallback to full table partition.For now, put this feature behind a feature flag.
Testing
Added e2e test to make sure (1) collection with objectIDs runs parallel snapshots (2) collection with non-ObjectID or mixed
_idtype uses full table partition (3) empty collection uses full table partition