feat: cdc checker by loomts · Pull Request #460 · apecloud/ape-dts

loomts · 2026-01-22T03:40:33Z

Summary

Supported matrix / constraints:
- standalone snapshot check: mysql / pg / mongo
- standalone struct check: mysql / pg only
- inline snapshot check: write sink + mysql / pg / mongo
- inline cdc check: write sink + mysql / pg only, requires parallel_type=rdb_check, pipeline_type=basic, and [resumer] resume_type=from_target|from_db
Breaking config changes:
- inline check no longer accepts [checker].db_type/url/username/password; target now always comes from [sinker]
- parallel_type=rdb_check now requires [checker]
- legacy resumer keys are not just deprecated; they are rejected
Runtime semantics:
- full checker queue applies backpressure instead of dropping check work
- mismatches are logged only and do not stop the main write path
- runtime checker errors are mode-dependent (inline fail-open vs standalone fail-close)
Operational impact:
- inline CDC check now persists durable checker state in the resumer backend, including unresolved rows in apedts_unconsistent_rows
- resume behavior is effectively “recheck unresolved rows first, then continue from the durable checkpoint”
- this may require schema/table create privileges on the resumer store

Check flow


  source data / change stream
            |
            v
       [Pipeline]
            |
            +-----------------------------+-----------------------------+
            |                                                           |
            | standalone check                                          | inline check
            |                                                           |
            v                                                           v
      [DummySinker]                                      [RealSinker: WRITE first]
            |                                                           |
            +-----------------------------+-----------------------------+
                                          |
                                          v
                                   [CheckedSinker]
                                          |
                                          | enqueue checked rows
                                          | (bounded by queue_size)
                                          v
                              [DataCheckerHandle: queue/worker]
                                          |
                       +------------------+------------------+
                       |                                     |
                       | queue full                          | checker runtime error
                       v                                     v
                 BACKPRESSURE                       mode-dependent handling
            (main path waits here)                 - inline     => FAIL-OPEN
                                                   - standalone => FAIL-CLOSE
                                          |
                                          v
                                   [checker_engine]
                                          |
                      +-------------------+-------------------+-------------------+
                      |                                       |                   |
                      v                                       v                   v
               [MysqlChecker]                          [PgChecker]        [MongoChecker]
                      \                                       |                   /
                       \                                      |                  /
                        +-------------------------------------+-----------------+
                                                              |
                                                              v
                                              compare source vs final target state
                                                              |
                                                              +--> diff / miss
                                                                   |
                                                                   v
            [diff.log / miss.log / sql.log / summary.log / metrics / monitor]
            (mismatch is LOGGED; it does NOT stop the main path)


  CDC + CHECK PERSISTENCE / RECOVERY
  ==================================

                                [checker_engine]
                                       |
                                       v
                                    [cdc_state]
                                       |
                                       | checkpoint persist
                                       | - clean: checkpoint only
                                       | - dirty: checkpoint + unresolved rows (atomic)
                                       v
                             [CheckerStateStore]
                                       |
            +--------------------------+---------------------------+
            |                                                      |
            v                                                      v
  [last durable CDC checkpoint]                     [unresolved checker rows]
  [resumer].table_full_name                         <same schema>.apedts_unconsistent_rows
  e.g. apecloud_metadata.                           stores rows that still need recheck
       apedts_task_position

  stores the latest recoverable CDC checkpoint 
  aligned with durable checker state
            |                                                      |
            +--------------------------+---------------------------+
                                       |
                                       v
                                     RESUME
                                       |
               +-----------------------+------------------------+
               |                                                |
               v                                                v
  load CDC checkpoint                               load unresolved checker rows
  from apedts_task_position                         from apedts_unconsistent_rows
               \                                                /
                \                                              /
                 +--------------------------------------------+
                                      |
                                      v
                       if unresolved rows exist: RECHECK FIRST
                                      |
                                      v
                      then continue CDC from the durable checkpoint

Add a canary test that freezes the default task_id for check_log configs so internal enum churn does not silently change external prefixes and orchestration contracts.

Inline the merge sink dispatch into MergeParallelizer::sink_dml and drop the extra helper layer to keep the hot path simpler.

…ations

…me deprecated codes

caiq1nyu · 2026-03-27T07:46:02Z

dt-connector/src/sinker/checked_sinker.rs

+}
+
+#[async_trait]
+pub trait CheckedSinkTarget: Sinker {


为什么叫checked_sinker 😶，已检查的sinker？

caiq1nyu · 2026-03-27T07:49:39Z

dt-connector/src/sinker/checked_sinker.rs

+impl<S: CheckedSinkTarget + Send> Sinker for CheckedSinker<S> {
+    async fn sink_dml(&mut self, mut data: Vec<RowData>, batch: bool) -> anyhow::Result<()> {
+        self.inner.sink_dml_borrowed(&mut data, batch).await?;
+        self.checker.enqueue_check(data).await?;


这里看起来有问题，enqueue_check里send时如果check queue满了异步task可能被挂起。sinker_dml要wait enqueue_check异步执行完所以也会被阻塞，整个cdc就被阻塞了。目标应该是check的逻辑不影响主干流程，send check queue时可以做一些queue满和其他异常时的降级逻辑，让enqueue_check方法可以无条件快速结束

caiq1nyu · 2026-03-27T08:06:48Z

dt-connector/src/checker/checker_engine.rs

+use anyhow::Context;
+use dt_common::meta::pg::pg_value_type::PgValueType;
+use dt_common::monitor::{counter_type::CounterType, task_metrics::TaskMetricsType};
+use std::collections::BTreeSet;


小问题：import尽量规范点，外部的在上，内部的在下，内部的尽量merge import一下

caiq1nyu · 2026-03-27T08:17:25Z

dt-pipeline/src/base_pipeline.rs

            }
+            // cdc+check also needs refreshed table metadata after sink ddl changes the target schema
+            if let Some(checker) = &self.checker {
+                checker.refresh_meta(data.clone()).await?;


同sink_dml那里的问题一样，原则是校验不影响主干流程，校验可以适当牺牲准确性。这里check的refresh应该是可以快速结束，现在的实现里面又有pub mpsc的逻辑可能被阻塞。而且result通过?解包，check refresh的错误会上抛给主干逻辑了

caiq1nyu · 2026-03-27T08:18:09Z

dt-pipeline/src/base_pipeline.rs

+        let mut position_persisted_by_checker = false;
+        if !matches!(record_position, Position::None) {
+            if let Some(checker) = &self.checker {
+                checker.record_checkpoint(record_position).await?;


这里也是，要设计成check不影响主干。可以按这个原则再梳理下流程

caiq1nyu · 2026-03-27T08:38:17Z

dt-connector/src/checker/cdc_state.rs

+        .iter()
+        .map(|(key, value)| (key.clone(), value.clone()))
+        .collect::<BTreeMap<_, _>>();
+    serde_json::to_string(&serde_json::json!({


这里尽量不要用json序列化，可能会比较耗cpu。这里其他的serde_json相关逻辑也看一下

caiq1nyu · 2026-03-27T08:43:04Z

dt-connector/src/checker/cdc_state.rs

+            ColValue::Enum(v) => Self::Enum(*v),
+            ColValue::Set2(v) => Self::Set2(v.clone()),
+            ColValue::Enum2(v) => Self::Enum2(v.clone()),
+            ColValue::Json(v) => Self::Json(v.clone()),


看起来PersistedColValue只是为了记录可读，json、mongodoc，甚至blob，bit这些都可以不记录。内容可能比较大，clone的成本太高，而且blob这种二进制的也不可读。外面diff row的上限最好也控制下

这个主要是为了在sinker的unconsistent_rows记录最原始的数据类型，用于下一次和target的比对，全量记录可能方便点，不过确实会很冗肿。那我把它改为记录下不一致的pk，重启之后再读取一次source和target的rows吧

caiq1nyu · 2026-03-27T08:50:06Z

docs/en/cdc/sync.md

+- keep `[sinker] sink_type=write`
+- add `[checker]`
+- add `[resumer] resume_type=from_target` or `from_db`
+- use `[parallelizer] parallel_type=rdb_check`


看起来cdc要启用check，需要改parallelizer.parallel_type=rdb_check，但rdb_check实际上是rdb_merge的一种wrap。这里比较倾向的配置入口是类似：

[parallelizer] parallel_type=rdb_merge/serial/其他 xxx [checker] enable=true xxx

然后比如现在是支持的rdb_merge形态下的cdc check，serial暂时不支持，在task_config时校验不支持就以配置不合法panic

用parallel_type=rdb_check不好扩展支持其他parallel

loomts marked this pull request as draft January 23, 2026 12:20

loomts force-pushed the feat/cdc-checker branch 2 times, most recently from fd2c9d2 to 5e85cf9 Compare February 9, 2026 13:51

loomts force-pushed the feat/cdc-checker branch from 3a1c850 to 2f1f040 Compare March 5, 2026 10:47

loomts force-pushed the feat/cdc-checker branch 2 times, most recently from fc4df7b to c3e4c41 Compare March 19, 2026 03:07

loomts added 24 commits March 23, 2026 21:42

refactor: split checker and sinker

690a03c

feat: wire cdc check runner

37900c3

refactor: remove checker arc

b3b1256

feat: add s3 check logs

6177854

feat: persist checker state

a3d8683

test: refresh cdc summaries

03e83ca

refactor(check): simplify checker recovery state

208fae7

fix cdc checker flow and simplify dt-tests

365a585

trim checker hot-path clones

4eb255d

refactor: stage checker simplification baseline

850c7fd

refactor: simplify checker state scoping

f14282b

refactor: split task kind and check mode

564187f

fix: wire standalone struct checker

8e90925

refactor: share cdc+check checkpoint state

9abd213

test(task-id): lock check_log default task id

46566c4

Add a canary test that freezes the default task_id for check_log configs so internal enum churn does not silently change external prefixes and orchestration contracts.

refactor(parallelizer): inline merge sink dispatch

dc7b454

Inline the merge sink dispatch into MergeParallelizer::sink_dml and drop the extra helper layer to keep the hot path simpler.

fix(checker): restore critical follow-up fixes

d7edbc0

cleanup some tests

ddc98f1

check cdc resume

e9dd6c3

feat: add refresh_meta to update metadata in checkers

811ad25

simplify check configurations

1865d2a

enhance test configurations and cleanup

1daba05

fix docs and add more tests

2d2b6ee

refactor: update checker methods to public and simplify test configur…

2677e98

…ations

loomts added 11 commits March 23, 2026 23:16

enhance checker configuration and support for CDC checks & cleanup so…

04ec86c

…me deprecated codes

simplify StructCheckerHandle

1958e8e

simplify

efaa6cc

fix lint and simplify

a169fda

simplify and limit sql.log

8a9ab5e

fix docs, config, tests

6e4093a

no fast-fail async check sidecar

59339c8

fix task configuration and error handling & enhance mongo extractor

f7aa057

chore

98b0783

cdc check manage persisted ids

cd43459

cleanup

16425fc

loomts force-pushed the feat/cdc-checker branch from ed7d1cf to 16425fc Compare March 24, 2026 07:29

loomts added 3 commits March 24, 2026 18:15

cleanup

8a2807e

remove some deep copy & add new test cases for cdc resume

72b6b58

enforce parallelizer for CDC tasks & cleanup

23a51b2

loomts marked this pull request as ready for review March 26, 2026 03:46

caiq1nyu reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cdc checker#460

feat: cdc checker#460
loomts wants to merge 38 commits intomainfrom
feat/cdc-checker

loomts commented Jan 22, 2026 •

edited

Loading

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

loomts Mar 27, 2026

Uh oh!

caiq1nyu Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loomts commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Check flow

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

loomts commented Jan 22, 2026 •

edited

Loading