Fix MySQL CDC hang when SSH tunnel dies#4067
Conversation
❌ 2 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
🔄 Flaky Test DetectedAnalysis: The test failed due to "context deadline exceeded" when connecting to the Temporal server — a transient infrastructure/network timeout, not a code logic failure. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Multiple Snowflake e2e tests failed with "UNEXPECTED STATUS TIMEOUT STATUS_SETUP", indicating transient connectivity/availability issues with the external Snowflake service during flow initialization, unrelated to the partitioned table column-ordering code change. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The e2e test ✅ Automatically retrying the workflow |
|
Current status:
|
|
Double checked logs in a local e2e test, looks as expected. Should be good to go |
🔄 Flaky Test DetectedAnalysis: The e2e test suite hit the 900-second timeout deadline (FAIL github.com/PeerDB-io/peerdb/flow/e2e 900.680s), indicating a flaky timeout rather than a deterministic test assertion failure. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: All three failures are timing/resource-related (setup timeouts, teardown failures, wait-for-load timeouts) with no logic assertion errors, occurring across unrelated matrix jobs on the same commit — classic CI flakiness. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The test ✅ Automatically retrying the workflow |
| // the keepalive fires, something else closed the connection (driver timeout, etc.) | ||
| select { | ||
| case <-keepaliveChan: | ||
| t.Log("SSH keepalive detected failure") |
There was a problem hiding this comment.
It looks like the prod keeepaliveInterval is set to 15s, is my understanding correct that this would take that much time to detect failure? if so, it may be worth setting a shorter interval for test.
I noticed that with the introduction of these two tests, mysql tests now takes 120s (was 60s previously)
There was a problem hiding this comment.
Added t.Parallel() which brought it to 30. Overriding the keepalive interval got hairier than I'd prefer with the current code structure, and 30 is in line with other suites.
🔄 Flaky Test DetectedAnalysis: TestCancelTableAdditionRemoveAddRemove failed due to a catalog connection being terminated (SQLSTATE 57P01/admin_shutdown), likely caused by the concurrently-running TestCancelErrorOnPgZeroOids test calling pg_terminate_backend() as part of its error-scenario testing while both tests share the same PostgreSQL catalog under -p 32 parallelism. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestDoubleClickCancelTableAddition failed due to a PostgreSQL catalog connection being terminated by an admin command (SQLSTATE 57P01) mid-test — a transient infrastructure event, not a code bug. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: e2e TestApiPg/TestQRep timed out after the catalog PostgreSQL connection was killed mid-test with SQLSTATE 57P01 (admin_shutdown), a transient infrastructure failure caused by resource contention from 32 parallel tests. ✅ Automatically retrying the workflow |
When SSH connection dies during CDC, the keepalive machinery notices but BinlogSyncer is not aware. To make it aware, close SSH connection besides just the MySQL query connection. Add tests for hangs in mystream.GetEvent() and syncer.Close() (both observed in production). Verified that without the fix both tests fail.