Skip to content

[DO NOT MERGE] chore: stderr heartbeat + thread dump for deadlock investigation#1027

Draft
Anatolii Yatsuk (tolik0) wants to merge 1 commit into
mainfrom
tolik0/cdk/stderr-heartbeat-thread-dump
Draft

[DO NOT MERGE] chore: stderr heartbeat + thread dump for deadlock investigation#1027
Anatolii Yatsuk (tolik0) wants to merge 1 commit into
mainfrom
tolik0/cdk/stderr-heartbeat-thread-dump

Conversation

@tolik0
Copy link
Copy Markdown
Contributor

⚠️ Reference only — not for merge

This branch is investigation tooling for diagnosing connector stalls and heartbeat timeouts. It runs a daemon thread, so it should not be merged into main — engineers should cherry-pick locally when an investigation needs evidence.

What it adds

  1. Stderr heartbeat (every 30s) — writes STDOUT_HEARTBEAT: t=... msgs=... bytes=... print_blocked=... queue_size=... queue_full=... directly to fd 2, so Kubernetes captures it independently of stdout.
  2. Automatic thread dump on stall — if the message count stays frozen for 3 consecutive intervals (90s), the heartbeat dumps stack traces of every thread via `sys._current_frames()`. Re-dumps every ~5 min during an ongoing stall.
  3. Queue registry — small module so the heartbeat thread can read `queue.qsize()` / `queue.full()` without threading the queue through the API.

Why

The thread dump produces the decisive evidence for classifying stalls against the known patterns:

Evidence in dump Pattern
Main thread in `queue.put()` + queue near max Full-queue self-deadlock (already fixed in #977)
Worker in `ssl.py` / `requests` `read()` Missing socket idle timeout
Main in `queue.get()` + workers in `partition_enqueuer` sleep Concurrent-generator starvation

Without this, classifications rely on indirect symptoms; with it, the stack frame names the mechanism directly.

Origin

Mirrors the diagnostic portion of #953. The `ConcurrentMessageRepository` deadlock-prevention diff is omitted because it has already landed in `main` via #977.

How to use locally

```bash

In a connector workspace, pin this CDK ref:

poetry add 'git+https://github.com/airbytehq/airbyte-python-cdk.git@tolik0/cdk/stderr-heartbeat-thread-dump'

Then run the connector normally — heartbeat lines and thread dumps land in stderr.

```

Adds a daemon thread that writes periodic status to stderr every 30s
(message counts, bytes written, queue size/full state) and dumps all
thread stack traces via sys._current_frames() when the source has been
silent for 90+ seconds.

This is investigation tooling for diagnosing connector stalls and
heartbeat timeouts. Not intended for merge — provides the evidence
needed to classify stalls against known patterns (full-queue self-
deadlock, socket read hang, concurrent-generator starvation).

Mirrors the diagnostic portion of #953, with the ConcurrentMessageRepository
deadlock fix omitted (already on main via #977).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@tolik0/cdk/stderr-heartbeat-thread-dump#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch tolik0/cdk/stderr-heartbeat-thread-dump

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

PyTest Results (Fast)

593 tests  ±0   582 ✅ ±0   3m 35s ⏱️ +4s
  1 suites ±0    10 💤 ±0 
  1 files   ±0     1 ❌ ±0 

For more details on these failures, see this check.

Results for commit b6e1402. ± Comparison against base commit 19a7083.

@github-actions
Copy link
Copy Markdown

PyTest Results (Full)

4 069 tests  ±0   4 057 ✅ ±0   11m 2s ⏱️ +12s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit b6e1402. ± Comparison against base commit 19a7083.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant