fix: prevent indefinite hangs during TLS operations (#819)#955
fix: prevent indefinite hangs during TLS operations (#819)#955CharlesWong wants to merge 5 commits intoprojectdiscovery:mainfrom
Conversation
…#819) Three independent bugs that together cause tlsx to hang indefinitely on long target lists: 1. ztls tlsHandshakeWithTimeout: the select statement used 'case errChan <- tlsConn.Handshake()' which evaluates Handshake() synchronously before the channel send — the ctx.Done() branch could never fire, making timeouts impossible. Fixed by running Handshake() in a goroutine and closing the connection on context cancellation. 2. ztls EnumerateCiphers: called tlsHandshakeWithTimeout with context.TODO() (no timeout). Replaced with a per-handshake context.WithTimeout using the configured timeout value. 3. tls EnumerateCiphers: called conn.Handshake() directly without any context or timeout. Replaced with conn.HandshakeContext() using a per-handshake timeout context. All three paths can block indefinitely on unresponsive hosts. With high concurrency (300+) and thousands of targets, goroutines accumulate until the process hangs. Fixes projectdiscovery#819
Neo - PR Security ReviewNo security issues found Highlights
Comment |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces direct/blocking TLS handshakes with context-aware handshakes: uses HandshakeContext when available or runs Handshake in a goroutine with a per-call context (optional timeout), cancels contexts and closes connections on timeout to unblock handshake attempts. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant HandshakeCtx as "Handshake Context"
participant Goroutine
participant TLSConn
Caller->>HandshakeCtx: create context (Background or WithTimeout)
Caller->>Goroutine: start Handshake() with TLSConn
Goroutine->>TLSConn: Handshake() (blocking)
alt handshake completes before timeout
TLSConn-->>Goroutine: success (or tls.ErrCertsOnly)
Goroutine-->>Caller: return success
Caller->>HandshakeCtx: cancel()
Caller->>TLSConn: close base connection
else timeout / ctx.Done
HandshakeCtx-->>Caller: ctx.Done()
Caller->>TLSConn: close TLSConn to unblock Handshake
Goroutine-->>Caller: Handshake returns error
Caller->>HandshakeCtx: cancel()
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tlsx/tls/tls.go`:
- Around line 239-245: The new handshake timeout is always created with
context.WithTimeout which makes a zero Timeout behave as an already-expired
context; change the code around conn.HandshakeContext so that if
c.options.Timeout == 0 you use a non-expiring context (e.g. context.Background()
or a cancellable background context) and only call context.WithTimeout when
Timeout > 0, ensuring you only call cancel() when WithTimeout was used; update
the logic that sets handshakeCtx and cancel and keep the existing use of
conn.HandshakeContext, c.options.Timeout, and appending to enumeratedCiphers
intact.
In `@pkg/tlsx/ztls/ztls.go`:
- Around line 260-266: The handshake context creation in the cipher enumeration
loop should guard against a zero timeout to avoid creating an
immediately-expired context; change the block that creates handshakeCtx and
cancel so that if c.options.Timeout > 0 you call
context.WithTimeout(context.Background(),
time.Duration(c.options.Timeout)*time.Second) and assign the returned cancel,
otherwise set handshakeCtx = context.Background() and cancel to a no-op
function, then pass handshakeCtx to c.tlsHandshakeWithTimeout(conn,
handshakeCtx) and still call cancel() after the handshake; update references in
this scope (c.options.Timeout, tlsHandshakeWithTimeout, conn.GetHandshakeLog,
enumeratedCiphers) accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2955b693-7970-480a-a978-891fd5b14554
📒 Files selected for processing (2)
pkg/tlsx/tls/tls.gopkg/tlsx/ztls/ztls.go
Per review feedback: when c.options.Timeout is 0, context.WithTimeout(ctx, 0) creates an already-expired context. Fall back to a 10s default in both tls and ztls cipher enumeration.
Per CodeRabbit review: replace the 'default to 10s fallback' pattern with the idiomatic Go pattern already used in the main connect path (tls.go:109 and ztls.go:118). When Timeout==0, use context.Background() (no expiry) rather than an always-expired or magic-default context. - pkg/tlsx/tls/tls.go: SupportedTLSCiphers cipher enum loop - pkg/tlsx/ztls/ztls.go: SupportedTLSCiphers cipher enum loop
|
Fixed in 57d4d61: replaced the '10s fallback' pattern with the idiomatic Go guard already used in the main connect path — when Timeout==0, handshakeCtx is now context.Background() (no expiry). Applied to both pkg/tlsx/tls/tls.go and pkg/tlsx/ztls/ztls.go. Compiles clean. |
- tlsHandshakeWithTimeout: move ctx to first parameter per Go convention (context.Context should always be the first parameter) - Remove duplicate stale comment line '// tlsHandshakeWithCtx attempts...' that was left alongside the new doc comment
|
Senior dev review follow-up (b936f57): two issues fixed:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tlsx/ztls/ztls.go`:
- Around line 332-357: The tlsHandshakeWithTimeout function can return a false
timeout when Handshake() and ctx deadline become ready simultaneously; update
tlsHandshakeWithTimeout to prefer the handshake result by placing the err :=
<-errChan case before the <-ctx.Done() case and add an atomic boolean (using
sync/atomic) to track whether the handshake is still in progress; set the atomic
flag to “running” before starting the goroutine, clear it when the goroutine
returns (or on successful err read), and in the ctx.Done() branch only close
tlsConn and return the timeout error if the atomic flag still indicates the
handshake is running, ensuring completed handshakes aren’t aborted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1d4c31d5-3889-4cd3-9878-f35355b0a58b
📒 Files selected for processing (1)
pkg/tlsx/ztls/ztls.go
Go's select statement is non-deterministic when multiple cases are ready at the same time. When Handshake() completes at the exact moment the context deadline fires, the original code could randomly take the ctx.Done() branch and return a spurious timeout error even though the handshake succeeded. Fix: reorder cases to check errChan first, and add a non-blocking errChan drain in the ctx.Done() branch so a just-completed handshake is never discarded in favour of a timeout.
|
Fixed in 7194d08. The select fairness issue is real — when both errChan and ctx.Done() are ready simultaneously, Go picks randomly. The fix:
This is semantically correct: if the handshake finished, we should use its result regardless of whether the deadline also fired. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/tlsx/ztls/ztls.go (1)
359-372:⚠️ Potential issue | 🟠 MajorThe timeout branch can still race with a just-finished handshake.
The non-blocking drain only fixes the exact simultaneous-ready case. If
Handshake()returns after Line 368 falls throughdefaultbut before Line 371 closes the connection, this still reports a timeout and tears down a successful handshake. You still need a completion flag/CAS around the close so only genuinely in-flight handshakes are aborted.Possible fix
+import "sync/atomic" + func (c *Client) tlsHandshakeWithTimeout(ctx context.Context, tlsConn *tls.Conn) error { + var state uint32 // 0=running, 1=completed, 2=timed out errChan := make(chan error, 1) go func() { - errChan <- tlsConn.Handshake() + err := tlsConn.Handshake() + atomic.CompareAndSwapUint32(&state, 0, 1) + errChan <- err }() select { case err := <-errChan: if err == tls.ErrCertsOnly { err = nil } return err case <-ctx.Done(): - // Prefer a completed handshake result that arrived at the same time - // as the deadline — select is non-deterministic when both are ready. - select { - case err := <-errChan: - if err == tls.ErrCertsOnly { - err = nil - } - return err - default: - } - // Close the connection to unblock the goroutine stuck in Handshake() - _ = tlsConn.Close() - return errorutil.NewWithTag("ztls", "timeout while attempting handshake") //nolint + if atomic.CompareAndSwapUint32(&state, 0, 2) { + _ = tlsConn.Close() + return errorutil.NewWithTag("ztls", "timeout while attempting handshake").Wrap(ctx.Err()) //nolint + } + err := <-errChan + if err == tls.ErrCertsOnly { + return nil + } + return err } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tlsx/ztls/ztls.go` around lines 359 - 372, The timeout branch can race with a just-finished Handshake; modify the handshake coordination so the goroutine that runs tlsConn.Handshake() sets a shared completion flag (e.g., an atomic.Bool or a boolean updated under a mutex) before sending into errChan, and have the ctx.Done() handler check that same flag with a CAS/read before closing tlsConn and returning the timeout error; reference errChan, tlsConn, Handshake(), ctx.Done(), and errorutil.NewWithTag("ztls", "timeout while attempting handshake") when applying the fix so only truly in-flight handshakes are aborted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/tlsx/ztls/ztls.go`:
- Around line 359-372: The timeout branch can race with a just-finished
Handshake; modify the handshake coordination so the goroutine that runs
tlsConn.Handshake() sets a shared completion flag (e.g., an atomic.Bool or a
boolean updated under a mutex) before sending into errChan, and have the
ctx.Done() handler check that same flag with a CAS/read before closing tlsConn
and returning the timeout error; reference errChan, tlsConn, Handshake(),
ctx.Done(), and errorutil.NewWithTag("ztls", "timeout while attempting
handshake") when applying the fix so only truly in-flight handshakes are
aborted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e5c33ef7-e23a-470a-b1a8-38dcb6491e4a
📒 Files selected for processing (1)
pkg/tlsx/ztls/ztls.go
|
Thanks for the security review @neo-by-projectdiscovery-dev — glad no issues were found. The timeout guard now uses |
|
Acknowledged @neo-by-projectdiscovery-dev[bot]'s review. |
|
Acknowledged @coderabbitai[bot]'s review. |
|
Tip For best results, initiate chat on the files or code changes.
|
Summary
Fixes #819 — tlsx hangs indefinitely on long target lists (~25k+ hosts) with cipher enumeration enabled.
Root Cause Analysis
Three independent bugs that together cause goroutine accumulation until the process deadlocks:
Bug 1:
tlsHandshakeWithTimeoutis fundamentally broken (ztls)The Go spec states that in a
select, all channel operands are evaluated before the select begins. SotlsConn.Handshake()runs to completion (or hangs forever) before the send toerrChaneven starts. Thectx.Done()case exists but can never execute while a handshake is stuck.Bug 2:
context.TODO()in ztls cipher enumerationEven after fixing Bug 1,
context.TODO()provides no deadline — handshakes can still block forever.Bug 3: No timeout in tls (crypto/tls) cipher enumeration
The standard TLS client cipher enumeration had zero timeout protection.
Why This PR
This PR fixes all three timeout paths with the minimum necessary changes, includes detailed root cause analysis, and closes the connection on timeout to prevent goroutine leaks.
Changes
pkg/tlsx/ztls/ztls.gotlsHandshakeWithTimeout: run Handshake in goroutine, close conn on timeoutEnumerateCiphers: replacecontext.TODO()with per-handshake timeout contextpkg/tlsx/tls/tls.goEnumerateCiphers: replace bareHandshake()withHandshakeContext()using timeoutTesting
The bug manifests only after hours with 25k+ targets and high concurrency, making it impractical to reproduce in CI. The fix is verified by code inspection: all three blocking paths now have enforced timeouts.
Fixes #819
Summary by CodeRabbit
Bug Fixes
Chores