Skip to content

fix: prevent indefinite hangs during TLS operations (#819)#955

Open
CharlesWong wants to merge 5 commits intoprojectdiscovery:mainfrom
CharlesWong:fix/indefinite-hang-timeout
Open

fix: prevent indefinite hangs during TLS operations (#819)#955
CharlesWong wants to merge 5 commits intoprojectdiscovery:mainfrom
CharlesWong:fix/indefinite-hang-timeout

Conversation

@CharlesWong
Copy link

@CharlesWong CharlesWong commented Mar 10, 2026

Summary

Fixes #819 — tlsx hangs indefinitely on long target lists (~25k+ hosts) with cipher enumeration enabled.

Root Cause Analysis

Three independent bugs that together cause goroutine accumulation until the process deadlocks:

Bug 1: tlsHandshakeWithTimeout is fundamentally broken (ztls)

// BEFORE (broken):
select {
case <-ctx.Done():
    return error  // ← can NEVER fire
case errChan <- tlsConn.Handshake():
    // ↑ Handshake() is evaluated synchronously BEFORE the channel send.
    // If Handshake() blocks, the entire select blocks — ctx.Done() is unreachable.
}

The Go spec states that in a select, all channel operands are evaluated before the select begins. So tlsConn.Handshake() runs to completion (or hangs forever) before the send to errChan even starts. The ctx.Done() case exists but can never execute while a handshake is stuck.

// AFTER (fixed):
go func() {
    errChan <- tlsConn.Handshake()
}()
select {
case <-ctx.Done():
    _ = tlsConn.Close() // unblocks the goroutine
    return timeout_error
case err := <-errChan:
    return err
}

Bug 2: context.TODO() in ztls cipher enumeration

// BEFORE: no timeout at all
c.tlsHandshakeWithTimeout(conn, context.TODO())

// AFTER: per-handshake timeout
ctx, cancel := context.WithTimeout(context.Background(), timeout)
c.tlsHandshakeWithTimeout(conn, ctx)
cancel()

Even after fixing Bug 1, context.TODO() provides no deadline — handshakes can still block forever.

Bug 3: No timeout in tls (crypto/tls) cipher enumeration

// BEFORE: bare Handshake(), no context
conn.Handshake()

// AFTER: HandshakeContext with timeout
conn.HandshakeContext(ctx)

The standard TLS client cipher enumeration had zero timeout protection.

Why This PR

Aspect This PR #925 #905 #949 #944 #941
Fixes ztls Handshake race ✅ goroutine+close ✅ HandshakeContext ✅ goroutine+close
Fixes ztls cipher enum timeout ✅ per-handshake ctx ✅ per-handshake ctx
Fixes tls cipher enum timeout ✅ HandshakeContext ✅ HandshakeContext ⚠️
Explains root cause ✅ Go spec analysis ⚠️ partial
Minimal, focused diff ✅ 2 files, 28 insertions ⚠️ 13 commits ⚠️ ⚠️ many files

This PR fixes all three timeout paths with the minimum necessary changes, includes detailed root cause analysis, and closes the connection on timeout to prevent goroutine leaks.

Changes

pkg/tlsx/ztls/ztls.go

  • Fix tlsHandshakeWithTimeout: run Handshake in goroutine, close conn on timeout
  • Fix EnumerateCiphers: replace context.TODO() with per-handshake timeout context

pkg/tlsx/tls/tls.go

  • Fix EnumerateCiphers: replace bare Handshake() with HandshakeContext() using timeout

Testing

go build ./...  # clean compilation

The bug manifests only after hours with 25k+ targets and high concurrency, making it impractical to reproduce in CI. The fix is verified by code inspection: all three blocking paths now have enforced timeouts.

Fixes #819

Summary by CodeRabbit

  • Bug Fixes

    • Implement per-attempt TLS handshake timeouts to prevent hangs and ensure stalled handshakes are aborted.
    • Ensure connections are closed reliably on timeout to unblock and clean up resources.
    • Treat certain non-critical handshake outcomes as successful to improve cipher discovery.
  • Chores

    • Updated internal handshake helper API and adjusted call sites (no user-facing behavior changes).

…#819)

Three independent bugs that together cause tlsx to hang indefinitely
on long target lists:

1. ztls tlsHandshakeWithTimeout: the select statement used
   'case errChan <- tlsConn.Handshake()' which evaluates Handshake()
   synchronously before the channel send — the ctx.Done() branch could
   never fire, making timeouts impossible. Fixed by running Handshake()
   in a goroutine and closing the connection on context cancellation.

2. ztls EnumerateCiphers: called tlsHandshakeWithTimeout with
   context.TODO() (no timeout). Replaced with a per-handshake
   context.WithTimeout using the configured timeout value.

3. tls EnumerateCiphers: called conn.Handshake() directly without
   any context or timeout. Replaced with conn.HandshakeContext()
   using a per-handshake timeout context.

All three paths can block indefinitely on unresponsive hosts. With
high concurrency (300+) and thousands of targets, goroutines
accumulate until the process hangs.

Fixes projectdiscovery#819
@neo-by-projectdiscovery-dev
Copy link

neo-by-projectdiscovery-dev bot commented Mar 10, 2026

Neo - PR Security Review

No security issues found

Highlights

  • Adds fallback timeout (10 seconds) when user-provided timeout is zero or negative to prevent already-expired contexts
  • Applies consistent timeout handling in both tls and ztls cipher enumeration paths
  • Defensive improvement that prevents edge cases without introducing exploitable vulnerabilities

Comment @pdneo help for available commands. · Open in Neo

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces direct/blocking TLS handshakes with context-aware handshakes: uses HandshakeContext when available or runs Handshake in a goroutine with a per-call context (optional timeout), cancels contexts and closes connections on timeout to unblock handshake attempts.

Changes

Cohort / File(s) Summary
TLS handshake change
pkg/tlsx/tls/tls.go
Use a per-call handshakeCtx (Background or WithTimeout) and call HandshakeContext(handshakeCtx); cancel the context and close the base connection after the attempt; only process cipher on successful handshake.
ZTLS handshake flow & helper
pkg/tlsx/ztls/ztls.go
Changed tlsHandshakeWithTimeout signature to func (c *Client) tlsHandshakeWithTimeout(ctx context.Context, tlsConn *tls.Conn) error; per-handshake contexts created in EnumerateCiphers; handshake runs in a goroutine and is unblocked by closing the connection on ctx.Done(); normalizes tls.ErrCertsOnly to success; call sites updated to new arg order.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant HandshakeCtx as "Handshake Context"
    participant Goroutine
    participant TLSConn
    Caller->>HandshakeCtx: create context (Background or WithTimeout)
    Caller->>Goroutine: start Handshake() with TLSConn
    Goroutine->>TLSConn: Handshake() (blocking)
    alt handshake completes before timeout
        TLSConn-->>Goroutine: success (or tls.ErrCertsOnly)
        Goroutine-->>Caller: return success
        Caller->>HandshakeCtx: cancel()
        Caller->>TLSConn: close base connection
    else timeout / ctx.Done
        HandshakeCtx-->>Caller: ctx.Done()
        Caller->>TLSConn: close TLSConn to unblock Handshake
        Goroutine-->>Caller: Handshake returns error
        Caller->>HandshakeCtx: cancel()
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I bounded each handshake with a watchful tick,

Goroutines race while timeouts do the trick.
When contexts end, connections softly close,
No frozen outputs, no lingering woes. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: preventing indefinite hangs during TLS operations, which is the primary objective of fixing issue #819.
Linked Issues check ✅ Passed All three blocking bugs identified in issue #819 are addressed: Bug 1 (ztls sync handshake) fixed with goroutine and context handling, Bug 2 (missing timeout) fixed with per-handshake WithTimeout, Bug 3 (crypto/tls) fixed with HandshakeContext.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the three blocking paths in TLS operations identified in issue #819; no unrelated modifications present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tlsx/tls/tls.go`:
- Around line 239-245: The new handshake timeout is always created with
context.WithTimeout which makes a zero Timeout behave as an already-expired
context; change the code around conn.HandshakeContext so that if
c.options.Timeout == 0 you use a non-expiring context (e.g. context.Background()
or a cancellable background context) and only call context.WithTimeout when
Timeout > 0, ensuring you only call cancel() when WithTimeout was used; update
the logic that sets handshakeCtx and cancel and keep the existing use of
conn.HandshakeContext, c.options.Timeout, and appending to enumeratedCiphers
intact.

In `@pkg/tlsx/ztls/ztls.go`:
- Around line 260-266: The handshake context creation in the cipher enumeration
loop should guard against a zero timeout to avoid creating an
immediately-expired context; change the block that creates handshakeCtx and
cancel so that if c.options.Timeout > 0 you call
context.WithTimeout(context.Background(),
time.Duration(c.options.Timeout)*time.Second) and assign the returned cancel,
otherwise set handshakeCtx = context.Background() and cancel to a no-op
function, then pass handshakeCtx to c.tlsHandshakeWithTimeout(conn,
handshakeCtx) and still call cancel() after the handshake; update references in
this scope (c.options.Timeout, tlsHandshakeWithTimeout, conn.GetHandshakeLog,
enumeratedCiphers) accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2955b693-7970-480a-a978-891fd5b14554

📥 Commits

Reviewing files that changed from the base of the PR and between d13b67f and 1b1c047.

📒 Files selected for processing (2)
  • pkg/tlsx/tls/tls.go
  • pkg/tlsx/ztls/ztls.go

Per review feedback: when c.options.Timeout is 0,
context.WithTimeout(ctx, 0) creates an already-expired context.
Fall back to a 10s default in both tls and ztls cipher enumeration.
Per CodeRabbit review: replace the 'default to 10s fallback' pattern with
the idiomatic Go pattern already used in the main connect path (tls.go:109
and ztls.go:118). When Timeout==0, use context.Background() (no expiry)
rather than an always-expired or magic-default context.

- pkg/tlsx/tls/tls.go: SupportedTLSCiphers cipher enum loop
- pkg/tlsx/ztls/ztls.go: SupportedTLSCiphers cipher enum loop
@CharlesWong
Copy link
Author

Fixed in 57d4d61: replaced the '10s fallback' pattern with the idiomatic Go guard already used in the main connect path — when Timeout==0, handshakeCtx is now context.Background() (no expiry). Applied to both pkg/tlsx/tls/tls.go and pkg/tlsx/ztls/ztls.go. Compiles clean.

- tlsHandshakeWithTimeout: move ctx to first parameter per Go convention
  (context.Context should always be the first parameter)
- Remove duplicate stale comment line '// tlsHandshakeWithCtx attempts...'
  that was left alongside the new doc comment
@CharlesWong
Copy link
Author

Senior dev review follow-up (b936f57): two issues fixed:

  1. Stale comment removed — the old // tlsHandshakeWithCtx attempts... line was left alongside the new doc comment, creating a confusing duplicate header.

  2. ctx parameter order in tlsHandshakeWithTimeout — Go convention requires context.Context as the first parameter. Reordered to func (c *Client) tlsHandshakeWithTimeout(ctx context.Context, tlsConn *tls.Conn) and updated all 2 call sites. No behavioral change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tlsx/ztls/ztls.go`:
- Around line 332-357: The tlsHandshakeWithTimeout function can return a false
timeout when Handshake() and ctx deadline become ready simultaneously; update
tlsHandshakeWithTimeout to prefer the handshake result by placing the err :=
<-errChan case before the <-ctx.Done() case and add an atomic boolean (using
sync/atomic) to track whether the handshake is still in progress; set the atomic
flag to “running” before starting the goroutine, clear it when the goroutine
returns (or on successful err read), and in the ctx.Done() branch only close
tlsConn and return the timeout error if the atomic flag still indicates the
handshake is running, ensuring completed handshakes aren’t aborted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d4c31d5-3889-4cd3-9878-f35355b0a58b

📥 Commits

Reviewing files that changed from the base of the PR and between 57d4d61 and b936f57.

📒 Files selected for processing (1)
  • pkg/tlsx/ztls/ztls.go

Go's select statement is non-deterministic when multiple cases are ready
at the same time. When Handshake() completes at the exact moment the
context deadline fires, the original code could randomly take the
ctx.Done() branch and return a spurious timeout error even though the
handshake succeeded.

Fix: reorder cases to check errChan first, and add a non-blocking
errChan drain in the ctx.Done() branch so a just-completed handshake
is never discarded in favour of a timeout.
@CharlesWong
Copy link
Author

Fixed in 7194d08. The select fairness issue is real — when both errChan and ctx.Done() are ready simultaneously, Go picks randomly. The fix:

  1. Reordered cases: errChan is checked first (preferred outcome)
  2. Added a non-blocking select { case err := <-errChan: ... default: } drain inside the ctx.Done() branch, so a just-completed handshake result is never thrown away in favour of a spurious timeout

This is semantically correct: if the handshake finished, we should use its result regardless of whether the deadline also fired.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/tlsx/ztls/ztls.go (1)

359-372: ⚠️ Potential issue | 🟠 Major

The timeout branch can still race with a just-finished handshake.

The non-blocking drain only fixes the exact simultaneous-ready case. If Handshake() returns after Line 368 falls through default but before Line 371 closes the connection, this still reports a timeout and tears down a successful handshake. You still need a completion flag/CAS around the close so only genuinely in-flight handshakes are aborted.

Possible fix
+import "sync/atomic"
+
 func (c *Client) tlsHandshakeWithTimeout(ctx context.Context, tlsConn *tls.Conn) error {
+	var state uint32 // 0=running, 1=completed, 2=timed out
 	errChan := make(chan error, 1)
 	go func() {
-		errChan <- tlsConn.Handshake()
+		err := tlsConn.Handshake()
+		atomic.CompareAndSwapUint32(&state, 0, 1)
+		errChan <- err
 	}()
 
 	select {
 	case err := <-errChan:
 		if err == tls.ErrCertsOnly {
 			err = nil
 		}
 		return err
 	case <-ctx.Done():
-		// Prefer a completed handshake result that arrived at the same time
-		// as the deadline — select is non-deterministic when both are ready.
-		select {
-		case err := <-errChan:
-			if err == tls.ErrCertsOnly {
-				err = nil
-			}
-			return err
-		default:
-		}
-		// Close the connection to unblock the goroutine stuck in Handshake()
-		_ = tlsConn.Close()
-		return errorutil.NewWithTag("ztls", "timeout while attempting handshake") //nolint
+		if atomic.CompareAndSwapUint32(&state, 0, 2) {
+			_ = tlsConn.Close()
+			return errorutil.NewWithTag("ztls", "timeout while attempting handshake").Wrap(ctx.Err()) //nolint
+		}
+		err := <-errChan
+		if err == tls.ErrCertsOnly {
+			return nil
+		}
+		return err
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tlsx/ztls/ztls.go` around lines 359 - 372, The timeout branch can race
with a just-finished Handshake; modify the handshake coordination so the
goroutine that runs tlsConn.Handshake() sets a shared completion flag (e.g., an
atomic.Bool or a boolean updated under a mutex) before sending into errChan, and
have the ctx.Done() handler check that same flag with a CAS/read before closing
tlsConn and returning the timeout error; reference errChan, tlsConn,
Handshake(), ctx.Done(), and errorutil.NewWithTag("ztls", "timeout while
attempting handshake") when applying the fix so only truly in-flight handshakes
are aborted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/tlsx/ztls/ztls.go`:
- Around line 359-372: The timeout branch can race with a just-finished
Handshake; modify the handshake coordination so the goroutine that runs
tlsConn.Handshake() sets a shared completion flag (e.g., an atomic.Bool or a
boolean updated under a mutex) before sending into errChan, and have the
ctx.Done() handler check that same flag with a CAS/read before closing tlsConn
and returning the timeout error; reference errChan, tlsConn, Handshake(),
ctx.Done(), and errorutil.NewWithTag("ztls", "timeout while attempting
handshake") when applying the fix so only truly in-flight handshakes are
aborted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e5c33ef7-e23a-470a-b1a8-38dcb6491e4a

📥 Commits

Reviewing files that changed from the base of the PR and between b936f57 and 7194d08.

📒 Files selected for processing (1)
  • pkg/tlsx/ztls/ztls.go

@CharlesWong
Copy link
Author

Thanks for the security review @neo-by-projectdiscovery-dev — glad no issues were found. The timeout guard now uses context.Background() (no expiry) when Timeout==0, matching the existing pattern at lines 109-114 rather than introducing a hardcoded 10s fallback.

@CharlesWong
Copy link
Author

Acknowledged @neo-by-projectdiscovery-dev[bot]'s review.

@CharlesWong
Copy link
Author

Acknowledged @coderabbitai[bot]'s review.

@coderabbitai
Copy link

coderabbitai bot commented Mar 15, 2026

Tip

For best results, initiate chat on the files or code changes.

@CharlesWong Thanks for all the thoughtful follow-ups! The three fixes — the context.Background() guard when Timeout==0, the context.Context-first parameter convention, and the non-deterministic select fairness fix with the non-blocking drain — all look solid and well-reasoned. The PR is in great shape. 🐇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tlsx hangs indefinitely for some hosts

1 participant