Skip to content

[ISSUE] requests.post() and requests.get() calls in oauth.py have no timeout, can hang indefinitely #1338

@cgrierson-smartsheet

Description

@cgrierson-smartsheet

Description

requests.post() and requests.get() calls in databricks/sdk/oauth.py do not pass a timeout= parameter. When the OAuth endpoint is unreachable or slow at the moment of token refresh, these calls block indefinitely. The SDK's per-request timeout (_BaseClient._http_timeout_seconds) does not protect against this because the token refresh runs inside session.auth (the header_factory callback), which executes before the request timeout takes effect.

Three call sites are affected:

Location Line Call
retrieve_token() 208 requests.post(token_url, params, auth=auth, headers=headers)
get_azure_entra_id_workspace_endpoints() 521 requests.get(f"{host}/oidc/oauth2/v2.0/authorize", allow_redirects=False)
PATOAuthTokenExchange.refresh() 889 requests.post(token_exchange_url, params)

This is related to but distinct from #1046, which is about the _BaseClient.do() retry timeout for OIDC endpoint discovery. The calls listed above bypass _BaseClient entirely and use the requests library directly with no timeout.

Actual behavior

requests.post() and requests.get() in oauth.py block indefinitely when the remote endpoint is unreachable or slow, because no timeout= parameter is passed. The SDK's per-request timeout (session.request(timeout=60)) does not help because the token refresh runs inside session.auth, before the timeout takes effect.

Expected behavior

requests.post() and requests.get() calls in oauth.py should include a timeout= parameter so that unreachable endpoints cause a timeout exception rather than an indefinite hang.

Impact

We run an MLflow evaluation pipeline on Databricks that makes hundreds of API calls over ~60 minutes. MLflow's get_workspace_client() caches the WorkspaceClient via @lru_cache, so the same credential provider persists for the process lifetime. The M2M OAuth token (TTL = 3600s, server-dictated) expires at ~59m20s (accounting for the SDK's 40s early-expiry buffer in Token.expired). When the next API call triggers a synchronous token refresh and the OAuth endpoint is slow at that moment, requests.post() in retrieve_token() blocks indefinitely, hanging the entire CI pipeline until the job-level timeout kills it.

Network failure modes that block forever without timeout=

Failure mode Phase that blocks Why it blocks forever
Firewall DROP (SYN, no reply) connect() TCP retransmits SYN until OS gives up (~2-4 min on Linux, longer on macOS)
Server stall (connected, no response) recv() Connection established, request sent, server never sends response data
Proxy / load balancer stall recv() Backend unavailable but frontend holds connection open
TLS negotiation stall ssl.do_handshake() TCP connected but peer never completes TLS handshake

All four are resolved by adding timeout= to requests.post()/requests.get().

Network failure modes that fail fast (with or without timeout=)

Failure mode Why it fails fast
DNS failure getaddrinfo() returns error immediately
Host unreachable (ICMP) OS receives ICMP unreachable, connect() returns error
Port closed (RST) Server sends TCP RST, connect() returns ConnectionRefused
Server crash after accept OS sends RST or FIN, recv() returns error or EOF

Reproduction

The script below reproduces the "server stall" row from the first table above: a local TCP server accepts the connection but never responds, causing requests.post() to block on recv() indefinitely. The timeout= parameter protects against all four blocking failure modes in that table, since it covers both the connect and read phases.

No Databricks credentials are needed — this demonstrates the defect in requests.post() as called by the SDK, not a specific production failure:

See attached reproduce_hang.py — run with pip install requests && python reproduce_hang.py (~30 seconds, no credentials needed).

reproduce_hang.py

Expected output:

CONFIRMED: requests.post() with no timeout blocked for >10s (killed)
CONFIRMED: session.get(timeout=60) does NOT protect auth callback
CONFIRMED: requests.post(timeout=5) raises ReadTimeout after 5.1s

All three tests pass. Fix: add timeout= to requests.post()/get() in oauth.py.

Code path of the hang:

client.current_user.me()
→ _BaseClient._perform() → session.request(timeout=60)
→ session.auth → Config.authenticate() → credential_provider.token()
→ Refreshable._blocking_token() [token is EXPIRED]
→ ClientCredentials.refresh() → retrieve_token()
→ requests.post(token_url, params, auth=auth, headers=headers)
   ↑ NO timeout= PARAMETER — BLOCKS FOREVER

Suggested fix

Suggested fix:

# retrieve_token(), line 208:
resp = requests.post(token_url, params, auth=auth, headers=headers, timeout=30)

# get_azure_entra_id_workspace_endpoints(), line 521:
res = requests.get(f"{host}/oidc/oauth2/v2.0/authorize", allow_redirects=False, timeout=30)

# PATOAuthTokenExchange.refresh(), line 889:
resp = requests.post(token_exchange_url, params, timeout=30)

Ideally the timeout value would come from Config.http_timeout_seconds (defaulting to the SDK's standard 60s) rather than a hardcoded value.

Is it a regression?

No — these calls have never had a timeout parameter.

Debug Logs

Not applicable — the process hangs with no output. A faulthandler stack dump shows the thread blocked in ssl.readurllib3requests.post inside retrieve_token().

Other Information

  • OS: macOS (Darwin 24.6.0)
  • Version: 0.96.0
  • Python: 3.11.13

Additional context

Related: #1046 (non-configurable timeout for _BaseClient.do() in OIDC endpoint discovery — same family of issue, different code path). PR #1085 addresses #1046 but does not cover the requests.post()/requests.get() calls listed above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions