Description
requests.post() and requests.get() calls in databricks/sdk/oauth.py do not pass a timeout= parameter. When the OAuth endpoint is unreachable or slow at the moment of token refresh, these calls block indefinitely. The SDK's per-request timeout (_BaseClient._http_timeout_seconds) does not protect against this because the token refresh runs inside session.auth (the header_factory callback), which executes before the request timeout takes effect.
Three call sites are affected:
| Location |
Line |
Call |
retrieve_token() |
208 |
requests.post(token_url, params, auth=auth, headers=headers) |
get_azure_entra_id_workspace_endpoints() |
521 |
requests.get(f"{host}/oidc/oauth2/v2.0/authorize", allow_redirects=False) |
PATOAuthTokenExchange.refresh() |
889 |
requests.post(token_exchange_url, params) |
This is related to but distinct from #1046, which is about the _BaseClient.do() retry timeout for OIDC endpoint discovery. The calls listed above bypass _BaseClient entirely and use the requests library directly with no timeout.
Actual behavior
requests.post() and requests.get() in oauth.py block indefinitely when the remote endpoint is unreachable or slow, because no timeout= parameter is passed. The SDK's per-request timeout (session.request(timeout=60)) does not help because the token refresh runs inside session.auth, before the timeout takes effect.
Expected behavior
requests.post() and requests.get() calls in oauth.py should include a timeout= parameter so that unreachable endpoints cause a timeout exception rather than an indefinite hang.
Impact
We run an MLflow evaluation pipeline on Databricks that makes hundreds of API calls over ~60 minutes. MLflow's get_workspace_client() caches the WorkspaceClient via @lru_cache, so the same credential provider persists for the process lifetime. The M2M OAuth token (TTL = 3600s, server-dictated) expires at ~59m20s (accounting for the SDK's 40s early-expiry buffer in Token.expired). When the next API call triggers a synchronous token refresh and the OAuth endpoint is slow at that moment, requests.post() in retrieve_token() blocks indefinitely, hanging the entire CI pipeline until the job-level timeout kills it.
Network failure modes that block forever without timeout=
| Failure mode |
Phase that blocks |
Why it blocks forever |
| Firewall DROP (SYN, no reply) |
connect() |
TCP retransmits SYN until OS gives up (~2-4 min on Linux, longer on macOS) |
| Server stall (connected, no response) |
recv() |
Connection established, request sent, server never sends response data |
| Proxy / load balancer stall |
recv() |
Backend unavailable but frontend holds connection open |
| TLS negotiation stall |
ssl.do_handshake() |
TCP connected but peer never completes TLS handshake |
All four are resolved by adding timeout= to requests.post()/requests.get().
Network failure modes that fail fast (with or without timeout=)
| Failure mode |
Why it fails fast |
| DNS failure |
getaddrinfo() returns error immediately |
| Host unreachable (ICMP) |
OS receives ICMP unreachable, connect() returns error |
| Port closed (RST) |
Server sends TCP RST, connect() returns ConnectionRefused |
| Server crash after accept |
OS sends RST or FIN, recv() returns error or EOF |
Reproduction
The script below reproduces the "server stall" row from the first table above: a local TCP server accepts the connection but never responds, causing requests.post() to block on recv() indefinitely. The timeout= parameter protects against all four blocking failure modes in that table, since it covers both the connect and read phases.
No Databricks credentials are needed — this demonstrates the defect in requests.post() as called by the SDK, not a specific production failure:
See attached reproduce_hang.py — run with pip install requests && python reproduce_hang.py (~30 seconds, no credentials needed).
reproduce_hang.py
Expected output:
CONFIRMED: requests.post() with no timeout blocked for >10s (killed)
CONFIRMED: session.get(timeout=60) does NOT protect auth callback
CONFIRMED: requests.post(timeout=5) raises ReadTimeout after 5.1s
All three tests pass. Fix: add timeout= to requests.post()/get() in oauth.py.
Code path of the hang:
client.current_user.me()
→ _BaseClient._perform() → session.request(timeout=60)
→ session.auth → Config.authenticate() → credential_provider.token()
→ Refreshable._blocking_token() [token is EXPIRED]
→ ClientCredentials.refresh() → retrieve_token()
→ requests.post(token_url, params, auth=auth, headers=headers)
↑ NO timeout= PARAMETER — BLOCKS FOREVER
Suggested fix
Suggested fix:
# retrieve_token(), line 208:
resp = requests.post(token_url, params, auth=auth, headers=headers, timeout=30)
# get_azure_entra_id_workspace_endpoints(), line 521:
res = requests.get(f"{host}/oidc/oauth2/v2.0/authorize", allow_redirects=False, timeout=30)
# PATOAuthTokenExchange.refresh(), line 889:
resp = requests.post(token_exchange_url, params, timeout=30)
Ideally the timeout value would come from Config.http_timeout_seconds (defaulting to the SDK's standard 60s) rather than a hardcoded value.
Is it a regression?
No — these calls have never had a timeout parameter.
Debug Logs
Not applicable — the process hangs with no output. A faulthandler stack dump shows the thread blocked in ssl.read → urllib3 → requests.post inside retrieve_token().
Other Information
- OS: macOS (Darwin 24.6.0)
- Version: 0.96.0
- Python: 3.11.13
Additional context
Related: #1046 (non-configurable timeout for _BaseClient.do() in OIDC endpoint discovery — same family of issue, different code path). PR #1085 addresses #1046 but does not cover the requests.post()/requests.get() calls listed above.
Description
requests.post()andrequests.get()calls indatabricks/sdk/oauth.pydo not pass atimeout=parameter. When the OAuth endpoint is unreachable or slow at the moment of token refresh, these calls block indefinitely. The SDK's per-request timeout (_BaseClient._http_timeout_seconds) does not protect against this because the token refresh runs insidesession.auth(theheader_factorycallback), which executes before the request timeout takes effect.Three call sites are affected:
retrieve_token()requests.post(token_url, params, auth=auth, headers=headers)get_azure_entra_id_workspace_endpoints()requests.get(f"{host}/oidc/oauth2/v2.0/authorize", allow_redirects=False)PATOAuthTokenExchange.refresh()requests.post(token_exchange_url, params)This is related to but distinct from #1046, which is about the
_BaseClient.do()retry timeout for OIDC endpoint discovery. The calls listed above bypass_BaseCliententirely and use therequestslibrary directly with no timeout.Actual behavior
requests.post()andrequests.get()inoauth.pyblock indefinitely when the remote endpoint is unreachable or slow, because notimeout=parameter is passed. The SDK's per-request timeout (session.request(timeout=60)) does not help because the token refresh runs insidesession.auth, before the timeout takes effect.Expected behavior
requests.post()andrequests.get()calls inoauth.pyshould include atimeout=parameter so that unreachable endpoints cause a timeout exception rather than an indefinite hang.Impact
We run an MLflow evaluation pipeline on Databricks that makes hundreds of API calls over ~60 minutes. MLflow's
get_workspace_client()caches theWorkspaceClientvia@lru_cache, so the same credential provider persists for the process lifetime. The M2M OAuth token (TTL = 3600s, server-dictated) expires at ~59m20s (accounting for the SDK's 40s early-expiry buffer inToken.expired). When the next API call triggers a synchronous token refresh and the OAuth endpoint is slow at that moment,requests.post()inretrieve_token()blocks indefinitely, hanging the entire CI pipeline until the job-level timeout kills it.Network failure modes that block forever without
timeout=connect()recv()recv()ssl.do_handshake()All four are resolved by adding
timeout=torequests.post()/requests.get().Network failure modes that fail fast (with or without
timeout=)getaddrinfo()returns error immediatelyconnect()returns errorconnect()returnsConnectionRefusedrecv()returns error or EOFReproduction
The script below reproduces the "server stall" row from the first table above: a local TCP server accepts the connection but never responds, causing
requests.post()to block onrecv()indefinitely. Thetimeout=parameter protects against all four blocking failure modes in that table, since it covers both the connect and read phases.No Databricks credentials are needed — this demonstrates the defect in
requests.post()as called by the SDK, not a specific production failure:See attached
reproduce_hang.py— run withpip install requests && python reproduce_hang.py(~30 seconds, no credentials needed).reproduce_hang.py
Expected output:
Code path of the hang:
Suggested fix
Suggested fix:
Ideally the timeout value would come from
Config.http_timeout_seconds(defaulting to the SDK's standard 60s) rather than a hardcoded value.Is it a regression?
No — these calls have never had a
timeoutparameter.Debug Logs
Not applicable — the process hangs with no output. A
faulthandlerstack dump shows the thread blocked inssl.read→urllib3→requests.postinsideretrieve_token().Other Information
Additional context
Related: #1046 (non-configurable timeout for
_BaseClient.do()in OIDC endpoint discovery — same family of issue, different code path). PR #1085 addresses #1046 but does not cover therequests.post()/requests.get()calls listed above.