Support Python 3.14 free-threaded CUDA wheels by tpn · Pull Request #897 · NVIDIA/numba-cuda

tpn · 2026-06-12T21:27:05Z

Overview

This PR adds initial CPython 3.14 free-threaded support for numba-cuda, including cp314t wheel builds.

The user-visible goal is that a CPython free-threaded process started with PYTHON_GIL=0 can import numba.cuda without CPython re-enabling the GIL. CPython does that re-enable unless native extension modules explicitly declare no-GIL support, so this PR opts the supported numba-cuda native modules into Py_MOD_GIL_NOT_USED after hardening the shared state they can reach without the GIL.

What changed

Build and CI paths understand 3.14t / cp314t and propagate Py_GIL_DISABLED=1 when building extension modules under a free-threaded Python.
The shared native module-init helper marks the C/C++ extension modules as no-GIL compatible when the build target is free-threaded CPython.
Native shared state is protected where the GIL previously provided serialization, including dispatcher overload/fallback state, _typeof caches and fingerprinting state, _typeconv compatibility data, and mviewbuf sequence/buffer helpers.
Python-side runtime state now has additional locking around CUDA driver first touch, pending deallocations, dispatcher specialization, and launch-config-sensitive cache/specialization paths.
The ad hoc stress scripts have been converted into repo-native tests under numba_cuda/numba/cuda/tests/stress and supporting focused test locations.
Developer docs now cover free-threaded stress testing and the stress-control environment variables. User docs describe a conda-forge cp314t smoke environment and the PYTHON_GIL=0 import gate.

Stress tests and environment variables

The heavier free-threading tests are opt-in because they create many threads and subprocesses. Enable them with:

PYTHON_GIL=0 NUMBA_CUDA_FT_STRESS=1 \
  python -m pytest -q --pyargs numba.cuda.tests.stress

The stress controls are:

NUMBA_CUDA_FT_STRESS: enables the opt-in stress tests.
NUMBA_CUDA_FT_STRESS_SECONDS: duration for timed CUDA stress phases.
NUMBA_CUDA_FT_STRESS_WORKERS: worker-thread count for thread-pool phases.
NUMBA_CUDA_FT_STRESS_PROCESSES: subprocess count for cache-concurrency phases.
NUMBA_CUDA_FT_STRESS_ITERS: iteration count override for loop-based no-CUDA phases.

The CUDA dispatch stress deliberately avoids a concurrent tight loop of full gc.collect() calls. That pattern timed out in CPython 3.14t free-threaded GC code on many-core systems and is being tracked separately from this PR. The dispatch stress still runs a full collection after workers shut down.

Validation

Current branch head: 2953b481ebe2177b5fd9b05a4f8c5d3fbbc9facc.

Local validation on the current head:

git diff --check: passed.
pixi run -e dev ruff check <changed-python-files>: passed.
python -m py_compile <changed-python-files>: passed.
pixi run -e docs build-docs: passed.
YAML parsing plus jq evaluation for the CI matrix filters passed. The third-party jobs continue to choose the non-free-threaded latest Python row, while explicit 3.14t matrix rows remain available where configured.
ci/tools/env-vars build with PY_VER=3.14t emits PYTHON_VERSION_FORMATTED=314t, CIBW_BUILD=cp314t-*, and numba-cuda-python314t-* artifact names.
pixi run -e default python -m pytest -q testing --override-ini "addopts=--benchmark-disable --import-mode=importlib" --pyargs numba.cuda.tests.stress.test_free_threading: 5 skipped in the local default env, which is not a free-threaded interpreter.
RoboRev standard, security, and design passes all found no issues for the final CI readability commit.

Many-core Linux validation:

On an 8x B200 system with 224 logical CPU cores, the branch built and installed a cp314t wheel, passed the PYTHON_GIL=0 import gate, and passed the focused free-threading stress phases under Python 3.14t with the GIL disabled.
That B200 run covered concurrent CUDA driver first touch, dispatcher fingerprinting while containers were mutated, memoryview helpers, dispatcher specialization and launch paths, launch-config-sensitive specialization, and a mixed stress workload.
The stress evidence was collected on runtime-code head 7bf53d4b. The current head adds only behavior-preserving CI YAML formatting.

Known limits and follow-up validation

Windows cp314t still needs a real builder pass. The build changes account for Windows needing explicit Py_GIL_DISABLED propagation, but this PR should not be treated as Windows-proven until that run completes.
AArch64 cp314t is in the matrix work, but still needs real CI or hardware validation.
GitHub checks are not currently reported on the PR. NVIDIA runner gating still needs to accept/run the workflows.
Copilot's latest clean review is stale relative to the current head. The current-head request at 2953b481 was quota-blocked, so a fresh Copilot pass is still pending.
The CPython full-GC non-progress lead is intentionally outside this PR.

Notes for reviewers

The highest-value review areas are the native state protection around _dispatcher, _typeof, _typeconv, and mviewbuf, plus the Python-side driver/dispatcher locking and the repo-native stress test coverage. The intent is initial cp314t support for the exercised import, native extension, driver first-touch, dispatcher, cache, and stress paths, not a claim that every possible CUDA workload has been exhaustively proven free-thread-safe.

copy-pr-bot · 2026-06-12T21:27:09Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2026-06-12T21:40:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tpn · 2026-06-12T21:40:41Z

/ok to test

copy-pr-bot · 2026-06-12T21:40:44Z

/ok to test

@tpn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copilot

Pull request overview

This PR adds initial support for CPython 3.14 free-threaded (“3.14t”, cp314t) wheels by opting the project’s native CUDA extension modules into no-GIL mode and hardening shared native state with locks, plus adding focused import/threading tests and CI matrix coverage.

Changes:

Detect free-threaded Python builds during extension builds and enable Py_GIL_DISABLED-gated code paths; mark native modules as Py_MOD_GIL_NOT_USED.
Add/adjust native synchronization around shared caches and dispatcher/type-conversion state to support running without the GIL.
Add tests and CI matrix entries for 3.14t, plus documentation updates describing local validation and expectations.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
testing/pytest.ini	Switch pytest import mode to `importlib` to preserve redirected module naming during collection.
setup.py	Detect free-threaded builds and define `Py_GIL_DISABLED` for extension compilation.
numba_cuda/numba/cuda/tests/nocuda/test_typeconv_threading.py	New threading stress test for concurrent type-conversion reads/writes.
numba_cuda/numba/cuda/tests/nocuda/test_import.py	New free-threaded import test ensuring `PYTHON_GIL=0` keeps the GIL disabled after imports.
numba_cuda/numba/cuda/cext/typeconv.hpp	Add mutex to `TypeManager` for thread safety.
numba_cuda/numba/cuda/cext/typeconv.cpp	Protect compatibility map and overload selection with a mutex.
numba_cuda/numba/cuda/cext/mviewbuf.c	Make sequence item reads safer and validate shape/stride lengths for no-GIL operation.
numba_cuda/numba/cuda/cext/_typeof.cpp	Add locking around global caches and stabilize container access under no-GIL.
numba_cuda/numba/cuda/cext/_typeconv.cpp	Mark `_typeconv` module as no-GIL compatible.
numba_cuda/numba/cuda/cext/_pymodule.h	Add shared macros to mark modules as no-GIL compatible.
numba_cuda/numba/cuda/cext/_helpermod.c	Mark `_helperlib` module as no-GIL compatible.
numba_cuda/numba/cuda/cext/_dispatcher.cpp	Add dispatcher locking and adjust reference handling to support no-GIL.
docs/source/user/installation.rst	Document free-threaded Python usage and local environment setup.
ci/test-matrix.yml	Enable initial `3.14t` rows in the CI matrix.
.gitignore	Ignore Windows `.pyd` artifacts.
.github/workflows/ci.yaml	Adjust matrix filtering logic to handle `3.14t` version strings.
.github/workflows/build-wheel.yml	Enable `3.14t` in the build-wheel workflow matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tpn · 2026-06-13T00:10:46Z

/ok to test

copy-pr-bot · 2026-06-13T00:10:49Z

/ok to test

@tpn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

tpn · 2026-06-13T00:11:39Z

/ok to test fcd0a40

Copilot

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 1 comment.

Driver.ensure_initialized() set is_initialized=True before cuInit() returned, so a concurrent first-touch from another thread could sail past the guard and call a driver API on an uninitialized driver, raising CUDA_ERROR_NOT_INITIALIZED. Under free-threaded CPython this race fires reliably (every thread but the initializer fails); under the GIL it is a narrow window while cuInit releases the GIL. Serialize initialization with a reentrant lock and only publish is_initialized once cuInit has actually completed. A separate _initializing flag breaks the __getattr__ -> cuInit -> ensure_initialized recursion that runs on the initializing thread. __getattr__ now refuses to lazily bind underscore-prefixed names so touching the init lock can never recurse into driver binding. Adds a spawned-subprocess regression test that hammers cold-start init from many threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

_PendingDeallocs.add_item()/clear() mutated a shared deque and size counter with no synchronization. Deallocations are driven by weakref finalizers, which run on whatever thread drops the last reference and, under free-threaded CPython, truly concurrently. The "while self._cons: self._cons.popleft()" loop in clear() raced: one thread's truthiness check then another's popleft drained the deque, raising "IndexError: pop from an empty deque" out of a finalizer. Any multithreaded program freeing device memory could hit it. Guard the pending list, byte accounting, and disable counter with a lock. clear() now atomically takes ownership of the pending entries under the lock and runs the destructors (which call into the CUDA driver) outside it, so deallocation work never blocks other threads queueing frees. Adds a regression test that hammers add_item/clear from many threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated no new comments.

tpn · 2026-06-13T06:51:05Z

I also did a many-core stress pass outside the normal CI matrix for the Python 3.14 free-threaded work.

Validated head: 50421f11

I built the branch into a cp314t wheel, installed that wheel into a Python 3.14 free-threaded environment, and verified the process started with the GIL disabled and kept it disabled after importing numba.cuda and the relevant C extension modules.

The completed run used an 8x B200 system with 224 logical CPU cores. This was a concurrency/correctness stress pass, not a throughput benchmark. It exercised:

concurrent CUDA driver first-touch / initialization
_dispatcher.compute_fingerprint() while containers were being mutated
memoryview extent/buffer helpers
dispatcher specialization and launch paths with 64 worker threads
launch-config-sensitive specialization with 64 worker threads
mixed kitchen-sink workload with 96 worker threads

Final result: the cp314t wheel built and installed successfully, the GIL-disabled import gate passed, and all stress phases completed:

fingerprint stress passed
mviewbuf stress passed
dispatch stress passed
LC-S stress passed
kitchen-sink stress passed

I also tried to include GB200 in this pass, but the completed stress evidence here is from B200.

Pretty, prettttttty good.

-- gpt-5.5 xhigh

Claude may now lower the eyebrow by one millimeter.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

tpn · 2026-06-13T06:56:34Z

/ok to test 50421f1

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

tpn added 2 commits June 12, 2026 13:53

Support free-threaded Python in native extensions

be07770

Enable cp314t wheel builds

798a02e

tpn added 2 commits June 12, 2026 14:33

Use importlib pytest mode for CUDA tests

a124cc4

Ignore Windows extension build outputs

ab22e9e

tpn requested a review from Copilot June 12, 2026 21:40

Copilot started reviewing on behalf of tpn June 12, 2026 21:41 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread setup.py

Comment thread numba_cuda/numba/cuda/tests/nocuda/test_import.py

Comment thread numba_cuda/numba/cuda/cext/_pymodule.h Outdated

Comment thread numba_cuda/numba/cuda/cext/_dispatcher.cpp

Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c

Address free-threading review comments

fcd0a40

tpn marked this pull request as ready for review June 12, 2026 22:47

tpn added 2 commits June 12, 2026 17:53

Guard concurrent CUDA specialization

9582e37

Guard launch-config races

62df131

tpn requested a review from Copilot June 13, 2026 01:05

Copilot started reviewing on behalf of tpn June 13, 2026 01:05 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c Outdated

tpn and others added 3 commits June 12, 2026 18:26

Use version hex for Python API guard

8be5c73

tpn requested a review from Copilot June 13, 2026 02:17

Copilot started reviewing on behalf of tpn June 13, 2026 02:17 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread numba_cuda/numba/cuda/tests/cudadrv/test_init.py

Report concurrent init join timeouts

3c20ec3

tpn requested a review from Copilot June 13, 2026 02:33

Copilot started reviewing on behalf of tpn June 13, 2026 02:33 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread numba_cuda/numba/cuda/cudadrv/driver.py Outdated

Comment thread numba_cuda/numba/cuda/cudadrv/driver.py

Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c

Guard driver init fields independently

abe06cd

tpn requested a review from Copilot June 13, 2026 03:32

Copilot started reviewing on behalf of tpn June 13, 2026 03:32 View session

tpn added 2 commits June 12, 2026 20:33

Lock pending-dealloc read accessors

6886ca8

Move mviewbuf item declaration before statements

484f6fa

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread numba_cuda/numba/cuda/cudadrv/driver.py Outdated

tpn requested a review from Copilot June 13, 2026 03:39

Copilot started reviewing on behalf of tpn June 13, 2026 03:39 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread numba_cuda/numba/cuda/cext/_pymodule.h Outdated

Comment thread numba_cuda/numba/cuda/cext/_typeof.cpp Outdated

Comment thread numba_cuda/numba/cuda/tests/nocuda/test_typeof_threading.py

Respect Py_GIL_DISABLED macro value

50421f1

tpn requested a review from Copilot June 13, 2026 06:08

Copilot started reviewing on behalf of tpn June 13, 2026 06:09 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

tpn self-assigned this Jun 13, 2026

tpn requested a review from Copilot June 13, 2026 06:52

Copilot AI reviewed Jun 13, 2026

Add free-threading stress regression tests

38cf185

tpn force-pushed the feature/add-314-free-threading-support branch from 49791e8 to 38cf185 Compare June 14, 2026 03:46

tpn added 2 commits June 14, 2026 10:59

Document free-threading stress controls

2816f4f

Bound free-threading dispatch stress hangs

194600e

tpn requested a review from Copilot June 14, 2026 19:29

Copilot AI reviewed Jun 14, 2026

tpn added 3 commits June 14, 2026 13:10

Fix no-launch LCS cache compile

fc4c9c8

Avoid full-GC loop in dispatch stress

7bf53d4

Format CI matrix filters

2953b48

tpn requested a review from Copilot June 15, 2026 00:07

Copilot AI reviewed Jun 15, 2026

Conversation

tpn commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What changed

Stress tests and environment variables

Validation

Known limits and follow-up validation

Notes for reviewers

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

tpn commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tpn commented Jun 13, 2026

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

tpn commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tpn commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tpn commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tpn commented Jun 12, 2026 •

edited

Loading