Support Python 3.14 free-threaded CUDA wheels#897
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
@tpn, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for CPython 3.14 free-threaded (“3.14t”, cp314t) wheels by opting the project’s native CUDA extension modules into no-GIL mode and hardening shared native state with locks, plus adding focused import/threading tests and CI matrix coverage.
Changes:
- Detect free-threaded Python builds during extension builds and enable
Py_GIL_DISABLED-gated code paths; mark native modules asPy_MOD_GIL_NOT_USED. - Add/adjust native synchronization around shared caches and dispatcher/type-conversion state to support running without the GIL.
- Add tests and CI matrix entries for
3.14t, plus documentation updates describing local validation and expectations.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| testing/pytest.ini | Switch pytest import mode to importlib to preserve redirected module naming during collection. |
| setup.py | Detect free-threaded builds and define Py_GIL_DISABLED for extension compilation. |
| numba_cuda/numba/cuda/tests/nocuda/test_typeconv_threading.py | New threading stress test for concurrent type-conversion reads/writes. |
| numba_cuda/numba/cuda/tests/nocuda/test_import.py | New free-threaded import test ensuring PYTHON_GIL=0 keeps the GIL disabled after imports. |
| numba_cuda/numba/cuda/cext/typeconv.hpp | Add mutex to TypeManager for thread safety. |
| numba_cuda/numba/cuda/cext/typeconv.cpp | Protect compatibility map and overload selection with a mutex. |
| numba_cuda/numba/cuda/cext/mviewbuf.c | Make sequence item reads safer and validate shape/stride lengths for no-GIL operation. |
| numba_cuda/numba/cuda/cext/_typeof.cpp | Add locking around global caches and stabilize container access under no-GIL. |
| numba_cuda/numba/cuda/cext/_typeconv.cpp | Mark _typeconv module as no-GIL compatible. |
| numba_cuda/numba/cuda/cext/_pymodule.h | Add shared macros to mark modules as no-GIL compatible. |
| numba_cuda/numba/cuda/cext/_helpermod.c | Mark _helperlib module as no-GIL compatible. |
| numba_cuda/numba/cuda/cext/_dispatcher.cpp | Add dispatcher locking and adjust reference handling to support no-GIL. |
| docs/source/user/installation.rst | Document free-threaded Python usage and local environment setup. |
| ci/test-matrix.yml | Enable initial 3.14t rows in the CI matrix. |
| .gitignore | Ignore Windows .pyd artifacts. |
| .github/workflows/ci.yaml | Adjust matrix filtering logic to handle 3.14t version strings. |
| .github/workflows/build-wheel.yml | Enable 3.14t in the build-wheel workflow matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/ok to test |
@tpn, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test fcd0a40 |
Driver.ensure_initialized() set is_initialized=True before cuInit() returned, so a concurrent first-touch from another thread could sail past the guard and call a driver API on an uninitialized driver, raising CUDA_ERROR_NOT_INITIALIZED. Under free-threaded CPython this race fires reliably (every thread but the initializer fails); under the GIL it is a narrow window while cuInit releases the GIL. Serialize initialization with a reentrant lock and only publish is_initialized once cuInit has actually completed. A separate _initializing flag breaks the __getattr__ -> cuInit -> ensure_initialized recursion that runs on the initializing thread. __getattr__ now refuses to lazily bind underscore-prefixed names so touching the init lock can never recurse into driver binding. Adds a spawned-subprocess regression test that hammers cold-start init from many threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_PendingDeallocs.add_item()/clear() mutated a shared deque and size counter with no synchronization. Deallocations are driven by weakref finalizers, which run on whatever thread drops the last reference and, under free-threaded CPython, truly concurrently. The "while self._cons: self._cons.popleft()" loop in clear() raced: one thread's truthiness check then another's popleft drained the deque, raising "IndexError: pop from an empty deque" out of a finalizer. Any multithreaded program freeing device memory could hit it. Guard the pending list, byte accounting, and disable counter with a lock. clear() now atomically takes ownership of the pending entries under the lock and runs the destructors (which call into the CUDA driver) outside it, so deallocation work never blocks other threads queueing frees. Adds a regression test that hammers add_item/clear from many threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
I also did a many-core stress pass outside the normal CI matrix for the Python 3.14 free-threaded work. Validated head: I built the branch into a The completed run used an 8x B200 system with 224 logical CPU cores. This was a concurrency/correctness stress pass, not a throughput benchmark. It exercised:
Final result: the
I also tried to include GB200 in this pass, but the completed stress evidence here is from B200. Pretty, prettttttty good. -- gpt-5.5 xhigh Claude may now lower the eyebrow by one millimeter. |
|
/ok to test 50421f1 |
49791e8 to
38cf185
Compare
Overview
This PR adds initial CPython 3.14 free-threaded support for
numba-cuda, includingcp314twheel builds.The user-visible goal is that a CPython free-threaded process started with
PYTHON_GIL=0can importnumba.cudawithout CPython re-enabling the GIL. CPython does that re-enable unless native extension modules explicitly declare no-GIL support, so this PR opts the supportednumba-cudanative modules intoPy_MOD_GIL_NOT_USEDafter hardening the shared state they can reach without the GIL.What changed
3.14t/cp314tand propagatePy_GIL_DISABLED=1when building extension modules under a free-threaded Python._typeofcaches and fingerprinting state,_typeconvcompatibility data, andmviewbufsequence/buffer helpers.numba_cuda/numba/cuda/tests/stressand supporting focused test locations.cp314tsmoke environment and thePYTHON_GIL=0import gate.Stress tests and environment variables
The heavier free-threading tests are opt-in because they create many threads and subprocesses. Enable them with:
The stress controls are:
NUMBA_CUDA_FT_STRESS: enables the opt-in stress tests.NUMBA_CUDA_FT_STRESS_SECONDS: duration for timed CUDA stress phases.NUMBA_CUDA_FT_STRESS_WORKERS: worker-thread count for thread-pool phases.NUMBA_CUDA_FT_STRESS_PROCESSES: subprocess count for cache-concurrency phases.NUMBA_CUDA_FT_STRESS_ITERS: iteration count override for loop-based no-CUDA phases.The CUDA dispatch stress deliberately avoids a concurrent tight loop of full
gc.collect()calls. That pattern timed out in CPython 3.14t free-threaded GC code on many-core systems and is being tracked separately from this PR. The dispatch stress still runs a full collection after workers shut down.Validation
Current branch head:
2953b481ebe2177b5fd9b05a4f8c5d3fbbc9facc.Local validation on the current head:
git diff --check: passed.pixi run -e dev ruff check <changed-python-files>: passed.python -m py_compile <changed-python-files>: passed.pixi run -e docs build-docs: passed.jqevaluation for the CI matrix filters passed. The third-party jobs continue to choose the non-free-threaded latest Python row, while explicit3.14tmatrix rows remain available where configured.ci/tools/env-vars buildwithPY_VER=3.14temitsPYTHON_VERSION_FORMATTED=314t,CIBW_BUILD=cp314t-*, andnumba-cuda-python314t-*artifact names.pixi run -e default python -m pytest -q testing --override-ini "addopts=--benchmark-disable --import-mode=importlib" --pyargs numba.cuda.tests.stress.test_free_threading:5 skippedin the local default env, which is not a free-threaded interpreter.Many-core Linux validation:
cp314twheel, passed thePYTHON_GIL=0import gate, and passed the focused free-threading stress phases under Python 3.14t with the GIL disabled.7bf53d4b. The current head adds only behavior-preserving CI YAML formatting.Known limits and follow-up validation
cp314tstill needs a real builder pass. The build changes account for Windows needing explicitPy_GIL_DISABLEDpropagation, but this PR should not be treated as Windows-proven until that run completes.cp314tis in the matrix work, but still needs real CI or hardware validation.2953b481was quota-blocked, so a fresh Copilot pass is still pending.Notes for reviewers
The highest-value review areas are the native state protection around
_dispatcher,_typeof,_typeconv, andmviewbuf, plus the Python-side driver/dispatcher locking and the repo-native stress test coverage. The intent is initialcp314tsupport for the exercised import, native extension, driver first-touch, dispatcher, cache, and stress paths, not a claim that every possible CUDA workload has been exhaustively proven free-thread-safe.