Skip to content

Support Python 3.14 free-threaded CUDA wheels#897

Open
tpn wants to merge 24 commits into
NVIDIA:mainfrom
tpn:feature/add-314-free-threading-support
Open

Support Python 3.14 free-threaded CUDA wheels#897
tpn wants to merge 24 commits into
NVIDIA:mainfrom
tpn:feature/add-314-free-threading-support

Conversation

@tpn

@tpn tpn commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR adds initial CPython 3.14 free-threaded support for numba-cuda, including cp314t wheel builds.

The user-visible goal is that a CPython free-threaded process started with PYTHON_GIL=0 can import numba.cuda without CPython re-enabling the GIL. CPython does that re-enable unless native extension modules explicitly declare no-GIL support, so this PR opts the supported numba-cuda native modules into Py_MOD_GIL_NOT_USED after hardening the shared state they can reach without the GIL.

What changed

  • Build and CI paths understand 3.14t / cp314t and propagate Py_GIL_DISABLED=1 when building extension modules under a free-threaded Python.
  • The shared native module-init helper marks the C/C++ extension modules as no-GIL compatible when the build target is free-threaded CPython.
  • Native shared state is protected where the GIL previously provided serialization, including dispatcher overload/fallback state, _typeof caches and fingerprinting state, _typeconv compatibility data, and mviewbuf sequence/buffer helpers.
  • Python-side runtime state now has additional locking around CUDA driver first touch, pending deallocations, dispatcher specialization, and launch-config-sensitive cache/specialization paths.
  • The ad hoc stress scripts have been converted into repo-native tests under numba_cuda/numba/cuda/tests/stress and supporting focused test locations.
  • Developer docs now cover free-threaded stress testing and the stress-control environment variables. User docs describe a conda-forge cp314t smoke environment and the PYTHON_GIL=0 import gate.

Stress tests and environment variables

The heavier free-threading tests are opt-in because they create many threads and subprocesses. Enable them with:

PYTHON_GIL=0 NUMBA_CUDA_FT_STRESS=1 \
  python -m pytest -q --pyargs numba.cuda.tests.stress

The stress controls are:

  • NUMBA_CUDA_FT_STRESS: enables the opt-in stress tests.
  • NUMBA_CUDA_FT_STRESS_SECONDS: duration for timed CUDA stress phases.
  • NUMBA_CUDA_FT_STRESS_WORKERS: worker-thread count for thread-pool phases.
  • NUMBA_CUDA_FT_STRESS_PROCESSES: subprocess count for cache-concurrency phases.
  • NUMBA_CUDA_FT_STRESS_ITERS: iteration count override for loop-based no-CUDA phases.

The CUDA dispatch stress deliberately avoids a concurrent tight loop of full gc.collect() calls. That pattern timed out in CPython 3.14t free-threaded GC code on many-core systems and is being tracked separately from this PR. The dispatch stress still runs a full collection after workers shut down.

Validation

Current branch head: 2953b481ebe2177b5fd9b05a4f8c5d3fbbc9facc.

Local validation on the current head:

  • git diff --check: passed.
  • pixi run -e dev ruff check <changed-python-files>: passed.
  • python -m py_compile <changed-python-files>: passed.
  • pixi run -e docs build-docs: passed.
  • YAML parsing plus jq evaluation for the CI matrix filters passed. The third-party jobs continue to choose the non-free-threaded latest Python row, while explicit 3.14t matrix rows remain available where configured.
  • ci/tools/env-vars build with PY_VER=3.14t emits PYTHON_VERSION_FORMATTED=314t, CIBW_BUILD=cp314t-*, and numba-cuda-python314t-* artifact names.
  • pixi run -e default python -m pytest -q testing --override-ini "addopts=--benchmark-disable --import-mode=importlib" --pyargs numba.cuda.tests.stress.test_free_threading: 5 skipped in the local default env, which is not a free-threaded interpreter.
  • RoboRev standard, security, and design passes all found no issues for the final CI readability commit.

Many-core Linux validation:

  • On an 8x B200 system with 224 logical CPU cores, the branch built and installed a cp314t wheel, passed the PYTHON_GIL=0 import gate, and passed the focused free-threading stress phases under Python 3.14t with the GIL disabled.
  • That B200 run covered concurrent CUDA driver first touch, dispatcher fingerprinting while containers were mutated, memoryview helpers, dispatcher specialization and launch paths, launch-config-sensitive specialization, and a mixed stress workload.
  • The stress evidence was collected on runtime-code head 7bf53d4b. The current head adds only behavior-preserving CI YAML formatting.

Known limits and follow-up validation

  • Windows cp314t still needs a real builder pass. The build changes account for Windows needing explicit Py_GIL_DISABLED propagation, but this PR should not be treated as Windows-proven until that run completes.
  • AArch64 cp314t is in the matrix work, but still needs real CI or hardware validation.
  • GitHub checks are not currently reported on the PR. NVIDIA runner gating still needs to accept/run the workflows.
  • Copilot's latest clean review is stale relative to the current head. The current-head request at 2953b481 was quota-blocked, so a fresh Copilot pass is still pending.
  • The CPython full-GC non-progress lead is intentionally outside this PR.

Notes for reviewers

The highest-value review areas are the native state protection around _dispatcher, _typeof, _typeconv, and mviewbuf, plus the Python-side driver/dispatcher locking and the repo-native stress test coverage. The intent is initial cp314t support for the exercised import, native extension, driver first-touch, dispatcher, cache, and stress paths, not a claim that every possible CUDA workload has been exhaustively proven free-thread-safe.

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tpn tpn requested a review from Copilot June 12, 2026 21:40
@tpn

tpn commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

/ok to test

@tpn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for CPython 3.14 free-threaded (“3.14t”, cp314t) wheels by opting the project’s native CUDA extension modules into no-GIL mode and hardening shared native state with locks, plus adding focused import/threading tests and CI matrix coverage.

Changes:

  • Detect free-threaded Python builds during extension builds and enable Py_GIL_DISABLED-gated code paths; mark native modules as Py_MOD_GIL_NOT_USED.
  • Add/adjust native synchronization around shared caches and dispatcher/type-conversion state to support running without the GIL.
  • Add tests and CI matrix entries for 3.14t, plus documentation updates describing local validation and expectations.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
testing/pytest.ini Switch pytest import mode to importlib to preserve redirected module naming during collection.
setup.py Detect free-threaded builds and define Py_GIL_DISABLED for extension compilation.
numba_cuda/numba/cuda/tests/nocuda/test_typeconv_threading.py New threading stress test for concurrent type-conversion reads/writes.
numba_cuda/numba/cuda/tests/nocuda/test_import.py New free-threaded import test ensuring PYTHON_GIL=0 keeps the GIL disabled after imports.
numba_cuda/numba/cuda/cext/typeconv.hpp Add mutex to TypeManager for thread safety.
numba_cuda/numba/cuda/cext/typeconv.cpp Protect compatibility map and overload selection with a mutex.
numba_cuda/numba/cuda/cext/mviewbuf.c Make sequence item reads safer and validate shape/stride lengths for no-GIL operation.
numba_cuda/numba/cuda/cext/_typeof.cpp Add locking around global caches and stabilize container access under no-GIL.
numba_cuda/numba/cuda/cext/_typeconv.cpp Mark _typeconv module as no-GIL compatible.
numba_cuda/numba/cuda/cext/_pymodule.h Add shared macros to mark modules as no-GIL compatible.
numba_cuda/numba/cuda/cext/_helpermod.c Mark _helperlib module as no-GIL compatible.
numba_cuda/numba/cuda/cext/_dispatcher.cpp Add dispatcher locking and adjust reference handling to support no-GIL.
docs/source/user/installation.rst Document free-threaded Python usage and local environment setup.
ci/test-matrix.yml Enable initial 3.14t rows in the CI matrix.
.gitignore Ignore Windows .pyd artifacts.
.github/workflows/ci.yaml Adjust matrix filtering logic to handle 3.14t version strings.
.github/workflows/build-wheel.yml Enable 3.14t in the build-wheel workflow matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread setup.py
Comment thread numba_cuda/numba/cuda/tests/nocuda/test_import.py
Comment thread numba_cuda/numba/cuda/cext/_pymodule.h Outdated
Comment thread numba_cuda/numba/cuda/cext/_dispatcher.cpp
Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c
@tpn tpn marked this pull request as ready for review June 12, 2026 22:47
@tpn

tpn commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

/ok to test

@tpn, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@tpn

tpn commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test fcd0a40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 1 comment.

Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c Outdated
tpn and others added 3 commits June 12, 2026 18:26
Driver.ensure_initialized() set is_initialized=True before cuInit()
returned, so a concurrent first-touch from another thread could sail past
the guard and call a driver API on an uninitialized driver, raising
CUDA_ERROR_NOT_INITIALIZED. Under free-threaded CPython this race fires
reliably (every thread but the initializer fails); under the GIL it is a
narrow window while cuInit releases the GIL.

Serialize initialization with a reentrant lock and only publish
is_initialized once cuInit has actually completed. A separate _initializing
flag breaks the __getattr__ -> cuInit -> ensure_initialized recursion that
runs on the initializing thread. __getattr__ now refuses to lazily bind
underscore-prefixed names so touching the init lock can never recurse into
driver binding.

Adds a spawned-subprocess regression test that hammers cold-start init from
many threads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_PendingDeallocs.add_item()/clear() mutated a shared deque and size counter
with no synchronization. Deallocations are driven by weakref finalizers,
which run on whatever thread drops the last reference and, under free-threaded
CPython, truly concurrently. The "while self._cons: self._cons.popleft()" loop
in clear() raced: one thread's truthiness check then another's popleft drained
the deque, raising "IndexError: pop from an empty deque" out of a finalizer.
Any multithreaded program freeing device memory could hit it.

Guard the pending list, byte accounting, and disable counter with a lock.
clear() now atomically takes ownership of the pending entries under the lock
and runs the destructors (which call into the CUDA driver) outside it, so
deallocation work never blocks other threads queueing frees.

Adds a regression test that hammers add_item/clear from many threads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.

Comment thread numba_cuda/numba/cuda/tests/cudadrv/test_init.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.

Comment thread numba_cuda/numba/cuda/cudadrv/driver.py Outdated
Comment thread numba_cuda/numba/cuda/cudadrv/driver.py
Comment thread numba_cuda/numba/cuda/cext/mviewbuf.c

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 1 comment.

Comment thread numba_cuda/numba/cuda/cudadrv/driver.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.

Comment thread numba_cuda/numba/cuda/cext/_pymodule.h Outdated
Comment thread numba_cuda/numba/cuda/cext/_typeof.cpp Outdated
Comment thread numba_cuda/numba/cuda/tests/nocuda/test_typeof_threading.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated no new comments.

@tpn tpn self-assigned this Jun 13, 2026
@tpn

tpn commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

I also did a many-core stress pass outside the normal CI matrix for the Python 3.14 free-threaded work.

Validated head: 50421f11

I built the branch into a cp314t wheel, installed that wheel into a Python 3.14 free-threaded environment, and verified the process started with the GIL disabled and kept it disabled after importing numba.cuda and the relevant C extension modules.

The completed run used an 8x B200 system with 224 logical CPU cores. This was a concurrency/correctness stress pass, not a throughput benchmark. It exercised:

  • concurrent CUDA driver first-touch / initialization
  • _dispatcher.compute_fingerprint() while containers were being mutated
  • memoryview extent/buffer helpers
  • dispatcher specialization and launch paths with 64 worker threads
  • launch-config-sensitive specialization with 64 worker threads
  • mixed kitchen-sink workload with 96 worker threads

Final result: the cp314t wheel built and installed successfully, the GIL-disabled import gate passed, and all stress phases completed:

  • fingerprint stress passed
  • mviewbuf stress passed
  • dispatch stress passed
  • LC-S stress passed
  • kitchen-sink stress passed

I also tried to include GB200 in this pass, but the completed stress evidence here is from B200.

Pretty, prettttttty good.

-- gpt-5.5 xhigh

Claude may now lower the eyebrow by one millimeter.

@tpn tpn requested a review from Copilot June 13, 2026 06:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@tpn

tpn commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 50421f1

@tpn tpn force-pushed the feature/add-314-free-threading-support branch from 49791e8 to 38cf185 Compare June 14, 2026 03:46
@tpn tpn requested a review from Copilot June 14, 2026 19:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@tpn tpn requested a review from Copilot June 15, 2026 00:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants