Skip to content

fix(base_dsl): drop ArchMeta alias so Arch.sm_*.value is correct#3248

Open
lingolin128 wants to merge 1 commit into
NVIDIA:mainfrom
lingolin128:fix
Open

fix(base_dsl): drop ArchMeta alias so Arch.sm_*.value is correct#3248
lingolin128 wants to merge 1 commit into
NVIDIA:mainfrom
lingolin128:fix

Conversation

@lingolin128
Copy link
Copy Markdown

Summary

Arch.sm_110f.value returned (10, 1, 'f') instead of (11, 0, 'f'). The same class of bug existed on the other CUDA
branch — Arch.sm_101f.value returned (11, 0, 'f') when CUDA ≥ 13. Anything reading .value, .major, or .minor
on these members got the wrong tuple.

Root cause

In python/CuTeDSL/cutlass/base_dsl/arch.py, sm_101* and sm_110* were declared as separate enum members with
different value tuples:

sm_101  = (10, 1, "")
sm_101f = (10, 1, "f")
sm_110  = (11, 0, "")
sm_110f = (11, 0, "f")

A custom ArchMeta(EnumMeta) then tried to alias one set onto the other based on CUDA version via getattribute and
getitem:

  • CUDA ≥ 13: sm_101* → sm_110*
  • CUDA < 13: sm_110* → sm_101*

This is fundamentally incompatible with how Enum works. A real enum alias must share the same value tuple as its
canonical member; here the two members had different tuples, and the metaclass only intercepted attribute / subscript
lookup. So Arch.sm_110f got silently rerouted to the sm_101f member object, and .value on that object honestly reported
(10, 1, 'f').

The bug surfaces on both CUDA branches — just on the opposite name on each. The (10, 1, 'f') symptom means CUDA < 13.

Fix

Drop ArchMeta entirely and let sm_101* and sm_110* stand as independent enum members, each carrying its correct (major,
minor, suffix) tuple.

The cross-name family relationship (sm_101f is family-of sm_110f but not sm_100f) is already handled inside
Arch.is_family_of via an explicit special case on sm_101a / sm_101f, so no semantics are lost.

Net diff: arch.py loses ~50 lines of metaclass machinery; the rest of the file is unchanged.

Verification

  assert Arch.sm_110f.value == (11, 0, 'f')
  assert Arch.sm_101f.value == (10, 1, 'f')
  assert Arch.sm_110.value  == (11, 0, '')
  assert Arch.sm_101.value  == (10, 1, '')

  assert Arch.from_string('sm_110f').value == (11, 0, 'f')
  assert Arch.from_string('sm_101f').value == (10, 1, 'f')

  assert Arch.sm_101f.is_family_of(Arch.sm_110f) is True
  assert Arch.sm_101f.is_family_of(Arch.sm_100f) is False
  assert Arch.sm_103f.is_family_of(Arch.sm_100f) is True

All pass.

Compatibility notes

  • Existing call sites that already use Arch.sm_110* or Arch.sm_101* (tcgen05/copy.py, tcgen05/mma.py,
    numeric_conversion.py, the MLA decode examples) continue to work.
  • Arch["sm_110f"] / Arch["sm_101f"] both resolve to their own members instead of being version-routed. Callers that
    relied on the implicit rerouting should select the canonical member for their target CUDA version explicitly.

@lingolin128
Copy link
Copy Markdown
Author

fix: #3249

@yiwangchunyu
Copy link
Copy Markdown

Hi @lingolin128 , thanks for the careful Enum-semantics analysis — the diagnosis that __getattribute__-based aliasing with different _value_ tuples can't be self-consistent is accurate.

Before we land it, some context that wasn't documented in the original code (and probably why this looked like a clear bug):

What ArchMeta was trying to do

We want to use same Python source against both CUDA 12.9 and CUDA 13.1. sm_101 was renamed to sm_110 after CUDA 13.0. ArchMeta was trying to make this transparent to callers, so both of these work on either CUDA without per-call-site handling:

if arch is Arch.sm_110f:                                # same chip on both CUDA
    emit_arch_specific_intrinsic()

target_str = f"sm_{arch.major}{arch.minor}{arch.suffix}"  # toolchain-recognized
target_str = arch.to_string()

The symptom you saw — Arch.sm_110f.value == (10, 1, 'f') on CUDA 12 — is actually the intended outcome of that second pattern: on CUDA 12, the chip you're referring to is what ptxas calls sm_101f, so (10, 1, 'f') is the tuple that produces a ptxas-acceptable string.

That said, you're right that the implementation is broken (fights against _member_map_, iteration, pickle, etc.). So I agree it shouldn't stay as-is.

Concern with deleting it

After this PR, Arch.sm_110f and Arch.sm_101f become independent members on every CUDA, so:

  • if arch is Arch.sm_110f: silently misses the sm_101f case (and vice versa) — every site needs defensive or.
  • f"Arch.sm110" emits a string the CUDA 12.9 toolchain rejects.
  • Some sites might silently regress

A smaller fix

Use real Python Enum aliases gated on CUDA version (same _value_ tuple → automatic alias, no metaclass):

class Arch(Enum):
    sm_100  = (10, 0, "")
    # ...
    if CUDA_VERSION.major >= 13:
        sm_110  = (11, 0, "")
        sm_110a = (11, 0, "a")
        sm_110f = (11, 0, "f")
        sm_101  = sm_110
        sm_101a = sm_110a
        sm_101f = sm_110f
    else:
        sm_101  = (10, 1, "")
        sm_101a = (10, 1, "a")
        sm_101f = (10, 1, "f")
        sm_110  = sm_101
        sm_110a = sm_101a
        sm_110f = sm_101f
CUDA 12.9:  Arch.sm_110f is Arch.sm_101f → True; .value (10,1,'f'); .name 'sm_101f'
CUDA 13.1:  Arch.sm_110f is Arch.sm_101f → True; .value (11,0,'f'); .name 'sm_110f'

This kills ArchMeta and makes .value/.name/identity all self-consistent (your goals), while keeping the single-source / no-OR-at-callers property.

Does this direction match what you had in mind for the fix? I'd like to hear if you see issues with the real-alias pattern, or if there are alternatives you've considered that we should weigh against it. Happy to iterate on the design. Thanks again — your analysis made the right replacement much easier to see.

@lingolin128
Copy link
Copy Markdown
Author

lingolin128 commented May 22, 2026

Hi @yiwangchunyu , thanks for the context — the single-source / ptxas-string motivation wasn't obvious from the code. The real-alias pattern is exactly right: it kills ArchMeta, makes .value/.name/identity self-consistent, and keeps the no-OR-at-callers property. Pushed an update that adopts it. Verified on both CUDA branches with a stubbed CUDA_VERSION:

CUDA 12.x CUDA 13.x
sm_110f is sm_101f True True
.value (10, 1, 'f') (11, 0, 'f')
.name 'sm_101f' 'sm_110f'
f"sm_{m}{n}{s}" / to_string() ptxas-accepted ptxas-accepted

if arch is Arch.sm_110f: matches on both CUDAs, toolchain strings stay accepted.

One thing that needs to come along: is_family_of. The existing special case arch.major == 11 and arch.minor >= 0 breaks reflexivity under real aliases — on CUDA 12 the canonical member has .major == 10, so sm_110f.is_family_of(sm_110f) would return False. Since aliases collapse the two names into one member, I rewrote it as identity-based:

if self in [Arch.sm_101a, Arch.sm_101f]:
    return arch in [Arch.sm_101a, Arch.sm_101f]

Holds reflexivity, holds cross-name, still rejects sm_100f as the family root. sm_103f.is_family_of(sm_100f) is unaffected.

Heads-up on BlackwellArchs(): under real aliases the literal tuple now contains duplicate object refs (sm_101* and sm_110* are the same member), so len(Arch.BlackwellArchs()) no longer matches the number of distinct chips, and set(...) shrinks. Not sure whethez the original size was load-bearing for any caller, so I didn't pre-emptively dedupe — let me know if it should be tuple(dict.fromkeys(...)) and I'll add it.

@yiwangchunyu
Copy link
Copy Markdown

Hi, @lingolin128 , thanks for your update!

  • is_family_of — good catch The rewrite is necessary. One small tweak I'd suggest to keep the behavior strictly equivalent to the pre-PR version: the semantic intent of this function is "can self use family-specific features of arch", so plain sm_101/sm_110 should still be a valid family root for the a/f variants (consistent with sm_101f.is_family_of(sm_101) → True):
if self in [Arch.sm_101a, Arch.sm_101f]:
    return arch in [Arch.sm_101, Arch.sm_101a, Arch.sm_101f]
  • BlackwellArchs() both options are fine to me. Because we don't rely on the length of the returned list so far. Returning a deduplicated tuple would be cleaner, since under real aliases the duplicate entries also surface as repeated names in print(BlackwellArchs()).

@lingolin128
Copy link
Copy Markdown
Author

Hi @yiwangchunyu ,Thanks for the careful review and valuable suggestions!

  • Fixed the is_family_of logic as you pointed out to keep consistent behavior.
  • Optimized BlackwellArchs() and removed duplicate entries
    Now it returns a deduplicated tuple with 15 valid arch entries under different CUDA versions.
# CUDA 12.x
BlackwellArchs()
-> (Arch.sm_100, Arch.sm_100a, Arch.sm_100f, Arch.sm_101, Arch.sm_101a, Arch.sm_101f, Arch.sm_103, Arch.sm_103a, Arch.sm_103f, Arch.sm_120, Arch.sm_120a, Arch.sm_120f, Arch.sm_121, Arch.sm_121a, Arch.sm_121f)
len(BlackwellArchs())
-> 15

# CUDA 13.x
BlackwellArchs()
-> (Arch.sm_100, Arch.sm_100a, Arch.sm_100f, Arch.sm_110, Arch.sm_110a, Arch.sm_110f, Arch.sm_103, Arch.sm_103a, Arch.sm_103f, Arch.sm_120, Arch.sm_120a, Arch.sm_120f, Arch.sm_121, Arch.sm_121a, Arch.sm_121f)
len(BlackwellArchs())
-> 15

Thanks a lot for review!

@yiwangchunyu
Copy link
Copy Markdown

Thanks for the update, LGTM!

@lingolin128
Copy link
Copy Markdown
Author

Hi @Junkai-Wu , please review this pr. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants