Feat: make TF32 and cuDNN benchmarking opt-in#18
Merged
Conversation
- TF32 is only useful on newer hardware - cudnn.benchmark=True can cause a performance hit for smaller volumes while the auto-tuner runs
camilolaiton
approved these changes
Jun 10, 2026
camilolaiton
left a comment
Collaborator
There was a problem hiding this comment.
Looks good to me! Super useful! Probably not here in this repo, but we should discuss about quantization of our models (yours, Anna's, mine, etc) and if it requires any changes here (which I doubt).
Member
Author
Yeah I totally agree re: quantization |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR updates CUDA backend configuration so performance-oriented PyTorch backend flags are controlled explicitly by
InferenceConfigand default toFalse.Previously,
GpuWorkeralways enabledtorch.backends.cudnn.benchmark, anduse_tf32=Truealso forced both matmul TF32 and cuDNN TF32 on. This made backend behavior less explicit and could override PyTorch/session defaults. The new behavior sets only the intended backend options directly from config:Changes
InferenceConfig.use_tf32default fromTruetoFalseInferenceConfig.cudnn_benchmark, defaulting toFalsetorch.backends.cudnn.allow_tf32, which defaults toTruein PyTorch. This can be set toFalseby the user in their inference script if desired.--no-tf32with opt-in--tf32--cudnn-benchmarkRationale
TF32 matmul and cuDNN benchmarking can improve throughput in certain cases, but it is hardware and data-size dependent. TF32 is only supported for Ampere and newer GPUs, whereas the default CodeOcean flex instance uses an older T4 (Turing). cuDNN benchmarking can also introduce a performance hit for smaller volumes where the auto-tuner overhead eclipses any performance gain.