Add Kernel Optimization Template to PromptManager by kaiming-cheng · Pull Request #90 · meta-pytorch/KernelAgent

kaiming-cheng · 2026-01-31T06:42:43Z

This PR adds a new Jinja2 template for bottleneck-guided kernel optimization.

Each optimization round targets exactly one root cause with one recommended fix

kernel_optimization.j2

ROOFLINE ANALYSIS section with SOL%, efficiency, headroom
BOTTLENECK ANALYSIS expects singular bottleneck with root_cause and recommended_fix
PERFORMANCE TARGET targets 10% improvement over current best or Eager baseline
TARGET GPU includes all fields from gpu_specs_database

prompt_manager.py

Added kernel_optimization template
New render_kernel_optimization_prompt() with explicit params: category, summary, reasoning, root_cause, recommended_fix

Example Usage

Generate bottleneck analysis from NCU data

  from kernel_perf_agent.kernel_opt.diagnose_prompt.judger_prompt import (                                                                                                             
      build_bottleneck_prompt,                                                                                                                                                         
      parse_bottleneck_response,                                                                                                                                                       
  )                                                                                                                                                                                    
  from kernel_perf_agent.kernel_opt.roofline.ncu_roofline import RooflineAnalyzer                                                                                                      
                                                                                                                                                                                       
  # Analyze roofline                                                                                                                                                                   
  analyzer = RooflineAnalyzer()                                                                                                                                                        
  roofline = analyzer.analyze(ncu_metrics)                                                                                                                                             
                                                                                                                                                                                       
  # Build prompt for LLM                                                                                                                                                               
  prompt = build_bottleneck_prompt(                                                                                                                                                    
      kernel_code=kernel_src,                                                                                                                                                          
      ncu_metrics=ncu_metrics,                                                                                                                                                         
      roofline=roofline,                                                                                                                                                               
      gpu_specs=gpu_specs,                                                                                                                                                             
      num_bottlenecks=2,                                                                                                                                                               
      num_causes=2,                                                                                                                                                                    
      num_fixes=1,                                                                                                                                                                     
  )                                                                                                                                                                                    
                                                                                                                                                                                       
  # Call LLM and parse response                                                                                                                                                        
  llm_response = call_llm(prompt)                                                                                                                                                      
  bottlenecks = parse_bottleneck_response(llm_response)                                                                                                                                
  # Returns: [BottleneckResult, BottleneckResult]

Render optimization prompt (one cause, one fix)

  from triton_kernel_agent.prompt_manager import PromptManager                                                                                                                         
                                                                                                                                                                      
  prompt_manager = PromptManager()                                                                                                                                                     
                                                                                                                                                                                       
  # Pick first bottleneck, first cause/fix pair                                                                                                                                        
  b = bottlenecks[0]                                                                                                                                                                   
                                                                                                                                                                                       
  prompt = prompt_manager.render_kernel_optimization_prompt(                                                                                                                           
      problem_description="Fused attention kernel",                                                                                                                                    
      kernel_code=current_kernel,                                                                                                                                                      
      gpu_specs=gpu_specs,                                                                                                                                                             
      roofline={                                                                                                                                                                       
          "bottleneck": roofline.bottleneck,                                                                                                                                           
          "compute_sol_pct": roofline.compute_sol_pct,                                                                                                                                 
          "memory_sol_pct": roofline.memory_sol_pct,                                                                                                                                   
          "efficiency_pct": roofline.efficiency_pct,                                                                                                                                   
          "headroom_pct": roofline.headroom_pct,                                                                                                                                       
          "at_roofline": roofline.at_roofline,                                                                                                                                         
          "uses_tensor_cores": roofline.uses_tensor_cores,                                                                                                                             
          "warnings": roofline.warnings,                                                                                                                                               
      },                                                                                                                                                                               
      category=b.category,                                                                                                                                                             
      summary=b.summary,                                                                                                                                                               
      reasoning=b.reasoning,                                                                                                                                                           
      root_cause=b.root_causes[0],                                                                                                                                                     
      recommended_fix=b.recommended_fixes[0],                                                                                                                                          
      pytorch_baseline_ms=1.234,                                                                                                                                                       
      current_best_ms=0.987,  # from previous iteration                                                                                                                                
  )                                                                                                                                                                                    
                                                                                                                                                                                       
  # Call LLM to generate optimized kernel                                                                                                                                              
  optimized_kernel = call_llm(prompt)

Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']

Jack-Khuu

Not in this PR so you don't need to change it, but once everything lands we should consider creating arg dataclasses, so that we aren't passing around 10+ arg to functions

Jack-Khuu · 2026-02-04T21:44:49Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+        ...     print(f"SM Count: {specs['sm_count']}")
+    """
+    if gpu_name in GPU_SPECS_DATABASE:
+        return GPU_SPECS_DATABASE[gpu_name].copy()


Is this just to prevent folks from editing it by accident? If so consider GPU_SPECS_DATABASE = MappingProxyType(GPU_SPECS_DATABASE)

This makes it so that you can read from the DB, but not write. Removes the need for this file

Jack-Khuu · 2026-02-04T21:46:28Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompt.py

+    root_causes: list[dict[str, Any]] = field(default_factory=list)
+    recommended_fixes: list[dict[str, Any]] = field(default_factory=list)
+
+    def to_dict(self) -> dict[str, Any]:


If we aren't customizing to_dict, have callers use asdict

Kaiming Cheng and others added 30 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

Introducing benchmarking infra for kernel performance

4febdd6

fix ruff

d92a7b7

fix ruff

2994315

address comments

1378fc3

Diagnose module - prompt constructor

45fec80

Refactors the diagnose_prompt module into a modular architecture

b640cde

fix diff issue

e952123

fix ruff issue

e7ba29a

fix

72ac4d1

fix ruff

e2c599e

Merge branch 'main' into kaiming/opt_component_3

8ab907c

fix gpu_spec based on feedback and remove judger_prompt for future PR

e350802

Remove judger_prompts.py changes from this PR

8541299

Merge branch 'main' into kaiming/opt_component_3

313a84f

Update gpu_specs_database.py

9e608ac

address feedback

f3220e1

ruff fix

4443f33

Merge branch 'main' into kaiming/opt_component_3

b12b138

Kaiming Cheng added 5 commits January 30, 2026 14:26

introduce roofline analyzer

31d0d70

update doc string in init and fix ncu_roofline

3c607b5

introduce judger prompt

1aad0ad

add optimization template

c0bd09c

update prompt manager

56fba36

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026

kaiming-cheng changed the title ~~Add Kernel Optimization Template to PromptManager #77Kaiming/opt template~~ Add Kernel Optimization Template to PromptManager Jan 31, 2026

kaiming-cheng requested review from Jack-Khuu and Laurawly January 31, 2026 06:42

Jack-Khuu approved these changes Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kernel Optimization Template to PromptManager#90

Add Kernel Optimization Template to PromptManager#90
kaiming-cheng wants to merge 35 commits intomainfrom
kaiming/opt_template

kaiming-cheng commented Jan 31, 2026

Uh oh!

Jack-Khuu left a comment

Uh oh!

Jack-Khuu Feb 4, 2026

Uh oh!

Jack-Khuu Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiming-cheng commented Jan 31, 2026

kernel_optimization.j2

prompt_manager.py

Example Usage

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants