| layout | default |
|---|---|
| title | Learning Path |
| nav_order | 3 |
| permalink | /docs/learning-path |
{: .fs-8 }
Follow the optimization ladder in the order the repository was designed to teach it {: .fs-6 .fw-300 }
| Step | Kernel | Why it comes here |
|---|---|---|
| 1 | Naive | Establish the baseline cost model |
| 2 | Tiled | Introduce shared-memory reuse |
| 3 | Bank-Free | Show why shared-memory layout still matters |
| 4 | Double Buffer | Add staging and overlap concepts |
| 5 | Tensor Core | Move to WMMA and mixed-precision hardware |
- Thread/block mapping
- Memory coalescing
- Shared-memory reuse
- 32-bank shared-memory behavior
- Why
[32][33]matters
- Pipeline thinking
- Tile staging and latency hiding
- WMMA fragments
- Mixed precision
- Safe fallback behavior for unsupported shapes
- Build and run the project first
- Read the kernel page for one stage
- Run the benchmark again
- Compare the code with the previous stage
- Move to the next optimization only after the current one is clear
- Make sure your environment follows Getting Started
- Use the Architecture page if you want the repository-level map first
- Keep the Specifications Index nearby if you want the normative requirements