执行摘要
为 SM120 架构 CUTLASS 块级 FP8 GEMM 添加 swapAB 支持,优化小 M 维度性能。
PR body中说明:'Purpose This PR adds swap AB kernel support for the SM120 (Blackwell) blockwise FP8 scaled GEMM. For decode-phase workloads and small-batch inference, the M dimension of the GEMM is very small. Without swap AB, the tile partitioning along M is highly inefficient — most threads within a CTA tile are idle.' 这旨在通过转置问题为 D = (B^T @ A^T)^T 来提升小批量推理性能。
值得精读以学习 CUTLASS 优化技巧和 swapAB 策略,特别关注启发式选择的 trade-offs 和模板元编程实现细节。工程师可参考此 PR 了解如何通过转置优化小维度 GEMM 性能。
review 中 gemini-code-assist[bot] 指出代码注释不准确:'The comment on these lines is incomplete and misleading. It refers to if constexpr, but the code uses a regular if statement.' 这可能导致维护混淆,但未在讨论中进一步解决。mgoin 批准 PR 并称赞优化:'Nice work, the changes look solid to me for using swapAB with cutlass.' 未发现其他争议或深入设计讨论。
参与讨论