执行摘要
融合 QK-norm、3D mRoPE 和 KV 缓存写入,优化 AMD 平台上 Qwen3-VL 解码性能。
PR body明确指出:'Use aiter's fused_qk_norm_mrope_3d_cache_pts_quant_shuffle kernel to replace 4 separate kernel launches (QKV split, QK RMSNorm, 3D mRoPE, KV cache write) with a single HIP kernel on the ROCm decode path.',动机是减少内核启动次数,优化解码性能。
建议精读此PR以了解融合内核的设计和实现细节,关注forward_prepare_aiter_fused_mrope函数的逻辑、条件检测的健壮性,以及如何平衡性能与代码维护性。对于涉及AMD平台优化或内核融合的开发者,此PR提供有价值的案例。
reviewer kkHuang-amd最初评论:'I don't suggest to copy whole attention block processing logic into one function. It will not follow the sglang attention processing logic. forward_prepare -> forward_core. It will not be easily to maintain',作者yctseng0211回应已重构以遵循标准模式,融合内核仅存在于forward_prepare_fused_mrope中。此外,作者在评论中解释guard的必要性:'The guard in Line:274 is needed because there's a downstream k.to(torch.bfloat16) cast in the RL on-policy path, without the guard, the fused prepare would return k=None and that .to() call would crash.',以避免下游转换崩溃。
参与讨论