#25773 Add fused_rope and for xpu

原始 PR 作者 gaopengff 合并时间 2026-06-03 09:41 文件变更 1 提交数 10 评论 9 代码增减 +31 / -8

执行摘要

XPU 融合 RoPE 内核提升解码性能

fused_qk_rope 可以减少内核启动次数，利用编译时常量优化向量化加载/存储，从而提升 XPU 平台 Rotary Embedding 计算性能。详见 review 中作者解释。

值得精读，了解 XPU 上基于 head_size 的 kernel 选择策略和条件分支设计。

讨论亮点

mingfeima: the logic here is not clear. so where does the performance benefit comes from? inplace?
gaopengff: This fused_qk_rope is a jit kernel in cuda, which means it could use constant value of is_neox, rope_dim to launch kernel. The load/store vectorized size is tuned from rope_dim. Also, it launched fewer kernels compared to rotary_embedding. For xpu version, I have a tuned vector size load/store PR: https://github.com/sgl-project/sgl-kernel-xpu/pull/221.

gemini-code-assist[bot]: Creating q_weight and k_weight tensors of ones on every forward pass introduces unnecessary overhead. These should ideally be pre-allocated as buffers.
gaopengff: Use new method without creating new tensors.

gemini-code-assist[bot]: The forward_xpu method is defined twice in MRotaryEmbedding class.
本 PR 未涉及 mrope.py 的修改，该重复定义问题未在本 PR 中处理。

实现拆解

在条件导入块中添加对 fused_qk_rope_with_cos_sin_cache_inplace 的导入：当平台为 xpu 时，从 sgl_kernel 导入该函数。
重写 forward_xpu 方法：根据 head_size 判断是否在 [128, 256, 512] 中，若是则走融合路径，否则回退到原有的 torch.ops.sgl_kernel.rotary_embedding 调用。
融合路径中对 query 和 key 进行 reshape 和 rotary_dim 切片，然后原地调用融合内核；回退路径保持原有逻辑。
未涉及测试、配置或部署配套更改。

文件	模块	状态	重要度
`python/sglang/srt/layers/rotary_embedding/base.py`	旋转嵌入	modified	6.82

关键符号

forward_xpu fused_qk_rope_with_cos_sin_cache_inplace

关键源码片段

python/sglang/srt/layers/rotary_embedding/base.py core-logic

唯一修改文件，添加 fused_qk_rope 内核使用，重写 forward_xpu 方法

def forward_xpu(
    self,
    positions: torch.Tensor,
    query: torch.Tensor,
    key: torch.Tensor,
    offsets: Optional[torch.Tensor] = None,
    fused_set_kv_buffer_arg: Optional[FusedSetKVBufferArg] = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
    # fused_set_kv_buffer_arg 在 xpu 实现中不支持
    assert (
        fused_set_kv_buffer_arg is None
    ), "fused_set_kv_buffer_arg is not supported for xpu implementation"

    # 处理 offsets（用于多批次位置偏移）
    positions = torch.add(positions, offsets) if offsets is not None else positions

    # 确保 cos/sin cache 的 dtype 与 query 一致
    self._match_cos_sin_cache_dtype(query)

    # Fused_qk_rope 只支持对齐的 head_size（128, 256, 512）
    if self.head_size in [128, 256, 512]:
        num_tokens = positions.size(0)
        # 将 query 和 key 重塑为 [num_tokens, -1, head_size] 以分离 head 维度
        q_rope = query.view(num_tokens, -1, self.head_size)
        k_rope = key.view(num_tokens, -1, self.head_size)
        # 如果 rotary_dim 小于 head_size，只取前 rotary_dim 部分
        if self.head_size != self.rotary_dim:
            q_rope = q_rope[..., : self.rotary_dim]
            k_rope = k_rope[..., : self.rotary_dim]
        # 原地调用融合 kernel，避免额外内存分配
        fused_qk_rope_with_cos_sin_cache_inplace(
            q_rope,
            k_rope,
            self.cos_sin_cache,
            positions,
            self.rotary_dim,
            self.is_neox_style,
        )
        return query, key
    else:
        # 对于不支持的 head_size，回退到通用 rotary_embedding kernel
        return torch.ops.sgl_kernel.rotary_embedding(
            positions,
            query,
            key,
            self.head_size,
            self.cos_sin_cache,
            self.is_neox_style,
        )

评论区精华

性能收益来源分析 性能

mingfeima 询问 forward_xpu 分支逻辑不清，性能收益从何而来，是否在于 inplace 操作。gaopengff 回应：fused_qk_rope 是 CUDA 中的 jit kernel，能用编译时常量优化向量化加载 / 存储（向量化大小由 rope_dim 决定），且启动更少 kernel，从而提升性能。

结论：解释合理，性能收益主要来自内核融合和向量化优化。 · 已解决

重复定义 forward_xpu 设计

gemini-code-assist[bot] 指出 MRotaryEmbedding 类中有两个重复的 forward_xpu 定义（在 mrope.py 中），建议删除。

结论：本 PR 未涉及 mrope.py 的修改，该问题未在本 PR 中处理。 · unresolved

临时张量分配和 cat 开销 性能

gemini-code-assist[bot] 提出创建 q_weight/k_weight 张量以及使用 torch.cat 会导致额外的内存分配和拷贝，建议预分配或避免。gaopengff 回复已使用新方法避免创建新张量和 cat。

结论：作者已修改代码，移除了临时张量和 cat 操作，采用 view 和索引切片，性能问题已解决。 · 已解决

风险与影响

只有 base.py 一个文件变更，且仅影响 XPU 路径的 forward_xpu；新增的分支逻辑与原有回退路径功能等效，但缺少显式测试覆盖，若融合内核在特定 head_size 下有正确性问题则可能导致静默错误。回退路径保留，可降低部分风险。

仅影响 XPU 平台（Intel GPU 等）的推理延迟，对解码性能有正面提升；对其他平台无影响。用户无需修改代码即可受益。

缺少测试覆盖仅影响 XPU 路径

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：XPU 融合 RoPE 内核提升解码性能
推荐动作：值得精读，了解 XPU 上基于 head_size 的 kernel 选择策略和条件分支设计。

功能与动机

fused_qk_rope 可以减少内核启动次数，利用编译时常量优化向量化加载/存储，从而提升 XPU 平台 Rotary Embedding 计算性能。详见 review 中作者解释。

实现拆解

在条件导入块中添加对 fused_qk_rope_with_cos_sin_cache_inplace 的导入：当平台为 xpu 时，从 sgl_kernel 导入该函数。
重写 forward_xpu 方法：根据 head_size 判断是否在 [128, 256, 512] 中，若是则走融合路径，否则回退到原有的 torch.ops.sgl_kernel.rotary_embedding 调用。
融合路径中对 query 和 key 进行 reshape 和 rotary_dim 切片，然后原地调用融合内核；回退路径保持原有逻辑。
未涉及测试、配置或部署配套更改。

关键文件：

python/sglang/srt/layers/rotary_embedding/base.py（模块旋转嵌入；类别 source；类型 core-logic；符号 forward_xpu, fused_qk_rope_with_cos_sin_cache_inplace）: 唯一修改文件，添加 fused_qk_rope 内核使用，重写 forward_xpu 方法

关键符号：forward_xpu, fused_qk_rope_with_cos_sin_cache_inplace

关键源码片段

`python/sglang/srt/layers/rotary_embedding/base.py`

唯一修改文件，添加 fused_qk_rope 内核使用，重写 forward_xpu 方法

def forward_xpu(
    self,
    positions: torch.Tensor,
    query: torch.Tensor,
    key: torch.Tensor,
    offsets: Optional[torch.Tensor] = None,
    fused_set_kv_buffer_arg: Optional[FusedSetKVBufferArg] = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
    # fused_set_kv_buffer_arg 在 xpu 实现中不支持
    assert (
        fused_set_kv_buffer_arg is None
    ), "fused_set_kv_buffer_arg is not supported for xpu implementation"

    # 处理 offsets（用于多批次位置偏移）
    positions = torch.add(positions, offsets) if offsets is not None else positions

    # 确保 cos/sin cache 的 dtype 与 query 一致
    self._match_cos_sin_cache_dtype(query)

    # Fused_qk_rope 只支持对齐的 head_size（128, 256, 512）
    if self.head_size in [128, 256, 512]:
        num_tokens = positions.size(0)
        # 将 query 和 key 重塑为 [num_tokens, -1, head_size] 以分离 head 维度
        q_rope = query.view(num_tokens, -1, self.head_size)
        k_rope = key.view(num_tokens, -1, self.head_size)
        # 如果 rotary_dim 小于 head_size，只取前 rotary_dim 部分
        if self.head_size != self.rotary_dim:
            q_rope = q_rope[..., : self.rotary_dim]
            k_rope = k_rope[..., : self.rotary_dim]
        # 原地调用融合 kernel，避免额外内存分配
        fused_qk_rope_with_cos_sin_cache_inplace(
            q_rope,
            k_rope,
            self.cos_sin_cache,
            positions,
            self.rotary_dim,
            self.is_neox_style,
        )
        return query, key
    else:
        # 对于不支持的 head_size，回退到通用 rotary_embedding kernel
        return torch.ops.sgl_kernel.rotary_embedding(
            positions,
            query,
            key,
            self.head_size,
            self.cos_sin_cache,
            self.is_neox_style,
        )

评论区精华

mingfeima: the logic here is not clear. so where does the performance benefit comes from? inplace?
gaopengff: This fused_qk_rope is a jit kernel in cuda, which means it could use constant value of is_neox, rope_dim to launch kernel. The load/store vectorized size is tuned from rope_dim. Also, it launched fewer kernels compared to rotary_embedding. For xpu version, I have a tuned vector size load/store PR: https://github.com/sgl-project/sgl-kernel-xpu/pull/221.

gemini-code-assist[bot]: Creating q_weight and k_weight tensors of ones on every forward pass introduces unnecessary overhead. These should ideally be pre-allocated as buffers.
gaopengff: Use new method without creating new tensors.

gemini-code-assist[bot]: The forward_xpu method is defined twice in MRotaryEmbedding class.
本 PR 未涉及 mrope.py 的修改，该重复定义问题未在本 PR 中处理。

性能收益来源分析 (performance): 解释合理，性能收益主要来自内核融合和向量化优化。
重复定义 forward_xpu (design): 本 PR 未涉及 mrope.py 的修改，该问题未在本 PR 中处理。
临时张量分配和 cat 开销 (performance): 作者已修改代码，移除了临时张量和 cat 操作，采用 view 和索引切片，性能问题已解决。

风险与影响

风险：只有 base.py 一个文件变更，且仅影响 XPU 路径的 forward_xpu；新增的分支逻辑与原有回退路径功能等效，但缺少显式测试覆盖，若融合内核在特定 head_size 下有正确性问题则可能导致静默错误。回退路径保留，可降低部分风险。
影响：仅影响 XPU 平台（Intel GPU 等）的推理延迟，对解码性能有正面提升；对其他平台无影响。用户无需修改代码即可受益。
风险标记：缺少测试覆盖, 仅影响 XPU 路径

关联脉络

暂无明显关联 PR

#25773 Add fused_rope and for xpu

执行摘要

XPU 融合 RoPE 内核提升解码性能

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论