#25399 Add NPU condition for cosine and sine caching

原始 PR 作者 ch-wan 合并时间 2026-05-15 20:21 文件变更 1 提交数 1 评论 1 代码增减 +4 / -3

执行摘要

NPU 条件化缓存 cos/sin 节省约 230MB

PR body 明确指出 'Conditionally compute cached cosine and sine values based on NPU flag. ~230 MB saving.'，旨在减少非 NPU 场景下的显存占用。

该 PR 属于性能微优化，变更简单直接，适合快速合并。可关注 review 中的优化建议，在后续迭代中进一步减少冗余计算。

讨论亮点

Review 评论来自 gemini-code-assist[bot]，建议优化 NPU 块内复用已计算的 cos 和 sin 张量，避免重复的 torch.cos/torch.sin 调用和中间 emb 张量的内存分配。该建议未被采纳，PR 已合并。

实现拆解

修改文件：python/sglang/srt/layers/rotary_embedding/rope_variant.py 中的 _compute_cos_sin_cache 方法。
变更点：将原来无条件计算的 emb、cos_cached_total、sin_cached_total 三行代码包裹在 if _is_npu: 条件块内。
效果：非 NPU 设备不再计算和存储 cos_cached_total 和 sin_cached_total，从而节省约 230MB 显存。

文件	模块	状态	重要度
`python/sglang/srt/layers/rotary_embedding/rope_variant.py`	旋转编码	modified	5.0

关键符号

_compute_cos_sin_cache

关键源码片段

python/sglang/srt/layers/rotary_embedding/rope_variant.py core-logic

核心变更文件，修改了 _compute_cos_sin_cache 方法，添加 NPU 条件判断以节省显存。

def _compute_cos_sin_cache(self) -> torch.Tensor:
    inv_freq = self._compute_inv_freq(self.scaling_factor)
    t = torch.arange(
        self.max_position_embeddings * self.scaling_factor,
        device=self.device,
        dtype=torch.float32,
    )
    freqs = torch.einsum("i,j -> ij", t, inv_freq)
    cos = freqs.cos() * self.mscale
    sin = freqs.sin() * self.mscale
    cache = torch.cat((cos, sin), dim=-1)
    # 仅在 NPU 环境下计算并缓存完整的 cos/sin 张量
    # 非 NPU 场景下跳过，节省约 230MB 显存
    if _is_npu:
        emb = torch.cat((freqs, freqs), dim=-1)
        self.cos_cached_total = torch.cos(emb) * self.mscale
        self.sin_cached_total = torch.sin(emb) * self.mscale
    return cache

评论区精华

复用已计算的 cos/sin 张量以避免冗余调用 性能

gemini-code-assist[bot] 建议在 NPU 块内直接复用已计算的 cos 和 sin，而不是重新对 emb 进行 cos/sin 计算，以减少冗余三角函数调用和内存分配。

结论：建议未被采纳，PR 已按原始方案合并。 · unresolved

风险与影响

风险较低。变更仅添加条件判断，不影响非 NPU 路径。NPU 路径的行为保持不变。但需确认 _is_npu 变量在上下文中已正确初始化，否则可能因变量未定义导致运行时错误。

影响范围小，仅针对 NPU 环境下的旋转位置编码缓存。非 NPU 用户无感知，NPU 用户显存节省约 230MB，推理时内存效率提升。

变量 _is_npu 未定义风险

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：NPU 条件化缓存 cos/sin 节省约 230MB
推荐动作：该 PR 属于性能微优化，变更简单直接，适合快速合并。可关注 review 中的优化建议，在后续迭代中进一步减少冗余计算。

功能与动机

PR body 明确指出 'Conditionally compute cached cosine and sine values based on NPU flag. ~230 MB saving.'，旨在减少非 NPU 场景下的显存占用。

实现拆解

修改文件：python/sglang/srt/layers/rotary_embedding/rope_variant.py 中的 _compute_cos_sin_cache 方法。
变更点：将原来无条件计算的 emb、cos_cached_total、sin_cached_total 三行代码包裹在 if _is_npu: 条件块内。
效果：非 NPU 设备不再计算和存储 cos_cached_total 和 sin_cached_total，从而节省约 230MB 显存。

关键文件：

python/sglang/srt/layers/rotary_embedding/rope_variant.py（模块旋转编码；类别 source；类型 core-logic）: 核心变更文件，修改了 _compute_cos_sin_cache 方法，添加 NPU 条件判断以节省显存。

关键符号：_compute_cos_sin_cache

关键源码片段

`python/sglang/srt/layers/rotary_embedding/rope_variant.py`

核心变更文件，修改了 _compute_cos_sin_cache 方法，添加 NPU 条件判断以节省显存。

def _compute_cos_sin_cache(self) -> torch.Tensor:
    inv_freq = self._compute_inv_freq(self.scaling_factor)
    t = torch.arange(
        self.max_position_embeddings * self.scaling_factor,
        device=self.device,
        dtype=torch.float32,
    )
    freqs = torch.einsum("i,j -> ij", t, inv_freq)
    cos = freqs.cos() * self.mscale
    sin = freqs.sin() * self.mscale
    cache = torch.cat((cos, sin), dim=-1)
    # 仅在 NPU 环境下计算并缓存完整的 cos/sin 张量
    # 非 NPU 场景下跳过，节省约 230MB 显存
    if _is_npu:
        emb = torch.cat((freqs, freqs), dim=-1)
        self.cos_cached_total = torch.cos(emb) * self.mscale
        self.sin_cached_total = torch.sin(emb) * self.mscale
    return cache

评论区精华

复用已计算的 cos/sin 张量以避免冗余调用 (performance): 建议未被采纳，PR 已按原始方案合并。

风险与影响

风险：风险较低。变更仅添加条件判断，不影响非 NPU 路径。NPU 路径的行为保持不变。但需确认 _is_npu 变量在上下文中已正确初始化，否则可能因变量未定义导致运行时错误。
影响：影响范围小，仅针对 NPU 环境下的旋转位置编码缓存。非 NPU 用户无感知，NPU 用户显存节省约 230MB，推理时内存效率提升。
风险标记：变量 _is_npu 未定义风险

关联脉络

暂无明显关联 PR

#25399 Add NPU condition for cosine and sine caching

执行摘要

NPU 条件化缓存 cos/sin 节省约 230MB

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论