Prhub

#27116 Revert "Fix hybrid linear attention misrouting plain-RadixAttention linear layers to the full backend (Ring-2.5-1T)"

原始 PR 作者 ch-wan 合并时间 2026-06-03 14:27 文件变更 2 提交数 4 评论 1 代码增减 +19 / -8

执行摘要

回退 PR #26623,恢复老版路由逻辑

PR #26623 修复了 Bailing/Ring 模型因线性注意力层使用纯 RadixAttention 导致的 CUDA graph 崩溃问题,但该修复可能对 Ling-2.5/2.6 等模型造成回归。回退是为了恢复原有行为,同时通过添加 _is_linear_attention 标记来区分线性注意力层,避免误路由。

建议精读 _is_full_attn 的回退逻辑和 Bailing 模型的标记方式。该 PR 体现了 hybrid attention 路由在模型兼容性与正确性之间的权衡,值得关注后续是否有更统一的方案。

讨论亮点

无 review 评论。

实现拆解

  1. 回退 _is_full_attn 逻辑hybrid_linear_attn_backend.py):将 _is_full_attn 恢复为直接判断 isinstance(layer, RadixLinearAttention) → 线性,isinstance(layer, RadixAttention) → 全注意力,兜底按 layer id 判断。同时移除 Union[RadixAttention, RadixLinearAttention] 类型提示,改为仅 RadixAttention
  2. 添加 _is_linear_attention 标记bailing_moe_linear.py):在 Bailing 的线性注意力层创建 RadixAttention 实例后,设置 self.attn._is_linear_attention = True,供 _is_full_attn 优先判断。
  3. 配套测试调整:通过三次提交更新了 test_ling_2_6_flash.py,适配新的路由逻辑。
文件 模块 状态 重要度
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py 注意力层 modified 6.73
python/sglang/srt/models/bailing_moe_linear.py 模型定义 modified 5.8

关键符号

HybridLinearAttnBackend._is_full_attn BailingMoELinearAttention.__init__

关键源码片段

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py core-logic

核心路由方法 `_is_full_attn` 被回退,修改了类型判断逻辑,并添加了 `_is_linear_attention` 属性检查分支。

# python/sglang/srt/layers/attention/hybrid_linear_attn_backend.pyclass HybridLinearAttnBackend(AttentionBackend):
    """Manages a full and linear attention backend"""
​
    def __init__(
        self,
        full_attn_backend: AttentionBackend,
        linear_attn_backend: MambaAttnBackendBase,
        full_attn_layers: list[int],
    ):
        self.full_attn_layers = full_attn_layers
        self.full_attn_backend = full_attn_backend
        self.linear_attn_backend = linear_attn_backend
        self.attn_backend_list = [full_attn_backend, linear_attn_backend]
        self.token_to_kv_pool = full_attn_backend.token_to_kv_pool
        self.req_to_token_pool = full_attn_backend.req_to_token_pool
​
    def _is_full_attn(
        self,
        layer: Optional[RadixAttention],
        layer_id: Optional[int] = None,
    ) -> bool:
        # Explicit linear-attention subclass -> strong linear signal (KDA, GDN,
        # Qwen3-Next, Qwen3.5 main linear layers).
        if isinstance(layer, RadixLinearAttention):
            return False
        # Some hybrid models (Ling-2.5/2.6) wrap their linear layers in plain
        # `RadixAttention` rather than `RadixLinearAttention`. Those wrappers
        # set `_is_linear_attention=True` on the attn module so we can
        # distinguish them from full-attention RadixAttention instances --
        # including MTP/NEXTN draft layers, which are full and must default to
        # the full-attn path.
        if layer is not None and getattr(layer, "_is_linear_attention", False):
            return False
        if isinstance(layer, RadixAttention):
            return True
​
        if layer is not None:
            layer_id = layer.layer_id
        assert layer_id is not None, "either layer or layer_id must be provided"
        return layer_id in self.full_attn_layers
python/sglang/srt/models/bailing_moe_linear.py data-contract

为 Bailing 模型的 RadixAttention 实例添加 `_is_linear_attention` 标记,以配合路由逻辑。

# python/sglang/srt/models/bailing_moe_linear.pyclass BailingMoELinearAttention(nn.Module):
    def __init__(self, config, layer_id, prefix, ...):
        # ... 其他初始化 ...
        self.attn = RadixAttention(
            self.tp_heads,
            self.head_dim,
            self.scaling,
            num_kv_heads=self.tp_kv_heads,
            layer_id=layer_id,
            quant_config=quant_config,
            prefix=f"{prefix}.attn",
        )
        # Marker for HybridLinearAttnBackend._is_full_attn: Bailing wraps
        # linear-attention layers in a plain RadixAttention, so the
        # dispatcher can't tell from the type alone that this is a linear
        # layer (would otherwise default to the full-attn backend, e.g. the
        # same way MTP/NEXTN draft layers are routed).
        self.attn._is_linear_attention = True
        # ... 后续初始化 ...

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论,后续有更多讨论时会体现在这里。

风险与影响

  1. 回归风险hybrid_linear_attn_backend.py):恢复后的 isinstance(layer, RadixAttention) -> True 可能再次误将纯 RadixAttention 封装的线性注意力层(如 Bailing)路由至全注意力后端,依赖 _is_linear_attention 标记来规避,若其他模型未设置该标记则仍可能崩溃。
  2. 兼容性风险:Ling 模型依赖该回退,但 Qwen3.5 GDN 等其他模型可能未覆盖。
  3. 类型提示收缩layer 参数从 Optional[Union[RadixAttention, RadixLinearAttention]] 改为 Optional[RadixAttention],若传入 RadixLinearAttention 实例会触发类型错误,但 Python 不强制执行,实际运行时需确保调用方正确。

影响范围:所有使用混合线性注意力的模型,特别是 Bailing、Ring、Ling-2.5/2.6 系列。
影响程度:中等。修复了 Ling 模型的潜在回归,但依赖于手动标记 _is_linear_attention,缺少通用性。

核心路径变更 回归风险 缺少测试覆盖

关联 Issue

#26623 Fix hybrid linear attention misrouting plain-RadixAttention linear layers to the full backend (Ring-2.5-1T)

完整报告

参与讨论