#26623 Fix hybrid linear attention misrouting plain-RadixAttention linear layers to the full backend (Ring-2.5-1T)

原始 PR 作者 alisonshao 合并时间 2026-06-03 07:24 文件变更 3 提交数 5 评论 15 代码增减 +10 / -21

执行摘要

修复混合注意力线性层误路由到 full 后端

在 Ring-2.5-1T 等使用 BailingMoE 架构的模型中，线性注意力层（Lightning seg_la）被用普通的 RadixAttention 类包装，而非 RadixLinearAttention。_is_full_attn 方法中的 isinstance(layer, RadixAttention) -> True 分支错误地将这些线性层路由到全注意力（MLA）后端，触发 set_kv_buffer/set_mla_kv_buffer 调用时因 layer_id 不在 full_attn_layers 中而崩溃。

如果希望采用更简洁的路由方案，本 PR 的设计（仅依赖 layer_id）优于基于类型的快捷方式。但由于主线已合并 #26474 hotfix，建议评估是否仍需要本 PR 的清理，或直接在此基础上进一步重构。

讨论亮点

Review 中 ch-wan 指出 _is_full_attn 的类型注解应同时接受 RadixAttention 和 RadixLinearAttention，因为 RadixLinearAttention 并非 RadixAttention 的子类。alisonshao 在 commit 2914ea7287 中采纳了该建议，将类型签章更新为 Optional[Union[RadixAttention, RadixLinearAttention]]。

实现拆解

根因分析：定位到 _is_full_attn 中 isinstance(layer, RadixAttention) -> True 分支为误判来源。该分支在 #23331 引入，对 Qwen3.5 GDN 等使用 RadixLinearAttention 的模型正确，但对 Bailing/Ring 等使用普通 RadixAttention 的线性层失效。
核心修复：删除了整个 isinstance(layer, RadixAttention) 判断块（及之前为 hotfix 添加的 _is_linear_attention 属性检查）。新的实现仅依赖 layer_id in self.full_attn_layers 决定路由。同时将参数类型从 Optional[RadixAttention] 扩展为 Optional[Union[RadixAttention, RadixLinearAttention]] 以匹配实际传入类型。
清理冗余标记：在 bailing_moe_linear.py 中移除 self.attn._is_linear_attention = True 赋值，因为该标记原为配合 hotfix 使用，本 PR 的路由逻辑已不再需要。
测试调整：将 test_ling_2_6_flash.py 从 stage=base-c 改为 suite=nightly-8-gpu-common, nightly=True，使其仅在夜间 CI 运行以节省 GPU 资源，同时仍然覆盖线性/全注意力路由路径。

文件	模块	状态	重要度
`python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`	路由	modified	6.22
`python/sglang/srt/models/bailing_moe_linear.py`	线性模型	modified	5.66
`test/registered/8-gpu-models/test_ling_2_6_flash.py`	测试	modified	4.06

关键符号

HybridLinearAttnBackend._is_full_attn

关键源码片段

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py core-logic

核心路由方法 `_is_full_attn` 的修复：移除导致误判的 `isinstance(layer, RadixAttention) -> True` 快捷分支，改为完全基于 `layer_id in self.full_attn_layers` 分类。同步更新类型签名。

def _is_full_attn(
    self,
    layer: Optional[Union[RadixAttention, RadixLinearAttention]],
    layer_id: Optional[int] = None,
) -> bool:
    # RadixLinearAttention 是明确的线性注意力层，直接返回 False。
    if isinstance(layer, RadixLinearAttention):
        return False

    # 从 layer 对象获取 layer_id，然后依赖 self.full_attn_layers 集合作判断。
    # 这个路径同时覆盖普通 RadixAttention（全注意力层）和
    # 用于线性层的普通 RadixAttention（如 Bailing/Ring）。
    if layer is not None:
        layer_id = layer.layer_id
    assert layer_id is not None, "either layer or layer_id must be provided"
    return layer_id in self.full_attn_layers

python/sglang/srt/models/bailing_moe_linear.py data-contract

移除 `self.attn._is_linear_attention = True` 标记行，因为新路由方案不再需要此标记。

self.attn = RadixAttention(
    self.tp_heads,
    self.head_dim,
    self.scaling,
    num_kv_heads=self.tp_kv_heads,
    layer_id=layer_id,
    quant_config=quant_config,
    prefix=f"{prefix}.attn",
)
# 注：之前这里有一行 `self.attn._is_linear_attention = True`，
# 已被移除，因为 HybridLinearAttnBackend._is_full_attn 现在
# 基于 layer_id 判断，不再需要该标记。

test/registered/8-gpu-models/test_ling_2_6_flash.py test-coverage

将测试从 base-c 改为 nightly 运行，避免在常规 CI 中占用 GPU 资源，同时仍对路由修复提供回归覆盖。

# 将测试从 base-c 调整为 nightly，以节省常规 CI GPU 资源，
# 同时通过 nextn 规格解码路径覆盖路由逻辑。
register_cuda_ci(est_time=600, suite="nightly-8-gpu-common", nightly=True)

class TestLing26Flash(GSM8KMixin, DefaultServerBase):
    model = "inclusionAI/Ling-2.6-flash"
    other_args = [
        "--tp-size", "4",
        "--trust-remote-code",
        "--mamba-scheduler-strategy", "extra_buffer",
        "--mem-fraction-static", "0.75",
        "--max-running-requests", "64",
        "--max-mamba-cache-size", "256",
        "--speculative-algorithm", "NEXTN",
        "--speculative-num-steps", "3",
        "--speculative-eagle-topk", "1",
        "--speculative-num-draft-tokens", "4",
    ]

评论区精华

函数签名更新 style

ch-wan 建议将 layer 参数类型扩展为 Optional[Union[RadixAttention, RadixLinearAttention]] 以匹配实际传入的类型。

结论：alisonshao 采纳建议并提交了 commit 2914ea7287。 · 已解决

风险与影响

风险主要集中在路由变更对已有模型的影响：对于原先通过 isinstance 判断为 full 的普通 RadixAttention 全注意力层（如 MTP draft layers），若其 layer_id 未在 full_attn_layers 列表中，将错误地路由到线性后端。但依据设计，full_attn_layers 自模型配置生成，理应包含所有全注意力层。仍需验证 Qwen3.5 GDN、DeepSeek 等模型的 hybrid 配置不会受到影响。此外，本 PR 移除的 _is_linear_attention 标记原为 #26474 hotfix 加入，若后续无意合并此标记，可能导致对 Ling-2.6 hotfix 的依赖。

修复了 Ring-2.5-1T 等模型在 TP8 CUDA 图捕获时的崩溃；对使用 HybridLinearAttnBackend 的所有模型（Qwen3.5、Ling-2.5/2.6 等）均有潜在影响，但路由逻辑更简洁且易于维护。测试从常规 CI 移至夜间，减少了常规 CI GPU 占用，但回归发现需要夜间才能触发。

路由逻辑变更影响已有混合注意力模型与 #26474 存在重叠

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复混合注意力线性层误路由到full后端
推荐动作：如果希望采用更简洁的路由方案，本 PR 的设计（仅依赖 layer_id）优于基于类型的快捷方式。但由于主线已合并 #26474 hotfix，建议评估是否仍需要本 PR 的清理，或直接在此基础上进一步重构。

功能与动机

实现拆解

根因分析：定位到 _is_full_attn 中 isinstance(layer, RadixAttention) -> True 分支为误判来源。该分支在 #23331 引入，对 Qwen3.5 GDN 等使用 RadixLinearAttention 的模型正确，但对 Bailing/Ring 等使用普通 RadixAttention 的线性层失效。
核心修复：删除了整个 isinstance(layer, RadixAttention) 判断块（及之前为 hotfix 添加的 _is_linear_attention 属性检查）。新的实现仅依赖 layer_id in self.full_attn_layers 决定路由。同时将参数类型从 Optional[RadixAttention] 扩展为 Optional[Union[RadixAttention, RadixLinearAttention]] 以匹配实际传入类型。
清理冗余标记：在 bailing_moe_linear.py 中移除 self.attn._is_linear_attention = True 赋值，因为该标记原为配合 hotfix 使用，本 PR 的路由逻辑已不再需要。
测试调整：将 test_ling_2_6_flash.py 从 stage=base-c 改为 suite=nightly-8-gpu-common, nightly=True，使其仅在夜间 CI 运行以节省 GPU 资源，同时仍然覆盖线性/全注意力路由路径。

关键文件：

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py（模块路由；类别 source；类型 core-logic；符号 _is_full_attn, init）: 核心路由方法 _is_full_attn 的修复：移除导致误判的 isinstance(layer, RadixAttention) -> True 快捷分支，改为完全基于 layer_id in self.full_attn_layers 分类。同步更新类型签名。
python/sglang/srt/models/bailing_moe_linear.py（模块线性模型；类别 source；类型 data-contract）: 移除 self.attn._is_linear_attention = True 标记行，因为新路由方案不再需要此标记。
test/registered/8-gpu-models/test_ling_2_6_flash.py（模块测试；类别 test；类型 test-coverage）: 将测试从 base-c 改为 nightly 运行，避免在常规CI中占用GPU资源，同时仍对路由修复提供回归覆盖。

关键符号：HybridLinearAttnBackend._is_full_attn

关键源码片段

`python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`

核心路由方法 _is_full_attn 的修复：移除导致误判的 isinstance(layer, RadixAttention) -> True 快捷分支，改为完全基于 layer_id in self.full_attn_layers 分类。同步更新类型签名。

def _is_full_attn(
    self,
    layer: Optional[Union[RadixAttention, RadixLinearAttention]],
    layer_id: Optional[int] = None,
) -> bool:
    # RadixLinearAttention 是明确的线性注意力层，直接返回 False。
    if isinstance(layer, RadixLinearAttention):
        return False

    # 从 layer 对象获取 layer_id，然后依赖 self.full_attn_layers 集合作判断。
    # 这个路径同时覆盖普通 RadixAttention（全注意力层）和
    # 用于线性层的普通 RadixAttention（如 Bailing/Ring）。
    if layer is not None:
        layer_id = layer.layer_id
    assert layer_id is not None, "either layer or layer_id must be provided"
    return layer_id in self.full_attn_layers

`python/sglang/srt/models/bailing_moe_linear.py`

移除 self.attn._is_linear_attention = True 标记行，因为新路由方案不再需要此标记。

self.attn = RadixAttention(
    self.tp_heads,
    self.head_dim,
    self.scaling,
    num_kv_heads=self.tp_kv_heads,
    layer_id=layer_id,
    quant_config=quant_config,
    prefix=f"{prefix}.attn",
)
# 注：之前这里有一行 `self.attn._is_linear_attention = True`，
# 已被移除，因为 HybridLinearAttnBackend._is_full_attn 现在
# 基于 layer_id 判断，不再需要该标记。

`test/registered/8-gpu-models/test_ling_2_6_flash.py`

将测试从 base-c 改为 nightly 运行，避免在常规CI中占用GPU资源，同时仍对路由修复提供回归覆盖。

# 将测试从 base-c 调整为 nightly，以节省常规 CI GPU 资源，
# 同时通过 nextn 规格解码路径覆盖路由逻辑。
register_cuda_ci(est_time=600, suite="nightly-8-gpu-common", nightly=True)

class TestLing26Flash(GSM8KMixin, DefaultServerBase):
    model = "inclusionAI/Ling-2.6-flash"
    other_args = [
        "--tp-size", "4",
        "--trust-remote-code",
        "--mamba-scheduler-strategy", "extra_buffer",
        "--mem-fraction-static", "0.75",
        "--max-running-requests", "64",
        "--max-mamba-cache-size", "256",
        "--speculative-algorithm", "NEXTN",
        "--speculative-num-steps", "3",
        "--speculative-eagle-topk", "1",
        "--speculative-num-draft-tokens", "4",
    ]

评论区精华

函数签名更新 (style): alisonshao 采纳建议并提交了 commit 2914ea7287。

风险与影响

风险：风险主要集中在路由变更对已有模型的影响：对于原先通过 isinstance 判断为 full 的普通 RadixAttention 全注意力层（如 MTP draft layers），若其 layer_id 未在 full_attn_layers 列表中，将错误地路由到线性后端。但依据设计，full_attn_layers 自模型配置生成，理应包含所有全注意力层。仍需验证 Qwen3.5 GDN、DeepSeek 等模型的 hybrid 配置不会受到影响。此外，本 PR 移除的 _is_linear_attention 标记原为 #26474 hotfix 加入，若后续无意合并此标记，可能导致对 Ling-2.6 hotfix 的依赖。
影响：修复了 Ring-2.5-1T 等模型在 TP8 CUDA 图捕获时的崩溃；对使用 HybridLinearAttnBackend 的所有模型（Qwen3.5、Ling-2.5/2.6 等）均有潜在影响，但路由逻辑更简洁且易于维护。测试从常规 CI 移至夜间，减少了常规 CI GPU 占用，但回归发现需要夜间才能触发。
风险标记：路由逻辑变更, 影响已有混合注意力模型, 与 #26474 存在重叠

关联脉络

PR #23331 Hybrid attention type dispatch for Qwen3.5 GDN: 本PR修复的是 #23331 引入的回归。该PR添加了基于类型的 isinstance(layer, RadixAttention) -> True 快捷方式，导致线性层被误路由。
PR #26474 [HotFix][Ling 2.6] Fix HybridLinearAttn dispatcher for Ling-2.6: 在主线上的替代修复，通过设置 _is_linear_attention=True 标记修复了同样的问题。本PR在hotfix基础上进一步改进了路由逻辑，并因此移除了该标记。

#26623 Fix hybrid linear attention misrouting plain-RadixAttention linear layers to the full backend (Ring-2.5-1T)

执行摘要

修复混合注意力线性层误路由到 full 后端

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论