执行摘要
- 一句话:回退 PR #26623,恢复老版路由逻辑
- 推荐动作:建议精读
_is_full_attn 的回退逻辑和 Bailing 模型的标记方式。该 PR 体现了 hybrid attention 路由在模型兼容性与正确性之间的权衡,值得关注后续是否有更统一的方案。
功能与动机
PR #26623 修复了 Bailing/Ring 模型因线性注意力层使用纯 RadixAttention 导致的 CUDA graph 崩溃问题,但该修复可能对 Ling-2.5/2.6 等模型造成回归。回退是为了恢复原有行为,同时通过添加 _is_linear_attention 标记来区分线性注意力层,避免误路由。
实现拆解
- 回退
_is_full_attn 逻辑(hybrid_linear_attn_backend.py):将 _is_full_attn 恢复为直接判断 isinstance(layer, RadixLinearAttention) → 线性,isinstance(layer, RadixAttention) → 全注意力,兜底按 layer id 判断。同时移除 Union[RadixAttention, RadixLinearAttention] 类型提示,改为仅 RadixAttention。
- 添加
_is_linear_attention 标记(bailing_moe_linear.py):在 Bailing 的线性注意力层创建 RadixAttention 实例后,设置 self.attn._is_linear_attention = True,供 _is_full_attn 优先判断。
- 配套测试调整:通过三次提交更新了
test_ling_2_6_flash.py,适配新的路由逻辑。
关键文件:
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py(模块 注意力层;类别 source;类型 core-logic;符号 _is_full_attn): 核心路由方法 _is_full_attn 被回退,修改了类型判断逻辑,并添加了 _is_linear_attention 属性检查分支。
python/sglang/srt/models/bailing_moe_linear.py(模块 模型定义;类别 source;类型 data-contract;符号 BailingMoELinearAttention.init): 为 Bailing 模型的 RadixAttention 实例添加 _is_linear_attention 标记,以配合路由逻辑。
关键符号:HybridLinearAttnBackend._is_full_attn, BailingMoELinearAttention.init
关键源码片段
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
核心路由方法 _is_full_attn 被回退,修改了类型判断逻辑,并添加了 _is_linear_attention 属性检查分支。
# python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
class HybridLinearAttnBackend(AttentionBackend):
"""Manages a full and linear attention backend"""
def __init__(
self,
full_attn_backend: AttentionBackend,
linear_attn_backend: MambaAttnBackendBase,
full_attn_layers: list[int],
):
self.full_attn_layers = full_attn_layers
self.full_attn_backend = full_attn_backend
self.linear_attn_backend = linear_attn_backend
self.attn_backend_list = [full_attn_backend, linear_attn_backend]
self.token_to_kv_pool = full_attn_backend.token_to_kv_pool
self.req_to_token_pool = full_attn_backend.req_to_token_pool
def _is_full_attn(
self,
layer: Optional[RadixAttention],
layer_id: Optional[int] = None,
) -> bool:
# Explicit linear-attention subclass -> strong linear signal (KDA, GDN,
# Qwen3-Next, Qwen3.5 main linear layers).
if isinstance(layer, RadixLinearAttention):
return False
# Some hybrid models (Ling-2.5/2.6) wrap their linear layers in plain
# `RadixAttention` rather than `RadixLinearAttention`. Those wrappers
# set `_is_linear_attention=True` on the attn module so we can
# distinguish them from full-attention RadixAttention instances --
# including MTP/NEXTN draft layers, which are full and must default to
# the full-attn path.
if layer is not None and getattr(layer, "_is_linear_attention", False):
return False
if isinstance(layer, RadixAttention):
return True
if layer is not None:
layer_id = layer.layer_id
assert layer_id is not None, "either layer or layer_id must be provided"
return layer_id in self.full_attn_layers
python/sglang/srt/models/bailing_moe_linear.py
为 Bailing 模型的 RadixAttention 实例添加 _is_linear_attention 标记,以配合路由逻辑。
# python/sglang/srt/models/bailing_moe_linear.py
class BailingMoELinearAttention(nn.Module):
def __init__(self, config, layer_id, prefix, ...):
# ... 其他初始化 ...
self.attn = RadixAttention(
self.tp_heads,
self.head_dim,
self.scaling,
num_kv_heads=self.tp_kv_heads,
layer_id=layer_id,
quant_config=quant_config,
prefix=f"{prefix}.attn",
)
# Marker for HybridLinearAttnBackend._is_full_attn: Bailing wraps
# linear-attention layers in a plain RadixAttention, so the
# dispatcher can't tell from the type alone that this is a linear
# layer (would otherwise default to the full-attn backend, e.g. the
# same way MTP/NEXTN draft layers are routed).
self.attn._is_linear_attention = True
# ... 后续初始化 ...
评论区精华
无 review 评论。
风险与影响
关联脉络
- PR #26623 Fix hybrid linear attention misrouting plain-RadixAttention linear layers to the full backend (Ring-2.5-1T): 本 PR 回退了 #26623 的修改,该 PR 修复了 Bailing 模型的崩溃问题。
参与讨论