执行摘要

修复 TboAttnBackend 中 HybridLinearAttnBackend 的崩溃

Issue #20109 报告了 TboAttnBackend 与 HybridLinearAttnBackend 不兼容，启动 Qwen3.5 模型并启用 two-batch-overlap 时在 CUDA graph 捕获阶段抛出 TypeError: forward() missing 3 required positional arguments。PR 通过添加 forward 代理和修复 mrope_positions 处理来解决该问题。

该 PR 修复明确，代码简洁，值得合并。建议维护者关注 mrope_positions 的维度兼容性，并考虑补充直接覆盖该路径的测试。

讨论亮点

在 review 中，Qiaolin-Yu 提问为何不把 mrope_positions 放在循环中直接 getattr。llllvvuu 解释因为它的序列维度是 1 而不是 0（即 token 维度不同），需要单独切片。Qiaolin-Yu 表示理解并批准了该方案。

实现拆解

在 tbo_backend.py 中为 TboAttnBackend 添加 forward 方法，直接转发到 self.primary.forward，确保线性注意力后端能正确被调用。
在 two_batch_overlap.py 的 filter_batch 方法中，将 mrope_positions 移出简单的 getattr 循环，单独处理切片：当 mrope_positions 不为 None 时，按 token 维度切片（第二维），否则设为 None。这是因为 mrope_positions 的形状与序列级张量不同，无法直接用 start_seq_index:end_seq_index 切片。

文件	模块	状态	重要度
`python/sglang/srt/batch_overlap/two_batch_overlap.py`	调度器	modified	5.85
`python/sglang/srt/layers/attention/tbo_backend.py`	注意力层	modified	5.36

关键符号

forward filter_batch

关键源码片段

python/sglang/srt/batch_overlap/two_batch_overlap.py core-logic

核心修复文件，处理了 mrope_positions 的切片逻辑，使 TBO 与 HybridLinearAttnBackend 兼容。

# python/sglang/srt/batch_overlap/two_batch_overlap.py 中 filter_batch 方法片段
# 先处理批量复制键（即非位置相关张量）
for key in [
    "forward_mode",
    "is_extend_in_batch",
    "all_extend_in_batch",
    "return_logprob",
    # ... 其他键
    "split_index",
    "orig_seq_lens",
    "return_pooled_hidden_states",
]:
    output_dict[key] = getattr(batch, key)

# mrope_positions 需要单独处理：它按 token 维度索引，而非序列维度
mrope_positions = getattr(batch, "mrope_positions")
if mrope_positions is not None:
    # 假设形状为 (batch, num_tokens, ...)，我们按 start_token_index:end_token_index 切片第二维
    output_dict["mrope_positions"] = mrope_positions[
        :, start_token_index:end_token_index
    ]
else:
    output_dict["mrope_positions"] = None

python/sglang/srt/layers/attention/tbo_backend.py core-logic

添加了 forward 方法，让 TboAttnBackend 能够正确委托调用给 primary 后端。

# python/sglang/srt/layers/attention/tbo_backend.py 中 TboAttnBackend 类
class TboAttnBackend(AttentionBackend):
    # ...
    def get_cuda_graph_seq_len_fill_value(self):
        ans = self.primary.get_cuda_graph_seq_len_fill_value()
        for child in self.children:
            assert ans == child.get_cuda_graph_seq_len_fill_value()
        return ans

    # 新增的 forward 方法：将调用代理给 primary 后端（例如 HybridLinearAttnBackend）
    def forward(self, *args, **kwargs):
        return self.primary.forward(*args, **kwargs)

    def forward_extend(self, *args, **kwargs):
        return self.primary.forward_extend(*args, **kwargs)

    def forward_decode(self, *args, **kwargs):
        return self.primary.forward_decode(*args, **kwargs)
    # ...

评论区精华

mrope_positions 处理位置 设计

Qiaolin-Yu 提问为什么不在循环中直接 getattr mrope_positions。llllvvuu 解释因为它的序列维度是 1 而不是 0（即形状不同），需要单独切片。

结论：Qiaolin-Yu 理解了维度差异，批准了当前方案。 · 已解决

风险与影响

改动较小且集中在特定路径，但 mrope_positions 的切片方式依赖其维度假设（第二维为 token 数），若未来模型改变维度可能导致错误。此外，没有新增测试用例，但 CI 中通过了相关 EP 测试覆盖。

影响使用 --enable-two-batch-overlap 的 Qwen3.5 用户，修复了崩溃问题，使其能正常使用混合线性注意力。对其他模型无影响。

缺少测试覆盖维度假设风险

关联 Issue

#20109 [Bug] [Qwen3.5] TboAttnBackend crashes with linear attention: `forward() missing 3 required positional arguments: 'q', 'k', and 'v'`

完整报告

执行摘要

一句话：修复 TboAttnBackend 中 HybridLinearAttnBackend 的崩溃
推荐动作：该 PR 修复明确，代码简洁，值得合并。建议维护者关注 mrope_positions 的维度兼容性，并考虑补充直接覆盖该路径的测试。

功能与动机

实现拆解

在 tbo_backend.py 中为 TboAttnBackend 添加 forward 方法，直接转发到 self.primary.forward，确保线性注意力后端能正确被调用。
在 two_batch_overlap.py 的 filter_batch 方法中，将 mrope_positions 移出简单的 getattr 循环，单独处理切片：当 mrope_positions 不为 None 时，按 token 维度切片（第二维），否则设为 None。这是因为 mrope_positions 的形状与序列级张量不同，无法直接用 start_seq_index:end_seq_index 切片。

关键文件：

python/sglang/srt/batch_overlap/two_batch_overlap.py（模块调度器；类别 source；类型 core-logic）: 核心修复文件，处理了 mrope_positions 的切片逻辑，使 TBO 与 HybridLinearAttnBackend 兼容。
python/sglang/srt/layers/attention/tbo_backend.py（模块注意力层；类别 source；类型 core-logic；符号 forward）: 添加了 forward 方法，让 TboAttnBackend 能够正确委托调用给 primary 后端。

关键符号：forward, filter_batch

关键源码片段

`python/sglang/srt/batch_overlap/two_batch_overlap.py`

核心修复文件，处理了 mrope_positions 的切片逻辑，使 TBO 与 HybridLinearAttnBackend 兼容。

# python/sglang/srt/batch_overlap/two_batch_overlap.py 中 filter_batch 方法片段
# 先处理批量复制键（即非位置相关张量）
for key in [
    "forward_mode",
    "is_extend_in_batch",
    "all_extend_in_batch",
    "return_logprob",
    # ... 其他键
    "split_index",
    "orig_seq_lens",
    "return_pooled_hidden_states",
]:
    output_dict[key] = getattr(batch, key)

# mrope_positions 需要单独处理：它按 token 维度索引，而非序列维度
mrope_positions = getattr(batch, "mrope_positions")
if mrope_positions is not None:
    # 假设形状为 (batch, num_tokens, ...)，我们按 start_token_index:end_token_index 切片第二维
    output_dict["mrope_positions"] = mrope_positions[
        :, start_token_index:end_token_index
    ]
else:
    output_dict["mrope_positions"] = None

`python/sglang/srt/layers/attention/tbo_backend.py`

添加了 forward 方法，让 TboAttnBackend 能够正确委托调用给 primary 后端。

# python/sglang/srt/layers/attention/tbo_backend.py 中 TboAttnBackend 类
class TboAttnBackend(AttentionBackend):
    # ...
    def get_cuda_graph_seq_len_fill_value(self):
        ans = self.primary.get_cuda_graph_seq_len_fill_value()
        for child in self.children:
            assert ans == child.get_cuda_graph_seq_len_fill_value()
        return ans

    # 新增的 forward 方法：将调用代理给 primary 后端（例如 HybridLinearAttnBackend）
    def forward(self, *args, **kwargs):
        return self.primary.forward(*args, **kwargs)

    def forward_extend(self, *args, **kwargs):
        return self.primary.forward_extend(*args, **kwargs)

    def forward_decode(self, *args, **kwargs):
        return self.primary.forward_decode(*args, **kwargs)
    # ...

评论区精华

mrope_positions 处理位置 (design): Qiaolin-Yu 理解了维度差异，批准了当前方案。

风险与影响

风险：改动较小且集中在特定路径，但 mrope_positions 的切片方式依赖其维度假设（第二维为 token 数），若未来模型改变维度可能导致错误。此外，没有新增测试用例，但 CI 中通过了相关 EP 测试覆盖。
影响：影响使用 --enable-two-batch-overlap 的 Qwen3.5 用户，修复了崩溃问题，使其能正常使用混合线性注意力。对其他模型无影响。
风险标记：缺少测试覆盖, 维度假设风险

关联脉络

暂无明显关联 PR

#20114 fix: support HybridLinearAttnBackend in TboAttnBackend

执行摘要

修复 TboAttnBackend 中 HybridLinearAttnBackend 的崩溃

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`python/sglang/srt/batch_overlap/two_batch_overlap.py`

`python/sglang/srt/layers/attention/tbo_backend.py`

评论区精华

风险与影响

关联脉络

参与讨论