#26380 [core] WAR barrier for overlap schedule buffer writes, without fwd occupancy cost

原始 PR 作者 hnyls2002 合并时间 2026-05-27 14:58 文件变更 5 提交数 13 评论 4 代码增减 +42 / -10

执行摘要

修复重叠调度数据竞争，解除前向占用损失

重叠调度中，schedule_stream 写入 req_to_token_pool、full_to_swa_index_mapping 等共享 GPU 缓冲区，而前向仍在读取，造成写后读竞争。在 DeepSeek V4 解耦模式下表现为 GSM8K 精度波动（~0.87），需通过屏障修复同时避免前向占用退化（之前方案 -4%）。

核心逻辑改动精炼，设计巧妙（用私有流避免屏障串行化），值得深入阅读。但 AMD 和 GB 回归表明跨平台适配不充分，建议后续添加数据竞争测试和跨平台性能基准后再逐步推广。

讨论亮点

AMD 回归报告（@bingxche）：PR 合入后 AMD MI35x 8-GPU 出现 test_deepseek_r1_mxfp4_8gpu 超时、test_priority_scheduling 断言失败、DSv4 非 MTP 性能下降 ~20%。作者 @hnyls2002 分析均为 HIP 特有，CUDA 无影响。
GB200/GB300 性能回归（@nvpohanh）：在 GPT-OSS、Kimi、Llama3、DeepSeek、Nemotron 等模型上观察到性能回退，作者尚未回复。

实现拆解

WAR 屏障：在 Scheduler.event_loop_overlap、Scheduler.event_loop_overlap_disagg_decode、Scheduler.event_loop_overlap_disagg_prefill 三个重叠循环顶部插入 schedule_stream.wait_stream(forward_stream)，确保本轮调度等待前向读完。
解除屏障串行化：屏障将 schedule_stream 链到前向结束，两个主机同步若留在 schedule_stream 会阻塞到前向完成。
- H2D：EagleDraftInput.prepare_for_decode 中 cur_kv_lens_cpu.to(device=batch.device) 改为 non_blocking=True，避免隐式同步。
- D2H：OverlapBypassBuffer.resolve_seq_lens_cpu 中 seq_lens.cpu() 阻塞拷贝改为：在私有 fwd_prepare_d2h_stream 上等 publish_ready 事件后，以 non_blocking 复制到 pinned 内存，主机通过 req_pool_indices_cpu 索引选出。
新增 pinned 缓冲与流：OverlapBypassBuffer.__init__ 中分配 new_seq_lens_cpu_pinned（pin_memory）和 fwd_prepare_d2h_stream，仅在 CUDA/HIP 上生效。

文件	模块	状态	重要度
`python/sglang/srt/managers/overlap_utils.py`	重叠缓冲	modified	6.35
`python/sglang/srt/speculative/eagle_info_v2.py`	推测解码	modified	5.82
`python/sglang/srt/managers/scheduler.py`	调度器	modified	5.05
`python/sglang/srt/disaggregation/decode.py`	解耦解码	modified	4.7
`python/sglang/srt/disaggregation/prefill.py`	解耦预填充	modified	4.7

关键符号

Scheduler.event_loop_overlap Scheduler.event_loop_overlap_disagg_decode Scheduler.event_loop_overlap_disagg_prefill OverlapBypassBuffer.__init__ OverlapBypassBuffer.resolve_seq_lens_cpu EagleDraftInput.prepare_for_decode

关键源码片段

python/sglang/srt/managers/overlap_utils.py core-logic

核心修改：添加 pinned 缓冲和私有 D2H 流，将 seq_lens_cpu 同步移出 schedule_stream，是关键设计实现。

def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
    # Lazy pull from new_seq_lens_buf for spec_v2 (accept_lens not known to
    # schedule). Write into both CPU and GPU so SB.seq_lens stays a faithful
    # seq_lens_cpu mirror.
    fi = batch.spec_info.future_indices if batch.spec_info is not None else None
    if fi is None:
        return
    if self.publish_ready is not None:
        self.publish_ready.wait()
    batch.seq_lens = self.new_seq_lens_buf[fi]

    if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
        # bootstrap / non-CUDA path: blocking CPU copy on schedule stream
        batch.seq_lens_cpu = batch.seq_lens.cpu()
        batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
        return

    # Offload D2H to private stream: wait for publish event,
    # then copy to pinned memory non_blocking.
    self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
    with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
        self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
    self.fwd_prepare_d2h_stream.synchronize()
    # host-select via req_pool_indices_cpu (precomputed host mirror)
    batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
    batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())

评论区精华

AMD 平台回归问题 正确性

PR 合入后 AMD MI35x 出现三种失败：test_deepseek_r1_mxfp4_8gpu 超时、test_priority_scheduling 断言失败、DSv4 非 MTP 性能下降 ~20%。作者确认 HIP 特有，CUDA 无影响。

结论：作者表示会调查并建议临时回退。 · unresolved

GB200/GB300 性能回归 性能

nvpohanh 报告在多个模型上观察到性能回退，作者尚未确认根本原因。

结论：待进一步分析，可能与非阻塞拷贝或流竞争有关。 · unresolved

风险与影响

AMD/HIP 兼容性：私有流与 non_blocking 拷贝在 HIP 上行为可能不同，已观察到三种失败模式，需针对性修复。
性能回归风险：GB200/GB300 出现系统性性能下降，可能与屏障串行化或流资源竞争有关。
缺少直接测试：变更未配套数据竞争或性能回归的自动化测试，回归需靠手动验证。
流资源影响：新增 fwd_prepare_d2h_stream 在设备流数量紧张时可能竞争。

影响所有使用重叠调度（event_loop_overlap）的场景，包括普通解碼和 DeepSeek V4 解耦路径。正面：修复数据竞争，提高精度稳定性。负面：在 AMD 和部分 NVIDIA 平台上引入性能回退。团队需关注跨平台一致性测试和性能基准。

核心路径变更跨平台兼容风险 AMD 平台回归 GB200/GB300 性能回归缺少直接测试

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复重叠调度数据竞争，解除前向占用损失
推荐动作：核心逻辑改动精炼，设计巧妙（用私有流避免屏障串行化），值得深入阅读。但 AMD 和 GB 回归表明跨平台适配不充分，建议后续添加数据竞争测试和跨平台性能基准后再逐步推广。

功能与动机

实现拆解

WAR 屏障：在 Scheduler.event_loop_overlap、Scheduler.event_loop_overlap_disagg_decode、Scheduler.event_loop_overlap_disagg_prefill 三个重叠循环顶部插入 schedule_stream.wait_stream(forward_stream)，确保本轮调度等待前向读完。
解除屏障串行化：屏障将 schedule_stream 链到前向结束，两个主机同步若留在 schedule_stream 会阻塞到前向完成。
- H2D：EagleDraftInput.prepare_for_decode 中 cur_kv_lens_cpu.to(device=batch.device) 改为 non_blocking=True，避免隐式同步。
- D2H：OverlapBypassBuffer.resolve_seq_lens_cpu 中 seq_lens.cpu() 阻塞拷贝改为：在私有 fwd_prepare_d2h_stream 上等 publish_ready 事件后，以 non_blocking 复制到 pinned 内存，主机通过 req_pool_indices_cpu 索引选出。
新增 pinned 缓冲与流：OverlapBypassBuffer.__init__ 中分配 new_seq_lens_cpu_pinned（pin_memory）和 fwd_prepare_d2h_stream，仅在 CUDA/HIP 上生效。

关键文件：

python/sglang/srt/managers/overlap_utils.py（模块重叠缓冲；类别 source；类型 core-logic；符号 init, resolve_seq_lens_cpu）: 核心修改：添加 pinned 缓冲和私有 D2H 流，将 seq_lens_cpu 同步移出 schedule_stream，是关键设计实现。
python/sglang/srt/speculative/eagle_info_v2.py（模块推测解码；类别 source；类型 core-logic；符号 prepare_for_decode）: H2D 转为 non_blocking，避免屏障串行化后主机阻塞，是性能恢复的关键辅助修改。
python/sglang/srt/managers/scheduler.py（模块调度器；类别 source；类型 core-logic；符号 event_loop_overlap）: 在主重叠循环入口添加 WAR 屏障，是数据竞争修复的核心触发点。
python/sglang/srt/disaggregation/decode.py（模块解耦解码；类别 source；类型 core-logic；符号 event_loop_overlap_disagg_decode）: 在解耦解码重叠循环添加 WAR 屏障，覆盖 disagg decode 路径。
python/sglang/srt/disaggregation/prefill.py（模块解耦预填充；类别 source；类型 core-logic；符号 event_loop_overlap_disagg_prefill）: 在解耦预填充重叠循环添加 WAR 屏障，覆盖 disagg prefill 路径。

关键符号：Scheduler.event_loop_overlap, Scheduler.event_loop_overlap_disagg_decode, Scheduler.event_loop_overlap_disagg_prefill, OverlapBypassBuffer.init, OverlapBypassBuffer.resolve_seq_lens_cpu, EagleDraftInput.prepare_for_decode

关键源码片段

`python/sglang/srt/managers/overlap_utils.py`

核心修改：添加 pinned 缓冲和私有 D2H 流，将 seq_lens_cpu 同步移出 schedule_stream，是关键设计实现。

def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
    # Lazy pull from new_seq_lens_buf for spec_v2 (accept_lens not known to
    # schedule). Write into both CPU and GPU so SB.seq_lens stays a faithful
    # seq_lens_cpu mirror.
    fi = batch.spec_info.future_indices if batch.spec_info is not None else None
    if fi is None:
        return
    if self.publish_ready is not None:
        self.publish_ready.wait()
    batch.seq_lens = self.new_seq_lens_buf[fi]

    if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
        # bootstrap / non-CUDA path: blocking CPU copy on schedule stream
        batch.seq_lens_cpu = batch.seq_lens.cpu()
        batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
        return

    # Offload D2H to private stream: wait for publish event,
    # then copy to pinned memory non_blocking.
    self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
    with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
        self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
    self.fwd_prepare_d2h_stream.synchronize()
    # host-select via req_pool_indices_cpu (precomputed host mirror)
    batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
    batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())

评论区精华

AMD 回归报告（@bingxche）：PR 合入后 AMD MI35x 8-GPU 出现 test_deepseek_r1_mxfp4_8gpu 超时、test_priority_scheduling 断言失败、DSv4 非 MTP 性能下降 ~20%。作者 @hnyls2002 分析均为 HIP 特有，CUDA 无影响。
GB200/GB300 性能回归（@nvpohanh）：在 GPT-OSS、Kimi、Llama3、DeepSeek、Nemotron 等模型上观察到性能回退，作者尚未回复。
AMD 平台回归问题 (correctness): 作者表示会调查并建议临时回退。
GB200/GB300 性能回归 (performance): 待进一步分析，可能与非阻塞拷贝或流竞争有关。

风险与影响

风险：
- AMD/HIP 兼容性：私有流与 non_blocking 拷贝在 HIP 上行为可能不同，已观察到三种失败模式，需针对性修复。
- 性能回归风险：GB200/GB300 出现系统性性能下降，可能与屏障串行化或流资源竞争有关。
- 缺少直接测试：变更未配套数据竞争或性能回归的自动化测试，回归需靠手动验证。
- 流资源影响：新增 fwd_prepare_d2h_stream 在设备流数量紧张时可能竞争。
- 影响：影响所有使用重叠调度（event_loop_overlap）的场景，包括普通解碼和 DeepSeek V4 解耦路径。正面：修复数据竞争，提高精度稳定性。负面：在 AMD 和部分 NVIDIA 平台上引入性能回退。团队需关注跨平台一致性测试和性能基准。
- 风险标记：核心路径变更, 跨平台兼容风险, AMD平台回归, GB200/GB300性能回归, 缺少直接测试

关联脉络

PR #26376 bidirectional wait_stream barrier: 被本 PR 替代，其对称屏障导致 -4% 前向占用
PR #26425 req_pool_indices_cpu host mirror: 本 PR 依赖其引入 req_pool_indices_cpu，实现主机侧索引选择

#26380 [core] WAR barrier for overlap schedule buffer writes, without fwd occupancy cost

执行摘要

修复重叠调度数据竞争，解除前向占用损失

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论