Prhub

#26463 refresh resolve_seq_lens_cpu comments

原始 PR 作者 hnyls2002 合并时间 2026-05-27 15:51 文件变更 1 提交数 1 评论 0 代码增减 +9 / -6

执行摘要

仅更新注释和文档,无行为变更

作为之前 PR 的后续跟进,澄清 seq_lens_cpu D2H 在独立流上执行的原因(避免阻塞调度流),并记录未来可能的优化方向。详见 PR body: 'Comment-only follow-up: clarify why the seq_lens_cpu D2H runs on a standalone stream...'

虽然是纯注释变更,但其中的设计解释(为什么 D2H 用独立流)以及 FIXME(统一索引)值得关注,反映了架构决策和未来演进方向。

讨论亮点

该 PR 没有 review 评论。

实现拆解

  1. 核心源码路径
    - python/sglang/srt/managers/overlap_utils.py(core-logic):源码主路径;+9/-6;以 core-logic 为主
文件 模块 状态 重要度
python/sglang/srt/managers/overlap_utils.py 调度器 modified 4.81

关键符号

resolve_seq_lens_cpu publish

关键源码片段

python/sglang/srt/managers/overlap_utils.py documentation

唯一修改的文件,更新了解析 seq_lens_cpu 和 publish 方法的注释,并添加了 FIXME。

def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
    # seq_lens_cpu may be needed on the host for kernel-launch prep (some backends).
    # Run this D2H on a standalone stream to avoid chain-blocking forward_n ->
    # prepare_{n+1}: a sync on the schedule stream would inherit its WAR barrier and
    # stall the host until forward_n ends.
    fi = batch.spec_info.future_indices if batch.spec_info is not None else None
    if fi is None:
        return
    if self.publish_ready is not None:
        self.publish_ready.wait()
    batch.seq_lens = self.new_seq_lens_buf[fi]
​
    if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
        batch.seq_lens_cpu = batch.seq_lens.cpu() # bootstrap / non-CUDA
        batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
        return
​
    # Mechanism: don't sync the schedule stream; gate a private stream on the
    # publish event and copy into the static pinned buffer.
    self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
    with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
        self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
    self.fwd_prepare_d2h_stream.synchronize()
​
    # FIXME: fi == batch.req_pool_indices; unify future_indices and req_pool_indices.
    batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
    batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
​
​
def publish(self, future_indices: torch.Tensor, new_seq_lens: torch.Tensor) -> None:
    indices = future_indices
    if indices.shape[0] == 0:
        return # DP idle
    self.new_seq_lens_buf[indices] = new_seq_lens.to(self.new_seq_lens_buf.dtype)
    # Only spec_v2 needs the event; it gates the seq_lens D2H on the private stream.
    if self.spec_algo.is_some():
        if self.publish_ready is None:
            self.publish_ready = torch.get_device_module(self.device).Event()
        self.publish_ready.record()

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论,后续有更多讨论时会体现在这里。

风险与影响

无任何风险,因为纯注释修改,不影响运行行为。

对用户和系统无影响。仅改善了代码可读性和可维护性。

纯注释变更

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接,后续同步到相关引用后会出现在这里。

完整报告

参与讨论