执行摘要

仅更新注释和文档，无行为变更

作为之前 PR 的后续跟进，澄清 seq_lens_cpu D2H 在独立流上执行的原因（避免阻塞调度流），并记录未来可能的优化方向。详见 PR body: 'Comment-only follow-up: clarify why the seq_lens_cpu D2H runs on a standalone stream...'

虽然是纯注释变更，但其中的设计解释（为什么 D2H 用独立流）以及 FIXME（统一索引）值得关注，反映了架构决策和未来演进方向。

讨论亮点

该 PR 没有 review 评论。

实现拆解

核心源码路径
- python/sglang/srt/managers/overlap_utils.py（core-logic）：源码主路径；+9/-6；以 core-logic 为主

文件	模块	状态	重要度
`python/sglang/srt/managers/overlap_utils.py`	调度器	modified	4.81

关键符号

resolve_seq_lens_cpu publish

关键源码片段

python/sglang/srt/managers/overlap_utils.py documentation

唯一修改的文件，更新了解析 seq_lens_cpu 和 publish 方法的注释，并添加了 FIXME。

def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
    # seq_lens_cpu may be needed on the host for kernel-launch prep (some backends).
    # Run this D2H on a standalone stream to avoid chain-blocking forward_n ->
    # prepare_{n+1}: a sync on the schedule stream would inherit its WAR barrier and
    # stall the host until forward_n ends.
    fi = batch.spec_info.future_indices if batch.spec_info is not None else None
    if fi is None:
        return
    if self.publish_ready is not None:
        self.publish_ready.wait()
    batch.seq_lens = self.new_seq_lens_buf[fi]

    if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
        batch.seq_lens_cpu = batch.seq_lens.cpu() # bootstrap / non-CUDA
        batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
        return

    # Mechanism: don't sync the schedule stream; gate a private stream on the
    # publish event and copy into the static pinned buffer.
    self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
    with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
        self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
    self.fwd_prepare_d2h_stream.synchronize()

    # FIXME: fi == batch.req_pool_indices; unify future_indices and req_pool_indices.
    batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
    batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())


def publish(self, future_indices: torch.Tensor, new_seq_lens: torch.Tensor) -> None:
    indices = future_indices
    if indices.shape[0] == 0:
        return # DP idle
    self.new_seq_lens_buf[indices] = new_seq_lens.to(self.new_seq_lens_buf.dtype)
    # Only spec_v2 needs the event; it gates the seq_lens D2H on the private stream.
    if self.spec_algo.is_some():
        if self.publish_ready is None:
            self.publish_ready = torch.get_device_module(self.device).Event()
        self.publish_ready.record()

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

无任何风险，因为纯注释修改，不影响运行行为。

对用户和系统无影响。仅改善了代码可读性和可维护性。

纯注释变更

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：仅更新注释和文档，无行为变更
推荐动作：虽然是纯注释变更，但其中的设计解释（为什么 D2H 用独立流）以及 FIXME（统一索引）值得关注，反映了架构决策和未来演进方向。

功能与动机

实现拆解

核心源码路径
- python/sglang/srt/managers/overlap_utils.py（core-logic）：源码主路径；+9/-6；以 core-logic 为主

关键文件：

python/sglang/srt/managers/overlap_utils.py（模块调度器；类别 source；类型 documentation；符号 resolve_seq_lens_cpu, publish）: 唯一修改的文件，更新了解析 seq_lens_cpu 和 publish 方法的注释，并添加了 FIXME。

关键符号：resolve_seq_lens_cpu, publish

关键源码片段

`python/sglang/srt/managers/overlap_utils.py`

唯一修改的文件，更新了解析 seq_lens_cpu 和 publish 方法的注释，并添加了 FIXME。

def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
    # seq_lens_cpu may be needed on the host for kernel-launch prep (some backends).
    # Run this D2H on a standalone stream to avoid chain-blocking forward_n ->
    # prepare_{n+1}: a sync on the schedule stream would inherit its WAR barrier and
    # stall the host until forward_n ends.
    fi = batch.spec_info.future_indices if batch.spec_info is not None else None
    if fi is None:
        return
    if self.publish_ready is not None:
        self.publish_ready.wait()
    batch.seq_lens = self.new_seq_lens_buf[fi]

    if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
        batch.seq_lens_cpu = batch.seq_lens.cpu() # bootstrap / non-CUDA
        batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
        return

    # Mechanism: don't sync the schedule stream; gate a private stream on the
    # publish event and copy into the static pinned buffer.
    self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
    with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
        self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
    self.fwd_prepare_d2h_stream.synchronize()

    # FIXME: fi == batch.req_pool_indices; unify future_indices and req_pool_indices.
    batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
    batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())


def publish(self, future_indices: torch.Tensor, new_seq_lens: torch.Tensor) -> None:
    indices = future_indices
    if indices.shape[0] == 0:
        return # DP idle
    self.new_seq_lens_buf[indices] = new_seq_lens.to(self.new_seq_lens_buf.dtype)
    # Only spec_v2 needs the event; it gates the seq_lens D2H on the private stream.
    if self.spec_algo.is_some():
        if self.publish_ready is None:
            self.publish_ready = torch.get_device_module(self.device).Event()
        self.publish_ready.record()

评论区精华

该 PR 没有 review 评论。

暂无高价值评论线程

风险与影响

风险：无任何风险，因为纯注释修改，不影响运行行为。
影响：对用户和系统无影响。仅改善了代码可读性和可维护性。
风险标记：纯注释变更

关联脉络

PR #26380 [core] WAR barrier for overlap schedule buffer writes, without fwd occupancy cost: 该 PR 引入了原始的 seq_lens_cpu D2H 逻辑，当前 PR 是对其注释的跟进澄清。

#26463 refresh resolve_seq_lens_cpu comments

执行摘要

仅更新注释和文档，无行为变更

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`python/sglang/srt/managers/overlap_utils.py`

评论区精华

风险与影响

关联脉络

参与讨论