执行摘要
- 一句话:仅更新注释和文档,无行为变更
- 推荐动作:虽然是纯注释变更,但其中的设计解释(为什么 D2H 用独立流)以及 FIXME(统一索引)值得关注,反映了架构决策和未来演进方向。
功能与动机
作为之前 PR 的后续跟进,澄清 seq_lens_cpu D2H 在独立流上执行的原因(避免阻塞调度流),并记录未来可能的优化方向。详见 PR body: 'Comment-only follow-up: clarify why the seq_lens_cpu D2H runs on a standalone stream...'
实现拆解
- 核心源码路径
- python/sglang/srt/managers/overlap_utils.py(core-logic):源码主路径;+9/-6;以 core-logic 为主
关键文件:
python/sglang/srt/managers/overlap_utils.py(模块 调度器;类别 source;类型 documentation;符号 resolve_seq_lens_cpu, publish): 唯一修改的文件,更新了解析 seq_lens_cpu 和 publish 方法的注释,并添加了 FIXME。
关键符号:resolve_seq_lens_cpu, publish
关键源码片段
python/sglang/srt/managers/overlap_utils.py
唯一修改的文件,更新了解析 seq_lens_cpu 和 publish 方法的注释,并添加了 FIXME。
def resolve_seq_lens_cpu(self, batch: ScheduleBatch) -> None:
# seq_lens_cpu may be needed on the host for kernel-launch prep (some backends).
# Run this D2H on a standalone stream to avoid chain-blocking forward_n ->
# prepare_{n+1}: a sync on the schedule stream would inherit its WAR barrier and
# stall the host until forward_n ends.
fi = batch.spec_info.future_indices if batch.spec_info is not None else None
if fi is None:
return
if self.publish_ready is not None:
self.publish_ready.wait()
batch.seq_lens = self.new_seq_lens_buf[fi]
if self.fwd_prepare_d2h_stream is None or self.publish_ready is None:
batch.seq_lens_cpu = batch.seq_lens.cpu() # bootstrap / non-CUDA
batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
return
# Mechanism: don't sync the schedule stream; gate a private stream on the
# publish event and copy into the static pinned buffer.
self.fwd_prepare_d2h_stream.wait_event(self.publish_ready)
with torch.get_device_module(self.device).stream(self.fwd_prepare_d2h_stream):
self.new_seq_lens_cpu_pinned.copy_(self.new_seq_lens_buf, non_blocking=True)
self.fwd_prepare_d2h_stream.synchronize()
# FIXME: fi == batch.req_pool_indices; unify future_indices and req_pool_indices.
batch.seq_lens_cpu = self.new_seq_lens_cpu_pinned[batch.req_pool_indices_cpu]
batch.seq_lens_sum = int(batch.seq_lens_cpu.sum())
def publish(self, future_indices: torch.Tensor, new_seq_lens: torch.Tensor) -> None:
indices = future_indices
if indices.shape[0] == 0:
return # DP idle
self.new_seq_lens_buf[indices] = new_seq_lens.to(self.new_seq_lens_buf.dtype)
# Only spec_v2 needs the event; it gates the seq_lens D2H on the private stream.
if self.spec_algo.is_some():
if self.publish_ready is None:
self.publish_ready = torch.get_device_module(self.device).Event()
self.publish_ready.record()
评论区精华
该 PR 没有 review 评论。
风险与影响
- 风险:无任何风险,因为纯注释修改,不影响运行行为。
- 影响:对用户和系统无影响。仅改善了代码可读性和可维护性。
- 风险标记:纯注释变更
关联脉络
- PR #26380 [core] WAR barrier for overlap schedule buffer writes, without fwd occupancy cost: 该 PR 引入了原始的 seq_lens_cpu D2H 逻辑,当前 PR 是对其注释的跟进澄清。
参与讨论