#23641 Revert "[Intel GPU] Enable pipeline parallelism on XPU"

原始 PR 作者 ShangmingCai 合并时间 2026-04-24 17:36 文件变更 1 提交数 1 评论 5 代码增减 +41 / -74

执行摘要

回退 XPU 流水线并行支持，修复 CI 中断

PR body明确指出原PR #23472合并后PD PP CI broken（https://github.com/sgl-project/sglang/actions/runs/24879511440/job/72844149700），作者希望通过回退验证问题是否消失。后续comment显示回退后CI通过，确认是原PR导致的不稳定。

建议立即合并此revert以解除CI阻塞。原作者应重新审查PR #23472的XPU通信逻辑（尤其是send/recv编排），添加针对XPU的自动化测试后再提交。同时注意按照review建议修正返回类型注解。

讨论亮点

类型标注不准确（来自gemini-code-assist[bot]）

_pp_commit_send_output_work_and_preprocess_output_tensors返回类型应包含Optional，因为微批次不存在时返回None，但回退后直接写为Tuple[PPProxyTensors, GenerationBatchResult, torch.cuda.Event]，丢失了可选性。
_pp_send_recv_and_preprocess_output_tensors返回类型与实现不匹配（实际返回4元组但标注为3元组）。
作者未回应这些评论，PR因紧急回退直接合并，类型问题遗留。

实现拆解

移除XPU专用导入：在文件头删除from sglang.srt.utils.common import is_xpu，不再依赖XPU检测。
还原设备无关调用为CUDA硬编码：在event_loop_pp、event_loop_pp_disagg_prefill、event_loop_pp_disagg_decode三个事件循环中，将self.device_module.current_stream().wait_event()改回torch.cuda.current_stream().wait_event()，抛弃get_device_module()抽象层。
修正类型注解：last_rank_comm_queue的类型从deque[Tuple[torch.Event, ...]]还原为deque[Tuple[torch.cuda.Event, ...]]；_pp_commit_send_output_work_and_preprocess_output_tensors和_pp_send_recv_and_preprocess_output_tensors的返回类型中Optional[torch.Event]改成torch.cuda.Event（注意此处未加Optional，降低了安全性）。
Profile阶段同步还原：profile_and_init_predictor中device=self.device还原为device="cuda"，self.device_module.synchronize()改为if torch.cuda.is_available(): torch.cuda.synchronize()，移除设备无关同步。

文件	模块	状态	重要度
`python/sglang/srt/managers/scheduler_pp_mixin.py`	调度器	modified	7.85

关键符号

event_loop_pp event_loop_pp_disagg_prefill event_loop_pp_disagg_decode init_pp_loop_state profile_and_init_predictor _pp_commit_send_output_work_and_preprocess_output_tensors _pp_send_recv_and_preprocess_output_tensors

关键源码片段

python/sglang/srt/managers/scheduler_pp_mixin.py core-logic

PP 调度核心代码，撤回 XPU 设备无关适配，恢复为显式 CUDA 调用，直接影响流水线并行执行路径。

    # event_loop_pp 中关键同步部分（回退后的最终状态）
    if not self.pp_group.is_last_rank:
        if self.cur_batch:
            # 使用硬编码 CUDA stream，不再通过 self.device_module 动态获取
            torch.cuda.current_stream().wait_event(self.launch_event)
            with torch.profiler.record_function("send_proxy_dict_to_next_stage"):
                self.send_proxy_work = self._pp_send_dict_to_next_stage(
                    result.pp_hidden_states_proxy_tensors.tensors,
                    async_send=True,
                    msg_type="proxy",
                )

评论区精华

函数返回类型注解不准确 正确性

gemini-code-assist[bot] 指出 _pp_commit_send_output_work_and_preprocess_output_tensors 返回类型缺少 Optional，因为微批次可能为 None；_pp_send_recv_and_preprocess_output_tensors 返回类型与实际返回值（4 元组）不匹配。

结论：作者未回应，PR 因紧急回退直接合并，类型问题遗留未修复。 · unresolved

风险与影响

低风险：该回退使代码恢复至合并PR #23472前的状态（已稳定运行），唯一风险是XPU上PP功能再次不可用，但CI已验证通过。类型标注的不精确不会影响运行时行为，但可能给静态分析带来误报。

范围：仅影响Intel XPU用户（PP >= 2），CUDA及AMD用户无影响。影响程度：中等，XPU上PP能力暂时回退，但main分支CI恢复稳定，优先保证主线健康。维护：后续需重新合入XPU PP支持时需更充分的CI覆盖。

XPU 功能回退 CI 修复核心调度路径变更

关联 Issue

#23472 [Intel GPU] Enable pipeline parallelism on XPU

完整报告

执行摘要

一句话：回退XPU流水线并行支持，修复CI中断
推荐动作：建议立即合并此revert以解除CI阻塞。原作者应重新审查PR #23472的XPU通信逻辑（尤其是send/recv编排），添加针对XPU的自动化测试后再提交。同时注意按照review建议修正返回类型注解。

功能与动机

实现拆解

移除XPU专用导入：在文件头删除from sglang.srt.utils.common import is_xpu，不再依赖XPU检测。
还原设备无关调用为CUDA硬编码：在event_loop_pp、event_loop_pp_disagg_prefill、event_loop_pp_disagg_decode三个事件循环中，将self.device_module.current_stream().wait_event()改回torch.cuda.current_stream().wait_event()，抛弃get_device_module()抽象层。
修正类型注解：last_rank_comm_queue的类型从deque[Tuple[torch.Event, ...]]还原为deque[Tuple[torch.cuda.Event, ...]]；_pp_commit_send_output_work_and_preprocess_output_tensors和_pp_send_recv_and_preprocess_output_tensors的返回类型中Optional[torch.Event]改成torch.cuda.Event（注意此处未加Optional，降低了安全性）。
Profile阶段同步还原：profile_and_init_predictor中device=self.device还原为device="cuda"，self.device_module.synchronize()改为if torch.cuda.is_available(): torch.cuda.synchronize()，移除设备无关同步。

关键文件：

python/sglang/srt/managers/scheduler_pp_mixin.py（模块调度器；类别 source；类型 core-logic；符号 event_loop_pp, event_loop_pp_disagg_prefill, event_loop_pp_disagg_decode, init_pp_loop_state）: PP调度核心代码，撤回XPU设备无关适配，恢复为显式CUDA调用，直接影响流水线并行执行路径。

关键符号：event_loop_pp, event_loop_pp_disagg_prefill, event_loop_pp_disagg_decode, init_pp_loop_state, profile_and_init_predictor, _pp_commit_send_output_work_and_preprocess_output_tensors, _pp_send_recv_and_preprocess_output_tensors

关键源码片段

`python/sglang/srt/managers/scheduler_pp_mixin.py`

PP调度核心代码，撤回XPU设备无关适配，恢复为显式CUDA调用，直接影响流水线并行执行路径。

    # event_loop_pp 中关键同步部分（回退后的最终状态）
    if not self.pp_group.is_last_rank:
        if self.cur_batch:
            # 使用硬编码 CUDA stream，不再通过 self.device_module 动态获取
            torch.cuda.current_stream().wait_event(self.launch_event)
            with torch.profiler.record_function("send_proxy_dict_to_next_stage"):
                self.send_proxy_work = self._pp_send_dict_to_next_stage(
                    result.pp_hidden_states_proxy_tensors.tensors,
                    async_send=True,
                    msg_type="proxy",
                )

评论区精华

类型标注不准确（来自gemini-code-assist[bot]）

_pp_commit_send_output_work_and_preprocess_output_tensors返回类型应包含Optional，因为微批次不存在时返回None，但回退后直接写为Tuple[PPProxyTensors, GenerationBatchResult, torch.cuda.Event]，丢失了可选性。
_pp_send_recv_and_preprocess_output_tensors返回类型与实现不匹配（实际返回4元组但标注为3元组）。
作者未回应这些评论，PR因紧急回退直接合并，类型问题遗留。
函数返回类型注解不准确 (correctness): 作者未回应，PR 因紧急回退直接合并，类型问题遗留未修复。

风险与影响

风险：低风险：该回退使代码恢复至合并PR #23472前的状态（已稳定运行），唯一风险是XPU上PP功能再次不可用，但CI已验证通过。类型标注的不精确不会影响运行时行为，但可能给静态分析带来误报。
影响：范围：仅影响Intel XPU用户（PP >= 2），CUDA及AMD用户无影响。影响程度：中等，XPU上PP能力暂时回退，但main分支CI恢复稳定，优先保证主线健康。维护：后续需重新合入XPU PP支持时需更充分的CI覆盖。
风险标记：XPU功能回退, CI修复, 核心调度路径变更

关联脉络

PR #23472 [Intel GPU] Enable pipeline parallelism on XPU: 被回退的原PR，其合并导致PD PP CI失败，触发本次回退。

#23641 Revert "[Intel GPU] Enable pipeline parallelism on XPU"

执行摘要

回退 XPU 流水线并行支持，修复 CI 中断

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论