#18582 Add subprocess liveness monitor to detect scheduler crashes

原始 PR 作者 Simon-Li 合并时间 2026-03-29 15:09 文件变更 7 提交数 18 评论 16 代码增减 +300 / -29

执行摘要

添加子进程活跃度监控以检测调度器崩溃，防止僵尸服务状态。

Issue #18421描述了当调度器子进程因NCCL超时等C++级别错误崩溃时，std::terminate()会在Python异常处理前执行，导致主进程无法检测崩溃，形成“僵尸服务”（服务接受请求但无法处理）。PR body明确指出目标是添加监控机制以检测此类崩溃并触发清理，减少故障恢复延迟，引用原文：'When a scheduler subprocess crashes at the C++ level, std::terminate() runs before Python exception handlers, leaving the main process unaware. The service continues accepting requests but cannot process them.'

建议工程师精读python/sglang/srt/utils/watchdog.py中的SubprocessWatchdog类实现，理解其守护线程设计、异常处理和SIGQUIT触发机制；重点关注engine.py的_launch_subprocesses变更，学习如何将监控集成到进程启动流程中，避免跨模块传递；对于信号处理爱好者，tokenizer_manager.py的running_phase_sigquit_handler展示了协调watchdog停止以避免竞态条件的设计决策。

讨论亮点

review讨论由hnyls2002主导，关键交锋包括：1) 设计简洁性：hnyls2002指出'scheduler_procs should not be kept in the init result as it's only used for watchdog setup'，导致重构为返回元组而非扩展SchedulerInitResult（提交0262ae10）。2) 正确性风险：hnyls2002评论'Why discard the reference of watch dog? When the server receives a SIGQUIT, I think you also need to stop the watch dog.'，作者通过提交2bbc0a01修复，确保watchdog引用存储在tokenizer_manager供SIGQUIT处理使用。3) 测试规范：hnyls2002建议'Register with CustomTestcase'，测试文件随后调整注册方式。所有讨论点均被解决，PR最终被批准。

实现拆解

实现方案拆解为四个模块：1) 监控核心：在watchdog.py中新增SubprocessWatchdog类，使用守护线程每隔1秒轮询子进程is_alive()，检测到异常退出（exitcode != 0）时发送SIGQUIT。2) 集成层：修改engine.py的_launch_subprocesses和_launch_scheduler_processes方法，返回并存储watchdog实例；更新http_server.py和ray/engine.py以传递watchdog。3) 信号处理：在tokenizer_manager.py的running_phase_sigquit_handler中添加停止watchdog逻辑，防止正常关闭时误报。4) 测试验证：新增test_subprocess_watchdog.py，包含6个测试用例覆盖健康进程、崩溃检测、多进程、边缘情况等。

文件	模块	状态	重要度
`python/sglang/srt/utils/watchdog.py`	utils	modified	8.0
`python/sglang/srt/entrypoints/engine.py`	entrypoints	modified	7.0
`python/sglang/srt/managers/tokenizer_manager.py`	managers	modified	6.0
`test/registered/unit/utils/test_subprocess_watchdog.py`	test	added	5.0

分析完成后，这里会展示 LLM 生成的相对完整源码片段和详细注释。

关键符号

SubprocessWatchdog.__init__ SubprocessWatchdog._monitor_loop Engine._launch_subprocesses running_phase_sigquit_handler

评论区精华

watchdog 引用存储与 SIGQUIT 处理 设计

hnyls2002 指出 watchdog 引用在 launch_server() 中被丢弃（使用 _ 占位符），导致 SIGQUIT 处理程序无法访问并停止 watchdog，可能引发误报

结论：作者通过提交 2bbc0a01 修复，将 watchdog 引用存储到 tokenizer_manager._subprocess_watchdog，确保在 running_phase_sigquit_handler 中可调用 stop() · 已解决

scheduler_procs 设计重构 设计

hnyls2002 评论 'scheduler_procs should not be kept in the init result. It is only used in the watch dog building steps.'，认为将 scheduler_procs 放入 SchedulerInitResult 会污染数据结构

结论：作者通过提交 0262ae10 重构，将 _launch_scheduler_processes 返回值改为元组 (SchedulerInitResult, scheduler_procs)，保持设计简洁 · 已解决

测试注册规范 测试

hnyls2002 在测试文件补丁中评论 'Register with CustomTestcase'，提示测试需符合项目规范注册

结论：测试文件后续调整以使用 CustomTestCase 注册，确保 CI 集成正确 · 已解决

风险与影响

技术风险具体包括：1) 并发风险：watchdog使用守护线程轮询，若is_alive()或exitcode访问抛出异常可能被静默吞没（提交741252f9已修复异常处理）。2) 误报风险：正常退出（exitcode == 0）可能错误触发SIGQUIT，但代码已显式处理此边缘情况。3) 兼容性风险：多节点部署中非零rank节点watchdog为None，Ray后端仅监控detokenizer（schedulers为actors），需确保逻辑一致。4) 性能开销：轮询间隔默认1秒引入轻微CPU开销，但可配置。测试覆盖了关键路径，但生产环境大规模部署需观察稳定性。

影响范围：1) 用户影响：显著提升系统可靠性，减少因未检测崩溃导致的“僵尸服务”窗口，故障恢复时间从默认20秒健康检查缩短至秒级。2) 系统影响：添加轻量级监控线程（约每进程1个守护线程），CPU开销可忽略；设计重用现有SIGQUIT清理基础设施，无需改动核心调度逻辑。3) 团队影响：新增代码需维护，但模块化在watchdog.py中，与现有WatchdogRaw模式一致，降低学习成本；测试用例为后续类似监控功能提供模板。影响程度中等，主要优化错误处理而非功能变更。

守护线程并发 SIGQUIT 误报风险多节点兼容性

关联 Issue

#18421 [Bug] Scheduler subprocess crash due to NCCL timeout remains undetected, causing zombie service state

完整报告

执行摘要

本次PR通过新增SubprocessWatchdog类，为SGLang调度器子进程添加活跃度监控机制，旨在解决C++级别崩溃（如NCCL超时）导致主进程无法检测的“僵尸服务”问题。变更涉及核心监控逻辑、进程启动集成、信号处理完善及全面单元测试，显著提升系统可靠性，属于有意义的错误修复，建议关注其设计如何重用现有SIGQUIT基础设施避免架构复杂化。

功能与动机

问题根源：Issue #18421详细描述了当调度器子进程因NCCL超时等C++错误触发std::terminate()时，Python异常处理程序无法执行，主进程继续运行但无法处理请求，形成“僵尸服务”状态，仅依赖健康检查会导致长达20秒的故障窗口。

解决方案目标：PR body明确指出需添加监控机制，在子进程异常退出时及时触发清理，引用原文：'When a scheduler subprocess crashes at the C++ level, std::terminate() runs before Python exception handlers, leaving the main process unaware. The service continues accepting requests but cannot process them.' 通过守护线程轮询proc.is_alive()，检测到非零退出码时发送SIGQUIT，复用现有信号处理基础设施进行恢复。

实现拆解

变更按模块拆解如下：
| 模块 | 关键文件 | 核心改动 |
|------|----------|----------|
| 监控核心 | watchdog.py | 新增SubprocessWatchdog类：__init__接收进程列表，_monitor_loop守护线程每秒轮询，_check_processes检测异常退出并调用os.kill(SIGQUIT) |
| 集成层 | engine.py | _launch_subprocesses返回元组增加subprocess_watchdog，_launch_scheduler_processes返回(SchedulerInitResult, scheduler_procs)以供构建watchdog |
| 信号处理 | tokenizer_manager.py | 在running_phase_sigquit_handler中添加if self.tokenizer_manager._subprocess_watchdog is not None: self.tokenizer_manager._subprocess_watchdog.stop()，防止正常关闭误报 |
| 后端适配 | http_server.py、ray/engine.py | 调整函数签名以传递watchdog，Ray后端中scheduler_procs为None仅监控detokenizer |
| 测试验证 | test_subprocess_watchdog.py | 6个测试用例覆盖：健康进程不触发、延迟/立即崩溃检测、多进程中单崩溃、空进程列表处理、正常退出不触发SIGQUIT |

关键代码逻辑示例（来自watchdog.py）：

def _check_processes(self) -> bool:
    for proc, name in zip(self._processes, self._names):
        if proc.is_alive() or proc.exitcode == 0: # 忽略正常退出
            continue
        logger.error(f"Subprocess {name} crashed with exit code {proc.exitcode}. Triggering SIGQUIT...")
        os.kill(os.getpid(), signal.SIGQUIT)
        return True
    return False

评论区精华

review讨论由hnyls2002主导，聚焦于正确性与设计简洁性：

watchdog引用存储问题：hnyls2002在http_server.pydiff中评论：

"Why discard the reference of watch dog? When the server receives a SIGQUIT, I think you also need to stop the watch dog."
作者在提交2bbc0a01中修复，将watchdog存储到tokenizer_manager._subprocess_watchdog，确保SIGQUIT处理程序可访问。
数据结构设计：针对scheduler_procs是否应放入SchedulerInitResult，hnyls2002指出：

"This should not be kept in the init result. It is only used in the watch dog building steps."
作者通过提交0262ae10重构，改为返回元组，避免污染初始化结果数据结构。
测试规范：hnyls2002在测试文件补丁中建议注册方式，后续提交调整测试以符合项目规范。所有讨论点均在迭代提交中解决，体现协作中的设计权衡。

风险与影响

技术风险：

并发与异常处理：watchdog守护线程若在is_alive()访问时抛出异常可能被静默吞没，提交741252f9已恢复try/except块记录日志。
误报触发：正常退出（exitcode == 0）需明确跳过，代码已处理此边缘情况，但生产环境中信号竞争可能仍需验证。
多后端兼容性：Ray后端scheduler_procs为None，仅监控detokenizer；非零rank节点watchdog为None，需确保逻辑一致无遗漏。

影响评估：

用户价值：故障检测延迟从健康检查的20秒缩短至秒级，减少服务不可用时间，提升用户体验。
系统开销：添加守护线程轮询（默认1秒间隔），CPU开销轻微，设计复用现有SIGQUIT处理，未引入新依赖。
团队维护：模块化在watchdog.py中，与现有WatchdogRaw模式一致，便于后续扩展；测试用例为类似监控功能提供参考模板。

关联脉络

直接关联：本PR直接修复Issue #18421，该issue详细分析了NCCL超时导致C++ std::terminate()的根因和“僵尸服务”影响。
代码演进：提交历史显示18次提交，包括初始实现、修复误报（正常退出处理）、测试调整、引用存储修复、设计重构等，体现多人协作（Simon-Li、hnyls2002、alphabetc1）下的迭代优化。
架构趋势：近期历史PR中多聚焦CI、测试、性能优化（如PR #21411融合GDN内核），本PR延续了可靠性改进方向，强调错误检测与恢复，符合系统成熟化演进需求。

支持 Prhub ♥

#18582 Add subprocess liveness monitor to detect scheduler crashes

执行摘要

添加子进程活跃度监控以检测调度器崩溃，防止僵尸服务状态。

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

评论区精华

风险与影响

关联脉络

参与讨论