#43707 [Logs Refactor] Optimize shutdown logs, easier to follow and consistent

原始 PR 作者 yewentao256 合并时间 2026-06-05 02:36 文件变更 7 提交数 9 评论 1 代码增减 +139 / -23

执行摘要

优化关闭日志，统一格式并增加上下文信息

当前 vLLM v1 的关闭日志分散且难以跟踪，正如 PR body 所述：'currently shutting down log is pretty complicated / hard to follow'。此 PR 旨在通过统一前缀、合理级别和更多上下文来改善可观测性。

值得精读，展示了如何系统性地改进大规模分布式系统的关机可观测性。关注的几点：日志前缀约定、级别权衡、上下文的添加时机。

讨论亮点

review 中 @njhill 指出新日志过于冗余，建议将部分信息放到 DEBUG 级别：'the "after" shutdown log above looks way too verbose? ... think we should be more selective and put some thought into what should be logged where especially for INFO messages'。作者 @yewentao256 回应 'updated'，并在后续提交中将部分 INFO 日志降低为 DEBUG（如 'tearing down local resources'、'local resource teardown complete' 等），同时保留了关键流程的 INFO 日志。最终 @markmc 批准。

实现拆解

步骤：

统一日志前缀和组件名称：在所有关闭路径中添加 [shutdown] 标签，并标注组件名（如 EngineCore、Executor、MPClient、APIServer、Scheduler），便于过滤和关联。
调整日志级别：根据 njhill 的 review 建议，将部分信息性日志（如 'tearing down local resources'）降为 DEBUG 级别，将异常情况（如进程未按时退出）升级为 WARNING，平衡详细程度。
增加上下文信息：在关键点记录进程数、超时时间、关闭模式（abort/drain）和请求数，使运维人员能快速判断关闭进度。
添加开始/完成标记：在每个组件的关闭入口和出口添加一对 DEBUG 日志，确保关闭序列可追踪。
涉及文件：vllm/v1/engine/core.py（EngineCore 关闭）、vllm/v1/executor/multiproc_executor.py（Worker 终止流程）、vllm/v1/utils.py（进程管理工具函数）、vllm/v1/engine/core_client.py（MPClient 后台资源清理）、vllm/entrypoints/launcher.py（API 服务器关闭）、vllm/v1/core/sched/scheduler.py（Scheduler 关闭）、vllm/distributed/parallel_state.py（分布式状态清理）。
无测试、配置或 schema 变动。

文件	模块	状态	重要度
`vllm/v1/engine/core.py`	引擎核心	modified	6.17
`vllm/v1/executor/multiproc_executor.py`	执行器	modified	6.01
`vllm/v1/utils.py`	工具函数	modified	5.9
`vllm/v1/engine/core_client.py`	客户端	modified	5.38
`vllm/entrypoints/launcher.py`	启动器	modified	5.22
`vllm/v1/core/sched/scheduler.py`	调度器	modified	4.31
`vllm/distributed/parallel_state.py`	并行状态	modified	4.22

关键符号

EngineCore.shutdown EngineCore._handle_shutdown EngineCoreProc.signal_handler MultiprocExecutor._ensure_worker_termination MultiprocExecutor.shutdown _shutdown_subprocesses shutdown (utils) BackgroundResources.__call__ MPClient.shutdown serve_http.signal_handler serve_http.handle_shutdown Scheduler.shutdown

关键源码片段

vllm/v1/engine/core.py logging

EngineCore 关闭流程的核心日志增强，包括 shutdown 方法和信号处理器的日志统一与级别调整。

def shutdown(self):
    # 标记清理开始，使用 logger.debug_once 避免重复输出
    logger.debug_once("[shutdown] EngineCore: tearing down local resources")
    self.structured_output_manager.clear_backend()
    if self.model_executor:
        self.model_executor.shutdown()
    if self.scheduler:
        self.scheduler.shutdown()
    gc.unfreeze()
    cleanup_dist_env_and_memory()
    # 标记清理完成，便于追踪引擎核心关闭完整序列
    logger.debug_once("[shutdown] EngineCore: local resource teardown complete")

vllm/v1/executor/multiproc_executor.py logging

Worker 进程终止流程的日志改进，增加活动进程数记录、级别区分（INFO 正常、WARNING 异常）以及明确的状态提示。

@staticmethod
def _ensure_worker_termination(worker_procs: list[BaseProcess]):
    active_procs = lambda: [proc for proc in worker_procs if proc.is_alive()]
    initial_count = len(active_procs())
    # 记录等待退出的进程数，便于运维判断关闭进度
    logger.info("[shutdown] Executor: waiting for worker exit count=%d", initial_count)
    if wait_for_termination(active_procs(), 4):
        logger.info_once("[shutdown] Executor: all workers exited gracefully")
        return
    # 正常超时后发 SIGTERM 并警告
    remaining = active_procs()
    logger.warning(
        "[shutdown] Executor: workers still running after grace period; "
        "sending SIGTERM count=%d", len(remaining))
    for p in remaining:
        p.terminate()
    if not wait_for_termination(active_procs(), 4):
        # 最终发 SIGKILL 并警告
        remaining = active_procs()
        logger.warning(
            "[shutdown] Executor: workers still running after SIGTERM; "
            "sending SIGKILL count=%d", len(remaining))
        for p in remaining:
            p.kill()

评论区精华

日志冗余与级别调整 设计

@njhill 指出新日志太冗余，建议更选择性记录，特别是 INFO 级别应更精简。

结论：作者采纳建议，将部分 INFO 日志降为 DEBUG 级别。 · 已解决

风险与影响

本次变更为纯日志优化，无核心逻辑改动。主要风险是日志过多可能影响性能（但 DEBUG 日志仅在设置时输出，影响可控）或级别不合理导致调试信息丢失（已通过 review 平衡）。无安全或兼容性风险。

对用户：关闭时日志更加清晰，方便排查关闭缓慢或异常问题。对系统：无功能变化，性能影响可忽略。对团队：统一的日志格式降低了维护成本，提高了问题定位效率。

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：优化关闭日志，统一格式并增加上下文信息
推荐动作：值得精读，展示了如何系统性地改进大规模分布式系统的关机可观测性。关注的几点：日志前缀约定、级别权衡、上下文的添加时机。

功能与动机

实现拆解

步骤：

统一日志前缀和组件名称：在所有关闭路径中添加 [shutdown] 标签，并标注组件名（如 EngineCore、Executor、MPClient、APIServer、Scheduler），便于过滤和关联。
调整日志级别：根据 njhill 的 review 建议，将部分信息性日志（如 'tearing down local resources'）降为 DEBUG 级别，将异常情况（如进程未按时退出）升级为 WARNING，平衡详细程度。
增加上下文信息：在关键点记录进程数、超时时间、关闭模式（abort/drain）和请求数，使运维人员能快速判断关闭进度。
添加开始/完成标记：在每个组件的关闭入口和出口添加一对 DEBUG 日志，确保关闭序列可追踪。
涉及文件：vllm/v1/engine/core.py（EngineCore 关闭）、vllm/v1/executor/multiproc_executor.py（Worker 终止流程）、vllm/v1/utils.py（进程管理工具函数）、vllm/v1/engine/core_client.py（MPClient 后台资源清理）、vllm/entrypoints/launcher.py（API 服务器关闭）、vllm/v1/core/sched/scheduler.py（Scheduler 关闭）、vllm/distributed/parallel_state.py（分布式状态清理）。
无测试、配置或 schema 变动。

关键文件：

vllm/v1/engine/core.py（模块引擎核心；类别 source；类型 logging；符号 shutdown, _handle_shutdown, signal_handler）: EngineCore 关闭流程的核心日志增强，包括 shutdown 方法和信号处理器的日志统一与级别调整。
vllm/v1/executor/multiproc_executor.py（模块执行器；类别 source；类型 logging；符号 _ensure_worker_termination, shutdown, WorkerProc.run）: Worker 进程终止流程的日志改进，增加活动进程数记录、级别区分（INFO 正常、WARNING 异常）以及明确的状态提示。
vllm/v1/utils.py（模块工具函数；类别 source；类型 logging；符号 _shutdown_subprocesses, shutdown）: 顶层 shutdown 和 _shutdown_subprocesses 工具函数增加了开始、强制终止警告和完成日志，为所有进程管理提供统一可追踪性。
vllm/v1/engine/core_client.py（模块客户端；类别 source；类型 logging；符号 BackgroundResources.call, MPClient.shutdown）: MPClient 后台资源清理过程增加了开始和完成日志，关闭方法增加了超时描述和阶段标记，使客户端关闭过程可追踪。
vllm/entrypoints/launcher.py（模块启动器；类别 source；类型 logging；符号 serve_http.signal_handler, serve_http.handle_shutdown）: API 服务器关闭流程的信号处理器和关闭任务增加了去重和阶段日志，使关闭序列更清晰。
vllm/v1/core/sched/scheduler.py（模块调度器；类别 source；类型 logging；符号 shutdown）: Scheduler 关闭方法增加了开始和完成 DEBUG 日志，同步其他组件的统一风格。
vllm/distributed/parallel_state.py（模块并行状态；类别 source；类型 logging）: 分布式状态清理路径添加了开始和完成日志，保持全局一致性。

关键符号：EngineCore.shutdown, EngineCore._handle_shutdown, EngineCoreProc.signal_handler, MultiprocExecutor._ensure_worker_termination, MultiprocExecutor.shutdown, _shutdown_subprocesses, shutdown (utils), BackgroundResources.call, MPClient.shutdown, serve_http.signal_handler, serve_http.handle_shutdown, Scheduler.shutdown

关键源码片段

`vllm/v1/engine/core.py`

EngineCore 关闭流程的核心日志增强，包括 shutdown 方法和信号处理器的日志统一与级别调整。

def shutdown(self):
    # 标记清理开始，使用 logger.debug_once 避免重复输出
    logger.debug_once("[shutdown] EngineCore: tearing down local resources")
    self.structured_output_manager.clear_backend()
    if self.model_executor:
        self.model_executor.shutdown()
    if self.scheduler:
        self.scheduler.shutdown()
    gc.unfreeze()
    cleanup_dist_env_and_memory()
    # 标记清理完成，便于追踪引擎核心关闭完整序列
    logger.debug_once("[shutdown] EngineCore: local resource teardown complete")

`vllm/v1/executor/multiproc_executor.py`

Worker 进程终止流程的日志改进，增加活动进程数记录、级别区分（INFO 正常、WARNING 异常）以及明确的状态提示。

@staticmethod
def _ensure_worker_termination(worker_procs: list[BaseProcess]):
    active_procs = lambda: [proc for proc in worker_procs if proc.is_alive()]
    initial_count = len(active_procs())
    # 记录等待退出的进程数，便于运维判断关闭进度
    logger.info("[shutdown] Executor: waiting for worker exit count=%d", initial_count)
    if wait_for_termination(active_procs(), 4):
        logger.info_once("[shutdown] Executor: all workers exited gracefully")
        return
    # 正常超时后发 SIGTERM 并警告
    remaining = active_procs()
    logger.warning(
        "[shutdown] Executor: workers still running after grace period; "
        "sending SIGTERM count=%d", len(remaining))
    for p in remaining:
        p.terminate()
    if not wait_for_termination(active_procs(), 4):
        # 最终发 SIGKILL 并警告
        remaining = active_procs()
        logger.warning(
            "[shutdown] Executor: workers still running after SIGTERM; "
            "sending SIGKILL count=%d", len(remaining))
        for p in remaining:
            p.kill()

评论区精华

日志冗余与级别调整 (design): 作者采纳建议，将部分 INFO 日志降为 DEBUG 级别。

风险与影响

风险：本次变更为纯日志优化，无核心逻辑改动。主要风险是日志过多可能影响性能（但 DEBUG 日志仅在设置时输出，影响可控）或级别不合理导致调试信息丢失（已通过 review 平衡）。无安全或兼容性风险。
影响：对用户：关闭时日志更加清晰，方便排查关闭缓慢或异常问题。对系统：无功能变化，性能影响可忽略。对团队：统一的日志格式降低了维护成本，提高了问题定位效率。
风险标记：暂无

关联脉络

暂无明显关联 PR

#43707 [Logs Refactor] Optimize shutdown logs, easier to follow and consistent

执行摘要

优化关闭日志，统一格式并增加上下文信息

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论