#24861 [Utils] Refactor device cache emptying

原始 PR 作者 hebiao064 合并时间 2026-05-10 12:28 文件变更 4 提交数 5 评论 3 代码增减 +40 / -29

执行摘要

重构设备缓存清空逻辑，抽象为通用辅助函数

PR body 指出 SGLang 有多个路径清空 PyTorch 设备分配器缓存，同时使用 flush_cache 清除 KV cache、Mamba cache 等内部内存池。一些调度器路径仍硬编码 torch.cuda.empty_cache()，使缓存清空行为是 CUDA 特定的，而 SGLang 支持 XPU、NPU、MUSA 等多种后端。本 PR 保持外部 API 不变，使内部职责更清晰：flush_cache 清除 SGLang 内存池，empty_device_cache 仅释放设备分配器的未用缓存块。

值得精读，特别是 empty_device_cache 的实现展示了如何通过 torch.get_device_module() 编写设备无关代码。提取 flush_cache_after_weight_update 的重构方式也值得在类似重复场景中借鉴。

讨论亮点

本 PR 无 review 讨论，直接获得批准。作者在 commit 历史中逐步调整：最初引入 helper，然后修正 docstring，最后移除相关的测试文件（第 5 个 commit 删除了 cache helper 测试），说明作者在迭代中决定不对内部 helper 编写独立测试。

实现拆解

在 python/sglang/srt/utils/common.py 中新增 empty_device_cache(device_module=None) 函数。它通过 torch.get_device_module() 动态获取当前设备模块，并调用其 empty_cache 方法（若存在）。
将 get_available_gpu_memory 中 CUDA、XPU、NPU、MUSA 各分支的 torch.*.empty_cache() 调用替换为 empty_device_cache(对应的设备模块)。
在 python/sglang/srt/managers/scheduler_update_weights_mixin.py 中提取 flush_cache_after_weight_update 方法，将 update_weights_from_disk、update_weights_from_distributed、update_weights_from_tensor、update_weights_from_ipc 四个方法中重复的 flush 逻辑（检查 flush_cache 标志并调用 flush_cache）集中到一处。
在 python/sglang/srt/managers/scheduler.py 中将 flush_cache 和 maybe_sleep 中的 torch.cuda.empty_cache() 替换为 empty_device_cache(self.device_module)，并更新 docstring 明确 flush_cache 只清 SGLang 内存池。
更新 io_struct.py 中 torch_empty_cache 字段的注释，使其与新的语义一致。
没有新增测试，因为行为无变化（PR 作者用 py_compile 验证语法正确，并在 CI 中通过）。

文件	模块	状态	重要度
`python/sglang/srt/utils/common.py`	工具层	modified	6.92
`python/sglang/srt/managers/scheduler_update_weights_mixin.py`	权重更新	modified	7.2
`python/sglang/srt/managers/scheduler.py`	调度器	modified	5.28

关键符号

empty_device_cache flush_cache_after_weight_update flush_cache

关键源码片段

python/sglang/srt/utils/common.py core-logic

新增 `empty_device_cache` 函数，是本次重构的核心抽象，将设备特定的 `empty_cache` 调用统一为设备无关接口。

def empty_device_cache(device_module: Optional[Any] = None) -> bool:
    # Release unused cached blocks from the active device allocator.
    # This does not clear SGLang KV/radix/request caches and does not free live
    # tensors. It only forwards to the backend allocator's empty_cache hook when
    # one is available.

    if device_module is None:
        device_module = torch.get_device_module()

    empty_cache = getattr(device_module, 'empty_cache', None)
    if empty_cache is None:
        return False

    empty_cache()
    return True

python/sglang/srt/managers/scheduler_update_weights_mixin.py core-logic

提取 `flush_cache_after_weight_update` 方法，消除四个更新路径中的重复 flush 逻辑，提高可维护性。

class SchedulerUpdateWeightsMixin:
    def flush_cache_after_weight_update(self: Scheduler, recv_req) -> None:
        if recv_req.flush_cache:
            flush_cache_success = self.flush_cache(
                empty_cache=recv_req.torch_empty_cache
            )
            assert flush_cache_success, 'Cache flush failed after updating weights'

    def update_weights_from_disk(
        self: Scheduler, recv_req: UpdateWeightFromDiskReqInput
    ):
        # ...
        if tp_success:
            self.flush_cache_after_weight_update(recv_req)
        # ...

python/sglang/srt/managers/scheduler.py core-logic

将 `flush_cache` 和 `maybe_sleep` 中的 `torch.cuda.empty_cache()` 替换为 `empty_device_cache`，并更新 docstring。

def flush_cache(self, empty_cache: bool = True):
    # Flush memory pools (e.g., KV cache, Mamba cache) and optionally empty device allocator cache.
    if self.is_fully_idle():
        # ...
        if empty_cache:
            empty_device_cache(self.device_module)
        # ...

评论区精华

无 review 讨论 other

PR 直接获得批准，无 review 评论。作者通过 5 个 commit 逐步完善，最终移除测试文件。

结论：无争议，直接合并。 · 已解决

风险与影响

主要风险在于替换是否遗漏：maybe_sleep 中原本是 torch.cuda.empty_cache()，现在改为 empty_device_cache() 无参数，将动态获取设备模块，在非 CUDA 后端可能产生不同的缓存清空效果（例如 NPU 可能没有 empty_cache，则静默跳过）。但之前硬编码 CUDA 在非 CUDA 环境会直接报错，现在反而更安全。另一个风险：flush_cache 依赖 self.device_module，需确保在调度器初始化时正确设置。从代码看，Scheduler 应该会在初始化中设置 device_module（例如从 ServerArgs 推断）。总体风险较低。

用户无感知，API 无变化。系统层面，提高后端兼容性，消除 CUDA 硬编码，未来新增后端只需确保设备模块实现 empty_cache 即可。团队维护成本降低。影响范围：4 个源文件，40 行新增，29 行删除，变更集中且语义清晰。

核心路径变更跨后端兼容性

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：重构设备缓存清空逻辑，抽象为通用辅助函数
推荐动作：值得精读，特别是 empty_device_cache 的实现展示了如何通过 torch.get_device_module() 编写设备无关代码。提取 flush_cache_after_weight_update 的重构方式也值得在类似重复场景中借鉴。

功能与动机

实现拆解

在 python/sglang/srt/utils/common.py 中新增 empty_device_cache(device_module=None) 函数。它通过 torch.get_device_module() 动态获取当前设备模块，并调用其 empty_cache 方法（若存在）。
将 get_available_gpu_memory 中 CUDA、XPU、NPU、MUSA 各分支的 torch.*.empty_cache() 调用替换为 empty_device_cache(对应的设备模块)。
在 python/sglang/srt/managers/scheduler_update_weights_mixin.py 中提取 flush_cache_after_weight_update 方法，将 update_weights_from_disk、update_weights_from_distributed、update_weights_from_tensor、update_weights_from_ipc 四个方法中重复的 flush 逻辑（检查 flush_cache 标志并调用 flush_cache）集中到一处。
在 python/sglang/srt/managers/scheduler.py 中将 flush_cache 和 maybe_sleep 中的 torch.cuda.empty_cache() 替换为 empty_device_cache(self.device_module)，并更新 docstring 明确 flush_cache 只清 SGLang 内存池。
更新 io_struct.py 中 torch_empty_cache 字段的注释，使其与新的语义一致。
没有新增测试，因为行为无变化（PR 作者用 py_compile 验证语法正确，并在 CI 中通过）。

关键文件：

python/sglang/srt/utils/common.py（模块工具层；类别 source；类型 core-logic；符号 empty_device_cache）: 新增 empty_device_cache 函数，是本次重构的核心抽象，将设备特定的 empty_cache 调用统一为设备无关接口。
python/sglang/srt/managers/scheduler_update_weights_mixin.py（模块权重更新；类别 source；类型 core-logic；符号 flush_cache_after_weight_update）: 提取 flush_cache_after_weight_update 方法，消除四个更新路径中的重复 flush 逻辑，提高可维护性。
python/sglang/srt/managers/scheduler.py（模块调度器；类别 source；类型 core-logic）: 将 flush_cache 和 maybe_sleep 中的 torch.cuda.empty_cache() 替换为 empty_device_cache，并更新 docstring。

关键符号：empty_device_cache, flush_cache_after_weight_update, flush_cache

关键源码片段

`python/sglang/srt/utils/common.py`

新增 empty_device_cache 函数，是本次重构的核心抽象，将设备特定的 empty_cache 调用统一为设备无关接口。

def empty_device_cache(device_module: Optional[Any] = None) -> bool:
    # Release unused cached blocks from the active device allocator.
    # This does not clear SGLang KV/radix/request caches and does not free live
    # tensors. It only forwards to the backend allocator's empty_cache hook when
    # one is available.

    if device_module is None:
        device_module = torch.get_device_module()

    empty_cache = getattr(device_module, 'empty_cache', None)
    if empty_cache is None:
        return False

    empty_cache()
    return True

`python/sglang/srt/managers/scheduler_update_weights_mixin.py`

提取 flush_cache_after_weight_update 方法，消除四个更新路径中的重复 flush 逻辑，提高可维护性。

class SchedulerUpdateWeightsMixin:
    def flush_cache_after_weight_update(self: Scheduler, recv_req) -> None:
        if recv_req.flush_cache:
            flush_cache_success = self.flush_cache(
                empty_cache=recv_req.torch_empty_cache
            )
            assert flush_cache_success, 'Cache flush failed after updating weights'

    def update_weights_from_disk(
        self: Scheduler, recv_req: UpdateWeightFromDiskReqInput
    ):
        # ...
        if tp_success:
            self.flush_cache_after_weight_update(recv_req)
        # ...

`python/sglang/srt/managers/scheduler.py`

将 flush_cache 和 maybe_sleep 中的 torch.cuda.empty_cache() 替换为 empty_device_cache，并更新 docstring。

def flush_cache(self, empty_cache: bool = True):
    # Flush memory pools (e.g., KV cache, Mamba cache) and optionally empty device allocator cache.
    if self.is_fully_idle():
        # ...
        if empty_cache:
            empty_device_cache(self.device_module)
        # ...

评论区精华

无 review 讨论 (other): 无争议，直接合并。

风险与影响

风险：主要风险在于替换是否遗漏：maybe_sleep 中原本是 torch.cuda.empty_cache()，现在改为 empty_device_cache() 无参数，将动态获取设备模块，在非 CUDA 后端可能产生不同的缓存清空效果（例如 NPU 可能没有 empty_cache，则静默跳过）。但之前硬编码 CUDA 在非 CUDA 环境会直接报错，现在反而更安全。另一个风险：flush_cache 依赖 self.device_module，需确保在调度器初始化时正确设置。从代码看，Scheduler 应该会在初始化中设置 device_module（例如从 ServerArgs 推断）。总体风险较低。
影响：用户无感知，API 无变化。系统层面，提高后端兼容性，消除 CUDA 硬编码，未来新增后端只需确保设备模块实现 empty_cache 即可。团队维护成本降低。影响范围：4 个源文件，40 行新增，29 行删除，变更集中且语义清晰。
风险标记：核心路径变更, 跨后端兼容性

关联脉络

暂无明显关联 PR

#24861 [Utils] Refactor device cache emptying

执行摘要

重构设备缓存清空逻辑，抽象为通用辅助函数

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论