#40812 Auto-disable expandable_segments around cumem memory pool

原始 PR 作者 youkaichao 合并时间 2026-04-27 09:37 文件变更 1 提交数 4 评论 5 代码增减 +40 / -32

执行摘要

自动禁用 expandable_segments 以兼容 cumem 内存池

PyTorch issue 147851指出expandable segments与CUDAPluggableAllocator + MemPool不兼容，导致vLLM的sleep模式无法与expandable segments同时使用。之前用assert硬性拒绝，用户只能二选一。此PR改为自动临时禁用，无需用户手动切换。

建议精读此PR，特别是use_memory_pool的try/finally重构和异常安全处理。设计上值得关注的是：在框架API不完善时，用环境变量加私有API实现临时开关；以及通过嵌套上下文管理器保持全局状态一致性的模式。

讨论亮点

检测手段的健壮性：gemini-code-assist[bot]指出直接解析环境变量字符串（"expandable_segments:True" in conf）脆弱，无法覆盖空格、大小写、_set_allocator_settings等变体，建议使用torch.cuda.memory.get_allocator_settings()。但作者youkaichao实测发现该API在PyTorch 2.11中不存在，已回退到环境变量解析。
self.current_tag的恢复位置：gemini-code-assist[bot]建议将self.current_tag = old_tag放到finally块确保异常安全，作者采纳并已在最终提交中实现。

实现拆解

移除__init__中的硬性断言：删除了CuMemAllocator.__init__中对expandable_segments:True的assert，允许对象正常创建。
在use_memory_pool入口检测并临时禁用expandable_segments：进入上下文管理器时，解析PYTORCH_CUDA_ALLOC_CONF环境变量，若包含expandable_segments:True则调用torch.cuda.memory._set_allocator_settings("expandable_segments:False")临时关闭。
在finally块中恢复expandable_segments和current_tag：嵌套try/finally确保即使上下文内部抛出异常，也能恢复self.current_tag和expandable segments设置（若之前启用则恢复为True）。
异常安全与状态一致性：审查中建议将self.current_tag恢复从try块末尾移到finally块，防止yield阶段异常导致状态泄漏。

文件	模块	状态	重要度
`vllm/device_allocator/cumem.py`	分配器	modified	6.84

关键符号

CuMemAllocator.__init__ CuMemAllocator.use_memory_pool

关键源码片段

vllm/device_allocator/cumem.py core-logic

核心变更文件，移除硬性断言并实现 expandable_segments 的临时禁用与恢复逻辑。

# vllm/device_allocator/cumem.py

class CuMemAllocator:
    # ... 其他代码省略 ...

    def __init__(self):
        # 移除了原来对 expandable_segments:True 的 assert 检查
        self.pointer_to_data: dict[int, AllocationData] = {}
        self.current_tag: str = CuMemAllocator.default_tag
        self.allocator_and_pools: dict[str, Any] = {}
        self.python_malloc_callback = self._python_malloc_callback
        self.python_free_callback = self._python_free_callback

    @contextmanager
    def use_memory_pool(self, tag: str | None = None):
        if tag is None:
            tag = CuMemAllocator.default_tag
        assert isinstance(tag, str)

        # 进入内存池上下文前，先检查并临时关闭 expandable_segments
        conf = os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
        expandable_was_enabled = "expandable_segments:True" in conf
        if expandable_was_enabled:
            # 调用 PyTorch 私有 API 临时关闭
            torch.cuda.memory._set_allocator_settings("expandable_segments:False")

        old_tag = self.current_tag
        self.current_tag = tag
        try:
            with use_memory_pool_with_allocator(
                self.python_malloc_callback,
                self.python_free_callback
            ) as data:
                # 保持对 data 的引用避免 PyTorch 2.6 gc 问题
                self.allocator_and_pools[tag] = data
                yield
                # 清理未使用分配 ...
                allocations = data[0].snapshot()
                for allocation in allocations:
                    if allocation["allocated_size"] == 0:
                        handle = self._python_free_callback(allocation["address"])
                        unmap_and_release(handle)
        finally:
            # 始终恢复 current_tag 和 expandable_segments 设置
            self.current_tag = old_tag
            if expandable_was_enabled:
                torch.cuda.memory._set_allocator_settings("expandable_segments:True")

评论区精华

expandable_segments 检测手段的健壮性 正确性

gemini-code-assist[bot] 指出直接解析环境变量字符串脆弱，建议使用 `torch.cuda.memory.get_allocator_settings()`，但作者反馈该 API 在 PyTorch 2.11 中不存在。

结论：保留环境变量解析方案，但存在漏检测风险。 · 已解决

self.current_tag 恢复应放入 finally 块 正确性

gemini-code-assist[bot] 建议将 `self.current_tag = old_tag` 移到 `finally` 块，确保异常时也能恢复状态。

结论：作者采纳，在最终提交中已将 `self.current_tag` 恢复和 expandable_segments 恢复都放入 `finally` 块。 · 已解决

风险与影响

环境变量解析不完整：当前仅检查"expandable_segments:True" in conf，PyTorch实际支持true/True/1及空格变体，可能出现漏判或误判。
_set_allocator_settings兼容性：该API是私有方法，未来PyTorch版本可能变更或删除。
单测覆盖缺失：PR body提到需要手动测试，但未包含自动化测试，回归风险存在。

影响范围：所有使用cumem allocator sleep/wake模式的vLLM部署。
用户影响：用户现在可以同时启用expandable_segments:True和sleep模式，无需手动取舍，减少OOM风险。
性能：临时禁用expandable segments仅在内存池上下文内生效，对整体性能影响极小。
兼容性：向下兼容，未变更公共API。

缺少测试覆盖依赖 PyTorch 私有 API

关联 Issue

#147851 expandable_segments does not work for CUDAPluggableAllocator + MemPool

完整报告

执行摘要

一句话：自动禁用expandable_segments以兼容cumem内存池
推荐动作：建议精读此PR，特别是use_memory_pool的try/finally重构和异常安全处理。设计上值得关注的是：在框架API不完善时，用环境变量加私有API实现临时开关；以及通过嵌套上下文管理器保持全局状态一致性的模式。

功能与动机

实现拆解

移除__init__中的硬性断言：删除了CuMemAllocator.__init__中对expandable_segments:True的assert，允许对象正常创建。
在use_memory_pool入口检测并临时禁用expandable_segments：进入上下文管理器时，解析PYTORCH_CUDA_ALLOC_CONF环境变量，若包含expandable_segments:True则调用torch.cuda.memory._set_allocator_settings("expandable_segments:False")临时关闭。
在finally块中恢复expandable_segments和current_tag：嵌套try/finally确保即使上下文内部抛出异常，也能恢复self.current_tag和expandable segments设置（若之前启用则恢复为True）。
异常安全与状态一致性：审查中建议将self.current_tag恢复从try块末尾移到finally块，防止yield阶段异常导致状态泄漏。

关键文件：

vllm/device_allocator/cumem.py（模块分配器；类别 source；类型 core-logic；符号 CuMemAllocator.init, CuMemAllocator.use_memory_pool）: 核心变更文件，移除硬性断言并实现expandable_segments的临时禁用与恢复逻辑。

关键符号：CuMemAllocator.init, CuMemAllocator.use_memory_pool

关键源码片段

`vllm/device_allocator/cumem.py`

核心变更文件，移除硬性断言并实现expandable_segments的临时禁用与恢复逻辑。

# vllm/device_allocator/cumem.py

class CuMemAllocator:
    # ... 其他代码省略 ...

    def __init__(self):
        # 移除了原来对 expandable_segments:True 的 assert 检查
        self.pointer_to_data: dict[int, AllocationData] = {}
        self.current_tag: str = CuMemAllocator.default_tag
        self.allocator_and_pools: dict[str, Any] = {}
        self.python_malloc_callback = self._python_malloc_callback
        self.python_free_callback = self._python_free_callback

    @contextmanager
    def use_memory_pool(self, tag: str | None = None):
        if tag is None:
            tag = CuMemAllocator.default_tag
        assert isinstance(tag, str)

        # 进入内存池上下文前，先检查并临时关闭 expandable_segments
        conf = os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
        expandable_was_enabled = "expandable_segments:True" in conf
        if expandable_was_enabled:
            # 调用 PyTorch 私有 API 临时关闭
            torch.cuda.memory._set_allocator_settings("expandable_segments:False")

        old_tag = self.current_tag
        self.current_tag = tag
        try:
            with use_memory_pool_with_allocator(
                self.python_malloc_callback,
                self.python_free_callback
            ) as data:
                # 保持对 data 的引用避免 PyTorch 2.6 gc 问题
                self.allocator_and_pools[tag] = data
                yield
                # 清理未使用分配 ...
                allocations = data[0].snapshot()
                for allocation in allocations:
                    if allocation["allocated_size"] == 0:
                        handle = self._python_free_callback(allocation["address"])
                        unmap_and_release(handle)
        finally:
            # 始终恢复 current_tag 和 expandable_segments 设置
            self.current_tag = old_tag
            if expandable_was_enabled:
                torch.cuda.memory._set_allocator_settings("expandable_segments:True")

评论区精华

检测手段的健壮性：gemini-code-assist[bot]指出直接解析环境变量字符串（"expandable_segments:True" in conf）脆弱，无法覆盖空格、大小写、_set_allocator_settings等变体，建议使用torch.cuda.memory.get_allocator_settings()。但作者youkaichao实测发现该API在PyTorch 2.11中不存在，已回退到环境变量解析。
self.current_tag的恢复位置：gemini-code-assist[bot]建议将self.current_tag = old_tag放到finally块确保异常安全，作者采纳并已在最终提交中实现。

expandable_segments检测手段的健壮性 (correctness): 保留环境变量解析方案，但存在漏检测风险。
self.current_tag恢复应放入finally块 (correctness): 作者采纳，在最终提交中已将self.current_tag恢复和expandable_segments恢复都放入finally块。

风险与影响

风险：
1. 环境变量解析不完整：当前仅检查"expandable_segments:True" in conf，PyTorch实际支持true/True/1及空格变体，可能出现漏判或误判。
2. _set_allocator_settings兼容性：该API是私有方法，未来PyTorch版本可能变更或删除。
3. 单测覆盖缺失：PR body提到需要手动测试，但未包含自动化测试，回归风险存在。
  - 影响：影响范围：所有使用cumem allocator sleep/wake模式的vLLM部署。
  用户影响：用户现在可以同时启用expandable_segments:True和sleep模式，无需手动取舍，减少OOM风险。
  性能：临时禁用expandable segments仅在内存池上下文内生效，对整体性能影响极小。
  兼容性：向下兼容，未变更公共API。
风险标记：缺少测试覆盖, 依赖PyTorch私有API

关联脉络

PR #11743 [ cumem allocator ] Sleep mode for preemption-free scheduler: 引入cumem allocator和sleep模式，是此PR的上游依赖。
PR #41268 [UX][Bugfix] Fix OOM by setting PyTorch max_split_size_mb during model loading: 同样是处理PyTorch CUDA内存分配配置和OOM的近期PR，属于同一条优化线。

#40812 Auto-disable expandable_segments around cumem memory pool

执行摘要

自动禁用 expandable_segments 以兼容 cumem 内存池

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论