#41714 [MM][CG] Profile encoder CUDA graph pool memory

原始 PR 作者 BWAAEEEK 合并时间 2026-06-02 12:27 文件变更 4 提交数 2 评论 35 代码增减 +173 / -92

执行摘要

Profile vision encoder CUDA graph pool memory

最初 PR 旨在为 Step3-VL/StepVL 模型添加 encoder CUDA graph 支持，但经过评审后缩小为仅提供 graph pool memory profiling 基础设施。核心动机是：在 CUDA graph 内存 profiling 阶段能准确估算 encoder 部分的 graph 内存消耗，避免主 pool 碎片化，并为未来多模态模型启用 encoder cudagraph 做准备。讨论中提及基于已合并的 #42224 进行精简，保留 graph pool 控制能力。

此 PR 值得精读，尤其关注 profile_cudagraph_memory 中如何集成 encoder 部分以及 graph pool 的生命周期设计。它展示了在已有的 CUDA graph 框架中扩展新模块的典型模式：通过临时 manager 进行 profile，通过持久 manager 进行 runtime，并利用 graph pool 隔离。对多模态模型开发者和 CUDA graph 维护者有重要参考价值。

讨论亮点

相关讨论主要围绕设计简化和正确性展开：

ZeroDivisionError 风险（gemini-code-assist 提出）：当 min_budget=0 时访问除零，作者添加了正数检查。
代码放置建议（shen-shanshan）：建议将 budget 验证移到更精确的分支中，减少不必要的用户覆盖。
不必要的 try 回退（Isotr0py）：capture 中的 try 回退到 eager 模式是多余的，因为 encoder cudagraph 默认未启用；作者移除了该逻辑。
DP 支持（Isotr0py）：质疑 Step3-VL 为什么禁止 DP 下的 capture；最终 PR 移除了模型特定代码。
patch_pixel_values 耦合（Isotr0py）：认为不应作为 buffer，作者后续在更大 revision 中处理。
stream 拆分（Isotr0py）：询问是否需要独立 stream，作者解释已在外层 graph_capture 中处理。
mock 冗余（Isotr0py）：测试中的 postprocess_encoder_output 覆盖与默认实现重复，作者移除了相关 mock。

实现拆解

EncoderCudaGraphManager 核心改造（vllm/v1/worker/encoder_cudagraph.py）：
- 新增 self.graph_pool 属性记录当前 pool。
- capture() 方法改为接受 graph_pool 参数，将 pool 传递给 torch.cuda.graph；按照 budget 从大到小捕获以优化内存复用。
- 新增 clear() 方法，释放 budget_graphs 和 graph_pool。
- 新增 get_num_graphs_to_capture() 方法返回 len(self.token_budgets)，供外部估算 graph 数量。
GPUModelRunner 集成（vllm/v1/worker/gpu_model_runner.py）：
- 新增 _create_encoder_cudagraph_manager()：检查配置和模型是否支持 encoder cudagraph，若支持则创建临时 EncoderCudaGraphManager。
- 新增 _maybe_init_encoder_cudagraph_manager()：延迟初始化持久化 manager。
- 修改 profile_cudagraph_memory()：创建临时 encoder manager，调用 get_num_graphs_to_capture 获得 graph 数量，加入日志；使用独立的 encoder_profiling_pool 进行 profile 隔离。
导入异常处理放宽（vllm/utils/import_utils.py）：
- _has_module() 将 except ImportError 改为 except Exception，以捕获在特定环境中因原生依赖加载失败而产生的非 ImportError 异常。
测试覆盖（tests/v1/cudagraph/test_encoder_cudagraph.py）：
- 新增 test_num_graphs_to_capture_tracks_budgets 验证 graph 数量与 budgets 一致。
- 新增 test_capture_uses_supplied_graph_pool 验证 captured pool 与传入的 pool 一致。
- 新增 test_clear_releases_graphs_and_pool 验证 clear 后释放所有状态。
- 修改现有 GPU 测试，为 capture() 传入显式 graph_pool。
配置与数据契约：无新增配置，但 encoder_cudagraph_manager 属性在 GPUModelRunner 中被引入，后续模块可基于此属性使用 encoder cudagraph。

文件	模块	状态	重要度
`vllm/v1/worker/gpu_model_runner.py`	模型运行器	modified	8.38
`vllm/v1/worker/encoder_cudagraph.py`	编码器图形	modified	7.06
`tests/v1/cudagraph/test_encoder_cudagraph.py`	图形测试	modified	6.15
`vllm/utils/import_utils.py`	工具函数	modified	4.26

关键符号

EncoderCudaGraphManager.capture EncoderCudaGraphManager.clear EncoderCudaGraphManager.get_num_graphs_to_capture GPUModelRunner._create_encoder_cudagraph_manager GPUModelRunner._maybe_init_encoder_cudagraph_manager _has_module

关键源码片段

vllm/v1/worker/gpu_model_runner.py core-logic

核心集成变更，新增 encoder cudagraph manager 创建和初始化方法，修改 profile_cudagraph_memory 集成 encoder graph profiling。

# 创建临时 encoder cudagraph manager（仅用于 profiling，不保持状态）
@torch.inference_mode()
def _create_encoder_cudagraph_manager(self) -> "EncoderCudaGraphManager | None":
    if not (self.compilation_config.cudagraph_mm_encoder and self.supports_mm_inputs):
        return None
    # 通过 get_model() 解包 CUDAGraphWrapper/UBatchWrapper
    # 因为 @runtime_checkable Protocol isinstance 检查无法通过 __getattr__ 转发
    from vllm.model_executor.models.interfaces import (
        SupportsEncoderCudaGraph, supports_encoder_cudagraph)
    from vllm.v1.worker.encoder_cudagraph import EncoderCudaGraphManager
    raw_model = self.get_model()
    if not supports_encoder_cudagraph(raw_model):
        return None
    return EncoderCudaGraphManager(
        vllm_config=self.vllm_config,
        device=self.device,
        dtype=self.dtype,
        model=cast(SupportsEncoderCudaGraph, raw_model),
    )

# 在 profile_cudagraph_memory 中的关键集成片段
encoder_cudagraph_manager = self._create_encoder_cudagraph_manager()
decoder_graphs = sum(len(descs) for _, descs in capture_descs)
encoder_graphs = (
    encoder_cudagraph_manager.get_num_graphs_to_capture()
    if encoder_cudagraph_manager is not None else 0
)
total_graphs = decoder_graphs + encoder_graphs
# ... 日志中追加 encoder graph 信息
if encoder_graphs > 0:
    graph_groups.append(
        f"ENCODER={encoder_graphs} (largest={encoder_cudagraph_manager.token_budgets[-1]})")
# 创建独立的 encoder profiling pool 防止主 pool 碎片化
encoder_profiling_pool = current_platform.graph_pool_handle()

vllm/v1/worker/encoder_cudagraph.py core-logic

核心逻辑变更，修改 capture、clear、get_num_graphs_to_capture 方法以支持显式 graph_pool。

def clear(self) -> None:
    """释放已捕获的 encoder CUDA graph 和 manager 本地的 pool。"""
    self.budget_graphs.clear()
    self.graph_pool = None

def capture(self, graph_pool: Any):
    """使用给定的 graph_pool 捕获所有 token budget 的 CUDA graph。"""
    self.graph_pool = graph_pool
    # 按 budget 从大到小捕获，有助于内存复用
    for token_budget in sorted(self.token_budgets, reverse=True):
        self._capture_budget_graph(token_budget)
    logger.info(
        "Encoder CUDA graph capture complete. Captured %d budget graphs.",
        len(self.budget_graphs))

def get_num_graphs_to_capture(self) -> int:
    """返回需要捕获的 graph 数量（即 budget 数量）。"""
    return len(self.token_budgets)

def _capture_budget_graph(self, token_budget: int):
    # ... 准备 capture_inputs
    graph = torch.cuda.CUDAGraph()
    # 使用 manager 持有的 graph_pool 进行捕获
    with torch.inference_mode(), torch.cuda.graph(graph, pool=self.graph_pool):
        output = self.model.encoder_cudagraph_forward({**values})
        output_buffer.copy_(output)
    self.budget_graphs[token_budget] = BudgetGraphMetadata(
        token_budget=token_budget,
        max_batch_size=self.max_batch_size,
        max_frames_per_batch=self.max_frames_per_batch,
        graph=graph,
        input_buffers=values,
        output_buffer=output_buffer,
    )

评论区精华

ZeroDivisionError when min_budget is zero 正确性

gemini-code-assist 指出若 `min_budget=0` 会导致 ZeroDivisionError，建议添加安全校验。

结论：作者已添加正数检查，并编写单元测试覆盖。 · 已解决

Remove redundant try fallback in capture 设计

Isotr0py 认为 encoder cudagraph 默认未启用，故捕获失败时无需回退到 eager 模式，建议移除 try。

结论：作者移除了 try 块，只保留正常捕获路径。 · 已解决

Graph pool stream independence 设计

Isotr0py 询问 `_capture_budget_graph` 是否需要独立的 CUDA stream 进行捕获。

结论：作者说明外层 `graph_capture` 上下文已切换至专用 stream，无需额外拆分。 · 已解决

风险与影响

内存估算偏差：profile 阶段使用临时 manager 估算 graph 数量，若实际运行时 budgets 发生变化（如用户自定义），估算可能与实际不符。当前设计要求运行时 budgets 与 profile 时一致，否则需重新 profile。
异常捕获放宽：_has_module 改为捕获所有 Exception，可能掩盖非 import 相关的编程错误，增加调试难度。
pool 生命周期管理：EncoderCudaGraphManager.clear() 仅将 graph_pool 置为 None，不直接释放 GPU 内存；实际 pool 由 graph_pool_handle() 管理，若 handle 被其他对象持有，可能导致泄漏。
缺少多 batch 测试：测试仅限于单 batch 场景（max_batch_size=16 但 GPU 测试使用单图片），未覆盖多图片并发时的 graph 选择与 pool 复用。

用户影响：几乎无直接影响，encoder cudagraph 默认关闭。但改进后的 memory profiling 为未来启用提供了更准确的内存报告。
系统影响：CUDA graph pool 被显式管理，减少主 pool 的碎片化，提升内存效率。
团队影响：新增的 EncoderCudaGraphManager 接口（capture(pool)、clear()、get_num_graphs_to_capture()）成为未来模型实现 encoder cudagraph 的标准契约。

CUDA graph 内存估算偏差异常捕获范围放宽 pool 生命周期管理缺少多 batch 测试

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：Profile vision encoder CUDA graph pool memory
推荐动作：此 PR 值得精读，尤其关注 profile_cudagraph_memory 中如何集成 encoder 部分以及 graph pool 的生命周期设计。它展示了在已有的 CUDA graph 框架中扩展新模块的典型模式：通过临时 manager 进行 profile，通过持久 manager 进行 runtime，并利用 graph pool 隔离。对多模态模型开发者和 CUDA graph 维护者有重要参考价值。

功能与动机

实现拆解

EncoderCudaGraphManager 核心改造（vllm/v1/worker/encoder_cudagraph.py）：
- 新增 self.graph_pool 属性记录当前 pool。
- capture() 方法改为接受 graph_pool 参数，将 pool 传递给 torch.cuda.graph；按照 budget 从大到小捕获以优化内存复用。
- 新增 clear() 方法，释放 budget_graphs 和 graph_pool。
- 新增 get_num_graphs_to_capture() 方法返回 len(self.token_budgets)，供外部估算 graph 数量。
GPUModelRunner 集成（vllm/v1/worker/gpu_model_runner.py）：
- 新增 _create_encoder_cudagraph_manager()：检查配置和模型是否支持 encoder cudagraph，若支持则创建临时 EncoderCudaGraphManager。
- 新增 _maybe_init_encoder_cudagraph_manager()：延迟初始化持久化 manager。
- 修改 profile_cudagraph_memory()：创建临时 encoder manager，调用 get_num_graphs_to_capture 获得 graph 数量，加入日志；使用独立的 encoder_profiling_pool 进行 profile 隔离。
导入异常处理放宽（vllm/utils/import_utils.py）：
- _has_module() 将 except ImportError 改为 except Exception，以捕获在特定环境中因原生依赖加载失败而产生的非 ImportError 异常。
测试覆盖（tests/v1/cudagraph/test_encoder_cudagraph.py）：
- 新增 test_num_graphs_to_capture_tracks_budgets 验证 graph 数量与 budgets 一致。
- 新增 test_capture_uses_supplied_graph_pool 验证 captured pool 与传入的 pool 一致。
- 新增 test_clear_releases_graphs_and_pool 验证 clear 后释放所有状态。
- 修改现有 GPU 测试，为 capture() 传入显式 graph_pool。
配置与数据契约：无新增配置，但 encoder_cudagraph_manager 属性在 GPUModelRunner 中被引入，后续模块可基于此属性使用 encoder cudagraph。

关键文件：

vllm/v1/worker/gpu_model_runner.py（模块模型运行器；类别 source；类型 core-logic；符号 _create_encoder_cudagraph_manager, _maybe_init_encoder_cudagraph_manager）: 核心集成变更，新增 encoder cudagraph manager 创建和初始化方法，修改 profile_cudagraph_memory 集成 encoder graph profiling。
vllm/v1/worker/encoder_cudagraph.py（模块编码器图形；类别 source；类型 core-logic；符号 capture, clear, get_num_graphs_to_capture）: 核心逻辑变更，修改 capture、clear、get_num_graphs_to_capture 方法以支持显式 graph_pool。
tests/v1/cudagraph/test_encoder_cudagraph.py（模块图形测试；类别 test；类型 test-coverage；符号 test_num_graphs_to_capture_tracks_budgets, test_capture_uses_supplied_graph_pool, test_clear_releases_graphs_and_pool）: 新增测试覆盖 capture、clear、get_num_graphs_to_capture 方法，验证正确性。
vllm/utils/import_utils.py（模块工具函数；类别 source；类型 core-logic）: 放宽异常捕获，防止某些环境下的 import 失败被忽略。

关键符号：EncoderCudaGraphManager.capture, EncoderCudaGraphManager.clear, EncoderCudaGraphManager.get_num_graphs_to_capture, GPUModelRunner._create_encoder_cudagraph_manager, GPUModelRunner._maybe_init_encoder_cudagraph_manager, _has_module

关键源码片段

`vllm/v1/worker/gpu_model_runner.py`

核心集成变更，新增 encoder cudagraph manager 创建和初始化方法，修改 profile_cudagraph_memory 集成 encoder graph profiling。

# 创建临时 encoder cudagraph manager（仅用于 profiling，不保持状态）
@torch.inference_mode()
def _create_encoder_cudagraph_manager(self) -> "EncoderCudaGraphManager | None":
    if not (self.compilation_config.cudagraph_mm_encoder and self.supports_mm_inputs):
        return None
    # 通过 get_model() 解包 CUDAGraphWrapper/UBatchWrapper
    # 因为 @runtime_checkable Protocol isinstance 检查无法通过 __getattr__ 转发
    from vllm.model_executor.models.interfaces import (
        SupportsEncoderCudaGraph, supports_encoder_cudagraph)
    from vllm.v1.worker.encoder_cudagraph import EncoderCudaGraphManager
    raw_model = self.get_model()
    if not supports_encoder_cudagraph(raw_model):
        return None
    return EncoderCudaGraphManager(
        vllm_config=self.vllm_config,
        device=self.device,
        dtype=self.dtype,
        model=cast(SupportsEncoderCudaGraph, raw_model),
    )

# 在 profile_cudagraph_memory 中的关键集成片段
encoder_cudagraph_manager = self._create_encoder_cudagraph_manager()
decoder_graphs = sum(len(descs) for _, descs in capture_descs)
encoder_graphs = (
    encoder_cudagraph_manager.get_num_graphs_to_capture()
    if encoder_cudagraph_manager is not None else 0
)
total_graphs = decoder_graphs + encoder_graphs
# ... 日志中追加 encoder graph 信息
if encoder_graphs > 0:
    graph_groups.append(
        f"ENCODER={encoder_graphs} (largest={encoder_cudagraph_manager.token_budgets[-1]})")
# 创建独立的 encoder profiling pool 防止主 pool 碎片化
encoder_profiling_pool = current_platform.graph_pool_handle()

`vllm/v1/worker/encoder_cudagraph.py`

核心逻辑变更，修改 capture、clear、get_num_graphs_to_capture 方法以支持显式 graph_pool。

def clear(self) -> None:
    """释放已捕获的 encoder CUDA graph 和 manager 本地的 pool。"""
    self.budget_graphs.clear()
    self.graph_pool = None

def capture(self, graph_pool: Any):
    """使用给定的 graph_pool 捕获所有 token budget 的 CUDA graph。"""
    self.graph_pool = graph_pool
    # 按 budget 从大到小捕获，有助于内存复用
    for token_budget in sorted(self.token_budgets, reverse=True):
        self._capture_budget_graph(token_budget)
    logger.info(
        "Encoder CUDA graph capture complete. Captured %d budget graphs.",
        len(self.budget_graphs))

def get_num_graphs_to_capture(self) -> int:
    """返回需要捕获的 graph 数量（即 budget 数量）。"""
    return len(self.token_budgets)

def _capture_budget_graph(self, token_budget: int):
    # ... 准备 capture_inputs
    graph = torch.cuda.CUDAGraph()
    # 使用 manager 持有的 graph_pool 进行捕获
    with torch.inference_mode(), torch.cuda.graph(graph, pool=self.graph_pool):
        output = self.model.encoder_cudagraph_forward({**values})
        output_buffer.copy_(output)
    self.budget_graphs[token_budget] = BudgetGraphMetadata(
        token_budget=token_budget,
        max_batch_size=self.max_batch_size,
        max_frames_per_batch=self.max_frames_per_batch,
        graph=graph,
        input_buffers=values,
        output_buffer=output_buffer,
    )

评论区精华

相关讨论主要围绕设计简化和正确性展开：

ZeroDivisionError 风险（gemini-code-assist 提出）：当 min_budget=0 时访问除零，作者添加了正数检查。
代码放置建议（shen-shanshan）：建议将 budget 验证移到更精确的分支中，减少不必要的用户覆盖。
不必要的 try 回退（Isotr0py）：capture 中的 try 回退到 eager 模式是多余的，因为 encoder cudagraph 默认未启用；作者移除了该逻辑。
DP 支持（Isotr0py）：质疑 Step3-VL 为什么禁止 DP 下的 capture；最终 PR 移除了模型特定代码。
patch_pixel_values 耦合（Isotr0py）：认为不应作为 buffer，作者后续在更大 revision 中处理。
stream 拆分（Isotr0py）：询问是否需要独立 stream，作者解释已在外层 graph_capture 中处理。
mock 冗余（Isotr0py）：测试中的 postprocess_encoder_output 覆盖与默认实现重复，作者移除了相关 mock。
ZeroDivisionError when min_budget is zero (correctness): 作者已添加正数检查，并编写单元测试覆盖。
Remove redundant try fallback in capture (design): 作者移除了 try 块，只保留正常捕获路径。
Graph pool stream independence (design): 作者说明外层 graph_capture 上下文已切换至专用 stream，无需额外拆分。

风险与影响

风险：
1. 内存估算偏差：profile 阶段使用临时 manager 估算 graph 数量，若实际运行时 budgets 发生变化（如用户自定义），估算可能与实际不符。当前设计要求运行时 budgets 与 profile 时一致，否则需重新 profile。
2. 异常捕获放宽：_has_module 改为捕获所有 Exception，可能掩盖非 import 相关的编程错误，增加调试难度。
3. pool 生命周期管理：EncoderCudaGraphManager.clear() 仅将 graph_pool 置为 None，不直接释放 GPU 内存；实际 pool 由 graph_pool_handle() 管理，若 handle 被其他对象持有，可能导致泄漏。
4. 缺少多 batch 测试：测试仅限于单 batch 场景（max_batch_size=16 但 GPU 测试使用单图片），未覆盖多图片并发时的 graph 选择与 pool 复用。
影响：
- 用户影响：几乎无直接影响，encoder cudagraph 默认关闭。但改进后的 memory profiling 为未来启用提供了更准确的内存报告。
- 系统影响：CUDA graph pool 被显式管理，减少主 pool 的碎片化，提升内存效率。
- 团队影响：新增的 EncoderCudaGraphManager 接口（capture(pool)、clear()、get_num_graphs_to_capture()）成为未来模型实现 encoder cudagraph 的标准契约。
- 风险标记：CUDA graph 内存估算偏差, 异常捕获范围放宽, pool 生命周期管理, 缺少多 batch 测试

关联脉络

PR #42224 Encoder CUDA graph max-batch/budget improvements: PR#42224 被合并后，作者基于其方向将本 PR 缩减为仅包含 graph pool profiling；讨论中明确要求基于 #42224 进行精简。
PR #41234 [Draft] Encoder CUDA graph interface refactor: PR body 提及若 #41234 先合并则需 rebase，表明与 encoder cudagraph 接口设计相关。

#41714 [MM][CG] Profile encoder CUDA graph pool memory

执行摘要

Profile vision encoder CUDA graph pool memory

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论