#40151 [compile] Skip FX graph deserialiaztion on loading, further reducing warm compile time.

原始 PR 作者 zhxchen17 合并时间 2026-04-23 13:43 文件变更 3 提交数 1 评论 10 代码增减 +98 / -51

执行摘要

通过跳过 FX 图反序列化，将热编译时间降低至亚 2 秒级别。

动机是进一步减少热编译时间，以提升模型加载和推理效率。PR body中引用测试数据，显示热编译时间从数秒降至亚秒级别，例如DeepSeek-V3.2从6.05秒降至0.27秒（-95.5%）。作者指出这是跟进PR #38657，利用最近添加的Python执行代码作为源，避免FX图序列化开销。

建议技术管理者和工程师精读此PR，重点关注generate_execution_code_with_name的设计决策，以及缓存反序列化的跳过逻辑。这些变更展示了如何通过代码生成优化编译性能，值得学习。

讨论亮点

review中主要讨论了三个问题：

向后兼容性风险：gemini-code-assist[bot]指出state["submod_names"]在旧缓存中可能缺失，导致KeyError。zhxchen17回应说缓存加载失败时会生成新缓存，因此保持当前行为，并添加注释解释。
导入缺失：gemini-code-assist[bot]建议在生成代码中添加import operator以支持函数调用，但未在代码中直接采纳，可能已通过其他方式解决。
子模块绑定方式：gemini-code-assist[bot]建议使用直接字典访问而非.get()以提供更清晰的错误信息，但zhxchen17解释使用.get()是故意的，用于处理内联子模块的占位符。

实现拆解

代码生成逻辑重构：在vllm/compilation/codegen.py中，新增generate_execution_code_with_name函数，支持内联torch.fx.GraphModule子模块到生成的Python代码中，从而避免序列化整个图。关键符号包括generate_execution_code_with_name和inlined_submods列表。
缓存和反序列化调整：在vllm/compilation/caching.py中，修改VllmSerializableFunction类的__init__方法，允许graph_module参数为bytes类型，并在deserialize_compile_artifacts方法中跳过图的反序列化步骤，直接使用执行代码。添加fake_mode参数以支持回退路径。
导入和依赖更新：在vllm/compilation/backends.py中，调整导入语句，从codegen模块直接导入compile_execution_fn和generate_execution_code，简化代码结构。
测试配套：本次变更未包含直接测试文件，但PR body提供了详细的性能测试数据，验证了热编译时间的改进。

文件	模块	状态	重要度
`vllm/compilation/codegen.py`	编译模块	modified	7.9
`vllm/compilation/caching.py`	编译模块	modified	6.78
`vllm/compilation/backends.py`	编译模块	modified	5.31

关键符号

generate_execution_code_with_name VllmSerializableFunction.__init__ deserialize_compile_artifacts

关键源码片段

vllm/compilation/caching.py data-contract

修改了缓存反序列化逻辑，允许跳过 FX 图反序列化，直接使用执行代码，减少加载时间。

class VllmSerializableFunction(SerializableCallable):
    def __init__(
        self,
        graph_module: torch.fx.GraphModule | bytes, # 现在允许 bytes 类型，避免反序列化
        example_inputs: Sequence[Any],
        prefix: str,
        optimized_call: Callable[..., Any],
        is_encoder: bool = False,
        vllm_backend: Any | None = None,
        sym_tensor_indices: list[int] | None = None,
        aot_autograd_config: dict[str, Any] | None = None,
        execution_code: str | None = None,
        submod_names: list[str] | None = None,
    ) -> None:
        self.graph_module = graph_module # 不再断言为 GraphModule，支持直接传递 bytes
        self.example_inputs = example_inputs
        self.prefix = prefix
        self.optimized_call = optimized_call
        self.is_encoder = is_encoder
        self.shape_env = None
        self.vllm_backend = vllm_backend
        self.sym_tensor_indices = sym_tensor_indices
        self.execution_code = execution_code # 存储生成的执行代码
        self.submod_names = submod_names
        self._fake_mode: Any | None = None
        # 其他初始化逻辑 ...

    @classmethod
    def deserialize_compile_artifacts(cls, data: bytes) -> "VllmSerializableFunction":
        from torch._guards import TracingContext, tracing
        from torch.fx.experimental.symbolic_shapes import ShapeEnv

        state = pickle.loads(data)
        fake_mode = FakeTensorMode(shape_env=ShapeEnv())

        # 跳过 graph_module 的反序列化，直接加载 example_inputs
        state["example_inputs"] = GraphPickler.loads(state["example_inputs"], fake_mode)
        standalone_compile_artifacts = state.pop("standalone_compile_artifacts", None)
        sym_shape_indices_map = state.pop("sym_shape_indices_map", {})
        returns_tuple_map = state.pop("returns_tuple_map", {})

        # 使用执行代码重构函数，避免反序列化整个图
        if envs.VLLM_USE_MEGA_AOT_ARTIFACT:
            assert standalone_compile_artifacts is not None
            submod_names = state.get("submod_names")
            # 重构逻辑 ...
        # 回退路径：仅在需要时反序列化 graph_module
        state["graph_module"] = cls.deserialize_graph_module(state["graph_module"], fake_mode)
        state["graph_module"].recompile()
        # 其他逻辑 ...

评论区精华

向后兼容性风险：submod_names 键缺失 正确性

gemini-code-assist[bot] 指出直接访问 state["submod_names"] 在旧缓存中可能引发 KeyError，建议安全跳过。zhxchen17 回应说缓存加载失败时会生成新缓存，因此保持当前行为。

结论：决定不修改代码，添加注释解释行为，依赖缓存再生机制。 · 已解决

生成代码导入缺失：operator 模块 正确性

gemini-code-assist[bot] 建议在生成代码中添加 import operator 以支持函数调用，避免 NameError。

结论：未在代码中直接采纳，可能已通过其他方式解决，但风险仍需关注。 · unresolved

子模块绑定方式：使用 .get() vs 直接访问 设计

gemini-code-assist[bot] 建议使用直接字典访问以提供更清晰的错误信息，但 zhxchen17 解释使用 .get() 是故意的，用于处理内联子模块的占位符。

结论：保持使用 .get()，并添加注释说明意图。 · 已解决

风险与影响

技术风险包括：

向后兼容性：旧缓存可能因缺少submod_names键而加载失败，但设计上会触发新缓存生成，风险可控。
运行时错误：如果生成代码中缺少必要导入（如operator），可能导致NameError，但review中未显示修复，需关注。
逻辑复杂性：内联子模块和跳过反序列化增加了代码复杂度，可能引入隐蔽的bug，尤其是在边缘情况下。

影响范围：

用户：热编译时间大幅减少，提升模型启动速度，尤其对频繁加载模型的场景有益。
系统：编译模块的核心路径变更，可能影响所有使用vLLM编译功能的模型推理。
团队：代码结构简化，但需确保缓存兼容性和测试覆盖。

向后兼容性风险运行时导入缺失核心路径变更

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：通过跳过FX图反序列化，将热编译时间降低至亚2秒级别。
推荐动作：建议技术管理者和工程师精读此PR，重点关注generate_execution_code_with_name的设计决策，以及缓存反序列化的跳过逻辑。这些变更展示了如何通过代码生成优化编译性能，值得学习。

功能与动机

实现拆解

代码生成逻辑重构：在vllm/compilation/codegen.py中，新增generate_execution_code_with_name函数，支持内联torch.fx.GraphModule子模块到生成的Python代码中，从而避免序列化整个图。关键符号包括generate_execution_code_with_name和inlined_submods列表。
缓存和反序列化调整：在vllm/compilation/caching.py中，修改VllmSerializableFunction类的__init__方法，允许graph_module参数为bytes类型，并在deserialize_compile_artifacts方法中跳过图的反序列化步骤，直接使用执行代码。添加fake_mode参数以支持回退路径。
导入和依赖更新：在vllm/compilation/backends.py中，调整导入语句，从codegen模块直接导入compile_execution_fn和generate_execution_code，简化代码结构。
测试配套：本次变更未包含直接测试文件，但PR body提供了详细的性能测试数据，验证了热编译时间的改进。

关键文件：

vllm/compilation/codegen.py（模块编译模块；类别 source；类型 core-logic；符号 generate_execution_code, generate_execution_code_with_name）: 核心变更文件，实现了代码生成逻辑的重构，支持内联子模块以避免FX图序列化。
vllm/compilation/caching.py（模块编译模块；类别 source；类型 data-contract；符号 VllmSerializableFunction）: 修改了缓存反序列化逻辑，允许跳过FX图反序列化，直接使用执行代码，减少加载时间。
vllm/compilation/backends.py（模块编译模块；类别 source；类型 dependency-wiring）: 调整导入语句，简化代码结构，确保从codegen模块正确导入函数。

关键符号：generate_execution_code_with_name, VllmSerializableFunction.init, deserialize_compile_artifacts

关键源码片段

`vllm/compilation/caching.py`

修改了缓存反序列化逻辑，允许跳过FX图反序列化，直接使用执行代码，减少加载时间。

class VllmSerializableFunction(SerializableCallable):
    def __init__(
        self,
        graph_module: torch.fx.GraphModule | bytes, # 现在允许 bytes 类型，避免反序列化
        example_inputs: Sequence[Any],
        prefix: str,
        optimized_call: Callable[..., Any],
        is_encoder: bool = False,
        vllm_backend: Any | None = None,
        sym_tensor_indices: list[int] | None = None,
        aot_autograd_config: dict[str, Any] | None = None,
        execution_code: str | None = None,
        submod_names: list[str] | None = None,
    ) -> None:
        self.graph_module = graph_module # 不再断言为 GraphModule，支持直接传递 bytes
        self.example_inputs = example_inputs
        self.prefix = prefix
        self.optimized_call = optimized_call
        self.is_encoder = is_encoder
        self.shape_env = None
        self.vllm_backend = vllm_backend
        self.sym_tensor_indices = sym_tensor_indices
        self.execution_code = execution_code # 存储生成的执行代码
        self.submod_names = submod_names
        self._fake_mode: Any | None = None
        # 其他初始化逻辑 ...

    @classmethod
    def deserialize_compile_artifacts(cls, data: bytes) -> "VllmSerializableFunction":
        from torch._guards import TracingContext, tracing
        from torch.fx.experimental.symbolic_shapes import ShapeEnv

        state = pickle.loads(data)
        fake_mode = FakeTensorMode(shape_env=ShapeEnv())

        # 跳过 graph_module 的反序列化，直接加载 example_inputs
        state["example_inputs"] = GraphPickler.loads(state["example_inputs"], fake_mode)
        standalone_compile_artifacts = state.pop("standalone_compile_artifacts", None)
        sym_shape_indices_map = state.pop("sym_shape_indices_map", {})
        returns_tuple_map = state.pop("returns_tuple_map", {})

        # 使用执行代码重构函数，避免反序列化整个图
        if envs.VLLM_USE_MEGA_AOT_ARTIFACT:
            assert standalone_compile_artifacts is not None
            submod_names = state.get("submod_names")
            # 重构逻辑 ...
        # 回退路径：仅在需要时反序列化 graph_module
        state["graph_module"] = cls.deserialize_graph_module(state["graph_module"], fake_mode)
        state["graph_module"].recompile()
        # 其他逻辑 ...

评论区精华

review中主要讨论了三个问题：

向后兼容性风险：gemini-code-assist[bot]指出state["submod_names"]在旧缓存中可能缺失，导致KeyError。zhxchen17回应说缓存加载失败时会生成新缓存，因此保持当前行为，并添加注释解释。
导入缺失：gemini-code-assist[bot]建议在生成代码中添加import operator以支持函数调用，但未在代码中直接采纳，可能已通过其他方式解决。
子模块绑定方式：gemini-code-assist[bot]建议使用直接字典访问而非.get()以提供更清晰的错误信息，但zhxchen17解释使用.get()是故意的，用于处理内联子模块的占位符。
向后兼容性风险：submod_names键缺失 (correctness): 决定不修改代码，添加注释解释行为，依赖缓存再生机制。
生成代码导入缺失：operator模块 (correctness): 未在代码中直接采纳，可能已通过其他方式解决，但风险仍需关注。
子模块绑定方式：使用.get() vs 直接访问 (design): 保持使用.get()，并添加注释说明意图。

风险与影响

风险：技术风险包括：
- 向后兼容性：旧缓存可能因缺少submod_names键而加载失败，但设计上会触发新缓存生成，风险可控。
- 运行时错误：如果生成代码中缺少必要导入（如operator），可能导致NameError，但review中未显示修复，需关注。
- 逻辑复杂性：内联子模块和跳过反序列化增加了代码复杂度，可能引入隐蔽的bug，尤其是在边缘情况下。
影响：影响范围：
- 用户：热编译时间大幅减少，提升模型启动速度，尤其对频繁加载模型的场景有益。
- 系统：编译模块的核心路径变更，可能影响所有使用vLLM编译功能的模型推理。
- 团队：代码结构简化，但需确保缓存兼容性和测试覆盖。
- 风险标记：向后兼容性风险, 运行时导入缺失, 核心路径变更

关联脉络

PR #38657 [compile] Follow-up PR for Python execution code optimization: 本PR是跟进PR #38657，利用其引入的Python执行代码作为真相来源，进一步优化编译时间。

#40151 [compile] Skip FX graph deserialiaztion on loading, further reducing warm compile time.

执行摘要

通过跳过 FX 图反序列化，将热编译时间降低至亚 2 秒级别。

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论