#24156 Cache FlashInfer autotune configs

原始 PR 作者 sshleifer 合并时间 2026-05-05 02:00 文件变更 1 提交数 2 评论 10 代码增减 +43 / -5

执行摘要

缓存 FlashInfer 自动调优配置

FlashInfer 的自动调优在每次服务启动时都会运行，对于需要频繁重启或部署多个实例的场景，重复调优造成了不必要的启动延迟和计算开销。通过持久化调优结果，可以显著缩短后续启动时间。

值得精读。该 PR 展示了如何通过简单的缓存机制显著优化模型初始化性能，设计上考虑了多维度缓存键和并发安全，是性能优化的良好范例。

讨论亮点

Review 中主要关注以下三点：

缓存文件命名唯一性：gemini-code-assist[bot] 和 chatgpt-codex-connector[bot] 均指出原始实现仅使用 tp_rank 会导致 PP/DP 多 worker 间文件名冲突，建议加入 pp_rank 和 dp_rank。作者接受并修复。
自动调优上下文冲突：gemini-code-assist[bot] 指出在 autotune(True, cache=...) 内再次调用 autotune()（嵌套上下文）会覆盖外部缓存配置，建议移除 _dummy_run 的 run_ctx 参数。作者采纳。
缓存键鲁棒性：gemini-code-assist[bot] 建议缓存键应包含 dtype，并使用 model_path 替代 served_model_name。作者在第一个 commit 中已使用 model_path 并包含 self.dtype。此外，Fridge003 建议补充 dp_size 和 ep_size，作者在第二个 commit 中已添加。

实现拆解

新增 _flashinfer_autotune_cache_path 方法：在 model_runner.py 中定义，用于计算缓存路径。该方法收集 model_path、dtype、quantization、moe_runner_backend、tp_size、pp_size、dp_size、moe_ep_size 以及 hf_config.__class__.__name__ 作为缓存键的组成部分，并通过 SHA-256 哈希缩短至 16 字符。缓存目录位于 SGLANG_CACHE_DIR/flashinfer/autotune/<flashinfer_version>/<arch>/<cache_key>/，文件名包含 tp_rank、pp_rank、dp_rank，确保不同 rank 的缓存文件独立。
修改 _flashinfer_autotune 方法：将 autotune() 上下文管理器的调用改为 autotune(True, cache=str(cache_path))，并移除了 _dummy_run 中的 run_ctx 参数，避免嵌套上下文冲突。
新增导入和属性：引入了 hashlib、pathlib.Path 模块，并在 __init__ 中显式存储 self.dp_rank（之前未保存）。
缓存键的迭代优化：根据 review 反馈，在第二个 commit 中将 dp_size 和 moe_ep_size 也加入了缓存键，确保 DP/EP 配置不同时缓存隔离。

文件	模块	状态	重要度
`python/sglang/srt/model_executor/model_runner.py`	模型运行器	modified	7.33

关键符号

_flashinfer_autotune_cache_path _flashinfer_autotune

关键源码片段

python/sglang/srt/model_executor/model_runner.py core-logic

核心变更文件，新增 `_flashinfer_autotune_cache_path` 方法并修改 `_flashinfer_autotune` 以支持缓存。

# python/sglang/srt/model_executor/model_runner.py

def _flashinfer_autotune_cache_path(self) -> Path:
    """构建 FlashInfer 自动调优缓存路径。

    缓存键包含模型路径、数据类型、量化方式、MoE 后端、
    并行尺寸（TP/PP/DP/EP）和配置类名，
    确保不同配置之间的缓存隔离。
    文件名包含 TP/PP/DP rank，避免多进程冲突。
    """
    import flashinfer

    major, minor = torch.cuda.get_device_capability(self.device)
    arch = f"sm{major}{minor}"
    flashinfer_version = getattr(flashinfer, "__version__", "unknown")

    server_args = self.server_args
    model_key = "|".join([
        str(server_args.model_path),
        str(self.dtype),
        str(server_args.quantization),
        str(server_args.moe_runner_backend),
        str(self.tp_size),
        str(self.pp_size),
        str(self.dp_size),
        str(self.moe_ep_size),
        str(self.model_config.hf_config.__class__.__name__),
    ])
    cache_key = hashlib.sha256(model_key.encode()).hexdigest()[:16]
    cache_dir = (
        Path(envs.SGLANG_CACHE_DIR.get())
        / "flashinfer"
        / "autotune"
        / flashinfer_version
        / arch
        / cache_key
    )
    cache_dir.mkdir(parents=True, exist_ok=True)
    # 文件名包含 tp_rank、pp_rank、dp_rank，确保不同 rank 的缓存独立
    return cache_dir / f"rank_tp{self.tp_rank}_pp{self.pp_rank}_dp{self.dp_rank or 0}.json"


def _flashinfer_autotune(self):
    """运行 FlashInfer 自动调优，并利用缓存避免重复调优。"""
    from flashinfer.autotuner import autotune

    cache_path = self._flashinfer_autotune_cache_path()
    logger.info("Running FlashInfer autotune with cache: %s", cache_path)

    self.forward_stream.wait_stream(torch.cuda.current_stream())
    with torch.get_device_module(self.device).stream(self.forward_stream):
        # 传入 cache 路径，autotune 将自动加载 / 保存缓存
        with torch.inference_mode(), autotune(True, cache=str(cache_path)):
            # 不再传递 run_ctx，避免嵌套上下文导致缓存失效
            self._dummy_run(batch_size=self.req_to_token_pool.size)
    torch.cuda.current_stream().wait_stream(self.forward_stream)
    logger.info("FlashInfer autotune completed.")

评论区精华

缓存文件名应包含 pp_rank 和 dp_rank 正确性

gemini-code-assist[bot] 和 chatgpt-codex-connector[bot] 指出仅使用 tp_rank 会导致 PP/DP 多 worker 间文件名冲突，建议加入 pp_rank 和 dp_rank。

结论：作者采纳建议，在文件名中添加了 pp_rank 和 dp_rank。 · 已解决

避免嵌套 autotune 上下文导致缓存失效 正确性

gemini-code-assist[bot] 指出 _dummy_run 中调用了 autotune() 作为 run_ctx，这会创建新的默认上下文并覆盖外部指定的 cache 参数。

结论：作者移除了 _dummy_run 的 run_ctx 参数，不再传递 run_ctx。 · 已解决

缓存键应包含 dp_size 和 ep_size 正确性

Fridge003 建议在缓存键中加入 dp_size 和 ep_size，以避免不同 DP/EP 配置使用相同的缓存。

结论：作者在第二个 commit 中添加了 dp_size 和 moe_ep_size 到缓存键。 · 已解决

缓存键应使用 model_path 而非 served_model_name 正确性

gemini-code-assist[bot] 建议使用 model_path 作为主要标识，避免不同模型使用同一 served name 时冲突。

结论：原始实现已使用 model_path，无需修改。 · 已解决

缓存键应包含 dtype 正确性

gemini-code-assist[bot] 指出自动调优结果对数据类型敏感，建议加入 dtype。

结论：原始实现已包含 self.dtype，无需修改。 · 已解决

风险与影响

缓存失效风险：如果 FlashInfer 版本、CUDA 架构或模型配置发生变化，缓存键会自动变化，不会使用过期缓存，因此不存在使用错误配置的风险。
并发写入风险：多进程同时写入同一缓存文件时可能存在竞态条件，但 FlashInfer 内部处理并发写入为“尽力而为”，且文件名已通过 rank 区分，因此风险较低。
磁盘占用：自动调优缓存文件大小通常较小，且按模型配置隔离，不会造成显著的磁盘膨胀。

影响范围：仅影响使用 FlashInfer 后端的 MoE runner 场景。每次服务启动节省数十秒至数分钟（取决于模型规模和硬件）。
影响程度：轻微正面影响。无功能破坏，仅增加磁盘缓存读取/写入，无额外运行时开销。

核心路径变更缺少测试覆盖

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：缓存 FlashInfer 自动调优配置
推荐动作：值得精读。该 PR 展示了如何通过简单的缓存机制显著优化模型初始化性能，设计上考虑了多维度缓存键和并发安全，是性能优化的良好范例。

功能与动机

实现拆解

新增 _flashinfer_autotune_cache_path 方法：在 model_runner.py 中定义，用于计算缓存路径。该方法收集 model_path、dtype、quantization、moe_runner_backend、tp_size、pp_size、dp_size、moe_ep_size 以及 hf_config.__class__.__name__ 作为缓存键的组成部分，并通过 SHA-256 哈希缩短至 16 字符。缓存目录位于 SGLANG_CACHE_DIR/flashinfer/autotune/<flashinfer_version>/<arch>/<cache_key>/，文件名包含 tp_rank、pp_rank、dp_rank，确保不同 rank 的缓存文件独立。
修改 _flashinfer_autotune 方法：将 autotune() 上下文管理器的调用改为 autotune(True, cache=str(cache_path))，并移除了 _dummy_run 中的 run_ctx 参数，避免嵌套上下文冲突。
新增导入和属性：引入了 hashlib、pathlib.Path 模块，并在 __init__ 中显式存储 self.dp_rank（之前未保存）。
缓存键的迭代优化：根据 review 反馈，在第二个 commit 中将 dp_size 和 moe_ep_size 也加入了缓存键，确保 DP/EP 配置不同时缓存隔离。

关键文件：

python/sglang/srt/model_executor/model_runner.py（模块模型运行器；类别 source；类型 core-logic；符号 _flashinfer_autotune_cache_path, _flashinfer_autotune）: 核心变更文件，新增 _flashinfer_autotune_cache_path 方法并修改 _flashinfer_autotune 以支持缓存。

关键符号：_flashinfer_autotune_cache_path, _flashinfer_autotune

关键源码片段

`python/sglang/srt/model_executor/model_runner.py`

核心变更文件，新增 _flashinfer_autotune_cache_path 方法并修改 _flashinfer_autotune 以支持缓存。

# python/sglang/srt/model_executor/model_runner.py

def _flashinfer_autotune_cache_path(self) -> Path:
    """构建 FlashInfer 自动调优缓存路径。

    缓存键包含模型路径、数据类型、量化方式、MoE 后端、
    并行尺寸（TP/PP/DP/EP）和配置类名，
    确保不同配置之间的缓存隔离。
    文件名包含 TP/PP/DP rank，避免多进程冲突。
    """
    import flashinfer

    major, minor = torch.cuda.get_device_capability(self.device)
    arch = f"sm{major}{minor}"
    flashinfer_version = getattr(flashinfer, "__version__", "unknown")

    server_args = self.server_args
    model_key = "|".join([
        str(server_args.model_path),
        str(self.dtype),
        str(server_args.quantization),
        str(server_args.moe_runner_backend),
        str(self.tp_size),
        str(self.pp_size),
        str(self.dp_size),
        str(self.moe_ep_size),
        str(self.model_config.hf_config.__class__.__name__),
    ])
    cache_key = hashlib.sha256(model_key.encode()).hexdigest()[:16]
    cache_dir = (
        Path(envs.SGLANG_CACHE_DIR.get())
        / "flashinfer"
        / "autotune"
        / flashinfer_version
        / arch
        / cache_key
    )
    cache_dir.mkdir(parents=True, exist_ok=True)
    # 文件名包含 tp_rank、pp_rank、dp_rank，确保不同 rank 的缓存独立
    return cache_dir / f"rank_tp{self.tp_rank}_pp{self.pp_rank}_dp{self.dp_rank or 0}.json"


def _flashinfer_autotune(self):
    """运行 FlashInfer 自动调优，并利用缓存避免重复调优。"""
    from flashinfer.autotuner import autotune

    cache_path = self._flashinfer_autotune_cache_path()
    logger.info("Running FlashInfer autotune with cache: %s", cache_path)

    self.forward_stream.wait_stream(torch.cuda.current_stream())
    with torch.get_device_module(self.device).stream(self.forward_stream):
        # 传入 cache 路径，autotune 将自动加载 / 保存缓存
        with torch.inference_mode(), autotune(True, cache=str(cache_path)):
            # 不再传递 run_ctx，避免嵌套上下文导致缓存失效
            self._dummy_run(batch_size=self.req_to_token_pool.size)
    torch.cuda.current_stream().wait_stream(self.forward_stream)
    logger.info("FlashInfer autotune completed.")

评论区精华

Review 中主要关注以下三点：

缓存文件命名唯一性：gemini-code-assist[bot] 和 chatgpt-codex-connector[bot] 均指出原始实现仅使用 tp_rank 会导致 PP/DP 多 worker 间文件名冲突，建议加入 pp_rank 和 dp_rank。作者接受并修复。
自动调优上下文冲突：gemini-code-assist[bot] 指出在 autotune(True, cache=...) 内再次调用 autotune()（嵌套上下文）会覆盖外部缓存配置，建议移除 _dummy_run 的 run_ctx 参数。作者采纳。
缓存键鲁棒性：gemini-code-assist[bot] 建议缓存键应包含 dtype，并使用 model_path 替代 served_model_name。作者在第一个 commit 中已使用 model_path 并包含 self.dtype。此外，Fridge003 建议补充 dp_size 和 ep_size，作者在第二个 commit 中已添加。
- 缓存文件名应包含 pp_rank 和 dp_rank (correctness): 作者采纳建议，在文件名中添加了 pp_rank 和 dp_rank。
- 避免嵌套 autotune 上下文导致缓存失效 (correctness): 作者移除了 _dummy_run 的 run_ctx 参数，不再传递 run_ctx。
- 缓存键应包含 dp_size 和 ep_size (correctness): 作者在第二个 commit 中添加了 dp_size 和 moe_ep_size 到缓存键。
- 缓存键应使用 model_path 而非 served_model_name (correctness): 原始实现已使用 model_path，无需修改。
- 缓存键应包含 dtype (correctness): 原始实现已包含 self.dtype，无需修改。

风险与影响

风险：
1. 缓存失效风险：如果 FlashInfer 版本、CUDA 架构或模型配置发生变化，缓存键会自动变化，不会使用过期缓存，因此不存在使用错误配置的风险。
2. 并发写入风险：多进程同时写入同一缓存文件时可能存在竞态条件，但 FlashInfer 内部处理并发写入为“尽力而为”，且文件名已通过 rank 区分，因此风险较低。
3. 磁盘占用：自动调优缓存文件大小通常较小，且按模型配置隔离，不会造成显著的磁盘膨胀。
  - 影响：影响范围：仅影响使用 FlashInfer 后端的 MoE runner 场景。每次服务启动节省数十秒至数分钟（取决于模型规模和硬件）。
  影响程度：轻微正面影响。无功能破坏，仅增加磁盘缓存读取/写入，无额外运行时开销。
风险标记：核心路径变更, 缺少测试覆盖

关联脉络

PR #24172 Fix flashinfer workspace OOM: 同样涉及 FlashInfer 后端的问题修复，与本 PR 的自动调优功能在同一关注区域。
PR #24350 [tiny] misc cleanups across configs, attention, jit_kernel: 清理与优化相关，其中包含对 base_attn_backend 的微小调整，与本 PR 在 JIT/自动调优方面有共同点。

#24156 Cache FlashInfer autotune configs

执行摘要

缓存 FlashInfer 自动调优配置

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论