#43978 [BugFix] [GDN] Read linear_key_head_dim from hf_text_config for multimodal models

原始 PR 作者 IdoAtadTD 合并时间 2026-06-02 22:17 文件变更 1 提交数 1 评论 2 代码增减 +2 / -2

执行摘要

修复多模态模型 GDN prefill 后端选择 bug

多模态 Qwen3.5 模型（如 Qwen3.5-397B-A17B）的 config.json 中 linear_key_head_dim (128) 位于 hf_text_config 而非 hf_config 下。原代码仅从 hf_config 读取，导致 head_k_dim 为 None，CuteDSL/FlashInfer 后端无法启用。PR body 明确指出："GDN prefill backend selection only read hf_config, so head_k_dim was None and CuteDSL/FlashInfer on Blackwell (SM100) was never enabled"。

此 PR 是典型的数据契约 bugfix，改动极小但影响关键路径，值得快速合入。建议关注类似的多模态配置属性访问模式，并在仓库内建立统一使用 hf_text_config 的惯例。

讨论亮点

Reviewer tomeras91 建议直接全面改用 hf_text_config 而非添加额外的 or 条件，因为 hf_text_config 在多模态模型中包含所需字段，而在纯文本模型中与 hf_config 等价，且已是 Qwen3.5 建模代码中的惯用做法。作者采纳了该建议，最终实现简洁统一。

实现拆解

在 _resolve_gdn_prefill_backend 函数中，将 getattr(vllm_config.model_config.hf_config, "linear_key_head_dim", None) 替换为 getattr(vllm_config.model_config.hf_text_config, "linear_key_head_dim", None)。
在 _log_gdn_backend_decision 函数中，做相同的替换，以确保日志信息准确反映实际读取的配置源。
这两个改动均位于 vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py 文件中，仅修改两行代码，共 2 处变更。

文件	模块	状态	重要度
`vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`	GDN	modified	6.08

关键符号

_resolve_gdn_prefill_backend _log_gdn_backend_decision

关键源码片段

vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py data-contract

唯一修改的文件，包含 GDN prefill 后端选择与日志打印逻辑。将配置读取源从 `hf_config` 改为 `hf_text_config`，修复了多模态模型后端无法正确启用的问题。

def _resolve_gdn_prefill_backend(
    vllm_config: VllmConfig,
) -> tuple[str, Literal["triton", "flashinfer", "cutedsl"]]:
    # ... 前面逻辑不变 ...
    # Fix: Read linear_key_head_dim from hf_text_config instead of hf_config.
    # For multimodal models like Qwen3.5-397B-A17B, this attribute is
    # stored in hf_text_config; for pure text models, hf_text_config
    # equals hf_config, so this change is backward-compatible.
    head_k_dim = getattr(
        vllm_config.model_config.hf_text_config, "linear_key_head_dim", None
    )
    # ... 后续 Blackwell 判断逻辑依赖 head_k_dim == 128 ...


def _log_gdn_backend_decision(
    vllm_config: VllmConfig,
    requested_backend: str,
    active_backend: str,
) -> None:
    """Log the GDN prefill backend choice in the attention-selector style."""
    # Same fix as above for logging accuracy.
    head_k_dim = getattr(
        vllm_config.model_config.hf_text_config, "linear_key_head_dim", None
    )
    # ... 日志打印 ...

评论区精华

使用 hf_text_config 替代 hf_config 设计

Reviewer tomeras91 建议直接全面改用 `hf_text_config` 而非添加额外的 `or` 语句，并指出该方案在多模态和纯文本模型中均适用，且是其他建模代码中的惯例。

结论：作者采纳建议，最终实现为直接替换两个函数中的 `hf_config` 为 `hf_text_config`。 · 已解决

风险与影响

风险极低：仅修改两行代码，将配置读取源从 hf_config 改为 hf_text_config。hf_text_config 在纯文本模型中与 hf_config 相同，因此行为完全向后兼容。改动覆盖了 GDN prefill 后端选择的核心逻辑（约 180 行和约 220 行），但逻辑本身简单明确。

直接影响所有使用 qwen_gdn_linear_attn.py 的 GDN 模型，特别是多模态 Qwen3.5 系列。修复后，这些模型在 Blackwell GPU 上能够正确启用 CuteDSL 或 FlashInfer 后端，从而获得性能提升。纯文本模型不受影响。

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复多模态模型 GDN prefill 后端选择 bug
推荐动作：此 PR 是典型的数据契约 bugfix，改动极小但影响关键路径，值得快速合入。建议关注类似的多模态配置属性访问模式，并在仓库内建立统一使用 hf_text_config 的惯例。

功能与动机

实现拆解

在 _resolve_gdn_prefill_backend 函数中，将 getattr(vllm_config.model_config.hf_config, "linear_key_head_dim", None) 替换为 getattr(vllm_config.model_config.hf_text_config, "linear_key_head_dim", None)。
在 _log_gdn_backend_decision 函数中，做相同的替换，以确保日志信息准确反映实际读取的配置源。
这两个改动均位于 vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py 文件中，仅修改两行代码，共 2 处变更。

关键文件：

vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py（模块 GDN；类别 source；类型 data-contract）: 唯一修改的文件，包含 GDN prefill 后端选择与日志打印逻辑。将配置读取源从 hf_config 改为 hf_text_config，修复了多模态模型后端无法正确启用的问题。

关键符号：_resolve_gdn_prefill_backend, _log_gdn_backend_decision

关键源码片段

`vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`

唯一修改的文件，包含 GDN prefill 后端选择与日志打印逻辑。将配置读取源从 hf_config 改为 hf_text_config，修复了多模态模型后端无法正确启用的问题。

def _resolve_gdn_prefill_backend(
    vllm_config: VllmConfig,
) -> tuple[str, Literal["triton", "flashinfer", "cutedsl"]]:
    # ... 前面逻辑不变 ...
    # Fix: Read linear_key_head_dim from hf_text_config instead of hf_config.
    # For multimodal models like Qwen3.5-397B-A17B, this attribute is
    # stored in hf_text_config; for pure text models, hf_text_config
    # equals hf_config, so this change is backward-compatible.
    head_k_dim = getattr(
        vllm_config.model_config.hf_text_config, "linear_key_head_dim", None
    )
    # ... 后续 Blackwell 判断逻辑依赖 head_k_dim == 128 ...


def _log_gdn_backend_decision(
    vllm_config: VllmConfig,
    requested_backend: str,
    active_backend: str,
) -> None:
    """Log the GDN prefill backend choice in the attention-selector style."""
    # Same fix as above for logging accuracy.
    head_k_dim = getattr(
        vllm_config.model_config.hf_text_config, "linear_key_head_dim", None
    )
    # ... 日志打印 ...

评论区精华

使用 hf_text_config 替代 hf_config (design): 作者采纳建议，最终实现为直接替换两个函数中的 hf_config 为 hf_text_config。

风险与影响

风险：风险极低：仅修改两行代码，将配置读取源从 hf_config 改为 hf_text_config。hf_text_config 在纯文本模型中与 hf_config 相同，因此行为完全向后兼容。改动覆盖了 GDN prefill 后端选择的核心逻辑（约 180 行和约 220 行），但逻辑本身简单明确。
影响：直接影响所有使用 qwen_gdn_linear_attn.py 的 GDN 模型，特别是多模态 Qwen3.5 系列。修复后，这些模型在 Blackwell GPU 上能够正确启用 CuteDSL 或 FlashInfer 后端，从而获得性能提升。纯文本模型不受影响。
风险标记：暂无

关联脉络

PR #43273 [insert title]: PR body 提及该 PR 为 CuteDSL GDN prefill on SM100 的相关工作，当前修复使其 gate 条件 head_k_dim == 128 在多模态模型上得以满足。

#43978 [BugFix] [GDN] Read linear_key_head_dim from hf_text_config for multimodal models

执行摘要

修复多模态模型 GDN prefill 后端选择 bug

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论