#42725 [XPU] fix weight scale shape

原始 PR 作者 zufangzhu 合并时间 2026-05-17 16:55 文件变更 1 提交数 2 评论 1 代码增减 +3 / -0

执行摘要

修复 XPU FP8 weight_scale 张量形状

XPU 上的 FP8 GEMM 核函数期望 weight_scale 的布局与 weight 一致，但原有代码只转置了 weight，未处理 weight_scale，导致形状不匹配。PR body 中的 "fix weight scale shape" 直接点明了问题。

建议精读以了解 XPU FP8 后端的参数处理细节。应关注 review 中关于条件不一致的问题，并考虑在后续 PR 中修复：将 weight_scale 的转置放入与 weight 相同的 if 块中，确保两者布局始终同步。

讨论亮点

gemini-code-assist[bot] 指出了条件不一致的问题：weight_scale 的转置是全局执行的，但 weight 的转置只在特定布局下进行。如果 weight 已经是 [in, out] 布局，则 weight_scale 仍会被转置，可能导致形状错误。此外，对于 per-channel 量化，weight_scale 可能是 1D 张量，此时 .t() 是无操作的。该评论尚未被解决，但 PR 已被批准合并。

实现拆解

在 vllm/model_executor/kernels/linear/scaled_mm/xpu.py 的 process_weights_after_loading 方法中，原有逻辑仅在 weight 为 [out, in] 布局时进行转置。新增两行代码：

对 layer.weight_scale 执行 .t().contiguous() 转置并保证连续性。
使用 replace_parameter 替换 layer.weight_scale 参数。

该操作目前无条件执行，未与 weight 的转置条件同步。

文件	模块	状态	重要度
`vllm/model_executor/kernels/linear/scaled_mm/xpu.py`	内核模块	modified	5.33

关键符号

process_weights_after_loading

关键源码片段

vllm/model_executor/kernels/linear/scaled_mm/xpu.py data-contract

核心变更文件，修复 XPU FP8 weight_scale 形状问题，影响 FP8 量化推理的正确性。

# xpu.py - XPU FP8 Scaled MM Kernel
# process_weights_after_loading 方法中，原有权重转置逻辑用于对齐 GEMM 布局。
# 新增 weight_scale 的转置和替换，但未与 weight 的条件同步。
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
    # fp8_gemm_w8a16 expects weight in [in, out] layout.
    # Transpose if weight is still in [out, in] layout.
    # For square matrices, use contiguity as tie-breaker:
    # checkpoint weights are contiguous, .t() views are not.
    weight = layer.weight
    out_features, in_features = self.config.weight_shape

    if weight.shape == (out_features, in_features) and (
        in_features != out_features or weight.is_contiguous()
    ):
        replace_parameter(layer, "weight", weight.data.t())
    # else: already in [in, out] layout — no-op

    # 问题：weight_scale 转置未与 weight 条件同步，
    # 当 weight 已是 [in, out] 布局时，weight_scale 仍会被转置。
    weight_scale = layer.weight_scale.t().contiguous()
    replace_parameter(layer, "weight_scale", weight_scale.data)

评论区精华

weight_scale 转置无条件执行 正确性

gemini-code-assist[bot] 指出 weight_scale 的转置是无条件的，而 weight 的转置只在特定布局下进行。如果 weight 已经是 [in, out] 布局，weight 不会被转置，但 weight_scale 仍会被转置，可能导致形状错误。

结论：未解决；PR 已被批准合并，但问题未修复。 · unresolved

风险与影响

主要风险在于 weight_scale 转置的条件不一致：如果 weight 未转置（weight 已是 [in, out] 布局），则 weight_scale 仍会被转置，导致形状不匹配。对于 per-channel 量化，weight_scale 为 1D 张量时 .t() 无影响，但 per-tensor 量化场景下可能出错。该风险已被 reviewer 指出但未修复。

影响范围限于 XPU 平台上的 FP8 量化模型推理。受影响用户是使用 Intel GPU 并启用 FP8 量化的 vLLM 用户。修复确保 weight_scale 与 weight 布局对齐，避免 GEMM 核函数因形状错误而崩溃或产生错误结果。

条件不一致已知未解决问题

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复 XPU FP8 weight_scale 张量形状
推荐动作：建议精读以了解 XPU FP8 后端的参数处理细节。应关注 review 中关于条件不一致的问题，并考虑在后续 PR 中修复：将 weight_scale 的转置放入与 weight 相同的 if 块中，确保两者布局始终同步。

功能与动机

实现拆解

在 vllm/model_executor/kernels/linear/scaled_mm/xpu.py 的 process_weights_after_loading 方法中，原有逻辑仅在 weight 为 [out, in] 布局时进行转置。新增两行代码：

对 layer.weight_scale 执行 .t().contiguous() 转置并保证连续性。
使用 replace_parameter 替换 layer.weight_scale 参数。

该操作目前无条件执行，未与 weight 的转置条件同步。

关键文件：

vllm/model_executor/kernels/linear/scaled_mm/xpu.py（模块内核模块；类别 source；类型 data-contract；符号 process_weights_after_loading）: 核心变更文件，修复 XPU FP8 weight_scale 形状问题，影响 FP8 量化推理的正确性。

关键符号：process_weights_after_loading

关键源码片段

`vllm/model_executor/kernels/linear/scaled_mm/xpu.py`

核心变更文件，修复 XPU FP8 weight_scale 形状问题，影响 FP8 量化推理的正确性。

# xpu.py - XPU FP8 Scaled MM Kernel
# process_weights_after_loading 方法中，原有权重转置逻辑用于对齐 GEMM 布局。
# 新增 weight_scale 的转置和替换，但未与 weight 的条件同步。
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
    # fp8_gemm_w8a16 expects weight in [in, out] layout.
    # Transpose if weight is still in [out, in] layout.
    # For square matrices, use contiguity as tie-breaker:
    # checkpoint weights are contiguous, .t() views are not.
    weight = layer.weight
    out_features, in_features = self.config.weight_shape

    if weight.shape == (out_features, in_features) and (
        in_features != out_features or weight.is_contiguous()
    ):
        replace_parameter(layer, "weight", weight.data.t())
    # else: already in [in, out] layout — no-op

    # 问题：weight_scale 转置未与 weight 条件同步，
    # 当 weight 已是 [in, out] 布局时，weight_scale 仍会被转置。
    weight_scale = layer.weight_scale.t().contiguous()
    replace_parameter(layer, "weight_scale", weight_scale.data)

评论区精华

weight_scale 转置无条件执行 (correctness): 未解决；PR 已被批准合并，但问题未修复。

风险与影响

风险：主要风险在于 weight_scale 转置的条件不一致：如果 weight 未转置（weight 已是 [in, out] 布局），则 weight_scale 仍会被转置，导致形状不匹配。对于 per-channel 量化，weight_scale 为 1D 张量时 .t() 无影响，但 per-tensor 量化场景下可能出错。该风险已被 reviewer 指出但未修复。
影响：影响范围限于 XPU 平台上的 FP8 量化模型推理。受影响用户是使用 Intel GPU 并启用 FP8 量化的 vLLM 用户。修复确保 weight_scale 与 weight 布局对齐，避免 GEMM 核函数因形状错误而崩溃或产生错误结果。
风险标记：条件不一致, 已知未解决问题

关联脉络

暂无明显关联 PR

#42725 [XPU] fix weight scale shape

执行摘要

修复 XPU FP8 weight_scale 张量形状

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论