#42201 [Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 Triton kernel

原始 PR 作者 Flink-ddd 合并时间 2026-05-12 02:52 文件变更 1 提交数 3 评论 7 代码增减 +3 / -3

执行摘要

修复 DeepGEMM SiLU/mul FP8 内核 int32 溢出

修复 issue #42173：DeepGEMM MoE 预热场景下，SiLU/mul FP8 量化 Triton 内核因 int32 溢出导致 CUDA illegal memory access。例如 DPEP=16、每 rank 36000 token 时，最大偏移量达 18,882,756,607，远超 int32 上限。

建议精读该 PR，了解 Triton kernel 中 int32 溢出的典型模式及修复方式，可作为后续类似问题的参考。

讨论亮点

Review 中 gemini-code-assist[bot] 指出：int64 的转换应作用于乘法前的操作数 pid_m，而非乘法结果 (pid_m * BLOCK_M).to(tl.int64)，以避免中间结果在 int32 下溢出。后续提交采纳了该建议，修正为 pid_m.to(tl.int64) * BLOCK_M。yewentao256 和 zyongye 均 LGTM 并批准。

实现拆解

修改 _silu_mul_quant_fp8_packed_kernel 中的偏移计算：在 fp8_utils.py 第 174 行，将 m_offset = pid_m * BLOCK_M 改为 m_offset = pid_m.to(tl.int64) * BLOCK_M，确保乘法结果为 int64 后参与后续指针运算。
修改 _silu_mul_per_token_group_quant_fp8_colmajor 中的偏移计算：在相同文件第 324-325 行，将 m_offset = pid_m * BLOCK_M 和 n_offset = pid_n * BLOCK_N 分别改为 m_offset = pid_m.to(tl.int64) * BLOCK_M 和 n_offset = pid_n.to(tl.int64) * BLOCK_N，统一提升至 int64。
测试验证：通过单卡最小复现脚本（M=524416, N=4096）验证修复后内核正确执行，输出 shape 与 scale shape 符合预期，无 illegal memory access 错误。

文件	模块	状态	重要度
`vllm/model_executor/layers/quantization/utils/fp8_utils.py`	量化层	modified	6.37

关键符号

_silu_mul_quant_fp8_packed_kernel _silu_mul_per_token_group_quant_fp8_colmajor

关键源码片段

vllm/model_executor/layers/quantization/utils/fp8_utils.py core-logic

包含两个被修复的 Triton kernel：`_silu_mul_quant_fp8_packed_kernel` 和 `_silu_mul_per_token_group_quant_fp8_colmajor`，修复 int32 溢出问题。

# file: vllm/model_executor/layers/quantization/utils/fp8_utils.py
# 修复前：m_offset = pid_m * BLOCK_M → int32 可能溢出
# 修复后：先转 int64 再乘，确保安全

@triton.jit
def _silu_mul_quant_fp8_packed_kernel(...):
    pid_pack = tl.program_id(0)
    pid_m = tl.program_id(1)
    # 关键修复：将 pid_m 转为 tl.int64 后再乘以 BLOCK_M
    m_offset = pid_m.to(tl.int64) * BLOCK_M
    if m_offset >= M:
        return
    # ... 后续指针运算基于 int64 偏移量，安全

@triton.jit
def _silu_mul_per_token_group_quant_fp8_colmajor(...):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    # 两个偏移量均先转 int64 再乘，避免乘积溢出
    m_offset = pid_m.to(tl.int64) * BLOCK_M
    n_offset = pid_n.to(tl.int64) * BLOCK_N
    if m_offset >= M:
        return
    # ... 后续指针运算安全

评论区精华

int64 转换时机：应在乘法前还是乘法后？ 正确性

gemini-code-assist[bot] 指出原始写法 `(pid_m * BLOCK_M).to(tl.int64)` 在乘法时仍以 int32 计算，可能先溢出；建议改为 `pid_m.to(tl.int64) * BLOCK_M` 以确保安全。

结论：采纳建议，将转换提前到乘法之前。 · 已解决

风险与影响

低风险。仅在 Triton kernel 内偏移计算路径改动，且仅对超出 int32 范围的超大 M 生效；常规小规模场景无性能影响。但需注意 pid_m.to(tl.int64) 在 Triton 中会引入一次类型转换开销，但微乎其微。

影响范围限于使用 DeepGEMM MoE 且配置大 DPEP 或高 token 数（M > 约 524288）的推理场景。修复前这些场景会崩溃，修复后可正常运行。不影响其他量化路径或小规模模型。

核心路径变更缺少测试覆盖

关联 Issue

#42173 DeepGEMM SiLU/mul FP8 quant Triton kernel overflows int32 addresses for large DPEP warmup shapes

完整报告

执行摘要

一句话：修复 DeepGEMM SiLU/mul FP8 内核 int32 溢出
推荐动作：建议精读该 PR，了解 Triton kernel 中 int32 溢出的典型模式及修复方式，可作为后续类似问题的参考。

功能与动机

实现拆解

修改 _silu_mul_quant_fp8_packed_kernel 中的偏移计算：在 fp8_utils.py 第 174 行，将 m_offset = pid_m * BLOCK_M 改为 m_offset = pid_m.to(tl.int64) * BLOCK_M，确保乘法结果为 int64 后参与后续指针运算。
修改 _silu_mul_per_token_group_quant_fp8_colmajor 中的偏移计算：在相同文件第 324-325 行，将 m_offset = pid_m * BLOCK_M 和 n_offset = pid_n * BLOCK_N 分别改为 m_offset = pid_m.to(tl.int64) * BLOCK_M 和 n_offset = pid_n.to(tl.int64) * BLOCK_N，统一提升至 int64。
测试验证：通过单卡最小复现脚本（M=524416, N=4096）验证修复后内核正确执行，输出 shape 与 scale shape 符合预期，无 illegal memory access 错误。

关键文件：

vllm/model_executor/layers/quantization/utils/fp8_utils.py（模块量化层；类别 source；类型 core-logic；符号 _silu_mul_quant_fp8_packed_kernel, _silu_mul_per_token_group_quant_fp8_colmajor）: 包含两个被修复的 Triton kernel：_silu_mul_quant_fp8_packed_kernel 和 _silu_mul_per_token_group_quant_fp8_colmajor，修复 int32 溢出问题。

关键符号：_silu_mul_quant_fp8_packed_kernel, _silu_mul_per_token_group_quant_fp8_colmajor

关键源码片段

`vllm/model_executor/layers/quantization/utils/fp8_utils.py`

包含两个被修复的 Triton kernel：_silu_mul_quant_fp8_packed_kernel 和 _silu_mul_per_token_group_quant_fp8_colmajor，修复 int32 溢出问题。

# file: vllm/model_executor/layers/quantization/utils/fp8_utils.py
# 修复前：m_offset = pid_m * BLOCK_M → int32 可能溢出
# 修复后：先转 int64 再乘，确保安全

@triton.jit
def _silu_mul_quant_fp8_packed_kernel(...):
    pid_pack = tl.program_id(0)
    pid_m = tl.program_id(1)
    # 关键修复：将 pid_m 转为 tl.int64 后再乘以 BLOCK_M
    m_offset = pid_m.to(tl.int64) * BLOCK_M
    if m_offset >= M:
        return
    # ... 后续指针运算基于 int64 偏移量，安全

@triton.jit
def _silu_mul_per_token_group_quant_fp8_colmajor(...):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    # 两个偏移量均先转 int64 再乘，避免乘积溢出
    m_offset = pid_m.to(tl.int64) * BLOCK_M
    n_offset = pid_n.to(tl.int64) * BLOCK_N
    if m_offset >= M:
        return
    # ... 后续指针运算安全

评论区精华

int64 转换时机：应在乘法前还是乘法后？ (correctness): 采纳建议，将转换提前到乘法之前。

风险与影响

风险：低风险。仅在 Triton kernel 内偏移计算路径改动，且仅对超出 int32 范围的超大 M 生效；常规小规模场景无性能影响。但需注意 pid_m.to(tl.int64) 在 Triton 中会引入一次类型转换开销，但微乎其微。
影响：影响范围限于使用 DeepGEMM MoE 且配置大 DPEP 或高 token 数（M > 约 524288）的推理场景。修复前这些场景会崩溃，修复后可正常运行。不影响其他量化路径或小规模模型。
风险标记：核心路径变更, 缺少测试覆盖

关联脉络

PR #40408 [Perf] Batch invariance with Cutlass fp8 support, 28.9% E2E latency improvement: 同样涉及 FP8 量化内核的优化，本 PR 修复了该 PR 引入的 FP8 路径在超大规模下的崩溃问题。
PR #41812 [ROCm][DSv4] implement flash sparse mla with triton kernels: 同为 Triton 内核相关 PR，显示内核开发中需关注数据类型与溢出边界。

#42201 [Bugfix] Fix int32 overflow in DeepGEMM SiLU/mul FP8 Triton kernel

执行摘要

修复 DeepGEMM SiLU/mul FP8 内核 int32 溢出

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论