#39391 fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs

原始 PR 作者 jhaotingc 合并时间 2026-04-21 19:04 文件变更 2 提交数 5 评论 20 代码增减 +86 / -2

执行摘要

修复 MoE topk_softmax 中 NaN/Inf 处理，防止 CUDA 图下生成重复专家 ID 导致的非法内存访问。

修复issue #39244中描述的CUDA非法内存访问崩溃。根本原因是CUDA图replay中填充token的隐藏状态退化产生NaN门控logits，导致topk_softmax生成重复专家ID，进而触发FlashInfer MoE排序kernel的bug。

建议精读此PR，了解如何处理数值异常情况，以及对MoE路由和CUDA图集成的设计权衡。

讨论亮点

review中，gemini-code-assist[bot]指出备用路径也需修复，作者确认已包含；ZJY0516询问clamp的必要性，作者解释专家数量非标准时使用备用路径；tlrmchlsmth建议使用torch.nan_to_num，但PR选择了直接clamp以保持低开销。

实现拆解

核心kernel修复：在csrc/moe/topk_softmax_kernels.cu的topkGatingSoftmax warp kernel中添加NaN/Inf clamp，将NaN/Inf值置为0，防止argmax循环始终选择专家0。
扩展至备用路径：同样修改moeSoftmax和moeSigmoid kernel，覆盖专家数量非标准（非2的幂或64的倍数）的情况，确保全面修复。
添加回归测试：在tests/kernels/moe/test_fused_topk.py中新增test_fused_topk_nan_inf_clamp测试，参数化覆盖多种数据类型、评分函数和坏值，验证clamp效果和专家ID唯一性。
性能验证：通过微基准测试和端到端测试确认修复无性能开销，并解决高并发下的崩溃问题。

文件	模块	状态	重要度
`csrc/moe/topk_softmax_kernels.cu`	MoE 内核	modified	4.52
`tests/kernels/moe/test_fused_topk.py`	融合算子测试	modified	5.7

关键符号

topkGatingSoftmax moeSoftmax moeSigmoid test_fused_topk_nan_inf_clamp

关键源码片段

tests/kernels/moe/test_fused_topk.py test-coverage

新增回归测试，验证 NaN/Inf clamp 在不同参数组合下的正确性。

# 回归测试：验证 NaN/Inf clamp 在 topk_softmax kernel 中的效果
def test_fused_topk_nan_inf_clamp(
    num_experts: int,
    topk: int,
    scoring_func: str,
    bad_value: float, # 坏值可以是 NaN 或 Inf
    dtype: torch.dtype,
):
    """
    模拟填充token产生的NaN/Inf门控输出，验证clamp后专家ID唯一且权重有限。
    """
    # 创建部分包含坏值的 gating_output
    gating_output = torch.randn((num_tokens, num_experts), dtype=dtype, device="cuda")
    gating_output[1:, :] = bad_value # 第 2 行及之后设为 NaN 或 Inf

    # 调用修复后的 fused_topk kernel
    topk_weights, topk_ids, _ = fused_topk(
        hidden_states=hidden_states,
        gating_output=gating_output,
        topk=topk,
        renormalize=False,
        scoring_func=scoring_func,
    )

    # 验证：正常行与参考一致，坏值行专家 ID 必须唯一
    for row in range(1, num_tokens):
        row_ids = topk_ids[row]
        assert row_ids.unique().numel() == topk, f"Row {row} has duplicate expert IDs"
        assert torch.isfinite(topk_weights[row]).all(), f"Row {row} has non-finite weights"

评论区精华

备用路径修复 设计

gemini-code-assist[bot] 指出备用路径（moeSoftmax/moeSigmoid）也需修复，以避免遗漏导致类似问题。

结论：作者确认已包含备用路径的修复，确保全面性。 · 已解决

clamp 实现选择 设计

tlrmchlsmth 建议使用 torch.nan_to_num 替代直接 clamp，以减少脆弱性。

结论：PR 选择直接 clamp 以保持低性能开销，且 kernel 级修复更直接。 · 已解决

风险与影响

风险低：clamp逻辑只在输入为NaN/Inf时生效，正常输入无影响；性能开销可忽略。但需确保所有kernel路径都已覆盖，防止遗漏导致类似问题。

影响使用MoE模型（如Qwen3.5-397B）和CUDA图的用户，特别是在高并发场景。修复后能避免CUDA非法内存访问崩溃，提高服务稳定性和可靠性。

核心路径变更数值稳定性

关联 Issue

#39244 [Bug]: CUDA illegal memory access with FlashInfer MoE FP8 on Qwen3.5-397B (num_tokens > 256)

完整报告

执行摘要

一句话：修复MoE topk_softmax中NaN/Inf处理，防止CUDA图下生成重复专家ID导致的非法内存访问。
推荐动作：建议精读此PR，了解如何处理数值异常情况，以及对MoE路由和CUDA图集成的设计权衡。

功能与动机

实现拆解

核心kernel修复：在csrc/moe/topk_softmax_kernels.cu的topkGatingSoftmax warp kernel中添加NaN/Inf clamp，将NaN/Inf值置为0，防止argmax循环始终选择专家0。
扩展至备用路径：同样修改moeSoftmax和moeSigmoid kernel，覆盖专家数量非标准（非2的幂或64的倍数）的情况，确保全面修复。
添加回归测试：在tests/kernels/moe/test_fused_topk.py中新增test_fused_topk_nan_inf_clamp测试，参数化覆盖多种数据类型、评分函数和坏值，验证clamp效果和专家ID唯一性。
性能验证：通过微基准测试和端到端测试确认修复无性能开销，并解决高并发下的崩溃问题。

关键文件：

csrc/moe/topk_softmax_kernels.cu（模块 MoE内核；类别 source；类型 core-logic；符号 topkGatingSoftmax, moeSoftmax, moeSigmoid）: 核心kernel文件，添加NaN/Inf clamp逻辑，防止重复专家ID生成。
tests/kernels/moe/test_fused_topk.py（模块融合算子测试；类别 test；类型 test-coverage；符号 test_fused_topk_nan_inf_clamp）: 新增回归测试，验证NaN/Inf clamp在不同参数组合下的正确性。

关键符号：topkGatingSoftmax, moeSoftmax, moeSigmoid, test_fused_topk_nan_inf_clamp

关键源码片段

`tests/kernels/moe/test_fused_topk.py`

新增回归测试，验证NaN/Inf clamp在不同参数组合下的正确性。

# 回归测试：验证 NaN/Inf clamp 在 topk_softmax kernel 中的效果
def test_fused_topk_nan_inf_clamp(
    num_experts: int,
    topk: int,
    scoring_func: str,
    bad_value: float, # 坏值可以是 NaN 或 Inf
    dtype: torch.dtype,
):
    """
    模拟填充token产生的NaN/Inf门控输出，验证clamp后专家ID唯一且权重有限。
    """
    # 创建部分包含坏值的 gating_output
    gating_output = torch.randn((num_tokens, num_experts), dtype=dtype, device="cuda")
    gating_output[1:, :] = bad_value # 第 2 行及之后设为 NaN 或 Inf

    # 调用修复后的 fused_topk kernel
    topk_weights, topk_ids, _ = fused_topk(
        hidden_states=hidden_states,
        gating_output=gating_output,
        topk=topk,
        renormalize=False,
        scoring_func=scoring_func,
    )

    # 验证：正常行与参考一致，坏值行专家 ID 必须唯一
    for row in range(1, num_tokens):
        row_ids = topk_ids[row]
        assert row_ids.unique().numel() == topk, f"Row {row} has duplicate expert IDs"
        assert torch.isfinite(topk_weights[row]).all(), f"Row {row} has non-finite weights"

评论区精华

备用路径修复 (design): 作者确认已包含备用路径的修复，确保全面性。
clamp实现选择 (design): PR选择直接clamp以保持低性能开销，且kernel级修复更直接。

风险与影响

风险：风险低：clamp逻辑只在输入为NaN/Inf时生效，正常输入无影响；性能开销可忽略。但需确保所有kernel路径都已覆盖，防止遗漏导致类似问题。
影响：影响使用MoE模型（如Qwen3.5-397B）和CUDA图的用户，特别是在高并发场景。修复后能避免CUDA非法内存访问崩溃，提高服务稳定性和可靠性。
风险标记：核心路径变更, 数值稳定性

关联脉络

暂无明显关联 PR

#39391 fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs

执行摘要

修复 MoE topk_softmax 中 NaN/Inf 处理，防止 CUDA 图下生成重复专家 ID 导致的非法内存访问。

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论