执行摘要

为 cutlass_moe_fp4 添加 no_combine 支持

为了支持 FP4 MoE 在 TP1 下使用 no_combine 模式，以便用于 EP 调度模式。目前 triton 路径已经支持 no_combine，但 cutlass 路径缺少该功能。PR body 中明确说明："Enables FP4 MoE with TP1 in no-combine mode (needed for EP dispatch patterns)."

值得精读。该 PR 是一个典型的 API 对齐改进，确保 cutlass 路径与 triton 路径在 no_combine 特性上保持一致。代码改动量小但意义明确，是支持 FP4 MoE 专家并行的重要基础。

讨论亮点

本 PR 没有 review 评论，只有作者自己触发的 CI 运行命令。讨论内容为空。

实现拆解

cutlass_moe_fp4 函数新增 no_combine 参数：在 python/sglang/srt/layers/moe/cutlass_moe.py 文件中，给函数签名添加 no_combine: bool = False 参数。在函数尾部、执行完第二个 GEMM 和 shuffle/reshape 之后，增加条件分支：当 no_combine=True 时，直接返回 shape 为 (m, num_topk, hidden_size) 的 per-expert 输出，不再执行后续的权重乘法和 sum 合并操作。
ModelOptNvFp4FusedMoEMethod 传入 no_combine 参数：在 python/sglang/srt/layers/quantization/modelopt_quant.py 中，在调用 cutlass_moe_fp4 的地方新增参数 no_combine=moe_runner_config.no_combine。这样就把上层配置传递到了底层函数。

文件	模块	状态	重要度
`python/sglang/srt/layers/moe/cutlass_moe.py`	MoE 层	modified	5.56
`python/sglang/srt/layers/quantization/modelopt_quant.py`	量化层	modified	4.96

关键符号

cutlass_moe_fp4

关键源码片段

python/sglang/srt/layers/moe/cutlass_moe.py core-logic

核心文件，新增 no_combine 参数并在函数末尾添加提前返回逻辑。

# cutlass_moe_fp4 的函数签名及尾部关键逻辑
# 在函数参数中新增 no_combine 参数，默认 False
def cutlass_moe_fp4(
    a: torch.Tensor,
    ...
    apply_router_weight_on_input: bool = False,
    no_combine: bool = False, # <-- 新增参数，默认 False 保持向后兼容
):
    ...
    # 执行两个 GEMM 和激活函数，得到每个专家的输出 c2
    # c2 shape: (m_a, num_topk, params.hidden_size)
    c2 = shuffle_rows(c2, c_map, (m_a * num_topk, params.hidden_size))
    c2 = c2.view(m_a, num_topk, params.hidden_size)
    # no_combine 分支：直接返回 per-expert 结果，不进行权重乘法和 sum
    if no_combine:
        return c2.to(out_dtype)
    # 正常路径：应用 topk 权重并合并
    if not apply_router_weight_on_input:
        c2 = c2 * topk_weights.view(m_a, num_topk, 1).to(out_dtype)
    return c2.sum(dim=1).to(out_dtype)

python/sglang/srt/layers/quantization/modelopt_quant.py data-contract

调用方修改，传递 no_combine 参数。

# ModelOptNvFp4FusedMoEMethod 中的调用代码
output = cutlass_moe_fp4(
    a=x,
    ...
    apply_router_weight_on_input=moe_runner_config.apply_router_weight_on_input,
    no_combine=moe_runner_config.no_combine, # <-- 新增参数传递
).to(x.dtype)
return StandardCombineInput(hidden_states=output)

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

风险较低。变更仅新增一个布尔参数，默认值为 False，因此对现有行为完全向后兼容。新增的控制流非常简单（提前 return），不会影响正常路径。但需要注意：如果上层 moe_runner_config.no_combine 被错误设置，可能导致返回形状不一致的问题，尤其是调用方期望合并后的输出时。考虑到该参数之前已存在于 triton 路径中，且是显式配置项，风险可控。

影响范围较小。仅影响使用 FP4 量化的 MoE 层，且仅在开启 no_combine 模式时改变行为。对现有模型推理无影响（默认行为不变）。为未来支持 EP 调度模式铺平了道路。

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：为 cutlass_moe_fp4 添加 no_combine 支持
推荐动作：值得精读。该 PR 是一个典型的 API 对齐改进，确保 cutlass 路径与 triton 路径在 no_combine 特性上保持一致。代码改动量小但意义明确，是支持 FP4 MoE 专家并行的重要基础。

功能与动机

实现拆解

cutlass_moe_fp4 函数新增 no_combine 参数：在 python/sglang/srt/layers/moe/cutlass_moe.py 文件中，给函数签名添加 no_combine: bool = False 参数。在函数尾部、执行完第二个 GEMM 和 shuffle/reshape 之后，增加条件分支：当 no_combine=True 时，直接返回 shape 为 (m, num_topk, hidden_size) 的 per-expert 输出，不再执行后续的权重乘法和 sum 合并操作。
ModelOptNvFp4FusedMoEMethod 传入 no_combine 参数：在 python/sglang/srt/layers/quantization/modelopt_quant.py 中，在调用 cutlass_moe_fp4 的地方新增参数 no_combine=moe_runner_config.no_combine。这样就把上层配置传递到了底层函数。

关键文件：

python/sglang/srt/layers/moe/cutlass_moe.py（模块 MoE 层；类别 source；类型 core-logic；符号 cutlass_moe_fp4）: 核心文件，新增 no_combine 参数并在函数末尾添加提前返回逻辑。
python/sglang/srt/layers/quantization/modelopt_quant.py（模块量化层；类别 source；类型 data-contract）: 调用方修改，传递 no_combine 参数。

关键符号：cutlass_moe_fp4

关键源码片段

`python/sglang/srt/layers/moe/cutlass_moe.py`

核心文件，新增 no_combine 参数并在函数末尾添加提前返回逻辑。

# cutlass_moe_fp4 的函数签名及尾部关键逻辑
# 在函数参数中新增 no_combine 参数，默认 False
def cutlass_moe_fp4(
    a: torch.Tensor,
    ...
    apply_router_weight_on_input: bool = False,
    no_combine: bool = False, # <-- 新增参数，默认 False 保持向后兼容
):
    ...
    # 执行两个 GEMM 和激活函数，得到每个专家的输出 c2
    # c2 shape: (m_a, num_topk, params.hidden_size)
    c2 = shuffle_rows(c2, c_map, (m_a * num_topk, params.hidden_size))
    c2 = c2.view(m_a, num_topk, params.hidden_size)
    # no_combine 分支：直接返回 per-expert 结果，不进行权重乘法和 sum
    if no_combine:
        return c2.to(out_dtype)
    # 正常路径：应用 topk 权重并合并
    if not apply_router_weight_on_input:
        c2 = c2 * topk_weights.view(m_a, num_topk, 1).to(out_dtype)
    return c2.sum(dim=1).to(out_dtype)

`python/sglang/srt/layers/quantization/modelopt_quant.py`

调用方修改，传递 no_combine 参数。

# ModelOptNvFp4FusedMoEMethod 中的调用代码
output = cutlass_moe_fp4(
    a=x,
    ...
    apply_router_weight_on_input=moe_runner_config.apply_router_weight_on_input,
    no_combine=moe_runner_config.no_combine, # <-- 新增参数传递
).to(x.dtype)
return StandardCombineInput(hidden_states=output)

评论区精华

本 PR 没有 review 评论，只有作者自己触发的 CI 运行命令。讨论内容为空。

暂无高价值评论线程

风险与影响

风险：风险较低。变更仅新增一个布尔参数，默认值为 False，因此对现有行为完全向后兼容。新增的控制流非常简单（提前 return），不会影响正常路径。但需要注意：如果上层 moe_runner_config.no_combine 被错误设置，可能导致返回形状不一致的问题，尤其是调用方期望合并后的输出时。考虑到该参数之前已存在于 triton 路径中，且是显式配置项，风险可控。
影响：影响范围较小。仅影响使用 FP4 量化的 MoE 层，且仅在开启 no_combine 模式时改变行为。对现有模型推理无影响（默认行为不变）。为未来支持 EP 调度模式铺平了道路。
风险标记：暂无

关联脉络

暂无明显关联 PR

#25688 Add no_combine support to cutlass_moe_fp4

执行摘要

为 cutlass_moe_fp4 添加 no_combine 支持

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`python/sglang/srt/layers/moe/cutlass_moe.py`

`python/sglang/srt/layers/quantization/modelopt_quant.py`

评论区精华

风险与影响

关联脉络

参与讨论