#25025 dp: refactor idle batch logic

原始 PR 作者 happierpig 合并时间 2026-05-27 05:22 文件变更 1 提交数 1 评论 8 代码增减 +10 / -2

执行摘要

重构 idle batch 逻辑并修复 dp=1 场景问题

PR body 指出：在 dp=1 且 skip all-gather 的场景下，不需要 idle batch，但仍需 require_attn_tp_gather 来处理 attention-tp-size 填充。原有逻辑将 idle batch 创建与同步信息更新耦合，导致 dp=1 时无法触发 token 填充到 attention tp 倍数。

建议精读该 PR，尤其是讨论中关于 idle batch 与 _update_gather_batch 解耦的设计权衡。虽然改动较小，但涉及对 dp attention 同步机制的理解，对维护多 GPU 并行策略的工程师有参考价值。

讨论亮点

Review 中 ch-wan 与 author happierpig 围绕 need_idle_batch 为 false 时是否仍要调用 _update_gather_batch 进行了深入讨论。ch-wan 指出，对 prebuilt batch 创建 idle batch_to_gather 后无条件调用 _update_gather_batch 可能触发 PD disagg 中非预期的 forward 路径。happierpig 回应称，此举是为了保证 ScheduleBatch.global_num_tokens 不为 None，从而触发 model_runner 中的 token 填充逻辑（padding to multiple of attention tp size），而原有逻辑在 skip_all_gather=true + dp_size=1 时丢失了该填充行为。最终 ch-wan 认可并 approve。

实现拆解

修改 idle batch 判断逻辑：在 python/sglang/srt/managers/scheduler_components/dp_attn.py 中，将原来的 need_idle_batch = skip_all_gather or max(mlp_sync_info.global_num_tokens) > 0 拆分为对 skip_all_gather 和 dp_size 的显式判断：当 skip_all_gather 为 true 时，仅当 dp_size > 1 时才需要 idle batch；否则沿用原来的 max(global_num_tokens) > 0 条件。
解耦 _update_gather_batch 调用：将 _update_gather_batch 从 if need_idle_batch 块中移出，改为在 batch_to_gather is not None 时无条件调用。这样即使不需要 idle batch，也能正确传递 mlp_sync_info 以触发 attention tp 填充。
调整 batch_to_gather 的默认值：将 batch_to_gather = local_batch 提前到 if need_idle_batch 之前，确保无论是否需要 idle batch，batch_to_gather 始终有值。

文件	模块	状态	重要度
`python/sglang/srt/managers/scheduler_components/dp_attn.py`	调度器	modified	6.18

关键符号

prepare_mlp_sync_batch_raw

关键源码片段

python/sglang/srt/managers/scheduler_components/dp_attn.py core-logic

核心变更文件，重构了 idle batch 创建与同步信息更新逻辑，修复 dp=1 场景下的 attention tp 填充问题。

# python/sglang/srt/managers/scheduler_components/dp_attn.py

def prepare_mlp_sync_batch_raw(
    local_batch: Optional[ScheduleBatch],
    dp_size: int,
    attn_tp_size: int,
    ...
) -> Optional[ScheduleBatch]:
    # ... 前面创建 mlp_sync_info 和 all_gather 操作 ...

    # 判断是否需要 idle batch
    if skip_all_gather:
        # dp=1 时不需要 idle batch，但仍需通过 _update_gather_batch 触发 attn tp 填充
        need_idle_batch = dp_size > 1
    else:
        need_idle_batch = max(mlp_sync_info.global_num_tokens) > 0

    # 先设置 batch_to_gather 为 local_batch，后续再视情况替换为 idle batch
    batch_to_gather = local_batch
    if need_idle_batch:
        if local_batch is None:
            batch_to_gather = local_batch = get_idle_batch()
        elif local_batch.forward_mode.is_prebuilt():
            # 对于 prebuilt batch，内部添加一个 idle batch 用于 MLP 同步
            batch_to_gather = local_batch.inner_idle_batch = get_idle_batch()

    # 无论是否需要 idle batch，都调用 _update_gather_batch 以传递同步信息
    # 确保 require_attn_tp_gather 等场景下 token 被正确填充到 attention tp 倍数
    if batch_to_gather is not None:
        _update_gather_batch(
            batch_to_gather, mlp_sync_info, require_mlp_tp_gather, skip_all_gather
        )

    # 后续 metrics 记录与返回 ...

评论区精华

解耦 idle batch 创建与 _update_gather_batch 调用 设计

ch-wan 质疑修改后 prebuilt batch 会无条件创建 idle batch 并调用 _update_gather_batch，担心影响 PD disagg 的 forward 路径。

结论：happierpig 解释修改目的是为了在 dp=1 时仍能触发 token 填充到 attention tp 倍数，且该行为是必要的。ch-wan 最终认可并 approve。 · 已解决

风险与影响

主要风险在于对 prebuilt batch 创建 idle batch 后无条件调用 _update_gather_batch，可能影响 PD disagg 中的 forward 路径（如 decode.py 中 L1642 附近）。需要确认该路径下 batch_to_gather 是否为 idle batch 且不会造成额外副作用。另外，仅修改了一个文件，影响范围有限，但缺少直接相关的测试覆盖。

影响范围局限于 dp=1 + skip_all_gather 场景以及需要 attention tp gather 的情况。对 dp>1 及非 skip_all_gather 场景无行为变化。用户层面，修复了 dp=1 时 token 未正确填充到 attention tp 倍数的问题，可能提升模型推理正确性。

功能路径变更缺少测试覆盖

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：重构 idle batch 逻辑并修复 dp=1 场景问题
推荐动作：建议精读该 PR，尤其是讨论中关于 idle batch 与 _update_gather_batch 解耦的设计权衡。虽然改动较小，但涉及对 dp attention 同步机制的理解，对维护多 GPU 并行策略的工程师有参考价值。

功能与动机

实现拆解

修改 idle batch 判断逻辑：在 python/sglang/srt/managers/scheduler_components/dp_attn.py 中，将原来的 need_idle_batch = skip_all_gather or max(mlp_sync_info.global_num_tokens) > 0 拆分为对 skip_all_gather 和 dp_size 的显式判断：当 skip_all_gather 为 true 时，仅当 dp_size > 1 时才需要 idle batch；否则沿用原来的 max(global_num_tokens) > 0 条件。
解耦 _update_gather_batch 调用：将 _update_gather_batch 从 if need_idle_batch 块中移出，改为在 batch_to_gather is not None 时无条件调用。这样即使不需要 idle batch，也能正确传递 mlp_sync_info 以触发 attention tp 填充。
调整 batch_to_gather 的默认值：将 batch_to_gather = local_batch 提前到 if need_idle_batch 之前，确保无论是否需要 idle batch，batch_to_gather 始终有值。

关键文件：

python/sglang/srt/managers/scheduler_components/dp_attn.py（模块调度器；类别 source；类型 core-logic；符号 prepare_mlp_sync_batch_raw）: 核心变更文件，重构了 idle batch 创建与同步信息更新逻辑，修复 dp=1 场景下的 attention tp 填充问题。

关键符号：prepare_mlp_sync_batch_raw

关键源码片段

`python/sglang/srt/managers/scheduler_components/dp_attn.py`

核心变更文件，重构了 idle batch 创建与同步信息更新逻辑，修复 dp=1 场景下的 attention tp 填充问题。

# python/sglang/srt/managers/scheduler_components/dp_attn.py

def prepare_mlp_sync_batch_raw(
    local_batch: Optional[ScheduleBatch],
    dp_size: int,
    attn_tp_size: int,
    ...
) -> Optional[ScheduleBatch]:
    # ... 前面创建 mlp_sync_info 和 all_gather 操作 ...

    # 判断是否需要 idle batch
    if skip_all_gather:
        # dp=1 时不需要 idle batch，但仍需通过 _update_gather_batch 触发 attn tp 填充
        need_idle_batch = dp_size > 1
    else:
        need_idle_batch = max(mlp_sync_info.global_num_tokens) > 0

    # 先设置 batch_to_gather 为 local_batch，后续再视情况替换为 idle batch
    batch_to_gather = local_batch
    if need_idle_batch:
        if local_batch is None:
            batch_to_gather = local_batch = get_idle_batch()
        elif local_batch.forward_mode.is_prebuilt():
            # 对于 prebuilt batch，内部添加一个 idle batch 用于 MLP 同步
            batch_to_gather = local_batch.inner_idle_batch = get_idle_batch()

    # 无论是否需要 idle batch，都调用 _update_gather_batch 以传递同步信息
    # 确保 require_attn_tp_gather 等场景下 token 被正确填充到 attention tp 倍数
    if batch_to_gather is not None:
        _update_gather_batch(
            batch_to_gather, mlp_sync_info, require_mlp_tp_gather, skip_all_gather
        )

    # 后续 metrics 记录与返回 ...

评论区精华

解耦 idle batch 创建与 _update_gather_batch 调用 (design): happierpig 解释修改目的是为了在 dp=1 时仍能触发 token 填充到 attention tp 倍数，且该行为是必要的。ch-wan 最终认可并 approve。

风险与影响

风险：主要风险在于对 prebuilt batch 创建 idle batch 后无条件调用 _update_gather_batch，可能影响 PD disagg 中的 forward 路径（如 decode.py 中 L1642 附近）。需要确认该路径下 batch_to_gather 是否为 idle batch 且不会造成额外副作用。另外，仅修改了一个文件，影响范围有限，但缺少直接相关的测试覆盖。
影响：影响范围局限于 dp=1 + skip_all_gather 场景以及需要 attention tp gather 的情况。对 dp>1 及非 skip_all_gather 场景无行为变化。用户层面，修复了 dp=1 时 token 未正确填充到 attention tp 倍数的问题，可能提升模型推理正确性。
风险标记：功能路径变更, 缺少测试覆盖

关联脉络

PR #26394 [PD] Fix cross-rank queue divergence by gating metadata readiness before all-reduce: 涉及 PD disagg 中的 forward 路径，与本 PR 讨论中提到的 PD disagg 风险相关。
PR #26148 Skip PP output communication for pure chunked prefill batches: 同为调度器优化，涉及 skip all-gather 和通信优化，与本 PR 场景相关。

#25025 dp: refactor idle batch logic

执行摘要

重构 idle batch 逻辑并修复 dp=1 场景问题

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论