#42387 [CI] Migrate remaining B200 jobs to b200-k8s with test fixes

原始 PR 作者 khluu 合并时间 2026-05-12 17:00 文件变更 5 提交数 2 评论 4 代码增减 +31 / -7

执行摘要

最后 3 个 B200 任务迁移至 b200-k8s 队列

完成剩余B200任务的队列迁移，实现基础设施统一。PR body 说明："Moves the last 3 device: b200 jobs to b200-k8s, with fixes for pre-existing test failures discovered during validation in #42356"。关联 Issue #42356 已先迁移了4个验证通过的B200任务。

该 PR 为纯 CI 基础设施变更，生产代码无改动，重要性较低，无需精读。但建议关注 review 中提到的 source_file_dependencies 缺失问题，应在后续 PR 中补全，避免回归检测盲区。同时，DeepSeek MTP 测试在 Blackwell 上的持续失败需要进一步调查，可能需提交单独 bugfix 或彻底禁用该测试。

讨论亮点

reviewer gemini-code-assist[bot] 提出 source_file_dependencies 不完整：新任务 lm-eval-large-models-b200-ep 缺少对 Qwen3-Next 等模型文件的依赖，导致模型代码变更不会触发该 CI 任务，可能让回归漏网。建议参考 lm-eval-qwen3-5-models-b200 补充依赖。该问题未在 PR 中被 resolved。
khluu 反馈 DeepSeek MTP 测试仍在失败：即使加上 skipif 后，该测试在运行中仍失败，haosdent 回应“i see”，表明该问题可能需要进一步排查。

实现拆解

迁移 Spec Decode CI 任务至 b200-k8s（.buildkite/test_areas/spec_decode.yaml）：将 spec-decode-eagle-nightly-b200 和 spec-decode-speculators-mtp-nightly-b200 两个任务的设备队列从 device: b200 改为 device: b200-k8s，命令和依赖保持不变。
迁移 LM Eval 小模型任务并拆分大模型任务（.buildkite/test_areas/lm_eval.yaml）：
- 将 lm-eval-small-models-b200 的设备队列改为 b200-k8s。
- 新增 lm-eval-large-models-b200-ep 任务，使用 num_devices: 2，配置列表指向新文件 models-blackwell-ep.txt，用于原本因单 GPU 不足而崩溃的3个 EP 大模型。
调整模型配置文件（tests/evals/gsm8k/configs/models-blackwell.txt 和新增 models-blackwell-ep.txt）：将 Qwen3-Next-80B、Qwen3-Next-FP8、Nemotron-120B 这3个需要2 GPU的模型从 models-blackwell.txt 移除，放入新建的 models-blackwell-ep.txt 中。
修复 Spec Decode 测试在 Blackwell 上的跳过条件（tests/v1/e2e/spec_decode/test_spec_decode.py）：
- test_eagle_correctness_light：为 DeepSeek Eagle 参数化添加 skipif，因 Flash Attention 不支持 Blackwell (SM100/SM110) 上的 head_dim=192。
- test_mtp_correctness：为 DeepSeek MTP 参数化使用 pytest.param 包裹并附加 skipif，因 TRTLLM MoE top_k 检查在 Blackwell 上失败。
修复提交信息中的跳过原因（第二个 commit）：将 DeepSeek MTP 的跳过原因从“CUDA graph compilation hang”更改为正确的“flashinfer TRTLLM MoE routing check failure”。

文件	模块	状态	重要度
`.buildkite/test_areas/spec_decode.yaml`	CI 配置	modified	3.45
`.buildkite/test_areas/lm_eval.yaml`	CI 配置	modified	4.03
`tests/v1/e2e/spec_decode/test_spec_decode.py`	Spec 解码	modified	4.58
`tests/evals/gsm8k/configs/models-blackwell-ep.txt`	Eval 配置	added	2.47
`tests/evals/gsm8k/configs/models-blackwell.txt`	Eval 配置	modified	2.03

关键符号

test_eagle_correctness_light test_mtp_correctness

关键源码片段

tests/v1/e2e/spec_decode/test_spec_decode.py test-coverage

修复了 Blackwell 上 DeepSeek Eagle 和 MTP 测试的跳过条件，是测试代码的唯一改动。

# 在 test_eagle_correctness_light 前添加 skipif 装饰器
# 原因：Flash Attention 在 Blackwell (SM100/SM110) 上不支持 head_dim=192
@single_gpu_only
@pytest.mark.skipif(
    current_platform.is_device_capability_family(100),
    reason="DeepSeek head_dim=192 not supported on SM100/SM110 (Blackwell)",
)
@pytest.mark.parametrize(...)
def test_eagle_correctness_light(...):
    ...

# 在 test_mtp_correctness 的 DeepSeek MTP 参数化中，使用 pytest.param 包裹并添加 skipif
# 原因：TRTLLM MoE top_k 检查在 Blackwell 上失败
(
    ("mtp", "ZixiQi/DeepSeek-V3-4layers-MTP-FP8", 1),
    False,
    0.0,
    marks=pytest.mark.skipif(
        current_platform.is_device_capability_family(100),
        reason="DeepSeek MTP: TRTLLM MoE top_k check fails on Blackwell",
    ),
), # dummy model

评论区精华

新任务缺少 source_file_dependencies 设计

gemini-code-assist[bot] 指出新增的 `lm-eval-large-models-b200-ep` 任务的 `source_file_dependencies` 不完整，缺少对 Qwen3-Next 等模型文件的依赖，导致模型代码变更不会触发该 CI 任务。

结论：未解决。PR 已合并但该依赖缺失问题未被修正，可能需要在后续 PR 中补充。 · unresolved

DeepSeek MTP 测试在 Blackwell 上仍失败 正确性

khluu 反馈即使添加了 skipif，DeepSeek MTP 测试仍然失败，haosdent 回应已了解。

结论：已知但未解决。skipif 已添加，但可能未覆盖所有失败场景，需要进一步排查。 · unresolved

Claude Code 自动评论 other

claude[bot] 自动评论说明该仓库配置了手动代码审查，'@claude review' 可触发审查。

结论：无实质技术讨论。 · 已解决

风险与影响

回归检测盲区（review 指出）：新 lm-eval-large-models-b200-ep 任务的 source_file_dependencies 仅包含 csrc/ 和 vllm/model_executor/layers/quantization，缺少对 Qwen3-Next、Nemotron 等模型实现文件的依赖。模型逻辑变更不会自动触发该任务，可能导致回归漏检。
测试跳过导致覆盖率下降：在 Blackwell 上跳过了 DeepSeek Eagle 和 MTP 的正确性测试，这些测试在旧队列（b200）上一直运行。迁移后，Blackwell 平台将不再覆盖这些用例，若后续修复底层问题，可能无法自动回归。
DeepSeek MTP 测试仍失败：khluu 反馈即使添加了 skipif 后测试仍失败，虽然参数化已标记跳过，但可能存在其他未覆盖的用例或平台检测不准确，需进一步排查。

用户影响：无。纯 CI 基础设施变更，不影响生产代码。
系统影响：
- 所有 B200 CI 任务统一使用 b200-k8s 队列，简化队列管理，提升资源利用率。
- 新增一个2 GPU的 LM Eval 大模型任务，增加 CI 耗时但覆盖了原先缺失的 EP 模型测试。
团队影响：CI 维护者受益于统一的队列配置；模型开发者需注意新任务的依赖缺失问题。

依赖遗漏导致回归盲区跳过测试降低覆盖率未解决测试失败

关联 Issue

#42356 [CI] Migrate more B200 jobs to b200-k8s queue

完整报告

执行摘要

一句话：最后3个B200任务迁移至b200-k8s队列
推荐动作：该 PR 为纯 CI 基础设施变更，生产代码无改动，重要性较低，无需精读。但建议关注 review 中提到的 source_file_dependencies 缺失问题，应在后续 PR 中补全，避免回归检测盲区。同时，DeepSeek MTP 测试在 Blackwell 上的持续失败需要进一步调查，可能需提交单独 bugfix 或彻底禁用该测试。

功能与动机

实现拆解

迁移 Spec Decode CI 任务至 b200-k8s（.buildkite/test_areas/spec_decode.yaml）：将 spec-decode-eagle-nightly-b200 和 spec-decode-speculators-mtp-nightly-b200 两个任务的设备队列从 device: b200 改为 device: b200-k8s，命令和依赖保持不变。
迁移 LM Eval 小模型任务并拆分大模型任务（.buildkite/test_areas/lm_eval.yaml）：
- 将 lm-eval-small-models-b200 的设备队列改为 b200-k8s。
- 新增 lm-eval-large-models-b200-ep 任务，使用 num_devices: 2，配置列表指向新文件 models-blackwell-ep.txt，用于原本因单 GPU 不足而崩溃的3个 EP 大模型。
调整模型配置文件（tests/evals/gsm8k/configs/models-blackwell.txt 和新增 models-blackwell-ep.txt）：将 Qwen3-Next-80B、Qwen3-Next-FP8、Nemotron-120B 这3个需要2 GPU的模型从 models-blackwell.txt 移除，放入新建的 models-blackwell-ep.txt 中。
修复 Spec Decode 测试在 Blackwell 上的跳过条件（tests/v1/e2e/spec_decode/test_spec_decode.py）：
- test_eagle_correctness_light：为 DeepSeek Eagle 参数化添加 skipif，因 Flash Attention 不支持 Blackwell (SM100/SM110) 上的 head_dim=192。
- test_mtp_correctness：为 DeepSeek MTP 参数化使用 pytest.param 包裹并附加 skipif，因 TRTLLM MoE top_k 检查在 Blackwell 上失败。
修复提交信息中的跳过原因（第二个 commit）：将 DeepSeek MTP 的跳过原因从“CUDA graph compilation hang”更改为正确的“flashinfer TRTLLM MoE routing check failure”。

关键文件：

.buildkite/test_areas/spec_decode.yaml（模块 CI配置；类别 config；类型 configuration）: 迁移2个 Spec Decode 任务的设备队列从 b200 到 b200-k8s，是迁移的主要组成部分。
.buildkite/test_areas/lm_eval.yaml（模块 CI配置；类别 config；类型 configuration）: 迁移 LM Eval 小模型任务并新增 EP 大模型任务，是核心配置变更。
tests/v1/e2e/spec_decode/test_spec_decode.py（模块 Spec解码；类别 test；类型 test-coverage；符号 test_eagle_correctness_light, test_mtp_correctness）: 修复了 Blackwell 上 DeepSeek Eagle 和 MTP 测试的跳过条件，是测试代码的唯一改动。
tests/evals/gsm8k/configs/models-blackwell-ep.txt（模块 Eval配置；类别 docs；类型 documentation）: 新增的配置文件，专门存放需要 2 GPU 的 EP 大模型，是模型拆分的关键。
tests/evals/gsm8k/configs/models-blackwell.txt（模块 Eval配置；类别 docs；类型 documentation）: 从原文件中移除了3个需要 2 GPU 的模型，与新增文件配合完成拆分。

关键符号：test_eagle_correctness_light, test_mtp_correctness

关键源码片段

`tests/v1/e2e/spec_decode/test_spec_decode.py`

修复了 Blackwell 上 DeepSeek Eagle 和 MTP 测试的跳过条件，是测试代码的唯一改动。

# 在 test_eagle_correctness_light 前添加 skipif 装饰器
# 原因：Flash Attention 在 Blackwell (SM100/SM110) 上不支持 head_dim=192
@single_gpu_only
@pytest.mark.skipif(
    current_platform.is_device_capability_family(100),
    reason="DeepSeek head_dim=192 not supported on SM100/SM110 (Blackwell)",
)
@pytest.mark.parametrize(...)
def test_eagle_correctness_light(...):
    ...

# 在 test_mtp_correctness 的 DeepSeek MTP 参数化中，使用 pytest.param 包裹并添加 skipif
# 原因：TRTLLM MoE top_k 检查在 Blackwell 上失败
(
    ("mtp", "ZixiQi/DeepSeek-V3-4layers-MTP-FP8", 1),
    False,
    0.0,
    marks=pytest.mark.skipif(
        current_platform.is_device_capability_family(100),
        reason="DeepSeek MTP: TRTLLM MoE top_k check fails on Blackwell",
    ),
), # dummy model

评论区精华

reviewer gemini-code-assist[bot] 提出 source_file_dependencies 不完整：新任务 lm-eval-large-models-b200-ep 缺少对 Qwen3-Next 等模型文件的依赖，导致模型代码变更不会触发该 CI 任务，可能让回归漏网。建议参考 lm-eval-qwen3-5-models-b200 补充依赖。该问题未在 PR 中被 resolved。
khluu 反馈 DeepSeek MTP 测试仍在失败：即使加上 skipif 后，该测试在运行中仍失败，haosdent 回应“i see”，表明该问题可能需要进一步排查。
- 新任务缺少 source_file_dependencies (design): 未解决。PR 已合并但该依赖缺失问题未被修正，可能需要在后续 PR 中补充。
- DeepSeek MTP 测试在 Blackwell 上仍失败 (correctness): 已知但未解决。skipif 已添加，但可能未覆盖所有失败场景，需要进一步排查。
- Claude Code 自动评论 (other): 无实质技术讨论。

风险与影响

风险：
- 回归检测盲区（review 指出）：新 lm-eval-large-models-b200-ep 任务的 source_file_dependencies 仅包含 csrc/ 和 vllm/model_executor/layers/quantization，缺少对 Qwen3-Next、Nemotron 等模型实现文件的依赖。模型逻辑变更不会自动触发该任务，可能导致回归漏检。
- 测试跳过导致覆盖率下降：在 Blackwell 上跳过了 DeepSeek Eagle 和 MTP 的正确性测试，这些测试在旧队列（b200）上一直运行。迁移后，Blackwell 平台将不再覆盖这些用例，若后续修复底层问题，可能无法自动回归。
- DeepSeek MTP 测试仍失败：khluu 反馈即使添加了 skipif 后测试仍失败，虽然参数化已标记跳过，但可能存在其他未覆盖的用例或平台检测不准确，需进一步排查。
影响：
- 用户影响：无。纯 CI 基础设施变更，不影响生产代码。
- 系统影响：
  - 所有 B200 CI 任务统一使用 b200-k8s 队列，简化队列管理，提升资源利用率。
  - 新增一个2 GPU的 LM Eval 大模型任务，增加 CI 耗时但覆盖了原先缺失的 EP 模型测试。
- 团队影响：CI 维护者受益于统一的队列配置；模型开发者需注意新任务的依赖缺失问题。
- 风险标记：依赖遗漏导致回归盲区, 跳过测试降低覆盖率, 未解决测试失败

关联脉络

PR #42356 [CI] Migrate more B200 jobs to b200-k8s queue: 本 PR 的前置 PR，已完成 4 个 B200 任务的迁移验证。本 PR 继续迁移剩余 3 个任务。

#42387 [CI] Migrate remaining B200 jobs to b200-k8s with test fixes

执行摘要

最后 3 个 B200 任务迁移至 b200-k8s 队列

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论