执行摘要

拆分 CPU 分布式测试为独立 CI 步骤

PR body 指出原有 CPU-Distributed Tests 在单个 timeout 10m 容器中依次运行 PP+TP（约 6 分钟）和 DP+TP 两个 vllm serve 生命周期，导致 DP+TP 在中间被 SIGTERM。"PP+TP alone now consumes ~6m, so DP+TP is SIGTERMed mid-init"。

建议快速合并。该 PR 解决了明确的 CI 超时问题，改动小且经过 reviewer 批准。无需精读。

讨论亮点

只有一个 reviewer（bigPYJ1151）确认了变更并给出 "Thanks! LGTM :)" 的批准。没有其他讨论或争议。

实现拆解

修改 .buildkite/hardware_tests/cpu.yaml：将原有的 CPU-Distributed Tests 步骤拆分为 CPU-Distributed Tests (PP+TP) 和 CPU-Distributed Tests (DP+TP) 两个步骤，每个步骤独立运行，超时仍为 10 分钟。使用 YAML 锚点 &cpu_distributed_deps 复用依赖文件列表，并将测试脚本本身也加入依赖。
重构 .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh：将原本顺序执行的两个场景抽离为 run_scenario 函数，通过脚本参数 MODE（tp_pp、dp_tp、all）控制执行哪个场景。case 语句中对未知模式输出错误并退出（exit 1）。这样两个 CI step 各自调用脚本并传入对应 mode。
无其他文件变更：该 PR 仅涉及 CI 配置和测试脚本。

文件	模块	状态	重要度
`.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh`	部署脚本	modified	5.02
`.buildkite/hardware_tests/cpu.yaml`	CI 配置	modified	4.44

关键符号

run_scenario

关键源码片段

.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh infrastructure

测试脚本核心重构：将顺序执行的两个场景封装为 run_scenario 函数，支持通过参数选择模式，fail-fast 处理未知模式。

# 通过第一个参数选择运行模式：tp_pp、dp_tp 或 all（默认）
MODE=${1:-all}

# run_scenario: 启动 vllm serve、等待就绪、运行 bench、检查失败请求
run_scenario() {
    local label="$1" result_file="$2"
    shift 2
    echo "--- $label"
    vllm serve meta-llama/Llama-3.2-3B-Instruct "$@" --max-model-len=4096 &
    local server_pid=$!
    timeout 600 bash -c "until curl localhost:8000/v1/models > /dev/null 2>&1; do sleep 1; done" || exit 1
    vllm bench serve \
        --backend vllm \
        --dataset-name random \
        --model meta-llama/Llama-3.2-3B-Instruct \
        --num-prompts 20 \
        --result-dir ./test_results \
        --result-filename "$result_file" \
        --save-result \
        --endpoint /v1/completions
    kill -s SIGTERM "$server_pid"; wait "$server_pid" || true
    if [ "$(jq '.failed' "./test_results/$result_file")" -ne 0 ]; then
        echo "Some requests were failed in $label!"
        exit 1
    fi
}

case "$MODE" in
    tp_pp) run_scenario "PP+TP" tp_pp.json -tp=2 -pp=2 ;;
    dp_tp) run_scenario "DP+TP" dp_tp.json -tp=2 -dp=2 ;;
    all)
        run_scenario "PP+TP" tp_pp.json -tp=2 -pp=2
        run_scenario "DP+TP" dp_tp.json -tp=2 -dp=2
        ;;
    *) echo "ERROR: unknown mode '$MODE' (expected: tp_pp | dp_tp | all)" >&2; exit 1 ;;
esac

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

风险较低。拆分 CI 步骤不会影响产品代码，但需确保每个步骤的依赖文件列表保持一致（已通过 YAML 锚点实现）。若测试脚本传递错误参数或 mode 名称变更，可能导致 CI 步骤跳过测试。脚本中已对未知 mode 进行 fail-fast 处理。

影响范围仅限于 CPU 分布式测试的 CI 流程。不再有超时风险，DP+TP 场景可获得完整 10 分钟超时。对用户无影响。

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：拆分 CPU 分布式测试为独立 CI 步骤
推荐动作：建议快速合并。该 PR 解决了明确的 CI 超时问题，改动小且经过 reviewer 批准。无需精读。

功能与动机

实现拆解

修改 .buildkite/hardware_tests/cpu.yaml：将原有的 CPU-Distributed Tests 步骤拆分为 CPU-Distributed Tests (PP+TP) 和 CPU-Distributed Tests (DP+TP) 两个步骤，每个步骤独立运行，超时仍为 10 分钟。使用 YAML 锚点 &cpu_distributed_deps 复用依赖文件列表，并将测试脚本本身也加入依赖。
重构 .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh：将原本顺序执行的两个场景抽离为 run_scenario 函数，通过脚本参数 MODE（tp_pp、dp_tp、all）控制执行哪个场景。case 语句中对未知模式输出错误并退出（exit 1）。这样两个 CI step 各自调用脚本并传入对应 mode。
无其他文件变更：该 PR 仅涉及 CI 配置和测试脚本。

关键文件：

.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh（模块部署脚本；类别 infra；类型 infrastructure）: 测试脚本核心重构：将顺序执行的两个场景封装为 run_scenario 函数，支持通过参数选择模式，fail-fast 处理未知模式。
.buildkite/hardware_tests/cpu.yaml（模块 CI配置；类别 test；类型 test-coverage）: CI 配置拆分：将单个 CPU-Distributed Tests 步骤拆分为两个独立 step，使用 YAML 锚点复用依赖列表。

关键符号：run_scenario

关键源码片段

`.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh`

测试脚本核心重构：将顺序执行的两个场景封装为 run_scenario 函数，支持通过参数选择模式，fail-fast 处理未知模式。

# 通过第一个参数选择运行模式：tp_pp、dp_tp 或 all（默认）
MODE=${1:-all}

# run_scenario: 启动 vllm serve、等待就绪、运行 bench、检查失败请求
run_scenario() {
    local label="$1" result_file="$2"
    shift 2
    echo "--- $label"
    vllm serve meta-llama/Llama-3.2-3B-Instruct "$@" --max-model-len=4096 &
    local server_pid=$!
    timeout 600 bash -c "until curl localhost:8000/v1/models > /dev/null 2>&1; do sleep 1; done" || exit 1
    vllm bench serve \
        --backend vllm \
        --dataset-name random \
        --model meta-llama/Llama-3.2-3B-Instruct \
        --num-prompts 20 \
        --result-dir ./test_results \
        --result-filename "$result_file" \
        --save-result \
        --endpoint /v1/completions
    kill -s SIGTERM "$server_pid"; wait "$server_pid" || true
    if [ "$(jq '.failed' "./test_results/$result_file")" -ne 0 ]; then
        echo "Some requests were failed in $label!"
        exit 1
    fi
}

case "$MODE" in
    tp_pp) run_scenario "PP+TP" tp_pp.json -tp=2 -pp=2 ;;
    dp_tp) run_scenario "DP+TP" dp_tp.json -tp=2 -dp=2 ;;
    all)
        run_scenario "PP+TP" tp_pp.json -tp=2 -pp=2
        run_scenario "DP+TP" dp_tp.json -tp=2 -dp=2
        ;;
    *) echo "ERROR: unknown mode '$MODE' (expected: tp_pp | dp_tp | all)" >&2; exit 1 ;;
esac

评论区精华

只有一个 reviewer（bigPYJ1151）确认了变更并给出 "Thanks! LGTM :)" 的批准。没有其他讨论或争议。

暂无高价值评论线程

风险与影响

风险：风险较低。拆分 CI 步骤不会影响产品代码，但需确保每个步骤的依赖文件列表保持一致（已通过 YAML 锚点实现）。若测试脚本传递错误参数或 mode 名称变更，可能导致 CI 步骤跳过测试。脚本中已对未知 mode 进行 fail-fast 处理。
影响：影响范围仅限于 CPU 分布式测试的 CI 流程。不再有超时风险，DP+TP 场景可获得完整 10 分钟超时。对用户无影响。
风险标记：暂无

关联脉络

PR #39781 [CI] Uncommented DP+TP test in CPU distributed smoke test: PR body 提及 DP+TP 块在 #39781 中被取消注释，但没有调整超时，导致当前超时问题。本 PR 是该 PR 的后续修复。

#41203 [CI][CPU] Split CPU-Distributed Tests into per-scenario labels

执行摘要

拆分 CPU 分布式测试为独立 CI 步骤

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`.buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh`

评论区精华

风险与影响

关联脉络

参与讨论