#37728 Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)

原始 PR 作者 minosfuture 合并时间 2026-03-25 01:30 文件变更 3 提交数 1 评论 6 代码增减 +17 / -0

执行摘要

修复 Mamba 状态损坏，清除陈旧 block table 条目。

在DP full cuda graph场景下，当一个rank完成batch而其他rank仍在运行时，dummy_run生成的seq_len为0值会映射到陈旧的mamba block，导致状态损坏和zero-token-id响应。PR body指出：'we saw zero-token-id response for a linear attention model. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run.'

建议工程师精读此PR，特别是block_table.py的clear_row实现和gpu_model_runner.py的_dummy_run同步逻辑，以理解DP和CUDA图中状态管理的设计权衡。

讨论亮点

review中主要讨论了清除GPU tensor的策略。heheda12345提问：'Do we need to clear the gpu tensor here? Will commit_block_table sync the block_table.np.clear() to gpu?'；minosfuture回复：'Commit is not called in this dummy run path. Also I think direct write per request should be more efficient.' 最终决定在clear_row中同时清除CPU和GPU tensor，以避免依赖commit的同步开销。

实现拆解

实现分为三个关键部分：

在block_table.py中为BlockTable和MultiGroupBlockTable添加clear_row方法，将指定行的block table条目（CPU和GPU）清零。
在gpu_input_batch.py的remove_request方法中调用clear_row，清理完成请求的slot。
在gpu_model_runner.py的_dummy_run中添加self.input_batch.block_table.commit_block_table(num_reqs_padded)，确保GPU端的block table更新，防止stale数据被引用。

文件	模块	状态	重要度
`vllm/v1/worker/block_table.py`	worker/block_table	modified	8.0
`vllm/v1/worker/gpu_input_batch.py`	worker/gpu_input_batch	modified	6.0
`vllm/v1/worker/gpu_model_runner.py`	worker/gpu_model_runner	modified	7.0

关键符号

clear_row remove_request _dummy_run commit_block_table

分析完成后，这里会展示 LLM 生成的相对完整源码片段和详细注释。

评论区精华

清除 GPU tensor 的策略 设计

heheda12345 询问是否需要直接清除 GPU tensor，或依赖 commit_block_table 同步；minosfuture 解释 commit 在 dummy_run 路径未被调用，直接写入更高效。

结论：在 clear_row 方法中同时清除 CPU 和 GPU tensor，以避免同步开销和确保及时更新。 · 已解决

风险与影响

风险包括：

回归风险：clear_row方法可能错误清理未完成请求的block table，需确保只在remove_request中调用；具体文件为block_table.py和gpu_input_batch.py。
性能风险：直接写入GPU tensor增加开销，但讨论认为可接受，影响gpu_model_runner.py的_dummy_run逻辑。
兼容性风险：修改核心数据结构，可能影响其他模型或DP场景，但测试覆盖应验证。

影响范围：主要影响使用数据并行和完整CUDA图的Mamba模型用户，解决了zero-token-id问题，提升服务可靠性。系统层面：优化了DP模式下的状态管理，减少错误输出。团队层面：需要更新相关测试，确保DP和CUDA图场景的覆盖。

核心数据结构变更 GPU 同步风险 DP 场景特定 bug

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

该PR修复了在数据并行(DP)和完整CUDA图模式下Mamba模型的状态损坏问题，导致零token响应。通过清理完成请求的block table条目并同步GPU，解决了DP dummy_run触发的stale block引用bug，提升服务可靠性。

功能与动机

动机源于DP场景中，当一个rank完成batch而其他rank仍在运行时，dummy_run生成的seq_len为0值会映射到陈旧的mamba block，引发状态损坏和zero-token-id响应。PR body明确指出："we saw zero-token-id response for a linear attention model. Root cause is due to using stale mamba block, and this is triggered by DP dummy_run."

实现拆解

block_table模块（vllm/v1/worker/block_table.py）：新增clear_row方法，将指定行的CPU和GPU block table条目清零。

def clear_row(self, row_idx: int) -> None:
    num_blocks = self.num_blocks_per_row[row_idx]
    if num_blocks > 0:
        self.block_table.np[row_idx, :num_blocks] = 0
        self.block_table.gpu[row_idx, :num_blocks] = 0

gpu_input_batch模块（vllm/v1/worker/gpu_input_batch.py）：在remove_request方法中调用clear_row，及时清理完成请求的slot。
gpu_model_runner模块（vllm/v1/worker/gpu_model_runner.py）：在_dummy_run中添加commit_block_table调用，确保GPU端block table更新。

评论区精华

review中聚焦于清除GPU tensor的决策：

heheda12345提问："Do we need to clear the gpu tensor here? Will commit_block_table sync the block_table.np.clear() to gpu?"
minosfuture回复："Commit is not called in this dummy run path. Also I think direct write per request should be more efficient."
最终采纳同时清除CPU和GPU的方案，以避免同步开销。

风险与影响

风险：回归风险（clear_row可能误清理）、性能风险（GPU写入开销）、兼容性风险（影响DP场景）。具体文件：block_table.py的修改需确保线程安全；gpu_model_runner.py的commit调用需协调_dummy_run逻辑。
影响：用户层面解决了Mamba模型的zero-token-id问题；系统层面优化了DP状态管理；团队需加强DP和CUDA图测试覆盖。

关联脉络

与PR 37926（"Make microbatch optimization (DBO) work with general models"）相关，都涉及CUDA图优化和状态管理，显示团队在提升DP和CUDA图交互上的持续演进。

#37728 Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728)

执行摘要

修复 Mamba 状态损坏，清除陈旧 block table 条目。

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论