#42289 [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B

原始 PR 作者 qyYue1389 合并时间 2026-05-19 12:22 文件变更 2 提交数 6 评论 20 代码增减 +189 / -12

执行摘要

修复 SimpleCPUOffloadScheduler TOCTOU 竞争导致断言崩溃

参考 issue #39702，作者描述了长期运行的服务器在 CPU KV offloading 启用后，一旦 CPU 缓存填满并开始 LRU evict，update_state_after_alloc 中的断言失败，因为 Phase A 找到的块在 Phase B 之前已被 evict。需要消除这个竞争窗口。

该 PR 修复了关键的 TOCTOU bug，设计清晰（pin + 缓存），并包含高质量回归测试。建议调度器相关开发者精读，理解在异步 offload 路径中保持 block 一致性的手法。对于非该模块的开发者，可略读但值得了解此类竞争模式的修复方式。

讨论亮点

类型注解讨论：gemini-code-assist 建议将 KVCacheBlock 类型注解替换为字符串字面量以避免运行时 NameError。作者回应称项目启用 UP037 lint 规则允许裸类型，且变量注解在函数体内不会运行时求值，因此未采纳。
注释精简与警告日志：zhewenl 提出注释过于冗长，建议精简并移入逻辑代码；同时建议在意外状态（如 num_external_tokens==0 但存在缓存）时添加警告日志。作者采纳并清理了注释、添加入 warning log。
断言校验：zhewenl 建议添加 num_external_tokens % block_size == 0 断言，作者未直接回复但实现中使用了整除和断言。
安全性日志：Dao007forever 指出 if pending is None 分支为意外情况，要求添加日志，作者已添加 warning。

实现拆解

修复分为以下步骤：

数据结构扩展：在 SimpleCPUOffloadScheduler.__init__ 中新增 _pending_cpu_hits 字典（key 为 request_id），用于缓存 Phase A 找到的 CPU 块和命中长度。
Phase A 持久化：在 get_num_new_matched_tokens 中，调用 find_longest_cache_hit 后若命中，立即通过 cpu_block_pool.touch() 增加这些块的引用计数（pin），然后将 (cpu_hit_blocks, hit_length) 存入 _pending_cpu_hits。同时，若该请求之前有未消费的缓存（例如重试场景），先释放旧的 pin。
Phase B 消费：在 update_state_after_alloc 中，优先从 _pending_cpu_hits 弹出缓存。若 num_external_tokens > 0 但无缓存，记录警告并跳过加载；若 num_external_tokens == 0 但存在缓存，释放 stale pin 并记录警告。之后根据缓存构建传输对。
清理辅助函数：新增 _free_pending_cpu_hit 方法，遍历缓存的 CPU 块并调用 cpu_block_pool.free() 减少引用计数，最终释放 pin。在 request_finished 中也调用此方法以处理请求在 Phase B 前被取消/抢占的情况。
回归测试：在 test_scheduler.py 中新增 test_toctou_cpu_hit_evicted_between_phases_no_crash，模拟多请求并发，在 Phase A 和 Phase B 之间填充 CPU 缓存触发 LRU evict，验证未发生断言崩溃且加载事件被正确调度。

文件	模块	状态	重要度
`vllm/v1/simple_kv_offload/manager.py`	调度器	modified	7.64
`tests/v1/simple_kv_offload/test_scheduler.py`	测试	modified	6.46

关键符号

_free_pending_cpu_hit get_num_new_matched_tokens update_state_after_alloc

关键源码片段

vllm/v1/simple_kv_offload/manager.py core-logic

核心修复文件：在 SimpleCPUOffloadScheduler 中新增 _pending_cpu_hits 缓存和 pin 机制，修改 get_num_new_matched_tokens 和 update_state_after_alloc 以消除 TOCTOU 窗口。

def get_num_new_matched_tokens(
    self, request: "Request", num_computed_tokens: int
) -> tuple[int | None, bool]:
    """Return (num_new_tokens, is_async) from consecutive CPU cache hits.

    Pins found CPU blocks via touch() so they survive LRU eviction until
    update_state_after_alloc() consumes them. Any pin from an earlier
    call on the same request (e.g. retry after a failed allocate_slots)
    is dropped first.
    """
    stale = self._pending_cpu_hits.pop(request.request_id, None)
    if stale is not None:
        self._free_pending_cpu_hit(stale)

    # ... 省略无关计算 ...
    cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(
        remaining_hashes, max_hit_len
    )
    if hit_length > 0:
        # 展平所有非空块并 touch，增加引用计数防止 LRU evict
        pin_blocks = [
            blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null
        ]
        self.cpu_block_pool.touch(pin_blocks)
        # 缓存结果供 Phase B 消费，无需二次搜索
        self._pending_cpu_hits[request.request_id] = (
            cpu_hit_blocks,
            hit_length,
        )
        return hit_length, True
    return 0, False

def update_state_after_alloc(
    self,
    request: "Request",
    kv_blocks: "KVCacheBlocks",
    num_external_tokens: int,
) -> None:
    # ... 省略其他逻辑 ...
    # 从缓存中弹出 Phase A 的结果
    pending = self._pending_cpu_hits.pop(req_id, None)
    if num_external_tokens == 0:
        if pending is not None:
            logger.warning(...)
            self._free_pending_cpu_hit(pending)
        return
    if pending is None:
        # 无缓存但需要加载，记录警告并跳过（原行为会导致断言崩溃）
        logger.warning(...)
        return
    cpu_hit_blocks_full, _ = pending
    num_blocks_to_load = num_external_tokens // self.block_size
    assert num_external_tokens % self.block_size == 0
    # 根据 token 预算切片并构建传输对
    # ... 后续构建 load 事件 ...

评论区精华

类型注解使用字符串字面量 设计

gemini-code-assist 建议将 KVCacheBlock 类型改为字符串字面量，避免运行时 NameError。作者解释项目启用 UP037 lint 允许裸类型，且函数内注解不运行时求值，因此未采纳。

结论：未采纳，项目 lint 允许现有风格。 · 已解决

注释精简与添加警告日志 style

zhewenl 指出注释过于详细，建议精简并移入逻辑；同时要求对意外状态添加 warning 日志。Dao007forever 也要求对 pending is None 分支添加日志。

结论：作者采纳，清理注释并增加 warning 日志。 · 已解决

num_external_tokens 对齐校验 正确性

zhewenl 建议添加 num_external_tokens % block_size == 0 断言。

结论：作者已在内部分配逻辑中使用整除和断言，满足要求。 · 已解决

风险与影响

pin 泄漏风险：若请求在 Phase A 后但 Phase B 前被取消或预占，且 request_finished 未正确清理，会导致 CPU 块引用计数泄漏。修复通过 _free_pending_cpu_hit 在 request_finished、重试弹出和 Phase B 消费时均释放，覆盖主要路径。
回退路径影响：当 Phase B 收到 num_external_tokens > 0 但无缓存时，不再断言而是跳过加载并记录警告。这可能导致该次 prefill 无 CPU 块加速，但不会崩溃；后续请求可恢复正常。
性能影响： touch 操作增加了约 O(hit_blocks) 的开销，但仅在命中时执行，且 CPU 块数通常较少（per request），对整体吞吐影响可忽略。
测试覆盖充分：新增回归测试精确模拟竞争条件，原有 13 项 scheduler 测试全部通过，集成测试也通过了。

用户：使用 SimpleCPUOffloadScheduler（通过环境变量或调度策略启用 CPU KV offloading）的用户将不再遇到长时间运行后的断言崩溃，服务稳定性显著提升。
系统：变更局限于 manager.py 和 test_scheduler.py，未修改 API、配置或依赖，对系统其他模块无影响。
团队：维护者需评审新逻辑和测试，但整体代码量小（+84/-12），理解成本低。后续可能需要关注 free_block_queue.popleft_n 的块计数问题（如 issue 中 @v1b3coder 所提），但当前修复已隔离。

核心路径变更 pin 泄漏风险测试覆盖充分

关联 Issue

#39702 [Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc)

完整报告

执行摘要

一句话：修复 SimpleCPUOffloadScheduler TOCTOU 竞争导致断言崩溃
推荐动作：该 PR 修复了关键的 TOCTOU bug，设计清晰（pin + 缓存），并包含高质量回归测试。建议调度器相关开发者精读，理解在异步 offload 路径中保持 block 一致性的手法。对于非该模块的开发者，可略读但值得了解此类竞争模式的修复方式。

功能与动机

实现拆解

修复分为以下步骤：

数据结构扩展：在 SimpleCPUOffloadScheduler.__init__ 中新增 _pending_cpu_hits 字典（key 为 request_id），用于缓存 Phase A 找到的 CPU 块和命中长度。
Phase A 持久化：在 get_num_new_matched_tokens 中，调用 find_longest_cache_hit 后若命中，立即通过 cpu_block_pool.touch() 增加这些块的引用计数（pin），然后将 (cpu_hit_blocks, hit_length) 存入 _pending_cpu_hits。同时，若该请求之前有未消费的缓存（例如重试场景），先释放旧的 pin。
Phase B 消费：在 update_state_after_alloc 中，优先从 _pending_cpu_hits 弹出缓存。若 num_external_tokens > 0 但无缓存，记录警告并跳过加载；若 num_external_tokens == 0 但存在缓存，释放 stale pin 并记录警告。之后根据缓存构建传输对。
清理辅助函数：新增 _free_pending_cpu_hit 方法，遍历缓存的 CPU 块并调用 cpu_block_pool.free() 减少引用计数，最终释放 pin。在 request_finished 中也调用此方法以处理请求在 Phase B 前被取消/抢占的情况。
回归测试：在 test_scheduler.py 中新增 test_toctou_cpu_hit_evicted_between_phases_no_crash，模拟多请求并发，在 Phase A 和 Phase B 之间填充 CPU 缓存触发 LRU evict，验证未发生断言崩溃且加载事件被正确调度。

关键文件：

vllm/v1/simple_kv_offload/manager.py（模块调度器；类别 source；类型 core-logic；符号 _free_pending_cpu_hit）: 核心修复文件：在 SimpleCPUOffloadScheduler 中新增 _pending_cpu_hits 缓存和 pin 机制，修改 get_num_new_matched_tokens 和 update_state_after_alloc 以消除 TOCTOU 窗口。
tests/v1/simple_kv_offload/test_scheduler.py（模块测试；类别 test；类型 test-coverage；符号 test_toctou_cpu_hit_evicted_between_phases_no_crash）: 新增回归测试 test_toctou_cpu_hit_evicted_between_phases_no_crash，精确重现 #39702 的竞争条件，验证 fix 有效且无回归。

关键符号：_free_pending_cpu_hit, get_num_new_matched_tokens, update_state_after_alloc

关键源码片段

`vllm/v1/simple_kv_offload/manager.py`

核心修复文件：在 SimpleCPUOffloadScheduler 中新增 _pending_cpu_hits 缓存和 pin 机制，修改 get_num_new_matched_tokens 和 update_state_after_alloc 以消除 TOCTOU 窗口。

def get_num_new_matched_tokens(
    self, request: "Request", num_computed_tokens: int
) -> tuple[int | None, bool]:
    """Return (num_new_tokens, is_async) from consecutive CPU cache hits.

    Pins found CPU blocks via touch() so they survive LRU eviction until
    update_state_after_alloc() consumes them. Any pin from an earlier
    call on the same request (e.g. retry after a failed allocate_slots)
    is dropped first.
    """
    stale = self._pending_cpu_hits.pop(request.request_id, None)
    if stale is not None:
        self._free_pending_cpu_hit(stale)

    # ... 省略无关计算 ...
    cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(
        remaining_hashes, max_hit_len
    )
    if hit_length > 0:
        # 展平所有非空块并 touch，增加引用计数防止 LRU evict
        pin_blocks = [
            blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null
        ]
        self.cpu_block_pool.touch(pin_blocks)
        # 缓存结果供 Phase B 消费，无需二次搜索
        self._pending_cpu_hits[request.request_id] = (
            cpu_hit_blocks,
            hit_length,
        )
        return hit_length, True
    return 0, False

def update_state_after_alloc(
    self,
    request: "Request",
    kv_blocks: "KVCacheBlocks",
    num_external_tokens: int,
) -> None:
    # ... 省略其他逻辑 ...
    # 从缓存中弹出 Phase A 的结果
    pending = self._pending_cpu_hits.pop(req_id, None)
    if num_external_tokens == 0:
        if pending is not None:
            logger.warning(...)
            self._free_pending_cpu_hit(pending)
        return
    if pending is None:
        # 无缓存但需要加载，记录警告并跳过（原行为会导致断言崩溃）
        logger.warning(...)
        return
    cpu_hit_blocks_full, _ = pending
    num_blocks_to_load = num_external_tokens // self.block_size
    assert num_external_tokens % self.block_size == 0
    # 根据 token 预算切片并构建传输对
    # ... 后续构建 load 事件 ...

评论区精华

类型注解讨论：gemini-code-assist 建议将 KVCacheBlock 类型注解替换为字符串字面量以避免运行时 NameError。作者回应称项目启用 UP037 lint 规则允许裸类型，且变量注解在函数体内不会运行时求值，因此未采纳。
注释精简与警告日志：zhewenl 提出注释过于冗长，建议精简并移入逻辑代码；同时建议在意外状态（如 num_external_tokens==0 但存在缓存）时添加警告日志。作者采纳并清理了注释、添加入 warning log。
断言校验：zhewenl 建议添加 num_external_tokens % block_size == 0 断言，作者未直接回复但实现中使用了整除和断言。
安全性日志：Dao007forever 指出 if pending is None 分支为意外情况，要求添加日志，作者已添加 warning。
- 类型注解使用字符串字面量 (design): 未采纳，项目 lint 允许现有风格。
- 注释精简与添加警告日志 (style): 作者采纳，清理注释并增加 warning 日志。
- num_external_tokens 对齐校验 (correctness): 作者已在内部分配逻辑中使用整除和断言，满足要求。

风险与影响

风险：
1. pin 泄漏风险：若请求在 Phase A 后但 Phase B 前被取消或预占，且 request_finished 未正确清理，会导致 CPU 块引用计数泄漏。修复通过 _free_pending_cpu_hit 在 request_finished、重试弹出和 Phase B 消费时均释放，覆盖主要路径。
2. 回退路径影响：当 Phase B 收到 num_external_tokens > 0 但无缓存时，不再断言而是跳过加载并记录警告。这可能导致该次 prefill 无 CPU 块加速，但不会崩溃；后续请求可恢复正常。
3. 性能影响： touch 操作增加了约 O(hit_blocks) 的开销，但仅在命中时执行，且 CPU 块数通常较少（per request），对整体吞吐影响可忽略。
4. 测试覆盖充分：新增回归测试精确模拟竞争条件，原有 13 项 scheduler 测试全部通过，集成测试也通过了。
影响：
- 用户：使用 SimpleCPUOffloadScheduler（通过环境变量或调度策略启用 CPU KV offloading）的用户将不再遇到长时间运行后的断言崩溃，服务稳定性显著提升。
- 系统：变更局限于 manager.py 和 test_scheduler.py，未修改 API、配置或依赖，对系统其他模块无影响。
- 团队：维护者需评审新逻辑和测试，但整体代码量小（+84/-12），理解成本低。后续可能需要关注 free_block_queue.popleft_n 的块计数问题（如 issue 中 @v1b3coder 所提），但当前修复已隔离。
- 风险标记：核心路径变更, pin 泄漏风险, 测试覆盖充分

关联脉络

暂无明显关联 PR

#42289 [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B

执行摘要

修复 SimpleCPUOffloadScheduler TOCTOU 竞争导致断言崩溃

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论