#41064 [Core] Simplify handling of `scheduler_reserve_full_isl` option

原始 PR 作者 njhill 合并时间 2026-05-01 09:10 文件变更 4 提交数 5 评论 2 代码增减 +30 / -58

执行摘要

合并 admission check 到 KV cache 分配，简化调度器

作为 PR#37307 的后续清理，简化 scheduler_reserve_full_isl 的处理，减少调度器中不必要的逻辑。PR 描述：'Minor follow-on to simplify the changes and reduce scheduler-side logic.'

值得精读，展示了将 admission check 与分配逻辑合并以简化接口的设计模式；同时注意时序问题在类似场景中的潜在影响。

讨论亮点

在 review 中，gemini-code-assist[bot] 指出 admission check 的执行时机在 remove_skipped_blocks 之前，可能导致对滑动窗口请求的误判。该问题未被修复，但 PR 仍获得批准。

实现拆解

删除 KVCacheManager.can_fit_full_sequence 方法，将其核心逻辑（检查空闲块是否足够容纳整个序列）移植到 allocate_slots 方法内部。
在 allocate_slots 方法签名中新增布尔参数 full_sequence_must_fit（默认 False）。当该参数为 True 时，在分配前调用 coordinator.get_num_blocks_to_allocate 计算所需块数并与空闲块比较；若不足则返回 None。
简化调度器 schedule 方法：移除原有的 if self.scheduler_reserve_full_isl and not self.kv_cache_manager.can_fit_full_sequence(...) 条件块，改为在调用 allocate_slots 时直接传递 full_sequence_must_fit=self.scheduler_reserve_full_isl。
更新测试：在 test_prefix_caching.py 中，将两个断言从 assert manager.can_fit_full_sequence(req) 和 assert not manager.can_fit_full_sequence(req) 改为 assert manager.allocate_slots(req, block_size, full_sequence_must_fit=True) is not None 和 assert manager.allocate_slots(req, block_size, full_sequence_must_fit=True) is None。在 test_mamba_prefix_cache.py 中，为 mock 函数 fake_allocate_slots_fn 添加 full_sequence_must_fit 参数以匹配新签名。

文件	模块	状态	重要度
`vllm/v1/core/kv_cache_manager.py`	缓存管理器	modified	7.72
`vllm/v1/core/sched/scheduler.py`	调度器	modified	6.25
`tests/v1/core/test_prefix_caching.py`	缓存测试	modified	4.16
`tests/v1/e2e/general/test_mamba_prefix_cache.py`	Mamba 测试	modified	3.42

关键符号

allocate_slots can_fit_full_sequence

关键源码片段

vllm/v1/core/kv_cache_manager.py core-logic

核心变更文件；删除独立方法 `can_fit_full_sequence`，将其逻辑移入 `allocate_slots`，新增 `full_sequence_must_fit` 参数。

def allocate_slots(
    self,
    request: Request,
    num_new_tokens: int,
    num_new_computed_tokens: int = 0,
    new_computed_blocks: KVCacheBlocks | None = None,
    num_lookahead_tokens: int = 0,
    num_external_computed_tokens: int = 0,
    delay_cache_blocks: bool = False,
    num_encoder_tokens: int = 0,
    full_sequence_must_fit: bool = False,
) -> KVCacheBlocks | None:
    """为请求追加新 tokens 分配 KV cache slot。

    `full_sequence_must_fit` 参数作为 admission check，
    防止 chunked prefill 过度接纳请求。
    """
    # 构建新计算的 block 列表（来自 prefix cache 命中）
    if new_computed_blocks is not None:
        new_computed_block_list = new_computed_blocks.blocks
    else:
        new_computed_block_list = self.empty_kv_cache_blocks.blocks

    # 计算本地已计算 tokens 总数（包括新命中的）
    num_local_computed_tokens = (
        request.num_computed_tokens + num_new_computed_tokens
    )
    # 考虑外部已计算 tokens（例如从 KV connector 获取）
    total_computed_tokens = min(
        num_local_computed_tokens + num_external_computed_tokens,
        self.max_model_len,
    )

    # 如果请求要求完整序列必须能放下，则提前检查容量
    if full_sequence_must_fit:
        full_num_tokens = min(request.num_tokens, self.max_model_len)
        num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
            request_id=request.request_id,
            num_tokens=full_num_tokens,
            new_computed_blocks=new_computed_block_list,
            num_encoder_tokens=num_encoder_tokens,
            total_computed_tokens=total_computed_tokens,
            num_tokens_main_model=full_num_tokens,
            apply_admission_cap=True,
        )
        if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
            return None # 无法容纳，拒绝调度

    # 继续实际的 slot 分配逻辑（原有 logic 不变）
    # ...

评论区精华

时序问题：admission check 在 remove_skipped_blocks 之前导致误判 正确性

The `full_sequence_must_fit` admission check is performed before `self.coordinator.remove_skipped_blocks` is called (currently at line 361). For requests using a sliding window (e.g., resumed requests), this can lead to an inaccurate free block count because blocks that are about to be freed are still considered allocated in the pool. This may cause the admission gate to incorrectly reject requests that would otherwise fit.

结论：未在 PR 中修复，但 PR 被批准合并。 · unresolved

风险与影响

主要风险是时序问题和接口兼容性。时序问题可能导致滑动窗口请求被错误拒绝；接口变更 allocate_slots 新增参数，但当前所有调用者都已适配（仅 scheduler.py 和测试 mock），但未来其他调用者若未传入该参数，则不会启用 check，行为不变。

对用户透明，调度逻辑行为不变；对开发者需注意新增参数，建议未来在调用 allocate_slots 时考虑是否需传入 full_sequence_must_fit。

时序依赖可能误拒滑动窗口请求接口变更影响面

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：合并 admission check 到 KV cache 分配，简化调度器
推荐动作：值得精读，展示了将 admission check 与分配逻辑合并以简化接口的设计模式；同时注意时序问题在类似场景中的潜在影响。

功能与动机

实现拆解

删除 KVCacheManager.can_fit_full_sequence 方法，将其核心逻辑（检查空闲块是否足够容纳整个序列）移植到 allocate_slots 方法内部。
在 allocate_slots 方法签名中新增布尔参数 full_sequence_must_fit（默认 False）。当该参数为 True 时，在分配前调用 coordinator.get_num_blocks_to_allocate 计算所需块数并与空闲块比较；若不足则返回 None。
简化调度器 schedule 方法：移除原有的 if self.scheduler_reserve_full_isl and not self.kv_cache_manager.can_fit_full_sequence(...) 条件块，改为在调用 allocate_slots 时直接传递 full_sequence_must_fit=self.scheduler_reserve_full_isl。
更新测试：在 test_prefix_caching.py 中，将两个断言从 assert manager.can_fit_full_sequence(req) 和 assert not manager.can_fit_full_sequence(req) 改为 assert manager.allocate_slots(req, block_size, full_sequence_must_fit=True) is not None 和 assert manager.allocate_slots(req, block_size, full_sequence_must_fit=True) is None。在 test_mamba_prefix_cache.py 中，为 mock 函数 fake_allocate_slots_fn 添加 full_sequence_must_fit 参数以匹配新签名。

关键文件：

vllm/v1/core/kv_cache_manager.py（模块缓存管理器；类别 source；类型 core-logic；符号 can_fit_full_sequence, allocate_slots）: 核心变更文件；删除独立方法 can_fit_full_sequence，将其逻辑移入 allocate_slots，新增 full_sequence_must_fit 参数。
vllm/v1/core/sched/scheduler.py（模块调度器；类别 source；类型 core-logic）: 调度器简化；移除对 can_fit_full_sequence 的直接调用，改为在 allocate_slots 调用中传入 full_sequence_must_fit 参数。
tests/v1/core/test_prefix_caching.py（模块缓存测试；类别 test；类型 test-coverage）: 单元测试适配；将对 can_fit_full_sequence 的断言改为调用 allocate_slots 并传入 full_sequence_must_fit=True。
tests/v1/e2e/general/test_mamba_prefix_cache.py（模块 Mamba测试；类别 test；类型 test-coverage）: e2e 测试适配；mock 函数 fake_allocate_slots_fn 新增 full_sequence_must_fit 参数以保持签名一致。

关键符号：allocate_slots, can_fit_full_sequence

关键源码片段

`vllm/v1/core/kv_cache_manager.py`

核心变更文件；删除独立方法 can_fit_full_sequence，将其逻辑移入 allocate_slots，新增 full_sequence_must_fit 参数。

def allocate_slots(
    self,
    request: Request,
    num_new_tokens: int,
    num_new_computed_tokens: int = 0,
    new_computed_blocks: KVCacheBlocks | None = None,
    num_lookahead_tokens: int = 0,
    num_external_computed_tokens: int = 0,
    delay_cache_blocks: bool = False,
    num_encoder_tokens: int = 0,
    full_sequence_must_fit: bool = False,
) -> KVCacheBlocks | None:
    """为请求追加新 tokens 分配 KV cache slot。

    `full_sequence_must_fit` 参数作为 admission check，
    防止 chunked prefill 过度接纳请求。
    """
    # 构建新计算的 block 列表（来自 prefix cache 命中）
    if new_computed_blocks is not None:
        new_computed_block_list = new_computed_blocks.blocks
    else:
        new_computed_block_list = self.empty_kv_cache_blocks.blocks

    # 计算本地已计算 tokens 总数（包括新命中的）
    num_local_computed_tokens = (
        request.num_computed_tokens + num_new_computed_tokens
    )
    # 考虑外部已计算 tokens（例如从 KV connector 获取）
    total_computed_tokens = min(
        num_local_computed_tokens + num_external_computed_tokens,
        self.max_model_len,
    )

    # 如果请求要求完整序列必须能放下，则提前检查容量
    if full_sequence_must_fit:
        full_num_tokens = min(request.num_tokens, self.max_model_len)
        num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
            request_id=request.request_id,
            num_tokens=full_num_tokens,
            new_computed_blocks=new_computed_block_list,
            num_encoder_tokens=num_encoder_tokens,
            total_computed_tokens=total_computed_tokens,
            num_tokens_main_model=full_num_tokens,
            apply_admission_cap=True,
        )
        if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
            return None # 无法容纳，拒绝调度

    # 继续实际的 slot 分配逻辑（原有 logic 不变）
    # ...

评论区精华

时序问题：admission check 在 remove_skipped_blocks 之前导致误判 (correctness): 未在 PR 中修复，但 PR 被批准合并。

风险与影响

风险：主要风险是时序问题和接口兼容性。时序问题可能导致滑动窗口请求被错误拒绝；接口变更 allocate_slots 新增参数，但当前所有调用者都已适配（仅 scheduler.py 和测试 mock），但未来其他调用者若未传入该参数，则不会启用 check，行为不变。
影响：对用户透明，调度逻辑行为不变；对开发者需注意新增参数，建议未来在调用 allocate_slots 时考虑是否需传入 full_sequence_must_fit。
风险标记：时序依赖, 可能误拒滑动窗口请求, 接口变更影响面

关联脉络

PR #37307 [Core] Add scheduler_reserve_full_isl option: 本 PR 是对其的后续简化，将 admission check 内联到 allocate_slots

#41064 [Core] Simplify handling of `scheduler_reserve_full_isl` option

执行摘要

合并 admission check 到 KV cache 分配，简化调度器

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论