#42097 [Bugfix] Fix mismatched kernel-per-logical blocks in NIXL HMA transfer

原始 PR 作者 ZhanqiuHu 合并时间 2026-05-12 21:53 文件变更 2 提交数 3 评论 2 代码增减 +395 / -29

执行摘要

修复异构 TP 下 NIXL HMA 传输中 kernel 块映射不匹配的 bug

在异构TP（如Mamba hybrid模型）中，_logical_to_remote_kernel_block_ids 使用了本地的 physical_per_logical 而不是远程的，导致kernel block ID扩展错误，进而引发静默的精度损坏。此外，逻辑block舍入可能导致本地与远程kernel block数量不一致，需要在传输前进行trimming。

建议精读。该PR展示了在复杂分布式缓存传输中处理异构块大小和前缀缓存约束的典型设计模式，_apply_prefix_caching 的方法抽取和 handshake 防御性检查值得借鉴。测试用例的扩展方式也值得参考。

讨论亮点

复审者 NickLucche 建议添加一个单元测试来展示当前缀缓存启用且 block sizes 不同时逻辑会失败，然后在 handshake 中断言前缀缓存关闭，直到后续加入支持。该建议已被采纳，体现在 test_mismatched_physical_per_logical_fails_with_prefix_caching 测试和 _validate_remote_agent_handshake 中的 RuntimeError。

实现拆解

修复kernel block ID扩展逻辑：在 _read_blocks 和 _read_blocks_for_req 中，从 remote_info 获取远程的 remote_physical_blocks_per_logical 并传递给 _logical_to_remote_kernel_block_ids，替代原先使用的本地 _physical_blocks_per_logical_kv_block。
抽取 _apply_prefix_caching 方法：将 _read_blocks 中原有的内联trim逻辑（从后裁剪remote block IDs与local对齐）抽取为独立方法，并扩展对Mamba hybrid模型的支持。非Mamba模型沿用原end-trim逻辑；Mamba模型因前缀缓存尚不支持异构块大小，改为前后同步裁剪至最小公共kernel block数，避免传输损坏。
添加前缀缓存冲突检测：在 _validate_remote_agent_handshake 中，若本地与远程 physical_blocks_per_logical 不同且启用了 enable_prefix_caching，则抛出 RuntimeError，提示用户禁用前缀缓存。这避免了运行时静默错误。
新增测试覆盖：在 test_nixl_connector_hma.py 中添加了 test_apply_prefix_caching_mamba_hybrid、test_mismatched_physical_per_logical_fails_with_prefix_caching 等测试用例，验证block trimming、handshake拒绝以及kernel block ID扩展的正确性，并完善了参数化测试的输入。

文件	模块	状态	重要度
`vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py`	分布式	modified	7.9
`tests/v1/kv_connector/unit/test_nixl_connector_hma.py`	测试	modified	7.58

关键符号

_apply_prefix_caching _read_blocks _read_blocks_for_req _validate_remote_agent_handshake

关键源码片段

vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py core-logic

核心源码文件，修复了 kernel block ID 映射 bug，新增 `_apply_prefix_caching` 方法统一 block trimming 逻辑，并增强 handshake 校验，是整个 PR 的核心。

def _apply_prefix_caching(
    self,
    local_block_ids: BlockIds,
    remote_block_ids: BlockIds,
    remote_physical_per_logical: int,
) -> tuple[BlockIds, list]:
    """应用前缀缓存，对 local/remote block ID 列表进行裁剪。

    非 Mamba 模型：从末尾裁剪 remote 以匹配 local 数量，跳过已缓存的 prefix。
    Mamba hybrid 模型（前缀缓存暂不支持）：两边都按最小 kernel block 数裁剪，
    解决异构 TP 中逻辑块舍入导致的 kernel 块数差异。
    """
    remote_block_ids = list(remote_block_ids)
    if not self._has_mamba:
        # 非 Mamba：标准前缀缓存裁剪（从后对齐）
        for i, remote_group in enumerate(remote_block_ids):
            num_local_blocks = len(local_block_ids[i])
            assert num_local_blocks <= len(remote_group)
            if num_local_blocks < len(remote_group):
                remote_block_ids[i] = remote_group[-num_local_blocks:]
    else:
        # Mamba hybrid：因前缀缓存不支持异构物理块数，
        # 将两边逻辑块按最小 kernel 块数对齐，避免传输损坏。
        # 注意：Mamba 组的 block 代表完整状态（conv+ssm），不能按 token 裁剪。
        for i, remote_group in enumerate(remote_block_ids):
            num_local_kernel = len(local_block_ids[i]) * self._physical_blocks_per_logical_kv_block
            num_remote_kernel = len(remote_group) * remote_physical_per_logical
            if num_local_kernel != num_remote_kernel:
                # 按最小 kernel 块数对齐，从前面丢弃多余逻辑块
                min_kernel = min(num_local_kernel, num_remote_kernel)
                local_block_ids[i] = local_block_ids[i][- (min_kernel // self._physical_blocks_per_logical_kv_block):]
                remote_block_ids[i] = remote_group[- (min_kernel // remote_physical_per_logical):]
    return local_block_ids, remote_block_ids

tests/v1/kv_connector/unit/test_nixl_connector_hma.py test-coverage

测试文件，新增大量单元测试覆盖所有变更逻辑，包括前缀裁剪、handshake 冲突检测和 kernel 块 ID 扩展，保证修复正确性并防止回归。

@pytest.mark.cpu_test
@pytest.mark.parametrize(
    "group_spec_types,remote_physical_per_logical,"
    "local_physical_per_logical,tp_ratio,remote_block_ids,"
    "expected_remote_block_ids",
    [
        # dense_fa_swa: 远程与本地物理块数相同
        pytest.param(
            ("FullAttentionSpec", "SlidingWindowSpec"),
            2, 2, 1,
            ([0, 1, 2], [3, 4]),
            [[0, 1, 2, 3, 4, 5], [6, 7, 8, 9]],
            id="dense_fa_swa",
        ),
        # mamba_fa_ssm: 模拟 Nemotron-3-Nano 的 4p1d 异构 TP
        # remote_physical_per_logical=34, local=66
        pytest.param(
            ("FullAttentionSpec", "MambaSpec"),
            34, 66, -4,
            ([5, 6], [2]),
            [list(range(170, 238)), [2]],
            id="mamba_fa_ssm",
        ),
    ],
)
def test_read_blocks_for_req_expands_remote_ids(
    group_spec_types,
    remote_physical_per_logical,
    local_physical_per_logical,
    tp_ratio,
    remote_block_ids,
    expected_remote_block_ids,
):
    """验证 _read_blocks_for_req 在 HMA 启用时正确扩展远程逻辑块 ID 到 kernel 块 ID。
    热路径应使用 remote_info.remote_physical_blocks_per_logical 进行扩展。
    """
    from unittest.mock import MagicMock
    from vllm.distributed.kv_transfer.kv_connector.v1.nixl.worker import NixlConnectorWorker
    from vllm.v1.kv_cache_interface import FullAttentionSpec, MambaSpec, SlidingWindowSpec

    spec_name_to_type = {
        "FullAttentionSpec": FullAttentionSpec,
        "SlidingWindowSpec": SlidingWindowSpec,
        "MambaSpec": MambaSpec,
    }
    resolved_types = tuple(spec_name_to_type[n] for n in group_spec_types)

    # 使用 __new__ 跳过 __init__ 创建 worker 实例
    worker = object.__new__(NixlConnectorWorker)
    worker._physical_blocks_per_logical_kv_block = local_physical_per_logical

    has_mamba = any(t is MambaSpec for t in resolved_types)
    has_swa = any(t is SlidingWindowSpec for t in resolved_types)
    worker.kv_cache_config = make_kv_cache_config(
        block_size=16, swa_enabled=has_swa, mamba_enabled=has_mamba
    )

    remote_engine_id = "remote-engine"
    worker.transfer_topo = MagicMock()
    worker.transfer_topo.tp_ratio.return_value = tp_ratio
    remote_info = MagicMock()
    remote_info.remote_physical_blocks_per_logical = remote_physical_per_logical
    worker.transfer_topo.get_engine_info.return_value = remote_info
    worker.use_mla = False

    # ...（省略后续 mock 设置和调用断言）

评论区精华

建议添加单元测试展示前缀缓存启用时失败并断言关闭它 测试

NickLucche 提出：如果添加一个单元测试展示 added logic 在前缀缓存启用时失败，然后可以在 handshake 中断言前缀缓存关闭，直到后续加入支持。

结论：已采纳，添加了 test_mismatched_physical_per_logical_fails_with_prefix_caching 并增加了 handshake 检查。 · 已解决

风险与影响

回归风险：_read_blocks 中的block trimming逻辑被替换为 _apply_prefix_caching，非Mamba模型的行为需要验证是否完全一致。已有的单元测试覆盖了典型场景，但端到端测试未包含，可能遗漏边缘情况。
兼容性风险：新增的handshake检查要求异构TP下不能同时启用前缀缓存，可能影响现有依赖此配置的用户。但该组合原已隐式损坏，因此属于合理的 breaking change。
性能风险：新增检查仅在 handshake 时执行一次，开销极小；trimming逻辑与原实现复杂度相当。

影响范围限定于NIXL KV connector的HMA传输路径，主要影响异构TP部署（如P_TP≠D_TP）特别是Mamba hybrid模型（如Qwen3.5）。修复了静默精度损坏，保证了计算结果正确性。对同构TP用户无影响。新增的handshake检查会阻止不支持的配置组合，避免隐藏错误。测试覆盖大幅提升，降低了后续修改的回归风险。

核心路径变更异常路径新增前缀缓存场景限制测试覆盖补全

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复异构TP下NIXL HMA传输中kernel块映射不匹配的bug
推荐动作：建议精读。该PR展示了在复杂分布式缓存传输中处理异构块大小和前缀缓存约束的典型设计模式，_apply_prefix_caching 的方法抽取和 handshake 防御性检查值得借鉴。测试用例的扩展方式也值得参考。

功能与动机

实现拆解

修复kernel block ID扩展逻辑：在 _read_blocks 和 _read_blocks_for_req 中，从 remote_info 获取远程的 remote_physical_blocks_per_logical 并传递给 _logical_to_remote_kernel_block_ids，替代原先使用的本地 _physical_blocks_per_logical_kv_block。
抽取 _apply_prefix_caching 方法：将 _read_blocks 中原有的内联trim逻辑（从后裁剪remote block IDs与local对齐）抽取为独立方法，并扩展对Mamba hybrid模型的支持。非Mamba模型沿用原end-trim逻辑；Mamba模型因前缀缓存尚不支持异构块大小，改为前后同步裁剪至最小公共kernel block数，避免传输损坏。
添加前缀缓存冲突检测：在 _validate_remote_agent_handshake 中，若本地与远程 physical_blocks_per_logical 不同且启用了 enable_prefix_caching，则抛出 RuntimeError，提示用户禁用前缀缓存。这避免了运行时静默错误。
新增测试覆盖：在 test_nixl_connector_hma.py 中添加了 test_apply_prefix_caching_mamba_hybrid、test_mismatched_physical_per_logical_fails_with_prefix_caching 等测试用例，验证block trimming、handshake拒绝以及kernel block ID扩展的正确性，并完善了参数化测试的输入。

关键文件：

vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py（模块分布式；类别 source；类型 core-logic；符号 _apply_prefix_caching, _read_blocks, _read_blocks_for_req, _validate_remote_agent_handshake）: 核心源码文件，修复了kernel block ID映射bug，新增 _apply_prefix_caching 方法统一block trimming逻辑，并增强handshake校验，是整个PR的核心。
tests/v1/kv_connector/unit/test_nixl_connector_hma.py（模块测试；类别 test；类型 test-coverage；符号 test_apply_prefix_caching_mamba_hybrid, test_mismatched_physical_per_logical_fails_with_prefix_caching, test_read_blocks_for_req_expands_remote_ids）: 测试文件，新增大量单元测试覆盖所有变更逻辑，包括前缀裁剪、handshake冲突检测和kernel块ID扩展，保证修复正确性并防止回归。

关键符号：_apply_prefix_caching, _read_blocks, _read_blocks_for_req, _validate_remote_agent_handshake

关键源码片段

`vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.py`

核心源码文件，修复了kernel block ID映射bug，新增 _apply_prefix_caching 方法统一block trimming逻辑，并增强handshake校验，是整个PR的核心。

def _apply_prefix_caching(
    self,
    local_block_ids: BlockIds,
    remote_block_ids: BlockIds,
    remote_physical_per_logical: int,
) -> tuple[BlockIds, list]:
    """应用前缀缓存，对 local/remote block ID 列表进行裁剪。

    非 Mamba 模型：从末尾裁剪 remote 以匹配 local 数量，跳过已缓存的 prefix。
    Mamba hybrid 模型（前缀缓存暂不支持）：两边都按最小 kernel block 数裁剪，
    解决异构 TP 中逻辑块舍入导致的 kernel 块数差异。
    """
    remote_block_ids = list(remote_block_ids)
    if not self._has_mamba:
        # 非 Mamba：标准前缀缓存裁剪（从后对齐）
        for i, remote_group in enumerate(remote_block_ids):
            num_local_blocks = len(local_block_ids[i])
            assert num_local_blocks <= len(remote_group)
            if num_local_blocks < len(remote_group):
                remote_block_ids[i] = remote_group[-num_local_blocks:]
    else:
        # Mamba hybrid：因前缀缓存不支持异构物理块数，
        # 将两边逻辑块按最小 kernel 块数对齐，避免传输损坏。
        # 注意：Mamba 组的 block 代表完整状态（conv+ssm），不能按 token 裁剪。
        for i, remote_group in enumerate(remote_block_ids):
            num_local_kernel = len(local_block_ids[i]) * self._physical_blocks_per_logical_kv_block
            num_remote_kernel = len(remote_group) * remote_physical_per_logical
            if num_local_kernel != num_remote_kernel:
                # 按最小 kernel 块数对齐，从前面丢弃多余逻辑块
                min_kernel = min(num_local_kernel, num_remote_kernel)
                local_block_ids[i] = local_block_ids[i][- (min_kernel // self._physical_blocks_per_logical_kv_block):]
                remote_block_ids[i] = remote_group[- (min_kernel // remote_physical_per_logical):]
    return local_block_ids, remote_block_ids

`tests/v1/kv_connector/unit/test_nixl_connector_hma.py`

测试文件，新增大量单元测试覆盖所有变更逻辑，包括前缀裁剪、handshake冲突检测和kernel块ID扩展，保证修复正确性并防止回归。

@pytest.mark.cpu_test
@pytest.mark.parametrize(
    "group_spec_types,remote_physical_per_logical,"
    "local_physical_per_logical,tp_ratio,remote_block_ids,"
    "expected_remote_block_ids",
    [
        # dense_fa_swa: 远程与本地物理块数相同
        pytest.param(
            ("FullAttentionSpec", "SlidingWindowSpec"),
            2, 2, 1,
            ([0, 1, 2], [3, 4]),
            [[0, 1, 2, 3, 4, 5], [6, 7, 8, 9]],
            id="dense_fa_swa",
        ),
        # mamba_fa_ssm: 模拟 Nemotron-3-Nano 的 4p1d 异构 TP
        # remote_physical_per_logical=34, local=66
        pytest.param(
            ("FullAttentionSpec", "MambaSpec"),
            34, 66, -4,
            ([5, 6], [2]),
            [list(range(170, 238)), [2]],
            id="mamba_fa_ssm",
        ),
    ],
)
def test_read_blocks_for_req_expands_remote_ids(
    group_spec_types,
    remote_physical_per_logical,
    local_physical_per_logical,
    tp_ratio,
    remote_block_ids,
    expected_remote_block_ids,
):
    """验证 _read_blocks_for_req 在 HMA 启用时正确扩展远程逻辑块 ID 到 kernel 块 ID。
    热路径应使用 remote_info.remote_physical_blocks_per_logical 进行扩展。
    """
    from unittest.mock import MagicMock
    from vllm.distributed.kv_transfer.kv_connector.v1.nixl.worker import NixlConnectorWorker
    from vllm.v1.kv_cache_interface import FullAttentionSpec, MambaSpec, SlidingWindowSpec

    spec_name_to_type = {
        "FullAttentionSpec": FullAttentionSpec,
        "SlidingWindowSpec": SlidingWindowSpec,
        "MambaSpec": MambaSpec,
    }
    resolved_types = tuple(spec_name_to_type[n] for n in group_spec_types)

    # 使用 __new__ 跳过 __init__ 创建 worker 实例
    worker = object.__new__(NixlConnectorWorker)
    worker._physical_blocks_per_logical_kv_block = local_physical_per_logical

    has_mamba = any(t is MambaSpec for t in resolved_types)
    has_swa = any(t is SlidingWindowSpec for t in resolved_types)
    worker.kv_cache_config = make_kv_cache_config(
        block_size=16, swa_enabled=has_swa, mamba_enabled=has_mamba
    )

    remote_engine_id = "remote-engine"
    worker.transfer_topo = MagicMock()
    worker.transfer_topo.tp_ratio.return_value = tp_ratio
    remote_info = MagicMock()
    remote_info.remote_physical_blocks_per_logical = remote_physical_per_logical
    worker.transfer_topo.get_engine_info.return_value = remote_info
    worker.use_mla = False

    # ...（省略后续 mock 设置和调用断言）

评论区精华

建议添加单元测试展示前缀缓存启用时失败并断言关闭它 (testing): 已采纳，添加了 test_mismatched_physical_per_logical_fails_with_prefix_caching 并增加了 handshake 检查。

风险与影响

风险：
1. 回归风险：_read_blocks 中的block trimming逻辑被替换为 _apply_prefix_caching，非Mamba模型的行为需要验证是否完全一致。已有的单元测试覆盖了典型场景，但端到端测试未包含，可能遗漏边缘情况。
2. 兼容性风险：新增的handshake检查要求异构TP下不能同时启用前缀缓存，可能影响现有依赖此配置的用户。但该组合原已隐式损坏，因此属于合理的 breaking change。
3. 性能风险：新增检查仅在 handshake 时执行一次，开销极小；trimming逻辑与原实现复杂度相当。
  - 影响：影响范围限定于NIXL KV connector的HMA传输路径，主要影响异构TP部署（如P_TP≠D_TP）特别是Mamba hybrid模型（如Qwen3.5）。修复了静默精度损坏，保证了计算结果正确性。对同构TP用户无影响。新增的handshake检查会阻止不支持的配置组合，避免隐藏错误。测试覆盖大幅提升，降低了后续修改的回归风险。
  - 风险标记：核心路径变更, 异常路径新增, 前缀缓存场景限制, 测试覆盖补全

关联脉络

暂无明显关联 PR

#42097 [Bugfix] Fix mismatched kernel-per-logical blocks in NIXL HMA transfer

执行摘要

修复异构 TP 下 NIXL HMA 传输中 kernel 块映射不匹配的 bug

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论