#43717 [9/n] Migrate attention and cache kernels to torch stable ABI (continued)

原始 PR 作者 cleonard530 合并时间 2026-05-29 12:44 文件变更 23 提交数 12 评论 36 代码增减 +911 / -722

执行摘要

迁移注意力与缓存内核至 torch stable ABI

将 vLLM 迁移到 libtorch stable ABI，减少对 PyTorch 内部不稳定 API 的依赖，确保库在不同 PyTorch 版本间的二进制兼容性。此 PR 延续 #38871，继续将 attention 和 cache 内核迁移到稳定 ABI。

此 PR 是持续 ABI 迁移的重要一环，值得核心开发者精读。重点关注 concat_mla_q 调度类型迁移的修复过程、头文件移动策略的讨论、以及 quant_utils.cuh 部分稳定性的权衡。这些模式将指导后续阶段。
普通审阅者应关注构建是否正确、测试是否覆盖以避免回归。
建议团队在后续 PR 中尽快完成 quant_utils.cuh 的完全迁移，并考虑为缓存操作添加更多单元测试。

讨论亮点

在 review 中，depthfirst-app[bot] 指出 concat_mla_q 函数中 dispatch 类型从 VLLM_DISPATCH_HALF_TYPES 扩宽到 VLLM_STABLE_DISPATCH_FLOATING_TYPES，可能导致 Half 类型输入时拷贝不完整进而数据损坏。经讨论后作者将其改回 HALF_TYPES 并添加了 ql_nope.scalar_type() 检查以确保类型安全。

Harry-Chen 建议将只被稳定代码引用的头文件也移入 libtorch_stable 目录，以避免 ../../ 路径并防止未来添加不稳定的 ABI 内核。作者在本 PR 中移动了相关头文件，并同意在后续 PR（#44013）中进一步推进更激进的头文件迁移。

janeyx99 询问 quant_utils.cuh 文件是否已经完全 ABI 稳定，cleonard530 回答否，因为它仍使用 c10::Float8_* 而不是 stable 头部。Harry-Chen 提出是否可以迁移 ROCm 版本，但未在本 PR 中解决。

关于 nvfp4_kv_cache_kernels.cu 的移动，Harry-Chen 指出该文件也应移入 libtorch_stable，作者随后执行了移动。

实现拆解

迁移注意力内核：将 csrc/attention/paged_attention_v1.cu、paged_attention_v2.cu、attention_kernels.cuh 移动到 csrc/libtorch_stable/attention/，更新内部引用使用 torch::stable::Tensor、torch::stable::DeviceGuard、get_current_cuda_stream() 等稳定 API。在 csrc/libtorch_stable/ops.h 中增加稳定声明。
迁移缓存内核：将 csrc/cache_kernels.cu、csrc/cache_kernels_fused.cu、csrc/nvfp4_kv_cache_kernels.cu 移动到 csrc/libtorch_stable/，适配稳定 API，并在 csrc/libtorch_stable/ops.h 中添加 swap_blocks、reshape_and_cache 等 12 个缓存操作声明。
注册稳定操作：在 csrc/libtorch_stable/torch_bindings.cpp 中使用 STABLE_TORCH_LIBRARY_FRAGMENT 和 STABLE_TORCH_LIBRARY_IMPL 注册注意力与缓存操作的 schema 和实现。从 csrc/torch_bindings.cpp 中删除旧的注册并调整头文件包含。
清理非稳定声明：从 csrc/ops.h 中删除 paged_attention_v1/v2 的旧声明（保留在 csrc/libtorch_stable/ops.h）。在 csrc/quantization/w8a8/fp8/nvidia/quant_utils.cuh 和 amd/quant_utils.cuh 中添加 <torch/headeronly/core/ScalarType.h> 包含以支持稳定环境下的宏。
调整头文件路径：更新移动后文件的 #include 路径，并依据 review 建议将相关稳定专用的头文件（如 attention_kernels.cuh）也移入 libtorch_stable。

文件	模块	状态	重要度
`csrc/torch_bindings.cpp`	不稳定绑定	modified	7.18
`csrc/libtorch_stable/torch_bindings.cpp`	稳定绑定	modified	6.92
`csrc/libtorch_stable/ops.h`	稳定接口	modified	6.83
`csrc/ops.h`	不稳定接口	modified	6.01
`csrc/libtorch_stable/cache_kernels.cu`	缓存内核	renamed	6.02

关键符号

paged_attention_v1 paged_attention_v2 swap_blocks reshape_and_cache reshape_and_cache_flash concat_and_cache_mla concat_and_cache_mla_rope_fused convert_fp8 gather_and_maybe_dequant_cache cp_gather_cache cp_gather_and_upconvert_fp8_kv_cache indexer_k_quant_and_cache concat_mla_q

关键源码片段

csrc/libtorch_stable/torch_bindings.cpp core-logic

在稳定扩展中添加 attention 和 cache 操作的 schema 定义与实现注册，是迁移的核心文件之一。

// csrc/libtorch_stable/torch_bindings.cpp 中新增的稳定注册部分

// 1. Attention 操作的 schema 定义（在之前的片段之后）
STABLE_TORCH_LIBRARY_FRAGMENT(_C, ops) {
  ops.def(
      "paged_attention_v1("
      "    Tensor! out, Tensor query, Tensor key_cache,"
      "    Tensor value_cache, int num_kv_heads, float scale,"
      "    Tensor block_tables, Tensor seq_lens, int block_size,"
      "    int max_seq_len, Tensor? alibi_slopes,"
      "    str kv_cache_dtype, Tensor k_scale, Tensor v_scale,"
      "    int tp_rank, int blocksparse_local_blocks,"
      "    int blocksparse_vert_stride, int blocksparse_block_size,"
      "    int blocksparse_head_sliding_step) -> ()");

  ops.def(
      "paged_attention_v2("
      "    Tensor! out, Tensor! exp_sums, Tensor! max_logits,"
      "    Tensor! tmp_out, Tensor query, Tensor key_cache,"
      "    Tensor value_cache, int num_kv_heads, float scale,"
      "    Tensor block_tables, Tensor seq_lens, int block_size,"
      "    int max_seq_len, Tensor? alibi_slopes,"
      "    str kv_cache_dtype, Tensor k_scale, Tensor v_scale,"
      "    int tp_rank, int blocksparse_local_blocks,"
      "    int blocksparse_vert_stride, int blocksparse_block_size,"
      "    int blocksparse_head_sliding_step) -> ()");
}

// 2. CUDA 实现注册（在已有 CUDA impl 块之后）
STABLE_TORCH_LIBRARY_IMPL(_C, CUDA, ops) {
  // ... 已有 impl (permute_cols, 量化 , GGML 等 ) ...

  ops.impl("paged_attention_v1", TORCH_BOX(&paged_attention_v1));
  ops.impl("paged_attention_v2", TORCH_BOX(&paged_attention_v2));
}

// 3. Cache ops 的完整稳定库定义
STABLE_TORCH_LIBRARY_FRAGMENT(_C_cache_ops, ops) {
  ops.def("swap_blocks(Tensor src, Tensor! dst, int block_size_in_bytes, Tensor block_mapping) -> ()");
  ops.def("swap_blocks_batch(Tensor src_ptrs, Tensor dst_ptrs, Tensor sizes, bool is_src_access_order_any=False) -> ()");
  ops.def("reshape_and_cache(Tensor key, Tensor value, Tensor! key_cache, Tensor! value_cache, Tensor slot_mapping, str kv_cache_dtype, Tensor k_scale, Tensor v_scale) -> ()");
  ops.def("reshape_and_cache_flash(...) -> ()");
  ops.def("concat_and_cache_mla(...) -> ()");
  ops.def("concat_and_cache_mla_rope_fused(...) -> ()");
  // ... 其他缓存操作 ...
}

csrc/libtorch_stable/ops.h core-logic

在此头文件中添加了 attention 和 cache 操作的稳定 ABI 函数声明，是稳定扩展的接口定义。

// csrc/libtorch_stable/ops.h 中在原有声明末尾新增的内容

// PagedAttention 稳定声明
void paged_attention_v1(
    torch::stable::Tensor& out, torch::stable::Tensor& query,
    torch::stable::Tensor& key_cache, torch::stable::Tensor& value_cache,
    int64_t num_kv_heads, double scale, torch::stable::Tensor& block_tables,
    torch::stable::Tensor& seq_lens, int64_t block_size, int64_t max_seq_len,
    const std::optional<torch::stable::Tensor>& alibi_slopes,
    const std::string& kv_cache_dtype, torch::stable::Tensor& k_scale,
    torch::stable::Tensor& v_scale, const int64_t tp_rank,
    const int64_t blocksparse_local_blocks,
    const int64_t blocksparse_vert_stride, const int64_t blocksparse_block_size,
    const int64_t blocksparse_head_sliding_step);

void paged_attention_v2(
    torch::stable::Tensor& out, torch::stable::Tensor& exp_sums,
    torch::stable::Tensor& max_logits, torch::stable::Tensor& tmp_out,
    torch::stable::Tensor& query, torch::stable::Tensor& key_cache,
    torch::stable::Tensor& value_cache, int64_t num_kv_heads, double scale,
    torch::stable::Tensor& block_tables, torch::stable::Tensor& seq_lens,
    int64_t block_size, int64_t max_seq_len,
    const std::optional<torch::stable::Tensor>& alibi_slopes,
    const std::string& kv_cache_dtype, torch::stable::Tensor& k_scale,
    torch::stable::Tensor& v_scale, const int64_t tp_rank,
    const int64_t blocksparse_local_blocks,
    const int64_t blocksparse_vert_stride, const int64_t blocksparse_block_size,
    const int64_t blocksparse_head_sliding_step);

// Cache 操作稳定声明
void swap_blocks(torch::stable::Tensor& src, torch::stable::Tensor& dst,
                 int64_t block_size_in_bytes,
                 const torch::stable::Tensor& block_mapping);
void reshape_and_cache(torch::stable::Tensor& key, torch::stable::Tensor& value,
                       torch::stable::Tensor& key_cache,
                       torch::stable::Tensor& value_cache,
                       torch::stable::Tensor& slot_mapping,
                       const std::string& kv_cache_dtype,
                       torch::stable::Tensor& k_scale,
                       torch::stable::Tensor& v_scale);
// 其他类似声明省略

评论区精华

concat_mla_q 调度类型扩大范围风险 正确性

review 机器人指出 concat_mla_q 函数中 dispatch 从 VLLM_DISPATCH_HALF_TYPES 改为 VLLM_STABLE_DISPATCH_FLOATING_TYPES，可能对 Half 类型输入造成数据损坏，因为内核使用固定 128 字节拷贝假设 16 位类型。应使用 VLLM_STABLE_DISPATCH_HALF_TYPES。

结论：作者将 dispatch 改回 HALF_TYPES 并添加了 ql_nope.scalar_type() 检查以确保类型安全。 · 已解决

将仅被稳定单元引用的头文件移入 libtorch_stable 设计

Harry-Chen 建议将所有仅被稳定代码引用的头文件也移入 libtorch_stable，以简化路径并防止未来添加不稳定内核。作者本 PR 移动了相关头文件，并同意在 #44013 中进一步推进。

结论：本 PR 移动了直接相关的头文件，更全面的迁移在后续 PR 中讨论。 · 已解决

quant_utils.cuh 的 ABI 稳定性状态 question

janeyx99 询问 quant_utils.cuh 是否已完全 ABI 稳定。cleonard530 回答否定，因为它仍使用 c10::Float8_* 而非 stable 头。Harry-Chen 问是否可迁移 ROCm 版本。

结论：当前未完全稳定，但尚未制定迁移计划。 · acknowledged

nvfp4_kv_cache_kernels.cu 应移入 libtorch_stable 设计

Harry-Chen 指出 nvfp4_kv_cache_kernels.cu 也应移入稳定目录以保持一致性。作者随后将其移入。

结论：文件已移动。 · 已解决

风险与影响

类型调度风险：在 concat_mla_q 函数中，迁移过程中曾错误地将 dispatch 类型扩大到 FLOATING_TYPES，虽已修复，但需确保其他类似迁移没有引入同样的类型假设错误。
未完全迁移的头文件：csrc/quantization/w8a8/fp8/ 下的 quant_utils.cuh 仍依赖不稳定的 c10::Float8_*，如果这些头文件被稳定代码间接包含，可能导致 ABI 不稳定。
包含路径错误：移动文件后使用相对路径 ../ 引用 csrc/ 下的文件，如果后续重构调整目录结构，可能引入编译错误。
平台专用代码：nvfp4_kv_cache_kernels.cu 仅在 CUDA SM100+ (Blackwell) 上有效，迁移后的条件编译和错误检查是否正确仍需验证。
缺乏集成测试：虽然注意力测试通过，但缓存操作和 MLA 相关函数没有新增测试覆盖，回归风险存在。

对用户：功能上无直接影响，但提升了库的 ABI 稳定性，使 vLLM 能在不同 PyTorch 版本间更可靠地运行。
对系统：减少了 _C.abi3.so 中的不稳定接口数量（从 99 到 93），增加了稳定库的覆盖，有助于简化 Issue 排查和版本兼容。
对团队：为后续进一步迁移奠定了基础，确立了一套稳定的内核迁移模式（头文件、操作注册、API 适配），新内核开发者可参考此模式。

核心路径变更数据类型安全风险未完全迁移的头文件缺少测试覆盖

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

此 PR 是 vLLM 向 libtorch stable ABI 迁移的第九阶段，成功将 paged_attention_v1/v2 以及全部 12 个缓存操作从非稳定的 _C 扩展迁移到 _C_stable_libtorch。迁移后 _C.abi3.so 中不稳定操作计数下降 6 个，稳定库覆盖增加 1 个 shim，进一步减少了库对 PyTorch 内部 API 的依赖。review 中修复了一处因 dispatch 类型放宽导致的安全隐患，并确立了后续头文件迁移的方向。

功能与动机

此 PR 属于将 vLLM 迁移到 libtorch stable ABI 的持续工作（延续 #38871），目标是减少对 PyTorch 内部不稳定 API 的依赖，确保库在不同 PyTorch 版本间的二进制兼容性。具体迁移了 attention（paged_attention_v1/v2）和 cache（swap_blocks、reshape_and_cache、concat_and_cache_mla 等共 12 个操作）的核心内核。

实现拆解

迁移注意力内核：将 csrc/attention/ 下的 paged_attention_v1.cu、paged_attention_v2.cu、attention_kernels.cuh 移动到 csrc/libtorch_stable/attention/，更新 API 调用为 torch stable 版本（如 torch::stable::Tensor、torch::stable::DeviceGuard、get_current_cuda_stream()）。在 csrc/libtorch_stable/ops.h 中新增 paged_attention_v1/v2 的稳定声明。
迁移缓存内核：将 csrc/ 下的 cache_kernels.cu、cache_kernels_fused.cu、nvfp4_kv_cache_kernels.cu 移动到 csrc/libtorch_stable/，同样适配稳定 API，并在 csrc/libtorch_stable/ops.h 中添加 12 个缓存操作的声明。
注册稳定操作：在 csrc/libtorch_stable/torch_bindings.cpp 中使用 STABLE_TORCH_LIBRARY_FRAGMENT 定义操作 schema，使用 STABLE_TORCH_LIBRARY_IMPL 注册 CUDA 实现。从 csrc/torch_bindings.cpp 中删除旧的 TORCH_LIBRARY_EXPAND 注册块，并移除 #include "cache.h" 改为 <torch/all.h>。
清理非稳定声明：从 csrc/ops.h 中删除 paged_attention_v1/v2 的旧声明。在 csrc/quantization/w8a8/fp8/nvidia/quant_utils.cuh 和 amd/quant_utils.cuh 中添加 <torch/headeronly/core/ScalarType.h> 包含，使 DISPATCH_BY_KV_CACHE_DTYPE 宏能在稳定代码中使用。
调整头文件路径：更新移动后文件的 #include 路径，并依据 review 建议将相关稳定专用的头文件（如 attention_kernels.cuh）也移入 libtorch_stable/attention/。

测试方面，运行了现有的 tests/kernels/attention/test_attention.py 和 tests/kernels/attention/test_cache.py 以保证功能正确。

`csrc/libtorch_stable/torch_bindings.cpp`

在稳定扩展中添加 attention 和 cache 操作的 schema 定义与实现注册，是迁移的核心文件之一。

// csrc/libtorch_stable/torch_bindings.cpp 中新增的稳定注册部分

// 1. Attention 操作的 schema 定义（在之前的片段之后）
STABLE_TORCH_LIBRARY_FRAGMENT(_C, ops) {
  ops.def(
      "paged_attention_v1("
      "    Tensor! out, Tensor query, Tensor key_cache,"
      "    Tensor value_cache, int num_kv_heads, float scale,"
      "    Tensor block_tables, Tensor seq_lens, int block_size,"
      "    int max_seq_len, Tensor? alibi_slopes,"
      "    str kv_cache_dtype, Tensor k_scale, Tensor v_scale,"
      "    int tp_rank, int blocksparse_local_blocks,"
      "    int blocksparse_vert_stride, int blocksparse_block_size,"
      "    int blocksparse_head_sliding_step) -> ()");

  ops.def(
      "paged_attention_v2("
      "    Tensor! out, Tensor! exp_sums, Tensor! max_logits,"
      "    Tensor! tmp_out, Tensor query, Tensor key_cache,"
      "    Tensor value_cache, int num_kv_heads, float scale,"
      "    Tensor block_tables, Tensor seq_lens, int block_size,"
      "    int max_seq_len, Tensor? alibi_slopes,"
      "    str kv_cache_dtype, Tensor k_scale, Tensor v_scale,"
      "    int tp_rank, int blocksparse_local_blocks,"
      "    int blocksparse_vert_stride, int blocksparse_block_size,"
      "    int blocksparse_head_sliding_step) -> ()");
}

// 2. CUDA 实现注册（在已有 CUDA impl 块之后）
STABLE_TORCH_LIBRARY_IMPL(_C, CUDA, ops) {
  // ... 已有 impl (permute_cols, 量化 , GGML 等 ) ...

  ops.impl("paged_attention_v1", TORCH_BOX(&paged_attention_v1));
  ops.impl("paged_attention_v2", TORCH_BOX(&paged_attention_v2));
}

// 3. Cache ops 的完整稳定库定义
STABLE_TORCH_LIBRARY_FRAGMENT(_C_cache_ops, ops) {
  ops.def("swap_blocks(Tensor src, Tensor! dst, int block_size_in_bytes, Tensor block_mapping) -> ()");
  ops.def("swap_blocks_batch(Tensor src_ptrs, Tensor dst_ptrs, Tensor sizes, bool is_src_access_order_any=False) -> ()");
  ops.def("reshape_and_cache(Tensor key, Tensor value, Tensor! key_cache, Tensor! value_cache, Tensor slot_mapping, str kv_cache_dtype, Tensor k_scale, Tensor v_scale) -> ()");
  ops.def("reshape_and_cache_flash(...) -> ()");
  ops.def("concat_and_cache_mla(...) -> ()");
  ops.def("concat_and_cache_mla_rope_fused(...) -> ()");
  // ... 其他缓存操作 ...
}

`csrc/libtorch_stable/ops.h`

在此头文件中添加了 attention 和 cache 操作的稳定 ABI 函数声明，是稳定扩展的接口定义。

// csrc/libtorch_stable/ops.h 中在原有声明末尾新增的内容

// PagedAttention 稳定声明
void paged_attention_v1(
    torch::stable::Tensor& out, torch::stable::Tensor& query,
    torch::stable::Tensor& key_cache, torch::stable::Tensor& value_cache,
    int64_t num_kv_heads, double scale, torch::stable::Tensor& block_tables,
    torch::stable::Tensor& seq_lens, int64_t block_size, int64_t max_seq_len,
    const std::optional<torch::stable::Tensor>& alibi_slopes,
    const std::string& kv_cache_dtype, torch::stable::Tensor& k_scale,
    torch::stable::Tensor& v_scale, const int64_t tp_rank,
    const int64_t blocksparse_local_blocks,
    const int64_t blocksparse_vert_stride, const int64_t blocksparse_block_size,
    const int64_t blocksparse_head_sliding_step);

void paged_attention_v2(
    torch::stable::Tensor& out, torch::stable::Tensor& exp_sums,
    torch::stable::Tensor& max_logits, torch::stable::Tensor& tmp_out,
    torch::stable::Tensor& query, torch::stable::Tensor& key_cache,
    torch::stable::Tensor& value_cache, int64_t num_kv_heads, double scale,
    torch::stable::Tensor& block_tables, torch::stable::Tensor& seq_lens,
    int64_t block_size, int64_t max_seq_len,
    const std::optional<torch::stable::Tensor>& alibi_slopes,
    const std::string& kv_cache_dtype, torch::stable::Tensor& k_scale,
    torch::stable::Tensor& v_scale, const int64_t tp_rank,
    const int64_t blocksparse_local_blocks,
    const int64_t blocksparse_vert_stride, const int64_t blocksparse_block_size,
    const int64_t blocksparse_head_sliding_step);

// Cache 操作稳定声明
void swap_blocks(torch::stable::Tensor& src, torch::stable::Tensor& dst,
                 int64_t block_size_in_bytes,
                 const torch::stable::Tensor& block_mapping);
void reshape_and_cache(torch::stable::Tensor& key, torch::stable::Tensor& value,
                       torch::stable::Tensor& key_cache,
                       torch::stable::Tensor& value_cache,
                       torch::stable::Tensor& slot_mapping,
                       const std::string& kv_cache_dtype,
                       torch::stable::Tensor& k_scale,
                       torch::stable::Tensor& v_scale);
// 其他类似声明省略

评论区精华

depthfirst-app[bot]：concat_mla_q 的 dispatch 从 HALF_TYPES 改成 FLOATING_TYPES 会导致 Half 输入时输出损坏，应使用 HALF_TYPES。
cleonard530：已修复，改回 HALF_TYPES 并添加类型检查。
Harry-Chen：建议将稳定专用的头文件也移入 libtorch_stable，并考虑更激进的整体迁移。
cleonard530：本 PR 移动了部分，后续在 #44013 继续讨论。
janeyx99：quant_utils.cuh 是否已经完全 stable？
cleonard530：尚未完全，因为它使用 c10::Float8_* 而非 stable 头。
Harry-Chen：nvfp4_kv_cache_kernels.cu 也应移入 libtorch_stable。
cleonard530：已移动。

风险与影响

风险

类型调度风险：concat_mla_q 的 dispatch 类型已修复，但类似模式可能存在于其他迁移代码中。
未完全迁移的头文件：quant_utils.cuh 仍包含不稳定依赖，被 stable 代码间接引入会破坏 ABI 稳定性。
包含路径错误：移动后使用 ../ 相对路径，未来目录重构可能导致编译失败。
平台专用代码：nvfp4_kv_cache 仅 Blackwell 支持，条件编译需要验证。
测试覆盖不足：仅有注意力测试，缓存操作缺乏独立新增测试。

影响

用户：无功能变化，但库在不同 PyTorch 版本间更稳定。
系统：_C.abi3.so 不稳定操作从 99 降至 93。
团队：本次迁移确立了可复用的模式，降低后续迁移成本。

关联脉络

此 PR 是 [#38871] 的延续（phase 9），继续之前未完成的注意力与缓存内核迁移。
review 中讨论了更彻底的头文件迁移方案，已在 #44013 中继续。
完整的 ABI 迁移计划将逐步覆盖所有 CUDA 内核，最终实现 _C.abi3.so 零不稳定操作。

#43717 [9/n] Migrate attention and cache kernels to torch stable ABI (continued)

执行摘要

迁移注意力与缓存内核至 torch stable ABI

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论