执行摘要

修复 fa_skip_kv_cache 在 MLA 下的潜在 bug

PR #21971 引入的 fa_skip_kv_cache 快速路径缺少对 MLA 的防护，导致 MLA 模式下写跳过但读取仍使用 get_key_buffer 返回陈旧数据。虽然尚无公开的 MLA embedding 模型，但该问题可能在未来引发静默错误。

建议合并。变更简洁明了，修复了一个潜在的正确性问题，并附有清晰的注释。不需额外测试，因为该路径目前无实际使用场景。

讨论亮点

无 review 评论。审核者 Qiaolin-Yu 直接批准。

实现拆解

分析问题根源：在 FlashAttentionBackend.__init__ 中发现，fa_skip_kv_cache 标志控制着 forward_extend 中的写跳过和读跳过分支。写跳过同时包裹了 set_kv_buffer（MHA）和 set_mla_kv_buffer（MLA），但读跳过仅位于 if not self.use_mla: 分支内。当 MLA 与 is_embedding=True 等条件同时生效时，写操作被跳过而读操作仍然执行，导致返回过期的缓存数据。
修改标志计算：在 fa_skip_kv_cache 的条件链末尾加入 and not self.use_mla，使得 MLA 后端永远不启用此快速路径，从而保证读写行为一致。
补充注释：在标志计算前添加多行注释，详细说明读写不对称的原因和限制目的，方便后续维护。

文件	模块	状态	重要度
`python/sglang/srt/layers/attention/flashattention_backend.py`	注意力层	modified	5.19

关键源码片段

python/sglang/srt/layers/attention/flashattention_backend.py core-logic

核心变更文件：在 `fa_skip_kv_cache` 标志位中增加 `and not self.use_mla` 保护，确保 MLA 后端不会进入快速路径，避免读写不对称导致的潜在静默错误。

# In embedding mode with no chunked prefill and radix cache disabled,
# skip KV cache write and use flash_attn_varlen_func with raw K/V
# instead of flash_attn_with_kvcache, bypassing paged KV cache entirely.
# Restricted to non-MLA backends: the read-skip elif lives inside the
# `if not self.use_mla:` branch in forward_extend, while the write-skip
# guard wraps both set_kv_buffer and set_mla_kv_buffer. Without this
# gate, MLA + is_embedding would skip the write but still read stale
# cache via get_key_buffer in the absorbed-MLA path.
server_args = model_runner.server_args
self.fa_skip_kv_cache = (
    server_args.is_embedding
    and server_args.chunked_prefill_size == -1
    and server_args.disable_radix_cache
    and not self.use_mla # <-- 新增 : 限制非 MLA 后端
)

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

风险极低：变更仅增加一个布尔条件，MHA 模型行为不变；MLA 模型在没有此修复时也未曾正确使用该路径，因此回退到原有缓存路径不会引入新问题。

影响范围局限于 FlashAttentionBackend 初始化逻辑。所有非 MLA 模型（MHA、GQA 等）无变化；MLA 模型在极不可能的组合（is_embedding=True + chunked_prefill_size=-1 + disable_radix_cache）下将获得正确行为。无性能影响。

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复 fa_skip_kv_cache 在 MLA 下的潜在 bug
推荐动作：建议合并。变更简洁明了，修复了一个潜在的正确性问题，并附有清晰的注释。不需额外测试，因为该路径目前无实际使用场景。

功能与动机

实现拆解

分析问题根源：在 FlashAttentionBackend.__init__ 中发现，fa_skip_kv_cache 标志控制着 forward_extend 中的写跳过和读跳过分支。写跳过同时包裹了 set_kv_buffer（MHA）和 set_mla_kv_buffer（MLA），但读跳过仅位于 if not self.use_mla: 分支内。当 MLA 与 is_embedding=True 等条件同时生效时，写操作被跳过而读操作仍然执行，导致返回过期的缓存数据。
修改标志计算：在 fa_skip_kv_cache 的条件链末尾加入 and not self.use_mla，使得 MLA 后端永远不启用此快速路径，从而保证读写行为一致。
补充注释：在标志计算前添加多行注释，详细说明读写不对称的原因和限制目的，方便后续维护。

关键文件：

python/sglang/srt/layers/attention/flashattention_backend.py（模块注意力层；类别 source；类型 core-logic）: 核心变更文件：在 fa_skip_kv_cache 标志位中增加 and not self.use_mla 保护，确保 MLA 后端不会进入快速路径，避免读写不对称导致的潜在静默错误。

关键符号：未识别

关键源码片段

`python/sglang/srt/layers/attention/flashattention_backend.py`

核心变更文件：在 fa_skip_kv_cache 标志位中增加 and not self.use_mla 保护，确保 MLA 后端不会进入快速路径，避免读写不对称导致的潜在静默错误。

# In embedding mode with no chunked prefill and radix cache disabled,
# skip KV cache write and use flash_attn_varlen_func with raw K/V
# instead of flash_attn_with_kvcache, bypassing paged KV cache entirely.
# Restricted to non-MLA backends: the read-skip elif lives inside the
# `if not self.use_mla:` branch in forward_extend, while the write-skip
# guard wraps both set_kv_buffer and set_mla_kv_buffer. Without this
# gate, MLA + is_embedding would skip the write but still read stale
# cache via get_key_buffer in the absorbed-MLA path.
server_args = model_runner.server_args
self.fa_skip_kv_cache = (
    server_args.is_embedding
    and server_args.chunked_prefill_size == -1
    and server_args.disable_radix_cache
    and not self.use_mla # <-- 新增 : 限制非 MLA 后端
)

评论区精华

无 review 评论。审核者 Qiaolin-Yu 直接批准。

暂无高价值评论线程

风险与影响

风险：风险极低：变更仅增加一个布尔条件，MHA 模型行为不变；MLA 模型在没有此修复时也未曾正确使用该路径，因此回退到原有缓存路径不会引入新问题。
影响：影响范围局限于 FlashAttentionBackend 初始化逻辑。所有非 MLA 模型（MHA、GQA 等）无变化；MLA 模型在极不可能的组合（is_embedding=True + chunked_prefill_size=-1 + disable_radix_cache）下将获得正确行为。无性能影响。
风险标记：暂无

关联脉络

PR #21971 [Performance] fa_skip_kv_cache fast path for embedding models: 引入 fa_skip_kv_cache 标志的原始 PR，本 PR 修复其未考虑 MLA 的问题。

#24097 Restrict fa_skip_kv_cache to non-MLA backends

执行摘要

修复 fa_skip_kv_cache 在 MLA 下的潜在 bug

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`python/sglang/srt/layers/attention/flashattention_backend.py`

评论区精华

风险与影响

关联脉络

参与讨论