#27118 [Mamba] extra buffer lazy support

原始 PR 作者 hanming-lu 合并时间 2026-06-04 03:42 文件变更 14 提交数 24 评论 5 代码增减 +429 / -94

执行摘要

新增 Mamba 延迟额外缓冲策略，减少内存占用

原有 extra_buffer 策略需要 2 个额外缓冲来避免在 decode track 边界、overlap schedule 和请求完成同时发生时的状态损坏，这是一种 tradeoff。在高吞吐场景下，常驻 2 个缓冲造成不必要的内存压力，extra_buffer_lazy 按需分配第二个槽位，仅在边界时临时申请，从而降低常驻内存占用。

值得精读，特别是理解 Mamba 调度器的状态管理设计。关注 _handle_finish_state_updated_req 的重构、mamba_lazy_prealloc_at_boundary 的边界处理和 release_kv_cache 的 is_insert 参数传递，这些是模式复用和抽象的关键。

讨论亮点

PR 未产生有效 review 讨论，仅有作者触发的 CI 重新运行注释。

实现拆解

新增调度策略字符串：在 server_args.py 的 MAMBA_SCHEDULER_STRATEGY_CHOICES 中增加 "extra_buffer_lazy"，并添加 enable_mamba_extra_buffer_lazy() 属性供其他模块查询。
内存池懒分配：在 memory_pool.py 的 HybridReqToTokenPool 中新增 enable_mamba_extra_buffer_lazy 参数；新方法 _alloc_ping_pong_buffer 在懒模式下只预分配 1 个槽位（第二个设为 -1）；增加 set_mamba_ping_pong_slot 和 donate_mamba_ping_pong_slot 用于按需设置/归还临时槽位。
调度器边界预分配：在 schedule_batch.py 的 Req 类添加 mamba_lazy_is_insert 标志；新增 mamba_lazy_prealloc_at_boundary() 方法，在每次 prepare_for_decode 时检查是否处于 track 边界，若是则尝试分配第二个槽位，分配失败时尝试 evict 一个 mamba 状态后重试，仍失败则标记该请求跳过缓存插入。
后处理适配：在 batch_result_processor.py 中，将原来的 _handle_finished_req 替换为 _handle_finish_state_updated_req（在 update_finish_state 之后调用）；新函数内部调用 _mamba_prefix_cache_update，并在释放缓存时根据 req.mamba_lazy_is_insert 决定是否实际插入 radix cache；新增 mamba_lazy_post_decode_at_boundary 负责在解码后清理临时槽位。
缓存释放与插入控制：在 mem_cache/common.py 的 release_kv_cache 增加 is_insert 参数；mamba_radix_cache.py 和 unified_cache_components/mamba_component.py 根据此参数决定是否跳过插入。
测试覆盖：新增 test/registered/models_e2e/test_qwen3_next_models.py，包含 4 个测试类：TestQwen3NextLazyExtraBuffer、TestQwen3NextLazyExtraBufferLargePage、以及两个手动运行的 AllocFail 变体，通过环境变量模拟分配失败场景，验证 GSM8K 精度、KL 散度和前缀缓存分支。同时修改 test/manual/4-gpu-models/test_qwen3_next_models.py 增加懒模式测试入口。

文件	模块	状态	重要度
`python/sglang/srt/managers/scheduler_components/batch_result_processor.py`	批处理器	modified	8.91
`python/sglang/srt/mem_cache/memory_pool.py`	内存池	modified	8.47
`python/sglang/srt/managers/schedule_batch.py`	调度批处理	modified	7.28

关键符号

_handle_finish_state_updated_req mamba_lazy_post_decode_at_boundary get_mamba_ping_pong_keep_idx _alloc_ping_pong_buffer set_mamba_ping_pong_slot donate_mamba_ping_pong_slot mamba_lazy_prealloc_at_boundary enable_mamba_extra_buffer_lazy _mamba_check_track_boundary

关键源码片段

python/sglang/srt/managers/scheduler_components/batch_result_processor.py core-logic

核心变更：将 `_handle_finished_req` 替换为 `_handle_finish_state_updated_req`，新增 `_mamba_check_track_boundary` 和 `mamba_lazy_post_decode_at_boundary`，实现懒模式的后处理逻辑。

# python/sglang/srt/managers/scheduler_components/batch_result_processor.py

def _handle_finish_state_updated_req(
    self,
    req: Req,
    batch: ScheduleBatch,
    result: GenerationBatchResult,
    i: int,
    logits_output: LogitsProcessorOutput,
):
    # 在 update_finish_state 之后调用，此时 req.finished() 已可用
    # 用于 mamba_lazy_post_decode_at_boundary 内部的判断
    self._mamba_prefix_cache_update(req, batch, result, i)

    # 以下原 _handle_finished_req 逻辑不变 ...
    if (
        self.server_args.disaggregation_decode_enable_offload_kvcache
        and not req.finished()
    ):
        ...
    # 释放缓存时根据懒模式决定是否插入 radix cache
    is_insert = (
        req.mamba_lazy_is_insert
        if get_global_server_args().enable_mamba_extra_buffer_lazy()
        else True
    )
    release_kv_cache(req, self.tree_cache, is_insert=is_insert)
    req.time_stats.set_completion_time()


def _mamba_prefix_cache_update(self, req, batch, result, i):
    """更新 Mamba track 状态，在 ping-pong 边界处处理。"""
    if req.mamba_ping_pong_track_buffer is None:
        return

    lazy = get_global_server_args().enable_mamba_extra_buffer_lazy()
    at_boundary, track_seqlen = self._mamba_check_track_boundary(req, batch, result, i)

    if not at_boundary:
        return

    # 非懒模式：直接交换 ping-pong 索引
    # 懒模式：保持索引不变，后处理中释放临时槽位
    if not lazy:
        req.mamba_next_track_idx = (
            batch.req_to_token_pool.get_mamba_ping_pong_other_idx(
                req.mamba_next_track_idx
            )
        )
    else:
        # 后续由 mamba_lazy_post_decode_at_boundary 清理临时槽位
        pass

python/sglang/srt/mem_cache/memory_pool.py core-logic

引入了懒分配所需的内存池方法：`_alloc_ping_pong_buffer`（只分配 1 个槽位）、`set_mamba_ping_pong_slot`、`donate_mamba_ping_pong_slot`、`get_mamba_ping_pong_keep_idx`，以及对 `ReqToTokenPool` 类的类变量扩展。

# python/sglang/srt/mem_cache/memory_pool.py

class ReqToTokenPool:
    enable_mamba_extra_buffer_lazy: bool = False # 类变量，标记懒模式

    def _alloc_ping_pong_buffer(self, req: "Req"):
        """为请求分配 ping-pong 缓冲。
        懒模式只分配 1 个槽位，第二个设为 -1（按需分配）；
        非懒模式分配全部槽位（通常是 2 个）。
        """
        n = 1 if self.enable_mamba_extra_buffer_lazy else self.mamba_ping_pong_track_buffer_size
        slots = self.mamba_pool.alloc(n)
        assert slots is not None, (
            "Not enough space for mamba ping pong idx, "
            "try to increase --mamba-full-memory-ratio."
        )
        buf = torch.full(
            (self.mamba_ping_pong_track_buffer_size,),
            -1,
            dtype=slots.dtype,
            device=slots.device,
        )
        buf[:n] = slots
        req.mamba_ping_pong_track_buffer = buf
        req.mamba_next_track_idx = 0

    def set_mamba_ping_pong_slot(self, req: "Req", idx: int, value):
        """设置 ping-pong 缓冲的某个槽位值，并同步设备端映射。"""
        req.mamba_ping_pong_track_buffer[idx] = value
        # 同步设备端映射 ...（略）

    def donate_mamba_ping_pong_slot(self, req: "Req", idx: int):
        """归还 ping-pong 缓冲的某个槽位到内存池。"""
        slot = req.mamba_ping_pong_track_buffer[idx].item()
        if slot != -1:
            self.mamba_pool.free(slot)
            req.mamba_ping_pong_track_buffer[idx] = -1

    def get_mamba_ping_pong_keep_idx(self, req: "Req") -> int:
        """返回保存当前 track 状态的 ping-pong 索引。
        懒模式下有效状态在 next_track_idx（未交换），
        非懒模式下在另一个索引（已交换）。
        """
        if self.enable_mamba_extra_buffer_lazy:
            return req.mamba_next_track_idx
        return self.get_mamba_ping_pong_other_idx(req.mamba_next_track_idx)

python/sglang/srt/managers/schedule_batch.py core-logic

新增 `mamba_lazy_prealloc_at_boundary` 方法，在 decode 准备阶段按需分配第二个 ping-pong 槽位；同时为 `Req` 添加 `mamba_lazy_is_insert` 标志，用于控制是否插入 radix cache。

# python/sglang/srt/managers/schedule_batch.py

class ScheduleBatch:
    def mamba_lazy_prealloc_at_boundary(self, mamba_track_interval: int):
        """在 track 边界为懒模式请求预分配第二个 ping-pong 槽位。
        遍历所有请求，若当前 seq_len 是 track_interval 的整数倍，
        且第二个槽位尚未占用（值为 -1），则尝试分配。
        分配失败时 evict 一个 mamba 状态后重试，仍失败则标记
        mamba_lazy_is_insert = False，跳过缓存插入。
        """
        pool = self.req_to_token_pool
        for i, req in enumerate(self.reqs):
            buf = req.mamba_ping_pong_track_buffer
            assert buf is not None
            if self.seq_lens_cpu[i].item() % mamba_track_interval != 0:
                continue
            other_idx = 1 - req.mamba_next_track_idx
            if buf[other_idx].item() != -1:
                # overlap 时前一轮后处理尚未释放，跳过
                continue

            if envs.SGLANG_TEST_MAMBA_LAZY_ALLOC_FAIL.get():
                new_slot = None
            else:
                new_slot = pool.mamba_pool.alloc(1)
            if new_slot is None:
                # 尝试 evict 一个 mamba 状态后重试
                self.tree_cache.evict(EvictParams(num_tokens=0, mamba_num=1))
                new_slot = pool.mamba_pool.alloc(1)
            if new_slot is not None:
                pool.set_mamba_ping_pong_slot(req, other_idx, new_slot[0])
                req.mamba_next_track_idx = other_idx
            else:
                # 分配失败，标记不插入缓存
                req.mamba_lazy_is_insert = False

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

状态一致性风险：懒模式在分配失败时跳过缓存，可能导致某些边界情况下的状态未缓存，影响后续请求的 hit 率或 KL 散度。测试已覆盖分配失败场景，但实际生产中的分配失败频次可能高于测试。
兼容性限制：extra_buffer_lazy 当前不支持 speculative decoding（server_args.py 中有明确断言），且必须启用 overlap schedule（否则断言失败）。未来扩展时需注意这些限制。
新增条件分支增加维护复杂度：batch_result_processor.py 中多处根据 enable_mamba_extra_buffer_lazy() 走不同逻辑，可能与其他调度特性（如 PD 分离、streaming）产生交互，回归风险集中在批处理结果处理这一核心路径。

用户：新增可选策略，通过 --mamba-scheduler-strategy extra_buffer_lazy 启用，不影响现有配置。对于高吞吐 Mamba 推理用户可减少约一半的 ping-pong 缓冲常驻内存（按需分配时峰值仍可能达到 2 槽），但分配失败时缓存效率略有下降。
系统：核心调度流水线增加少量分支判断，但整体性能影响可忽略。内存池的 alloc 调用次数可能增加（边界按需分配），但次数有限。
团队：需要维护两套策略的测试和文档，未来添加新调度特性时需同时适配。

兼容性限制依赖 overlap schedule 最佳效果语义

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：新增 Mamba 延迟额外缓冲策略，减少内存占用
推荐动作：值得精读，特别是理解 Mamba 调度器的状态管理设计。关注 _handle_finish_state_updated_req 的重构、mamba_lazy_prealloc_at_boundary 的边界处理和 release_kv_cache 的 is_insert 参数传递，这些是模式复用和抽象的关键。

功能与动机

实现拆解

新增调度策略字符串：在 server_args.py 的 MAMBA_SCHEDULER_STRATEGY_CHOICES 中增加 "extra_buffer_lazy"，并添加 enable_mamba_extra_buffer_lazy() 属性供其他模块查询。
内存池懒分配：在 memory_pool.py 的 HybridReqToTokenPool 中新增 enable_mamba_extra_buffer_lazy 参数；新方法 _alloc_ping_pong_buffer 在懒模式下只预分配 1 个槽位（第二个设为 -1）；增加 set_mamba_ping_pong_slot 和 donate_mamba_ping_pong_slot 用于按需设置/归还临时槽位。
调度器边界预分配：在 schedule_batch.py 的 Req 类添加 mamba_lazy_is_insert 标志；新增 mamba_lazy_prealloc_at_boundary() 方法，在每次 prepare_for_decode 时检查是否处于 track 边界，若是则尝试分配第二个槽位，分配失败时尝试 evict 一个 mamba 状态后重试，仍失败则标记该请求跳过缓存插入。
后处理适配：在 batch_result_processor.py 中，将原来的 _handle_finished_req 替换为 _handle_finish_state_updated_req（在 update_finish_state 之后调用）；新函数内部调用 _mamba_prefix_cache_update，并在释放缓存时根据 req.mamba_lazy_is_insert 决定是否实际插入 radix cache；新增 mamba_lazy_post_decode_at_boundary 负责在解码后清理临时槽位。
缓存释放与插入控制：在 mem_cache/common.py 的 release_kv_cache 增加 is_insert 参数；mamba_radix_cache.py 和 unified_cache_components/mamba_component.py 根据此参数决定是否跳过插入。
测试覆盖：新增 test/registered/models_e2e/test_qwen3_next_models.py，包含 4 个测试类：TestQwen3NextLazyExtraBuffer、TestQwen3NextLazyExtraBufferLargePage、以及两个手动运行的 AllocFail 变体，通过环境变量模拟分配失败场景，验证 GSM8K 精度、KL 散度和前缀缓存分支。同时修改 test/manual/4-gpu-models/test_qwen3_next_models.py 增加懒模式测试入口。

关键文件：

python/sglang/srt/managers/scheduler_components/batch_result_processor.py（模块批处理器；类别 source；类型 core-logic；符号 _handle_finished_req, _handle_finish_state_updated_req, _mamba_check_track_boundary, mamba_lazy_post_decode_at_boundary）: 核心变更：将 _handle_finished_req 替换为 _handle_finish_state_updated_req，新增 _mamba_check_track_boundary 和 mamba_lazy_post_decode_at_boundary，实现懒模式的后处理逻辑。
python/sglang/srt/mem_cache/memory_pool.py（模块内存池；类别 source；类型 core-logic；符号 get_mamba_ping_pong_keep_idx, _alloc_ping_pong_buffer, set_mamba_ping_pong_slot, donate_mamba_ping_pong_slot）: 引入了懒分配所需的内存池方法：_alloc_ping_pong_buffer（只分配 1 个槽位）、set_mamba_ping_pong_slot、donate_mamba_ping_pong_slot、get_mamba_ping_pong_keep_idx，以及对 ReqToTokenPool 类的类变量扩展。
python/sglang/srt/managers/schedule_batch.py（模块调度批处理；类别 source；类型 core-logic；符号 mamba_lazy_prealloc_at_boundary）: 新增 mamba_lazy_prealloc_at_boundary 方法，在 decode 准备阶段按需分配第二个 ping-pong 槽位；同时为 Req 添加 mamba_lazy_is_insert 标志，用于控制是否插入 radix cache。

关键符号：_handle_finish_state_updated_req, mamba_lazy_post_decode_at_boundary, get_mamba_ping_pong_keep_idx, _alloc_ping_pong_buffer, set_mamba_ping_pong_slot, donate_mamba_ping_pong_slot, mamba_lazy_prealloc_at_boundary, enable_mamba_extra_buffer_lazy, _mamba_check_track_boundary

关键源码片段

`python/sglang/srt/managers/scheduler_components/batch_result_processor.py`

核心变更：将 _handle_finished_req 替换为 _handle_finish_state_updated_req，新增 _mamba_check_track_boundary 和 mamba_lazy_post_decode_at_boundary，实现懒模式的后处理逻辑。

# python/sglang/srt/managers/scheduler_components/batch_result_processor.py

def _handle_finish_state_updated_req(
    self,
    req: Req,
    batch: ScheduleBatch,
    result: GenerationBatchResult,
    i: int,
    logits_output: LogitsProcessorOutput,
):
    # 在 update_finish_state 之后调用，此时 req.finished() 已可用
    # 用于 mamba_lazy_post_decode_at_boundary 内部的判断
    self._mamba_prefix_cache_update(req, batch, result, i)

    # 以下原 _handle_finished_req 逻辑不变 ...
    if (
        self.server_args.disaggregation_decode_enable_offload_kvcache
        and not req.finished()
    ):
        ...
    # 释放缓存时根据懒模式决定是否插入 radix cache
    is_insert = (
        req.mamba_lazy_is_insert
        if get_global_server_args().enable_mamba_extra_buffer_lazy()
        else True
    )
    release_kv_cache(req, self.tree_cache, is_insert=is_insert)
    req.time_stats.set_completion_time()


def _mamba_prefix_cache_update(self, req, batch, result, i):
    """更新 Mamba track 状态，在 ping-pong 边界处处理。"""
    if req.mamba_ping_pong_track_buffer is None:
        return

    lazy = get_global_server_args().enable_mamba_extra_buffer_lazy()
    at_boundary, track_seqlen = self._mamba_check_track_boundary(req, batch, result, i)

    if not at_boundary:
        return

    # 非懒模式：直接交换 ping-pong 索引
    # 懒模式：保持索引不变，后处理中释放临时槽位
    if not lazy:
        req.mamba_next_track_idx = (
            batch.req_to_token_pool.get_mamba_ping_pong_other_idx(
                req.mamba_next_track_idx
            )
        )
    else:
        # 后续由 mamba_lazy_post_decode_at_boundary 清理临时槽位
        pass

`python/sglang/srt/mem_cache/memory_pool.py`

引入了懒分配所需的内存池方法：_alloc_ping_pong_buffer（只分配 1 个槽位）、set_mamba_ping_pong_slot、donate_mamba_ping_pong_slot、get_mamba_ping_pong_keep_idx，以及对 ReqToTokenPool 类的类变量扩展。

# python/sglang/srt/mem_cache/memory_pool.py

class ReqToTokenPool:
    enable_mamba_extra_buffer_lazy: bool = False # 类变量，标记懒模式

    def _alloc_ping_pong_buffer(self, req: "Req"):
        """为请求分配 ping-pong 缓冲。
        懒模式只分配 1 个槽位，第二个设为 -1（按需分配）；
        非懒模式分配全部槽位（通常是 2 个）。
        """
        n = 1 if self.enable_mamba_extra_buffer_lazy else self.mamba_ping_pong_track_buffer_size
        slots = self.mamba_pool.alloc(n)
        assert slots is not None, (
            "Not enough space for mamba ping pong idx, "
            "try to increase --mamba-full-memory-ratio."
        )
        buf = torch.full(
            (self.mamba_ping_pong_track_buffer_size,),
            -1,
            dtype=slots.dtype,
            device=slots.device,
        )
        buf[:n] = slots
        req.mamba_ping_pong_track_buffer = buf
        req.mamba_next_track_idx = 0

    def set_mamba_ping_pong_slot(self, req: "Req", idx: int, value):
        """设置 ping-pong 缓冲的某个槽位值，并同步设备端映射。"""
        req.mamba_ping_pong_track_buffer[idx] = value
        # 同步设备端映射 ...（略）

    def donate_mamba_ping_pong_slot(self, req: "Req", idx: int):
        """归还 ping-pong 缓冲的某个槽位到内存池。"""
        slot = req.mamba_ping_pong_track_buffer[idx].item()
        if slot != -1:
            self.mamba_pool.free(slot)
            req.mamba_ping_pong_track_buffer[idx] = -1

    def get_mamba_ping_pong_keep_idx(self, req: "Req") -> int:
        """返回保存当前 track 状态的 ping-pong 索引。
        懒模式下有效状态在 next_track_idx（未交换），
        非懒模式下在另一个索引（已交换）。
        """
        if self.enable_mamba_extra_buffer_lazy:
            return req.mamba_next_track_idx
        return self.get_mamba_ping_pong_other_idx(req.mamba_next_track_idx)

`python/sglang/srt/managers/schedule_batch.py`

新增 mamba_lazy_prealloc_at_boundary 方法，在 decode 准备阶段按需分配第二个 ping-pong 槽位；同时为 Req 添加 mamba_lazy_is_insert 标志，用于控制是否插入 radix cache。

# python/sglang/srt/managers/schedule_batch.py

class ScheduleBatch:
    def mamba_lazy_prealloc_at_boundary(self, mamba_track_interval: int):
        """在 track 边界为懒模式请求预分配第二个 ping-pong 槽位。
        遍历所有请求，若当前 seq_len 是 track_interval 的整数倍，
        且第二个槽位尚未占用（值为 -1），则尝试分配。
        分配失败时 evict 一个 mamba 状态后重试，仍失败则标记
        mamba_lazy_is_insert = False，跳过缓存插入。
        """
        pool = self.req_to_token_pool
        for i, req in enumerate(self.reqs):
            buf = req.mamba_ping_pong_track_buffer
            assert buf is not None
            if self.seq_lens_cpu[i].item() % mamba_track_interval != 0:
                continue
            other_idx = 1 - req.mamba_next_track_idx
            if buf[other_idx].item() != -1:
                # overlap 时前一轮后处理尚未释放，跳过
                continue

            if envs.SGLANG_TEST_MAMBA_LAZY_ALLOC_FAIL.get():
                new_slot = None
            else:
                new_slot = pool.mamba_pool.alloc(1)
            if new_slot is None:
                # 尝试 evict 一个 mamba 状态后重试
                self.tree_cache.evict(EvictParams(num_tokens=0, mamba_num=1))
                new_slot = pool.mamba_pool.alloc(1)
            if new_slot is not None:
                pool.set_mamba_ping_pong_slot(req, other_idx, new_slot[0])
                req.mamba_next_track_idx = other_idx
            else:
                # 分配失败，标记不插入缓存
                req.mamba_lazy_is_insert = False

评论区精华

PR 未产生有效 review 讨论，仅有作者触发的 CI 重新运行注释。

暂无高价值评论线程

风险与影响

风险：
1. 状态一致性风险：懒模式在分配失败时跳过缓存，可能导致某些边界情况下的状态未缓存，影响后续请求的 hit 率或 KL 散度。测试已覆盖分配失败场景，但实际生产中的分配失败频次可能高于测试。
2. 兼容性限制：extra_buffer_lazy 当前不支持 speculative decoding（server_args.py 中有明确断言），且必须启用 overlap schedule（否则断言失败）。未来扩展时需注意这些限制。
3. 新增条件分支增加维护复杂度：batch_result_processor.py 中多处根据 enable_mamba_extra_buffer_lazy() 走不同逻辑，可能与其他调度特性（如 PD 分离、streaming）产生交互，回归风险集中在批处理结果处理这一核心路径。
  - 影响：用户：新增可选策略，通过 --mamba-scheduler-strategy extra_buffer_lazy 启用，不影响现有配置。对于高吞吐 Mamba 推理用户可减少约一半的 ping-pong 缓冲常驻内存（按需分配时峰值仍可能达到 2 槽），但分配失败时缓存效率略有下降。
  系统：核心调度流水线增加少量分支判断，但整体性能影响可忽略。内存池的 alloc 调用次数可能增加（边界按需分配），但次数有限。
  团队：需要维护两套策略的测试和文档，未来添加新调度特性时需同时适配。
风险标记：兼容性限制, 依赖 overlap schedule, 最佳效果语义

关联脉络

暂无明显关联 PR

#27118 [Mamba] extra buffer lazy support

执行摘要

新增 Mamba 延迟额外缓冲策略，减少内存占用

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论