#26062 [UnifiedRadixTree]: Support L3 HiStorage framework

原始 PR 作者 hzh0425 合并时间 2026-05-26 22:38 文件变更 12 提交数 12 评论 22 代码增减 +1149 / -72

执行摘要

支持 UnifiedRadixCache 的 L3 层级存储后端框架

从 PR Body 的 Todo 清单可见，主要驱动力是支持 L3 存储框架，并以此验证 GLM5.1 DSA Model、改进 MambaComponent（Qwen Hybrid Linear Model）、以及后续的 SWAComponent（DeepSeek-V4 Hybrid SWA and sparse model）。目标在于扩展现有 HiCache 容量，将不常用的 KV/Mamba 状态缓存在文件后端，从而在有限显存下支持更长的上下文或更大的模型。

建议重点关注 unified_radix_cache.py 中的预取和备份机制设计，以及 hybrid_cache_controller.py 的配置解析方式。这两个文件是 L3 存储的核心骨架。同时注意 host lock 的组件实现一致性，特别是后续 SWAComponent 的支持。测试用例 test_unified_radix_cache_kl_hicache_part2.py 是 Mamba 混合模型集成的良好参考。

讨论亮点

Review 中主要讨论如下：

CPU 张量用于 NCCL AllReduce：gemini-code-assist[bot] 指出 write_backup_storage 和 _prefetch_timeout_check_linear_func 中有 3 处张量创建在 CPU 上，当 TP 组使用 NCCL 后端时会失败，建议添加 device=self.device。该问题尚未在评论中得到作者明确回应，但 PR 已合并，推测已修正或另行处理。
TOML 解析兼容性：bot 指出 tomllib 仅在 Python 3.11+ 可用，建议为 3.9/3.10 使用 tomli 后备。未见后续修改确认。
get_prefix_hash_values 方法设计：bot 指出该方法忽略 self 而依赖传入 node，应使用 @staticmethod 或重构。最终代码保留了实例方法但附加了 lru_cache，设计仍有争议。
last_host_node 逻辑简化：ispobock 询问为何去掉原有一致性检查；hzh0425 解释 unified tree 下可直接使用 best_match_node，双方达成一致并添加注释。
swa_component 同步更新：ispobock 要求也更新 swa_component 的主机锁定接口，作者确认已完成。

实现拆解

扩展 UnifiedRadixCache 核心逻辑：在 unified_radix_cache.py 中新增 get_last_hash_value、get_prefix_hash_values 方法，支持节点哈希计算与路径构建；引入 write_backup_storage、prefetch_from_storage 等函数实现数据从设备到主机再到层级存储的异步写入与预取；新增 inc_host_lock_ref / dec_host_lock_ref 接口管理主机端锁定引用计数。
增强 HybridCacheController：在 hybrid_cache_controller.py 中新增 parse_storage_backend_extra_config 静态方法，支持从 JSON/TOML/YAML 文件解析存储后端配置；新增 _init_extra_host_mem_release_queues、append_host_mem_release 等函数实现额外的宿主内存释放队列；重写 _start_storage_threads 以启动这些队列的消费线程。
适配 Mamba 和 Full 组件：在 mamba_component.py 与 full_component.py 中修改 acquire_component_lock 和 release_component_lock 以支持 lock_host 参数，从而区分设备锁定和主机锁定；同时在 mamba_component.py 中新增 BACKUP_STORAGE 和 PREFETCH 阶段的传输构建逻辑。
重构测试框架：将原有的 GSM8KTwoPassMixin 重命名为 AccuracyTwoPassMixin，并泛化支持 MMLU 等多任务评估；新增 TestUnifiedMambaHiCacheL3 测试类（test_unified_radix_cache_kl_hicache_part2.py）验证 Mamba 模型与文件后端 L3 缓存的集成；修改 hybrid_pool_assembler.py 传递存储后端配置参数给缓存控制器。
配套调整：在 unified_cache_components/tree_component.py、swa_component.py、base_prefix_cache.py 中调整配置键传递；移除不再需要的测试用例。

文件	模块	状态	重要度
`python/sglang/srt/mem_cache/unified_radix_cache.py`	缓存层	modified	8.93
`python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py`	缓存控制	modified	8.64
`python/sglang/srt/mem_cache/unified_cache_components/mamba_component.py`	Mamba 缓存	modified	7.32
`test/registered/radix_cache/test_unified_radix_cache_kl_hicache_part2.py`	集成测试	added	7.25

关键符号

get_last_hash_value get_prefix_hash_values inc_host_lock_ref dec_host_lock_ref write_backup_storage prefetch_from_storage _prefetch_timeout_check_linear_func _start_storage_threads parse_storage_backend_extra_config clear_storage_backend _init_extra_host_mem_release_queues append_host_mem_release acquire_component_lock release_component_lock build_hicache_transfers

关键源码片段

python/sglang/srt/mem_cache/unified_radix_cache.py dependency-wiring

核心变更文件，新增哈希方法、存储备份、预取、主机锁定等关键逻辑，是 L3 存储框架的主入口。

# UnifiedTreeNode 类中的哈希相关方法
# 获取当前节点的最后一个哈希值（即该节点自身的哈希，而非路径）
def get_last_hash_value(self) -> Optional[str]:
    # 如果 hash_value 为 None 或空列表，则返回 None
    if self.hash_value is None or len(self.hash_value) == 0:
        return None
    # 返回列表中的最后一个元素
    return self.hash_value[-1]

# 获取从根到指定节点（node）的完整哈希路径
# 使用 @lru_cache 避免递归重复计算，maxsize=1 表示只缓存最近一次结果
@lru_cache(maxsize=1)
def get_prefix_hash_values(self, node: Optional[UnifiedTreeNode]) -> list[str]:
    # 如果 node 为 None 或没有 hash_value，返回空列表
    if node is None or node.hash_value is None:
        return []
    # 递归获取父节点的前缀哈希，并拼接当前节点的 hash_value
    return node.get_prefix_hash_values(node.parent) + node.hash_value

python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py entrypoint

作为缓存控制器的入口，新增配置解析、主机内存释放队列和存储线程管理，统一了不同组件（Full、Mamba、SWA）的存储后端行为。

# 静态方法：解析存储后端额外配置
# 支持 JSON 字符串、@ 文件路径（JSON/TOML/YAML）
@staticmethod
def parse_storage_backend_extra_config(
    storage_backend_extra_config: Optional[str],
) -> tuple[dict, int, float, float, bool]:
    extra_config = {}
    if storage_backend_extra_config:
        if storage_backend_extra_config.startswith("@"):
            # 文件方式：按扩展名选择解析器
            path = storage_backend_extra_config[1:]
            ext = os.path.splitext(path)[1].lower()
            with open(path, "rb" if ext == ".toml" else "r") as f:
                if ext == ".json":
                    extra_config = json.load(f)
                elif ext == ".toml":
                    # 注意：tomllib 仅适用于 Python >= 3.11
                    import tomllib as toml_parser
                    extra_config = toml_parser.load(f)
                elif ext in (".yaml", ".yml"):
                    import yaml
                    extra_config = yaml.safe_load(f)
                else:
                    raise ValueError(f"Unsupported config file {path}")
        else:
            extra_config = json.loads(storage_backend_extra_config)

    # 提取预定键，剩余部分作为额外配置
    prefetch_threshold = extra_config.pop("prefetch_threshold", 256)
    prefetch_timeout_base = extra_config.pop("prefetch_timeout_base", 1)
    prefetch_timeout_per_ki_token = extra_config.pop("prefetch_timeout_per_ki_token", 0.25)
    hicache_storage_pass_prefix_keys = extra_config.pop("hicache_storage_pass_prefix_keys", False)

    # 类型校验
    if not isinstance(prefetch_threshold, int):
        raise ValueError(...)
    # ... 省略类似校验

    return (extra_config, prefetch_threshold, float(prefetch_timeout_base),
            float(prefetch_timeout_per_ki_token), hicache_storage_pass_prefix_keys)

python/sglang/srt/mem_cache/unified_cache_components/mamba_component.py dependency-wiring

实现了 Mamba 组件的主机锁定支持和新阶段的传输构建，是混合模型适配的关键一环。

# MambaComponent 中重写的 acquire_component_lock
# 新增 lock_host 参数用于锁定主机副本而非设备副本
def acquire_component_lock(
    self,
    node: UnifiedTreeNode,
    result: IncLockRefResult,
    lock_host: bool = False, # 新增参数
) -> IncLockRefResult:
    ct = self.component_type
    if node is self.cache.root_node:
        return result
    cd = node.component_data[ct]
    # 根据 lock_host 选择是访问 host_value 还是 value
    value = cd.host_value if lock_host else cd.value
    if value is None:
        result.skip_lock_node_ids.setdefault(ct, set()).add(node.id)
        return result

    if lock_host:
        # 主机锁：从主机 LRU 中移除节点
        if cd.host_lock_ref == 0:
            host_lru = self.cache.host_lru_lists[ct]
            if host_lru.in_list(node):
                host_lru.remove_node(node)
        cd.host_lock_ref += 1
    else:
        # 设备锁：调整 evictable/protected 大小
        if cd.lock_ref == 0:
            vlen = len(value)
            self.cache.component_evictable_size_[ct] -= vlen
            self.cache.component_protected_size_[ct] += vlen
        cd.lock_ref += 1
    return result

评论区精华

CPU 张量用于 NCCL AllReduce 导致错误 正确性

gemini-code-assist[bot] 在代码中识别出 3 处（`states`, `completed_tokens_tensor`, `qsizes`）张量创建在 CPU 上，当 TP 组使用 NCCL 后端执行 `all_reduce` 时会失败。建议添加 `device=self.device` 参数。

结论：作者未在评论中回应，但 PR 最终合并，可能已在后续 commit 中修复或通过其他方式规避。 · unresolved

TOML 解析在 Python 3.9/3.10 下不可用 other

gemini-code-assist[bot] 指出 `parse_storage_backend_extra_config` 中使用 `import tomllib`，该模块仅在 Python 3.11+ 存在，建议使用 `tomli` 作为后备。

结论：未见作者修改回复，当前代码仍使用 `tomllib`。 · unresolved

get_prefix_hash_values 实例方法设计问题 设计

gemini-code-assist[bot] 指出 `get_prefix_hash_values` 是一个实例方法但忽略了 `self`，转而操作传入的 `node` 参数，内部递归调用时也传入 `node.parent`，不符合常规；建议改为 `@staticmethod` 或重构。

结论：作者未就此回复，最终代码保留了实例方法并附加 `@lru_cache`，设计仍存在争议。 · unresolved

last_host_node 走查逻辑简化 设计

ispobock 询问为何去掉原有一个额外的 while 循环来寻找最后一个备份节点；hzh0425 解释 unified tree 下可直接使用 `best_match_node` 作为 `last_host_node`；ispobock 进一步确认 full attention 场景是否也是如此；hzh0425 同意并添加了注释说明。

结论：双方达成一致，`last_host_node` 直接使用 `best_match_node`。 · 已解决

需同步更新 swa_component 的主机锁定接口 设计

ispobock 审阅 `inc_host_lock_ref` 时要求也更新 `swa_component` 以支持 `lock_host` 参数，否则 SWA 组件在 L3 存储下会缺少主机锁定能力。

结论：hzh0425 回复 "updated"，确认代码已同步。 · 已解决

风险与影响

CPU tensor AllReduce 风险：若 write_backup_storage 等函数中的张量仍保留在 CPU 上，使用 NCCL TP 时将导致运行时错误。虽然 PR 已合并，但应确认生产环境中已正确设置设备。
Python 版本兼容风险：parse_storage_backend_extra_config 在 Python 3.9/3.10 下尝试 import tomllib 会失败，导致 TOML 配置解析不可用。
主机锁定引用计数：新增的 inc_host_lock_ref / dec_host_lock_ref 逻辑涉及多个组件（Full、Mamba、SWA），若某组件未正确实现，可能导致主机副本被提前释放或永远无法释放，引发数据损坏或内存泄漏。
预取超时策略：_prefetch_timeout_check_linear_func 基于线性时间复杂度的超时检查，可能在高并发请求下成为性能瓶颈，且超时参数需要针对不同负载调优。
线程安全：新增的 extra_host_mem_release_queues 及消费线程在并发访问下存在竞态风险，需确保锁机制完整。

用户影响：启用 --enable-hierarchical-cache 并设置 --hicache-storage-backend=file 的用户将自动获得 L3 存储能力；新的配置项（如 prefetch_threshold、prefetch_timeout_base）可通过 --hicache-storage-backend-extra-config 调整。

系统影响：增加了 3 个后台线程（用于存储 I/O 与主机内存释放），在低显存环境下可显著降低 cache miss 惩罚，但也会增加 CPU 和内存开销。混合模型（含 Mamba/SWA）的缓存策略得到统一管理，不再需要为不同组件单独维护存储逻辑。

团队影响：代码库复杂度提升，新增的 HybridCacheController 承担了更多职责；测试覆盖了 GLM5 和 Qwen-Next 两种混合模型，为后续扩展提供了参考基线。

CPU tensor AllReduce 风险 Python 3.9 TOML 兼容主机锁定引用计数错误预取超时策略可能成为瓶颈

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：支持 UnifiedRadixCache 的 L3 层级存储后端框架
推荐动作：建议重点关注 unified_radix_cache.py 中的预取和备份机制设计，以及 hybrid_cache_controller.py 的配置解析方式。这两个文件是 L3 存储的核心骨架。同时注意 host lock 的组件实现一致性，特别是后续 SWAComponent 的支持。测试用例 test_unified_radix_cache_kl_hicache_part2.py 是 Mamba 混合模型集成的良好参考。

功能与动机

实现拆解

扩展 UnifiedRadixCache 核心逻辑：在 unified_radix_cache.py 中新增 get_last_hash_value、get_prefix_hash_values 方法，支持节点哈希计算与路径构建；引入 write_backup_storage、prefetch_from_storage 等函数实现数据从设备到主机再到层级存储的异步写入与预取；新增 inc_host_lock_ref / dec_host_lock_ref 接口管理主机端锁定引用计数。
增强 HybridCacheController：在 hybrid_cache_controller.py 中新增 parse_storage_backend_extra_config 静态方法，支持从 JSON/TOML/YAML 文件解析存储后端配置；新增 _init_extra_host_mem_release_queues、append_host_mem_release 等函数实现额外的宿主内存释放队列；重写 _start_storage_threads 以启动这些队列的消费线程。
适配 Mamba 和 Full 组件：在 mamba_component.py 与 full_component.py 中修改 acquire_component_lock 和 release_component_lock 以支持 lock_host 参数，从而区分设备锁定和主机锁定；同时在 mamba_component.py 中新增 BACKUP_STORAGE 和 PREFETCH 阶段的传输构建逻辑。
重构测试框架：将原有的 GSM8KTwoPassMixin 重命名为 AccuracyTwoPassMixin，并泛化支持 MMLU 等多任务评估；新增 TestUnifiedMambaHiCacheL3 测试类（test_unified_radix_cache_kl_hicache_part2.py）验证 Mamba 模型与文件后端 L3 缓存的集成；修改 hybrid_pool_assembler.py 传递存储后端配置参数给缓存控制器。
配套调整：在 unified_cache_components/tree_component.py、swa_component.py、base_prefix_cache.py 中调整配置键传递；移除不再需要的测试用例。

关键文件：

python/sglang/srt/mem_cache/unified_radix_cache.py（模块缓存层；类别 source；类型 dependency-wiring；符号 get_last_hash_value, get_prefix_hash_values, inc_host_lock_ref, dec_host_lock_ref）: 核心变更文件，新增哈希方法、存储备份、预取、主机锁定等关键逻辑，是 L3 存储框架的主入口。
python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py（模块缓存控制；类别 source；类型 entrypoint；符号 _start_storage_threads, parse_storage_backend_extra_config, clear_storage_backend, _init_extra_host_mem_release_queues）: 作为缓存控制器的入口，新增配置解析、主机内存释放队列和存储线程管理，统一了不同组件（Full、Mamba、SWA）的存储后端行为。
python/sglang/srt/mem_cache/unified_cache_components/mamba_component.py（模块 Mamba缓存；类别 source；类型 dependency-wiring；符号 acquire_component_lock, release_component_lock, build_hicache_transfers）: 实现了 Mamba 组件的主机锁定支持和新阶段的传输构建，是混合模型适配的关键一环。
test/registered/radix_cache/test_unified_radix_cache_kl_hicache_part2.py（模块集成测试；类别 test；类型 test-coverage；符号 TestUnifiedMambaHiCacheL3, setUpClass, tearDownClass）: 新增的 Mamba 混合模型 L3 缓存端到端测试，验证了文件后端与 UnifiedRadixCache 集成的正确性。

关键符号：get_last_hash_value, get_prefix_hash_values, inc_host_lock_ref, dec_host_lock_ref, write_backup_storage, prefetch_from_storage, _prefetch_timeout_check_linear_func, _start_storage_threads, parse_storage_backend_extra_config, clear_storage_backend, _init_extra_host_mem_release_queues, append_host_mem_release, acquire_component_lock, release_component_lock, build_hicache_transfers

关键源码片段

`python/sglang/srt/mem_cache/unified_radix_cache.py`

核心变更文件，新增哈希方法、存储备份、预取、主机锁定等关键逻辑，是 L3 存储框架的主入口。

# UnifiedTreeNode 类中的哈希相关方法
# 获取当前节点的最后一个哈希值（即该节点自身的哈希，而非路径）
def get_last_hash_value(self) -> Optional[str]:
    # 如果 hash_value 为 None 或空列表，则返回 None
    if self.hash_value is None or len(self.hash_value) == 0:
        return None
    # 返回列表中的最后一个元素
    return self.hash_value[-1]

# 获取从根到指定节点（node）的完整哈希路径
# 使用 @lru_cache 避免递归重复计算，maxsize=1 表示只缓存最近一次结果
@lru_cache(maxsize=1)
def get_prefix_hash_values(self, node: Optional[UnifiedTreeNode]) -> list[str]:
    # 如果 node 为 None 或没有 hash_value，返回空列表
    if node is None or node.hash_value is None:
        return []
    # 递归获取父节点的前缀哈希，并拼接当前节点的 hash_value
    return node.get_prefix_hash_values(node.parent) + node.hash_value

`python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py`

作为缓存控制器的入口，新增配置解析、主机内存释放队列和存储线程管理，统一了不同组件（Full、Mamba、SWA）的存储后端行为。

# 静态方法：解析存储后端额外配置
# 支持 JSON 字符串、@ 文件路径（JSON/TOML/YAML）
@staticmethod
def parse_storage_backend_extra_config(
    storage_backend_extra_config: Optional[str],
) -> tuple[dict, int, float, float, bool]:
    extra_config = {}
    if storage_backend_extra_config:
        if storage_backend_extra_config.startswith("@"):
            # 文件方式：按扩展名选择解析器
            path = storage_backend_extra_config[1:]
            ext = os.path.splitext(path)[1].lower()
            with open(path, "rb" if ext == ".toml" else "r") as f:
                if ext == ".json":
                    extra_config = json.load(f)
                elif ext == ".toml":
                    # 注意：tomllib 仅适用于 Python >= 3.11
                    import tomllib as toml_parser
                    extra_config = toml_parser.load(f)
                elif ext in (".yaml", ".yml"):
                    import yaml
                    extra_config = yaml.safe_load(f)
                else:
                    raise ValueError(f"Unsupported config file {path}")
        else:
            extra_config = json.loads(storage_backend_extra_config)

    # 提取预定键，剩余部分作为额外配置
    prefetch_threshold = extra_config.pop("prefetch_threshold", 256)
    prefetch_timeout_base = extra_config.pop("prefetch_timeout_base", 1)
    prefetch_timeout_per_ki_token = extra_config.pop("prefetch_timeout_per_ki_token", 0.25)
    hicache_storage_pass_prefix_keys = extra_config.pop("hicache_storage_pass_prefix_keys", False)

    # 类型校验
    if not isinstance(prefetch_threshold, int):
        raise ValueError(...)
    # ... 省略类似校验

    return (extra_config, prefetch_threshold, float(prefetch_timeout_base),
            float(prefetch_timeout_per_ki_token), hicache_storage_pass_prefix_keys)

`python/sglang/srt/mem_cache/unified_cache_components/mamba_component.py`

实现了 Mamba 组件的主机锁定支持和新阶段的传输构建，是混合模型适配的关键一环。

# MambaComponent 中重写的 acquire_component_lock
# 新增 lock_host 参数用于锁定主机副本而非设备副本
def acquire_component_lock(
    self,
    node: UnifiedTreeNode,
    result: IncLockRefResult,
    lock_host: bool = False, # 新增参数
) -> IncLockRefResult:
    ct = self.component_type
    if node is self.cache.root_node:
        return result
    cd = node.component_data[ct]
    # 根据 lock_host 选择是访问 host_value 还是 value
    value = cd.host_value if lock_host else cd.value
    if value is None:
        result.skip_lock_node_ids.setdefault(ct, set()).add(node.id)
        return result

    if lock_host:
        # 主机锁：从主机 LRU 中移除节点
        if cd.host_lock_ref == 0:
            host_lru = self.cache.host_lru_lists[ct]
            if host_lru.in_list(node):
                host_lru.remove_node(node)
        cd.host_lock_ref += 1
    else:
        # 设备锁：调整 evictable/protected 大小
        if cd.lock_ref == 0:
            vlen = len(value)
            self.cache.component_evictable_size_[ct] -= vlen
            self.cache.component_protected_size_[ct] += vlen
        cd.lock_ref += 1
    return result

评论区精华

Review 中主要讨论如下：

CPU 张量用于 NCCL AllReduce：gemini-code-assist[bot] 指出 write_backup_storage 和 _prefetch_timeout_check_linear_func 中有 3 处张量创建在 CPU 上，当 TP 组使用 NCCL 后端时会失败，建议添加 device=self.device。该问题尚未在评论中得到作者明确回应，但 PR 已合并，推测已修正或另行处理。
TOML 解析兼容性：bot 指出 tomllib 仅在 Python 3.11+ 可用，建议为 3.9/3.10 使用 tomli 后备。未见后续修改确认。
get_prefix_hash_values 方法设计：bot 指出该方法忽略 self 而依赖传入 node，应使用 @staticmethod 或重构。最终代码保留了实例方法但附加了 lru_cache，设计仍有争议。
last_host_node 逻辑简化：ispobock 询问为何去掉原有一致性检查；hzh0425 解释 unified tree 下可直接使用 best_match_node，双方达成一致并添加注释。
swa_component 同步更新：ispobock 要求也更新 swa_component 的主机锁定接口，作者确认已完成。
- CPU 张量用于 NCCL AllReduce 导致错误 (correctness): 作者未在评论中回应，但 PR 最终合并，可能已在后续 commit 中修复或通过其他方式规避。
- TOML 解析在 Python 3.9/3.10 下不可用 (other): 未见作者修改回复，当前代码仍使用 tomllib。
- get_prefix_hash_values 实例方法设计问题 (design): 作者未就此回复，最终代码保留了实例方法并附加 @lru_cache，设计仍存在争议。
- last_host_node 走查逻辑简化 (design): 双方达成一致，last_host_node 直接使用 best_match_node。
- 需同步更新 swa_component 的主机锁定接口 (design): hzh0425 回复 "updated"，确认代码已同步。

风险与影响

风险：
1. CPU tensor AllReduce 风险：若 write_backup_storage 等函数中的张量仍保留在 CPU 上，使用 NCCL TP 时将导致运行时错误。虽然 PR 已合并，但应确认生产环境中已正确设置设备。
2. Python 版本兼容风险：parse_storage_backend_extra_config 在 Python 3.9/3.10 下尝试 import tomllib 会失败，导致 TOML 配置解析不可用。
3. 主机锁定引用计数：新增的 inc_host_lock_ref / dec_host_lock_ref 逻辑涉及多个组件（Full、Mamba、SWA），若某组件未正确实现，可能导致主机副本被提前释放或永远无法释放，引发数据损坏或内存泄漏。
4. 预取超时策略：_prefetch_timeout_check_linear_func 基于线性时间复杂度的超时检查，可能在高并发请求下成为性能瓶颈，且超时参数需要针对不同负载调优。
5. 线程安全：新增的 extra_host_mem_release_queues 及消费线程在并发访问下存在竞态风险，需确保锁机制完整。
  - 影响：用户影响：启用 --enable-hierarchical-cache 并设置 --hicache-storage-backend=file 的用户将自动获得 L3 存储能力；新的配置项（如 prefetch_threshold、prefetch_timeout_base）可通过 --hicache-storage-backend-extra-config 调整。

团队影响：代码库复杂度提升，新增的 HybridCacheController 承担了更多职责；测试覆盖了 GLM5 和 Qwen-Next 两种混合模型，为后续扩展提供了参考基线。

风险标记：CPU tensor AllReduce 风险, Python 3.9 TOML 兼容, 主机锁定引用计数错误, 预取超时策略可能成为瓶颈

关联脉络

PR #26301 [HiCache]: Check return code of cudaHostRegister: 同为 HiCache 改进系列的 PR，修复了 cudaHostRegister 返回值检查问题，与当前 PR 的 L3 存储后端共享内存管理基础设施。

#26062 [UnifiedRadixTree]: Support L3 HiStorage framework

执行摘要

支持 UnifiedRadixCache 的 L3 层级存储后端框架

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论