# PR #42889 完整报告

- 仓库：`vllm-project/vllm`
- 标题：[Refactor] Remove dead code
- 合并时间：2026-05-19 04:41
- 原文链接：http://prhub.com.cn/vllm-project/vllm/pull/42889

---

# 执行摘要

- 一句话：删除 CompressedTensors24 及相关死代码
- 推荐动作：该 PR 是典型的死代码清理实践，值得阅读了解如何安全地移除已有废弃机制的代码。对于正在维护大型代码库的开发者，可以借鉴其分步删除逻辑：先确保代码已被条件守卫或 raise 替代，再删除文件并更新所有引用。

# 功能与动机

PR 描述明确说明目的是移除 dead code，具体涉及 CompressedTensors24、KVCacheQuantSchema 和 QuantParamSchema。CompressedTensors24 类中的所有方法都 raise NotImplementedError，表明 Sparse 2:4 模型已经被弃用。KVCacheQuantSchema 和 QuantParamSchema 是用于加载 KV cache 量化参数的 Pydantic 模型，但已经被新机制取代。turboquant/quantizer.py 只包含一个空的 docstring，从未被使用。清理这些无用代码是日常维护的一部分。

# 实现拆解

### 步骤 1: 删除 CompressedTensors24 方案类和相关文件
移除了 `vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py` 文件，该类继承自 `CompressedTensorsScheme` 但所有方法均抛出 `NotImplementedError`，表示不再支持。同时删除了 `schemes/__init__.py` 中的导入和导出条目。

### 步骤 2: 删除 KV cache 量化参数模式
完全移除了 `vllm/model_executor/layers/quantization/schema.py`，其中定义了两个 Pydantic 模型 `KVCacheQuantSchema` 和 `QuantParamSchema`，以及它们的验证器方法。这些模式曾用于从 JSON 加载 KV cache 量化参数，但已被其他机制替代。

### 步骤 3: 修改量化选择逻辑
在 `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py` 中，删除了对 `CompressedTensors24` 的导入，并将原实例化该类的条件分支替换为直接 `raise NotImplementedError`。这保证了在遇到 Sparse 2:4 配置时立即报错，而不是默默地创建一个不可用的方案对象。

### 步骤 4: 删除空文件
移除了 `vllm/model_executor/layers/quantization/turboquant/quantizer.py`，该文件仅包含一个 docstring，声明未来计划用 Triton 内核处理量化，但从未实现任何功能。

关键文件：
- `vllm/model_executor/layers/quantization/schema.py`（模块 量化层；类别 source；类型 deletion；符号 KVCacheQuantSchema, check_is_fp8, check_tp_ranks, check_current_rank）: 完全删除了 KVCacheQuantSchema 和 QuantParamSchema 两个 Pydantic 模型及其验证方法，是本次清理的核心部分。
- `vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py`（模块 量化层；类别 source；类型 deletion；符号 CompressedTensors24, __init__, get_min_capability, create_weights）: 完全删除了 CompressedTensors24 类，该类所有方法均已声明不再支持，是 Sparse 2:4 量化方案的遗留代码。
- `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py`（模块 量化层；类别 source；类型 control-flow；符号 CompressedTensors24, CompressedTensorsLinearMethod.get_scheme）: 修改了量化方案选择逻辑，删除了 CompressedTensors24 的导入并将原来实例化它的条件分支改为直接抛出 NotImplementedError，是本次清理的必选项。
- `vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py`（模块 量化层；类别 source；类型 import-update；符号 CompressedTensors24）: 更新了导出列表，移除了 CompressedTensors24 的导入和导出。
- `vllm/model_executor/layers/quantization/turboquant/quantizer.py`（模块 量化层；类别 source；类型 deletion）: 删除了仅包含空 docstring 的文件，清理了 TurboQuant 模块的冗余文件。

关键符号：KVCacheQuantSchema.check_is_fp8, KVCacheQuantSchema.check_tp_ranks, KVCacheQuantSchema.check_current_rank, QuantParamSchema.check_model_type, CompressedTensors24.__init__, CompressedTensors24.get_min_capability, CompressedTensors24.create_weights, CompressedTensors24.process_weights_after_loading, CompressedTensors24.apply_weights

## 关键源码片段

### `vllm/model_executor/layers/quantization/schema.py`

完全删除了 KVCacheQuantSchema 和 QuantParamSchema 两个 Pydantic 模型及其验证方法，是本次清理的核心部分。

```python
# 文件 vllm/model_executor/layers/quantization/schema.py（已删除）
# 该文件定义了 KVCacheQuantSchema 和 QuantParamSchema，
# 用于从 JSON 加载 KV cache 量化参数，现已全部移除。

from pydantic import BaseModel, ConfigDict, ValidationInfo, model_validator

class KVCacheQuantSchema(BaseModel):
    dtype: str  # 量化数据类型，如 "float8_e4m3fn"
    # 每个 TP rank 到 {layer_index: scale} 的映射
    scaling_factor: dict[int, dict[int, float]]

    @model_validator(mode="after")
    def check_is_fp8(self) -> "KVCacheQuantSchema":
        # 验证 dtype 是否为 float8_e4m3fn
        assert self.dtype == "float8_e4m3fn", (
            "Loaded scaling factors intended for KV cache dtype = "
            f"{self.dtype} rather than float8_e4m3fn!"
        )
        return self

    @model_validator(mode="after")
    def check_tp_ranks(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证 scaling_factor 的 key 数等于 TP 大小，
        # 且每个 TP rank 的层数等于总层数
        context = info.context
        if context:
            tp_size = context["tp_size"]
            num_hidden_layers = context["num_hidden_layers"]
            assert len(self.scaling_factor) == tp_size, (
                f"Loaded dictionary has TP size {len(self.scaling_factor)} "
                f"but LLM engine is currently running with TP size {tp_size}."
            )
            for tp_rank, layer_maps in self.scaling_factor.items():
                assert len(layer_maps) == num_hidden_layers, (
                    f"KV cache scales map for TP rank {tp_rank} is malformed. "
                    f"Expected {num_hidden_layers} layers, got "
                    f"{len(layer_maps)}."
                )
            for i in range(tp_size):
                assert i in self.scaling_factor, (
                    f"KV cache scales map for TP rank {i} not found."
                )
        return self

    @model_validator(mode="after")
    def check_current_rank(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证当前 TP rank 的层索引完整性
        context = info.context
        if context:
            tp_rank = context["tp_rank"]
            num_hidden_layers = context["num_hidden_layers"]
            layer_scales_map = self.scaling_factor[tp_rank]
            for i in range(num_hidden_layers):
                assert i in layer_scales_map, (
                    f"Could not find KV cache scales for layer {i} in "
                    f"TP rank {tp_rank}."
                )
        return self

class QuantParamSchema(BaseModel):
    model_config = ConfigDict(protected_namespaces=())
    model_type: str | None  # 模型类型标识
    kv_cache: KVCacheQuantSchema  # 嵌套的 KV cache 量化参数

    @model_validator(mode="after")
    def check_model_type(self, info: ValidationInfo) -> "QuantParamSchema":
        # 验证 model_type 与上下文匹配
        context = info.context
        if context:
            model_type = context.get("model_type", None)
            if model_type is not None:
                assert model_type == self.model_type, (
                    f"Model type is {model_type} but loaded scaling factors "
                    f"belonging to different model type {self.model_type}!"
                )
        return self

```

### `vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py`

完全删除了 CompressedTensors24 类，该类所有方法均已声明不再支持，是 Sparse 2:4 量化方案的遗留代码。

```python
# 文件 vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py（已删除）
# 该类继承 CompressedTensorsScheme，但所有方法均抛出 NotImplementedError，
# 表明 Sparse 2:4 模型已不再支持。

from collections.abc import Callable
from typing import Any

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
)
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
    CompressedTensorsScheme,
)

__all__ = ["CompressedTensors24"]

class CompressedTensors24(CompressedTensorsScheme):
    def __init__(
        self,
        quantized: bool = False,
        weight_quant: QuantizationArgs | None = None,
        input_quant: QuantizationArgs | None = None,
        model_compression_config: dict[str, Any] | None = None,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    @classmethod
    def get_min_capability(cls) -> int:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def create_weights(
        self,
        layer: torch.nn.Module,
        input_size: int,
        output_partition_sizes: list[int],
        input_size_per_partition: int,
        params_dtype: torch.dtype,
        weight_loader: Callable,
        **kwargs,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def apply_weights(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: torch.Tensor | None = None,
    ) -> torch.Tensor:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

```

# 评论区精华

该 PR 没有收到任何人工 review 评论，仅有一个自动化 bot 评论重申了删除的内容，随后由 sfeng33 直接批准合并。因此无实质性技术讨论。

- 暂无高价值评论线程

# 风险与影响

- 风险：主要风险在于外部代码可能直接引用了被删除的符号，例如 `from vllm.model_executor.layers.quantization.schema import KVCacheQuantSchema`。但由于这些符号已经被废弃，且在仓库内部已无引用，风险较低。其他变更均为死代码清理，不会影响运行时行为。
- 影响：对用户：几乎无影响，因为删除的代码早已停止使用。极少数直接引用这些内部模块的用户可能需要更新 import。
对系统：代码库更简洁，减少了不必要的文件和导入。
对团队：降低了维护成本，避免了新开发者误用废弃代码。

- 风险标记：低兼容性风险

# 关联脉络

- 暂无明显关联 PR