#42889 [Refactor] Remove dead code

原始 PR 作者 yewentao256 合并时间 2026-05-19 04:41 文件变更 5 提交数 2 评论 0 代码增减 +2 / -168

执行摘要

删除 CompressedTensors24 及相关死代码

PR 描述明确说明目的是移除 dead code，具体涉及 CompressedTensors24、KVCacheQuantSchema 和 QuantParamSchema。CompressedTensors24 类中的所有方法都 raise NotImplementedError，表明 Sparse 2:4 模型已经被弃用。KVCacheQuantSchema 和 QuantParamSchema 是用于加载 KV cache 量化参数的 Pydantic 模型，但已经被新机制取代。turboquant/quantizer.py 只包含一个空的 docstring，从未被使用。清理这些无用代码是日常维护的一部分。

该 PR 是典型的死代码清理实践，值得阅读了解如何安全地移除已有废弃机制的代码。对于正在维护大型代码库的开发者，可以借鉴其分步删除逻辑：先确保代码已被条件守卫或 raise 替代，再删除文件并更新所有引用。

讨论亮点

该 PR 没有收到任何人工 review 评论，仅有一个自动化 bot 评论重申了删除的内容，随后由 sfeng33 直接批准合并。因此无实质性技术讨论。

实现拆解

步骤 1: 删除 CompressedTensors24 方案类和相关文件

移除了 vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py 文件，该类继承自 CompressedTensorsScheme 但所有方法均抛出 NotImplementedError，表示不再支持。同时删除了 schemes/__init__.py 中的导入和导出条目。

步骤 2: 删除 KV cache 量化参数模式

完全移除了 vllm/model_executor/layers/quantization/schema.py，其中定义了两个 Pydantic 模型 KVCacheQuantSchema 和 QuantParamSchema，以及它们的验证器方法。这些模式曾用于从 JSON 加载 KV cache 量化参数，但已被其他机制替代。

步骤 3: 修改量化选择逻辑

在 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py 中，删除了对 CompressedTensors24 的导入，并将原实例化该类的条件分支替换为直接 raise NotImplementedError。这保证了在遇到 Sparse 2:4 配置时立即报错，而不是默默地创建一个不可用的方案对象。

步骤 4: 删除空文件

移除了 vllm/model_executor/layers/quantization/turboquant/quantizer.py，该文件仅包含一个 docstring，声明未来计划用 Triton 内核处理量化，但从未实现任何功能。

文件	模块	状态	重要度
`vllm/model_executor/layers/quantization/schema.py`	量化层	removed	8.33
`vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py`	量化层	removed	7.84
`vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py`	量化层	modified	5.97
`vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py`	量化层	modified	4.48
`vllm/model_executor/layers/quantization/turboquant/quantizer.py`	量化层	removed	4.72

关键符号

KVCacheQuantSchema.check_is_fp8 KVCacheQuantSchema.check_tp_ranks KVCacheQuantSchema.check_current_rank QuantParamSchema.check_model_type CompressedTensors24.__init__ CompressedTensors24.get_min_capability CompressedTensors24.create_weights CompressedTensors24.process_weights_after_loading CompressedTensors24.apply_weights

关键源码片段

vllm/model_executor/layers/quantization/schema.py deletion

完全删除了 KVCacheQuantSchema 和 QuantParamSchema 两个 Pydantic 模型及其验证方法，是本次清理的核心部分。

# 文件 vllm/model_executor/layers/quantization/schema.py（已删除）
# 该文件定义了 KVCacheQuantSchema 和 QuantParamSchema，
# 用于从 JSON 加载 KV cache 量化参数，现已全部移除。

from pydantic import BaseModel, ConfigDict, ValidationInfo, model_validator

class KVCacheQuantSchema(BaseModel):
    dtype: str # 量化数据类型，如 "float8_e4m3fn"
    # 每个 TP rank 到 {layer_index: scale} 的映射
    scaling_factor: dict[int, dict[int, float]]

    @model_validator(mode="after")
    def check_is_fp8(self) -> "KVCacheQuantSchema":
        # 验证 dtype 是否为 float8_e4m3fn
        assert self.dtype == "float8_e4m3fn", (
            "Loaded scaling factors intended for KV cache dtype = "
            f"{self.dtype} rather than float8_e4m3fn!"
        )
        return self

    @model_validator(mode="after")
    def check_tp_ranks(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证 scaling_factor 的 key 数等于 TP 大小，
        # 且每个 TP rank 的层数等于总层数
        context = info.context
        if context:
            tp_size = context["tp_size"]
            num_hidden_layers = context["num_hidden_layers"]
            assert len(self.scaling_factor) == tp_size, (
                f"Loaded dictionary has TP size {len(self.scaling_factor)} "
                f"but LLM engine is currently running with TP size {tp_size}."
            )
            for tp_rank, layer_maps in self.scaling_factor.items():
                assert len(layer_maps) == num_hidden_layers, (
                    f"KV cache scales map for TP rank {tp_rank} is malformed. "
                    f"Expected {num_hidden_layers} layers, got "
                    f"{len(layer_maps)}."
                )
            for i in range(tp_size):
                assert i in self.scaling_factor, (
                    f"KV cache scales map for TP rank {i} not found."
                )
        return self

    @model_validator(mode="after")
    def check_current_rank(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证当前 TP rank 的层索引完整性
        context = info.context
        if context:
            tp_rank = context["tp_rank"]
            num_hidden_layers = context["num_hidden_layers"]
            layer_scales_map = self.scaling_factor[tp_rank]
            for i in range(num_hidden_layers):
                assert i in layer_scales_map, (
                    f"Could not find KV cache scales for layer {i} in "
                    f"TP rank {tp_rank}."
                )
        return self

class QuantParamSchema(BaseModel):
    model_config = ConfigDict(protected_namespaces=())
    model_type: str | None # 模型类型标识
    kv_cache: KVCacheQuantSchema # 嵌套的 KV cache 量化参数

    @model_validator(mode="after")
    def check_model_type(self, info: ValidationInfo) -> "QuantParamSchema":
        # 验证 model_type 与上下文匹配
        context = info.context
        if context:
            model_type = context.get("model_type", None)
            if model_type is not None:
                assert model_type == self.model_type, (
                    f"Model type is {model_type} but loaded scaling factors "
                    f"belonging to different model type {self.model_type}!"
                )
        return self

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py deletion

完全删除了 CompressedTensors24 类，该类所有方法均已声明不再支持，是 Sparse 2:4 量化方案的遗留代码。

# 文件 vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py（已删除）
# 该类继承 CompressedTensorsScheme，但所有方法均抛出 NotImplementedError，
# 表明 Sparse 2:4 模型已不再支持。

from collections.abc import Callable
from typing import Any

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
)
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
    CompressedTensorsScheme,
)

__all__ = ["CompressedTensors24"]

class CompressedTensors24(CompressedTensorsScheme):
    def __init__(
        self,
        quantized: bool = False,
        weight_quant: QuantizationArgs | None = None,
        input_quant: QuantizationArgs | None = None,
        model_compression_config: dict[str, Any] | None = None,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    @classmethod
    def get_min_capability(cls) -> int:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def create_weights(
        self,
        layer: torch.nn.Module,
        input_size: int,
        output_partition_sizes: list[int],
        input_size_per_partition: int,
        params_dtype: torch.dtype,
        weight_loader: Callable,
        **kwargs,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def apply_weights(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: torch.Tensor | None = None,
    ) -> torch.Tensor:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

主要风险在于外部代码可能直接引用了被删除的符号，例如 from vllm.model_executor.layers.quantization.schema import KVCacheQuantSchema。但由于这些符号已经被废弃，且在仓库内部已无引用，风险较低。其他变更均为死代码清理，不会影响运行时行为。

对用户：几乎无影响，因为删除的代码早已停止使用。极少数直接引用这些内部模块的用户可能需要更新 import。
对系统：代码库更简洁，减少了不必要的文件和导入。
对团队：降低了维护成本，避免了新开发者误用废弃代码。

低兼容性风险

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：删除 CompressedTensors24 及相关死代码
推荐动作：该 PR 是典型的死代码清理实践，值得阅读了解如何安全地移除已有废弃机制的代码。对于正在维护大型代码库的开发者，可以借鉴其分步删除逻辑：先确保代码已被条件守卫或 raise 替代，再删除文件并更新所有引用。

功能与动机

实现拆解

步骤 1: 删除 CompressedTensors24 方案类和相关文件

步骤 2: 删除 KV cache 量化参数模式

步骤 3: 修改量化选择逻辑

步骤 4: 删除空文件

移除了 vllm/model_executor/layers/quantization/turboquant/quantizer.py，该文件仅包含一个 docstring，声明未来计划用 Triton 内核处理量化，但从未实现任何功能。

关键文件：

vllm/model_executor/layers/quantization/schema.py（模块量化层；类别 source；类型 deletion；符号 KVCacheQuantSchema, check_is_fp8, check_tp_ranks, check_current_rank）: 完全删除了 KVCacheQuantSchema 和 QuantParamSchema 两个 Pydantic 模型及其验证方法，是本次清理的核心部分。
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py（模块量化层；类别 source；类型 deletion；符号 CompressedTensors24, init, get_min_capability, create_weights）: 完全删除了 CompressedTensors24 类，该类所有方法均已声明不再支持，是 Sparse 2:4 量化方案的遗留代码。
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py（模块量化层；类别 source；类型 control-flow；符号 CompressedTensors24, CompressedTensorsLinearMethod.get_scheme）: 修改了量化方案选择逻辑，删除了 CompressedTensors24 的导入并将原来实例化它的条件分支改为直接抛出 NotImplementedError，是本次清理的必选项。
vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py（模块量化层；类别 source；类型 import-update；符号 CompressedTensors24）: 更新了导出列表，移除了 CompressedTensors24 的导入和导出。
vllm/model_executor/layers/quantization/turboquant/quantizer.py（模块量化层；类别 source；类型 deletion）: 删除了仅包含空 docstring 的文件，清理了 TurboQuant 模块的冗余文件。

关键符号：KVCacheQuantSchema.check_is_fp8, KVCacheQuantSchema.check_tp_ranks, KVCacheQuantSchema.check_current_rank, QuantParamSchema.check_model_type, CompressedTensors24.init, CompressedTensors24.get_min_capability, CompressedTensors24.create_weights, CompressedTensors24.process_weights_after_loading, CompressedTensors24.apply_weights

关键源码片段

`vllm/model_executor/layers/quantization/schema.py`

完全删除了 KVCacheQuantSchema 和 QuantParamSchema 两个 Pydantic 模型及其验证方法，是本次清理的核心部分。

# 文件 vllm/model_executor/layers/quantization/schema.py（已删除）
# 该文件定义了 KVCacheQuantSchema 和 QuantParamSchema，
# 用于从 JSON 加载 KV cache 量化参数，现已全部移除。

from pydantic import BaseModel, ConfigDict, ValidationInfo, model_validator

class KVCacheQuantSchema(BaseModel):
    dtype: str # 量化数据类型，如 "float8_e4m3fn"
    # 每个 TP rank 到 {layer_index: scale} 的映射
    scaling_factor: dict[int, dict[int, float]]

    @model_validator(mode="after")
    def check_is_fp8(self) -> "KVCacheQuantSchema":
        # 验证 dtype 是否为 float8_e4m3fn
        assert self.dtype == "float8_e4m3fn", (
            "Loaded scaling factors intended for KV cache dtype = "
            f"{self.dtype} rather than float8_e4m3fn!"
        )
        return self

    @model_validator(mode="after")
    def check_tp_ranks(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证 scaling_factor 的 key 数等于 TP 大小，
        # 且每个 TP rank 的层数等于总层数
        context = info.context
        if context:
            tp_size = context["tp_size"]
            num_hidden_layers = context["num_hidden_layers"]
            assert len(self.scaling_factor) == tp_size, (
                f"Loaded dictionary has TP size {len(self.scaling_factor)} "
                f"but LLM engine is currently running with TP size {tp_size}."
            )
            for tp_rank, layer_maps in self.scaling_factor.items():
                assert len(layer_maps) == num_hidden_layers, (
                    f"KV cache scales map for TP rank {tp_rank} is malformed. "
                    f"Expected {num_hidden_layers} layers, got "
                    f"{len(layer_maps)}."
                )
            for i in range(tp_size):
                assert i in self.scaling_factor, (
                    f"KV cache scales map for TP rank {i} not found."
                )
        return self

    @model_validator(mode="after")
    def check_current_rank(self, info: ValidationInfo) -> "KVCacheQuantSchema":
        # 验证当前 TP rank 的层索引完整性
        context = info.context
        if context:
            tp_rank = context["tp_rank"]
            num_hidden_layers = context["num_hidden_layers"]
            layer_scales_map = self.scaling_factor[tp_rank]
            for i in range(num_hidden_layers):
                assert i in layer_scales_map, (
                    f"Could not find KV cache scales for layer {i} in "
                    f"TP rank {tp_rank}."
                )
        return self

class QuantParamSchema(BaseModel):
    model_config = ConfigDict(protected_namespaces=())
    model_type: str | None # 模型类型标识
    kv_cache: KVCacheQuantSchema # 嵌套的 KV cache 量化参数

    @model_validator(mode="after")
    def check_model_type(self, info: ValidationInfo) -> "QuantParamSchema":
        # 验证 model_type 与上下文匹配
        context = info.context
        if context:
            model_type = context.get("model_type", None)
            if model_type is not None:
                assert model_type == self.model_type, (
                    f"Model type is {model_type} but loaded scaling factors "
                    f"belonging to different model type {self.model_type}!"
                )
        return self

`vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py`

完全删除了 CompressedTensors24 类，该类所有方法均已声明不再支持，是 Sparse 2:4 量化方案的遗留代码。

# 文件 vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py（已删除）
# 该类继承 CompressedTensorsScheme，但所有方法均抛出 NotImplementedError，
# 表明 Sparse 2:4 模型已不再支持。

from collections.abc import Callable
from typing import Any

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
)
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
    CompressedTensorsScheme,
)

__all__ = ["CompressedTensors24"]

class CompressedTensors24(CompressedTensorsScheme):
    def __init__(
        self,
        quantized: bool = False,
        weight_quant: QuantizationArgs | None = None,
        input_quant: QuantizationArgs | None = None,
        model_compression_config: dict[str, Any] | None = None,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    @classmethod
    def get_min_capability(cls) -> int:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def create_weights(
        self,
        layer: torch.nn.Module,
        input_size: int,
        output_partition_sizes: list[int],
        input_size_per_partition: int,
        params_dtype: torch.dtype,
        weight_loader: Callable,
        **kwargs,
    ):
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

    def apply_weights(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: torch.Tensor | None = None,
    ) -> torch.Tensor:
        raise NotImplementedError("Sparse24 models are no longer supported by vLLM")

评论区精华

该 PR 没有收到任何人工 review 评论，仅有一个自动化 bot 评论重申了删除的内容，随后由 sfeng33 直接批准合并。因此无实质性技术讨论。

暂无高价值评论线程

风险与影响

风险：主要风险在于外部代码可能直接引用了被删除的符号，例如 from vllm.model_executor.layers.quantization.schema import KVCacheQuantSchema。但由于这些符号已经被废弃，且在仓库内部已无引用，风险较低。其他变更均为死代码清理，不会影响运行时行为。
影响：对用户：几乎无影响，因为删除的代码早已停止使用。极少数直接引用这些内部模块的用户可能需要更新 import。
对系统：代码库更简洁，减少了不必要的文件和导入。
对团队：降低了维护成本，避免了新开发者误用废弃代码。
风险标记：低兼容性风险

关联脉络

暂无明显关联 PR

#42889 [Refactor] Remove dead code

执行摘要

删除 CompressedTensors24 及相关死代码

实现拆解

步骤 1: 删除 CompressedTensors24 方案类和相关文件

步骤 2: 删除 KV cache 量化参数模式

步骤 3: 修改量化选择逻辑

步骤 4: 删除空文件

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论