#25191 [Apple Silicon] [MLX] Auto-detect MLX-format quantization_config dict

原始 PR 作者 jlee5814 合并时间 2026-05-15 00:55 文件变更 2 提交数 1 评论 8 代码增减 +149 / -11

执行摘要

MLX 预量化模型配置自动识别

mlx-community 提供的预量化模型（如 mlx-community/Qwen3-30B-A3B-4bit）的 config.json 中 quantization_config 仅含 group_size 和 bits，缺少 quant_method 键。ModelConfig._verify_quantization 无法识别该形状，抛出 Unknown quantization method 错误。通过实现 override_quantization_method 扩展点，使得注册中心自动匹配并返回 mlx_q4 或 mlx_q8。

该 PR 修复了实际用户报障且设计上恪守了后端功能隔离原则（所有逻辑位于 mlx.py），推荐合并。值得关注的设计决策是：选择复用已有的 override_quantization_method 扩展点而非在 model_config.py 中增加条件判断，保持了架构整洁。

讨论亮点

Review 中 yeahdongcn 指出模块文档字符串描述过时，要求更新以涵盖新增的自动检测功能，开发者已照做。此外，yeahdongcn 建议将新增的测试类导入语句移至文件顶部，开发者同样遵循。讨论简洁无争议，PR 最终获得 Approve。

实现拆解

在 MlxQuantizationConfig 中新增 override_quantization_method 类方法，该方法检查输入是否为包含 group_size 和 bits 的字典，且无 quant_method 键、user_quant 为 None，并根据 bits 值返回 mlx_q4 或 mlx_q8。
在测试文件中新增 TestMlxQuantizationOverride 类，包含 5 个纯逻辑测试，无需 MLX 或 Apple Silicon 依赖，可在所有 CI 平台运行，覆盖 q4/q8 自动检测、非 MLX 配置排除、非字典输入、用户显式量化参数优先等场景。
同步更新模块文档字符串，明确描述模块的两大职责：注册预设名称和自动检测 mlx-community 配置。

文件	模块	状态	重要度
`python/sglang/srt/layers/quantization/mlx.py`	量化配置	modified	7.25
`test/registered/unit/hardware_backend/mlx/test_quantization.py`	测试	modified	6.48

关键符号

override_quantization_method test_mlx_q4_dict_config_autodetect test_mlx_q8_dict_config_autodetect test_non_mlx_dict_not_matched test_non_dict_not_matched test_user_quant_explicit_defers_to_user

关键源码片段

python/sglang/srt/layers/quantization/mlx.py core-logic

核心修改文件：新增 `override_quantization_method` 类方法，实现 mlx-community 量化配置的自动检测。

@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant) -> Optional[str]:
    """Auto-detect mlx-community-shape quantization configs.

    mlx-community models ship ``config.json`` with::

        "quantization_config": {"group_size": <int>, "bits": <int>}

    No ``quant_method`` key, no other identifying field. Without this
    override, :meth:`ModelConfig._verify_quantization` cannot match the
    shape to any registered method and raises ``Unknown quantization
    method`` (see #25119). Match it here and return the preset whose
    bit-width agrees, so pre-quantized HF repos load on Apple Silicon
    without the user having to pass ``--quantization`` on the CLI.

    Returns ``None`` for any input that does not look like a bare MLX
    preset: non-dict, dict with an explicit ``quant_method``, missing
    keys, non-integer values, or unsupported bit-width. Also defers to
    any explicit ``--quantization`` CLI choice (``user_quant``) per the
    registry contract: CLI selection takes priority over auto-detect.
    """
    # 用户显式指定了 quantization 参数，尊重用户选择
    if user_quant is not None:
        return None
    # 输入必须是字典
    if not isinstance(hf_quant_cfg, dict):
        return None
    # 如果有 quant_method 键，说明属于其他量化方法，不拦截
    if "quant_method" in hf_quant_cfg:
        return None
    bits = hf_quant_cfg.get("bits")
    group_size = hf_quant_cfg.get("group_size")
    # 确保两个字段都是整数
    if not isinstance(bits, int) or not isinstance(group_size, int):
        return None
    # 根据 bit 宽度返回对应的预设名称
    if bits == 4:
        return "mlx_q4"
    if bits == 8:
        return "mlx_q8"
    return None

test/registered/unit/hardware_backend/mlx/test_quantization.py test-coverage

新增纯逻辑测试类，覆盖 override_quantization_method 的多种场景，无需 MLX/Apple Silicon 依赖，可在所有 CI 平台运行。

class TestMlxQuantizationOverride(unittest.TestCase):
    """Pure-logic tests for MlxQuantizationConfig.override_quantization_method.
    The override is a classmethod over a dict; no mlx / Apple Silicon
    dependency. Runs on every CI platform and guards #25119 from regression.
    """

    def test_mlx_q4_dict_config_autodetect(self):
        """Bare {group_size, bits=4} dict maps to mlx_q4."""
        # 模拟 mlx-community 的 quantization_config
        result = MlxQuantizationConfig.override_quantization_method(
            {"group_size": 64, "bits": 4}, None
        )
        self.assertEqual(result, "mlx_q4")

    def test_mlx_q8_dict_config_autodetect(self):
        """Bare {group_size, bits=8} dict maps to mlx_q8."""
        result = MlxQuantizationConfig.override_quantization_method(
            {"group_size": 32, "bits": 8}, None
        )
        self.assertEqual(result, "mlx_q8")

    def test_non_mlx_dict_not_matched(self):
        """Dicts with an explicit quant_method belong to that method, not ours."""
        # 含有 quant_method 键的配置不应被 MLX 拦截
        self.assertIsNone(
            MlxQuantizationConfig.override_quantization_method(
                {"quant_method": "modelopt", "bits": 4, "group_size": 64}, None
            )
        )
        self.assertIsNone(
            MlxQuantizationConfig.override_quantization_method(
                {"quant_method": "gptq", "bits": 4, "group_size": 128}, None
            )
        )

评论区精华

更新文档字符串 documentation

yeahdongcn 指出模块文档字符串描述过时，需要涵盖新增的自动检测功能。

结论：开发者已更新文档字符串。 · 已解决

移动导入语句 style

yeahdongcn 建议将测试类中新增的导入移到文件顶部。

结论：开发者已将导入移至顶部。 · 已解决

风险与影响

风险极低。变更仅影响配置验证阶段的 override_quantization_method 调用，不改变模型加载或推理路径。新增方法在输入不符合预期时返回 None，不会干扰其他量化方法的检测。所有逻辑封装在 MLX 专属文件中，未改动通用的 model_config.py 或 base_config.py。测试覆盖了边界情况，确保回归安全。

对 Apple Silicon（MLX 后端）用户而言，所有来自 mlx-community 的预量化模型现在均可自动加载，无需手动指定 --quantization。现有显式量化路径（--quantization mlx_q4/q8）完全不受影响。对其他后端（CUDA/ROCm）无影响。测试可在所有 CI 平台上运行，增强回归覆盖。

关联 Issue

#24907 [MLX] Add on-the-fly --quantization mlx_q4 / mlx_q8 for Apple Silicon

#25119 [Bug] [Apple Silicon] [MLX] _verify_quantization rejects mlx-community models with quantization_config dict

完整报告

执行摘要

一句话：MLX 预量化模型配置自动识别
推荐动作：该 PR 修复了实际用户报障且设计上恪守了后端功能隔离原则（所有逻辑位于 mlx.py），推荐合并。值得关注的设计决策是：选择复用已有的 override_quantization_method 扩展点而非在 model_config.py 中增加条件判断，保持了架构整洁。

功能与动机

实现拆解

在 MlxQuantizationConfig 中新增 override_quantization_method 类方法，该方法检查输入是否为包含 group_size 和 bits 的字典，且无 quant_method 键、user_quant 为 None，并根据 bits 值返回 mlx_q4 或 mlx_q8。
在测试文件中新增 TestMlxQuantizationOverride 类，包含 5 个纯逻辑测试，无需 MLX 或 Apple Silicon 依赖，可在所有 CI 平台运行，覆盖 q4/q8 自动检测、非 MLX 配置排除、非字典输入、用户显式量化参数优先等场景。
同步更新模块文档字符串，明确描述模块的两大职责：注册预设名称和自动检测 mlx-community 配置。

关键文件：

python/sglang/srt/layers/quantization/mlx.py（模块量化配置；类别 source；类型 core-logic；符号 override_quantization_method）: 核心修改文件：新增 override_quantization_method 类方法，实现 mlx-community 量化配置的自动检测。
test/registered/unit/hardware_backend/mlx/test_quantization.py（模块测试；类别 test；类型 test-coverage；符号 TestMlxQuantizationOverride, test_mlx_q4_dict_config_autodetect, test_mlx_q8_dict_config_autodetect, test_non_mlx_dict_not_matched）: 新增纯逻辑测试类，覆盖 override_quantization_method 的多种场景，无需 MLX/Apple Silicon 依赖，可在所有 CI 平台运行。

关键符号：override_quantization_method, test_mlx_q4_dict_config_autodetect, test_mlx_q8_dict_config_autodetect, test_non_mlx_dict_not_matched, test_non_dict_not_matched, test_user_quant_explicit_defers_to_user

关键源码片段

`python/sglang/srt/layers/quantization/mlx.py`

核心修改文件：新增 override_quantization_method 类方法，实现 mlx-community 量化配置的自动检测。

@classmethod
def override_quantization_method(cls, hf_quant_cfg, user_quant) -> Optional[str]:
    """Auto-detect mlx-community-shape quantization configs.

    mlx-community models ship ``config.json`` with::

        "quantization_config": {"group_size": <int>, "bits": <int>}

    No ``quant_method`` key, no other identifying field. Without this
    override, :meth:`ModelConfig._verify_quantization` cannot match the
    shape to any registered method and raises ``Unknown quantization
    method`` (see #25119). Match it here and return the preset whose
    bit-width agrees, so pre-quantized HF repos load on Apple Silicon
    without the user having to pass ``--quantization`` on the CLI.

    Returns ``None`` for any input that does not look like a bare MLX
    preset: non-dict, dict with an explicit ``quant_method``, missing
    keys, non-integer values, or unsupported bit-width. Also defers to
    any explicit ``--quantization`` CLI choice (``user_quant``) per the
    registry contract: CLI selection takes priority over auto-detect.
    """
    # 用户显式指定了 quantization 参数，尊重用户选择
    if user_quant is not None:
        return None
    # 输入必须是字典
    if not isinstance(hf_quant_cfg, dict):
        return None
    # 如果有 quant_method 键，说明属于其他量化方法，不拦截
    if "quant_method" in hf_quant_cfg:
        return None
    bits = hf_quant_cfg.get("bits")
    group_size = hf_quant_cfg.get("group_size")
    # 确保两个字段都是整数
    if not isinstance(bits, int) or not isinstance(group_size, int):
        return None
    # 根据 bit 宽度返回对应的预设名称
    if bits == 4:
        return "mlx_q4"
    if bits == 8:
        return "mlx_q8"
    return None

`test/registered/unit/hardware_backend/mlx/test_quantization.py`

新增纯逻辑测试类，覆盖 override_quantization_method 的多种场景，无需 MLX/Apple Silicon 依赖，可在所有 CI 平台运行。

class TestMlxQuantizationOverride(unittest.TestCase):
    """Pure-logic tests for MlxQuantizationConfig.override_quantization_method.
    The override is a classmethod over a dict; no mlx / Apple Silicon
    dependency. Runs on every CI platform and guards #25119 from regression.
    """

    def test_mlx_q4_dict_config_autodetect(self):
        """Bare {group_size, bits=4} dict maps to mlx_q4."""
        # 模拟 mlx-community 的 quantization_config
        result = MlxQuantizationConfig.override_quantization_method(
            {"group_size": 64, "bits": 4}, None
        )
        self.assertEqual(result, "mlx_q4")

    def test_mlx_q8_dict_config_autodetect(self):
        """Bare {group_size, bits=8} dict maps to mlx_q8."""
        result = MlxQuantizationConfig.override_quantization_method(
            {"group_size": 32, "bits": 8}, None
        )
        self.assertEqual(result, "mlx_q8")

    def test_non_mlx_dict_not_matched(self):
        """Dicts with an explicit quant_method belong to that method, not ours."""
        # 含有 quant_method 键的配置不应被 MLX 拦截
        self.assertIsNone(
            MlxQuantizationConfig.override_quantization_method(
                {"quant_method": "modelopt", "bits": 4, "group_size": 64}, None
            )
        )
        self.assertIsNone(
            MlxQuantizationConfig.override_quantization_method(
                {"quant_method": "gptq", "bits": 4, "group_size": 128}, None
            )
        )

评论区精华

更新文档字符串 (documentation): 开发者已更新文档字符串。
移动导入语句 (style): 开发者已将导入移至顶部。

风险与影响

风险：风险极低。变更仅影响配置验证阶段的 override_quantization_method 调用，不改变模型加载或推理路径。新增方法在输入不符合预期时返回 None，不会干扰其他量化方法的检测。所有逻辑封装在 MLX 专属文件中，未改动通用的 model_config.py 或 base_config.py。测试覆盖了边界情况，确保回归安全。
影响：对 Apple Silicon（MLX 后端）用户而言，所有来自 mlx-community 的预量化模型现在均可自动加载，无需手动指定 --quantization。现有显式量化路径（--quantization mlx_q4/q8）完全不受影响。对其他后端（CUDA/ROCm）无影响。测试可在所有 CI 平台上运行，增强回归覆盖。
风险标记：暂无

关联脉络

PR #24907 [MLX] Add on-the-fly --quantization mlx_q4 / mlx_q8 for Apple Silicon: 该 PR 是 #24907 的后续修复，完善了 MLX 量化支持，使预量化模型也能自动加载。

#25191 [Apple Silicon] [MLX] Auto-detect MLX-format quantization_config dict

执行摘要

MLX 预量化模型配置自动识别

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论