#32325 [Model] Add Moondream3 model support(only query and caption skills)

原始 PR 作者 sniper35 合并时间 2026-05-01 10:06 文件变更 19 提交数 10 评论 78 代码增减 +3238 / -9

执行摘要

新增 Moondream3 模型，支持 Query 和 Caption

社区请求支持Moondream3模型（#25215），该模型具有高吞吐量和优异的多模态性能。PR实现了基本的Query和Caption技能，满足大多数使用场景。

建议阅读，尤其是reconstruct_from_crops函数和Moondream3Processor的设计，展示了如何将视觉预处理封装在processor中，保持模型核心简洁。此外，Moondream3的prefix-LM实现和MoE配置为其他类似模型提供参考。

讨论亮点

核心engine改动取舍：DarkLight1337反对引入模型钩子和model_extra_output，认为会增加代码债务。作者sniper35最终同意移除Point/Detect技能相关核心改动，只保留Query/Caption，避免对Engine Core的侵入。

grid_size计算与权重映射错误：gemini-code-assist指出grid_size错误使用了enc_n_layers（应为crop_size/patch_size），以及attention权重qkv映射错误（qkv应为qkv_proj）。已修正。

processor输入标准化：DarkLight1337要求模型层不应处理多种输入格式，应由processor统一标准化。模型层的输入处理被简化，仅接受5d tensor或list of 4d tensors。

实现拆解

定义模型配置（vllm/transformers_utils/configs/moondream3.py）：创建组合配置类，包含Vision子配置（crop_size、max_crops等）和Text子配置（MoE、prefix_attn、特殊token ID）。桥接原生命名与HF标准属性。
实现核心模型架构（vllm/model_executor/models/moondream3.py）：
- Vision Encoder：基于重叠tiling的Vision Transformer，支持多crop输入，使用reconstruct_from_crops在patch级重建特征。内部使用MMEncoderAttention（双向注意力）。
- Text Decoder：兼容vLLM的decoder注意力，含MoE、RoPE、tau缩放。通过prefix_attn控制前N个位置使用双向注意力（prefix-LM），其余因果。
- 多模态嵌入：将视觉特征与token嵌入相加，填充到prefix位置。
- 抑制答案token（默认ID 3），避免模型输出分隔符。
开发自定义处理器（vllm/transformers_utils/processors/moondream3.py）：
- Moondream3Processor使用独立tokenizer仓库moondream/starmie-v1，包含预处理流水线：select_tiling计算最优tile数，归一化和BF16转换。
- Chat template根据文本前缀路由生成Moondream3特定prompt（含<|endoftext|>、<image>、<|md_reserved_0|>等特殊token）。
注册模型到vLLM框架：在registry.py添加Moondream3ForCausalLM和HfMoondream两个架构别名；在configs/__init__.py导入配置；在model.py的is_mm_prefix_lm属性中添加moondream3。
编写测试与文档：添加处理器单元测试（test_moondream3_processing.py）和生成测试（test_moondream3_generation.py，含TP测试）；在conftest.py和model_utils.py中添加辅助设施；更新supported_models.md和multimodal_inputs.md。

文件	模块	状态	重要度
`vllm/model_executor/models/moondream3.py`	模型层	added	9.36
`vllm/transformers_utils/processors/moondream3.py`	处理器	added	9.08
`vllm/transformers_utils/configs/moondream3.py`	配置	added	8.72
`tests/models/multimodal/processing/test_moondream3.py`	测试	added	7.97
`tests/models/multimodal/generation/test_moondream3.py`	测试	added	7.85
`vllm/model_executor/models/registry.py`	注册层	modified	5.1
`tests/conftest.py`	测试配置	modified	5.29
`tests/models/multimodal/generation/vlm_utils/model_utils.py`	测试工具	modified	7.4

关键符号

reconstruct_from_crops select_tiling Moondream3Processor.__call__ Moondream3Model.forward make_query_prompt make_caption_prompt _encode_vision _normalize_tiling

关键源码片段

vllm/model_executor/models/moondream3.py data-contract

核心模型实现，包含 Vision Encoder、Text Decoder、多模态嵌入和 forward 逻辑。

# 从重叠 crops 中重建特征图
def reconstruct_from_crops(
    crops: torch.Tensor,
    tiling: tuple[int, int],
    overlap_margin: int,
    patch_size: int = 14,
) -> torch.Tensor:
    """Reconstruct features from overlapping crops.

    Args:
        crops: (N, H, W, D) 的 crop 特征张量。
        tiling: (tiling_h, tiling_w) 的瓦片数。
        overlap_margin: 重叠 margin 的 patch 数。
        patch_size: 每 patch 像素数（默认 14，适用于 SigLIP 风格的 ViT）。

    Returns:
        reconstructed: (output_h, output_w, D) 完整重建特征。
    """
    tiling_h, tiling_w = tiling
    crop_height, crop_width = crops[0].shape[:2]
    margin_pixels = overlap_margin * patch_size

    output_h = (crop_height - 2 * margin_pixels) * tiling_h + 2 * margin_pixels
    output_w = (crop_width - 2 * margin_pixels) * tiling_w + 2 * margin_pixels

    reconstructed = torch.zeros(
        (output_h, output_w, crops[0].shape[2]),
        device=crops[0].device,
        dtype=crops[0].dtype,
    )

    for i, crop in enumerate(crops):
        tile_y = i // tiling_w
        tile_x = i % tiling_w

        # 边界 tile 保留边缘，非边界裁剪重叠区域
        x_start = 0 if tile_x == 0 else margin_pixels
        x_end = crop_width if tile_x == tiling_w - 1 else crop_width - margin_pixels
        y_start = 0 if tile_y == 0 else margin_pixels
        y_end = crop_height if tile_y == tiling_h - 1 else crop_height - margin_pixels

        out_x = tile_x * (crop_width - 2 * margin_pixels)
        out_y = tile_y * (crop_height - 2 * margin_pixels)

        # 将有效区域放置到重建图对应位置
        reconstructed[
            out_y + y_start : out_y + y_end,
            out_x + x_start : out_x + x_end,
        ] = crop[y_start:y_end, x_start:x_end]

    return reconstructed

vllm/transformers_utils/processors/moondream3.py dependency-wiring

自定义处理器，处理图像 tiling、归一化、tokenization 和 prompt 构建。

import math

def select_tiling(
    height: int, width: int, crop_size: int, max_crops: int
) -> tuple[int, int]:
    """Determine the optimal number of tiles to cover an image.

    根据图像尺寸和 crop 大小，计算最优的 tiling 网格数 (h, w)。
    如果图像小于 crop，则返回 (1, 1)。
    否则最小化 tile 数，但不超过 max_crops，并尽量保持正方形。

    Args:
        height: 原始图像高度。
        width: 原始图像宽度。
        crop_size: 每个 crop 的尺寸（默认 378）。
        max_crops: 最大允许的 crop 数（默认 12）。

    Returns:
        (tiling_h, tiling_w) 瓦片网格大小。
    """
    if height <= crop_size or width <= crop_size:
        return (1, 1)

    # 最小需要的 tile 数
    min_h = math.ceil(height / crop_size)
    min_w = math.ceil(width / crop_size)

    # 如果最小数已超过限制，按比例缩小
    if min_h * min_w > max_crops:
        ratio = math.sqrt(max_crops / (min_h * min_w))
        return (max(1, math.floor(min_h * ratio)),
                max(1, math.floor(min_w * ratio)))

    # 否则在正方形约束下尽量填满 max_crops
    h_tiles = math.floor(math.sqrt(max_crops * height / width))
    w_tiles = math.floor(math.sqrt(max_crops * width / height))

    h_tiles = max(h_tiles, min_h)
    w_tiles = max(w_tiles, min_w)

    # 如果超出，减少一个维度
    if h_tiles * w_tiles > max_crops:
        if w_tiles > h_tiles:
            w_tiles = math.floor(max_crops / h_tiles)
        else:
            h_tiles = math.floor(max_crops / w_tiles)

    return (max(1, h_tiles), max(1, w_tiles))

评论区精华

核心 engine 改动（模型钩子）是否必要 设计

DarkLight1337 反对引入模型钩子和 model_extra_output，认为仅一个模型使用不值得增加代码债务。作者 sniper35 最终同意移除 Point/Detect 技能相关核心改动，只保留 Query/Caption。

结论：移除对 Engine Core 的侵入性修改，仅保留非侵入模型注册。 · 已解决

grid_size 计算依赖错误 正确性

gemini-code-assist 指出 grid_size 错误使用了 enc_n_layers（27），应为 crop_size/patch_size。

结论：已修正为动态计算。 · 已解决

权重名称映射错误 正确性

gemini-code-assist 指出 attn.qkv. 应映射为 attn.qkv_proj.，导致权重无法加载。

结论：已修正映射规则。 · 已解决

processor 输入标准化 设计

DarkLight1337 要求模型层不应处理多种输入格式，应由 processor 统一标准化。

结论：简化模型层输入处理，仅接受 5d tensor 或 list of 4d tensors，其余由 processor 处理。 · 已解决

IO Processor plugin 的引入 设计

DarkLight1337 和 christian-pinto 质疑 post_process_generate 和 merge_sampling_params_for_prompt 方法的必要性。作者解释 detect/point 需要，但随后决定不实现 detect/point，废弃相关 plugin 改动。

结论：最终未引入 IO Processor plugin 改动，保持原有接口。 · superseded

风险与影响

新模型稳定性：首次集成，虽然通过HF对齐测试，但实际部署可能出现未预期行为，尤其是tiling路径和prompt格式化边缘情况。
外部tokenizer依赖：使用单独仓库moondream/starmie-v1作为tokenizer，若仓库不可用或变更则模型无法加载。
性能开销：图像预处理涉及重叠tiling和特征重建，高分辨率图像可能增加延迟；MoE解码器在批处理时有额外计算。
Prefix-LM实现风险：vLLM的prefix-LM机制主要用于少量模型，固定prefix长度（730）若配置变更可能导致错位。
测试覆盖：未覆盖所有tiling组合和极端情况。

用户：可直接使用vllm.LLM加载Moondream3模型，通过Query和Caption prompt格式进行多模态交互。
系统：新增约5400行代码，包括完整模型、处理器、配置和测试。模型使用独立tokenizer，需额外网络获取。能利用vLLM的张量并行、pipeline并行、prefix caching等特性。
团队：需维护模型对HF config的兼容性，跟踪上游更新。processor中select_tiling算法若变化需同步。

依赖外部 tokenizer 新模型稳定性 tiling 计算开销

关联 Issue

#25215 [Feature]: Could support moondream vlm model?

完整报告

执行摘要

一句话：新增Moondream3模型，支持Query和Caption
推荐动作：建议阅读，尤其是reconstruct_from_crops函数和Moondream3Processor的设计，展示了如何将视觉预处理封装在processor中，保持模型核心简洁。此外，Moondream3的prefix-LM实现和MoE配置为其他类似模型提供参考。

功能与动机

社区请求支持Moondream3模型（#25215），该模型具有高吞吐量和优异的多模态性能。PR实现了基本的Query和Caption技能，满足大多数使用场景。

实现拆解

定义模型配置（vllm/transformers_utils/configs/moondream3.py）：创建组合配置类，包含Vision子配置（crop_size、max_crops等）和Text子配置（MoE、prefix_attn、特殊token ID）。桥接原生命名与HF标准属性。
实现核心模型架构（vllm/model_executor/models/moondream3.py）：
- Vision Encoder：基于重叠tiling的Vision Transformer，支持多crop输入，使用reconstruct_from_crops在patch级重建特征。内部使用MMEncoderAttention（双向注意力）。
- Text Decoder：兼容vLLM的decoder注意力，含MoE、RoPE、tau缩放。通过prefix_attn控制前N个位置使用双向注意力（prefix-LM），其余因果。
- 多模态嵌入：将视觉特征与token嵌入相加，填充到prefix位置。
- 抑制答案token（默认ID 3），避免模型输出分隔符。
开发自定义处理器（vllm/transformers_utils/processors/moondream3.py）：
- Moondream3Processor使用独立tokenizer仓库moondream/starmie-v1，包含预处理流水线：select_tiling计算最优tile数，归一化和BF16转换。
- Chat template根据文本前缀路由生成Moondream3特定prompt（含<|endoftext|>、<image>、<|md_reserved_0|>等特殊token）。
注册模型到vLLM框架：在registry.py添加Moondream3ForCausalLM和HfMoondream两个架构别名；在configs/__init__.py导入配置；在model.py的is_mm_prefix_lm属性中添加moondream3。
编写测试与文档：添加处理器单元测试（test_moondream3_processing.py）和生成测试（test_moondream3_generation.py，含TP测试）；在conftest.py和model_utils.py中添加辅助设施；更新supported_models.md和multimodal_inputs.md。

关键文件：

vllm/model_executor/models/moondream3.py（模块模型层；类别 source；类型 data-contract；符号 reconstruct_from_crops, Moondream3VisionMLP, init, forward）: 核心模型实现，包含Vision Encoder、Text Decoder、多模态嵌入和forward逻辑。
vllm/transformers_utils/processors/moondream3.py（模块处理器；类别 source；类型 dependency-wiring；符号 Moondream3ProcessorKwargs, select_tiling, Moondream3Processor, init）: 自定义处理器，处理图像tiling、归一化、tokenization和prompt构建。
vllm/transformers_utils/configs/moondream3.py（模块配置；类别 source；类型 core-logic；符号 Moondream3VisionConfig, init, Moondream3TextConfig, Moondream3Config）: 定义模型配置，包括vision和text子配置，MoE参数映射。
tests/models/multimodal/processing/test_moondream3.py（模块测试；类别 test；类型 test-coverage；符号 test_processor_creation, test_processor_apply, test_processor_pixel_values, test_processor_image_token_expansion）: 处理器功能的单元测试，验证tiling、placeholder扩展、像素值等。
tests/models/multimodal/generation/test_moondream3.py（模块测试；类别 test；类型 test-coverage；符号 make_query_prompt, make_caption_prompt, test_tensor_parallel, llm）: 端到端生成测试，验证query和caption技能的正确性。
vllm/model_executor/models/registry.py（模块注册层；类别 source；类型 data-contract）: 注册Moondream3ForCausalLM和HfMoondream架构，使vLLM能识别模型。
tests/conftest.py（模块测试配置；类别 test；类型 test-coverage）: 添加skip_processor_init和tokenizer_name参数，支持moondream3测试。
tests/models/multimodal/generation/vlm_utils/model_utils.py（模块测试工具；类别 test；类型 test-coverage；符号 moondream3_processor, moondream3_patch_hf_runner, processor, _normalize_tiling）: 添加moondream3_patch_hf_runner，修补HF runner以对齐vLLM行为。

关键符号：reconstruct_from_crops, select_tiling, Moondream3Processor.call, Moondream3Model.forward, make_query_prompt, make_caption_prompt, _encode_vision, _normalize_tiling

关键源码片段

`vllm/model_executor/models/moondream3.py`

核心模型实现，包含Vision Encoder、Text Decoder、多模态嵌入和forward逻辑。

# 从重叠 crops 中重建特征图
def reconstruct_from_crops(
    crops: torch.Tensor,
    tiling: tuple[int, int],
    overlap_margin: int,
    patch_size: int = 14,
) -> torch.Tensor:
    """Reconstruct features from overlapping crops.

    Args:
        crops: (N, H, W, D) 的 crop 特征张量。
        tiling: (tiling_h, tiling_w) 的瓦片数。
        overlap_margin: 重叠 margin 的 patch 数。
        patch_size: 每 patch 像素数（默认 14，适用于 SigLIP 风格的 ViT）。

    Returns:
        reconstructed: (output_h, output_w, D) 完整重建特征。
    """
    tiling_h, tiling_w = tiling
    crop_height, crop_width = crops[0].shape[:2]
    margin_pixels = overlap_margin * patch_size

    output_h = (crop_height - 2 * margin_pixels) * tiling_h + 2 * margin_pixels
    output_w = (crop_width - 2 * margin_pixels) * tiling_w + 2 * margin_pixels

    reconstructed = torch.zeros(
        (output_h, output_w, crops[0].shape[2]),
        device=crops[0].device,
        dtype=crops[0].dtype,
    )

    for i, crop in enumerate(crops):
        tile_y = i // tiling_w
        tile_x = i % tiling_w

        # 边界 tile 保留边缘，非边界裁剪重叠区域
        x_start = 0 if tile_x == 0 else margin_pixels
        x_end = crop_width if tile_x == tiling_w - 1 else crop_width - margin_pixels
        y_start = 0 if tile_y == 0 else margin_pixels
        y_end = crop_height if tile_y == tiling_h - 1 else crop_height - margin_pixels

        out_x = tile_x * (crop_width - 2 * margin_pixels)
        out_y = tile_y * (crop_height - 2 * margin_pixels)

        # 将有效区域放置到重建图对应位置
        reconstructed[
            out_y + y_start : out_y + y_end,
            out_x + x_start : out_x + x_end,
        ] = crop[y_start:y_end, x_start:x_end]

    return reconstructed

`vllm/transformers_utils/processors/moondream3.py`

自定义处理器，处理图像tiling、归一化、tokenization和prompt构建。

import math

def select_tiling(
    height: int, width: int, crop_size: int, max_crops: int
) -> tuple[int, int]:
    """Determine the optimal number of tiles to cover an image.

    根据图像尺寸和 crop 大小，计算最优的 tiling 网格数 (h, w)。
    如果图像小于 crop，则返回 (1, 1)。
    否则最小化 tile 数，但不超过 max_crops，并尽量保持正方形。

    Args:
        height: 原始图像高度。
        width: 原始图像宽度。
        crop_size: 每个 crop 的尺寸（默认 378）。
        max_crops: 最大允许的 crop 数（默认 12）。

    Returns:
        (tiling_h, tiling_w) 瓦片网格大小。
    """
    if height <= crop_size or width <= crop_size:
        return (1, 1)

    # 最小需要的 tile 数
    min_h = math.ceil(height / crop_size)
    min_w = math.ceil(width / crop_size)

    # 如果最小数已超过限制，按比例缩小
    if min_h * min_w > max_crops:
        ratio = math.sqrt(max_crops / (min_h * min_w))
        return (max(1, math.floor(min_h * ratio)),
                max(1, math.floor(min_w * ratio)))

    # 否则在正方形约束下尽量填满 max_crops
    h_tiles = math.floor(math.sqrt(max_crops * height / width))
    w_tiles = math.floor(math.sqrt(max_crops * width / height))

    h_tiles = max(h_tiles, min_h)
    w_tiles = max(w_tiles, min_w)

    # 如果超出，减少一个维度
    if h_tiles * w_tiles > max_crops:
        if w_tiles > h_tiles:
            w_tiles = math.floor(max_crops / h_tiles)
        else:
            h_tiles = math.floor(max_crops / w_tiles)

    return (max(1, h_tiles), max(1, w_tiles))

评论区精华

核心engine改动（模型钩子）是否必要 (design): 移除对Engine Core的侵入性修改，仅保留非侵入模型注册。
grid_size计算依赖错误 (correctness): 已修正为动态计算。
权重名称映射错误 (correctness): 已修正映射规则。
processor输入标准化 (design): 简化模型层输入处理，仅接受5d tensor或list of 4d tensors，其余由processor处理。
IO Processor plugin的引入 (design): 最终未引入IO Processor plugin改动，保持原有接口。

风险与影响

风险：新模型稳定性：首次集成，虽然通过HF对齐测试，但实际部署可能出现未预期行为，尤其是tiling路径和prompt格式化边缘情况。
外部tokenizer依赖：使用单独仓库moondream/starmie-v1作为tokenizer，若仓库不可用或变更则模型无法加载。
性能开销：图像预处理涉及重叠tiling和特征重建，高分辨率图像可能增加延迟；MoE解码器在批处理时有额外计算。
Prefix-LM实现风险：vLLM的prefix-LM机制主要用于少量模型，固定prefix长度（730）若配置变更可能导致错位。
测试覆盖：未覆盖所有tiling组合和极端情况。
影响：用户：可直接使用vllm.LLM加载Moondream3模型，通过Query和Caption prompt格式进行多模态交互。
系统：新增约5400行代码，包括完整模型、处理器、配置和测试。模型使用独立tokenizer，需额外网络获取。能利用vLLM的张量并行、pipeline并行、prefix caching等特性。
团队：需维护模型对HF config的兼容性，跟踪上游更新。processor中select_tiling算法若变化需同步。
风险标记：依赖外部tokenizer, 新模型稳定性, tiling计算开销

关联脉络

PR #32327 Update imports for Moondream3: PR body中提及可能需要跟随#32327更新导入，实际合并时已适配。

#32325 [Model] Add Moondream3 model support(only query and caption skills)

执行摘要

新增 Moondream3 模型，支持 Query 和 Caption

实现拆解

评论区精华

风险与影响

关联 Issue

完整报告

参与讨论