#43575 [feat] add GlmgaProcessor specific logits in `glm4_1v.py`

原始 PR 作者 JaredforReal 合并时间 2026-05-29 10:56 文件变更 3 提交数 26 评论 13 代码增减 +346 / -33

执行摘要

新增 GLMGA/GLM-4.6V-Flash 多模态支持

该 PR 旨在扩展 vllm 的多模态模型覆盖范围，增加对 GLMGA 模型的支持，同时保持与 GLM-4.6V-Flash 的兼容性。PR 描述指出：'Add serving support for the GLMGA multimodal model and improve GLM-4.6V-Flash compatibility within the existing Glm4vForConditionalGeneration framework.'

该 PR 实现了必要的新模型支持，但存在若干风险点（除零、类型安全、断言硬失败），建议在后续 PR 中修复。值得关注的决策包括：通过处理器类名探测变体、视频帧偶数填充以符合 HF temporal patch 要求。

讨论亮点

_to_video_metadata 类型安全（gemini-code-assist）：使用 **{k: metadata[k] for k in metadata if k != "do_sample_frames"} 转换字典为 VideoMetadata，若字段不匹配（如 timestamps）会引发 TypeError。
original_fps 除零风险（Copilot）：compute_frames_index_to_sample 中使用 original_fps 作为除数，若 fps 为 0 导致 ZeroDivisionError。
视频元数据与视频列表不同步（Copilot）：_get_direct_path_inputs 中，当某些输入不是二元组时，hf_videos 会添加但 hf_video_metadata 不会，导致长度不一致。
token 预算估算可能错误（Copilot）：get_mm_max_tokens_per_item 中 max_pixels // (temporal_patch_size * patch_size**2 * merge_size**2) 是面积除以面积，可能低估 token 数。
assert 语句硬失败（Copilot）：iter_mm_grid_thw 中使用 assert 检查 embed 范围，生产环境可能引发 AssertionError，且被 -O 移除后行为改变。

实现拆解

集成 GLMGA 处理器探测与视频 token 预算计算：在 glm4_1v.py 中新增 _is_glmga_model 方法，通过子处理器类名检测 GLMGA 变体；get_mm_max_tokens_per_item 利用视频处理器的 longest_edge 动态计算最大视觉 token 数量，并结合帧数限制和时间戳 token 长度估算预算。
添加 GLM-4.6V 专用视频后端：在 vllm/multimodal/video.py 中注册 GLM4_6VVideoBackend，实现 _prepare_source（处理缺失 duration 的场景）、compute_frames_index_to_sample（基于 fps 的帧选择逻辑）和 load_bytes（确保偶数帧数以匹配 HF 的 temporal patch 检查）。
适配直接 HF 处理器调用路径：为 Glm46VProcessor 提供 _get_direct_path_inputs 和 _apply_hf_processor_text_only，避免在 Glm4vBaseProcessor 中走分裂视频路径。
更新测试模型注册表：在 tests/models/registry.py 中将 GLM-4.6V-Flash 添加为 Glm4vForConditionalGeneration 的 extras 条目，确保 CI 覆盖新变体。
调整视频嵌入放置逻辑：修改 iter_mm_grid_thw 使其支持 per-frame 的 embed ranges，从 mm_position.extract_embeds_range() 获取范围而非单一 offset。

文件	模块	状态	重要度
`vllm/model_executor/models/glm4_1v.py`	多模态模型	modified	9.21
`vllm/multimodal/video.py`	多模态处理	modified	8.52
`tests/models/registry.py`	测试注册表	modified	4.45

关键符号

_to_video_metadata get_mm_max_tokens_per_item _is_glmga_model _get_video_second_idx_glmga _get_direct_path_inputs get_video_replacement_glm46v GLM4_6VVideoBackend._prepare_source GLM4_6VVideoBackend.compute_frames_index_to_sample GLM4_6VVideoBackend.load_bytes

关键源码片段

vllm/model_executor/models/glm4_1v.py data-contract

核心变更文件：集成 GLMGA 处理器、视频 token 预算计算、直接 HF 路径适配，新增约 240 行代码

# 在 Glm4vModel 类中新增的方法

def get_mm_max_tokens_per_item(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> Mapping[str, int] | None:
    processor = self.get_hf_processor()
    # Glm4vProcessor ( 非 GA) 直接返回 None
    if isinstance(processor, Glm4vProcessor):
        return None

    result: dict[str, int] = {}

    if mm_counts.get("image", 0) > 0:
        result["image"] = self.get_max_image_tokens()

    if mm_counts.get("video", 0) > 0:
        video_processor = self.get_video_processor()
        # 从 processor 配置中读取 longest_edge
        max_pixels = video_processor.size["longest_edge"]

        vision_config = self.get_hf_config().vision_config
        temporal_patch_size = vision_config.temporal_patch_size
        patch_size = vision_config.patch_size
        merge_size = vision_config.spatial_merge_size

        # 计算最大视觉 token 数 ( 注意：此计算可能低估实际值 )
        max_vision_tokens = max_pixels // (
            temporal_patch_size * patch_size**2 * merge_size**2
        )

        # GLMGA 支持最多 640 帧
        max_grid_t = 640 // temporal_patch_size

        tokenizer = self.get_tokenizer()
        # 枚举至多 300 种时间戳格式，取最大 token 长度
        max_ts_tokens = max(
            len(tokenizer.encode(f"{t:.1f} seconds", add_special_tokens=False))
            for t in range(min(max_grid_t, 300))
        )

        result["video"] = max_vision_tokens + max_grid_t * (2 + max_ts_tokens) + 2

    return result


def _is_glmga_model(self, processor: object) -> bool:
    """通过 processor 及其子 processor 的类名检测 GLMGA 变体"""
    for attr in ("image_processor", "video_processor"):
        sub = getattr(processor, attr, None)
        if sub and "Glmga" in type(sub).__name__:
            return True
    return False

评论区精华

_to_video_metadata 可能导致 TypeError 正确性

gemini-code-assist 指出：`VideoMetadata` 是固定字段的 dataclass，`_to_video_metadata` 用 `**` 解包包含额外键（如 timestamps）的字典会引发 TypeError。

结论：未在 PR 中得到解决；风险已记录，后续需改用字典或自定义类。 · unresolved

compute_frames_index_to_sample 中 original_fps 除零风险 正确性

Copilot 指出：当 original_fps 为 0 或无效时，计算 duration 和 duration_per_frame 会导致除零错误。

结论：未在 PR 中得到解决；建议在 computed 中增加 guard。 · unresolved

get_mm_max_tokens_per_item 可能低估视觉 token 预算 性能

Copilot 指出：`max_pixels // (temporal_patch_size * patch_size**2 * merge_size**2)` 是面积除以面积，很可能低估 token 数，导致 OOM 或调度错误。

结论：未在 PR 中得到解决；建议基于 grid 维度重新计算。 · unresolved

_get_direct_path_inputs 视频列表与元数据不同步 正确性

Copilot 指出：当某些视频输入不是二元组时，`hf_videos` 添加但 `hf_video_metadata` 不添加，导致长度不一致。

结论：未在 PR 中得到解决；建议同步添加条目或验证输入格式。 · unresolved

iter_mm_grid_thw 中的 assert 硬崩溃 正确性

Copilot 指出：`assert` 语句在生产环境可能因细微的 processor 输出差异而崩溃，且被 `-O` 移除后行为不可预测。

结论：未在 PR 中得到解决；建议改为异常并给出上下文信息。 · unresolved

风险与影响

除零风险：compute_frames_index_to_sample 和 _get_video_second_idx_glmga 未对 original_fps == 0 做防御性检查。
类型错误：_to_video_metadata 假设字典键完全匹配 VideoMetadata 字段，传入额外键（如 timestamps）将崩溃。
索引未同步：_get_direct_path_inputs 中视频列表与元数据列表可能不等长，导致处理错位。
Token 预算低估：get_mm_max_tokens_per_item 的视觉 token 计算公式可能错误，导致 OOM 或调度失败。
断言硬失败：iter_mm_grid_thw 中的 assert 在生产环境面对略微不同的处理器输出时直接崩溃。

为用户提供对 GLMGA 和 GLM-4.6V-Flash 的推理支持；新增 101 行视频后端代码，需在视频加载路径中维护；模型类增加约 240 行逻辑；测试注册表扩展确保 CI 覆盖。影响范围限于新增的多模态变体，不影响已有模型。

除零风险类型安全断言硬失败 token 预算不准确数据同步风险

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：新增 GLMGA/GLM-4.6V-Flash 多模态支持
推荐动作：该 PR 实现了必要的新模型支持，但存在若干风险点（除零、类型安全、断言硬失败），建议在后续 PR 中修复。值得关注的决策包括：通过处理器类名探测变体、视频帧偶数填充以符合 HF temporal patch 要求。

功能与动机

实现拆解

集成 GLMGA 处理器探测与视频 token 预算计算：在 glm4_1v.py 中新增 _is_glmga_model 方法，通过子处理器类名检测 GLMGA 变体；get_mm_max_tokens_per_item 利用视频处理器的 longest_edge 动态计算最大视觉 token 数量，并结合帧数限制和时间戳 token 长度估算预算。
添加 GLM-4.6V 专用视频后端：在 vllm/multimodal/video.py 中注册 GLM4_6VVideoBackend，实现 _prepare_source（处理缺失 duration 的场景）、compute_frames_index_to_sample（基于 fps 的帧选择逻辑）和 load_bytes（确保偶数帧数以匹配 HF 的 temporal patch 检查）。
适配直接 HF 处理器调用路径：为 Glm46VProcessor 提供 _get_direct_path_inputs 和 _apply_hf_processor_text_only，避免在 Glm4vBaseProcessor 中走分裂视频路径。
更新测试模型注册表：在 tests/models/registry.py 中将 GLM-4.6V-Flash 添加为 Glm4vForConditionalGeneration 的 extras 条目，确保 CI 覆盖新变体。
调整视频嵌入放置逻辑：修改 iter_mm_grid_thw 使其支持 per-frame 的 embed ranges，从 mm_position.extract_embeds_range() 获取范围而非单一 offset。

关键文件：

vllm/model_executor/models/glm4_1v.py（模块多模态模型；类别 source；类型 data-contract；符号 _to_video_metadata, get_mm_max_tokens_per_item, _is_glmga_model, _get_video_second_idx_glmga）: 核心变更文件：集成 GLMGA 处理器、视频 token 预算计算、直接 HF 路径适配，新增约 240 行代码
vllm/multimodal/video.py（模块多模态处理；类别 source；类型 core-logic；符号 GLM4_6VVideoBackend, _prepare_source, compute_frames_index_to_sample, load_bytes）: 新增 GLM-4.6V 专用视频后端，包含帧采样逻辑和偶数帧填充，注册新 backend
tests/models/registry.py（模块测试注册表；类别 test；类型 test-coverage）: 测试注册表扩展，将 GLM-4.6V-Flash 添加为 Glm4vForConditionalGeneration 的 extras 条目

关键符号：_to_video_metadata, get_mm_max_tokens_per_item, _is_glmga_model, _get_video_second_idx_glmga, _get_direct_path_inputs, get_video_replacement_glm46v, GLM4_6VVideoBackend._prepare_source, GLM4_6VVideoBackend.compute_frames_index_to_sample, GLM4_6VVideoBackend.load_bytes

关键源码片段

`vllm/model_executor/models/glm4_1v.py`

核心变更文件：集成 GLMGA 处理器、视频 token 预算计算、直接 HF 路径适配，新增约 240 行代码

# 在 Glm4vModel 类中新增的方法

def get_mm_max_tokens_per_item(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> Mapping[str, int] | None:
    processor = self.get_hf_processor()
    # Glm4vProcessor ( 非 GA) 直接返回 None
    if isinstance(processor, Glm4vProcessor):
        return None

    result: dict[str, int] = {}

    if mm_counts.get("image", 0) > 0:
        result["image"] = self.get_max_image_tokens()

    if mm_counts.get("video", 0) > 0:
        video_processor = self.get_video_processor()
        # 从 processor 配置中读取 longest_edge
        max_pixels = video_processor.size["longest_edge"]

        vision_config = self.get_hf_config().vision_config
        temporal_patch_size = vision_config.temporal_patch_size
        patch_size = vision_config.patch_size
        merge_size = vision_config.spatial_merge_size

        # 计算最大视觉 token 数 ( 注意：此计算可能低估实际值 )
        max_vision_tokens = max_pixels // (
            temporal_patch_size * patch_size**2 * merge_size**2
        )

        # GLMGA 支持最多 640 帧
        max_grid_t = 640 // temporal_patch_size

        tokenizer = self.get_tokenizer()
        # 枚举至多 300 种时间戳格式，取最大 token 长度
        max_ts_tokens = max(
            len(tokenizer.encode(f"{t:.1f} seconds", add_special_tokens=False))
            for t in range(min(max_grid_t, 300))
        )

        result["video"] = max_vision_tokens + max_grid_t * (2 + max_ts_tokens) + 2

    return result


def _is_glmga_model(self, processor: object) -> bool:
    """通过 processor 及其子 processor 的类名检测 GLMGA 变体"""
    for attr in ("image_processor", "video_processor"):
        sub = getattr(processor, attr, None)
        if sub and "Glmga" in type(sub).__name__:
            return True
    return False

评论区精华

_to_video_metadata 类型安全（gemini-code-assist）：使用 **{k: metadata[k] for k in metadata if k != "do_sample_frames"} 转换字典为 VideoMetadata，若字段不匹配（如 timestamps）会引发 TypeError。
original_fps 除零风险（Copilot）：compute_frames_index_to_sample 中使用 original_fps 作为除数，若 fps 为 0 导致 ZeroDivisionError。
视频元数据与视频列表不同步（Copilot）：_get_direct_path_inputs 中，当某些输入不是二元组时，hf_videos 会添加但 hf_video_metadata 不会，导致长度不一致。
token 预算估算可能错误（Copilot）：get_mm_max_tokens_per_item 中 max_pixels // (temporal_patch_size * patch_size**2 * merge_size**2) 是面积除以面积，可能低估 token 数。
assert 语句硬失败（Copilot）：iter_mm_grid_thw 中使用 assert 检查 embed 范围，生产环境可能引发 AssertionError，且被 -O 移除后行为改变。

_to_video_metadata 可能导致 TypeError (correctness): 未在 PR 中得到解决；风险已记录，后续需改用字典或自定义类。
compute_frames_index_to_sample 中 original_fps 除零风险 (correctness): 未在 PR 中得到解决；建议在 computed 中增加 guard。
get_mm_max_tokens_per_item 可能低估视觉 token 预算 (performance): 未在 PR 中得到解决；建议基于 grid 维度重新计算。
_get_direct_path_inputs 视频列表与元数据不同步 (correctness): 未在 PR 中得到解决；建议同步添加条目或验证输入格式。
iter_mm_grid_thw 中的 assert 硬崩溃 (correctness): 未在 PR 中得到解决；建议改为异常并给出上下文信息。

风险与影响

风险：
1. 除零风险：compute_frames_index_to_sample 和 _get_video_second_idx_glmga 未对 original_fps == 0 做防御性检查。
2. 类型错误：_to_video_metadata 假设字典键完全匹配 VideoMetadata 字段，传入额外键（如 timestamps）将崩溃。
3. 索引未同步：_get_direct_path_inputs 中视频列表与元数据列表可能不等长，导致处理错位。
4. Token 预算低估：get_mm_max_tokens_per_item 的视觉 token 计算公式可能错误，导致 OOM 或调度失败。
5. 断言硬失败：iter_mm_grid_thw 中的 assert 在生产环境面对略微不同的处理器输出时直接崩溃。
  - 影响：为用户提供对 GLMGA 和 GLM-4.6V-Flash 的推理支持；新增 101 行视频后端代码，需在视频加载路径中维护；模型类增加约 240 行逻辑；测试注册表扩展确保 CI 覆盖。影响范围限于新增的多模态变体，不影响已有模型。
  - 风险标记：除零风险, 类型安全, 断言硬失败, token 预算不准确, 数据同步风险

关联脉络

暂无明显关联 PR

#43575 [feat] add GlmgaProcessor specific logits in `glm4_1v.py`

执行摘要

新增 GLMGA/GLM-4.6V-Flash 多模态支持

实现拆解

评论区精华

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论