#27297 [diffusion] Optimize LingBot realtime transport and camera conditioning

原始 PR 作者 mickqian 合并时间 2026-06-05 16:00 文件变更 7 提交数 5 评论 5 代码增减 +462 / -119

执行摘要

优化 LingBot 实时传输和相机条件化，延迟降低 10%

降低LingBot实时推理延迟，特别是相机条件化计算的重复开销和原始帧传输的编码延迟。基准测试显示这两个部分有显著优化空间，且测试中CI OOM问题需要修复。

值得精读，尤其是相机条件器缓存设计中基于source tensor identity的键构建和条件判断，以及传输层将delta-gzip降级为raw bytes的权衡决策。测试覆盖充分，可作为性能优化PR的典范。

讨论亮点

PR无review讨论，作者通过几次/tag-and-rerun-ci命令触发CI重跑以通过测试。

实现拆解

相机条件器scale/shift缓存机制：在LingBotWorldCamConditioner中提取compute_scale_shift方法，使forward可接受预计算值；在CausalLingBotWorldTransformerBlock中添加_cam_conditioner_scale_shift方法，基于forward_batch缓存键（含data_ptr、shape、stride、dtype、device和_version）存储计算结果，仅当sequence_shard启用且timestep>=0时缓存，避免每timestep重复计算。
输出传输优化：在RawRGBRealtimeOutputAdapter中，默认使用raw bytes而非delta-gzip传输lossless raw RGB；当payload超过64KB时，拆分为msgpack header和独立raw bytes，减小延迟；移除_last_raw_rgb_frame和_last_event_id状态，简化_build_transport_payload，删除delta-gzip分支。
CI OOM修复：在realtime_video_api.py中添加_wait_for_server_warmup，确保server warmup完成后再接受websocket请求；在lingbot_world_causal_denoising.py中更新缓存时清理lingbot_cam_conditioner条目。
测试配套：修改test_realtime_output_transport.py验证默认raw payload和分片行为；修改test_lingbot_causal_denoising.py覆盖缓存匹配、重用和跳过条件。

文件	模块	状态	重要度
`python/sglang/multimodal_gen/runtime/models/dits/lingbot_world.py`	模型层	modified	8.75
`python/sglang/multimodal_gen/test/unit/realtime/test_lingbot_causal_denoising.py`	测试	modified	7.47
`python/sglang/multimodal_gen/test/unit/realtime/test_realtime_output_transport.py`	测试	modified	7.18
`python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_output_adapter.py`	传输层	modified	7.0
`python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_video_api.py`	API 入口	modified	6.21
`python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/lingbot_world/lingbot_world_causal_denoising.py`	模型层	modified	4.39
`python/sglang/multimodal_gen/test/server/gpu_cases.py`	测试	modified	3.59

关键符号

compute_scale_shift _cam_conditioner_scale_shift _should_cache_cam_conditioner _prepare_cam_conditioner_scale_shifts _pack_frame_batch_header _build_transport_payload RawRGBRealtimeOutputAdapter.send _wait_for_server_warmup

关键源码片段

python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_output_adapter.py core-logic

传输层核心逻辑简化，移除 delta-gzip，新增 _pack_frame_batch_header，支持大 payload 分片

def _pack_frame_batch_header(header: RealtimeFrameBatchHeader) -> bytes:
    # 新增函数：将 header 单独编码为 msgpack，用于大 payload 分片时的首个小消息
    return msgspec.msgpack.encode(header)

def _build_transport_payload(
    transport_frames: list[bytes],
    *,
    content_type: str,
    metadata: dict[str, int | str],
    output_format: str | None,
    transport_quality: int | None,
    preview_max_width: int | None,
    # 移除了 reference_frame 和 event_id 参数
) -> _TransportPayload:
    # 简化：对于 raw RGB，直接拼接 transport_frames 为 raw_payload
    # 不再使用 delta-gzip
    if content_type == RAW_RGB_CONTENT_TYPE and transport_frames:
        raw_payload = b"".join(transport_frames)
        payload_metadata = {
            "raw_size": len(raw_payload),
            "encoding": RAW_LOSSLESS_OUTPUT_FORMAT,
        }
        # ... 原有 delta-gzip 分支已删除

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

缓存键依赖tensor_version和指针，若tensor被in-place修改但版本未更新可能导致缓存不一致，但测试已覆盖。
默认传输从delta-gzip改为raw，可能增加带宽消耗，但payload构建延迟从~20ms降至~1ms，权衡合理。
_wait_for_server_warmup引入websocket接受前的阻塞，可能增加首次请求延迟，但避免OOM更关键。
清理lingbot_cam_conditioner缓存在context cache update时执行，逻辑正确但需异常处理。

对用户：LingBot实时视频推理延迟降低约10%，体验明显提升。对系统：传输层简化，去除delta-gzip依赖，降低了维护成本；raw payload构建开销极小，但带宽略有增加（对于网络瓶颈场景仍需关注）。对团队：提供了在特定场景下计算缓存和传输协议优化的参考模式。

缓存依赖 tensor 版本号带宽权衡新增 websocket 延迟点

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：优化LingBot实时传输和相机条件化，延迟降低10%
推荐动作：值得精读，尤其是相机条件器缓存设计中基于source tensor identity的键构建和条件判断，以及传输层将delta-gzip降级为raw bytes的权衡决策。测试覆盖充分，可作为性能优化PR的典范。

功能与动机

实现拆解

相机条件器scale/shift缓存机制：在LingBotWorldCamConditioner中提取compute_scale_shift方法，使forward可接受预计算值；在CausalLingBotWorldTransformerBlock中添加_cam_conditioner_scale_shift方法，基于forward_batch缓存键（含data_ptr、shape、stride、dtype、device和_version）存储计算结果，仅当sequence_shard启用且timestep>=0时缓存，避免每timestep重复计算。
输出传输优化：在RawRGBRealtimeOutputAdapter中，默认使用raw bytes而非delta-gzip传输lossless raw RGB；当payload超过64KB时，拆分为msgpack header和独立raw bytes，减小延迟；移除_last_raw_rgb_frame和_last_event_id状态，简化_build_transport_payload，删除delta-gzip分支。
CI OOM修复：在realtime_video_api.py中添加_wait_for_server_warmup，确保server warmup完成后再接受websocket请求；在lingbot_world_causal_denoising.py中更新缓存时清理lingbot_cam_conditioner条目。
测试配套：修改test_realtime_output_transport.py验证默认raw payload和分片行为；修改test_lingbot_causal_denoising.py覆盖缓存匹配、重用和跳过条件。

关键文件：

python/sglang/multimodal_gen/runtime/models/dits/lingbot_world.py（模块模型层；类别 source；类型 data-contract；符号 compute_scale_shift, _cam_conditioner_scale_shift, _should_cache_cam_conditioner, _prepare_cam_conditioner_scale_shifts）: 核心模型文件，实现了相机条件器的缓存机制，包含compute_scale_shift、_cam_conditioner_scale_shift等新方法
python/sglang/multimodal_gen/test/unit/realtime/test_lingbot_causal_denoising.py（模块测试；类别 test；类型 test-coverage；符号 test_lingbot_cam_conditioner_scale_shift_matches_forward, test_lingbot_cam_conditioner_cache_reuses_source_tensor, test_lingbot_cam_conditioner_cache_skips_non_sequence_shard, test_lingbot_cam_conditioner_cache_skips_single_ulysses_world）: 为相机条件器缓存机制提供测试覆盖，验证缓存匹配、重用、跳过等场景
python/sglang/multimodal_gen/test/unit/realtime/test_realtime_output_transport.py（模块测试；类别 test；类型 test-coverage；符号 test_raw_rgb_realtime_output_adapter_uses_lossless_raw_payload_by_default, test_raw_rgb_realtime_output_adapter_offloads_default_lossless_payload_build, test_raw_rgb_realtime_output_adapter_sends_large_payload_separately, _unpack_frame_batch_messages）: 测试传输层改动：默认raw payload、分片消息解析、大payload拆分
python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_output_adapter.py（模块传输层；类别 source；类型 core-logic；符号 _pack_frame_batch_header, _build_transport_payload, RawRGBRealtimeOutputAdapter）: 传输层核心逻辑简化，移除delta-gzip，新增_pack_frame_batch_header，支持大payload分片
python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_video_api.py（模块 API入口；类别 source；类型 entrypoint；符号 _wait_for_server_warmup）: 入口层增加server warmup等待，修复CI OOM
python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/lingbot_world/lingbot_world_causal_denoising.py（模块模型层；类别 source；类型 data-contract）: 在context cache更新时清理相机条件缓存，保持一致性
python/sglang/multimodal_gen/test/server/gpu_cases.py（模块测试；类别 test；类型 test-coverage）: 测试配置调整，可能涉及硬件条件

关键符号：compute_scale_shift, _cam_conditioner_scale_shift, _should_cache_cam_conditioner, _prepare_cam_conditioner_scale_shifts, _pack_frame_batch_header, _build_transport_payload, RawRGBRealtimeOutputAdapter.send, _wait_for_server_warmup

关键源码片段

`python/sglang/multimodal_gen/runtime/entrypoints/openai/realtime/realtime_output_adapter.py`

传输层核心逻辑简化，移除delta-gzip，新增_pack_frame_batch_header，支持大payload分片

def _pack_frame_batch_header(header: RealtimeFrameBatchHeader) -> bytes:
    # 新增函数：将 header 单独编码为 msgpack，用于大 payload 分片时的首个小消息
    return msgspec.msgpack.encode(header)

def _build_transport_payload(
    transport_frames: list[bytes],
    *,
    content_type: str,
    metadata: dict[str, int | str],
    output_format: str | None,
    transport_quality: int | None,
    preview_max_width: int | None,
    # 移除了 reference_frame 和 event_id 参数
) -> _TransportPayload:
    # 简化：对于 raw RGB，直接拼接 transport_frames 为 raw_payload
    # 不再使用 delta-gzip
    if content_type == RAW_RGB_CONTENT_TYPE and transport_frames:
        raw_payload = b"".join(transport_frames)
        payload_metadata = {
            "raw_size": len(raw_payload),
            "encoding": RAW_LOSSLESS_OUTPUT_FORMAT,
        }
        # ... 原有 delta-gzip 分支已删除

评论区精华

PR无review讨论，作者通过几次/tag-and-rerun-ci命令触发CI重跑以通过测试。

暂无高价值评论线程

风险与影响

风险：
1. 缓存键依赖tensor_version和指针，若tensor被in-place修改但版本未更新可能导致缓存不一致，但测试已覆盖。
2. 默认传输从delta-gzip改为raw，可能增加带宽消耗，但payload构建延迟从~20ms降至~1ms，权衡合理。
3. _wait_for_server_warmup引入websocket接受前的阻塞，可能增加首次请求延迟，但避免OOM更关键。
4. 清理lingbot_cam_conditioner缓存在context cache update时执行，逻辑正确但需异常处理。
  - 影响：对用户：LingBot实时视频推理延迟降低约10%，体验明显提升。对系统：传输层简化，去除delta-gzip依赖，降低了维护成本；raw payload构建开销极小，但带宽略有增加（对于网络瓶颈场景仍需关注）。对团队：提供了在特定场景下计算缓存和传输协议优化的参考模式。
  - 风险标记：缓存依赖tensor版本号, 带宽权衡, 新增websocket延迟点

关联脉络

暂无明显关联 PR

#27297 [diffusion] Optimize LingBot realtime transport and camera conditioning

执行摘要

优化 LingBot 实时传输和相机条件化，延迟降低 10%

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

参与讨论