Repositories / vllm-project / vllm

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

监控状态：已开启最近同步：2026-06-13 19:39 同步状态：空闲下次计划：2026-06-13 20:39

PR 列表

最近 1 天最近 3 天最近 7 天

更多筛选

排序重要度开始结束

✕ 清空

标签聚合仓库周报

2026-06-02

#44220 [Perf] use triton moe backend on hopper by default

原始 PR · 作者 ZJY0516 · 合并时间 2026-06-02 15:52

性能优化重要性 5.91 洞察度 5.00

Hopper 上默认使用 Triton MoE 后端

建议合并。该 PR 基于实际基准测试数据，将 Hopper 上 MoE 后端的默认选择从 FlashInfer 切换为 Triton，性能提升明确，风险低。值得关注的是 Hopper 特定优化和基准测试方法，可推广到类似决策中。

performancekernelmodel

#44267 [Refactor] Unify reasoning + tool-call parsing behind Parser.parse()

原始 PR · 作者 sfeng33 · 合并时间 2026-06-02 15:11

重构重要性 8.36 洞察度 6.00

统一推理与工具调用解析到 Parser.parse()

值得精读，因为统一解析入口是前端架构重构的关键步骤，为后续支持更多解析组合打下基础。需关注作者关于“匹配 streaming”的设计决策及其潜在的兼容性影响。

refactorfrontendtool-calling

#43991 [Model Runner V2] Use actual batch max_seq_len for attn metadata

原始 PR · 作者 izhuhaoran · 合并时间 2026-06-02 14:07

缺陷修复重要性 6.25 洞察度 5.00

修复 V2 模型运行器中 attn 元数据 max_seq_len 传递错误

值得精读，尤其是了解如何将 `DefaultModelState` 中的优化模式推广到其他 ModelState 实现，以及 speculative decoding 中 draft max_seq_len 的动态管理方式。设计决策清晰，代码差异小但影响正确性。

bugfixv1attention

#43990 [Model Runner V2] Support zeroing freshly allocated KV blocks for hybrid + fp8 KVCache

原始 PR · 作者 izhuhaoran · 合并时间 2026-06-02 13:56

缺陷修复重要性 7.45 洞察度 6.00

修复 V2 模型运行器未清零混合+fp8 KV缓存新块的 bug

bugfixv1attention

#43798 [Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl

原始 PR · 作者 Isotr0py · 合并时间 2026-06-02 12:41

缺陷修复重要性 7.71 洞察度 6.00

修复 Gemma4-MM ViT 量化线性层兼容性

值得精读。设计上选择通用递归替换而非模型特定补丁，体现了模块化封装思想。`BitsAndBytesWeightParameter` 的 dtype 修复技巧可复用。建议关注后续 LoRA 准确性修复。

bugfixmulti-modalityquantization

#41714 [MM][CG] Profile encoder CUDA graph pool memory

原始 PR · 作者 BWAAEEEK · 合并时间 2026-06-02 12:27

性能优化重要性 8.35 洞察度 6.00

Profile vision encoder CUDA graph pool memory

此 PR 值得精读，尤其关注 `profile_cudagraph_memory` 中如何集成 encoder 部分以及 graph pool 的生命周期设计。它展示了在已有的 CUDA graph 框架中扩展新模块的典型模式：通过临时 manager 进行 profile，通过持久 manager 进行 runtime，并利用 graph pool 隔离。对多模态模型开发者和 CUDA graph 维护者有重要参考价值。

performancev1multi-modality

#43930 [XPU][Bugfix] Fix per_token_group_fp8_quant missing dummy args on XPU

原始 PR · 作者 chaojun-zhang · 合并时间 2026-06-02 11:09

缺陷修复重要性 5.72 洞察度 3.00

修复 XPU 上 FP8 量化少传 2 个参数的问题

此 PR 为必要的 bugfix，改动小而精，值得合入。建议在合入后验证 XPU 上 FP8 量化功能正常。

bugfixintel-gpuquantization

#42959 [BugFix][kv_offload]: Prevent offloading stale sliding window blocks

原始 PR · 作者 orozery · 合并时间 2026-06-02 10:59

缺陷修复重要性 7.65 洞察度 5.00

修复滑动窗口块在卸载失败后变脏的问题

建议审核者重点审查 `_update_req_states` 中的全量遍历逻辑及其对性能的影响，确认设计权衡合理。同时鼓励在滑动窗口功能相关的集成测试中运行本 PR 的新测试用例。总体修复思路正确，值得精读。

bugfixv1kv-connector

第 56 / 312 页 · 共 2496 条

上一页 1 … 54 55 56 57 58 … 312 下一页