执行摘要

修复边界条件，允许总 token 数等于 context_len 的请求

用户需要总 token 数（prompt_length + max_new_tokens）恰好等于 context_len 的请求被允许，与 vLLM 行为和已有自动截断逻辑一致。

可以快速合并。变更简单安全，建议添加单元测试覆盖边界情况以确保未来重构时不会回归。

讨论亮点

无 review 评论。PR 获得两名维护者批准并合并。

实现拆解

修改文件 python/sglang/srt/managers/tokenizer_manager.py 中 _validate_one_request 方法的第 831 行，将条件判断 >= _max_req_len 改为 > _max_req_len。当 max_new_tokens + input_token_num 等于 context_len 时，不再触发拒绝逻辑，直接通过验证。

文件	模块	状态	重要度
`python/sglang/srt/managers/tokenizer_manager.py`	tokenizer_manager	modified	4.32

关键符号

_validate_one_request

关键源码片段

python/sglang/srt/managers/tokenizer_manager.py core-logic

修改了 `_validate_one_request` 方法中总 token 验证的边界条件，从 `>=` 改为 `>`，是本次变更的唯一文件。

# 位于 tokenizer_manager.py 的 _validate_one_request 方法中
# 变更前：>= 导致总 token 数恰好等于 context_len 时被拒绝
# 变更后：> 允许等于 context_len 的请求通过，与自动截断逻辑一致
if (
    self.validate_total_tokens
    and max_new_tokens is not None
    and (max_new_tokens + input_token_num) > _max_req_len # 原为 >=
):
    if self.server_args.allow_auto_truncate:
        logger.warning(
            f"Requested token count ({input_token_num} input + {max_new_tokens} new) "
            f"exceeds the model's context length ({self.context_len} tokens). "
            "Truncating max_new_tokens."
        )
        obj.sampling_params["max_new_tokens"] = max(
            0, _max_req_len - input_token_num
        )
    else:
        total_tokens = max_new_tokens + input_token_num
        error_msg = (
            f"Requested token count exceeds the model's maximum context length "
            f"of {self.context_len} tokens. You requested a total of {total_tokens} "
            f"tokens: {input_token_num} tokens from the input messages and "
            f"{max_new_tokens} tokens for the completion. Please reduce the number "
            f"of tokens in the input messages or the completion to fit within the limit."
        )
        raise ValueError(error_msg)

评论区精华

没有提炼出高价值讨论线程

当前评论区没有形成足够清晰的争议点或结论，后续有更多讨论时会体现在这里。

风险与影响

风险极低。仅放宽边界条件，使行为与自动截断逻辑一致，不会导致超出 context_len 的请求被误放行。由于自动截断逻辑已将 max_new_tokens 截断为 max(0, _max_req_len - input_token_num)，边界情况下 max_new_tokens 会变为 0，但 request 不会被直接拒绝。需要注意的是，若 input_token_num 恰好等于 context_len，max_new_tokens 将被截断为 0，这是合理的。

影响范围小：仅修改一处边界判断，对已有逻辑无破坏性。用户总 token 数恰好等于 context_len 的请求不再被拒绝，提升了接口的友好性。

缺少测试覆盖

关联 Issue

未识别关联 Issue

当前没有检测到明确关联的 Issue 链接，后续同步到相关引用后会出现在这里。

完整报告

执行摘要

一句话：修复边界条件，允许总 token 数等于 context_len 的请求
推荐动作：可以快速合并。变更简单安全，建议添加单元测试覆盖边界情况以确保未来重构时不会回归。

功能与动机

用户需要总 token 数（prompt_length + max_new_tokens）恰好等于 context_len 的请求被允许，与 vLLM 行为和已有自动截断逻辑一致。

实现拆解

修改文件 python/sglang/srt/managers/tokenizer_manager.py 中 _validate_one_request 方法的第 831 行，将条件判断 >= _max_req_len 改为 > _max_req_len。当 max_new_tokens + input_token_num 等于 context_len 时，不再触发拒绝逻辑，直接通过验证。

关键文件：

python/sglang/srt/managers/tokenizer_manager.py（模块 tokenizer_manager；类别 source；类型 core-logic；符号 _validate_one_request）: 修改了 _validate_one_request 方法中总 token 验证的边界条件，从 >= 改为 >，是本次变更的唯一文件。

关键符号：_validate_one_request

关键源码片段

`python/sglang/srt/managers/tokenizer_manager.py`

修改了 _validate_one_request 方法中总 token 验证的边界条件，从 >= 改为 >，是本次变更的唯一文件。

# 位于 tokenizer_manager.py 的 _validate_one_request 方法中
# 变更前：>= 导致总 token 数恰好等于 context_len 时被拒绝
# 变更后：> 允许等于 context_len 的请求通过，与自动截断逻辑一致
if (
    self.validate_total_tokens
    and max_new_tokens is not None
    and (max_new_tokens + input_token_num) > _max_req_len # 原为 >=
):
    if self.server_args.allow_auto_truncate:
        logger.warning(
            f"Requested token count ({input_token_num} input + {max_new_tokens} new) "
            f"exceeds the model's context length ({self.context_len} tokens). "
            "Truncating max_new_tokens."
        )
        obj.sampling_params["max_new_tokens"] = max(
            0, _max_req_len - input_token_num
        )
    else:
        total_tokens = max_new_tokens + input_token_num
        error_msg = (
            f"Requested token count exceeds the model's maximum context length "
            f"of {self.context_len} tokens. You requested a total of {total_tokens} "
            f"tokens: {input_token_num} tokens from the input messages and "
            f"{max_new_tokens} tokens for the completion. Please reduce the number "
            f"of tokens in the input messages or the completion to fit within the limit."
        )
        raise ValueError(error_msg)

评论区精华

无 review 评论。PR 获得两名维护者批准并合并。

暂无高价值评论线程

风险与影响

风险：风险极低。仅放宽边界条件，使行为与自动截断逻辑一致，不会导致超出 context_len 的请求被误放行。由于自动截断逻辑已将 max_new_tokens 截断为 max(0, _max_req_len - input_token_num)，边界情况下 max_new_tokens 会变为 0，但 request 不会被直接拒绝。需要注意的是，若 input_token_num 恰好等于 context_len，max_new_tokens 将被截断为 0，这是合理的。
影响：影响范围小：仅修改一处边界判断，对已有逻辑无破坏性。用户总 token 数恰好等于 context_len 的请求不再被拒绝，提升了接口的友好性。
风险标记：缺少测试覆盖

关联脉络

暂无明显关联 PR

#22546 allow requests with exactly context_len total tokens

执行摘要

修复边界条件，允许总 token 数等于 context_len 的请求

实现拆解

评论区精华

没有提炼出高价值讨论线程

风险与影响

关联 Issue

未识别关联 Issue

完整报告

执行摘要

功能与动机

实现拆解

关键源码片段

`python/sglang/srt/managers/tokenizer_manager.py`

评论区精华

风险与影响

关联脉络

参与讨论