Browsecomp Reproducibility | 结果复现

by pandemo - opened Dec 26, 2025

Dec 26, 2025

Hi MiniMax team, happy holidays 🎄❄️🎆, and thank you so much for open-sourcing such impressive models and sharing your research!

I have a question regarding the reproducibility of the BrowseComp benchmarks, specifically the BrowseComp (with Context Management) results.

From the README, you mention:

“When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.”

And from the tool_calling_guide, you define the following markers:

]~!b[]~b]system: System message start marker
[e~[: Message end marker
]~b]user: User message start marker
]~b]ai: Assistant message start marker
]~b]tool: Tool result message start marker
<tools>...</tools>: Tool definition area (each tool wrapped with <tool>, content is JSON Schema)
<minimax:tool_call>...</minimax:tool_call>: Tool call area
<think>...</think>: Thinking process marker during generation

Given this, I wanted to confirm whether my understanding of the context management behavior is correct:

When the context threshold is exceeded, the system retains:

the first ]~b]ai message
the last five ]~b]ai messages
all ]~b]tool messages
…while discarding everything else in between.

I also assume that the initial ]~!b[]~b]system message and the original ]~b]user query are retained as well, but I wanted to check whether that assumption is correct.

Further clarification, and if there is any chance you could open-source (or provide a small standalone snippet for) the context-management component used for BrowseComp reproduction, that would be deeply appreciated so the open-source and research community can better reproduce these impressive BrowseComp results.🙏

Thanks again for your amazing work and for engaging so openly with the community. Looking forward to learning more and happy holidays once again!⛄

嗨 MiniMax 团队，节日快乐 🎄❄️🎆，也非常感谢你们开源了如此出色的模型并分享研究成果！

我有一个关于 BrowseComp 基准可复现性的问题，具体是 BrowseComp (with Context Management) 的结果。

在 README 中，你们提到：

“当 token 使用量超过最大 context window 的 30% 时，我们会保留第一条 AI 回复、最后五条 AI 回复以及 tool outputs，并丢弃其余内容。”

另外，在 tool_calling_guide 中，你们定义了以下标记（markers）：

]~!b[]~b]system：System message 起始标记
[e~[：Message 结束标记
]~b]user：User message 起始标记
]~b]ai：Assistant message 起始标记
]~b]tool：Tool result message 起始标记
<tools>...</tools>：Tool definition 区域（每个 tool 用 <tool> 包裹，内容为 JSON Schema）
<minimax:tool_call>...</minimax:tool_call>：Tool call 区域
<think>...</think>：生成过程中的 Thinking process 标记

基于以上信息，我想确认一下我对 context management 行为的理解是否正确：

当超过 context threshold 时，系统会保留：

第一条 ]~b]ai 消息
最后五条 ]~b]ai 消息
所有 ]~b]tool 消息
……并丢弃中间的其他所有内容。

我也假设最初的 ]~!b[]~b]system 消息以及最原始的 ]~b]user query 也会被保留，但我想确认一下这个假设是否正确。

另外如果可以进一步澄清，并且如果你们有可能开源（或提供一个小的独立 snippet）用于 BrowseComp 复现的 context-management 组件，那将非常非常感激，这也能帮助开源与研究社区更好地复现这些令人印象深刻的 BrowseComp 结果。🙏

再次感谢你们的杰出工作，以及如此开放地与社区交流。期待了解更多内容，也再次祝节日快乐！⛄

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment