Browsecomp Reproducibility | 结果复现
Hi MiniMax team, happy holidays 🎄❄️🎆, and thank you so much for open-sourcing such impressive models and sharing your research!
I have a question regarding the reproducibility of the BrowseComp benchmarks, specifically the BrowseComp (with Context Management) results.
From the README, you mention:
“When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.”
And from the tool_calling_guide, you define the following markers:
]~!b[]~b]system: System message start marker[e~[: Message end marker]~b]user: User message start marker]~b]ai: Assistant message start marker]~b]tool: Tool result message start marker<tools>...</tools>: Tool definition area (each tool wrapped with<tool>, content is JSON Schema)<minimax:tool_call>...</minimax:tool_call>: Tool call area<think>...</think>: Thinking process marker during generation
Given this, I wanted to confirm whether my understanding of the context management behavior is correct:
When the context threshold is exceeded, the system retains:
- the first
]~b]aimessage - the last five
]~b]aimessages - all
]~b]toolmessages
…while discarding everything else in between.
I also assume that the initial ]~!b[]~b]system message and the original ]~b]user query are retained as well, but I wanted to check whether that assumption is correct.
Further clarification, and if there is any chance you could open-source (or provide a small standalone snippet for) the context-management component used for BrowseComp reproduction, that would be deeply appreciated so the open-source and research community can better reproduce these impressive BrowseComp results.🙏
Thanks again for your amazing work and for engaging so openly with the community. Looking forward to learning more and happy holidays once again!⛄
嗨 MiniMax 团队,节日快乐 🎄❄️🎆,也非常感谢你们开源了如此出色的模型并分享研究成果!
我有一个关于 BrowseComp 基准可复现性的问题,具体是 BrowseComp (with Context Management) 的结果。
在 README 中,你们提到:
“当 token 使用量超过最大 context window 的 30% 时,我们会保留第一条 AI 回复、最后五条 AI 回复以及 tool outputs,并丢弃其余内容。”
另外,在 tool_calling_guide 中,你们定义了以下标记(markers):
]~!b[]~b]system:System message 起始标记[e~[:Message 结束标记]~b]user:User message 起始标记]~b]ai:Assistant message 起始标记]~b]tool:Tool result message 起始标记<tools>...</tools>:Tool definition 区域(每个 tool 用<tool>包裹,内容为 JSON Schema)<minimax:tool_call>...</minimax:tool_call>:Tool call 区域<think>...</think>:生成过程中的 Thinking process 标记
基于以上信息,我想确认一下我对 context management 行为的理解是否正确:
当超过 context threshold 时,系统会保留:
- 第一条
]~b]ai消息 - 最后五条
]~b]ai消息 - 所有
]~b]tool消息
……并丢弃中间的其他所有内容。
我也假设最初的 ]~!b[]~b]system 消息以及最原始的 ]~b]user query 也会被保留,但我想确认一下这个假设是否正确。
另外如果可以进一步澄清,并且如果你们有可能开源(或提供一个小的独立 snippet)用于 BrowseComp 复现的 context-management 组件,那将非常非常感激,这也能帮助开源与研究社区更好地复现这些令人印象深刻的 BrowseComp 结果。🙏
再次感谢你们的杰出工作,以及如此开放地与社区交流。期待了解更多内容,也再次祝节日快乐!⛄