Browsecomp Reproducibility | 结果复现

#6
by pandemo - opened

Hi MiniMax team, happy holidays 🎄❄️🎆, and thank you so much for open-sourcing such impressive models and sharing your research!

I have a question regarding the reproducibility of the BrowseComp benchmarks, specifically the BrowseComp (with Context Management) results.

From the README, you mention:

“When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.”

And from the tool_calling_guide, you define the following markers:

]~!b[]~b]system: System message start marker
[e~[: Message end marker
]~b]user: User message start marker
]~b]ai: Assistant message start marker
]~b]tool: Tool result message start marker
<tools>...</tools>: Tool definition area (each tool wrapped with <tool>, content is JSON Schema)
<minimax:tool_call>...</minimax:tool_call>: Tool call area
<think>...</think>: Thinking process marker during generation

Given this, I wanted to confirm whether my understanding of the context management behavior is correct:

When the context threshold is exceeded, the system retains:

  • the first ]~b]ai message
  • the last five ]~b]ai messages
  • all ]~b]tool messages
    …while discarding everything else in between.

I also assume that the initial ]~!b[]~b]system message and the original ]~b]user query are retained as well, but I wanted to check whether that assumption is correct.

Further clarification, and if there is any chance you could open-source (or provide a small standalone snippet for) the context-management component used for BrowseComp reproduction, that would be deeply appreciated so the open-source and research community can better reproduce these impressive BrowseComp results.🙏

Thanks again for your amazing work and for engaging so openly with the community. Looking forward to learning more and happy holidays once again!⛄


嗨 MiniMax 团队,节日快乐 🎄❄️🎆,也非常感谢你们开源了如此出色的模型并分享研究成果!

我有一个关于 BrowseComp 基准可复现性的问题,具体是 BrowseComp (with Context Management) 的结果。

在 README 中,你们提到:

“当 token 使用量超过最大 context window 的 30% 时,我们会保留第一条 AI 回复、最后五条 AI 回复以及 tool outputs,并丢弃其余内容。”

另外,在 tool_calling_guide 中,你们定义了以下标记(markers):

]~!b[]~b]system:System message 起始标记
[e~[:Message 结束标记
]~b]user:User message 起始标记
]~b]ai:Assistant message 起始标记
]~b]tool:Tool result message 起始标记
<tools>...</tools>:Tool definition 区域(每个 tool 用 <tool> 包裹,内容为 JSON Schema)
<minimax:tool_call>...</minimax:tool_call>:Tool call 区域
<think>...</think>:生成过程中的 Thinking process 标记

基于以上信息,我想确认一下我对 context management 行为的理解是否正确:

当超过 context threshold 时,系统会保留:

  • 第一条 ]~b]ai 消息
  • 最后五条 ]~b]ai 消息
  • 所有 ]~b]tool 消息
    ……并丢弃中间的其他所有内容。

我也假设最初的 ]~!b[]~b]system 消息以及最原始的 ]~b]user query 也会被保留,但我想确认一下这个假设是否正确。

另外如果可以进一步澄清,并且如果你们有可能开源(或提供一个小的独立 snippet)用于 BrowseComp 复现的 context-management 组件,那将非常非常感激,这也能帮助开源与研究社区更好地复现这些令人印象深刻的 BrowseComp 结果。🙏

再次感谢你们的杰出工作,以及如此开放地与社区交流。期待了解更多内容,也再次祝节日快乐!⛄

Sign up or log in to comment