Text Generation
Transformers
Safetensors
step3p5
conversational
custom_code
Eval Results

Context Management Reproducibility | 可复现性 ?

#27
by pandemo - opened

Hi StepFun team, thank you so much for open-sourcing such impressive models and sharing your research!

Just a quick question on the discard-all strategy used for BrowseComp:

When the context length exceeds the threshold and the agent “discards its entire context,”(ref from the Step 3.5 Flash paper), does that mean everything accumulated (tool calls, reasoning, observations, etc.) is removed except the system prompt and initial user message/question?

Also, is the agent framework used for BrowseComp/HLE evaluation the same as (or similar to) the one in Step-DeepResearch?

Thanks again for your amazing work! 🙏


Hi 阶跃星辰团队,感谢你们开源如此出色的模型并分享你们的研究成果!

有一个关于 BrowseComp 中使用的 discard-all 策略的小问题:

当 context length 超过阈值,agent “discards its entire context”(引用自 Step 3.5 Flash 论文)时,是否意味着此前累计的所有内容(tool calls、reasoning、observations 等)都会被移除,仅保留 system prompt 和 initial user message/question?

另外,用于 BrowseComp/HLE 评测的 agent framework,是否与 Step-DeepResearch 中使用的框架相同或类似?

再次感谢你们出色的工作!🙏

Sign up or log in to comment