Some errors were found in the model evaluation【Important】

by Alicia-Ross - opened 8 days ago

Discussion

Alicia-Ross

8 days ago

•

edited 8 days ago

I'm a LLM engineer. I reviewed your model today but found some errors and have some questions.

In IMOAnswerBench, Gemini 3 Pro's score is 83.3, not 82.16 (refer to Step-3.5-Flash).

In AIME 2026, kimi 2.5's score is 92.5, not 90.62 (refer to GLM5).

In HMMT Nov. 2025, GPT-5.2 (xhigh)'s score is 97.1, not 95.83 (refer to GLM5).

In Livecodebenchv6, DeepSeek-V3.2 should be 83.3, not 82.71; Gemini 3 Pro's score is 90.7, not 88.22; Claude Opus 4.5's score is 84.8, not 83.70 (refer to Step-3.5-Flash).

Gaia2-search seems to have many untested models, making it impossible to determine the accuracy of the metrics.

τ²-Bench
The scores for DeepSeek-V3.2, Kimi K2.5, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) are 85.2, 85.4, 92.5, 90.7, and 85.5 respectively. (Your test results are significantly lower than StepFun's results.) (Similar conclusions can be found in Step-3.5-Flash or GLM5.)

The score for GPT-5.2-thinking in SWE-Bench Verified is not as low as 71.8; it should be 80.0. This is a very serious error (refer to the Kimi K2.5 report).

Information on other models of ARC-AGI-v2 is also scarce.

Discussion is welcome.

Felix2024

7 days ago

Thanks for your detailed review and feedback.
We really appreciate an LLM engineer taking the time to review our eval, and how you noting the differences compared to reports from other model builders.
First thing first, as a model provider, it is a common practice for us to setup our environment to do internal evaluation, given not all benchmarks would cover latest results of all SOTA models.
For models that haven't officially released scores on these specific boards, we used a unified evaluation protocol.
We acknowledge that other reports you mentioned from other model builders may show higher numbers, given we don't have access to their methodology, we would like to share our heuristics for your reference.

#IMOAnswerBench (Gemini-3-Pro):

Our Self-test: 82.16 (vs. 83.3 in other reports).
Setup: Zero-shot & Chain-of-Thought (COT), Pass@1. We ran the evaluation 8 times and took the average.
Prompt: “{problem}\nPlease reason step by step, and put your final answer within \boxed{}.”

#AIME 2026 (Kimi-K2.5-Thinking):

Our Self-test: 90.62 (vs. 92.5 in GLM5 report).
Setup: Zero-shot & COT, Pass@1. Evaluated 64 times, taken as the average.
Prompt: Same as above.

#HMMT Nov. 2025
Similarly, we reported the results for GPT-5.2-Thinking (high) (95.83, sourced from MathArena), as we could not evaluate GPT-5.2-Thinking(high) on all benchmarks due to service stability issues.

#LiveCodeBench-v6
Our self-test results (DeepSeek-3.2-Thinking 82.71, Gemini-3-Pro 88.22, Claude-Opus-4.5 83.7) are slightly different due to the evaluation environment.
To minimize differences introduced by execution sandboxes, judging logic, and timeout policies, our leaderboard uses a single, consistent self-evaluation pipeline applied uniformly to all models (similar to how Kimi and Qwen report their LCB scores).
Consequently, our results generally fall within the range of scores reported by others, rather than being consistently lower. For example:

DeepSeek-3.2-Thinking: Our self-test (82.71) is comparable to the official report (83.3) and higher than Qwen3-Max’s report (80.8).
Gemini-3-Pro: Our self-test (88.22) sits between Qwen3-Max’s report (90.7) and Kimi-K2.5-Thinking’s report (87.4).
Claude-Opus-4.5: Our self-test (83.7) is effectively a median between Qwen3-Max’s report (84.8) and Kimi-K2.5-Thinking’s report (82.2).
The gaps are small and reflect protocol differences rather than model capability. To reduce confusion, we will explicitly label these as "Self-evaluated".

#Gaia2
We utilize the community-wide OpenAI function call format rather than the original ReAct format. The relevant evaluation configurations and methodologies will be submitted to the GAIA2 GitHub repository, enabling the community to conduct broader and reproducible comparisons and assessments.

#τ2-Bench
The reason our scores (e.g., Claude-Opus-4.5 92.5) are lower than StepFun/GLM5 results is due to the evaluation framework.The evaluation methods of different models on tau2-bench are not uniform.

Others: Step-3.5-Flash and GLM5 introduced a repair mechanism for Claude-Opus-4.5 during evaluation, which resulted in higher scores
Ours: We adopted a completely consistent scheme aligned with Qwen-Max-Thinking. We followed the official evaluation framework without adding any modification or repair mechanisms to the task content. Since Gemini-3-Pro has not officially released its score on this leaderboard, we used a unified evaluation protocol to ensure a fair comparison across different models
In addition, the evaluation scores of other models directly refer to their official results.

#SWE-Bench Verified:
You mentioned a score of 80.0, which corresponds to GPT-5.2-Thinking (xhigh). However, the model reported in our chart is GPT-5.2-Thinking (high).
According to the official SWE leaderboard, the score for the "high" version is 71.8.

#ARC-AGI-v2
We first reproduced the official scores of Gemini-3-Pro and Claude-Opus-4.5 using our unified evaluation setting. However, noting that initial metric may be unstable, we report Pass@4 metric for reliability.

#Conclusion
The gaps between our self-evaluations and some official/third-party reports are small and are most likely due to the choice of evaluation protocol differences rather than model capability.
To avoid confusion, we will add label to mark the results as “self-evaluated,” and we don't mind providing logs/configs upon request if needed.
We hope this clarifies the difference you found.

Best regards.

Alicia-Ross

7 days ago

•

edited 6 days ago

Thanks for your reply.
@Felix2024

We can see that all the comparison models are based on GPT5.2-XHIGH(Chinese company: StepFun, Zhipu, Moonshot, Alibaba(Qwen), and Minimax).
https://arxiv.org/pdf/2602.10604 (Stepfun)
https://huggingface.co/zai-org/GLM-5. (Zhipu)
https://huggingface.co/moonshotai/Kimi-K2.5. (Moonshot)
https://qwen.ai/blog?id=qwen3-max-thinking. (Alibaba(qwen))
https://huggingface.co/MiniMaxAI/MiniMax-M2.5. (Minimax)

I originally thought you were comparing this as well. Why didn't you compare XHIGH, but instead actively chose HIGH, which has slightly weaker reasoning capabilities? This is strange and different from other major Chinese model companies. What is the reason for choosing the weaker OpenAI model?

If you chose the lower-tier GPT5.2-HIGH, why are your reported IMOAnswerBench scores 86.3 and LiveCodebench-v6 scores 87.70?
Because we can see from other companies' reports (GPT5.2-xHIGH)
IMOAnswerBench 86.3 (StepFun) 86.3 (Alibaba(qwen)) 86.3 (Zhipu) 86.3( Moonshot)
LiveCodebench-v6 87.70(StepFun) 87.7(Alibaba(qwen))
Is this a coincidence? I don't believe this is a coincidence; it's possible your configuration is incorrect.
In Tau2-bench, your GPT5.2-HIGH score is 80.9, but Qwen3max also gets 80.9 for GPT5.2-xHIGH (this can be verified by multiple sources; they also use xhigh).
Is this another coincidence?
You said you used the same configuration as Qwen3max, but are your testing scores on Tau2 ( DeepSeek-V3.2、Claude Opus 4.5 、Gemini 3 Pro 、GPT-5.2)exactly the same as those in qwen3max's blog?
Using your framework, are the scores for these four models exactly the same as those reported by qwen3max?
Can I believe this? Did you really reproduce the scores of these four models perfectly using this framework?
I also created an issue https://huggingface.co/moonshotai/Kimi-K2.5/discussions/87 about the score of KIMI k2.5 in Tau2-bench.
You use your own results for some and other reported results for others, but you can't seem to provide a reasonable logic?
If you used a unified evaluation protocol, why are some scores exactly the same as those reported by other organizations, while others are from your own testing? This is very strange.
For example, on IMOAnswerBench, your reported scores on Deepseek v3.2, Claude Opus 4.5, and GPT-5.2 are completely consistent with the results from Qwen3 max‘s blog and Stepfun's technical report.(Note: Stepfun explicitly states that it used GPT-5.2-xhigh) your reported scores on Kimi K2.5, which are completely consistent with the scores in the Stepfun's technical report.
However, your reported scores on Gemini 3.0 Pro are completely different from those of qwen3 max and Stepfun.

6 Furthermore, I've discovered a serious problem with your reply. The LiveCodeBench version mentioned in Qwen3max's blog is (February 25 - May 25), not the more commonly used LiveCodeBench v6 (August 2024–May 2025). Which version are you using? Therefore, your statement may be completely incorrect.
Consequently, you need to further investigate why some external models score lower within your framework.

7 This is the method I prefer (https://zhuanlan.zhihu.com/p/2001741987360023159), and I recommend you use it.
'If the official scores of other models are higher than our own retest results, we will use the official score; otherwise, we will use the higher points from our retest.'
Of course, you may have other considerations.
Discussion is welcome.

8 I will not discuss other lists with smaller score differences (within 3 points) here.

Thanks so much

Alicia-Ross

5 days ago

•

edited 5 days ago

@Felix2024
1 From the latest version at https://huggingface.co/Qwen/Qwen3.5-397B-A17B, it's clear that your model evaluation has significant problems.
On IMOAnswerBench, AIME26, and livecodebenchv6, your GPT5.2 thinking (high) results are consistent with qwen3's reported scores.
However, on HMMT (NOV.25), your GPT5.2 thinking (high) score is 95.83, while Qwen3's is 100.
On tau2-bench, your GPT5.2 thinking (high) score is 80.90, while Qwen3's is 87.1.
On SWE-bench Verified, your GPT5.2 thinking (high) score is 71.80, while Qwen3's is 80.（Your model had the lowest score out of a total of six models.）
Your evaluation will result in a very low GPT5.2 score.
2 From the latest version at https://huggingface.co/Qwen/Qwen3.5-397B-A17B
On IMOAnswerBench, Gemini-3 Pro should be 83.3（not 82.16）
On AIME26,KIMI2.5-thinking should be 93.3 (not 90.62) Gemini-3 Pro should be 90.6(not 90.0；Not much difference)
On HMMT Nov 25，Claude 4.5 Opus should be 93.3（not 91.04）
On LiveCodeBench v6， Gemini-3 Pro should be 90.7（not 88.22）Claude 4.5 Opus should be 84.8（not 83.70）
On tau2-bench， KIMI2.5-thinking should be 77.0 (not 72.57) Claude 4.5 Opus should be 91.6（not 85.70）（Your model doesn't have any advantage on this list.）
I hope you can investigate why external models consistently score lower in your comparisons because this result might lead others to question the score of your model itself.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment