FINAL_Bench

Team
community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

SeaWolf-AIย  updated a collection 1 day ago
FINAL Bench
SeaWolf-AIย  updated a collection 1 day ago
FINAL Bench
View all activity

Articles

Welcome!

#1 opened 1 day ago by
SeaWolf-AI
SeaWolf-AIย 
published an article 1 day ago
view article
Article

FINAL Bench: The Real Bottleneck to AGI Is Self-Correction

โ€ข
13
SeaWolf-AIย 
posted an update 1 day ago
view post
Post
1409
FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction

We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs โ€” the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.

Dataset: [FINAL-Bench/Metacognitive]( FINAL-Bench/Metacognitive) | 100 Tasks | 15 Domains | 8 TICOS Types | Apache 2.0

Leaderboard: FINAL-Bench/Leaderboard

Article: https://huggingface.co/blog/FINAL-Bench/metacognitive

Core Innovation

Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) โ€” the ability to say "I might be wrong", and ER (Error Recovery) โ€” the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.

Three Findings Across 9 SOTA Models

We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:

1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning โ€” it is self-correction.

2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct โ€” the most dangerous AI safety profile.

3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).

from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")


Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs

FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.
  • 1 reply
ยท