FINAL Bench: The Real Bottleneck to AGI Is Self-Correction
The First Functional Metacognition Benchmark — 9 SOTA Models, 100 Tasks, 3 Devastating Findings
Authors: Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang Dataset: FINAL-Bench/Metacognitive | 100 Tasks | 15 Domains | Apache 2.0
Why We Built This Benchmark
MMLU has crossed 90%. GPQA is saturating. HumanEval has hit its ceiling. Yet every single one of these benchmarks shares a common blind spot: none of them has ever measured whether AI can recognize its own errors and correct them.
Cognitive psychology calls this ability metacognition. It is the real dividing line between human experts and novices, and a prerequisite for AGI. When a physician catches a misdiagnosis and changes the treatment plan, when a scientist revises a hypothesis after unexpected results — the essence of these actions is metacognition.
FINAL Bench (Frontier Intelligence Nexus for AGI-Level Verification) is the first benchmark to systematically measure this capability.
What Makes FINAL Bench Different
Where existing benchmarks ask "Did you get the right answer?", FINAL Bench asks "What did you do when you got it wrong?"
Three key differentiators:
First, we separate "saying" from "fixing." Among our 5-axis rubric, MA (Metacognitive Accuracy) measures the declarative ability to say "I might be wrong," while ER (Error Recovery) measures the procedural ability to actually detect and correct errors. This separation directly maps to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology and is a first in AI benchmarking.
Second, every task embeds a hidden cognitive trap. Confirmation bias, anchoring, base-rate neglect — cognitive biases established in psychology are deliberately planted across all 100 tasks. We observe the model's process of "falling into and climbing out of" these traps. Tasks span 15 domains (mathematics, medicine, ethics, philosophy, economics, and more), 8 TICOS metacognitive types, and 3 difficulty grades.
Third, we isolate causal effects through two conditions. Baseline is a single API call with no self-correction prompting. MetaCog applies a three-phase self-correction scaffold (initial reasoning, critical self-review, corrective revision). This follows the same logic as placebo-controlled clinical trials.
Five-Axis Evaluation Rubric
| Axis | Symbol | Weight | Measurement Target |
|---|---|---|---|
| Process Quality | PQ | 15% | Structured reasoning quality |
| Metacognitive Accuracy | MA | 20% | Confidence calibration, limit awareness (Declarative) |
| Error Recovery | ER | 25% | Error detection & correction (Procedural) |
| Integration Depth | ID | 20% | Multi-perspective integration |
| Final Correctness | FC | 20% | Final answer accuracy |
Scoring uses a tri-model LLM-as-Judge ensemble (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro). Cross-validation with human annotators on a 20-task subset confirmed Cohen's kappa = 0.87.
Results: Three Principal Findings
We evaluated 9 SOTA models across 100 expert-level tasks: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and four others.
Finding 1. ER Dominance — Self-Correction Is Everything
The MetaCog scaffold produces a mean gain of +14.05 points. 94.8% of this gain comes from Error Recovery alone.
| Rubric | Contribution | Interpretation |
|---|---|---|
| Error Recovery | 94.8% | Nearly all of the self-correction effect |
| Metacognitive Accuracy | 5.0% | "Saying" ability barely changes |
| Remaining 3 axes | 0.2% | Negligible |
The remaining four axes contribute a combined 5.2%. The bottleneck to AGI-level performance is not knowledge, not reasoning — it is self-correction alone.
Finding 2. Declarative-Procedural Gap — They Can Say It, But Can't Do It
All 9 models, without exception, satisfy MA > ER at Baseline.
Baseline mean: MA = 0.694 vs. ER = 0.302 (Gap = 0.392)
The ability to verbalize uncertainty is substantial; the ability to actually correct errors is less than half. When the MetaCog scaffold is applied, ER jumps by +0.533 while MA moves only +0.035 under the same conditions. A 15x differential. The two capabilities are independent.
This is the first large-scale quantitative evidence of the declarative-procedural dissociation in AI — a 40-year-old open question in cognitive psychology. The phenomenon mirrors a well-documented clinical pattern: physicians who say "this diagnosis is uncertain" yet do not change the treatment plan. Today, all 9 of the world's leading AI models exhibit this exact behavior.
Finding 3. Difficulty Effect — Harder Problems Benefit Most
Baseline score and MetaCog gain are strongly anticorrelated: Pearson r = -0.777 (p < 0.001). The harder the task, the greater the value of metacognition.
Claude Opus 4.6, which recorded the lowest Baseline, showed the largest gain (Delta = +20.13). Kimi K2.5, with the highest Baseline, showed the smallest gain (Delta = +9.83). Models with higher intrinsic self-correction benefit less from external scaffolding.
Model Leaderboard
Baseline (No Self-Correction)
| Rank | Model | FINAL Score | MA | ER | MA-ER Gap |
|---|---|---|---|---|---|
| 1 | Kimi K2.5 | 68.71 | 0.775 | 0.450 | 0.325 |
| 2 | GPT-5.2 | 62.76 | 0.750 | 0.336 | 0.414 |
| 3 | GLM-5 | 62.50 | 0.750 | 0.284 | 0.466 |
| 9 | Claude Opus 4.6 | 56.04 | 0.708 | 0.267 | 0.442 |
| Mean | 61.12 | 0.694 | 0.302 | 0.392 |
MetaCog (Self-Correction Applied)
| Rank | Model | FINAL Score | ER | Delta |
|---|---|---|---|---|
| 1 | Kimi K2.5 | 78.54 | 0.908 | +9.83 |
| 2 | Gemini 3 Pro | 77.08 | 0.875 | +17.58 |
| 5 | Claude Opus 4.6 | 76.17 | 0.867 | +20.13 |
| Mean | 75.17 | 0.835 | +14.05 |
Notable: Claude Opus 4.6 rises from Baseline rank 9 to MetaCog rank 5 — the highest scaffold receptivity of any model. Kimi K2.5 maintains rank 1 in both conditions but shows the smallest gain, consistent with its already-elevated intrinsic ER (0.450).
A Warning for AI Safety
Models with high MA and low ER represent the most dangerous safety profile. They sound humble — "I'm not entirely confident about this" — while failing to self-correct. Users perceive reliability where none exists.
All 9 SOTA models currently match this profile. The MA-ER Gap is the first metric to quantitatively identify this risk.
Using the Dataset
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
print(f"Total tasks: {len(dataset)}") # 100
task = dataset[0]
print(task['title']) # Task title
print(task['prompt'][:200]) # Prompt for the model
print(task['hidden_trap']) # Embedded cognitive trap
Composition: 100 tasks / 15 domains / 8 TICOS types / 3 grades / 12 fields per task
Paper and Future Work
FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang Currently under review at a leading international AI venue.
Future directions include L2 layer measurement via open-source model logit entropy analysis, expanded multi-judge and human cross-validation, and periodic task pool refresh to prevent contamination.
AGI without metacognition is driving with your eyes closed. FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.




