Spaces:

JustTheStatsHuman
/

Togmal-demo

Sleeping

App Files Files Community

Togmal-demo / togmal_improvement_plan.md

HeTalksInMaths

Major improvement plan update: Nested CV + Adaptive Scoring

ad8f7e9 about 2 months ago

preview code

raw

history blame

39.5 kB

	# ToGMAL Improvement Plan: Adaptive Scoring & Evaluation Framework

	## Executive Summary

	This plan addresses two critical gaps in togmal's current implementation:
	1. Naive weighted averaging fails when retrieved questions have low similarity to the prompt
	2. Lack of rigorous evaluation methodology to measure OOD detection performance

	---

	## Problem 1: Low-Similarity Scoring Issues

	### Current Limitation
	Your system uses a simple weighted average of difficulty scores from k-nearest neighbors, which produces unreliable risk assessments when:
	- Maximum similarity < 0.6 (semantically distant matches)
	- Retrieved questions span multiple unrelated domains
	- Query is truly novel/out-of-distribution

	Example: "Prove universe is 10,000 years old" matched to factual recall questions about Earth's age (similarity ~0.57), resulting in LOW risk despite being a "prove false premise" pattern.

	### Solution: Adaptive Uncertainty-Aware Scoring

	#### 1. Similarity-Based Confidence Adjustment

	Implement a confidence decay function that increases risk when similarity is low:

	```python
	def compute_adaptive_risk(similarities, difficulties, k=5):
	"""
	Adjust risk score based on retrieval confidence
	"""
	# Base weighted score
	weights = np.array(similarities) / sum(similarities)
	base_score = np.dot(weights, difficulties)

	# Confidence metrics
	max_sim = max(similarities)
	avg_sim = np.mean(similarities)
	sim_variance = np.var(similarities)

	# Uncertainty penalty - increase risk when:
	# - Max similarity is low (< 0.7)
	# - High variance in similarities (diverse matches)
	# - Average similarity is low

	uncertainty_penalty = 0.0

	# Low maximum similarity threshold
	if max_sim < 0.7:
	uncertainty_penalty += (0.7 - max_sim) * 0.5

	# High variance (retrieved questions are dissimilar to each other)
	if sim_variance > 0.05:
	uncertainty_penalty += min(sim_variance * 2, 0.3)

	# Low average similarity
	if avg_sim < 0.5:
	uncertainty_penalty += (0.5 - avg_sim) * 0.4

	# Adjusted score (higher = more risky)
	adjusted_score = base_score + uncertainty_penalty

	# Map to risk levels
	if adjusted_score < 0.2:
	return "MINIMAL"
	elif adjusted_score < 0.4:
	return "LOW"
	elif adjusted_score < 0.6:
	return "MODERATE"
	elif adjusted_score < 0.8:
	return "HIGH"
	else:
	return "CRITICAL"
	```

	Key Insight: Research shows that cosine similarity thresholds vary by domain and task. Values 0.7-0.8 are commonly recommended starting points for "relevant" matches. Below 0.6, matches become increasingly unreliable.

	#### 2. Multi-Signal Fusion

	Combine multiple indicators beyond just k-NN similarity:

	```python
	def compute_risk_with_fusion(prompt, knn_results, heuristics):
	"""
	Fuse vector similarity with rule-based heuristics
	"""
	# Vector-based score (from k-NN)
	vector_score = compute_adaptive_risk(
	knn_results['similarities'],
	knn_results['difficulties']
	)

	# Rule-based heuristics (existing togmal patterns)
	heuristic_score = heuristics.evaluate(prompt)

	# Domain classifier (is this math/physics/medical?)
	domain_confidence = classify_domain(prompt)

	# Combine scores with learned weights
	final_score = (
	0.4 * vector_score +
	0.4 * heuristic_score +
	0.2 * domain_uncertainty(domain_confidence)
	)

	return final_score
	```

	#### 3. Threshold Calibration per Domain

	Different domains need different thresholds. Implement domain-specific calibration:

	```python
	# Learned from validation data
	DOMAIN_THRESHOLDS = {
	'math': {'low': 0.65, 'moderate': 0.75, 'high': 0.85},
	'physics': {'low': 0.60, 'moderate': 0.70, 'high': 0.80},
	'medical': {'low': 0.70, 'moderate': 0.80, 'high': 0.90},
	'general': {'low': 0.60, 'moderate': 0.70, 'high': 0.80}
	}

	def get_calibrated_threshold(domain, risk_level):
	return DOMAIN_THRESHOLDS.get(domain, DOMAIN_THRESHOLDS['general'])[risk_level]
	```

	---

	## Problem 2: Evaluation & Generalization

	### Proposed Evaluation Framework: Nested Cross-Validation (Gold Standard)

	#### Why Nested CV > Simple Train/Val/Test Split

	Problem with simple splits:
	- Single validation set can be unrepresentative (lucky/unlucky split)
	- Repeated "peeking" at validation during hyperparameter search causes leakage
	- Test set provides only ONE estimate of generalization (high variance)

	Nested CV advantages:
	- Outer loop: K-fold CV for unbiased generalization estimate
	- Inner loop: Hyperparameter search on each training fold
	- No leakage: Test folds never seen during tuning
	- Multiple estimates: Robust performance across K different test sets

	#### Implementation: Nested Cross-Validation

	```python
	from sklearn.model_selection import StratifiedKFold, GridSearchCV
	import numpy as np
	from typing import Dict, List, Any

	class NestedCVEvaluator:
	"""
	Nested cross-validation for ToGMAL hyperparameter tuning and evaluation.

	Outer CV: 5-fold stratified CV for generalization estimate
	Inner CV: 3-fold stratified CV for hyperparameter search

	This prevents data leakage from "peeking" at validation during tuning.
	"""

	def __init__(
	self,
	benchmark_data,
	outer_folds: int = 5,
	inner_folds: int = 3,
	random_state: int = 42
	):
	self.data = benchmark_data
	self.outer_folds = outer_folds
	self.inner_folds = inner_folds
	self.random_state = random_state

	# Stratify by (domain, difficulty) to ensure balanced folds
	self.stratify_labels = (
	benchmark_data['domain'].astype(str) + '_' +
	benchmark_data['difficulty_label'].astype(str)
	)

	def run_nested_cv(
	self,
	param_grid: Dict[str, List[Any]],
	scoring_metric: str = 'roc_auc'
	) -> Dict[str, Any]:
	"""
	Run nested cross-validation.

	Args:
	param_grid: Hyperparameters to search (e.g., {'k': [3,5,7], 'threshold': [0.6,0.7]})
	scoring_metric: Metric for optimization (roc_auc, f1, etc.)

	Returns:
	Dictionary with:
	- outer_scores: Generalization performance on each outer fold
	- best_params_per_fold: Optimal hyperparameters found in each inner CV
	- mean_test_score: Average performance across outer folds
	- std_test_score: Standard deviation (uncertainty estimate)
	"""

	# Outer CV: For generalization estimate
	outer_cv = StratifiedKFold(
	n_splits=self.outer_folds,
	shuffle=True,
	random_state=self.random_state
	)

	outer_scores = []
	best_params_per_fold = []

	print("Starting Nested Cross-Validation...")
	print(f"Outer CV: {self.outer_folds} folds")
	print(f"Inner CV: {self.inner_folds} folds")
	print(f"Param grid: {param_grid}")
	print("="*80)

	for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(self.data, self.stratify_labels)):
	print(f"\nOuter Fold {fold_idx + 1}/{self.outer_folds}")

	# Split data for this outer fold
	train_data = self.data.iloc[train_idx]
	test_data = self.data.iloc[test_idx]

	# Inner CV: Hyperparameter search on training data ONLY
	inner_cv = StratifiedKFold(
	n_splits=self.inner_folds,
	shuffle=True,
	random_state=self.random_state
	)

	# Run grid search on inner folds
	best_params, best_inner_score = self._inner_grid_search(
	train_data,
	param_grid,
	inner_cv,
	scoring_metric
	)

	print(f" Inner CV best params: {best_params}")
	print(f" Inner CV best score: {best_inner_score:.4f}")

	# Build ToGMAL vector DB with ONLY training data
	vector_db = self._build_vector_db(train_data)

	# Evaluate on held-out test fold with best hyperparameters
	test_score = self._evaluate_on_test_fold(
	vector_db,
	test_data,
	best_params,
	scoring_metric
	)

	print(f" Outer test score: {test_score:.4f}")

	outer_scores.append(test_score)
	best_params_per_fold.append(best_params)

	# Aggregate results
	mean_score = np.mean(outer_scores)
	std_score = np.std(outer_scores)

	print("\n" + "="*80)
	print("Nested CV Results:")
	print(f" Outer scores: {[f'{s:.4f}' for s in outer_scores]}")
	print(f" Mean ± Std: {mean_score:.4f} ± {std_score:.4f}")
	print("="*80)

	return {
	'outer_scores': outer_scores,
	'mean_test_score': mean_score,
	'std_test_score': std_score,
	'best_params_per_fold': best_params_per_fold,
	'most_common_params': self._find_most_common_params(best_params_per_fold)
	}

	def _inner_grid_search(
	self,
	train_data,
	param_grid: Dict[str, List[Any]],
	inner_cv,
	scoring_metric: str
	) -> tuple:
	"""
	Grid search over hyperparameters using inner CV folds.
	Returns (best_params, best_score)
	"""
	stratify = (
	train_data['domain'].astype(str) + '_' +
	train_data['difficulty_label'].astype(str)
	)

	best_score = -np.inf
	best_params = {}

	# Generate all parameter combinations
	from itertools import product
	param_names = list(param_grid.keys())
	param_values = list(param_grid.values())

	for param_combo in product(*param_values):
	params = dict(zip(param_names, param_combo))

	# Evaluate this parameter combination on inner folds
	fold_scores = []

	for inner_train_idx, inner_val_idx in inner_cv.split(train_data, stratify):
	inner_train = train_data.iloc[inner_train_idx]
	inner_val = train_data.iloc[inner_val_idx]

	# Build vector DB with inner training data
	inner_db = self._build_vector_db(inner_train)

	# Evaluate on inner validation
	score = self._evaluate_on_test_fold(
	inner_db,
	inner_val,
	params,
	scoring_metric
	)
	fold_scores.append(score)

	avg_score = np.mean(fold_scores)

	if avg_score > best_score:
	best_score = avg_score
	best_params = params

	return best_params, best_score

	def _build_vector_db(self, train_data):
	"""Build vector database from training data."""
	from benchmark_vector_db import BenchmarkVectorDB, BenchmarkQuestion
	from pathlib import Path
	import tempfile

	# Create temporary DB for this fold
	temp_dir = tempfile.mkdtemp()
	db = BenchmarkVectorDB(
	db_path=Path(temp_dir) / "fold_db",
	embedding_model="all-MiniLM-L6-v2"
	)

	# Convert dataframe to BenchmarkQuestion objects
	questions = [
	BenchmarkQuestion(
	question_id=row['question_id'],
	source_benchmark=row['source_benchmark'],
	domain=row['domain'],
	question_text=row['question_text'],
	correct_answer=row['correct_answer'],
	success_rate=row['success_rate'],
	difficulty_score=row['difficulty_score'],
	difficulty_label=row['difficulty_label']
	)
	for _, row in train_data.iterrows()
	]

	db.index_questions(questions)
	return db

	def _evaluate_on_test_fold(
	self,
	vector_db,
	test_data,
	params: Dict[str, Any],
	metric: str
	) -> float:
	"""
	Evaluate ToGMAL on test fold with given hyperparameters.

	Args:
	vector_db: Vector database built from training data
	test_data: Held-out test fold
	params: Hyperparameters (e.g., k, similarity_threshold, weights)
	metric: Scoring metric (roc_auc, f1, etc.)
	"""
	from sklearn.metrics import roc_auc_score, f1_score

	predictions = []
	ground_truth = []

	for _, row in test_data.iterrows():
	# Query vector DB with test question
	result = vector_db.query_similar_questions(
	prompt=row['question_text'],
	k=params.get('k_neighbors', 5)
	)

	# Apply adaptive scoring with hyperparameters
	risk_score = self._compute_adaptive_risk(
	result,
	params
	)

	predictions.append(risk_score)

	# Ground truth: is this question hard? (success_rate < 0.5)
	ground_truth.append(1 if row['success_rate'] < 0.5 else 0)

	# Compute metric
	if metric == 'roc_auc':
	return roc_auc_score(ground_truth, predictions)
	elif metric == 'f1':
	# Binarize predictions at 0.5 threshold
	binary_preds = [1 if p > 0.5 else 0 for p in predictions]
	return f1_score(ground_truth, binary_preds)
	else:
	raise ValueError(f"Unknown metric: {metric}")

	def _compute_adaptive_risk(
	self,
	query_result: Dict[str, Any],
	params: Dict[str, Any]
	) -> float:
	"""
	Compute risk score with adaptive uncertainty penalties.
	Uses hyperparameters from inner CV search.
	"""
	similarities = [q['similarity'] for q in query_result['similar_questions']]
	difficulties = [q['difficulty_score'] for q in query_result['similar_questions']]

	# Base weighted average
	weights = np.array(similarities) / sum(similarities)
	base_score = np.dot(weights, difficulties)

	# Adaptive uncertainty penalties
	max_sim = max(similarities)
	avg_sim = np.mean(similarities)
	sim_variance = np.var(similarities)

	uncertainty_penalty = 0.0

	# Low similarity threshold (configurable)
	sim_threshold = params.get('similarity_threshold', 0.7)
	if max_sim < sim_threshold:
	uncertainty_penalty += (sim_threshold - max_sim) * params.get('low_sim_penalty', 0.5)

	# High variance penalty
	if sim_variance > 0.05:
	uncertainty_penalty += min(sim_variance * params.get('variance_penalty', 2.0), 0.3)

	# Low average similarity
	if avg_sim < 0.5:
	uncertainty_penalty += (0.5 - avg_sim) * params.get('low_avg_penalty', 0.4)

	# Final score
	adjusted_score = base_score + uncertainty_penalty

	return np.clip(adjusted_score, 0.0, 1.0)

	def _find_most_common_params(self, params_list: List[Dict]) -> Dict:
	"""Find the most frequently selected hyperparameters across folds."""
	from collections import Counter

	# For each parameter, find the most common value
	all_param_names = params_list[0].keys()
	most_common = {}

	for param_name in all_param_names:
	values = [p[param_name] for p in params_list]
	most_common[param_name] = Counter(values).most_common(1)[0][0]

	return most_common


	# Example usage
	if __name__ == "__main__":
	import pandas as pd
	from benchmark_vector_db import BenchmarkVectorDB

	# Load all benchmark questions
	db = BenchmarkVectorDB(db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db"))
	stats = db.get_statistics()

	# Get all questions as dataframe (you'll need to implement this)
	all_questions_df = db.get_all_questions_as_dataframe()

	# Define hyperparameter search grid
	param_grid = {
	'k_neighbors': [3, 5, 7, 10],
	'similarity_threshold': [0.6, 0.7, 0.8],
	'low_sim_penalty': [0.3, 0.5, 0.7],
	'variance_penalty': [1.0, 2.0, 3.0],
	'low_avg_penalty': [0.2, 0.4, 0.6]
	}

	# Run nested CV
	evaluator = NestedCVEvaluator(
	benchmark_data=all_questions_df,
	outer_folds=5, # 5-fold outer CV
	inner_folds=3 # 3-fold inner CV for hyperparameter search
	)

	results = evaluator.run_nested_cv(
	param_grid=param_grid,
	scoring_metric='roc_auc'
	)

	print("\nFinal Results:")
	print(f"Generalization Performance: {results['mean_test_score']:.4f} ± {results['std_test_score']:.4f}")
	print(f"Most Common Best Params: {results['most_common_params']}")
	```

	Key Advantages:
	- No leakage: Each outer test fold is never seen during hyperparameter tuning
	- Robust estimates: 5 different generalization scores (not just 1)
	- Automatic tuning: Inner CV finds best hyperparameters for each fold
	- Confidence intervals: Standard deviation tells you uncertainty in performance

	#### Phase 2: Define Evaluation Metrics

	Use standard OOD detection metrics + calibration metrics:

	1. AUROC (Area Under ROC Curve)
	- Threshold-independent
	- Measures overall discriminative ability
	- Gold standard for OOD detection
	- Interpretation: Probability that a random risky prompt is ranked higher than a random safe prompt

	2. FPR@TPR95 (False Positive Rate at 95% True Positive Rate)
	- How many safe prompts are incorrectly flagged when catching 95% of risky ones
	- Common in safety-critical applications
	- Lower is better (want to minimize false alarms)

	3. AUPR (Area Under Precision-Recall Curve)
	- Better for imbalanced datasets
	- Useful when risky prompts are rare
	- Focuses on positive class (risky prompts)

	4. Expected Calibration Error (ECE)
	- Are your risk probabilities accurate?
	- If you say 70% risky, is it actually 70% risky?
	- Measures gap between predicted probabilities and observed frequencies

	5. Brier Score
	- Measures accuracy of probabilistic predictions
	- Lower is better
	- Combines discrimination and calibration

	```python
	from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, brier_score_loss
	import numpy as np

	def compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95):
	"""Compute FPR when TPR is at specified threshold."""
	from sklearn.metrics import roc_curve

	fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)

	# Find index where TPR >= threshold
	idx = np.argmax(tpr >= tpr_threshold)

	return fpr[idx]

	def expected_calibration_error(y_true, y_pred_proba, n_bins=10):
	"""
	Compute Expected Calibration Error (ECE).

	Bins predictions into n_bins buckets and measures the gap between
	predicted probability and observed frequency in each bin.
	"""
	bin_boundaries = np.linspace(0, 1, n_bins + 1)
	bin_lowers = bin_boundaries[:-1]
	bin_uppers = bin_boundaries[1:]

	ece = 0.0

	for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
	# Find predictions in this bin
	in_bin = (y_pred_proba > bin_lower) & (y_pred_proba <= bin_upper)
	prop_in_bin = in_bin.mean()

	if prop_in_bin > 0:
	# Observed frequency in this bin
	accuracy_in_bin = y_true[in_bin].mean()
	# Average predicted probability in this bin
	avg_confidence_in_bin = y_pred_proba[in_bin].mean()

	# Contribution to ECE
	ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin

	return ece

	def evaluate_togmal(predictions, ground_truth):
	"""
	Comprehensive evaluation of ToGMAL performance.

	Args:
	predictions: Dict with 'risk_score' (continuous 0-1) and 'risk_level' (categorical)
	ground_truth: Array of difficulty scores or binary labels (0=easy, 1=hard)

	Returns:
	Dictionary with all evaluation metrics
	"""
	# Convert ground truth to binary if needed (HIGH/CRITICAL = 1, else = 0)
	if hasattr(ground_truth, 'success_rate'):
	y_true = (ground_truth['success_rate'] < 0.5).astype(int)
	else:
	y_true = ground_truth

	y_pred_proba = predictions['risk_score'] # Continuous 0-1
	y_pred_binary = (y_pred_proba > 0.5).astype(int) # Binarized

	# AUROC
	auroc = roc_auc_score(y_true, y_pred_proba)

	# FPR@TPR95
	fpr_at_95_tpr = compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95)

	# AUPR
	precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
	aupr = auc(recall, precision)

	# Calibration error
	ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)

	# Brier score (lower is better)
	brier = brier_score_loss(y_true, y_pred_proba)

	# Standard classification metrics (for reference)
	from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

	accuracy = accuracy_score(y_true, y_pred_binary)
	f1 = f1_score(y_true, y_pred_binary)
	precision = precision_score(y_true, y_pred_binary)
	recall = recall_score(y_true, y_pred_binary)

	return {
	# Primary OOD detection metrics
	'AUROC': auroc,
	'FPR@TPR95': fpr_at_95_tpr,
	'AUPR': aupr,

	# Calibration metrics
	'ECE': ece,
	'Brier_Score': brier,

	# Standard classification (for reference)
	'Accuracy': accuracy,
	'F1': f1,
	'Precision': precision,
	'Recall': recall
	}

	def print_evaluation_report(metrics: dict):
	"""Pretty print evaluation metrics."""
	print("\n" + "="*80)
	print("ToGMAL Evaluation Report")
	print("="*80)

	print("\nOOD Detection Performance:")
	print(f" AUROC: {metrics['AUROC']:.4f} (higher is better, 0.5=random, 1.0=perfect)")
	print(f" FPR@TPR95: {metrics['FPR@TPR95']:.4f} (lower is better, false alarm rate)")
	print(f" AUPR: {metrics['AUPR']:.4f} (higher is better)")

	print("\nCalibration:")
	print(f" ECE: {metrics['ECE']:.4f} (lower is better, 0=perfect calibration)")
	print(f" Brier Score: {metrics['Brier_Score']:.4f} (lower is better)")

	print("\nClassification Metrics (for reference):")
	print(f" Accuracy: {metrics['Accuracy']:.4f}")
	print(f" F1 Score: {metrics['F1']:.4f}")
	print(f" Precision: {metrics['Precision']:.4f}")
	print(f" Recall: {metrics['Recall']:.4f}")

	print("\n" + "="*80)
	```

	#### Phase 3: Out-of-Distribution Testing

	Critical: Test on data that's truly OOD from your training benchmarks.

	OOD Test Sets to Create:

	1. Temporal OOD: New benchmark questions released after your training data cutoff
	2. Domain Shift: Categories not in MMLU (e.g., creative writing prompts, coding challenges)
	3. Adversarial: Hand-crafted examples designed to fool the system
	- "Prove [false scientific claim]"
	- Jailbreak attempts disguised as innocent questions
	- Edge cases from your taxonomy submissions

	```python
	ood_test_sets = {
	'adversarial_false_premises': load_false_premise_examples(),
	'jailbreaks': load_jailbreak_attempts(),
	'creative_writing': load_writing_prompts(),
	'recent_benchmarks': load_benchmarks_after('2024-01'),
	'user_submissions': load_taxonomy_entries()
	}

	# Evaluate on each OOD set
	for name, test_data in ood_test_sets.items():
	metrics = evaluate_togmal(model.predict(test_data), test_data.labels)
	print(f"{name}: AUROC={metrics['AUROC']:.3f}, FPR@95={metrics['FPR@TPR95']:.3f}")
	```

	#### Phase 4: Hyperparameter Tuning Protocol

	Use validation set ONLY - never touch test set until final evaluation.

	```python
	from sklearn.model_selection import GridSearchCV

	# Parameters to tune
	param_grid = {
	'similarity_threshold': [0.5, 0.6, 0.7, 0.8],
	'k_neighbors': [3, 5, 7, 10],
	'uncertainty_penalty_weight': [0.2, 0.4, 0.6],
	'heuristic_weight': [0.3, 0.4, 0.5],
	'vector_weight': [0.3, 0.4, 0.5]
	}

	# Cross-validation on validation set
	best_params = grid_search_cv(
	togmal_model,
	param_grid,
	val_set,
	metric='AUROC',
	cv=5 # 5-fold CV within validation set
	)

	# Train final model with best params on train + val
	final_model = train_togmal(
	train_set + val_set,
	params=best_params
	)

	# Evaluate ONCE on test set
	final_metrics = evaluate_togmal(
	final_model.predict(test_set),
	test_set.labels
	)
	```

	---

	## Implementation Roadmap

	### Phase 1: Adaptive Scoring Implementation (Week 1-2)
	- [x] ✓ Implement basic vector database with 32K questions
	- [ ] Add adaptive uncertainty-aware scoring function
	- [ ] Similarity threshold penalties
	- [ ] Variance penalties for diverse matches
	- [ ] Low average similarity penalties
	- [ ] Implement domain-specific threshold calibration
	- [ ] Add multi-signal fusion (vector + heuristics)
	- [ ] Integrate into `benchmark_vector_db.py::query_similar_questions()`

	### Phase 2: Data Export & Preparation (Week 2)
	- [ ] Export all 32K questions from ChromaDB to pandas DataFrame
	- [ ] Add `BenchmarkVectorDB.get_all_questions_as_dataframe()` method
	- [ ] Include all metadata (domain, difficulty, success_rate, etc.)
	- [ ] Verify stratification labels (domain × difficulty)
	- [ ] Create initial train/val/test split (simple 70/15/15) for baseline
	- [ ] Document dataset statistics per split

	### Phase 3: Nested CV Framework (Week 3)
	- [ ] Implement `NestedCVEvaluator` class
	- [ ] Outer CV loop (5-fold stratified)
	- [ ] Inner CV loop (3-fold grid search)
	- [ ] Temporary vector DB creation per fold
	- [ ] Define hyperparameter search grid
	- `k_neighbors`: [3, 5, 7, 10]
	- `similarity_threshold`: [0.6, 0.7, 0.8]
	- `low_sim_penalty`: [0.3, 0.5, 0.7]
	- `variance_penalty`: [1.0, 2.0, 3.0]
	- `low_avg_penalty`: [0.2, 0.4, 0.6]
	- [ ] Implement evaluation metrics (AUROC, FPR@TPR95, ECE)

	### Phase 4: Baseline Evaluation (Week 3-4)
	- [ ] Run current ToGMAL (naive weighted average) on simple split
	- [ ] Compute baseline metrics:
	- [ ] AUROC on test set
	- [ ] FPR@TPR95
	- [ ] Expected Calibration Error
	- [ ] Brier Score
	- [ ] Analyze failure modes:
	- [ ] Low similarity cases (max_sim < 0.6)
	- [ ] High variance matches
	- [ ] Cross-domain queries
	- [ ] Document baseline performance for comparison

	### Phase 5: Nested CV Hyperparameter Tuning (Week 4-5)
	- [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs)
	- [ ] Track computational cost (time per fold)
	- [ ] Collect best hyperparameters per outer fold
	- [ ] Identify most common optimal parameters
	- [ ] Compute mean ± std generalization performance

	### Phase 6: Final Model Training (Week 5)
	- [ ] Train final model on ALL 32K questions with best hyperparameters
	- [ ] Re-index full vector database
	- [ ] Update `togmal_mcp.py` to use adaptive scoring
	- [ ] Deploy to MCP server and HTTP facade

	### Phase 7: OOD Testing (Week 6)
	- [ ] Create OOD test sets:
	- [ ] Adversarial: Hand-crafted edge cases
	- "Prove [false scientific claim]"
	- Jailbreak attempts disguised as questions
	- Taxonomy submissions from users
	- [ ] Domain Shift: Categories not in MMLU
	- Creative writing prompts
	- Code generation tasks
	- Real-world user queries
	- [ ] Temporal OOD: New benchmarks (2024+)
	- SimpleQA (if available)
	- Latest MMLU updates
	- [ ] Evaluate on each OOD set
	- [ ] Analyze degradation vs. in-distribution performance

	### Phase 8: Iteration & Documentation (Week 7)
	- [ ] Analyze failures on OOD sets
	- [ ] Add new heuristics for missed patterns
	- [ ] Re-run nested CV with updated features
	- [ ] Generate calibration plots (reliability diagrams)
	- [ ] Write technical report:
	- [ ] Methodology (nested CV protocol)
	- [ ] Results (baseline vs. adaptive)
	- [ ] Ablation studies (each penalty component)
	- [ ] OOD generalization analysis
	- [ ] Failure mode documentation

	---

	## Expected Improvements

	Based on OOD detection literature and nested CV best practices:

	1. Adaptive scoring should improve AUROC by 5-15% on low-similarity cases
	- Baseline: ~0.75 AUROC (naive weighted average)
	- Target: ~0.85+ AUROC (adaptive with uncertainty)

	2. Nested CV will give honest performance estimates
	- Simple train/test: Single point estimate (could be lucky/unlucky)
	- Nested CV: Mean ± std across 5 folds (robust estimate)

	3. Domain calibration should reduce false positives by 10-20%
	- Expected: FPR@TPR95 drops from ~0.25 to ~0.15

	4. Multi-signal fusion should catch edge cases like "prove false premise"
	- Combine vector similarity + rule-based heuristics
	- Expected: Improved recall on adversarial examples

	5. Calibration improvements
	- Expected Calibration Error (ECE) < 0.05
	- Better alignment between predicted risk and actual difficulty

	---

	## Validation Checklist

	Before deploying to production:
	- ✓ Nested CV completed with no data leakage
	- ✓ Hyperparameters tuned on inner CV folds only
	- ✓ Generalization performance estimated on outer CV folds
	- ✓ OOD sets tested (adversarial, domain-shift, temporal)
	- ✓ Calibration error measured and within acceptable range (ECE < 0.1)
	- ✓ Failure modes documented with specific examples
	- ✓ Ablation studies show each component contributes positively
	- ✓ Performance comparison: adaptive > baseline on all metrics
	- ✓ Real-world testing with user queries from taxonomy submissions

	---

	## Key References

	1. Similarity Thresholds: Cosine similarity 0.7-0.8 recommended as starting point for "relevant" matches; lower values increasingly unreliable
	2. OOD Metrics: AUROC, FPR@TPR95 are standard; conformal prediction provides probabilistic guarantees
	3. Adaptive Methods: Uncertainty-aware thresholds outperform fixed thresholds in retrieval tasks
	4. Holdout Validation: 60-20-20 or 70-15-15 splits common; stratification by domain/difficulty essential
	5. Calibration: Expected Calibration Error (ECE) measures if predicted probabilities match observed frequencies
	6. Nested CV: Gold standard for hyperparameter tuning; prevents leakage from repeated validation peeking
	7. Stratified K-Fold: Maintains class distribution across folds; essential for imbalanced datasets

	---

	## Quick Start: Immediate Implementation

	### Step 1: Add Adaptive Scoring to `benchmark_vector_db.py` (Today)

	Replace the naive weighted average in `query_similar_questions()` with adaptive uncertainty-aware scoring:

	```python
	def query_similar_questions(
	self,
	prompt: str,
	k: int = 5,
	domain_filter: Optional[str] = None,
	# NEW: Adaptive scoring parameters
	similarity_threshold: float = 0.7,
	low_sim_penalty: float = 0.5,
	variance_penalty: float = 2.0,
	low_avg_penalty: float = 0.4
	) -> Dict[str, Any]:
	"""Find k most similar benchmark questions with adaptive uncertainty penalties."""

	# ... existing code to query ChromaDB ...

	# Extract similarities and difficulty scores
	similarities = []
	difficulty_scores = []
	success_rates = []

	for i in range(len(results['ids'][0])):
	metadata = results['metadatas'][0][i]
	distance = results['distances'][0][i]

	# Convert L2 distance to cosine similarity
	similarity = max(0, 1 - (distance ** 2) / 2)

	similarities.append(similarity)
	difficulty_scores.append(metadata['difficulty_score'])
	success_rates.append(metadata['success_rate'])

	# IMPROVED: Adaptive uncertainty-aware scoring
	weighted_difficulty = self._compute_adaptive_difficulty(
	similarities=similarities,
	difficulty_scores=difficulty_scores,
	similarity_threshold=similarity_threshold,
	low_sim_penalty=low_sim_penalty,
	variance_penalty=variance_penalty,
	low_avg_penalty=low_avg_penalty
	)

	# ... rest of existing code ...

	def _compute_adaptive_difficulty(
	self,
	similarities: List[float],
	difficulty_scores: List[float],
	similarity_threshold: float = 0.7,
	low_sim_penalty: float = 0.5,
	variance_penalty: float = 2.0,
	low_avg_penalty: float = 0.4
	) -> float:
	"""
	Compute difficulty score with adaptive uncertainty penalties.

	Key insight: When retrieved questions have low similarity to the prompt,
	we should INCREASE the risk estimate because we're extrapolating.

	Args:
	similarities: Cosine similarities of k-NN results
	difficulty_scores: Difficulty scores (1 - success_rate) of k-NN results
	similarity_threshold: Below this, apply low similarity penalty (default: 0.7)
	low_sim_penalty: Weight for low similarity penalty (default: 0.5)
	variance_penalty: Weight for high variance penalty (default: 2.0)
	low_avg_penalty: Weight for low average similarity penalty (default: 0.4)

	Returns:
	Adjusted difficulty score (0.0 to 1.0, higher = more risky)
	"""
	import numpy as np

	# Base weighted average (original approach)
	weights = np.array(similarities) / sum(similarities)
	base_score = np.dot(weights, difficulty_scores)

	# Compute uncertainty indicators
	max_sim = max(similarities)
	avg_sim = np.mean(similarities)
	sim_variance = np.var(similarities)

	# Initialize uncertainty penalty
	uncertainty_penalty = 0.0

	# Penalty 1: Low maximum similarity
	# If best match is weak, we're likely OOD
	if max_sim < similarity_threshold:
	penalty = (similarity_threshold - max_sim) * low_sim_penalty
	uncertainty_penalty += penalty
	logger.debug(f"Low max similarity penalty: {penalty:.3f} (max_sim={max_sim:.3f})")

	# Penalty 2: High variance in similarities
	# If k-NN results are very dissimilar to each other, matches are unreliable
	variance_threshold = 0.05
	if sim_variance > variance_threshold:
	penalty = min(sim_variance * variance_penalty, 0.3) # Cap at 0.3
	uncertainty_penalty += penalty
	logger.debug(f"High variance penalty: {penalty:.3f} (variance={sim_variance:.3f})")

	# Penalty 3: Low average similarity
	# If ALL matches are weak, we're definitely OOD
	avg_threshold = 0.5
	if avg_sim < avg_threshold:
	penalty = (avg_threshold - avg_sim) * low_avg_penalty
	uncertainty_penalty += penalty
	logger.debug(f"Low avg similarity penalty: {penalty:.3f} (avg_sim={avg_sim:.3f})")

	# Final adjusted score
	adjusted_score = base_score + uncertainty_penalty

	# Clip to [0, 1] range
	adjusted_score = np.clip(adjusted_score, 0.0, 1.0)

	logger.info(
	f"Adaptive scoring: base={base_score:.3f}, penalty={uncertainty_penalty:.3f}, "
	f"adjusted={adjusted_score:.3f}"
	)

	return adjusted_score
	```

	Why this helps:
	- "Prove universe is 10,000 years old" example: max_sim=0.57 triggers low similarity penalty → risk increases from MODERATE to HIGH
	- Unrelated k-NN matches: High variance → additional penalty → correctly flags as uncertain
	- Novel domains: Low average similarity across all matches → strong penalty → CRITICAL risk

	### Step 2: Export Database for Evaluation (This Week)

	Add method to export all questions as DataFrame for nested CV:

	```python
	def get_all_questions_as_dataframe(self) -> 'pd.DataFrame':
	"""
	Export all questions from ChromaDB as a pandas DataFrame.
	Used for train/val/test splitting and nested CV.

	Returns:
	DataFrame with columns:
	- question_id, source_benchmark, domain, question_text,
	- correct_answer, success_rate, difficulty_score, difficulty_label
	"""
	import pandas as pd

	count = self.collection.count()
	logger.info(f"Exporting {count} questions from vector database...")

	# Get all questions from ChromaDB
	all_data = self.collection.get(
	limit=count,
	include=["metadatas", "documents"]
	)

	# Convert to DataFrame
	rows = []
	for i, qid in enumerate(all_data['ids']):
	metadata = all_data['metadatas'][i]
	rows.append({
	'question_id': qid,
	'question_text': all_data['documents'][i],
	'source_benchmark': metadata['source'],
	'domain': metadata['domain'],
	'success_rate': metadata['success_rate'],
	'difficulty_score': metadata['difficulty_score'],
	'difficulty_label': metadata['difficulty_label'],
	'num_models_tested': metadata.get('num_models', 0)
	})

	df = pd.DataFrame(rows)

	logger.info(f"Exported {len(df)} questions to DataFrame")
	logger.info(f" Domains: {df['domain'].nunique()}")
	logger.info(f" Sources: {df['source_benchmark'].nunique()}")

	return df
	```

	### Step 3: Test Adaptive Scoring Immediately

	Create a test script to compare baseline vs. adaptive:

	```python
	#!/usr/bin/env python3
	"""Test adaptive scoring improvements."""

	from benchmark_vector_db import BenchmarkVectorDB
	from pathlib import Path

	# Initialize database
	db = BenchmarkVectorDB(
	db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db")
	)

	# Test cases that should trigger uncertainty penalties
	test_cases = [
	# Low similarity - should get penalty
	"Prove that the universe is exactly 10,000 years old using thermodynamics",

	# Novel domain - should get penalty
	"Write a haiku about quantum entanglement in 17th century Japanese",

	# Should match well - no penalty
	"What is the capital of France?",

	# Should match GPQA physics - no penalty
	"Calculate the quantum correction to the partition function for a 3D harmonic oscillator"
	]

	print("="*80)
	print("Adaptive Scoring Test")
	print("="*80)

	for prompt in test_cases:
	print(f"\nPrompt: {prompt[:100]}...")

	result = db.query_similar_questions(prompt, k=5)

	print(f" Max Similarity: {max(q['similarity'] for q in result['similar_questions']):.3f}")
	print(f" Avg Similarity: {result['avg_similarity']:.3f}")
	print(f" Weighted Difficulty: {result['weighted_difficulty_score']:.3f}")
	print(f" Risk Level: {result['risk_level']}")
	print(f" Top Match: {result['similar_questions'][0]['domain']} - {result['similar_questions'][0]['source']}")
	```

	---

	## Next Steps

	1. Immediate: Implement train/val/test split of benchmark data
	2. This week: Add similarity-based uncertainty penalties
	3. Next week: Run validation experiments with different thresholds
	4. End of month: Complete evaluation on test set + OOD sets
	5. Ongoing: Build adversarial test set from user submissions