Togmal-demo / togmal_improvement_plan.md
HeTalksInMaths
Major improvement plan update: Nested CV + Adaptive Scoring
ad8f7e9
|
raw
history blame
39.5 kB

ToGMAL Improvement Plan: Adaptive Scoring & Evaluation Framework

Executive Summary

This plan addresses two critical gaps in togmal's current implementation:

  1. Naive weighted averaging fails when retrieved questions have low similarity to the prompt
  2. Lack of rigorous evaluation methodology to measure OOD detection performance

Problem 1: Low-Similarity Scoring Issues

Current Limitation

Your system uses a simple weighted average of difficulty scores from k-nearest neighbors, which produces unreliable risk assessments when:

  • Maximum similarity < 0.6 (semantically distant matches)
  • Retrieved questions span multiple unrelated domains
  • Query is truly novel/out-of-distribution

Example: "Prove universe is 10,000 years old" matched to factual recall questions about Earth's age (similarity ~0.57), resulting in LOW risk despite being a "prove false premise" pattern.

Solution: Adaptive Uncertainty-Aware Scoring

1. Similarity-Based Confidence Adjustment

Implement a confidence decay function that increases risk when similarity is low:

def compute_adaptive_risk(similarities, difficulties, k=5):
    """
    Adjust risk score based on retrieval confidence
    """
    # Base weighted score
    weights = np.array(similarities) / sum(similarities)
    base_score = np.dot(weights, difficulties)
    
    # Confidence metrics
    max_sim = max(similarities)
    avg_sim = np.mean(similarities)
    sim_variance = np.var(similarities)
    
    # Uncertainty penalty - increase risk when:
    # - Max similarity is low (< 0.7)
    # - High variance in similarities (diverse matches)
    # - Average similarity is low
    
    uncertainty_penalty = 0.0
    
    # Low maximum similarity threshold
    if max_sim < 0.7:
        uncertainty_penalty += (0.7 - max_sim) * 0.5
    
    # High variance (retrieved questions are dissimilar to each other)
    if sim_variance > 0.05:
        uncertainty_penalty += min(sim_variance * 2, 0.3)
    
    # Low average similarity
    if avg_sim < 0.5:
        uncertainty_penalty += (0.5 - avg_sim) * 0.4
    
    # Adjusted score (higher = more risky)
    adjusted_score = base_score + uncertainty_penalty
    
    # Map to risk levels
    if adjusted_score < 0.2:
        return "MINIMAL"
    elif adjusted_score < 0.4:
        return "LOW"  
    elif adjusted_score < 0.6:
        return "MODERATE"
    elif adjusted_score < 0.8:
        return "HIGH"
    else:
        return "CRITICAL"

Key Insight: Research shows that cosine similarity thresholds vary by domain and task. Values 0.7-0.8 are commonly recommended starting points for "relevant" matches. Below 0.6, matches become increasingly unreliable.

2. Multi-Signal Fusion

Combine multiple indicators beyond just k-NN similarity:

def compute_risk_with_fusion(prompt, knn_results, heuristics):
    """
    Fuse vector similarity with rule-based heuristics
    """
    # Vector-based score (from k-NN)
    vector_score = compute_adaptive_risk(
        knn_results['similarities'],
        knn_results['difficulties']
    )
    
    # Rule-based heuristics (existing togmal patterns)
    heuristic_score = heuristics.evaluate(prompt)
    
    # Domain classifier (is this math/physics/medical?)
    domain_confidence = classify_domain(prompt)
    
    # Combine scores with learned weights
    final_score = (
        0.4 * vector_score + 
        0.4 * heuristic_score +
        0.2 * domain_uncertainty(domain_confidence)
    )
    
    return final_score

3. Threshold Calibration per Domain

Different domains need different thresholds. Implement domain-specific calibration:

# Learned from validation data
DOMAIN_THRESHOLDS = {
    'math': {'low': 0.65, 'moderate': 0.75, 'high': 0.85},
    'physics': {'low': 0.60, 'moderate': 0.70, 'high': 0.80},
    'medical': {'low': 0.70, 'moderate': 0.80, 'high': 0.90},
    'general': {'low': 0.60, 'moderate': 0.70, 'high': 0.80}
}

def get_calibrated_threshold(domain, risk_level):
    return DOMAIN_THRESHOLDS.get(domain, DOMAIN_THRESHOLDS['general'])[risk_level]

Problem 2: Evaluation & Generalization

Proposed Evaluation Framework: Nested Cross-Validation (Gold Standard)

Why Nested CV > Simple Train/Val/Test Split

Problem with simple splits:

  • Single validation set can be unrepresentative (lucky/unlucky split)
  • Repeated "peeking" at validation during hyperparameter search causes leakage
  • Test set provides only ONE estimate of generalization (high variance)

Nested CV advantages:

  • Outer loop: K-fold CV for unbiased generalization estimate
  • Inner loop: Hyperparameter search on each training fold
  • No leakage: Test folds never seen during tuning
  • Multiple estimates: Robust performance across K different test sets

Implementation: Nested Cross-Validation

from sklearn.model_selection import StratifiedKFold, GridSearchCV
import numpy as np
from typing import Dict, List, Any

class NestedCVEvaluator:
    """
    Nested cross-validation for ToGMAL hyperparameter tuning and evaluation.
    
    Outer CV: 5-fold stratified CV for generalization estimate
    Inner CV: 3-fold stratified CV for hyperparameter search
    
    This prevents data leakage from "peeking" at validation during tuning.
    """
    
    def __init__(
        self,
        benchmark_data,
        outer_folds: int = 5,
        inner_folds: int = 3,
        random_state: int = 42
    ):
        self.data = benchmark_data
        self.outer_folds = outer_folds
        self.inner_folds = inner_folds
        self.random_state = random_state
        
        # Stratify by (domain, difficulty) to ensure balanced folds
        self.stratify_labels = (
            benchmark_data['domain'].astype(str) + '_' + 
            benchmark_data['difficulty_label'].astype(str)
        )
    
    def run_nested_cv(
        self,
        param_grid: Dict[str, List[Any]],
        scoring_metric: str = 'roc_auc'
    ) -> Dict[str, Any]:
        """
        Run nested cross-validation.
        
        Args:
            param_grid: Hyperparameters to search (e.g., {'k': [3,5,7], 'threshold': [0.6,0.7]})
            scoring_metric: Metric for optimization (roc_auc, f1, etc.)
        
        Returns:
            Dictionary with:
            - outer_scores: Generalization performance on each outer fold
            - best_params_per_fold: Optimal hyperparameters found in each inner CV
            - mean_test_score: Average performance across outer folds
            - std_test_score: Standard deviation (uncertainty estimate)
        """
        
        # Outer CV: For generalization estimate
        outer_cv = StratifiedKFold(
            n_splits=self.outer_folds,
            shuffle=True,
            random_state=self.random_state
        )
        
        outer_scores = []
        best_params_per_fold = []
        
        print("Starting Nested Cross-Validation...")
        print(f"Outer CV: {self.outer_folds} folds")
        print(f"Inner CV: {self.inner_folds} folds")
        print(f"Param grid: {param_grid}")
        print("="*80)
        
        for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(self.data, self.stratify_labels)):
            print(f"\nOuter Fold {fold_idx + 1}/{self.outer_folds}")
            
            # Split data for this outer fold
            train_data = self.data.iloc[train_idx]
            test_data = self.data.iloc[test_idx]
            
            # Inner CV: Hyperparameter search on training data ONLY
            inner_cv = StratifiedKFold(
                n_splits=self.inner_folds,
                shuffle=True,
                random_state=self.random_state
            )
            
            # Run grid search on inner folds
            best_params, best_inner_score = self._inner_grid_search(
                train_data,
                param_grid,
                inner_cv,
                scoring_metric
            )
            
            print(f"  Inner CV best params: {best_params}")
            print(f"  Inner CV best score: {best_inner_score:.4f}")
            
            # Build ToGMAL vector DB with ONLY training data
            vector_db = self._build_vector_db(train_data)
            
            # Evaluate on held-out test fold with best hyperparameters
            test_score = self._evaluate_on_test_fold(
                vector_db,
                test_data,
                best_params,
                scoring_metric
            )
            
            print(f"  Outer test score: {test_score:.4f}")
            
            outer_scores.append(test_score)
            best_params_per_fold.append(best_params)
        
        # Aggregate results
        mean_score = np.mean(outer_scores)
        std_score = np.std(outer_scores)
        
        print("\n" + "="*80)
        print("Nested CV Results:")
        print(f"  Outer scores: {[f'{s:.4f}' for s in outer_scores]}")
        print(f"  Mean Β± Std: {mean_score:.4f} Β± {std_score:.4f}")
        print("="*80)
        
        return {
            'outer_scores': outer_scores,
            'mean_test_score': mean_score,
            'std_test_score': std_score,
            'best_params_per_fold': best_params_per_fold,
            'most_common_params': self._find_most_common_params(best_params_per_fold)
        }
    
    def _inner_grid_search(
        self,
        train_data,
        param_grid: Dict[str, List[Any]],
        inner_cv,
        scoring_metric: str
    ) -> tuple:
        """
        Grid search over hyperparameters using inner CV folds.
        Returns (best_params, best_score)
        """
        stratify = (
            train_data['domain'].astype(str) + '_' + 
            train_data['difficulty_label'].astype(str)
        )
        
        best_score = -np.inf
        best_params = {}
        
        # Generate all parameter combinations
        from itertools import product
        param_names = list(param_grid.keys())
        param_values = list(param_grid.values())
        
        for param_combo in product(*param_values):
            params = dict(zip(param_names, param_combo))
            
            # Evaluate this parameter combination on inner folds
            fold_scores = []
            
            for inner_train_idx, inner_val_idx in inner_cv.split(train_data, stratify):
                inner_train = train_data.iloc[inner_train_idx]
                inner_val = train_data.iloc[inner_val_idx]
                
                # Build vector DB with inner training data
                inner_db = self._build_vector_db(inner_train)
                
                # Evaluate on inner validation
                score = self._evaluate_on_test_fold(
                    inner_db,
                    inner_val,
                    params,
                    scoring_metric
                )
                fold_scores.append(score)
            
            avg_score = np.mean(fold_scores)
            
            if avg_score > best_score:
                best_score = avg_score
                best_params = params
        
        return best_params, best_score
    
    def _build_vector_db(self, train_data):
        """Build vector database from training data."""
        from benchmark_vector_db import BenchmarkVectorDB, BenchmarkQuestion
        from pathlib import Path
        import tempfile
        
        # Create temporary DB for this fold
        temp_dir = tempfile.mkdtemp()
        db = BenchmarkVectorDB(
            db_path=Path(temp_dir) / "fold_db",
            embedding_model="all-MiniLM-L6-v2"
        )
        
        # Convert dataframe to BenchmarkQuestion objects
        questions = [
            BenchmarkQuestion(
                question_id=row['question_id'],
                source_benchmark=row['source_benchmark'],
                domain=row['domain'],
                question_text=row['question_text'],
                correct_answer=row['correct_answer'],
                success_rate=row['success_rate'],
                difficulty_score=row['difficulty_score'],
                difficulty_label=row['difficulty_label']
            )
            for _, row in train_data.iterrows()
        ]
        
        db.index_questions(questions)
        return db
    
    def _evaluate_on_test_fold(
        self,
        vector_db,
        test_data,
        params: Dict[str, Any],
        metric: str
    ) -> float:
        """
        Evaluate ToGMAL on test fold with given hyperparameters.
        
        Args:
            vector_db: Vector database built from training data
            test_data: Held-out test fold
            params: Hyperparameters (e.g., k, similarity_threshold, weights)
            metric: Scoring metric (roc_auc, f1, etc.)
        """
        from sklearn.metrics import roc_auc_score, f1_score
        
        predictions = []
        ground_truth = []
        
        for _, row in test_data.iterrows():
            # Query vector DB with test question
            result = vector_db.query_similar_questions(
                prompt=row['question_text'],
                k=params.get('k_neighbors', 5)
            )
            
            # Apply adaptive scoring with hyperparameters
            risk_score = self._compute_adaptive_risk(
                result,
                params
            )
            
            predictions.append(risk_score)
            
            # Ground truth: is this question hard? (success_rate < 0.5)
            ground_truth.append(1 if row['success_rate'] < 0.5 else 0)
        
        # Compute metric
        if metric == 'roc_auc':
            return roc_auc_score(ground_truth, predictions)
        elif metric == 'f1':
            # Binarize predictions at 0.5 threshold
            binary_preds = [1 if p > 0.5 else 0 for p in predictions]
            return f1_score(ground_truth, binary_preds)
        else:
            raise ValueError(f"Unknown metric: {metric}")
    
    def _compute_adaptive_risk(
        self,
        query_result: Dict[str, Any],
        params: Dict[str, Any]
    ) -> float:
        """
        Compute risk score with adaptive uncertainty penalties.
        Uses hyperparameters from inner CV search.
        """
        similarities = [q['similarity'] for q in query_result['similar_questions']]
        difficulties = [q['difficulty_score'] for q in query_result['similar_questions']]
        
        # Base weighted average
        weights = np.array(similarities) / sum(similarities)
        base_score = np.dot(weights, difficulties)
        
        # Adaptive uncertainty penalties
        max_sim = max(similarities)
        avg_sim = np.mean(similarities)
        sim_variance = np.var(similarities)
        
        uncertainty_penalty = 0.0
        
        # Low similarity threshold (configurable)
        sim_threshold = params.get('similarity_threshold', 0.7)
        if max_sim < sim_threshold:
            uncertainty_penalty += (sim_threshold - max_sim) * params.get('low_sim_penalty', 0.5)
        
        # High variance penalty
        if sim_variance > 0.05:
            uncertainty_penalty += min(sim_variance * params.get('variance_penalty', 2.0), 0.3)
        
        # Low average similarity
        if avg_sim < 0.5:
            uncertainty_penalty += (0.5 - avg_sim) * params.get('low_avg_penalty', 0.4)
        
        # Final score
        adjusted_score = base_score + uncertainty_penalty
        
        return np.clip(adjusted_score, 0.0, 1.0)
    
    def _find_most_common_params(self, params_list: List[Dict]) -> Dict:
        """Find the most frequently selected hyperparameters across folds."""
        from collections import Counter
        
        # For each parameter, find the most common value
        all_param_names = params_list[0].keys()
        most_common = {}
        
        for param_name in all_param_names:
            values = [p[param_name] for p in params_list]
            most_common[param_name] = Counter(values).most_common(1)[0][0]
        
        return most_common


# Example usage
if __name__ == "__main__":
    import pandas as pd
    from benchmark_vector_db import BenchmarkVectorDB
    
    # Load all benchmark questions
    db = BenchmarkVectorDB(db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db"))
    stats = db.get_statistics()
    
    # Get all questions as dataframe (you'll need to implement this)
    all_questions_df = db.get_all_questions_as_dataframe()
    
    # Define hyperparameter search grid
    param_grid = {
        'k_neighbors': [3, 5, 7, 10],
        'similarity_threshold': [0.6, 0.7, 0.8],
        'low_sim_penalty': [0.3, 0.5, 0.7],
        'variance_penalty': [1.0, 2.0, 3.0],
        'low_avg_penalty': [0.2, 0.4, 0.6]
    }
    
    # Run nested CV
    evaluator = NestedCVEvaluator(
        benchmark_data=all_questions_df,
        outer_folds=5,  # 5-fold outer CV
        inner_folds=3   # 3-fold inner CV for hyperparameter search
    )
    
    results = evaluator.run_nested_cv(
        param_grid=param_grid,
        scoring_metric='roc_auc'
    )
    
    print("\nFinal Results:")
    print(f"Generalization Performance: {results['mean_test_score']:.4f} Β± {results['std_test_score']:.4f}")
    print(f"Most Common Best Params: {results['most_common_params']}")

Key Advantages:

  • No leakage: Each outer test fold is never seen during hyperparameter tuning
  • Robust estimates: 5 different generalization scores (not just 1)
  • Automatic tuning: Inner CV finds best hyperparameters for each fold
  • Confidence intervals: Standard deviation tells you uncertainty in performance

Phase 2: Define Evaluation Metrics

Use standard OOD detection metrics + calibration metrics:

  1. AUROC (Area Under ROC Curve)

    • Threshold-independent
    • Measures overall discriminative ability
    • Gold standard for OOD detection
    • Interpretation: Probability that a random risky prompt is ranked higher than a random safe prompt
  2. FPR@TPR95 (False Positive Rate at 95% True Positive Rate)

    • How many safe prompts are incorrectly flagged when catching 95% of risky ones
    • Common in safety-critical applications
    • Lower is better (want to minimize false alarms)
  3. AUPR (Area Under Precision-Recall Curve)

    • Better for imbalanced datasets
    • Useful when risky prompts are rare
    • Focuses on positive class (risky prompts)
  4. Expected Calibration Error (ECE)

    • Are your risk probabilities accurate?
    • If you say 70% risky, is it actually 70% risky?
    • Measures gap between predicted probabilities and observed frequencies
  5. Brier Score

    • Measures accuracy of probabilistic predictions
    • Lower is better
    • Combines discrimination and calibration
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, brier_score_loss
import numpy as np

def compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95):
    """Compute FPR when TPR is at specified threshold."""
    from sklearn.metrics import roc_curve
    
    fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
    
    # Find index where TPR >= threshold
    idx = np.argmax(tpr >= tpr_threshold)
    
    return fpr[idx]

def expected_calibration_error(y_true, y_pred_proba, n_bins=10):
    """
    Compute Expected Calibration Error (ECE).
    
    Bins predictions into n_bins buckets and measures the gap between
    predicted probability and observed frequency in each bin.
    """
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers = bin_boundaries[:-1]
    bin_uppers = bin_boundaries[1:]
    
    ece = 0.0
    
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        # Find predictions in this bin
        in_bin = (y_pred_proba > bin_lower) & (y_pred_proba <= bin_upper)
        prop_in_bin = in_bin.mean()
        
        if prop_in_bin > 0:
            # Observed frequency in this bin
            accuracy_in_bin = y_true[in_bin].mean()
            # Average predicted probability in this bin
            avg_confidence_in_bin = y_pred_proba[in_bin].mean()
            
            # Contribution to ECE
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    
    return ece

def evaluate_togmal(predictions, ground_truth):
    """
    Comprehensive evaluation of ToGMAL performance.
    
    Args:
        predictions: Dict with 'risk_score' (continuous 0-1) and 'risk_level' (categorical)
        ground_truth: Array of difficulty scores or binary labels (0=easy, 1=hard)
    
    Returns:
        Dictionary with all evaluation metrics
    """
    # Convert ground truth to binary if needed (HIGH/CRITICAL = 1, else = 0)
    if hasattr(ground_truth, 'success_rate'):
        y_true = (ground_truth['success_rate'] < 0.5).astype(int)
    else:
        y_true = ground_truth
    
    y_pred_proba = predictions['risk_score']  # Continuous 0-1
    y_pred_binary = (y_pred_proba > 0.5).astype(int)  # Binarized
    
    # AUROC
    auroc = roc_auc_score(y_true, y_pred_proba)
    
    # FPR@TPR95
    fpr_at_95_tpr = compute_fpr_at_tpr(y_true, y_pred_proba, tpr_threshold=0.95)
    
    # AUPR
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    aupr = auc(recall, precision)
    
    # Calibration error
    ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)
    
    # Brier score (lower is better)
    brier = brier_score_loss(y_true, y_pred_proba)
    
    # Standard classification metrics (for reference)
    from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
    
    accuracy = accuracy_score(y_true, y_pred_binary)
    f1 = f1_score(y_true, y_pred_binary)
    precision = precision_score(y_true, y_pred_binary)
    recall = recall_score(y_true, y_pred_binary)
    
    return {
        # Primary OOD detection metrics
        'AUROC': auroc,
        'FPR@TPR95': fpr_at_95_tpr,
        'AUPR': aupr,
        
        # Calibration metrics
        'ECE': ece,
        'Brier_Score': brier,
        
        # Standard classification (for reference)
        'Accuracy': accuracy,
        'F1': f1,
        'Precision': precision,
        'Recall': recall
    }

def print_evaluation_report(metrics: dict):
    """Pretty print evaluation metrics."""
    print("\n" + "="*80)
    print("ToGMAL Evaluation Report")
    print("="*80)
    
    print("\nOOD Detection Performance:")
    print(f"  AUROC:          {metrics['AUROC']:.4f}  (higher is better, 0.5=random, 1.0=perfect)")
    print(f"  FPR@TPR95:      {metrics['FPR@TPR95']:.4f}  (lower is better, false alarm rate)")
    print(f"  AUPR:           {metrics['AUPR']:.4f}  (higher is better)")
    
    print("\nCalibration:")
    print(f"  ECE:            {metrics['ECE']:.4f}  (lower is better, 0=perfect calibration)")
    print(f"  Brier Score:    {metrics['Brier_Score']:.4f}  (lower is better)")
    
    print("\nClassification Metrics (for reference):")
    print(f"  Accuracy:       {metrics['Accuracy']:.4f}")
    print(f"  F1 Score:       {metrics['F1']:.4f}")
    print(f"  Precision:      {metrics['Precision']:.4f}")
    print(f"  Recall:         {metrics['Recall']:.4f}")
    
    print("\n" + "="*80)

Phase 3: Out-of-Distribution Testing

Critical: Test on data that's truly OOD from your training benchmarks.

OOD Test Sets to Create:

  1. Temporal OOD: New benchmark questions released after your training data cutoff
  2. Domain Shift: Categories not in MMLU (e.g., creative writing prompts, coding challenges)
  3. Adversarial: Hand-crafted examples designed to fool the system
    • "Prove [false scientific claim]"
    • Jailbreak attempts disguised as innocent questions
    • Edge cases from your taxonomy submissions
ood_test_sets = {
    'adversarial_false_premises': load_false_premise_examples(),
    'jailbreaks': load_jailbreak_attempts(),
    'creative_writing': load_writing_prompts(),
    'recent_benchmarks': load_benchmarks_after('2024-01'),
    'user_submissions': load_taxonomy_entries()
}

# Evaluate on each OOD set
for name, test_data in ood_test_sets.items():
    metrics = evaluate_togmal(model.predict(test_data), test_data.labels)
    print(f"{name}: AUROC={metrics['AUROC']:.3f}, FPR@95={metrics['FPR@TPR95']:.3f}")

Phase 4: Hyperparameter Tuning Protocol

Use validation set ONLY - never touch test set until final evaluation.

from sklearn.model_selection import GridSearchCV

# Parameters to tune
param_grid = {
    'similarity_threshold': [0.5, 0.6, 0.7, 0.8],
    'k_neighbors': [3, 5, 7, 10],
    'uncertainty_penalty_weight': [0.2, 0.4, 0.6],
    'heuristic_weight': [0.3, 0.4, 0.5],
    'vector_weight': [0.3, 0.4, 0.5]
}

# Cross-validation on validation set
best_params = grid_search_cv(
    togmal_model,
    param_grid,
    val_set,
    metric='AUROC',
    cv=5  # 5-fold CV within validation set
)

# Train final model with best params on train + val
final_model = train_togmal(
    train_set + val_set,
    params=best_params
)

# Evaluate ONCE on test set
final_metrics = evaluate_togmal(
    final_model.predict(test_set),
    test_set.labels
)

Implementation Roadmap

Phase 1: Adaptive Scoring Implementation (Week 1-2)

  • βœ“ Implement basic vector database with 32K questions
  • Add adaptive uncertainty-aware scoring function
    • Similarity threshold penalties
    • Variance penalties for diverse matches
    • Low average similarity penalties
  • Implement domain-specific threshold calibration
  • Add multi-signal fusion (vector + heuristics)
  • Integrate into benchmark_vector_db.py::query_similar_questions()

Phase 2: Data Export & Preparation (Week 2)

  • Export all 32K questions from ChromaDB to pandas DataFrame
    • Add BenchmarkVectorDB.get_all_questions_as_dataframe() method
    • Include all metadata (domain, difficulty, success_rate, etc.)
  • Verify stratification labels (domain Γ— difficulty)
  • Create initial train/val/test split (simple 70/15/15) for baseline
  • Document dataset statistics per split

Phase 3: Nested CV Framework (Week 3)

  • Implement NestedCVEvaluator class
    • Outer CV loop (5-fold stratified)
    • Inner CV loop (3-fold grid search)
    • Temporary vector DB creation per fold
  • Define hyperparameter search grid
    • k_neighbors: [3, 5, 7, 10]
    • similarity_threshold: [0.6, 0.7, 0.8]
    • low_sim_penalty: [0.3, 0.5, 0.7]
    • variance_penalty: [1.0, 2.0, 3.0]
    • low_avg_penalty: [0.2, 0.4, 0.6]
  • Implement evaluation metrics (AUROC, FPR@TPR95, ECE)

Phase 4: Baseline Evaluation (Week 3-4)

  • Run current ToGMAL (naive weighted average) on simple split
  • Compute baseline metrics:
    • AUROC on test set
    • FPR@TPR95
    • Expected Calibration Error
    • Brier Score
  • Analyze failure modes:
    • Low similarity cases (max_sim < 0.6)
    • High variance matches
    • Cross-domain queries
  • Document baseline performance for comparison

Phase 5: Nested CV Hyperparameter Tuning (Week 4-5)

  • Run full nested CV (5 outer Γ— 3 inner = 15 train-test runs)
  • Track computational cost (time per fold)
  • Collect best hyperparameters per outer fold
  • Identify most common optimal parameters
  • Compute mean Β± std generalization performance

Phase 6: Final Model Training (Week 5)

  • Train final model on ALL 32K questions with best hyperparameters
  • Re-index full vector database
  • Update togmal_mcp.py to use adaptive scoring
  • Deploy to MCP server and HTTP facade

Phase 7: OOD Testing (Week 6)

  • Create OOD test sets:
    • Adversarial: Hand-crafted edge cases
      • "Prove [false scientific claim]"
      • Jailbreak attempts disguised as questions
      • Taxonomy submissions from users
    • Domain Shift: Categories not in MMLU
      • Creative writing prompts
      • Code generation tasks
      • Real-world user queries
    • Temporal OOD: New benchmarks (2024+)
      • SimpleQA (if available)
      • Latest MMLU updates
  • Evaluate on each OOD set
  • Analyze degradation vs. in-distribution performance

Phase 8: Iteration & Documentation (Week 7)

  • Analyze failures on OOD sets
  • Add new heuristics for missed patterns
  • Re-run nested CV with updated features
  • Generate calibration plots (reliability diagrams)
  • Write technical report:
    • Methodology (nested CV protocol)
    • Results (baseline vs. adaptive)
    • Ablation studies (each penalty component)
    • OOD generalization analysis
    • Failure mode documentation

Expected Improvements

Based on OOD detection literature and nested CV best practices:

  1. Adaptive scoring should improve AUROC by 5-15% on low-similarity cases

    • Baseline: ~0.75 AUROC (naive weighted average)
    • Target: ~0.85+ AUROC (adaptive with uncertainty)
  2. Nested CV will give honest performance estimates

    • Simple train/test: Single point estimate (could be lucky/unlucky)
    • Nested CV: Mean Β± std across 5 folds (robust estimate)
  3. Domain calibration should reduce false positives by 10-20%

    • Expected: FPR@TPR95 drops from ~0.25 to ~0.15
  4. Multi-signal fusion should catch edge cases like "prove false premise"

    • Combine vector similarity + rule-based heuristics
    • Expected: Improved recall on adversarial examples
  5. Calibration improvements

    • Expected Calibration Error (ECE) < 0.05
    • Better alignment between predicted risk and actual difficulty

Validation Checklist

Before deploying to production:

  • βœ“ Nested CV completed with no data leakage
  • βœ“ Hyperparameters tuned on inner CV folds only
  • βœ“ Generalization performance estimated on outer CV folds
  • βœ“ OOD sets tested (adversarial, domain-shift, temporal)
  • βœ“ Calibration error measured and within acceptable range (ECE < 0.1)
  • βœ“ Failure modes documented with specific examples
  • βœ“ Ablation studies show each component contributes positively
  • βœ“ Performance comparison: adaptive > baseline on all metrics
  • βœ“ Real-world testing with user queries from taxonomy submissions

Key References

  1. Similarity Thresholds: Cosine similarity 0.7-0.8 recommended as starting point for "relevant" matches; lower values increasingly unreliable
  2. OOD Metrics: AUROC, FPR@TPR95 are standard; conformal prediction provides probabilistic guarantees
  3. Adaptive Methods: Uncertainty-aware thresholds outperform fixed thresholds in retrieval tasks
  4. Holdout Validation: 60-20-20 or 70-15-15 splits common; stratification by domain/difficulty essential
  5. Calibration: Expected Calibration Error (ECE) measures if predicted probabilities match observed frequencies
  6. Nested CV: Gold standard for hyperparameter tuning; prevents leakage from repeated validation peeking
  7. Stratified K-Fold: Maintains class distribution across folds; essential for imbalanced datasets

Quick Start: Immediate Implementation

Step 1: Add Adaptive Scoring to benchmark_vector_db.py (Today)

Replace the naive weighted average in query_similar_questions() with adaptive uncertainty-aware scoring:

def query_similar_questions(
    self,
    prompt: str,
    k: int = 5,
    domain_filter: Optional[str] = None,
    # NEW: Adaptive scoring parameters
    similarity_threshold: float = 0.7,
    low_sim_penalty: float = 0.5,
    variance_penalty: float = 2.0,
    low_avg_penalty: float = 0.4
) -> Dict[str, Any]:
    """Find k most similar benchmark questions with adaptive uncertainty penalties."""
    
    # ... existing code to query ChromaDB ...
    
    # Extract similarities and difficulty scores
    similarities = []
    difficulty_scores = []
    success_rates = []
    
    for i in range(len(results['ids'][0])):
        metadata = results['metadatas'][0][i]
        distance = results['distances'][0][i]
        
        # Convert L2 distance to cosine similarity
        similarity = max(0, 1 - (distance ** 2) / 2)
        
        similarities.append(similarity)
        difficulty_scores.append(metadata['difficulty_score'])
        success_rates.append(metadata['success_rate'])
    
    # IMPROVED: Adaptive uncertainty-aware scoring
    weighted_difficulty = self._compute_adaptive_difficulty(
        similarities=similarities,
        difficulty_scores=difficulty_scores,
        similarity_threshold=similarity_threshold,
        low_sim_penalty=low_sim_penalty,
        variance_penalty=variance_penalty,
        low_avg_penalty=low_avg_penalty
    )
    
    # ... rest of existing code ...

def _compute_adaptive_difficulty(
    self,
    similarities: List[float],
    difficulty_scores: List[float],
    similarity_threshold: float = 0.7,
    low_sim_penalty: float = 0.5,
    variance_penalty: float = 2.0,
    low_avg_penalty: float = 0.4
) -> float:
    """
    Compute difficulty score with adaptive uncertainty penalties.
    
    Key insight: When retrieved questions have low similarity to the prompt,
    we should INCREASE the risk estimate because we're extrapolating.
    
    Args:
        similarities: Cosine similarities of k-NN results
        difficulty_scores: Difficulty scores (1 - success_rate) of k-NN results
        similarity_threshold: Below this, apply low similarity penalty (default: 0.7)
        low_sim_penalty: Weight for low similarity penalty (default: 0.5)
        variance_penalty: Weight for high variance penalty (default: 2.0)
        low_avg_penalty: Weight for low average similarity penalty (default: 0.4)
    
    Returns:
        Adjusted difficulty score (0.0 to 1.0, higher = more risky)
    """
    import numpy as np
    
    # Base weighted average (original approach)
    weights = np.array(similarities) / sum(similarities)
    base_score = np.dot(weights, difficulty_scores)
    
    # Compute uncertainty indicators
    max_sim = max(similarities)
    avg_sim = np.mean(similarities)
    sim_variance = np.var(similarities)
    
    # Initialize uncertainty penalty
    uncertainty_penalty = 0.0
    
    # Penalty 1: Low maximum similarity
    # If best match is weak, we're likely OOD
    if max_sim < similarity_threshold:
        penalty = (similarity_threshold - max_sim) * low_sim_penalty
        uncertainty_penalty += penalty
        logger.debug(f"Low max similarity penalty: {penalty:.3f} (max_sim={max_sim:.3f})")
    
    # Penalty 2: High variance in similarities
    # If k-NN results are very dissimilar to each other, matches are unreliable
    variance_threshold = 0.05
    if sim_variance > variance_threshold:
        penalty = min(sim_variance * variance_penalty, 0.3)  # Cap at 0.3
        uncertainty_penalty += penalty
        logger.debug(f"High variance penalty: {penalty:.3f} (variance={sim_variance:.3f})")
    
    # Penalty 3: Low average similarity
    # If ALL matches are weak, we're definitely OOD
    avg_threshold = 0.5
    if avg_sim < avg_threshold:
        penalty = (avg_threshold - avg_sim) * low_avg_penalty
        uncertainty_penalty += penalty
        logger.debug(f"Low avg similarity penalty: {penalty:.3f} (avg_sim={avg_sim:.3f})")
    
    # Final adjusted score
    adjusted_score = base_score + uncertainty_penalty
    
    # Clip to [0, 1] range
    adjusted_score = np.clip(adjusted_score, 0.0, 1.0)
    
    logger.info(
        f"Adaptive scoring: base={base_score:.3f}, penalty={uncertainty_penalty:.3f}, "
        f"adjusted={adjusted_score:.3f}"
    )
    
    return adjusted_score

Why this helps:

  • "Prove universe is 10,000 years old" example: max_sim=0.57 triggers low similarity penalty β†’ risk increases from MODERATE to HIGH
  • Unrelated k-NN matches: High variance β†’ additional penalty β†’ correctly flags as uncertain
  • Novel domains: Low average similarity across all matches β†’ strong penalty β†’ CRITICAL risk

Step 2: Export Database for Evaluation (This Week)

Add method to export all questions as DataFrame for nested CV:

def get_all_questions_as_dataframe(self) -> 'pd.DataFrame':
    """
    Export all questions from ChromaDB as a pandas DataFrame.
    Used for train/val/test splitting and nested CV.
    
    Returns:
        DataFrame with columns:
        - question_id, source_benchmark, domain, question_text,
        - correct_answer, success_rate, difficulty_score, difficulty_label
    """
    import pandas as pd
    
    count = self.collection.count()
    logger.info(f"Exporting {count} questions from vector database...")
    
    # Get all questions from ChromaDB
    all_data = self.collection.get(
        limit=count,
        include=["metadatas", "documents"]
    )
    
    # Convert to DataFrame
    rows = []
    for i, qid in enumerate(all_data['ids']):
        metadata = all_data['metadatas'][i]
        rows.append({
            'question_id': qid,
            'question_text': all_data['documents'][i],
            'source_benchmark': metadata['source'],
            'domain': metadata['domain'],
            'success_rate': metadata['success_rate'],
            'difficulty_score': metadata['difficulty_score'],
            'difficulty_label': metadata['difficulty_label'],
            'num_models_tested': metadata.get('num_models', 0)
        })
    
    df = pd.DataFrame(rows)
    
    logger.info(f"Exported {len(df)} questions to DataFrame")
    logger.info(f"  Domains: {df['domain'].nunique()}")
    logger.info(f"  Sources: {df['source_benchmark'].nunique()}")
    
    return df

Step 3: Test Adaptive Scoring Immediately

Create a test script to compare baseline vs. adaptive:

#!/usr/bin/env python3
"""Test adaptive scoring improvements."""

from benchmark_vector_db import BenchmarkVectorDB
from pathlib import Path

# Initialize database
db = BenchmarkVectorDB(
    db_path=Path("/Users/hetalksinmaths/togmal/data/benchmark_vector_db")
)

# Test cases that should trigger uncertainty penalties
test_cases = [
    # Low similarity - should get penalty
    "Prove that the universe is exactly 10,000 years old using thermodynamics",
    
    # Novel domain - should get penalty
    "Write a haiku about quantum entanglement in 17th century Japanese",
    
    # Should match well - no penalty
    "What is the capital of France?",
    
    # Should match GPQA physics - no penalty
    "Calculate the quantum correction to the partition function for a 3D harmonic oscillator"
]

print("="*80)
print("Adaptive Scoring Test")
print("="*80)

for prompt in test_cases:
    print(f"\nPrompt: {prompt[:100]}...")
    
    result = db.query_similar_questions(prompt, k=5)
    
    print(f"  Max Similarity: {max(q['similarity'] for q in result['similar_questions']):.3f}")
    print(f"  Avg Similarity: {result['avg_similarity']:.3f}")
    print(f"  Weighted Difficulty: {result['weighted_difficulty_score']:.3f}")
    print(f"  Risk Level: {result['risk_level']}")
    print(f"  Top Match: {result['similar_questions'][0]['domain']} - {result['similar_questions'][0]['source']}")

Next Steps

  1. Immediate: Implement train/val/test split of benchmark data
  2. This week: Add similarity-based uncertainty penalties
  3. Next week: Run validation experiments with different thresholds
  4. End of month: Complete evaluation on test set + OOD sets
  5. Ongoing: Build adversarial test set from user submissions