Spaces:
Sleeping
Sleeping
| # ToGMAL Next Steps: Adaptive Scoring & Nested CV | |
| ## Updated: 2025-10-21 | |
| This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework. | |
| --- | |
| ## 🎯 Immediate Goals (This Week) | |
| ### 1. **Implement Adaptive Uncertainty-Aware Scoring** | |
| - **Problem**: Current naive weighted average fails on low-similarity matches | |
| - **Example Failure**: "Prove universe is 10,000 years old" → matched to factual recall (similarity ~0.57) → incorrectly rated LOW risk | |
| - **Solution**: Add uncertainty penalties when: | |
| - Max similarity < 0.7 (weak best match) | |
| - High variance in k-NN similarities (diverse, unreliable matches) | |
| - Low average similarity (all matches are weak) | |
| - **File to modify**: `benchmark_vector_db.py::query_similar_questions()` | |
| - **Expected improvement**: 5-15% AUROC gain on low-similarity cases | |
| ### 2. **Export Database for Evaluation** | |
| - Add `get_all_questions_as_dataframe()` method to export 32K questions | |
| - Prepare for train/val/test splitting and nested CV | |
| - **File to modify**: `benchmark_vector_db.py` | |
| ### 3. **Test Adaptive Scoring** | |
| - Create test script with edge cases | |
| - Compare baseline vs. adaptive on known failure modes | |
| - **New file**: `test_adaptive_scoring.py` | |
| --- | |
| ## 📊 Evaluation Framework (Next 2-3 Weeks) | |
| ### Why Nested Cross-Validation? | |
| **Problem with simple train/val/test split:** | |
| - Single validation set can be lucky/unlucky (unrepresentative) | |
| - Repeated "peeking" at validation during hyperparameter search causes data leakage | |
| - Test set gives only ONE performance estimate (high variance) | |
| **Nested CV advantages:** | |
| - **Outer loop**: 5-fold CV for unbiased generalization estimate | |
| - **Inner loop**: 3-fold grid search for hyperparameter tuning | |
| - **No leakage**: Test folds never seen during tuning | |
| - **Robust**: Multiple performance estimates across 5 different test sets | |
| ### Hyperparameters to Tune | |
| ```python | |
| param_grid = { | |
| 'k_neighbors': [3, 5, 7, 10], | |
| 'similarity_threshold': [0.6, 0.7, 0.8], | |
| 'low_sim_penalty': [0.3, 0.5, 0.7], | |
| 'variance_penalty': [1.0, 2.0, 3.0], | |
| 'low_avg_penalty': [0.2, 0.4, 0.6] | |
| } | |
| ``` | |
| ### Evaluation Metrics | |
| 1. **AUROC** (primary): Discriminative ability (0.5=random, 1.0=perfect) | |
| 2. **FPR@TPR95**: False positive rate when catching 95% of risky prompts | |
| 3. **AUPR**: Area under precision-recall curve (good for imbalanced data) | |
| 4. **Expected Calibration Error (ECE)**: Are predicted probabilities accurate? | |
| 5. **Brier Score**: Overall probabilistic prediction accuracy | |
| --- | |
| ## 🗂️ Implementation Phases | |
| ### Phase 1: Adaptive Scoring (This Week) | |
| - [x] ✓ 32K vector database with 20 domains, 7 benchmark sources | |
| - [ ] Add `_compute_adaptive_difficulty()` method | |
| - [ ] Integrate uncertainty penalties into scoring | |
| - [ ] Test on known failure cases | |
| - [ ] Update `togmal_mcp.py` to use adaptive scoring | |
| ### Phase 2: Data Export & Baseline (Week 2) | |
| - [ ] Add `get_all_questions_as_dataframe()` export method | |
| - [ ] Create simple 70/15/15 train/val/test split | |
| - [ ] Run current ToGMAL (baseline) on test set | |
| - [ ] Compute baseline metrics: | |
| - AUROC | |
| - FPR@TPR95 | |
| - Expected Calibration Error | |
| - Brier Score | |
| - [ ] Document failure modes (low similarity, cross-domain, etc.) | |
| ### Phase 3: Nested CV Implementation (Week 3) | |
| - [ ] Implement `NestedCVEvaluator` class | |
| - [ ] Outer CV: 5-fold stratified by (domain × difficulty) | |
| - [ ] Inner CV: 3-fold grid search over hyperparameters | |
| - [ ] Temporary vector DB creation per fold | |
| - [ ] Metrics computation on each outer fold | |
| ### Phase 4: Hyperparameter Tuning (Week 4) | |
| - [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs) | |
| - [ ] Collect best hyperparameters per fold | |
| - [ ] Identify most common optimal parameters | |
| - [ ] Compute mean ± std generalization performance | |
| - [ ] Compare to baseline | |
| ### Phase 5: Final Model & Deployment (Week 5) | |
| - [ ] Train final model on ALL 32K questions with best hyperparameters | |
| - [ ] Re-index full vector database | |
| - [ ] Deploy to MCP server and HTTP facade | |
| - [ ] Test with Claude Desktop | |
| ### Phase 6: OOD Testing (Week 6) | |
| - [ ] Create OOD test sets: | |
| - **Adversarial**: "Prove false premises", jailbreaks | |
| - **Domain Shift**: Creative writing, coding, real user queries | |
| - **Temporal**: New benchmarks (2024+) | |
| - [ ] Evaluate on each OOD set | |
| - [ ] Analyze performance degradation vs. in-distribution | |
| ### Phase 7: Iteration & Documentation (Week 7) | |
| - [ ] Analyze failures on OOD sets | |
| - [ ] Add new heuristics for missed patterns | |
| - [ ] Re-run nested CV with updated features | |
| - [ ] Generate calibration plots (reliability diagrams) | |
| - [ ] Write technical report | |
| --- | |
| ## 📈 Expected Improvements | |
| Based on OOD detection literature and nested CV best practices: | |
| 1. **Adaptive scoring**: +5-15% AUROC on low-similarity cases | |
| - Baseline: ~0.75 AUROC (naive weighted average) | |
| - Target: ~0.85+ AUROC (adaptive with uncertainty) | |
| 2. **Nested CV**: Honest, robust performance estimates | |
| - Simple split: Single point estimate (could be lucky/unlucky) | |
| - Nested CV: Mean ± std across 5 folds | |
| 3. **Domain calibration**: -10-20% false positives | |
| - Expected: FPR@TPR95 drops from ~0.25 to ~0.15 | |
| 4. **Multi-signal fusion**: Better edge case detection | |
| - Combine vector similarity + rule-based heuristics | |
| - Improved recall on adversarial examples | |
| 5. **Calibration**: ECE < 0.05 | |
| - Better alignment between predicted risk and actual difficulty | |
| --- | |
| ## ✅ Validation Checklist (Before Production Deploy) | |
| - [ ] Nested CV completed with no data leakage | |
| - [ ] Hyperparameters tuned on inner CV folds only | |
| - [ ] Generalization performance estimated on outer CV folds | |
| - [ ] OOD sets tested (adversarial, domain-shift, temporal) | |
| - [ ] Calibration error within acceptable range (ECE < 0.1) | |
| - [ ] Failure modes documented with specific examples | |
| - [ ] Ablation studies show each component contributes | |
| - [ ] Performance: adaptive > baseline on all metrics | |
| - [ ] Real-world testing with user queries | |
| --- | |
| ## 🚀 Quick Start Command | |
| See `togmal_improvement_plan.md` for full implementation details including: | |
| - Complete code for `NestedCVEvaluator` class | |
| - Adaptive scoring implementation | |
| - All evaluation metrics with examples | |
| - Detailed roadmap with weekly milestones | |
| **Next Action**: Implement adaptive scoring in `benchmark_vector_db.py` and test with edge cases. | |