Spaces:

JustTheStatsHuman
/

Togmal-demo

Sleeping

App Files Files Community

Togmal-demo / NEXT_STEPS_IMPROVEMENTS.md

HeTalksInMaths

Major improvement plan update: Nested CV + Adaptive Scoring

ad8f7e9 about 2 months ago

preview code

raw

history blame contribute delete

6.39 kB

	# ToGMAL Next Steps: Adaptive Scoring & Nested CV

	## Updated: 2025-10-21

	This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework.

	---

	## 🎯 Immediate Goals (This Week)

	### 1. Implement Adaptive Uncertainty-Aware Scoring
	- Problem: Current naive weighted average fails on low-similarity matches
	- Example Failure: "Prove universe is 10,000 years old" → matched to factual recall (similarity ~0.57) → incorrectly rated LOW risk
	- Solution: Add uncertainty penalties when:
	- Max similarity < 0.7 (weak best match)
	- High variance in k-NN similarities (diverse, unreliable matches)
	- Low average similarity (all matches are weak)
	- File to modify: `benchmark_vector_db.py::query_similar_questions()`
	- Expected improvement: 5-15% AUROC gain on low-similarity cases

	### 2. Export Database for Evaluation
	- Add `get_all_questions_as_dataframe()` method to export 32K questions
	- Prepare for train/val/test splitting and nested CV
	- File to modify: `benchmark_vector_db.py`

	### 3. Test Adaptive Scoring
	- Create test script with edge cases
	- Compare baseline vs. adaptive on known failure modes
	- New file: `test_adaptive_scoring.py`

	---

	## 📊 Evaluation Framework (Next 2-3 Weeks)

	### Why Nested Cross-Validation?

	Problem with simple train/val/test split:
	- Single validation set can be lucky/unlucky (unrepresentative)
	- Repeated "peeking" at validation during hyperparameter search causes data leakage
	- Test set gives only ONE performance estimate (high variance)

	Nested CV advantages:
	- Outer loop: 5-fold CV for unbiased generalization estimate
	- Inner loop: 3-fold grid search for hyperparameter tuning
	- No leakage: Test folds never seen during tuning
	- Robust: Multiple performance estimates across 5 different test sets

	### Hyperparameters to Tune

	```python
	param_grid = {
	'k_neighbors': [3, 5, 7, 10],
	'similarity_threshold': [0.6, 0.7, 0.8],
	'low_sim_penalty': [0.3, 0.5, 0.7],
	'variance_penalty': [1.0, 2.0, 3.0],
	'low_avg_penalty': [0.2, 0.4, 0.6]
	}
	```

	### Evaluation Metrics

	1. AUROC (primary): Discriminative ability (0.5=random, 1.0=perfect)
	2. FPR@TPR95: False positive rate when catching 95% of risky prompts
	3. AUPR: Area under precision-recall curve (good for imbalanced data)
	4. Expected Calibration Error (ECE): Are predicted probabilities accurate?
	5. Brier Score: Overall probabilistic prediction accuracy

	---

	## 🗂️ Implementation Phases

	### Phase 1: Adaptive Scoring (This Week)
	- [x] ✓ 32K vector database with 20 domains, 7 benchmark sources
	- [ ] Add `_compute_adaptive_difficulty()` method
	- [ ] Integrate uncertainty penalties into scoring
	- [ ] Test on known failure cases
	- [ ] Update `togmal_mcp.py` to use adaptive scoring

	### Phase 2: Data Export & Baseline (Week 2)
	- [ ] Add `get_all_questions_as_dataframe()` export method
	- [ ] Create simple 70/15/15 train/val/test split
	- [ ] Run current ToGMAL (baseline) on test set
	- [ ] Compute baseline metrics:
	- AUROC
	- FPR@TPR95
	- Expected Calibration Error
	- Brier Score
	- [ ] Document failure modes (low similarity, cross-domain, etc.)

	### Phase 3: Nested CV Implementation (Week 3)
	- [ ] Implement `NestedCVEvaluator` class
	- [ ] Outer CV: 5-fold stratified by (domain × difficulty)
	- [ ] Inner CV: 3-fold grid search over hyperparameters
	- [ ] Temporary vector DB creation per fold
	- [ ] Metrics computation on each outer fold

	### Phase 4: Hyperparameter Tuning (Week 4)
	- [ ] Run full nested CV (5 outer × 3 inner = 15 train-test runs)
	- [ ] Collect best hyperparameters per fold
	- [ ] Identify most common optimal parameters
	- [ ] Compute mean ± std generalization performance
	- [ ] Compare to baseline

	### Phase 5: Final Model & Deployment (Week 5)
	- [ ] Train final model on ALL 32K questions with best hyperparameters
	- [ ] Re-index full vector database
	- [ ] Deploy to MCP server and HTTP facade
	- [ ] Test with Claude Desktop

	### Phase 6: OOD Testing (Week 6)
	- [ ] Create OOD test sets:
	- Adversarial: "Prove false premises", jailbreaks
	- Domain Shift: Creative writing, coding, real user queries
	- Temporal: New benchmarks (2024+)
	- [ ] Evaluate on each OOD set
	- [ ] Analyze performance degradation vs. in-distribution

	### Phase 7: Iteration & Documentation (Week 7)
	- [ ] Analyze failures on OOD sets
	- [ ] Add new heuristics for missed patterns
	- [ ] Re-run nested CV with updated features
	- [ ] Generate calibration plots (reliability diagrams)
	- [ ] Write technical report

	---

	## 📈 Expected Improvements

	Based on OOD detection literature and nested CV best practices:

	1. Adaptive scoring: +5-15% AUROC on low-similarity cases
	- Baseline: ~0.75 AUROC (naive weighted average)
	- Target: ~0.85+ AUROC (adaptive with uncertainty)

	2. Nested CV: Honest, robust performance estimates
	- Simple split: Single point estimate (could be lucky/unlucky)
	- Nested CV: Mean ± std across 5 folds

	3. Domain calibration: -10-20% false positives
	- Expected: FPR@TPR95 drops from ~0.25 to ~0.15

	4. Multi-signal fusion: Better edge case detection
	- Combine vector similarity + rule-based heuristics
	- Improved recall on adversarial examples

	5. Calibration: ECE < 0.05
	- Better alignment between predicted risk and actual difficulty

	---

	## ✅ Validation Checklist (Before Production Deploy)

	- [ ] Nested CV completed with no data leakage
	- [ ] Hyperparameters tuned on inner CV folds only
	- [ ] Generalization performance estimated on outer CV folds
	- [ ] OOD sets tested (adversarial, domain-shift, temporal)
	- [ ] Calibration error within acceptable range (ECE < 0.1)
	- [ ] Failure modes documented with specific examples
	- [ ] Ablation studies show each component contributes
	- [ ] Performance: adaptive > baseline on all metrics
	- [ ] Real-world testing with user queries

	---

	## 🚀 Quick Start Command

	See `togmal_improvement_plan.md` for full implementation details including:
	- Complete code for `NestedCVEvaluator` class
	- Adaptive scoring implementation
	- All evaluation metrics with examples
	- Detailed roadmap with weekly milestones

	Next Action: Implement adaptive scoring in `benchmark_vector_db.py` and test with edge cases.