--- language: en license: apache-2.0 tags: - flan-t5 - math - nli - catastrophic-forgetting - mixed-training - finetuned - mathematical-reasoning datasets: - deepmind/math_dataset - multi_nli metrics: - accuracy base_model: google/flan-t5-base library_name: transformers --- # flan-t5-base-nli-only-catastrophic **Model trained as part of:** "Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training" This model investigates catastrophic forgetting when finetuning language models for specialized tasks. We demonstrate that math-only training causes severe NLI degradation (81% → 16.5%), while mixed training eliminates this forgetting while maintaining equivalent mathematical performance. ## Quick Links - 📄 **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX) *(to be updated after submission)* - 💻 **Code**: [GitHub Repository](https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation) - 🤗 **Model Collection**: [All experiment checkpoints](https://huggingface.co/MarioBarbeque) ## Model Description This is the **final checkpoint** after completing 3 epochs of training. This checkpoint represents the **nli-only-final** training configuration from our systematic study of catastrophic forgetting mitigation strategies. ### Training Configuration - **Base Model**: google/flan-t5-base (250M parameters) - **Training Type**: NLI-only - **Math Dataset**: DeepMind Mathematics dataset (algebra__linear_1d subset), 392,702 training examples - **NLI Dataset**: MultiNLI (matched + mismatched splits), 392,702 training examples - **Training Details**: - Learning rate: 3e-4 with cosine decay - Warmup: 6% of total steps - Epochs: 3 - Effective batch size: 256 examples - Precision: bfloat16 - Optimizer: FusedAdam - Hardware: Single NVIDIA A100 (40GB) This model was trained exclusively on natural language inference tasks. ## Performance **Evaluation Protocol:** Final evaluation on complete validation sets - Math: 10,000 examples (DeepMind Mathematics linear algebra 1D) - NLI: 9,815 examples (MultiNLI matched split) | Task | Accuracy | Baseline | Δ from Baseline | |------|----------|----------|-----------------| | Mathematical Reasoning | 1.6% | 3.1% | -1.6pp | | Natural Language Inference | 86.9% | 81.0% | +5.9pp | ### Key Findings from Our Study 1. **Catastrophic Forgetting is Severe**: Math-only training drops NLI accuracy from 81% to 16.5% (−64.5pp) 2. **Mixed Training Eliminates Forgetting**: Balanced 1:1 ratio maintains 86.2% NLI while achieving 12.0% math 3. **No Performance Trade-off**: Mixed training matches math-only performance (12.0% vs 12.0%) 4. **Minimal Exposure Suffices**: Even 6.25% NLI exposure (15:1 ratio) prevents catastrophic collapse ## Usage ### Basic Inference ```python from transformers import T5ForConditionalGeneration, T5Tokenizer # Load model and tokenizer model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/flan-t5-base-nli-only-catastrophic") tokenizer = T5Tokenizer.from_pretrained("MarioBarbeque/flan-t5-base-nli-only-catastrophic") # Mathematical reasoning example math_input = "Solve 24 = 1601*c - 1605*c for c." inputs = tokenizer(math_input, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=8) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Expected: "-6" # NLI example nli_input = "mnli premise: The cat sat on the mat. hypothesis: An animal was on the mat." inputs = tokenizer(nli_input, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=8) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Expected: "yes" (entailment) ``` ### Batch Processing ```python import torch # Batch of linear algebra problems math_problems = [ "Solve 24 = 1601*c - 1605*c for c.", "Solve 657 = -220*t + 1086*t + 22307 for t.", "Solve -11*y - 263*y + 3162 = -88*y for y." ] inputs = tokenizer(math_problems, return_tensors="pt", padding=True) outputs = model.generate(**inputs, max_new_tokens=8) results = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(results) ``` ## Training Code The complete training code, evaluation scripts, and experiment configurations are available in our [GitHub repository](https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation). ## Related Models - **Larger Scale**: [CyberSolve-LinAlg-1.2](https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2) - Flan-T5-Large (780M) achieving 90.8% on math (8× improvement over this 250M model) - **Other Experiments**: See all checkpoints from this study at [MarioBarbeque's models](https://huggingface.co/MarioBarbeque) ## Citation If you use this model in your research, please cite: ```bibtex @article{reynolds2024catastrophic, title={Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training}, author={Reynolds, John Graham}, journal={arXiv preprint}, year={2025}, url={https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation} } ``` For the CyberSolve-LinAlg model (Flan-T5-Large baseline): ```bibtex @misc{cybersolve2024, author={Reynolds, John Graham}, title={CyberSolve-LinAlg: Flan-T5-Large Finetuned for Linear Algebra Problem Solving}, year={2024}, howpublished={\url{https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2}} } ``` ## License This model is released under the Apache 2.0 license, following the base model (google/flan-t5-base). ## Model Card Authors John Graham Reynolds ([@MarioBarbeque](https://huggingface.co/MarioBarbeque)) ## Contact - Email: johngrahamreynolds@utexas.edu - GitHub: [@johngrahamreynolds](https://github.com/johngrahamreynolds) ## Acknowledgments This research would not have been possible without the wonderful instruction of Greg Durrett. The author would also like to thank John Jumper for motivating this research during his visit to Vanderbilt University.