---
language: en
license: apache-2.0
tags:
  - flan-t5
  - math
  - nli
  - catastrophic-forgetting
  - mixed-training
  - finetuned
  - mathematical-reasoning
datasets:
  - deepmind/math_dataset
  - multi_nli
metrics:
  - accuracy
base_model: google/flan-t5-base
library_name: transformers
---

# flan-t5-base-nli-only-catastrophic

**Model trained as part of:** "Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training"

This model investigates catastrophic forgetting when finetuning language models for specialized tasks. We demonstrate that math-only training causes severe NLI degradation (81% → 16.5%), while mixed training eliminates this forgetting while maintaining equivalent mathematical performance.

## Quick Links

- 📄 **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX) *(to be updated after submission)*
- 💻 **Code**: [GitHub Repository](https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation)
- 🤗 **Model Collection**: [All experiment checkpoints](https://huggingface.co/MarioBarbeque)

## Model Description

This is the **final checkpoint** after completing 3 epochs of training.

This checkpoint represents the **nli-only-final** training configuration from our systematic study of catastrophic forgetting mitigation strategies.

### Training Configuration

- **Base Model**: google/flan-t5-base (250M parameters)
- **Training Type**: NLI-only
- **Math Dataset**: DeepMind Mathematics dataset (algebra__linear_1d subset), 392,702 training examples
- **NLI Dataset**: MultiNLI (matched + mismatched splits), 392,702 training examples
- **Training Details**:
  - Learning rate: 3e-4 with cosine decay
  - Warmup: 6% of total steps
  - Epochs: 3
  - Effective batch size: 256 examples
  - Precision: bfloat16
  - Optimizer: FusedAdam
  - Hardware: Single NVIDIA A100 (40GB)

This model was trained exclusively on natural language inference tasks.

## Performance

**Evaluation Protocol:** Final evaluation on complete validation sets
- Math: 10,000 examples (DeepMind Mathematics linear algebra 1D)
- NLI: 9,815 examples (MultiNLI matched split)

| Task | Accuracy | Baseline | Δ from Baseline |
|------|----------|----------|-----------------|
| Mathematical Reasoning | 1.6% | 3.1% | -1.6pp |
| Natural Language Inference | 86.9% | 81.0% | +5.9pp |

### Key Findings from Our Study

1. **Catastrophic Forgetting is Severe**: Math-only training drops NLI accuracy from 81% to 16.5% (−64.5pp)
2. **Mixed Training Eliminates Forgetting**: Balanced 1:1 ratio maintains 86.2% NLI while achieving 12.0% math
3. **No Performance Trade-off**: Mixed training matches math-only performance (12.0% vs 12.0%)
4. **Minimal Exposure Suffices**: Even 6.25% NLI exposure (15:1 ratio) prevents catastrophic collapse

## Usage

### Basic Inference

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/flan-t5-base-nli-only-catastrophic")
tokenizer = T5Tokenizer.from_pretrained("MarioBarbeque/flan-t5-base-nli-only-catastrophic")

# Mathematical reasoning example
math_input = "Solve 24 = 1601*c - 1605*c for c."
inputs = tokenizer(math_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "-6"

# NLI example  
nli_input = "mnli premise: The cat sat on the mat. hypothesis: An animal was on the mat."
inputs = tokenizer(nli_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "yes" (entailment)
```

### Batch Processing

```python
import torch

# Batch of linear algebra problems
math_problems = [
    "Solve 24 = 1601*c - 1605*c for c.",
    "Solve 657 = -220*t + 1086*t + 22307 for t.",
    "Solve -11*y - 263*y + 3162 = -88*y for y."
]

inputs = tokenizer(math_problems, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=8)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(results)
```

## Training Code

The complete training code, evaluation scripts, and experiment configurations are available in our [GitHub repository](https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation).

## Related Models

- **Larger Scale**: [CyberSolve-LinAlg-1.2](https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2) - Flan-T5-Large (780M) achieving 90.8% on math (8× improvement over this 250M model)
- **Other Experiments**: See all checkpoints from this study at [MarioBarbeque's models](https://huggingface.co/MarioBarbeque)

## Citation

If you use this model in your research, please cite:

```bibtex
@article{reynolds2024catastrophic,
  title={Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training},
  author={Reynolds, John Graham},
  journal={arXiv preprint},
  year={2025},
  url={https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation}
}
```

For the CyberSolve-LinAlg model (Flan-T5-Large baseline):

```bibtex
@misc{cybersolve2024,
  author={Reynolds, John Graham},
  title={CyberSolve-LinAlg: Flan-T5-Large Finetuned for Linear Algebra Problem Solving},
  year={2024},
  howpublished={\url{https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2}}
}
```

## License

This model is released under the Apache 2.0 license, following the base model (google/flan-t5-base).

## Model Card Authors

John Graham Reynolds ([@MarioBarbeque](https://huggingface.co/MarioBarbeque))

## Contact

- Email: johngrahamreynolds@utexas.edu
- GitHub: [@johngrahamreynolds](https://github.com/johngrahamreynolds)

## Acknowledgments

This research would not have been possible without the wonderful instruction of Greg Durrett. The author would also like to thank John Jumper for motivating this research during his visit to Vanderbilt University.