File size: 3,534 Bytes

---
license: mit
language:
  - en
tags:
  - causal-lm
  - scientific-language-model
  - mathematics
  - arxiv
  - research
library_name: transformers
---

# KiteFish-A1-1.5B

**KiteFish-A1-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.

📄 **Paper:** https://arxiv.org/abs/2602.17288  
💻 **Github:** https://github.com/kitefishai/KiteFish-A1-1.5B-Math

This is a **base scientific language model** (not instruction-tuned).

## Overview

KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.

**Training Scale**
- ~52B pretraining tokens  
- ~5B additional post-training tokens  
- ~200GB processed scientific corpus  
- LLaMA-compatible tokenizer (~102k vocab)  
- 2× NVIDIA A100 (80GB) GPUs  
- 24 experimental training runs  

The focus of this project is *scientific language modeling robustness*, not benchmark optimization.

## Model Architecture

- 24 Transformer layers  
- Hidden size: 2048  
- FFN size: 5504  
- 16 attention heads  
- Context length: 4096 (trained at 768 tokens)  
- Dense LLaMA-style architecture  

**Optimization**
- AdamW  
- Learning rate: 2e-4  
- Warmup: 500 steps  
- Weight decay: 0.1  
- Gradient accumulation: 32  
- bf16 mixed precision  
- Gradient checkpointing enabled  

**Validation Perplexity:** ~4.2 (held-out scientific corpus)

## Intended Use

KiteFish-A1-1.5B is suitable for:

- Scientific text modeling research  
- Mathematical language modeling experiments  
- Pretraining initialization for domain fine-tuning  
- Tokenization and symbolic modeling research  
- Studying LaTeX structure modeling  

It is **not optimized for:**

- Instruction following  
- Chat-based applications  
- General conversational AI  
- Benchmark leaderboard performance  

## Performance Notes

This model was trained under moderate compute constraints and without instruction tuning or alignment stages.

Observed characteristics:

- Strong familiarity with scientific writing style  
- Stable LaTeX structural modeling  
- Reasonable symbolic fluency  
- Limited reasoning depth  
- Low downstream benchmark accuracy without fine-tuning  

Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.

## Limitations

- Not instruction-tuned  
- No RLHF or preference alignment  
- Trained at 768-token sequence length  
- Domain restricted to selected arXiv categories  
- Not optimized for reasoning benchmarks  
- General NLP benchmark scores may be low  

This release is intended primarily for research and experimentation.

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## Citation

If you use this model in your research, please cite:

```
@article{kitefish_a1_2026,
  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
  author={...},
  year={2026},
  eprint={2602.17288},
  archivePrefix={arXiv}
}
```