File size: 3,534 Bytes
8f3589a
 
 
 
 
 
 
 
7c070a4
8f3589a
 
 
 
 
 
7c070a4
8f3589a
7c070a4
bfe4a0f
7c070a4
 
8f3589a
 
 
7c070a4
8f3589a
7c070a4
 
 
 
 
 
 
8f3589a
7c070a4
8f3589a
7c070a4
8f3589a
7c070a4
 
 
 
 
 
8f3589a
7c070a4
 
 
 
 
 
 
 
8f3589a
7c070a4
8f3589a
7c070a4
8f3589a
7c070a4
8f3589a
7c070a4
 
 
 
 
8f3589a
7c070a4
8f3589a
7c070a4
 
 
 
8f3589a
7c070a4
8f3589a
7c070a4
8f3589a
7c070a4
8f3589a
7c070a4
 
 
 
 
 
 
8f3589a
 
 
7c070a4
 
 
 
 
 
8f3589a
7c070a4
8f3589a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c070a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
language:
  - en
tags:
  - causal-lm
  - scientific-language-model
  - mathematics
  - arxiv
  - research
library_name: transformers
---

# KiteFish-A1-1.5B

**KiteFish-A1-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.

📄 **Paper:** https://arxiv.org/abs/2602.17288  
💻 **Github:** https://github.com/kitefishai/KiteFish-A1-1.5B-Math

This is a **base scientific language model** (not instruction-tuned).

## Overview

KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.

**Training Scale**
- ~52B pretraining tokens  
- ~5B additional post-training tokens  
- ~200GB processed scientific corpus  
- LLaMA-compatible tokenizer (~102k vocab)  
- 2× NVIDIA A100 (80GB) GPUs  
- 24 experimental training runs  

The focus of this project is *scientific language modeling robustness*, not benchmark optimization.

## Model Architecture

- 24 Transformer layers  
- Hidden size: 2048  
- FFN size: 5504  
- 16 attention heads  
- Context length: 4096 (trained at 768 tokens)  
- Dense LLaMA-style architecture  

**Optimization**
- AdamW  
- Learning rate: 2e-4  
- Warmup: 500 steps  
- Weight decay: 0.1  
- Gradient accumulation: 32  
- bf16 mixed precision  
- Gradient checkpointing enabled  

**Validation Perplexity:** ~4.2 (held-out scientific corpus)

## Intended Use

KiteFish-A1-1.5B is suitable for:

- Scientific text modeling research  
- Mathematical language modeling experiments  
- Pretraining initialization for domain fine-tuning  
- Tokenization and symbolic modeling research  
- Studying LaTeX structure modeling  

It is **not optimized for:**

- Instruction following  
- Chat-based applications  
- General conversational AI  
- Benchmark leaderboard performance  

## Performance Notes

This model was trained under moderate compute constraints and without instruction tuning or alignment stages.

Observed characteristics:

- Strong familiarity with scientific writing style  
- Stable LaTeX structural modeling  
- Reasonable symbolic fluency  
- Limited reasoning depth  
- Low downstream benchmark accuracy without fine-tuning  

Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.

## Limitations

- Not instruction-tuned  
- No RLHF or preference alignment  
- Trained at 768-token sequence length  
- Domain restricted to selected arXiv categories  
- Not optimized for reasoning benchmarks  
- General NLP benchmark scores may be low  

This release is intended primarily for research and experimentation.

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## Citation

If you use this model in your research, please cite:

```
@article{kitefish_a1_2026,
  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
  author={...},
  year={2026},
  eprint={2602.17288},
  archivePrefix={arXiv}
}
```