anuj0456 commited on
Commit
7c070a4
·
verified ·
1 Parent(s): 8f3589a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -57
README.md CHANGED
@@ -5,103 +5,105 @@ language:
5
  tags:
6
  - causal-lm
7
  - scientific-language-model
8
- - arxiv
9
  - mathematics
 
10
  - research
11
  library_name: transformers
12
  ---
13
 
14
  # KiteFish-A1-1.5B
15
 
16
- KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.
17
 
18
- This model is a **base scientific language model** and is not instruction-tuned.
 
 
19
 
20
  ---
21
 
22
  ## Overview
23
 
24
- KiteFish-A1-1.5B was trained using approximately:
25
 
26
- - **52.18B pretraining tokens**
27
- - **5B post-training tokens**
28
- - ~200GB of processed scientific corpus
29
- - LLaMA-compatible tokenizer (~102k vocab)
30
- - NVIDIA A100 (80GB) GPUs
31
- - 24 experimental runs for optimization stability
 
32
 
33
- The goal of this model is to explore the practical challenges of training a domain-specialized scientific language model from raw LaTeX archives.
34
 
35
  ---
36
 
37
- ## Intended Use
38
-
39
- This model is intended for:
40
 
41
- - Scientific text modeling research
42
- - Mathematical language modeling experiments
43
- - Pretraining initialization for domain-specific fine-tuning
44
- - Tokenization and symbolic modeling research
 
 
45
 
46
- This model is **not optimized for:**
 
 
 
 
 
 
 
47
 
48
- - General conversational AI
49
- - Instruction following
50
- - Chat-based interaction
51
- - Benchmark competition
52
 
53
  ---
54
 
55
- ## Performance Notes
56
 
57
- This is a base model trained from scratch under moderate compute constraints.
58
 
59
- Observed characteristics:
 
 
 
 
60
 
61
- - Strong familiarity with scientific writing style
62
- - Stable LaTeX structure modeling
63
- - Limited instruction-following ability
64
- - Limited reasoning depth compared to large instruction-tuned models
65
- - Modest downstream benchmark accuracy without fine-tuning
66
 
67
- Users are encouraged to apply supervised fine-tuning (SFT) or LoRA-based adaptation for improved task performance.
 
 
 
68
 
69
  ---
70
 
71
- ## Training Details
72
 
73
- **Architecture**
74
- - 24 layers
75
- - Hidden size: 2048
76
- - FFN size: 5504
77
- - 16 attention heads
78
- - Context length: 4096 (trained at 768 tokens)
79
- - Dense LLaMA-style transformer
80
 
81
- **Optimization**
82
- - AdamW
83
- - Learning rate: 2e-4
84
- - Warmup: 500 steps
85
- - Weight decay: 0.1
86
- - Gradient accumulation: 32
87
- - Gradient checkpointing enabled
88
- - Mixed precision (bf16)
89
 
90
- **Validation Perplexity**
91
- - ~4.2 on held-out scientific corpus
 
 
 
 
 
92
 
93
  ---
94
 
95
  ## Limitations
96
 
97
- - Not instruction-tuned
98
- - Limited reasoning capabilities
99
- - Trained at 768-token sequence length
100
- - Domain restricted to selected arXiv categories
101
- - No RLHF or preference alignment
102
- - Not benchmark-optimized
103
 
104
- Performance on general NLP benchmarks may be low.
105
 
106
  ---
107
 
@@ -124,3 +126,18 @@ with torch.no_grad():
124
 
125
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - causal-lm
7
  - scientific-language-model
 
8
  - mathematics
9
+ - arxiv
10
  - research
11
  library_name: transformers
12
  ---
13
 
14
  # KiteFish-A1-1.5B
15
 
16
+ **KiteFish-A1-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.
17
 
18
+ 📄 **Paper:** https://arxiv.org/abs/2602.17288
19
+
20
+ This is a **base scientific language model** (not instruction-tuned).
21
 
22
  ---
23
 
24
  ## Overview
25
 
26
+ KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.
27
 
28
+ **Training Scale**
29
+ - ~52B pretraining tokens
30
+ - ~5B additional post-training tokens
31
+ - ~200GB processed scientific corpus
32
+ - LLaMA-compatible tokenizer (~102k vocab)
33
+ - NVIDIA A100 (80GB) GPUs
34
+ - 24 experimental training runs
35
 
36
+ The focus of this project is *scientific language modeling robustness*, not benchmark optimization.
37
 
38
  ---
39
 
40
+ ## Model Architecture
 
 
41
 
42
+ - 24 Transformer layers
43
+ - Hidden size: 2048
44
+ - FFN size: 5504
45
+ - 16 attention heads
46
+ - Context length: 4096 (trained at 768 tokens)
47
+ - Dense LLaMA-style architecture
48
 
49
+ **Optimization**
50
+ - AdamW
51
+ - Learning rate: 2e-4
52
+ - Warmup: 500 steps
53
+ - Weight decay: 0.1
54
+ - Gradient accumulation: 32
55
+ - bf16 mixed precision
56
+ - Gradient checkpointing enabled
57
 
58
+ **Validation Perplexity:** ~4.2 (held-out scientific corpus)
 
 
 
59
 
60
  ---
61
 
62
+ ## Intended Use
63
 
64
+ KiteFish-A1-1.5B is suitable for:
65
 
66
+ - Scientific text modeling research
67
+ - Mathematical language modeling experiments
68
+ - Pretraining initialization for domain fine-tuning
69
+ - Tokenization and symbolic modeling research
70
+ - Studying LaTeX structure modeling
71
 
72
+ It is **not optimized for:**
 
 
 
 
73
 
74
+ - Instruction following
75
+ - Chat-based applications
76
+ - General conversational AI
77
+ - Benchmark leaderboard performance
78
 
79
  ---
80
 
81
+ ## Performance Notes
82
 
83
+ This model was trained under moderate compute constraints and without instruction tuning or alignment stages.
 
 
 
 
 
 
84
 
85
+ Observed characteristics:
 
 
 
 
 
 
 
86
 
87
+ - Strong familiarity with scientific writing style
88
+ - Stable LaTeX structural modeling
89
+ - Reasonable symbolic fluency
90
+ - Limited reasoning depth
91
+ - Low downstream benchmark accuracy without fine-tuning
92
+
93
+ Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.
94
 
95
  ---
96
 
97
  ## Limitations
98
 
99
+ - Not instruction-tuned
100
+ - No RLHF or preference alignment
101
+ - Trained at 768-token sequence length
102
+ - Domain restricted to selected arXiv categories
103
+ - Not optimized for reasoning benchmarks
104
+ - General NLP benchmark scores may be low
105
 
106
+ This release is intended primarily for research and experimentation.
107
 
108
  ---
109
 
 
126
 
127
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
128
 
129
+ ```
130
+
131
+ ## Citation
132
+
133
+ If you use this model in your research, please cite:
134
+
135
+ ```
136
+ @article{kitefish_a1_2026,
137
+ title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
138
+ author={...},
139
+ year={2026},
140
+ eprint={2602.17288},
141
+ archivePrefix={arXiv}
142
+ }
143
+ ```