anuj0456 commited on
Commit
8f3589a
·
verified ·
1 Parent(s): 031bdef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -3
README.md CHANGED
@@ -1,3 +1,126 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - causal-lm
7
+ - scientific-language-model
8
+ - arxiv
9
+ - mathematics
10
+ - research
11
+ library_name: transformers
12
+ ---
13
+
14
+ # KiteFish-A1-1.5B
15
+
16
+ KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.
17
+
18
+ This model is a **base scientific language model** and is not instruction-tuned.
19
+
20
+ ---
21
+
22
+ ## Overview
23
+
24
+ KiteFish-A1-1.5B was trained using approximately:
25
+
26
+ - **52.18B pretraining tokens**
27
+ - **5B post-training tokens**
28
+ - ~200GB of processed scientific corpus
29
+ - LLaMA-compatible tokenizer (~102k vocab)
30
+ - 2× NVIDIA A100 (80GB) GPUs
31
+ - 24 experimental runs for optimization stability
32
+
33
+ The goal of this model is to explore the practical challenges of training a domain-specialized scientific language model from raw LaTeX archives.
34
+
35
+ ---
36
+
37
+ ## Intended Use
38
+
39
+ This model is intended for:
40
+
41
+ - Scientific text modeling research
42
+ - Mathematical language modeling experiments
43
+ - Pretraining initialization for domain-specific fine-tuning
44
+ - Tokenization and symbolic modeling research
45
+
46
+ This model is **not optimized for:**
47
+
48
+ - General conversational AI
49
+ - Instruction following
50
+ - Chat-based interaction
51
+ - Benchmark competition
52
+
53
+ ---
54
+
55
+ ## Performance Notes
56
+
57
+ This is a base model trained from scratch under moderate compute constraints.
58
+
59
+ Observed characteristics:
60
+
61
+ - Strong familiarity with scientific writing style
62
+ - Stable LaTeX structure modeling
63
+ - Limited instruction-following ability
64
+ - Limited reasoning depth compared to large instruction-tuned models
65
+ - Modest downstream benchmark accuracy without fine-tuning
66
+
67
+ Users are encouraged to apply supervised fine-tuning (SFT) or LoRA-based adaptation for improved task performance.
68
+
69
+ ---
70
+
71
+ ## Training Details
72
+
73
+ **Architecture**
74
+ - 24 layers
75
+ - Hidden size: 2048
76
+ - FFN size: 5504
77
+ - 16 attention heads
78
+ - Context length: 4096 (trained at 768 tokens)
79
+ - Dense LLaMA-style transformer
80
+
81
+ **Optimization**
82
+ - AdamW
83
+ - Learning rate: 2e-4
84
+ - Warmup: 500 steps
85
+ - Weight decay: 0.1
86
+ - Gradient accumulation: 32
87
+ - Gradient checkpointing enabled
88
+ - Mixed precision (bf16)
89
+
90
+ **Validation Perplexity**
91
+ - ~4.2 on held-out scientific corpus
92
+
93
+ ---
94
+
95
+ ## Limitations
96
+
97
+ - Not instruction-tuned
98
+ - Limited reasoning capabilities
99
+ - Trained at 768-token sequence length
100
+ - Domain restricted to selected arXiv categories
101
+ - No RLHF or preference alignment
102
+ - Not benchmark-optimized
103
+
104
+ Performance on general NLP benchmarks may be low.
105
+
106
+ ---
107
+
108
+ ## Example Usage
109
+
110
+ ```python
111
+ from transformers import AutoTokenizer, AutoModelForCausalLM
112
+ import torch
113
+
114
+ model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"
115
+
116
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
117
+ model = AutoModelForCausalLM.from_pretrained(model_id)
118
+
119
+ prompt = "Prove that the sum of two continuous functions is continuous."
120
+ inputs = tokenizer(prompt, return_tensors="pt")
121
+
122
+ with torch.no_grad():
123
+ outputs = model.generate(**inputs, max_new_tokens=200)
124
+
125
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
126
+