Girinath11 commited on
Commit
5e91a95
·
verified ·
1 Parent(s): a3974d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +232 -3
README.md CHANGED
@@ -1,3 +1,232 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ metrics:
4
+ - perplexity
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - transformers
8
+ - recursive-transformer
9
+ - technical-content
10
+ - code-generation
11
+ - math
12
+ - conversation
13
+ - bpe-tokenizer
14
+ - adaptive-routing
15
+ ---
16
+ ## MixtureofRecursionwithRouter
17
+ A transformer-based small-scale language model optimized for technical content, featuring a custom tokenizer and a recursive transformer architecture with an adaptive router for dynamic computation steps. Designed for efficient training (4-5 hours) and inference on technical datasets, this model excels in processing code snippets, mathematical expressions, and technical conversations.
18
+
19
+ ## Model Description
20
+ MixtureofRecursionwithRouter is tailored for technical domains, combining:
21
+ ->Custom Tokenizer: Byte-pair encoding (BPE) with special tokens for code, math, and conversation roles (e.g., <user>, <assistant>).
22
+ ->Adaptive Embeddings: Token embeddings with configurable positional encodings (learned, sinusoidal, or RoPE).
23
+ ->Recursive Transformer: Multi-layered architecture with a RecursionRouter to dynamically adjust computation steps based on input complexity.
24
+ ->Ultra-Fast Training: Optimized for low loss (<2.0) and perplexity (<12) in 4-5 hours using mixed precision and cosine scheduling.
25
+
26
+ ## Model Details
27
+
28
+ ->Vocabulary Size: 32,000
29
+ ->Embedding Dimension: 384
30
+ ->Number of Layers: 6
31
+ ->Attention Heads: 6
32
+ ->Max Sequence Length: 128
33
+ ->Positional Encoding: Learned (default, supports sinusoidal or RoPE)
34
+ ->Training Objective: Causal language modeling with cross-entropy loss
35
+
36
+ ## Performance:
37
+ ->Validation Loss: 2.07
38
+ ->Validation Perplexity: 7.9
39
+
40
+
41
+ ## Optimizer: AdamW with cosine learning rate scheduling
42
+ ## Hardware: Trained on GPU (CUDA-compatible) or CPU
43
+ ## Training Time: ~4-5 hours on a single GPU
44
+ ## Parameters: 10M (exact count via count_parameters(model))
45
+
46
+ ## Installation
47
+ Requires Python 3.8+ and the following dependencies:
48
+ ->pip install torch numpy tqdm
49
+
50
+ ## Clone the repository:
51
+ git clone https://huggingface.co/girinath11/MixtureofRecursionwithRouter
52
+ cd MixtureofRecursionwithRouter
53
+ pip install .
54
+
55
+ ## Usage
56
+ ## Loading the Model
57
+ from model_slm import MixtureOfRecursions
58
+ from custom_tokenizer import TechnicalTokenizer
59
+ import torch
60
+
61
+ # Load tokenizer
62
+ tokenizer = TechnicalTokenizer()
63
+ tokenizer.load("path/to/tokenizer")
64
+
65
+ # Initialize model
66
+ model = MixtureOfRecursions(
67
+ vocab_size=tokenizer.get_vocab_size(),
68
+ d_model=384,
69
+ n_layers=6,
70
+ n_heads=6,
71
+ max_seq_len=128,
72
+ padding_idx=tokenizer.vocab.get('<pad>', 0)
73
+ )
74
+
75
+ # Load checkpoint
76
+ checkpoint = torch.load("checkpoints/best_model.pt")
77
+ model.load_state_dict(checkpoint['model_state_dict'])
78
+
79
+ # Move to device
80
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
81
+ model.to(device)
82
+
83
+ Text Generation
84
+ from model_slm import TextGenerator
85
+
86
+ # Initialize generator
87
+ generator = TextGenerator(model, tokenizer, max_length=128, device=device)
88
+
89
+ # Generate text
90
+ prompt = "Write a Python function to compute the Fibonacci sequence."
91
+ response = generator.generate(
92
+ prompt,
93
+ method="nucleus",
94
+ temperature=0.8,
95
+ top_p=0.9,
96
+ max_new_tokens=100
97
+ )
98
+ print(response)
99
+
100
+ ## Training
101
+ Prepare a dataset in .txt format and run:
102
+ python train.py \
103
+ --train_file path/to/train.txt \
104
+ --val_file path/to/val.txt \
105
+ --tokenizer_dir path/to/tokenizer \
106
+ --max_examples 50000 \
107
+ --d_model 384 \
108
+ --n_layers 6 \
109
+ --n_heads 6 \
110
+ --max_seq_len 128 \
111
+ --epochs 15 \
112
+ --batch_size 16
113
+
114
+ The training script uses mixed precision, gradient accumulation, and a cosine learning rate scheduler to achieve a validation loss of 2.07 and perplexity of 7.9 in 4-5 hours.
115
+ ## Dataset
116
+ The model is trained on technical conversation datasets (.txt). The FastTechnicalTextDataset class applies filters:
117
+ ->Text length: 50–400 characters
118
+ ->Minimum 8 words
119
+ ->No URLs or excessive punctuation
120
+ ->Deduplication via hashing
121
+ ->Maximum 50,000 examples
122
+
123
+ ## Example JSONL Format:
124
+ {"messages": [{"role": "user", "content": "How does backpropagation work?"}, {"role": "assistant", "content": "Backpropagation is..."}]}
125
+
126
+ ## Tokenizer
127
+ The TechnicalTokenizer is optimized for technical content:
128
+ ->Special Tokens: <pad>, <unk>, <bos>, <eos>, <user>, <assistant>, <code>, <math>, etc.
129
+ ->BPE: Subword tokenization with a vocabulary of 32,000.
130
+ ->Features: Handles code blocks, URLs, emails, numbers, and technical terms (e.g., "algorithm", "neural").
131
+ N->ormalization: Unicode NFKC normalization.
132
+
133
+ ## To train the tokenizer:
134
+ from custom_tokenizer import train_tokenizer_from_files
135
+
136
+ train_tokenizer_from_files(
137
+ file_paths=["path/to/train.txt"],
138
+ vocab_size=32000,
139
+ min_freq=2,
140
+ output_dir="tokenizer"
141
+ )
142
+
143
+ ## Model Architecture
144
+ The MixtureofRecursionwithRouter model is a transformer-based architecture specifically designed for technical content, incorporating several innovative components to enhance performance and efficiency:
145
+
146
+ ## Embedding Layer (TechEmbeddingLayer):
147
+
148
+ Combines token embeddings with configurable positional encodings (learned by default, with support for sinusoidal or RoPE).
149
+ Uses a d_model of 384 for compact yet expressive representations.
150
+ Applies layer normalization and dropout (0.1) for regularization.
151
+ Supports padding tokens (<pad>) to handle variable-length sequences efficiently.
152
+
153
+
154
+ ## Attention Mechanism (MultiHeadAttention):
155
+
156
+ Implements multi-head self-attention with 6 heads, each handling a subspace of the 384-dimensional input.
157
+ Uses causal and padding masks to ensure proper attention patterns for language modeling and to ignore padding tokens.
158
+ Weights are initialized with Xavier uniform initialization for stable training.
159
+ Supports integration with RoPE positional encodings for enhanced context awareness in technical sequences.
160
+
161
+
162
+ ## Recursive Transformer Layers (RecursiveTransformerLayer):
163
+
164
+ Consists of 6 layers, each incorporating a MultiHeadAttention module, a FeedForward network, and two layer normalization steps.
165
+ RecursionRouter that dynamically determines the number of recursive computation steps (up to 4) based on input complexity.
166
+ The router can operate in "adaptive" mode (using a classifier to predict steps) or "fixed" mode (using a constant number of steps).
167
+ Each recursive step applies a linear projection (step_projections) to modulate the input, enabling iterative refinement of representations.
168
+ Computation loss is tracked to balance performance and efficiency, with a small penalty (0.0001) applied to encourage efficient routing.
169
+
170
+
171
+ ## Feedforward Network (FeedForward):
172
+
173
+ Position-wise feedforward network with GELU activation and a hidden dimension of 2048.
174
+ Applies dropout (0.1) to prevent overfitting and Xavier initialization for stable training.
175
+ Processes each token independently to capture complex patterns in technical content.
176
+
177
+
178
+ ## Output Layer:
179
+
180
+ A linear layer maps the 384-dimensional hidden states to the vocabulary size (32,000).
181
+ Shares weights with the embedding layer for efficiency (optional, depending on configuration).
182
+ Produces logits for next-token prediction in causal language modeling.
183
+
184
+
185
+ ## Adaptive Routing (RecursionRouter):
186
+
187
+ A unique feature that evaluates input complexity using a small neural network (linear layer, GELU, dropout, and softmax).
188
+ Outputs a probability distribution over possible recursion steps (0 to 4), allowing the model to allocate more computation to complex inputs (e.g., code or math) and fewer to simpler ones.
189
+ Reduces computational overhead while maintaining performance on diverse technical tasks.
190
+
191
+ This architecture is optimized for technical domains by prioritizing efficiency (via adaptive recursion) and expressiveness (via specialized tokenization and embeddings). The recursive layers enable the model to handle tasks requiring iterative reasoning, such as code generation or mathematical derivations, while keeping the parameter count low (~10M) for fast training and inference.
192
+ ## Evaluation
193
+ Evaluated on a validation set with:
194
+
195
+ Loss: 2.07
196
+ Perplexity: 7.9
197
+
198
+ Validation is performed every 500 steps (configurable). Example metrics:
199
+ {
200
+ "epoch": 15,
201
+ "train_loss": 1.85,
202
+ "train_ppl": 6.35,
203
+ "val_loss": 2.07,
204
+ "val_ppl": 7.9,
205
+ "epoch_time_min": 12.5
206
+ }
207
+
208
+ ## Checkpoints
209
+ Checkpoints are saved in the checkpoints directory when a new best validation loss is achieved. Each checkpoint includes:
210
+
211
+ Model state
212
+ Optimizer state
213
+ Scaler state
214
+ Metrics
215
+
216
+ ## To load a checkpoint:
217
+ checkpoint = torch.load("checkpoints/best_model.pt")
218
+ model.load_state_dict(checkpoint['model_state_dict'])
219
+
220
+ ## Limitations
221
+ ->Sequence Length: Limited to 128 tokens (configurable, but longer sequences increase memory usage).
222
+ ->Dataset Size: Optimized for 50,000 examples to ensure fast training.
223
+ ->Domain: Tailored for technical content; may not generalize to non-technical text.
224
+ ->Hardware: Best performance on GPU; CPU training is slower.
225
+
226
+ ## License
227
+ This model is licensed under the Apache-2.0 License. See the LICENSE file for details.
228
+
229
+ ## Acknowledgments
230
+ ->Built using PyTorch.
231
+ ->Inspired by transformer architectures and BPE tokenization.
232
+ ->Optimized for technical content with insights from domain-specific language models.