OpenTransformer commited on
Commit
4d911fb
·
verified ·
1 Parent(s): b46577b

Upload positional_encodings_via_fluxions.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. positional_encodings_via_fluxions.md +481 -0
positional_encodings_via_fluxions.md ADDED
@@ -0,0 +1,481 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Positional Encodings via the Method of Fluxions
2
+ ## How Transformers Know Where Things Are
3
+
4
+ **Scott Bisset, Silicon Goddess**
5
+ OpenTransformers Ltd
6
+ January 2026
7
+
8
+ ---
9
+
10
+ ## Abstract
11
+
12
+ Positional encodings are often presented as "magic sine waves" or "learned embeddings" without explaining WHY they work. We analyze positional encodings through the fluxion lens, revealing: (1) sinusoidal encodings create a Fourier basis for position, (2) learned embeddings are just position-specific biases, (3) RoPE rotates the query-key space to make dot products position-aware, and (4) ALiBi adds position-dependent damping to attention. Each method has different gradient flow characteristics that explain their empirical behavior.
13
+
14
+ ---
15
+
16
+ ## 1. The Position Problem
17
+
18
+ ### 1.1 Self-Attention Is Permutation-Invariant
19
+
20
+ ```
21
+ Attention(X) = softmax(QKᵀ/√d) · V
22
+
23
+ Where Q = XWq, K = XWk, V = XWv
24
+ ```
25
+
26
+ If we shuffle the rows of X, we get shuffled output.
27
+ The attention mechanism itself has NO concept of order.
28
+
29
+ ### 1.2 Why This Matters
30
+
31
+ "The cat sat on the mat" and "mat the on sat cat The" produce different attention patterns ONLY if we add position information.
32
+
33
+ ---
34
+
35
+ ## 2. Sinusoidal Positional Encoding (Original Transformer)
36
+
37
+ ### 2.1 The Formula
38
+
39
+ ```
40
+ PE(pos, 2i) = sin(pos / 10000^(2i/d))
41
+ PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
42
+
43
+ Where pos = position (0, 1, 2, ...)
44
+ i = dimension index
45
+ d = model dimension
46
+ ```
47
+
48
+ ### 2.2 Fluxion Interpretation
49
+
50
+ Each dimension oscillates at a different frequency:
51
+
52
+ ```
53
+ Dimension 0,1: frequency = 1/10000⁰ = 1 (fastest)
54
+ Dimension 2,3: frequency = 1/10000^(2/d)
55
+ ...
56
+ Dimension d-2,d-1: frequency = 1/10000¹ (slowest)
57
+ ```
58
+
59
+ **This is a Fourier basis for position!**
60
+
61
+ Low dimensions: change rapidly with position (fine detail)
62
+ High dimensions: change slowly (coarse position)
63
+
64
+ ### 2.3 Why Sin AND Cos?
65
+
66
+ ```
67
+ PE(pos) = [sin(ω₀·pos), cos(ω₀·pos), sin(ω₁·pos), cos(ω₁·pos), ...]
68
+ ```
69
+
70
+ Sin and cos together allow LINEAR interpolation of relative positions:
71
+
72
+ ```
73
+ PE(pos+k) = PE(pos) · R(k)
74
+
75
+ Where R(k) is a rotation matrix (depends only on offset k)
76
+ ```
77
+
78
+ The network can learn to compute relative positions via linear operations!
79
+
80
+ ### 2.4 Gradient Flow
81
+
82
+ Sinusoidal encodings are FIXED (not learned).
83
+
84
+ ```
85
+ L̇ᴾᴱ = 0 (no gradient flows to positional encoding)
86
+ ```
87
+
88
+ All position information must be extracted by the attention weights.
89
+
90
+ ### 2.5 Addition vs Concatenation
91
+
92
+ Original Transformer ADDS PE to embeddings:
93
+
94
+ ```
95
+ X = TokenEmbed(tokens) + PE(positions)
96
+ ```
97
+
98
+ **Fluxion view:** Gradient flows equally to token embedding and through position.
99
+
100
+ Alternative (concatenation):
101
+ ```
102
+ X = [TokenEmbed(tokens), PE(positions)]
103
+ ```
104
+
105
+ Doubles dimension but keeps position separate.
106
+
107
+ ---
108
+
109
+ ## 3. Learned Positional Embeddings
110
+
111
+ ### 3.1 The Idea
112
+
113
+ Just learn a separate embedding for each position:
114
+
115
+ ```
116
+ PE = PositionEmbedding(pos) # Shape: [max_len, d]
117
+
118
+ X = TokenEmbed(tokens) + PE[positions]
119
+ ```
120
+
121
+ ### 3.2 Fluxion Backward
122
+
123
+ ```
124
+ L̇ᴾᴱ[pos] = L̇ˣ[pos] (gradient flows directly)
125
+ ```
126
+
127
+ Each position gets gradient from all samples at that position.
128
+
129
+ ### 3.3 Advantages
130
+
131
+ - Can learn arbitrary position patterns
132
+ - No assumptions about structure
133
+
134
+ ### 3.4 Disadvantages
135
+
136
+ - Limited to max_len seen during training
137
+ - No extrapolation: position 1001 has no embedding if max_len=1000
138
+ - More parameters: max_len × d additional weights
139
+
140
+ ### 3.5 Use Cases
141
+
142
+ - BERT, GPT-2 (fixed max length)
143
+ - Most encoder-only models
144
+
145
+ ---
146
+
147
+ ## 4. Relative Positional Encodings
148
+
149
+ ### 4.1 The Insight
150
+
151
+ Attention should depend on RELATIVE position (i-j), not absolute.
152
+
153
+ "Token 5 attending to token 3" and "token 105 attending to token 103" should use the same relative position encoding.
154
+
155
+ ### 4.2 Transformer-XL Style
156
+
157
+ Add relative position bias to attention scores:
158
+
159
+ ```
160
+ S_ij = (Q_i · K_j + Q_i · R_{i-j}) / √d
161
+
162
+ Where R_{i-j} = relative position embedding for offset (i-j)
163
+ ```
164
+
165
+ ### 4.3 Fluxion Backward
166
+
167
+ ```
168
+ L̇ᴿ[k] = Σᵢⱼ:i-j=k L̇ˢᵢⱼ · Qᵢ
169
+ ```
170
+
171
+ Gradient to relative embedding k = sum over all (i,j) pairs with that offset.
172
+
173
+ ---
174
+
175
+ ## 5. Rotary Position Embedding (RoPE)
176
+
177
+ ### 5.1 The Core Idea
178
+
179
+ Instead of ADDING position to embeddings, ROTATE them:
180
+
181
+ ```
182
+ Q_rotated = Rotate(Q, θ·pos)
183
+ K_rotated = Rotate(K, θ·pos)
184
+ ```
185
+
186
+ Then attention becomes:
187
+ ```
188
+ Q_rot · K_rotᵀ = f(Q, K, pos_q - pos_k)
189
+ ```
190
+
191
+ The dot product naturally encodes RELATIVE position!
192
+
193
+ ### 5.2 The Rotation
194
+
195
+ For each pair of dimensions (2i, 2i+1):
196
+
197
+ ```
198
+ [q_{2i} ] [cos(mθᵢ) -sin(mθᵢ)] [q_{2i} ]
199
+ [q_{2i+1}] = [sin(mθᵢ) cos(mθᵢ)] [q_{2i+1}]
200
+
201
+ Where m = position index
202
+ θᵢ = base^(-2i/d), typically base=10000
203
+ ```
204
+
205
+ ### 5.3 Why Rotation Works
206
+
207
+ ```
208
+ Q_m · K_nᵀ = Σᵢ (q_{2i}·cos(mθᵢ) - q_{2i+1}·sin(mθᵢ))
209
+ × (k_{2i}·cos(nθᵢ) - k_{2i+1}·sin(nθᵢ)) + ...
210
+ = f(q, k, (m-n)θ) # Only depends on relative position!
211
+ ```
212
+
213
+ ### 5.4 Fluxion Backward
214
+
215
+ ```
216
+ L̇Q_pre_rotate = Rotateᵀ(L̇Q_rotated, θ·pos)
217
+ = Rotate(L̇Q_rotated, -θ·pos)
218
+ ```
219
+
220
+ Gradient flows backward through inverse rotation.
221
+
222
+ ### 5.5 Advantages
223
+
224
+ - Extrapolates to longer sequences (rotation works at any position)
225
+ - No additional parameters
226
+ - Relative position is built into attention
227
+
228
+ ### 5.6 Use Cases
229
+
230
+ - LLaMA, Mistral, most modern LLMs
231
+ - Becoming the default for decoder-only models
232
+
233
+ ---
234
+
235
+ ## 6. ALiBi (Attention with Linear Biases)
236
+
237
+ ### 6.1 The Simplest Approach
238
+
239
+ Don't modify Q or K. Just add a bias to attention scores:
240
+
241
+ ```
242
+ S_ij = Q_i · K_jᵀ / √d - m · |i - j|
243
+
244
+ Where m = head-specific slope
245
+ ```
246
+
247
+ ### 6.2 Fluxion View
248
+
249
+ ```
250
+ S_ij = raw_attention - position_penalty
251
+ ```
252
+
253
+ **Distant tokens get penalized.** Attention naturally focuses on nearby tokens.
254
+
255
+ ### 6.3 The Slopes
256
+
257
+ Different heads use different slopes:
258
+
259
+ ```
260
+ Head 1: m = 2^(-8/n_heads) (mild penalty)
261
+ Head 2: m = 2^(-16/n_heads) (steeper)
262
+ ...
263
+ ```
264
+
265
+ Some heads focus locally, others can attend far.
266
+
267
+ ### 6.4 Gradient Flow
268
+
269
+ ```
270
+ L̇Q = L̇ˢ · K / √d (unchanged from normal attention)
271
+ L̇K = L̇ˢᵀ · Q / √d (unchanged)
272
+ ```
273
+
274
+ Position bias has no learnable parameters.
275
+ Zero gradient to position encoding (because there isn't one).
276
+
277
+ ### 6.5 Advantages
278
+
279
+ - Zero additional computation
280
+ - Zero additional parameters
281
+ - Extrapolates extremely well
282
+ - Simple to implement
283
+
284
+ ### 6.6 Disadvantages
285
+
286
+ - Less expressive than RoPE
287
+ - Assumes "closer is more relevant" (not always true)
288
+
289
+ ---
290
+
291
+ ## 7. Comparison Table
292
+
293
+ | Method | Parameters | Extrapolation | Relative Position | Compute |
294
+ |--------|------------|---------------|-------------------|---------|
295
+ | Sinusoidal | 0 | Limited | Via linear transform | + |
296
+ | Learned | max_len × d | None | No | + |
297
+ | RoPE | 0 | Good | Yes (native) | ++ |
298
+ | ALiBi | 0 | Excellent | Yes (via bias) | + |
299
+
300
+ ---
301
+
302
+ ## 8. NTK-Aware Interpolation (Long Context)
303
+
304
+ ### 8.1 The Problem
305
+
306
+ RoPE trained on 4K context doesn't work at 32K.
307
+ The rotations become too fast, angles wrap around.
308
+
309
+ ### 8.2 The Fix: Adjust the Base
310
+
311
+ ```
312
+ Original: θᵢ = 10000^(-2i/d)
313
+ Scaled: θᵢ = (10000 · α)^(-2i/d)
314
+
315
+ Where α = (target_len / train_len)^(d/(d-2))
316
+ ```
317
+
318
+ ### 8.3 Fluxion Interpretation
319
+
320
+ Slower rotation = larger effective wavelength = position information spreads across longer range.
321
+
322
+ ### 8.4 YaRN, CodeLLaMA, etc.
323
+
324
+ Various interpolation schemes exist:
325
+ - Linear interpolation (scale all frequencies)
326
+ - NTK-aware (scale base, preserve high frequencies)
327
+ - YaRN (attention scaling + NTK)
328
+
329
+ All modify how position information flows through attention.
330
+
331
+ ---
332
+
333
+ ## 9. Absolute vs Relative: The Gradient Perspective
334
+
335
+ ### 9.1 Absolute Position Gradients
336
+
337
+ ```
338
+ L̇ᴾᴱ[pos] ∝ "how useful was knowing absolute position pos"
339
+ ```
340
+
341
+ If position 0 is always "start token," PE[0] gets specialized gradient.
342
+
343
+ ### 9.2 Relative Position Gradients
344
+
345
+ ```
346
+ L̇ᴿ[offset] ∝ "how useful was knowing relative offset"
347
+ ```
348
+
349
+ If "1 token apart" is meaningful, R[1] and R[-1] get large gradients.
350
+
351
+ ### 9.3 RoPE: No Position Parameters
352
+
353
+ ```
354
+ L̇θ = 0 (rotation angles are fixed, not learned)
355
+ ```
356
+
357
+ All position learning happens in Q, K, V projections.
358
+ The model learns "what to encode" rather than "how position affects attention."
359
+
360
+ ---
361
+
362
+ ## 10. Position in Different Architectures
363
+
364
+ ### 10.1 Encoder-Only (BERT)
365
+
366
+ ```
367
+ Input: [CLS] tok1 tok2 ... [SEP]
368
+ Position: 0 1 2 ... n
369
+ ```
370
+
371
+ Absolute position works fine - always process fixed-length chunks.
372
+
373
+ ### 10.2 Decoder-Only (GPT)
374
+
375
+ ```
376
+ Input: tok1 tok2 tok3 ... [generating]
377
+ Position: 0 1 2 ... n
378
+
379
+ Must attend causally: position i can only see ≤ i
380
+ ```
381
+
382
+ Relative position helps - model cares about "how far back" not "absolute slot."
383
+
384
+ ### 10.3 Encoder-Decoder (T5)
385
+
386
+ ```
387
+ Encoder: bidirectional, absolute position
388
+ Decoder: causal, relative position to encoder via cross-attention
389
+ ```
390
+
391
+ Often uses different position schemes for different components.
392
+
393
+ ---
394
+
395
+ ## 11. Implementation: RoPE
396
+
397
+ ```python
398
+ def apply_rope(x, cos, sin):
399
+ """
400
+ x: [batch, seq_len, n_heads, head_dim]
401
+ cos, sin: [seq_len, head_dim]
402
+ """
403
+ # Split into pairs
404
+ x1 = x[..., 0::2] # Even dimensions
405
+ x2 = x[..., 1::2] # Odd dimensions
406
+
407
+ # Rotate
408
+ x_rotated = torch.cat([
409
+ x1 * cos - x2 * sin,
410
+ x1 * sin + x2 * cos
411
+ ], dim=-1)
412
+
413
+ return x_rotated
414
+
415
+
416
+ def precompute_rope(dim, max_len, base=10000):
417
+ """Precompute rotation matrices"""
418
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2) / dim))
419
+ positions = torch.arange(max_len)
420
+ angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0)
421
+
422
+ cos = angles.cos()
423
+ sin = angles.sin()
424
+
425
+ return cos, sin
426
+ ```
427
+
428
+ ### 11.1 Fluxion Backward (Manual)
429
+
430
+ ```python
431
+ def rope_backward(grad_output, cos, sin):
432
+ """Backward through RoPE = inverse rotation"""
433
+ g1 = grad_output[..., 0::2]
434
+ g2 = grad_output[..., 1::2]
435
+
436
+ # Inverse rotation (negate sin)
437
+ grad_input = torch.cat([
438
+ g1 * cos + g2 * sin, # Note: +sin (inverse)
439
+ -g1 * sin + g2 * cos
440
+ ], dim=-1)
441
+
442
+ return grad_input
443
+ ```
444
+
445
+ ---
446
+
447
+ ## 12. Summary
448
+
449
+ ### 12.1 The Position Problem
450
+
451
+ Transformers need position information injected because self-attention is permutation-invariant.
452
+
453
+ ### 12.2 Solutions
454
+
455
+ | Method | How | Gradient Flow |
456
+ |--------|-----|---------------|
457
+ | Sinusoidal | Add Fourier basis | None (fixed) |
458
+ | Learned | Add learned embeddings | To position params |
459
+ | RoPE | Rotate Q, K | Through Q, K projections |
460
+ | ALiBi | Bias attention scores | None (fixed bias) |
461
+
462
+ ### 12.3 Modern Best Practice
463
+
464
+ - **RoPE** for most LLMs (good extrapolation, relative position)
465
+ - **ALiBi** for extreme length extrapolation
466
+ - **Learned** for fixed-length encoders
467
+
468
+ The fluxion view reveals: position encoding is about "where gradient needs to flow to learn position-aware representations."
469
+
470
+ ---
471
+
472
+ ## References
473
+
474
+ 1. Vaswani et al. (2017). "Attention Is All You Need." (Sinusoidal)
475
+ 2. Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." (RoPE)
476
+ 3. Press et al. (2022). "Train Short, Test Long: Attention with Linear Biases." (ALiBi)
477
+ 4. Chen et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation."
478
+
479
+ ---
480
+
481
+ *Correspondence: scott@opentransformers.online*