OpenTransformer commited on
Commit
f256bdc
·
verified ·
1 Parent(s): 4557e2d

Upload optimizers_via_fluxions.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. optimizers_via_fluxions.md +593 -0
optimizers_via_fluxions.md ADDED
@@ -0,0 +1,593 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gradient Descent and Optimizers via the Method of Fluxions
2
+ ## From SGD to AdamW: A Newtonian Perspective
3
+
4
+ **Scott Bisset, Silicon Goddess**
5
+ OpenTransformers Ltd
6
+ January 2026
7
+
8
+ ---
9
+
10
+ ## Abstract
11
+
12
+ Neural network optimizers are typically presented as update rules with cryptic Greek letters (β₁, β₂, ε) and little intuition for why they work. We reformulate gradient descent, momentum, RMSprop, Adam, and AdamW using Newton's method of fluxions. In this framework, optimization becomes physical: weights flow through parameter space, momentum is literal velocity, and adaptive learning rates emerge from measuring flow variance. This perspective reveals why certain hyperparameter choices work and suggests principled modifications.
13
+
14
+ ---
15
+
16
+ ## 1. The Optimization Problem
17
+
18
+ ### 1.1 What We Want
19
+
20
+ Find weights W that minimize loss L(W).
21
+
22
+ ### 1.2 The Fluxion Framing
23
+
24
+ Imagine weights as particles flowing through parameter space. The loss function L(W) defines a landscape—hills and valleys. We want weights to flow downhill to the lowest valley.
25
+
26
+ **Key quantities:**
27
+ | Symbol | Meaning |
28
+ |--------|---------|
29
+ | W | Position in weight space |
30
+ | Ẇ | Velocity (how weights flow) |
31
+ | Ẅ | Acceleration (how velocity changes) |
32
+ | L̇ᵂ | Gradient (which direction is uphill) |
33
+ | g | Shorthand for L̇ᵂ (the gradient) |
34
+
35
+ ---
36
+
37
+ ## 2. Vanilla Gradient Descent
38
+
39
+ ### 2.1 The Update Rule
40
+
41
+ **Leibniz (opaque):**
42
+ ```
43
+ W_{t+1} = W_t - η · ∂L/∂W
44
+ ```
45
+
46
+ **Fluxion (physical):**
47
+ ```
48
+ Ẇ = -η · g
49
+
50
+ "Weights flow opposite to gradient, scaled by learning rate"
51
+ ```
52
+
53
+ ### 2.2 Physical Interpretation
54
+
55
+ Imagine a ball on a hill:
56
+ - **g = L̇ᵂ** points uphill (steepest ascent)
57
+ - **-g** points downhill
58
+ - **η** controls flow speed
59
+
60
+ The ball has no mass, no inertia—it teleports in the downhill direction each step.
61
+
62
+ ### 2.3 Problems
63
+
64
+ 1. **Ravine oscillation**: Narrow valleys cause zig-zagging
65
+ 2. **Flat region stalling**: Tiny gradient = tiny movement
66
+ 3. **Uniform speed**: Same η for all parameters, regardless of curvature
67
+
68
+ ---
69
+
70
+ ## 3. Momentum: Adding Inertia
71
+
72
+ ### 3.1 The Idea
73
+
74
+ Give the ball mass. Let it build up speed.
75
+
76
+ ### 3.2 Fluxion Formulation
77
+
78
+ Introduce velocity v as a separate state:
79
+
80
+ ```
81
+ v̇ = β · v + g # Velocity accumulates gradient (with decay)
82
+ Ẇ = -η · v # Position flows with velocity
83
+ ```
84
+
85
+ **Physical interpretation:**
86
+ - β = friction coefficient (0.9 = low friction, velocity persists)
87
+ - v accumulates gradient over time
88
+ - Ball builds momentum rolling downhill
89
+
90
+ ### 3.3 Why It Helps
91
+
92
+ **Ravine problem solved:**
93
+ - Side-to-side gradients cancel out in v
94
+ - Down-the-valley gradients accumulate
95
+ - Ball rolls straight down valley floor
96
+
97
+ **Flat regions:**
98
+ - Momentum carries ball through plateaus
99
+ - Previous velocity persists even when current gradient is small
100
+
101
+ ### 3.4 The β Parameter
102
+
103
+ ```
104
+ β = 0.0: No momentum, vanilla GD
105
+ β = 0.9: Standard choice, 10-step effective memory
106
+ β = 0.99: Heavy ball, 100-step memory
107
+ ```
108
+
109
+ Effective memory ≈ 1/(1-β) steps
110
+
111
+ ---
112
+
113
+ ## 4. Nesterov Momentum: Look Before You Leap
114
+
115
+ ### 4.1 The Problem with Standard Momentum
116
+
117
+ Ball computes gradient at current position, then moves.
118
+ But it's GOING to move with velocity v anyway.
119
+ Why not compute gradient at where we're GOING to be?
120
+
121
+ ### 4.2 Fluxion Formulation
122
+
123
+ ```
124
+ W_ahead = W + β · v # Where momentum will take us
125
+ g_ahead = L̇ᵂ(W_ahead) # Gradient at future position
126
+ v̇ = β · v + g_ahead # Update velocity with lookahead gradient
127
+ Ẇ = -η · v
128
+ ```
129
+
130
+ ### 4.3 Physical Interpretation
131
+
132
+ "Look downhill from where you'll land, not where you stand."
133
+
134
+ The ball predicts its next position, evaluates the slope THERE, then adjusts.
135
+
136
+ ### 4.4 Why It Helps
137
+
138
+ - Anticipates overshooting
139
+ - Dampens oscillations faster
140
+ - Converges slightly faster in practice
141
+
142
+ ---
143
+
144
+ ## 5. AdaGrad: Adaptive Learning Rates
145
+
146
+ ### 5.1 The Problem
147
+
148
+ Some parameters get huge gradients, others tiny.
149
+ Uniform η is wrong for both.
150
+
151
+ ### 5.2 The Idea
152
+
153
+ Track cumulative squared gradient per parameter.
154
+ Scale learning rate inversely.
155
+
156
+ ### 5.3 Fluxion Formulation
157
+
158
+ ```
159
+ ṡ = s + g² # Accumulate squared gradient (elementwise)
160
+ Ẇ = -η · g / (√s + ε) # Scale by inverse sqrt of accumulator
161
+ ```
162
+
163
+ ### 5.4 Physical Interpretation
164
+
165
+ **s** measures "how much this parameter has been pushed historically."
166
+
167
+ - High s → parameter was pushed a lot → reduce sensitivity
168
+ - Low s → parameter barely moved → increase sensitivity
169
+
170
+ ### 5.5 Problem
171
+
172
+ s only grows. Learning rate only shrinks.
173
+ Eventually ALL learning rates → 0.
174
+ Training stalls.
175
+
176
+ ---
177
+
178
+ ## 6. RMSprop: Exponential Moving Average Fix
179
+
180
+ ### 6.1 The Fix
181
+
182
+ Don't accumulate forever. Use exponential moving average.
183
+
184
+ ### 6.2 Fluxion Formulation
185
+
186
+ ```
187
+ ṡ = β · s + (1-β) · g² # EMA of squared gradient
188
+ Ẇ = -η · g / (√s + ε) # Adaptive scaling
189
+ ```
190
+
191
+ ### 6.3 Physical Interpretation
192
+
193
+ **s** now measures "recent gradient variance."
194
+
195
+ - High recent variance → parameter is noisy → take smaller steps
196
+ - Low recent variance → parameter is stable → take larger steps
197
+
198
+ ### 6.4 The β Parameter (typically 0.99)
199
+
200
+ ```
201
+ β = 0.99: ~100 step memory for variance estimate
202
+ β = 0.9: ~10 step memory (more reactive)
203
+ ```
204
+
205
+ ---
206
+
207
+ ## 7. Adam: Best of Both Worlds
208
+
209
+ ### 7.1 The Combination
210
+
211
+ Adam = Momentum + RMSprop
212
+
213
+ Track BOTH:
214
+ - First moment (mean gradient) → momentum
215
+ - Second moment (gradient variance) → adaptive rate
216
+
217
+ ### 7.2 Fluxion Formulation
218
+
219
+ ```
220
+ # First moment: momentum
221
+ ṁ = β₁ · m + (1-β₁) · g
222
+
223
+ # Second moment: variance
224
+ v̇ = β₂ · v + (1-β₂) · g²
225
+
226
+ # Bias correction (important at start!)
227
+ m̂ = m / (1 - β₁ᵗ)
228
+ v̂ = v / (1 - β₂ᵗ)
229
+
230
+ # Update
231
+ Ẇ = -η · m̂ / (√v̂ + ε)
232
+ ```
233
+
234
+ ### 7.3 Physical Interpretation
235
+
236
+ **m** = smoothed direction (where to flow)
237
+ **v** = smoothed magnitude variance (how carefully to flow)
238
+
239
+ "Flow in the average recent direction, at speed inversely proportional to recent bumpiness."
240
+
241
+ ### 7.4 Bias Correction: Why?
242
+
243
+ At t=0, m=0 and v=0.
244
+ First update: m = (1-β₁)·g ≈ 0.1·g (biased low!)
245
+
246
+ Division by (1-β₁ᵗ) corrects:
247
+ - t=1: divide by 0.1 → correct scale
248
+ - t=∞: divide by 1.0 → no correction needed
249
+
250
+ ### 7.5 Standard Hyperparameters
251
+
252
+ ```
253
+ β₁ = 0.9 # Momentum coefficient (~10 step memory)
254
+ β₂ = 0.999 # Variance coefficient (~1000 step memory)
255
+ ε = 1e-8 # Numerical stability (prevents division by zero)
256
+ η = 0.001 # Base learning rate
257
+ ```
258
+
259
+ ---
260
+
261
+ ## 8. AdamW: Weight Decay Done Right
262
+
263
+ ### 8.1 The Problem with L2 Regularization
264
+
265
+ Original Adam with L2 regularization:
266
+ ```
267
+ g_reg = g + λ·W # Add weight penalty to gradient
268
+ ṁ = β₁·m + (1-β₁)·g_reg # Momentum includes penalty
269
+ ```
270
+
271
+ Problem: The adaptive scaling also scales the weight decay!
272
+ Large weights with small gradients get LESS decay, not more.
273
+
274
+ ### 8.2 AdamW: Decoupled Weight Decay
275
+
276
+ ```
277
+ # Moments on RAW gradient (no weight penalty)
278
+ ṁ = β₁·m + (1-β₁)·g
279
+ v̇ = β₂·v + (1-β₂)·g²
280
+
281
+ # Bias correction
282
+ m̂ = m / (1-β₁ᵗ)
283
+ v̂ = v / (1-β₂ᵗ)
284
+
285
+ # Update with SEPARATE weight decay
286
+ Ẇ = -η · (m̂/(√v̂+ε) + λ·W)
287
+ ```
288
+
289
+ ### 8.3 Physical Interpretation
290
+
291
+ Two separate forces on each weight:
292
+ 1. **Gradient force**: Push toward lower loss
293
+ 2. **Decay force**: Pull toward zero (regularization)
294
+
295
+ AdamW keeps these forces separate.
296
+ Original Adam mixed them, causing the decay force to be scaled by the adaptive rate.
297
+
298
+ ### 8.4 Why It Matters
299
+
300
+ AdamW consistently outperforms Adam+L2 on language models.
301
+ The "W" stands for "decoupled Weight decay."
302
+
303
+ ---
304
+
305
+ ## 9. Complete Algorithm Comparison
306
+
307
+ ### 9.1 In Fluxion Notation
308
+
309
+ **SGD:**
310
+ ```
311
+ Ẇ = -η·g
312
+ ```
313
+
314
+ **SGD + Momentum:**
315
+ ```
316
+ v̇ = β·v + g
317
+ Ẇ = -η·v
318
+ ```
319
+
320
+ **RMSprop:**
321
+ ```
322
+ ṡ = β·s + (1-β)·g²
323
+ Ẇ = -η·g/(√s+ε)
324
+ ```
325
+
326
+ **Adam:**
327
+ ```
328
+ ṁ = β₁·m + (1-β₁)·g
329
+ v̇ = β₂·v + (1-β₂)·g²
330
+ Ẇ = -η·m̂/(√v̂+ε)
331
+ ```
332
+
333
+ **AdamW:**
334
+ ```
335
+ ṁ = β₁·m + (1-β₁)·g
336
+ v̇ = β₂·v + (1-β₂)·g²
337
+ Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
338
+ ```
339
+
340
+ ### 9.2 State Required
341
+
342
+ | Optimizer | States per parameter |
343
+ |-----------|---------------------|
344
+ | SGD | 0 |
345
+ | Momentum | 1 (velocity) |
346
+ | RMSprop | 1 (variance) |
347
+ | Adam | 2 (momentum + variance) |
348
+ | AdamW | 2 (same as Adam) |
349
+
350
+ Adam requires 2x memory for optimizer states!
351
+ For large models, this matters.
352
+
353
+ ---
354
+
355
+ ## 10. Learning Rate Schedules
356
+
357
+ ### 10.1 The Problem
358
+
359
+ Fixed η is suboptimal:
360
+ - Early training: large steps okay, landscape is far from optimum
361
+ - Late training: need precision, should take smaller steps
362
+
363
+ ### 10.2 Common Schedules in Fluxion Terms
364
+
365
+ **Constant:**
366
+ ```
367
+ η̇ = 0 (η never changes)
368
+ ```
369
+
370
+ **Linear decay:**
371
+ ```
372
+ η̇ = -η₀/T (linear decrease to 0 over T steps)
373
+ ```
374
+
375
+ **Cosine decay:**
376
+ ```
377
+ η(t) = η_min + (η₀-η_min)·(1+cos(πt/T))/2
378
+ ```
379
+
380
+ **Warmup:**
381
+ ```
382
+ t < T_warm: η(t) = η₀·t/T_warm (ramp up)
383
+ t ≥ T_warm: normal schedule (then decay)
384
+ ```
385
+
386
+ ### 10.3 Why Warmup?
387
+
388
+ At initialization:
389
+ - Weights are random
390
+ - Gradients are huge and noisy
391
+ - Adam's variance estimate (v) is zero
392
+
393
+ Large initial steps can destabilize training.
394
+ Warmup lets variance estimates stabilize before taking big steps.
395
+
396
+ ---
397
+
398
+ ## 11. Gradient Clipping
399
+
400
+ ### 11.1 The Problem
401
+
402
+ Occasionally, gradients explode (‖g‖ → ∞).
403
+ One bad step can ruin training.
404
+
405
+ ### 11.2 Fluxion Formulation
406
+
407
+ ```
408
+ if ‖g‖ > max_norm:
409
+ g ← g · (max_norm / ‖g‖) # Rescale to max_norm
410
+
411
+ # Then proceed with normal optimizer
412
+ ```
413
+
414
+ ### 11.3 Physical Interpretation
415
+
416
+ "Cap the maximum force that can act on any weight."
417
+
418
+ No matter how steep the local slope, the ball can only accelerate so fast.
419
+
420
+ ---
421
+
422
+ ## 12. Implementation: Fused vs Unfused
423
+
424
+ ### 12.1 The Computational Point
425
+
426
+ Mathematically equivalent formulations can have VERY different performance.
427
+
428
+ **Unfused Adam (naive):**
429
+ ```python
430
+ m = beta1 * m + (1-beta1) * g # Read m, g, write m
431
+ v = beta2 * v + (1-beta2) * g**2 # Read v, g, write v
432
+ m_hat = m / (1 - beta1**t) # Read m, write m_hat
433
+ v_hat = v / (1 - beta2**t) # Read v, write v_hat
434
+ W = W - lr * m_hat / (sqrt(v_hat) + eps) # Read W,m_hat,v_hat, write W
435
+ ```
436
+ 5 separate kernel launches, multiple memory round-trips.
437
+
438
+ **Fused Adam:**
439
+ ```python
440
+ # Single kernel: read g,m,v,W once, write m,v,W once
441
+ fused_adam_kernel(g, m, v, W, beta1, beta2, lr, eps, t)
442
+ ```
443
+ 1 kernel, 1 memory round-trip.
444
+
445
+ ### 12.2 The Fluxion Insight
446
+
447
+ When written as flows:
448
+ ```
449
+ ṁ = β₁·m + (1-β₁)·g
450
+ v̇ = β₂·v + (1-β₂)·g²
451
+ Ẇ = -η·m̂/(√v̂+ε)
452
+ ```
453
+
454
+ These are clearly THREE coupled ODEs that should be integrated together.
455
+ The flow notation suggests fusion naturally.
456
+
457
+ Leibniz notation hides this by writing separate update equations.
458
+
459
+ ---
460
+
461
+ ## 13. Second-Order Methods (Brief)
462
+
463
+ ### 13.1 Newton's Method (the optimization one, not fluxions)
464
+
465
+ Use curvature (second derivative) information:
466
+
467
+ ```
468
+ Ẇ = -H⁻¹·g
469
+
470
+ Where H = Hessian = matrix of Ẅ (second derivatives)
471
+ ```
472
+
473
+ ### 13.2 Fluxion Interpretation
474
+
475
+ **First-order (gradient descent):** "Flow downhill"
476
+ **Second-order (Newton):** "Flow toward the minimum, accounting for curvature"
477
+
478
+ If the landscape is a bowl, Newton's method jumps straight to the bottom in one step.
479
+ Gradient descent spirals down gradually.
480
+
481
+ ### 13.3 Why Not Used?
482
+
483
+ Computing H⁻¹ is O(n²) storage, O(n³) compute for n parameters.
484
+ For n = 1 billion, this is impossible.
485
+
486
+ Approximations exist (L-BFGS, K-FAC) but Adam usually wins in practice.
487
+
488
+ ---
489
+
490
+ ## 14. Summary: Optimizer Selection
491
+
492
+ ### 14.1 Quick Guide
493
+
494
+ | Situation | Optimizer |
495
+ |-----------|-----------|
496
+ | Simple convex problem | SGD + momentum |
497
+ | Deep networks, general | Adam |
498
+ | Language models | AdamW |
499
+ | Memory constrained | SGD + momentum |
500
+ | Fine-tuning | Lower LR Adam/AdamW |
501
+
502
+ ### 14.2 The Unified View
503
+
504
+ All optimizers are just different ways of computing Ẇ from g:
505
+
506
+ ```
507
+ Ẇ = f(g, history, W)
508
+ ```
509
+
510
+ - SGD: Ẇ = -η·g (no history)
511
+ - Momentum: Ẇ = -η·EMA(g) (first moment history)
512
+ - Adam: Ẇ = -η·EMA(g)/√EMA(g²) (first and second moment)
513
+ - AdamW: Ẇ = -η·(EMA(g)/√EMA(g²) + λ·W) (plus decay force)
514
+
515
+ ---
516
+
517
+ ## 15. Conclusion
518
+
519
+ Optimizers become physical when viewed through fluxions:
520
+
521
+ - **Weights** are particles with position W
522
+ - **Gradients** are forces pushing uphill
523
+ - **Momentum** is literal velocity
524
+ - **Adaptive rates** measure local bumpiness
525
+ - **Weight decay** is a restoring force toward origin
526
+
527
+ This isn't just pedagogy—the flow formulation naturally suggests:
528
+ 1. Fused implementations (coupled ODEs)
529
+ 2. Continuous-time analysis (neural ODEs)
530
+ 3. Novel optimizers (what other forces could we add?)
531
+
532
+ The math is equivalent, but the intuition is transformative.
533
+
534
+ ---
535
+
536
+ ## References
537
+
538
+ 1. Ruder, S. (2016). "An overview of gradient descent optimization algorithms."
539
+ 2. Kingma & Ba (2014). "Adam: A Method for Stochastic Optimization."
540
+ 3. Loshchilov & Hutter (2017). "Decoupled Weight Decay Regularization." (AdamW)
541
+ 4. Newton, I. (1736). *The Method of Fluxions.*
542
+
543
+ ---
544
+
545
+ ## Appendix: PyTorch Implementation
546
+
547
+ ```python
548
+ class AdamWFluxion:
549
+ """AdamW in fluxion style - flows computed explicitly"""
550
+
551
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999),
552
+ eps=1e-8, weight_decay=0.01):
553
+ self.params = list(params)
554
+ self.lr = lr
555
+ self.beta1, self.beta2 = betas
556
+ self.eps = eps
557
+ self.wd = weight_decay
558
+ self.t = 0
559
+
560
+ # Flow states (m = momentum flow, v = variance flow)
561
+ self.m = [torch.zeros_like(p) for p in self.params]
562
+ self.v = [torch.zeros_like(p) for p in self.params]
563
+
564
+ def step(self):
565
+ self.t += 1
566
+
567
+ for i, W in enumerate(self.params):
568
+ if W.grad is None:
569
+ continue
570
+
571
+ g = W.grad # Gradient = L̇ᵂ
572
+
573
+ # Momentum flow: ṁ = β₁·m + (1-β₁)·g
574
+ self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * g
575
+
576
+ # Variance flow: v̇ = β₂·v + (1-β₂)·g²
577
+ self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * g**2
578
+
579
+ # Bias correction
580
+ m_hat = self.m[i] / (1 - self.beta1**self.t)
581
+ v_hat = self.v[i] / (1 - self.beta2**self.t)
582
+
583
+ # Weight flow: Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
584
+ W_dot = -self.lr * (m_hat / (v_hat.sqrt() + self.eps)
585
+ + self.wd * W)
586
+
587
+ # Apply flow
588
+ W.data += W_dot
589
+ ```
590
+
591
+ ---
592
+
593
+ *Correspondence: scott@opentransformers.online*