OpenTransformer
/

SciPapers

Model card Files Files and versions

xet

Community

OpenTransformer commited on 27 days ago

Commit

f256bdc

verified ·

1 Parent(s): 4557e2d

Upload optimizers_via_fluxions.md with huggingface_hub

Browse files

Files changed (1) hide show

optimizers_via_fluxions.md +593 -0

optimizers_via_fluxions.md ADDED Viewed

	@@ -0,0 +1,593 @@

+# Gradient Descent and Optimizers via the Method of Fluxions
+## From SGD to AdamW: A Newtonian Perspective
+**Scott Bisset, Silicon Goddess**
+OpenTransformers Ltd
+January 2026
+---
+## Abstract
+Neural network optimizers are typically presented as update rules with cryptic Greek letters (β₁, β₂, ε) and little intuition for why they work. We reformulate gradient descent, momentum, RMSprop, Adam, and AdamW using Newton's method of fluxions. In this framework, optimization becomes physical: weights flow through parameter space, momentum is literal velocity, and adaptive learning rates emerge from measuring flow variance. This perspective reveals why certain hyperparameter choices work and suggests principled modifications.
+---
+## 1. The Optimization Problem
+### 1.1 What We Want
+Find weights W that minimize loss L(W).
+### 1.2 The Fluxion Framing
+Imagine weights as particles flowing through parameter space. The loss function L(W) defines a landscape—hills and valleys. We want weights to flow downhill to the lowest valley.
+**Key quantities:**
+| Symbol | Meaning |
+|--------|---------|
+| W | Position in weight space |
+| Ẇ | Velocity (how weights flow) |
+| Ẅ | Acceleration (how velocity changes) |
+| L̇ᵂ | Gradient (which direction is uphill) |
+| g | Shorthand for L̇ᵂ (the gradient) |
+---
+## 2. Vanilla Gradient Descent
+### 2.1 The Update Rule
+**Leibniz (opaque):**
+```
+W_{t+1} = W_t - η · ∂L/∂W
+```
+**Fluxion (physical):**
+```
+Ẇ = -η · g
+"Weights flow opposite to gradient, scaled by learning rate"
+```
+### 2.2 Physical Interpretation
+Imagine a ball on a hill:
+- **g = L̇ᵂ** points uphill (steepest ascent)
+- **-g** points downhill
+- **η** controls flow speed
+The ball has no mass, no inertia—it teleports in the downhill direction each step.
+### 2.3 Problems
+1. **Ravine oscillation**: Narrow valleys cause zig-zagging
+2. **Flat region stalling**: Tiny gradient = tiny movement
+3. **Uniform speed**: Same η for all parameters, regardless of curvature
+---
+## 3. Momentum: Adding Inertia
+### 3.1 The Idea
+Give the ball mass. Let it build up speed.
+### 3.2 Fluxion Formulation
+Introduce velocity v as a separate state:
+```
+v̇ = β · v + g          # Velocity accumulates gradient (with decay)
+Ẇ = -η · v             # Position flows with velocity
+```
+**Physical interpretation:**
+- β = friction coefficient (0.9 = low friction, velocity persists)
+- v accumulates gradient over time
+- Ball builds momentum rolling downhill
+### 3.3 Why It Helps
+**Ravine problem solved:**
+- Side-to-side gradients cancel out in v
+- Down-the-valley gradients accumulate
+- Ball rolls straight down valley floor
+**Flat regions:**
+- Momentum carries ball through plateaus
+- Previous velocity persists even when current gradient is small
+### 3.4 The β Parameter
+```
+β = 0.0: No momentum, vanilla GD
+β = 0.9: Standard choice, 10-step effective memory
+β = 0.99: Heavy ball, 100-step memory
+```
+Effective memory ≈ 1/(1-β) steps
+---
+## 4. Nesterov Momentum: Look Before You Leap
+### 4.1 The Problem with Standard Momentum
+Ball computes gradient at current position, then moves.
+But it's GOING to move with velocity v anyway.
+Why not compute gradient at where we're GOING to be?
+### 4.2 Fluxion Formulation
+```
+W_ahead = W + β · v           # Where momentum will take us
+g_ahead = L̇ᵂ(W_ahead)        # Gradient at future position
+v̇ = β · v + g_ahead          # Update velocity with lookahead gradient
+Ẇ = -η · v
+```
+### 4.3 Physical Interpretation
+"Look downhill from where you'll land, not where you stand."
+The ball predicts its next position, evaluates the slope THERE, then adjusts.
+### 4.4 Why It Helps
+- Anticipates overshooting
+- Dampens oscillations faster
+- Converges slightly faster in practice
+---
+## 5. AdaGrad: Adaptive Learning Rates
+### 5.1 The Problem
+Some parameters get huge gradients, others tiny.
+Uniform η is wrong for both.
+### 5.2 The Idea
+Track cumulative squared gradient per parameter.
+Scale learning rate inversely.
+### 5.3 Fluxion Formulation
+```
+ṡ = s + g²                    # Accumulate squared gradient (elementwise)
+Ẇ = -η · g / (√s + ε)         # Scale by inverse sqrt of accumulator
+```
+### 5.4 Physical Interpretation
+**s** measures "how much this parameter has been pushed historically."
+- High s → parameter was pushed a lot → reduce sensitivity
+- Low s → parameter barely moved → increase sensitivity
+### 5.5 Problem
+s only grows. Learning rate only shrinks.
+Eventually ALL learning rates → 0.
+Training stalls.
+---
+## 6. RMSprop: Exponential Moving Average Fix
+### 6.1 The Fix
+Don't accumulate forever. Use exponential moving average.
+### 6.2 Fluxion Formulation
+```
+ṡ = β · s + (1-β) · g²        # EMA of squared gradient
+Ẇ = -η · g / (√s + ε)         # Adaptive scaling
+```
+### 6.3 Physical Interpretation
+**s** now measures "recent gradient variance."
+- High recent variance → parameter is noisy → take smaller steps
+- Low recent variance → parameter is stable → take larger steps
+### 6.4 The β Parameter (typically 0.99)
+```
+β = 0.99: ~100 step memory for variance estimate
+β = 0.9:  ~10 step memory (more reactive)
+```
+---
+## 7. Adam: Best of Both Worlds
+### 7.1 The Combination
+Adam = Momentum + RMSprop
+Track BOTH:
+- First moment (mean gradient) → momentum
+- Second moment (gradient variance) → adaptive rate
+### 7.2 Fluxion Formulation
+```
+# First moment: momentum
+ṁ = β₁ · m + (1-β₁) · g
+# Second moment: variance
+v̇ = β₂ · v + (1-β₂) · g²
+# Bias correction (important at start!)
+m̂ = m / (1 - β₁ᵗ)
+v̂ = v / (1 - β₂ᵗ)
+# Update
+Ẇ = -η · m̂ / (√v̂ + ε)
+```
+### 7.3 Physical Interpretation
+**m** = smoothed direction (where to flow)
+**v** = smoothed magnitude variance (how carefully to flow)
+"Flow in the average recent direction, at speed inversely proportional to recent bumpiness."
+### 7.4 Bias Correction: Why?
+At t=0, m=0 and v=0.
+First update: m = (1-β₁)·g ≈ 0.1·g (biased low!)
+Division by (1-β₁ᵗ) corrects:
+- t=1: divide by 0.1 → correct scale
+- t=∞: divide by 1.0 → no correction needed
+### 7.5 Standard Hyperparameters
+```
+β₁ = 0.9    # Momentum coefficient (~10 step memory)
+β₂ = 0.999  # Variance coefficient (~1000 step memory)
+ε = 1e-8    # Numerical stability (prevents division by zero)
+η = 0.001   # Base learning rate
+```
+---
+## 8. AdamW: Weight Decay Done Right
+### 8.1 The Problem with L2 Regularization
+Original Adam with L2 regularization:
+```
+g_reg = g + λ·W              # Add weight penalty to gradient
+ṁ = β₁·m + (1-β₁)·g_reg     # Momentum includes penalty
+```
+Problem: The adaptive scaling also scales the weight decay!
+Large weights with small gradients get LESS decay, not more.
+### 8.2 AdamW: Decoupled Weight Decay
+```
+# Moments on RAW gradient (no weight penalty)
+ṁ = β₁·m + (1-β₁)·g
+v̇ = β₂·v + (1-β₂)·g²
+# Bias correction
+m̂ = m / (1-β₁ᵗ)
+v̂ = v / (1-β₂ᵗ)
+# Update with SEPARATE weight decay
+Ẇ = -η · (m̂/(√v̂+ε) + λ·W)
+```
+### 8.3 Physical Interpretation
+Two separate forces on each weight:
+1. **Gradient force**: Push toward lower loss
+2. **Decay force**: Pull toward zero (regularization)
+AdamW keeps these forces separate.
+Original Adam mixed them, causing the decay force to be scaled by the adaptive rate.
+### 8.4 Why It Matters
+AdamW consistently outperforms Adam+L2 on language models.
+The "W" stands for "decoupled Weight decay."
+---
+## 9. Complete Algorithm Comparison
+### 9.1 In Fluxion Notation
+**SGD:**
+```
+Ẇ = -η·g
+```
+**SGD + Momentum:**
+```
+v̇ = β·v + g
+Ẇ = -η·v
+```
+**RMSprop:**
+```
+ṡ = β·s + (1-β)·g²
+Ẇ = -η·g/(√s+ε)
+```
+**Adam:**
+```
+ṁ = β₁·m + (1-β₁)·g
+v̇ = β₂·v + (1-β₂)·g²
+Ẇ = -η·m̂/(√v̂+ε)
+```
+**AdamW:**
+```
+ṁ = β₁·m + (1-β₁)·g
+v̇ = β₂·v + (1-β₂)·g²
+Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
+```
+### 9.2 State Required
+| Optimizer | States per parameter |
+|-----------|---------------------|
+| SGD | 0 |
+| Momentum | 1 (velocity) |
+| RMSprop | 1 (variance) |
+| Adam | 2 (momentum + variance) |
+| AdamW | 2 (same as Adam) |
+Adam requires 2x memory for optimizer states!
+For large models, this matters.
+---
+## 10. Learning Rate Schedules
+### 10.1 The Problem
+Fixed η is suboptimal:
+- Early training: large steps okay, landscape is far from optimum
+- Late training: need precision, should take smaller steps
+### 10.2 Common Schedules in Fluxion Terms
+**Constant:**
+```
+η̇ = 0     (η never changes)
+```
+**Linear decay:**
+```
+η̇ = -η₀/T    (linear decrease to 0 over T steps)
+```
+**Cosine decay:**
+```
+η(t) = η_min + (η₀-η_min)·(1+cos(πt/T))/2
+```
+**Warmup:**
+```
+t < T_warm:  η(t) = η₀·t/T_warm     (ramp up)
+t ≥ T_warm:  normal schedule        (then decay)
+```
+### 10.3 Why Warmup?
+At initialization:
+- Weights are random
+- Gradients are huge and noisy
+- Adam's variance estimate (v) is zero
+Large initial steps can destabilize training.
+Warmup lets variance estimates stabilize before taking big steps.
+---
+## 11. Gradient Clipping
+### 11.1 The Problem
+Occasionally, gradients explode (‖g‖ → ∞).
+One bad step can ruin training.
+### 11.2 Fluxion Formulation
+```
+if ‖g‖ > max_norm:
+    g ← g · (max_norm / ‖g‖)     # Rescale to max_norm
+# Then proceed with normal optimizer
+```
+### 11.3 Physical Interpretation
+"Cap the maximum force that can act on any weight."
+No matter how steep the local slope, the ball can only accelerate so fast.
+---
+## 12. Implementation: Fused vs Unfused
+### 12.1 The Computational Point
+Mathematically equivalent formulations can have VERY different performance.
+**Unfused Adam (naive):**
+```python
+m = beta1 * m + (1-beta1) * g           # Read m, g, write m
+v = beta2 * v + (1-beta2) * g**2        # Read v, g, write v
+m_hat = m / (1 - beta1**t)              # Read m, write m_hat
+v_hat = v / (1 - beta2**t)              # Read v, write v_hat
+W = W - lr * m_hat / (sqrt(v_hat) + eps) # Read W,m_hat,v_hat, write W
+```
+5 separate kernel launches, multiple memory round-trips.
+**Fused Adam:**
+```python
+# Single kernel: read g,m,v,W once, write m,v,W once
+fused_adam_kernel(g, m, v, W, beta1, beta2, lr, eps, t)
+```
+1 kernel, 1 memory round-trip.
+### 12.2 The Fluxion Insight
+When written as flows:
+```
+ṁ = β₁·m + (1-β₁)·g
+v̇ = β₂·v + (1-β₂)·g²
+Ẇ = -η·m̂/(√v̂+ε)
+```
+These are clearly THREE coupled ODEs that should be integrated together.
+The flow notation suggests fusion naturally.
+Leibniz notation hides this by writing separate update equations.
+---
+## 13. Second-Order Methods (Brief)
+### 13.1 Newton's Method (the optimization one, not fluxions)
+Use curvature (second derivative) information:
+```
+Ẇ = -H⁻¹·g
+Where H = Hessian = matrix of Ẅ (second derivatives)
+```
+### 13.2 Fluxion Interpretation
+**First-order (gradient descent):** "Flow downhill"
+**Second-order (Newton):** "Flow toward the minimum, accounting for curvature"
+If the landscape is a bowl, Newton's method jumps straight to the bottom in one step.
+Gradient descent spirals down gradually.
+### 13.3 Why Not Used?
+Computing H⁻¹ is O(n²) storage, O(n³) compute for n parameters.
+For n = 1 billion, this is impossible.
+Approximations exist (L-BFGS, K-FAC) but Adam usually wins in practice.
+---
+## 14. Summary: Optimizer Selection
+### 14.1 Quick Guide
+| Situation | Optimizer |
+|-----------|-----------|
+| Simple convex problem | SGD + momentum |
+| Deep networks, general | Adam |
+| Language models | AdamW |
+| Memory constrained | SGD + momentum |
+| Fine-tuning | Lower LR Adam/AdamW |
+### 14.2 The Unified View
+All optimizers are just different ways of computing Ẇ from g:
+```
+Ẇ = f(g, history, W)
+```
+- SGD: Ẇ = -η·g (no history)
+- Momentum: Ẇ = -η·EMA(g) (first moment history)
+- Adam: Ẇ = -η·EMA(g)/√EMA(g²) (first and second moment)
+- AdamW: Ẇ = -η·(EMA(g)/√EMA(g²) + λ·W) (plus decay force)
+---
+## 15. Conclusion
+Optimizers become physical when viewed through fluxions:
+- **Weights** are particles with position W
+- **Gradients** are forces pushing uphill
+- **Momentum** is literal velocity
+- **Adaptive rates** measure local bumpiness
+- **Weight decay** is a restoring force toward origin
+This isn't just pedagogy—the flow formulation naturally suggests:
+1. Fused implementations (coupled ODEs)
+2. Continuous-time analysis (neural ODEs)
+3. Novel optimizers (what other forces could we add?)
+The math is equivalent, but the intuition is transformative.
+---
+## References
+1. Ruder, S. (2016). "An overview of gradient descent optimization algorithms."
+2. Kingma & Ba (2014). "Adam: A Method for Stochastic Optimization."
+3. Loshchilov & Hutter (2017). "Decoupled Weight Decay Regularization." (AdamW)
+4. Newton, I. (1736). *The Method of Fluxions.*
+---
+## Appendix: PyTorch Implementation
+```python
+class AdamWFluxion:
+    """AdamW in fluxion style - flows computed explicitly"""
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999),
+                 eps=1e-8, weight_decay=0.01):
+        self.params = list(params)
+        self.lr = lr
+        self.beta1, self.beta2 = betas
+        self.eps = eps
+        self.wd = weight_decay
+        self.t = 0
+        # Flow states (m = momentum flow, v = variance flow)
+        self.m = [torch.zeros_like(p) for p in self.params]
+        self.v = [torch.zeros_like(p) for p in self.params]
+    def step(self):
+        self.t += 1
+        for i, W in enumerate(self.params):
+            if W.grad is None:
+                continue
+            g = W.grad  # Gradient = L̇ᵂ
+            # Momentum flow: ṁ = β₁·m + (1-β₁)·g
+            self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * g
+            # Variance flow: v̇ = β₂·v + (1-β₂)·g²
+            self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * g**2
+            # Bias correction
+            m_hat = self.m[i] / (1 - self.beta1**self.t)
+            v_hat = self.v[i] / (1 - self.beta2**self.t)
+            # Weight flow: Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
+            W_dot = -self.lr * (m_hat / (v_hat.sqrt() + self.eps)
+                               + self.wd * W)
+            # Apply flow
+            W.data += W_dot
+```
+---
+*Correspondence: scott@opentransformers.online*