SignSGD clips all optimizer updates to ±1, making it naturally scale-invariant. However, when paired with momentum — as it almost always is in practice — a critical inconsistency emerges: the momentum buffer itself is not scale-invariant and remains susceptible to gradient outliers. Additionally, standard SignSGD produces no meaningful signal near zero (it always outputs a full ±1 update regardless of directional confidence), and it lacks any variance-based acceleration mechanism.
This post introduces two targeted techniques to address all three limitations: (1) Normalization-then-Momentum (NtM), which restructures the update order so that momentum becomes a direct tracker of sign consistency and sign-flip probability; and (2) Variance Acceleration, which translates the momentum signal through the arcsine function — an exact mathematical consequence of the binary gradient distribution — to accelerate confident directions and dampen uncertain ones.
The core idea behind SignSGD is radical in its simplicity: discard the magnitude of every gradient component and keep only its direction. Each update is exactly ±1, nothing more. This makes the optimizer inherently scale-invariant — it treats a gradient of 0.001 and a gradient of 1,000 identically, as long as they agree in direction. In distributed training settings, this property carries a practical communication bonus as well: you only need to transmit one bit per parameter instead of a full-precision float.
In practice, SignSGD is almost always paired with momentum to smooth out the inherently noisy binary updates. This combination is competitive with more complex adaptive methods and has seen renewed interest in the context of large-scale distributed optimization.
But there is a quiet inconsistency in the standard formulation. Momentum is accumulated on the raw, unscaled gradients — the sign operation is applied only at the very end. This ordering means the momentum buffer is fully sensitive to gradient magnitude. A single large outlier gradient will inflate the buffer and distort the optimizer's behavior for many subsequent steps. The scale invariance that SignSGD guarantees for the final update does not extend to the momentum that produces it.
Two further limitations are also worth addressing. First, the optimizer has no concept of directional confidence: whether the gradient has been consistently pointing the same way for 100 steps or has been flipping randomly, the output is always ±1. Second, there is no built-in mechanism to reward consistently low-variance directions with larger steps. AASS addresses all three issues within a unified, low-overhead framework.
SignSGD is known to be scale-invariant: it maps any input gradient to a discrete ±1 output, discarding magnitude entirely. In practice, it is almost always paired with momentum for more stable and faster convergence. The standard formulation accumulates raw gradients first, then signs:
The problem is subtle but meaningful: the momentum accumulates raw, unscaled gradients. If the optimizer encounters a single outlier — say, a gradient 10× or 100× larger than typical — the momentum buffer becomes inflated and remains so for many subsequent steps. Theoretically, momentum needs approximately $\frac{1}{1-\beta}$ steps to decay an outlier's contribution. At $\beta = 0.9$, that is 10 steps; at $\beta = 0.99$, that is 100 steps.
During this window, the optimizer's effective direction can be driven entirely by the outlier rather than the true gradient signal. This reintroduces the very scale sensitivity that SignSGD was designed to eliminate, and makes it difficult or impossible to reliably exploit long-term momentum.
The fix for momentum's scale sensitivity is conceptually straightforward: apply the sign operation before momentum accumulation, not after. We call this reordering Normalization-then-Momentum (NtM).
With this ordering, the momentum buffer only ever accumulates discrete ±1 values. It is fully immune to gradient magnitude, because the sign operation discards that information before it ever reaches the accumulator. A gradient of 0.001 and a gradient of 1,000 — as long as they share the same direction — contribute identically to the momentum.
The resulting momentum $m_t \in [-1, 1]$ now has a precise, intuitive interpretation: it is the exponentially-weighted running average of the gradient signs, or equivalently, a measure of directional consistency. Its behavior is easy to reason about:
This produces adaptive-like behavior without any per-parameter second-moment tracking. The optimizer naturally takes maximum steps in confident, consistent directions and near-zero steps in uncertain, noisy ones. It also directly fixes the "meaningless near zero" problem from the abstract: rather than always outputting ±1, the update magnitude now reflects the optimizer's actual confidence.
With NtM in place, we have a clean signal: the momentum $m_t \in [-1, 1]$ represents the sign-flip consistency of the gradient over time. The question now is what the optimal update function $f(m)$ should be. It turns out there is an exact, mathematically principled answer — derived directly from the binary structure of the gradient signal.
After the NtM step, the incoming gradient is strictly binary: $g_t \in \{-1, +1\}$. Because $g_t$ takes only these two values, its square is always exactly $1$:
The momentum tensor $m$ represents the exponentially-weighted expected value (mean) of the sign gradients:
We can compute the variance of the gradient signal directly. The variance is the expected squared deviation from the mean:
A natural way to use the variance is Adam-style normalization: divide the momentum by the standard deviation $\sigma = \sqrt{V}$. However, as $m \to \pm 1$, the standard deviation approaches zero, which sends the naive division to ±∞. To keep the update bounded, we use the atan2 operation in place of explicit division:
To scale and bound our update to $[-2, 2]$ for stable training dynamics and predictable bounds, we multiply the resulting arcsine value by a normalization constant $\frac{4}{\pi}$:
This $\frac{4}{\pi}$ scaling maps the maximum confidence limits exactly to $\pm 2$ instead of $\pm \frac{\pi}{2} \approx \pm 1.571$. This ensures that highly certain paths can run up to twice as fast as classical SignSGD.
Figures 2 and 3 tell complementary stories. Fig. 2 shows how the scaled $\frac{4}{\pi}\arcsin(m)$ produces larger updates than the linear identity at high confidence values — the optimizer is accelerated exactly where the variance is lowest. Fig. 3 shows the underlying variance $V = 1 - m^2$: a parabola that encodes the information structure of the binary gradient. The arcsine function is the natural, variance-aware inverse of this structure.
Putting both techniques together, the complete AASS update for a single parameter step is:
The computational overhead relative to standard SignSGD with momentum is one scaled arcsine call per parameter per step — negligible in practice. There is no second moment to track, no per-parameter learning rate scaling, and no additional memory beyond the momentum buffer already required by standard SignSGD.
| Property | SGD | SignSGD + Momentum | AASS |
|---|---|---|---|
| Scale-invariant updates | ✗ | ✓ | ✓ |
| Scale-invariant momentum | ✗ | ✗ | ✓ |
| Outlier-resistant | ✗ | Partial | ✓ |
| Meaningful behavior near zero | ✓ | ✗ | ✓ |
| Confidence / variance acceleration | ✗ | ✗ | ✓ (Scaled to [-2, 2]) |
| Extra memory per parameter | 0 | 1 buffer | 1 buffer |