Work in Progress

Accelerated Arcsine SignSGD

AASS

Korata Hiu^*

^* nickname — adv_optm dev

Abstract

SignSGD clips all optimizer updates to ±1, making it naturally scale-invariant. However, when paired with momentum — as it almost always is in practice — a critical inconsistency emerges: the momentum buffer itself is not scale-invariant and remains susceptible to gradient outliers. Additionally, standard SignSGD produces no meaningful signal near zero (it always outputs a full ±1 update regardless of directional confidence), and it lacks any variance-based acceleration mechanism.

This post introduces two targeted techniques to address all three limitations: (1) Normalization-then-Momentum (NtM), which restructures the update order so that momentum becomes a direct tracker of sign consistency and sign-flip probability; and (2) Variance Acceleration, which translates the momentum signal through the arcsine function — an exact mathematical consequence of the binary gradient distribution — to accelerate confident directions and dampen uncertain ones.

§ 0 Introduction

The core idea behind SignSGD is radical in its simplicity: discard the magnitude of every gradient component and keep only its direction. Each update is exactly ±1, nothing more. This makes the optimizer inherently scale-invariant — it treats a gradient of 0.001 and a gradient of 1,000 identically, as long as they agree in direction. In distributed training settings, this property carries a practical communication bonus as well: you only need to transmit one bit per parameter instead of a full-precision float.

In practice, SignSGD is almost always paired with momentum to smooth out the inherently noisy binary updates. This combination is competitive with more complex adaptive methods and has seen renewed interest in the context of large-scale distributed optimization.

But there is a quiet inconsistency in the standard formulation. Momentum is accumulated on the raw, unscaled gradients — the sign operation is applied only at the very end. This ordering means the momentum buffer is fully sensitive to gradient magnitude. A single large outlier gradient will inflate the buffer and distort the optimizer's behavior for many subsequent steps. The scale invariance that SignSGD guarantees for the final update does not extend to the momentum that produces it.

Two further limitations are also worth addressing. First, the optimizer has no concept of directional confidence: whether the gradient has been consistently pointing the same way for 100 steps or has been flipping randomly, the output is always ±1. Second, there is no built-in mechanism to reward consistently low-variance directions with larger steps. AASS addresses all three issues within a unified, low-overhead framework.

§ 1 Momentum Isn't Scale Invariant — and Neither Is the Optimizer

SignSGD is known to be scale-invariant: it maps any input gradient to a discrete ±1 output, discarding magnitude entirely. In practice, it is almost always paired with momentum for more stable and faster convergence. The standard formulation accumulates raw gradients first, then signs:

$$m_t = \beta \cdot m_{t-1} + (1 - \beta) \cdot g_t \qquad \longrightarrow \qquad \text{update} = \operatorname{sign}(m_t)$$

The problem is subtle but meaningful: the momentum accumulates raw, unscaled gradients. If the optimizer encounters a single outlier — say, a gradient 10× or 100× larger than typical — the momentum buffer becomes inflated and remains so for many subsequent steps. Theoretically, momentum needs approximately $\frac{1}{1-\beta}$ steps to decay an outlier's contribution. At $\beta = 0.9$, that is 10 steps; at $\beta = 0.99$, that is 100 steps.

During this window, the optimizer's effective direction can be driven entirely by the outlier rather than the true gradient signal. This reintroduces the very scale sensitivity that SignSGD was designed to eliminate, and makes it difficult or impossible to reliably exploit long-term momentum.

The core issue: A single gradient magnitude outlier can reverse the sign of the accumulated momentum — flipping the update direction entirely — even when the true gradient signal remains consistently correct. The optimizer may then spend several steps actively moving in the wrong direction before recovering.

Fig. 1 — Outlier Sensitivity: Standard Momentum vs. AASS

Simulated training run: true signal is consistently negative, with a single large positive outlier at step 26. Standard momentum (red) inverts its sign and requires ~8 steps to recover. AASS momentum (green) sees only the sign of the outlier (+1) and is barely perturbed — it remains negative throughout.

§ 2 Normalization-then-Momentum

The fix for momentum's scale sensitivity is conceptually straightforward: apply the sign operation before momentum accumulation, not after. We call this reordering Normalization-then-Momentum (NtM).

Standard SignSGD vs. NtM — Update Ordering

// Standard: momentum on raw gradients, sign at the very end

m_t = β · m_{t-1} + (1 − β) · g_t ← outlier-sensitive

update = sign(m_t)

// NtM: normalize first, then accumulate the binary signal

s_t = sign(g_t) ← strip magnitude first

m_t = β · m_{t-1} + (1 − β) · s_t ← momentum over signs only

update = f(m_t) ← arcsine (§ 3)

With this ordering, the momentum buffer only ever accumulates discrete ±1 values. It is fully immune to gradient magnitude, because the sign operation discards that information before it ever reaches the accumulator. A gradient of 0.001 and a gradient of 1,000 — as long as they share the same direction — contribute identically to the momentum.

The resulting momentum $m_t \in [-1, 1]$ now has a precise, intuitive interpretation: it is the exponentially-weighted running average of the gradient signs, or equivalently, a measure of directional consistency. Its behavior is easy to reason about:

If the gradient is consistently positive, all signs are $+1$, and $m_t \to +1$ — maximum confident update.
If the gradient is consistently negative, all signs are $-1$, and $m_t \to -1$ — maximum confident update in the opposite direction.
If the gradient is noisy and frequently flipping, positive and negative signs cancel, and $m_t \to 0$ — zeroed or heavily damped update.

This produces adaptive-like behavior without any per-parameter second-moment tracking. The optimizer naturally takes maximum steps in confident, consistent directions and near-zero steps in uncertain, noisy ones. It also directly fixes the "meaningless near zero" problem from the abstract: rather than always outputting ±1, the update magnitude now reflects the optimizer's actual confidence.

§ 3 Variance Acceleration — The Arcsine Function

With NtM in place, we have a clean signal: the momentum $m_t \in [-1, 1]$ represents the sign-flip consistency of the gradient over time. The question now is what the optimal update function $f(m)$ should be. It turns out there is an exact, mathematically principled answer — derived directly from the binary structure of the gradient signal.

3.1 — Deriving the Exact Variance

After the NtM step, the incoming gradient is strictly binary: $g_t \in \{-1, +1\}$. Because $g_t$ takes only these two values, its square is always exactly $1$:

$$g_t^2 = 1$$

The momentum tensor $m$ represents the exponentially-weighted expected value (mean) of the sign gradients:

$$m = \mathbb{E}[g_t]$$

We can compute the variance of the gradient signal directly. The variance is the expected squared deviation from the mean:

Definition

$$V = \mathbb{E}\!\left[(g_t - m)^2\right]$$

Expanding the polynomial

$$V = \mathbb{E}\!\left[g_t^2 - 2g_t m + m^2\right]$$

Applying linearity of expectation ($m$ is a constant w.r.t. the expectation)

$$V = \mathbb{E}[g_t^2] - 2m\,\mathbb{E}[g_t] + m^2$$

Substituting $g_t^2 = 1$ and $\mathbb{E}[g_t] = m$

$$V = 1 - 2m^2 + m^2 = 1 - m^2$$

Exact Variance of Binary Gradients

$$V = 1 - m^2 \qquad \Longrightarrow \qquad \sigma = \sqrt{V} = \sqrt{1 - m^2}$$

No approximations. This holds exactly because $g_t^2 = 1$ always.

3.2 — From Variance to the Arcsine Update

A natural way to use the variance is Adam-style normalization: divide the momentum by the standard deviation $\sigma = \sqrt{V}$. However, as $m \to \pm 1$, the standard deviation approaches zero, which sends the naive division to ±∞. To keep the update bounded, we use the atan2 operation in place of explicit division:

$$\text{update} = \operatorname{atan2}(m,\; \sigma) = \operatorname{atan2}\!\left(m,\; \sqrt{1 - m^2}\right) = \arcsin(m)$$

To scale and bound our update to $[-2, 2]$ for stable training dynamics and predictable bounds, we multiply the resulting arcsine value by a normalization constant $\frac{4}{\pi}$:

    $$\boxed{\;\text{update} = \frac{4}{\pi} \arcsin(m_t)\;}$$
    
      where $m_t = \beta\,m_{t-1} + (1-\beta)\operatorname{sign}(g_t)$ and update $\in [-2, 2]$

This $\frac{4}{\pi}$ scaling maps the maximum confidence limits exactly to $\pm 2$ instead of $\pm \frac{\pi}{2} \approx \pm 1.571$. This ensures that highly certain paths can run up to twice as fast as classical SignSGD.

3.3 — Properties of the Arcsine Update

Near zero ($m \approx 0$): $\frac{4}{\pi}\arcsin(0) = 0$ — no update. Safe and conservative when direction is uncertain.
High confidence ($m \approx \pm 1$): $\frac{4}{\pi}\arcsin(\pm 1) = \pm 2$ — update is exactly 100% larger than standard SignSGD's ±1, doubling acceleration.
Linear near zero, superlinear near extremes: exactly the shape you want from a confidence-adaptive optimizer.
Perfectly bounded output $\in [-2,\, 2]$: no numerical explosion, even as variance approaches zero.

Fig. 2 — Scaled Arcsine vs. Linear Update

Gold: (4/π) * arcsin(m). Blue dashed: identity m (standard SignSGD). The acceleration zone peaks exactly at ±2.

Fig. 3 — Variance V = 1 − m²

Maximum uncertainty at m = 0 (V = 1). Fully certain at m = ±1 (V = 0). The arcsine is the function that "knows" this shape.

Figures 2 and 3 tell complementary stories. Fig. 2 shows how the scaled $\frac{4}{\pi}\arcsin(m)$ produces larger updates than the linear identity at high confidence values — the optimizer is accelerated exactly where the variance is lowest. Fig. 3 shows the underlying variance $V = 1 - m^2$: a parabola that encodes the information structure of the binary gradient. The arcsine function is the natural, variance-aware inverse of this structure.

§ 4 The Algorithm

Putting both techniques together, the complete AASS update for a single parameter step is:

Algorithm — Accelerated Arcsine SignSGD (AASS)

// Hyperparameters

η ← learning rate

β ← momentum decay (e.g. 0.9)

ε ← numerical clamp (e.g. 1e-6)

// Initialize

m₀ = 0 ← momentum buffer (same memory footprint as standard SignSGD)

for each step t = 1, 2, 3, …:

1.g_t = ∇L(θ_{t-1}) ← compute gradient

2.s_t = sign(g_t) ← NtM: normalize first

3.m_t = β · m_{t-1} + (1 − β) · s_t ← momentum on binary signal

4.u_t = (4 / π) * arcsin(clamp(m_t, −1+ε, 1−ε)) ← variance acceleration scaled to [-2, 2]

5.θ_t = θ_{t-1} − η · u_t ← parameter update

The computational overhead relative to standard SignSGD with momentum is one scaled arcsine call per parameter per step — negligible in practice. There is no second moment to track, no per-parameter learning rate scaling, and no additional memory beyond the momentum buffer already required by standard SignSGD.

Property	SGD	SignSGD + Momentum	AASS
Scale-invariant updates	✗	✓	✓
Scale-invariant momentum	✗	✗	✓
Outlier-resistant	✗	Partial	✓
Meaningful behavior near zero	✓	✗	✓
Confidence / variance acceleration	✗	✗	✓ (Scaled to [-2, 2])
Extra memory per parameter	0	1 buffer	1 buffer