SinkGD is an optimization algorithm that iteratively balances noise and signal by geometrically preconditioning gradient rows and columns to a uniform norm. While it effectively prevents vanishing or exploding updates across channels, standard SinkGD fundamentally lacks an adaptive variance mechanism to dampen noisy dimensions—a critical mechanism for stable convergence.
In this post, we propose two techniques to introduce precise variance preconditioning to SinkGD with zero memory overhead. First, we define Normalization-then-Momentum (NtM), a structural shift that applies Sinkhorn normalization before momentum accumulation. Second, building strictly upon NtM's geometrical constraints, we introduce Sinkhorn Implicit Variance (SINK-V). By guaranteeing that all incoming gradients lie on a unit hypersphere, we prove that exact, spatially-aware variance can be extracted dynamically from the momentum buffer's norm. This yields robust adaptivity without allocating any explicit trackers.
Standard SinkGD introduces an iterative, geometric structural row and column pre-conditioning method designed to balance noise and signal propagation dynamically across network dimensions.
By calculating rank-1 diagonal pre-conditioning factors on-the-fly, the algorithm ensures equitable distribution of energy and guarantees that every row and column consistently contributes to the network’s learning process.
In a traditional SinkGD implementation, the optimizer accumulates standard gradients into a momentum buffer, and then applies the Sinkhorn normalization to that buffer to compute the final update:
In deep learning, scaling updates inversely to their variance allows an optimizer to take safely large steps in consistent directions, while aggressively dampening steps in noisy ones. Standard SinkGD lacks this mechanism. Because the Sinkhorn operator is the final step before the update, every row and column is forced to have an identical structural norm, destroying any magnitude-based confidence metric.
To solve this, our first proposal is a simple but structural reordering of operations called Normalization-then-Momentum (NtM). Instead of normalizing the momentum buffer, we normalize the raw gradient first, and accumulate these normalized geometries into the momentum buffer.
This reordering is crucial. By enforcing the Sinkhorn constraint on $\mathbf{g}_t$ directly, we guarantee that the root mean square (RMS) energy of every incoming gradient row and column is strictly $1.0$. As these bounded vectors are exponentially averaged into $\mathbf{m}_t$, the magnitude of $\mathbf{m}_t$ organically shrinks when the gradient directions are noisy, and grows towards $1.0$ when directions are consistent. This sets the mathematical foundation for SINK-V.
We now establish Sinkhorn Implicit Variance (SINK-V). Let $\hat{\mathbf{g}}_t \in \mathbb{R}^d$ be a Sinkhorn-normalized gradient row at step $t$ under the NtM regime. Let $\mathbf{m}_t$ be the momentum buffer for that row, acting as the expected value: $\mathbf{m}_t \approx \mathbb{E}[\hat{\mathbf{g}}_t]$.
We wish to find the true spatial variance $V$ of the normalized gradient around its momentum:
This reveals a profound property: The expected variance of the gradient is an exact function of the spatial mean of the squared momentum elements. We extract the variance dynamically from the momentum buffer that we already track, with zero explicit variance state.
We can visualize this exact equality using basic geometry.
Because NtM Sinkhorn Normalization enforces an RMS norm of 1, every incoming gradient vector $\hat{\mathbf{g}}_t$ is forced to live on the surface of a $d$-dimensional hypersphere of radius $R = \sqrt{d}$.
The momentum buffer $\mathbf{m}_t$ is the exponentially-weighted average of these surface vectors. By Jensen's Inequality, $\mathbf{m}_t$ will always fall inside the hypersphere. By the Pythagorean theorem of expected values, the relationship between the origin, the momentum, and the gradient surface is fixed:
atan2 with the $4/\pi$ multiplier safely non-linearizes the update, bounding max output exactly to $[-2, 2]$.In standard adaptive optimizers like Adam, the update is computed using simple division: $\mathbf{m} / \sqrt{v}$. This division is notoriously unstable when variance approaches zero, requiring hyperparameter tuning of a small $\epsilon$ denominator to prevent explosion.
In SINK-V, we replace division entirely with the 2-argument arctangent: $\text{atan2}(\mathbf{m}, \sqrt{v})$. Because $\mathbf{m}$ and $v$ are geometrically strictly bound, $\text{atan2}$ cleanly scales confident signals while bounding the raw maximum step mathematically to $\pm \pi/2$. By multiplying the output by $4/\pi$, we map the effective update perfectly to a bounded $[-2, 2]$ range. This guarantees absolute numerical stability—zero division errors, and no $\epsilon$ tuning
m_t. No additional VRAM is allocated for variance tracking.