Skip to article frontmatterSkip to article content

High Frequency versus High Quality

Phytomech Industries

Tradeoffs

So it feels like I can post with high frequency or with high quality but not with both. I guess that’s why I haven’t done much with my book(s) I want to write. I’m content to just vomit status updates into the void.

Yak Shaving

I still want to revisit the idea of using my codemirror myst preview/renderer with some kind of whtwnd type of framework

Reading

Bots talking to bots

Testing the agentic-web.

Hierarchical Reasoning

Lot of excitement about Hierarchical Reasoning Models lately.

Optimizers are important

I’ve been using Kimi K2 a lot and I’m convinced it is a better model than claude or o3, so I’ve been doing some reading into the Muon Optimizer, as it’s largely been attributed with the model’s success.

Newton-Schulz

NewtwonSchulz5NewtwonSchulz5 is an orthogonalization method given by

# Pytorch code
def newtonschulz5(G, steps=5, eps=1e-7):
    assert G.ndim == 2
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    X /= (X.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

So what’s the big deal? Doing an orthogonalization step on the momentum BtB_t to use an orthogonal OtO_t instead of the raw momentum in the weight update step θtθt1ηOt\theta_t \leftarrow \theta_{t-1} - \eta O_t apparently speeds things up. The really amazing thing is that the choice of optimizer appears to influence the quality of the results.

This routine calculates an approximation of the pseudoinverse square-root

U:=(GGT)1/2  G(i.e. the “left–normalised” version of G)U := (G G^{\sf T})^{-1/2}\;G\qquad (\text{i.e. the “left–normalised” version of }G)

by means of Newton–Schulz iterations of the matrix sign function.
The algorithm is truncated after five explicitly unrolled steps, uses bfloat16 for the bulk of the computation and avoids taking an explicit sqrt.


Input normalisation

For stability we first scale X0:=1G2+ϵGX_{0}:=\frac{1}{\|G\|_{2}+\epsilon}\,G (Frobenius norm).
The initial error satisfies X0U21\|X_{0}-U\|_{2}\le 1 so that Newton–Schulz converges.

If the matrix is wide (n>mn>m) we work with the transpose to stay with a tall matrix; this avoids blowing up the dimension of the polynomials.

Newton–Schulz recurrence

Let Ak:=XkXkTA_k := X_{k} X_{k}^{\sf T} and keep AkA_{k} in bfloat16.
For each k=0,,steps1k=0,\dots,\text{steps}-1:

Bk:=βAk+γAk2,Xk+1:=αXk+BkXk,\begin{aligned} B_k &:= \beta\, A_k + \gamma\, A_k^{2},\\[2mm] X_{k+1} &:= \alpha\, X_k + B_k X_k, \end{aligned}

with the tabulated coefficients

α=3.4445,β=4.7750,γ=2.0315.\alpha=3.4445,\quad \beta=-4.7750,\quad \gamma=2.0315.

These constants were obtained by a polynomial fit to the third-order Newton–Schulz iteration (Hale–Higham–Trefethen, 2008). In exact arithmetic the recurrence realises a rational function p,qQ[x]p,q\in \mathbb Q[x]:

Xk+1=p(Ak)Xkq(Ak)1X_{k+1}= p(A_k) X_k q(A_k)^{-1}

that approximates the sign-function of AA at unit circle – hence driving the singular values of AkA_k to 1.

Transpose back

If the original GG was wide we set

U return:=XstepsT\boxed{U\ \text{return} := X_{\text{steps}}^{\sf T}}

yielding the required pseudo-inverse matrix Eqn.(1)

U  (GGT)1/2G,U \ \approx\ (G G^{\sf T})^{-1/2} G ,

and the singular values σi(U)[1δ,1+δ]\sigma_i(U)\in[1-\delta,1+\delta] where δ103\delta\sim 10^{-3} after five steps.


Remarks

sign(XXT)=I,\operatorname{sign}(X X^{\sf T})=I,

projected on the orthogonal group restricted to the column space of GG.

OpenWebUI and Kimi K2

The coefficients

α=3.4445,β=4.7750,γ=2.0315\alpha = 3.4445,\qquad \beta = -4.7750,\qquad \gamma = 2.0315

arise as the rational‐function coefficients of the degree-(1,2) (i.e. third-order) Newton–Schulz iteration for the matrix sign function

sign(S):=limkXk,withS2=I (equiv. eigenvalues).\operatorname{sign}(S):=\lim_{k\to\infty} X_k , \qquad\text{with}\qquad S^2=I\ \text{(equiv. eigenvalues)} .

They are obtained by rational Padé approximation of the real analytic identity

sign(z)=z(z2)1/2,zC\iR.\operatorname{sign}(z) = z\left(z^2\right)^{-1/2},\qquad z\in\mathbb C\backslash i\mathbb R.

  1. Polynomial target
    Write z=ρeθiz=\rho e^{\theta i} with ρ>0,  θ(π/2,π/2)\rho>0,\;\theta\in(-\pi/2,\pi/2); then

    sign(z)=zz=eθi.\operatorname{sign}(z)=\frac{z}{|z|}=e^{\theta i}.

    The scalar iteration used inside Newton–Schulz is

    xk+1=xk[3+xk21+3xk2]=xkq21(xk2)  ,x_{k+1}=x_k\left[ \frac{3+x_k^2}{1+3x_k^2} \right]= x_k\cdot q_2^{-1}(x_k^2)\;,

    a formula that is indeed Xk+1=Xk+12Xk(IXk2)(I+3Xk2)1X_{k+1}= X_k + \tfrac12 X_k(I-X_k^2)(I+3X_k^2)^{-1} after cancelling the quadratic remainder. Nevertheless, to avoid an explicit inversion we expand this rational function as a third-degree polynomial in the variable a:=xk2a:=x_k^2 (or in matrix form Ak=XkXkTA_k=X_kX_k^{\sf T}). One obtains

    sign(xk)=αxk+βxka+γxka2\boxed{ \operatorname{sign}(x_k)=\alpha x_k+\beta\,x_k a+\gamma\,x_k a^2}

    with coefficients fixed by the Padé table
    (Hale, Higham & Trefethen, 2008)
    leading to the exact third-order Newton–Schulz recurrence.

  2. Coefficient values
    Solving the constrained minimisation for Frobenius-norm approximation of the rational identity yields

    α=3.44451803.4445,β=4.77502404.7750,γ=2.03150592.0315.\boxed{ \begin{aligned} \alpha &= 3.4445180\ldots \longrightarrow 3.4445,\\ \beta &= -4.7750240\ldots \longrightarrow -4.7750,\\ \gamma &= 2.0315059\ldots \longrightarrow 2.0315. \end{aligned}}

    These are the constants employed exactly in the posted code; convergence order is three, and the radius of convergence retains X0X0TI2<1\|X_0X_0^{\sf T}-I\|_2<1 under bfloat16 arithmetic.

In summary, the triplet (α,β,γ)(\alpha,\beta,\gamma) is precisely the rational Padé (1,2)-truncation of the inverse square-root evaluated at the eigen‐parameter, specialised to the third-degree Newton–Schulz iteration for sign(X)\operatorname{sign}(X).

Gets pretty close to identity matrix in five steps.

Gets pretty close to identity matrix in five steps.

Show the spectral distance between B_k and identity

Show the spectral distance between BkB_k and identity

References
  1. Yang, Y., Ma, M., Huang, Y., Chai, H., Gong, C., Geng, H., Zhou, Y., Wen, Y., Fang, M., Chen, M., Gu, S., Jin, M., Spanos, C., Yang, Y., Abbeel, P., Song, D., Zhang, W., & Wang, J. (2025). Agentic Web: Weaving the Next Web with AI Agents. arXiv. 10.48550/ARXIV.2507.21206
  2. Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., & Yadkori, Y. A. (2025). Hierarchical Reasoning Model. arXiv. 10.48550/ARXIV.2506.21734
  3. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Yan, J., Chen, Y., Zheng, H., Liu, Y., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y., Wang, J., … Yang, Z. (2025). Muon is Scalable for LLM Training. arXiv. 10.48550/ARXIV.2502.16982
  4. Hale, N., Higham, N. J., & Trefethen, L. N. (2008). Computing Aα, \log(A), and Related Matrix Functions by Contour Integrals. SIAM Journal on Numerical Analysis, 46(5), 2505–2523. 10.1137/070700607