Skip to content

Numerical stability

A short guide to dtype choice, finite-difference \(\varepsilon\) tuning, and reorthogonalization defaults.

Dtype

dtype Where it makes sense Caveats
fp64 Validation tests, small models, papers that quote eigenvalues to 6+ sig figs Slow; doubles memory; CUDA kernels less optimized
fp32 Most analysis on models up to ~1B params The default
bf16 LLM scale (≥7B), where fp32 master copies are infeasible Per-matvec noise on the order of 1%
fp16 Generally avoid for HVP — narrow exponent range underflows Use bf16 instead at scale

The library inherits the operator's dtype from the model parameters. Cast the model with model.to(torch.float64) for high-precision validation.

Finite-difference HVP \(\varepsilon\) selection

The central-difference approximation \(Hv \approx (\nabla L(\theta + \varepsilon v) - \nabla L(\theta - \varepsilon v)) / 2\varepsilon\) has two competing error sources:

  • Truncation bias: \(O(\varepsilon^2 \cdot \|v\| \cdot \|\nabla^3 L\|)\)
  • Roundoff (cancellation): \(O(\varepsilon_\text{machine} \cdot \|\nabla L\| / \varepsilon)\)

Optimal \(\varepsilon\) minimizes their sum at \(\varepsilon^* \approx \varepsilon_\text{machine}^{1/3}\):

dtype Optimal \(\varepsilon\) Best-case relative error Realistic relative error
fp64 \(\sim 10^{-5}\) \(\sim 10^{-10}\) \(\sim 10^{-9}\)
fp32 \(\sim 10^{-3}\) \(\sim 10^{-6}\) \(\sim 10^{-5}\) to \(10^{-4}\)
bf16 \(\sim 0.1\) \(\sim 10^{-2}\) \(\sim 10^{-2}\) to \(5 \cdot 10^{-2}\)

HessianOperator(method="finite_difference") selects these defaults automatically. Override with fd_eps=.

For LLM-scale spectral analysis where you're already in bf16, the finite-difference noise is dominated by precision noise of the gradient itself — finite-difference is the right tool. For high-precision eigenvalue claims on small models, use the autograd path in fp32 or fp64.

Reorthogonalization in Lanczos

Without reorthogonalization, classical Lanczos loses orthogonality after enough steps and produces ghost eigenvalues — duplicated copies of converged eigenvalues that pollute the result (Paige 1976). The fix is full Gram-Schmidt against all prior basis vectors after each step, at \(O(m^2 n)\) cost.

lanczos(operator, ..., reorthogonalize=None) defaults to True for max_iter <= 50 and False for larger Krylov dimensions. If you're analyzing a Hessian known to have many near-degenerate eigenvalues — common for transformers — keep reorthogonalization on regardless.

Mixed-precision tradeoffs in iterative algorithms

For top eigenpairs in fp32 / bf16, expect the eigenvalue estimates to inherit roughly the per-matvec relative error compounded across Lanczos steps. Specifically:

  • Lanczos with reorthogonalization: tridiagonal eigenvalues good to ~3-5x the per-matvec error.
  • Power iteration: convergence is exponential in \(|\lambda_2/\lambda_1|^t\); precision noise sets a floor on how close to the true eigenvalue you can get.
  • Hutchinson trace: stochastic noise dominates over precision noise as long as you take enough samples.
  • SLQ density: smoothing washes out per-matvec noise effectively.

In practice, for qualitative spectral analysis (sharpness trends, density shape, top-eigenvalue changes during training), fp32 or bf16 is fine. For absolute precision of specific eigenvalues, drop to fp64 on a tractable problem.

References

  • Paige, C. C. (1976). Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric matrix.
  • Granziol & Juarev (2026). Hessian Spectral Analysis at Foundation Model Scale. arXiv:2602.00816.
  • Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation 6(1).