Numerical stability¶
A short guide to dtype choice, finite-difference \(\varepsilon\) tuning, and reorthogonalization defaults.
Dtype¶
| dtype | Where it makes sense | Caveats |
|---|---|---|
| fp64 | Validation tests, small models, papers that quote eigenvalues to 6+ sig figs | Slow; doubles memory; CUDA kernels less optimized |
| fp32 | Most analysis on models up to ~1B params | The default |
| bf16 | LLM scale (≥7B), where fp32 master copies are infeasible | Per-matvec noise on the order of 1% |
| fp16 | Generally avoid for HVP — narrow exponent range underflows | Use bf16 instead at scale |
The library inherits the operator's dtype from the model parameters. Cast the model with model.to(torch.float64) for high-precision validation.
Finite-difference HVP \(\varepsilon\) selection¶
The central-difference approximation \(Hv \approx (\nabla L(\theta + \varepsilon v) - \nabla L(\theta - \varepsilon v)) / 2\varepsilon\) has two competing error sources:
- Truncation bias: \(O(\varepsilon^2 \cdot \|v\| \cdot \|\nabla^3 L\|)\)
- Roundoff (cancellation): \(O(\varepsilon_\text{machine} \cdot \|\nabla L\| / \varepsilon)\)
Optimal \(\varepsilon\) minimizes their sum at \(\varepsilon^* \approx \varepsilon_\text{machine}^{1/3}\):
| dtype | Optimal \(\varepsilon\) | Best-case relative error | Realistic relative error |
|---|---|---|---|
| fp64 | \(\sim 10^{-5}\) | \(\sim 10^{-10}\) | \(\sim 10^{-9}\) |
| fp32 | \(\sim 10^{-3}\) | \(\sim 10^{-6}\) | \(\sim 10^{-5}\) to \(10^{-4}\) |
| bf16 | \(\sim 0.1\) | \(\sim 10^{-2}\) | \(\sim 10^{-2}\) to \(5 \cdot 10^{-2}\) |
HessianOperator(method="finite_difference") selects these defaults automatically. Override with fd_eps=.
For LLM-scale spectral analysis where you're already in bf16, the finite-difference noise is dominated by precision noise of the gradient itself — finite-difference is the right tool. For high-precision eigenvalue claims on small models, use the autograd path in fp32 or fp64.
Reorthogonalization in Lanczos¶
Without reorthogonalization, classical Lanczos loses orthogonality after enough steps and produces ghost eigenvalues — duplicated copies of converged eigenvalues that pollute the result (Paige 1976). The fix is full Gram-Schmidt against all prior basis vectors after each step, at \(O(m^2 n)\) cost.
lanczos(operator, ..., reorthogonalize=None) defaults to True for max_iter <= 50 and False for larger Krylov dimensions. If you're analyzing a Hessian known to have many near-degenerate eigenvalues — common for transformers — keep reorthogonalization on regardless.
Mixed-precision tradeoffs in iterative algorithms¶
For top eigenpairs in fp32 / bf16, expect the eigenvalue estimates to inherit roughly the per-matvec relative error compounded across Lanczos steps. Specifically:
- Lanczos with reorthogonalization: tridiagonal eigenvalues good to ~3-5x the per-matvec error.
- Power iteration: convergence is exponential in \(|\lambda_2/\lambda_1|^t\); precision noise sets a floor on how close to the true eigenvalue you can get.
- Hutchinson trace: stochastic noise dominates over precision noise as long as you take enough samples.
- SLQ density: smoothing washes out per-matvec noise effectively.
In practice, for qualitative spectral analysis (sharpness trends, density shape, top-eigenvalue changes during training), fp32 or bf16 is fine. For absolute precision of specific eigenvalues, drop to fp64 on a tractable problem.
References¶
- Paige, C. C. (1976). Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric matrix.
- Granziol & Juarev (2026). Hessian Spectral Analysis at Foundation Model Scale. arXiv:2602.00816.
- Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation 6(1).