Spectral density¶
The eigenvalue density (or density of states) of an \(n \times n\) symmetric operator \(H\) is the probability density of its eigenvalues:
For a trained neural network this density typically has a large bulk concentrated near zero plus a small number of outliers — see Ghorbani, Krishnan, Xiao (2019) and Pan et al. (ICML 2025) on the power-law structure in LLMs.
Stochastic Lanczos Quadrature (SLQ)¶
Computing \(\phi\) exactly requires the full eigendecomposition. Ubaru/Chen/Saad 2017 give a stochastic estimator that costs \(O(n_v \cdot m)\) matvecs, where \(n_v\) is the number of random init vectors and \(m\) is the number of Lanczos steps per init.
For each of \(n_v\) random Rademacher init vectors \(v_l\):
- Run \(m\) Lanczos steps from \(v_l\) to obtain a tridiagonal \(T_m^{(l)}\).
- Eigendecompose \(T_m^{(l)}\). The nodes \(\theta_k^{(l)}\) are the eigenvalues of \(T_m^{(l)}\), and the weights \(\tau_k^{(l)} = (e_1^\top y_k)^2\) are the squared first components of its eigenvectors \(y_k\), where \(e_1 = (1, 0, \ldots, 0)\).
- The smoothed density is the average over runs of a sum of Gaussian-blurred contributions:
where \(g_\sigma\) is a unit-mass Gaussian of bandwidth \(\sigma\). Each run's quadrature weights sum to 1, so the result integrates to \(\approx 1\) — it is a probability density of eigenvalues.
from hessian_eigenthings.algorithms import spectral_density
result = spectral_density(
operator,
num_runs=10, # n_v in the paper
lanczos_steps=80, # m in the paper
num_grid_points=10000,
sigma=None, # default: spectrum_range / lanczos_steps
seed=0,
)
result.density integrates to ≈ 1 over result.grid. result.raw_eigenvalues and result.raw_weights give you the per-run nodes and weights so you can do your own smoothing or plotting.
Reading a spectral density¶
Trained-network Hessians are well-known to have:
- A bulk concentrated near zero — most directions have very little curvature.
- A small number of outlier eigenvalues rising sharply above the bulk — for classification, the count is roughly the number of classes minus one (Ghorbani et al. 2019).
- A power-law tail in the bulk for well-trained LLMs (Pan et al. ICML 2025).
Plot in log-log to see the power-law structure. The outliers are usually clear in linear-y plots.
Choosing \(m\) and \(n_v\)¶
The paper's bound: for \(\varepsilon\)-accurate spectral approximation,
where \(\kappa\) is the condition number and \(\eta\) is failure probability. In practice:
| Goal | Reasonable defaults |
|---|---|
| Quick visual check | lanczos_steps=50, num_runs=5 |
| Publication-quality density | lanczos_steps=100, num_runs=20 |
| High-resolution outlier study | lanczos_steps=200, num_runs=10 |
Larger lanczos_steps improves spectral resolution; larger num_runs reduces Monte-Carlo noise.
Bandwidth \(\sigma\)¶
The default \(\sigma = (\lambda_\max - \lambda_\min) / m\) relates the kernel width to the quadrature resolution and works well across most spectra. For very narrow features (closely spaced outliers) try a smaller \(\sigma\); for noisy estimates a larger \(\sigma\) smooths the result.
Limitations¶
- SLQ holds the entire Lanczos basis per run in memory while computing the tridiagonal — the basis-storage problem at LLM scale. v1.3 will support a basis-free SLQ variant.
- The bias depends on \(m\) (Krylov truncation) and Gaussian smoothing; sharp delta-like features in the true spectrum get blurred. Nuance is recoverable from
raw_eigenvalues/raw_weights. - For indefinite operators (Hessians at saddle points), reorthogonalization in the Lanczos kernel matters more — we keep it on by default in SLQ.
References¶
- Ubaru, Chen, Saad (2017). Fast Estimation of \(\mathrm{tr}(F(A))\) via Stochastic Lanczos Quadrature. SIMAX 38(4).
- Lin, Saad, Yang (2016). Approximating Spectral Densities of Large Matrices. SIAM Review.
- Ghorbani, Krishnan, Xiao (2019). An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. ICML / arXiv:1901.10159.