Spectral density¶

The eigenvalue density (or density of states) of an \(n \times n\) symmetric operator \(H\) is the probability density of its eigenvalues:

\[\phi(\lambda) \;=\; \frac{1}{n} \sum_{i=1}^n \delta(\lambda - \lambda_i)\]

For a trained neural network this density typically has a large bulk concentrated near zero plus a small number of outliers — see Ghorbani, Krishnan, Xiao (2019) and Pan et al. (ICML 2025) on the power-law structure in LLMs.

Stochastic Lanczos Quadrature (SLQ)¶

Computing \(\phi\) exactly requires the full eigendecomposition. Ubaru/Chen/Saad 2017 give a stochastic estimator that costs \(O(n_v \cdot m)\) matvecs, where \(n_v\) is the number of random init vectors and \(m\) is the number of Lanczos steps per init.

For each of \(n_v\) random Rademacher init vectors \(v_l\):

Run \(m\) Lanczos steps from \(v_l\) to obtain a tridiagonal \(T_m^{(l)}\).
Eigendecompose \(T_m^{(l)}\). The nodes \(\theta_k^{(l)}\) are the eigenvalues of \(T_m^{(l)}\), and the weights \(\tau_k^{(l)} = (e_1^\top y_k)^2\) are the squared first components of its eigenvectors \(y_k\), where \(e_1 = (1, 0, \ldots, 0)\).
The smoothed density is the average over runs of a sum of Gaussian-blurred contributions:

\[\phi(t) \;\approx\; \frac{1}{n_v} \sum_{l=1}^{n_v} \sum_{k=1}^m (\tau_k^{(l)})^2 \, g_\sigma(t - \theta_k^{(l)})\]

where \(g_\sigma\) is a unit-mass Gaussian of bandwidth \(\sigma\). Each run's quadrature weights sum to 1, so the result integrates to \(\approx 1\) — it is a probability density of eigenvalues.

from hessian_eigenthings.algorithms import spectral_density

result = spectral_density(
    operator,
    num_runs=10,         # n_v in the paper
    lanczos_steps=80,    # m in the paper
    num_grid_points=10000,
    sigma=None,          # default: spectrum_range / lanczos_steps
    seed=0,
)

result.density integrates to ≈ 1 over result.grid. result.raw_eigenvalues and result.raw_weights give you the per-run nodes and weights so you can do your own smoothing or plotting.

Reading a spectral density¶

Trained-network Hessians are well-known to have:

A bulk concentrated near zero — most directions have very little curvature.
A small number of outlier eigenvalues rising sharply above the bulk — for classification, the count is roughly the number of classes minus one (Ghorbani et al. 2019).
A power-law tail in the bulk for well-trained LLMs (Pan et al. ICML 2025).

Plot in log-log to see the power-law structure. The outliers are usually clear in linear-y plots.

Choosing \(m\) and \(n_v\)¶

The paper's bound: for \(\varepsilon\)-accurate spectral approximation,

\[m \gtrsim \sqrt{\kappa} \log(K/\varepsilon), \qquad n_v \gtrsim \frac{1}{\varepsilon^2}\log(2/\eta)\]

where \(\kappa\) is the condition number and \(\eta\) is failure probability. In practice:

Goal	Reasonable defaults
Quick visual check	`lanczos_steps=50`, `num_runs=5`
Publication-quality density	`lanczos_steps=100`, `num_runs=20`
High-resolution outlier study	`lanczos_steps=200`, `num_runs=10`

Larger lanczos_steps improves spectral resolution; larger num_runs reduces Monte-Carlo noise.

Bandwidth \(\sigma\)¶

The default \(\sigma = (\lambda_\max - \lambda_\min) / m\) relates the kernel width to the quadrature resolution and works well across most spectra. For very narrow features (closely spaced outliers) try a smaller \(\sigma\); for noisy estimates a larger \(\sigma\) smooths the result.

Limitations¶

SLQ holds the entire Lanczos basis per run in memory while computing the tridiagonal — the basis-storage problem at LLM scale. v1.3 will support a basis-free SLQ variant.
The bias depends on \(m\) (Krylov truncation) and Gaussian smoothing; sharp delta-like features in the true spectrum get blurred. Nuance is recoverable from raw_eigenvalues / raw_weights.
For indefinite operators (Hessians at saddle points), reorthogonalization in the Lanczos kernel matters more — we keep it on by default in SLQ.

References¶

Ubaru, Chen, Saad (2017). Fast Estimation of \(\mathrm{tr}(F(A))\) via Stochastic Lanczos Quadrature. SIMAX 38(4).
Lin, Saad, Yang (2016). Approximating Spectral Densities of Large Matrices. SIAM Review.
Ghorbani, Krishnan, Xiao (2019). An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. ICML / arXiv:1901.10159.