Why HVP, not full \(H\)¶
Forming the full Hessian costs \(O(n^2)\) memory. For a 7B-parameter model that's \(\sim\)200 PB. Iterative algorithms (power iteration, Lanczos, Hutchinson) only ever need to apply \(H\) to a vector, never to materialize it — so they cost \(O(n)\) memory.
Two ways to compute \(Hv\)¶
Autograd (exact)¶
Pearlmutter's \(R\)-operator trick (1994), reinvented as automatic double-backward:
In PyTorch:
g = torch.autograd.grad(loss, params, create_graph=True) # ∇L
g_dot_v = sum((gi * vi).sum() for gi, vi in zip(g, v_split)) # <∇L, v>
Hv = torch.autograd.grad(g_dot_v, params) # ∇(<∇L, v>) = Hv
Cost: roughly one extra backward pass per \(Hv\) on top of the original forward+backward. Numerically exact (to floating-point rounding). This is the default in HessianOperator(method="autograd").
Finite difference (FSDP-friendly)¶
The classic central difference:
Two normal forward+backward passes, no second-backward graph anywhere. This is the technique Granziol & Juarev 2026 (arXiv:2602.00816) revive at LLM scale, because Fully Sharded Data Parallel (FSDP) — PyTorch's standard mechanism for training models too large to fit on a single device — detaches its gradient collectives from the autograd graph, breaking double-backward. Finite difference doesn't care: it only uses first-order gradients, which FSDP handles correctly out of the box.
Use via HessianOperator(method="finite_difference"). See Numerical stability for how to pick \(\varepsilon\).
Why this is enough¶
For the things people actually want from the Hessian:
- Top eigenpairs: power iteration or Lanczos — both only need \(Hv\).
- Trace: Hutchinson's \(\frac{1}{m}\sum v_i^\top H v_i \approx \mathrm{tr}(H)\) — only \(Hv\) products.
- Spectral density: Stochastic Lanczos Quadrature — same.
The full Hessian is never required. The library's job is to expose \(Hv\) in a clean, distributed-ready way and run these algorithms on top.
Reference¶
- Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation 6(1), 147-160.
- Granziol & Juarev (2026). Hessian Spectral Analysis at Foundation Model Scale. arXiv:2602.00816.