Softmax, can you derive the Jacobian? And should you care?

· ai coding · Source ↗

TLDR

  • Walkthrough of softmax internals: numerical stability via max-shift, full Jacobian derivation, and an efficient backward pass that avoids materializing the n x n matrix.

Key Takeaways

  • Softmax maps arbitrary logits onto the probability simplex; outputs sum to 1, creating cross-dimension coupling absent in elementwise activations like ReLU or sigmoid.
  • Naive implementation overflows at large inputs (e.g. x=1000); fix is x - max(x) before exponentiation, which is invariant to the softmax output.
  • Jacobian is diag(s) minus the outer product s·sT, a diagonal plus rank-1 structure that enables efficient backprop without storing the full n x n matrix.
  • Backward pass reduces to: s_i * (dL/ds_i - dot(s, dL/ds)), a scaling by probability plus centering against the softmax-weighted mean gradient.
  • At transformer vocab or sequence-length scale, naively materializing the Jacobian is prohibitive; the rank-1 structure is the practical escape hatch.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN