Walkthrough of softmax internals: numerical stability via max-shift, full Jacobian derivation, and an efficient backward pass that avoids materializing the n x n matrix.
Key Takeaways
Softmax maps arbitrary logits onto the probability simplex; outputs sum to 1, creating cross-dimension coupling absent in elementwise activations like ReLU or sigmoid.
Naive implementation overflows at large inputs (e.g. x=1000); fix is x - max(x) before exponentiation, which is invariant to the softmax output.
Jacobian is diag(s) minus the outer product s·sT, a diagonal plus rank-1 structure that enables efficient backprop without storing the full n x n matrix.
Backward pass reduces to: s_i * (dL/ds_i - dot(s, dL/ds)), a scaling by probability plus centering against the softmax-weighted mean gradient.
At transformer vocab or sequence-length scale, naively materializing the Jacobian is prohibitive; the rank-1 structure is the practical escape hatch.