Transformers Explained: The Discovery That Changed AI Forever
YC’s Ankit Gupta traces the lineage from LSTMs to Transformers, arguing ‘Attention Is All You Need’ was built on three sequential breakthroughs, not a single overnight discovery.
- The 2017 Google paper ‘Attention Is All You Need’ introduced Transformers by scrapping recurrence entirely, relying solely on attention.
- Vanishing gradients crippled early RNNs: gradients faded through N matrix multiplications, making long sequences unreliable.
- Hochreiter and Schmidhuber proposed LSTMs in the 1990s to fix vanishing gradients, but they were too expensive to train until GPU acceleration arrived circa 2010.
- The fixed-length bottleneck in encoder-decoder LSTMs collapsed entire input sentences into one vector, breaking on long or complex sequences.
- Bahdanau, Cho, and Bengio’s 2014 seq2seq-with-attention paper beat best statistical translation systems and triggered Google Translate’s quality jump.
- Transformers process all tokens in parallel via self-attention, making training dramatically faster than linear-time RNNs.
- GPT (decoder-only) and BERT (encoder-only) are both subsets of the original encoder-decoder Transformer architecture.
2025-10-23 · Watch on YouTube