Step-by-step interactive walkthrough of LLM construction from raw web crawl through BPE tokenization, Transformer training, SFT, and RLHF.
Key Takeaways
FineWeb pipeline: Common Crawl’s 2.7B pages filtered via URL blocklists, text extraction, language detection, MinHash dedup, and PII removal yields 44TB / 15T tokens.
BPE tokenization starts from 256 byte symbols and merges most-frequent adjacent pairs iteratively; GPT-4 uses a 100,277-token vocabulary.
Pre-training loss measures next-token prediction error across billions of steps; Llama 3 ran 405B parameters over 15T tokens.
Temperature at inference controls how broadly the model samples from the next-token probability distribution; 0.7-1.0 balances coherence and creativity.
Post-training is two stages: SFT on human-labeled ideal conversations, then RLHF trains a reward model on ranked responses and RL-tunes the LLM toward higher scores.
Hacker News Comment Review
No substantive HN discussion yet.
Notable Comments
@learningToFly33: Suggests expanding the guide to cover how embedded data is fed at the final inference step and how it affects prediction results.