GPT-2 (2019) Paper Notes
Paper notes on GPT-2 covering its core ideas: decoder-only Transformer scaling, WebText, next-token prediction, zero-shot task transfer, and the staged release controversy.
Paper notes on GPT-2 covering its core ideas: decoder-only Transformer scaling, WebText, next-token prediction, zero-shot task transfer, and the staged release controversy.
Paper notes covering the core ideas of BERT: the bidirectional Transformer encoder, masked language model, next sentence prediction, and the fine-tuning paradigm.
A comparison of encoder-only and decoder-only architectures that distinguish the BERT and GPT families.
Explains how RNNs — the dominant architecture before Transformers — process sequences token by token, and the fundamental limitations that motivated moving beyond them.
Explains BERT's core training objective — the Masked Language Model — with formulas, commentary, and examples.
Explains large-scale pre-training and task-specific fine-tuning through the lens of the BERT workflow.
A clear explanation of what the Transformer encoder and decoder each do, grounded in the original architecture and illustrated with simple examples.
A reading note on the Transformer paper — the core ideas, why it mattered, and what to read next.
Explains cross-entropy and perplexity — the metrics used to measure how wrong a model is — with formulas, commentary, and examples.
A walkthrough of how softmax converts raw scores into probability-like values, with formulas, explanations, and examples.
A walkthrough of vectors and the dot product — with notation, explanations, and examples — covering what you need to know before reading LLM papers.
A Map of Content page for reading core LLM papers in order, starting from the Transformer.
Explains the role of Q, K, and V through the attention formula, a plain-language walkthrough, and a concrete example.
Why Residual, LayerNorm, and FFN are necessary in a Transformer block — explained with equations, commentary, and examples.