GPT-2 (2019) Paper Notes
Paper notes on GPT-2 covering its core ideas: decoder-only Transformer scaling, WebText, next-token prediction, zero-shot task transfer, and the staged release controversy.
Paper notes on GPT-2 covering its core ideas: decoder-only Transformer scaling, WebText, next-token prediction, zero-shot task transfer, and the staged release controversy.
Paper notes covering the core ideas of BERT: the bidirectional Transformer encoder, masked language model, next sentence prediction, and the fine-tuning paradigm.
A comparison of encoder-only and decoder-only architectures that distinguish the BERT and GPT families.
A clear explanation of what the Transformer encoder and decoder each do, grounded in the original architecture and illustrated with simple examples.
A reading note on the Transformer paper — the core ideas, why it mattered, and what to read next.
Explains the role of Q, K, and V through the attention formula, a plain-language walkthrough, and a concrete example.
Why Residual, LayerNorm, and FFN are necessary in a Transformer block — explained with equations, commentary, and examples.