Pre-Read: Overview of Transformers
Date: April 2, 2026 Speakers: Course Instructors (Steven Feng, Karan Singh, Michael C. Frank, Christopher Manning)
Topic Overview
The kickoff session provides a historical overview of transformers, core architectural components, and their explosive impact across NLP, vision, robotics, and multimodal AI. Expect a survey of the landscape rather than deep technical dives — this sets context for the rest of the course.
Key Concepts
| Concept | Definition |
|---|---|
| Self-Attention | Mechanism allowing each token to attend to all other tokens in a sequence, capturing relationships regardless of distance |
| Position Encoding | Injects positional information since attention is permutation-invariant |
| Multi-Head Attention | Parallel attention heads learning different relationship types |
| Transformer Block | Stack of attention + feed-forward + layer norm + residual connections |
| Encoder-Decoder | Original Transformer: encoder processes input, decoder generates output |
| Encoder-only (BERT) | Bidirectional understanding, good for classification/extraction |
| Decoder-only (GPT) | Autoregressive generation, good for text completion |
| Foundation Model | Large pretrained model adaptable to many downstream tasks |
| Scaling Laws | Predictable relationship between compute, data, model size, and performance |
| Emergent Capabilities | Abilities that appear suddenly at certain scale thresholds (e.g., in-context learning) |
Speakers
Christopher Manning
Affiliation: Stanford AI Lab Director, Professor of Computer Science and Linguistics
Why he matters: Pioneer of neural NLP, helped create CoreNLP, key figure in the deep learning NLP revolution. His textbook Foundations of Statistical Natural Language Processing is foundational.
Recent focus: Interpretability, efficient transformers, embodied AI + language
Steven Feng
Affiliation: PhD Student, Stanford
Research: Multimodal AI, vision-language models
Karan Singh
Affiliation: PhD Student, Stanford
Research: Efficient ML, long-context modeling
Michael C. Frank
Affiliation: Professor of Psychology, Stanford
Research: Language acquisition, human-AI interaction, cognitive development
Key Papers
1. Attention Is All You Need (2017)
Authors: Vaswani et al. (Google Brain)
Why it matters: The paper that started it all. Introduced the Transformer architecture, replacing RNNs/CNNs with pure attention.
Key insight: Self-attention enables parallel processing of sequences and captures long-range dependencies efficiently.
2. BERT: Pre-training of Deep Bidirectional Transformers (2018)
Authors: Devlin et al. (Google AI Language)
Why it matters: Showed that bidirectional pretraining on massive unlabeled text creates powerful language representations.
Key insight: Masked language modeling + next sentence prediction create generalizable understanding.
3. Language Models are Few-Shot Learners (GPT-3) (2020)
Authors: Brown et al. (OpenAI)
Why it matters: Demonstrated that scale enables in-context learning — models can perform tasks from examples without weight updates.
Key insight: Emergent capabilities appear at scale; 175B parameters was a turning point.
4. An Image is worth 16x16 words: Transformers for Image Recognition at Scale (ViT) (2020)
Authors: Dosovitskiy et al. (Google Brain)
Why it matters: Proved transformers work for vision too, not just language.
Key insight: Patch-based tokenization + transformer = competitive with CNNs when scaled.
5. Scaling Laws for Neural Language Models (2020)
Authors: Kaplan et al. (OpenAI)
Why it matters: Established predictable relationships between compute, data, parameters, and performance.
Key insight: Power laws govern scaling — bigger is predictably better.
Why It Matters for Autonomy
| Aspect | Relevance to Robotics/Embodied AI |
|---|---|
| Foundation models | Pretrained transformers can be fine-tuned for robot perception/control |
| Multimodal transformers | Vision-language models enable natural language robot commands |
| Long-context reasoning | Extended context windows support multi-step planning |
| In-context learning | Robots can adapt to new tasks from few demonstrations |
| World models | Transformers can learn predictive models for decision-making |
| Efficiency challenges | Edge deployment requires quantization, distillation, sparse attention |
| Interpretability gaps | Black-box nature is problematic for safety-critical autonomy |
Question Bank
Architecture Questions
- What are the fundamental differences between encoder-only, decoder-only, and encoder-decoder transformers? When would you choose each?
- How does multi-head attention learn different relationship types? What determines the optimal number of heads?
- Why did transformers replace RNNs/LSTMs despite having O(n²) attention complexity?
Scaling Questions
- What capabilities emerge at scale that don’t exist in smaller models? Can we predict emergence?
- How do scaling laws inform decisions about model size vs. training compute vs. data?
- Is there a point of diminishing returns for transformer scale, or will bigger always be better?
Multimodal Questions
- How do vision transformers differ from CNNs in terms of inductive bias and data efficiency?
- What makes vision-language models like CLIP effective for cross-modal understanding?
- Can transformers unified across modalities (text, image, audio, action) enable embodied AI?
Autonomy Questions
- What are the biggest barriers to deploying transformers on robots (latency, memory, power)?
- How can transformers be used for world modeling and predictive control?
- What interpretability techniques exist for understanding transformer decision-making?
Pre-Lecture Reading
Essential
- Attention Is All You Need — The original paper
- The Illustrated Transformer — Visual explanation
Background
- Stanford CS224N: NLP with Deep Learning — Manning’s course
- The Annotated Transformer — Line-by-line implementation
Cross-References
- Week 2 (JEPA): Alternative to transformers for world modeling
- Week 3 (SSMs/Mamba): Linear-time alternatives to quadratic attention
- Week 6 (Interpretability): Understanding what transformers learn
- Week 7 (Med-PaLM): Safety-critical deployment patterns
Prepared: 2026-04-04 • Session 1 Overview