Pre-Read: Ultra-Scale Training — Nouamane Tazi

Date: April 23, 2026 Topic: Scaling Training to Thousands of GPUs Speaker: Nouamane Tazi (Hugging Face) Slides: Slides not yet posted — check course site Thursday

Speaker Bio

Nouamane Tazi is a research engineer at Hugging Face and contributor to the BigCode project. He’s known for work on efficient training, code generation models (StarCoder), and small language models (SmolLM3). His current focus is on the infrastructure and systems work required to train and deploy large models at scale.

Key affiliations:

Hugging Face (research engineer)
BigCode project contributor

Topic: Ultra-Scale Training

The topic is about scaling training infrastructure — how to efficiently train models across thousands of GPUs. This covers:

Distributed training strategies
Pipeline and tensor parallelism
Memory optimization (gradient checkpointing, ZeRO)
Hardware topology and network bottlenecks
Training stability at scale

Research: Nouamane Tazi’s Recent Work

1. StarCoder 2 (BigCode, 2024)

arXiv: 2402.19173

Contribution: Family of open-source code generation LLMs (StarCoder 2-3B/7B/15B) trained on The Stack v2 dataset.

Key details:

Trained on permissively-licensed code from 300+ languages
Context window: 8192 tokens
Competitive with proprietary code models on HumanEval

2. SmolLM3 (Hugging Face, 2025)

Blog: huggingface.co/blog/smollm3

Contribution: Small (135M/360M/1.7B parameters), multilingual, long-context reasoner demonstrating efficiency techniques.

Key details:

Uses speculative decoding for fast inference
Focus on quality per parameter — small but capable
Long context (8K-32K) in small footprint

3. Layer Normalization Research (2025)

Paper: Understanding Layer Normalization in Transformers

Contribution: Discovered tanh-like S-shaped curves in transformer input-output mappings — shows that transformers have structured internal representations that emerge from training.

Relevance: Understanding this could help with training stability and model interpretability.

4. GPU Memory Prediction for MoE Models (2025)

Paper: Predicting GPU Memory Usage for Mixture-of-Experts Models

Contribution: Infrastructure work predicting GPU memory requirements for MoE models during training — critical for planning resource allocation.

Key Contributions & Why They Matter

Contribution	Why It Matters
StarCoder 2	Demonstrated that open code models can match proprietary. Foundation for code generation research.
SmolLM3	Shows small models can still be capable. Important for edge/robotics deployment.
Layer norm insights	Understanding transformer internals helps with debugging and optimization.
GPU memory prediction	Practical infrastructure work that makes large-scale training more predictable.

Relevance to Autonomy / Robotics

Aspect	Relevance
Training infrastructure	Training autonomous models requires similar infrastructure
Edge deployment	SmolLM3’s efficiency techniques translate to robotics edge deployment
MoE for robotics	Mixture-of-experts could enable specialized sub-models for different robot tasks
Memory efficiency	Critical for onboard GPU systems with limited VRAM
Speculative decoding	Fast inference for real-time robot responses

Key Insights to Listen For

Based on Nouamane’s work and topic:

Distributed training bottlenecks — Where do bottlenecks typically occur? (communication, memory, compute)
Pipeline parallelism strategies — How to partition model across GPUs efficiently
Memory optimization techniques — ZeRO, gradient checkpointing, mixed precision
Training stability at scale — What breaks when you scale, and how to fix it
MoE infrastructure — How Hugging Face handles mixture-of-experts training

Papers Referenced in Talk

To be populated after lecture / when slides are posted.

Question Bank

Technical Questions

What are the main bottlenecks when scaling to thousands of GPUs?
- Is it communication bandwidth? Memory? Compute? Need to know where the pinch points are.
How does pipeline parallelism interact with transformer architecture?
- Different strategies (FSDP, DeepSpeed, Megatron) have different tradeoffs. What works best?
What’s your approach to training stability at extreme scale?
- Loss spikes, gradient explosion, NaN issues — how do you diagnose and fix?
How does speculative decoding work in practice, and what’s the quality/speed tradeoff?
- Relevant for robotics: fast inference is critical.
What memory optimization techniques are most impactful?
- Gradient checkpointing, ZeRO stages, quantization — what’s the priority order?

Infrastructure Questions

How do you handle GPU failures during training?
- Thousands of GPUs, failures are inevitable. Checkpoint strategy?
What’s the role of network topology in distributed training?
- InfiniBand vs RoCE vs Ethernet — does it matter for training efficiency?
How do you profile and debug training performance?
- What tools/metrics do you use to find bottlenecks?

Robotics/Autonomy Questions

How do these large-scale training techniques transfer to robot-specific training?
- RL training, behavior cloning, world model training — different from language?
SmolLM3 shows small models can be capable. What’s the path to truly capable edge models for robotics?
- Can we get robot-capable models in 1-3B parameter range?
MoE seems promising for robotics — specialized experts for navigation, manipulation, language. Any plans in that direction?
- Practical question about whether HF is exploring robotics-specific MoE.

Research Philosophy Questions

What’s surprised you most about transformer training behavior at scale?
- Curious about unexpected findings from the layer normalization work.
Open or closed models for code generation — what’s your take on the future?
- BigCode vs Codex/GitHub Copilot — where is the field heading?

Cross-References

Week 3 (SSMs): Mamba’s efficient inference is relevant to the efficiency discussion
Week 10 (Modal): Serverless GPU deployment — complementary infrastructure topic
Related research: PEFT methods (LoRA, QLoRA) for fine-tuning at scale

Slide Link

Slides not yet posted. Check course schedule Thursday for updates.

Session Summary

To be filled in after lecture.

Last updated: 2026-04-22