Pre-Read: TBD Topic
Date: May 28, 2026 Speaker: Charles Frye (Modal)
Topic Overview
Topic TBD — check course site for updates. Speaker focuses on serverless GPU infrastructure, CUDA Python, and practical ML deployment.
Key Concepts
Awaiting topic announcement.
Speaker
Charles Frye
Affiliation: Modal (AI Engineer)
Background: PhD from UC Berkeley (psychology/cognitive science), now focused on practical ML infrastructure and GPU computing. Creator of educational content including the GPU Glossary.
Key contributions:
- Modal platform — Serverless GPU infrastructure for GenAI workloads
- GPU Glossary — Educational resource on GPU architecture
- CUDA Python advocacy — Year of CUDA Python (GTC 2025 coverage)
- Distributed inference patterns — Practical deployment knowledge
Resources
1. Modal Platform
What it is: Serverless GPU platform for running GenAI models without managing infrastructure.
Key features:
- Pay-per-use GPU compute
- Auto-scaling from 0 to hundreds of GPUs
- Built-in distributed inference support
2. GPU Glossary
What it is: Educational content explaining GPU architecture, CUDA concepts, and performance optimization.
Topics covered:
- Memory hierarchy (HBM, L2 cache, shared memory)
- Kernel optimization
- Tensor cores and matrix operations
- Multi-GPU communication (NVLink, PCIe)
3. CUDA Python Coverage
What it is: Podcast and blog coverage of NVIDIA’s CUDA Python initiative (GTC 2025).
Topics covered:
- Tensor memory accelerators
- Python-native GPU programming
- Performance comparisons
🔗 Podcast
Why It Matters for Autonomy
| Aspect | Relevance to Robotics/Embodied AI |
|---|---|
| Deployment infrastructure | Serverless GPU for on-demand autonomy workloads — burst compute for planning/inference |
| Practical ML systems | Real-world deployment patterns that go beyond toy examples |
| Cost efficiency | Running inference without managing hardware — relevant for fleet deployment |
| Edge-cloud hybrid | Understanding when to use cloud GPU vs edge deployment |
| GPU optimization | Making models run faster on constrained hardware |
Question Bank
Infrastructure Questions
- What’s the break-even point where serverless GPU becomes more expensive than dedicated hardware?
- How does Modal handle cold starts for latency-sensitive autonomy workloads?
- What are the patterns for hybrid edge-cloud inference?
Optimization Questions
- What are the most impactful GPU optimizations for transformer inference?
- How much performance gain is realistic from CUDA kernel optimization vs just using better libraries?
- What’s the state of CUDA Python for production workloads vs C++?
Deployment Questions
- How do you handle model versioning and rollback in serverless GPU deployments?
- What monitoring/observability patterns work best for GPU workloads?
- How does Modal handle multi-GPU inference (tensor parallelism, pipeline parallelism)?
Pre-Lecture Reading
Essential
- Modal GPU Glossary — quick reference
- Modal docs — platform overview
Background
Cross-References
To be populated once topic is announced.
Prepared: 2026-04-04 • Topic TBD