Pre-Read: TBD Topic

Date: May 28, 2026 Speaker: Charles Frye (Modal)


Topic Overview

Topic TBD — check course site for updates. Speaker focuses on serverless GPU infrastructure, CUDA Python, and practical ML deployment.


Key Concepts

Awaiting topic announcement.


Speaker

Charles Frye

Affiliation: Modal (AI Engineer)

Background: PhD from UC Berkeley (psychology/cognitive science), now focused on practical ML infrastructure and GPU computing. Creator of educational content including the GPU Glossary.

Key contributions:

  • Modal platform — Serverless GPU infrastructure for GenAI workloads
  • GPU Glossary — Educational resource on GPU architecture
  • CUDA Python advocacy — Year of CUDA Python (GTC 2025 coverage)
  • Distributed inference patterns — Practical deployment knowledge

Resources

1. Modal Platform

What it is: Serverless GPU platform for running GenAI models without managing infrastructure.

Key features:

  • Pay-per-use GPU compute
  • Auto-scaling from 0 to hundreds of GPUs
  • Built-in distributed inference support

🔗 Modal docs


2. GPU Glossary

What it is: Educational content explaining GPU architecture, CUDA concepts, and performance optimization.

Topics covered:

  • Memory hierarchy (HBM, L2 cache, shared memory)
  • Kernel optimization
  • Tensor cores and matrix operations
  • Multi-GPU communication (NVLink, PCIe)

🔗 GPU Glossary


3. CUDA Python Coverage

What it is: Podcast and blog coverage of NVIDIA’s CUDA Python initiative (GTC 2025).

Topics covered:

  • Tensor memory accelerators
  • Python-native GPU programming
  • Performance comparisons

🔗 Podcast


Why It Matters for Autonomy

Aspect Relevance to Robotics/Embodied AI
Deployment infrastructure Serverless GPU for on-demand autonomy workloads — burst compute for planning/inference
Practical ML systems Real-world deployment patterns that go beyond toy examples
Cost efficiency Running inference without managing hardware — relevant for fleet deployment
Edge-cloud hybrid Understanding when to use cloud GPU vs edge deployment
GPU optimization Making models run faster on constrained hardware

Question Bank

Infrastructure Questions

  1. What’s the break-even point where serverless GPU becomes more expensive than dedicated hardware?
  2. How does Modal handle cold starts for latency-sensitive autonomy workloads?
  3. What are the patterns for hybrid edge-cloud inference?

Optimization Questions

  1. What are the most impactful GPU optimizations for transformer inference?
  2. How much performance gain is realistic from CUDA kernel optimization vs just using better libraries?
  3. What’s the state of CUDA Python for production workloads vs C++?

Deployment Questions

  1. How do you handle model versioning and rollback in serverless GPU deployments?
  2. What monitoring/observability patterns work best for GPU workloads?
  3. How does Modal handle multi-GPU inference (tensor parallelism, pipeline parallelism)?

Pre-Lecture Reading

Essential

Background


Cross-References

To be populated once topic is announced.


Prepared: 2026-04-04 • Topic TBD