Category: MLOps
-
Beyond GPUs: Mastering Ultra-Scale LLM Training – Part 2
Introduction In the first part of this series, we unpacked the big picture of scaling LLM training. The “why” and “what” behind ultra-scale setups, and how different forms of parallelism come together to make training trillion-parameter models even possible. That gave us the map. Now it’s time to get into the weeds of the “how.”…
-
Beyond GPUs: Mastering Ultra-Scale LLM Training – Part 1
Introduction Training today’s largest language models demands massive computational resources, often thousands of GPUs humming in perfect harmony, orchestrated to act as one. Until recently, only a few elite research labs could marshal such “symphonies” of compute power. The open-source movement has started to change that by releasing model weights (like Llama or DeepSeek) and…
-
Supercharging Your Inference of Large Language Models with vLLM (part-2)
As discussed in part 1 of this blog post vLLM is a high-throughput distributed system for serving large language models (LLMs) efficiently. It addresses the challenge of memory management in LLM serving systems by introducing PagedAttention, an innovative attention algorithm inspired by virtual memory techniques in operating systems. This approach allows for near-zero waste in…