Ali Darbehani

/improving lives with AI/

Category: LLM Inference

Supercharging Your Inference of Large Language Models with vLLM (part-2)

As discussed in part 1 of this blog post vLLM is a high-throughput distributed system for serving large language models (LLMs) efficiently. It addresses the challenge of memory management in LLM serving systems by introducing PagedAttention, an innovative attention algorithm inspired by virtual memory techniques in operating systems. This approach allows for near-zero waste in…

Alireza Darbehani

August 10, 2024

GenAI, Large Language Models, LLM Inference, MLOps

Distributed Inference, GenAI, large-language-model, llm, llm-serving, Paged Attention
Supercharging Your Inference of Large Language Models with vLLM (part-1)

As the demand for large language models (LLMs) continues to rise, optimizing inference performance becomes crucial. vLLM is an innovative library designed to enhance the efficiency and speed of LLM inference and serving. This blog post explains a high level view of vLLM’s capabilities, its unique features, and how it compares to similar solutions in…

Alireza Darbehani

August 4, 2024

GenAI, Large Language Models, LLM Inference

Continuous Batching, GenAI, large-language-model, llm, LLM Inference, Paged Attention, vLLM