DeepYard
v

vLLM

4

High-throughput LLM inference and serving engine

Open Sourcefree Trending

About

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models. It introduced PagedAttention, a novel attention algorithm that reduces memory waste and enables up to 24x higher throughput than Hugging Face Transformers. vLLM provides an OpenAI-compatible API server and supports a wide range of model architectures for production deployment.

Details

Typeinference-engine
LanguagesPython, C++, CUDA
Supported ModelsLlama, Mistral, Mixtral, Falcon, GPT-NeoX, Qwen, Phi, StarCoder, Yi, DeepSeek

Tags

inferenceservinghigh-throughputpaged-attentionopenai-compatibleproductiongpu