vLLM

High-throughput LLM inference and serving engine

Open Sourcefree Trending

About

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models. It introduced PagedAttention, a novel attention algorithm that reduces memory waste and enables up to 24x higher throughput than Hugging Face Transformers. vLLM provides an OpenAI-compatible API server and supports a wide range of model architectures for production deployment.

Details

Type	inference-engine
Languages	Python, C++, CUDA
Supported Models	Llama, Mistral, Mixtral, Falcon, GPT-NeoX, Qwen, Phi, StarCoder, Yi, DeepSeek

vLLM

About

Details

Tags

Quick Info

Also in Training Tools

PyTorch

Axolotl

Unsloth