v
vLLM
4
High-throughput LLM inference and serving engine
Open Sourcefree Trending
About
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models. It introduced PagedAttention, a novel attention algorithm that reduces memory waste and enables up to 24x higher throughput than Hugging Face Transformers. vLLM provides an OpenAI-compatible API server and supports a wide range of model architectures for production deployment.
Details
| Type | inference-engine |
| Languages | Python, C++, CUDA |
| Supported Models | Llama, Mistral, Mixtral, Falcon, GPT-NeoX, Qwen, Phi, StarCoder, Yi, DeepSeek |
Tags
inferenceservinghigh-throughputpaged-attentionopenai-compatibleproductiongpu
Quick Info
- Organization
- vLLM Team (UC Berkeley)
- Pricing
- Free
- Free Tier
- Yes
- Popularity
- 86/100
- Stars
- 70.0K
- Updated
- Feb 19, 2026
Also in Training Tools
P
PyTorch
The most widely-used open-source deep learning framework
OSSfree
PyTorch Foundation (Linux Foundation)97.0KFree
A
Axolotl
Streamlined tool for fine-tuning large language models
OSSfree
Axolotl AI11.3KFree
U
Unsloth
Fine-tune LLMs 2-5x faster with 80% less memory
OSSFreeTrending
Unsloth AI51.0KFree (open-source) / Pro plans available