vLLM REST API
High-throughput LLM inference and serving engine
vLLM is a fast and easy-to-use library for LLM inference and serving, optimized for high throughput and memory efficiency. It provides OpenAI-compatible REST API endpoints for running large language models with advanced features like continuous batching, paged attention, and tensor parallelism. Developers use vLLM to deploy production-grade LLM applications with minimal latency and maximum GPU utilization.
http://localhost:8000/v1
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /completions | Generate text completions from a prompt using the loaded LLM model |
| POST | /chat/completions | Generate chat-based completions using the conversational format with role-based messages |
| GET | /models | List all available models currently loaded in the vLLM server |
| POST | /embeddings | Generate vector embeddings for input text using embedding models |
| POST | /tokenize | Tokenize input text and return token IDs and counts for the loaded model |
| POST | /detokenize | Convert token IDs back into readable text using the model's tokenizer |
| GET | /health | Check the health status of the vLLM server and model availability |
| GET | /version | Get the current version information of the vLLM server |
| GET | /metrics | Retrieve Prometheus-compatible metrics for monitoring server performance |
| POST | /v1/score | Score or rank multiple candidate continuations for a given prompt |
| GET | /openapi.json | Get the OpenAPI specification for all available vLLM endpoints |
| POST | /generate | Low-level generation endpoint with fine-grained control over sampling parameters |
Sponsor this page
AvailableReach developers actively building with vLLM. See live pageview data and self-serve checkout — your slot goes live in minutes.
View inventory & pricing →Code Examples
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"temperature": 0.7,
"max_tokens": 150,
"stream": false
}'
Use vLLM from Claude / Cursor / ChatGPT
vLLM is a self-hosted protocol — it lives on a host you operate (default http://localhost:8000/v1). A
hosted MCP gateway can't reach localhost on your machine, so the usual one-click setup doesn't apply.
These are the tools an MCP for vLLM would expose:
vllm_generate_completion
Generate text completions from a prompt using vLLM's inference engine with customizable parameters like temperature and max tokens
vllm_chat
Have multi-turn conversations with LLMs using the chat completions API, supporting system prompts and conversation history
vllm_get_embeddings
Generate vector embeddings for semantic search, similarity matching, and RAG applications using vLLM embedding models
vllm_list_models
Query available models loaded in the vLLM server to determine capabilities and select appropriate models for tasks
vllm_monitor_metrics
Fetch performance metrics including request throughput, GPU utilization, and latency statistics for optimization and monitoring
Run an vLLM MCP locally
The local-CLI version of these tools is on the way (npx @meru/rest-mcp --vendor=vllm · BYO connection string · zero secrets sent to us). For now use the patterns below in your own MCP server, or self-host one from the IOX templates.