Self-hosted
AI & Machine Learning API Key

vLLM REST API

High-throughput LLM inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving, optimized for high throughput and memory efficiency. It provides OpenAI-compatible REST API endpoints for running large language models with advanced features like continuous batching, paged attention, and tensor parallelism. Developers use vLLM to deploy production-grade LLM applications with minimal latency and maximum GPU utilization.

Base URL http://localhost:8000/v1

API Endpoints

MethodEndpointDescription
POST/completionsGenerate text completions from a prompt using the loaded LLM model
POST/chat/completionsGenerate chat-based completions using the conversational format with role-based messages
GET/modelsList all available models currently loaded in the vLLM server
POST/embeddingsGenerate vector embeddings for input text using embedding models
POST/tokenizeTokenize input text and return token IDs and counts for the loaded model
POST/detokenizeConvert token IDs back into readable text using the model's tokenizer
GET/healthCheck the health status of the vLLM server and model availability
GET/versionGet the current version information of the vLLM server
GET/metricsRetrieve Prometheus-compatible metrics for monitoring server performance
POST/v1/scoreScore or rank multiple candidate continuations for a given prompt
GET/openapi.jsonGet the OpenAPI specification for all available vLLM endpoints
POST/generateLow-level generation endpoint with fine-grained control over sampling parameters

Code Examples

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 150,
    "stream": false
  }'

Use vLLM from Claude / Cursor / ChatGPT

vLLM is a self-hosted protocol — it lives on a host you operate (default http://localhost:8000/v1). A hosted MCP gateway can't reach localhost on your machine, so the usual one-click setup doesn't apply. These are the tools an MCP for vLLM would expose:

vllm_generate_completion Generate text completions from a prompt using vLLM's inference engine with customizable parameters like temperature and max tokens
vllm_chat Have multi-turn conversations with LLMs using the chat completions API, supporting system prompts and conversation history
vllm_get_embeddings Generate vector embeddings for semantic search, similarity matching, and RAG applications using vLLM embedding models
vllm_list_models Query available models loaded in the vLLM server to determine capabilities and select appropriate models for tasks
vllm_monitor_metrics Fetch performance metrics including request throughput, GPU utilization, and latency statistics for optimization and monitoring

Run an vLLM MCP locally

The local-CLI version of these tools is on the way (npx @meru/rest-mcp --vendor=vllm · BYO connection string · zero secrets sent to us). For now use the patterns below in your own MCP server, or self-host one from the IOX templates.

Build your own vLLM MCP →

Related APIs