AI & Machine Learning API Key

vLLM REST API

High-throughput LLM inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving, optimized for high throughput and memory efficiency. It provides OpenAI-compatible REST API endpoints for running large language models with advanced features like continuous batching, paged attention, and tensor parallelism. Developers use vLLM to deploy production-grade LLM applications with minimal latency and maximum GPU utilization.

Base URL http://localhost:8000/v1

API Endpoints

Method	Endpoint	Description
POST	`/completions`	Generate text completions from a prompt using the loaded LLM model
POST	`/chat/completions`	Generate chat-based completions using the conversational format with role-based messages
GET	`/models`	List all available models currently loaded in the vLLM server
POST	`/embeddings`	Generate vector embeddings for input text using embedding models
POST	`/tokenize`	Tokenize input text and return token IDs and counts for the loaded model
POST	`/detokenize`	Convert token IDs back into readable text using the model's tokenizer
GET	`/health`	Check the health status of the vLLM server and model availability
GET	`/version`	Get the current version information of the vLLM server
GET	`/metrics`	Retrieve Prometheus-compatible metrics for monitoring server performance
POST	`/v1/score`	Score or rank multiple candidate continuations for a given prompt
GET	`/openapi.json`	Get the OpenAPI specification for all available vLLM endpoints
POST	`/generate`	Low-level generation endpoint with fine-grained control over sampling parameters

Code Examples

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 150,
    "stream": false
  }'

const response = await fetch('http://localhost:8000/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({
    model: 'meta-llama/Llama-2-7b-chat-hf',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Explain quantum computing in simple terms.' }
    ],
    temperature: 0.7,
    max_tokens: 150,
    stream: false
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

import requests

url = 'http://localhost:8000/v1/chat/completions'
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
}
payload = {
    'model': 'meta-llama/Llama-2-7b-chat-hf',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain quantum computing in simple terms.'}
    ],
    'temperature': 0.7,
    'max_tokens': 150,
    'stream': False
}

response = requests.post(url, json=payload, headers=headers)
result = response.json()
print(result['choices'][0]['message']['content'])

Use vLLM from Claude / Cursor / ChatGPT

vLLM is a self-hosted protocol — it lives on a host you operate (default http://localhost:8000/v1). A hosted MCP gateway can't reach localhost on your machine, so the usual one-click setup doesn't apply. These are the tools an MCP for vLLM would expose:

vllm_generate_completion Generate text completions from a prompt using vLLM's inference engine with customizable parameters like temperature and max tokens

vllm_chat Have multi-turn conversations with LLMs using the chat completions API, supporting system prompts and conversation history

vllm_get_embeddings Generate vector embeddings for semantic search, similarity matching, and RAG applications using vLLM embedding models

vllm_list_models Query available models loaded in the vLLM server to determine capabilities and select appropriate models for tasks

vllm_monitor_metrics Fetch performance metrics including request throughput, GPU utilization, and latency statistics for optimization and monitoring

Run an vLLM MCP locally

The local-CLI version of these tools is on the way (npx @meru/rest-mcp --vendor=vllm · BYO connection string · zero secrets sent to us). For now use the patterns below in your own MCP server, or self-host one from the IOX templates.

Build your own vLLM MCP →

vLLM REST API

API Endpoints

Sponsor this page

Code Examples

Use vLLM from Claude / Cursor / ChatGPT

Run an vLLM MCP locally

Related APIs