Deploy MCP Server
AI & Machine Learning API Key

vLLM REST API

High-throughput LLM inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving, optimized for high throughput and memory efficiency. It provides OpenAI-compatible REST API endpoints for running large language models with advanced features like continuous batching, paged attention, and tensor parallelism. Developers use vLLM to deploy production-grade LLM applications with minimal latency and maximum GPU utilization.

Base URL http://localhost:8000/v1

API Endpoints

MethodEndpointDescription
POST/completionsGenerate text completions from a prompt using the loaded LLM model
POST/chat/completionsGenerate chat-based completions using the conversational format with role-based messages
GET/modelsList all available models currently loaded in the vLLM server
POST/embeddingsGenerate vector embeddings for input text using embedding models
POST/tokenizeTokenize input text and return token IDs and counts for the loaded model
POST/detokenizeConvert token IDs back into readable text using the model's tokenizer
GET/healthCheck the health status of the vLLM server and model availability
GET/versionGet the current version information of the vLLM server
GET/metricsRetrieve Prometheus-compatible metrics for monitoring server performance
POST/v1/scoreScore or rank multiple candidate continuations for a given prompt
GET/openapi.jsonGet the OpenAPI specification for all available vLLM endpoints
POST/generateLow-level generation endpoint with fine-grained control over sampling parameters

Code Examples

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 150,
    "stream": false
  }'

Connect vLLM to AI

Deploy a vLLM MCP server on IOX Cloud and connect it to Claude, ChatGPT, Cursor, or any AI client. Your AI assistant gets direct access to vLLM through these tools:

vllm_generate_completion Generate text completions from a prompt using vLLM's inference engine with customizable parameters like temperature and max tokens
vllm_chat Have multi-turn conversations with LLMs using the chat completions API, supporting system prompts and conversation history
vllm_get_embeddings Generate vector embeddings for semantic search, similarity matching, and RAG applications using vLLM embedding models
vllm_list_models Query available models loaded in the vLLM server to determine capabilities and select appropriate models for tasks
vllm_monitor_metrics Fetch performance metrics including request throughput, GPU utilization, and latency statistics for optimization and monitoring

Deploy in 60 seconds

Describe what you need, AI generates the code, and IOX deploys it globally.

Deploy vLLM MCP Server →

Related APIs