AI & Machine Learning
API Key
vLLM REST API
High-throughput LLM inference and serving engine
vLLM is a fast and easy-to-use library for LLM inference and serving, optimized for high throughput and memory efficiency. It provides OpenAI-compatible REST API endpoints for running large language models with advanced features like continuous batching, paged attention, and tensor parallelism. Developers use vLLM to deploy production-grade LLM applications with minimal latency and maximum GPU utilization.
Base URL
http://localhost:8000/v1
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /completions | Generate text completions from a prompt using the loaded LLM model |
| POST | /chat/completions | Generate chat-based completions using the conversational format with role-based messages |
| GET | /models | List all available models currently loaded in the vLLM server |
| POST | /embeddings | Generate vector embeddings for input text using embedding models |
| POST | /tokenize | Tokenize input text and return token IDs and counts for the loaded model |
| POST | /detokenize | Convert token IDs back into readable text using the model's tokenizer |
| GET | /health | Check the health status of the vLLM server and model availability |
| GET | /version | Get the current version information of the vLLM server |
| GET | /metrics | Retrieve Prometheus-compatible metrics for monitoring server performance |
| POST | /v1/score | Score or rank multiple candidate continuations for a given prompt |
| GET | /openapi.json | Get the OpenAPI specification for all available vLLM endpoints |
| POST | /generate | Low-level generation endpoint with fine-grained control over sampling parameters |
Code Examples
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"temperature": 0.7,
"max_tokens": 150,
"stream": false
}'
Connect vLLM to AI
Deploy a vLLM MCP server on IOX Cloud and connect it to Claude, ChatGPT, Cursor, or any AI client. Your AI assistant gets direct access to vLLM through these tools:
vllm_generate_completion
Generate text completions from a prompt using vLLM's inference engine with customizable parameters like temperature and max tokens
vllm_chat
Have multi-turn conversations with LLMs using the chat completions API, supporting system prompts and conversation history
vllm_get_embeddings
Generate vector embeddings for semantic search, similarity matching, and RAG applications using vLLM embedding models
vllm_list_models
Query available models loaded in the vLLM server to determine capabilities and select appropriate models for tasks
vllm_monitor_metrics
Fetch performance metrics including request throughput, GPU utilization, and latency statistics for optimization and monitoring
Deploy in 60 seconds
Describe what you need, AI generates the code, and IOX deploys it globally.
Deploy vLLM MCP Server →