Why This Changes Everything
No More API Bills
Cloud APIs charge on average $0.01+ per 1K tokens. Run unlimited queries on your Mac for $0. Your data never leaves your machine.
Serve Multiple Users
Continuous batching lets your Mac handle 5+ concurrent users. Perfect for team demos, internal tools, or local development.
Vision Models Too
Not just text. Run Qwen-VL, LLaVA, and Gemma 3 for image understanding. Our caching makes follow-up questions 28x faster.
Drop-in Replacement
OpenAI-compatible API. Change one line of code to switch from cloud to local. Works with LangChain, LlamaIndex, and any OpenAI SDK.
The Cloud AI Problem
Running AI in the cloud is expensive. A single GPT-4 Vision call costs ~$0.01. Sounds cheap until you're processing thousands of images or building an app with real users. Suddenly you're looking at $500+ monthly bills.
Plus, every request means sending your data to someone else's servers. For sensitive documents, medical images, or proprietary data, that's a non-starter.
Cloud APIs
- ×Pay per token (adds up fast)
- ×Data leaves your network
- ×Rate limits and downtime
- ×Latency to remote servers
vLLM-MLX on Your Mac
- ✓Unlimited queries, zero cost
- ✓100% private, data stays local
- ✓Always available, no limits
- ✓Sub-second local latency
Under the Hood
Two core innovations make vLLM-MLX different: Continuous Batching for handling multiple users, and Content-Based Caching for faster vision model responses.
Content-Based Prefix Caching for MLLMs
Multimodal Large Language Models (MLLMs) like Qwen3-VL, Gemma 3, and LLaVA process images through a vision encoder before the language model can reason about them. This vision encoding step is computationally expensive, often taking 1.5 to 2 seconds per image on consumer hardware.
Our key insight: identical images can arrive through different delivery mechanisms (URLs, base64 encoding, file paths), but the pixel content remains the same. By computing a SHA-256 hash of the decoded pixel values, we can identify semantically identical images regardless of their delivery format.
Cache Key
SHA256(image) + promptCached Value
embeddings + kv_cacheSame image content = same hash = cache hit (regardless of URL, base64, or file path)
Apple Silicon Unified Memory Architecture
Apple Silicon's unified memory model enables zero-copy cache management. Unlike discrete GPU systems where tensors must be explicitly transferred between CPU and GPU memory pools, MLX arrays exist in a single address space accessible by both compute units. This eliminates transfer overhead for cached embeddings.
System Architecture
vLLM-Inspired API
OpenAI-compatible interface with continuous batching
MLXPlatform
Apple Silicon Backend
mlx-lm
LLM Inference
mlx-vlm
Vision + LLM
mlx-audio
TTS + STT
MLX Framework
Apple ML Framework with Metal GPU Kernels
Cache-Aware Generation Algorithm
1def generate_with_cache(image, prompt):2# Compute content hash regardless of delivery format3image_hash = sha256(decode_pixels(image))4cache_key = (image_hash, prompt_prefix)56if cache_key in prefix_cache:7# Cache HIT → skip vision encoder8embeddings, kv = prefix_cache[cache_key]9return llm.generate(prompt, cached_kv=kv)1011else:12# Cache MISS → encode, generate, store13embeddings = vision_encoder(image)14response, kv = llm.generate(prompt, embeddings)15prefix_cache[cache_key] = (embeddings, kv)16return response
Real Performance Numbers
No marketing fluff. Here's what you actually get on an M4 Max with 128GB RAM. Even older M1/M2 machines deliver impressive results.
Text Model Throughput Comparison
All models use 4-bit quantization. Bold indicates best throughput per row. Speedup = Ours / llama.cpp.
| Model | vLLM-MLX | vllm-metal | mlx-lm | llama.cpp | Speedup |
|---|---|---|---|---|---|
| Qwen3 Family | |||||
| Qwen3-0.6B | 525.5 | 365.8 | 356.2 | 281.5 | 1.87x |
| Qwen3-4B | 159.0 | 137.3 | 128.9 | 118.2 | 1.35x |
| Qwen3-8B | 93.3 | 87.1 | 79.9 | 76.9 | 1.21x |
| Qwen3-30B-A3B | 109.7 | 110.3 | 107.4 | 89.9 | 1.17x |
| Llama 3.2 Family | |||||
| Llama-3.2-1B | 461.9 | 350.9 | 347.1 | 331.3 | 1.39x |
| Llama-3.2-3B | 203.6 | 174.3 | 167.5 | 155.8 | 1.31x |
| Other Architectures | |||||
| Gemma 3-4B | 152.5 | 117.0 | 105.4 | 123.2 | 1.24x |
| Nemotron-30B-A3B | 121.8 | -- | 101.6 | 85.1 | 1.43x |
Measured on M4 Max (128GB). Values in tokens/second.
Concurrency Scaling
How throughput scales with multiple concurrent users. Sequential processing (llama.cpp) stays flat, while continuous batching in vLLM-MLX scales efficiently.
Aggregate Throughput (Qwen3-0.6B)
Image Caching Performance
Response time when asking multiple questions about the same image
| Question # | Without Cache | With Cache | Improvement |
|---|---|---|---|
| 1st (new image) | 21.7 seconds | 21.7 seconds | Same |
| 2nd | 21.7 seconds | 1.15 seconds | 19x faster |
| 3rd | 21.7 seconds | 0.79 seconds | 28x faster |
| 4th+ | 21.7 seconds | 0.78 seconds | 28x faster |
Image Understanding Performance
Qwen3-VL-8B-Instruct-4bit performance across different image resolutions
| Resolution | Pixels | Time | Tokens | Speed |
|---|---|---|---|---|
| 224x224 | 50K | 1.04s | 78 | 74.8 tok/s |
| 448x448 | 201K | 1.45s | 70 | 48.1 tok/s |
| 768x768 | 590K | 2.05s | 91 | 44.3 tok/s |
| 1024x1024 | 1.0M | 2.79s | 76 | 27.2 tok/s |
| 1280x720 | 922K | 2.97s | 96 | 32.4 tok/s |
| 1920x1080 | 2.1M | 6.30s | 89 | 14.1 tok/s |
Average: 45.2 tok/s across all resolutions. Fastest at 224x224, slowest at 1920x1080.
Video Understanding Performance
Qwen3-VL-8B-Instruct-4bit performance across different frame counts
| Configuration | Frames | Time | Speed | Memory |
|---|---|---|---|---|
| 2 @ 0.5fps | 2 | 4.48s | 57.1 tok/s | 6.4 GB |
| 4 @ 1fps | 4 | 4.65s | 55.0 tok/s | 6.4 GB |
| 8 @ 2fps | 8 | 6.45s | 37.2 tok/s | 6.8 GB |
| 16 @ 2fps | 16 | 10.96s | 23.4 tok/s | 7.6 GB |
| 32 @ 4fps | 32 | 20.00s | 12.8 tok/s | 9.2 GB |
| 64 @ 8fps | 64 | 59.81s | 4.3 tok/s | 12.9 GB |
Memory scales from 6.4 GB (2 frames) to 12.9 GB (64 frames). 96+ frames may cause GPU timeout.
Feature Comparison
How vLLM-MLX compares to other tools for running AI on Mac
| Feature | mlx-lm | llama.cpp | vllm-metal | vLLM-MLX |
|---|---|---|---|---|
| Native Apple GPU | Yes | Yes | Yes | Yes |
| OpenAI-compatible API | Yes | Yes | Yes | Yes |
| Continuous batching | No | No | Yes | Yes |
| Vision models (MLLM) | No | Yes | Yes | Yes |
| Content-based prefix caching | No | No | No | Yes |
What You Can Build
Private AI Assistant
Run ChatGPT-like conversations on your Mac without sending data to the cloud. Great for sensitive documents and privacy.
Image Analysis Tools
Build apps that understand photos, analyze charts, read documents, or describe scenes for accessibility.
Voice Applications
Text-to-speech in 10+ languages and speech-to-text transcription, all running locally.
AI-Powered Automation
Connect AI to external tools via MCP protocol for agentic workflows and automation.
Development & Testing
Test AI integrations locally before deploying to production, without API costs.
Research & Experimentation
Experiment with different AI models, fine-tuning, and prompt engineering on your own hardware.
Getting Started
You'll need a Mac with Apple Silicon (M1, M2, M3, or M4 chip) and Python installed.
Clone the Repository
$ git clone https://github.com/waybarrios/vllm-mlx.git
$ cd vllm-mlx
Install Dependencies
$ pip install -e .
Start the AI Server
$ vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
Start Chatting
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Running locally
)
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
The Bottom Line
Your MacBook isn't just a laptop anymore. It's a private AI server that can:
Keep Data Private
Nothing leaves your machine
Serve Your Team
Multiple concurrent users
Zero API Costs
Unlimited queries, free
Whether you're building a startup product, running internal tools, or just experimenting with AI, vLLM-MLX gives you production-grade inference without the cloud bills or privacy concerns.
Ready to Run AI Locally?
Open source, free forever. Get started in 5 minutes. No GPU rental, no API keys, no monthly bills.
Works on any Apple Silicon Mac (M1, M2, M3, M4)