Back to Blog
Research PaperMLXApple Silicon

Your Mac is Now an AI Server

Stop paying for cloud APIs. Run production-grade LLMs and vision models locally with the same performance you'd get from expensive GPU servers.

Wayner BarriosJanuary 202610 min read
5+
Concurrent Users
464
Tokens/Second
200+
Models Supported
$0
API Cost

Why This Changes Everything

No More API Bills

Cloud APIs charge on average $0.01+ per 1K tokens. Run unlimited queries on your Mac for $0. Your data never leaves your machine.

Serve Multiple Users

Continuous batching lets your Mac handle 5+ concurrent users. Perfect for team demos, internal tools, or local development.

Vision Models Too

Not just text. Run Qwen-VL, LLaVA, and Gemma 3 for image understanding. Our caching makes follow-up questions 28x faster.

Drop-in Replacement

OpenAI-compatible API. Change one line of code to switch from cloud to local. Works with LangChain, LlamaIndex, and any OpenAI SDK.

The Cloud AI Problem

Running AI in the cloud is expensive. A single GPT-4 Vision call costs ~$0.01. Sounds cheap until you're processing thousands of images or building an app with real users. Suddenly you're looking at $500+ monthly bills.

Plus, every request means sending your data to someone else's servers. For sensitive documents, medical images, or proprietary data, that's a non-starter.

Cloud APIs

  • ×Pay per token (adds up fast)
  • ×Data leaves your network
  • ×Rate limits and downtime
  • ×Latency to remote servers

vLLM-MLX on Your Mac

  • Unlimited queries, zero cost
  • 100% private, data stays local
  • Always available, no limits
  • Sub-second local latency

Under the Hood

Two core innovations make vLLM-MLX different: Continuous Batching for handling multiple users, and Content-Based Caching for faster vision model responses.

Content-Based Prefix Caching for MLLMs

Multimodal Large Language Models (MLLMs) like Qwen3-VL, Gemma 3, and LLaVA process images through a vision encoder before the language model can reason about them. This vision encoding step is computationally expensive, often taking 1.5 to 2 seconds per image on consumer hardware.

Our key insight: identical images can arrive through different delivery mechanisms (URLs, base64 encoding, file paths), but the pixel content remains the same. By computing a SHA-256 hash of the decoded pixel values, we can identify semantically identical images regardless of their delivery format.

Cache Key

SHA256(image) + prompt

Cached Value

embeddings + kv_cache

Same image content = same hash = cache hit (regardless of URL, base64, or file path)

Apple Silicon Unified Memory Architecture

Apple Silicon's unified memory model enables zero-copy cache management. Unlike discrete GPU systems where tensors must be explicitly transferred between CPU and GPU memory pools, MLX arrays exist in a single address space accessible by both compute units. This eliminates transfer overhead for cached embeddings.

System Architecture

vLLM-Inspired API

OpenAI-compatible interface with continuous batching

MLXPlatform

Apple Silicon Backend

mlx-lm

LLM Inference

mlx-vlm

Vision + LLM

mlx-audio

TTS + STT

MLX Framework

Apple ML Framework with Metal GPU Kernels

Cache-Aware Generation Algorithm

cache_generation.py
python
1def generate_with_cache(image, prompt):
2 # Compute content hash regardless of delivery format
3 image_hash = sha256(decode_pixels(image))
4 cache_key = (image_hash, prompt_prefix)
5 
6 if cache_key in prefix_cache:
7 # Cache HIT → skip vision encoder
8 embeddings, kv = prefix_cache[cache_key]
9 return llm.generate(prompt, cached_kv=kv)
10 
11 else:
12 # Cache MISS → encode, generate, store
13 embeddings = vision_encoder(image)
14 response, kv = llm.generate(prompt, embeddings)
15 prefix_cache[cache_key] = (embeddings, kv)
16 return response

Real Performance Numbers

No marketing fluff. Here's what you actually get on an M4 Max with 128GB RAM. Even older M1/M2 machines deliver impressive results.

Text Model Throughput Comparison

All models use 4-bit quantization. Bold indicates best throughput per row. Speedup = Ours / llama.cpp.

ModelvLLM-MLXvllm-metalmlx-lmllama.cppSpeedup
Qwen3 Family
Qwen3-0.6B525.5365.8356.2281.51.87x
Qwen3-4B159.0137.3128.9118.21.35x
Qwen3-8B93.387.179.976.91.21x
Qwen3-30B-A3B109.7110.3107.489.91.17x
Llama 3.2 Family
Llama-3.2-1B461.9350.9347.1331.31.39x
Llama-3.2-3B203.6174.3167.5155.81.31x
Other Architectures
Gemma 3-4B152.5117.0105.4123.21.24x
Nemotron-30B-A3B121.8--101.685.11.43x

Measured on M4 Max (128GB). Values in tokens/second.

Concurrency Scaling

How throughput scales with multiple concurrent users. Sequential processing (llama.cpp) stays flat, while continuous batching in vLLM-MLX scales efficiently.

Aggregate Throughput (Qwen3-0.6B)

1 user441 tok/s (1.0x)
4 users892 tok/s (2.0x)
8 users1,156 tok/s (2.6x)
16 users1,642 tok/s (3.7x)

Image Caching Performance

Response time when asking multiple questions about the same image

Question #Without CacheWith CacheImprovement
1st (new image)21.7 seconds21.7 secondsSame
2nd21.7 seconds1.15 seconds19x faster
3rd21.7 seconds0.79 seconds28x faster
4th+21.7 seconds0.78 seconds28x faster

Image Understanding Performance

Qwen3-VL-8B-Instruct-4bit performance across different image resolutions

ResolutionPixelsTimeTokensSpeed
224x22450K1.04s7874.8 tok/s
448x448201K1.45s7048.1 tok/s
768x768590K2.05s9144.3 tok/s
1024x10241.0M2.79s7627.2 tok/s
1280x720922K2.97s9632.4 tok/s
1920x10802.1M6.30s8914.1 tok/s

Average: 45.2 tok/s across all resolutions. Fastest at 224x224, slowest at 1920x1080.

Video Understanding Performance

Qwen3-VL-8B-Instruct-4bit performance across different frame counts

ConfigurationFramesTimeSpeedMemory
2 @ 0.5fps24.48s57.1 tok/s6.4 GB
4 @ 1fps44.65s55.0 tok/s6.4 GB
8 @ 2fps86.45s37.2 tok/s6.8 GB
16 @ 2fps1610.96s23.4 tok/s7.6 GB
32 @ 4fps3220.00s12.8 tok/s9.2 GB
64 @ 8fps6459.81s4.3 tok/s12.9 GB

Memory scales from 6.4 GB (2 frames) to 12.9 GB (64 frames). 96+ frames may cause GPU timeout.

Feature Comparison

How vLLM-MLX compares to other tools for running AI on Mac

Featuremlx-lmllama.cppvllm-metalvLLM-MLX
Native Apple GPUYesYesYesYes
OpenAI-compatible APIYesYesYesYes
Continuous batchingNoNoYesYes
Vision models (MLLM)NoYesYesYes
Content-based prefix cachingNoNoNoYes

What You Can Build

Private AI Assistant

Run ChatGPT-like conversations on your Mac without sending data to the cloud. Great for sensitive documents and privacy.

Image Analysis Tools

Build apps that understand photos, analyze charts, read documents, or describe scenes for accessibility.

Voice Applications

Text-to-speech in 10+ languages and speech-to-text transcription, all running locally.

AI-Powered Automation

Connect AI to external tools via MCP protocol for agentic workflows and automation.

Development & Testing

Test AI integrations locally before deploying to production, without API costs.

Research & Experimentation

Experiment with different AI models, fine-tuning, and prompt engineering on your own hardware.

Getting Started

You'll need a Mac with Apple Silicon (M1, M2, M3, or M4 chip) and Python installed.

1

Clone the Repository

Terminal

$ git clone https://github.com/waybarrios/vllm-mlx.git

$ cd vllm-mlx

2

Install Dependencies

Terminal

$ pip install -e .

3

Start the AI Server

Terminal

$ vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

4

Start Chatting

chat_example.py

from openai import OpenAI

 

client = OpenAI(

base_url="http://localhost:8000/v1",

api_key="not-needed" # Running locally

)

 

response = client.chat.completions.create(

model="default",

messages=[{"role": "user", "content": "Hello!"}]

)

 

print(response.choices[0].message.content)

The Bottom Line

Your MacBook isn't just a laptop anymore. It's a private AI server that can:

Keep Data Private

Nothing leaves your machine

Serve Your Team

Multiple concurrent users

Zero API Costs

Unlimited queries, free

Whether you're building a startup product, running internal tools, or just experimenting with AI, vLLM-MLX gives you production-grade inference without the cloud bills or privacy concerns.

Ready to Run AI Locally?

Open source, free forever. Get started in 5 minutes. No GPU rental, no API keys, no monthly bills.

Works on any Apple Silicon Mac (M1, M2, M3, M4)