Wiqonn - AI & Technology Solutions for Real-World Impact

Concurrent Users

464

Tokens/Second

200+

Models Supported

API Cost

Why This Changes Everything

No More API Bills

Cloud APIs charge on average $0.01+ per 1K tokens. Run unlimited queries on your Mac for $0. Your data never leaves your machine.

Serve Multiple Users

Continuous batching lets your Mac handle 5+ concurrent users. Perfect for team demos, internal tools, or local development.

Vision Models Too

Not just text. Run Qwen-VL, LLaVA, and Gemma 3 for image understanding. Our caching makes follow-up questions 28x faster.

Drop-in Replacement

OpenAI-compatible API. Change one line of code to switch from cloud to local. Works with LangChain, LlamaIndex, and any OpenAI SDK.

The Cloud AI Problem

Running AI in the cloud is expensive. A single GPT-4 Vision call costs ~$0.01. Sounds cheap until you're processing thousands of images or building an app with real users. Suddenly you're looking at $500+ monthly bills.

Plus, every request means sending your data to someone else's servers. For sensitive documents, medical images, or proprietary data, that's a non-starter.

Cloud APIs

×Pay per token (adds up fast)
×Data leaves your network
×Rate limits and downtime
×Latency to remote servers

vLLM-MLX on Your Mac

✓Unlimited queries, zero cost
✓100% private, data stays local
✓Always available, no limits
✓Sub-second local latency

Under the Hood

Two core innovations make vLLM-MLX different: Continuous Batching for handling multiple users, and Content-Based Caching for faster vision model responses.

Content-Based Prefix Caching for MLLMs

Multimodal Large Language Models (MLLMs) like Qwen3-VL, Gemma 3, and LLaVA process images through a vision encoder before the language model can reason about them. This vision encoding step is computationally expensive, often taking 1.5 to 2 seconds per image on consumer hardware.

Our key insight: identical images can arrive through different delivery mechanisms (URLs, base64 encoding, file paths), but the pixel content remains the same. By computing a SHA-256 hash of the decoded pixel values, we can identify semantically identical images regardless of their delivery format.

Cache Key

SHA256(image) + prompt

Cached Value

embeddings + kv_cache

Same image content = same hash = cache hit (regardless of URL, base64, or file path)

Apple Silicon Unified Memory Architecture

Apple Silicon's unified memory model enables zero-copy cache management. Unlike discrete GPU systems where tensors must be explicitly transferred between CPU and GPU memory pools, MLX arrays exist in a single address space accessible by both compute units. This eliminates transfer overhead for cached embeddings.

System Architecture

vLLM-Inspired API

OpenAI-compatible interface with continuous batching

MLXPlatform

Apple Silicon Backend

mlx-lm

LLM Inference

mlx-vlm

Vision + LLM

mlx-audio

TTS + STT

MLX Framework

Apple ML Framework with Metal GPU Kernels

Cache-Aware Generation Algorithm

cache_generation.py

python

1def generate_with_cache(image, prompt):
2    # Compute content hash regardless of delivery format
3    image_hash = sha256(decode_pixels(image))
4    cache_key = (image_hash, prompt_prefix)
5 
6    if cache_key in prefix_cache:
7        # Cache HIT → skip vision encoder
8        embeddings, kv = prefix_cache[cache_key]
9        return llm.generate(prompt, cached_kv=kv)
10 
11    else:
12        # Cache MISS → encode, generate, store
13        embeddings = vision_encoder(image)
14        response, kv = llm.generate(prompt, embeddings)
15        prefix_cache[cache_key] = (embeddings, kv)
16        return response

Real Performance Numbers

No marketing fluff. Here's what you actually get on an M4 Max with 128GB RAM. Even older M1/M2 machines deliver impressive results.

Text Model Throughput Comparison

All models use 4-bit quantization. Bold indicates best throughput per row. Speedup = Ours / llama.cpp.

Model	vLLM-MLX	vllm-metal	mlx-lm	llama.cpp	Speedup
Qwen3 Family
Qwen3-0.6B	525.5	365.8	356.2	281.5	1.87x
Qwen3-4B	159.0	137.3	128.9	118.2	1.35x
Qwen3-8B	93.3	87.1	79.9	76.9	1.21x
Qwen3-30B-A3B	109.7	110.3	107.4	89.9	1.17x
Llama 3.2 Family
Llama-3.2-1B	461.9	350.9	347.1	331.3	1.39x
Llama-3.2-3B	203.6	174.3	167.5	155.8	1.31x
Other Architectures
Gemma 3-4B	152.5	117.0	105.4	123.2	1.24x
Nemotron-30B-A3B	121.8	--	101.6	85.1	1.43x

Measured on M4 Max (128GB). Values in tokens/second.

Concurrency Scaling

How throughput scales with multiple concurrent users. Sequential processing (llama.cpp) stays flat, while continuous batching in vLLM-MLX scales efficiently.

Aggregate Throughput (Qwen3-0.6B)

1 user441 tok/s (1.0x)

4 users892 tok/s (2.0x)

8 users1,156 tok/s (2.6x)

16 users1,642 tok/s (3.7x)

Image Caching Performance

Response time when asking multiple questions about the same image

Question #	Without Cache	With Cache	Improvement
1st (new image)	21.7 seconds	21.7 seconds	Same
2nd	21.7 seconds	1.15 seconds	19x faster
3rd	21.7 seconds	0.79 seconds	28x faster
4th+	21.7 seconds	0.78 seconds	28x faster

Image Understanding Performance

Qwen3-VL-8B-Instruct-4bit performance across different image resolutions

Resolution	Pixels	Time	Tokens	Speed
224x224	50K	1.04s	78	74.8 tok/s
448x448	201K	1.45s	70	48.1 tok/s
768x768	590K	2.05s	91	44.3 tok/s
1024x1024	1.0M	2.79s	76	27.2 tok/s
1280x720	922K	2.97s	96	32.4 tok/s
1920x1080	2.1M	6.30s	89	14.1 tok/s

Average: 45.2 tok/s across all resolutions. Fastest at 224x224, slowest at 1920x1080.

Video Understanding Performance

Qwen3-VL-8B-Instruct-4bit performance across different frame counts

Configuration	Frames	Time	Speed	Memory
2 @ 0.5fps	2	4.48s	57.1 tok/s	6.4 GB
4 @ 1fps	4	4.65s	55.0 tok/s	6.4 GB
8 @ 2fps	8	6.45s	37.2 tok/s	6.8 GB
16 @ 2fps	16	10.96s	23.4 tok/s	7.6 GB
32 @ 4fps	32	20.00s	12.8 tok/s	9.2 GB
64 @ 8fps	64	59.81s	4.3 tok/s	12.9 GB

Memory scales from 6.4 GB (2 frames) to 12.9 GB (64 frames). 96+ frames may cause GPU timeout.

Feature Comparison

How vLLM-MLX compares to other tools for running AI on Mac

Feature	mlx-lm	llama.cpp	vllm-metal	vLLM-MLX
Native Apple GPU	Yes	Yes	Yes	Yes
OpenAI-compatible API	Yes	Yes	Yes	Yes
Continuous batching	No	No	Yes	Yes
Vision models (MLLM)	No	Yes	Yes	Yes
Content-based prefix caching	No	No	No	Yes

What You Can Build

Private AI Assistant

Run ChatGPT-like conversations on your Mac without sending data to the cloud. Great for sensitive documents and privacy.

Image Analysis Tools

Build apps that understand photos, analyze charts, read documents, or describe scenes for accessibility.

Voice Applications

Text-to-speech in 10+ languages and speech-to-text transcription, all running locally.

AI-Powered Automation

Connect AI to external tools via MCP protocol for agentic workflows and automation.

Development & Testing

Test AI integrations locally before deploying to production, without API costs.

Research & Experimentation

Experiment with different AI models, fine-tuning, and prompt engineering on your own hardware.

Getting Started

You'll need a Mac with Apple Silicon (M1, M2, M3, or M4 chip) and Python installed.

Clone the Repository

Terminal

$ git clone https://github.com/waybarrios/vllm-mlx.git

$ cd vllm-mlx

Install Dependencies

Terminal

$ pip install -e .

Start the AI Server

Terminal

$ vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

Start Chatting

chat_example.py

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:8000/v1",

api_key="not-needed" # Running locally

)

response = client.chat.completions.create(

model="default",

messages=[{"role": "user", "content": "Hello!"}]

)

print(response.choices[0].message.content)

The Bottom Line

Your MacBook isn't just a laptop anymore. It's a private AI server that can:

Keep Data Private

Nothing leaves your machine

Serve Your Team

Multiple concurrent users

Zero API Costs

Unlimited queries, free

Whether you're building a startup product, running internal tools, or just experimenting with AI, vLLM-MLX gives you production-grade inference without the cloud bills or privacy concerns.

Ready to Run AI Locally?

Open source, free forever. Get started in 5 minutes. No GPU rental, no API keys, no monthly bills.

Get Started Free Read the Paper

Works on any Apple Silicon Mac (M1, M2, M3, M4)

Your Mac is Now an AI Server

Why This Changes Everything

The Cloud AI Problem

Under the Hood

Content-Based Prefix Caching for MLLMs

Apple Silicon Unified Memory Architecture

System Architecture

Cache-Aware Generation Algorithm

Real Performance Numbers

Text Model Throughput Comparison

Concurrency Scaling

Image Caching Performance

Image Understanding Performance

Video Understanding Performance

Feature Comparison

What You Can Build

Private AI Assistant

Image Analysis Tools

Voice Applications

AI-Powered Automation

Development & Testing

Research & Experimentation

Getting Started

Clone the Repository

Install Dependencies

Start the AI Server

Start Chatting

The Bottom Line

Ready to Run AI Locally?