Local AI vs Cloud
The Benchmark
Gemma 4 26B-A4B scores 97.9% — matching GPT-5.4 — running entirely on a MacBook Pro M5 at 59.9 tok/s, 414ms TTFT. Zero API costs. Full data privacy. All local.
MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp b8638
Local LLM Deep Dive
Evaluated on Apple M-Series hardware. Models run via either llama.cpp (GGUF) or our SwiftLM (MLX) engine. Performance varies by system load; speeds reflect real-world on-device conditions.
| # | Engine | Model | Score | Accuracy | Speed (tok/s) | Avg TTFT |
|---|---|---|---|---|---|---|
| 🥇 | GGUF | Gemma-4-26B-A4B-it-Q4_K_M MoE | 97.9% | 59.9 | 414ms | |
| 🥇 | GGUF | Gemma-4-31B-it-Q4_K_M | 97.9% | 11.9 | 2.4s | |
| GGUF | Qwen3.5-27B-UD-Q8_K_XL | 95.8% | 6.3 | 2.3s | ||
| 4 | GGUF | Qwen3.5-9B-BF16 | 94.8% | 12.5 | 626ms | |
| 5 | GGUF | Qwen3.5-9B-Q4_K_M | 93.8% | 25 | 765ms | |
| 5 | GGUF | Qwen3.5-27B-Q4_K_M | 93.8% | 10 | 2.2s | |
| 7 | GGUF | Qwen3.5-122B-A10B-UD-IQ1_M MoE | 92.7% | 18 | 1.6s | |
| 7 | MLX | Qwen3.5-9B-MLX-4bit | 92.7% | 40 | 315ms | |
| 7 | MLX | Qwen3.5-27B-4bit | 92.7% | 13.7 | 1.3s | |
| 7 | MLX | Qwen3.5-35B-A3B-4bit MoE | 92.7% | 62.8 | 391ms | |
| 11 | GGUF | Qwen3.5-35B-A3B-UD-Q4_K_L MoE | 91.7% | 41.9 | 435ms | |
| 12 | GGUF | Gemma-4-E4B-it-Q4_K_M | 90.6% | 56.9 | 356ms | |
| 12 | MLX | Qwen3.5-9B-8bit | 90.6% | 22.6 | 471ms | |
| 14 | GGUF | Mistral-Small-4-119B-Q2_K_XL MoE | 89.6% | 43.1 | 1.0s | |
| 14 | MLX | Qwen3.5-35B-A3B-4bit (older run) | 89.6% | 24.3 | 446ms | |
| 16 | GGUF | NVIDIA Nemotron-3-Nano-4B-Q4_K_M | 87.5% | 50 | 727ms | |
| 17 | GGUF | Mistral-Small-4-119B-UD-IQ1_M(MoE) MoE | 82.3% | 44 | 997ms | |
| 18 | GGUF | NVIDIA Nemotron-30B-A3B-Q8_0(MoE) MoE | 81.2% | 39.9 | 1.1s | |
| 19 | GGUF | Qwen3.5-27B (earlier Q4_K_M run) | 77.1% | 11.3 | 4.0s | |
| 20 | GGUF | LFM2-24B-A2B-Q8_0 MoE | 75.0% | 68 | 535ms | |
| 21 | MLX | GPT-OSS-20B-MXFP4-Q8 | 69.8% | 39.6 | 712ms | |
| 22 | GGUF | LFM2.5-1.2B-Instruct-BF16 | 64.6% | 84.4 | 153ms | |
| 23 | MLX | Qwen2.5-3B-Instruct-4bit | 59.4% | 89.6 | 78ms | |
| 23 | MLX | Qwen3.5-27B-Claude-Distilled-4bit | 57.3% | 53.9 | 2.3s | |
| 23 | GGUF | LFM2.5-350M-BF16 ⚡ glue | 57.3% | 220.8 | 54ms |
Gemma 4 Ties GPT-5.4
The 26B-A4B MoE matches GPT-5.4 at 97.9% — running locally at 59.9 t/s with 414ms TTFT. The first local model to reach the cloud ceiling.
MoE vs Dense: 5× Faster
Gemma 4 26B-A4B activates only 4B params per token, matching the 31B Dense model's accuracy while decoding 5× faster (59.9 vs 11.9 t/s).
Multi-Family Depth
4 model families now score above 90%: Gemma 4, Qwen3.5, Mistral, and Nemotron — proving local AI is no longer a single-vendor story.
🧩 System Controller
LFM2.5-350M hits 221 tok/s with Tool Use 81% & JSON Compliance 91% — fast enough to control the full event pipeline (routing, dispatch, tool selection) at 54ms TTFT, bundled under 1 GB with the app.
LFM2.5-350M — The System Controller
350M params · <1 GB RAM · ships bundled with the app · always-on tool caller
Controller-Critical Suite Performance
The Embedded Controller Pattern
Routes · dispatches · calls tools
reasoning & classification
LFM2.5-350M handles the structured fast-path. The 9B model handles the reasoning. Together they cover the full HomeSec-Bench suite.
Cloud LLM Deep Dive
Every major cloud provider, tested head-to-head on the same 96-test HomeSec-Bench suite. Metrics include accuracy, real-world Time to First Token, and decode throughput — no cherry picking.
| # | Provider | Model | Score | Accuracy | Speed (tok/s) | Avg TTFT |
|---|---|---|---|---|---|---|
| 🥇 | OpenAI | gpt-5.4-2026-03-05 | 97.9% | 73.4 | 601ms | |
| 🥈 | Anthropic | claude-opus-4-20250514 | 96.9% | 1.8 | 1.3s | |
| OpenAI | gpt-5.4-mini-2026-03-17 | 95.8% | 234.5 | 553ms | ||
| Alibaba Cloud | qwen3.6-plus | 95.8% | 20.9 | 1.1s | ||
| Alibaba Cloud | qwen3-max | 95.8% | 5.9 | 1.2s | ||
| 6 | Anthropic | claude-sonnet-4-20250514 | 94.8% | 2.6 | 1.2s | |
| 7 | MiniMax | MiniMax-M2.7-highspeed | 93.8% | 3 | 1.5s | |
| 7 | Moonshot AI | kimi-k2-0905-preview | 93.8% | 62.5 | 2.2s | |
| 7 | Anthropic | claude-opus-4-6 | 93.8% | 2.3 | 1.9s | |
| 10 | OpenAI | gpt-5.4-nano-2026-03-17 | 92.7% | 136.4 | 508ms | |
| 10 | MiniMax | MiniMax-M2.5-highspeed | 92.7% | 2.8 | 1.8s | |
| 10 | Alibaba Cloud | qwen-plus | 92.7% | 11.7 | 535ms | |
| 10 | Anthropic | claude-haiku-4-5 | 92.7% | 5.3 | 530ms | |
| 10 | Anthropic | claude-sonnet-4-6 | 92.7% | 2.6 | 1.4s | |
| 10 | Other Cloud | grok-4-1-fast-non-reasoning | 92.7% | 496.1 | 447ms | |
| 16 | DeepSeek | deepseek-chat | 91.7% | 21.4 | 1.5s | |
| 16 | MiniMax | MiniMax-M2.7 | 91.7% | 1.5 | 3.8s | |
| 18 | MiniMax | MiniMax-M2.5 | 89.6% | 199.1 | 3.2s | |
| 18 | Alibaba Cloud | qwen-flash | 89.6% | 28.5 | 398ms | |
| 20 | Alibaba Cloud | qwen3.5-plus | 85.4% | 16.3 | 1.3s | |
| 21 | Alibaba Cloud | qwen3.5-flash | 84.4% | 26.7 | 804ms | |
| 22 | MiniMax | MiniMax-M2.1 | 80.2% | 195.7 | 4.5s | |
| 23 | Moonshot AI | kimi-k2-thinking ⚠ thinking | 72.9% | 28.3 | 1.1s | |
| 24 | Moonshot AI | kimi-k2-turbo-preview | 68.8% | 134.9 | 773ms | |
| 25 | DeepSeek | deepseek-reasoner ⚠ thinking | 65.6% | 34.5 | 1.1s | |
| 26 | OpenAI | gpt-5-mini-2025-08-07 | 62.5% | 72.6 | 7.2s |
Frontier Dominance
GPT-5.4 and Anthropic Opus 4 top the chart, but Alibaba's Qwen3-max is practically tied at 95.8% — a new axis of competition.
The Speed Anomaly
Grok-4-1-fast hits 496 tok/s with Tier-2 accuracy (89/96). Anthropic endpoints peak at ~2 tok/s — high accuracy, low throughput.
The "Thinking" Penalty
kimi-k2-thinking (70/96) and deepseek-reasoner (63/96) collapse. Chain-of-thought tokens break strict JSON — a hard lesson for structured tasks.
Performance: Local vs Cloud
The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.
Time to First Token (avg)
Lower is better
Decode Speed
Higher is better · tokens/second
GPU Memory Usage (Local Models)
What is HomeSec-Bench?
A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.
All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.
Context Preprocessing
6Deduplicating conversations, preserving system msgs
Topic Classification
4Routing queries to the right domain
Knowledge Distillation
5Extracting durable facts from conversations
Event Deduplication
8"Same person or new visitor?" across cameras
Tool Use
16Selecting correct tools with correct parameters
Chat & JSON Compliance
11Persona, JSON output, multilingual
Security Classification
12Normal → Monitor → Suspicious → Critical triage
Narrative Synthesis
4Summarizing event logs into daily reports
Prompt Injection Resistance
4Role confusion, prompt extraction, escalation
Multi-Turn Reasoning
4Reference resolution, temporal carry-over
Error Recovery
4Handling impossible queries, API errors
Privacy & Compliance
3PII redaction, illegal surveillance rejection
Alert Routing
5Channel routing, quiet hours parsing
Knowledge Injection
5Using injected KIs to personalize responses
VLM-to-Alert Triage
5End-to-end: VLM output → urgency → alert dispatch
Why This Matters
See It Run
Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.
A local MoE model matching GPT-5.4 on real security tasks — fully offline with complete privacy — is the value proposition of local AI.
System: Aegis-AI — Local-first AI home security on consumer hardware.
Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.
Skill Platform: DeepCamera — Decentralized AI skill ecosystem.