HomeSec-Bench v1 · 96 LLM Tests · 15 Suites

Local AI vs Cloud
The Benchmark

Gemma 4 26B-A4B scores 97.9%matching GPT-5.4 — running entirely on a MacBook Pro M5 at 59.9 tok/s, 414ms TTFT. Zero API costs. Full data privacy. All local.

97.9%
Pass Rate (Local MoE)
59.9 tok/s
Decode Speed
414 ms
Time to First Token

MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp b8638

25 Local Models · Full API Benchmark

Local LLM Deep Dive

Evaluated on Apple M-Series hardware. Models run via either llama.cpp (GGUF) or our SwiftLM (MLX) engine. Performance varies by system load; speeds reflect real-world on-device conditions.

# Engine Model Score Accuracy Speed (tok/s) Avg TTFT
🥇 GGUF Gemma-4-26B-A4B-it-Q4_K_M MoE 94/96 97.9% 59.9 414ms
🥇 GGUF Gemma-4-31B-it-Q4_K_M 94/96 97.9% 11.9 2.4s
GGUF Qwen3.5-27B-UD-Q8_K_XL 92/96 95.8% 6.3 2.3s
4 GGUF Qwen3.5-9B-BF16 91/96 94.8% 12.5 626ms
5 GGUF Qwen3.5-9B-Q4_K_M 90/96 93.8% 25 765ms
5 GGUF Qwen3.5-27B-Q4_K_M 90/96 93.8% 10 2.2s
7 GGUF Qwen3.5-122B-A10B-UD-IQ1_M MoE 89/96 92.7% 18 1.6s
7 MLX Qwen3.5-9B-MLX-4bit 89/96 92.7% 40 315ms
7 MLX Qwen3.5-27B-4bit 89/96 92.7% 13.7 1.3s
7 MLX Qwen3.5-35B-A3B-4bit MoE 89/96 92.7% 62.8 391ms
11 GGUF Qwen3.5-35B-A3B-UD-Q4_K_L MoE 88/96 91.7% 41.9 435ms
12 GGUF Gemma-4-E4B-it-Q4_K_M 87/96 90.6% 56.9 356ms
12 MLX Qwen3.5-9B-8bit 87/96 90.6% 22.6 471ms
14 GGUF Mistral-Small-4-119B-Q2_K_XL MoE 86/96 89.6% 43.1 1.0s
14 MLX Qwen3.5-35B-A3B-4bit (older run) 86/96 89.6% 24.3 446ms
16 GGUF NVIDIA Nemotron-3-Nano-4B-Q4_K_M 84/96 87.5% 50 727ms
17 GGUF Mistral-Small-4-119B-UD-IQ1_M(MoE) MoE 79/96 82.3% 44 997ms
18 GGUF NVIDIA Nemotron-30B-A3B-Q8_0(MoE) MoE 78/96 81.2% 39.9 1.1s
19 GGUF Qwen3.5-27B (earlier Q4_K_M run) 74/96 77.1% 11.3 4.0s
20 GGUF LFM2-24B-A2B-Q8_0 MoE 72/96 75.0% 68 535ms
21 MLX GPT-OSS-20B-MXFP4-Q8 67/96 69.8% 39.6 712ms
22 GGUF LFM2.5-1.2B-Instruct-BF16 62/96 64.6% 84.4 153ms
23 MLX Qwen2.5-3B-Instruct-4bit 57/96 59.4% 89.6 78ms
23 MLX Qwen3.5-27B-Claude-Distilled-4bit 55/96 57.3% 53.9 2.3s
23 GGUF LFM2.5-350M-BF16 ⚡ glue 55/96 57.3% 220.8 54ms

Gemma 4 Ties GPT-5.4

The 26B-A4B MoE matches GPT-5.4 at 97.9% — running locally at 59.9 t/s with 414ms TTFT. The first local model to reach the cloud ceiling.

MoE vs Dense: 5× Faster

Gemma 4 26B-A4B activates only 4B params per token, matching the 31B Dense model's accuracy while decoding 5× faster (59.9 vs 11.9 t/s).

Multi-Family Depth

4 model families now score above 90%: Gemma 4, Qwen3.5, Mistral, and Nemotron — proving local AI is no longer a single-vendor story.

🧩 System Controller

LFM2.5-350M hits 221 tok/s with Tool Use 81% & JSON Compliance 91% — fast enough to control the full event pipeline (routing, dispatch, tool selection) at 54ms TTFT, bundled under 1 GB with the app.

🌱 New Born · Liquid AI · March 31, 2026

LFM2.5-350M The System Controller

350M params · <1 GB RAM · ships bundled with the app · always-on tool caller

57.3% Score
54ms TTFT
221 t/s Speed
<1 GB RAM

Controller-Critical Suite Performance

🔧 Tool Use
81%
13/16
💬 Chat & JSON Compliance
91%
10/11
🔄 Multi-Turn Reasoning
100%
4/4
⚠️ Error Recovery
100%
4/4
🔒 Privacy & Compliance
67%
2/3
📝 Narrative Synthesis
75%
3/4
🛡️ Security Classification
17%
2/12
🔔 Event Deduplication
25%
2/8

The Embedded Controller Pattern

📹
Camera Event
Motion · person · anomaly
🧩
LFM2.5-350M
Always-on · <1 GB · 54ms
Routes · dispatches · calls tools
Escalate if needed
Qwen3.5-9B for complex
reasoning & classification

LFM2.5-350M handles the structured fast-path. The 9B model handles the reasoning. Together they cover the full HomeSec-Bench suite.

26 Cloud LLMs · Full API Benchmark

Cloud LLM Deep Dive

Every major cloud provider, tested head-to-head on the same 96-test HomeSec-Bench suite. Metrics include accuracy, real-world Time to First Token, and decode throughput — no cherry picking.

OpenAI Anthropic Alibaba Cloud MiniMax Moonshot AI DeepSeek Other
# Provider Model Score Accuracy Speed (tok/s) Avg TTFT
🥇 OpenAI gpt-5.4-2026-03-05 94/96 97.9% 73.4 601ms
🥈 Anthropic claude-opus-4-20250514 93/96 96.9% 1.8 1.3s
OpenAI gpt-5.4-mini-2026-03-17 92/96 95.8% 234.5 553ms
Alibaba Cloud qwen3.6-plus 92/96 95.8% 20.9 1.1s
Alibaba Cloud qwen3-max 92/96 95.8% 5.9 1.2s
6 Anthropic claude-sonnet-4-20250514 91/96 94.8% 2.6 1.2s
7 MiniMax MiniMax-M2.7-highspeed 90/96 93.8% 3 1.5s
7 Moonshot AI kimi-k2-0905-preview 90/96 93.8% 62.5 2.2s
7 Anthropic claude-opus-4-6 90/96 93.8% 2.3 1.9s
10 OpenAI gpt-5.4-nano-2026-03-17 89/96 92.7% 136.4 508ms
10 MiniMax MiniMax-M2.5-highspeed 89/96 92.7% 2.8 1.8s
10 Alibaba Cloud qwen-plus 89/96 92.7% 11.7 535ms
10 Anthropic claude-haiku-4-5 89/96 92.7% 5.3 530ms
10 Anthropic claude-sonnet-4-6 89/96 92.7% 2.6 1.4s
10 Other Cloud grok-4-1-fast-non-reasoning 89/96 92.7% 496.1 447ms
16 DeepSeek deepseek-chat 88/96 91.7% 21.4 1.5s
16 MiniMax MiniMax-M2.7 88/96 91.7% 1.5 3.8s
18 MiniMax MiniMax-M2.5 86/96 89.6% 199.1 3.2s
18 Alibaba Cloud qwen-flash 86/96 89.6% 28.5 398ms
20 Alibaba Cloud qwen3.5-plus 82/96 85.4% 16.3 1.3s
21 Alibaba Cloud qwen3.5-flash 81/96 84.4% 26.7 804ms
22 MiniMax MiniMax-M2.1 77/96 80.2% 195.7 4.5s
23 Moonshot AI kimi-k2-thinking ⚠ thinking 70/96 72.9% 28.3 1.1s
24 Moonshot AI kimi-k2-turbo-preview 66/96 68.8% 134.9 773ms
25 DeepSeek deepseek-reasoner ⚠ thinking 63/96 65.6% 34.5 1.1s
26 OpenAI gpt-5-mini-2025-08-07 60/96 62.5% 72.6 7.2s
🏆

Frontier Dominance

GPT-5.4 and Anthropic Opus 4 top the chart, but Alibaba's Qwen3-max is practically tied at 95.8% — a new axis of competition.

The Speed Anomaly

Grok-4-1-fast hits 496 tok/s with Tier-2 accuracy (89/96). Anthropic endpoints peak at ~2 tok/s — high accuracy, low throughput.

🧠

The "Thinking" Penalty

kimi-k2-thinking (70/96) and deepseek-reasoner (63/96) collapse. Chain-of-thought tokens break strict JSON — a hard lesson for structured tasks.

Performance: Local vs Cloud

The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

Time to First Token (avg)

Lower is better

Qwen3.5-35B-MoE
435ms
GPT-5.4-nano
508ms
GPT-5.4-mini
553ms
GPT-5.4
601ms
Qwen3.5-9B
765ms
Qwen3.5-122B-MoE
1627ms
Qwen3.5-27B
2156ms
Local Cloud

Decode Speed

Higher is better · tokens/second

GPT-5.4-mini
234.5
GPT-5.4-nano
136.4
GPT-5.4
73.4
Qwen3.5-35B-MoE
41.9
Qwen3.5-9B
25
Qwen3.5-122B-MoE
18
Qwen3.5-27B
10
Local Cloud

GPU Memory Usage (Local Models)

27.2 GB
Qwen3.5-35B-MoE
13.8 GB
Qwen3.5-9B
40.8 GB
Qwen3.5-122B-MoE
24.9 GB
Qwen3.5-27B

What is HomeSec-Bench?

A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.

All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.

Context Preprocessing

6

Deduplicating conversations, preserving system msgs

Topic Classification

4

Routing queries to the right domain

Knowledge Distillation

5

Extracting durable facts from conversations

Event Deduplication

8

"Same person or new visitor?" across cameras

Tool Use

16

Selecting correct tools with correct parameters

Chat & JSON Compliance

11

Persona, JSON output, multilingual

Security Classification

12

Normal → Monitor → Suspicious → Critical triage

Narrative Synthesis

4

Summarizing event logs into daily reports

Prompt Injection Resistance

4

Role confusion, prompt extraction, escalation

Multi-Turn Reasoning

4

Reference resolution, temporal carry-over

Error Recovery

4

Handling impossible queries, API errors

Privacy & Compliance

3

PII redaction, illegal surveillance rejection

Alert Routing

5

Channel routing, quiet hours parsing

Knowledge Injection

5

Using injected KIs to personalize responses

VLM-to-Alert Triage

5

End-to-end: VLM output → urgency → alert dispatch

Why This Matters

✅ Can it pick the right tool with correct parameters?
✅ Can it classify "masked person at night" as Critical?
✅ Can it resist prompt injection in event descriptions?
✅ Can it deduplicate the same person across 3 cameras?
✅ Can it maintain context across multi-turn security conversations?

See It Run

Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.

A local MoE model matching GPT-5.4 on real security tasks — fully offline with complete privacy — is the value proposition of local AI.

System: Aegis-AI — Local-first AI home security on consumer hardware.

Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.

Skill Platform: DeepCamera — Decentralized AI skill ecosystem.