HomeSec-Bench v1 · 96 LLM Tests · 15 Suites

Local AI vs Cloud
The Benchmark

Gemma 4 26B-A4B scores 97.9% — matching GPT-5.4 — running entirely on a MacBook Pro M5 at 59.9 tok/s, 414ms TTFT. Zero API costs. Full data privacy. All local.

97.9%

Pass Rate (Local MoE)

59.9 tok/s

Decode Speed

414 ms

Time to First Token

MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp b8638

25 Local Models · Full API Benchmark

Local LLM Deep Dive

Evaluated on Apple M-Series hardware. Models run via either llama.cpp (GGUF) or our SwiftLM (MLX) engine. Performance varies by system load; speeds reflect real-world on-device conditions.

#	Engine	Model	Score	Accuracy	Speed (tok/s)	Avg TTFT
🥇	GGUF	Gemma-4-26B-A4B-it-Q4_K_M MoE	94/96	97.9%	59.9	414ms
🥇	GGUF	Gemma-4-31B-it-Q4_K_M	94/96	97.9%	11.9	2.4s
	GGUF	Qwen3.5-27B-UD-Q8_K_XL	92/96	95.8%	6.3	2.3s
4	GGUF	Qwen3.5-9B-BF16	91/96	94.8%	12.5	626ms
5	GGUF	Qwen3.5-9B-Q4_K_M	90/96	93.8%	25	765ms
5	GGUF	Qwen3.5-27B-Q4_K_M	90/96	93.8%	10	2.2s
7	GGUF	Qwen3.5-122B-A10B-UD-IQ1_M MoE	89/96	92.7%	18	1.6s
7	MLX	Qwen3.5-9B-MLX-4bit	89/96	92.7%	40	315ms
7	MLX	Qwen3.5-27B-4bit	89/96	92.7%	13.7	1.3s
7	MLX	Qwen3.5-35B-A3B-4bit MoE	89/96	92.7%	62.8	391ms
11	GGUF	Qwen3.5-35B-A3B-UD-Q4_K_L MoE	88/96	91.7%	41.9	435ms
12	GGUF	Gemma-4-E4B-it-Q4_K_M	87/96	90.6%	56.9	356ms
12	MLX	Qwen3.5-9B-8bit	87/96	90.6%	22.6	471ms
14	GGUF	Mistral-Small-4-119B-Q2_K_XL MoE	86/96	89.6%	43.1	1.0s
14	MLX	Qwen3.5-35B-A3B-4bit (older run)	86/96	89.6%	24.3	446ms
16	GGUF	NVIDIA Nemotron-3-Nano-4B-Q4_K_M	84/96	87.5%	50	727ms
17	GGUF	Mistral-Small-4-119B-UD-IQ1_M(MoE) MoE	79/96	82.3%	44	997ms
18	GGUF	NVIDIA Nemotron-30B-A3B-Q8_0(MoE) MoE	78/96	81.2%	39.9	1.1s
19	GGUF	Qwen3.5-27B (earlier Q4_K_M run)	74/96	77.1%	11.3	4.0s
20	GGUF	LFM2-24B-A2B-Q8_0 MoE	72/96	75.0%	68	535ms
21	MLX	GPT-OSS-20B-MXFP4-Q8	67/96	69.8%	39.6	712ms
22	GGUF	LFM2.5-1.2B-Instruct-BF16	62/96	64.6%	84.4	153ms
23	MLX	Qwen2.5-3B-Instruct-4bit	57/96	59.4%	89.6	78ms
23	MLX	Qwen3.5-27B-Claude-Distilled-4bit	55/96	57.3%	53.9	2.3s
23	GGUF	LFM2.5-350M-BF16 ⚡ glue	55/96	57.3%	220.8	54ms

Gemma 4 Ties GPT-5.4

The 26B-A4B MoE matches GPT-5.4 at 97.9% — running locally at 59.9 t/s with 414ms TTFT. The first local model to reach the cloud ceiling.

MoE vs Dense: 5× Faster

Gemma 4 26B-A4B activates only 4B params per token, matching the 31B Dense model's accuracy while decoding 5× faster (59.9 vs 11.9 t/s).

Multi-Family Depth

4 model families now score above 90%: Gemma 4, Qwen3.5, Mistral, and Nemotron — proving local AI is no longer a single-vendor story.

🧩 System Controller

LFM2.5-350M hits 221 tok/s with Tool Use 81% & JSON Compliance 91% — fast enough to control the full event pipeline (routing, dispatch, tool selection) at 54ms TTFT, bundled under 1 GB with the app.

🌱 New Born · Liquid AI · March 31, 2026

HomeSec-Bench v1 · 96 tests · M5 Pro

LFM2.5-350M — The System Controller

350M params · <1 GB RAM · ships bundled with the app · always-on tool caller

57.3% Score

54ms TTFT

221 t/s Speed

<1 GB RAM

Controller-Critical Suite Performance

🔧 Tool Use

81%

13/16 Function calling · Tool selection · Parameter accuracy

💬 Chat & JSON Compliance

91%

10/11 Structured output · Persona · Multilingual

🔄 Multi-Turn Reasoning

100%

4/4 Context carry-over · Anaphora resolution

⚠️ Error Recovery

100%

4/4 Empty results · API errors · Edge cases

🔒 Privacy & Compliance

67%

2/3 PII redaction · Surveillance rejection

📝 Narrative Synthesis

75%

3/4 Event summarisation into daily reports

🛡️ Security Classification

17%

2/12 Urgency triage — complex reasoning required

🔔 Event Deduplication

25%

2/8 Cross-camera identity — needs larger model

The Embedded Controller Pattern

📹

Camera Event

Motion · person · anomaly

🧩

LFM2.5-350M

Always-on · <1 GB · 54ms
Routes · dispatches · calls tools

⚡

Escalate if needed

Qwen3.5-9B for complex
reasoning & classification

LFM2.5-350M handles the structured fast-path. The 9B model handles the reasoning. Together they cover the full HomeSec-Bench suite.

26 Cloud LLMs · Full API Benchmark

Cloud LLM Deep Dive

Every major cloud provider, tested head-to-head on the same 96-test HomeSec-Bench suite. Metrics include accuracy, real-world Time to First Token, and decode throughput — no cherry picking.

OpenAI Anthropic Alibaba Cloud MiniMax Moonshot AI DeepSeek Other

#	Provider	Model	Score	Accuracy	Speed (tok/s)	Avg TTFT
🥇	OpenAI	gpt-5.4-2026-03-05	94/96	97.9%	73.4	601ms
🥈	Anthropic	claude-opus-4-20250514	93/96	96.9%	1.8	1.3s
	OpenAI	gpt-5.4-mini-2026-03-17	92/96	95.8%	234.5	553ms
	Alibaba Cloud	qwen3.6-plus	92/96	95.8%	20.9	1.1s
	Alibaba Cloud	qwen3-max	92/96	95.8%	5.9	1.2s
6	Anthropic	claude-sonnet-4-20250514	91/96	94.8%	2.6	1.2s
7	MiniMax	MiniMax-M2.7-highspeed	90/96	93.8%	3	1.5s
7	Moonshot AI	kimi-k2-0905-preview	90/96	93.8%	62.5	2.2s
7	Anthropic	claude-opus-4-6	90/96	93.8%	2.3	1.9s
10	OpenAI	gpt-5.4-nano-2026-03-17	89/96	92.7%	136.4	508ms
10	MiniMax	MiniMax-M2.5-highspeed	89/96	92.7%	2.8	1.8s
10	Alibaba Cloud	qwen-plus	89/96	92.7%	11.7	535ms
10	Anthropic	claude-haiku-4-5	89/96	92.7%	5.3	530ms
10	Anthropic	claude-sonnet-4-6	89/96	92.7%	2.6	1.4s
10	Other Cloud	grok-4-1-fast-non-reasoning	89/96	92.7%	496.1	447ms
16	DeepSeek	deepseek-chat	88/96	91.7%	21.4	1.5s
16	MiniMax	MiniMax-M2.7	88/96	91.7%	1.5	3.8s
18	MiniMax	MiniMax-M2.5	86/96	89.6%	199.1	3.2s
18	Alibaba Cloud	qwen-flash	86/96	89.6%	28.5	398ms
20	Alibaba Cloud	qwen3.5-plus	82/96	85.4%	16.3	1.3s
21	Alibaba Cloud	qwen3.5-flash	81/96	84.4%	26.7	804ms
22	MiniMax	MiniMax-M2.1	77/96	80.2%	195.7	4.5s
23	Moonshot AI	kimi-k2-thinking ⚠ thinking	70/96	72.9%	28.3	1.1s
24	Moonshot AI	kimi-k2-turbo-preview	66/96	68.8%	134.9	773ms
25	DeepSeek	deepseek-reasoner ⚠ thinking	63/96	65.6%	34.5	1.1s
26	OpenAI	gpt-5-mini-2025-08-07	60/96	62.5%	72.6	7.2s

🏆

Frontier Dominance

GPT-5.4 and Anthropic Opus 4 top the chart, but Alibaba's Qwen3-max is practically tied at 95.8% — a new axis of competition.

⚡

The Speed Anomaly

Grok-4-1-fast hits 496 tok/s with Tier-2 accuracy (89/96). Anthropic endpoints peak at ~2 tok/s — high accuracy, low throughput.

🧠

The "Thinking" Penalty

kimi-k2-thinking (70/96) and deepseek-reasoner (63/96) collapse. Chain-of-thought tokens break strict JSON — a hard lesson for structured tasks.

Performance: Local vs Cloud

The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

Time to First Token (avg)

Lower is better

Qwen3.5-35B-MoE

435ms

GPT-5.4-nano

508ms

GPT-5.4-mini

553ms

GPT-5.4

601ms

Qwen3.5-9B

765ms

Qwen3.5-122B-MoE

1627ms

Qwen3.5-27B

2156ms

Local Cloud

Decode Speed

Higher is better · tokens/second

GPT-5.4-mini

234.5

GPT-5.4-nano

136.4

GPT-5.4

73.4

Qwen3.5-35B-MoE

41.9

Qwen3.5-9B

25

Qwen3.5-122B-MoE

18

Qwen3.5-27B

10

Local Cloud

GPU Memory Usage (Local Models)

27.2 GB

Qwen3.5-35B-MoE

13.8 GB

Qwen3.5-9B

40.8 GB

Qwen3.5-122B-MoE

24.9 GB

Qwen3.5-27B

What is HomeSec-Bench?

A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.

All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.

Context Preprocessing

6

Deduplicating conversations, preserving system msgs

Topic Classification

4

Routing queries to the right domain

Knowledge Distillation

5

Extracting durable facts from conversations

Event Deduplication

8

"Same person or new visitor?" across cameras

Tool Use

16

Selecting correct tools with correct parameters

Chat & JSON Compliance

11

Persona, JSON output, multilingual

Security Classification

12

Normal → Monitor → Suspicious → Critical triage

Narrative Synthesis

4

Summarizing event logs into daily reports

Prompt Injection Resistance

4

Role confusion, prompt extraction, escalation

Multi-Turn Reasoning

4

Reference resolution, temporal carry-over

Error Recovery

4

Handling impossible queries, API errors

Privacy & Compliance

3

PII redaction, illegal surveillance rejection

Alert Routing

5

Channel routing, quiet hours parsing

Knowledge Injection

5

Using injected KIs to personalize responses

VLM-to-Alert Triage

5

End-to-end: VLM output → urgency → alert dispatch

Why This Matters

✅ Can it pick the right tool with correct parameters?

✅ Can it classify "masked person at night" as Critical?

✅ Can it resist prompt injection in event descriptions?

✅ Can it deduplicate the same person across 3 cameras?

✅ Can it maintain context across multi-turn security conversations?

See It Run

Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.

A local MoE model matching GPT-5.4 on real security tasks — fully offline with complete privacy — is the value proposition of local AI.

Download Aegis Benchmark on GitHub

System: Aegis-AI — Local-first AI home security on consumer hardware.

Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.

Skill Platform: DeepCamera — Decentralized AI skill ecosystem.

Local AI vs Cloud The Benchmark

Local LLM Deep Dive

Gemma 4 Ties GPT-5.4

MoE vs Dense: 5× Faster

Multi-Family Depth

🧩 System Controller

LFM2.5-350M — The System Controller

Cloud LLM Deep Dive

Frontier Dominance

The Speed Anomaly

The "Thinking" Penalty

Performance: Local vs Cloud

Time to First Token (avg)

Decode Speed

GPU Memory Usage (Local Models)

What is HomeSec-Bench?

Context Preprocessing

Topic Classification

Knowledge Distillation

Event Deduplication

Tool Use

Chat & JSON Compliance

Security Classification

Narrative Synthesis

Prompt Injection Resistance

Multi-Turn Reasoning

Error Recovery

Privacy & Compliance

Alert Routing

Knowledge Injection

VLM-to-Alert Triage

Why This Matters

See It Run

Download SharpAI Aegis

Local AI vs Cloud
The Benchmark