Cluster Ops methodology · v1 · 2026

How on-prem MLX gets operated.

A working description of how Garnet runs a Mac Mini MLX cluster — from cold-boot to model hot-swap to monthly cost-per-token report. Your hardware, our pager.

← Back to Cluster Ops

Why on-prem inference

For two years the LLM stack has been "use the API." OpenAI, Anthropic, Gemini — pay per token, forget about infrastructure. That worked while frontier capability was the sole differentiator and the open-source models trailed by 18 months. That gap closed in 2025. The Llama 3.1 70B family, Mistral Large 2, Qwen 2.5 72B, and DeepSeek V3 deliver production-grade output for the majority of buyer workloads — agentic flows, document summarization, structured-output extraction, vector-and-rerank retrieval, conversational interfaces, code completion. The remaining 10–20% of workloads still benefit from frontier API access; the other 80–90% does not.

Once the capability gap closes, the economic argument flips:

The trade-off is operational. Someone has to keep the rack honest — model placement, thermal management, eviction policy, deploy hygiene, alerting. That's the lane.

Why Mac Minis

Mac Mini M4 / M4 Pro nodes hit a price/performance band the API providers can't match for inference of moderate-sized open models. M4 Pro 64GB unified holds the Llama 3.1 70B family quantized at q4 with room for KV cache; M4 Max 128GB holds Mistral Large 2 q4 comfortably. The /v1/chat/completions endpoint shape works without modification — applications that talk to OpenAI also talk to MLX.

Cost per million tokens on a fully-amortized rack of M4 Pros runs $0.04–$0.12 for the 7–34B class and $0.30–$0.85 for the 70B class — roughly a 5–15× saving over GPT-4-class API for equivalent quality on most non-frontier tasks. The trade-off is operational: someone has to keep the rack healthy. That's the lane.

Daily — telemetry

A monitor process on each node ships structured metrics to a central R2 every 60s:

Telemetry flows to your tenant's R2 namespace. Garnet operations have read-only access to alert state but never to your inference inputs/outputs — the model operates entirely on-prem; nothing leaves your boundary except aggregate metrics.

Weekly — placement + cost diff

Once a week, three artifacts ship:

Monthly — executive PDF

On the 1st of each month, a Workflow renders an executive PDF covering:

Hot-swap protocol

Models are swapped without dropping connections via a dual-loader pattern:

Total downtime: 0 connections dropped, 0 requests failed, ~30–90s elevated latency window during loader-B warmup. Documented in the runbook; replicable by your team if Cluster Ops rolls off (we don't keep you locked in).

Hardened deploys

Every code change to the cluster (model upgrades, router config, monitor binary) ships via a GitHub Actions deploy pipeline against your tenant. Each deploy:

What success looks like

Across the first 90 days, Cluster Pro typically operates 2–4 nodes at <100ms p95 latency on the 7–34B class, 99.9% uptime, and ≤2-min hot-swap windows. Cost-per-million-tokens stabilizes 5–15× below GPT-4-class API for equivalent quality on tracked workloads. Cluster Scale (4–8 nodes) and Enterprise (8+ with hardware sourcing + power planning) target the same ratios at higher absolute throughput.

What it isn't

Day 1, Day 30, Day 90

Day 1 — onboarding kickoff

Day 30 — first executive PDF + traffic-tuned placement

Day 90 — cluster steady-state

FAQ

How many nodes do we need to start?

Cluster Pro starts at 2 nodes for redundancy. One node is a single point of failure (no model failover), and we don't recommend running production traffic on it. Scale tier supports 4–8 nodes; Enterprise targets 8+. Most customers ramp from 2→4→6 over the first 6 months as they migrate workloads off API.

Which models run well on Mac Mini MLX?

As of 2026-Q1: Llama 3.1 8B/70B, Llama 3.3 70B, Mistral Large 2 (123B at q4 needs M4 Max 128GB), Mistral Small 3.1 24B, Qwen 2.5 family (7B/14B/32B/72B), DeepSeek-V3 (large; needs heavy quant + 128GB), Gemma 2 9B/27B, custom fine-tunes of any of the above. We re-evaluate the supported list each quarter as new models ship and quantization tooling improves.

How does this compare to running our own GPU servers?

For inference of moderate-sized models (7–70B class): an A100/H100 server is faster throughput-per-dollar but costs 4–10× more upfront, draws 5–15× more power, and requires rack-grade cooling. Mac Mini M4 Pro at $2K-$2.5K hardware/node, with passive desktop cooling, hits the price/perf sweet spot specifically for sub-frontier-model inference. For frontier models (DeepSeek-V3 full precision, GPT-4 class): GPU is the right answer; we won't pretend otherwise.

What if we want a different host OS than macOS?

The "MLX" in Cluster Ops is specifically Apple's MLX framework, which only runs on Apple Silicon hardware (macOS or iOS). If your sovereignty story requires Linux, you're looking at NVIDIA GPU + vLLM/TGI/llama.cpp territory, which is a different lane. Cluster Ops is purpose-built for the MLX/Apple Silicon path.

What happens during a hardware failure?

Each node has a hot-swap counterpart (the redundancy is why Pro starts at 2 nodes). When a node misses its heartbeat, the router routes around it; the alert fires in #cluster-alerts. For Pro/Scale customers we work with you to plan replacement hardware. For Enterprise customers we keep a cold spare on-site.

What does power consumption look like?

Mac Mini M4 Pro under sustained inference draws 40–80W per node, peaks briefly at ~150W during loader init. A 4-node cluster runs ~250W steady-state, comfortably within a single 15A wall circuit. Enterprise customers running 8+ nodes get a power-planning doc covering circuit breaker layout + UPS sizing.

Can we run this air-gapped?

Mostly yes. The model artifacts + monitor binary deploy via your CI runner; once installed, inference runs without internet. The audit-and-monitoring telemetry pipeline does require outbound to your Cloudflare R2, which is the bare minimum for cross-cluster observability. Air-gapped Enterprise customers either accept the R2 dependency or run an on-prem S3-compatible blob store with a periodic sync.

Adjacent lanes

Cluster Ops is one of three production lanes. The Workflow runtime that renders monthly PDFs across all three lanes lives in the same Cloudflare worker space:

See Cluster Ops pricing →   See the 30-day onboarding walkthrough →   or talk to engineering