Why on-prem inference

For two years the LLM stack has been "use the API." OpenAI, Anthropic, Gemini — pay per token, forget about infrastructure. That worked while frontier capability was the sole differentiator and the open-source models trailed by 18 months. That gap closed in 2025. The Llama 3.1 70B family, Mistral Large 2, Qwen 2.5 72B, and DeepSeek V3 deliver production-grade output for the majority of buyer workloads — agentic flows, document summarization, structured-output extraction, vector-and-rerank retrieval, conversational interfaces, code completion. The remaining 10–20% of workloads still benefit from frontier API access; the other 80–90% does not.

Once the capability gap closes, the economic argument flips:

Cost: API tokens cost $1.50–$15 per million on the frontier providers; on-prem MLX runs $0.04–$0.85 depending on model size — a 5–15× saving on the line items where the open model is sufficient.
Latency: API round-trips include cross-region networking (often 50–200ms round-trip latency before any inference time). On-prem MLX inference runs sub-100ms p95 for 7–34B class models on M4 Pro, eliminating the network tax entirely.
Sovereignty: prompts and completions never leave your hardware. For regulated workloads (healthcare, legal, finance, defense, EU GDPR-bound), the audit surface is a single rack instead of a vendor's TOS appendix.
Predictability: API pricing changes (and has, repeatedly, including rate-limit clawbacks). The amortized cost of an owned rack is predictable to the dollar once you know your traffic.

The trade-off is operational. Someone has to keep the rack honest — model placement, thermal management, eviction policy, deploy hygiene, alerting. That's the lane.

Why Mac Minis

Mac Mini M4 / M4 Pro nodes hit a price/performance band the API providers can't match for inference of moderate-sized open models. M4 Pro 64GB unified holds the Llama 3.1 70B family quantized at q4 with room for KV cache; M4 Max 128GB holds Mistral Large 2 q4 comfortably. The /v1/chat/completions endpoint shape works without modification — applications that talk to OpenAI also talk to MLX.

Cost per million tokens on a fully-amortized rack of M4 Pros runs $0.04–$0.12 for the 7–34B class and $0.30–$0.85 for the 70B class — roughly a 5–15× saving over GPT-4-class API for equivalent quality on most non-frontier tasks. The trade-off is operational: someone has to keep the rack healthy. That's the lane.

Daily — telemetry

A monitor process on each node ships structured metrics to a central R2 every 60s:

Per-model — tokens served, requests, p50/p95 latency, tokens/sec, queue depth.
Thermal — package temp, performance-core temp, fan RPM, throttle events. M4 Mac Mini under sustained inference will throttle from ~70°C ambient; we alert on packed throttles >5% over a 5-min window.
Memory pressure — VM stat, swap usage, model-loader cache hit rate. KV cache evictions are first-class events here.
Power — wall-meter draw if a metered PDU is wired (Enterprise tier includes one); otherwise system-reported package power.

Telemetry flows to your tenant's R2 namespace. Garnet operations have read-only access to alert state but never to your inference inputs/outputs — the model operates entirely on-prem; nothing leaves your boundary except aggregate metrics.

Weekly — placement + cost diff

Once a week, three artifacts ship:

Model placement diff — which model lives on which node, with rationale (memory budget, expected QPS, thermal headroom). Recommendations to rebalance if traffic shifted.
Cost-per-million-tokens diff — per model, this week vs. last week. Anomalies flagged: a 2× spike usually means a hot prompt-cache miss or a poisoned KV that needs purge.
Eviction log — top-N evicted models, with hint on whether the eviction was traffic-justified or a placement bug.

Monthly — executive PDF

On the 1st of each month, a Workflow renders an executive PDF covering:

Inference volume — total tokens, requests, per-model breakdown
Thermal — peak temps, throttle event count, recommended ambient temp adjustments
Eviction history — count + top-evicted models
Distribution — per-node placement map, rebalance-recommended Y/N
Cost — per-model $/M tokens, weighted-average $/M, vs.-API savings %
Uptime — cluster uptime %, per-node uptime, incidents (with cause + fix)
Recommendations — next-cycle placement adjustments, capacity additions, retirements

Hot-swap protocol

Models are swapped without dropping connections via a dual-loader pattern:

Operator queues new model on the target node — loader-B starts cold.
Health check confirms loader-B serves test prompts within p99 budget.
Router flips traffic atomically from loader-A to loader-B.
Loader-A drains in-flight requests then unloads.

Total downtime: 0 connections dropped, 0 requests failed, ~30–90s elevated latency window during loader-B warmup. Documented in the runbook; replicable by your team if Cluster Ops rolls off (we don't keep you locked in).

Hardened deploys

Every code change to the cluster (model upgrades, router config, monitor binary) ships via a GitHub Actions deploy pipeline against your tenant. Each deploy:

Runs full test suite (unit + integration against a staging node)
Canaries to one node, watches for 10 min
If canary clean, rolls to remaining nodes one at a time
Auto-rollbacks if any node's p95 regresses >25% post-deploy
Posts deploy summary to the ops Discord channel (Sentinel-aaS bus, if you have one)

What success looks like

Across the first 90 days, Cluster Pro typically operates 2–4 nodes at <100ms p95 latency on the 7–34B class, 99.9% uptime, and ≤2-min hot-swap windows. Cost-per-million-tokens stabilizes 5–15× below GPT-4-class API for equivalent quality on tracked workloads. Cluster Scale (4–8 nodes) and Enterprise (8+ with hardware sourcing + power planning) target the same ratios at higher absolute throughput.

What it isn't

Not model training. We don't train your foundation model. We operate the inference layer for whatever model you choose to run.
Not data residency theater. Your inference inputs/outputs stay on-prem. Garnet operations sees aggregate metrics only — request counts, tokens, latency, thermal. Never prompt content.
Not API-backed. If your traffic spikes past on-prem capacity, the router can fail-over to OpenRouter / Anthropic / OpenAI as a configured fallback — but only if you turn that on. Default is fail-closed.
Not GPU-cluster ops. If you need an H100 farm, this isn't the lane. Cluster Ops is specifically Mac Mini MLX (Apple Silicon) at moderate scale.
Not a hardware vendor relationship. Pro and Scale tiers assume you own the hardware. Enterprise tier includes hardware sourcing + power-planning + rack layout, but Garnet is not a Mac reseller — we just take ownership of the operations once the rack lands.

Day 1, Day 30, Day 90

Day 1 — onboarding kickoff

60-min intake: traffic characterization (which workloads, which models, expected QPS, latency budget), hardware inventory (existing nodes or sourcing path), network topology
Monitor process deployed on each node; first telemetry batch lands in your R2 within 24 hours
Discord channel set bootstrapped: #cluster-alerts, #cluster-thermal, #cluster-monthly
Day-1 model placement: typically the model the customer is migrating from API, quantized appropriately for the target node memory budget
Router configured at /v1/chat/completions shape — applications that talk to OpenAI talk to MLX without modification

Day 30 — first executive PDF + traffic-tuned placement

Traffic-tuned model placement based on first 4 weeks of real load (model A might want its own node; model B might cohabit; routing rules tune accordingly)
Hot-swap protocol exercised at least once (model upgrade, version bump, or experimental model swap)
Cost-per-million-tokens baseline established; first month's API-equivalent cost saving surfaced
First monthly executive PDF lands

Day 90 — cluster steady-state

Cluster operates at <100ms p95 on tracked workloads, >99.9% uptime, throttle events under 1% of inference time
Cost-per-million-tokens settles in the 5–15× below API band for the sufficient workloads
Quarter-end review: capacity-add decision (scale to more nodes?), model swap proposals (try the new release of Llama / Qwen / Mistral?), tier escalation decision

FAQ

How many nodes do we need to start?

Cluster Pro starts at 2 nodes for redundancy. One node is a single point of failure (no model failover), and we don't recommend running production traffic on it. Scale tier supports 4–8 nodes; Enterprise targets 8+. Most customers ramp from 2→4→6 over the first 6 months as they migrate workloads off API.

Which models run well on Mac Mini MLX?

As of 2026-Q1: Llama 3.1 8B/70B, Llama 3.3 70B, Mistral Large 2 (123B at q4 needs M4 Max 128GB), Mistral Small 3.1 24B, Qwen 2.5 family (7B/14B/32B/72B), DeepSeek-V3 (large; needs heavy quant + 128GB), Gemma 2 9B/27B, custom fine-tunes of any of the above. We re-evaluate the supported list each quarter as new models ship and quantization tooling improves.

How does this compare to running our own GPU servers?

For inference of moderate-sized models (7–70B class): an A100/H100 server is faster throughput-per-dollar but costs 4–10× more upfront, draws 5–15× more power, and requires rack-grade cooling. Mac Mini M4 Pro at $2K-$2.5K hardware/node, with passive desktop cooling, hits the price/perf sweet spot specifically for sub-frontier-model inference. For frontier models (DeepSeek-V3 full precision, GPT-4 class): GPU is the right answer; we won't pretend otherwise.

What if we want a different host OS than macOS?

The "MLX" in Cluster Ops is specifically Apple's MLX framework, which only runs on Apple Silicon hardware (macOS or iOS). If your sovereignty story requires Linux, you're looking at NVIDIA GPU + vLLM/TGI/llama.cpp territory, which is a different lane. Cluster Ops is purpose-built for the MLX/Apple Silicon path.

What happens during a hardware failure?

Each node has a hot-swap counterpart (the redundancy is why Pro starts at 2 nodes). When a node misses its heartbeat, the router routes around it; the alert fires in #cluster-alerts. For Pro/Scale customers we work with you to plan replacement hardware. For Enterprise customers we keep a cold spare on-site.

What does power consumption look like?

Mac Mini M4 Pro under sustained inference draws 40–80W per node, peaks briefly at ~150W during loader init. A 4-node cluster runs ~250W steady-state, comfortably within a single 15A wall circuit. Enterprise customers running 8+ nodes get a power-planning doc covering circuit breaker layout + UPS sizing.

Can we run this air-gapped?

Mostly yes. The model artifacts + monitor binary deploy via your CI runner; once installed, inference runs without internet. The audit-and-monitoring telemetry pipeline does require outbound to your Cloudflare R2, which is the bare minimum for cross-cluster observability. Air-gapped Enterprise customers either accept the R2 dependency or run an on-prem S3-compatible blob store with a periodic sync.

Adjacent lanes

Cluster Ops is one of three production lanes. The Workflow runtime that renders monthly PDFs across all three lanes lives in the same Cloudflare worker space:

GEO Methodology — citation engineering for the AI-search surface. The monthly executive PDF for GEO is rendered by the same Workflow infrastructure deployed in Cluster Ops.
Audit Retainer — when the audit retainer covers a stack that includes on-prem inference, the cluster's compliance posture queries are part of the daily snapshot.
Sentinel-aaS — node-down alerts, eviction notices, thermal-throttle warnings, and the cluster monthly PDF preview all flow through the Sentinel bus.

See Cluster Ops pricing → See the 30-day onboarding walkthrough → or talk to engineering

How on-prem MLX gets operated.