The four approaches
- Frontier API — OpenAI, Anthropic, Google Gemini API, Mistral La Plateforme. Pay per token, no infrastructure, frontier capability.
- Cloud GPU servers — AWS p4d / p5, GCP A3 / A4, Azure ND-series. NVIDIA H100 / H200 / B200 farms running vLLM, TGI, or llama.cpp. High throughput, high cost, real ops burden.
- In-house DevOps owning a Mac rack — your own engineers run the cluster. Lower hardware cost than GPU, but operational know-how is non-trivial.
- Garnet Cluster Ops — Mac Mini MLX cluster operations as a managed retainer. Same engineer every month, sub-100ms p95, monthly cost-per-token report.
What scales — and where the economics flip
| Frontier API | Cloud GPU (H100) | In-house Mac rack | Garnet Cluster Ops |
|
|---|---|---|---|---|
| Frontier-class capability (GPT-4 / Claude Opus) | Yes | Yes (large open-source) | 7–70B class only | 7–70B class only |
| Cost per million tokens (7-34B class) | $1.50–$15 | $0.10–$0.40 | $0.04–$0.12 | $0.04–$0.12 |
| Cost per million tokens (70B class) | $3–$15 | $0.40–$1.20 | $0.30–$0.85 | $0.30–$0.85 |
| p95 latency (sub-frontier model) | 200–500 ms | 100–250 ms | <100 ms | <100 ms |
| Sovereignty (data stays on your hardware) | No | Cloud-region-bound | Yes | Yes |
| Hardware capex | $0 | $0 (rented) or $30K-$300K | $5K-$50K (Mac rack) | $5K-$50K (your hardware) |
| Power consumption (4-node cluster) | — | ~3–8 kW | ~250–500 W | ~250–500 W |
| Operations burden | None | 1.0+ FTE | 0.3–0.7 FTE | Garnet |
| Predictable monthly cost | Variable | Reserved instances | Yes | Yes |
Where each approach wins
Frontier API is right when
- The workload genuinely needs frontier capability — multi-step agentic flows, complex reasoning, code generation at GPT-4 / Claude Opus level. Open-source models can't match on these as of 2026-Q1.
- Volume is small enough that the per-token premium is rounding error. Under ~$2K/month of API spend, the operational simplicity beats any infrastructure savings.
- The team has zero ops capacity and won't for 12+ months. API is the right answer until ops capacity exists.
Cloud GPU is right when
- You need to run frontier-class open models (DeepSeek-V3 full-precision, Llama 3.1 405B) that exceed Apple Silicon memory budgets. M4 Max tops out around 128GB unified — fine for 70B q4, not enough for 405B at any usable quantization.
- Throughput-per-rack matters more than power-per-rack. A single H100 server matches 8–12 Mac Minis on tokens/second for the 70B class. If you're rate-limited, GPU wins.
- You already have NVIDIA infrastructure — vLLM/TGI Kubernetes deployments, NIM, Triton Inference Server — and adding a Mac fleet would fragment your ops.
- You're training or fine-tuning, not just inferring. Apple Silicon training tooling is behind NVIDIA's stack by a wide margin.
In-house Mac rack is right when
- You have a senior engineer with strong Apple Silicon + LLM ops experience already on payroll, with bandwidth for ongoing rack maintenance. The role is roughly 0.3–0.7 FTE at steady state.
- Your latency budget is tight (<100ms p95 for the 7–34B class), and you've decided that on-prem MLX hits that band best. The deployment work is straightforward; the operations work is what compounds over time.
- Your sovereignty story requires that no one outside the team — even Garnet — has operational access. A few regulated industries fall here.
Garnet Cluster Ops is right when
- You've identified that on-prem MLX is the right deployment for your sub-frontier workloads (cost, latency, sovereignty), but you don't want the ops headcount.
- Your scale is in the 2–8 node range — too small for a dedicated platform team, too large for a side project. The Garnet retainer is roughly 0.25–0.4 FTE in cost with senior-engineer output.
- You want the monthly cost-per-token report and capacity-plan recommendations as deliverables, not as another thing the team builds.
- You want the hot-swap protocol, automated eviction, and hardened deploy pipeline pre-built rather than authored from scratch.
The economic flip points
Three thresholds matter for the API → on-prem decision:
- ~$5K/month of API spend: at this point, a 2-node Mac Mini cluster starts to break even on a 12-month amortization. Below it, API is operationally cheaper.
- ~$15K/month of API spend: a 4-node cluster running the right quantizations of Llama 70B / Mistral Large is paying for itself plus a managed retainer at Cluster Pro / Scale tier within 6 months.
- ~$50K/month of API spend: the inverse of "frontier-only" is true — most of that spend is workloads where 70B-class open models are sufficient, and the on-prem savings start to be material to leadership. This is also where Cluster Enterprise (8+ nodes, hardware sourcing included) typically lands.
These numbers are workload-dependent — RAG inference is cheaper per token than agentic multi-step generation; structured-output extraction is cheaper than long-context summary — but the order of magnitude is right. We see most customers move on-prem when API spend crosses ~$10K/month with steady traffic.
Why specifically Apple Silicon
Cluster Ops is opinionated about hardware. The reasons:
- Unified memory — M4 Pro 64GB and M4 Max 128GB hold 70B-class models quantized at q4 with usable KV cache. NVIDIA equivalent (A100 80GB or H100 80GB) is 20–60× more expensive per node.
- Power profile — 40–80W steady-state under inference. A 4-node Mac Mini cluster runs on a single 15A wall circuit. Equivalent NVIDIA hardware needs rack-grade power and cooling.
- MLX framework maturity — Apple's MLX has caught up to vLLM / llama.cpp on the inference path for the 7–70B class. Quantization tooling (mlx-lm, exo, mlx_omni_server) is production-ready.
- OS predictability — macOS is a stable target. The kernel scheduler behaves predictably under sustained load. Driver thrash (a real NVIDIA-on-Linux concern) isn't a factor.
- Real-world thermal headroom — Mac Mini under sustained inference will throttle from ~70°C ambient. Cluster Ops alerts on packed throttles >5% over a 5-min window. Manageable in a typical office or small server closet.
For frontier-class models or training workloads, Apple Silicon is the wrong tool. We don't pretend otherwise. Cluster Ops is specifically scoped to on-prem MLX at moderate scale.
How to evaluate any inference-deployment vendor
- Who owns the hardware? If the vendor owns it, you're locked into their pricing. If you own it (the Garnet pattern), you're locked into nothing — the rack outlives the vendor relationship.
- What's the hot-swap protocol? Model upgrades happen monthly. A vendor without a documented hot-swap pattern (drain in-flight, atomic flip, loader-A unload) is going to drop production traffic on every upgrade.
- How is thermal handled? If the answer is "we don't monitor thermals," your throttle events become latency mystery. Cluster Ops monitors thermal first-class.
- What's the deploy pipeline? Manual SSH-and-restart is fine for one node, broken for four. Canary → watch → rolling deploy with auto-rollback should be standard.
- What's the cost-per-token reporting cadence? Monthly is the minimum. Without it, you're flying blind on whether on-prem is actually saving you money.
Adjacent lanes
If your team is also evaluating other lanes:
- GEO vs SEO / AI-SEO — citation engineering for AI-search visibility. Many GEO customers run their own LLM inference on Cluster Ops for cost reasons.
- Audit Retainer vs Big-4 vs in-house — when the audit covers a stack with on-prem inference, Cluster Ops is the operational layer the audit retainer also tracks.
- Sentinel-aaS vs Zapier / PagerDuty — Discord-resident operations bus. Routes Cluster Ops node-down + thermal-throttle alerts.
See Cluster Ops pricing → Read the full methodology → or talk to engineering