ON-PREM MLX · SUB-100MS P95

Local. Sovereign. Operated.

A rack of Mac Minis runs your inference faster than the cloud and cheaper than the API. Cluster Ops keeps that rack honest — automated eviction, log shipping, hardened deploys, alerting at 03:00 so you don't have to be.

Start Cluster Scale — $9,999/mo

  14-day money-back · cancel any time · Secured by Stripe

How it works → Read the methodology → Compare vs API / cloud GPU →

01 / WHAT CLUSTER OPS DOES

Sovereign inference, run by people who built clusters

Your hardware, your IP, your latency. Ours: the runbooks, the alerting, the eviction logic, the upgrade path.

MLX inference operations

Llama, Mistral, Qwen, DeepSeek, custom finetunes — running on your Apple Silicon cluster. Hot-swap models, A/B routes, latency budgets, automatic failover to fallback nodes.

Automated eviction + thermal mgmt

Mac Minis throttle. Cluster Ops routes around hot nodes, evicts stale model contexts, schedules thermal recovery, and tells you when a node needs a physical visit.

Log shipping + observability

Every inference call, eviction event, model load, GPU temp — shipped to your S3/R2 of choice. Grafana dashboards delivered. Pager rules tuned to your tolerance.

Hardened deploys + upgrades

New model? New MLX version? New macOS? We test in a staging node, ship to canary, watch for regressions, then roll the cluster. Zero-downtime; rollback in ≤2 minutes.

02 / THE LOOP

Probe · Route · Evict · Page

Probe

Every 60 seconds: GPU temp, RAM pressure, queue depth, p95 latency, model-context staleness. Per node, aggregated across the cluster.

Route

Inference requests routed to the warmest node with capacity. Cold nodes get a warm-up call; hot nodes get a thermal break. Latency budget enforced at the router.

Evict

Stale model contexts evicted on a sliding window. Memory reclaimed. The node returns to service before the queue notices.

Page

When something needs a human — a node down, a thermal event, a model regression — your on-call gets a structured page. With a runbook link. We answer first; you confirm.

03 / PRICING

Three tiers. Per-cluster, fixed-fee.

Cluster Pro

$3,999/mo

1-3 Mac Minis

Up to 3 nodes monitored
2 models managed (hot-swap)
Daily ops summary
Business-hours pager
Quarterly upgrade window

Start Cluster Pro

Cluster Scale

$9,999/mo

4-12 Mac Minis

Up to 12 nodes monitored
Unlimited models managed
Real-time ops dashboard
24/7 pager · 15-min ack SLA
Monthly upgrade window
Custom routing rules

Start Cluster Scale

Cluster Enterprise

$24,999/mo

12+ nodes / multi-site

Unlimited nodes · multi-site
Hybrid cloud + on-prem
Dedicated cluster engineer + technical PM
24/7 pager · 5-min ack SLA
Continuous deploys + per-customer model routes
Quarterly capacity planning
SOC 2 + custom compliance

Talk to engineering →

<100ms

P95 inference latency*

85%

Cost reduction vs API

99.9%

Uptime at Scale

≤2min

Rollback after canary

* Llama-3.1-8B on M4 Pro; 7B-class workloads, single-turn

04 / QUESTIONS

The ones SREs ask

Why Mac Mini? Why not GPUs?

Per-token cost. A rack of M4 Pro Minis runs 7B-class inference at ~$0.0008/1K tokens — about 3% of OpenAI list price for comparable models. Apple Silicon's unified memory means no PCIe bottleneck on context loads. For 95% of inference workloads under 14B, Minis beat GPU economics. We co-design GPU augmentation for the 5% that don't.

What about model quality / open-weight gap with frontier?

Llama 3.1 70B fine-tuned on your domain matches GPT-4 on 70-80% of the workload it actually serves. We help you identify which 20-30% needs frontier API fallback (Cluster Ops includes the routing logic) and optimize for it.

Where does the cluster physically live?

Wherever you want it. Most Pro customers run a 1-3 node cluster in their office rack. Scale customers usually colocate at a regional DC. Enterprise customers do multi-site with hybrid CF Workers AI fallback for burst.

Do you sell us the hardware?

No. We operate hardware you own. We can recommend (specifications, vendors, financing); the purchase is yours. This keeps the IP, the depreciation, and the asset on your balance sheet — which is the whole point of going on-prem.

How do you handle macOS updates?

Tested in our staging node first (we run a representative cluster), then your canary, then the rest of yours. We know which OS versions break which MLX builds because we hit those bugs before you do.

What happens if a Mini fails physically?

Cluster routes around it (no quality-of-service drop until you hit a single-node-remaining state). We page your on-call with the diagnostics; you swap the Mini; we re-onboard it to the cluster. Pro tier business hours, Scale 24/7, Enterprise with optional Garnet on-site engineering.

Your hardware. Our pager.

Start Cluster Scale for $9,999/mo. First node onboarded within 5 business days.

Start Cluster Scale Talk to enterprise →