ON-PREM MLX · SUB-100MS P95

Local. Sovereign. Operated.

A rack of Mac Minis runs your inference faster than the cloud and cheaper than the API. Cluster Ops keeps that rack honest — automated eviction, log shipping, hardened deploys, alerting at 03:00 so you don't have to be.

01 / WHAT CLUSTER OPS DOES

Sovereign inference, run by people who built clusters

Your hardware, your IP, your latency. Ours: the runbooks, the alerting, the eviction logic, the upgrade path.

A1

MLX inference operations

Llama, Mistral, Qwen, DeepSeek, custom finetunes — running on your Apple Silicon cluster. Hot-swap models, A/B routes, latency budgets, automatic failover to fallback nodes.

A2

Automated eviction + thermal mgmt

Mac Minis throttle. Cluster Ops routes around hot nodes, evicts stale model contexts, schedules thermal recovery, and tells you when a node needs a physical visit.

A3

Log shipping + observability

Every inference call, eviction event, model load, GPU temp — shipped to your S3/R2 of choice. Grafana dashboards delivered. Pager rules tuned to your tolerance.

A4

Hardened deploys + upgrades

New model? New MLX version? New macOS? We test in a staging node, ship to canary, watch for regressions, then roll the cluster. Zero-downtime; rollback in ≤2 minutes.

02 / THE LOOP

Probe · Route · Evict · Page

01

Probe

Every 60 seconds: GPU temp, RAM pressure, queue depth, p95 latency, model-context staleness. Per node, aggregated across the cluster.

02

Route

Inference requests routed to the warmest node with capacity. Cold nodes get a warm-up call; hot nodes get a thermal break. Latency budget enforced at the router.

03

Evict

Stale model contexts evicted on a sliding window. Memory reclaimed. The node returns to service before the queue notices.

04

Page

When something needs a human — a node down, a thermal event, a model regression — your on-call gets a structured page. With a runbook link. We answer first; you confirm.

03 / PRICING

Three tiers. Per-cluster, fixed-fee.

Cluster Pro
$3,999/mo
1-3 Mac Minis
  • Up to 3 nodes monitored
  • 2 models managed (hot-swap)
  • Daily ops summary
  • Business-hours pager
  • Quarterly upgrade window
Start Cluster Pro
Cluster Enterprise
$24,999/mo
12+ nodes / multi-site
  • Unlimited nodes · multi-site
  • Hybrid cloud + on-prem
  • Dedicated cluster engineer + technical PM
  • 24/7 pager · 5-min ack SLA
  • Continuous deploys + per-customer model routes
  • Quarterly capacity planning
  • SOC 2 + custom compliance
Talk to engineering →
<100ms
P95 inference latency*
85%
Cost reduction vs API
99.9%
Uptime at Scale
≤2min
Rollback after canary

* Llama-3.1-8B on M4 Pro; 7B-class workloads, single-turn

04 / QUESTIONS

The ones SREs ask

Why Mac Mini? Why not GPUs?

Per-token cost. A rack of M4 Pro Minis runs 7B-class inference at ~$0.0008/1K tokens — about 3% of OpenAI list price for comparable models. Apple Silicon's unified memory means no PCIe bottleneck on context loads. For 95% of inference workloads under 14B, Minis beat GPU economics. We co-design GPU augmentation for the 5% that don't.

What about model quality / open-weight gap with frontier?

Llama 3.1 70B fine-tuned on your domain matches GPT-4 on 70-80% of the workload it actually serves. We help you identify which 20-30% needs frontier API fallback (Cluster Ops includes the routing logic) and optimize for it.

Where does the cluster physically live?

Wherever you want it. Most Pro customers run a 1-3 node cluster in their office rack. Scale customers usually colocate at a regional DC. Enterprise customers do multi-site with hybrid CF Workers AI fallback for burst.

Do you sell us the hardware?

No. We operate hardware you own. We can recommend (specifications, vendors, financing); the purchase is yours. This keeps the IP, the depreciation, and the asset on your balance sheet — which is the whole point of going on-prem.

How do you handle macOS updates?

Tested in our staging node first (we run a representative cluster), then your canary, then the rest of yours. We know which OS versions break which MLX builds because we hit those bugs before you do.

What happens if a Mini fails physically?

Cluster routes around it (no quality-of-service drop until you hit a single-node-remaining state). We page your on-call with the diagnostics; you swap the Mini; we re-onboard it to the cluster. Pro tier business hours, Scale 24/7, Enterprise with optional Garnet on-site engineering.

Your hardware. Our pager.

Start Cluster Scale for $9,999/mo. First node onboarded within 5 business days.