MLX inference operations
Llama, Mistral, Qwen, DeepSeek, custom finetunes — running on your Apple Silicon cluster. Hot-swap models, A/B routes, latency budgets, automatic failover to fallback nodes.
A rack of Mac Minis runs your inference faster than the cloud and cheaper than the API. Cluster Ops keeps that rack honest — automated eviction, log shipping, hardened deploys, alerting at 03:00 so you don't have to be.
Your hardware, your IP, your latency. Ours: the runbooks, the alerting, the eviction logic, the upgrade path.
Llama, Mistral, Qwen, DeepSeek, custom finetunes — running on your Apple Silicon cluster. Hot-swap models, A/B routes, latency budgets, automatic failover to fallback nodes.
Mac Minis throttle. Cluster Ops routes around hot nodes, evicts stale model contexts, schedules thermal recovery, and tells you when a node needs a physical visit.
Every inference call, eviction event, model load, GPU temp — shipped to your S3/R2 of choice. Grafana dashboards delivered. Pager rules tuned to your tolerance.
New model? New MLX version? New macOS? We test in a staging node, ship to canary, watch for regressions, then roll the cluster. Zero-downtime; rollback in ≤2 minutes.
Every 60 seconds: GPU temp, RAM pressure, queue depth, p95 latency, model-context staleness. Per node, aggregated across the cluster.
Inference requests routed to the warmest node with capacity. Cold nodes get a warm-up call; hot nodes get a thermal break. Latency budget enforced at the router.
Stale model contexts evicted on a sliding window. Memory reclaimed. The node returns to service before the queue notices.
When something needs a human — a node down, a thermal event, a model regression — your on-call gets a structured page. With a runbook link. We answer first; you confirm.
* Llama-3.1-8B on M4 Pro; 7B-class workloads, single-turn
Per-token cost. A rack of M4 Pro Minis runs 7B-class inference at ~$0.0008/1K tokens — about 3% of OpenAI list price for comparable models. Apple Silicon's unified memory means no PCIe bottleneck on context loads. For 95% of inference workloads under 14B, Minis beat GPU economics. We co-design GPU augmentation for the 5% that don't.
Llama 3.1 70B fine-tuned on your domain matches GPT-4 on 70-80% of the workload it actually serves. We help you identify which 20-30% needs frontier API fallback (Cluster Ops includes the routing logic) and optimize for it.
Wherever you want it. Most Pro customers run a 1-3 node cluster in their office rack. Scale customers usually colocate at a regional DC. Enterprise customers do multi-site with hybrid CF Workers AI fallback for burst.
No. We operate hardware you own. We can recommend (specifications, vendors, financing); the purchase is yours. This keeps the IP, the depreciation, and the asset on your balance sheet — which is the whole point of going on-prem.
Tested in our staging node first (we run a representative cluster), then your canary, then the rest of yours. We know which OS versions break which MLX builds because we hit those bugs before you do.
Cluster routes around it (no quality-of-service drop until you hit a single-node-remaining state). We page your on-call with the diagnostics; you swap the Mini; we re-onboard it to the cluster. Pro tier business hours, Scale 24/7, Enterprise with optional Garnet on-site engineering.
Start Cluster Scale for $9,999/mo. First node onboarded within 5 business days.