Day 0 — Stripe checkout + hardware question
You complete checkout
From the Cluster Ops page you click "Start Cluster
Scale — $9,999/mo" (or Pro / Enterprise). Stripe handles payment.
garnetgrid-fulfillment creates your envelope at
garnet-tokens/cluster/<slug>.json and sends a welcome email.
The welcome email asks the lane-specific question: What hardware are you running? Pro and Scale tiers assume you own the rack already. Enterprise tier includes hardware sourcing — if you don't have nodes yet, the engineer scopes the procurement during the intake call.
Day 1 — Intake call + hardware inventory
Workload characterization, hardware inventory, network
- Workload characterization — which workloads, which models, expected QPS, latency budget, traffic shape (bursty / steady / scheduled).
- Hardware inventory — current rack: how many M4 / M4 Pro / M4 Max nodes, RAM per node, network topology (10GbE preferred, 1GbE acceptable for smaller deployments).
- Network + access plan — Cloudflare Tunnel for outbound telemetry, your firewall rules, SSH key provisioning for engineer access.
- Initial model selection — which model loads first. Typical Day-1 deployment: the model the customer is migrating from API. Quantization chosen based on target node memory budget.
Monitor process deploys to each node
Garnet engineer SSHes (with the keys provisioned during intake) and runs the deploy script on each node:
Monitor process ships structured metrics to your R2 every 60 seconds:
- Per-model: tokens served, requests, p50/p95 latency, tokens/sec, queue depth
- Thermal: package temp, performance-core temp, fan RPM, throttle events
- Memory pressure: VM stat, swap usage, model-loader cache hit rate, KV cache evictions
- Power: wall-meter draw if metered PDU wired, else system-reported package power
Day 2 — First model loads, traffic begins
Initial model placement
The chosen Day-1 model loads on the target nodes. Router exposed at
/v1/chat/completions — applications that talk to OpenAI talk to
MLX without modification.
First traffic flows. Telemetry confirms p50/p95 latency, tokens/sec, no throttle
events. The dashboard at /account/cluster/<slug> updates with
the first cost-per-token data point.
Day 7 — First weekly placement diff
Traffic-tuned model placement
First week of real load shows the actual traffic shape. Engineer reviews and may rebalance:
- Model A's QPS is higher than projected — give it its own node, evict the secondary model that's been cohabiting.
- Model B's KV-cache hit rate is low — its workload is mostly long-context summarization; increase the cache budget.
- Node-2 thermal events >3% — relocate one model off it to balance heat load.
Three weekly artifacts ship: Model placement diff, Cost-per-million-tokens diff, Eviction log.
Day 14 — Hot-swap protocol exercise
First hot-swap: 0 connections dropped
Models get updated monthly (new release, refined quantization, swap from one family to another). The hot-swap protocol exercises:
- Operator queues new model on target node — loader-B starts cold
- Health check confirms loader-B serves test prompts within p99 budget
- Router flips traffic atomically from loader-A to loader-B
- Loader-A drains in-flight requests then unloads
Total downtime: 0 connections dropped, 0 requests failed, ~30–90s elevated latency window during loader-B warmup. Documented in the runbook; replicable by your team if Cluster Ops rolls off (we don't keep you locked in).
Day 30 — First executive PDF
Cluster-monthly Workflow renders + emails the PDF
On the 1st of next month, garnet-cluster-monthly Workflow fires.
Aggregates the month's telemetry, computes uptime, surfaces thermal anomalies,
runs cost-per-token math, then renders the PDF.
The PDF covers:
- Inference volume — total tokens, requests, per-model breakdown
- Thermal — peak temps, throttle event count, recommended ambient temp adjustments
- Eviction history — count + top-evicted models
- Distribution — per-node placement map, rebalance-recommended Y/N
- Cost — per-model $/M tokens, weighted-average $/M, vs.-API savings %
- Uptime — cluster uptime %, per-node uptime, incidents (with cause + fix)
- Recommendations — next-cycle placement adjustments, capacity additions, retirements
Steady-state cadence from here: 60-second telemetry shipping, weekly placement diff, monthly executive PDF. Hot-swaps as warranted.
What you DON'T see in this walkthrough
- No cloud-vendor lock-in — your hardware, your inference, your data. Garnet operates the rack, doesn't own it. Cancellation removes operational access; the hardware keeps running.
- No SaaS observability bill — telemetry lives in your R2, queryable via your tools (Grafana, custom dashboards, or just the Garnet monthly PDF).
- No token-counted billing — your cost is the monthly retainer plus your existing power + hardware amortization. Inference itself doesn't have a per-token line item.
See Cluster Ops pricing → Read the methodology → Compare vs alternatives →