VIDEO · COMING SOON 5-minute screencast of the full Cluster Ops onboarding flow. Recorded once first Cluster subscriber lands; the text walkthrough below covers the same flow today.

Day 0 — Stripe checkout + hardware question

DAY 0 · STRIPE PAYMENT

You complete checkout

From the Cluster Ops page you click "Start Cluster Scale — $9,999/mo" (or Pro / Enterprise). Stripe handles payment. garnetgrid-fulfillment creates your envelope at garnet-tokens/cluster/<slug>.json and sends a welcome email.

The welcome email asks the lane-specific question: What hardware are you running? Pro and Scale tiers assume you own the rack already. Enterprise tier includes hardware sourcing — if you don't have nodes yet, the engineer scopes the procurement during the intake call.

Day 1 — Intake call + hardware inventory

DAY 1 · 60-MINUTE INTAKE

Workload characterization, hardware inventory, network

Workload characterization — which workloads, which models, expected QPS, latency budget, traffic shape (bursty / steady / scheduled).
Hardware inventory — current rack: how many M4 / M4 Pro / M4 Max nodes, RAM per node, network topology (10GbE preferred, 1GbE acceptable for smaller deployments).
Network + access plan — Cloudflare Tunnel for outbound telemetry, your firewall rules, SSH key provisioning for engineer access.
Initial model selection — which model loads first. Typical Day-1 deployment: the model the customer is migrating from API. Quantization chosen based on target node memory budget.

DAY 1 · MONITOR DEPLOYMENT

Monitor process deploys to each node

Garnet engineer SSHes (with the keys provisioned during intake) and runs the deploy script on each node:

Sample monitor deployment

$ ./deploy-monitor.sh node-1 --tier scale --tunnel cf [node-1] Installing garnet-monitor v0.5.3 [node-1] ✓ binary copied to /opt/garnet/bin/ [node-1] ✓ launchd plist installed: com.garnet.monitor.plist [node-1] ✓ Cloudflare Tunnel binary installed [node-1] ✓ tunnel registered: garnet-cluster-yourcompany-node-1 [node-1] Starting monitor service [node-1] ✓ launchctl load /Library/LaunchDaemons/com.garnet.monitor.plist [node-1] ✓ first metrics batch shipped (R2 confirmed) [node-1] ✓ thermal probe active (current pkg temp: 47°C, fan: 1240 rpm) Done. node-1 healthy. Telemetry flowing to garnet-snapshots/cluster/[slug]/. Repeat for remaining nodes.

Monitor process ships structured metrics to your R2 every 60 seconds:

Per-model: tokens served, requests, p50/p95 latency, tokens/sec, queue depth
Thermal: package temp, performance-core temp, fan RPM, throttle events
Memory pressure: VM stat, swap usage, model-loader cache hit rate, KV cache evictions
Power: wall-meter draw if metered PDU wired, else system-reported package power

Day 2 — First model loads, traffic begins

DAY 2 · FIRST INFERENCE

Initial model placement

The chosen Day-1 model loads on the target nodes. Router exposed at /v1/chat/completions — applications that talk to OpenAI talk to MLX without modification.

First traffic flows. Telemetry confirms p50/p95 latency, tokens/sec, no throttle events. The dashboard at /account/cluster/<slug> updates with the first cost-per-token data point.

Sample telemetry batch 30 minutes after first inference

{ "node": "node-1", "model": "llama-3.1-70b-instruct-q4", "since_load": 1834, "tokens_served": 124711, "requests_completed": 412, "p50_latency_ms": 42, "p95_latency_ms": 87, "tokens_per_sec": 68.0, "queue_depth": 0, "kv_cache_hit_rate": 0.913, "package_temp_c": 64.2, "throttle_events_5m": 0, "power_watts_avg": 71.5, "cost_per_million_tokens_inferred": 0.34 }

Day 7 — First weekly placement diff

DAY 7 · PLACEMENT REVIEW

Traffic-tuned model placement

First week of real load shows the actual traffic shape. Engineer reviews and may rebalance:

Model A's QPS is higher than projected — give it its own node, evict the secondary model that's been cohabiting.
Model B's KV-cache hit rate is low — its workload is mostly long-context summarization; increase the cache budget.
Node-2 thermal events >3% — relocate one model off it to balance heat load.

Three weekly artifacts ship: Model placement diff, Cost-per-million-tokens diff, Eviction log.

Day 14 — Hot-swap protocol exercise

DAY 14 · MODEL UPGRADE

First hot-swap: 0 connections dropped

Models get updated monthly (new release, refined quantization, swap from one family to another). The hot-swap protocol exercises:

Operator queues new model on target node — loader-B starts cold
Health check confirms loader-B serves test prompts within p99 budget
Router flips traffic atomically from loader-A to loader-B
Loader-A drains in-flight requests then unloads

Total downtime: 0 connections dropped, 0 requests failed, ~30–90s elevated latency window during loader-B warmup. Documented in the runbook; replicable by your team if Cluster Ops rolls off (we don't keep you locked in).

Day 30 — First executive PDF

DAY 30 · MONTHLY DELIVERY

Cluster-monthly Workflow renders + emails the PDF

On the 1st of next month, garnet-cluster-monthly Workflow fires. Aggregates the month's telemetry, computes uptime, surfaces thermal anomalies, runs cost-per-token math, then renders the PDF.

The PDF covers:

Inference volume — total tokens, requests, per-model breakdown
Thermal — peak temps, throttle event count, recommended ambient temp adjustments
Eviction history — count + top-evicted models
Distribution — per-node placement map, rebalance-recommended Y/N
Cost — per-model $/M tokens, weighted-average $/M, vs.-API savings %
Uptime — cluster uptime %, per-node uptime, incidents (with cause + fix)
Recommendations — next-cycle placement adjustments, capacity additions, retirements

Steady-state cadence from here: 60-second telemetry shipping, weekly placement diff, monthly executive PDF. Hot-swaps as warranted.

What you DON'T see in this walkthrough

No cloud-vendor lock-in — your hardware, your inference, your data. Garnet operates the rack, doesn't own it. Cancellation removes operational access; the hardware keeps running.
No SaaS observability bill — telemetry lives in your R2, queryable via your tools (Grafana, custom dashboards, or just the Garnet monthly PDF).
No token-counted billing — your cost is the monthly retainer plus your existing power + hardware amortization. Inference itself doesn't have a per-token line item.

See Cluster Ops pricing → Read the methodology → Compare vs alternatives →

Cluster Ops onboarding — Day 0 to Day 30.