Why retainers, not one-shots

The standard architecture-audit engagement is a two-week, slide-deck deliverable. A senior engineer parachutes in, reads code, interviews the team, and ships a 30-page report. The report identifies twenty findings; the customer addresses three; the rest become technical debt that's "known but unowned." Six months later the architecture has drifted, the report is stale, and the cycle repeats with a new vendor.

The Audit Retainer collapses that cycle. Same engineer every month, indefinitely. The audit isn't a deliverable — it's a continuous diff. The findings aren't slides — they're merged pull requests in your repos. The cycle isn't 6 months between checkpoints — it's weekly between weekly, with daily passive instrumentation underneath.

The economic argument is straightforward. A typical Garnet customer running Audit Scale ($9,999/mo) lands 2–8 merged engineering tickets/month, recovers ~28% on infra-spend line items within the first six months, and progresses from ~60% to ~95% on a SOC 2 readiness gap-list over the same window. The retainer is cheaper than the one-shots it replaces, and the work is owned end-to-end instead of handed off.

The four-axis audit

Most "architecture audit" engagements stop at security and call it done. We track four axes continuously, because the trade-offs between them are where production debt actually lives:

Architecture — service topology, data flow, dependency graph, blast-radius diagrams. What happens when service X dies? Who knows?
Security — secret rotation cadence, IAM least-privilege drift, public-exposure surface, dependency CVE exposure. SOC 2 / ISO 27001 / HIPAA / GDPR posture maps here too.
Cost — month-over-month cloud + LLM spend, cost-per-tenant, cost-attributed-to-revenue. Who's spending how much, and why.
Latency / reliability — committed SLOs vs. p50/p95/p99 actuals, error budgets, incident cadence, MTTR trend.

A single number on each axis is the leading indicator. The diff between months is the lagging indicator and the work item.

Daily — passive snapshots

A snapshot writer Worker (deployed in your Cloudflare account, talking to your cloud APIs with read-only credentials you provision) runs nightly and lands a structured JSON snapshot in your R2. Each snapshot covers:

Schema fingerprint

DB schemas (Postgres, MySQL, BigQuery, Snowflake — whatever you're on), IaC repos (Terraform, Pulumi, CloudFormation, CDK), API contract files (OpenAPI, GraphQL SDL, Protobuf). Hash-tracked. Drift between two consecutive snapshots trips an alert in #audit-drift the next morning. The schema-diff is structural — we ignore migrations that only add columns or rename without breaking change, surface anything that's a contract break.

IAM + secrets posture

IAM-graph snapshot (AWS IAM / GCP IAM / Azure RBAC + OIDC trust policies), last-rotation timestamps on tracked secrets (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, 1Password Connect, Doppler — pick your stack), public-bucket scan, exposed-key scan via gitleaks-style heuristics across the repos we have access to.

Cost + latency

Cloud-bill API pull (AWS CUR, GCP Billing, Azure Cost Management — full daily granularity, tagged-by-service), observability metric pull from your existing stack (Datadog API, CloudWatch Metrics, Grafana Mimir, Prometheus federation, Sentry — we adapt to what you already pay for). Per-service unit costs, request volume, p50/p95/p99. We do not require you to install a new agent — read-only API access is enough.

Compliance posture

For customers on a SOC 2 / ISO 27001 / HIPAA / GDPR path: a daily diff against the control-list. Each control row is mapped to a passive evidence query (e.g., SOC 2 CC6.1 "logical access controls" maps to an IAM-graph query that confirms MFA-required on console access). When evidence shifts, the row's posture updates.

Raw snapshots stay in your R2; only the structured diffs replicate to our control plane (encrypted at rest, your tenant only). We do not store production data verbatim. Snapshots are application/json, typically 50–500 KB/day, designed to be auditable in isolation. If you offboard, the snapshots remain in your R2 — they are your audit trail, not ours.

Weekly — active diff + drill

Once a week, the engineer (the same one all month) sits with the diff. Three artifacts ship:

Drift report — what changed vs. the last sound state. Categorized architecture / security / cost / latency. Each item carries a severity (red / amber / green) and an estimated time-to-fix.
Engineering tickets — for each red and amber finding at Scale tier and above, a ticket gets opened in your tracker (Linear, Jira, GitHub) with the fix scoped, sized, and assigned to Garnet. Not "we recommend you address X." Actual PRs.
Pre-mortem entry — for any architecture change with non-trivial blast radius, a 1-page pre-mortem ("if this fails, what does the post-mortem say we should have known?"). Goes to a shared doc your team can challenge before merge.

Monthly — executive PDF

On the 1st of each month, a Cloudflare Workflow renders an executive PDF covering the same shape every month so you can trend across quarters without re-orienting. Designed to be forward-able to your CFO without translation:

Findings dashboard — opened / closed / carried, severity-grouped, with closure rate trended over 30/60/90 days
Engineering tickets shipped — repo, PR number, 1-line summary, before/after metric where applicable (latency, cost, error rate)
Architecture schema diff — what's new in the topology, with rationale. Includes blast-radius diagrams for new services and dependency-graph deltas
Compliance posture — SOC 2 / ISO 27001 / HIPAA / GDPR % progress, gap-list, next-90-day projection if current pace holds
Cost + latency posture — cloud spend Δ, LLM spend Δ, p95 Δ vs. last month, anomalies with attribution (which service, which deploy, which day)
Recommendations — 3–5 prioritized next-cycle items, with cost/value framing and effort estimates
Pre-mortems shipped — appendix listing each pre-mortem authored that month, the change it covered, and whether the pre-mortem caught something pre-merge

The PDF is signed (DocuSign optional at Enterprise tier) and delivered to your inbox via Postal — sovereign mail, our infra, DKIM-verified. A summary embed also lands in your Discord #audit-monthly channel for the broader team.

Day 1, Day 30, Day 90

Day 1 — onboarding kickoff

60-min intake call: scope of audit (which repos, which clouds, which observability stack, which compliance frameworks)
Read-only credentials provisioned to your Garnet-tenant Cloudflare account
Snapshot writer Worker deployed; first nightly run completes within 24 hours
Discord channel set bootstrapped: #audit-drift, #audit-tickets, #audit-monthly
Garnet engineer onboarded into your existing tracker (Linear / Jira / GitHub Issues)

Day 30 — first executive PDF

4 weekly drift cycles complete; baseline-vs-now diff established
3–6 findings opened, 2–8 engineering tickets shipped (Scale tier and above)
First monthly executive PDF delivered — sets the format you'll trend against
Mid-month review call (optional, free) to recalibrate scope

Day 90 — full retainer maturity

Architecture schema-diff cadence has surfaced enough state to identify the high-leverage drift patterns specific to your stack
Compliance posture moved from baseline (typically ~60%) toward quarter target (~80–95%)
Cost reduction visible on the line items the audit focused on (typical: 15–30%)
Quarter-end readout: full retainer review, next-90-day plan, decision on tier escalation

What success looks like

Across the first 90 days, Audit Pro typically opens 3–6 findings per cycle and closes 4–8 (the surplus from the first month's backlog). Audit Scale closes 2–8 engineering tickets/month as merged PRs in customer repos. Compliance posture progress varies by stack — a typical SOC 2-aspiring customer moves from ~60% to ~95% within 6 months. Cost reduction averages ~28% on infra-spend line items in the first 6 months (anonymized aggregate; your mileage will depend on starting hygiene).

What it isn't

Not a one-shot audit. One-shots find the holes. The retainer fixes them, finds the new ones that opened, fixes those. The work compounds; the cost doesn't.
Not staff augmentation. You don't get "an engineer assigned" plus an account-manager wall. You get the same engineer who shipped the lane. Slack, Discord, email — your call.
Not a recommendation deck. We don't ship slides we expect you to operationalize. We ship merged PRs and closed tickets, and the deck is the monthly executive PDF after the fact.
Not unbounded. Pro is 4–8 audit hours/month; Scale is 12–24; Enterprise is 32+. Hours over the cap roll forward one cycle or get bid out as a separate scope.
Not "we'll bring our tools." Audit Retainer runs on YOUR cloud accounts, YOUR Cloudflare tenant, YOUR R2. We don't shovel your data into a SaaS dashboard you don't control. The auditability of the retainer is itself part of the retainer's value.

FAQ

How is this priced relative to a Big-4 audit?

A Big-4 architecture audit typically runs $120K–$400K for a 6–10 week engagement. Audit Scale ($9,999/mo) is $120K/year — same headline number, but for ongoing work with merged tickets, not a single-shot deck. Audit Pro ($4,999/mo) is $60K/year and covers the diagnostic + 4–8 audit hours of remediation work. The economics flip the moment you treat audits as a recurring discipline rather than a snapshot.

What if our team disagrees with a finding?

Findings are proposals. Each red/amber finding lands in your tracker as a ticket your team can close, decline, or push back on. A declined finding is recorded with the rationale and doesn't re-open next cycle. The engineer's job is to surface and frame — not to override your team's judgment.

Do we need to be on Cloudflare to use this?

The snapshot writer Worker runs on Cloudflare (which is why this is the lane's host requirement), but your primary infra can live anywhere — AWS, GCP, Azure, on-prem, multi-cloud. The Worker pulls read-only from your existing cloud APIs. If you don't have a Cloudflare account, we'll provision one for you (free tier covers most Pro/Scale customers; Enterprise customers typically already have one).

What happens to our snapshots if we cancel?

They stay in your R2. Cancellation removes Garnet's read access to the diff control plane, but your historical snapshots (the daily JSON, the monthly PDFs) are entirely yours. Many customers retain the snapshot pipeline post-cancellation as a passive audit trail; you can also point a different operator at the same R2 bucket if you change vendors.

What about software vulnerabilities and CVEs?

Dependency CVE exposure is one of the security-axis snapshot inputs. We integrate with whichever scanner you already run (Snyk, Dependabot, Trivy, Grype) — we don't add a new scanner unless you don't have one. CVE counts and trends land in the monthly PDF security section; high-severity CVEs (CVSS >= 8.0) trigger same-day alerts at Scale and Enterprise.

Can the retainer cover only one repo or one cloud?

Yes. Pro tier is typically scoped to 1–2 repos + 1 cloud. Scale and Enterprise expand the scope. If you have a "primary" service plus a "long tail" of internal tooling, we usually recommend the audit scope = primary + most-shared internal libs, leaving the long tail as next-cycle expansion.

Adjacent lanes

Audit Retainer is one of three production lanes. Customers running an architecture-under- watch program often pair it with:

GEO Methodology — citation-engineering for the AI-search surface. The architecture audit tracks the systems hosting your schema and llms.txt; GEO does the work that puts those signals into the model's retrieval.
Sentinel-aaS — the Discord operations bus that fires the audit-drift alerts and posts the monthly PDF preview. The /audit-status slash-command is a Sentinel deliverable.
Cluster Ops — for customers running on-prem inference, the operations layer that the audit retainer also covers (the audit retainer's compliance posture queries adapt to on-prem MLX clusters).

See Audit Retainer pricing → See the 30-day onboarding walkthrough → or talk to engineering

How architecture-under-watch works.