SaaS & Technology · SRE & observability · 6 months

From 99.5% to 99.95% for a B2B SaaS platform.

99.95% Availability, from 99.5%

88% Fewer customer-detected incidents

41 → 9 Minutes mean time to restore

Challenge

The company — a UK B2B SaaS scale-up moving upmarket — had started losing enterprise deals on a single question in security review: “describe your reliability practice.” Availability hovered around 99.5%, which sounds respectable until an enterprise customer translates it: 43 hours of downtime a year, much of it discovered when their own users complained.

Engineering was not careless; it was blind. Three monitoring tools, none correlated, meant incidents took 40+ minutes to even diagnose. On-call was a single exhausted rotation drowning in alerts that mostly meant nothing. The renewal book made the cost concrete: two churned logos in the previous year had each cited reliability.

Approach

Observability came first, because nothing else works without it. We consolidated on OpenTelemetry — traces, metrics, and logs sharing identifiers, so an alert links to the traces that triggered it and the logs those traces touched. Diagnosis stopped being archaeology.

Then SLOs, defined with product rather than engineering: what does “working” mean for a customer? Checkout latency, report generation time, API error rate — seven SLOs with error budgets, reviewed fortnightly. Alerting was rebuilt on burn rates, which cut page volume by three quarters: pages now meant the budget was actually at risk, not that a threshold somewhere had twitched.

Finally, incident practice. Lightweight roles, a single incident channel template, and blameless reviews that produced engineering actions rather than apologies. We ran the first six reviews ourselves, then handed the chair to their engineers. The discipline stuck because the reviews kept finding real work: connection pool exhaustion, a retry storm, a deploy-time cache stampede — each fixed permanently.

Telemetry   ·  OpenTelemetry end-to-end; correlated traces/metrics/logs
SLOs        ·  7 customer-facing SLOs, error budgets, fortnightly review
Alerting    ·  Burn-rate based; page volume down 76%
Incidents   ·  Role-based response, blameless reviews, action tracking
Stack       ·  Prometheus · Grafana · Sentry · PagerDuty

Results

Two quarters in, availability stood at 99.95% — a 10× reduction in downtime — and customer-detected incidents had fallen 88%: the platform now found its own problems first. Mean time to restore dropped from 41 minutes to 9. The commercial effect was the point: the reliability section of enterprise security reviews went from a liability to a strength, and the next two enterprise deals closed with the SLO dashboard in the sales deck.

Stack

OpenTelemetry · Prometheus · Grafana · Sentry · PagerDuty GitHub Actions · ArgoCD · k6 · Sloth

Next case study

Financial Services Platform engineering

70% faster deployment lead time for a UK retail bank Read case study →

Is reliability costing you renewals?

Start a project