text
12.04.24 #004
We need to talk about: SLAs
;; 778w; 4'KEYWORDS: architecture, sre, devops, wntta
This is going to be a rant about corporate life. I've been working with enterprise customers in one way or another for over 10 years, and in that time I've not had a single meaningful discussion about Service Level Agreements (SLAs).
Now, don't get me wrong: I've had plenty of discussions about SLAs. But I can safely say that all of them were a complete waste of my time, for the simple reason that the people who wanted to talk to me about SLAs had no clue what SLAs were or what purpose they serve. Worse yet, ever single time someone wanted to talk to me about SLAs, they were convinced that SLAs were synonymous with "guaranteed minimum availability".
Utter nonsense.
So, same as last time, let's set the record straight once and for all.
You probably don't need SLAs
So what are SLAs, anyway?
SLAs are contracts. Legally binding documents, carefully drafted by your friendly corporate lawyers. Don't let the technical terminology included delude you, SLAs are not the domain of engineers, they are the domain of lawyers.
Corollary 1. If your SLA does not read like it was written by lawyers, it's probably just words someone put up on a website and not a real SLA.
So, if SLAs are contracts and are legally binding, whom do they legally bind, and to what? Broadly speaking, all SLAs are about the same thing: They bind a service provider into not charging you money for a service that was not delivered, or that was not delivered up to some agreed specification. They provide a clear description of the service, define how its quality will be measured, and establish some quality thresholds that needs to be met to avoid having to provide monetary compensation.
That is all there is to it. It sounds like a good thing, and it is. But let me illustrate why in practice they are not as relevant as you might expect them to be, from the perspective of a technology consumer.
Say you buy a ticket for a concert. If the artist cancels the tour, you'd be entitled to a refund. That's the SLA. Now, say you had deliberately planned your entire holidays around this concert. The only reason you're traveling is to attend this concert, and without it your entire trip no longer makes sense. Would the artist be responsible for any of this? Hell no. All that is on you. The artist made sure not to take your money for the service they failed to provide, but whatever else you built on top of this service, and the potential loss that you're now exposed, is entirely your responsibility.
That's the issue with SLAs in the real world.
If you have a $1000/hour business process running on top of a $10/hour server, the most an SLA will get you is your $10/hour back if the server is down. Can you afford to lose that $990/hour? If the answer is "no", then you want to look at building a resilient architecture that can tolerate server failures, not at the server's SLA.
And when you build this architecture, you now have a clear cost ceiling: Whatever you come up with, if it were to cost more than $990/hour, then you'd be better off letting your process go down and taking the hit. Between that number and your new cost is your potential business case.
When do you need SLAs?
From my experience, as an engineer you'll only care about SLAs when you're the one writing them. That is, when your team owns the service that's charging others money, and you need to offer them some terms that establish the quality you expect to deliver, and under which circumstances you will give them their money back.
Corollary 2. If you're being asked to write an SLA with no input or support from legal, it's probably just words someone wants to put up on a website and not a real SLA.
So, what do you do in this case? Well, I can tell you what you don't do: Math based on theoretical availabilities or other SLAs.
In a nutshell: Define some indicators of service quality you want to measure and measure them. Once you've collected enough data to understand what your baseline performance for these indicators is, you can define your objectives. From your objectives, you can derive your failure budget and set-up proactive alerting.
I could go into details, but I'll be repeating what others have already written down for me in Google's Site Reliability Workbook Chapters 2 and 5. Once you have Service Level Objectives (SLOs) you feel confident in, talk to legal.