Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it’s the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their “Site Reliability Engineering” book covers the ideas behind SRE and Google’s internal practices, which work well for them. Let’s put the Google specifics aside for a moment and instead focus on ideas, responsibilities, and objectives. Taking a step back from specific implementations, this post reviews the prerequisites required to bootstrap an SRE team that fits your organization.
Before we dive into it, check out Cloud Academy’s Recipe for DevOps Success webinar in collaboration with Capital One and don’t forget to have a look at Cloud Roster, the job role matrix that shows you what kind of skills a DevOps Engineer should master to land their dream job. If you’re a company, we suggest reading The Four Tactics for Cultural Change in DevOps Adoption.
The What and Why Behind SRE
SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities in an interview for the SRE book:
[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.
The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.
Framing Responsibilities with SLOs
Generally, SRE’s goal is to promote system reliability and efficiency throughout the SDLC. Doing SRE well means tracking and assessing progress against metrics. Service Level Objectives (SLOs) are the entry point to reliability for many organizations and are provided in one way or another. They may already be written down, quantified and tracked, or they may be something as simple as an unspoken idea that the website must be up during work hours. SLO’s frame SRE’s operational work and they’re fundamental in doing so.
Stephen Thorne, SRE at Google, echoes this point in his talk titled “Getting Started with SRE” from the DevOps Enterprise Summit 2018.
You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.
Setting measurable SLOs is the first checkpoint in getting started with SRE. Put bluntly, if your organization does not have written, measured, and reported SLO’s then it’s not ready for SRE. SLOs also need consequences since they’re worthless without enforcement. This provides SRE’s leverage to prioritize work that directly impacts SLOs as opposed to other work. The good news is that any organization can create and enforce SLOs (however they tend to carry different weight in a 2 person startup compared to a thousand strong enterprise organization).
SRE’s prime responsibility is ensuring their systems meet SLOs and many other things follow from that. That leads to the next question: how does SRE achieve this?
SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1
It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.
Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than just toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.
Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:
if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.2
After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.
Moving Towards Site Reliability Engineering
When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:
- A centralized SRE team (like a Google)
- A decentralized SRE team
- SREs embedded in teams
There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.
VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:
For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.
VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.
Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.
Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.
Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.