Site Reliability Engineer (SRE) positions are open - in the thousands. A recent search on Indeed returned 9,475 open SRE jobs in the U.S. alone. Hiring is robust for this role as organizations in all industries look to shore up the performance and reliability of their systems, whether customer-facing services or critical internal applications. If you have the right mix of qualifications, it’s a high-ceiling opportunity, much like its close cousin, the DevOps engineer role.
That said, SRE interviews can be tougher to prepare for than some other IT jobs. It’s still a new-ish field and role in many companies, even if it has its roots in traditional IT operations as well as DevOps. It’s also a role where non-technical skills are just as important as tech IQ. IT prowess is only part of the job.
[ Get prepared. Read also: How to spot a great software developer: 7 interview questions and 10 top DevOps engineer interview questions for 2021. ]
What is an SRE?
Here’s how Eveline Oehrlich, chief research officer at DevOps Institute, defines the SRE team role: “Site reliability engineering (SRE) is Google’s approach to service management, introduced in a book of the same name. It is a post-production set of practices for operating large systems at scale, with an engineering focus on operations.”
Oehrlich continues, “[SRE team members are] software engineers who are intended to perform operation functions instead of a dedicated operations team. The reliability of production systems, and therefore their users, are supported by an engineer who applies SRE site principles to manage availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. They can also function as support engineers, leveraging monitoring, capacity, and optimization automation tools. Their focus is on non-functional requirements of availability, performance, security, and maintainability.” (Read Oehrlich’s full article: DevOps vs. ITIL 4 vs. SRE: Stop the arguments.)
[ What does an SRE do? What's SRE vs. DevOps? Read also: What is SRE? ]
How to prepare for a Site Reliability Engineer (SRE) job interview
“When we get past the technical skills and experience, really the SRE role comes down to helping others weigh the tradeoffs and pressures on them to deliver fast and to deliver safely,” says Kit Merker, COO at Nobl9. “There is pressure from one side of the organization to deliver new shiny features, and from the other side to ensure we are secure, up, and stable. This conflict exists in every organization, from two people in a garage to a vast engineering org the size of Facebook or Netflix.”
If you roll your eyes when career discussions turn to people skills or the broad bucket of “soft” skills, the SRE field probably isn’t the best fit for you. These traits might be the toughest part of the job in some organizations, especially those with entrenched processes and culture.
“The rise of Site Reliability Engineering shows the importance of how much technology impacts our daily lives,” says Ravi Lachhman, evangelist at Harness. “Similar to DevOps, SRE represents more than a set of skills; organizations need to enable and foster SRE cultures and practices.”
The romantic notion that SREs are cape-wearing superheroes who swoop in and save the day when there’s an outage is mostly just that: a romantic notion. It happens, Lachhman says, but their real focus is ensuring that outages don’t occur in the first place; SREs are obsessive about the science of uptime and measurement.
“SREs are viewed as experts and help drive practices, architectures, and general recommendations about system robustness and reliability across the organization,” Lachhman says.
Given that the role is still new in many organizations, though, that status can’t be presumed. There’s a certain amount of evangelism involved in developing that trusted expert status, and that means close collaboration with individuals and teams throughout the organization. It’s a role that’s as much social as technical.
“The best candidates are those with a compelling narrative illustrating that SRE deals with socio-technical systems, not just computer systems,” Merker says. “Humans are the most important part of any system – not the code or services.”
7 Site Reliability Engineer (SRE) job interview questions
Keep this human aspect in mind when on the hunt for your next (or first) SRE job; likewise, keep it in mind when you’re hiring SREs. It’s going to inform at least some of the questions you’ll respond to (or ask) during an interview. Below, we’ll unpack seven example questions you can use to prepare for either side of the interview.
Question 1: How do you decide if the team should work on new features or paying down technical debt?
SREs play a growing role in negotiating the tension between building new features and reducing technical debt: Most organizations can’t do both simultaneously week in, week out. While this question might be rooted in technical decisions, it speaks to the “socio-technical” nature of SRE.
This is one of Merker’s favorite questions, and he deliberately leaves it open-ended – he wants to hear the candidate dig in for more data and context.
“If they have hard-and-fast rules, I am less impressed by their answer,” Merker says. “What I’m looking for is curiosity about the customer and the business, an understanding of a variety of roles in the company, and a desire to get data (when possible) to back up different points of view.”
For SRE candidates, this topic is a chance to show how you approach seemingly insurmountable conflicts. Everyone thinks their goal or issue is the most important; how do you actually set priorities that people can (mostly) agree on and work on? When is technical debt acceptable (or inevitable)? How do you pay it down?
[ Get our free ebook: Technical debt: The IT leader's essential guide ]
“A big part of SRE is mediating between these different interests and finding practical and actionable answers to somewhat impossible questions,” Merker says. “There is no exact right answer; it’s the process of discovery to find what truly matters that makes me want to say STRONG HIRE!”
Question 2: How do you go about setting SLOs and SLIs and how do you make adjustments when necessary?
Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are foundational metrics for SREs. SLOs are the goals for a particular application; SLIs are the actual measurement of performance against those goals.
Lachhman notes that the SRE function is often at the heart of defining and refining SLOs and SLIs; oftentimes, developers don’t necessarily know the norm or baseline for the applications they build and maintain, particularly if SRE is a relatively new dimension of the broader team.
Hiring managers should dig into how the candidate identifies and defines SLOs and SLIs; if you’re the candidate, you should be prepared to speak about how you approach these metrics. Moreover, make sure you can discuss a thoughtful process for reevaluating and optimizing those measurements over time.
“Like any metric, they need to evolve,” Lachhman says. “Negotiating changes to SLO/SLI measurements is par for the course.”
Question 3: Which of the three pillars of observability is most important to you? Which one do you feel you need to get more exposure in?
The three pillars here are logging, metrics, and tracing. Observability as a whole is intrinsic to the SRE field.
“The science of measuring a system is core to what SREs are hired for,” Lachhman says, pointing to the “Four Golden Signals” in Site Reliability Engineering as one basis for thinking about this question.
“Which pillar would help you determine those [signals] the best?” Lachhman asks. “These will eventually lead into your SLO/SLI measurements. Showing interest in one or more of the pillars shows you are ready to grow into your role.”
As a general principle, measurement is critical in any SRE position, so keep this in mind if you’re looking to pivot into this role from another IT area: It’s a data-driven discipline.
[ Learn more about hybrid cloud and observability. Get the free eBooks, Hybrid Cloud Strategy for Dummies and Multicloud Portability for Dummies. ]
Question 4: How have you implemented process improvements and other changes in the past?
It’s true: The “e” in SRE stands for engineering, and SREs have technical skills. But this role requires more people skills and change agent capabilities than some other IT roles.
“While the SRE position is an engineering role, it is atypical to what one thinks of an engineering role,” says Oehrlich of the DevOps Institute. “While in some organizations existing monitoring practices, on-call procedures, and other standard processes are already well-established, an SRE should think and challenge existing ways of working. This calls for creativity and tenacity.”
Lots of roles might pay lip service to creativity and tenacity desired traits in the job description. In SRE, though, they’re actually critical characteristics, especially when dealing with egos, cultural resistance to change, and other challenges.
“As hiring manager, I would ask for examples where the individual has shown such qualities, how they go about it, and what has been achieved,” Oehrlich says.
Let’s examine three more questions to expect: