What Is SRE?
Site Reliability Engineering
SRE is a discipline that applies software engineering principles to infrastructure and operations. Pioneered by Google, SREs write code to automate operations tasks, define reliability targets (SLOs), and balance feature velocity with system stability.
How SRE Works
SRE teams define Service Level Objectives (SLOs) — like 99.9% uptime or p99 latency under 200ms. They use error budgets: if reliability exceeds the SLO, the team ships features faster. If reliability drops below, they prioritize stability work.
Key Concepts
- SLO — Service Level Objective — a measurable reliability target like 99.95% availability or 200ms response time
- Error Budget — The acceptable amount of unreliability — 99.9% uptime allows 43 minutes of downtime per month
- Toil Automation — Writing code to automate repetitive operational tasks — the core SRE activity
Frequently Asked Questions
Is SRE different from DevOps?
DevOps is a culture and set of practices. SRE is a specific implementation with concrete principles (SLOs, error budgets, toil reduction). Google describes SRE as 'what happens when you ask a software engineer to design an operations team.'
Do small companies need SRE?
Not as a dedicated team, but adopting SRE practices (SLOs, incident response, automation) benefits any team running production services.