
The overwhelming majority of a software system's lifespan is spent in use not in design or implementation So why does conventional wisdom insist that software engineers focus primarily on the design & development of large-scale computing systems? In this collection of essays & articles key members of Google's Site Reliability Team explain how & why their commitment to the entire lifecycle has enabled the company to successfully build deploy monitor & maintain some of the largest software systems in the world You'll learn the principles & practices that enable Google engineers to make systems more scalable reliable & efficient-lessons directly applicable to your organization This book is divided into four sections Introduction-Learn what site reliability engineering is & why it differs from conventional IT industry practices Principles-Examine the patterns behaviors & areas of concern that influence the work of a site reliability engineer (SRE) Practices-Understand the theory & practice of an SRE's day-to-day work building & operating large distributed computing systems Management-Explore Google's best practices for training communication & meetings that your organization can use