Site reliability engineering how google runs production systems

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key m...

Descripción completa

Detalles Bibliográficos
Otros Autores: Butow, Tammy, author (author), Beyer, Betsy, editor (editor)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Beijing, [China] : O'Reilly 2016.
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630140306719
Tabla de Contenidos:
  • Introduction. The production environment at Google, from the viewpoint of an SRE
  • Principles. Embracing risk
  • Service level objectives
  • Eliminating toil
  • Monitoring distributed systems
  • The evolution of automation at Google
  • Release engineering
  • Simplicity
  • Practices. Practical alerting from time-series data
  • Being on-call
  • Effective troubleshooting
  • Emergency response
  • Managing incidents
  • Postmortem culture: learning from failure
  • Tracking outages
  • Testing for reliability
  • Software engineering in SRE
  • Load balancing at the frontend
  • Load balancing in the datacenter
  • Handling overload
  • Addressing cascading failures
  • Managing critical state: distributed consensus for reliability
  • Distributed periodic scheduling with Cron
  • Data processing pipelines
  • Date integrity: what you read is what your wrote
  • Reliable product launches at scale
  • Management. Accelerating SREs to on-call and beyond
  • Dealing with interrupts
  • Embedding an SRE to recover from operational overload
  • Communication and collaboration in SRE
  • The evolving SRE engagement model
  • Conclusions. Lessons learned from other industries.