Site reliability engineering how google runs production systems

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key m...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Butow, Tammy, author (author), Beyer, Betsy, editor (editor)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Beijing, [China] : O'Reilly 2016.
Edición:	First edition
Materias:	Google (Firm) Reliability (Engineering) Computer engineering.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630140306719

Tabla de Contenidos:

Introduction. The production environment at Google, from the viewpoint of an SRE
Principles. Embracing risk
Service level objectives
Eliminating toil
Monitoring distributed systems
The evolution of automation at Google
Release engineering
Simplicity
Practices. Practical alerting from time-series data
Being on-call
Effective troubleshooting
Emergency response
Managing incidents
Postmortem culture: learning from failure
Tracking outages
Testing for reliability
Software engineering in SRE
Load balancing at the frontend
Load balancing in the datacenter
Handling overload
Addressing cascading failures
Managing critical state: distributed consensus for reliability
Distributed periodic scheduling with Cron
Data processing pipelines
Date integrity: what you read is what your wrote
Reliable product launches at scale
Management. Accelerating SREs to on-call and beyond
Dealing with interrupts
Embedding an SRE to recover from operational overload
Communication and collaboration in SRE
The evolving SRE engagement model
Conclusions. Lessons learned from other industries.

Site reliability engineering how google runs production systems

Ejemplares similares