Site reliability engineering how google runs production systems
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key m...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Beijing, [China] :
O'Reilly
2016.
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630140306719 |
Tabla de Contenidos:
- Introduction. The production environment at Google, from the viewpoint of an SRE
- Principles. Embracing risk
- Service level objectives
- Eliminating toil
- Monitoring distributed systems
- The evolution of automation at Google
- Release engineering
- Simplicity
- Practices. Practical alerting from time-series data
- Being on-call
- Effective troubleshooting
- Emergency response
- Managing incidents
- Postmortem culture: learning from failure
- Tracking outages
- Testing for reliability
- Software engineering in SRE
- Load balancing at the frontend
- Load balancing in the datacenter
- Handling overload
- Addressing cascading failures
- Managing critical state: distributed consensus for reliability
- Distributed periodic scheduling with Cron
- Data processing pipelines
- Date integrity: what you read is what your wrote
- Reliable product launches at scale
- Management. Accelerating SREs to on-call and beyond
- Dealing with interrupts
- Embedding an SRE to recover from operational overload
- Communication and collaboration in SRE
- The evolving SRE engagement model
- Conclusions. Lessons learned from other industries.