In today’s digital world, data is the new currency. We invest significant effort in safeguarding and protecting our customers’ data, yet all too often fail to fully consider the other fundamental pillar of information security: availability. The uptime of our systems is no longer the sole remit of engineers; organisations add value through the provision of always-on technology platforms, and rapidly fall into obsolescence when such systems are unavailable.

Risks to our system availability abound throughout the development lifecycle, and often originate from unexpected sources. We protect against malicious third-party actors despite the significant risk associated with increasingly sophisticated software systems to which our developers deploy changes multiple times per day.

Originally pioneered at Google, Site Reliability Engineering is the discipline which automates processes and builds systems to ensure a pragmatic approach to this risk in the development cycle. We know systems fail in spite of human effort, not because of it, so we continually optimise and refine the boring parts to promote a pragmatic approach to risk throughout the development cycle.

In this talk, I will share my experience implementing SRE in a small organisation to promote availability, discuss the theoretical properties of reliability engineering, and provide practical guidance on building systems which cope well with continual change.

Session takeaways