Title: Site Reliability Engineering
Author(s): Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
There are some resources that you know you must read, even if it is a steep ascent. Considering that this title is freely available, that it contains so much good ideas, tips, and practices that many people won't ever put half of them in practice, and that it is a big book (around 500 pages), it is hard to complain about it. But still, I'd like to start with my only complain: Most of the book feels like a bunch of academic papers grouped together, often under a common topic, but sometimes even overlapping in the contents despite being at different chapters and book parts. It's not been an easy book to read because it can get dull at times, up to the point that I stopped reading it halfway, and retook it recently and decided to finish it.
That said, everything else is so interesting. Even if sometimes we're not exactly told how their internal systems work, Google SRE teams authors explain enough techniques, procedures, ideas and advices to create tickets for years of work at any medium or large sized company that has more than a few services in production. From recommending monitoring everything, placing retries and circuit breakers, or explaining their "production readiness requirements" SRE guidelines, to less often heard concepts like "given enough requests per second, a simple random load balancer strategy can perform better than a round robin one", or how Google in some services employs an initially counter-intuitive practice of, in case of high latency, discarding the newer requests instead of the older ones (while also always discarding those that have been waiting for more than X seconds). Tons if ideas of how, what and when to monitor, log, alert, automate, improve, fix, and a myriad of related actions you can take related with services, tasks, teams, scenarios... Just take a look at the table of contents to grasp the sheer amount of topics it covers.
I marked some internal highlights but if you want some really hardcore and useful notes, I can only encourage you to check the in-depth review at danluu.com/google-sre-book/. Every chapter except the final SRE team management and integration ones is summarized.
But even the management final chapters are useful, explaining why interruptions are bad, how to deal with them and mitigate them, how to collaborate between teams, how, when and why to meet...
As I mentioned before, not the easiest book to read but definitely one that most engineers, SREs or not, should read.