SoundCloud runs a complex microservice architecture to serve a great diversity of features to a large user base. All of this is done by a relatively small number of engineers, under constant pressure to innovate in the not exactly easy market of music streaming. While this might appear quite similar to the sitatution of many other startups, SoundCloud is a rather extreme example. As such, it is perfectly suited to find out how to tackle this tech-debt prone situation.
About six years ago, with the microservice migration in full swing, site reliability became more and more problematic at SoundCloud. At about the same time, SoundCloud happened to employ a handful of ex-Google SREs. Naively, one might have expected they would simply wave their magic G-wands and make the site reliable again. However, simply copying Google-style SRE and applying it to an organization very different in scale and culture was doomed to fail. Studying the exact reasons for the failure and SoundCloud's subsequent mission to find their own implementation of SRE is a helpful exercise for many smaller organizations in a similarly challenging situation of sustainably running a diverse set of services.
Björn is an engineer at Grafana and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.