SREcon18 EMEA has ended
Back To Schedule
Wednesday, August 29 • 14:00 - 14:50
Availability, Latency, and Cost: Withstanding Regional Outages

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Running in multiple regions is better for your users through increased availability and lower latencies, and it won't cost as much as you think. We've turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach—and our understanding—as we've matured.

Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it's a matter of routine that usually concludes with an brief "all is well" email.

This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we've developed to tame, refine, and leverage our approach. Once you've decided to go multi-region, the three major questions that arise are: how many regions, how should we steer users to regions, and how do we actually perform the failover? In addition to the story of how we got to where we are, I'll present the design considerations and system models we used to make those decisions.



Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team... Read More →

Wednesday August 29, 2018 14:00 - 14:50 CEST
2 - Rheinlandsaal Ballroom BC