SREcon18 EMEA has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 2 [clear filter]
Wednesday, August 29

11:20 CEST

The Silver Lining Consortium: Post-Mortems for the Rest of Us
This talk describes why the industry would really be much better off if we shared our post-mortems more widely, and in a more structured way. It shows why it benefits both cloud consumers and cloud providers. It proposes the creation of a consortium to do precisely this, and suggests how to express interest in joining.


Niall Murphy

Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →

Wednesday August 29, 2018 11:20 - 12:00 CEST
2 - Rheinlandsaal Ballroom BC

12:00 CEST

Migrations under Production Load: How to Switch Your Database without Disrupting Service
How do you handle live traffic and serve correct results while switching databases? In this talk, Vilde will share a general step-by-step approach to live database migrations from a backend application developer perspective. Using two recent migrations as examples she will walk through what went wrong, what went right, the patterns such projects follow, and how you can apply this to your own projects. You will hear about a database migration from MySQL to Cassandra in a high-traffic critical production system and a data migration that involved recreating the whole database in a complex legacy system.


Vilde Opsal

Vilde Opsal is a feminist developer from Norway, currently living in Berlin by way of California.

Wednesday August 29, 2018 12:00 - 12:30 CEST
2 - Rheinlandsaal Ballroom BC

14:00 CEST

Availability, Latency, and Cost: Withstanding Regional Outages
Running in multiple regions is better for your users through increased availability and lower latencies, and it won't cost as much as you think. We've turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach—and our understanding—as we've matured.

Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it's a matter of routine that usually concludes with an brief "all is well" email.

This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we've developed to tame, refine, and leverage our approach. Once you've decided to go multi-region, the three major questions that arise are: how many regions, how should we steer users to regions, and how do we actually perform the failover? In addition to the story of how we got to where we are, I'll present the design considerations and system models we used to make those decisions.



Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team... Read More →

Wednesday August 29, 2018 14:00 - 14:50 CEST
2 - Rheinlandsaal Ballroom BC

14:50 CEST

SRE for Mobile Applications
In the server side world, we can and do lean heavily on redundancy, scaling, and direct control to engineer reliability; however, in the mobile world these well known facets of SRE (amongst others) are virtually non-existent. We typically have no binary rollbacks or downgrades, no forced updates/upgrades, and no ability to turn it off and back on again. Users rely on mobile applications more and more, and it's important that SREs consider client-side reliability. This talk discusses practices, principles, and processes that can be applied in the mobile world to make client-side code a first-class citizen, along with real-world case studies of what has and hasn't worked.

Note: Much of the talk will centre around Android, although many of what is discussed can be applied to other platforms.


Samuel Littley

Samuel Littley has been a Site Reliability Engineer at Google for 2 years, and is currently working in London on a team supporting the Google Search App and Google Play Services, as well as working on a wider effort supporting production best practices for Google's wide array of mobile... Read More →

Wednesday August 29, 2018 14:50 - 15:30 CEST
2 - Rheinlandsaal Ballroom BC

16:00 CEST

Impact of Network Automation
Running a network efficiently requires a lot of time and effort. Provisioning resources and configuring networking devices often relies on many manual processes and requires highly trained and skilled engineers. Using automation saves time and helps maintain consistent configuration across many networking devices.

In this talk, I will discuss how we, a small network engineering team, started using automation to build and maintain the Squarespace network. We began our move towards orchestration by aggregating small Ansible roles and playbooks. For example, when we migrated the internal routing protocol from OSPF to BGP, we created an Ansible role to fully control the internal routing. Now we run Ansible playbooks when we need to move traffic away from devices before performing maintenance.

I will also describe how we use route servers to minimize configuration changes on network devices. BGP Route Servers is an elegant and sophisticated tool that we use to shift internal or external traffic without impact. We integrated Route Servers with our CI/CD tools to do BGP traffic engineering in the internal and external network. I will provide an example of how we route traffic onto a DDoS mitigation service with the click of a button.



Squarespace Inc.
I'm a Staff Network Engineer on the Infrastructure Engineering team at Squarespace. I have been working building reliable and redundant datacenter networks, focusing my efforts also on building automation tools for network deployments and management.

Wednesday August 29, 2018 16:00 - 16:45 CEST
2 - Rheinlandsaal Ballroom BC

16:45 CEST

Migrating Your Old Server Products to Be Stateless Cloud Services
This talk will be a technical walkthrough of how we built out an infrastructure in AWS using industry open-source tooling (like Packer, Ansible, and Terraform) to build stateless instances that run our antiquated applications and how the CI/CD pipeline allows for rollouts in a reliable way. Because these products are business critical (like billing services), it was quintessential to have a solid migration plan.

I'll cover the before architecture, the goal, the migration, the end architecture, the upgrade build and deployment process, and the monitoring and alerting taxonomy. I'll share what worked and what didn't and the complications (and outages) along the way.

The goal is to give any SRE the knowledge needed to be confident to perform this type of migration themselves.



Craig Knott is a Cloud Farmer at Atlassian.

Wednesday August 29, 2018 16:45 - 17:30 CEST
2 - Rheinlandsaal Ballroom BC
Thursday, August 30

09:00 CEST

Applying the Principles of Chaos to Serverless
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users. Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

avatar for Yan Cui

Yan Cui

Principal Engineer, DAZN
Yan is an experienced engineer who has run production workload at scale in AWS for nearly 10 years. He has been an architect and principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. He has worked extensively with AWS... Read More →

Thursday August 30, 2018 09:00 - 09:55 CEST
2 - Rheinlandsaal Ballroom BC

09:55 CEST

Know Your Kubernetes Deploys
Containers changed the way we develop and package our code. Kubernetes made it easy to deploy and orchestrate our workloads. Now that those steps are well understood, it is time to draw attention to securing the software supply chain. This talk shows how Shopify secures and tracks its workloads.

We secure our software supply chain by creating signatures on our containers which state that they originate from the correct deploy pipeline, got tested and contain no known vulnerabilities or outdated software.

During deployment we use an admission controller that enables us to enforce deploy time policies that check the presence of the before created signatures so that we prevent privilege escalation via code deployment.

Since new exploits show up all the time, we need to add another piece to the puzzle to sure containers: a place to track all the metadata created during the lifetime of a container. For example, where it's deployed so that if it becomes vulnerable it gets pulled out of production, fixed, and redeployed.

avatar for Felix Glaser

Felix Glaser

Senior Production Security Engineer ☁️ 生产安全工程师 ☁️, Shopify
Felix likes to climb, cycle, and code in Canada. The first two outside and the other one at Shopify, where he works on securing containers and their deployment into the cloud.

Thursday August 30, 2018 09:55 - 10:30 CEST
2 - Rheinlandsaal Ballroom BC

11:00 CEST

SoundCloud's Story of Seeking Sustainable SRE
SoundCloud runs a complex microservice architecture to serve a great diversity of features to a large user base. All of this is done by a relatively small number of engineers, under constant pressure to innovate in the not exactly easy market of music streaming. While this might appear quite similar to the sitatution of many other startups, SoundCloud is a rather extreme example. As such, it is perfectly suited to find out how to tackle this tech-debt prone situation.

About six years ago, with the microservice migration in full swing, site reliability became more and more problematic at SoundCloud. At about the same time, SoundCloud happened to employ a handful of ex-Google SREs. Naively, one might have expected they would simply wave their magic G-wands and make the site reliable again. However, simply copying Google-style SRE and applying it to an organization very different in scale and culture was doomed to fail. Studying the exact reasons for the failure and SoundCloud's subsequent mission to find their own implementation of SRE is a helpful exercise for many smaller organizations in a similarly challenging situation of sustainably running a diverse set of services.

avatar for Björn Rabenstein

Björn Rabenstein

Engineer, Grafana Labs
Björn is an engineer at Grafana and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.

Thursday August 30, 2018 11:00 - 11:35 CEST
2 - Rheinlandsaal Ballroom BC

11:35 CEST

How We Un-Scattered Our DNS Setup and Unlocked New Automation Options
We own over a hundred different domains. They were spread over multiple registrars. DNS servers were not under active management and DNS data was neither version controlled nor reviewed. Deployments were risky and rollbacks challenging.

We gained control over the situation by reducing the number of contracts with registrars, selecting a cloud-based DNS service, and convinced the teams to manage DNS data in a version-controlled manner. To deploy DNS changes, we build tooling that we open-sourced. Today, we are able to deploy much faster and safer. We also have automated checks and implemented some safety measures to prevent the most common mistakes I made in the past. Sharing the mistakes will be part of the presentation, as well as quick outlook to the new automation options we unlocked by having a more robust DNS setup.

avatar for Dan Lüdtke

Dan Lüdtke

Site Reliability Manager, Google
Dan is a Site Reliability Manager in Munich. He contributes to open source software projects, regularly helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel. Prior to Google, Dan served his country, worked as a security consultant, joined... Read More →

Thursday August 30, 2018 11:35 - 12:00 CEST
2 - Rheinlandsaal Ballroom BC

12:00 CEST

Kernel Upgrades at Facebook
The goal of this talk is to explain the importance of automating your kernel upgrades and why you should invest time in building automation which reliably and continuously enforces newer kernels on your hosts.

The Kernel Team at Facebook is in charge of the Linux kernel used at Facebook, along with other 'system level' packages that go with. The kernel team works on tasks like:

  • Merging upstream changes into the Facebook Linux Kernel
  • Creating custom kernel changes for our needs
  • Investigating Linux-related performance issues and failures
  • Periodically building and initial testing of new Facebook kernel rpms

MySQL is one of the primary data stores which Facebook relies on. We have tens of thousands of database hosts which run on linux boxes with different kernel versions. No kernel is perfect and often time database hosts hit kernel bugs which impact production traffic. The remediation often is to upgrade to newer kernels which have these fixes.

In this talk I will go over some of the kernel bugs which impacted our production database servers and how we invested time in developing an automation framework to enforce new kernels on our database hosts in a continuous fashion at Facebook scale. I will also go over how MySQL Infrastructure at Facebook adopted this and is successfully upgrading tens of thousands of database servers without impacting production traffic.


Pradeep Nayak Udupi Kadbet

Pradeep is a Production Engineer at Facebook and works with MySQL Infrastructure. He loves hacking code in python and builds bots to do things for him. While he is not working, he enjoys traveling and clicking pictures.

Thursday August 30, 2018 12:00 - 12:30 CEST
2 - Rheinlandsaal Ballroom BC

14:00 CEST

Care and Feeding of Data Processing Pipelines
Data processing pipelines have important use cases ranging from business analytics, machine learning, eliminating spam and abuse, and delivering billing invoices to transforming data for many important user facing serving jobs. These pipelines are often composed of multiple steps where the input of one is the output of another and with dependencies on external systems and storage, all of which can break. When they do, and pipelines fail to meet SLOs, fixes are often expensive and time consuming, especially if a large data set needs to be reprocessed or repaired. It is best to focus on prevention and quickly detecting and responding to the issues, which is where SRE can help.

In part the difficulty of managing pipelines lies in their difference from serving jobs. Unable to monitor RPC latency and errors directly as a proxy for customer happiness it's necessary to gain visibility into the age of oldest unprocessed data and measure data correctness since corrupt output data may be customer visible and persisted even when serving jobs report no errors. To prevent issues and minimize impact techniques such as canarying, incremental rollout, automatic failover, and auto­scaling can be used, which all have specific considerations for pipelines.



Rita is an SRE at Google with experience managing data processing pipelines, including Google Analytics. She has worked with other pipeline groups at Google on automation and, in particular, monitoring products that meet the needs of pipelines as well as serving jobs. She started... Read More →

Thursday August 30, 2018 14:00 - 14:45 CEST
2 - Rheinlandsaal Ballroom BC

14:45 CEST

Panel: Data Pipelines—Scaling and Reliability
Data processing pipelines, in some form or another, are the lifeblood of all large systems that aggregate data, sort and structure unordered input, or compute features for machine learning. These kinds of systems have become much more common in recent years, and problems with delayed or incorrect results are becoming more likely to have business and user impact. Designing and running pipelines is quite different from designing and running serving jobs. In this panel, pipeline experts from a range of organisations and applications will discuss their experiences scaling pipelines and dealing with their pitfalls.

avatar for Laura Nolan

Laura Nolan

Laura Nolan is a software engineer whose fascination with failure and fragility in systems drew her into the field of Site Reliability Engineering. She is a contributor to "Site Reliability Engineering: How Google Runs Production Systems" and "Seeking SRE", and writes a quarterly... Read More →


Matthew Flaming

New Relic
Matthew Flaming began his career in software engineering back when creating a web portal meant hacking together your own version of JSP and racking your own Solaris boxes. Since then he has led the development of complex, high-scale backend systems ranging from CDNs to IoT platforms... Read More →


Rita is an SRE at Google with experience managing data processing pipelines, including Google Analytics. She has worked with other pipeline groups at Google on automation and, in particular, monitoring products that meet the needs of pipelines as well as serving jobs. She started... Read More →
avatar for Theo Schlossnagle

Theo Schlossnagle

Founder & CEO, Circonus
The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded... Read More →

Thursday August 30, 2018 14:45 - 15:30 CEST
2 - Rheinlandsaal Ballroom BC

16:00 CEST

Canarying Well: Lessons Learned from Canarying Large Populations
Canarying, the process of controlled and observed partial rollout in production to mitigate risk, is one of the common techniques used to ensure safe production changes. In this talk, we will cover common pitfalls, discuss best practices, and outline an end-to-end strategy for the canary process.


Štěpán Davidovič

Štěpán Davidovič is a Site Reliability Engineer at Google. He currently works on internal infrastructure for automatic monitoring. In previous Google SRE roles, he developed Canary Analysis Service, worked on distributed Cron solution, and has worked on both a wide range of shared... Read More →

Thursday August 30, 2018 16:00 - 16:45 CEST
2 - Rheinlandsaal Ballroom BC

16:45 CEST

Real World SLOs and SLIs: A Deep Dive
If you've read almost anything about SRE best practices, you've probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity.

But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we'll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals.

avatar for Elisa Binette

Elisa Binette

New Relic, Inc.
Elisa Binette is a Senior Engineering Manager within the Site Reliability Organization at New Relic. The group focuses on helping teams measure and achieve their reliability goals, improving reliability for both the engineers within the company and for the end customers of New Relic... Read More →

Matthew Flaming

New Relic
Matthew Flaming began his career in software engineering back when creating a web portal meant hacking together your own version of JSP and racking your own Solaris boxes. Since then he has led the development of complex, high-scale backend systems ranging from CDNs to IoT platforms... Read More →

Thursday August 30, 2018 16:45 - 17:30 CEST
2 - Rheinlandsaal Ballroom BC
Friday, August 31

09:00 CEST

SRE Team Lifecycles
Site Reliability Engineering is more than a job title or the name of a team: we know it is a collection of approaches and practices that allow an organisation to deliver their applications at scale.

We set out to give a definition of what the principles of SRE should be able to deliver to an organisation, and to supply you with a roadmap of how to go from first considering implementing SRE practices, to what your first SREs should be tasked with, to making your SRE team a stable and productive fixture in your organisation.

avatar for Stephen Thorne

Stephen Thorne

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher... Read More →

Friday August 31, 2018 09:00 - 09:55 CEST
2 - Rheinlandsaal Ballroom BC

09:55 CEST

Capacity Planning in Four Parts: Telling the Future without a Crystal Ball
Capacity Planning is a pretty difficult topic to wrap your head around and can seem almost impossible to start with. In this talk, I present the four basic steps to go from monitoring what you have to planning for the future. If you've found yourself woken up at 4 AM to expand nodes to meet demand, it's time to pull away from reactive actions and start proactive capacity planning.
The goal of the session is to:

  • Provide an approachable method for capacity planning right now.
  • Highlight concerns and considerations before starting your plan.
  • Explain how to enforce your plan after you've made it.
  • Explain when to re-evaluate your plans and reduce toil.
  • Recommend some methods for avoiding common pitfalls engineers/business folk fall into.

avatar for Evan Smith

Evan Smith

SRE, Hosted Graphite
Evan Smith is a Service Reliability Engineer with Hosted Graphite in Dublin, Ireland. He's responsible for architecting capacity plans and the company ChatOps bot while helping manage an ingestion pipeline of over 100 billion data points a day. A lover of systems, security, and making... Read More →

Friday August 31, 2018 09:55 - 10:30 CEST
2 - Rheinlandsaal Ballroom BC

11:00 CEST

Tradeoffs in Resiliency: Managing the Burden of Data Recoverability
Almost every service has critical data somewhere, whether it's large-scale blob storage or minimalistic index tables or just the service's own production configuration. The data's sizes and shapes and storage technologies vary widely; and yet, the possibilities for data loss remain, and the same obstacles to recovery consistently appear. This talk reviews the practices that can prepare a service for practical data recoveries, highlights some of the hidden dangers waiting to ambush a recovery attempt, and examines some of the risk/cost tradeoffs that inevitably dominate data integrity coverage, based on the lessons of five years of data integrity tooling and consulting across Google.


Kristina Bennett

Kristina Bennett has worked at Google since 2009. Although she recently joined the Customer Reliability Engineering team in their mission to apply the principles and lessons of SRE at Google towards customers, prior to that she spent five years working on data integrity across Go... Read More →

Friday August 31, 2018 11:00 - 11:45 CEST
2 - Rheinlandsaal Ballroom BC

11:45 CEST

Scalable Coding—Find the Error
An important function in our job as SREs is being able to write code and review code, but even here we miss a lot of details and many bugs remain undiscovered. Why is this? A major part of the problem is how our brain works—we only see what we want to see. This talk will explain this important "shortcoming" of our brain and give some insights on how to overcome this problem by addressing common examples of code snippets that can easily lead to bugs and problems at scale.


Igor Ebner de Carvalho

At Microsoft Azure, Igor has been working with some of the largest and most resilient services on the planet, where he helps drive the necessary changes needed for Azure to be ahead of the current growth curve we are experiencing in the Cloud business.

Friday August 31, 2018 11:45 - 12:15 CEST
2 - Rheinlandsaal Ballroom BC

12:15 CEST

Delete This: Decommissioning Servers at Scale
Facebook's datacenter footprint has increased significantly; we now have 12 locations across USA and Europe. As these new locations come online, we have had to plan for the end-of-life process: decommissioning server racks and replacing them in a timely and streamlined manner. Until recently, decommissioning a cluster entailed a lot of manual work: service oncalls were ticketed by project managers and then migrated off the old hardware onto new hardware, after which hardware was unplugged and rolled out.

We realized the need for automation that covered all of this. We started with a framework that allows for automated service migration, given a list of retiring machines and a list of replacements. We moved on to an automated process that looks at a decommission schedule and kicks off jobs to drain server clusters on time so that old racks can be taken away and new racks rolled into their place.

With this automated process in place, we have learned lessons and figured out how to minimize the time that old servers spend without services running on them before being rolled out of the datacenter. We are also exploring ways to reuse parts of this framework in other ways to increase efficiency.


Anirudh Ra

Customer support tech turned production engineer, Anirudh tries to remember that his job is still about helping people succeed. He builds frameworks for service owners to run their services with minimal bother and enjoys baking bread and reading fiction and histories and fictional... Read More →

Friday August 31, 2018 12:15 - 12:40 CEST
2 - Rheinlandsaal Ballroom BC

14:00 CEST

Keep Building Fresh: Shopify's Journey to Kubernetes
Shopify, in 2014, was one of the first large-scale users of Docker in production. We ran 100% of our production traffic in hundreds of containers. We saw the value of containerization and aspired to also introduce a real orchestration layer.

Fast forward two years to 2016, when instead we had a clumsy and fragile homemade middleware for controlling containers. We started looking at orchestration solutions again and the technology behind Kubernetes intrigued us.

In this talk I'll briefly go over challenges we saw in moving from a traditional host-based infrastructure to a cloud native one, moving not only our core app to Kubernetes but also hundreds of our other apps at the same time. I'll focus on the cluster tooling solutions we've built like controllers, cluster creators, and deploy tools. We've automated things ranging from our DNS to certificates and even complex cluster creations—and all with a real programming language and projects rather than a handful of random scripts.

The ability to extend Kubernetes to fit our needs has been the greatest reward of this project. It's given us a new paradigm on which to build upon rather than relying on old patterns.

avatar for Niko Kurtti

Niko Kurtti

Production Engineer, Shopify
Tinkerer with keen interest in container technologies. Working on the cloudplatform team at Shopify and building an internal PaaS on top of Kubernetes.

Friday August 31, 2018 14:00 - 14:50 CEST
2 - Rheinlandsaal Ballroom BC

14:50 CEST

The Myth of Cloud Agnosticism
In theory, the idea of having infrastructure that can seamlessly deploy between different cloud providers is a wonderful concept. Who wouldn't love to migrate workloads seamlessly between providers for a variety of reasons? In theory, a tiger with an anger management problem is just a scaled up house-cat.

This talk explores the practical reality of cloud agnosticism, with all of its warts. The financial, technical, and operational complexities introduced by multiple providers can take companies by surprise. Come explore the basic truth of "however much you hate your cloud provider, you will hate the migration process far more."

avatar for Corey Quinn

Corey Quinn

Editor, Last Week in AWS
Corey is a Cloud Economist at the Quinn Advisory Group and an advisor to ReactiveOps. He has a history as an engineering director, public speaker, and cloud architect. Corey specializes in helping companies address horrifying AWS bills, hosts the "Screaming in the Cloud" podcast... Read More →

Friday August 31, 2018 14:50 - 15:30 CEST
2 - Rheinlandsaal Ballroom BC