SREcon18 EMEA has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 3 [clear filter]
Wednesday, August 29

11:20 CEST

Instrumenting an Existing Service for Monitoring
Limited Capacity filling up

A new service lands on your desk, ready for production! Only, it doesn't seem to have any monitoring, and the dev team have moved on to more urgent needs. Starting with a design document, an existing code-base, an editor and a compiler, we will take this from "runs," to "runs, and is suitable for production use."

Some familiarity with Go is expected. There's (probably) no need for extensive coding, but the workshop does include modifying code, then compile it. During the workshop, you will need a Go compiler and a way of cloning git repositories, as well as running the resulting binaries.


Ingvar Mattsson

Ingvar Mattsson has 20+ years in the "unix admin" trenches paired with an unhealthy fascination for monitoring and alerting systems.

Wednesday August 29, 2018 11:20 - 12:30 CEST
3 - Hegel Room

14:00 CEST

Statistics for Engineers
Limited Capacity filling up

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information. Some key questions become:

  • How to interpret the telemetry data that is emitted from the systems you are running?
  • How to measure the quality of APIs you provide and consume?
  • How to aggregate metrics from single nodes to service-level views?

In this workshop we will address these questions with statistical methods like: data visualisation, averages, percentiles, outlier-analysis, histograms, regressions, robustness, and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper and a laptop!

avatar for Heinrich Hartmann

Heinrich Hartmann

Analytics Lead, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician... Read More →

Wednesday August 29, 2018 14:00 - 17:30 CEST
3 - Hegel Room
Thursday, August 30

09:00 CEST

Developing Effective Service Level Indicators and Service Level Objectives
Limited Capacity full
Adding this to your schedule will put you on the waitlist.

Devising service level indicators and service level objectives that effectively measure user experiences requires knowledge of the fundamentals of SLIs and SLOs and practice examining SLIs/SLOs for potential weaknesses. In this workshop, you will review SLI/SLO fundamentals, then practice devising and checking SLIs/SLOs in groups with guidance from the Google Customer Reliability Engineering team.

Attendees are expected to have some basic knowledge of monitoring (e.g. whitebox/blackbox) and some experience with how systems can fail.

Attendees will gain appropriate knowledge and experience from the exercise of devising SLOs and SLIs for an example application to propose SLO and SLIs for their own businesses. They will understand what SLAs, SLOs, SLIs, and error budgets are and how they are related; they will be able to understand to define and measure meaningful SLOs and how to run a service or application targeting those SLOs.


Kristina Bennett

Kristina Bennett has worked at Google since 2009. Although she recently joined the Customer Reliability Engineering team in their mission to apply the principles and lessons of SRE at Google towards customers, prior to that she spent five years working on data integrity across Go... Read More →
avatar for Liz Fong-Jones

Liz Fong-Jones

Developer Advocate, Activist, and Site Reliability Engineer, Google
Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates... Read More →
avatar for Gwendolyn Stockman

Gwendolyn Stockman

Gwendolyn Stockman has worked at Google since 2008, first as an SWE then as an SRE for the last 5 years. She is on the Customer Reliability Engineering team which she joined after being on a similar group which works with teams within Google launching to production. Before helping... Read More →
avatar for Stephen Thorne

Stephen Thorne

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher... Read More →

Thursday August 30, 2018 09:00 - 12:30 CEST
3 - Hegel Room

14:00 CEST

Chaos Engineering Bootcamp
Limited Capacity filling up

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of controlled experiments to unearth weaknesses.

Ana Medina leads a hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Everyone will be given a Kubernetes cluster with a demo application to perform a set of chaos engineering attacks using a variety of chaos engineering tools. You’ll identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering to create reliable distributed systems.

Pre-Reading List

1. Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)

Prerequisites, Skills, and Tools

1. A basic understanding of production environments and the infrastructure required to run systems
2. Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting

avatar for Ana Margarita Medina

Ana Margarita Medina

Senior Chaos Engineer, Gremlin
Ana is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud... Read More →

Thursday August 30, 2018 14:00 - 17:30 CEST
3 - Hegel Room
Friday, August 31

09:00 CEST

Unconference: Resilient Design
Limited Capacity seats available

Building resilient applications is not easy. We expect our servers to have decent predictable latency even in the face of high loads, unexpected load/latency spikes; we want graceful degradation in face of failures. This unconference session provides a space for practitioners to discuss experiences of designing and building applications with these concerns in mind. The agenda will be decided by the participants on the day, but examples of topics relevant to the conference are:

  • Queueing delays
  • Sizing queues, thread pools, concurrency limits
  • The effect of outliers (high percentiles) on system performance and capacity
  • Load shedding
  • Circuit breakers
  • Cascading failures
  • Isolating interdependent systems

avatar for Avishai Ish-Shalom

Avishai Ish-Shalom

Engineer in Residence, Aleph VC
Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and... Read More →

Friday August 31, 2018 09:00 - 10:30 CEST
3 - Hegel Room

11:00 CEST

Unconference: Developing Effective SRE Teams
Limited Capacity filling up

Skill acquisition applies to both individuals and teams. For a team, it is helpful to understand where the team is currently executing and potential changes to move to greater levels of effectiveness. For convenience, we can usefully divide skill levels into five ranges: novice, advanced beginner, competent, proficient, and expert.

The precise agenda will be decided by the participants on the day, but the idea in this unconference is to take aspects of SRE practice (such as monitoring, measurement against SLOs, incident management and postmortems, managing toil, etc) and discuss what these look like at the different skill levels - not so much at an individual level as at an organizational one.

We would also like to discuss how attendees can gauge their own team and company's state and progress, and develop a plan for growing weak areas. This unconference session is a follow-up to Kurt's talk The Never-Ending Story of Site Reliability from SREcon17 Europe and a later version presented at SREcon Asia 2018 Characterizing and Phases of SRE Practice.

Collaboration spaces for the unconference:

avatar for Kurt Andersen

Kurt Andersen

Program Committee, LinkedIn
Kurt Andersen was one of the co-chairs for SREcon-Americas in 2017 and 2018. He has been active in the anti-abuse community for over 20 years and is currently the senior IC for the Product SRE team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging... Read More →

Friday August 31, 2018 11:00 - 12:40 CEST
3 - Hegel Room

14:00 CEST

SparkPost: The Day the DNS Died
More than 30% of the world's non-spam email is sent using SparkPost's technology, and our cloud service sends over 15 billion messages per month. Deploying that service on AWS has provided all the expected cloud benefits of flexibility and scalability, but also unique challenges due to email's unique profile and needs.

Our DNS needs are particularly extreme. Our infrastructure currently has to support 8,000 DNS queries per second. We have experienced several issues deploying a service model that can meet this need, and a major DNS-related outage in May of 2017 caused significant pain for our customers and sent us back to the drawing board once again. We recently completed a ground-up DNS service redesign that includes dedicated VPCs with optimized security groups and ACLs, distribution across tiers and availability zones, resolver tuning and custom configurations, and multiple local caching resolvers per instance.

In this talk, we will discuss our history addressing this challenge and lessons learned, the May outage event itself, and our current architecture's design and results. Attendees will gain an understanding of what it takes to host a robust DNS service in AWS at a scale beyond what is currently natively supported by AWS' resolver services.


Jeremy Blosser

Jeremy Blosser has worked in systems administration and engineering for 20 years, and most of that time has included a focus on reliably delivering email and other traffic at scale. He is currently the Principal Operations Engineer at SparkPost, responsible for technical architecture... Read More →

Friday August 31, 2018 14:00 - 14:50 CEST
3 - Hegel Room

14:50 CEST

Unikernels—The New Black
The Operations world has never been more exiting! Long gone are the days of monolithic systems being responsible for multiple services and one cannot help but smile with a bit of sentiment reading books of old that gave advice along the lines "If possible, try to run a single service on a machine."

From the rise of VMs some 15 years ago, to the rise in popularity of containers in the last few years, the need for performance, easily scalable and manageable solutions for the age-old problem of running our software reliably, often on a massive scale, is obvious and addressed both by academia and industry.

Containers have been the new black for quite some time, but are they the future?

In this talk, I am going to:

  • Explore the amazing and exciting world of Unikernels! Unikernels are not a new technology and have been around for quite a while, but are a technology which is gaining popularity as an application container.
  • Explore some "young" and "old" unikernel projects
  • Talk about the growing unikernel community and just how easy it is nowadays to start experimenting with unikernels.


Hristo Mohamed

Hristo Mohamed really likes system administration and has decided to make this his main activity during the day. He works at CERN in the LHCb Online Team as a System Administrator and mainly deals with making things run as smoothly as possible. His main activities are automation and... Read More →

Friday August 31, 2018 14:50 - 15:30 CEST
3 - Hegel Room