SREcon18 EMEA has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 1 [clear filter]
Wednesday, August 29

11:20 CEST

What Makes a Good SRE: Findings from the SRE Survey
Site Reliability Engineering is a relatively new discipline when it comes to careers, having only been in existence for about 15 years. While 15 years may seem like an eternity, the SRE role can be considered to be in its infancy. This leads to challenges defining the role and understanding exactly what it is. Browsing through job descriptions or speaking with other SREs you will see many different job descriptions and responsibilities.

Earlier this year Catchpoint conducted a survey of 416 professionals with the title or responsibility of an SRE in an attempt to create a real-world profile of the SRE and the organizations where they work. This session will review the findings of the survey. Learn what the top technical and non-technical skills are, and whether they vary by industry or size of the company. What surprised us with the findings? What additional questions arose analyzing the results? And how can the survey results be used to help organizations building out an SRE team.


Dawn Parzych

Dawn is a Director at Catchpoint where she uses her storytelling prowess to write and speak about the intersection of technology and psychology. She makes technical information accessible avoiding buzzwords and jargon whenever possible. Dawn has spoken at DevOpsDays, Velocity, Interop... Read More →

Wednesday August 29, 2018 11:20 - 12:00 CEST
1 - Rheinlandsaal Ballroom A

12:00 CEST

Sustainability Starts Early: Creating a Great Ops Internship
Representing the next generation of engineers, interns are integral to a sustainable SRE team. Why then, do some revolutionize your deployment process, while others never make it past their first commit? Interns and junior engineers can be particularly susceptible to burnout and effective mentorship can make all the difference.

With little effort, you can drastically increase the likelihood of successful internships. Supporting your intern with realistic projects and expectations, reasonable docs, and frequent feedback can tip the scales. Unfortunately, there is very little information out there about what this really looks like. I’ve known students who after an internship chose to leave tech entirely, and other who were inspired to start a career in operations. In this talk, you’ll hear about how you can set your intern up for success with concrete steps and examples.


Fatema Boxwala

University of Waterloo
Fatema Boxwala is a CS student at the University of Waterloo. At school she’s involved with the Women in Computer Science Committee and the Computer Science Club, occasionally teaching people about Python, Git and Systems Administration. At work she’s been an intern at Rackspace... Read More →

Wednesday August 29, 2018 12:00 - 12:30 CEST
1 - Rheinlandsaal Ballroom A

14:00 CEST

The 7 Deadly Sins of Documentation
Documentation can often be forgotten when it comes to the pillars of SRE practice; developing tools and new infrastructure has supremacy, and a constant sense of urgency leads to documentation being a "we'll get around to it later" task. Even in places where documentation is actually written, it's often done quickly or poorly in the first place, not maintained, or not organized in a way that makes it easy to use. This can impair onboarding, obfuscate problems, and impair incident response. But if done correctly, documentation can preserve institutional knowledge, streamline incident response, and allow new team members to quickly begin contributing. This talk will discuss the biggest problems surrounding creating, maintaining, and providing utility with documentation, and how to solve them.


Chastity Blackwell

Chastity Blackwell has worked in Operations for almost 20 years in a variety of environments, from the hallowed halls of the University of Illinois to a smaller startup, and now works for Yelp as an SRE. In addition, she has done work as a freelance writer and content developer for... Read More →

Wednesday August 29, 2018 14:00 - 14:50 CEST
1 - Rheinlandsaal Ballroom A

14:50 CEST

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation
Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples.

Risk Reduction is trying to make sure bad things happen as rarely as possible. It's anti-lock brakes and vaccinations and irons that turn off by themselves and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often.

Harm Mitigation is what we do so that when bad things do happen, they are less catastrophic. Building fire sprinklers and seatbelts and needle exchanges are all about making the consequences of something bad less terrible.

This talk is focused on understanding where we can prevent problems and where we can just make them less bad, and what kinds of tools we can use to make every disaster a disappointing fizzle.

avatar for Heidi Waterhouse

Heidi Waterhouse

Developer Advocate, LaunchDarkly
Heidi is a developer advocate with LaunchDarkly. She delights in working at the intersection of usability, risk reduction, and cutting-edge technology. One of her favorite hobbies is talking to developers about things they already knew but had never thought of that way before. She... Read More →

Wednesday August 29, 2018 14:50 - 15:30 CEST
1 - Rheinlandsaal Ballroom A

16:00 CEST

Your System Has Recovered from an Incident, but Have Your Developers?
Mistakes are inevitable, and happen to the best of us. Our industry adopts a blame-free culture, but that doesn't negate the sting that occurs when we're at the heart of a mess-up.

Developers continually raise the bar on how to prevent errors, mitigate damage for ones that arise, and wring out as many learnings as possible after the damage is done. But much of this work is focused on the products, and not the people. And given the high-stakes in SRE, the range of how a mistake psychologically impacts people can run the gamut from minor to the near-traumatic.

Where are the game day exercises that simulate how to support a coworker who just caused 3 am pings and 20 hour work days? What resources should we share to help people understand the stages of emotions they'll feel after a major incident?

The concept of psychological safety is well understood as a key predictor for high-performing teams, but what does that entail? Drawing from original research, and lessons from fields like sports, medicine, and even stand-up comedy, attendees will leave with a series of tangible actions and exercises to help restore team trust and rebuild a developer's confidence.


Jaime Woo

Jaime Woo started his career as a molecular biologist, working on cartilage replacements. While he adored nurturing genetically-modified E. coli, he realized his main passion was storytelling. He has written an award-nominated book, launched the Engineering blog at Riot, built the... Read More →

Wednesday August 29, 2018 16:00 - 16:45 CEST
1 - Rheinlandsaal Ballroom A

16:45 CEST

Against On-Call: A Polemic
There have been computer emergencies as long as there have been computers and emergencies. There is evidence dating computer on-call shifts from the 1940s, and we can trace on-call activity in an essentially unbroken line from today back through seven decades of the computer industry.

But just because we've been doing it for seven decades doesn't make it right. In fact, if you think about it, the fact we've been doing something essentially unchanged for seven decades, given everything else that has changed around the practice, might cause us to ask: is it right that we are doing this? Is there something better? What are the alternatives?

This talks builds on previous work to analyse and establish what the _real_ reasons for continuing to do on-call are, provides evidence that human beings are actually really bad at it—in fact, it's really harmful for them to do it—and proposes solutions, ending with a call for action for the industry to bring into being an on-call-free future.

(Note: this talk builds on an article written for DNB's "Seeking SRE.")


Niall Murphy

Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer... Read More →

Wednesday August 29, 2018 16:45 - 17:30 CEST
1 - Rheinlandsaal Ballroom A
Thursday, August 30

09:00 CEST

Dealing with Dark Debt: Lessons Learnt at Goldman Sachs
Dark debt is a form of technical debt that is invisible until it causes failures. In enterprise systems, technical engineering liabilities often develop over time due to system complexities, and changes down the line might have unintended consequences due to unforeseen inter-dependencies or hidden problems.

In this talk, we will cover strategies to prevent and mitigate dark debt, as well as tools we have developed or adopted at Goldman Sachs to help manage complexity across our environment.
Attendees of this session can expect to learn:

  • Challenges of managing distributed systems
  • Practical tips on how to manage technical debt in your environment
  • How SRE can help with tackling this form of risk


Vanessa Yiu

Site Reliability Engineering, Goldman Sachs
Vanessa is a site reliability engineer at Goldman Sachs in London. She has worked in a number of engineering roles across Goldman Sachs including sysadmin, storage, market data, trading support as well as managing the firm’s SDLC tool stack for build, test, and deploy.

Thursday August 30, 2018 09:00 - 09:55 CEST
1 - Rheinlandsaal Ballroom A

09:55 CEST

Halt and Don’t Catch Fire
Every young systems team (regardless if it’s in a startup or not) is burdened with someone else’s decisions, good and bad.

Startups have the uniqueness where most of the people that contributed to their technical debt are still there. They are flexible, and, in many cases, still can create a solid foundation (both cultural and technological) for the future. On the other hand, they are fragile, change fast, and frequently adopt new technologies prematurely. New systems teams in medium-sized companies face even harder challenges. They must keep everything running, take care of an unknown complex infrastructure and development team while dealing with the lack of original information.

We will discuss how we can slow down the generation of technical debt in new teams, put it in perspective using practical examples and real-life stories, and how small iterations, and changes in culture can lead to more sustainable systems and teams.

avatar for effie mouzeli

effie mouzeli

Site Reliability Engineer, Wikimedia Foundation
Effie studied physics and scientific computing but decided to follow neither. Instead she became a sysadmin, later systems engineer, now SRE. She has worked in a number of startups and small organisations, where her responsibilities were usually automation, infrastructure architecture... Read More →

Thursday August 30, 2018 09:55 - 10:30 CEST
1 - Rheinlandsaal Ballroom A

11:00 CEST

Not Invented Here Syndrome and Dark Debt: The PagerDuty Story
The conundrum of building something in-house vs using something off-the shelf is an important one. Organizations sometime have their own set of priorities and business requirements that results in them to conclude that no off-the-shelf solution fits the bill. In such scenarios, companies and teams often end up building their own custom software. Once such custom software is written and deployed, organizations rarely consider revisiting their original choices. Proliferation of such decisions within an organization ultimately results in dark-debt.

This talk aims to talk about the important conundrum of build vs buy. I will start with the different choices that teams and organizations have when considering an in-house solution. Next, I will share a case study about PagerDuty’s journey from building a highly available Cassandra based distributed queue solution to address a specific use to using it albeit incorrectly everywhere else, to depreciating and retiring it, in favour of something off-the-shelf. Along the way, the audience will also learn about how such upgrades were done without affecting the health of production systems, and how revisiting decisions from the past helped uncover dark-debt.

avatar for Aish Raj Dahal

Aish Raj Dahal

Software Engineer, PagerDuty
Aish works as an Engineer at PagerDuty in San Francisco. He currently works in building PagerDuty’s event intelligence platform often dealing with fallacies of distributed computing. His recent focus has been on Elixir/OTP and building event driven microservices using Kafka and... Read More →

Thursday August 30, 2018 11:00 - 11:35 CEST
1 - Rheinlandsaal Ballroom A

11:35 CEST

Building a Debuggable Go Server
Microservice architectures bring flexibility and scalability to service-based applications, however, they also drastically increase their operational complexity. This presentation will explore how Improbable uses a common Go server with built-in debuggability as a basis for new services, minimizing the complexity inherent in microservices. This server provides consistent metrics, logging, tracing, and more, to deliver a reliable system, and it is now open-source!

avatar for Keeley Erhardt

Keeley Erhardt

Software Engineer, Improbable
Keeley is a software engineer at Improbable, a London-based tech company focused on enabling massive-scale simulation. She graduated from MIT with a B.S. and an M.Eng in Computer Science. Keeley is passionate about distributed systems and open source and has contributed to a variety... Read More →

Thursday August 30, 2018 11:35 - 12:00 CEST
1 - Rheinlandsaal Ballroom A

12:00 CEST

Building a Fellowship Program to Mentor and Grow Your SRE Team
Mentorship is invaluable at any point in your career. At DigitalOcean, we introduced an internal two-week fellowship program pairing any developer interested in learning more about what infrastructure did with a senior engineer. We followed the Tuckman 4-stages of group development of forming, storming, norming, and performing. We believe we create the best performing team when mentors and mentees go through the four stages together as a team. Two weeks may seem brief, but we were able to iterate quickly, and also it meant we could focus our energies on mentoring just one person at a time to limit straining the team’s bandwidth. The benefits were manifold: our infrastructure team gained a better perspective of what other teams go through and work on a daily basis which helps us build better tools and workflows to support them. Not only did participants strengthen their skills, but some joined infrastructure, realizing it was right for them. And for those that didn't join it was an excellent way to cross-pollinate ideas and build the infrastructure team's relationships with other teams. In this talk, attendees will hear about the theory, lessons learned, and how to create their own fellowship program.

avatar for Tom Spiegelman

Tom Spiegelman

I am truly passionate for all things tech specifically if you want to talk to me about Infrastructure. I have been at DigitalOcean coming on 3 years now and really love it.

Thursday August 30, 2018 12:00 - 12:30 CEST
1 - Rheinlandsaal Ballroom A

14:00 CEST

Managing Misfortune for Best Results
The Simulated Outage training game is a regular part of SRE training at Google and elsewhere. They represent great opportunities to simulate an outage, and to practice problem debugging and escalation. Perhaps equally important, they provide an opportunity to simulate the stress of an outage for an oncall engineer.

This talk describes techniques to ensure a productive training environment. It will emphasise the importance of providing context to the trainee engineer. It will also talk about the importance of calibrating the level of stress to the needs of the student. Since training games are often observed by whole teams, the talk will cover ways to maintain engagement among the group of observers.

Finally, it will talk about potential anti-patterns to be avoided.


Kieran Barry

SRE @ Google
Kieran has worked at Google as an SRE on the search team for the past four years. He has also volunteered on new-SRE education as part of the SREEDU team.

Thursday August 30, 2018 14:00 - 14:45 CEST
1 - Rheinlandsaal Ballroom A

14:45 CEST

Clearing the Way for SRE in the Enterprise
Do you get excited listening to the SRE success stories from companies whose practices, infrastructure, and tooling seems ready-made for this new style of working? But then do you quickly become disillusioned when you look back at the hairball that is your company's multiple generations of legacy infrastructure, tooling, skills, processes, and business constraints? This talk is for you.

We'll first look at how decades of traditional operations beliefs and practices leave organizational scar tissue that is difficult to overcome. We'll examine examples of how silos, excessive toil, reliance on queues, and incorrectly applied governance models undermine the adoption of SRE principles and practices in the enterprise.

We will also look at how high-performing enterprises are having success transforming their operations. Specifically, we'll look at the design patterns they applied to change organizational structures, update roles/responsibilities, align incentives, and change individual mindsets. The result was that these organization cleared away that scar tissue and made it easier to move to an SRE model of working.

Note: This talk is based on Damon Edwards's chapter "Clearing the Way for SRE in the Enterprise" in the upcoming Seeking SRE (O'Reilly 2018).

avatar for Damon Edwards

Damon Edwards

Co-Founder and Chief Product Officer, Rundeck
Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the popular open source Self-Service Operations platform. Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent the past years working with... Read More →

Thursday August 30, 2018 14:45 - 15:30 CEST
1 - Rheinlandsaal Ballroom A

16:00 CEST

The Math behind Project Scheduling, Bug Tracking, and Triage
Many projects have poorly defined (and often overridden) priorities, hopelessly optimistic schedules, and overflowing bug trackers that are occasionally purged out of frustration in a mysterious process called "bug bankruptcy." But a few projects seem to get everything right. What's the difference? Avery collected the best advice from the best-running teams at Google, then tried to break down why that advice works—using math, psychology, an ad-hoc engineer simulator (SimSWE), and pages torn out of Agile Project Management textbooks.

We'll answer questions like:

  • Why are my estimates always too optimistic, no matter how pessimistic I make them?
  • How many engineers have to come to the project planning meetings?
  • Why do people work on tasks that aren't on the schedule?
  • What do I do when new bugs are filed faster than I can fix them?
  • Should I make one release with two features or two releases with one new feature each?
  • If my bug tracker is already a hopeless mess, how can I clean it up without going crazy or declaring bankruptcy?


Avery Pennarun

Once upon a time, Avery was the lead engineer for Google Fiber's home wifi devices, building, managing, and monitoring the whole fleet in customers' homes. More recently, he's branched out into projects that are harder to explain. Before that, he started startups including one that... Read More →

Thursday August 30, 2018 16:00 - 16:45 CEST
1 - Rheinlandsaal Ballroom A

16:45 CEST

Ethics in Computing
As computing evolved to touch billions of lives, it did so over the rapid course of about 25 years. In that time, the profession of computing has not self-regulated with respect to professional domain-specific ethical enforcement. In fact, many computing professionals have not taken a course on ethics, know very little about ethics, and don't understand how it uniquely applies to their industry. This talk aims to cast light on this important subject in a tone and tenor that should resonate with computing professionals.

avatar for Theo Schlossnagle

Theo Schlossnagle

Founder & CEO, Circonus
The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded... Read More →

Thursday August 30, 2018 16:45 - 17:30 CEST
1 - Rheinlandsaal Ballroom A
Friday, August 31

09:00 CEST

I’m SRE and You Can Too!—A Fine Manual for Migrating Your Organization to the New Hotness
Tales of brown-field SRE transitions from Netflix, Stripe, YouTube, Chrome, and the Google acquisitions process that you can use to overcome your own obstacles to continuous operational improvement.

How do you build a sustainable SRE program? How do you migrate an existing infrastructure, teams, and workflow to an architecture that can realize the benefits of SRE? What choices have companies made with their SRE programs, and what should you consider when adopting the SRE model?

We've been through this process a half dozen times in differing circumstances and organizations, and will share with you what has worked—and occasionally failed—for us as well as attempt to answer some of the most common questions we've had from previous attendees undergoing their own SRE/DevOps transitions.

avatar for Blake Bisset

Blake Bisset

Blake got his first legal tech job at 16, long enough ago that he’s entitled to make shakeyfists while shouting “Get off my LAN!”He did three startups (a Dupont/ConAgra venture; a UW biotech spinoff; and this other time some kids were sitting around New Year's Eve, wondering... Read More →
avatar for Jonah Horowitz

Jonah Horowitz

Site Reliability Engineer
Jonah is a Senior Site Reliability Engineer with 18 years experience building and scaling production applications. He's worked at several startups and large companies including Quantcast, Netflix, and Stripe.

Friday August 31, 2018 09:00 - 09:55 CEST
1 - Rheinlandsaal Ballroom A

09:55 CEST

Lessons Learned—Data Driven Hiring 3 Years Later
In 2015 I gave a talk on Data Driven Hiring at the first SREcon EMEA. Three years later I would like to revisit that talk, and discuss what's changed and what has remained the same since I first put in place a new hiring system based on evaluating "potential" rather than hard skills.

With the speed of changing technology, we have shifted to a reality where companies need people with growth mindsets who are able to embrace change and pick up new technologies fast. Gone are the days when the most technologies on a resume wins the race, and hiring managers are now working to replace their old methods with ones that validate a baseline of technical competency but focus on hiring for unknown unknowns.

Even if you are not a manager, interviewing for new roles is a constant in our world. This talk will help you understand what many managers are looking for and will give you an understanding of the evolution of the hiring process in the event you have not changed jobs recently.

avatar for Chris Stankaitis

Chris Stankaitis

The Pythian Group
Working for Pythian, Chris builds and manages high performing SRE and Hadoop Teams which are globally distributed (follow the sun) and remote (work from home) based. Working with companies from startup to web-scale, Chris's teams keeps many of the sites and services people use on... Read More →

Friday August 31, 2018 09:55 - 10:30 CEST
1 - Rheinlandsaal Ballroom A

11:00 CEST

The Nth Region Project: An Open Retrospective
For the past year, a small team of engineers and I have had one job: allow New Relic to run an independent European region for data sovereignty reasons. That means taking around 500 services written by around 50 teams that have historically been assumed to run in just one deployment and changing them to work anywhere. And, at the end of the process, we needed to be able to spin up new regions quickly and sustainably operate them with our existing staff.

The talk will be in two parts, because a project like this isn't purely technical or organizational.

We needed to choose technical changes that turned building out a new region from a many-month-long process for all teams into a project for one small team. We decided that the key was to move all services to run in containers, and have them all do service discovery via dependency injection.

The reality of working at a medium-sized organization meant we had to have a lot of coordination and buy-in. I'll talk about how our roadmapping process both hindered and enabled this project to work at all, and how we used test buildouts and teardowns to integrate early and often.

This wouldn't be an open retrospective without talking about what didn't work well, which was primarily organizational rather than technical. We've learned some lessons on how to run large-scale projects that will hopefully help us on our next one, so I hope that we can provide some hard-earned lessons.


Andrew Bloomgarden

New Relic
Andrew Bloomgarden is a Principal Software Engineer at New Relic. He's worked on a wide range of projects, including the NRDB distributed event database, charting, the autocompleting NRQL query editor, bare metal hardware provisioning, and supporting multiple regions. He lives in... Read More →

Friday August 31, 2018 11:00 - 11:45 CEST
1 - Rheinlandsaal Ballroom A

11:45 CEST

This IS NOT Fine: Putting Out (Code) Fires
So the dumpster is on fire. Again. The site’s down. Your boss’s face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke.

Fires are never going to stop. We’re human. We miss bugs. Or we fat finger a command — deleting dozens of servers and bringing down S3 in US-EAST-1 for hours — effectively halting the internet. These things happen.

But we can fundamentally change the way we approach fires. And that requires adopting the techniques of industries much older than ours.

Firefighters have clear procedures and a strong hierarchy. The first truck at a scene immediately begins assessing the situation. They evaluate building construction, visible smoke, fire and flow paths. The Incident Commander gives orders to his or her personnel regarding fire attack as well as calling for additional resources, communicating with those not yet at at the scene and preparing an action plan.


Emily Freeman

After many years of ghostwriting, Emily Freeman made the bold (insane?!) choice to switch careers into software engineering. Emily is the curator of JavaScript January — a collection of JavaScript articles which attracts 20,000 visitors in the month of January. She works as a developer... Read More →

Friday August 31, 2018 11:45 - 12:15 CEST
1 - Rheinlandsaal Ballroom A

12:15 CEST

What Medicine Can Teach Us about Being On-Call
Being on-call is a critical and stressful part of being a SRE. While most organizations want and are willing to take steps to reduce the on-call burden, few have used quantitative research methods to try and optimize being on-call.

At the same time, being on-call is a part of most physician’s practice. This is especially true for medical residents—postgraduate doctors in training—who can be on-call as often as once every three days. The field of medicine has undertaken numerous studies and research projects to optimize the handling of on-call duties. These studies have explored work-life balance, ways to decrease the number of critical incidents (which can literally mean life or death), as well as reducing mistakes.

This talk breaks down the techniques and research that have led to practices that can be adopted for SREs. It also looks at issues that remain unsolved in both fields, like pages sent to the wrong team or those that shouldn’t have been sent at all. Finally, it concludes with words of warning that SREs are not physicians, and as with any interdisciplinary study, we must be mindful of these differences when borrowing techniques.

avatar for Daniel Turner

Daniel Turner

Senior Software Developer, Shopify
Daniel Turner is a senior software developer at Shopify. He is part of the team building Shopify’s Kubernetes-based platform-as-a-service. He came to the team after working on deploying and running Kubernetes in Shopify’s data centers. Daniel is an experienced speaker and currently... Read More →

Friday August 31, 2018 12:15 - 12:40 CEST
1 - Rheinlandsaal Ballroom A

14:00 CEST

Observability for Emerging Infra: What Got You Here Won't Get You There
Distributed systems, microservices, containers and schedulers, polyglot persistence... modern infrastructure is ever more fluid and dynamic, chaotic and transient. Likewise, individual engineering roles can no longer be broken down neatly into software engineers (who write the code) and ops engineers (who deploy the code (and buffer the consequences). Many teams have already sailed past an event horizon of complexity and found that their old tools and processes no longer work for them. But why, exactly? What was wrong with traditional metrics and logs? Why are they failing to keep pace with modern requirements? Isn't observability just a marketing term for monitoring? And what on earth can we do about it? In this talk, we'll cover the shortcomings of traditional metrics and logs, and the technical and cultural differences between monitoring and observability. We'll also talk about the deep cultural revolution underway from siloed specialties towards software ownership (in every type of engineering role)—and what exactly does that mean when it comes to systems observability?—as well as the technical practices and mental shifts that are absolutely required to keep pace with modern infrastructure.

avatar for Charity Majors

Charity Majors

Co-founder and CTO, Honeycomb.io
Charity Majors is the co-founder and CTO of Honeycomb.io, provider of tools for engineering teams to debug production systems faster and smarter. Previously Charity ran infrastructure at Parse and was an engineering manager at Facebook, where she ran next-generation distributed systems... Read More →

Friday August 31, 2018 14:00 - 14:50 CEST
1 - Rheinlandsaal Ballroom A

14:50 CEST

Deploying SRE Training Best Practices to Production: What We Learned (a.k.a. Strapping Jetpacks on Unicorns, the Postmortem)
In 2015, Andrew Widdowson gave a talk at SREcon Americas titled “From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams.” His recommendations were based on nearly a decade of personal experience ramping up new SREs at Google.

Fast forward to 2018. Google SRE now has a global training organization called SRE EDU. In many ways, SRE EDU was charged with developing a formal program to deploy these training best practices into production. Our goal? Spin up a globally consistent and reliable education program for Site Reliability Engineering.

Of course a cornerstone of SRE practice is the blameless postmortem. This talk addresses what we learned when scaling training best practices globally. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs.

avatar for Jennifer Petoff

Jennifer Petoff

Senior Program Manager, Site Reliability Engineering Team, Google Ireland
Jennifer Petoff is a Senior Program Manager for Google's Site Reliability Engineering team based in Dublin, Ireland. She is the global lead for Google’s SRE EDU program and is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production... Read More →

Friday August 31, 2018 14:50 - 15:30 CEST
1 - Rheinlandsaal Ballroom A