Incident Adventures

Incidents Happen

Are you prepared for an incident? What will your team actually do when PagerDuty calls? Be prepared with Incident Adventures, the latest way to have fun and get prepared for when the worst happens.

Inspired by Escape this Podcast and role playing games, Incident Adventures is an easy way to run simulated incidents and responses for your engineering team. Set up scenarios and let your team discover the root cause while learning about what to do along the way.

Imagination Over Simulations

Incident Adventures are simulated by an Adventure Master (or GM/DM for the tabletop players). Each scenario takes place by doing as much as possible in real life, but as the practice Response Team explores the graphs, logs, and data the Adventure Master will let the team know when they “see” something different. For example, if the scenario’s root cause is that an SSL certificate has expired, the team goes and looks at the production graph for 200’s, the Adventure Master might tell the team that while the chart looks normal in reality, in the scenario, the rate is noticeably lower in the scenario.

By using the imagination to simulate, the team exploring the scenario can go look at anything anywhere and the Adventure Master merely needs to be able to tell the team what it would look like as the Adventure Master knows the root cause.

How It Works

Preparation

While not much is needed to run an Incident Adventure, some preparation is required.

1. Collect a list of root causes

You don’t need very many, and they don’t need to be super complex, but it is a good idea to have root causes in various parts of your infrastructure or code. That way each root cause scenario can introduce or refresh the team about that part of your system.

2. Preparing to run the scenario

The Adventure Master running the scenario needs to be knowledgeable enough to know when and how a given root cause would express into whatever the team is exploring. In addition, the Adventure Master or perhaps a helper needs to be able to step in and provide hints or direction if the team is running into a wall.

Remember: This is about learning and practicing what to do in an incident. Being stuck for too long is not only frustrating, but also hinders the goal of learning.

3. Practice Response Team Size

A practice Response Team should be 2-3 people of about the same level of ability. If you have a team that is unbalanced, let the senior team member know that they should give their junior time to think and respond first.

It is also possible to have some observers, but observers should be for learning the basics. You don’t want to give observers a false sense of ability. Observers will also increase the pressure, so get permission and senior team members.

4. Time Length

Scenarios should take between 30 and 60 minutes. Two, 30 minute scenarios or a single 45-60 minute scenario is about all that single session should try to do. More than that could be information overload for the team.

Getting Started

You have a practice Response Team and your selected root cause, so time to get started with the rules! Here’s a suggested introduction:

Welcome to Incident Adventures! In just a moment an alert will go off, but first let me set some ground rules.

  1. Talk through what you are doing and what you are looking at. Share your screen if possible.
  2. I am your filter to the outside world. I will let you know if something you are looking at is different in our scenario, but I can also be other teams and the customer.
  3. This is an open book setup. Search and use the resources you would have in a real incident.
  4. We’re here to learn and have fun. I might give you some hints, but getting stuck for a bit is ok.

The Alert

Let your team know that an alert has occurred. If possible, give them the text of the issue report or specific alert. User reports are also ok. The alert or issue should be true, but also shouldn't give much away. For example, “increasing 400 responses” or “user reports they cannot access service” are clear starting points, but don’t tell too much about that root cause.

The Response

75% of the response work is just getting started. Just knowing where to find the Standard Operating Procedure, see the logs, or monitoring data is a great start, but some inexperienced teams may need help with this step.

Dead Ends

A good adventure has some dead ends. Take notes on the suggested, but incorrect root causes. They might be great future scenarios for others!

Discovery and Closeout

You will have to determine when to count the root cause as found. Usually it takes two steps: the idea and the confirmation. When the team thinks they have the root cause asking, “How would you confirm that?” is a great way to take it out of speculation.

Solution

Once you have congratulated the team on the discovery of the root cause, some root causes may have the option of asking how they would resolve the issue. This might just be a discussion or having the team bring up the runbook that would be followed.

Expansions

Want more? Here are some ideas to set you up for more adventures!

Consulting

Want help setting up or running Incident Adventures for your company? Interested in an outside review of your incident planning?

Contact: Wil Wade
Email: wil@incidentadvenures.com