Guidance on Performing Retrospectives
In my career I’ve had to conduct a number of retrospectives. Ahead of them it already sucked, there was an outage at some point, customers were impacted, and it was our fault. Never was it solely on our underlying infrastructure provider (AWS or Heroku), nope the blame was on us and we’d failed in some way. And as soon as the incident was resolved, it wasn’t time to go home and decompress with a beer, it was time start the process of a retrospective.
Finding the motivation to get right back to work is tough, but not losing time is important. There is probably a lot out there on retrospectives, and in general I was well rehearsed at them. But since I’d not performed a large scale one in a few years I found myself rusty and thought it’d be good to share some of our process.
Capture details immediately
It may not be clear if you’ve not been involved in many, but a retrospective is more than just a meeting to discuss what happened and how to fix it. It’s an overall process, it begins with capturing thorough details of what happened. The start is a timeline. The best thing to do is capture the details while they’re fresh. Start with a google doc and simply document the timeline of everything. Capture chat logs that are relevant while they’re fresh in history and easy to find.
The start of an outage likely wasn’t the start of the timeline, there may have been something that happened days, weeks, or even years ago. Don’t just start from the time things went offline, go back to the causes as much as possible. If code was committed a year ago that was the offender make sure to note that.
Running the retrospective (the meeting part)
There are a number of various good practices for running the retrospective itself. There are also a lot of different formats, all valid each with their own pros and cons. You can do with a basic timeline, what went well/didn’t, do a five whys analysis. I tend to prefer a clean and dry analysis of timeline, what went well and what didn’t and what we’re doing about it.
Some key tips to help a retrospective meeting be productive:
- Explicitly set time bounds for each activity ahead of time, more so than maybe any other meeting it’s important to get through all your agenda. Hard time limits on the planned items is how you accomplish this.
- Spend at least some time on what went well, retrospectives aren’t fun. Spending some time on the good parts of your process and response isn’t wasted, just don’t be overly self-back-patting (that’s not a word but you get it).
- Don’t discuss people, or rather don’t point fingers. Yes, people will come up, but it’s about the technical and process errors not the person that performed them.
The important part
Every step in the retrospective is important, but the goal of them all is to get to how you can improve. With any retrospective there are likely two categories of improvements that will surface. The first is bugs that caused the issue or engineering that could go in place to help with the specific issue. The second are process improvements. If you don’t have improvements in both areas then spend more time thinking on the one you’re missing.
Improvements shouldn’t be isolated to the exact issue you saw. Yes you may see the exact same issue again, but there is also a lot more you can draw out that helps improve overall quality. It’s inevitable you’ll see different issues in the future, thinking of how you can improve your systems and processes to catch those future issues is time well spent.
Sharing the details
You’ve done the hard work in the above, but it’s still good to share publicly and transparently. Within your public retrospective I tend to follow:
- Apologize, and mean it
- Show a firm understanding of your systems, and communicate the problem. Don’t try to be fancy technically, but don’t be too highlevel. Think goldilocks.
- Share what you’re doing to improve
Credit to Mark for being the first person I’m aware of to lay it out like the above
Public outage retro formula reminder: Apologize, demonstrate thorough understanding of the failure, explain remediations. Simple as that.— Mark Imbriaco (@markimbriaco) July 20, 2015
Looking for more resources on the topic? Make sure to check out these two talks: