2 July 2020

You Build It, You Run It: at least there's one caveat

A quick introduction to You Build It, You Run It (YBIYRI)

If you are not yet familiar with YBIYRI, here‘s where it all started in 2006: A Conversation with Werner Vogels. 

Allow me to highlight the following paragraphs from the Interview with Amazon’s CTO:

“Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service - from scoping out the functionality, to architecting it, to building it, and operating it.

“There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”

You Build It, You Run It, is an easy concept that doesn’t need a lot of explanation. It is basically the end-to-end responsibility of a team for the services it delivers . They define the architecture, pick the technologies, make the designs and take care of the development and deployment. Once a service is running in production, the team is responsible for its maintenance and support.

Engine

Does this concept work? In my honest opinion, yes, it does. Been there, done (doing) that. As you can expect though it’s not all champagne and caviar. It implies that the team has the authority to take all decisions. Often, this is where the shoe pinches (or where problems arise). When you can’t hold the wheel they can’t expect you to drive…

Giving support 24/7

Regularly, the lack of team authority becomes visible when handling support. I have the following rule of thumb in case an issue occurs: “For any particular problem you can only call me once”. That’s a bold statement, I know. So let’s drill down a bit and parse that statement. 

First, I want to stress that saying “you can only call me once” has nothing to do with loyalty or willingness. It has more to do with a safety lock against abuse. If a problem occurs and you get lifted out of your bed on a Saturday night you can state that the problem is something unforeseen. It’s a corner case, probably something you thought would never occur or could ever happen… But it’s a fact that things went wrong and your service went down.

Secondly, after the initial firefight to recover to a steady state, the next thing you want is “the authority” to fix the issue once and for all. Even if your Service Level Objective (SLO) was not broken, it’s you who feels the pain of being lifted out of bed in the middle of the night. On a Saturday night at 4 am I just want to stay in bed and my service to be self-healing.

Let the battle begin

Arriving at the office on Monday, the issue is gone and the pain it caused is forgotten. At the next sprint planning however, you arrive with your post-mortem at hand. But with the pain gone, the urge to fix the issue properly also seems to have disappeared. Other things are more urgent and once again the sprint is stuffed to the brim with new features. There’s no time for a better fix, so the service remains brittle and you know that it’s only a matter of time until things will go south again.

Don’t get me wrong, I feel very responsible for the software I ship. Besides, for any unforeseen issue you can call me anytime. Calling me twice for the same problem? A known problem I’m not allowed to properly fix? I feel very reluctant in answering those calls and that feeling will only get stronger every time I get called for the exact same issue.

Service ownership and authority

Till today, the lack of authority regarding quality assurance is the biggest pitfall I’ve encountered while practicing YBIYRI. Getting something back up quickly is easy most of the time, tackling the root cause of a problem often takes a lot more time and effort. At a moment where everything appears to be peaches and cream again it often seems very hard to convince people to spend more time on a better fix.

Engine

So yes, you can put me in the driver seat and I will hit the road. But in the middle of a race, when my tires are almost gone, it’s my call to make that necessary pitstop. It will cost some valuable time but without those crucial repairs I can assure you we will not make it to the finish line.