Offensive Engineering

A few nights ago at my Brazilian Jiu-Jitsu gym, I found myself in an (unfortunately) familiar position: flattened out, stuck under a Black Belt’s side control. Anyone who has trained knows the feeling: you’re not in immediate danger, but you definitely don’t want to be there. I fought for space: an elbow here, a knee there, and finally, recovered my guard… for three seconds, before I was passed again.

That cycle of defence, recovery, and defence again is a common rhythm when going against someone better than you. You never quite get on top of things, always one step behind, always scrambling. It is demoralising, because you are constantly moving from a defensive position, to a neutral one, and then back.

This problem is presented by John Danaher through the concept of offensive and defensive cycles. Danaher is one of the greatest coaches in Jiu-Jitsu. He has also coached MMA legends like Georges St-Pierre, arguably the greatest of all time in that sport. Danaher describes how in grappling, you are always in one of two states: on the offence, or playing defence. He argues that the traditional way the sport is taught is to move from a defensive position into a neutral one. The classic example is the one I described, escaping from a pin and recovering a guard. Danaher argues this approach is flawed in that it allows your opponent to always be on the attack, forcing you to continuously defend. This idea isn’t novel, there is the old saying “the best defence is a good offence”. Danaher posits that you don’t want to move from a defensive to neutral, but rather from defensive to a offensive position. Go immediately on the attack. You want to trap your opponent in their own defensive cycle, rather than allowing them to trap you in yours. Doing this allows you to dictate the pace of the match, and ultimately win through overwhelming momentum.

I’ve thought about how this concept applies elsewhere in life, so if you’ll forgive me, here is a blog post from a tech bro talking about how jiu-jitsu relates to software engineering. Mark Zuckerberg eat your heart out.

Engineering Has Its Own Defensive Cycles

In my line of work, things break. Incidents happen, requirements change halfway through implementation. Someone acts with haste, not speed and deploys on a Friday afternoon without testing properly.

When the inevitable happens, we react. We stop what we were doing (typically something of much higher value than fixing whatever is currently on fire), stabilise the issue, and then, go back to business as usual. Back to neutral. Then, to the surprise of no one, something else breaks, and we are right back on the defensive.

This cycle is the same loop I find myself in when I am losing in the gym: defend -> neutral -> defend again. We are constantly reacting and responding to what is happening to us, rather than defining what happens next.

The best engineering teams (like the best grapplers) behave differently. When something breaks, they fix it once. Not just enough to put out the fire, but enough to ensure that the same kind of fire never starts again, not just in this spot, but in any spot. They don’t defend long enough to go back to BAU, they move from defence, to offence.

What does “Offensive” Look Like in Engineering?

A practical example of this mindset is something we practice at work. We have a weekly ritual called operational review. It’s essentially a small postmortem of the week’s operational issues (failures, alerts, helpline requests etc). We look not just at what broke, but also what nearly did (pipelines approaching thresholds, long running/expensive jobs, anything that succeeded, but had to retry). As well as reviewing the actions taken in the short term to fix the problem, we discuss what could we do to so that, in theory, the same thing cannot fail in the same way ever again. Even beyond this, we use second-order thinking to abstract this particular failure into a class of failures, and then look for other potential vectors for this same failure mode across the systems we support. To put it another way, you can’t just keep putting out fires, you need to burn away the fuel that keeps igniting in the first place.

Let’s look at a classic real world example in data engineering: A pipeline has failed because some string was too big for the field we were inserting it into. The defensive move is obvious, we either fix the incoming data, or we widen the field, rerun the pipeline, and move on.

The offensive move starts where the defensive one ends. It asks those second-order questions:

Why was the string too long in the first place? Can we go back to the provider and find out some bounds for length? Can we confirm with them that the system producing this data (e.g. some web form) has protections in place to stop too-long strings from entering the data in the first place?
Can we add validations so that this is a warning rather than a failure? A good team responds to a fire alarm, but a great team installs smoke detectors to catch it before it burns out of control
Should we increase the field precision and add validation to see if the data is trending bigger than normal? (e.g. if an invoice_amount field suddenly starts receiving values that are 10x their usual size, maybe the definition of that field upstream has changed from dollars to cents, which might not cause a problem for ingestion, but will certainly cause a problem for reporting)
Do other fields in this domain (or others we ingest from!) have this same problem? Can we fix those too so they can never have this failure?

Before long, you’re not just fixing one failure, you’re removing an entire class of problem. You’ve gone from reacting to one bad record to making the whole system better.

This is the power of second-order thinking. Most teams can fix the immediate problem, and great teams can even stop it from breaking again by determining a root cause. But the best version of this is to go one level deeper, and spot patterns and failure modes across systems, not instances. It’s a shift in mindset from “How do I fix this?” to “What does this failure tell me about our system? What other failures share its characteristics?”

Momentum and Initiative

This idea of offensive and defensive cycles can also be put in terms of dictating the pace, or tempo of a match. Whoever controls the pace, controls the fight. If you are always reacting, your opponent has momentum, and it is a powerful thing to overcome. If however, you are able to break that momentum, and then immediately start building momentum of your own, you can change the rhythm of the fight. You make them respond to you.

Engineering appears to work the same way.

If you are always firefighting, you never have time to think strategically. This momentum pushes you towards all the things that break teams: constant defence, burning out, running postmortems that you don’t have time to act on because there is always the next fire. But, if you can halt that momentum, and break that cycle, you can change the direction of your systems and team. Through proper root cause fixes, better observability, increased automation, and increased communication, you can move from reactive to proactive, and start to have the power to choose your next move, rather than having it chosen for you. At this point, continue attacking. Find the fires waiting to start with the same DNA, and extinguish them before they ever have a chance to burn.

If you’ve been stuck on defence for a while, then it’s going to be difficult to stop that momentum: after all, it’s been building for a long time! But taking that initiative pays dividends, and the effect compounds. Each problem you go on the offence against goes away forever. All of that time can then be used on the next, and the next, until suddenly operation of your systems doesn’t induce dread. Your weekly operational review starts to become less and less about the failures of the last week and more about how to stop the failures of next week from ever occurring. You’re in control.

Stop Reacting, Start Attacking

Let’s look at some practical ways you can shift momentum and implement some of what I am talking about:

Automate the Root Cause

By automation, I mean here that if a manual step has to be taken to fix something, can you do some work so that happens automatically? If a system falls over and needs to be manually restarted, can you put something like auto-scaling groups or automatic failover in place to improve the resiliency of your systems? If a pipeline has a few known failure modes, and you have playbooks to fix those, is there some way that the steps you would take can be done by a machine instead of a person?

This isn’t an invitation for over-engineering or resume driven development of the latest AI powered auto-resolution tool, but if there is a way to code yourself out of a call out, I think this is time well spent. Every time something breaks, ask the question “Could this have been detected or prevented automatically?” Automate alerts, retries, validations and quality checks. Automate the things stealing your attention and keeping you on defence.

Look for Patterns

Rather than treating incidents as isolated events, look for the patterns.

When something fails, don’t just ask why it happened, ask where else it could happen. What other processes, pipelines or systems share the same assumptions or failure modes? If a string overflowed here, could it overflow somewhere else? If this model doesn’t cater for daylight savings in its logic, does that happen in other models?

Second-order thinking like this turns single fixes into systematic improvements. If you need to learn a lesson, try to only learn it once, not over and over. As the saying goes history doesn’t repeat, but it does rhyme.

Change the Culture

You can’t automate or pattern-spot your way out of a culture of reactivity. Culture is what determines whether your team treats incidents as annoyances or opportunities.

Encourage curiosity and not blame. As part of our post mortem process, we refer to the Prime Directive for project retrospectives:

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”

The intention of this is to turn retrospectives into an effective team ritual to find solutions and to improve the ways of working in the future.

Build rituals into your ways of working (like a frequent operational review), rather than waiting for an incident to occur before doing a postmortem. Over time, your team’s way of thinking will change from “How do we fix this?” to “How do we make sure this never happens again?”

You Win or You Learn

In Jiu-Jitsu, if you’re always defending, you’re losing — even if you don’t get submitted. In engineering, if you’re always reacting, you’re losing too — even if everything is technically working.

The goal in both is the same: to move from defence to offence, to regain initiative, and get momentum on your side.

Defend what you must, but attack as soon as you can. By practicing Offensive Engineering, you’ll get that momentum on your side, and have time for all the important things that deliver real value.

To quote John Danaher:

“Losing is never a pleasant thing, but losing and not knowing why is completely intolerable since it takes away the value of losing as a learning experience.”

That idea has always stuck with me, in both Jiu-Jitsu and engineering. Early in my training, I lost a competition by being submitted with a triangle strangle. Afterwards, my coach made me start every round for the next month already in a locked triangle (sore neck anyone?). I had to learn to survive, escape, and eventually counter from there. I rarely ever get caught in one now.

The saying “You either win or you learn” is very popular in the sport. With this framing, my loss wasn’t really a failure, it was an opportunity to learn and make sure that particular mistake never happened again.

Engineering works the same way. Systems will fail, incidents will happen, but the real failure is if we don’t learn from them. Each outage or bug is a chance to analyse, adapt, and eliminate a class of problem for good — to move from defence to offence.

Engineering Has Its Own Defensive Cycles#

What does “Offensive” Look Like in Engineering?#

Momentum and Initiative#

Stop Reacting, Start Attacking#

Automate the Root Cause#

Look for Patterns#

Change the Culture#

You Win or You Learn#