Introduction
When a critical system goes down, the pressure to "just fix it" can lead to chaos. Engineers get pulled into a frantic, all-hands-on-deck effort that often results in either a rushed, risky patch or a multi-week deep dive that completely derails the roadmap. These emergencies demand a response, but the response doesn't have to be chaotic.
A structured approach can contain the immediate fire while methodically building long-term resilience. This borrows a philosophy from product development, where frameworks like "What Can We Do In X" use time-boxing to accelerate learning and development. We can adapt that idea for incident response by asking, “What can we fix in an hour, a day, a week, or a month?” I call this the WCWFIX framework.
The 1-Hour Fix: Put out the fire
The first priority is always to stop the immediate damage. This isn’t the time for an elegant or comprehensive solution; it’s about finding the quickest, safest way to halt the problem. This fix often doesn't even require a code change. It could be as simple as rolling back a deploy, disabling a feature flag, or changing a configuration value. The goal is mitigation, not perfection.
The 1-Day Fix: Add an Alarm
With the immediate danger gone, the next step is to ensure the same problem can't happen again undetected. The vulnerability still exists, but you can now add visibility. This fix is about building a better alarm system. You might add a new dashboard chart to track the problematic metric or create a Slack alert that would have fired much earlier. This moves you from reacting to a fire to detecting the smoke.
The 1-Week Fix: Reinforce the Walls
Now you have breathing room and better monitoring. You can implement a proper, code-level guardrail that is tactical and focused on what just broke. This is a robust fix, but it's isolated to the immediate cause. Examples include adding input validation to a vulnerable endpoint, implementing stricter rate limiting, or adding queue limits to prevent a service from being overwhelmed. This moves the solution from detection to programmatic prevention.
The 1-Month Fix: Re-architect the Foundation
Finally, with the situation fully under control, you can address the root architectural weakness. This is the "right" long-term solution that was impossible to design and implement during the crisis. It addresses the entire class of potential issues, not just the specific one that occurred. This could involve re-architecting a vulnerable component for better isolation or building a generic system for cost and usage controls that applies platform-wide.
From Chaos to Control
Critical incidents are inevitable, but how you respond defines their impact on your product and your team. The WCWFIX framework turns a chaotic firefighting exercise into a structured process. It allows you to solve the immediate problem quickly while layering in progressively stronger solutions over time—all without throwing your roadmap into disarray. It’s a way to build a more resilient product and a saner engineering culture.