The July 2024 CrowdStrike disruption unleashed digital chaos on an unprecedented scale. In a matter of hours, 8.5 million systems crashed worldwide, paralyzing businesses, governments and critical infrastructure.
Although the incident highlights the ever-increasing complexity of our software systems, for business and technology leaders the question is simple: how do I stop this from happening again?
CrowdStrike might be unique — but failures are inevitable
Unfortunately, the reality is there's no comprehensive safeguard against incidents causing significant disruption. Attempts to find a single root cause — blaming inadequate testing, kernel integration patterns, or flawed deployment pipelines — might be tempting, but the next major disruption is highly likely to come from an unexpected source within our interconnected systems. This is due in particular to our complex, layered digital ecosystems.
What really matters, then, is that you’re prepared and resilient enough to deal with the unexpected when it inevitably happens. That will allow you to address issues quickly and minimize the impact on your organization.
As we see it at Thoughtworks, there are two important ways of doing that:
Properly understanding and modeling your digital assets to achieve comprehensive system visibility
Using this knowledge to target technology controls and practices that minimize risk
This is how we can move towards digital resilience. While traditional disaster recovery and business continuity steps — such as manual business process fallbacks — are important pieces of the puzzle, in an age of highly distributed systems and complex dependencies, leaders need to recognize a more sophisticated approach is required.
Think of digital resilience like fire safety: it's not sufficient to just be concerned with preventing fires — you need to be prepared to handle incidents effectively when they arise.
Think of digital resilience like fire safety: it's not sufficient to just be concerned with preventing fires — you need to be prepared to handle incidents effectively when they arise.
The importance of asset awareness
Asset awareness is about developing a more complete understanding of all the components in your digital ecosystem: users, devices, third party dependencies, vendors and, crucially, services and data. It requires modeling how these elements interact to consider system health holistically.
Developing this comprehensive understanding requires navigating the specifics of your own context. Frameworks and guidance can be immensely valuable — the 'Identify' function of NIST's Cybersecurity Framework (CSF) should be essential reading — but the real work begins when you start to reflect and interrogate what’s unique to you. This is because resilience is always a question of trade-offs — you can’t do everything.
The essential first step is to identify levels of criticality within your environment and broader ecosystem. Instead of spending time and money trying to enhance it everywhere, you invest to enhance resilience where it really matters. This approach prevents overinvestment in less critical areas, allowing you to focus resources effectively. By prioritizing criticality and business impact, you can make more informed decisions about where to apply your resilience efforts, balancing risk mitigation with resource allocation.
This isn’t something that just happens. These things need to be prioritized and cultivated. Organizations that empower individuals with the resources they need to learn and adapt to a rapidly changing technology environment are invariably more resilient. Teams that own and support business outcomes and not simply administer and maintain systems create greater accountability and business results.
Engineering practices informed by asset awareness
Taking stock of your digital ecosystem is just the first step. It needs to lead to practices that ensure quality and reliability in your software and systems. While no single practice can prevent incidents like CrowdStrike, doing the fundamentals well positions you to tackle the unexpected.
Think of digital resilience like fire safety: it's not sufficient to just be concerned with preventing fires — you need to be prepared to handle incidents effectively when they arise. This includes:
Ensuring your code is robust by using test-driven development and continuous integration
Addressing security concerns early by building security practices and tools directly into the development process
Reducing errors and increasing consistency through infrastructure as code and test automation
Enabling rapid responses to problems with automated deployment pipelines and small, frequent software releases
Improving transparency and system visibility through comprehensive monitoring and observability
Enhancing response preparedness through regular tabletop exercises
Beyond technology, resilience requires learning and skill development, cross-functional teams, and effective collaboration mechanisms. These practices, consistently applied, create a foundation for withstanding and recovering from digital disruptions.
Good engineering practices and asset awareness go hand-in-hand. Asset awareness provides the situational awareness from which teams can act, while good sensible default practices ensure that this understanding translates into effective execution.
A time to reflect — and then act
CrowdStrike is a significant shock: but it’s also an opportunity for business and technology leaders to take stock of where they are now and whether they’re fit for the future. It’s time to consider how much you really know about your systems, infrastructure and applications; it’s worth also reflecting on your teams’ capabilities and confidence in modern software engineering practices.
True, an event exactly like CrowdStrike may never happen again. But something equally disruptive almost certainly will — not only is it impossible to predict what, it’s equally impossible to predict what the impact will be. That’s why it’s important to take the necessary steps towards digital resilience today.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.