If the word CrowdStrike only recently entered your consciousness, you might be wondering what happened that led to such a disastrous outcome, and, more importantly, could it happen to your organization?
There are plenty of articles that are diagnosing the cause, the impact and resolution but there is, I think, a more fundamental lesson business leaders need to take from CrowdStrike: get closer to the technology at the heart of your organization.
Despite the trope that ‘software has eaten the world’ and is now ubiquitous, ironically — from a business perspective at least — technology is largely hidden from view. Indeed, even engineering teams themselves often find gaining visibility into complex, multi-layered systems challenging — how much can business executives be expected to know what’s happening?
If you’re not sure where to begin, I’ll help you get started. Here are five questions that business leaders need to ask their technology teams:
How much do we know about our systems right now?
How do we communicate outages?
How quickly can we respond to incidents?
What processes and practices are in place to prevent these incidents from happening in the future?
- How can we manage the risks of external dependencies?
Together, they can help build the transparency and trust needed to develop a culture and capability that can better respond to unpredictable — and potentially devastating — IT crises when they happen.
How much do we know right now?
In a recent article exploring how to respond to CrowdStrike, my colleague Jim Gumbley stressed the importance of being ‘asset aware’. This is undoubtedly a vital starting point for businesses. But it can only begin with a conversation with your technologists.
So, take time to explore with your technology and engineering leaders how much visibility they have into risks. What tools do they use? Are there any specific roles charged with monitoring or interpreting system data? Does the team have the right capabilities? Do they have the time to pay attention to existing system performance?
One exercise that can prove useful is something called critical systems mapping. This is about outlining and delineating the various pieces of your current IT infrastructure and how they interact with one another. It can help you identify possible points of failure or vulnerability. Indeed, this process can involve more than just technology — it should also include people and processes, so you’re able to understand who has visibility or responsibility for certain parts of a system and where there may be gaps.
It’s important to note that these conversations need to be built on trust — there may well be things technology teams don’t know right now. That might be concerning, but it isn’t that surprising. What matters is to work out how to move to a world in which everyone is comfortable they have oversight of what really matters.
How do we communicate outages?
It’s often said that talk is cheap. However, when it comes to IT outages, having the right communication plans in place can be worth a hell of a lot. When done well, good communication provides stability and clarity, ensuring confidence as you move forward.
With this in mind, the second question you need to explore with technologists is how we talk about and communicate failures and outages. First and foremost, this matters from an internal perspective — when something goes wrong, who knows about it and when? What actions need to follow?
However, for many organizations it will also be important to have external communications plans in place. Consider, for instance, how you will talk to customers about serious technical issues, and think carefully about how information is shared with other stakeholders.
This is rarely easy, particularly when technical failures are complex and multi-faceted. It’s all well and good saying that you need to be direct and clear, but when there are nuances and details that may simply not make sense to some audiences, these explanations require an ability to translate the complex into something understandable. Not everyone can do this — as a leader, it’s your job to identify people who can and work with them closely to ensure that not only is information shared, but also that it’s clear, honest and easy for everyone to understand.
How quickly can we respond to incidents?
Every organization has its own culture and processes. That means the way problems are addressed and incidents responded to will likely be unique — for better and worse.
However, it’s essential that business leaders get to know these processes. Do your technology teams have the resources needed to respond quickly? Are organizational structures helping them move as they need to or hindering them? What metrics are in place for measuring incident response times — and how do we measure up at the moment?
If ever there was a time for business leaders to get closer to their technology teams and learn more about what’s actually happening on the ground this is it.
If ever there was a time for business leaders to get closer to their technology teams and learn more about what’s actually happening on the ground this is it.
What processes and practices are in place to prevent these incidents from happening in the future?
Allied to questions concerning incident response is understanding what’s being done to ensure outages and other kinds of incidents don’t happen in the future. This isn’t to say that you should expect your technology teams to prevent anything bad happening, ever, but more that you should expect them to be thinking seriously about the practices and processes that can minimize the risk of failure.
In short, talk to your technology leaders about how they’re working to achieve software and delivery excellence — are we following best practices? Are we making informed decisions about tools? Are we bringing security decisions to bear on software early in the development process?
Again, trust and honesty are important here. No one wants to talk about their limitations and what they’re not currently doing. However, identifying how your teams can bolster their capabilities and improve their processes is a vital step that business leaders need to encourage and enable.
How can we manage the risks of external dependencies?
Many incidents — including CrowdStrike — are caused by issues outside of your organization. These happen when there are problems with third-party software, or when vendors have an outage that brings down your own applications.
Although you cannot control what happens in other organizations, you can be prepared. A key part of this takes us back to our first point — ensuring you have transparency over your IT estate — but it’s also worth exploring in more detail how your technology teams work with external vendors. What kind of support do they receive? Are they happy with it? What’s in the contracts? How transparent are third parties when it comes to their own issues?
Of course, as I’ve already discussed, trust matters a lot in technology — suspicion and fear aren’t conducive to effective collaboration and a good working relationship. However, self-awareness and a confidence in setting expectations so the organization, and its people, has what it needs to do its job consistently and do it well is vital. Business leaders have a part to play in this — so explore these issues with technologists and give them the support they need.
Dive deeper into what’s happening on the ground
CrowdStrike was just another in a line of significant outages that demonstrate precisely this point. If ever there was a time for business leaders to get closer to their technology teams and learn more about what’s actually happening on the ground this is it. Only then can you determine how resilient your organization really is.
How do you strike a balance between rapid software delivery and system resilience? In an upcoming article, Max Griffiths will explain why it doesn’t have to be a case of either/or.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.