Crowdstrike, What Happened? A Q&A with Tad Gralewski

 

August 1, 2024

 

After software giant CrowdStrike pushed a faulty content update to thousands of Windows-based systems on Friday, July 19, chaos ensued across industries—banks, broadcasters, airlines.

Systems were up and running again in a matter of hours or days, depending on a company’s size, but getting them back on track was hugely labor-intensive. If the incident didn’t serve as a warning, it should have: Are you ready if something similar, or far more serious, happens in the future? Because chances are very good it will.

We asked Mindsight CIO and COO Tad Gralewski to explain how companies can help protect themselves against future outages, whether caused by software glitches or cyberattacks. His advice boils down to this: Be prepared.

Q: How could something like this have been avoided, or at least mitigated?

A: Regardless of whether it’s CrowdStrike or another vendor, I think the lesson learned is it’s not if, it’s when. Businesses need to do everything they can to protect themselves so they’re safe. Number one: Have you done everything you can process-wise and procedurally to prepare for it? For example, does your company have an incident response plan? Do you know what you would do if you walked into the office in the morning and all communication was down? How would you inform your workforce? How would the executive team communicate if they couldn’t use email? What are your next steps? Do you have a triage room, a war room, a phone number that you’re going to call into to get people talking and working together just to establish basic communication?

In parallel with that, do you have any sort of documented business continuity and/or disaster recovery plans so that you at least know the first couple options that you’re going to have to try to restore some sort of normalcy? Also, how are you going to take this on as a company? Most IT departments are thin. People might be on vacation. You may have an event like we saw with CrowdStrike that happens on a Friday. Fortunately, it was not a difficult thing to recover from, but it was 100 percent manual. You could not script it. You could not go in and automate it. It required an engineer to log into every server, take steps, reboot it, and get it fixed. What happens if you’ve got 1,000 servers and you have a team of only six people? Do the math. It’s going to take you days to recover. So it’s a matter of figure out what else can you can do to prepare for these sorts of events—including leaning on a trusted partner that’s going to be there to help you on that really horrible day.

Q: Do you find it’s helpful to present people with dire scenarios about what could or might happen, or do they have to actually go through something like this in order to believe that they need to be much more proactive?

A: You can scare people with stories of the boogeyman. But with many of them, until they’re staring that in the face—until the president of the company tries to log in in the morning and realizes they can’t get to their email or communicate with the help desk or talk to their other people—they don’t understand the level of absolute dependence they have on technology. There are certainly some leaders that do understand the consequences. But spending money proactively for people and technology to be prepared for something that might happen is still a tough ask.

Q: When we ask, “How could this have been avoided?” is the bottom line that companies aren’t investing enough time and resources in preparation?

A: This event was clearly on the vendor, CrowdStrike, and we’ll find out more over time about what happened. But there’s other software that companies use that could experience the same thing or a similar issue. So this isn’t just about CrowdStrike. This is about the fact that we live in a distributed world where we’re reliant on third-party vendors for software that run our air-conditioning systems and take care of our email and take care of doing patching and protection. And if any one of them has an issue, you have no idea what the impact is going to be to your environment. So it goes back to the first two things: You’ve got to have a plan and you’ve got to be prepared.

Q: CrowdStrike is the number one EDR solution with a huge market share. How does that play into what happened? Would it be better if there were more vendor options?

A: I’ve thought about that, and I’ve even talked to a couple of our customers to get their opinion on it. And it comes down to a couple of things: Let’s say you had 100 servers,.  You put Crowdstrike on 50 of them, and then you’re going to choose another platform (like SentinelOne) and you’re going to put it on the other 50. So now what you’ve done is you’ve introduced two completely different variables on two different groups of servers. You’ve essentially doubled your risk of something happening, and you’ve lost some of your economies of scale for getting the best pricing that you possibly can. And you’ve introduced more unknowns. Now you might have one of your web servers running software X and the other one running software Y. Well, if one of them starts acting goofy, you have less certainty about what’s causing that problem. So that complexity can cause different issues down the line just in trying to troubleshoot basic problems.

Q: When something like the CrowdStrike outage happens that doesn’t affect public safety and is a pretty easy fix, is that usually an instance where companies proceed with business as usual without thinking about potentially catastrophic future events because it wasn’t so serious?

A: Yeah. If I rewind 10 or 15 years ago, it wasn’t uncommon to have an antivirus solution. You’d download the latest update, and it would cause some problems on some of your servers, and it kind of ticked you off, and you’d get through it and no big deal. Something like this CrowdStrike issue is a one-time event. This is on a piece of software that a company might have deployed to thousands of endpoints, hundreds and hundreds of servers. So as an IT professional, are you going to get angry that CrowdStrike screwed up? Absolutely. Are you going to open a web browser and read about competitors to CrowdStrike and compare cost? Absolutely. Are you going to pull the trigger, buy a replacement solution, and put in hundreds and hundreds of planning hours to uninstall what you have and replace it with a new solution? Absolutely not. Because at some point, you’re going to come to your senses and say, “We have bigger fish to fry than this.”

Q: How were Mindsight customers impacted?

A: There were about a dozen, and all of them save for one—and only because, this is tongue-in-cheek, that one didn’t call us back until the following morning—were up and going by start of business the next day. When we got alerts that their systems were down, our engineers went, “Wow, this is really, really bad,” and escalated it to our service team. We woke people up and had a team working in the middle of the night to restore servers. Because we manage their systems, we were able to get in remotely and pinpoint the problem. We then assembled a big enough team to systematically start going through and doing the remediation actions, rebooting the servers, and moving on to the next one.

Q: So the lesson here is that if you don’t have an internal IT team, which most SMBs don’t, you need outside IT professionals who are monitoring this stuff around the clock. That way they can step in when emergencies arise.

A: Yes. If our clients didn’t have a managed services provider that’s monitoring their systems 24/7, they would have walked into their office at 7 o’clock in the morning, had no idea what was going on, and probably would have been down for the better part of the day. We’re damn proud of the way our team reacted. They did a tremendous job and were a huge service to our customers. Several of them, by the time they got back into the office, were already up and going. It was like any other day, except that they were reading the news that all the airlines were down and everything was in chaos. Obviously, of course, these kinds of issues at bigger companies take a lot longer to address because they’re much more complicated in terms of scale. And this particular recovery effort was very manual.

Q: What are the lessons here for IT and other senior leaders? Is there a uniform incident response plan that applies across industries?

A: There are absolutely templates that can be used. But the primary things are, if your systems are down: How are you going to speak to each other as a leadership team? And how are you going to communicate status updates to your entire company? And then, depending on the type of event, What do we do next? If it’s, say, a cyber event, do we call the local police? Do we call the FBI? Do we call our cyber insurance company? For the top scenarios that might happen, have you at least contemplated steps one through three so that you’re not having to figure things out on the fly? Once you’ve got those processes in place, then you can start moving into asset control and software vendor management and things like that, which are a lot more time-consuming. But having a basic plan is the first step. Because, again, it’s not a matter of if, it’s when.

About The Expert

Mindsight CIO/COO Tad Gralewski is graduate of the University of Illinois at Champaign-Urbana and has been in the IT industry for over three decades. At Mindsight, Tad focuses on both delivering Mindsight’s services to our customers and working with them to help develop strategies, roadmaps, and solutions to solve their issues. To Tad, “We don’t sell things – we solve problems”. A self-proclaimed “outdoors person”, Tad enjoys camping, hiking, and riding motorcycles in his spare time.

About Mindsight

Mindsight, a Chicagoland IT services provider, is an extension of your team.  Located in Downers Grove, IL we proudly serve customers across the area including Naperville, Oak Brook, Northbrook, and surrounding counties (Cook, Lake, Dupage, Will, Kane, and Grundy). Our culture is built on transparency and trust, and our team is made up of extraordinary people – the kinds of people you would hire. We have one of the largest expert-level engineering teams delivering the full spectrum of IT services and solutions, from cloud to infrastructure, collaboration to contact center. Our highly certified engineers and process-oriented excellence have certainly been key to our success. But what really sets us apart is our straightforward and honest approach to every conversation, whether it is for an emerging business or global enterprise. Our customers rely on our thought leadership, responsiveness, and dedication to solving their toughest technology challenges.





Related Articles

View All Blog Posts