Jack: Welcome back to Security Voices. Yes, it's an unfamiliar voice. I am back with Dave after missing a few episodes and we have an amazing guest for, uh, should be a very enlightening conversation. How's your world, Dave?
Dave: It's La Vida startup heading into the busy part of the year. So I managed to sneak in a quick, uh, summer vacation at the end of it.
But, um, yeah, it's good. It's, it's a super busy time of the year. Kiddo just started back to a new school. So it's a busy, busy time. All the sins of the summer, the things unaccomplished are looming in front. So it's time for the, uh, the mad dash for the fall. Kelly, welcome to security voices.
Kelly: Thank you so much for having me.
Dave: Our pleasure. You are right off the heels of doing kind of a book talk at Black Hat, right? Were you mostly talking about security chaos engineering at Black Hat?
Kelly: Sort of. I spoke a lot about resilience and the real thesis of the talk is, okay, we know that attackers have Certain advantages. I think there are a lot of myths about what those advantages are.
So I clarified those. Then a lot of it was, how do we steal those advantages for ourselves? Like, how do we get a faster operational tempo? How do we make sure that we're adaptive? How do we adopt systems thinking? How do we make sure we can measure success? Well, and. It, a lot of it drew on the book, but that specific framing, I thought would really resonate with the black hat audience who obviously there's kind of a, a love affair with all things offense, but obviously most of the attendees are on the defense side.
So I tried to kind of bridge those two worlds.
Dave: All right, so I spent most of my Sunday morning reading Security Chaos Engineering. At the beginning, there's this mention that, you know, it's a tome, which always invokes in me imagery of like a big, leather bound book, you know, something by Neil Stevenson with like a gold embossed cover, and I thought, oh, that's cute, I'll breeze through this.
And I was completely wrong. It's a tome. It was not false advertising at the beginning. And one of the things that I noticed and I really appreciated is the lack of Sun Tzu quotes and the normal kind of tropes and so on and the militaristic nonsense. God bless you for not doing that. Instead, I was greeted by Ursula K.
Le Guin, whom I love, Buckminster Fuller, Faust, and so on. Which made me think, ooh, these people have interesting literary influences. So what's your information diet? Who are some of your favorite authors? What are you reading now? Take us through it.
Kelly: I read all the time. I especially love literary fiction, which you picked up on.
One of my all time favorites is Moby Dick. I like to say it's a compendium of whale facts that also happens to have a story. But what I especially love, I'll wax poetic briefly because it's important for the book. Is that in the book, Melville really analogizes the whale as its own kind of complex system to the nature of, you know, the human condition.
And I find that pattern matching and that again, almost like very literary or poetic way of interpreting the world around us through this other creature, like this indomitable whale. I find that to be really illuminating and it gets you to think in a different way. So throughout the book I wrote, save for, I always like to.
Mentioned like chapter nine was all Aaron Reinhart, um, who compiled a bunch of case studies, which are fantastic. But for the rest of the book, which I wrote, I really wanted to bring that same sense of like, okay, you're learning a lot, but you're learning through different lenses. You're gaining like new perspectives on things.
You're, you can kind of feel and almost like touch these analogies that make these concepts more concrete in a way. And again, just get you to think differently about your work. Cause I think cybersecurity desperately needs that kind of. Rethinking overall. So literary fiction. I also love reading lots of random papers.
Recently, it was about volcanic plumbing systems, the other VPS, not the AWS one. And yeah, anything and everything related to complex systems. I love reading in my spare time.
Dave: All right. So I'm going to give an example. And I think the book could have been incredibly dry, but for the analogies that are woven throughout the tips and the case studies, and even it look, this is not a book that can be skimmed.
Let's be clear about that. Like I thought it was, I wanted to treat this like Michael Zalewski's prepper book. Jack, it was. untreatable in such a fashion. I will say, though, the summaries at the end of the chapter are great, but the analogies really pulled me in. And since we're talking about complex systems here, I felt like this was such a fun, great example.
I'll read this. It says, a classic example of safety boundaries is found in the book Jurassic Park. When the protagonists start their tour of the park, the system is in a stable state. The dinosaurs are within their designated territories and the first on rail experience successfully returns the guest to the lodge.
But changes in conditions are accumulating. The lead geneticist fills the gaps in the velociraptor's DNA sequences with frog DNA, allowing them to change sex and reproduce. Landscapers plant West Indian lilacs in the park whose berry stegosaurus is confused for gizzard stones, poisoning them. A computer program for estimating the dinosaur population is designed to search for the expected count rather than the real count to make it operate faster.
A disgruntled employee disables the phone lines in the park's security infrastructure to steal embryos, causing a power blackout that disables the electrical fences in tour vehicles. The disgruntled employee steals the emergency jeep with a rocket launcher in the back, making it infeasible for the game warden to rescue the protagonist now stranded.
These accumulated changes push the park as system past this threshold, moving it into a new state of existence. That can be summarized as chaos as the lethal kind, rather than the constructive kind that SCE suffered chaos and security chaos engineering embraces. Crucially, once that threshold is crossed, it's nearly impossible for the park to return to its prior state.
What a fun example of a complex system you could have, you know, use the example of the capital one hack here, you know, you could have used any of a number of things, but the Jurassic Park example, you look at that and say, yeah, that's very much the world that so many organizations live in. And this is how bad things happen.
It's not so much. One isolated thing. It's an accumulated complexity and then people acting out of their own incentive structures and whoopsie bad things happen.
Kelly: I agree with that entirely. That's one kind of a point of contention I've had with the overall cyber security industry for a while is they'll say, well, the human clicked on a link, that's the cause.
It's like, well. Is it actually the cause? It's the same thing with you see with a lot of like incident response, like, oh, the, you know, responder missed the alert. It's like, well, how many alerts do they have to deal with every day? You know, and how much has documentation decayed or playbooks decayed? There are always so many.
Factors that create this kind of like mosaic of what our system is and what can lead to failure. So I think in general, it's a lot more easy, or it's a lot easier to understand that principle when you think about like dinosaurs, because those are very visceral in your mind, rather than, you know, computer systems can be relatively abstract.
So I was stealing Creighton, who really just wanted to write a book about complex systems and the dangers within them and was like, well, what will the public care about? Dinosaurs, murderous dinosaurs, right? So taking the same sort of tactic.
Jack: It's interesting because as people, we often can relate to the idea with crises in your life pretty well.
And then one day, you know, whatever it is, your coffee mug, isn't where it's supposed to be in the morning and it just sends you off and it's not, you know, to your point about, you know, somebody clicked on a link it's personally. We know that the coffee mug shouldn't upset us that much, but the past two months have just driven us to the edge.
And they're complex systems. People are complex systems, but the systems that we, uh, we try to secure are too. And of course, there are people at play there too, who could just have that day where they have just had enough of... Notifications from the two factor off app, and they just don't care anymore because they lost their coffee cup.
You know, it's that the whole complexity thing is great. And that that is a phenomenal storyline to use to help people understand that.
Kelly: I think you're right. And I think there's a problem. What is it? The fundamental mental attribution error, right? We can really. Give ourselves empathy when, like you said, it's like, oh, the coffee cup's out of place, but it's like, yeah, we had a loved one who was sick or like all these different things that cause us to have, like you say, tip over our thresholds as people, we often don't extend that empathy to other humans, which I think is a big part of the problem.
And I, I can't remember if it's in the book or in something else I wrote where I talk about, yes, we can say like, ah, yes, this user, which by the way, is a very clinical kind of abstract. Yeah. Yeah. Yeah. term for a human being, like, yes, this user was lazy or negligent or whatever else. It's like, or we can say, again, they were, you know, taking care of a loved one.
They're a caretaker. They had to, you know, bring their kid home from the hospital after a long battle with some sort of illness. Like you never know what is going on in another human's life. And to just label them as careless is, I think kind of. Cruel, which I know is a harsh word, but I mean that very sincerely.
And I think is, is it such a surprise that cybersecurity gets a bad rap within organizations when we don't have that kind of empathy, right?
Dave: It's a very American and very Western way of, of looking at the problem. Even if you look at, you know, and you see it in our language, if you say, you know, and you hear this all the time, like, oh, I forgot this, which is very direct blame.
I forgot it. I did something wrong. In Spanish, we say, say, am I all we do? It was forgotten to me. It's a much kinder, reflexive way of saying it. And you just would never kind of say it in that fashion in Japan. You might say, Shogun, I name, which means like, as the fates would have it. So security is very, feels like it's always been very inflected by, you know, this sort of punitive, you know, nearly Catholic.
There it is. I said it kind of, um, blaming, but anyways, wow. It took us like. 10 minutes to get to sacrilege. That's great.
Kelly: Oh no, like I published a series of posts last year, which by the way, people don't tend to read blog series. Um, I'm going to republish it as a single blog, but I went for the Catholicism thing too.
I was like, cause it's, it's true. There is a kind of like, well, you committed a sin. You need to pay penance or you can basically like, you know, suck up to us and we'll make it. Okay. There's that kind of vibe, which I agree is not. Very healthy, and I think that's often what also discourages that secure by design thing, which it's obviously not new.
It's something we wanted since the 70s and 80s, perhaps earlier. That's at least when I know it was kind of birth because when you say, like, I forgot, it's like, well. Why didn't the system prompt you, or why wasn't the system designed in a way that you didn't have to remember at all, right? It's that focusing on the individual, the socio part of the system, rather than, you know, maybe there are flaws in the technical part of the system and in the design, I think creates a lot of the challenges we have because evidence on efficacy of awareness training is pretty poor.
The efficacy of a mitigation like isolation or immutability, or when you're just designing something to be. impossible. It's just substantially more effective.
Dave: Yeah. And that feels it's interesting. So the name of the book is security chaos engineering. And you'd hinted to me that the name was a little bit of a misnomer when I got into it and I got into it and it was, was kind of pouring through it yesterday and realize like, yes, that's part of it.
But this is really one of those books and I want to overstate it, but I felt like it was really a book that was about fundamentally looking at security in a different way. And a modern approach to thinking about security. I had a conversation with a guy who was looking to hire a CSO the other day, and we were talking about the right thing Way to structure the person to hire and where, how to measure their performance.
But if we were talking about what they would do in the actual approach of a security leader, I think I would probably just hand them this book and say, like for a modern cloud environment in particular, because so much of this seems to be influenced by like. DevOps and even kind of agile thinking and so forth, it just felt to me like this is a modern way to approach running a security program.
That was my thinking about it. And it kind of, it was a little bit in a way that as you get into thinking fast and slow, you're like, this isn't a book about economic behavioral economics. This is a book about the human brain and how we make decisions. Or like when you get into Yuval Noah Harari Sapiens, And it's like, Oh, this is a book about the evolution of the human species.
No, no, no, no. It's much, much bigger than that. Like this is about all of humanity, the impact we've had on the planet, the interaction with earlier species and all those things. So the book is a lot bigger than its title. In my experience, I'm not sure that that's a question, but that was very much my feeling as I went through it.
I don't know what the hell you would call it, but I would call it something different than security chaos engineer. And I think that dramatically undersells.
Kelly: That's entirely possible. And I, I definitely appreciate you kind of vibing with my bigger vision to revolutionize cybersecurity because that is a huge part of my goal because it's just not working the way we need it to.
The subtitle I think is a little more accurate, sustaining resilience in software and systems. But I will first say that chaos experiments are definitely a tool for this. If you look across nearly every complex systems domain, what you see is basically these industries, whether healthcare, urban water systems, you name it, they're basically.
Begging and spending tons of money trying to figure out a way to simulate those systems to be able to experiment with them because it is extremely unethical to like inject some sort of failure into like the financial system to just see what happens. Right? That would be bad. A lot of people would suffer.
So they don't have the same kind of. Blessing that we do to be able to just like clone something or create a replica. It's something that's really powerful and we uniquely have the opportunity in software. We just aren't using it. And one thing I will say is that, and I'm working on a post to kind of highlight this in a paper actually is don't think about.
The chaos and security chaos engineering so much as the chaos experiments. Again, those are a really valuable tool. Think about it in terms of like chaos theory again, goes back to Jurassic Park, which is chaos theory basically says that there are some systems that are so complex that you just, you can't predict them.
Like the only way you can understand how they're going to behave is to see what happens when they run. Essentially. It's like for some programs, you can't just predict everything ahead of time. And I think that's really the essence of it is like, if you look at. Again, complex chaotic systems, a lot of them are very non deterministic, or even if they are deterministic, there's so many different factors that it's just impossible to kind of trace all the possible like outcomes from like an initial starting point, you know, the, the classic butterfly flaps its wings or whatnot.
So I think if you want to stick with security chaos engineering, which again, I fully recognize sometimes, you know, with these transformations, you need something that sounds kind of, you know, pithy to get buy in for budget. So if you want to use security chaos engineering, great. If you want to use resilience engineering, platform resilience engineering, that's also great.
But I would encourage people to think about the chaos in terms of chaos theory, rather than just the chaos experiments. But again, don't want to knock chaos experiments. That would be very off brand, but also I do think they're generally useful tools. I do often think that companies should start with other things first and then kind of graduate into chaos experiments and be careful with that in chapter eight.
Um, I kind of enumerate how to do that, that slow iterative process. So, but that's a long way of saying, yes, the title I think doesn't quite capture it, but also I'm trying to now find a way to, like, get people to think about, again, the chaos theory angle with it, but we'll see what happens.
Dave: The voice in the book, I think I can recognize it at this point is very much your own.
It seems like, you know, it's your personality comes through, but there's an element of it. Which is empathetic and recognizing kind of human constraints and there's an element of kindness and like, look, it's okay not to know we're dealing with incredibly complex systems, but by the same token, it's not okay not to test.
It's not okay to presume that it's going to be okay. And what you just said, there was a really nice quote on it. And at the risk of being redundant, but it's an important point. I want to make sure we hammer home. The quote is, we must remember that we live in an era of abundance in software and that we are inestimably lucky that for the most of our systems failure isn't really fatal.
No one will die or be displaced by a UDP flood like they will a hurricane flood. It may feel like a disaster to us, but we're largely fortunate. We should not waste this privilege. So let's harness experimentation to learn and adapt the way other domains wish they could. It's a powerful statement. I think it gets at like an affect in security where people think they have such complex and unique problems and it's so hard and it's changing all the time.
And they're not wrong, but also that mentality doesn't feel it feels a little victim to me. It doesn't feel quite as helpful as the, Hey, you're very fortunate. Look at the things you can do without risking human life. Look at the things you can do that someone who works on another field can only dream of.
Kelly: Yeah. I mean, think about incidents in basically every other domain, nuclear, petrochemical. Healthcare, like, during an operation, you know, again, we have it very easy. Like, yes, privacy is important. I don't want to downplay that. But if people's data gets stolen, like, there can be some really, truly tangible damaging effects.
But for a lot of people, it's just like their credit card company mostly handles it. And it's. Kind of fine for all intents and purposes, right? There can be some reputational damage. I think the big thing I also tried to highlight in the book is the rise of availability is the concern where if software is eating the world, every company becomes more of a software company that incidentally does something else.
Availability is king or queen or whatever, regent, because that is the money printer for businesses. And so we need to start thinking about that. And for many businesses that are becoming more technology companies. That can start to lead to the problems we're talking about where it is. tangibly damaging to someone's life, maybe not lethal.
But I think the key thing is with the software itself though, like we don't have to wrestle with a lot of the typical challenges and we can create replicas. This is true even for kind of your traditional on prem thing. It doesn't even have to be The perfect kind of high fidelity replica, because guess what?
In an urban water system, you can't perfectly replicate all of the sewer systems. Like you would have to have an enormous amount of compute to be able to do that. So even something that's close is better than nothing. And I promise every CISO out there, you can fair and Predict and do all of that stuff that makes you feel like you're, you know, a real quantitative scientist.
That's not going to help you when something happens. Like, what will help you is understanding like, oh, here's how our system actually behaves when something bad happens in it. Here are all the things we thought we had that we don't. Right? I love Aaron Reinhart's experiment that he did at United Health Group, which showed that the firewall caught misconfigured ports 60 percent of the time, which is something you expect to do 100 percent of the time.
Right? And it's catching those kind of assumptions that we hold is, you know, this is always going to be true that we don't actually, it would be impossible to bake that into our prediction models or whatever else. Right? So it's what's the point of expending all this effort on trying to predict when we're not actually preparing and understanding how our systems just are, how they become, how they respond.
That's my soapbox about, yeah, making sure that we leverage that kind of simulation and experimentation to its greatest effect and pulling away all of that gravity from the quantitative stuff towards maybe we just need a bit more qualitative judgment on just like, Hey, is this working as we intend?
Dave: The examples from other industries, I think are really apropos here.
Security is still young. And if we look at related industries such as privacy, privacy is even younger. It's about 10 years younger and A fledgling industry, I think that's just popped up and starting to become incredibly important is trust and safety, which I think is even about five years younger, you know, in some ways than privacy is at least in terms of mainstream thinking and so on my perspective only.
So if we look at this, every one of these kind of interrelated disciplines, and there's plenty of overlap between them can benefit from looking at things. Like automotive industry, you know, I thought of you as I was reading this yesterday, Jack, I'm like, I bet resiliency. I bet Jack has an analogy or two back to his previous occupation on how he thinks about resilience.
And Kelly, your analogies were kind of. There was a ton of them throughout and they were all incredibly useful from strangler figs to like biodiversity in terms of orchard blocks and so on. What's a favorite analogy from either one of you that you think is applicable to security about another industry and how they think of resilience?
Jack: When you talk about Human life. One of the things that's really savage, but part of the game in the automotive industry is when they build things in and think about safety, there is a dollar value on human life. That's balanced against the dollar value of the sales of the vehicle. And the same thing happens when they fight recalls and, um, they convert human life to a dollar value and do a mathematical thing at some engineering level.
And. It's, I don't know, cold and calculating, but it's also a way to measure something instead of a seat of the pants speculation. If we kill a hundred people, uh, by a bad design, what is that going to do to the company? What's that going to do to the reputation? And they've got over a century of data to back this up, right?
And they actually think about these things. And so there's one angle of actually just. Like getting absolutely cold and calculating, but I mean, you know, we could, we can talk about the insurance company, you know, health insurance, anything. There are a bunch of things that do that, where they just really have something critical and they put a literal dollar value on human life and start working from there.
So we don't have to, for the most part, unless you're in healthcare, doing security and healthcare, you're not. Doing that quite as much, but maybe if you're in, you know, industrial control systems, maybe if you're in power plants, we've, we've seen what happens when power goes out during extreme heat or cold or whatever, but there's also, uh, you know, there's one of the things about automobiles that sort of tangential to this as I'm thinking about it is, you know, people think, even though when you rent a car and can't figure out how to turn the heat on, there's actually a lot of thought that goes into usability.
How do people use these things? Can we do a thing if you remember decades ago, there didn't used to be shift interlocks on automatic transmissions and people would occasionally stomp on the gas instead of the brakes and drive into what was called unintended acceleration. And they tried to blame the cars.
And then one day, somebody said, you know what, if we make it so that you can't put it from park to reverse or drive without your foot firmly on the brake. Let's see what that does. And what has it done? It saved thousands of lives by making it harder, back to an earlier point, Kelly, making it harder to do the wrong thing.
There's nothing quite as simple as a, uh, a switch and a solenoid that we can put in, um, software, but that idea of here's a common mistake. Is there a way for us to prevent? the catastrophic failure. And like I said, it's, it's not a switch in a solenoid and software, but on the other hand, it's an idea that people do this.
We know they do that. Is there a way to either prevent it from happening or encourage them to do the right thing or minimize the consequences of failure? And I think that, you know, some folks have been talking about looking at this for a long time, and I think you're pushing this idea forward in a meaningful way.
Kelly: Yeah, I think with automobiles too, there's the like crumple design. So to be clear, I live in New York and like my knowledge of automobiles is pretty stale at this point, but I think it's also the crumple design is another kind of like almost safety by design sort of thing too. But I actually love what you talked about with the brakes.
Cause to me, that's very much the transition from like manual deployments to something like CICD and automation, which again, doesn't really require cloud if you don't want it to. But the idea is like. If you have to get like peer review on some sort of change that you're pushing, like that is another mechanism to make sure you're not just deploying some sort of change that's going to break everything in production.
Right. It's almost the equivalent of like the brakes to have another human, like make sure. And the other nice thing with like automated CICD is you could roll that back. So imagine you're like, Oh my God, it's going in acceleration. And you just like hit a button and then. It stops and, like, you go back to being parked.
That would be really nice and is much more difficult to do in the car. Right? So, I think to me, that's kind of the closest analogy. I was thinking of, like, what is my favorite thing? Because obviously not. Automobiles for me, there are honestly too many 1 of the things I find fascinating is my cortisol networks, which are basically when you look in, like, a forest, there's this whole underground network.
We're basically, like, fungi help trees, essentially load balance nutrients. Which is fascinating, but I think it's a great example of resilience and even kind of secure by design because it has a bunch of features like adaptation and being able to tolerate like droughts, being able to, again, sense like, okay, this part of the system needs more nitrogen.
And to be able to like move that dynamically, it's such a, again, adaptive system that's capable of evolving over time. And based on changing conditions, I find that very beautiful. Cause I, I especially like it. That is something that emerged. Naturally, it's not something humans had designed. This is just baked into the fabric of our reality.
And I think if you look across a lot of that kind of just how does reality operate, you see resilience everywhere. And even humanity, I would argue humanity has succeeded because of our ability to adapt. And what I find funny is a lot of cybersecurity specifically trying to remove the human capability.
To adapt and to change, trying to restrict change as much as possible. And my view is very much, okay, we need to ensure that humans again, don't have to do the kind of like super repetitive things where it has to be perfect every time. That is not our strength. It's never been our human strength. We need to free time for them to be able to do the things that they excel at, which is creativity, um, and adaptation.
So I'm very much cybersecurity should be supporting. safe change rather than trying to block change. So that's a big reason why I love the natural examples is because it's like, okay, yes, we see resilience in our man made systems, but this is also, this is something that just like is clearly a viable strategy for life, which I find very poetic.
Dave: Yeah. There's a Ted talk in a Smithsonian magazine article on it by Suzanne Seymard. I just. Looked it up. I know I'd read it a while ago with my son and it's mind blowing. It covers how trees communicate across fungal networks and so on. It's astonishingly cool, non sequitur, but fun nonetheless. Now there you get on a number of really kind of SRE topics in there.
And I think it's no mistake that like SRE stands for site reliability engineering. Yes, there's a cloud orientation, but I think the principles of what you're talking about apply even to something as far flung as an OT environment, which we just covered with Glean and Tova and so on the principles hold, but the concept of toil, like you were just talking about is a very real one and applies to security as well, the want for automation and so forth for things that can be automated, which feels.
You know, prescient in the world of AI, when we're looking at things like large language models and what could be done there in addition to things like auto GPT, which can chain together. If you want to get all sci fi and futury William Gibson's agency kind of shows how like the agent based programming model in AI can make.
Things that seem very futuristic and Jetsons, you know, possible, so it seems like that's an area where we can lean into quite a bit in the future. But if we talk about the now, let's see if we can give ourselves a little bit of credit as an industry here. As I was reading through it, you focus on. Hey, look, the prevention and prediction model.
You mentioned fair as a prediction model. And I, I tend to agree to me. It felt a lot. It feels a lot like software estimation in some ways, which is irreparably broken. And, you know, not saying that it doesn't have value, but just to say that to rely upon it as insanity and the prevention model, we know it doesn't work as well.
But if we look at this and say, is there evidence that the rest of the industry gets this and have we moved off of this starkly deterministic model of security? And I think we have a bit, at least. Detect and respond while the acronym makes me break out in hives XDR as a person who was working in antivirus before that at Norton and where we had improved antivirus dramatically.
Now, Symantec, the enterprise software didn't pick up the engines that we had, so it was hysterical. The. Consumer product was great. The enterprise product kind of sucked because they didn't bother to pick up the engines or make people take engines. And if you're running a three year old antivirus engine, it can't do a hell of a lot for you.
So again, resiliency within software water mining is important. But having said that my whole point in bringing it up is my revelation as someone had been in the antivirus industry as I moved into EDR at Krausreich was this makes sense. We're never going to get 100 percent of it. And AV is very focused on portable executive files, executables, and it's not always going to be.
A PE file, particularly as we move more towards nation state adversaries and so forth, there will be humans in the system. So it feels like to a certain degree, EDR, NDR, XDR stuff is sort of a step in that direction of saying, look, AV isn't going to do it. All the firewall isn't going to do it. All all these preventative measures that we never really trusted to begin with.
And we knew we're going to have flaws. We're going to have to have you know, People and automation to pour through things and look at the signal that's coming through instead of just relying upon signatures and behavioral heuristics to catch it all. Do you think it's, it's fair to take a little bit of credit there as an industry?
How do you see it?
Kelly: This is going to be a spicy take. Just warning everyone. I think. Yes, it's an improvement off of a bar that's, like, in the depths of hell, maybe? I think, overall, I'm not too fond of the DR category, because I think, I have heard CISO say that one of their key metrics is time to detect.
That is an output that is not an outcome that says nothing about whether the system was back to healthy, what the actual impact was, because time to detect, like maybe you do detect it, but you can't do anything about it for hours and hours or even longer, or someone doesn't actually notice that it was detected.
So I dislike the kind of incentivization or the incentives that it's encouraged. I also think sometimes, especially I used to work at a company that kind of did Linux monitoring. And I think. Even at the time, a lot of what I was writing about is, especially when you're dealing with again, production environments.
Sure. You can add some sort of D. R. tool to that system, but it's still, it's potentially will cause its own problems like kernel panics, but also it's never going to be as effective by some sort of by design mitigation and. The problem I see is people think that all of, I think the infosec tools, I think the, you know, asterisk dr is seen as closest to a silver bullet, especially on endpoints.
And I think that's a counterproductive tendency. So I think it's again, it's certainly an improvement over antivirus. I will grant that. But I think if you look at how, like you mentioned, SREs are kind of approaching the problem of how do we detect when things aren't behaving as intended? Or if things are approaching failure or, you know, if, okay, someone is clearly YOLO deploying or is created like a new artifact on the system, they shouldn't be able to like how they're solving those problems is so drastically different.
I think it's worth asking, what are we missing as an industry? And I think we stumble upon interesting answers there. So that's my long way of saying like, yeah, it's better than antivirus, but I don't know how much, especially with that paper. I don't know if you saw that paper that came out, I think it was last year.
The time is not something I acknowledged, so I, I'm not sure exactly when. Relatively recent, it was empirical study of the efficacy of various EDRs and it was very much like I said, that like 60% of the time it works every time vibe. It was, honestly, it was a little embarrassing because a lot of the stuff that they weren't able to detect was stuff that was advertised kind of on the tin that they were supposed to detect.
So I don't know. I think one thing I do love, I will say, about chaos experiments is if more of the industry starts conducting them, vendors will be held much more accountable to their claims, and I think the industry can only improve with that.
Dave: I think smart organizations are more sophisticated ones. I think the ones that have are kind of deeper into their program do this sort of testing and they go in and they find out.
Like, where did the signal, was there a signal when the red team got to the final mile to different areas? And as much as I never thought we would do it, it's why we recently did data detect and respond, because we found out that like, look at the final mile, people aren't getting the level of observability that they need.
And maybe that's the better word for it. Certainly an SRE word. What's the level of observability we need at that point, let's assume that someone's going to get all the way back to the data. Did we see what we needed from all different parts of the system along the way? So that we had a prayer of catching this.
Let's talk a little bit about the alternative here. So you make the case against prevention and prediction and kind of the wise alternative of evaluate and experiment. Let's talk a little bit about some of the fundamentals of software chaos engineering, so you can explore this. The concept of evaluate and experiment, which I think was really important, and it's something that at a startup at a young company, I felt acutely like I've, even on the marketing side, I tried to retrain the them to think of things in terms of hypotheses and testing things in an early category.
Which is it sounds a little fluffy at times, but the mindset change is significant and that concept of we are running experiments and we're testing things and it's okay for stuff to break was essential to it. But those are my words. How would you describe sort of the core principles of security chaos engineering in your own words?
Kelly: I'm going to kind of narrow down to specifically the E& E approach you mentioned that I talk about, I think, in chapter two of the book, which is for evaluate and experiment. So this, I figure, you know, there are going to be some people because it is a tome that only get through maybe chapters one and two, and then they use the rest as a reference guide sometimes.
So chapter two, this is really how do you begin transforming toward, like, how do you set a foundation for this kind of transformation to drag? Your security program out of the dark ages. So with evaluation, the idea is basically, yeah, you're documenting your hypothesis. So I'm a huge fan of decision trees.
I'm actually, especially a fan for the exact reason you mentioned, which is so much of the time security teams think about that initial access and the exploitation, whatever else, and they miss the lateral movement and all of the kind of rest of the attackers operation. Because attackers have their own kind of like lifestyle, very similar to kind of like software engineers with like delivering and monitoring and all of that stuff.
So with decision trees, for people who aren't familiar, decision trees are from the realm of behavioral game theory, very long precedent. And there's a technique in behavioral game theory, which is called belief prompting, which is very similar. If you know chess, where really expert chess players will Basically, think ahead, like, if I make this move, my opponent will then make this move and then I can make this move and then they'll probably make this and they basically map out to like K levels of thinking through the game.
And that's exactly what decision trees help us do. So basically, we start with something where I think the example I give in the book is with S3 buckets. So it's like the YOLO sec, you know, when you're just. Completely abandoning all reason and caution. The yellow stick option is to have the, you know, bucket be public.
Your s3 bucket be public. And of course, the attacker could pretty easily discover it through. I don't know, like, or something like that. And from there, though. You think then about like, okay, are they going to have to like exfiltrate data? Like you have to think through kind of all the steps they make and let's say, okay, now you set it to private.
Well, now you have to have a new branch, which is how does the attacker approach the problem of accessing your private bucket, which could be phishing. It could be potentially exploitation, but it really gets you to just map out, like, here are your hypotheses about how the attacker will approach attacking.
Your system and how are they going to react to our mitigations? So a lot of people, again, think like, Oh, mitigate it with like, again, asterisks DR here, that's not the end of the story, especially you mentioned nation States, they are not going to stop there. If you're a valuable target, they will pivot.
They will figure out some sort of alternative strategy. So I always recommend before you do your chaos experiments, at least have. At least one decision tree, which is for whatever your most critical asset is. And that OT actually understands this pretty well in my experience. For one, they have digital twins, which help them experiment, or they're excited by the idea.
They tend to know, like, again, like, a manufacturing company knows, like, we need the shop floor operational because that's how we make money. So an attacker disrupting that, that's exactly the kind of, like, key thing that they're worried about. Create a decision tree for how an attacker would do that. Think about the easiest thing for the attacker to do to the hardest thing, which is probably going to be, like, I don't know, like some sort of fancy backdoor, like that Bloomberg story claimed way back, you know, the grain of rice thing.
So when you start documenting those hypotheses, one, again, you're just understanding like, okay, what did we miss? And almost every time I've done these decision trees with companies and teams, including software engineering teams. There's something that's been overlooked that is revealed by the decision tree where you're like, Oh, my God, this is so obvious in hindsight.
Of course, we need to, like, introduce this mitigation. Of course, this isn't going to work the way we intend, or it's one of those things where, like, we think we have that. And then you go talk to the engineer and they're like, yeah, it was on the roadmap, but we never actually implemented it. So decision tree is very valuable for that, but also they're the basis for your chaos experiments.
Because he say, again, like, we expect the firewall to detect the misconfigured port. Thank you. Run a chaos experiment, right? Validate that assumption. That's my long way of saying that I think the evaluation part, the experiment patient part is the really cool one that I think people would want to, like, blog about the most.
But the evaluation part is really essential just to understand, like, what are your own assumptions about the system? What are hypotheses? And you can really go from there.
Dave: Yeah, the S3 bucket example is wonderfully illustrated. I have to say, like, you clearly took great pains throughout the tome in order to make it not a wall of text and illustrate what you were thinking.
And I'll read a quote here, but having said that, you know, the words are great, you're a gifted writer, but the illustrations that go along with it were really nice, particularly in this instance, because you could see the iterative nature of the thinking. And I can imagine. A group of people at a whiteboard working through this and saying, Oh, well, if we did this, they would do that.
And this kind of quote brings it to life, or I think kind of provides some really interesting color around it. It's. The cat and mouse game of cybersecurity is better characterized as a spy versus spy game, where each side can inflict harm on each other's mental models through booby traps and other bamboozles.
Great turn of phrase. Don't hear bamboozles enough. We must prowl our own assumptions for loopholes and alternative interpretations rather than waiting for attackers to take advantage of them. These insights galvanize an advantage whereby we can proactively refine our design and architecture before attackers can exploit the difference between our mental models And reality.
As is evident, the attacker mindset is ultimately one defined by curiosity. How can we foster such curiosity nudging ourselves and our teams to challenge assumptions and enrich them with evidence? The answer is experimentation, curiosity, and action. When we conduct experiments, we can generate evidence that unmask blemishes in your mental model of the system, but without the stress and fear that arise when attackers succeed in compromised new systems.
What a great paragraph. Thank you. Curiosity in action. And the idea of this isn't cat versus mouse, uh, one the pursuer and the other the pursued, but both sides kind of equally yoked and with equal opportunities in front of them. Anything you, you kind of care to elaborate on that? I fear that I've said too much in reading the full paragraph because it says it all so well.
Kelly: I think I wrote like 130, 000 words. So there are plenty more words for people to be surprised and delighted by. Um, yeah, Spy vs. Spy. I actually very fondly remember like it was something my dad and I bonded over because he loves that comic and I love it too. Cause like you said, I like to think it's a little.
Powering for defenders to think of themselves is like, Oh yeah, we're actually peers to attackers. Like we, we don't have the disadvantage, as I mentioned, like in my black hat talk, one of the myths we have is that attackers only have to get right once. And that's frankly, just total bullshit, total bullshit.
They have to get right. Once for initial access, and then they have to get right every single move they make after that. Right. And it goes to the importance, like you said, of making sure that you have more observability, like, beyond that initial access. So, I really like that spy versus spy kind of analogy again to get people thinking like, oh, yeah, we can be kind of like devious back at attackers.
Like, we can ruin their days and like, mess with them to make it much harder for them. So. That was definitely a lot of fun, right? I also love there's something I think nearby there where it talks about attackers are much more like lawyers where they try to like, poke and prod your assumptions and find ways to creatively exploit them.
And I think that also almost defangs attackers a bit if you think of them more like. Software lawyers, in a sense, rather than again, these kind of like boogie monsters that we make them out to be a lot of the time, one thing I will mention is I highly recommend. Obviously, this is shooting my own horn a bit.
Um, I co created an open source tool called deciduous because before. Yes, I did decision trees on the whiteboard, but it turns out if you can, Create some sort of YAML file you can check into GitHub. That's a lot easier to collaborate and make sure you have an artifact. So it's basically like a nice, it's deciduous.
app, a nice user interface where you can, again, edit the YAML and create these decision trees collaboratively, but it is a great exercise and I will say people come. Off like, after the exercises, you can tell there's just more confidence in the room in that kind of very spy versus spy sense of like, oh, there's actually a lot we can do to kind of disrupt the attackers workflow and how they approach attacking our systems.
And again, that's. That's something I was really hoping in the book. It's not like a, Hey, y'all are dumb because you're not doing all this cool stuff. That is not the vibe I was going for at all. It's very much like this stuff is accessible. We can draw on other disciplines. We can draw on modern software practices.
Like this is all within your reach. You can pick and choose and Hey, you're going to be able to start feeling like you're able to get the upper hand on attackers finally. And I'll, I'll throw another quote out here that I think it was more in the beginning, but we'll jump around a little bit here. It said all the attack phase.
Dave: Lateral movement means is leveraging the connection between one component and another to others within the system. Defenders, however, traditionally conceive security in terms of whether individual components are secure or whether the connections between them are secure. Status quo security thinks in terms of lists rather than graphs, whereas attackers think in graphs rather than lists.
There is a dearth of systems knowledge in traditional cybersecurity defense, and attackers are all too happy to exploit. So going back to our Jurassic Park system and, you know, and so forth, this overall kind of call the arms to think different in general, develop a new mental model. You're on equal footing.
You can test. You should test. It's your curiosity in many ways that's holding you back, which I think is really empowering you. So many times we talk about tooling, we talk about training and the rest of it. It's much more empowering to say our curiosity and our ability to think in terms of the system that's there is allowing us to fail.
Against software, basically evil software lawyers, evil is a terrible word, adversarial software lawyers. So it's a very different model, I think, than what I've heard advocated before. And it's a much more empowering one that ultimately seems quite doable. Let me ask you this. Like you have a day job at Fastly.
How do you apply all of this in your day job? Do you find it? Straightforward or at least ready, you know, places where you apply this methodology on a day in day out basis. Is it more periodical? Is it more of hey, we're trying to do this, but it's challenging. How would you describe your own journey applying these principles inside an organization?
Kelly: Yeah, so I'll say I'm in the office of the CTO at Fastly. So I'm tasked with thinking about the future of security. So a lot of this is really helping our customers think differently about security and leverage Fastly's product line to be able to achieve some of these things. I will say I'm very excited.
Y'all are actually the first place I think I'm mentioning this. There is now officially published a code example for doing experiments on our edge. So like as a serverless function, which is really cool that I created as part of actually one of our internal hackathons so you can. Basically, um, strip cookies and force cross site origin requests to make sure that your site actually requires them, assuming that's something that again is like an assumption you hold dear and obviously will alert you whether or not it's working on.
So that was a really cool kind of, again, open source contribution. I was able to make, I will say fastly as you might imagine, because. As a CDN, I think this is something all CDNs have to do. You have to invest an enormous amount of effort into reliability. So there's already just like a great reliability culture.
And even with our platform, like resilience was thought of from the beginning, even if you just abstract that to like the WebAssembly ecosystem, I don't know if your audience will be familiar. WebAssembly is. Just like extremely cool. But the whole kind of the thing that I love about WebAssembly is from the very beginning, they thought about how do we ensure resilience against failure?
So you see a lot of like nested isolation. You see a lot of like, okay, we're going to wrap this basically in like a safe, like memory safe wrapper, even with your C code, which I analogize to lead. Like there's a lot that they've invested in to make it safe by design, which to me very much kind of gets into this.
And it's the same thing with Fastly's implementation of this on the edge. Um, they definitely thought about it. A lot of that. So I've helped internally with like modeling decision trees with some really awesome people that hopefully will listen to this and know that they're super awesome. Yeah, in general, I'm constantly thinking about like how we can help our customers in particular, whether that's through kind of the experiment that I published as well as maybe other things that are in the works, I can't talk about how that benefits.
Kind of our customers and their systems from a resilience perspective.
Dave: Do you have a favorite case study? And it can be from the back of the book and there's a bunch of really nice case studies there. And I do think the fact that you're a practitioner, at least actively working with customers, and I'm assuming engaged with the facility security team and so on.
And the number of case studies in the back. Bring it to life. It's nice not to have a book that feels replete with theory, but not real examples like this crosses over into the practical. But what's your favorite example of having the principles of security chaos engineering applied in real life?
Kelly: You know, I'm going to deviate from.
The case studies in the back of the book because I really want people to read them and I do feel like they stand on their own. It's from like Rise and Accenture. This is not just a like sexy tech company or startup thing. I want to make that clear. These are like very established organizations that have regulatory requirements that are pursuing this too.
Even have one from Capital One back there. I'm Actually going to mention, I think it's mentioned in the book lightly, and it's something I talked about in the blackhead talk, just because I feel like this is I really want security teams understand, like, how a modern organization engineering organization approaches is like, basically, this is what it's like to live in 2023, which is Mozilla partner with UC San Diego on a project called our outbox.
We're essentially wrapping your, as I put it, like smelly C components in what I call like a memory membrane. I actually don't like the term sandbox membrane to me again goes with all the biology things, but call it sandbox or whatever you want. The idea is basically like using. WebAssembly stuff, they're able to wrap their C components in memory membrane, which means you get this essentially like added layer of protection for like data going in and out.
But the important outcome that Mozilla and UC San Diego have talked about is it means they don't have to worry about Oday anymore. Like zero days are just kind of like, they're not a non event, but they're not like an all hands on deck sort of event. It gives them the time to like find a reliable fix.
And it's this RL box has, other than the outcome, it is not by security people. This is something that you want for again, performance, resilience to performance failures as well. This is something that comes across or comes about by thinking about like, okay, how do we bake some sort of design based mitigation to like reduce the hazards in the system?
So they've, I think even there's spell checkers now wrapped in this. It's like audio and image stuff. Like there's a bunch of stuff in firebox now using RL box and. It's a lot easier than refactoring the C code into like a memory safe language. Obviously, I believe they're working on that as well. But a lot of people, I really hate the advice of like, just use Rust.
It's like, one, there aren't many Rust developers. It's not the easiest language to learn. Better is having this, at the very least, interim solution that reduces the impact of compromises and exploitation in your C code by design. I think it's really powerful. And it's something that I have noticed is scary for a lot of CISOs or security leaders or even security practitioners, because it's like, Well, I don't actually really know how to code that much, or I don't really understand software engineering that much.
It's something that you have to learn, but I can tell you my background is behavioral economics. I started my career as an investment banker promise. If I can learn, you can definitely learn. And I think you're a product person. Like you're definitely exposed to software engineering. You can go down rabbit holes for the rest of your life.
You don't really need to do that to understand at least a little bit of like, okay, what are the kind of design levers we can pull in order to again, reduce or eliminate hazards by design. So that's my long winded example. That's kind of outside of the book of like, this is how you should be thinking. It should not be like more security training for developers.
It's like, how about we just make sure that like. Things can't go wrong really in production, and I think that's such an important shift and also in a budget conscious time, something like that is a lot cheaper than a lot of the solutions that are out there.
Dave: So great example and practical as well. What did I miss?
I wanted to ask you about the ice cream cone hierarchy of security solutions, mainly because it's fun and I have a sweet tooth and I freaking love ice cream, but having said that you can elaborate on that or just anything that you felt Like I should have asked but didn't about security chaos engineering?
Kelly: The ice cream cone hierarchy is great I encourage people to definitely check it out because it's a great heuristic for like again What we were just talking about like how do we prioritize the solutions we design and hint administrative controls like thou shalt Not sorted policies aren't the most effective.
You can't scoop much resilience ice cream into like the tiny base of the comb. So definitely check that out. But one thing, because you're a product leader, I want to mention is chapter seven is on platform resilience engineering, which is basically my way of showing like, Hey, here's how probably your org should look to affect this trend, uh, transformation.
And it looks a lot more like platform engineering, which is just. We see it definitely among our customer base fastly, but platform engineering is just like taking off like a rocket ship pioneered by Camille Fournier, who's amazing. It basically says like, okay, we need to approach internal engineering problems very much with a product mindset.
Like these are user problems or users have problems that we need to solve. We need to understand their perspectives, perform the user research and also be accountable for the outcomes just like you would with a product that's facing customers. And so I think that is, again, a huge shift where security teams to think of.
Oh, these are our customers. They aren't just users. We're policing. These are customers. We have to satisfy. These are customers with problems that we need to be solving in a way that isn't going to disrupt their goals. Right? Because as I joke, like imagine a like monopolistic firm that like says because they're the only provider on the market, like, Hey, you hate using this thing, but.
We're going to force you to use it. It's just, you never hear of that in private markets. Right. But that's essentially what a cybersecurity team asks people to do is just like, you hate this thing that we're imposing on you. And we haven't really cared about how it interferes with your work, but like, it's important because security is important.
People are like, are you kidding me? I need to do my job. So I think this mindset's really important to get into that product mindset of like, okay, using a lot of the processes, both of us know very well, like again, understand the problem, understand their goals and constraints, like solve the problem accordingly.
Create an adoption path, track your kind of success metrics for the project. And it's something again on the platform engineering side. We see a ton of companies doing again. That's more around reliability and resilience to performance failures. There are a few companies that are starting to approach cybersecurity with that platform engineering mindset.
Netflix being the most famous. to pretty great effect.
Dave: 100%. And there's a great example. I was just looking for the quote in here, but I couldn't find it. I've given a bunch of quotes, so I'll paraphrase at this point. I'll take that liberty where you talk about the progression of what a classic database administrator looked like to how it kind of gracefully and quietly moved its way to a very different model in the cloud that You know, is one where a lot of the work is taken on by managed services.
Some of it's assumed by S. R. E. Data professionals can focus on bluntly more important things than maintaining the data stores at this point. And I thought it was a really once again, a really nice analogy. This one very close analogy to how we'll see security progress where you will see security work.
You know, the security equivalent of platform engineering teams with that mindset, and they're already out there. Netflix is a great example and has a bunch of principles coming out of that like paved roads that are certainly more and more prevalent, you know, every year.
Kelly: It feels like, Oh, yeah, paper is one of my absolute favorite things.
Um, it's something I would love to see more adoption of in cyber security. I've waxed poetic about them. I think a few different conferences now and obviously doing the book a lot. Yeah. Wall E is a great example. There was another recent one by, um, Block, the artist formerly known as Square. I think they called it FARCARS, which is basically, I'm kind of like butchering it a bit, but it was basically like a service mesh, an intelligent service mesh sort of thing, where when they acquired a company, the company could basically call into their service mesh as a way of like integrating.
So it sped up the integration time to just like a couple of weeks rather than months, which is anyone who's been through acquisition integration knows that can be very. Challenging to do. So having that kind of paved road just so the acquired engineers can like more easily integrate with a stim, that's really powerful.
And it resulted in like more secure outcomes. So my, the thing I joke is like with paved roads, what we keep seeing in these public examples, it's like even if you don't care about the productivity gains or like reliability or all the stuff that you really should care about, there's still security gains here, right?
If you wanna ingratiate yourself with your C T O or with. Definitely even your CEO being able to say like, yes, this thing improves security. We invested a little more effort by our team to create this paper road, but it improved security. And by the way, it improved our like deploy frequency, improved our time to like deploy changes.
Like all of that is going to be very powerful and get you away from that department of no vibe to like, Oh, this is like a serious collaborator who also cares about our interests in the future of the company.
Dave: All right, last question. What do you get excited about as you look to the future for cyber security and the industry in general?
What gives you hope ?
Kelly: this is going to be a wildly unpopular answer? I think with your audience, but it's basically nothing within cyber security. It is entirely because every time I speak to an audience of or software engineers, they actually get more excited about all these topics than me. A lot of cybersecurity people do.
A lot of cybersecurity people are very resistant, and I understand it can be scary to know that you have to change the way you're doing things. SREs, platform engineers, and even just programmers are like, wait, we can extend some of the things we're already doing just like a little bit, a little more effort to cover these attack use cases too.
That's amazing. Like, it's mind blowing. They're just, they're thrilled. That gives me so much hope for the future of security, because that means if for them, it's like, okay, we need resilience to performance failures. And also attacks that starts to really get into that secure by design mindset much more organically than trying to force it.
So I'm very excited about that. I'm really excited to see innovations again around isolation, things like RL box. And certainly in the web assembly ecosystem, they have the component model coming out, which. Assuming all goes to plan, it's not out yet. You would get the ability to have the intra app isolation.
So being able to basically sandbox libraries from each other. So if an attacker like poisoned one, the impact would be very minimal, which also is huge. I think the, what my white whale since drawing back from the beginning, when I talked about Moby Dick, my white whale is swappability. That is something that basically vendors have a vested interest against, but you can imagine log for J happens.
And let's say you're able to swap out that logging library for something else, because you have some sort of interface that allows different languages to like work nicely together, which again, bytecode alliance is thinking about that. Like that would be huge to be able to. To swap out software on demand.
You could even do, you talked a bit about experimentation. You do AB testing, being able to swap out components much more easily in your kind of like experimentation environment. And to be able to tell like, Oh, actually this is more effective than this for like whatever security use case. Like there, there's just so many things you can do once we get swappability and software.
It's also a nearly intractable problem. So I don't have a ton of hope, but that's why it's my white whale.
Dave: Intractable, but it doesn't mean that we can't make progress. It just means that, you know, there's no glorious resolution where we achieve, you know, the top of the mountain. We ascend to the summit and call a victory, but it doesn't mean we can't make progress.
Kelly: Exactly. In the spirit of resilience, we're going for iterative improvement.
Dave: So amazing. This feels like a great point to stop. Thanks so much, Kelly.
Kelly: Yeah. Thanks for the amazing questions.