Omer, Pathik, welcome to Security Voices.
Thanks, man. Hey, thank you. All right,
so Jack couldn't join us today. He's officially on the road somewhere. But we do have a very quiet fourth participant as is the fashion these days Chat GPT is going to join us for this session on security data lakes. So I won't even try to make ChatGPT use the voice of Jack Daniels. God knows what would happen if we did that. We'll use ChatGPT as a really poor AI stand in for Jack throughout. With that, why don't we just do a quick kind of around the virtual room here Pratik, why don't we start with you? Where are you based out of where are you today? And what was the last book that you read
thing today? My name is Pathik Patel. Currently, I'm working in Informatica. I've been here for six years in managing the cloud security org. The last book I read is widely for the brokenhearted. It's a story about a gentleman who broke up with a girlfriend and travel through South Africa. Quite a fun one.
Oh, how did you pick that? Was it a recommendation from a friend or somewhere else?
Yeah, it was a recommendation from a friend like who travels as well. So that was the basis behind it.
Excellent. All right, Omer, over to you. Yeah.
Hey, so Omer Singer, based in Santa Monica, California, and last book. So I've actually gotten really lost in the Stormlight series, which, if any, those of you that like Lord of the Rings, and that kind of book, this series is so good. And I hadn't heard about it as I've been kind of a little bit addicted to Twitter lately, and somebody recommended it. They're they're like a cave. If you're into this kind of book, you're gonna love it. And I can't believe I haven't heard about it before. Because this is great. And I'm deep in the series and Stormlight series. Just great stuff there.
Yeah. Brian Sanderson. Yeah,
yeah. So you know about it. I feel like I'm the only one that missed out on this for four years. It's like, where has it been all my life. But yeah, I joined catching up with it now.
Oh, my God, I'm with you.
He's really prolific.
There's a lot right. The problem is the IUP can be reused for like 10 years now. I
feel Yeah, they're thick. They're pretty easy reads though. I'm reading the Peripheral by William Gibson. And I just forgot I hadn't read him since I read Neuromancer, I realized now why don't read more William Gibson. It's challenging work. Sanderson is really easy, fun reads and you're right, you just kind of get lost and I'm I think I finished Mistborn in like a month or something. And it's he's a great author, an author, and he's super prolific. A lot of the fantasy authors take these giant pauses in between things like George RR Martin is famous for that. Not finishing as is you know, Pat Rothfluss, Scott Lynch and so forth. But Sanderson just crank stuff out. He gets it. And I feel like
as in cybersecurity, we need something at the end of the day, that's not going to be too challenging, right where we can a little bit unwind with a book right? So that qualifies in our crazy industry.
Totally. I use it to shut down my brain at night or take a break exactly what you're saying. There's a quiet but very large fantasy sci fi reading community in cybersecurity for sure. Awesome overs in Santa Monica. today. I believe I am away from home in Seattle here. And Pratik is actually implicated in the reason why we're doing this. We started a conversation on get 1k. It's one of the many kind of private slack communities growing slack communities out there around what would make for an interesting podcast. He said security data lakes, which of course, we'd been already talking to Omer about, with all the work that he's done over at snowflake on it. And then last week, we got a gift with a big announcement from AWS and a bunch of work that's been done on the topic of data lake. So our timing is impeccable, to say the least. So with that, we thought figure we start out with a functional definition of a security data lake. Well let chat gbta have its turn. But why don't we start with Pratik, how would you describe a security data lake what's what's kind of your functional working definition
for mind functional working definition is a ability to ingest structured and unstructured data, especially like when we talk about security tools. Every vendor, every tool out there has their own way to define a security event. It's about ingesting structured unstructured data, and then a ability to query using a common query language like SQL. Excellent.
I think that's right. I think the key part of like defining the security data lake is that general purpose NIS, right, so a general purpose, Big Data solution, general purpose analytics languages, and applied to cybersecurity. Right. But that's to differentiate, I think, from an of what is not a security data lake. And, you know, I think that's everything that we've seen in security in the last decade or so right when when when security data was stored and analyzed. In these very proprietary systems that were more kind of search with very kind of specific use cases in mind, and the security data lake is saying, let's use this kind of general purpose, big data solutions, but to solve cybersecurity problems more effectively.
Yeah, let's take it to chat GPT here. So as is the fashion, asking a straightforward to chat question to chat GPT is a little less fun than putting a twist on it. So I had I asked chat GPT moments when I thought I broke the thing and actually took it a while to come back with a response. But I asked it to explain a security data lake to me as if I was a botanist. Here it goes as aboutness probably familiar with the concept of a lake which is a large body of water that is surrounded by land. Similarly, a security data lake is a large repository of data that is surrounded by security related tools and technologies. Just as a natural lake as a source of water for plants and animals. A security data lake as a source of data for security analyst, just as a lake may contain a diverse array of plant and animal life. A security data lake may contain a diverse array of security related data from various sources within an organization. Wow, not bad. Ah, yeah, it's I think
we're all in trouble. You know, I thought our jobs were saved from AI. And I'm not so sure,
you know, the term goal explained like I am fired. That's how charged up they explained it now.
It's pretty impressive. It's pretty darn impressive. So I'm sure are you in particular, have taken this question on a lot of times. But compare and contrast the world of the SIM with a security data lake? And to what extent is security data lake, you know, going to just subsume the market? For Sims?
It's a great question. It's an important question to answer. Because if you look at how most security programs are structured today, they need to consider kind of where's their source of truth? It's the same. So just like most of the enterprise has the data warehouse as their kind of source of truth. If the finance needs to report on sales and revenues for the last quarter, the warehouses the data warehouse is where they turn to security team wants to understand, have we been breached, and have we mitigated the incident, then the same is typically where they would go. And by the way, I think that's part of the challenge with it, we even the naming around security data lake, you know, we played around with this, we refer to it as a security data warehouse security data platform, the security data lake is the term that's been around for for a long time. So we've kind of stuck with it. And I think what we're seeing now in terms of kind of the comparing and contrasting, is that the sim never delivered on its mission of being that single source of truth, because of architectural limitations, just the way that the Sims have been built. Pretty much every security team uses it for some of their data for some of the time. And then the whole kind of model really breaks down the security data lake, bringing unlimited scale, unlimited compute kind of very flexible data types, for the first time, does enable the security team to have all the data in one place. But is it an alternative to the sim, I wouldn't see it that way. Because when we think about what teams need to do to effectively detect threats, or to measure risk, or to manage vulnerabilities, all these different kinds of jobs that they have with data at the core, they need more than just scale. And they need solutions. And I think what we're seeing is more an unbundling of the sim, and that's how I would think about it, I think the sim is getting unbundled and you're gonna have sock platforms or TDI, er platforms running on top of the security data lake to operationalize that use case and compliance automation solutions on the security data like for that vulnerability management solutions for that. And so I think we're gonna see the same replaced with very much more narrow but more effective solutions running on top of the security data lake, which becomes that true single source of truth.
I agree with VMware here, I don't think seeing that piece in the near future, it's going away or like we want to play there, it's a one on one replacement. Mainly the way I see it is this security data like tool will allow security engineers to find new ways to solve their problems like as Omar pointed, like when every management out there, compliance management, etc. Currently, those work items are bogged down by the amount of data that we have, and how to solve how to get meaningful action items out of it. I think that's where this unbundling part where you will be able to manage some of your data using security data like build out campaigns build out your work plans in go from there. Another like if I look into five years future, I think that same in security data lake are two different interfaces towards data. That's how it will turn out to be from my opinion, secondary data lake is where you can create your tables, your analytics and so on. But you go to same if If you want to pinpoint your information you wanna you have, you want to search for something like searching a needle in a haystack, then symbol help out. So that's how I differentiate that.
And I liked how Gartner in their latest sim MQ started kind of giving extra points for sim solutions that support what they called third party data lake, from my perspective is actually first party because I think about it from the perspective of the customer. You know, that's customer's existing data platform. That's the data platform that the enterprise already has, but being able to run the sim on top of that. I think that's what Gartner is recognizing as a big advantage. And I think what we'll get this place are those sim vendors that consider a large part of their so you could sauce, their added value, being a proprietary data platform, and we're starting to see that kind of separation, there are those sim solutions out there that say, Hey, we've built kind of a mini snowflake, and it's very optimized for some security use cases. And the reality is that solving the scale challenge at multi terabyte petabyte scale, our cars are larger customers at snowflake, they have security data lakes into dozens of petabytes, like solving security problems at that scale is a full time job. And it's just not reasonable for any cybersecurity company to keep up with that and to do it successfully. So that's what I think will get disrupted. But we'll continue to see a healthy sim industry that just takes advantage of advances at the cloud data platform at the security data lake level.
All that sounds great. Let's see what chat GPT says well drag race against chat GPT for a moment. So I asked it with very low expectations. Will security data lakes replace the SIEM? And here's the answer. It is possible that security data lakes may eventually replace same security information and event management systems. But it's not necessarily a given name and security data lakes serve similar purposes and that they both provide a central repository for security data and enable security analysts to quickly and easily access and analyze large volumes of data to identify potential security threats. However, there are some key differences between the two. Siem systems are typically focused on providing real time analysis and alerting. Whereas security data lakes are more geared toward providing long term storage and analysis of large volumes of data. Additionally, SIEM systems often provide a more comprehensive set of security related features, such as user behavior analysis and compliance reporting. Whereas security data lakes are often more focused on providing raw data for analysis. Ultimately, the decision to use a security data lake or a SIEM system or both will depend on the specific needs and requirements of the organization.
Holy shit, yeah, largely agrees with what we just say. Yeah,
I think that's a good thing.
I think the thing that probably is too new for GPT, right, I think it's indexed up until 2021. Maybe, and not that it's brand new this year. But I would say that I think this over the last year, there has been kind of a shift in understanding of thinking that it's not necessarily a dichotomy, and either or, and this kind of turning to the connected application model, and that's where, where they work together. And points about real time, for example, really interesting, right? Because you don't want to rely just on queering into the data lake for the new real time detections. There's gonna be cost and latency associated with it. What we're seeing the kind of the more advanced sim solutions now that do support a bring your own data lake kind of model is, as they normalize the data as they prepare it for loading into the data lake. They also do an initial round of detection. So those atomic detections that you can raise based on kind of what's on the wire, those are done as the data is heading to the data lake. And then the more robust, maybe multi-dimensional contextually oriented detections, those happen more on a schedule, and then that hybrid approach means that there's kind of eliminates the trade off that was implied there.
That makes sense. Going back to what you said before, prior to our chat, GBT break there. There's an element of this where, you know, flatly, kind of what you were saying in really crude terms is the seams are trying to do too much. They're trying to solve a data ingest and storage problem, while at the same time having a normalized a ton of different types of data, and provide the logic and the features on top of it. That was actually almost in every instance, if not every instance, just too tall order, especially with data growth and the number of security tools, the different varieties, lack of canonical formats, and so on. And really, part of what we're seeing, I think, with security data lakes is is the necessary specialization that allows organizations to come in and just do a piece of that and be successful at it, which I think is kind of broadly where security is right now, where you have a whole bunch of specialists. And one of the things that I liked about what's happening with security data lakes, and one of the reasons we began writing to snow pipe and creating basically a private snowflake repo repo for customers, is because at the end of A day as vendors, this specialization works, but I think it only works if we provide that data out. And I think we've done a pretty lousy job as an industry of building open platforms. Instead, it feels like what vendors have done. And this includes companies I've been in, is we try and keep on adding more things to grow customer lifetime value, to extend adjacencies, to prove that we're in bigger markets, and so forth. And oftentimes, it just spreads the company too thin. Instead of building an open platform, we build a whole bunch of modules, which investors love and sometimes customers love. But it has taken us to a place where, you know, maybe the security data lake is a manifestation. And just the recognition that look, we've got to start providing the data out in a common place in a common format. The only way we get ourselves out of this is not by giving up specialization, but maybe by making ourselves more open, or at least having a place where we exchange data. That was a mouthful, sorry for the monologue. But I
think security teams are frankly, fed up with the idea of the tool sprawl and that every tool that they get, then becomes a silo that then they're on the hook to reconcile, and especially with the economy going in the direction that it's going, there's a lot of pressure now to to consolidate. And can teams do more with less than do they really have to buy that additional tool, and we're even seeing a wider resurgence of think of these big suites, right, where you'll have the mega vendors saying, just buy everything from us. And we can do it all. And I think that's becoming increasingly appealing to some security leaders who say, you know, I'm under this budget pressure, I feel like we have dozens and dozens and dozens of tools just too much to manage, maybe I shouldn't buy all my stuff from I don't know Palo Alto Networks are whoever that they'll give me everything. But where that falls over is when you actually look at the jobs you need to get done. Like you need to protect your endpoints, you need to have visibility across what's happening in the environment. And then with that kind of suite approach, you start saying, well, actually, like, I don't want to install the Palo Alto Networks endpoint agent and use that to protect my laptops. That's not what I want. And I start making all these different exceptions, and the whole thing falls over. And so I think the security data lake model opens up a third path, which says, pick the best of breed, right, use the tools that will actually get the job done effectively without putting too much burden on the team, etc. But have that unification at the data layer. So plug these tools into the security data lake and at least you as the customer owned the data and have it be unified. And then I'm seeing security teams really responding to that I've had security leaders known in different places tell me, you know, what, I want all my tools to align to this approach, you know, so Dave, I think it makes a lot of sense for you guys to support that as as an option. So if a team does say, this is how I'm achieving my unification consolidation is at that data layer, then you can meet them where they are. And I think they'll they'll appreciate
and even Omar to add to what you say, right? Like, even the people who are going for platform approach, right? There are going to be cases where that form will not solve all of your problem. And currently, like if you look at current scenario, Palo Alto is whoever you pick it up big, big name platforms, there are many cutting edge problems that you have to solve. They are not there yet in if you ask them, they'll probably take two to three years to get there, right like to in there are vendors out there who are solving problems like data security, open revenue, right. So from that point of view, security will always need a common ground where they can get all the data and start asking question and get the job done. And from that point of view, right, like secure data lake is the common ground that we can probably all agree upon. But then the challenge here is how do you build the common connectors? Right, you have a tool, you have vendors, whether it's a best of breed or platform, you want to bring this into secure data lake. And as I say, like earlier in my definition, structured and unstructured data, I'm putting that definition in because right now, some vendor may be defining the data, but it is it could be probably unstructured for me because there is doesn't make sense that it doesn't follow standard security terms and language. So that's a challenge and whether it's the best of breed or platform approach, again, you want to build your workflows, right you want to build a unit integrate your tools with your JIRA Service now your slack, your SOAR platform, etc. And that is also additional cost when you have to do this across multiple tools, or even like, even though it's a platform, it's hard to understand each platform modules output, because like the way we have some of the platforms, and it's pretty hard. So I would rather rely on putting all of this into data lake and then data lake being a stream engine, which can then put out data to my Slack teams, whatever the integration that you want to work on.
Totally, by the way, with that model that you just described, you have the opportunity to have some analytics and logic applied between those sensors and the automation, right? If you try to plug one of these maybe cloud security solutions that are driving kind of, Hey, I found this, I found that you tried to plug that directly into Slack or JIRA or something, you risk just overloading the team, right? With a whole bunch of noise. And conversely, if you have that layer of analytics in the middle, you can start applying context, and maybe even context that is non-security sources, right that you need to ETL in from your work day or from your Salesforce or from wherever, right and then you could say, well, this is interesting because it applies to an executive or this is not interesting, because this person is on vacation, and I know it right, like take those things into account. And you have a shot at managing the noise to the team get more value from the automation solutions that you mentioned at Informatica,
do you have a security data lake today? No, I
wouldn't call it a secondary data lake, we are trying to figure this out. In my opinion, we haven't yet figured out a solid technology that we can bank on. Like recently, everybody has seen those announcements, right? Amazon came out with security data lake offering, it is out there. But we are still evaluating various technologies to understand what is a long term thing because security is a very this is if we actually use deploy something, it is not going to change, it will take a lot of time to get it deployed, configured, and integrated with various tools. So we are still researching on the good technology. I wouldn't call it a perfect technology. But a good technology.
That's so interesting. I think a lot of security teams are at that place where you're at now we're thinking about where do I take it? And what kind of technology makes sense to me. And if I could kind of share one piece of guidance on that it's to think about kind of three pillars of success for the security data lake. And those would be scale solutions and unified analytics. I'm actually writing a post about this right now. Because just like you there are a lot of people out there seeing that this is gaining momentum, but they don't want to make a wrong decision. So when you when so when you're thinking about this for Informatica, the scale is once you do collect the data, can you keep it there for as long as you need? And will you be able to effectively ask questions of that data? When you do get up into the hundreds of terabytes or even into the petabytes? That's something that's important to check. And not assume that just because it's an AWS solution, for example, that it will be able to scale? I think we we can talk about how some AWS solutions are kind of don't meet those needs. And then in terms of solutions, What content do you have off the shelf? What capabilities kind of can you bring, not necessarily from the security data lake provider, but from the ecosystem that you can drop in. And then unified analytics, ideally, the security team is not on a different data platform than the rest of the company. Right? I think a big part of the opportunity here is the seaso, aligning to the CIO and the security team having access to the expertise and the data tools, and all of the rigging that the data team has put in place to serve the rest of the enterprise. Take advantage of that. So unified analytics with the rest of the enterprise is that kind of third key pillar.
And I'd imagine the benefit of that isn't just hey, we're all using the same tools when price breaks, familiarity ability to get help. But it's also that context, like you mentioned before, bringing in something like Workday, data, which can be super helpful if you have an insider threat use case or you know, you're trying to figure out what happened to data and so forth. And you need to know whether or not that was an appropriate person that it would go to and so on. Super easy. If you're all in snowflake, are those some of the primary benefits of having your security data lake a part of your broader kind of data analytics program inside an organization? Or is there more to it than that?
I think that's definitely one part, right, the data sets. But also, I think we also need to be a little bit humble in cybersecurity and acknowledge the fact that we are way behind when it comes to data analytics. It's just a reality for us as an industry. And maybe now we're taking steps to catch up. But I've been blown away with the amount of tools and processes that data practitioners have put in place over the years to ensure quality of data and governance of the data, and all these different metrics and things that need to be in place so that when you're relying on data, you're building on a solid foundation. So taking advantage of that, or just yesterday, I was talking to a security team that wants to use the TerraForm code that the data team has built to put in a new pipe like that, that's great. Like, let's leverage that. And let's have somebody to go to when you ask, Hey, how do I write the sequel or create a materialized view to do XYZ? So I would say it's a data. It's also the tools, the processes, and even the people that are very busy, but do want to support the security mission of the company. You kind of got to meet them where they are, though.
By the way, I asked chat GPT if cybersecurity teams in data analytics, and it basically punted on that one, it wasn't willing to be nearly controversial. It kind of said it depends. So
doesn't stick its neck out. Okay.
Yeah, it was It wasn't willing to take a stance on that one.
Yeah, you even Changi, but he wants to be a diplomatic.
Yeah. Dave, I'm curious to hear like from from open room, and like what you see when you talk to security teams that are thinking about protecting data, right, which is a bit different than creating a security data lake, but now they do kind of get into the world of data, how much familiarity and kind of comfort? Do you see it security teams around data platforms and data analytics, or not so much
increasing? They're curious, and they know that it's important, but you know, it's foreign to them. And I think it's gonna take a while, if you go back into the history of security, how many security professionals have been able to successfully complete an enterprise wide data security project, you know, DLP was super painful, like, that's one that comes to mind for people. And it's incredibly painful. I remember in the early days before Phil Venables went over to GCP, he was with us as a board member. And he invested early in the company and took a board seat with us, partly because he was curious about this, because he'd been using semantic at Goldman, to try and understand the data that they had on prem. And he said, one point, he's like, look, the scans never complete. He's like, actually never had a full picture of our data. So we're playing into a very serious historical deficit. In terms of data visibility. My sense on this is that, you know, when businesses move to the cloud, they didn't ask the security team, they ran ahead of us. And we've been kind of trying to catch up for a long time. What's happened with data, and snowflake has been a big part of this, but so is data, bricks, and a bunch of other solutions. You know, probably maybe it started with Hadoop, you know, 10 years or so ago. But the rise of data science, and kind of the data economy has put both cloud and security teams kind of racing to catch up. Now. You know, we're up for the challenge. This isn't new for cybersecurity, like we've had to catch up to cloud, we will catch up the data. And I think there's an appetite for it, for sure. But it's something that we don't have a lot of experience with. In most instances, there isn't a lot of muscle memory on successful projects. And the teams are super strapped, right, like 30% staffing shortage. And we're all too busy at times to take the training and what training would you get and who would give it to you and so forth. So it's going to take time. And I think in some instances, the cloud teams are just as behind as the security teams in this instance. And they're not you know, any more familiar with it than than the security teams are? Batik what use cases would drive you to a security data lake and then Omer, I'd love it. If you could give us a case study of someone who of a company that you guys have worked with, that you can share what drove them to do it? And ultimately, like what's working for them? What isn't like give us a sense of of what it's like to work at scale. But before that, I'm curious Pratik, if you're considering this, what would be the first things you would do with it? What would drive you to use a security data lake? What's that compelling use case or use cases?
That's a great question. So because initially, like I'll start with like how figure data lake idea was put into my mind is on our side, like we have a huge vulnerability management data, especially when we move to containers. Right now you are running hundreds of containers, 1000s of containers, actually, on multiple pods. As companies like Informatica, where we are multi cloud, we are also running it across multiple clouds as well. Along with that, you have your traditional legacy system, right virtual machines, and so on. So all of this all of the data we had one of the big question that we are trying to figure out that everybody's working to patch, why do we still keep patching, right? Like this is the question that our executives, our teams are asking, Are we improving? Are we going up or going down? And this was the question that we were not able to get it out quickly, on a regular basis from our vulnerability management tool. And even if we tried to get it out, it was very unreliable. So this is where we believe that like, rather than asking this question to vulnerability management tools, why don't we take all of this data Don't put it somewhere and start asking this questions ourselves. So we built our own mini security data lake within AWS using s3, buckets, Amazon, Athena, and then quick side dashboards. So this all our initial problem of ability to dice the one liberty management data the way we want, and provide answers to execute these were various business units leads, even like engineering functions on what should they be patching, right? How fast like to get the best bang for the buck? Like what should they patching? Like? All of these questions are now we are able to answer using our mini data lake. But as mini data lake got success, people are more and more people started asking questions that hey, why can't I have my CSV and data in it? Can you put CSV and data in the security lake? And I want to ask the same question to it. So I don't have to go patch 100 different things and still be behind. Like, let's try to do the same thing using cspm. And then one by one request came in by like other tools. Other teams are like, can you put in this data? And this is where we are like, Oh, our mini data lake is very customized data lake, right? Because we started with one single idea. Now adding all of this SAP compliance data cspm data in another challenge is some of this compliance data, for example, they are very different in their nature, the they are not following your normal security terminologies. So how do we figure it out the columns and put this into a word processor, like allow our Athena to be able to query it? So yeah, this is where we are right now, we are planning to integrate more and more data sources. But before we do that, we want scalability in the slides, as Omar was mentioning, right? We security team did this. But we don't want to be in a business of like, keep adding this new data sources. And then maintaining this data lake, it's gonna be a tough one, right? Like, I need a whole different team of operations, who will manage this. So we are talking to our CIO team or data analytics team, who has done this for enterprises and trying to figure out that, how they can support this. And currently, we are trying to market that migrate that many data lake into our enterprise data lake, so that we can start reaping the same benefits.
Transcribed by https://otter.ai