Technology

Security Data Lakes: A New Tool for Threat Hunting, Detection & Response, and GenAI

by Cliff Crosland

7 min read

(October 15, 2024)

Security Data Lakes: A New Tool for Threat Hunting and GenAI

11:28

This article explores the key benefits of security data lakes, including advanced use cases for threat hunting, streamlined detection and response workflows, and their role in enhancing GenAI projects. It also addresses the challenges of managing such large-scale data environments and offers solutions for optimizing performance.

What is a Security Data Lake?

A security data lake is a centralized repository of security data where you can store both structured and unstructured data at large scale. It enables you to store data in its raw form without needing to pre-structure it.

A security data lake typically uses an object storage cloud service to store the data, such as AWS S3 or Azure Blob Storage. It can scale indefinitely and is designed to be cost-effective.

The total size of the data in a security data lake is measured in terabytes or petabytes, which means that significantly more data is available for analysis compared to traditional security data analysis tools.

Thanks to the sheer scale of the data that can be stored and accessed in data lakes, they unlock dramatically new use cases for threat hunting, detection and response, and even now GenAI.

3 New Use Cases that are Unlocked by Security Data Lakes

1: Threat Hunting

Security data lakes allow analysts to search far more historical data than before to find indicators of compromise, like malicious IP addresses, domains, and ransomware, making it finally possible to do threat hunting over large time ranges.

Here's a scenario showing how a security data lake can help with threat hunting:

Let's say that new research has just been released about an attack campaign that has been active for the past nine months and was only recently detected by the security community.
The research includes indicators of compromise, such as domains, IP addresses, and file hashes associated with the attack.
Your team needs to investigate the incident by answering the following questions:
- Are any of these indicators of compromise present in our historical logs going back nine months?
- If so, was our system successfully breached?
- Is there an advanced persistent threat still lurking?
Before the availability of data lake query tools, querying historical data going back nine months was a significant challenge. You would need to rehydrate logs from cold storage back to your original SIEM or log search tool, which could take days and be very expensive.
With your security data lake, you have 18 months of data readily accessible for analysis. In data lake querying tools, like Amazon Athena or Azure Data Lake Storage, you can search for these indicators of compromise over a large time range, such as nine months.
Let's say your team finds these indicators of compromise in logs from 6 months ago, which is a time range well outside of what your traditional SIEM can see.
With some more searching in the data 6 months ago, you discover an advanced persistent threat using a DNS for command-and-control. This gives you the information you need to shut down the threat.

Since security data lakes can store far more data than traditional security log tools, it's now feasible to keep your historical logs and search them, allowing you to find threats in the past much more easily.

2: Detection & Response

In detection and response, it is useful to correlate activity across a wide range of data sources to trace the steps of an incident and learn how to respond.

Since security data lakes are designed to handle both structured and unstructured data of various formats, they allow you to store a wide variety of data sources in one central location and correlate across them.

Here's a scenario showing how a security data lake can help with detection and response:

Let's say that you receive an alert indicating that an employee at your company clicked on a phishing link in an email.
Your team needs to investigate the incident by answering the following questions:
- What other sites did the user visit after clicking the link? This helps identify other potentially malicious domains.
- Did any malware get installed on the user’s laptop afterward?
- What internal systems did the user log in to afterward? Is there any evidence of compromise?
Thankfully, your data lake contains a wide variety of data that can help with this investigation, including DNS logs, endpoint logs, and identity provider logs.
First, you query your DNS logs in the data lake to see what other domains the user visited shortly after clicking the phishing link.
Second, you query your endpoint logs to determine what activity occurred on the user’s laptop shortly after they clicked on the phishing link. You find a process that appears to have connected to a suspicious domain that downloaded and executed a script, which could be malware or a command-and-control process.
Third, you query your Identity Provider logs to check for login events from this user into other internal systems to determine if any internal systems were compromised.
You respond to the issue by disabling the user's access to various systems, blocking the phishing and malware domains, adding a detection rule to look for that script command in endpoint logs, and start checking the internal systems that the user interacted with.

Since the data lake serves as a centralized store for your security data, you can run all of these queries across many data sources in one place, tracing through all of the steps of an incident without needing to jump between many different tools.

3: GenAI

Security data lakes are also powerful tools for GenAI projects, providing data that can be useful with various use cases:

Retrieval-Augmented Generation (RAG)
Fine-tuning
Training a foundational model

It's usually recommended to start with RAG and fine-tuning on top of a base foundational model rather than training your own model from scratch.

RAG

Retrieval-Augmented Generation (RAG) involves retrieving data and incorporating it into your prompt to enhance the response generated by a large language model (LLM). This approach brings relevant data into the context window of the model to improve the quality of its answers.

Here's a scenario showing how you could use your security data lake with an LLM and RAG to answer security questions.

Let's say you want to build an internal tool at your company that allows you to ask security questions in plain English.

For example, you want your tool to be able to answer questions like: "Are we adequately monitoring and mitigating potential insider threats?"

Here is how the internal tool could work:

Let's say you choose Claude 3.5 Sonnet via Amazon Bedrock to be your LLM. This model is reasonably capable of writing code and queries.
Before sending the question to the LLM, your internal tool retrieves all of the tables and schemas in your data lake catalogs. It uses this information to augment the prompt and generate an improved response.
Given the augmented prompt, the LLM responds by providing a few suggested SQL queries you can run against your data lake. If you asked it the question, "Are we adequately monitoring and mitigating potential insider threats?", it might reply with these query suggestions:
- Query the Cloud Audit Log Events table to look for the largest data transfers to a single IP address over the past month, which could indicate insider data exfiltration.
- Query the Identity Access Management Events table to look for users with a high number of failed login attempts.
- Query the Detection Alerts table to determine how many alerts are tagged as related to insider threats. If there are no such alerts, it might indicate insufficient coverage in detection rules.
The tool then runs the queries, fetches the data, and then creates a new augmented prompt with that data included. The LLM responds to the prompt, summarizing the data and answering the original question.

A security data lake can be highly useful for GenAI use cases, providing a rich data source that can make an LLM much more capable of providing insights.

Challenges your team may face running a security data lake

Data lakes are powerful, but they come with various challenges due to their sheer scale and the variety of data they contain.

How do we route data into the data lake?

Many vendors offer features to automatically export data to cloud storage, such as AWS S3. Examples include Cloudflare DNS and HTTP logs, Crowdstrike Falcon Data Replicator for XDR logs, AWS cloud audit logs, WAF logs, and VPC flow logs. It is likely that this export feature will become increasingly common.

Log pipeline tools like Cribl, Fluentd, and Logstash can help fetch data from different sources and route it into cloud storage.

At Scanner, we have a guide in our documentation about how to load a few dozen of the most common security log sources into a data lake.

The trend seems to be that eventually almost all security data sources will integrate directly into cloud storage like AWS S3 or Azure Blob Storage, making it easy to feed them directly into security data lakes.

How do we handle the wide variety of data formats?

In a data lake, data exists in a variety of formats, both structured and unstructured. This diversity can make querying difficult.

Some tools provide mechanisms for transforming common data sources into a standard schema, such as the Open Cybersecurity Schema Framework (OCSF). Amazon Security Lake is one such tool.

Tools like Presto or Amazon Athena can scan files in their raw format, but they require a defined schema, which can be challenging when dealing with diverse data formats.

A new category of data lake tools has emerged that can analyze both structured and unstructured data without requiring users to define schemas beforehand. The team I work with at Scanner.dev are building tools here, making it as easy as possible to run full-text search and detections on any data format.

Our data set is very large now, how do we speed up our queries?

Handling hundreds of terabytes or even petabytes of data can be slow unless you take steps to optimize or use specialized tools. To optimize SQL queries on tabular data, consider converting data to Parquet format. For optimizing full-text search on unstructured data, creating a search index can be beneficial.

Choosing the right tool for analyzing different kinds of data is crucial:

For structured, tabular data, with formats like CSV and Parquet (eg. network flow logs, DNS logs), use a structured SQL-based tool, like Presto, Amazon Athena, or Azure Data Lake Storage.
For unstructured, complex data, with formats like deeply-nested, complex JSON (eg. web application firewall logs, cloud audit logs, and identity logs), there is a new category of tools for unstructured, full-text search, like Scanner.dev.

The Road Ahead

As the security landscape evolves, security data lakes will remain essential for staying ahead of threats, responding effectively to detection alerts, and leveraging AI advancements to strengthen security.

I'm convinced that eventually almost all data relevant to security will make its way to data lakes in cloud storage. The tools here will continue to improve, and it will become easier and easier to analyze the wide diversity of data stored there.

Thanks to their ability to store and analyze massive data sets with a wide variety of formats, security data lakes can provide immense value for threat hunting, detection and response, and even now GenAI.

Share this

Previous story

← Glossary of Cybersecurity Threats and Scams for 2025

Next story

Balancing Cybersecurity and Business Growth Using a Risk-Based Approach →