Recently, Microsoft announced the public preview of Microsoft Sentinel Data Lake[1]. This data lake makes it easier and more cost-effective to store data for the long term. In this article, we will delve into this data lake feature.
Note: This article is based on information available as of August 13, 2025. This feature is currently in public preview. The author is not responsible for any configuration errors or data loss in your environment. It is recommended to start with an evaluation in a small-scale test environment.
The Security Log Management Dilemma
It is often said that the global volume of data is exploding year after year. Consequently, security logs are also increasing rapidly, forcing many security teams to make difficult choices. Storing all logs in a perpetually analyzable state (the Analytics Tier in Sentinel) incurs enormous costs. Therefore, many companies likely manage by either narrowing the scope of log collection, creating blind spots, or shortening the retention period, sacrificing traceability and auditability.
Alternatively, some companies store logs separately in Syslog servers or Blob storage without ingesting them into a SIEM. However, this "you get what you pay for" approach is inefficient, as it requires ingesting the data into an analysis platform whenever an investigation is needed, and it's unsuitable for routine monitoring and alert detection. While large enterprises with deep pockets might solve this by storing all logs in the analytics tier, this is not sustainable and certainly not cost-effective.
Microsoft Sentinel Data Lake
Amidst this situation, the newly introduced Microsoft Sentinel Data Lake allows for cheaper and easier long-term data storage. In addition, although there are limitations compared to Sentinel's standard "Analytics Tier," it supports KQL queries, making it an optimal platform for when you need to retrieve "something" from long-term stored logs during an incident.
Comparison of Analytics Tier and Data Lake Tier
For the Analytics Tier, the Data Lake Tier, and their respective key features, the table summarized by Microsoft was very easy to understand, so I have quoted it here.[2] This is not me being lazy.
Feature | Analytics Tier | Data Lake Tier |
---|---|---|
Key characteristics | High-performance querying and indexing of logs (also known as hot or interactive retention) | Cost-effective long-term retention of large data volumes (also known as cold storage) |
Best for | Real-time analytics rules, alerts, hunting, workbooks, and all Microsoft Sentinel features | Compliance and regulatory logging, historical trend analysis and forensics, less-frequently touched data that doesn't require real-time alerts |
Ingestion cost | Standard | Minimal |
Query pricing | Included ✅ | Billed separately ❌ |
Optimized query performance | ✅ | Slower queries ❌ Suitable for audits, but not optimized for real-time analysis |
Query capabilities | Full query capabilities in Microsoft Defender and Azure portal, and API usage | Full KQL on a single table (can be enriched with data in analytics tables using lookups), run scheduled KQL or Spark jobs, use notebooks |
Real-time analytics capabilities | Full set ✅ | Limited ❌ Restrictions on some features like analytics rules, hunting queries, parsers, watchlists, workbooks, and playbooks |
Search jobs | ✅ | ✅ |
Summary rules | ✅ | ✅ |
KQL | Full functionality | Limited to a single table |
Restore | ✅ | ❌ |
Data export | ✅ | ❌ |
Retention period | 90 days for Microsoft Sentinel, 30 days for Microsoft Defender XDR. Can be extended up to 2 years with a prorated monthly long-term retention fee | Same as analytics retention by default. Can be extended up to 12 years |
Note: I'm still investigating what "Restore" exactly refers to. If it means bringing data into the analytics tier, that can be done from the Data Lake Tier. Perhaps it refers to whether it can be used as a destination for data restoration?
By the way, while a direct cost comparison isn't straightforward, ingesting data into the Analytics Tier with Sentinel's Pay-as-you-go plan costs $4.3 USD per GB, whereas the Data Lake costs $0.05 per GB for ingestion and $0.026 per GB/month for storage.[3]
Note: Please note that the capabilities of the Analytics Tier and the Data Lake are different, so it's not an apples-to-apples comparison. For example, running KQL queries against the Data Lake incurs additional costs. Microsoft's official information mentions it's less than 10% of the traditional cost, which can be a useful guideline. However, please estimate the actual costs based on your own use cases.
Trying Out Microsoft Sentinel Data Lake
Prerequisites
To onboard to the Microsoft Sentinel Data Lake public preview, you must meet the following prerequisites:
- Microsoft Defender and Microsoft Sentinel must be integrated and available in Defender XDR.
- You need an existing Azure subscription and resource group for data lake billing, and you must have owner permissions on the subscription.
- The Microsoft Sentinel primary workspace must be connected to the Microsoft Defender portal.
- You need read permissions to the primary and other workspaces that you want to attach to the data lake.
- The Microsoft Sentinel primary workspace and other workspaces must be in the same region as your tenant's home region (a public preview constraint).
Note: As a public preview constraint, the Sentinel primary workspace must be in the same geographical region as the tenant (Entra ID). The "Entra ID" part is key. Many Japanese users might have their Defender logs stored in the US, but their Entra ID is likely in Japan (how confusing...). Therefore, please configure and test your Sentinel workspace in the East Japan region. Also, please be aware that not all regions are supported during the public preview.
Onboarding (Initial Setup)[4]
The steps to onboard your tenant to the Microsoft Sentinel Data Lake are simple. As a prerequisite, connect Sentinel to the Defender portal via the SIEM workspace feature. I wrote a blog about this at my previous job, which you can refer to here:
https://blog.cloudnative.co.jp/24112/
Next, navigate to the data lake settings page in the Defender XDR portal ( https://security.microsoft.com ) under [System] > [Settings] > [Microsoft Sentinel] > [Data lake]. Once all prerequisites are met, a connect button will appear. Click the "Start setup" button to launch the configuration screen. After entering all the information, click "Set up data lake." It can take up to 60 minutes for the data lake to be fully created and linked to your Defender tenant.
If the process is ongoing, you will see a message "Lake setup in progress". After a while, the data lake setup will be complete.
By the way, once the setup is complete, a new data lake exploration view will appear in Defender XDR. At that time, a workspace named "default" is created. The "Default" workspace that appears in the workspace selector for KQL queries is created by Microsoft Sentinel Data Lake during onboarding.
Long-term Log Storage in the Data Lake Tier
Data connectors that ingest logs into Microsoft Sentinel are configured by default to send data to both the analytics tier and the long-term storage data lake tier. Once a Sentinel data connector is enabled, data is pushed to the analytics tier and automatically mirrored to the data lake tier. Mirroring data to the data lake with the same retention period as the analytics tier does not incur additional billing charges.[5]
After setting up the data lake, additional storage costs for the data lake are only incurred if the retention period is extended, as shown in the image below. Settings can be configured per table in the [Defender XDR portal] > [Microsoft Sentinel] > [Configuration] > [Tables].
Ingesting Data Directly into the Data Lake Tier
Alternatively, it is possible to ingest data only into the data lake tier. Raw logs from firewalls or Entra ID's `AADNonInteractiveUserSignInLogs` can be very costly if ingested directly into the analytics tier. With the new data lake, you can stream data directly to the data lake, skipping Sentinel's analytics tier. This means data is only ingested into the data lake tier, ingestion into the analytics tier is stopped, and the data is stored only in the data lake.
KQL Queries
A key feature of the data lake tier is that you're not just storing data; you can also investigate logs by running KQL queries. This may sound simple, but it's actually very important. Previously, when storing logs in cheap storage (like Blob storage), analysis required the cumbersome step of ingesting them into an analysis platform first. In contrast, the data lake keeps costs low while allowing you to quickly run queries when an investigation is needed, which is a fantastic benefit.
Note: Please be aware that running KQL against the data lake tier is a paid feature. Also, note that you cannot join two tables in a search.
Search & Restore
As mentioned above, KQL investigations in the data lake tier have constraints. However, if you need to perform a thorough investigation without these limitations, you can restore the data to the analytics tier. On the [Search & Restore] tab, you can select a table and a time period to restore the data, allowing you to investigate it in Advanced Hunting.
However, be aware that the data format will be different from data ingested directly into Advanced Hunting, as shown below.
Jobs
In addition, you can use Jobs to move small amounts of data directly from the data lake to the analytics tier. A job is a feature that runs a KQL query against data in the data lake tier and promotes the results to the analytics tier. You can run these as one-off or scheduled tasks.
While storage in the analytics tier incurs a higher billing rate than the data lake tier, using KQL allows you to reduce and filter the data, saving costs while promoting it to the analytics tier. This enables you to send all your data to the data lake tier, while specific logs meeting certain criteria are sent to the analytics tier for further hunting.
When moving data with Jobs, a dedicated table for the job is used (or created). Therefore, be aware that query changes may be necessary later on due to the change in table names.
Select the Sentinel workspace and write your query.
In the schedule settings, you can choose a one-time, daily, weekly, or monthly execution interval.
Once the job runs, you can view the completed jobs in a list.
As mentioned before, when migrating data, a separate table is created (or selected), so you will view the data in Advanced Hunting as a Custom Log.
Summary
Previously, while it was possible to store data long-term cheaply using services like Blob storage, it wasn't truly cost-effective when considering setup and operational overhead. With the arrival of the Data Lake, the cost-performance has improved significantly. What's also interesting is that Microsoft describes this data lake not just as a cheap storage option, but as something that "accelerates the adoption of agentized AI." Currently, the value in mitigating data silos with the Data Lake is clear, but it's also evident that Microsoft has a vision for its future use in AI, so I'll be keeping a close eye on its development.
Comments
Post a Comment