Hope for the Best, Prepare for the Worst: How to prepare for cloud DFIR

Aug 19 2023

Understand the specific steps that can be taken to significantly improve your organization's cloud incident response efficiency and efficacy.

Authored byMegan Roddie-Fonseca

While incident response is reactive in nature, there are steps DFIR teams can proactively take to ensure that if the worst happens they will be prepared to respond. In this blog post, we will provide three key recommendations that will help organizations improve their

ability to efficiently and effectively respond if an incident occurs. Specifically, we will discuss:

Configuring cloud logging
Creating accounts and resources for responders
Understanding the environment

Although this is by no means an extensive list, it is a starting point that can exponentially affect how your organization handles incidents. You should also take time to identify and implement proactive controls aimed at defending against threats, but our focus in this post will be on preparing for when those proactive controls fail

1. Configuring cloud logging

When it comes to the cloud, our biggest source of evidence is logs. Regardless of whether its sign-in logs, audit logs, resource logs, or any other number of logs available, logging is what provides us visibility over activity in the environment. Without logging, it becomes very difficult to investigate incidents. While it is outside the scope of the post to go into the specific settings and configurations per cloud service provider (CSP), this post provides high-level guidance that can be used to help your organization understand the importance, and ultimately properly implement cloud logging based on a few key considerations. Specifically, we will discuss the following two topics related to logging:

Enabling non-default events
Storing and centralizing logs

Before diving into each of these categories, it is important to discuss cost. Each cloud provider will give you a certain level of logging for free, typically the default logs for a limited retention period. Enabling additional services and/or turning on additional service features (i.e logging additional events, increasing log retention periods, even simply enabling logging for a service), however, may incur additional charges from the service provider. Pricing details and calculators are provided by most CSPs if you want to determine roughly how much investment is required for some of these actions, so we won’t go in-depth on incurred charges in this post.

Enabling Non-Default Events

Most cloud providers have a set of logs for each service that are enabled by default yet may offer additional, often valuable, logs that are disabled by default. There is considerable variation between which CSP services have which logs enabled by default, but generally the pattern is that events related to managing or administering the environment (control plane logs) are enabled by default, while events related to specific resource activity (data plane logs) are likely disabled by default.

Some example of management logs on by default are:

New access key creation

Global administrator creation

Creation or deletion of a virtual machine (or other resources)

Some examples of resource activity disabled by default are:

Data read or write activity

Flow logs

Application or OS logs from VMs

It is important to understand which logging events you have enabled vs. disabled. If you find out during an incident that logs you need are disabled, you will have a major visibility gap. For example, let’s say your organization has sensitive data in an AWS S3 bucket, and you’re tasked with identifying whether that data was exposed as part of a breach. If you haven’t proactively enabled S3 data event and server access logging, which are not on by default - you will have a significant gap in your visibility and be unable to conclusively determine whether or not data was exposed.

That’s not to say that all non-default events should be turned on. That is unrealistic from a data processing and cost perspective. Instead, your organization needs to evaluate what resources need to be monitored at what logging level and apply policies based on your requirements. Start by focusing on enabling additional logging for sensitive data and resources.

Storing and Centralizing Logs

On the topic of storing and, preferably, centralizing logs, there are a few aspects that we need to discuss. First is discussing how you are going to store your logs. Many of the CSPs provide you with multiple methods by which you can access logs. For example, Azure will allow you to view logs in the Azure Portal, send them to a Log Analytics Workspace or a Storage Account, or export them via EventHub or Graph API. All methods have their pros and cons and it's up to your organization to decide which method is best for them.

Location, location, location. Ideally, logs from all data sources will be centralized into a single location. This is particularly critical when investigating incidents, as it allows for quicker correlation and identification of related events and significantly reduces response time. Resources are recommended to be regionalized, which can create complexities for centralization. However, having to go to multiple locations and/or services to find logs creates a gap in visibility and increases analyst overhead. You will likely need to take a programmatic approach to centralize the logs from across regions, but it is more than worth the effort in order to effectively investigate incidents.

There are both in-cloud options for this provided by each CSP as well as the option to leverage APIs or other cloud services to export CSP logs to an external service such as a SIEM or log aggregation tool. One thing to keep in mind when exporting logs or leveraging cloud native log aggregation services is that both methods will result in additional charges, once again emphasizing the importance of identifying which logs are of value to your organization, and what your retention policies should be .

It's critical to consider your log retention period on a service by service basis. For example, using the default logs for any of the CSPs typically comes with a restricted retention period, sometimes as low as 30 days. Many incidents involve long dwell times and having logs with a short retention period can greatly impact your ability to see the whole picture and get to the root cause of an incident. For that reason, it’s ideal to increase log retention by routing logs to another service or storage location. However, as previously mentioned, the longer the log retention period, the more data you generate, transmit, and store, and hence the more charges you will incur.

2. Creating accounts and resources for responders

If an incident occurs in the cloud, it's likely that incident responders will need access to a variety of services and service data to investigate. You do not want to waste precious time during an incident trying to get your IR team access to the necessary resources. Proactively creating IR specific roles or groups not only reduces the risk of over-provisioning an account’s permissions during the stress of an incident - potentially leading to further compromise - it allows for the scoping out of needs well beforehand, and hence adherence to the principle of least privilege.

Depending on how your architecture is structured, the IR team will likely need visibility over your entire organization (all projects, subscriptions, management groups, OUs, etc.). Creating a role with the permissions required that can quickly be assigned to those that need access is the best approach. The level of permissions required, however, will vary; ie: read-only for some services (the ability to read logs), while write permissions may be required for other services (such as the ability to create snapshots). This is best determined well ahead of time with a tabletop exercise, wherein the goal is to Identify the steps that will potentially be involved during an IR engagement and ensure that the responders will have all the necessary permissions needed to take action.

In another blog post on cloud DFIR, we talked about the capabilities that the cloud provides to incident responders. One of those was the ability to run forensic workstations in the cloud. Not only does this reduce egress costs when working with cloud data, but it also prevents responders from being limited by the hardware in their possession. To take advantage of this capability to its fullest, we recommend creating a forensic machine image ahead of time that has all the tools required to carry out investigations. Even more effective would be the use of infrastructure-as-code (IaC) templates to deploy all resources required for a DFIR workstation, such as the VMs, networking requirements, permissions, and more. All three major CSPs provide this type of service, as listed below:

Terraform can also be used for the same purpose and supports all of the above clouds.

First, determine what requirements you have for a forensic workstation, both in terms of compute power and installed software, as well as what connectivity and permissions are needed in the scope of the environment. Once you have this information, develop an infrastructure-as-code template based on the requirements, which will allow any responder with the assigned permissions to spin up their own forensic workstation in minutes vs hours.

3. Understanding the environment

This concept may appear very broad but its importance cannot be overstated, as while fundamental similarities exist - DFIR in the cloud can be very different from DFIR on-premise . One of the challenges related to doing DFIR for cloud environments vs. on-premise is that responders need to have at minimum a basic understanding of cloud concepts in addition to understanding organization-specific details. Responders who are assigned to cloud incidents without an understanding of how the cloud works may not be able to successfully perform an investigation, or remediate threats. At minimum, responders should be prepared to approach log analysis from a cloud perspective, knowing where to look for logs and how to interpret them. There are plenty of free resources online that responders can use to get up to speed, as well as high quality paid training opportunities, such as FOR509: Enterprise Cloud Forensics and Incident Response.

After gaining an understanding of the cloud as a whole and any concepts specific to the CSP(s) you use, seek to grow your understanding of how the organization leverages the cloud. Again, going into an incident without having ever worked in your organization’s cloud environment is going to make it very challenging to interpret activity you are seeing in logs and know what risks your organization may be vulnerable to. Outside of engagements, the DFIR team should connect with your cloud administrators and seek to increase their understanding of how the organization’s cloud environment is structured, how permissions are assigned, what policies exist and how they are enforced, which cloud services are used, and any other information that will be needed during response. If the cloud administrators are unable to provide this information due to a lack of documenting this information, we recommend identifying the context yourself, although programmatic solutions or scripts will need to be developed for this to ensure complete coverage. Any gathered information should be documented in a place accessible to responders during incidents and for organizational reference.

Conclusion

In this blog post, we’ve provided specific steps that can be taken to significantly improve your organization's cloud incident response efficiency and efficacy. We focused on expanding cloud logging capabilities, providing access and resources to responders, and developing an understanding of your cloud environment. This list is by no means exhaustive and is instead meant to provide a starting point for your cloud DFIR journey, and strengthen your organization’s overall security posture.

Meet the expert

Megan Roddie-Fonseca

Certified Instructor

Megan is a Senior Security Engineer at Datadog, SANS DFIR faculty, and co-author of FOR509. She holds two master’s degrees, serves as CFO of Mental Health Hackers, and is a strong advocate for hands-on cloud forensics training and mental wellness.

SEC536: Adversarial AI - Penetration Testing AI Systems