SANS Digital Forensics and Incident Response Blog | Timeline analysis with Apache Spark and Python

This blog post introduces a technique for timeline analysis that mixes a bit of data science and domain-specific knowledge (file-systems, DFIR).

Analyzing CSV formatted timelines by loading them with Excel or any other spreadsheet application can be inefficient, even impossible at times. It all depends on the size of the timelines and how many different timelines or systems we are analyzing.

Looking at timelines that are gigabytes in size or trying to correlate data between 10 different system's timelines does not scale well with traditional tools.

One way to approach this problem is to leverage some of the open source data analysis tools that are available today. Apache Spark is a fast and general engine for big data processing. PySpark is its Python API, which in combination with Matplotlib, Pandas and NumPY, will allow you to drill down and analyze large amounts of data using SQL-syntax statements. This can come in handy for things like filtering, combining timelines and visualizing some useful statistics.

These tools can be easily installed on your SANS DFIR Workstation, although if you plan on analyzing a few TBs of data i would recommend setting up a Spark cluster separately.

The reader is assumed to have a basic understanding of Python programming.

Quick intro to Spark

This section is a quick introduction to PySpark and basic Spark concepts. For further details please check the Spark documentation, it is well written and fairly up to date. This section of the blog post does not intend to be a Spark tutorial, I'll barely scratch the surface of what's possible with Spark.

From apache.spark.org: Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark's documentation site covers the basics and can get you up and running with minimal effort. Best way to get started is to spin up an Amazon EMR cluster, although that's not free. You can always spin up a few VMs in your basement lab :)

In the context of this DFIR exercise, we'll leverage Spark SQL and Spark's DataFrame API. The DataFrame API is similar Python's Pandas. Essentially, it allows you to access any kind of data in a structured, columnar format. This is easy to do when you are handling CSV files.

Folks over at Databricks have been kind enough to publish a Spark package that can convert CSV files (with headers) into Spark DataFrames with one simple Python line of code. Isn't that nice? Although, if you are brave enough, you can do this yourself with Spark and some RDD transformations.

To summarize, you'll need the following apps/libraries on your DFIR workstation:

Access to a Spark (1.3+) cluster
Spark-CSV package from Databricks
Python 2.6+
Pandas
NumPy

It should be noted that the Spark-CSV package needs to be built from source (Scala) and for that there are a few other requirements that are out of the scope of this blog post, as is setting up a Spark cluster.