Recently I was asked to recover images from a suspect machine. Numerous tools have the ability to categorize files based on type. Students of SANS 508 get a look under the hood at how this is done using the "magic numbers" found at or near the start of files with well-known formats. Fortunately, most of the files we deal with have reliable file headers.
However, because these tools rely on magic numbers, there are countless ways files of a given type can be obfuscated and therefore go undetected. I encountered this on my recent case, not because the suspect was particularly savvy, but because the instant messaging client the suspect used to send and receive images. The client software maintained log files containing portions of conversations including image file transfers.
The images weren't found for two reasons, first they were embedded in the log files so the magic number sequences were not near the beginning of the files and second the images were base64 encoded so that they could be transferred as ASCII text via the chat client. The good news is that base64 is deterministic, that is to say, given a certain byte string, it will always produce the same encoding and fixed format files for common image types have well defined headers that will reliably produce the same base64 encoded values.
I only had to figure out what these magic numbers looked like when base64 encoded. On my SIFT system I ran the following command:
for i in $(locate gif | grep "gif$"); do base64 $i | head -c4; echo " (gif)"; done | sort | uniq —c
and received the following result:
1 iVBO (gif) 1173 R0lG (gif)
Let's go over this command and what it does. It uses the "for i in" construct to build a list of values that it will perform operations on. In this case, the list is comprised of results returned by the $(locate gif | grep "gif$") command. This compound command returns files on the system that end in gif and are assumed to be gif images (whether or not this is a valid assumption is beyond the scope of this post). Each file in the list is then base64 encoded and that encoding is passed to head —c4, which prints the first four characters of the encoded image, the echo command adds a newline and the loop finishes, all of the results are then sorted and passed to uniq —c, which prints a count of each encoding variation.
The whole statement allows me to quickly see the most common base64 encodings for the given file type's magic number.
I repeated this command substituting "jpg" for "gif", then "png" and so on. The results came back as follows:
704 /9j/ (jpg) 0003 AQAA (jpg) 7 f0VM (png) 22700 iVB0 (png) 1 R0lG (png) 8 UE5H (png) ...
You can see that "/9j/" is a common base64 encoding for both jpg and png images and that "iVB0" is the most common base64 encoded magic number for png files.
Armed with this information, I initiated string searches for these common encodings. Lo and behold, I had a number of hits that when decoded using "base64 —d" resulted in the recovery of additional images that had an impact on the case. If you're not familiar with base64, you may want to have a look at RFC 3548, which defines the base64 standard. Below is a base64 encoded bmp image, picking the encodings out of byte streams is not too difficult, though doing it manually certainly won't scale.
Next time you're faced with searching a drive, consider the elements of the case. Are there applications involved that may store evidence in an encoded form such as base64 and would searching for those encoded variants affect the outcome of your case?
Update I put together a text file containing a more complete set of base64 encoded magic byte sequences for common image types. The file is hosted at http://trustedsignal.com/forensics/b64_enc_img_types.txt.
Dave Hull is a forensic analyst at Trusted Signal and a Community Instructor with the SANS Institute. He'll be teaching Forensics 408: Computer Forensics Essentials in Boston, MA from Feb. 28 - Mar. 4.