I was drawn to consider someting by a question on a certification practical exam I recently took. The problem had been presented as "find the specified text in the supplied disk image". However the text actually turned out to be viewable in a jpeg file which was nested inside a Word document. Once I'd found the text, the question was essentially answered, but then I started thinking about extraction options and the origins of that JPEG file.
I recalled a tool I'd recently discovered thanks to traffic on the GCFA mailing list, hachoir-subfile. The original email context was about using this tool to extract executable objects from PPS files, but it turns out that it works equally well to extract .jpg files. I had always assumed that when image files were incorporated into MS Office documents, they were somehow re-encoded, however this turns out not to be the case.
In fact, once a JPEG file has been extracted from its encapsulating Office document, you can then derive any metadata from it which was present in the original picture file, at least for the Office 2003 format (I tested it with Word 2003), and presumably earlier. It also works on PDFs. I verified this by creating a PDF with Adobe Acrobat, and then extracting the embedded JPEG using hachoir-subfile, finding all EXIF data intact. My testing with the new docx format of Office 2007 shows that much of the EXIF information in a JPEG file is apparently removed when the image is incorporated into the document. But in any case, you don't need hachoir to extract that format. Component files are extracted normally from docx files when you unzip then into a heirarchy of folders.
I like using exiftool (a precompiled windows binary is available for download) to extract metadata from JPEG files, but the same data, though somewhat less of it, can be extracted using another member of the hachoir family, hachoir-metadata.
The basic Hachoir application is actually a python library, so using it will require that Python be installed on your forensic workstation. I use cygwin (a UNIX application porting overlay for Windows) for various tasks already, so the Python package which is available for that works fine for me. However, various other Python ports are available from www.python.org if your needs differ from mine.
To get it working I simply downloaded the various hachoir packages;
- hachoir-core-1.2.1.tar.gz
- hachoir-parser-1.2.1.tar.gz
- hachoir-regex-1.0.3.tar.gz
- hachoir-subfile-0.5.3.tar.gz
- hachoir-wx-0.3.tar.gz
- hachoir-metadata-1.2.1.tar.gz
- hachoir-urwid-1.1.tar.gz
I extracted each of these archives ("tar xvzf filename", or else use a Windows archive program such as 7zip which knows about tar and gzip), and then ran it's included setup.py python installation script (the command is just "python setup.py install"). Some of these packages are dependent on others, so you need to do them in the correct order. If you try to install one that needs another which hasn't yet been installed, it will complain, but you'll just have to go back and install the prerequisite before retrying.
Once all of the packages have been installed, extracting any included subfiles from a word document or PDF becomes a simple matter of typing "hachoir-subfile document_file_name target_folder_name". This will drop all extracted objects into files named file-####.ext (numbered, and with appropriate extensions) in the specified folder.
Now that the the subfiles are extracted, you can run exiftool or hachoir-metadata against them, and examine their metadata to your heart's content.
As always, please feel free to leave commentary if you liked this article or want to call me on the carpet for some inaccuracy.