Hash filtering is a time-saving technique for a computer forensics examiner when working on a huge disk image. In a nutshell, this technique can filter out all those files in your image that belong to the operating system or well-known software packages. This will let the examiner focus on unknown files, reducing the scope of the investigation. After all, there's no point in spending time checking files we already know.
This filtering operation is based on hashes. Usually, we calculate the hash for every file in the image and check it against a list of hashes previously calculated over known good files. We call this list the known good hash set. All files with hashes matching the list are filtered out.
On the other hand, we would like to know if there are malicious files in our computer forensics case image. Again, the technique works by calculating the hash for every file in the image, looking for matches in a list containing pre-calculated hashes for known malicious files, viruses, cracker's tools, or anything you judge to be a malicious file. We call this list the known bad hash set and we want to be alerted when matches occur.
It's not an easy task to keep such hashsets, and they need to be huge in order to be effective. Thankfully, others are collecting files and calculating hashes for us. The National Institute for Standards and Technology maintains the National Software Reference Library or NSRL, which is one of the best hashset libraries available, it's public and free!
Unfortunately, life is not a bed of roses.
Practically all tools that use hash sets for filtering have a way to say "this is my known good hash set, ignore everything found here" and "this is my known bad hash set, ring all bells when something matches here". The SleuthKit tool SORTER does that using -x (for known good) and -a (for known bad). However, the NSRL hash set contains both good and bad files. If we use it as known good, there's a risk of ignoring malicious files in the image. If we use it as known bad, we will have thousands of false positives. What to do?
Doug White, from NIST, has given me good advice.
The NSRL file that correlates hashes and file names is NSRLFile.txt while NSRLProd.txt softs the files by classification. The known bad files belong to products classified as "Hacker Tool". So, we can separate them. You can use MS LogParse, AWK or any programming language. I prefer Perl and here is the code:
#!/usr/bin/perl -w # Extracts known good and known bad hashsets from NSRL # uso: nsrlext.pl -n <nsrl files comma separated> -p <nsrl prod files comma separated> -g <known good txt> -b <known bad txt> [-h] # # -n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt # -p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt # -g :known good txt filename. Ex: -g good.txt # -b :known bad txt filename. Ex: -b bad.txt # -h :help # # use Getopt::Std; my $ver="0.1"; #opcoes %args = ( ); getopts("hn:p:g:b:", \%args); #help if ($args{h}) { &cabecalho; print <<DETALHE ; uso: nsrlext.pl -n nsrl_files_comma_separated -p nsrl_prod_files_comma_separated [-g known_good_txt] [-b known_bad_txt] [-h] -n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt -p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt -g :known good txt filename. Ex: -g good.txt -b :known bad txt filename. Ex: -b bad.txt -h :help DETALHE exit; } die "Enter the NSRL hashset file list (comma delimited)\n" unless ($args{n}); die "Enter the NSRL product file list (comma delimited)\n" unless ($args{p}); die "Enter known good and/or known bad output filenames\n" unless (($args{g}) || ($args{b})); my %hack; &cabecalho; #Prod files my @prod = split(/,/, $args{p}); foreach $item (@prod) { open(PRODUCT, "< $item"); while (<PRODUCT>) { chomp; my @line = split(/,/, $_); #create a hash of hacker tool codes $hack{$line[0]} = $item if ($line[6] =~ /Hacker Tool/); } close(PRODUCT); } #hashset files my @hset = split(/,/, $args{n}); open(BAD, "> $args{b}") if ($args{b}); open(GOOD, "> $args{g}") if ($args{g}); my $i=0; foreach $item (@hset) { open(NSRL, "< $item"); while (<NSRL>) { #stdout feedback print ">" if (($i % 10000) == 0); my @line = split(/,/, $_); if ($hack{$line[5]}) { #is a hacker tool print BAD $_ if ($args{b}); } else { print GOOD $_ if ($args{g}); } $i++; } close(NSRL); } print "\nDone !\n"; close(BAD) if ($args{b}); close(GOOD) if ($args{g}); ### Sub rotinas #### sub cabecalho { print <<CABEC; nsrlext.pl v$ver Extracts known good and known bad hashsets from NSRL Tony Rodrigues dartagnham at gmail dot com -------------------------------------------------------------------------- CABEC } #-----EOF-------
Usage:
nsrlext.pl -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt -b NSRLBad.txt -g NSRLGood.txt
This script runs in both Windows and Linux, it just requires Perl.
After this, we can use both hash sets in Autopsy, TSK Sorter or even with "md5deep/sha1deep".
Tony Rodrigues has over 20 years of IT experience and 7 years in Information Security management. He currently holds CISSP, CFCP and Security+ certifications and has been in charge of several corporate digital investigations in Brazil. He loves CAINE Live CD and writes about Computer Forensics/Incident Response for forcomp.blogspot.com.