homepage
Open menu Go one level top
  • Train and Certify
    • Get Started in Cyber
    • Courses & Certifications
    • Training Roadmap
    • Search For Training
    • Online Training
    • OnDemand
    • Live Training
    • Summits
    • Cyber Ranges
    • College Degrees & Certificates
    • NICE Framework
    • DoDD 8140
    • Specials
  • Manage Your Team
    • Overview
    • Security Awareness Training
    • Voucher Program
    • Private Training
    • Workforce Development
    • Skill Assessments
    • Hiring Opportunities
  • Resources
    • Overview
    • Reading Room
    • Webcasts
    • Newsletters
    • Blog
    • Tip of The Day
    • Posters
    • Top 25 Programming Errors
    • The Critical Security Controls
    • Security Policy Project
    • Critical Vulnerability Recaps
    • Affiliate Directory
  • Focus Areas
    • Blue Team Operations
    • Cloud Security
    • Digital Forensics & Incident Response
    • Industrial Control Systems
    • Leadership
    • Offensive Operations
  • Get Involved
    • Overview
    • SANS Community
    • CyberTalent
    • Work Study
    • Instructor Development
    • Sponsorship Opportunities
    • COINS
  • About
    • About SANS
    • Why SANS?
    • Instructors
    • Cybersecurity Innovation Awards
    • Contact
    • Frequently Asked Questions
    • Customer Reviews
    • Press Room
  • Log In
  • Join
  • Contact Us
  • SANS Sites
    • GIAC Security Certifications
    • Internet Storm Center
    • SANS Technology Institute
    • Security Awareness Training
  • Search
  1. Home >
  2. Blog >
  3. Data reduction redux and map-reduce
Dave Hull

Data reduction redux and map-reduce

April 26, 2011

A few days ago I wrote a post about applying the principle of least frequent occurrence to string searches in forensics. This post will discuss how long that process may take and at the end, will show some significant ways to speed up the process.

In the previous post I used the following compound command line statement to create a list of strings ordered from most frequent occurring to least frequent:

cat sda1.dd.asc | awk '{$1=""; print}' | sort | uniq -c | sort -gr > sda1.dd.asc.lfo

One thing worth noting is that the sort command is invoked twice. As you can imagine, sorting the strings from a large disk image takes time, to say nothing of doing it twice, though the second sort is much faster as it is initially sorting on the count value that uniq -c tacks onto the front of each line of text.

For example, I've got an ASCII strings file that is approximately 19GB. Below I've split out the command above and used time to measure the time required to complete each step in the process on my i7 quad-core, 64-bit Ubuntu VM with 7GB of RAM:

time cat sda1.dd.asc | awk '{$1=""; print}' | sort > sda1.dd.asc.sorted
real126m57.159s
user112m59.670s
sys10m51.150s

Running uniq against that file, again with time:

time uniq -c sda1.dd.asc.sorted > sda1.dd.asc.sorted.uniq

real6m51.088s
user2m4.820s
sys4m45.170s

My 19GB strings file has been reduced to 2.9GB by removing duplicate lines of text. We can do better.

I can further reduce it by removing the garbage strings using the technique outlined in the previous post; namely grep and a large dictionary file. (Remember to clean up your dictionary file, no blank lines, no short strings).

time grep -iFf /usr/share/dict/american_english sda1.dd.asc.lfo > sda1.dd.asc.lfo.words

real4m9.105s
user3m56.350s
sys0m12.250s

Now my 2.9GB strings file is 574MB.

Now we're getting somewhere, albeit slowly — not to mention the time it will take to review 574MB of text.

How can we speed this up? My initial thought was to try creating a ramdisk to see if that made things any faster. I tried that using a smaller file of course. Much to my surprise, the difference between using a ramdisk and not using one was hardly noticeable. I mentioned this on Twitter and strcpy stepped into the conversation. He's a brilliant developer whom I've talked to a few times online over the last year or so.

The next day, he sent me the following tweet:

strcpy_lunch.jpg

Figure 1: I owe strcpy lunch

I grabbed a copy of strcpy's script and looked it over. It was brilliant and simple and I was disappointed that I hadn't thought of it.

In a nutshell, the script counts of the number of processors in the system then divides the data set to be sorted by the number of processors minus one, then assigns each processor (minus one) with the task of sorting a chunk of the data set. It also tells the sort command how much of the system's memory it can safely use for buffering, by default sort will use 1MB, which can result in the creation of lots of temporary files (more than 1000 for my 19GB file). strcpy's script specifies 30% of RAM can be used by sort, reducing the number of temporary files used. Lastly the script merges the three sorted files back into one file.

Screenshot-System-Monitor-1.jpg

Figure 2: System Monitor: single processor sorting

Notice that only one processor is being used for the heavy lifting of sorting the data set.

Screenshot-System-Monitor-3.jpg

Figure 3: System Monitor: Three processors pegged

Notice that three of the four system processors are pegged, or nearly so and the amount of system memory in use has dramatically increased. Memory utilisation is configurable from within strcpy's script.

How much faster was it to run the sort this way than the previous way? Here's the output from time:

time ~/ellzey_sort.sh sda1.asc 19717183333 6572394445.0 Splitting sda1.asc into 3 pieces. sda1.asc.0016572394445 bytes sda1.asc.0026572394445 bytes sda1.asc.0036572394443 bytes Done! Running sort on sda1.asc.001 on cpu 0x00000001 Running sort on sda1.asc.002 on cpu 0x00000002 Running sort on sda1.asc.003 on cpu 0x00000004 Finalizing sort into sda1.asc.sorted Removing temporary files

real72m28.051s
user52m32.580s
sys75m5.760s
codeslack.jpg

strcpy had said as much to me, "... you've been introduced to mapreduce 101."

One day in the not too distant future we'll be able to distribute these CPU intensive processes across dozens of cores.

Check out strcpy's script. It's easy to understand and can be easily modified. One thing worth noting, in my testing, the script rounded down to the nearest byte when splitting files, resulting in a loss of one byte (i.e. a 100 byte file split in three parts should be 33.33333 bytes, strcpy's script rounds this down to 33 bytes, so you'll end up losing a byte of data). I modified line 62 of the script and notifed strcpy of my change, here's what I ended up with:

... split_size=`echo "scale=10; $file_size / $proc_count" | bc | python -c "import math; print math.ceil(float(raw_input()))"` ...

This change fixes the rounding issue.

Go forth and sort across all cores.

Dave Hull is a wanna be when compared to the likes of strcpy and codeslack, but he's learning. He's an incident responder and digital forensics practitioner with Trusted Signal. He'll be teaching SANS Forensics 508: Advanced Computer Forensic Analysis and Incident Response in Baltimore, MD, May 15 - 20, 2011. If you'd like to learn powerful forensics techniques like these, sign up for the course.

Share:
TwitterLinkedInFacebook
Copy url Url was copied to clipboard
Subscribe to SANS Newsletters
Join the SANS Community to receive the latest curated cybersecurity news, vulnerabilities, and mitigations, training opportunities, plus our webcast schedule.
United States
Canada
United Kingdom
Spain
Belgium
Denmark
Norway
Netherlands
Australia
India
Japan
Singapore
Afghanistan
Aland Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belize
Benin
Bermuda
Bhutan
Bolivia
Bonaire, Sint Eustatius, and Saba
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Cook Islands
Costa Rica
Croatia (Local Name: Hrvatska)
Curacao
Cyprus
Czech Republic
Democratic Republic of the Congo
Djibouti
Dominica
Dominican Republic
East Timor
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands (Malvinas)
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Heard And McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
Indonesia
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Jersey
Jordan
Kazakhstan
Kenya
Kingdom of Saudi Arabia
Kiribati
Korea, Republic Of
Kosovo
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia, Federated States Of
Moldova, Republic Of
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Islands
Oman
Pakistan
Palau
Palestine
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saint Bartholemy
Saint Kitts And Nevis
Saint Lucia
Saint Martin
Saint Vincent And The Grenadines
Samoa
San Marino
Sao Tome And Principe
Senegal
Serbia
Seychelles
Sierra Leone
Sint Maarten
Slovakia (Slovak Republic)
Slovenia
Solomon Islands
South Africa
South Georgia and the South Sandwich Islands
South Sudan
Sri Lanka
St. Helena
St. Pierre And Miquelon
Suriname
Svalbard And Jan Mayen Islands
Swaziland
Sweden
Switzerland
Taiwan
Tajikistan
Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad And Tobago
Tunisia
Turkey
Turkmenistan
Turks And Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Vatican City
Venezuela
Vietnam
Virgin Islands (British)
Virgin Islands (U.S.)
Wallis And Futuna Islands
Western Sahara
Yemen
Yugoslavia
Zambia
Zimbabwe

Tags:
  • Digital Forensics and Incident Response

Related Content

Blog
Digital Forensics and Incident Response
January 21, 2021
Things the Community Said About Chris Krebs's CTI Keynote
Chris Krebs Fmr. Director, US Cybersecurity and Infrastructure Security Agency (CISA); Founder, Krebs Stamos Group was the keynote speaker
SANS Institute
read more
Blog
Digital Forensics and Incident Response
January 7, 2021
How You Can Start Learning Malware Analysis
Lenny Zeltser shares a roadmap for getting into malware analysis, with pointers to 10 hours of free recorded content and additional references.
370x370_Lenny-Zeltser.jpg
Lenny Zeltser
read more
Blog
Digital Forensics and Incident Response
September 26, 2019
The State of Malware Analysis: Advice from the Trenches
What malware analysis approaches work well? Which don’t? How are the tools and methodologies evolving? The following discussion–captured as an MP3 audio file–offers friendly advice from 5 malware analysts. These are some of the practitioners who teach the reverse-engineering malware course...
370x370_Lenny-Zeltser.jpg
Lenny Zeltser
read more
  • Register to Learn
  • Courses
  • Certifications
  • Degree Programs
  • Cyber Ranges
  • Job Tools
  • Security Policy Project
  • Posters
  • The Critical Security Controls
  • Focus Areas
  • Blue Team Operations
  • Cloud Security
  • Cybersecurity Leadership
  • Digital Forensics
  • Industrial Control Systems
  • Offensive Operations
Subscribe to SANS Newsletters
Join the SANS Community to receive the latest curated cybersecurity news, vulnerabilities, and mitigations, training opportunities, plus our webcast schedule.
United States
Canada
United Kingdom
Spain
Belgium
Denmark
Norway
Netherlands
Australia
India
Japan
Singapore
Afghanistan
Aland Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belize
Benin
Bermuda
Bhutan
Bolivia
Bonaire, Sint Eustatius, and Saba
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Cook Islands
Costa Rica
Croatia (Local Name: Hrvatska)
Curacao
Cyprus
Czech Republic
Democratic Republic of the Congo
Djibouti
Dominica
Dominican Republic
East Timor
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands (Malvinas)
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Heard And McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
Indonesia
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Jersey
Jordan
Kazakhstan
Kenya
Kingdom of Saudi Arabia
Kiribati
Korea, Republic Of
Kosovo
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia, Federated States Of
Moldova, Republic Of
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Islands
Oman
Pakistan
Palau
Palestine
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saint Bartholemy
Saint Kitts And Nevis
Saint Lucia
Saint Martin
Saint Vincent And The Grenadines
Samoa
San Marino
Sao Tome And Principe
Senegal
Serbia
Seychelles
Sierra Leone
Sint Maarten
Slovakia (Slovak Republic)
Slovenia
Solomon Islands
South Africa
South Georgia and the South Sandwich Islands
South Sudan
Sri Lanka
St. Helena
St. Pierre And Miquelon
Suriname
Svalbard And Jan Mayen Islands
Swaziland
Sweden
Switzerland
Taiwan
Tajikistan
Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad And Tobago
Tunisia
Turkey
Turkmenistan
Turks And Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Vatican City
Venezuela
Vietnam
Virgin Islands (British)
Virgin Islands (U.S.)
Wallis And Futuna Islands
Western Sahara
Yemen
Yugoslavia
Zambia
Zimbabwe
  • © 2021 SANS™ Institute
  • Privacy Policy
  • Contact
  • Twitter
  • Facebook
  • Youtube
  • LinkedIn