homepage
Open menu
Go one level top
  • Train and Certify
    Train and Certify

    Immediately apply the skills and techniques learned in SANS courses, ranges, and summits

    • Overview
    • Courses
      • Overview
      • Full Course List
      • By Focus Areas
        • Cloud Security
        • Cyber Defense
        • Cybersecurity and IT Essentials
        • DFIR
        • Industrial Control Systems
        • Offensive Operations
        • Management, Legal, and Audit
      • By Skill Levels
        • New to Cyber
        • Essentials
        • Advanced
        • Expert
      • Training Formats
        • OnDemand
        • In-Person
        • Live Online
      • Course Demos
    • Training Roadmaps
      • Skills Roadmap
      • Focus Area Job Roles
        • Cyber Defence Job Roles
        • Offensive Operations Job Roles
        • DFIR Job Roles
        • Cloud Job Roles
        • ICS Job Roles
        • Leadership Job Roles
      • NICE Framework
        • Security Provisionals
        • Operate and Maintain
        • Oversee and Govern
        • Protect and Defend
        • Analyze
        • Collect and Operate
        • Investigate
        • Industrial Control Systems
      • European Skills Framework
    • GIAC Certifications
    • Training Events & Summits
      • Events Overview
      • Event Locations
        • Asia
        • Australia & New Zealand
        • Latin America
        • Mainland Europe
        • Middle East & Africa
        • Scandinavia
        • United Kingdom & Ireland
        • United States & Canada
      • Summits
    • OnDemand
    • Get Started in Cyber
      • Overview
      • Degree and Certificate Programs
      • Scholarships
    • Cyber Ranges
  • Manage Your Team
    Manage Your Team

    Build a world-class cyber team with our workforce development programs

    • Overview
    • Why Work with SANS
    • Group Purchasing
    • Build Your Team
      • Team Development
      • Assessments
      • Private Training
      • Hire Cyber Professionals
      • By Industry
        • Health Care
        • Industrial Control Systems Security
        • Military
    • Leadership Training
  • Security Awareness
    Security Awareness

    Increase your staff’s cyber awareness, help them change their behaviors, and reduce your organizational risk

    • Overview
    • Products & Services
      • Security Awareness Training
        • EndUser Training
        • Phishing Platform
      • Specialized
        • Developer Training
        • ICS Engineer Training
        • NERC CIP Training
        • IT Administrator
      • Risk Assessments
        • Knowledge Assessment
        • Culture Assessment
        • Behavioral Risk Assessment
    • OUCH! Newsletter
    • Career Development
      • Overview
      • Training & Courses
      • Professional Credential
    • Blog
    • Partners
    • Reports & Case Studies
  • Resources
    Resources

    Enhance your skills with access to thousands of free resources, 150+ instructor-developed tools, and the latest cybersecurity news and analysis

    • Overview
    • Webcasts
    • Free Cybersecurity Events
      • Free Events Overview
      • Summits
      • Solutions Forums
      • Community Nights
    • Content
      • Newsletters
        • NewsBites
        • @RISK
        • OUCH! Newsletter
      • Blog
      • Podcasts
      • Summit Presentations
      • Posters & Cheat Sheets
    • Research
      • White Papers
      • Security Policies
    • Tools
    • Focus Areas
      • Cyber Defense
      • Cloud Security
      • Digital Forensics & Incident Response
      • Industrial Control Systems
      • Cyber Security Leadership
      • Offensive Operations
  • Get Involved
    Get Involved

    Help keep the cyber community one step ahead of threats. Join the SANS community or begin your journey of becoming a SANS Certified Instructor today.

    • Overview
    • Join the Community
    • Work Study
    • Teach for SANS
    • CISO Network
    • Partnerships
    • Sponsorship Opportunities
  • About
    About

    Learn more about how SANS empowers and educates current and future cybersecurity practitioners with knowledge and skills

    • SANS
      • Overview
      • Our Founder
      • Awards
    • Instructors
      • Our Instructors
      • Full Instructor List
    • Mission
      • Our Mission
      • Diversity
      • Scholarships
    • Contact
      • Contact Customer Service
      • Contact Sales
      • Press & Media Enquiries
    • Frequent Asked Questions
    • Customer Reviews
    • Press
    • Careers
  • Contact Sales
  • SANS Sites
    • GIAC Security Certifications
    • Internet Storm Center
    • SANS Technology Institute
    • Security Awareness Training
  • Search
  • Log In
  • Join
    • Account Dashboard
    • Log Out
  1. Home >
  2. Blog >
  3. Data reduction redux and map-reduce
Dave Hull

Data reduction redux and map-reduce

April 26, 2011

A few days ago I wrote a post about applying the principle of least frequent occurrence to string searches in forensics. This post will discuss how long that process may take and at the end, will show some significant ways to speed up the process.

In the previous post I used the following compound command line statement to create a list of strings ordered from most frequent occurring to least frequent:

cat sda1.dd.asc | awk '{$1=""; print}' | sort | uniq -c | sort -gr > sda1.dd.asc.lfo

One thing worth noting is that the sort command is invoked twice. As you can imagine, sorting the strings from a large disk image takes time, to say nothing of doing it twice, though the second sort is much faster as it is initially sorting on the count value that uniq -c tacks onto the front of each line of text.

For example, I've got an ASCII strings file that is approximately 19GB. Below I've split out the command above and used time to measure the time required to complete each step in the process on my i7 quad-core, 64-bit Ubuntu VM with 7GB of RAM:

time cat sda1.dd.asc | awk '{$1=""; print}' | sort > sda1.dd.asc.sorted
real126m57.159s
user112m59.670s
sys10m51.150s

Running uniq against that file, again with time:

time uniq -c sda1.dd.asc.sorted > sda1.dd.asc.sorted.uniq

real6m51.088s
user2m4.820s
sys4m45.170s

My 19GB strings file has been reduced to 2.9GB by removing duplicate lines of text. We can do better.

I can further reduce it by removing the garbage strings using the technique outlined in the previous post; namely grep and a large dictionary file. (Remember to clean up your dictionary file, no blank lines, no short strings).

time grep -iFf /usr/share/dict/american_english sda1.dd.asc.lfo > sda1.dd.asc.lfo.words

real4m9.105s
user3m56.350s
sys0m12.250s

Now my 2.9GB strings file is 574MB.

Now we're getting somewhere, albeit slowly — not to mention the time it will take to review 574MB of text.

How can we speed this up? My initial thought was to try creating a ramdisk to see if that made things any faster. I tried that using a smaller file of course. Much to my surprise, the difference between using a ramdisk and not using one was hardly noticeable. I mentioned this on Twitter and strcpy stepped into the conversation. He's a brilliant developer whom I've talked to a few times online over the last year or so.

The next day, he sent me the following tweet:

strcpy_lunch.jpg

Figure 1: I owe strcpy lunch

I grabbed a copy of strcpy's script and looked it over. It was brilliant and simple and I was disappointed that I hadn't thought of it.

In a nutshell, the script counts of the number of processors in the system then divides the data set to be sorted by the number of processors minus one, then assigns each processor (minus one) with the task of sorting a chunk of the data set. It also tells the sort command how much of the system's memory it can safely use for buffering, by default sort will use 1MB, which can result in the creation of lots of temporary files (more than 1000 for my 19GB file). strcpy's script specifies 30% of RAM can be used by sort, reducing the number of temporary files used. Lastly the script merges the three sorted files back into one file.

Screenshot-System-Monitor-1.jpg

Figure 2: System Monitor: single processor sorting

Notice that only one processor is being used for the heavy lifting of sorting the data set.

Screenshot-System-Monitor-3.jpg

Figure 3: System Monitor: Three processors pegged

Notice that three of the four system processors are pegged, or nearly so and the amount of system memory in use has dramatically increased. Memory utilisation is configurable from within strcpy's script.

How much faster was it to run the sort this way than the previous way? Here's the output from time:

time ~/ellzey_sort.sh sda1.asc 19717183333 6572394445.0 Splitting sda1.asc into 3 pieces. sda1.asc.0016572394445 bytes sda1.asc.0026572394445 bytes sda1.asc.0036572394443 bytes Done! Running sort on sda1.asc.001 on cpu 0x00000001 Running sort on sda1.asc.002 on cpu 0x00000002 Running sort on sda1.asc.003 on cpu 0x00000004 Finalizing sort into sda1.asc.sorted Removing temporary files

real72m28.051s
user52m32.580s
sys75m5.760s
codeslack.jpg

strcpy had said as much to me, "... you've been introduced to mapreduce 101."

One day in the not too distant future we'll be able to distribute these CPU intensive processes across dozens of cores.

Check out strcpy's script. It's easy to understand and can be easily modified. One thing worth noting, in my testing, the script rounded down to the nearest byte when splitting files, resulting in a loss of one byte (i.e. a 100 byte file split in three parts should be 33.33333 bytes, strcpy's script rounds this down to 33 bytes, so you'll end up losing a byte of data). I modified line 62 of the script and notifed strcpy of my change, here's what I ended up with:

... split_size=`echo "scale=10; $file_size / $proc_count" | bc | python -c "import math; print math.ceil(float(raw_input()))"` ...

This change fixes the rounding issue.

Go forth and sort across all cores.

Dave Hull is a wanna be when compared to the likes of strcpy and codeslack, but he's learning. He's an incident responder and digital forensics practitioner with Trusted Signal. He'll be teaching SANS Forensics 508: Advanced Computer Forensic Analysis and Incident Response in Baltimore, MD, May 15 - 20, 2011. If you'd like to learn powerful forensics techniques like these, sign up for the course.

Share:
TwitterLinkedInFacebook
Copy url Url was copied to clipboard
Subscribe to SANS Newsletters
Receive curated news, vulnerabilities, & security awareness tips
United States
Canada
United Kingdom
Spain
Belgium
Denmark
Norway
Netherlands
Australia
India
Japan
Singapore
Afghanistan
Aland Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belize
Benin
Bermuda
Bhutan
Bolivia
Bonaire, Sint Eustatius, and Saba
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Cook Islands
Costa Rica
Croatia (Local Name: Hrvatska)
Curacao
Cyprus
Czech Republic
Democratic Republic of the Congo
Djibouti
Dominica
Dominican Republic
East Timor
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands (Malvinas)
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Heard And McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
Indonesia
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Jersey
Jordan
Kazakhstan
Kenya
Kiribati
Korea, Republic Of
Kosovo
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia, Federated States Of
Moldova, Republic Of
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Islands
Oman
Pakistan
Palau
Palestine
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saint Bartholemy
Saint Kitts And Nevis
Saint Lucia
Saint Martin
Saint Vincent And The Grenadines
Samoa
San Marino
Sao Tome And Principe
Saudi Arabia
Senegal
Serbia
Seychelles
Sierra Leone
Sint Maarten
Slovakia
Slovenia
Solomon Islands
South Africa
South Georgia and the South Sandwich Islands
South Sudan
Sri Lanka
St. Helena
St. Pierre And Miquelon
Suriname
Svalbard And Jan Mayen Islands
Swaziland
Sweden
Switzerland
Taiwan
Tajikistan
Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad And Tobago
Tunisia
Turkey
Turkmenistan
Turks And Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Vatican City
Venezuela
Vietnam
Virgin Islands (British)
Virgin Islands (U.S.)
Wallis And Futuna Islands
Western Sahara
Yemen
Yugoslavia
Zambia
Zimbabwe

By providing this information, you agree to the processing of your personal data by SANS as described in our Privacy Policy.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Tags:
  • Digital Forensics and Incident Response

Related Content

Blog
DFIR_-_DFIR_Origin_Stories_-_340x340_Thumb.jpg
Digital Forensics and Incident Response
March 20, 2023
DFIR Origin Stories - Kat Hedley
Digital Forensics and Incident Response (DFIR) called to Kat Hedley as soon as she first entered the workforce.
DFIR_ICON_(1).PNG
SANS DFIR
read more
Blog
Google.png
Digital Forensics and Incident Response, Cloud Security
March 13, 2023
Google Cloud Log Extraction
In this blog post, we review the methods through which we can extract logs from Google Cloud.
Megan_Roddie_370x370.png
Megan Roddie
read more
Blog
Untitled_design-43.png
Digital Forensics and Incident Response, Cybersecurity and IT Essentials, Industrial Control Systems Security, Purple Team, Open-Source Intelligence (OSINT), Penetration Testing and Red Teaming, Cyber Defense, Cloud Security, Security Management, Legal, and Audit
December 8, 2021
Good News: SANS Virtual Summits Will Remain FREE for the Community in 2022
They’re virtual. They’re global. They’re free.
370x370-person-placeholder.png
Emily Blades
read more
  • Register to Learn
  • Courses
  • Certifications
  • Degree Programs
  • Cyber Ranges
  • Job Tools
  • Security Policy Project
  • Posters & Cheat Sheets
  • White Papers
  • Focus Areas
  • Cyber Defense
  • Cloud Security
  • Cybersecurity Leadership
  • Digital Forensics
  • Industrial Control Systems
  • Offensive Operations
Subscribe to SANS Newsletters
Receive curated news, vulnerabilities, & security awareness tips
United States
Canada
United Kingdom
Spain
Belgium
Denmark
Norway
Netherlands
Australia
India
Japan
Singapore
Afghanistan
Aland Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belize
Benin
Bermuda
Bhutan
Bolivia
Bonaire, Sint Eustatius, and Saba
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Cook Islands
Costa Rica
Croatia (Local Name: Hrvatska)
Curacao
Cyprus
Czech Republic
Democratic Republic of the Congo
Djibouti
Dominica
Dominican Republic
East Timor
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands (Malvinas)
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Heard And McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
Indonesia
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Jersey
Jordan
Kazakhstan
Kenya
Kiribati
Korea, Republic Of
Kosovo
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia, Federated States Of
Moldova, Republic Of
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Islands
Oman
Pakistan
Palau
Palestine
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saint Bartholemy
Saint Kitts And Nevis
Saint Lucia
Saint Martin
Saint Vincent And The Grenadines
Samoa
San Marino
Sao Tome And Principe
Saudi Arabia
Senegal
Serbia
Seychelles
Sierra Leone
Sint Maarten
Slovakia
Slovenia
Solomon Islands
South Africa
South Georgia and the South Sandwich Islands
South Sudan
Sri Lanka
St. Helena
St. Pierre And Miquelon
Suriname
Svalbard And Jan Mayen Islands
Swaziland
Sweden
Switzerland
Taiwan
Tajikistan
Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad And Tobago
Tunisia
Turkey
Turkmenistan
Turks And Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Vatican City
Venezuela
Vietnam
Virgin Islands (British)
Virgin Islands (U.S.)
Wallis And Futuna Islands
Western Sahara
Yemen
Yugoslavia
Zambia
Zimbabwe

By providing this information, you agree to the processing of your personal data by SANS as described in our Privacy Policy.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
  • © 2023 SANS™ Institute
  • Privacy Policy
  • Contact
  • Careers
  • Twitter
  • Facebook
  • Youtube
  • LinkedIn