The introduction of Unicode characters (such as Persian, Cyrillic and Arabic characters) has introduced both a simple means of fingerprinting intellectual property (signature stamping) and a very simple steganographic data hiding technique.
The following is an extract from the Cyrillic Unicode character set [1].
Unicode # Character
0410 ? CYRILLIC CAPITAL LETTER A 0430 ? CYRILLIC SMALL LETTER A 0412 ? CYRILLIC CAPITAL LETTER VE 0415 ? CYRILLIC CAPITAL LETTER IE 0435 ? CYRILLIC SMALL LETTER IE 041C ? CYRILLIC CAPITAL LETTER EM 041E ? CYRILLIC CAPITAL LETTER O 043E ? CYRILLIC SMALL LETTER O 0420 ? CYRILLIC CAPITAL LETTER ER 0440 ? CYRILLIC SMALL LETTER ER 0422 ? CYRILLIC CAPITAL LETTER TE 0443 ? CYRILLIC SMALL LETTER U 0405 ? CYRILLIC CAPITAL LETTER DZE (this is the Old Cyrillic zelo - Macedonian) 0455 ? CYRILLIC SMALL LETTER DZE
The basic Latin character table reflects these same symbols. The difference is that the displayed character is not the same. For instance, this can be used by an attacker seeking to complete a phishing attach using a similar domain name now that the registration of Unicode characters has been allowed. For instance, the following domains are distinctly different, but appear the same:
Microsoft.com
\x004D\x0069\x0063\x0072\x006F \x0073\x006F\x0066\x0074\x002E\x0063\x006F\x006D
and
?i?r???ft.com
\x041C\x0069\x0441\x072\x043E\x0445\x043E\x0066\x0074\x002E\x0063\x006F\x006D
Unicode Mixed Characters | Latin Characters |
041C ? CYRILLIC CAPITAL LETTER EM0069 i LATIN SMALL LETTER I 0441 ? CYRILLIC SMALL LETTER ES 0072 r LATIN SMALL LETTER R 043E ? CYRILLIC SMALL LETTER O 0455 ? CYRILLIC SMALL LETTER DZE 043E ? CYRILLIC SMALL LETTER O 0066 f LATIN SMALL LETTER F 0074 t LATIN SMALL LETTER T 002E . FULL STOP 0063 c LATIN SMALL LETTER C 006F o LATIN SMALL LETTER O 006D m LATIN SMALL LETTER M | 004D M LATIN CAPITAL LETTER M0069 i LATIN SMALL LETTER I 0063 c LATIN SMALL LETTER C 0072 r LATIN SMALL LETTER R 006F o LATIN SMALL LETTER O 0073 s LATIN SMALL LETTER S 006F o LATIN SMALL LETTER O 0066 f LATIN SMALL LETTER F 0074 t LATIN SMALL LETTER T 002E . FULL STOP 0063 c LATIN SMALL LETTER C 006F o LATIN SMALL LETTER O 006D m LATIN SMALL LETTER M |
At the same time there are positive uses for this type of technique. Word documents can be embedded with seemingly harmless information. If this document is ever published on the web, it can be searched for using an engine such as Google. Also, it can be added as a string for a standard forensic string search. Find the string and you have your document.
Think of file names as well. Windows will allow names to be created using Unicode characters. Hence, if you are looking for a file called "cat.txt", a simple string search will miss "cat.txt" defined using the following Unicode, (\x0441\x00430\x00074\x002E\x0074\x0078\x0074). I have linked a site that does online Unicode conversions and display.
An issue with trying to uncover all versions and possible combinations is that this is an NP infeasible problem. There are more ways to hide data than there are to create simple string searches. This means that we as forensic professionals need to use our greatest tool — our Brain. Things are not always as they seem.
[1] Unicode Character Table: Cyrillic
http://jrgraphix.net/research/unicode_blocks.php?block=8
Craig Wright is a Director with Information Defense in Australia. He holds both the GSE-Malware and GSE-Compliance certifications from GIAC. He is a perpetual student with numerous post graduate degrees including an LLM specializing in international commercial law and ecommerce law as well as working on his 4th IT focused Masters degree (Masters in System Development) from Charles Stuart University where he is helping to launch a Masters degree in digital forensics. He starts his second doctorate, a PhD on the quantification of information system risk at CSU in April this year.