[Editor's Note: In part 3 of this series on techniques penetration testers can use to analyze executable files, Yori Kvitchko takes a look at reverse compiling code, with specific tips for Python and Java. They are often chock full of useful stuff in pen testing, and Yori provides a bunch of helpful tips in teasing out their secrets! -Ed.]
In the first part of this series, I discussed analyzing binary files and looking for hints about their communications streams. In the second part of the series, I delved into the data files that binaries often create. For the third and final blog post in this series about analyzing binaries, I'll be discussing some quick and easy techniques for decompiling binaries that don't compile into raw assembly. I use the term "compiled" here rather loosely. The extent and difficulty of analyzing assembly is outside the scope of this blog post, so what I'll be talking about is analyzing executable files that are not pure assembly. These executable are either a "compiled" version of an interpreted language, such as Python, or are compiled into byte code and run in a virtual machine such as the Java Virtual Machine (JVM).
Seeing the source code of an application is big. It's the grand slam of reverse engineering because it gives you pretty much everything about what's going on inside the program itself. If you can see the entire source code of an application, assuming no trickery such as staged downloads, you know everything the application does. Just having the source code by itself isn't very helpful though, unless you know what to look for. Not only can source code often be difficult to search through to find what you're looking for, but it also depends on you knowing the language it's written in well enough to analyze it. So, what are some easy things to look for in source code that don't require poring over each line of code?
The usual suspects from my last two blog posts are much easier to look for in source code. Hardcoded passwords, URLs, and even XML or plain text configurations can often be found by searching for with relevant keywords that often appear in variable names such as "pass" or "xml". On top of that, looking at the source code can also inform the process of analyzing files created by the executable. Often times, encoding or encryption is done with libraries that can easily be seen in source, telling us exactly how we might decode a given file. Furthermore, looking at source code is probably the easiest method for decoding a network protocol. Searching for function names such as "getInt" or "getString" can often reveal the area of the source responsible for decoding a custom protocol.
Now that we know what to keep an eye out for, here are some languages and corresponding tools that will let us get at the juicy source code innards of these executables. Keep in mind though, that there are source code obfuscation tools for each of these languages that can make an executable much harder to analyze with these techniques.
C# .NET
C Sharp is everywhere; from standalone applications, to languages inside of other tools, to web applications. Microsoft's .NET platform and all related technologies are flexible, relatively easy to code, and therefore used all over the place. With such a large attack surface to work with, as well as a similar need coming from the developer side, it's no wonder that there are a number of tools that can decompile C# binaries and libraries.
Probably the best of these tools is redgate's Reflector. It works exceptionally well, rarely fails, and has a number of great features that help in analyzing the resulting source code. The only problem is that it isn't free. If you need it though, it might be well worth the asking price.
On the free side, we have ILSpy as one of the top contenders for decompiling and analyzing C#. In addition to having a relatively high success rate in decompiling code, it also has a number of very useful features. First, it allows you browse the code much like you would in a modern IDE. Going to the definition of a function or variable is easy, as is finding all references to them. On top of that, ILSpy allows you to save all of the relevant source code as a series of source files and corresponding project file. This gives you the ability to then browse the source in the editor or IDE of your choice and use tools such as grep to quickly search through it.
Both of these tools will automatically analyze any imported libraries making it easy to see what the analyzed source code is importing and browse through that code as well. I've personally used ILSpy to great effect, while trying to replicate a network protocol.
Java
Although losing its popularity a bit in the last few years, Java still has a number of developers dedicated to it and the advent of Android phone development has only strengthened Java's presence. Much like C#, Java is a byte code compiled language and can often be reverted back to near-original source code. Applications often come packaged in JAR packages, which are little more than zip files of byte code compiled Java code.
With most of the functionality provided by ILSpy for C#, the Java Decompiler is freely available and does a great job with decompiling Java class files and JAR packages. It allows you to browse the source, find definitions, and save the source code for viewing in an IDE like Eclipse.
Probably the most common use for decompiling Java, other than internal tools, is the analysis of Android applications. For more on this topic, check out SANS Sec 575 which covers analyzing Android apps in detail.
Python
Ever the favorite for small scripts and the occasional Django web application, Python can still be found in the toolkit of many a developer. As an interpreted language, Python has an even looser definition of compiling, but still has a compiled layer in the form of Python byte code stored as ".pyc" files. These pyc files tend to be much more easily reversible than Java and C# so unless the source has been obfuscated you can almost always retrieve the exact source code made to create the final executable.
There are a number of tools that can perform the necessary decompile operation, but my personal favorite is depython.com. It's easy and it works, but it only supports up to Python version 2.6. For later versions, check out unpyc and unpyc3.
Although decompiling is great, there is actually an even cooler alternative for getting at the innards of a Python script. Because Python is an interpreted language, interacting with it can be done in a much more dynamic fashion. For example, the python interpreter can be executed with the "-i" option. What this option does is executes the given Python script then, without clearing any state, immediately gives you a Python prompt. This prompt then allows you to run any Python command inside of the environment left over by the Python script. What this means is that any variables stored by the script can be accessed with a simple print statement, and any functions can be called at will. Both are discoverable and listed by using the dir() function. That's quite a bit of power just from using the built in features of the Python interpreter.
$ python -i someprogram.py >>> dir() ['__builtins__', '__doc__', '__name__', '__package__', 'secret_variable'] >>> print secret_variable secret value
That's it for my series on analyzing binaries. Any other languages I didn't cover will typically fall into the byte code category such as C# or the interpreted category such as Python, so a bit of Googling will often reveal tools that fill a similar function for those languages.
I hope those of you who are new to the subject learned a little bit to get started with and bypass that mental barrier of thinking that reverse engineering requires you to know assembly. For any experts reading, I hope you picked up a trick or two as well. Thanks for reading and as always, happy hacking!
-Yori Kvitchko
Counter Hack