In this post, I am gonna walk you through compiling python code to CPython bytecode, what code objects are, how to construct them, how to disassemble them, and how to decompile them.
I will be using CPython 3.6.5.
>>> codestr = """ print('Witness me!') """ >>> compiled_codestr = compile(codestr, '<string>', 'exec') >>> type(compiled_codestr) <class 'code'>
Whoo, we have created our first code object.
We passed arguments to the compile function as follows:
codestr is, as you might have guessed, our code as a string.
The second argument is the filename of the file from which the code was read, since we passed this in an interpreter, we defined it as
'<string>' as per the documentation.
The third argument is called
mode in the documentation and it specifies what kind of code must be compiled, we could have used
eval since we’re compiling a single expression,
refer to the compile function documentation for more details on the
mode argument and refer to
this for a detailed explanation of
exec, and the differences between them.
Let’s see some of the attributes that this code object has,
>>> compiled_codestr.co_consts ('Witness me!',) >>> compiled_codestr.co_filename '<string>'
co_consts is a tuple of constants, you can see the string that we had in our
codestr here as the first element of the tuple.
co_filename is the filename which this code object belongs to. Since we defined this as
'<string>' in our compile function, that’s what we get here.
We will explore more attributes (but not all) of the
code object later on, let’s see the list of all available attributes:
>>> dir(compiled_codestr) ['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars', 'co_kwonlyargcount', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']
We’re interested in the attributes that start with
co_, for a complete description of these attributes
you can refer to:
Code objects in CPython’s documentation.
A description of the attributes of a code object is available in this table, under the
The definition of the
PyCodeObject struct in CPython.
Well, we now have constructed a
code object, we’ve looked at some of its attributes, what else can we do with it? Well, we can
>>> exec(compiled_codestr) Witness me!
Now let’s look at the code object of a function
>>> def hello(name): ... print("Hello,", name) >>> codeobj = hello.__code__ >>> type(codeobj) code
Since this function takes one argument, let’s see if the
co_argcount attribute of the
code object reflects this
>>> codeobj.co_argcount 1
This code object should also have the string that we use in the call to
>>> codeobj.co_consts (None, 'Hello,')
We’d also like to see the name of the function,, we can find this by checking the
>>> codeobj.co_name hello
Let’s check out the local variables of our
# Number of locals >>> codeobj.co_nlocals 1 # Names of locals >>> codeobj.co_varnames ('name',)
Now, let’s look at one final attribute of
>>> codeobj.co_code b't\x00d\x01|\x00\x83\x02\x01\x00d\x00S\x00'
oh, look at that, it’s a
bytes object. But what does it represent? This is the bytecode representation of the code of the
Yes, it looks unwieldy to understand, but fortunately there’s a better way to understand code objects.
The CPython virtual machine (CPython VM) is a stack-based VM, this means that the bytecode works by pushing things onto the stack and popping things off of it.
Let’s see an example, continuing with the
code object of the
hello function that we have defined before.
>>> from dis import dis >>> dis.dis(codeobj) 2 0 LOAD_GLOBAL 0 (print) 2 LOAD_CONST 1 ('Hello,') 4 LOAD_FAST 0 (name) 6 CALL_FUNCTION 2 8 POP_TOP 10 LOAD_CONST 0 (None) 12 RETURN_VALUE
Before we dig into the meaning of the instructions, let’s first define what each column in the previous output means,
2 0 LOAD_GLOBAL 0 (print) | | | | | | | | | +--------- Interpretation of the parameters in parentheses. | | | +------------- Operation Parameters. | | +------------------------------ The operation code name. | +---------------------------------------- The address of the instruction. +---------------------------------------------------- The line number, for the first instruction of each line.
Sometimes there can be more columns, but we’ll stick to the ones that were generated from our
code object, for a complete description of the output of
you can refer to its documentation.
Now let’s dig into the disassembly and figure out what each line means
2 0 LOAD_GLOBAL 0 (print)
LOAD_GLOBAL instruction will push the global
co_names[namei] onto the stack, in this case it’s loading
co_names, let’s verify
this by inspecting our
>>> codeobj.co_names ('print',)
Cool, now let’s move to the 2nd line
2 LOAD_CONST 1 ('Hello,')
LOAD_CONST instruction will push
co_consts[consti] onto the stack, in our case this will load
>>> codeobj.co_consts (None, 'Hello,')
Moving onto the next line
4 LOAD_FAST 0 (name)
LOAD_FAST instruction will push a reference to the local
co_varnames[var_num] onto the stack, verifying
>>> codeobj.co_varnames ('name',)
Alright, things are about to get a little more interesting, but first let’s review what our stack looks like currently.
We’ve done 3 operations which push things onto the stack, roughly:
push print push 'Hello,' push (ref name)
Translating this into a visual representation, this is what our stack looks like:
| | +--------+ |ref name| +--------+ |'Hello,'| +--------+ | print | +--------+
Alright, let’s continue looking at the disassembly, the next line is
6 CALL_FUNCTION 2
CALL_FUNCTION as obvious from its name will call a function, but what does the argument that it takes,
2 in our case, mean?
Well, it indicates the number of parameters that the function will be called with, this number is interpreted as a 2-byte (16-bit) number,
where the low byte indicates the number of positional parameters, the high byte the number of keyword parameters.
In our case it simply means that the function will take 2 positional parameters by popping them off the stack, the order of the parameters with regards to passing it to the function is reversed. In other words, the rightmost parameter is on the top of the stack.
So to summarize what
CALL_FUNCTION will do here:
Pop 2 arguments off of the stack.
Pass them to the function that’s below them in the stack, in our case
Push the return value of the function onto the stack.
Awesome, let’s move to the next line in the disassembly
POP_TOP will simply remove the item on the top of the stack, so we’ve now removed the value that was returned by our last
The final 2 lines in our disassembly are
10 LOAD_CONST 0 (None) 12 RETURN_VALUE
The first will
LOAD_CONST as before, the constant being
None. The second will return the value on the top of stack i.e. it will return
None to the caller of the function.
WOOHOO! We’ve now compiled python code into bytecode, disassembled it, and explored the disassembly. The
dis module documentation
has information on more bytecode instructions if you want to dig deeper.
Suppose you somehow stumbled upon python bytecode in the wild, maybe from malware, maybe from proprietary code and you want to understand what’s going on but you are not in the mood to read disassembly, what can you do? Well you can use a decompiler, in this case we’ll use uncompyle6, go ahead and install it.
First, let’s start by creating a file
hello.py which contains the
def hello(name): print('Hello,', name)
Now we want to compile this file into python bytecode, let’s do this by running this in a shell
>: python -m compileall .
You will now find a
__pycache__ directory in your current-working-directory,
>: cd __pycache__ >: uncompyle6 hello.cpython-36.pyc
This should output
# ... # ... def hello(name): print('Hello,', name) # okay decompiling hello.cpython-36.pyc
Voila, you now have your source-code back.