Python Internals – CPython Bytecode

June 16, 2018
programming python

In this post, I am gonna walk you through compiling python code to CPython bytecode, what code objects are, how to construct them, how to disassemble them, and how to decompile them.

I will be using CPython 3.6.5.

A simple example

>>> codestr = """
print('Witness me!')
"""

>>> compiled_codestr = compile(codestr, '<string>', 'exec')

>>> type(compiled_codestr)
<class 'code'>

Whoo, we have created our first code object.

We passed arguments to the compile function as follows:

Let's see some of the attributes that this code object has,

>>> compiled_codestr.co_consts
('Witness me!',)

>>> compiled_codestr.co_filename
'<string>'

co_consts is a tuple of constants, you can see the string that we had in our codestr here as the first element of the tuple.

co_filename is the filename which this code object belongs to. Since we defined this as '<string>' in our compile function, that's what we get here.

We will explore more attributes (but not all) of the code object later on, let's see the list of all available attributes:

>>> dir(compiled_codestr)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__',
'__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars',
'co_kwonlyargcount', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']

We're interested in the attributes that start with co_, for a complete description of these attributes you can refer to:

Well, we now have constructed a code object, we've looked at some of its attributes, what else can we do with it? Well, we can exec it:

>>> exec(compiled_codestr)
Witness me!

A more complicated example

Now let's look at the code object of a function

>>> def hello(name):
...     print("Hello,", name)

>>> codeobj = hello.__code__

>>> type(codeobj)
code

Since this function takes one argument, let's see if the co_argcount attribute of the code object reflects this

>>> codeobj.co_argcount
1

This code object should also have the string that we use in the call to print in its co_consts attribute

>>> codeobj.co_consts
(None, 'Hello,')

We'd also like to see the name of the function,, we can find this by checking the co_name attribute

>>> codeobj.co_name
hello

Let's check out the local variables of our codeobj

# Number of locals
>>> codeobj.co_nlocals
1

# Names of locals
>>> codeobj.co_varnames
('name',)

Now, let's look at one final attribute of code objects, co_code

>>> codeobj.co_code
b't\x00d\x01|\x00\x83\x02\x01\x00d\x00S\x00'

oh, look at that, it's a bytes object. But what does it represent? This is the bytecode representation of the code of the hello function. Yes, it looks unwieldy to understand, but fortunately there's a better way to understand code objects.

Disassembly

The CPython virtual machine (CPython VM) is a stack-based VM, this means that the bytecode works by pushing things onto the stack and popping things off of it.

Let's see an example, continuing with the code object of the hello function that we have defined before.

>>> from dis import dis
>>> dis.dis(codeobj)
  2           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello,')
              4 LOAD_FAST                0 (name)
              6 CALL_FUNCTION            2
              8 POP_TOP
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE

Before we dig into the meaning of the instructions, let's first define what each column in the previous output means,

  2           0 LOAD_GLOBAL              0 (print)
  |           |         |                |   |
  |           |         |                |   +--------- Interpretation of the parameters in parentheses.
  |           |         |                +------------- Operation Parameters.
  |           |         +------------------------------ The operation code name.
  |           +---------------------------------------- The address of the instruction.
  +---------------------------------------------------- The line number, for the first instruction of each line.

Sometimes there can be more columns, but we'll stick to the ones that were generated from our code object, for a complete description of the output of dis() you can refer to its documentation.

Now let's dig into the disassembly and figure out what each line means

  2           0 LOAD_GLOBAL              0 (print)

Here, the LOAD_GLOBAL instruction will push the global co_names[namei] onto the stack, in this case it's loading co_names[0], let's verify this by inspecting our code object

>>> codeobj.co_names
('print',)

Cool, now let's move to the 2nd line

              2 LOAD_CONST               1 ('Hello,')

The LOAD_CONST instruction will push co_consts[consti] onto the stack, in our case this will load co_consts[1], verifying

>>> codeobj.co_consts
(None, 'Hello,')

Moving onto the next line

              4 LOAD_FAST                0 (name)

The LOAD_FAST instruction will push a reference to the local co_varnames[var_num] onto the stack, verifying

>>> codeobj.co_varnames
('name',)

Alright, things are about to get a little more interesting, but first let's review what our stack looks like currently.

We've done 3 operations which push things onto the stack, roughly:

push print
push 'Hello,'
push (ref name)

Translating this into a visual representation, this is what our stack looks like:

   |        |
   +--------+
   |ref name|
   +--------+
   |'Hello,'|
   +--------+
   | print  |
   +--------+

Alright, let's continue looking at the disassembly, the next line is

              6 CALL_FUNCTION            2

The CALL_FUNCTION as obvious from its name will call a function, but what does the argument that it takes, 2 in our case, mean? Well, it indicates the number of parameters that the function will be called with, this number is interpreted as a 2-byte (16-bit) number, where the low byte indicates the number of positional parameters, the high byte the number of keyword parameters.

In our case it simply means that the function will take 2 positional parameters by popping them off the stack, the order of the parameters with regards to passing it to the function is reversed. In other words, the rightmost parameter is on the top of the stack.

So to summarize what CALL_FUNCTION will do here:

Awesome, let's move to the next line in the disassembly

              8 POP_TOP

The instruction POP_TOP will simply remove the item on the top of the stack, so we've now removed the value that was returned by our last CALL_FUNCTION instruction.

The final 2 lines in our disassembly are

             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE

The first will LOAD_CONST as before, the constant being None. The second will return the value on the top of stack i.e. it will return None to the caller of the function.

WOOHOO! We've now compiled python code into bytecode, disassembled it, and explored the disassembly. The dis module documentation has information on more bytecode instructions if you want to dig deeper.

Decompilation

Suppose you somehow stumbled upon python bytecode in the wild, maybe from malware, maybe from proprietary code and you want to understand what's going on but you are not in the mood to read disassembly, what can you do? Well you can use a decompiler, in this case we'll use uncompyle6, go ahead and install it.

First, let's start by creating a file hello.py which contains the hello function

def hello(name):
	print('Hello,', name)

Now we want to compile this file into python bytecode, let's do this by running this in a shell

>: python -m compileall .

You will now find a __pycache__ directory in your current-working-directory,

>: cd __pycache__
>: uncompyle6 hello.cpython-36.pyc

This should output

# ...
# ...

def hello(name):
    print('Hello,', name)
# okay decompiling hello.cpython-36.pyc

Voila, you now have your source-code back.

References