Disassembler (Part 1)

Editorial Note: This article is the first in a three part series on writing a disassembler. Today we’ll cover the high-level concepts involved in disassembly and see how to read machine code “by hand”. Next week, we’ll look at the issues involved in finding and/or constructing an opcode map. Finally, in week 3, we’ll build a disassembler for 8086 integer instructions, using Python.

Over the last few weeks, we’ve seen how to use a debugger (such as DOS DEBUG) to find and examine interesting portions of an executable’s machine code. Now it’s time to begin considering how the debugger performs some of its tricks. I don’t want to launch into a full discussion of debugger programming (yet!), but I do want to talk a little about how to produce an assembly-language view of an executable’s instructions.

Executable structure

To begin at the beginning, it’s important to understand that there is nothing magical about an executable. An executable is simply a collection of machine code and program data, prefixed with some header records specific to the operating system (OS) on which the executable is meant to be run.

Every OS has its own set of executable headers, but let’s just consider DOS for now; DOS headers are simple, and are still present (in a vestigial form) in modern Windows executables. Every DOS executable begins with an MZ header, as described on (for example) the following webpages:

The following Python class will, given a DOS executable’s filename, parse its MZ header into an object with several useful methods. (Since WordPress mangles code, I’m also making this available as a download.)

import  struct

class MZ_Header:
    _fields = [('c', '%c',          'SigChar0'),
               ('c', '%c',          'SigChar1'),
               ('h', '%d',          'BytesOnLastPage'),
               ('h', '%d (*512)',   'PagesInExe'),
               ('h', '%d',          'RelocationCount'),
               ('h', '%d (*16)',    'ParagraphsInHeader'),
               ('h', '%d (*16)',    'MinAddlParagraphs'),
               ('h', '%d (*16)',    'MaxAddlParagraphs'),
               ('h', '0x%04x',      'InitialRelativeSS'),
               ('h', '0x%04x',      'InitialSP'),
               ('h', '0x%04x',      'Checksum'),
               ('h', '0x%04x',      'InitialIP'),
               ('h', '0x%04x',      'InitialRelativeCS'),
               ('h', '0x%04x',      'RelocationOffset'),
               ('h', '%d',          'OverlayNumber')]

    def __init__(self, fn):
        self._fn = fn
        fmt = '<' + ''.join([p[0] for p in self._fields])
        data = struct.unpack_from(fmt, file(self._fn, 'rb').read(struct.calcsize(fmt)))
        for i in range(len(self._fields)): setattr(self, self._fields[i][2], data[i])

    def print_table(self):
        col_len = max(len(p[2]) for p in self._fields)
        print '\n'.join(('%-*s: '+p[1]) % (col_len, p[2], getattr(self, p[2])) for p in self._fields)

    def check_signature(self):
        # The first two bytes in an MZ header are always 'M' and 'Z'
        return self.SigChar0 + self.SigChar1 == 'MZ'

    def calc_length(self):
        # Find the overall length of the executable, including all headers
        if (self.BytesOnLastPage):
            # A non-zero BytesOnLastPage means the final 512-byte page is only partially used
            return (self.PagesInExe-1)*512 + self.BytesOnLastPage
        else:
            return self.PagesInExe*512

    def calc_code_start(self):
        # Return the file offset of the start of machine code and program data
        return self.ParagraphsInHeader*16

    def calc_first_instruction_offset(self):
        # Return the file offset of the first instruction executed when the program runs
        return self.calc_code_start() + self.InitialRelativeCS*16 + self.InitialIP

Machine code

If we use the calc_first_instruction_offset() function of the preceding class, we can find the offset (within the executable file) of the first instructions executed when the program is loaded and run by the DOS operating system. For the Neuromancer executable discussed in previous weeks, for example, the first bytes of machine code are located at offset 0x16f72. They are:

  • 8c c0 05 10 00 0e 1f a3 04 00 03 06 0c 00 8e c0 ...

What do these bytes mean?

As Dave Touretzky has pointed out, there’s no real difference between machine language (the instructions that a modern CPU natively executes, represented as a series of 8-bit numbers) and any other programming language; machine language is just a little harder to read. Understanding it requires patience, and the right reference.

Finding the right reference is a largely matter of knowing which CPU architecture the machine language was written for, as each type of CPU has its own dialect. It can also be important to know what CPU mode the machine language was written for: modern x86 CPUs, for instance, can be configured to use 16- or 32-bit operands and addressing by default, and the same sequence of machine language bytes may mean different things depending upon the CPU’s state. (Matters become even more complex when 64-bit instructions are introduced.)

Since we’re looking at a DOS (i.e. x86 real-mode) executable, a good reference is the Instruction Set Reference (ISR) volume from the Intel Architecture Software Developer’s Manual. This is a formidable volume, but only a few pages are immediately intesting for our purposes:

  • Pages 2-1 through 2-6 describe the basic layout of x86 machine language instructions. (Note that since we’re dealing with real-mode machine language, we’re only interested in 16-bit addressing modes.)
  • Pages A-1 through A-8 give the processor’s opcode map. (Note that since we’re dealing with such an old program, we can assume that it only uses 8086 integer opcodes; this means that we can ignore all two-byte and escape opcodes in the opcode map.)

Disassembly (first instruction)

The first byte of the first instruction executed when the Neuromancer program is run is 8c. Page A-5 of the ISR tells us that this is a MOV Ew, Sw instruction. Pages A-1 and A-2 of the ISR tell us that Ew denotes a WORD-sized general-purpose register or memory operand specified by a ModR/M byte following the primary opcode. These pages also tell us that Sw denotes a WORD-sized segment register specified by the reg field of a ModR/M byte following the primary opcode.

The second byte (the ModR/M byte) of the first instruction executed when the Neuromancer program is run is c0. If we decompose this byte into mod, reg, and r/m portions (as shown on page 2-1 of the ISR) we see that:

  • mod is 3
  • reg is 0
  • r/m is 0

Page 2-4 tells us that a mod of 3 and an r/m of 0 select the general-purpose register AX (when a 16-bit operand is being selected). Page B-4 tells us that a reg of 0 selects the segment register ES. Putting it all together, the first instruction is made up of the bytes 8c c0, and would be represented in assembly language as MOV AX, ES.

More disassembly (second instruction)

The first byte of the next instruction (at offset 2) is 05. On page A-4 of the ISR, we see that this is an ADD eAX, Iv instruction. Page A-3 of the ISR tells us that eAX denotes either the EAX or AX register, depending upon the operand size attribute, while pages A-1 and A-2 tell us that Iv denotes either a WORD or a DWORD (depending, again, upon the operand size attribute) encoded in the bytes at the end of the instruction, as shown on page 2-1.

Since we are dealing with real-mode code, the operand-size attribute is always 16 bits. The next two bytes of machine code are 10 00, representing the 16-bit number 0x0010 in LSB order. The overall interpretation of the program’s second instruction, then, is as the assembly language operation ADD AX, 0010.

More disassembly (instructions 3 and 4)

The first byte of the next instruction (at offset 5) is 0e. On page A-5 of the ISR, we see that this is a PUSH CS instruction. Simple enough.

The first byte of the next instruction (at offset 6) is 1f. On page A-5 of the ISR, we see that this is a POP DS instruction. Also simple. Instructions 3 and 4 have the effect of copying the value of CS to DS.

More disassembly (instruction 5)

The first byte of the next instruction (at offset 7) is a3. On page A-4 of the ISR, we see that this is a MOV Ov, eAX instruction. We already know that eAX denotes the AX register, and page A-2 of the ISR tells us that Ov denotes a WORD or a DWORD (depending upon the operand size attribute) stored in memory, the address of which is encoded in the bytes at the end of the instruction, as shown on page 2-1. Since we are dealing with real-mode code, the address of the operand will be 16 bits long. The next two bytes of machine code are 04 00, represting the 16-bit number 0x0004 in LSB order. The overall interpretation of the program’s fifth instruction, then, is as the assembly language operation MOV [0004], AX.

More disassembly (instruction 6)

The first byte of the next instruction (at offset a) is 03. On page A-4 of the ISR, we see that this is an ADD Gv, Ev instruction. Pages A-1 and A-2 of the ISR tell us that Ev denotes a WORD- or DWORD-sized (depending upon the operand size attribute) general-purpose register or memory operand specified by a ModR/M byte following the primary opcode. Those pages also tell us that Gv represents a WORD- or DWORD-sized (depending upon the operand size attribute) general-purpose register specified by the reg field of a ModR/M byte following the primary opcode.

The second byte of this instruction (the ModR/M byte, at offset b) is 06. If we decompose this byte into mod, reg, and r/m portions (as shown on page 2-1 of the ISR) we see that:

  • mod is 0
  • reg is 0
  • r/m is 6

Page 2-4 tells us that a mod of 0 and an r/m of 6 select a memory operand given by a 16-bit displacement following the ModR/M byte. This page also tells us that a reg of 0 selects the the general-purpose register AX (when a 16-bit operand is being selected). The two bytes following the ModR/M byte are 0c 00, represting the 16-bit number 0x000c in LSB order. The program’s sixth instruction is equivalent to the assembly language operation ADD AX, [000c].

More disassembly (instruction 7)

The first byte of the next instruction (at offset e) is 8e. On page A-5 of the ISR, we see that this is a MOV Sw, Ew instruction. We already know that Sw denotes a WORD-sized segment register specified by the reg field of a a ModR/M byte following the primary opcode, and that Ew denotes a WORD-sized general-purpose register or memory operand specified by a ModR/M byte following the primary opcode. The next byte of machine code (at offset f) is c0. This decomposes as:

  • a mod of 3
  • a reg of 0
  • a r/m of 0

A mod of 3 and r/m of 0 selects AX (when a 16-bit operand is being selected), while a reg of 0 selects the segment register ES. This instruction is therefore equivalent to the assembly language operation MOV ES, AX.

Disassembly summary

The first seven instructions executed when the Neuromancer program is run are:

MOV        AX        ES
ADD        AX        0010
PUSH       CS
POP        DS
MOV        [0004]    AX
ADD        AX        [000C]
MOV        ES        AX

Conclusion

There is nothing all that special about an executable file. It can be read and understood by anyone with patience, who knows:

  • Which OS the executable is targeting
  • Which CPU (and CPU mode) the executable is targeting
  • The executable header format of the targeted OS
  • The instruction format of the targeted CPU mode

… and who has an appropriate opcode map. The only software required is a simple hex editor, or some other means of viewing the raw bytes which make up a file. Other software can be helpful but, as we’ll see, it needn’t be very complicated. Next week, we’ll look at the issue of opcode maps a little more closely. Until then, remember that anyone who says that you’re not allowed to “reverse engineer” software is really just saying that you’re not allowed to look at it.

And I think it’s long past time to retire the notion of forbidden texts.

Share and Enjoy:
  • Twitter
  • Facebook
  • Digg
  • Reddit
  • HackerNews
  • del.icio.us
  • Google Bookmarks
  • Slashdot
This entry was posted in Planet Microsoft, Projects, Python, Reverse Engineering. Bookmark the permalink.