I often use the DOS DEBUG command to investigate older 16-bit programs; while there are doubtless better tools, DEBUG is ubiquitous and effective. There are many guides to its operation, but here I focus on a specific and practical question: How does one translate between file offsets and memory addresses?
It is sometimes necessary to know the memory address to which a particular part of an executable was loaded. (Or, conversely, the file offset from which a particular region of memory was loaded.) For instance, if one wishes to permanently alter a program’s opcodes, one must know where those opcodes are located in the executable, and not merely where they reside in memory.
DOS DEBUG works with a 16-bit segmented memory model. Under this model addressable memory is limited to 1M bytes, from
0xFFFFF. Memory addresses are represented by Segment:Offset pairs, where the 16-bit Segment is drawn from a Segment Register (e.g. CS, SS, DS, or ES) and the 16-bit Offset is drawn from another register (e.g. BX), an immediate, or some combination thereof.
A physical memory address is computed from a segmented address with the following formula:
- Physical_Address = Segment*16 + Offset
The most immediately relevant consequence of this is that a single physical (or, if running on virtualized hardware, “physical”) memory location can be addressed with many (approximately 4K) different Segment:Offset pairs. For instance, all these Segment:Offset pairs refer to the same physical address (i.e.
Therefore, when discussing the memory address to which a particular file offset maps, we are concerned with *physical* locations, which are only incidentally represented by Segment:Offset pairs.
The MZ Header
All DOS executables begin with an MZ header. For our purposes, we are interested in two values:
|Header size in paragraphs||unsigned short||
|Initial relative CS||unsigned short||
The first value represents the executable’s header length as a number of 16-byte ‘paragraphs’. The second value is discussed in the next section.
When DOS loads an executable into memory, it performs two actions of immediate interest to us:
- Copy a contiguous area of the executable to a contiguous area of memory
- Initialize registers, particularly the CS register
When DOS copies the executable, it first skips a number of header bytes (as indicated by the MZ header) and then (normally) continues reading until the end of the file. DOS begins writing at a “segment-aligned” address, which is a physical address representable by a segmented address with a zero Offset (e.g.
3246:0000, or physical address
During register initialization, the CS register is computed based upon the memory address to which DOS began writing the executable, and a value (the “Initial relative CS”) taken from the MZ header. Specifically, the initial CS value is computed according to this formula:
- Initial_CS = Physical_Start_Of_Executable/16 + Initial_Relative_CS
These facts allow us to translate between file offsets and memory addresses.
Since an executable file is read into memory as one contiguous chunk, we really only need one number – which we may call a Translator – to map file offsets to physical memory addresses, and vice-versa. This number will be employed in these formulas:
- File_Offset = Physical_Address – Translator
- Physical_Address = File_Offset + Translator
Given these formulas, it’s obvious that the Translator can be computed from the difference between the physical address to which DOS began writing the executable, and the file offset from which DOS began reading it.
Putting together all the preceding facts, the formula for the Translator is:
- Translator = (Initial_CS – Initial_Relative_CS)*16 – Header_Size*16
Where “Initial_CS” is taken from the output of DOS DEBUG’s “r” command immediately after the executable is loaded, and “Header_Size” is the raw paragraph count from the executable’s MZ header.
This Python code captures the preceding discussion: Use the ComputeTranslator() function to, well, compute a translator for an executable (you must supply the “initialCS” value by running DOS DEBUG), and then use the MemoryToOffset() and OffsetToMemory() functions to translate between file offsets and memory addresses.
# File Offset <-> Memory address computer import struct # Pull relevant values from an executable's MZ header def ScanMzValues(fn): return struct.unpack('<8xH12xH', file(fn, 'rb').read(24)) # Display some values from an executable's MZ header def FmtMzValues(hLen, csOff): return 'Header length: 16*%-3d, CS Offset: %04x' % (hLen, csOff) # Given an initial CS value and an executable's filename, compute the Translator value def ComputeTranslator(fn, initialCS): hLen, csOff = ScanMzValues(fn) return (initialCS - csOff - hLen)*16 # Given a physical address, format it as a legal Segment:Offset pair, if possible # Since a physical address may be represented by many Segment:Offset pairs, # a value must be supplied for the segment def PhysicalToSegmented(addy, segment): addy = int(addy); segment = int(segment) offset = addy-segment*16 if ((offset >= 0) and (offset <= 0xffff) and (segment >= 0) and (segment <= 0xffff)): return '%04X:%04X' % (segment, offset) else: return 'Invalid address/segment pairing' # Given a segmented memory address and a translator, compute a file offset def MemoryToOffset(segment, offset, translator): return segment*16 + offset - translator # Given a file offset and a translator, compute a memory address # If a segment is given, attempt conversion to a segmented memory address def OffsetToMemory(offset, translator, segment=None): if (segment): return PhysicalToSegmented(offset + translator, segment) else: return offset + translator