Local Register Allocator for 8086
Table of contents
What is this?
This is an implementation of a greedy bottomup local register allocator for the intel 8086 CPU. It may be used as part of a simple compiler. In particular, it can be used in a C compiler whose int
is 16bit and char
is 8bit.
To make things interesting and interactive it comes with a code generator capable of generating 8086 assembly code (compilable into DOS .COM programs) from trees representing expressions with 16bit integer ALU operations, constants and memory loads and stores. A few extra instructions generated by the code check the results of expression evaluation in the output register and the output memory locations, if any.
Thus you can actually execute the generated code and tweak things around to see how your changes affect code generation.
Sample assembly output from the register allocator/code generator:
; 7 (vr4)
; mul (vr5)
; 5 (vr3)
; add (vr6)
; 3 (vr1)
; mul (vr2)
; 2 (vr0)
; 
; Regs needed (approximately): 3
; 
; ; vr0
mov ax, 2 ; vr0
mov cx, 3 ; vr1
mul cx ; vr2
mov cx, 5 ; vr3
mov dx, 7 ; vr4
xchg ax, cx ; vr5
mul dx ; vr5
add cx, ax ; vr6
; ; vr6
cmp cx, 41 ; vr6
jne failure ; vr6
N.B. This register allocator will not allocate registers perfectly in all circumstances. It will occasionally generate unnecessary register moves and exchanges. Bear in mind, optimal register allocation is an NPcomplete problem.
Motivation
The intel 8086 architecture has a number of instructions with fixed input and/or output registers and it's not entirely trivial to connect the ins and outs of these instructions. There's obviously a long history of implementing compilers for this and similar architectures, but it's a bit odd that it's hard to find a conceptually simple algorithm like this. The "Dragon" book and many other sources either consider architectures without instructions with such fixed register operands or solve the problem using graph coloring, which for certain reasons may be impractical.
The intel 8086 architecture survives to this day in the modern intel Pentium CPUs that are compatible descendants of the old 8086 and with it survive some of the instructions with fixed in/out register operands. So, the problem persists to this day, although there are fewer limitations to deal with when using newer 32bit and 64bit x86 instructions and registers today.
8086 registers by special uses in expression evaluation (pardon the mixture of C and assembly operators, instructions and syntax and ASCII art):
+ can be freely used in most ALU instructions as src/dst
 + can be used as address to access memory
  + has individual 8bit subregisters
   + can be signextended with cbw
    + can be shifted
     + can be shift count
v v v v v v
+<&^~ [r] 8bit cbw <<dst <<cnt *dst *src /dvd /dvsr /quo /rem
ax + + + + fix+ fix+ fix+ fix+
bx + + + + + +
cx + + fix+ + +
dx + + + + fix+
si + + + + +
di + + + + +
^ ^ ^ ^ ^ ^
can be product +     
can be multiplicand +    
can be dividend +   
can be divisor +  
can be quotient + 
can be remainder +
Scope
Ouside of the scope of this work are:
 data types other than 8bit bytes and 16bit words (e.g. floats, structs)
 conditional/ternary operators like in C/C++
 function calls and calling conventions
 use of immediate and memory operands directly in ALU instructions (we use separate load/store instructions instead)
These are left as an exercise to the reader.
Basic algorithm
Before we get to the 8086 specifics, let's describe the algorithm for an architecture, where there aren't instructions with fixed register operands. We'll be building on top of it.
Let's also consider a case, where our expressions consist of only binary operators (unary operators will be a trivial specialization), which are realized as 3register instructions in the CPU, that is, 2 input registers and 1 output register, much like in the MIPS architecture (e.g. sub r2, r3, r4
will do r2 = r3  r4;
).
Suppose we have an expression tree like this:
op

op op
+ 
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
op
and nm
represent operator and numeric nodes in the tree.
Our hardware register allocation will be done recursively from the root towards the leaves:
AllocHRegs(node):
recursively call AllocHRegs() for both input/child nodes
Ensure() that both input/child values are in registers
Free() both input registers
Allocate() one output register
generate an instruction using these 3 register operands
For the numeric leaf node there will only be the last two steps: Allocate() for the output register and the generation of an instruction to load an integer constant into that register.
The utility functions will be:
Ensure(node):
if node's value has no location yet:
Allocate() a register for it and return the register
else if node's value is already in a register:
return that register
else: /* the node's value must have been spilled onto the stack */
Allocate() a register
generate a pop instruction to pop a value from
the stack into this register
return this register
and
Allocate(node):
if there's a vacant register among the N registers the allocator manages:
mark it as holding the node
mark the node as being in this register
return the register
else:
from the N registers managed by the allocator find the register whose
value won't be needed the longest, that is, whose parent/using node
is the most distant
mark the node that that register holds as spilled
generate a push instruction to push the node's value from the
register onto the stack
mark the register as holding the argument node passed into Allocate()
mark the node as being in this register
return the register
and
Free(hardware_reg):
mark the node this register holds as not having a location
mark the register as holding no node
Using this algorithm we will arive at this code generated for the above tree (let's replicate the tree here):
op

op op
+ 
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
mov hr0, 1
mov hr1, 2
xor hr0, hr0, hr1
mov hr1, 3
mov hr2, 4
mul hr1, hr1, hr2
add hr0, hr0, hr1
mov hr1, 5
mov hr2, 6
idiv hr1, hr1, hr2
mov hr2, 7
mov hr3, 8
irem hr2, hr2, hr3
or hr1, hr1, hr2
sub hr0, hr0, hr1
This code needs 4 registers (hr0 through hr3).
Spilling
If we restrict the number of registers that the allocator can manage to 3, we'll get this instead (let's replicate the tree one more time):
op

op op
+ 
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
mov hr0, 1
mov hr1, 2
xor hr0, hr0, hr1
mov hr1, 3
mov hr2, 4
mul hr1, hr1, hr2
add hr0, hr0, hr1
mov hr1, 5
mov hr2, 6
idiv hr1, hr1, hr2
mov hr2, 7
push hr0
mov hr0, 8
irem hr0, hr2, hr0
or hr0, hr1, hr0
pop hr1
sub hr0, hr1, hr0
Note the appearance of the push and pop instructions in the above code and the disappearance of register hr3.
The allocator runs out of registers once it has loaded 7 into hr2 and is about to load 8 into another register. At this point there are 3 partial results occupying all 3 registers:
hr0 = result of +
hr1 = result of /
hr2 = 7
We spill hr0, the result of +, onto the stack to make hr0 available for loading of 8. We choose hr0 to spill because its parent, , is the most distant ( and %, the parents of / and 7, are closer). When we finally need the result of +, we pop it from the stack.
The relative distance of parent nodes can be obtained by simple recursive assignment of unique numbers to all nodes:
op

op op
+ 
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
0 2 1 6 3 5 4 14 7 9 8 13 10 12 11 < node number / virtual reg
Each node then will have this unique number (we may refer to it as the virtual register) and we can either follow the node's parent link to find the parent's unique number or we may simply store the parent's unique number in the child node instead of storing there the link.
And so in this example of ours, 14 is the largest of the three numbers (14, 13 and 12), which is how we decide to kick the hr0 value out onto the stack.
Spilling the register that has the most distant use guarantees that unspilling will occur in the exact opposite order (LIFO), which lets us use the stack with push and pop instructions. Specifically, with operators being at most binary (not ternary and so on), we'll never spill two sibling nodes, that is, we'll never have to distinguish equally distant nodes.
When implementing this algorithm we'll need a global array of N pointers to the expression tree nodes, where N is the number of the registers that the register allocator manages. (This array is known as NodeFromHReg[]
in our code.) By examining the contents of this array we'll know if a particular hardware register is occupied by the result of some node. The nodes themselves will carry the location of their results (that is, it can be one of these N hardware registers or the stack or nowhere yet). IOW, with such a setup we'll always be able to tell what's where, by examining the array or a node and we'll be modifying the array and nodes appropriately to reflect the current allocations throughout the process.
Evaluation order of subsexpressions
When evaluating a binary operator (e.g. a + b) we should first evaluate the child/subexpression that uses more registers, then the one that uses fewer. This leads to more effective register use and fewer spills. The logic is simple: before you can evaluate the binary operator, you need to hold the result of one of its children in a register while evaluating the other child. So, if the children need, let's say, 1 and 2 registers each to evaluate, then evaluating first then second will need 3 registers, whereas evaluating them in the opposite order will need 2 registers.
When assigning unique node numbers (virtual register numbers) as described in the preceding section, we can assign them in this order and we can then handle sibling nodes in this order by looking at and comparing their unique numbers.
See Ershov number and SethiUllman algorithm.
2register instructions
Not all CPU architectures offer 3register ALU instructions like the MIPS does. The x86 architecture's ALU instructions are 2register. That is, the output overwrites one of the inputs.
This affects the previously described AllocHRegs()
function a little. It has to be able to allocate a specific output register through the Alloc()
function such that Alloc()
would allocate the same register as one of the inputs. So, for example, if the inputs to the add
instruction are in, say, ax
and cx
, the output will be in ax
if we generate add ax, cx
(or the other way around, the output will be in cx
if we generate add cx, ax
; keep in mind that not all binary operators are symmetric).
8086specific approach
The 8086specific idea is:
 When an instruction at a node needs its inputs in specific registers, it requests its child nodes to produce their outputs in desired registers (this is done recursively through the familiar
AllocHRegs()
function). However, they may be unable to satisfy the requests due to their own limitations (that is, they can never produce the outputs elsewhere naturally) or because the desired registers already hold something useful. And so the child nodes do what they can and leave it to their parent to do the rest when they can't satisfy the request.  Thankfully, it's always possible to swap values in a few registers with the
xchg
instruction to satisfy the needs on the input side and such swaps aren't needed too often.
So the AllocHRegs()
, Ensure()
and Allocate()
functions introduced earlier need to receive an additional parameter, the desired register. It should be possible to use this parameter to request a specific register, any register or a register from a specific subset of all allocatable registers.
Currently we're using this enumeration to specify a register:
enum HReg : int
{
// Specific regs begin
HReg0, HRegAX = HReg0,
HReg1, HRegCX = HReg1,
HReg2, HRegDX = HReg2,
HReg3, HRegBX = HReg3,
HReg4, HRegSI = HReg4,
HReg5, HRegDI = HReg5,
// Specific regs end
HRegCnt,
// Constants representing multiplechoice desired regs:
HRegAny = HRegCnt, // unspecified, any reg at all is OK
HRegNotCX, // prefer regs other than cx (for shifts)
HRegNotDXNotAX, // prefer regs other than dx and ax (for (i)div)
HRegNotDXNotCXNotAX, // prefer regs other than dx, cx and ax
HRegByte, // prefer those with individual byte components: ax, cx, dx, bx
HRegByteNotCX, // prefer those with individual byte components except cx: ax, dx, bx
HRegAddr, // prefer those that can be a memory operand: bx, si, di
};
The Allocate()
function will try to allocate the requested/desired register first, but if it fails to, it'll return a different one and the caller will need to deal with it (e.g. use the xchg
instruction).
Shifts
The 8086 shift instructions (shl
, shr
, sar
) take the shift count from the fixed register cl
(lower half of cx
) if the count isn't a simple 1. The value that they shift can be anywhere else.
Hence the implementation of AllocHRegs()
for shift instructions should pass its own desired register parameter (possibly modified to exclude cx
) to the left child's AllocHRegs()
and Ensure()
while it should pass HRegCX
as the desired register to the right child's AllocHRegs()
and Ensure()
. Btw, when both children need the same number of registers to compute we should prefer to handle the right/count child first, to have higher chances of allocating cx
for the count.
desired register may pass modified to left child

v
shift dst, cl
 
 v
v desired reg = cx
desired reg excludes cx
Ensure()
may fail to allocate and return HRegCX
for the right child. When this happens, we either move the right child node from that other register to cx
(if cx
is vacant) or we exchange the registers between that node and the one currently residing in cx
(when cx
isn't vacant). For this we update the node(s) and the global array of pointers to nodes (NodeFromHReg[]
) and generate the mov
or xchg
instruction. We should also remember that Ensure()
may allocate cx
for the left child, in which case we need to properly update the left register.
^

shift dst, cl
^ ^
 
case 1: ?x cx Nothing to do
case 2: cx ?x Swap left and right
case 3: ?x ?x Swap right and cx (or move right to cx)
(i)mul
The (i)mul
instruction multiplies ax
by another input and outputs the product to a pair of registers, dx
:ax
(most significant bits:least significant bits). That is, the product is twice as wide as either multiplicand. However, for our purposes, for 16bit multiplicands whose product isn't expected to need more than 16 bits, we can treat dx
as a clobbered register.
(i)mul
therefore disregards its caller's desired register since (i)mul
always outputs to ax
. (i)mul
, being symmetric, also asks both its child/input nodes to produce their results in ax
in the hopes that at least one would end up in ax
.
desired reg is ignored because output is in ax

v
(i)mul ax, src
 
 v
v desired reg = ax
desired reg = ax
If either of (i)mul
's inputs ends up in dx
, it's kept there, since it can be conveniently clobbered. However, if none of the inputs is in dx
, dx
may need to be explicitly preserved (if dx
holds some other partial result that we don't want (i)mul
to clobber).
To preserve dx
we exchange the registers between one of the input nodes and the node in dx
. IOW, we reduce this to having one of the inputs in dx
, which is safe to clobber.
So, through at most 2 register exchanges we can make one input in ax
and the other, if needed, in dx
.
N.B. the 16 least significant bits of a product of two 16bit integers are the same for both signed and unsigned multiplication and so on the x86 we can use mul
and imul
interchangeably if we're interested in just those 16 bits of the product.
(i)div
Computing quotients and remainders with (i)div
is overall trickier.
The instruction divides a pair of registers, dx
:ax
(most significant bits: least significant bits), that is, a 32bit integer, by a 16bit input and the outputs are in dx
(the remainder) and ax
(the quotient). When we need to divide 16bit integers we zero or signextend the dividend from 16 bits to 32 bits, filling dx
with the extension. dx
being part of the dividend naturally prevents us from using dx
for the divisor.
But first, we need the dividend in ax
, for which we pass HRegAX
as the desired register into AllocHRegs()
and Ensure()
for the left input node. We use HRegNotDXNotAX
(meaning any register but dx
or ax
) as the desired register for the right input node.
desired reg is ignored because output is in ax or dx

v
(i)div dx:ax, src
 
 v
v desired reg excludes dx and ax
desired reg = ax
Then through at most one xchg
we can make sure that the dividend is indeed in ax
.
If we're unlucky and Ensure()
(or the xchcg
) gives us dx
for the divisor or dx
contains some other partial result, we Allocate()
another register (and immediately Free()
it, IOW, we just find an available register, spilling if necessary) and move to it whatever node is in dx
, thereby freeing dx
from anything useful.
Finally, when we get to allocate the output register for (i)div
, we allocate ax
if we're interested in the quotient or we allocate dx
if we're interested in the remainder.
N.B. when computing how many registers an (i)div
based division/remainder node needs, we factor in the above restriction around dx
. This raises the minimum from 2 to 3 registers, but otherwise the computation remains the same as for e.g. add
. The order of evaluation and the node numbering described earlier should take this into account.
Loads
To load a value from memory we need the memory address in one of 3 suitable registers: bx
, si
, di
. So, the child/input node supplying the address will get its AllocHRegs()
and Ensure()
called with HRegAddr
as the desired register. If we're unlucky to get the address in one of those registers, we can use xchg
, as usual, to bring the address into e.g. bx
.
A 16bit value can be loaded into any 16bit register, but an 8bit value can only be loaded into individual 8bit subregisters, e.g. al
, bl
, cl
, dl
(the desired register can be set to HRegByte
when we'd like a register with 8bit subregisters). Further, if we intend to signextend an 8bit value from memory to 16 bits with the cbw
instruction, we need to load the value into al
.
On top of that we want to load the value into the desired register if we can do it directly. So, before we call Allocate()
to allocate the output register we're going to examine the desired register value and see if it's available and suitable. Again, if Allocate()
doesn't give us a register with 8bit subregisters when we need one, we may use xchg
to get us ax
/al
for the output register.
desired reg may be ignored and
chosen from regs that have 8bit subregs

v
mov r, [r]

v
desired reg can address memory
Stores
Stores are a bit simpler than loads, but they have the same fundamental requirements of the memory address being in one of the 3 suitable registers (bx
, si
, di
) and the 8bit value being in one of the 4 suitable registers (ax
/al
, bx
/bl
, cx
/cl
, dx
/dl
) and we may need to mov
or xchg
to satisfy these requirements.
desired reg may be ignored

v
mov [r], r
 
 v
v desired reg may need to have 8bit subregs
desired reg can address memory
N.B. the store node's result (besides the direct effect of writing to memory) can be the value that's being stored similar to how we can write a = b = c;
in C/C++ to use the result of one assignment (b = c
) as a value for another assignment (a = ...
).
Resources
 See Ershov number, SethiUllman algorithm.
 Engineering a Compiler 2nd ed. by Keith D. Cooper and Linda Torczon. In particular, section 13.3.2 "BottomUp Local Register Allocation".
 Compiler Construction by William M. Waite and Gerhard Goos. In particular, section 10.2.2 "Targeting".
 Compilers Principles, Techniques, and Tools (AKA the "Dragon" book) by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.