How to optimize for the Pentium
family of microprocessors

Copyright © 1996, 2000 by Agner Fog. Last modified 2000-07-03.

 

Contents

  1. Introduction
  2. Literature
  3. Calling assembly functions from high level language
  4. Debugging and verifying
  5. Memory model
  6. Alignment
  7. Cache
  8. First time versus repeated execution
  9. Address generation interlock (PPlain and PMMX)
  10. Pairing integer instructions (PPlain and PMMX)
    1. Perfect pairing
    2. Imperfect pairing
  11. Splitting complex instructions into simpler ones (PPlain and PMMX)
  12. Prefixes (PPlain and PMMX)
  13. Overview of PPro, PII and PIII pipeline
  14. Instruction decoding (PPro, PII and PIII)
  15. Instruction fetch (PPro, PII and PIII)
  16. Register renaming (PPro, PII and PIII)
    1. Eliminating dependencies
    2. Register read stalls
  17. Out of order execution (PPro, PII and PIII)
  18. Retirement (PPro, PII and PIII)
  19. Partial stalls (PPro, PII and PIII)
    1. Partial register stalls
    2. Partial flags stalls
    3. Flags stalls after shifts and rotates
    4. Partial memory stalls
  20. Dependency chains (PPro, PII and PIII)
  21. Searching for bottlenecks (PPro, PII and PIII)
  22. Jumps and branches (all processors)
    1. Branch prediction in PPlain
    2. Branch prediction in PMMX, PPro, PII and PIII
    3. Avoiding jumps (all processors)
    4. Avoiding conditional jumps by using flags (all processors)
    5. Replacing conditional jumps by conditional moves (PPro, PII and PIII)
  23. Reducing code size (all processors)
  24. Scheduling floating point code (PPlain and PMMX)
  25. Loop optimization (all processors)
    1. Loops in PPlain and PMMX
    2. Loops in PPro, PII and PIII
  26. Problematic Instructions
    1. XCHG (all processors)
    2. Rotates through carry (all processors)
    3. String instructions (all processors)
    4. Bit test (all processors)
    5. Integer multiplication (all processors)
    6. WAIT instruction (all processors)
    7. FCOM + FSTSW AX (all processors)
    8. FPREM (all processors)
    9. FRNDINT (all processors)
    10. FSCALE and exponential function (all processors)
    11. FPTAN (all processors)
    12. FSQRT (PIII)
    13. MOV [MEM], ACCUM (PPlain and PMMX)
    14. TEST instruction (PPlain and PMMX)
    15. Bit scan (PPlain and PMMX)
    16. FLDCW (PPro, PII and PIII)
  27. Special topics
    1. LEA instruction (all processors)
    2. Division (all processors)
    3. Freeing floating point registers (all processors)
    4. Transitions between floating point and MMX instructions PMMX, PII and PIII)
    5. Converting from floating point to integer (All processors)
    6. Using integer instructions to do floating point operations (All processors)
    7. Using floating point instructions to do integer operations (PPlain and PMMX)
    8. Moving blocks of data (All processors)
    9. Self-modifying code (All processors)
    10. Detecting processor type (All processors)
  28. List of instruction timings for PPlain and PMMX
    1. Integer instructions
    2. Floating point instructions
    3. MMX instructions (PMMX)
  29. List of instruction timings and micro-op breakdown for PPro, PII and PIII
    1. Integer instructions
    2. Floating point instructions
    3. MMX instructions (PII and PIII)
    4. XMM instructions (PIII)
  30. Testing speed
  31. Comparison of the different microprocessors

 

1. Introduction

This manual describes in detail how to write optimized assembly language code, with particular focus on the Pentium® family of microprocessors.

Most of the information herein is based on my own research. Many people have sent me useful information and corrections for this manual, and I keep updating it whenever I have new important information. This manual is therefore more accurate, detailed, comprehensive and exact than any other source of information, and it contains many details not found anywhere else. This information will enable you in many cases to calculate exactly how many clock cycles a piece of code will take. I do not claim, though, that all information in this manual is exact: Some timings etc. can be difficult or impossible to measure exactly, and I do not have access to the inside information on technical implementations that the writers of Intel manuals have.

The following versions of Pentium processors are discussed in this manual:

abbreviationname
PPlainplain old Pentium (without MMX)
PMMXPentium with MMX
PProPentium Pro
PIIPentium II (including Celeron and Xeon)
PIIIPentium III (including variants)

The assembly language syntax used in this manual is MASM 5.10 syntax. There is no official standard for X86 assembly language, but this is the closest you can get to a de facto standard since most assemblers have a MASM 5.10 compatible mode. (I do not recommend using MASM version 5.10 though, because it has a serious bug in 32 bit mode. Use TASM or a later version of MASM).

Some of the remarks in this manual may seem like a criticism of Intel. This should not be taken to mean that other brands are better. The Pentium family of microprocessors compare well with competing brands, they are better documented, and have better testability features. For these reasons, no competing brand has been subjected to the same level of independent research by me or by anybody else.

Programming in assembly language is much more difficult than high level language. Making bugs is very easy, and finding them is very difficult. Now you have been warned! It is assumed that the reader is already experienced in assembly programming. If not, then please read some books on the subject and get some programming experience before you begin to do complicated optimizations.

The hardware design of the PPlain and PMMX chips has many features which are optimized specifically for some commonly used instructions or instruction combinations, rather than using general optimization methods. Consequently, the rules for optimizing software for this design are complicated and have many exceptions, but the possible gain in performance may be substantial. The PPro, PII and PIII processors have a very different design where the processor takes care of much of the optimization work by executing instructions out of order, but the more complicated design of these processors generate many potential bottlenecks, so there may be a lot to gain by optimizing manually for these processors. The Pentium 4 processor has yet another design, and the optimization guidelines for Pentium 4 are quite different from previous versions. This manual does not cover the Pentium 4 - the reader is referred to manuals from Intel.

Before you start to convert your code to assembly, make sure that your algorithm is optimal. Often you can improve a piece of code much more by improving the algorithm than by converting it to assembly code.

Next, you have to identify the critical parts of your program. Often more than 99% of the CPU time is spent in the innermost loop of a program. In this case you should optimize only this loop and leave everything else in high level language. Some assembly programmers waste a lot of energy optimizing the wrong parts of their programs, the only significant effect of their effort being that the programs become more difficult to debug and maintain!

If it is not obvious where the critical parts of your program are then you may use a profiler to find them. If it turns out that the bottleneck is disk access, then you may modify your program to make disk access sequential in order to improve disk caching, rather than turning to assembly programming. If the bottleneck is graphics output then you may look for a way of reducing the number of calls to graphic procedures.

Some high level language compilers offer relatively good optimization for specific processors, but further optimization by hand can usually make it much better.

Please don't send your programming questions to me. I am not gonna do your homework for you!

Good luck with your hunt for nanoseconds!

2. Literature

A lot of useful literature and tutorials can be downloaded for free from Intel's www site or acquired in print or on CD-ROM. It is recommended that you study this literature in order to get acquainted with the microprocessor architecture. However, the documents from Intel are not always accurate - especially the tutorials have many errors (evidently, they haven't tested their own examples).

I will not give the URL's here because the file locations change very often. You can find the documents you need by using the search facilities at: developer.intel.com or follow the links from www.agner.org/assem

Some documents are in .PDF format. If you don't have software for viewing or printing .PDF files, then you may download the Acrobat file reader from www.adobe.com

The use of MMX and XMM (SIMD) instructions for optimizing specific applications are described in several application notes. The instruction set is described in various manuals and tutorials.

VTUNE is a software tool from Intel for optimizing code. I have not tested it and can therefore not give any evalutation of it here.

A lot of other sources than Intel also have useful information. These sources are listed in the FAQ for the newsgroup comp.lang.asm.x86. For other internet ressources follow the links from www.agner.org/assem

3. Calling assembly functions from high level language

You can either use inline assembly or code a subroutine entirely in assembly language and link it into your project. If you choose the latter option, then it is recommended that you use a compiler which is capable of translating high level code directly to assembly. This assures that you get the function calling method right. Most C++ compilers can do this.

The way of transferring parameters depends on the calling convention:

 calling convention   parameter order on stack   parameters removed by 
 _cdecl  first par. at low address   caller 
 _stdcall   first par. at low address  subroutine 
 _fastcall  compiler specific   subroutine 
 _pascal  first par. at high address   subroutine 

The methods for function calling and name mangling can be quite complicated. There are many different calling conventions, and the different brands of compilers are not compatible in this respect. If you are calling assembly language subroutines from C++, then the best method in terms of consistency and compatibility is to declare your functions extern "C" and _cdecl. The assembly code must then have the function name prefixed by an underscore (_) and be assembled with case sensitivity on externals (option -mx). Example:

  ; extern "C" int _cdecl square (int x);
  _square PROC NEAR             ; integer square function
  PUBLIC  _square
          MOV     EAX, [ESP+4]
          IMUL    EAX
          RET
  _square ENDP

If you need to make overloaded functions, overloaded operators, methods, and other C++ specialties then you have to code it in C++ first and make your compiler translate it to assembly in order to get the right linking information and calling method. These details are different for different brands of compilers and seldom documented. If you want an assembly function with any other calling method than extern "C" and _cdecl to be callable from code compiled with different compilers then you need to give it one public name for each compiler. For example an overloaded square function:

  ; int square (int x);
  SQUARE_I PROC NEAR             ; integer square function
  @square$qi LABEL NEAR          ; link name for Borland compiler
  ?square@@YAHH@Z LABEL NEAR     ; link name for Microsoft compiler
  _square__Fi LABEL NEAR         ; link name for Gnu compiler
  PUBLIC @square$qi, ?square@@YAHH@Z, _square__Fi
          MOV     EAX, [ESP+4]
          IMUL    EAX
          RET
  SQUARE_I ENDP

  ; double square (double x);
  SQUARE_D PROC NEAR             ; double precision float square function
  @square$qd LABEL NEAR          ; link name for Borland compiler
  ?square@@YANN@Z LABEL NEAR     ; link name for Microsoft compiler
  _square__Fd LABEL NEAR         ; link name for Gnu compiler
  PUBLIC @square$qd, ?square@@YANN@Z, _square__Fd
          FLD     QWORD PTR [ESP+4]
          FMUL    ST(0), ST(0)
          RET
  SQUARE_D ENDP

This method works because all these compilers use the _cdecl calling convention by default for overloaded functions. For methods (member functions), however, not even the calling convention is the same for all compilers (Borland and Gnu compilers use _cdecl convention with the 'this' pointer first, Microsoft uses the _stdcall convention with the 'this' pointer in ecx).

In general, you shouldn't expect different compilers to be compatible on the object file level if you are using any of these constructs: long double, member pointers, virtual methods, new, delete, exceptions, system function calls, or standard library functions.

Register usage in 16 bit mode DOS or Windows, C or C++:
16-bit return value in AX, 32-bit return value in DX:AX, floating point return value in ST(0). Registers AX, BX, CX, DX, ES and arithmetic flags may be changed by the procedure; all other registers must be saved and restored. A procedure can rely on SI, DI, BP, DS and SS being unchanged across a call to another procedure.

Register usage in 32 bit Windows, C++ and other programming languages:
Integer return value in EAX, floating point return value in ST(0). Registers EAX, ECX, EDX (not EBX) may be changed by the procedure; all other registers must be saved and restored. Segment registers cannot be changed, not even temporarily. CS, DS, ES, and SS all point to the flat segment group. FS is used by the operating system. GS is unused, but reserved. Flags may be changed by the procedure with the following restrictions: The direction flag is 0 by default. The direction flag may be set temporarily, but must be cleared before any call or return. The interrupt flag cannot be cleared. The floating point register stack is empty at the entry of a procedure and must be empty at return, except for ST(0) if it is used for return value. MMX registers may be changed by the procedure and if so cleared by EMMS before returning and before calling any other procedure that may use floating point registers. All XMM registers may be modified by procedures. Rules for passing parameters and return values in XMM registers are described in Intel's application note AP 589. A procedure can rely on EBX, ESI, EDI, EBP and all segment registers being unchanged across a call to another procedure.

4. Debugging and verifying

Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you want to optimize as a subroutine in a high level language. Next, write a test program that will test your subroutine thoroughly. Make sure the test program goes into all branches and boundary cases.

When your high level language subroutine works with your test program then you are ready to translate the code to assembly language.

Now you can start to optimize. Each time you have made a modification you should run it on the test program to see if it works correctly. Number all your versions and save them so that you can go back and test them again in case you discover an error that the test program didn't catch (such as writing to a wrong address).

Test the speed of the most critical part of your program with the method described in chapter 30 or with a test program. If the code is significantly slower than expected, then the most probable causes are: cache misses (chapter 7), misaligned operands (chapter 6), first time penalty (chapter 8), branch mispredictions (chapter 22), instruction fetch problems (chapter 15), register read stalls (16), or long dependency chains (chapter 20).

Highly optimized code tends to be very difficult to read and understand for others, and even for yourself when you get back to it after some time. In order to make it possible to maintain the code it is important that you organize it into small logical units (procedures or macros) with a well-defined interface and appropriate comments. The more complicated the code is to read, the more important is a good documentation.

5. Memory model

The Pentiums are designed primarily for 32 bit code, and the performance is inferior on 16 bit code. Segmenting your code and data also degrades performance significantly, so you should generally prefer 32 bit flat mode, and an operating system which supports this mode. The code examples shown in this manual assume a 32 bit flat memory model, unless otherwise specified.

6. Alignment

All data in RAM should be aligned to addresses divisible by 2, 4, 8, or 16 according to this scheme:
  alignment
 operand size   PPlain and PMMX   PPro, PII and PIII 
 1 (byte)  11
 2 (word) 2 2
 4 (dword)  44
 6 (fword)  48
 8 (qword)  88
 10 (tbyte)  816
 16 (oword)  n.a.16

On PPlain and PMMX, misaligned data will take at least 3 clock cycles extra to access if a 4 byte boundary is crossed. The penalty is higher when a cache line boundary is crossed.

On PPro, PII and PIII, misaligned data will cost you 6-12 clocks extra when a cache line boundary is crossed. Misaligned operands smaller than 16 bytes that do not cross a 32 byte boundary give no penalty.

Aligning data by 8 or 16 on a dword size stack may be a problem. A common method is to set up an aligned frame pointer. A function with aligned local data may look like this:

_FuncWithAlign PROC NEAR
        PUSH    EBP                        ; prolog code
        MOV     EBP, ESP
        AND     EBP, -8                    ; align frame pointer by 8
        FLD     DWORD PTR [ESP+8]          ; function parameter
        SUB     ESP, LocalSpace + 4        ; allocate local space
        FSTP    QWORD PTR [EBP-LocalSpace] ; store something in aligned space
        ...
        ADD     ESP, LocalSpace + 4        ; epilog code. restore ESP
        POP     EBP                        ; (AGI stall on PPlain/PMMX)
        RET
_FuncWithAlign ENDP

While aligning data is always important, aligning code is not necessary on the PPlain and PMMX. Principles for aligning code on PPro, PII and PIII are explained in chapter 15.

7. Cache

The PPlain and PPro have 8 kb of on-chip cache (level one cache) for code, and 8 kb for data. The PMMX, PII and PIII have 16 kb for code and 16 kb for data. Data in the level 1 cache can be read or written to in just one clock cycle, whereas a cache miss may cost many clock cycles. It is therefore important that you understand how the cache works in order to use it most efficiently.

The data cache consists of 256 or 512 lines of 32 bytes each. Each time you read a data item which is not cached, the processor will read an entire cache line from memory. The cache lines are always aligned to a physical address divisible by 32. When you have read a byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost no extra cost. You can take advantage of this by arranging data items which are used near each other together into aligned blocks of 32 bytes of memory. If, for example, you have a loop which accesses two arrays, then you may interleave the two arrays into one array of structures, so that data which are used together are also stored together.

If the size of an array or other data structure is a multiple of 32 bytes, then you should preferably align it by 32.

The cache is set-associative. This means that a cache line can not be assigned to an arbitrary memory address. Each cache line has a 7-bit set-value which must match bits 5 through 11 of the physical RAM address (bit 0-4 define the 32 bytes within a cache line). The PPlain and PPro have two cache lines for each of the 128 set-values, so there are two possible cache lines to assign to any RAM address. The PMMX, PII and PIII have four.

The consequence of this is that the cache can hold no more than two or four different data blocks which have the same value in bits 5-11 of the address. You can determine if two addresses have the same set-value by the following method: Strip off the lower 5 bits of each address to get a value divisible by 32. If the difference between the two truncated addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value.

Let me illustrate this by the following piece of code, where ESI holds an address divisible by 32:

AGAIN:  MOV  EAX, [ESI]
        MOV  EBX, [ESI + 13*4096 +  4]
        MOV  ECX, [ESI + 20*4096 + 28]
        DEC  EDX
        JNZ  AGAIN

The three addresses used here all have the same set-value because the differences between the truncated addresses are multipla of 4096. This loop will perform very poorly on the PPlain and PPro. At the time you read ECX there is no free cache line with the proper set-value so the processor takes the least recently used of the two possible cache lines, that is the one which was used for EAX, and fills it with the data from [ESI+20*4096] to [ESI+20*4096+31] and reads ECX. Next, when reading EAX, you find that the cache line that held the value for EAX has now been discarded, so you take the least recently used line, which is the one holding the EBX value, and so on.. You have nothing but cache misses and the loop takes something like 60 clock cycles. If the third line is changed to:

        MOV  ECX, [ESI + 20*4096 + 32]

then we have crossed a 32 byte boundary, so that we do not have the same set-value as in the first two lines, and there will be no problem assigning a cache line to each of the three addresses. The loop now takes only 3 clock cycles (except for the first time) - a very considerable improvement! As already mentioned, the PMMX, PII and PIII have 4-way caches so that you have four cache lines with the same set-value. (Some Intel documents erroneously say that the PII cache is 2-way).

It may be very difficult to determine if your data addresses have the same set-values, especially if they are scattered around in different segments. The best thing you can do to avoid problems of this kind is to keep all data used in the critical part or your program within one contiguous block not bigger than the cache, or two contiguous blocks no bigger than half that size (for example one block for static data and another block for data on the stack). This will make sure that your cache lines are used optimally.

If the critical part of your code accesses big data structures or random data addresses, then you may want to keep all frequently used variables (counters, pointers, control variables, etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of cache lines free for accessing random data. Since you probably need stack space anyway for subroutine parameters and return addresses, the best thing is to copy all frequently used static data to dynamic variables on the stack, and copy them back again outside the critical loop if they have been changed.

Reading a data item which is not in the level one cache causes an entire cache line to be filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for first are available already after 50-100 ns. If the data item is not in the level two cache either, then you will get a delay of something like 200-300 ns. This delay will be somewhat longer if you cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 and 8 MB 72 pin RAM modules, and 2 kb for 16 and 32 MB modules).

When reading big blocks of data from memory, the speed is limited by the time it takes to fill cache lines. You can sometimes improve speed by reading data in a non-sequential order: before you finish reading data from one cache line start reading the first item from the next cache line. This method can increase reading speed by 20 - 40% when reading from main memory or level 2 cache on PPlain and PMMX, and from level 2 cache on PPro, PII and PIII. A disadvantage of this method is of course that the program code becomes extremely clumsy and complicated. For further information on this trick see www.intelligentfirm.com.

When you write to an address which is not in the level 1 cache, then the value will go right through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up) on the PPlain and PMMX. This takes approximately 100 ns. If you write eight or more times to the same 32 byte block of memory without also reading from it, and the block is not in the level one cache, then it may be advantageous to make a dummy read from the block first to load it into a cache line. All subsequent writes to the same block will then go to the cache instead, which takes only one clock cycle. On PPlain and PMMX, there is sometimes a small penalty for writing repeatedly to the same address without reading in between.

On PPro, PII and PIII, a write miss will normally load a cache line, but it is possible to setup an area of memory to perform differently, for example video RAM (See Pentium Pro Family Developer's Manual, vol. 3: Operating System Writer's Guide").

Other ways of speeding up memory reads and writes are discussed in chapter 27.8 below.

The PPlain and PPro have two write buffers, PMMX, PII and PIII have four. On the PMMX, PII and PIII you may have up to four unfinished writes to uncached memory without delaying the subsequent instructions. Each write buffer can handle operands up to 64 bits wide.

Temporary data may conveniently be stored on the stack because the stack area is very likely to be in the cache. However, you should be aware of the alignment problems if your data elements are bigger than the stack word size.

If the life ranges of two data structures do not overlap, then they may share the same RAM area to increase cache efficiency. This is consistent with the common practice of allocating space for temporary variables on the stack.

Storing temporary data in registers is of course even more efficient. Since registers is a scarce ressource you may want to use [ESP] rather than [EBP] for addressing data on the stack, in order to free EBP for other purposes. Just don't forget that the value of ESP changes every time you do a PUSH or POP. (You cannot use ESP under 16-bit Windows because the timer interrupt will modify the high word of ESP at unpredictable places in your code.)

There is a separate cache for code, which is similar to the data cache. The size of the code cache is 8 kb on PPlain and PPro and 16 kb on the PMMX, PII and PIII. It is important that the critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces of code or routines which are used together should preferable be stored near each other. Seldom used branches or procedures should be put away in the bottom of your code or somewhere else.

8. First time versus repeated execution

A piece of code usually takes much more time the first time it is executed than when it is repeated. The reasons are the following:

  1. Loading the code from RAM into the cache takes longer time than executing it.
  2. Any data accessed by the code has to be loaded into the cache, which may take much more time than executing the instructions. When the code is repeated then the data are more likely to be in the cache.
  3. Jump instructions will not be in the branch target buffer the first time they execute, and therefore are less likely to be predicted correctly. See chapter 22.
  4. In the PPlain, decoding the code is a bottleneck. If it takes one clock cycle to determine the length of an instruction, then it is not possible to decode two instructions per clock cycle, because the processor doesn't know where the second instruction begins. The PPlain solves this problem by remembering the length of any instruction which has remained in the cache since last time it was executed. As a consequence of this, a set of instructions will not pair in the PPlain the first time they are executed, unless the first of the two instructions is only one byte long. The PMMX, PPro, PII and PIII have no penalty on first time decoding.

For these four reasons, a piece of code inside a loop will generally take much more time the first time it executes than the subsequent times.

If you have a big loop which doesn't fit into the code cache then you will get penalties all the time because it doesn't run from the cache. You should therefore try to reorganize the loop to make it fit into the cache.

If you have very many jumps, calls, and branches inside a loop, then you may get the penalty of branch target buffer misses repeatedly.

Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you will get the penalty of data cache misses all the time.

9. Address generation interlock (PPlain and PMMX)

It takes one clock cycle to calculate the address needed by an instruction which accesses memory. Normally, this calculation is done at a separate stage in the pipeline while the preceding instruction or instruction pair is executing. But if the address depends on the result of an instruction executing in the preceding clock cycle, then you have to wait an extra clock cycle for the address to be calculated. This is called an AGI stall. Example:
ADD EBX,4 / MOV EAX,[EBX] ; AGI stall
The stall in this example can be removed by putting some other instructions in between ADD EBX,4 and MOV EAX,[EBX] or by rewriting the code to: MOV EAX,[EBX+4] / ADD EBX,4

You can also get an AGI stall with instructions which use ESP implicitly for addressing, such as PUSH, POP, CALL, and RET, if ESP has been changed in the preceding clock cycle by instructions such as MOV, ADD, or SUB. The PPlain and PMMX have special circuitry to predict the value of ESP after a stack operation so that you do not get an AGI delay after changing ESP with PUSH, POP, or CALL. You can get an AGI stall after RET only if it has an immediate operand to add to ESP.

Examples:

ADD ESP,4 / POP ESI            ; AGI stall
POP EAX   / POP ESI            ; no stall, pair
MOV ESP,EBP / RET              ; AGI stall
CALL L1 / L1: MOV EAX,[ESP+8]  ; no stall
RET / POP EAX                  ; no stall
RET 8 / POP EAX                ; AGI stall

The LEA instruction is also subject to an AGI stall if it uses a base or index register which has been changed in the preceding clock cycle. Example:

INC ESI / LEA EAX,[EBX+4*ESI]  ; AGI stall

PPro, PII and PIII have no AGI stalls for memory reads and LEA, but they do have AGI stalls for memory writes. This is not very significant unless the subsequent code has to wait for the write to finish.

10. Pairing integer instructions (PPlain and PMMX)

10.1 Perfect pairing

The PPlain and PMMX have two pipelines for executing instructions, called the U-pipe and the V-pipe. Under certain conditions it is possible to execute two instructions simultaneously, one in the U-pipe and one in the V-pipe. This can almost double the speed. It is therefore advantageous to reorder your instructions to make them pair.

The following instructions are pairable in either pipe:

The following instructions are pairable in the U-pipe only:

The following instructions can execute in either pipe but are only pairable when in the V-pipe:

All other integer instructions can execute in the U-pipe only, and are not pairable.

Two consecutive instructions will pair when the following conditions are met:

1. The first instruction is pairable in the U-pipe and the second instruction is pairable in the V-pipe.

2. The second instruction does not read or write a register which the first instruction writes to.
Examples:

    MOV EAX, EBX / MOV ECX, EAX     ; read after write, do not pair
    MOV EAX, 1   / MOV EAX, 2       ; write after write, do not pair
    MOV EBX, EAX / MOV EAX, 2       ; write after read, pair OK
    MOV EBX, EAX / MOV ECX, EAX     ; read after read, pair OK
    MOV EBX, EAX / INC EAX          ; read and write after read, pair OK

3. In rule 2 partial registers are treated as full registers. Example:

    MOV AL, BL  /  MOV AH, 0

writes to different parts of the same register, do not pair

4. Two instructions which both write to parts of the flags register can pair despite rule 2 and 3. Example:

    SHR EAX, 4 / INC EBX            ; pair OK

5. An instruction which writes to the flags can pair with a conditional jump despite rule 2. Example:

    CMP EAX, 2 / JA LabelBigger     ; pair OK

6. The following instruction combinations can pair despite the fact that they both modify the stack pointer:

    PUSH + PUSH,  PUSH + CALL,  POP + POP

7. There are restrictions on the pairing of instructions with prefix. There are several types of prefixes:

On the PPlain, a prefixed instruction can only execute in the U-pipe, except for conditional near jumps.

On the PMMX, instructions with operand size, address size, or 0FH prefix can execute in either pipe, whereas instructions with segment, repeat, or lock prefix can only execute in the U-pipe.

8. An instruction which has both a displacement and immediate data is not pairable on the PPlain and only pairable in the U-pipe on the PMMX:

    MOV DWORD PTR DS:[1000], 0    ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX+8], 1       ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX], 1         ; pairable
    CMP BYTE PTR [EBX+8], AL      ; pairable

(Another problem with instructions which have both a displacement and immediate data on the PMMX is that such instructions may be longer than 7 bytes, which means that only one instruction can be decoded per clock cycle, as explained in chapter 12.)

9. Both instructions must be preloaded and decoded. This is explained in chapter 8.

10. There are special pairing rules for MMX instructions on the PMMX:

10.2 Imperfect pairing

There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent instruction can start to execute before both instructions in the imperfect pair have finished.

Imperfect pairing will happen in the following cases:

1. If the second instructions suffers an AGI stall (see chapter 9).

2. Two instructions cannot access the same DWORD of memory simultaneously. The following examples assume that ESI is divisible by 4:
MOV AL, [ESI] / MOV BL, [ESI+1]
The two operands are within the same DWORD, so they cannot execute simultaneously. The pair takes 2 clock cycles.
MOV AL, [ESI+3] / MOV BL, [ESI+4]
Here the two operands are on each side of a DWORD boundary, so they pair perfectly, and take only one clock cycle.

3. Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict). For DWORD addresses this means that the difference between the two addresses should not be divisible by 32. Examples:

   MOV [ESI], EAX / MOV [ESI+32000], EBX ;  imperfect pairing
   MOV [ESI], EAX / MOV [ESI+32004], EBX ;  perfect pairing

Pairable integer instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for using complex addressing modes such as scaled index registers.

A pairable integer instruction which reads from memory, does some calculation, and stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).

A pairable integer instruction which reads from memory, does some calculation, and writes the result back to the memory, takes 3 clock cycles. (read/modify/write instructions).

4. If a read/modify/write instruction is paired with a read/modify or read/modify/write instruction, then they will pair imperfectly.

The number of clock cycles used is given in the following table:
First instruction Second instruction
  MOV or register only   read/modify   read/modify/write 
 MOV or register only   1   2   3 
 read/modify   2   2   3 
 read/modify/write   3   4   5 

Example:
ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles
ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles

5. When two paired instructions both take extra time due to cache misses, misalignment, or jump misprediction, then the pair will take more time than each instruction, but less than the sum of the two.

6. A pairable floating point instruction followed by FXCH will make imperfect pairing if the next instruction is not a floating point instruction.

In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V-pipe. You can find out this by looking backwards in your code and search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair due to one of the rules above.

Imperfect pairing can often be avoided by reordering your instructions. Example:

L1:     MOV     EAX,[ESI]
        MOV     EBX,[ESI]
        INC     ECX

Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions.

L2:     MOV     EAX,OFFSET A
        XOR     EBX,EBX
        INC     EBX
        MOV     ECX,[EAX]
        JMP     L1

The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1 instead, then the sequence takes only 3 clocks.

The next example is in 16 bit mode, assuming that SP is divisible by 4:

L3:     PUSH    AX
        PUSH    BX
        PUSH    CX
        PUSH    DX
        CALL    FUNC

Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a DWORD boundary) but it doesn't because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode.

11. Splitting complex instructions into simpler ones (PPlain and PMMX)

You may split up read/modify and read/modify/write instructions to improve pairing. Example:
ADD [mem1],EAX / ADD [mem2],EBX ; 5 clock cycles
This code may be split up into a sequence which takes only 3 clock cycles:

    MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
    MOV [mem1],ECX / MOV [mem2],EDX

Likewise you may split up non-pairable instructions into pairable instructions:

    PUSH [mem1]
    PUSH [mem2]  ; non-pairable

Split up into:

    MOV EAX,[mem1]
    MOV EBX,[mem2]
    PUSH EAX
    PUSH EBX     ; everything pairs

Other examples of non-pairable instructions which may be split up into simpler pairable instructions:
CDQ split into: MOV EDX,EAX / SAR EDX,31
NOT EAX change to XOR EAX,-1
NEG EAX split into XOR EAX,-1 / INC EAX
MOVZX EAX,BYTE PTR [mem] split into XOR EAX,EAX / MOV AL,BYTE PTR [mem]
JECXZ split into TEST ECX,ECX / JZ
LOOP split into DEC ECX / JNZ
XLAT change to MOV AL,[EBX+EAX]

If splitting instructions doesn't improve speed, then you may keep the complex or nonpairable instructions in order to reduce code size.

Splitting instructions is not needed on the PPro, PII and PIII, except when the split instructions generate fewer uops.

12. Prefixes (PPlain and PMMX)

An instruction with one or more prefixes may not be able to execute in the V-pipe (se chapter 10, sect. 7), and it may take more than one clock cycle to decode.

On the PPlain, the decoding delay is one clock cycle for each prefix except for the 0FH prefix of conditional near jumps.

The PMMX has no decoding delay for 0FH prefix. Segment and repeat prefixes take one clock extra to decode. Address and operand size prefixes take two clocks extra to decode. The PMMX can decode two instructions per clock cycle if the first instruction has a segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions with address or operand size prefixes can only decode alone on the PMMX. Instructions with more than one prefix take one clock extra for each prefix.

Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and 32 bit integers.

Where prefixes are unavoidable, the decoding delay may be masked if a preceding instruction takes more than one clock cycle to execute. The rule for the PPlain is that any instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction pairs. In other words, each extra clock cycle that an instruction takes to execute can be used to decode one prefix in a later instruction. This shadowing effect even extends across a predicted branch. Any instruction which takes more than one clock cycle to execute, and any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any other reason except decoding delay and branch misprediction, has a shadowing effect.

The PMMX has a similar shadowing effect, but the mechanism is different. Decoded instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to four instructions. As long as there are instructions in the FIFO buffer you get no delay. When the buffer is empty then instructions are executed as soon as they are decoded. The buffer is filled when instructions are decoded faster than they are executed, i.e. when you have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions execute faster than they are decoded, i.e. when you have decoding delays due to prefixes. The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two instructions per clock cycle provided that the second instruction is without prefixes and none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can each receive one instruction per clock cycle from the FIFO buffer.

Examples:
CLD / REP MOVSD
The CLD instruction takes two clock cycles and can therefore overshadow the decoding delay of the REP prefix. The code would take one clock cycle more if the CLD instruction was placed far from the REP MOVSD.
CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL
The CMP instruction takes two clock cycles here because it is a read/modify instruction. The 0FH prefix of the SETNZ instruction is decoded during the second clock cycle of the CMP instruction, so that the decoding delay is hidden on the PPlain (The PMMX has no decoding delay for the 0FH).

Prefix penalties in PPro, PII and PIII are described in chapter 14.

13. Overview of PPro, PII and PIII pipeline

The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in various manuals and tutorials from Intel. It is recommended that you study this material in order to get an understanding of how these microprocessors work. I will describe the structure briefly here with particular focus on those elements that are important for optimizing code.

Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double buffer that can hold two 16-byte chunks. The code is passed on from the double buffer to the decoders in blocks which I will call ifetch blocks (instruction fetch blocks). The ifetch blocks are usually 16 bytes long, but not aligned. The purpose of the double-buffer is to make it possible to decode an instruction that crosses a 16-byte boundary (i.e. an address divisible by 16).

The ifetch block goes to the instruction length decoder, which determines where each instruction begins and ends, and next to the instruction decoders. There are three decoders so that you can decode up to three instructions in each clock cycle. A group of up to three instructions that are decoded in the same clock cycle is called a decode group.

The decoders translate instructions into micro-operations, abbreviated uops. Simple instructions generate only one uop, while more complex instructions may generate several uops. For example, the instruction ADD EAX,[MEM] is decoded into two uops: one for reading the source operand from memory, and one for doing the addition. The purpose of splitting instructions into uops is to make the handling later in the system more effective.

The three decoders are called D0, D1, and D2. D0 can handle all instructions, while D1 and D2 can handle only simple instructions that generate one uop.

The uops from the decoders go via a short queue to the register allocation table (RAT). The execution of uops work on temporary registers which are later written to the permanent registers EAX, EBX, etc. The purpose of the RAT is to tell the uops which temporary registers to use, and to allow register renaming (see later).

After the RAT, the uops to go the reorder buffer (ROB). The purpose of the ROB is to enable out-of-order execution. A uop stays in the reservation station until the operands it needs are available. If an operand for one uop is delayed because a previous uop that generates the operand is not finished yet, then the ROB may find another uop later in the queue that can be executed in the meantime in order to save time.

The uops that are ready for execution are sent to the execution units, which are clustered around five ports: Port 0 and 1 can handle arithmetic operations, jumps, etc. Port 2 takes care of all reads from memory, port 3 calculates addresses for memory writes, and port 4 does memory writes.

When an instruction has been executed then it is marked in the ROB as ready to retire. It then goes to the retirement station. Here the contents of the temporary registers used by the uops are written to the permanent registers. While uops can be executed out of order, they must be retired in order.

In the following chapters, I will describe in detail how to optimize the throughput of each step in the pipeline.

14. Instruction decoding (PPro, PII and PIII)

I am describing instruction decoding before instruction fetching here because you need to know how the decoders work in order to understand the possible delays in instruction fetching.

The decoders can handle three instructions per clock cycle, but only when certain conditions are met. Decoder D0 can handle any instruction that generates up to 4 uops in a single clock cycle. Decoders D1 and D2 can handle only instructions that generate 1 uop and these instructions can be no more than 8 bytes long.

To summarize the rules for decoding two or three instructions in the same clock cycle:

There is no limit to the length of the instruction in D0 (despite Intel manuals saying something else), as long as the three instructions fit into one 16 bytes ifetch block.

An instruction that generates more than 4 uops takes two or more clock cycles to decode, and no other instructions can decode in parallel.

It follows from the rules above that the decoders can produce a maximum of 6 uops per clock cycle if the first instruction in each decode group generates 4 uops and the next two generate 1 uop each. The minimum production is 2 uops per clock cycle, which you get when all instructions generate 2 uops each, so that D1 and D2 are never used.

For maximum throughput, it is recommended that you order your instructions according to the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two simple 1-uop instructions for free, in the sense that they do not add to the decoding time. Example:

MOV     EBX, [MEM1]     ; 1 uop  (D0)
INC     EBX             ; 1 uop  (D1)
ADD     EAX, [MEM2]     ; 2 uops (D0)
ADD     [MEM3], EAX     ; 4 uops (D0)

This takes 3 clock cycles to decode. You can save one clock cycle by reordering the instructions into two decode groups:

ADD     EAX, [MEM2]     ; 2 uops (D0)
MOV     EBX, [MEM1]     ; 1 uop  (D1)
INC     EBX             ; 1 uop  (D2)
ADD     [MEM3], EAX     ; 4 uops (D0)

The decoders now generate 8 uops in two clock cycles, which is probably satisfactory. Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate higher than this you can assume that decoding is not a bottleneck. However, complications in the fetch mechanism can delay decoding as described in the next chapter, so to be safe you may want to aim at a decoding rate higher than 3 uops per clock cycle.

You can see how many uops each instruction generates in the tables in chapter 29.

Instruction prefixes can also incur penalties in the decoders. Instructions can have several kinds of prefixes:

15. Instruction fetch (PPro, PII and PIII)

The code is fetched in aligned 16-bytes chunks from the code cache and placed in the double buffer, which is called so because it can contain two such chunks. The code is then taken from the double buffer and fed to the decoders in blocks which are usually 16 bytes long, but not necessarily aligned by 16. I will call these blocks ifetch blocks (instruction fetch blocks). If an ifetch block crosses a 16 byte boundary in the code then it needs to take from both chunks in the double buffer. So the purpose of the double buffer is to allow instruction fetching across 16 byte boundaries.

The double buffer can fetch one 16-bytes chunk per clock cycle and can generate one ifetch block per clock cycle. The ifetch blocks are usually 16 bytes long, but can be shorter if there is a predicted jump in the block. (See chapter 22 about jump prediction).

Unfortunately, the double buffer is not big enough for handling fetches around jumps without delay. If the ifetch block that contains the jump instruction crosses a 16-byte boundary then the double buffer needs to keep two consecutive aligned 16-bytes chunks of code in order to generate it. If the first instruction after the jump crosses a 16-byte boundary, then the double buffer needs to load two new 16-bytes chunks of code before a valid ifetch block can be generated. This means that, in the worst case, the decoding of the first instruction after a jump can be delayed for two clock cycles. You get one penalty for a 16-byte boundary in the ifetch block containing the jump instruction, and one penalty for a 16-byte boundary in the first instruction after the jump. You can get bonus if you have more than one decode group in the ifetch block that contains the jump because this gives the double buffer extra time to fetch one or two 16-byte chunks of code in advance for the instructions after the jump. The bonuses can compensate for the penalties according to the table below. If the double buffer has fetched only one 16-byte chunk of code after the jump, then the first ifetch block after the jump will be identical to this chunk, that is, aligned to a 16-byte boundary. In other words, the first ifetch block after the jump will not begin at the first instruction, but at the nearest preceding address divisible by 16. If the double buffer has had time to load two 16-byte chunks, then the new ifetch block can cross a 16-byte boundary and begin at the first instruction after the jump. These rules are summarized in the following table:

Number of
decode groups
in ifetch-block
containing jump
16-byte
boundary in this
ifetch-block
16-byte
boundary in first
instruction after
jump


decoder delay
alignment of first
ifetch after jump
1 0 0 0 by 16
1 0 1 1 to instruction
1 1 0 1 by 16
1 1 1 2 to instruction
2 0 0 0 to instruction
2 0 1 0 to instruction
2 1 0 0 by 16
2 1 1 1 to instruction
3 or more 0 0 0 to instruction
3 or more 0 1 0 to instruction
3 or more 1 0 0 to instruction
3 or more 1 1 0 to instruction

Jumps delay the fetching so that a loop always takes at least two clock cycles more per iteration than the number of 16 byte boundaries in the loop.

A further problem with the instruction fetch mechanism is that a new ifetch block is not generated until the previous one is exhausted. Each ifetch block can contain several decode groups. If a 16 bytes long ifetch block ends with an unfinished instruction, then the next ifetch block will begin at the beginning of that instruction. The first instruction in an ifetch block always goes to decoder D0, and the next two instructions go to D1 and D2, if possible. The consequence of this is that D1 and D2 are used less than optimally. If the code is structured according to the recommended 4-1-1 pattern, and an instruction intended to go into D1 or D2 happens to be the first instruction in an ifetch block, then that instruction has to go into D0 with the result that one clock cycle is wasted. This is probably a hardware design flaw. At least it is suboptimal design. The consequence of this problem is that the time it takes to decode a piece of code can vary considerably depending on where the first ifetch block begins.

If decoding speed is critical and you want to avoid these problems then you have to know where each ifetch block begins. This is quite a tedious job. First you need to make your code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then you have to look at the output listing from your assembler to see how long each instruction is. (It is recommended that you study how instructions are coded so that you can predict the lengths of the instructions.) If you know where one ifetch block begins then you can find where the next ifetch block begins in the following way: Make the block 16 bytes long. If it ends at an instruction boundary then the next block will begin there. If it ends with an unfinished instruction then the next block will begin at the beginning of this instruction. (Only the lengths of the instructions counts here, it doesn't matter how many uops they generate or what they do). This way you can work your way all through the code and mark where each ifetch block begins. The only problem is knowing where to start. If you know where one ifetch block is then you can find all the subsequent ones, but you have to know where the first one begins. Here are some guidelines:

I am sure you want an example now:

 address      instruction             length    uops    expected decoder
 ----------------------------------------------------------------------
 1000h        MOV ECX, 1000             5         1       D0
 1005h   LL:  MOV [ESI], EAX            2         2       D0
 1007h        MOV [MEM], 0             10         2       D0
 1011h        LEA EBX, [EAX+200]        6         1       D1
 1017h        MOV BYTE PTR [ESI], 0     3         2       D0
 101Ah        BSR EDX, EAX              3         2       D0
 101Dh        MOV BYTE PTR [ESI+1],0    4         2       D0
 1021h        DEC ECX                   1         1       D1
 1022h        JNZ LL                    2         1       D2

Let's assume that the first ifetch block begins at address 1000h and ends at 1010h. This is before the end of the MOV [MEM],0 instruction so the next ifetch block will begin at 1007h and end at 1017h. This is at an instruction boundary so the third ifetch block will begin at 1017h and cover the rest of the loop. The number of clock cycles it takes to decode this is the number of D0 instructions, which is 5 per iteration of the LL loop. The last ifetch block contained three decode blocks covering the last five instructions, and it has one 16-byte boundary (1020h). Looking at the table above we find that the first ifetch block after the jump will begin at the first instruction after the jump, that is the LL label at 1005h, and end at 1015h. This is before the end of the LEA instruction, so the next ifetch block will go from 1011h to 1021h, and the last one from 1021h covering the rest. Now the LEA instruction and the DEC instruction both fall at the beginning of an ifetch block which forces them to go into D0. We now have 7 instructions in D0 and the loop takes 7 clocks to decode in the second iteration. The last ifetch block contains only one decode group (DEC ECX / JNZ LL) and has no 16-byte boundary. According to the table, the next ifetch block after the jump will begin at a 16-byte boundary, which is 1000h. This will give us the same situation as in the first iteration, and you will see that the loop takes alternatingly 5 and 7 clock cycles to decode. Since there are no other bottlenecks, the complete loop will take 6000 clocks to run 1000 iterations. If the starting address had been different so that you had a 16-byte boundary in the first or the last instruction of the loop then it would take 8000 clocks. If you reorder the loop so that no D1 or D2 instructions fall at the beginning of an ifetch block then you can make it take only 5000 clocks.

The example above was deliberately constructed so that fetch and decoding is the only bottleneck. The easiest way to avoid this problem is to structure your code to generate much more than 3 uops per clock cycle so that decoding will not be a bottleneck despite the penalties described here. In small loops this may not be possible and then you have to find out how to optimize the instruction fetch and decoding.

One thing you can do is to change the starting address of your procedure in order to avoid 16-byte boundaries where you don't want them. Remember to make your code segment paragraph aligned so that you know where the boundaries are.

If you insert an ALIGN 16 directive before the loop entry then the assembler will put in NOP's and other filler instructions up to the nearest 16 byte boundary. Most assemblers use the instruction XCHG EBX,EBX as a 2-byte filler (the so called 2-byte NOP). Whoever got this idea, it's a bad one because this instruction takes more time than two NOP's on most processors! If the loop executes many times then whatever is outside the loop is unimportant in terms of speed and you don't have to care about the suboptimal filler instructions. But if the time taken by the fillers is important then you may select the filler instructions manually. You may as well use filler instructions that do something useful, such as refreshing a register in order to avoid register read stalls (see chapter 16.2) For example, if you are using register EBP for addressing but seldom write to it, then you may use MOV EBP,EBP or ADD EBP, 0 as filler in order to reduce the possibilities of register read stalls. If you have nothing useful to do, you may use FXCH ST(0) as a good filler because it doesn't put any load on the execution ports, provided that ST(0) contains a valid floating point value.

Another possible remedy is to reorder your instructions in order to get the ifetch boundaries where they don't hurt. This can be quite a difficult puzzle and it is not always possible to find a satisfactory solution.

Yet another possibility is to manipulate instruction lengths. Sometimes you can substitute one instruction with another one with a different length. Many instructions can be coded in different versions with different lengths. The assembler always chooses the shortest possible version of an instruction, but it is often possible to hard-code a longer version. For example, DEC ECX is one byte long, SUB ECX,1 is 3 bytes, and you can code a 6 bytes version with a long immediate operand using this trick:

         SUB ECX, 9999
         ORG $-4
         DD 1

Instructions with a memory operand can be made one byte longer with a SIB byte, but the easiest way of making an instruction one byte longer is to add a DS: segment prefix (DB 3Eh). The microprocessors generally accept redundant and meaningless prefixes (except LOCK) as long as the instruction length does not exceed 15 bytes. Even instructions without a memory operand can have a segment prefix. So if you want the DEC ECX instruction to be 2 bytes long, write:

         DB  3Eh
         DEC ECX

Remember that you get a penalty in the decoder if an instruction has more than one prefix. It is possible that instructions with meaningless prefixes - especially repeat and lock prefixes - will be used in future processors for new instructions when there are no more vacant instruction codes, but I would consider it safe to use a segment prefix with any instruction.

With these methods it will usually be possible to put the ifetch boundaries where you want them, although it can be a tedious puzzle.

16. Register renaming (PPro, PII and PIII)

16.1 Eliminating dependencies

Register renaming is an advanced technique used by these microprocessors to remove dependencies between different parts of the code. Example:

         MOV EAX, [MEM1]
         IMUL EAX, 6
         MOV [MEM2], EAX
         MOV EAX, [MEM3]
         INC EAX
         MOV [MEM4], EAX

Here the last three instructions are independent of the first three in the sense that they don't need any result from the first three instructions. To optimize this on earlier processors you would have to use a different register instead of EAX in the last three instructions and reorder the instructions so that the last three instructions could execute in parallel with the first three instructions. The PPro, PII and PIII processors do this for you automatically. They assign a new temporary register for EAX every time you write to it. Thereby the MOV EAX,[MEM3] instruction becomes independent of the preceding instructions. With out-of-order execution it is likely to finish the move to [MEM4] before the slow IMUL instruction is finished.

Register renaming goes fully automatically. A new temporary register is assigned as an alias for the permanent register every time an instruction writes to this register. An instruction that both reads and writes a register also causes renaming. For example the INC EAX instruction above uses one temporary register for input and another temporary register for output. This does not remove any dependency, of course, but it has some significance for subsequent register reads as I will explain later.

All general purpose registers, stack pointer, flags, floating point registers, MMX registers, XMM registers and segment registers can be renamed. Control words, and the floating point status word cannot be renamed and this is the reason why the use of these registers is slow. There are 40 universal temporary registers so it is unlikely that you will run out of temporary registers.

A common way of setting a register to zero is XOR EAX,EAX or SUB EAX,EAX. These instructions are not recognized as independent of the previous value of the register. If you want to remove the dependency on slow preceding instructions then use MOV EAX,0.

Register renaming is controlled by the register alias table (RAT) and the reorder buffer (ROB). The uops from the decoders go to the RAT via a queue, and then to the ROB and the reservation station. The RAT can handle only 3 uops per clock cycle. This means that the overall throughput of the microprocessor can never exceed 3 uops per clock cycle on average.

There is no practical limit to the number of renamings. The RAT can rename three registers per clock cycle, and it can even rename the same register three times in one clock cycle.

16.2 Register read stalls

But there is another limitation which may be quite serious, and that is that you can only read two different permanent register names per clock cycle. This limitation applies to all registers used by an instruction except those registers that the instruction writes to only. Example:

         MOV [EDI + ESI], EAX
         MOV EBX, [ESP + EBP]

The first instruction generates two uops: one that reads EAX and one that reads EDI and ESI. The second instruction generates one uop that reads ESP and EBP. EBX does not count as a read because it is only written to by the instruction. Let's assume that these three uops go through the RAT together. I will use the word triplet for a group of three consecutive uops that go through the RAT together. Since the ROB can handle only two permanent register reads per clock cycle and we need five register reads, our triplet will be delayed for two extra clock cycles before it comes to the reservation station. With 3 or 4 register reads in the triplet it would be delayed by one clock cycle.

The same register can be read more than once in the same triplet without adding to the count. If the instructions above are changed to:

         MOV [EDI + ESI], EDI
         MOV EBX, [EDI + EDI]

then you will need only two register reads (EDI and ESI) and the triplet will not be delayed.

A register that is going to be written to by a pending uop is stored in the ROB so that it can be read for free until it is written back, which takes at least 3 clock cycles, and usually more. Write-back is the end of the execution stage where the value becomes available. In other words, you can read any number of registers in the RAT without stall if their values are not yet available from the execution units. The reason for this is that when a value becomes available it is immediately written directly to any subsequent ROB entries that need it. But if the value has already been written back to a temporary or permanent register when a subsequent uop that needs it goes into the RAT, then the value has to be read from the register file, which has only two read ports. There are three pipeline stages from the RAT to the execution unit so you can be certain that a register written to in one uop-triplet can be read for free in at least the next three triplets. If the writeback is delayed by reordering, slow instructions, dependency chains, cache misses, or by any other kind of stall, then the register can be read for free further down the instruction stream.

Example:

         MOV EAX, EBX
         SUB ECX, EAX
         INC EBX
         MOV EDX, [EAX]
         ADD ESI, EBX
         ADD EDI, ECX

These 6 instructions generate 1 uop each. Let's assume that the first 3 uops go through the RAT together. These 3 uops read register EBX, ECX, and EAX. But since we are writing to EAX before reading it, the read is free and we get no stall. The next three uops read EAX, ESI, EBX, EDI, and ECX. Since both EAX, EBX and ECX have been modified in the preceding triplet and not yet written back then they can be read for free, so that only ESI and EDI count, and we get no stall in the second triplet either. If the SUB ECX,EAX instruction in the first triplet is changed to CMP ECX,EAX then ECX is not written to and we will get a stall in the second triplet for reading ESI, EDI and ECX. Similarly, if the INC EBX instruction in the first triplet is changed to NOP or something else then we will get a stall in the second triplet for reading ESI, EBX and EDI.

No uop can read more than two registers. Therefore, all instructions that need to read more than two registers are split up into two or more uops.

To count the number of register reads, you have to include all registers which are read by the instruction. This includes integer registers, the flags register, the stack pointer, floating point registers and MMX registers. An XMM register counts as two registers, except when only part of it is used, as e.g. in ADDSS and MOVHLPS. Segment registers and the instruction pointer do not count. For example, in SETZ AL you count the flags register but not AL. ADD EBX,ECX counts both EBX and ECX, but not the flags because they are written to only. PUSH EAX reads EAX and the stack pointer and then writes to the stack pointer.

The FXCH instruction is a special case. It works by renaming, but doesn't read any values so that it doesn't count in the rules for register read stalls. An FXCH instruction behaves like 1 uop that neither reads nor writes any registers with regard to the rules for register read stalls.

Don't confuse uop triplets with decode groups. A decode group can generate from 1 to 6 uops, and even if the decode group has three instructions and generates three uops there is no guarantee that the three uops will go into the RAT together.

The queue between the decoders and the RAT is so short (10 uops) that you cannot assume that register read stalls do not stall the decoders or that fluctuations in decoder throughput do not stall the RAT.

It is very difficult to predict which uops go through the RAT together unless the queue is empty, and for optimized code the queue should be empty only after mispredicted branches. Several uops generated by the same instruction do not necessarily go through the RAT together; the uops are simply taken consecutively from the queue, three at at time. The sequence is not broken by a predicted jump: uops before and after the jump can go through the RAT together. Only a mispredicted jump will discard the queue and start over again so that the next three uops are sure to go into the RAT together.

If three consecutive uops read more than two different registers then you would of course prefer that they do not go through the RAT together. The probability that they do is one third. The penalty of reading three or four written-back registers in one triplet of uops is one clock cycle. You can think of the one clock delay as equivalent to the load of three more uops through the RAT. With the probability of 1/3 of the three uops going into the RAT together, the average penalty will be the equivalent of 3/3 = 1 uop. To calculate the average time it will take for a piece of code to go through the RAT, add the number of potential register read stalls to the number of uops and divide by three. You can see that It doesn't pay to remove the stall by putting in an extra instruction unless you know for sure which uops go into the RAT together or you can prevent more than one potential register read stall by one extra instruction.

In situations where you aim at a throughput of 3 uops per clock, the limit of two permanent register reads per clock cycle may be a problematic bottleneck to handle. Possible ways to remove register read stalls are:

For instructions that generate more than one uop you may want to know the order of the uops generated by the instruction in order to make a precise analysis of the possibility of register read stalls. I have therefore listed the most common cases below.

Writes to memory
A memory write generates two uops. The first one (to port 4) is a store operation, reading the register to store. The second uop (port 3) calculates the memory address, reading any pointer registers. Examples:
MOV [EDI], EAX
First uop reads EAX, second uop reads EDI.
FSTP QWORD PTR [EBX+8*ECX]
First uop reads ST(0), second uop reads EBX and ECX.

Read and modify
An instruction that reads a memory operand and modifies a register by some arithmetic or logical operation generates two uops. The first one (port 2) is a memory load instruction reading any pointer registers, the second uop is an arithmetic instruction (port 0 or 1) reading and writing to the destination register and possibly writing to the flags. Example:
ADD EAX, [ESI+20]
First uop reads ESI, second uop reads EAX and writes EAX and flags.

Read/modify/write
A read/modify/write instruction generates four uops. The first uop (port 2) reads any pointer registers, the second uop (port 0 or 1) reads and writes to any source register and possibly writes to the flags, the third uop (port 4) reads only the temporary result which doesn't count here, the fourth uop (port 3) reads any pointer registers again. Since the first and the fourth uop cannot go into the RAT together you cannot take advantage of the fact that they read the same pointer registers. Example:
OR [ESI+EDI], EAX
The first uop reads ESI and EDI, the second uop reads EAX and writes EAX and the flags, the third uop reads only the temporary result, the fourth uop reads ESI and EDI again. No matter how these uops go into the RAT you can be sure that the uop that reads EAX goes together with one of the uops that read ESI and EDI. A register read stall is therefore inevitable for this instruction unless one of the registers has been modified recently.

Push register
A push register instruction generates 3 uops. The first one (port 4) is a store instruction, reading the register. The second uop (port 3) generates the address, reading the stack pointer. The third uop (port 0 or 1) subtracts the word size from the stack pointer, reading and modifying the stack pointer.

Pop register
A pop register instruction generates 2 uops. The first uop (port 2) loads the value, reading the stack pointer and writing to the register. The second uop (port 0 or 1) adjusts the stack pointer, reading and modifying the stack pointer.

Call
A near call generates 4 uops (port 1, 4, 3, 01). The first two uops read only the instruction pointer which doesn't count because it cannot be renamed. The third uop reads the stack pointer. The last uop reads and modifies the stack pointer.

Return
A near return generates 4 uops (port 2, 01, 01, 1). The first uop reads the stack pointer. The third uop reads and modifies the stack pointer.

An example of how to avoid a register read stall is given in example 2.6.

17. Out of order execution (PPro, PII and PIII)

The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. This makes out-of-order execution possible. If one part of the code is delayed because of a cache miss then it won't delay later parts of the code if they are independent of the delayed operations.

Writes to memory cannot execute out of order relative to other writes. There are four write buffers, so if you expect many cache misses on writes or you are writing to uncached memory then it is recommended that you schedule four writes at at time and make sure the processor has something else to do before you give it the next four writes. Memory reads and other instructions can execute out of order, except IN, OUT and serializing instructions.

If your code writes to a memory address and soon after reads from the same address, then the read may by mistake be executed before the write because the ROB doesn't know the memory addresses at the time of reordering. This error is detected when the write address is calculated, and then the read operation (which was executed speculatively) has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to make sure the execution unit has other things to do between a write and a subsequent read from the same memory address.

There are several execution units clustered around five ports. Port 0 and 1 are for arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1, whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and rotates, and floating point operations. Port 1 also handles jumps and some MMX and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory write, and port 4 executes all memory write operations. In chapter 29 you'll find a complete list of the uops generated by code instructions with an indication of which ports they go to. Note that all memory write operations require two uops, one for port 3 and one for port 4, while memory read operations use only one uop (port 2).

In most cases each port can receive one new uop per clock cycle. This means that you can execute up to 5 uops in the same clock cycle if they go to five different ports, but since there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3 uops per clock on average.

You must make sure that no execution port receives more than one third of the uops if you want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter 29 and count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then you can improve your code by replacing some MOV register,register or MOV register,immediate instructions with MOV register,memory in order to move some of the load from port 0 and 1 to port 2.

Most uops take only one clock cycle to execute, but multiplications, divisions, and many floating point operations take more:

Floating point addition and subtraction takes 3 clocks, but the execution unit is fully pipelined so that it can receive a new FADD or FSUB in every clock cycle before the preceding ones are finished (provided, of course, that they are independent).

Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so that it can receive a new instruction every clock cycle. Floating point multiplication is partially pipelined: The execution unit can receive a new FMUL instruction two clocks after the preceding one, so that the maximum throughput is one FMUL per two clock cycles. The holes between the FMUL's cannot be filled by integer multiplications because they use the same circuitry. XMM additions and multiplications take 3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM register is implemented as two physical 64-bit registers, you need two uops for a packed XMM operation, and the throughput will then be one arithmetic XMM instruction every two clock cycles. XMM add and multiply instructions can execute in parallel because they don't use the same execution port.

Integer and floating point division takes up to 39 clocks and is not pipelined. This means that the execution unit cannot begin a new division until the previous division is finished. The same applies to squareroot and transcendental functions.

Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps, calls, and returns is one for every two clocks.

You should, of course, avoid instructions that generate many uops. The LOOP XX instruction, for example, should be replaced by DEC ECX / JNZ XX.

If you have consecutive POP instructions then you may break them up to reduce the number of uops:

POP ECX / POP EBX / POP EAX     ; can be changed to:
MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12
The former code generates 6 uops, the latter generates only 4 and decodes faster. Doing the same with PUSH instructions is less advantageous because the split-up code is likely to generate register read stalls unless you have other instructions to put in between or the registers have been renamed recently. Doing it with CALL and RET instructions will interfere with prediction in the return stack buffer. Note also that the ADD ESP instruction can cause an AGI stall in earlier processors.

18. Retirement (PPro, PII and PIII)

Retirement is a process where the temporary registers used by the uops are copied into the permanent registers EAX, EBX, etc. When a uop has been executed it is marked in the ROB as ready to retire.

The retirement station can handle three uops per clock cycle. This may not seem like a problem because the throughput is already limited to 3 uops per clock in the RAT. But retirement may still be a bottleneck for two reasons. Firstly, instructions must retire in order. If a uop is executed out of order then it cannot retire before all preceding uops in the order have retired. And the second limitation is that taken jumps must retire in the first of the three slots in the retirement station. Just like decoder D1 and D2 can be idle if the next instruction only fits into D0, the last two slots in the retirement station can be idle if the next uop to retire is a taken jump. This is significant if you have a small loop where the number of uops in the loop is not divisible by three.

All uops stay in the reorder buffer (ROB) until they retire. The ROB can hold 40 uops. This sets a limit to the number of instructions that can execute during the long delay of a division or other slow operation. Before the division is finished the ROB will be filled up with executed uops waiting to retire. Only when the division is finished and retired can the subsequent uops begin to retire, because retirement takes place in order.

In case of speculative execution of predicted branches (see chapter 22) the speculatively executed uops cannot retire until it is certain that the prediction was correct. If the prediction turns out to be wrong then the speculatively executed uops are discarded without retirement.

The following instructions cannot execute speculatively: memory writes, IN, OUT, and serializing instructions.

19. Partial stalls (PPro, PII and PIII)

19.1 Partial register stalls

Partial register stall is a problem that occurs when you write to part of a 32 bit register and later read from the whole register or a bigger part of it. Example:

        MOV AL, BYTE PTR [M8]
        MOV EBX, EAX            ; partial register stall

This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX. The stall can be avoided by changing to code to:

        MOVZX EBX, BYTE PTR [MEM8]
        AND EAX, 0FFFFFF00h
        OR EBX, EAX

Of course you can also avoid the partial stalls by putting in other instructions after the write to the partial register so that it has time to retire before you read from the full register.

You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32 bits):

        MOV BH, 0
        ADD BX, AX              ; stall
        INC EBX                 ; stall

You don't get a stall when reading a partial register after writing to the full register, or a bigger part of it:

        MOV EAX, [MEM32]
        ADD BL, AL              ; no stall
        ADD BH, AH              ; no stall
        MOV CX, AX              ; no stall
        MOV DX, BX              ; stall

The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands. These instructions are fast on the PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is offered when you want your code to perform reasonably well on all processors. The replacement for MOVZX EAX,BYTE PTR [M8] looks like this:

        XOR EAX, EAX
        MOV AL, BYTE PTR [M8]

The PPro, PII and PIII processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that the upper 24 bits of EAX are zero, so that a partial stall can be avoided. This mechanism works only on certain combinations:

        XOR EAX, EAX
        MOV AL, 3
        MOV EBX, EAX            ; no stall

        XOR AH, AH
        MOV AL, 3
        MOV BX, AX              ; no stall

        XOR EAX, EAX
        MOV AH, 3
        MOV EBX, EAX            ; stall

        SUB EBX, EBX
        MOV BL, DL
        MOV ECX, EBX            ; no stall

        MOV EBX, 0
        MOV BL, DL
        MOV ECX, EBX            ; stall

        MOV BL, DL
        XOR EBX, EBX            ; no stall

Setting a register to zero by subtracting it from itself works the same as the XOR, but setting it to zero with the MOV instruction doesn't prevent the stall.

You can set the XOR outside a loop:

        XOR EAX, EAX
        MOV ECX, 100
LL:     MOV AL, [ESI]
        MOV [EDI], EAX          ; no stall
        INC ESI
        ADD EDI, 4
        DEC ECX
        JNZ LL

The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

You should remember to neutralize any partial register you have used before calling a subroutine that might push the full register:

        ADD BL, AL
        MOV [MEM8], BL
        XOR EBX, EBX            ; neutralize BL
        CALL _HighLevelFunction

Most high level language procedures push EBX at the start of the procedure which would generate a partial register stall in the example above if you hadn't neutralized BL.

Setting a register to zero with the XOR method doesn't break its dependency on earlier instructions:

        DIV EBX
        MOV [MEM], EAX
        MOV EAX, 0              ; break dependency
        XOR EAX, EAX            ; prevent partial register stall
        MOV AL, CL
        ADD EBX, EAX

Setting EAX to zero twice here seems redundant, but without the MOV EAX,0 the last instructions would have to wait for the slow DIV to finish, and without XOR EAX,EAX you would have a partial register stall.

The FNSTSW AX instruction is special: in 32 bit mode it behaves as if writing to the entire EAX. In fact, it does something like this in 32 bit mode:
AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP
hence, you don't get a partial register stall when reading EAX after this instruction in 32 bit mode:

    FNSTSW AX / MOV EBX,EAX         ; stall only if 16 bit mode
    MOV AX,0  / FNSTSW AX           ; stall only if 32 bit mode

19.2 Partial flags stalls

The flags register can also cause partial register stalls:

        CMP EAX, EBX
        INC ECX
        JBE XX          ; partial flags stall

The JBE instruction reads both the carry flag and the zero flag. Since the INC instruction changes the zero flag, but not the carry flag, the JBE instruction has to wait for the two preceding instructions to retire before it can combine the carry flag from the CMP instruction and the zero flag from the INC instruction. This situation is likely to be a bug rather than an intended combination of flags. To correct it change INC ECX to ADD ECX,1. A similar bug that causes a partial flags stall is SAHF / JL XX. The JL instruction tests the sign flag and the overflow flag, but SAHF doesn't change the overflow flag. To correct it, change JL XX to JS XX.

Unexpectedly (and contrary to what Intel manuals say) you also get a partial flags stall after an instruction that modifies some of the flag bits when reading only unmodified flag bits:

        CMP EAX, EBX
        INC ECX
        JC  XX          ; partial flags stall

but not when reading only modified bits:

        CMP EAX, EBX
        INC ECX
        JE  XX          ; no stall

Partial flags stalls are likely to occur on instructions that read many or all flags bits, i.e. LAHF, PUSHF, PUSHFD. The following instructions cause partial flags stalls when followed by LAHF or PUSHF(D): INC, DEC, TEST, bit tests, bit scan, CLC, STC, CMC, CLD, STD, CLI, STI, MUL, IMUL, and all shifts and rotates. The following instructions do not cause partial flags stalls: AND, OR, XOR, ADD, ADC, SUB, SBB, CMP, NEG. It is strange that TEST and AND behave differently while, by definition, they do exactly the same thing to the flags. You may use a SETcc instruction instead of LAHF or PUSHF(D) for storing the value of a flag in order to avoid a stall.

Examples:

    INC EAX   / PUSHFD      ; stall
    ADD EAX,1 / PUSHFD      ; no stall

    SHR EAX,1 / PUSHFD      ; stall
    SHR EAX,1 / OR EAX,EAX / PUSHFD   ; no stall

    TEST EBX,EBX / LAHF     ; stall
    AND  EBX,EBX / LAHF     ; no stall
    TEST EBX,EBX / SETZ AL  ; no stall

    CLC / SETZ AL           ; stall
    CLD / SETZ AL           ; no stall

The penalty for partial flags stalls is approximately 4 clocks.

19.3 Flags stalls after shifts and rotates

You can get a stall resembling the partial flags stall when reading any flag bit after a shift or rotate, except for shifts and rotates by one (short form):

    SHR EAX,1 / JZ XX                ; no stall
    SHR EAX,2 / JZ XX                ; stall
    SHR EAX,2 / OR EAX,EAX / JZ XX   ; no stall

    SHR EAX,5 / JC XX                ; stall
    SHR EAX,4 / SHR EAX,1 / JC XX    ; no stall

    SHR EAX,CL / JZ XX               ; stall, even if CL = 1
    SHRD EAX,EBX,1 / JZ XX           ; stall
    ROL EBX,8 / JC XX                ; stall

The penalty for these stalls is approximately 4 clocks.

19.4 Partial memory stalls

A partial memory stall is somewhat analogous to a partial register stall. It occurs when you mix data sizes for the same memory address:

        MOV BYTE PTR [ESI], AL
        MOV EBX, DWORD PTR [ESI]        ; partial memory stall

Here you get a stall because the processor has to combine the byte written from AL with the next three bytes, which were in memory before, to get the four bytes needed for reading into EBX. The penalty is approximately 7-8 clocks.

Unlike the partial register stalls, you also get a partial memory stall when you write a bigger operand to memory and then read part of it, if the smaller part doesn't start at the same address:

        MOV DWORD PTR [ESI], EAX
        MOV BL, BYTE PTR [ESI]          ; no stall
        MOV BH, BYTE PTR [ESI+1]        ; stall

You can avoid this stall by changing the last line to MOV BH,AH, but such a solution is not possible in a situation like this:

        FISTP QWORD PTR [EDI]
        MOV EAX, DWORD PTR [EDI]
        MOV EDX, DWORD PTR [EDI+4]      ; stall

Interestingly, you can also get a partial memory stall when writing and reading completely different addresses if they happen to have the same set-value in different cache banks:

        MOV BYTE PTR [ESI], AL
        MOV EBX, DWORD PTR [ESI+4092]   ; no stall
        MOV ECX, DWORD PTR [ESI+4096]   ; stall

20. Dependency chains (PPro, PII and PIII)

A series of instructions where each instruction depends on the result of the preceding one is called a dependency chain. Long dependency chains should be avoided, if possible, because they prevent out-of-order and parallel execution.

Example:

   MOV EAX, [MEM1]
   ADD EAX, [MEM2]
   ADD EAX, [MEM3]
   ADD EAX, [MEM4]
   MOV [MEM5], EAX

In this eaxmple, the ADD instructions generate 2 uops each, one for reading from memory (port 2), and one for adding (port 0 or 1). The read uops can execute out or order, while the add uops must wait for the previous uops to finish. This dependency chain does not take very long to execute, because each addition adds only 1 clock to the execution time. But if you have slow instructions like multiplications, or even worse: divisions, then you should definitely do something to break the dependency chain. The way to do this is to use multiple accumulators:

   MOV EAX, [MEM1]         ; start first chain
   MOV EBX, [MEM2]         ; start other chain in different accumulator
   IMUL EAX, [MEM3]
   IMUL EBX, [MEM4]
   IMUL EAX, EBX           ; join chains in the end
   MOV [MEM5], EAX

Here, the second IMUL instruction can start before the first one is finished. Since the IMUL instruction has a delay of 4 clocks and is fully pipelined, you may have up to 4 accumulators.

Division is not pipelined so you cannot do the same with chained divisions, but you can of course multiply all the divisors and do only one division in the end.

Floating point instructions have a longer delay than integer instructions, so you should definitely break up long dependency chains with floating point instructions:

   FLD [MEM1]         ; start first chain
   FLD [MEM2]         ; start second chain in different accumulator
   FADD [MEM3]
   FXCH
   FADD [MEM4]
   FXCH
   FADD [MEM5]
   FADD               ; join chains in the end
   FSTP [MEM6]

You need a lot of FXCH instructions for this, but don't worry: they are cheap. FXCH instructions are resolved in the RAT by register renaming so they don't put any load on the execution ports. An FXCH does count as 1 uop in the RAT, ROB, and retirement station, though.

If the dependency chain is long you may need three accumulators:

        FLD [MEM1]              ; start first chain
        FLD [MEM2]              ; start second chain
        FLD [MEM3]              ; start third chain
        FADD [MEM4]             ; third chain
        FXCH ST(1)
        FADD [MEM5]             ; second chain
        FXCH ST(2)
        FADD [MEM6]             ; first chain
        FXCH ST(1)
        FADD [MEM7]             ; third chain
        FXCH ST(2)
        FADD [MEM8]             ; second chain
        FXCH ST(1)
        FADD                    ; join first and third chain
        FADD                    ; join with second chain
        FSTP [MEM9]

Avoid storing intermediate data in memory and read them immediately afterwards:

        MOV [TEMP], EAX
        MOV EBX, [TEMP]

There is a penalty for attempting to read from a memory address before a previous write to that address is finished. In the example above, change the last instruction to MOV EBX,EAX or put some other instructions in between.

There is one situation where you cannot avoid storing intermediate data in memory, and that is when transferring data from an integer register to a floating point register, or vice versa. For example:

        MOV EAX, [MEM1]
        ADD EAX, [MEM2]
        MOV [TEMP], EAX
        FILD [TEMP]

If you don't have anything to put in between the write to TEMP and the read from TEMP, then you may consider using a floating point register instead of EAX:

        FILD [MEM1]
        FIADD [MEM2]

Consecutive jumps, calls, or returns may also be considered dependency chains. The throughput for these instructions is one jump per two clock cycles. It is therefore recommended that you give the microprocessor something else to do between the jumps.

21. Searching for bottlenecks (PPro, PII and PIII)

When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if there is another bottleneck which is narrower.

If you expect code cache misses then you should restructure your code to keep the most used parts of code together.

If you expect many data cache misses then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (chapter 7), and avoid long dependency chains after a data read cache miss (chapter 20).

If you have many divisions then try to reduce them (chapter 27.2) and make sure the processor has something else to do during the divisions.

Dependency chains tend to hamper out-of-order execution (chapter 20). Try to break long dependency chains, especially if they contain slow instructions such as multiplication, division, and floating point instructions.

If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace conditional jumps with conditional moves if possible, and replace small procedures with macros (chapter 22.3).

If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (chapter 19).

If you aim at a throughput of 3 uops per clock cycle then be aware of possible delays in instruction fetch and decoding (chapter and 14 and 15), especially in small loops.

The limit of two permanent register reads per clock cycle may reduce your throughput to less than 3 uops per clock cycle (chapter 16.2). This is likely to happen if you often read registers more than 4 clock cycles after they last were modified. This may, for example, happen if you often use pointers for addressing your data but seldom modify the pointers.

A throughput of 3 uops per clock requires that no execution port gets more than one third of the uops (chapter 17).

The retirement station can handle 3 uops per clock, but may be slightly less effective for taken jumps (chapter 18).

22. Jumps and branches (all processors)

The Pentium family of processors attempt to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction is correct, then it can save a considerable amount of time by loading the subsequent instructions into the pipeline and start decoding them before the jump is executed. If the prediction turns out to be wrong, then the pipeline has to be flushed, which will cost a penalty depending on the length of the pipeline.

The predictions are based on a Branch Target Buffer (BTB) which stores the history for each branch or jump instruction and makes predictions based on the prior history of executions of each instruction. The BTB is organized like a set-associative cache where new entries are allocated according to a pseudo-random replacement method.

When optimizing code, it is important to minimize the number of misprediction penalties. This requires a good understanding of how the jump prediction works.

The branch prediction mechanisms are not described adequately in Intel manuals or anywhere else. I am therefore giving a very detailed description here. This information is based on my own research (with the help of Karki Jitendra Bahadur for the PPlain).

In the following, I will use the term 'control transfer instruction' for any instruction which can change the instruction pointer, including conditional and unconditional, direct and indirect, near and far, jumps, calls, and returns. All these instructions use prediction.

22.1 Branch prediction in PPlain

The branch prediction mechanism for the PPlain is very different from the other three processors. Information found in Intel documents and elsewhere on this subject is directly misleading, and following the advises given is such documents is likely to lead to sub-optimal code.

The PPlain has a branch target buffer (BTB), which can hold information for up to 256 jump instructions. The BTB is organized like a 4-way set-associative cache with 64 entries per way. This means that the BTB can hold no more than 4 entries with the same set value. Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which means that a new entry will not necessarily displace the least recently used entry of the same set-value. How the set-value is calculated will be explained later. Each BTB entry stores the address of the jump target and a prediction state, which can have four different values:

state 0: "strongly not taken"
state 1: "weakly not taken"
state 2: "weakly taken"
state 3: "strongly taken"

A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in state 0 or 1. The state transition works like a two-bit counter, so that the state is incremented when the branch is taken, and decremented when it falls through. The counter saturates, rather than wrap around, so that it does not decrement beyond 0 or increment beyond 3. Ideally, this would provide a reasonably good prediction, because a branch instruction would have to deviate twice from what it does most of the time, before the prediction changes.

However, this mechanism has been compromised by the fact that state 0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes sense, because a branch instruction is predicted to fall through if it has no BTB entry. This improves the utilization of the BTB, because a branch instruction which is seldom taken will most of the time not take up any BTB entry.

Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated, and this new entry will always be set to state 3. This means that it is impossible to go from state 0 to state 1 (except for a very special case discussed later). From state 0 you can only go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the BTB.

This is a serious design flaw. By throwing state 0 entries out of the BTB and always setting new entries to state 3, the designers apparently have given priority to minimizing the first time penalty for unconditional jumps and branches often taken, and ignored that this seriously compromises the basic idea behind the mechanism and reduces the performance in small innermost loops. The consequence of this flaw is, that a branch instruction which falls through most of the time will have up to three times as many mispredictions as a branch instruction which is taken most of the time. (Apparently, Intel engineers have been unaware of this flaw until I published my findings).

You may take this asymmetry into account by organizing your branches so that they are taken more often than not. Consider for example this if-then-else construction:

        TEST EAX,EAX
        JZ   A
        <branch 1>
        JMP  E
A:      <branch 2>
E:

If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in succession, then you can reduce the number of branch mispredictions by up to a factor 3 by swapping the two branches so that the branch instruction will jump more often than fall through:

        TEST EAX,EAX
        JNZ  A
        <branch 2>
        JMP  E
A:      <branch 1>
E:

(This is contrary to the recommendations in Intel's manuals and tutorials).

There may be reasons to put the most often executed branch first, however:

  1. Putting seldom executed branches away in the bottom of your code can improve code cache utilization.
  2. A branch instruction seldom taken will stay out of the BTB most of the time, possibly improving BTB utilization.
  3. The branch instruction will be predicted as not taken if it has been flushed out of the BTB by other branch instructions.
  4. The asymmetry in branch prediction only exists on the PPlain.

These considerations have little weight, however, for small critical loops, so I would still recommend organizing branches with a skewed distribution so that the branch instruction is taken more often than not, unless branch 2 is executed so seldom, that misprediction doesn't matter.

Likewise, you should preferably organize loops with the testing branch instruction at the bottom, as in this example:

        MOV ECX, [N]
L:      MOV [EDI],EAX
        ADD EDI,4
        DEC ECX
        JNZ L

If N is high, then the JNZ instruction here will be taken more often than not, and never fall through twice in succession.

Consider the situation where a branch is taken every second time. The first time it jumps the BTB entry will go into state 3, and will then alternate between state 2 and 3. It is predicted to jump all the time, which gives 50% mispredictions. Assume now that it deviates from this regular pattern and falls through an extra time. The jump pattern is:

01010100101010101010101, where 0 means nojump, and 1 means jump.
       ^
The extra nojump is indicated with a ^ above. After this incident, the BTB entry will alternate between state 1 and 2, which gives 100% mispredictions. It will continue in this unfortunate mode until there is another deviation from the 0101 pattern. This is the worst case for this branch prediction mechanism.

22.1.2 BTB is looking ahead (PPlain)

The BTB mechanism is counting instruction pairs, rather than single instructions, so you have to know how instructions are pairing in order to analyze where a BTB entry is stored. The BTB entry for any control instruction is attached to the address of the U-pipe instruction in the preceding instruction pair. (An unpaired instruction counts as one pair). Example:
    SHR EAX,1
    MOV EBX,[ESI]
    CMP EAX,EBX
    JB  L

Here SHR pairs with MOV, and CMP pairs with JB. The BTB entry for JB L is thus attached to the address of the SHR EAX,1 instruction. When this BTB entry is met, and if it is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load the instructions following L into the pipeline. This happens before the branch instruction has been decoded, so the Pentium relies solely on the information in the BTB when doing this.

You may remember, that instructions are seldom pairing the first time they are executed (see chapter 8). If the instructions above are not pairing, then the BTB entry should be attached to the address of the CMP instruction, and this entry would be wrong on the next execution, when instructions are pairing. However, in most cases the PPlain is smart enough to not make a BTB entry when there is an unused pairing opportunity, so you don't get a BTB entry until the second execution, and hence you won't get a prediction until the third execution. (In the rare case, where every second instruction is a single-byte instruction, you may get a BTB entry on the first execution which becomes invalid in the second execution, but since the instruction it is attached to will then go to the V-pipe, it is ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a U-pipe instruction).

A BTB entry is identified by its set-value which is equal to bits 0-5 of the address it is attached to. Bits 6-31 are then stored in the BTB as a tag. Addresses which are spaced a multiple of 64 bytes apart will have the same set-value. You can have no more than four BTB entries with the same set-value. If you want to check whether your jump instructions contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I have never heard of anybody doing so. There are no tools available to do this job for you.

22.1.3 Consecutive branches (PPlain)

When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair executed also contains a control transfer instruction, then the PPlain won't load its target because it cannot load a new target while the pipeline is being flushed. The result is that the second jump instruction is predicted to fall through regardless of the state of its BTB entry. Therefore, if the second jump is also taken, then you will get another penalty. The state of the BTB entry for the second jump instruction does get correctly updated, though. If you have a long chain of control transfer instructions, and the first jump in the chain is mispredicted, then the pipeline will get flushed all the time, and you will get nothing but mispredictions until you meet an instruction pair which does not jump. The most extreme case of this is a loop which jumps to itself: It will get a misprediction penalty for each iteration.

This is not the only problem with consecutive control transfer instructions. Another problem is that you can have another branch instruction between a BTB entry and the control transfer instruction it belongs to. If the first branch instruction jumps to somewhere else, then strange things may happen. Consider this example:

        SHR EAX,1
        MOV EBX,[ESI]
        CMP EAX,EBX
        JB  L1
        JMP L2

L1:     MOV EAX,EBX
        INC EBX

When JB L1 falls through, then you will get a BTB entry for JMP L2 attached to the address of CMP EAX,EBX. But what will happen when JB L1 later is taken? At the time when the BTB entry for JMP L2 is read, the processor doesn't know that the next instruction pair does not contain a jump instruction, so it will actually predict the instruction pair MOV EAX,EBX / INC EBX to jump to L2. The penalty for predicting non-jump instructions to jump is 3 clock cycles. The BTB entry for JMP L2 will get its state decremented, because it is applied to something which doesn't jump. If we keep going to L1, then the BTB entry for JMP L2 will be decremented to state 1 and 0, so that the problem will disappear until next time JMP L2 is executed.

The penalty for predicting the non-jumping instructions to jump only occurs when the jump to