The contents of this document are provided in connection with Advanced
Micro Devices, Inc. (“AMD”) products. AMD makes no representations or
warranties with respect to the accuracy or completeness of the contents of
this publication and reserves the right to make changes to specifications and
product descriptions at any time without notice. No license, whether express,
implied, arising by estoppel or otherwise, to any intellectual property rights
is granted by this publication. Except as set forth in AMD’s Standard Terms
and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims
any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular
purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use
as components in systems intended for surgical implant into the body, or in
other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur.
AMD reserves the right to discontinue or make changes to its products at any
time without notice.
Trademarks
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks,
and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc.
Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
MMX is a trademark and Pentium is a registered trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of
their respective companies.
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Revision History
DateRevDescription
Added “About this Document” on page 1.
Further clarification of “Consider the Sign of Integer Operands” on page 14.
Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29.
Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31.
Further clarification of “Select DirectPath Over VectorPath Instructions” on page 34.
Further clarification of “Align Branch Targets in Program Hot Spots” on page 36.
Further clarification of REP instruction as a filler in “Code Padding Using Neutral Code Fillers” on page 39.
Further clarification of “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46.
Modified examples 1 and 2 of “Unsigned Division by Multiplication of Constant” on page 78.
Nov.
1999
Added the optimization, “Efficient Implementation of Population Count Function” on page 91.
Further clarification of “Use FFREEP Macro to Pop One Register from the FPU Stack” on page 98.
Further clarification of “Minimize Floating-Point-to-Integer Conversions” on page 100.
Added the optimization, “Check Argument Range of Trigonometric Instructions Efficiently” on page 103.
Added the optimization, “Take Advantage of the FSINCOS Instruction” on page 105.
E
Further clarification of “Use 3DNow!™ Instructions for Fast Division” on page 108.
Further clarification “Use FEMMS Instruction” on page 107.
Further clarification of “Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root” on
page 110.
Clarified “3DNow!™ and MMX™ Intra-Operand Swapping” on page 112.
Corrected PCMPGT information in “Use MMX™ PCMP Instead of 3DNow!™ PFCMP” on page 114.
Added the optimization, “Use MMX™ Instructions for Block Copies and Block Fills” on page 115.
Modified the rule for “Use MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 118.
Modified the rule for “Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register” on page 119.
Added the optimization, “Optimized Matrix Multiplication” on page 119.
Added the optimization, “Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions” on page
122.
Added the optimization, “Complex Number Arithmetic” on page 126.
Added Appendix E, “Programming the MTRR and PAT”.
Rearranged the appendices.
Added Index.
Revision Historyxv
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
xviRevision History
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
1
Introduction
The AMD Athlon™ processor is the newest microprocessor in
the AMD K86™ family of microprocessors. The advances in the
AMD Athlon processor take superscalar operation and
out-of-order execution to a new level. The AMD Athlon
processor has been designed to efficiently execute code written
for previous-generation x86 processors. However, to enable the
fastest code execution with the AMD Athlon processor,
programmers should write software that includes specific code
optimization techniques.
About this Document
This document contains information to assist programmers in
creating optimized code for the AMD Athlon processor. In
addition to compiler and assembler designers, this document
has been targeted to C and assembly language programmers
writing execution-sensitive code sequences.
This document assumes that the reader possesses in-depth
knowledge of the x86 instruction set, the x86 architecture
(registers, programming modes, etc.), and the IBM PC-AT
platform.
This guide has been written specifically for the AMD Athlon
processor, but it includes considerations for
About this Document1
AMD Athlon™ Processor x86 Code Optimization
previous-generation processors and describes how those
optimizations are applicable to the AMD Athlon processor. This
guide contains the following chapters:
Chapter 1: Introduction. Outlines the material covered in this
document. Summarizes the AMD Athlon microarchitecture.
Chapter 2: Top Optimizations. Provides convenient descriptions of
the most important optimizations a programmer should take
into consideration.
Chapter 3: C Source Level Optimizations. Describes optimizations that
C/C++ programmers can implement.
Chapter 4: Instruction Decoding Optimizations. Describes methods that
will make the most efficient use of the three sophisticated
instruction decoders in the AMD Athlon processor.
22007E/0—November 1999
Chapter 5: Cache and Memory Optimizations. Describes optimizations
that makes efficient use of the large L1 caches and highbandwidth buses of the AMD Athlon processor.
Chapter 6: Branch Optimizations. Describes optimizations that
improves branch prediction and minimizes branch penalties.
Chapter 7: Scheduling Optimizations. Describes optimizations that
improves code scheduling for efficient execution resource
utilization.
Chapter 8: Integer Optimizations. Describes optimizations that
improves integer arithmetic and makes efficient use of the
integer execution units in the AMD Athlon processor.
Chapter 9: Floating-Point Optimizations. Describes optimizations that
makes maximum use of the superscalar and pipelined floatingpoint unit (FPU) of the AMD Athlon processor.
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines
for Enhanced 3DNow! and MMX code optimization techniques.
Chapter 11: General x86 Optimizations Guidelines. Lists generic
optimizations techniques applicable to x86 processors.
Appendix A: AMD Athlon Processor Microarchitecture. Describes in
detail the microarchitecture of the AMD Athlon processor.
2About this Document
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Appendix B: Pipeline and Execution Unit Resources Overview. Describes
in detail the execution units and its relation to the instruction
pipeline.
Appendix C: Implementation of Write Combining. Describes the
algorithm used by the AMD Athlon processor to write combine.
Appendix D: Performance Monitoring Counters. Describes the usage of
the performance counters available in the AMD Athlon
processor.
Appendix E: Programming the MTRR and PAT. Describes the steps
needed to program the Memory Type Range Registers and the
Page Attribute Table.
Appendix F: Instruction Dispatch and Execution Resources. Lists the
instruction’s execution resource usage.
Appendix G: DirectPath versus VectorPath Instructions. Lists the x86
instructions that are DirectPath and VectorPath instructions.
AMD Athlon™ Processor Family
The AMD Athlon processor family uses state-of-the-art
decoupled decode/execution design techniques to deliver
next-generation performance with x86 binary software
compatibility. This next-generation processor family advances
x86 code execution by using flexible instruction predecoding,
wide and balanced decoders, aggressive out-of-order execution,
parallel integer execution pipelines, parallel floating-point
execution pipelines, deep pipelined execution for higher
delivered operating frequency, dedicated backside cache
memory, and a new high-performance double-rate 64-bit local
bus. As an x86 binary-compatible processor, the AMD Athlon
processor implements the industry-standard x86 instruction set
by decoding and executing the x86 instructions using a
proprietary microarchitecture. This microarchitecture allows
the delivery of maximum performance when running x86-based
PC software.
AMD Athlon™ Processor Family3
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™ Processor Microarchitecture Summary
The AMD Athlon processor brings superscalar performance
and high operating frequency to PC systems running
industry-standard x86 software. A brief summary of the
next-generation design features implemented in the
AMD Athlon processor is as follows:
■High-speed double-rate local bus interface
■Large, split 128-Kbyte level-one (L1) cache
■Dedicated backside level-two (L2) cache
■Instruction predecode and branch detection during cache
line fills
■Decoupled decode/execution core
■Three-way x86 instruction decoding
■Dynamic scheduling and speculative execution
■Three-way integer execution
■Three-way address generation
■Three-way floating-point execution
■3DNow!™ technology and MMX™ single-instruction
multiple-data (SIMD) instruction extensions
■Super data forwarding
■Deep out-of-order integer and floating-point execution
■Register renaming
■Dynamic branch prediction
The AMD Athlon processor communicates through a
next-generation high-speed local bus that is beyond the current
Socket 7 or Super7™ bus standard. The local bus can transfer
data at twice the rate of the bus operating frequency by using
both the rising and falling edges of the clock (see
“AMD Athlon™ System Bus” on page 139 for more
information).
To reduce on-chip cache miss penalties and to avoid subsequent
data load or instruction fetch stalls, the AMD Athlon processor
has a dedicated high-speed backside L2 cache. The large
128-Kbyte L1 on-chip cache and the backside L2 cache allow the
4AMD Athlon™ Processor Microarchitecture Summary
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
AMD Athlon execution core to achieve and sustain maximum
performance.
As a decoupled decode/execution processor, the AMD Athlon
processor makes use of a proprietary microarchitecture, which
defines the heart of the AMD Athlon processor. With the
inclusion of all these features, the AMD Athlon processor is
capable of decoding, issuing, executing, and retiring multiple
x86 instructions per cycle, resulting in superior scaleable
performance.
The AMD Athlon processor includes both the industry-standard
MMX SIMD integer instructions and the 3DNow! SIMD
floating-point instructions that were first introduced in the
®
AMD-K6
-2 processor. The design of 3DNow! technology was
based on suggestions from leading graphics and independent
software vendors (ISVs). Using SIMD format, the AMD Athlon
processor can generate up to four 32-bit, single-precision
floating-point results per clock cycle.
The 3DNow! execution units allow for high-performance
floating-point vector operations, which can replace x87
instructions and enhance the performance of 3D graphics and
other floating-point-intensive applications. Because the
3DNow! architecture uses the same registers as the MMX
instructions, switching between MMX and 3DNow! has no
penalty.
The AMD Athlon processor designers took another innovative
step by carefully integrating the traditional x87 floating-point,
MMX, and 3DNow! execution units into one operational engine.
With the introduction of the AMD Athlon processor, the
switching overhead between x87, MMX, and 3DNow!
technology is virtually eliminated. The AMD Athlon processor
combined with 3DNow! technology brings a better multimedia
experience to mainstream PC users while maintaining
backwards compatibility with all existing x86 software.
Although the AMD Athlon processor can extract code
parallelism on-the-fly from off-the-shelf, commercially available
x86 software, specific code optimization for the AMD Athlon
processor can result in even higher delivered performance. This
document describes the proprietary microarchitecture in the
AMD Athlon processor and makes recommendations for
optimizing execution of x86 software on the processor.
AMD Athlon™ Processor Microarchitecture Summary5
AMD Athlon™ Processor x86 Code Optimization
The coding techniques for achieving peak performance on the
AMD Athlon processor include, but are not limited to, those for
the AMD-K6, AMD-K6-2, Pentium
II processors. However, many of these optimizations are not
necessary for the AMD Athlon processor to achieve maximum
performance. Due to the more flexible pipeline control and
aggressive out-of-order execution, the AMD Athlon processor is
not as sensitive to instruction selection and code scheduling.
This flexibility is one of the distinct advantages of the
AMD Athlon processor.
The AMD Athlon processor uses the latest in processor
microarchitecture design techniques to provide the highest x86
performance for today’s PC. In short, the AMD Athlon
processor offers true next-generation performance with x86
binary software compatibility.
22007E/0—November 1999
®
, Pentium Pro, and Pentium
6AMD Athlon™ Processor Microarchitecture Summary
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
2
Top Optimizations
Group I — Essential
Optimizations
Group II — Secondary
Optimizations
This chapter contains concise descriptions of the best
optimizations for improving the performance of the
AMD Athlon™ processor. Subsequent chapters contain more
detailed descriptions of these and other optimizations. The
optimizations in this chapter are divided into two groups and
listed in order of importance.
Group I contains essential optimizations. Users should follow
these critical guidelines closely. The optimizations in Group I
are as follows:
■Memory Size and Alignment Issues—Avoid memory size
mismatches—Align data where possible
■Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions
■Select DirectPath Over VectorPath Instructions
Group II contains secondary optimizations that can
significantly improve the performance of the AMD Athlon
processor. The optimizations in Group II are as follows:
■Load-Execute Instruction Usage—Use Load-Execute
instructions—Avoid load-execute floating-point instructions
with integer operands
■Take Advantage of Write Combining
■Use 3DNow! Instructions
■Avoid Branches Dependent on Random Data
Top Optimizations7
AMD Athlon™ Processor x86 Code Optimization
■Avoid Placing Code and Data in the Same 64-Byte Cache
Line
Optimization Star
The top optimizations described in this chapter are flagged
TOP
with a star. In addition, the star appears beside the more
detailed descriptions found in subsequent chapters.
✩
Group I Optimizations — Essential Optimizations
22007E/0—November 1999
Memory Size and Alignment Issues
See “Memory Size and Alignment Issues” on page 45 for more
details.
Avoid Memory Size Mismatches
Avoid memory size mismatches when instructions operate on
TOP
✩
Align Data Where Possible
TOP
the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each
operand the same size.
Avoid misaligned data references. A misaligned store or load
operation suffers a minimum one-cycle penalty in the
AMD Athlon processor load/store pipeline.
✩
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
For code that can take advantage of prefetching, use the
TOP
✩
8Optimization Star
3DNow! PREFETCH and PREFETCHW instructions to increase
the effective bandwidth to the AMD Athlon processor, which
significantly improves performance. All the prefetch
instructions are essentially integer instructions and can be used
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
anywhere, in any type of code (integer, x87, 3DNow!, MMX,
etc.). Use the following formula to determine prefetch distance:
DS
Prefetch Length = 200 (
■Round up to the nearest cache line.
■DS is the data stride per loop iteration.
■C is the number of cycles per loop iteration when hitting in
the L1 cache.
See “Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions” on page 46 for more details.
/C)
Select DirectPath Over VectorPath Instructions
Use DirectPath instructions rather than VectorPath
TOP
✩
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction. Three DirectPath instructions can be
decoded in parallel. Using VectorPath instructions will block
DirectPath instructions from decoding simultaneously.
See Appendix G, “DirectPath versus VectorPath Instructions”
on page 219 for a list of DirectPath and VectorPath instructions.
Group II Optimizations—Secondary Optimizations
Load-Execute Instruction Usage
See “Load-Execute Instruction Usage” on page 34 for more
details.
Use Load-Execute Instructions
Wherever possible, use load-execute instructions to increase
TOP
✩
code density with the one exception described below. The
split-instruction form of load-execute instructions can be used
to avoid scheduler stalls for longer executing instructions and
to explicitly schedule the load and execute operations.
Group II Optimizations—Secondary Optimizations9
AMD Athlon™ Processor x86 Code Optimization
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Do not use load-execute floating-point instructions with integer
TOP
✩
Take Advantage of Write Combining
TOP
✩
operands. The floating-point load-execute instructions with
integer operands are VectorPath and generate two OPs in a
cycle, while the discrete equivalent enables a third DirectPath
instruction to be decoded in the same cycle.
This guideline applies only to operating system, device driver,
and BIOS programmers. In order to improve system
performance, the AMD Athlon processor aggressively combines
multiple memory-write cycles of any data size that address
locations within a 64-byte cache line aligned write buffer.
See Appendix C, “Implementation of Write Combining” on
page 155 for more details.
floating-point computations using the 3DNow! instructions
instead of x87 instructions. The SIMD nature of 3DNow!
instructions achieves twice the number of FLOPs that are
achieved through x87 instructions. 3DNow! instructions also
provide for a flat register file instead of the stack-based
approach of x87 instructions.
See Table 23 on page 217 for a list of 3DNow! instructions. For
information about instruction usage, see the 3DNow!™Technology Manual, order# 21928.
Avoid data-dependent branches around a single instruction.
Data-dependent branches acting upon basically random data
can cause the branch prediction logic to mispredict the branch
about 50% of the time. Design branch-free alternative code
sequences, which results in shorter average execution time.
See “Avoid Branches Dependent on Random Data” on page 57
for more details.
10Group II Optimizations—Secondary Optimizations
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Consider that the AMD Athlon processor cache line is twice the
TOP
✩
size of previous processors. Code and data should not be shared
in the same 64-byte cache line, especially if the data ever
becomes modified. In order to maintain cache coherency, the
AMD Athlon processor may thrash its caches, resulting in lower
performance.
In general the following should be avoided:
■Self-modifying code
■Storing data in code segments
See “Avoid Placing Code and Data in the Same 64-Byte Cache
Line” on page 50 for more details.
Group II Optimizations—Secondary Optimizations11
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
12Group II Optimizations—Secondary Optimizations
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
3
C Source Level Optimizations
This chapter details C programming practices for optimizing
code for the AMD Athlon™ processor. Guidelines are listed in
order of importance.
Ensure Floating-Point Variables and Expressions are of
Type Float
For compilers that generate 3DNow!™ instructions, make sure
that all floating-point variables and expressions are of type
float. Pay special attention to floating-point constants. These
require a suffix of “F” or “f” (for example, 3.14f) in order to be
of type float, otherwise they default to type double. To avoid
automatic promotion of float arguments to double, always use
function prototypes for all functions that accept float
arguments.
Use 32-Bit Data Types for Integer Code
Use 32-bit data types for integer code. Compiler
implementations vary, but typically the following data types are
included—int, signed, signed int, unsigned, unsigned int, long,
signed long, long int, signed long int, unsigned long, and unsigned
long int.
Ensure Floating-Point Variables and Expressions are of Type Float13
AMD Athlon™ Processor x86 Code Optimization
Consider the Sign of Integer Operands
In many cases, the data stored in integer variables determines
whether a signed or an unsigned integer type is appropriate.
For example, to record the weight of a person in pounds, no
negative numbers are required so an unsigned type is
appropriate. However, recording temperatures in degrees
Celsius may require both positive and negative numbers so a
signed type is needed.
Where there is a choice of using either a signed or an unsigned
type, it should be considered that certain operations are faster
with unsigned types while others are faster for signed types.
Integer-to-floating-point conversion using integers larger than
16-bit is faster with signed types, as the x86 FPU provides
instructions for converting signed integers to floating-point, but
has no instructions for converting unsigned integers. In a
typical case, a 32-bit integer is converted as follows:
22007E/0—November 1999
Example 1 (Avoid):
double x; ====>MOV [temp+4], 0
unsigned int i; MOV EAX, i
MOV [temp], eax
x = i; FILD QWORD PTR [temp]
FSTP QWORD PTR [x]
This code is slow not only because of the number of instructions
but also because a size mismatch prevents store-to-loadforwarding to the FILD instruction.
Computing quotients and remainders in integer division by
constants are faster when performed on unsigned types. In a
typical case, a 32-bit integer is divided by four as follows:
14Consider the Sign of Integer Operands
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example (Avoid):
int i; ====>MOV EAX, i
CDQ
i = i / 4; AND EDX, 3
ADD EAX, EDX
SAR EAX, 2
MOV i, EAX
Example (Preferred):
unsigned int i; ====>SHR i, 2
i = i / 4;
In summary:
Use unsigned types for:
■Division and remainders
■Loop counters
■Array indexing
Use signed types for:
■Integer-to-float conversion
Use Array Style Instead of Pointer Style Code
The use of pointers in C makes work difficult for the optimizers
in C compilers. Without detailed and aggressive pointer
analysis, the compiler has to assume that writes through a
pointer can write to any place in memory. This includes storage
allocated to other variables, creating the issue of aliasing, i.e.,
the same block of memory is accessible in more than one way.
In order to help the optimizer of the C compiler in its analysis,
avoid the use of pointers where possible. One example where
this is trivially possible is in the access of data organized as
arrays. C allows the use of either the array operator [] or
pointers to access the array. Using array-style code makes the
task of the optimizer easier by reducing possible aliasing.
For example, x[0] and x[2] can not possibly refer to the same
memory location, while *p and *q could. It is highly
recommended to use the array style, as significant performance
advantages can be achieved with most compilers.
Use Array Style Instead of Pointer Style Code15
AMD Athlon™ Processor x86 Code Optimization
Note that source code transformations will interact with a
compiler’s code generator and that it is difficult to control the
generated machine code from the source level. It is even
possible that source code transformations for improving
performance and compiler optimizations "fight" each other.
Depending on the compiler and the specific source code it is
therefore possible that pointer style code will be compiled into
machine code that is faster than that generated from equivalent
array style code. It is advisable to check the performance after
any source code transformation to see whether performance
indeed increased.
for (i = 0; i < numverts; i++) {
rr->x = vv->x*mm->m[0][0] + vv->y*mm->m[0][1] +
vv->z*mm->m[0][2] + vv->w*mm->m[0][3];
rr->y = vv->x*mm->m[1][0] + vv->y*mm->m[1][1] +
vv->z*mm->m[1][2] + vv->w*mm->m[1][3];
rr->z = vv->x*mm->m[2][0] + vv->y*mm->m[2][1] +
vv->z*mm->m[2][2] + vv->w*mm->m[2][3];
rr->w = vv->x*mm->m[3][0] + vv->y*mm->m[3][1] +
vv->z*mm->m[3][2] + vv->w*mm->m[3][3];
}
}
Use Array Style Instead of Pointer Style Code17
AMD Athlon™ Processor x86 Code Optimization
Completely Unroll Small Loops
Take advantage of the AMD Athlon processor’s large, 64-Kbyte
instruction cache and completely unroll small loops. Unrolling
loops can be beneficial to performance, especially if the loop
body is small which makes the loop overhead significant. Many
compilers are not aggressive at unrolling loops. For loops that
have a small fixed loop count and a small loop body, completely
unrolling the loops at the source level is recommended.
Example 1 (Avoid):
// 3D-transform: multiply vector V by 4x4 transform matrix M
for (i=0; i<4; i++) {
r[i] = 0;
for (j=0; j<4; j++) {
r[i] += M[j][i]*V[j];
}
}
22007E/0—November 1999
Example 2 (Preferred):
// 3D-transform: multiply vector V by 4x4 transform matrix M
r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +
A store-to-load dependency exists when data is stored to
memory, only to be read back shortly thereafter. See
“Store-to-Load Forwarding Restrictions” on page 51 for more
details. The AMD Athlon processor contains hardware to
accelerate such store-to-load dependencies, allowing the load to
obtain the store data before it has been written to memory.
However, it is still faster to avoid such dependencies altogether
and keep the data in an internal register.
Avoiding store-to-load dependencies is especially important if
they are part of a long dependency chains, as might occur in a
recurrence computation. If the dependency occurs while
operating on arrays, many compilers are unable to optimize the
18Completely Unroll Small Loops
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
code in a way that avoids the store-to-load dependency. In some
instances the language definition may prohibit the compiler
from using code transformations that would remove the storeto-load dependency. It is therefore recommended that the
programmer remove the dependency manually, e.g., by
introducing a temporary variable that can be kept in a register.
This can result in a significant performance increase. The
following is an example of this.
Example 1 (Avoid):
double x[VECLEN], y[VECLEN], z[VECLEN];
unsigned int k;
for (k = 1; k < VECLEN; k++) {
x[k] = x[k-1] + y[k];
}
for (k = 1; k < VECLEN; k++) {
x[k] = z[k] * (y[k] - x[k-1]);
}
Example 2 (Preferred):
double x[VECLEN], y[VECLEN], z[VECLEN];
unsigned int k;
double t;
t = x[0];
for (k = 1; k < VECLEN; k++) {
t = t + y[k];
x[k] = t;
}
t = x[0];
for (k = 1; k < VECLEN; k++) {
t = z[k] * (y[k] - t);
x[k] = t;
}
Avoid Unnecessary Store-to-Load Dependencies19
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Consider Expression Order in Compound Branch
Conditions
Branch conditions in C programs are often compound
conditions consisting of multiple boolean expressions joined by
the boolean operators && and ||. C guarantees a short-circuit
evaluation of these operators. This means that in the case of ||,
the first operand to evaluate to TRUE terminates the
evaluation, i.e., following operands are not evaluated at all.
Similarly for &&, the first operand to evaluate to FALSE
terminates the evaluation. Because of this short-circuit
evaluation, it is not always possible to swap the operands of ||
and &&. This is especially the case when the evaluation of one
of the operands causes a side effect. However, in most cases the
exchange of operands is possible.
When used to control conditional branches, expressions
involving || and && are translated into a series of conditional
branches. The ordering of the conditional branches is a function
of the ordering of the expressions in the compound condition,
and can have a significant impact on performance. It is
unfortunately not possible to give an easy, closed-form formula
on how to order the conditions. Overall performance is a
function of a variety of the following factors:
■probability of a branch mispredict for each of the branches
generated
■additional latency incurred due to a branch mispredict
■cost of evaluating the conditions controlling each of the
branches generated
■amount of parallelism that can be extracted in evaluating
the branch conditions
■data stream consumed by an application (mostly due to the
dependence of mispredict probabilities on the nature of the
incoming data in data dependent branches)
It is therefore recommended to experiment with the ordering of
expressions in compound branch conditions in the most active
areas of a program (so called hot spots) where most of the
execution time is spent. Such hot spots can be found through
the use of profiling. A "typical" data stream should be fed to
the program while doing the experiments.
20Consider Expression Order in Compound Branch Conditions
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Switch Statement Usage
Optimize Switch Statements
Switch statements are translated using a variety of algorithms.
The most common of these are jump tables and comparison
chains/trees. It is recommended to sort the cases of a switch
statement according to the probability of occurrences, with the
most probable first. This will improve performance when the
switch is translated as a comparison chain. It is further
recommended to make the case labels small, contiguous
integers, as this will allow the switch to be translated as a jump
table.
Example 1 (Avoid):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) {
case 28:
case 29: short_months++; break;
case 30: normal_months++; break;
case 31: long_months++; break;
default: printf ("month has fewer than 28 or more than 31
days\n");
}
Example 2 (Preferred):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) {
case 31: long_months++; break;
case 30: normal_months++; break;
case 28:
case 29: short_months++; break;
default: printf ("month has fewer than 28 or more than 31
days\n");
}
Use Prototypes for All Functions
In general, use prototypes for all functions. Prototypes can
convey additional information to the compiler that might
enable more aggressive optimizations.
Switch Statement Usage21
AMD Athlon™ Processor x86 Code Optimization
Use Const Type Qualifier
Use the “const” type qualifier as much as possible. This
optimization makes code more robust and may enable higher
performance code to be generated due to the additional
information available to the compiler. For example, the C
standard allows compilers to not allocate storage for objects
that are declared “const”, if their address is never taken.
Generic Loop Hoisting
To improve the performance of inner loops, it is beneficial to
reduce redundant constant calculations (i.e., loop invariant
calculations). However, this idea can be extended to invariant
control structures.
22007E/0—November 1999
The first case is that of a constant “if()” statement in a “for()”
loop.
Example 1:
for( i ... ) {
if( CONSTANT0 ) {
DoWork0( i );// does not affect CONSTANT0
} else {
DoWork1( i );// does not affect CONSTANT0
}
}
The above loop should be transformed into:
if( CONSTANT0 ) {
for( i ... ) {
DoWork0( i );
}
} else {
for( i ... ) {
DoWork1( i );
}
}
This will make your inner loops tighter by avoiding repetitious
evaluation of a known “if()” control structure. Although the
branch would be easily predicted, the extra instructions and
decode limitations imposed by branching are saved, which are
usually well worth it.
22Use Const Type Qualifier
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Generalization for Multiple Constant Control Code
To generalize this further for multiple constant control code
some more work may have to be done to create the proper outer
loop. Enumeration of the constant cases will reduce this to a
simple switch statement.
The trick here is that there is some up-front work involved in
generating all the combinations for the switch constant and the
total amount of code has doubled. However, it is also clear that
the inner loops are "if()-free". In ideal cases where the
“DoWork*()” functions are inlined, the successive functions
will have greater overlap leading to greater parallelism than
would be possible in the presence of intervening “if()”
statements.
22007E/0—November 1999
for( i ... ) {
DoWork1( i );
DoWork3( i );
}
break;
break;
The same idea can be applied to constant “switch()”
statements, or combinations of “switch()” statements and “if()”
statements inside of “for()” loops. The method for combining
the input constants gets more complicated but will be worth it
for the performance benefit.
However, the number of inner loops can also substantially
increase. If the number of inner loops is prohibitively high, then
only the most common cases need to be dealt with directly, and
the remaining cases can fall back to the old code in a "default:"
clause for the “switch()” statement.
This typically comes up when the programmer is considering
runtime generated code. While runtime generated code can
lead to similar levels of performance improvement, it is much
harder to maintain, and the developer must do their own
optimizations for their code generation without the help of an
available compiler.
Declare Local Functions as Static
Functions that are not used outside the file in which they are
defined should always be declared static, which forces internal
linkage. Otherwise, such functions default to external linkage,
24Declare Local Functions as Static
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
which might inhibit certain optimizations with some
compilers—for example, aggressive inlining.
Dynamic Memory Allocation Consideration
Dynamic memory allocation (‘malloc’ in C language) should
always return a pointer that is suitably aligned for the largest
base type (quadword alignment). Where this aligned pointer
cannot be guaranteed, use the technique shown in the following
code to make the pointer quadword aligned, if needed. This
code assumes the pointer can be cast to a long.
Example:
double* p;
double* np;
p = (double *)malloc(sizeof(double)*number_of_doubles+7L);
np = (double *)((((long)(p))+7L) & (–8L));
Then use ‘np’ instead of ‘p’ to access the data. ‘p’ is still needed
in order to deallocate the storage.
Introduce Explicit Parallelism into Code
Where possible, long dependency chains should be broken into
several independent dependency chains which can then be
executed in parallel exploiting the pipeline execution units.
This is especially important for floating-point code, whether it
is mapped to x87 or 3DNow! instructions because of the longer
latency of floating-point operations. Since most languages,
including ANSI C, guarantee that floating-point expressions are
not re-ordered, compilers can not usually perform such
optimizations unless they offer a switch to allow ANSI noncompliant reordering of floating-point expressions according to
algebraic rules.
Note that re-ordered code that is algebraically identical to the
original code does not necessarily deliver identical
computational results due to the lack of associativity of floating
point operations. There are well-known numerical
considerations in applying these optimizations (consult a book
on numerical analysis). In some cases, these optimizations may
Dynamic Memory Allocation Consideration25
AMD Athlon™ Processor x86 Code Optimization
lead to unexpected results. Fortunately, in the vast majority of
cases, the final result will differ only in the least significant
bits.
Notice that the 4-way unrolling was chosen to exploit the 4-stage
fully pipelined floating-point adder. Each stage of the floatingpoint adder is occupied on every clock cycle, ensuring maximal
sustained utilization.
Explicitly Extract Common Subexpressions
In certain situations, C compilers are unable to extract common
subexpressions from floating-point expressions due to the
guarantee against reordering of such expressions in the ANSI
standard. Specifically, the compiler can not re-arrange the
computation according to algebraic equivalencies before
extracting common subexpressions. In such cases, the
programmer should manually extract the common
subexpression. It should be noted that re-arranging the
expression may result in different computational results due to
the lack of associativity of floating-point operations, but the
results usually differ in only the least significant bits.
26Explicitly Extract Common Subexpressions
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example 1Avoid:
double a,b,c,d,e,f;
e = b*c/d;
f = b/d*a;
Preferred:
double a,b,c,d,e,f,t;
t = b/d;
e = c*t;
f = a*t;
Example 2Avoid:
double a,b,c,e,f;
e = a/c;
f = b/c;
Preferred:
double a,b,c,e,f,t;
t = 1/c;
e = a*t
f = b*t;
C Language Structure Component Considerations
Many compilers have options that allow padding of structures
to make their size multiples of words, doublewords, or
quadwords, in order to achieve better alignment for structures.
In addition, to improve the alignment of structure members,
some compilers might allocate structure elements in an order
that differs from the order in which they are declared. However,
some compilers might not offer any of these features, or their
implementation might not work properly in all situations.
Therefore, to achieve the best alignment of structures and
structure members while minimizing the amount of padding
regardless of compiler optimizations, the following methods are
suggested.
Sort by Base Type
Size
Sort structure members according to their base type size,
declaring members with a larger base type size ahead of
members with a smaller base type size.
C Language Structure Component Considerations27
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Pad by Multiple of
Largest Base Type
Size
Pad the structure to a multiple of the largest base type size of
any member. In this fashion, if the first member of a structure is
naturally aligned, all other members are naturally aligned as
well. The padding of the structure to a multiple of the largest
based type size allows, for example, arrays of structures to be
perfectly aligned.
The following example demonstrates the reordering of
structure member declarations:
See “C Language Structure Component Considerations” on
page 55 for a different perspective.
Sort Local Variables According to Base Type Size
When a compiler allocates local variables in the same order in
which they are declared in the source code, it can be helpful to
declare local variables in such a manner that variables with a
larger base type size are declared ahead of the variables with
smaller base type size. Then, if the first variable is allocated so
that it is naturally aligned, all other variables are allocated
contiguously in the order they are declared, and are naturally
aligned without any padding.
Some compilers do not allocate variables in the order they are
declared. In these cases, the compiler should automatically
allocate variables in such a manner as to make them naturally
aligned with the minimum amount of padding. In addition,
some compilers do not guarantee that the stack is aligned
suitably for the largest base type (that is, they do not guarantee
28Sort Local Variables According to Base Type Size
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
quadword alignment), so that quadword operands might be
misaligned, even if this technique is used and the compiler does
allocate variables in the order they are declared.
The following example demonstrates the reordering of local
variable declarations:
Original ordering (Avoid):
short ga, gu, gi;
long foo, bar;
double x, y, z[3];
char a, b;
float baz;
Improved ordering (Preferred):
double z[3];
double x, y;
long foo, bar;
float baz;
short ga, gu, gi;
See “Sort Variables According to Base Type Size” on page 56 for
more information from a different perspective.
Accelerating Floating-Point Divides and Square Roots
Divides and square roots have a much longer latency than other
floating-point operations, even though the AMD Athlon
processor provides significant acceleration of these two
operations. In some codes, these operations occur so often as to
seriously impact performance. In these cases, it is
recommended to port the code to 3DNow! inline assembly or to
use a compiler that can generate 3DNow! code. If code has hot
spots that use single-precision arithmetic only (i.e., all
computation involves data of type float) and for some reason
cannot be ported to 3DNow!, the following technique may be
used to improve performance.
The x87 FPU has a precision-control field as part of the FPU
control word. The precision-control setting determines what
precision results get rounded to. It affects the basic arithmetic
operations, including divides and square roots. AMD Athlon
and AMD-K6
root in such fashion as to only compute the number of bits
®
family processors implement divide and square
Accelerating Floating-Point Divides and Square Roots29
AMD Athlon™ Processor x86 Code Optimization
necessary for the currently selected precision. This means that
setting precision control to single precision (versus Win32
default of double precision) lowers the latency of those
operations.
The Microsoft® Visual C environment provides functions to
manipulate the FPU control word and thus the precision
control. Note that these functions are not very fast, so changes
of precision control should be inserted where it creates little
overhead, such as outside a computation-intensive loop.
Otherwise the overhead created by the function calls outweighs
the benefit from reducing the latencies of divide and square
root operations.
The following example shows how to set the precision control to
single precision and later restore the original settings in the
Microsoft Visual C environment.
22007E/0—November 1999
Example:
/* prototype for _controlfp() function */
#include <float.h>
unsigned int orig_cw;
/* Get current FPU control word and save it */
orig_cw = _controlfp (0,0);
/* Set precision control in FPU control word to single
precision. This reduces the latency of divide and square
root operations.
*/
_controlfp (_PC_24, MCW_PC);
/* restore original FPU control word */
_controlfp (orig_cw, 0xfffff);
30Accelerating Floating-Point Divides and Square Roots
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Avoid Unnecessary Integer Division
Integer division is the slowest of all integer arithmetic
operations and should be avoided wherever possible. One
possibility for reducing the number of integer divisions is
multiple divisions, in which division can be replaced with
multiplication as shown in the following examples. This
replacement is possible only if no overflow occurs during the
computation of the product. This can be determined by
considering the possible ranges of the divisors.
Example 1 (Avoid):
int i,j,k,m;
m = i / j / k;
Example 2 (Preferred):
int i,j,k,l;
m = i / (j * k);
Copy Frequently De-referenced Pointer Arguments to Local
Variables
Avoid frequently de-referencing pointer arguments inside a
function. Since the compiler has no knowledge of whether
aliasing exists between the pointers, such de-referencing can
not be optimized away by the compiler. This prevents data from
being kept in registers and significantly increases memory
traffic.
Note that many compilers have an “assume no aliasing”
optimization switch. This allows the compiler to assume that
two different pointers always have disjoint contents and does
not require copying of pointer arguments to local variables.
Otherwise, copy the data pointed to by the pointer arguments
to local variables at the start of the function and if necessary
copy them back at the end of the function.
Avoid Unnecessary Integer Division31
AMD Athlon™ Processor x86 Code Optimization
Example 1 (Avoid):
//assumes pointers are different and q!=r
void isqrt( unsigned long a,
{
*q = a;
if (a > 0)
{
while (*q > (*r = a / *q))
{
*q = (*q + *r) >> 1;
}
}
*r = a - *q * *q;
}
Example 2 (Preferred):
//assumes pointers are different and q!=r
void isqrt( unsigned long a,
{
unsigned long qq, rr;
qq = a;
if (a > 0)
{
while (qq > (rr = a / qq))
{
qq = (qq + rr) >> 1;
}
}
rr = a - qq * qq;
*q = qq;
*r = rr;
}
22007E/0—November 1999
unsigned long *q,
unsigned long *r)
unsigned long *q,
unsigned long *r)
32Copy Frequently De-referenced Pointer Arguments to Local Variables
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
4
Instruction Decoding
Optimizations
Overview
This chapter discusses ways to maximize the number of
instructions decoded by the instruction decoders in the
AMD Athlon™ processor. Guidelines are listed in order of
importance.
The AMD Athlon processor instruction fetcher reads 16-byte
aligned code windows from the instruction cache. The
instruction bytes are then merged into a 24-byte instruction
queue. On each cycle, the in-order front-end engine selects for
decode up to three x86 instructions from the instruction-byte
queue.
All instructions (x86, x87, 3DNow!™, and MMX™) are
classified into two types of decodes— DirectPath and
VectorPath (see “DirectPath Decoder” and “VectorPath
Decoder” on page 133 for more information). DirectPath
instructions are common instructions that are decoded directly
in hardware. VectorPath instructions are more complex
instructions that require the use of a sequence of multiple
operations issued from an on-chip ROM.
Up to three DirectPath instructions can be selected for decode
per cycle. Only one VectorPath instruction can be selected for
decode per cycle. DirectPath instructions and VectorPath
instructions cannot be simultaneously decoded.
Overview33
AMD Athlon™ Processor x86 Code Optimization
✩
TOP
✩
TOP
Select DirectPath Over VectorPath Instructions
Use DirectPath instructions rather than VectorPath
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction, which includes ‘register←register op
memory’ as well as ‘register←register op register’ forms of
instructions. Up to three DirectPath instructions can be
decoded per cycle. VectorPath instructions will block the
decoding of DirectPath instructions.
The very high majority of instructions used be a compiler has
been implemented as DirectPath instructions in the
AMD Athlon processor. Assembly writers must still take into
consideration the usage of DirectPath versus VectorPath
instructions.
22007E/0—November 1999
See Appendix F, “Instruction Dispatch and Execution
Resources” on page 187 and Appendix G, “DirectPath versus
VectorPath Instructions” on page 219 for tables of DirectPath
and VectorPath instructions.
Load-Execute Instruction Usage
Use Load-Execute Integer Instructions
Most load-execute integer instructions are DirectPath
decodable and can be decoded at the rate of three per cycle.
Splitting a load-execute integer instruction into two separate
instructions—a load instruction and a “reg, reg” instruction—
reduces decoding bandwidth and increases register pressure,
which results in lower performance. The split-instruction form
can be used to avoid scheduler stalls for longer executing
instructions and to explicitly schedule the load and execute
operations.
34Select DirectPath Over VectorPath Instructions
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
✩
TOP
✩
TOP
Use Load-Execute Floating-Point Instructions with Floating-Point
Operands
When operating on single-precision or double-precision
floating-point data, wherever possible use floating-point
load-execute instructions to increase code density.
Note: This optimization applies only to floating-point instructions
with floating-point operands and not with integer operands,
as described in the next optimization.
This coding style helps in two ways. First, denser code allows
more work to be held in the instruction cache. Second, the
denser code generates fewer internal OPs and, therefore, the
FPU scheduler holds more work, which increases the chances of
extracting parallelism from the code.
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Do not use load-execute floating-point instructions with integer
operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR,
FICOM, and FICOMP. Remember that floating-point
instructions can have integer operands while integer
instruction cannot have floating-point operands.
Floating-point computations involving integer-memory
operands should use separate FILD and arithmetic instructions.
This optimization has the potential to increase decode
bandwidth and OP density in the FPU scheduler. The floatingpoint load-execute instructions with integer operands are
VectorPath and generate two OPs in a cycle, while the discrete
equivalent enables a third DirectPath instruction to be decoded
in the same cycle. In some situations this optimizations can also
reduce execution time if the FILD can be scheduled several
instructions ahead of the arithmetic instruction in order to
cover the FILD latency.
In program hot spots (i.e., innermost loops in the absence of
profiling data), place branch targets at or near the beginning of
16-byte aligned code windows. This technique helps to
maximize the number of instructions that are filled into the
instruction-byte queue while preventing I-cache space in
branch intensive code.
22007E/0—November 1999
Use Short Instruction Lengths
Assemblers and compilers should generate the tightest code
possible to optimize use of the I-cache and increase average
decode rate. Wherever possible, use instructions with shorter
lengths. Using shorter instructions increases the number of
instructions that can fit into the instruction-byte queue. For
example, use 8-bit displacements as opposed to 32-bit
displacements. In addition, use the single-byte format of simple
integer instructions whenever possible, as opposed to the
2-byte opcode ModR/M format.
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example 2 (Preferred):
05 78 56 34 12add eax, 12345678h;uses single byte
; opcode form
83 C3 FB add ebx, -5;uses 8-bit sign
; extended immediate
74 05jz $label1;uses 1-byte opcode,
; 8-bit immediate
Avoid Partial Register Reads and Writes
In order to handle partial register writes, the AMD Athlon
processor execution core implements a data-merging scheme.
In the execution unit, an instruction writing a partial register
merges the modified portion with the current state of the
remainder of the register. Therefore, the dependency hardware
can potentially force a false dependency on the most recent
instruction that writes to any part of the register.
Example 1 (Avoid):
MOVAL, 10;inst 1
MOVAH, 12;inst 2 has a false dependency on
; inst 1
;inst 2 merges new AH with current
; EAX register value forwarded
; by inst 1
In addition, an instruction that has a read dependency on any
part of a given architectural register has a read dependency on
the most recent instruction that modifies any part of the same
architectural register.
Example 2 (Avoid):
MOVBX, 12h;inst 1
MOVBL, DL;inst 2, false dependency on
; completion of inst 1
MOVBH, CL;inst 3, false dependency on
; completion of inst 2
MOVAL, BL;inst 4, depends on completion of
; inst 2
Avoid Partial Register Reads and Writes37
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Replace Certain SHLD Instructions with Alternative Code
Certain instances of the SHLD instruction can be replaced by
alternative code using SHR and LEA. The alternative code has
lower latency and requires less execution resources. SHR and
LEA (32-bit version) are DirectPath instructions, while SHLD is
a VectorPath instruction. SHR and LEA preserves decode
bandwidth as it potentially enables the decoding of a third
DirectPath instruction.
Example 1(Avoid):
SHLD REG1, REG2, 1
(Preferred):
SHR REG2, 31
LEA REG1, [REG1*2 + REG2]
Example 2(Avoid):
SHLD REG1, REG2, 2
(Preferred):
SHR REG2, 30
LEA REG1, [REG1*4 + REG2]
Example 3(Avoid):
SHLD REG1, REG2, 3
(Preferred):
SHR REG2, 29
LEA REG1, [REG1*8 + REG2]
Use 8-Bit Sign-Extended Immediates
Using 8-bit sign-extended immediates improves code density
with no negative effects on the AMD Athlon processor. For
example, ADD BX, –5 should be encoded “83 C3 FB” and not
“81 C3 FF FB”.
38Replace Certain SHLD Instructions with Alternative
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Use 8-Bit Sign-Extended Displacements
Use 8-bit sign-extended displacements for conditional
branches. Using short, 8-bit sign-extended displacements for
conditional branches improves code density with no negative
effects on the AMD Athlon processor.
Code Padding Using Neutral Code Fillers
Occasionally a need arises to insert neutral code fillers into the
code stream, e.g., for code alignment purposes or to space out
branches. Since this filler code can be executed, it should take
up as few execution resources as possible, not diminish decode
density, and not modify any processor state other than
advancing EIP. A one byte padding can easily be achieved using
the NOP instructions (XCHG EAX, EAX; opcode 0x90). In the
x86 architecture, there are several multi-byte "NOP"
instructions available that do not change processor state other
than EIP:
■MOV REG, REG
■XCHG REG, REG
■CMOVcc REG, REG
■SHR REG, 0
■SAR REG, 0
■SHL REG, 0
■SHRD REG, REG, 0
■SHLD REG, REG, 0
■LEA REG, [REG]
■LEA REG, [REG+00]
■LEA REG, [REG*1+00]
■LEA REG, [REG+00000000]
■LEA REG, [REG*1+00000000]
Not all of these instructions are equally suitable for purposes of
code padding. For example, SHLD/SHRD are microcoded which
reduces decode bandwidth and takes up execution resources.
Use 8-Bit Sign-Extended Displacements39
AMD Athlon™ Processor x86 Code Optimization
Recommendations for the AMD Athlon™ Processor
For code that is optimized specifically for the AMD Athlon
processor, the optimal code fillers are NOP instructions (opcode
0x90) with up to two REP prefixes (0xF3). In the AMD Athlon
processor, a NOP with up to two REP prefixes can be handled
by a single decoder with no overhead. As the REP prefixes are
redundant and meaningless, they get discarded, and NOPs are
handled without using any execution resources. The three
decoders of AMD Athlon processor can handle up to three
NOPs, each with up to two REP prefixes each, in a single cycle,
for a neutral code filler of up to nine bytes.
Note: When used as a filler instruction, REP/REPNE prefixes can
be used in conjunction only with NOPs. REP/REPNE has
undefined behavior when used with instructions other than
a NOP.
If a larger amount of code padding is required, it is
recommended to use a JMP instruction to jump across the
padding region. The following assembly language macros show
this:
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Recommendations for AMD-K6® Family and AMD Athlon™ Processor
Blended Code
On x86 processors other than the AMD Athlon processor
(including the AMD-K6 family of processors), the REP prefix
and especially multiple prefixes cause decoding overhead, so
the above technique is not recommended for code that has to
run well both on AMD Athlon processor and other x86
processors (blended code). In such cases the instructions and
instruction sequences below are recommended. For neutral
code fillers longer than eight bytes in length, the JMP
instruction can be used to jump across the padding region.
Note that each of the instructions and instruction sequences
below utilizes an x86 register. To avoid performance
degradation, the register used in the padding should be
selected so as to not lengthen existing dependency chains, i.e.,
one should select a register that is not used by instructions in
the vicinity of the neutral code filler. Note that certain
instructions use registers implicitly. For example, PUSH, POP,
CALL, and RET all make implicit use of the ESP register. The
5-byte filler sequence below consists of two instructions. If flag
changes across the code padding are acceptable, the following
instructions may be used as single instruction, 5-byte code
fillers:
■TEST EAX, 0FFFF0000h
■CMP EAX, 0FFFF0000h
The following assembly language macros show the
recommended neutral code fillers for code optimized for the
AMD Athlon processor that also has to run well on other x86
processors. Note for some padding lengths, versions using ESP
or EBP are missing due to the lack of fully generalized
addressing modes.
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
✩
TOP
5
Cache and Memory
Optimizations
This chapter describes code optimization techniques that take
advantage of the large L1 caches and high-bandwidth buses of
the AMD Athlon™ processor. Guidelines are listed in order of
importance.
Memory Size and Alignment Issues
Avoid Memory Size Mismatches
Avoid memory size mismatches when instructions operate on
the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each
operand the same size. The following code examples result in a
In general, avoid misaligned data references. All data whose
size is a power of 2 is considered aligned if it is naturally
aligned. For example:
■QWORD accesses are aligned if they access an address
divisible by 8.
■DWORD accesses are aligned if they access an address
divisible by 4.
■WORD accesses are aligned if they access an address
divisible by 2.
■TBYTE accesses are aligned if they access an address
divisible by 8.
A misaligned store or load operation suffers a minimum
one-cycle penalty in the AMD Athlon processor load/store
pipeline. In addition, using misaligned loads and stores
increases the likelihood of encountering a store-to-load
forwarding pitfall. For a more detailed discussion of store-toload forwarding issues, see “Store-to-Load Forwarding
Restrictions” on page 51.
22007E/0—November 1999
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
For code that can take advantage of prefetching, use the
3DNow! PREFETCH and PREFETCHW instructions to
increase the effective bandwidth to the AMD Athlon processor.
The PREFETCH and PREFETCHW instructions take
advantage of the AMD Athlon processor’s high bus bandwidth
to hide long latencies when fetching data from system memory.
The prefetch instructions are essentially integer instructions
and can be used anywhere, in any type of code (integer, x87,
3DNow!, MMX, etc.).
Large data sets typically require unit-stride access to ensure
that all data pulled in by PREFETCH or PREFETCHW is
actually used. If necessary, algorithms or data structures should
be reorganized to allow unit-stride access.
46Use the 3DNow!™ PREFETCH and PREFETCHW
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
PREFETCH/W versus
PREFETCHNTA/T0/T1
/T2
The PREFETCHNTA/T0/T1/T2 instructions in the MMX
extensions are processor implementation dependent. To
®
maintain compatibility with the 25 million AMD-K6
-2 and
AMD-K6-III processors already sold, use the 3DNow!
PREFETCH/W instructions instead of the various prefetch
flavors in the new MMX extensions.
PREFETCHW UsageCode that intends to modify the cache line brought in through
prefetching should use the PREFETCHW instruction. While
PREFETCHW works the same as a PREFETCH on the
AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a
hint to the AMD Athlon processor of an intent to modify the
cache line. The AMD Athlon processor will mark the cache line
being brought in by PREFETCHW as Modified. Using
PREFETCHW can save an additional 15-25 cycles compared to
a PREFETCH and the subsequent cache state change caused by
a write to the prefetched cache line.
Multiple PrefetchesProgrammers can initiate multiple outstanding prefetches on
the AMD Athlon processor. While the AMD-K6-2 and
AMD-K6-III processors can have only one outstanding prefetch,
the AMD Athlon processor can have up to six outstanding
prefetches. When all six buffers are filled by various memory
read requests, the processor will simply ignore any new
prefetch requests until a buffer frees up. Multiple prefetch
requests are essentially handled in-order. If data is needed first,
then that data should be prefetched first.
The example below shows how to initiate multiple prefetches
when traversing more than one array.
Example (Multiple Prefetches):
.CODE
.K3D
; original C code
;
; #define LARGE_NUM 65536
;
; double array_a[LARGE_NUM];
; double array b[LARGE_NUM];
; double array c[LARGE_NUM];
; int i;
;
; for (i = 0; i < LARGE_NUM; i++) {
; a[i] = b[i] * c[i]
; }
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions47
AMD Athlon™ Processor x86 Code Optimization
MOV ECX, (-LARGE_NUM) ;used biased index
MOV EAX, OFFSET array_a ;get address of array_a
MOV EDX, OFFSET array_b ;get address of array_b
MOV ECX, OFFSET array_c ;get address of array_c
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
The following optimization rules were applied to this example.
■Loops should be unrolled to make sure that the data stride
per loop iteration is equal to the length of a cache line. This
avoids overlapping PREFETCH instructions and thus
optimal use of the available number of outstanding
PREFETCHes.
■Since the array "array_a" is written rather than read,
PREFETCHW is used instead of PREFETCH to avoid
overhead for switching cache lines to the correct MESI
state. The PREFETCH lookahead has been optimized such
that each loop iteration is working on three cache lines
while six active PREFETCHes bring in the next six cache
lines.
■Index arithmetic has been reduced to a minimum by use of
complex addressing modes and biasing of the array base
addresses in order to cut down on loop overhead.
Determining Prefetch
Distance
Prefetch at Least 64
Bytes Away from
Surrounding Stores
Given the latency of a typical AMD Athlon processor system
and expected processor speeds, the following formula should be
used to determine the prefetch distance in bytes for a single
array:
Prefetch Distance = 200 (DS/C) bytes
■Round up to the nearest 64-byte cache line.
■The number 200 is a constant based upon expected
AMD Athlon processor clock frequencies and typical system
memory latencies.
■DS is the data stride in bytes per loop iteration.
■C is the number of cycles for one loop to execute entirely
from the L1 cache.
The prefetch distance for multiple arrays are typically even
longer.
The PREFETCH and PREFETCHW instructions can be
affected by false dependencies on stores. If there is a store to an
address that matches a request, that request (the PREFETCH
or PREFETCHW instruction) may be blocked until the store is
written to the cache. Therefore, code should prefetch data that
is located at least 64 bytes away from any surrounding store’s
data address.
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions49
AMD Athlon™ Processor x86 Code Optimization
✩
TOP
✩
TOP
22007E/0—November 1999
Take Advantage of Write Combining
Operating system and device driver programmers should take
advantage of the write-combining capabilities of the
AMD Athlon processor. The AMD Athlon processor has a very
aggressive write-combining algorithm, which improves
performance significantly.
See Appendix C, “Implementation of Write Combining” on
page 155 for more details.
Avoid Placing Code and Data in the Same 64-Byte Cache
Line
Sharing code and data in the same 64-byte cache line may cause
the L1 caches to thrash (unnecessary castout of code/data) in
order to maintain coherency between the separate instruction
and data caches. The AMD Athlon processor has a cache-line
size of 64-bytes, which is twice the size of previous processors.
Programmers must be aware that code and data should not be
shared within this larger cache line, especially if the data
becomes modified.
For example, programmers should consider that a memory
indirect JMP instruction may have the data for the jump table
residing in the same 64-byte cache line as the JMP instruction,
which would result in lower performance.
Although rare, do not place critical code at the border between
32-byte aligned code segments and a data segments. The code
at the start or end of your data segment should be as rarely
executed as possible or simply padded with garbage.
In general, the following should be avoided:
■self-modifying code
50Take Advantage of Write Combining
■storing data in code segments
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Store-to-Load Forwarding Restrictions
Store-to-load forwarding refers to the process of a load reading
(forwarding) data from the store buffer (LS2). There are
instances in the AMD Athlon processor load/store architecture
when either a load operation is not allowed to read needed data
from a store in the store buffer, or a load OP detects a false data
dependency on a store in the store buffer.
In either case, the load cannot complete (load the needed data
into a register) until the store has retired out of the store buffer
and written to the data cache. A store-buffer entry cannot retire
and write to the data cache until every instruction before the
store has completed and retired from the reorder buffer.
The implication of this restriction is that all instructions in the
reorder buffer, up to and including the store, must complete
and retire out of the reorder buffer before the load can
complete. Effectively, the load has a false dependency on every
instruction up to the store.
The following sections describe store-to-load forwarding
examples that are acceptable and those that should be avoided.
A load is allowed to read data from the store-buffer entry only if
all of the following conditions are satisfied:
■The start address of the load matches the start address of
the store.
■The load operand size is equal to or smaller than the store
operand size.
■Neither the load or store is misaligned.
■The store data is not from a high-byte register (AH, BH, CH,
or DH).
The following sections describe common-case scenarios to avoid
whereby a load has a true dependency on a LS2-buffered store
but cannot read (forward) data from a store-buffer entry.
Store-to-Load Forwarding Restrictions51
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Narrow-to-Wide
Store-Buffer Data
Forwarding
Restriction
If the following conditions are present, there is a
narrow-to-wide store-buffer data forwarding restriction:
■The operand size of the store data is smaller than the
operand size of the load data.
■The range of addresses spanned by the store data covers
some sub-region of range of addresses spanned by the load
data.
Avoid the type of code shown in the following two examples.
Misaligned
Store-Buffer Data
Forwarding
Restriction
If the following condition is present, there is a misaligned
store-buffer data forwarding restriction:
■The store or load address is misaligned. For example, a
quadword store is not aligned to a quadword boundary, a
doubleword store is not aligned to doubleword boundary,
etc.
A common case of misaligned store-data forwarding involves
the passing of misaligned quadword floating-point data on the
doubleword-aligned integer stack. Avoid the type of code shown
High-Byte
Store-Buffer Data
Forwarding
Restriction
If the following condition is present, there is a high-byte
store-data buffer forwarding restriction:
■The store data is from a high-byte register (AH, BH, CH,
DH).
Avoid the type of code shown in the following example.
Example 7 (Avoid):
MOV EAX, 10h
MOV [EAX], BH;high-byte store
.
MOV DL, [EAX];load cannot forward from
; high-byte store
Store-to-Load Forwarding Restrictions53
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
One Supported Storeto-Load Forwarding
Case
There is one case of a mismatched store-to-load forwarding that
is supported by the by AMD Athlon processor. The lower 32 bits
from an aligned QWORD write feeding into a DWORD read is
allowed.
Summary of Store-to-Load Forwarding Pitfalls to Avoid
To avoid store-to-load forwarding pitfalls, code should conform
to the following guidelines:
■Maintain consistent use of operand size across all loads and
stores. Preferably, use doubleword or quadword operand
sizes.
■Avoid misaligned data references.
■Avoid narrow-to-wide and wide-to-narrow forwarding cases.
■When using word or byte stores, avoid loading data from
anywhere in the same doubleword of memory other than the
identical start addresses of the stores.
Stack Alignment Considerations
Make sure the stack is suitably aligned for the local variable
with the largest base type. Then, using the technique described
in “C Language Structure Component Considerations” on page
55, all variables can be properly aligned with no padding.
Extend to 32 Bits
Before Pushing onto
Stack
Function arguments smaller than 32 bits should be extended to
32 bits before being pushed onto the stack, which ensures that
the stack is always doubleword aligned on entry to a function.
If a function has no local variables with a base type larger than
doubleword, no further work is necessary. If the function does
have local variables whose base type is larger than a
doubleword, additional code should be inserted to ensure
proper alignment of the stack. For example, the following code
achieves quadword alignment:
54Stack Alignment Considerations
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example (Preferred):
Prolog:
PUSH EBP
MOV EBP, ESP
SUB ESP, SIZE_OF_LOCALS;size of local variables
AND ESP, –8
;push registers that need to be preserved
Epilog: ;pop register that needed to be preserved
MOVESP, EBP
POPEBP
RET
With this technique, function arguments can be accessed via
EBP, and local variables can be accessed via ESP. In order to
free EBP for general use, it needs to be saved and restored
between the prolog and the epilog.
Align TBYTE Variables on Quadword Aligned Addresses
Align variables of type TBYTE on quadword aligned addresses.
In order to make an array of TBYTE variables that are aligned,
array elements are 16-bytes apart. In general, TBYTE variables
should be avoided. Use double-precision variables instead.
C Language Structure Component Considerations
Structures (‘struct’ in C language) should be made the size of a
multiple of the largest base type of any of their components. To
meet this requirement, padding should be used where
necessary.
Language definitions permitting, to minimize padding,
structure components should be sorted and allocated such that
the components with a larger base type are allocated ahead of
those with a smaller base type. For example, consider the
following code:
Align TBYTE Variables on Quadword Aligned Addresses55
AMD Athlon™ Processor x86 Code Optimization
Example:
struct{
chara[5];
longk;
doublex;
} baz;
The structure components should be allocated (lowest to
highest address) as follows:
See “C Language Structure Component Considerations” on
page 27 for more information from a C source code perspective.
Sort Variables According to Base Type Size
22007E/0—November 1999
Sort local variables according to their base type size and
allocate variables with larger base type size ahead of those with
smaller base type size. Assuming the first variable allocated is
naturally aligned, all other variables are naturally aligned
without any padding. The following example is a declaration of
local variables in a C function:
Allocate in the following order from left to right (from higher to
lower addresses):
x, y, z[2], z[1], z[0], foo, bar, baz, ga, gu, gi, a, b;
See “Sort Local Variables According to Base Type Size” on page
28 for more information from a C source code perspective.
56Sort Variables According to Base Type Size
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
✩
TOP
6
Branch Optimizations
While the AMD Athlon™ processor contains a very
sophisticated branch unit, certain optimizations increase the
effectiveness of the branch prediction unit. This chapter
discusses rules that improve branch prediction and minimize
branch penalties. Guidelines are listed in order of importance.
Avoid Branches Dependent on Random Data
Avoid conditional branches depending on random data, as these
are difficult to predict. For example, a piece of code receives a
random stream of characters “A” through “Z” and branches if
the character is before “M” in the collating sequence.
Data-dependent branches acting upon basically random data
causes the branch prediction logic to mispredict the branch
about 50% of the time.
If possible, design branch-free alternative code sequences,
which results in shorter average execution time. This technique
is especially important if the branch body is small. Examples 1
and 2 illustrate this concept using the CMOV instruction. Note
that the AMD-K6
instruction. Therefore, blended AMD-K6 and AMD Athlon
processor code should use examples 3 and 4.
®
processor does not support the CMOV
Avoid Branches Dependent on Random Data57
AMD Athlon™ Processor x86 Code Optimization
AMD Athlon™ Processor Specific Code
Example 1 — Signed integer ABS function (X = labs(X)):
MOVECX, [X];load value
MOVEBX, ECX;save value
NEGECX;–value
CMOVSECX, EBX;if –value is negative, select value
MOV[X], ECX;save labs result
Example 2 — Unsigned integer min function (z = x < y ? x : y):
MOVEAX, [X];load X value
MOVEBX, [Y];load Y value
CMPEAX, EBX;EBX<=EAX ? CF=0 : CF=1
CMOVNCEAX, EBX;EAX=(EBX<=EAX) ? EBX:EAX
MOV[Z], EAX;save min (X,Y)
Blended AMD-K6® and AMD Athlon™ Processor Code
22007E/0—November 1999
Example 3 — Signed integer ABS function (X = labs(X)):
MOV ECX, [X] ;load value
MOV EBX, ECX ;save value
SAR ECX, 31;x < 0 ? 0xffffffff : 0
XOR EBX, ECX ;x < 0 ? ~x : x
SUB EBX, ECX ;x < 0 ? (~x)+1 : x
MOV [X], EBX ;x < 0 ? -x : x
Example 4 — Unsigned integer min function (z = x < y ? x : y):
MOV EAX, [x] ;load x
MOV EBX, [y] ;load y
SUB EAX, EBX ;x < y ? CF : NC ; x - y
SBB ECX, ECX ;x < y ? 0xffffffff : 0
AND ECX, EAX ;x < y ? x - y : 0
ADD ECX, EBX ;x < y ? x - y + y : y
MOV [z], ECX ;x < y ? x : y
Example 5 — Hexadecimal to ASCII conversion
(y=x < 10 ? x + 0x30: x + 0x41):
MOVAL, [X];load X value
CMPAL, 10;if x is less than 10, set carry flag
SBBAL, 69h;0..9 –> 96h, Ah..Fh –> A1h...A6h
DAS;0..9: subtract 66h, Ah..Fh: Sub. 60h
MOV[Y],AL;save conversion in y
58Avoid Branches Dependent on Random Data
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example 6 — Increment Ring Buffer Offset:
//C Code
char buf[BUFSIZE];
int a;
if (a < (BUFSIZE-1)) {
a++;
} else {
a = 0;
}
;------------;Assembly Code
MOV EAX, [a] ; old offset
CMP EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC
INC EAX ; a++
SBB EDX, EDX ; a < (BUFSIZE-1) ? 0xffffffff :0
AND EAX, EDX ; a < (BUFSIZE-1) ? a++ : 0
MOV [a], EAX ; store new offset
Example 7 — Integer Signum Function:
//C Code
int a, s;
if (!a) {
s = 0;
} else if (a < 0) {
s = -1;
} else {
s = 1;
}
When the 12 entry return address stack gets out of
synchronization, the latency of returns increase. The return
address stack becomes out of sync when:
■calls and returns do not match
■the depth of the return stack is exceeded because of too
many levels of nested functions calls
Always Pair CALL and RETURN59
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Replace Branches with Computation in 3DNow!™ Code
Branches negatively impact the performance of 3DNow! code.
Branches can operate only on one data item at a time, i.e., they
are inherently scalar and inhibit the SIMD processing that
makes 3DNow! code superior. Also, branches based on 3DNow!
comparisons require data to be passed to the integer units,
which requires either transport through memory, or the use of
“MOVD reg, MMreg” instructions. If the body of the branch is
small, one can achieve higher performance by replacing the
branch with computation. The computation simulates
predicated execution or conditional moves. The principal tools
for this are the following instructions: PCMPGT, PFCMPGT,
PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.
Muxing Constructs
The most important construct to avoiding branches in
3DNow!™ and MMX™ code is a 2-way muxing construct that is
equivalent to the ternary operator “?:” in C and C++. It is
implemented using the PCMP/PFCMP, PAND, PANDN, and
POR instructions. To maximize performance, it is important to
apply the PAND and PANDN instructions in the proper order.
Example 1 (Avoid):
; r = (x < y) ? a : b
;
; in: mm0 a
; mm1 b
; mm2 x
; mm3 y
; out: mm1 r
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0
MOVQ MM4, MM3 ; duplicate mask
PANDN MM3, MM0 ; y > x ? 0 : a
PAND MM1, MM4 ; y > x ? b : 0
POR MM1, MM3 ; r = y > x ? b : a
Because the use of PANDN destroys the mask created by PCMP,
the mask needs to be saved, which requires an additional
register. This adds an instruction, lengthens the dependency
chain, and increases register pressure. Therefore 2-way muxing
constructs should be written as follows.
60Replace Branches with Computation in 3DNow!™ Code
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example 2 (Preferred):
; r = (x < y) ? a : b
;
; in: mm0 a
; mm1 b
; mm2 x
; mm3 y
; out: mm1 r
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0
PAND MM1, MM3 ; y > x ? b : 0
PANDN MM3, MM0 ; y > x > 0 : a
POR MM1, MM3 ; r = y > x ? b : a "
Sample Code Translated into 3DNow!™ Code
The following examples use scalar code translated into 3DNow!
code. Note that it is not recommended to use 3DNow! SIMD
instructions for scalar code, because the advantage of 3DNow!
instructions lies in their “SIMDness”. These examples are
meant to demonstrate general techniques for translating source
code with branches into branchless 3DNow! code. Scalar source
code was chosen to keep the examples simple. These techniques
work in an identical fashion for vector code.
Each example shows the C code and the resulting 3DNow! code.
Example 1:C code:
float x,y,z;
if (x < y) {
z += 1.0;
}
else {
z -= 1.0;
}
3DNow! code:
;in:MM0 = x
;MM1 = y
;MM2 = z
;out: MM0 = z
MOVQ MM3, MM0 ;save x
MOVQ MM4, one ;1.0
PFCMPGEMM0, MM1 ;x < y ? 0 : 0xffffffff
PSLLDMM0, 31 ;x < y ? 0 : 0x80000000
PXOR MM0, MM4 ;x < y ? 1.0 : -1.0
PFADD MM0, MM2 ;x < y ? z+1.0 : z-1.0
Replace Branches with Computation in 3DNow!™ Code61
AMD Athlon™ Processor x86 Code Optimization
Example 2:C code:
float x,z;
z = abs(x);
if (z >= 1) {
z = 1/z;
}
3DNow! code:
;in: MM0 = x
;out: MM0 = z
MOVQ MM5, mabs ;0x7fffffff
PAND MM0, MM5 ;z=abs(x)
PFRCP MM2, MM0 ;1/z approx
MOVQ MM1, MM0 ;save z
PFRCPIT1 MM0, MM2 ;1/z step
PFRCPIT2 MM0, MM2 ;1/z final
PFMINMM0, MM1 ;z = z < 1 ? z : 1/z
Example 3:C code:
float x,z,r,res;
z = fabs(x)
if (z < 0.575) {
res = r;
}
else {
res = PI/2 - 2*r;
}
22007E/0—November 1999
3DNow! code:
;in: MM0 = x
; MM1 = r
;out: MM0 = res
MOVQ MM7, mabs ;mask for absolute value
PAND MM0, MM7 ;z = abs(x)
MOVQ MM2, bnd ;0.575
PCMPGTD MM2, MM0 ;z < 0.575 ? 0xffffffff : 0
MOVQ MM3, pio2 ;pi/2
MOVQ MM0, MM1 ;save r
PFADD MM1, MM1 ;2*r
PFSUBR MM1, MM3 ;pi/2 - 2*r
PAND MM0, MM2 ;z < 0.575 ? r : 0
PANDN MM2, MM1 ;z < 0.575 ? 0 : pi/2 - 2*r
POR MM0, MM2 ;z < 0.575 ? r : pi/2 - 2 * r
62Replace Branches with Computation in 3DNow!™ Code
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Example 4:C code:
#define PI 3.14159265358979323
float x,z,r,res;
/* 0 <= r <= PI/4 */
z = abs(x)
if (z < 1) {
res = r;
}
else {
res = PI/2-r;
}
3DNow! code:
;in:MM0 = x
;MM1 = r
;out: MM1 = res
MOVQ MM5, mabs ; mask to clear sign bit
MOVQ MM6, one ; 1.0
PAND MM0, MM5 ; z=abs(x)
PCMPGTD MM6, MM0 ; z < 1 ? 0xffffffff : 0
MOVQ MM4, pio2 ; pi/2
PFSUB MM4, MM1 ; pi/2-r
PANDN MM6, MM4 ; z < 1 ? 0 : pi/2-r
PFMAX MM1, MM6 ; res = z < 1 ? r : pi/2-r
Replace Branches with Computation in 3DNow!™ Code63
AMD Athlon™ Processor x86 Code Optimization
Example 5:C code:
#define PI 3.14159265358979323
float x,y,xa,ya,r,res;
int xs,df;
xs = x < 0 ? 1 : 0;
xa = fabs(x);
ya = fabs(y);
df = (xa < ya);
if (xs && df) {
res = PI/2 + r;
}
else if (xs) {
res = PI - r;
}
else if (df) {
res = PI/2 - r;
}
else {
res = r;
}
; (df ? 0 : -pi/2))
POR MM0, MM7 ;ar = xs^df ? -r : r
PFADD MM0, MM6 ;res = ar + pr
64Replace Branches with Computation in 3DNow!™ Code
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Avoid the Loop Instruction
The LOOP instruction in the AMD Athlon processor requires
eight cycles to execute. Use the preferred code shown below:
Example 1 (Avoid):
LOOPLABEL
Example 2 (Preferred):
DECECX
JNZLABEL
Avoid Far Control Transfer Instructions
Avoid using far control transfer instructions. Far control
transfer branches can not be predicted by the branch target
buffer (BTB).
Avoid the Loop Instruction65
AMD Athlon™ Processor x86 Code Optimization
Avoid Recursive Functions
Avoid recursive functions due to the danger of overflowing the
return address stack. Convert end-recursive functions to
iterative code. An end-recursive function is when the function
call to itself is at the end of the code.
Example 1 (Avoid):
long fac(long a)
{
if (a==0) {
return (1);
} else {
return (a*fac(a–1));
}
return (t);
}
22007E/0—November 1999
Example 2 (Preferred):
long fac(long a)
{
long t=1;
while (a > 0) {
t *= a;
a--;
}
return (t);
}
66Avoid Recursive Functions
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
7
Scheduling Optimizations
This chapter describes how to code instructions for efficient
scheduling. Guidelines are listed in order of importance.
Schedule Instructions According to their Latency
The AMD Athlon™ processor can execute up to three x86
instructions per cycle, with each x86 instruction possibly having
a different latency. The AMD Athlon processor has flexible
scheduling, but for absolute maximum performance, schedule
instructions, especially FPU and 3DNow!™ instructions,
according to their latency. Dependent instructions will then not
have to wait on instructions with longer latencies.
See Appendix F, “Instruction Dispatch and Execution
Resources” on page 187 for a list of latency numbers.
Unrolling Loops
Complete Loop Unrolling
Make use of the large AMD Athlon processor 64-Kbyte
instruction cache and unroll loops to get more parallelism and
reduce loop overhead, even with branch prediction. Complete
Schedule Instructions According to their Latency67
AMD Athlon™ Processor x86 Code Optimization
unrolling reduces register pressure by removing the loop
counter. To completely unroll a loop, remove the loop control
and replicate the loop body N times. In addition, completely
unrolling a loop increases scheduling opportunities.
Only unrolling very large code loops can result in the inefficient
use of the L1 instruction cache. Loops can be unrolled
completely, if all of the following conditions are true:
■The loop is in a frequently executed piece of code.
■The loop count is known at compile time.
■The loop body, once unrolled, is less than 100 instructions,
which is approximately 400 bytes of code.
Partial Loop Unrolling
Partial loop unrolling can increase register pressure, which can
make it inefficient due to the small number of registers in the
x86 architecture. However, in certain situations, partial
unrolling can be efficient due to the performance gains
possible. Partial loop unrolling should be considered if the
following conditions are met:
22007E/0—November 1999
■Spare registers are available
■Loop body is small, so that loop overhead is significant
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes
Unrolling Loops69
AMD Athlon™ Processor x86 Code Optimization
no faster than three iterations in 10 cycles, or 6/10
floating-point adds per cycle, or 1.4 times as fast as the original
loop.
22007E/0—November 1999
Deriving Loop
Control For Partially
Unrolled Loops
A frequently used loop construct is a counting loop. In a typical
case, the loop count starts at some lower bound lo, increases by
some fixed, positive increment inc for each iteration of the
loop, and may not exceed some upper bound hi. The following
example shows how to partially unroll such a loop by an
unrolling factor of fac, and how to derive the loop control for
the partially unrolled version of the loop.
Example 1 (rolled loop):
for (k = lo; k <= hi; k += inc) {
x[k] =
...
}
Example 2 (partially unrolled loop):
for (k = lo; k <= (hi - (fac-1)*inc); k += fac*inc) {
x[k] =
...
x[k+inc] =
...
...
x[k+(fac-1)*inc] =
...
}
/* handle end cases */
for (k = k; k <= hi; k += inc) {
x[k] =
...
}
70Unrolling Loops
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Use Function Inlining
Overview
Make use of the AMD Athlon processor’s large 64-Kbyte
instruction cache by inlining small routines to avoid
procedure-call overhead. Consider the cost of possible
increased register usage, which can increase load/store
instructions for register spilling.
Function inlining has the advantage of eliminating function call
overhead and allowing better register allocation and
instruction scheduling at the site of the function call. The
disadvantage is decreasing code locality, which can increase
execution time due to instruction cache misses. Therefore,
function inlining is an optimization that has to be used
judiciously.
In general, due to its very large instruction cache, the
AMD Athlon processor is less susceptible than other processors
to the negative side effect of function inlining. Function call
overhead on the AMD Athlon processor can be low because
calls and returns are executed at high speed due to the use of
prediction mechanisms. However, there is still overhead due to
passing function arguments through memory, which creates
STLF (store-to-load-forwarding) dependencies. Some compilers
allow for a reduction of this overhead by allowing arguments to
be passed in registers in one of their calling conventions, which
has the drawback of constraining register allocation in the
function and at the site of the function call.
In general, function inlining works best if the compiler can
utilize feedback from a profiler to identify the function call
sites most frequently executed. If such data is not available, a
reasonable heuristic is to concentrate on function calls inside
loops. Functions that are directly recursive should not be
considered candidates for inlining. However, if they are
end-recursive, the compiler should convert them to an iterative
equivalent to avoid potential overflow of the AMD Athlon
processor return prediction mechanism (return stack) during
deep recursion. For best results, a compiler should support
function inlining across multiple source files. In addition, a
compiler should provide inline templates for commonly used
library functions, such as sin(), strcmp(), or memcpy().
Use Function Inlining71
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Always Inline Functions if Called from One Site
A function should always be inlined if it can be established that
it is called from just one site in the code. For the C language,
determination of this characteristic is made easier if functions
are explicitly declared static unless they require external
linkage. This case occurs quite frequently, as functionality that
could be concentrated in a single large function is split across
multiple small functions for improved maintainability and
readability.
Always Inline Functions with Fewer than 25 Machine Instructions
In addition, functions that create fewer than 25 machine
instructions once inlined should always be inlined because it is
likely that the function call overhead is close to or more than
the time spent executing the function body. For large functions,
the benefits of reduced function call overhead gives
diminishing returns. Therefore, a function that results in the
insertion of more than 500 machine instructions at the call site
should probably not be inlined. Some larger functions might
consist of multiple, relatively short paths that are negatively
affected by function overhead. In such a case, it can be
advantageous to inline larger functions. Profiling information is
the best guide in determining whether to inline such large
functions.
Avoid Address Generation Interlocks
Loads and stores are scheduled by the AMD Athlon processor to
access the data cache in program order. Newer loads and stores
with their addresses calculated can be blocked by older loads
and stores whose addresses are not yet calculated – this is
known as an address generation interlock. Therefore, it is
advantageous to schedule loads and stores that can calculate
their addresses quickly, ahead of loads and stores that require
the resolution of a long dependency chain in order to generate
their addresses. Consider the following code examples.
72Avoid Address Generation Interlocks
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Use the MOVZX and MOVSX instructions to zero-extend and
sign-extend byte-size and word-size operands to doubleword
length. For example, typical code for zero extension creates a
superset dependency when the zero-extended value is used, as
in the following code:
Example 1 (Avoid):
XOREAX, EAX
MOVAL, [MEM]
Example 2 (Preferred):
MOVZXEAX, BYTE PTR [MEM]
Minimize Pointer Arithmetic in Loops
Minimize pointer arithmetic in loops, especially if the loop
body is small. In this case, the pointer arithmetic would cause
significant overhead. Instead, take advantage of the complex
addressing modes to utilize the loop counter to index into
memory arrays. Using complex addressing modes does not have
any negative impact on execution speed, but the reduced
number of instructions preserves decode bandwidth.
Use MOVZX and MOVSX73
AMD Athlon™ Processor x86 Code Optimization
Example 1 (Avoid):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV ECX, MAXSIZE ;initialize loop counter
XOR ESI, ESI ;initialize offset into array a
XOR EDI, EDI ;initialize offset into array b
XOR EBX, EBX ;initialize offset into array c
$add_loop:
MOV EAX, [ESI + a] ;get element a
MOV EDX, [EDI + b] ;get element b
ADD EAX, EDX ;a[i] + b[i]
MOV [EBX + c], EAX ;write result to c
ADD ESI, 4 ;increment offset into a
ADD EDI, 4 ;increment offset into b
ADD EBX, 4 ;increment offset into c
DEC ECX ;decrement loop count
JNZ $add_loop ;until loop count 0
22007E/0—November 1999
Example 2 (Preferred):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV ECX, MAXSIZE-1;initialize loop counter
$add_loop:
MOV EAX, [ECX*4 + a] ;get element a
MOV EDX, [ECX*4 + b] ;get element b
ADD EAX, EDX ;a[i] + b[i]
MOV [ECX*4 + c], EAX ;write result to c
DEC ECX ;decrement index
JNS $add_loop ;until index negative
Note that the code in example 2 traverses the arrays in a
downward direction (i.e., from higher addresses to lower
addresses), whereas the original code in example 1 traverses
the arrays in an upward direction. Such a change in the
direction of the traversal is possible if each loop iteration is
completely independent of all other loop iterations, as is the
case here.
In code where the direction of the array traversal can’t be
switched, it is still possible to minimize pointer arithmetic by
appropriately biasing base addresses and using an index
74Minimize Pointer Arithmetic in Loops
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
variable that starts with a negative value and reaches zero when
the loop expires. Note that if the base addresses are held in
registers (e.g., when the base addresses are passed as
arguments of a function) biasing the base addresses requires
additional instructions to perform the biasing at run time and a
small amount of additional overhead is incurred. In the
examples shown here the base addresses are used in the
displacement portion of the address and biasing is
accomplished at compile time by simply modifying the
displacement.
Example 3 (Preferred):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOVECX, (-MAXSIZE) ;initialize index
$add_loop:
MOVEAX, [ECX*4 + a + MAXSIZE*4] ;get a element
MOVEDX, [ECX*4 + b + MAXSIZE*4] ;get b element
ADDEAX, EDX ;a[i] + b[i]
MOV[ECX*4 + c + MAXSIZE*4], EAX ;write result to c
INCECX ;increment index
JNZ$add_loop ;until index==0
Push Memory Data Carefully
Carefully choose the best method for pushing memory data. To
reduce register pressure and code dependencies, follow
example 2 below.
Example 1 (Avoid):
MOV EAX, [MEM]
PUSH EAX
Example 2 (Preferred):
PUSH [MEM]
Push Memory Data Carefully75
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
76Push Memory Data Carefully
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
8
Integer Optimizations
This chapter describes ways to improve integer performance
through optimized programming techniques. The guidelines are
listed in order of importance.
Replace Divides with Multiplies
Replace integer division by constants with multiplication by
the reciprocal. Because the AMD Athlon™ processor has a very
fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned)
and the integer division delivers only one bit of quotient per
cycle (22–47 cycles signed, 17–41 cycles unsigned), the
equivalent code is much faster. The user can follow the
examples in this chapter that illustrate the use of integer
division by constants, or access the executables in the
opt_utilities directory in the AMD documentation CD-ROM
(order# 21860) to find alternative code for dividing by a
constant.
Multiplication by Reciprocal (Division) Utility
The code for the utilities can be found at “Derivation of
Multiplier Used for Integer Division by Constants” on page 93.
All utilities were compiled for the Microsoft Windows® 95,
Windows 98, and Windows NT® environments. All utilities are
provided ‘as is’ and are not supported by AMD.
Replace Divides with Multiplies77
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Signed Division
Utility
In the opt_utilities directory of the AMD documentation
CDROM, run sdiv.exe in a DOS window to find the fastest code
for signed division by a constant. The utility displays the code
after the user enters a signed constant divisor. Type “sdiv >
example.out” to output the code to a file.
Unsigned Division
Utility
In the opt_utilities directory of the AMD documentation
CDROM, run udiv.exe in a DOS window to find the fastest code
for unsigned division by a constant. The utility displays the code
after the user enters an unsigned constant divisor. Type “udiv >
example.out” to output the code to a file.
Unsigned Division by Multiplication of Constant
Algorithm: Divisors
31
1 <= d < 2
, Odd d
The following code shows an unsigned division using a constant
value multiplier.
;In:d = divisor, 1 <= d < 2^31, odd d
;Out: a = algorithm
;m = multiplier
;s = shift factor
;algorithm 0
MOVEDX, dividend
MOVEAX, m
MULEDX
SHREDX, s ;EDX=quotient
;algorithm 1
MOVEDX, dividend
MOVEAX, m
MULEDX
ADDEAX, m
ADCEDX, 0
SHREDX, s ;EDX=quotient
Derivation of a, m, sThe derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section “Unsigned Derivation for
Algorithm, Multiplier, and Shift Factor” on page 93.
Algorithm: Divisors
231 <= d < 2
32
For divisors 231 <= d < 232, the possible quotient values are
either 0 or 1. This makes it easy to establish the quotient by
simple comparison of the dividend and divisor. In cases where
the dividend needs to be preserved, example 1 below is
recommended.
78Replace Divides with Multiplies
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
In cases where the dividend does not need to be preserved, the
division can be accomplished without the use of an additional
register, thus reducing register pressure. This is shown in
example 2 below:
These algorithms work if the divisor is positive. If the divisor is
negative, use abs(d) instead of d, and append a ‘NEG EDX’ to
the code. The code makes use of the fact that n/–d = –(n/d).
;IN:d = divisor, 2 <= d < 2^31
;OUT: a = algorithm
;m = multiplier
;s = shift count
;algorithm 0
MOV EAX, m
MOV EDX, dividend
MOV ECX, EDX
IMUL EDX
SHR ECX, 31
SAR EDX, s
ADD EDX, ECX ;quotient in EDX
Replace Divides with Multiplies79
AMD Athlon™ Processor x86 Code Optimization
;algorithm 1
MOV EAX, m
MOV EDX, dividend
MOV ECX, EDX
IMUL EDX
ADD EDX, ECX
SHR ECX, 31
SAR EDX, s
ADD EDX, ECX ;quotient in EDX
22007E/0—November 1999
Derivation for a, m, sThe derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section “Signed Derivation for
Algorithm, Multiplier, and Shift Factor” on page 95.
Signed Division By 2
Signed Division By 2
;IN: EAX = dividend
;OUT:EAX = quotient
CMPEAX, 800000000h;CY = 1, if dividend >=0
SBBEAX, –1;Increment dividend if it is < 0
SAREAX, 1;Perform a right shift
n
;IN:EAX = dividend
;OUT:EAX = quotient
CDQ;Sign extend into EDX
ANDEDX, (2^n–1);Mask correction (use divisor –1)
ADDEAX, EDX;Apply correction if necessary
SAREAX, (n);Perform right shift by
Signed Division By –2;IN:EAX = dividend
;OUT:EAX = quotient
CMPEAX, 800000000h;CY = 1, if dividend >= 0
SBBEAX, –1;Increment dividend if it is < 0
SAREAX, 1;Perform right shift
NEGEAX;Use (x/–2) == –(x/2)
Signed Division By
–(2n)
;IN:EAX = dividend
;OUT:EAX = quotient
CDQ;Sign extend into EDX
ANDEDX, (2^n–1);Mask correction (–divisor –1)
ADDEAX, EDX;Apply correction if necessary
SAREAX, (n);Right shift by log2(–divisor)
NEGEAX;Use (x/–(2^n)) == (–(x/2^n))
; log2 (divisor)
Remainder of Signed
Integer 2 or –2
;IN:EAX = dividend
;OUT:EAX = remainder
CDQ;Sign extend into EDX
ANDEDX, 1;Compute remainder
XOREAX, EDX;Negate remainder if
SUBEAX, EDX;Dividend was < 0
MOV[remainder], EAX
80Replace Divides with Multiplies
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Remainder of Signed
Integer 2n or –(2n)
;IN:EAX = dividend
;OUT:EAX = remainder
CDQ;Sign extend into EDX
ANDEDX, (2^n–1);Mask correction (abs(divison)–1)
ADDEAX, EDX;Apply pre-correction
ANDEAX, (2^n–1);Mask out remainder (abs(divison)–1)
SUBEAX, EDX;Apply pre-correction, if necessary
MOV[remainder], EAX
Use Alternative Code When Multiplying by a Constant
A 32-bit integer multiply by a constant has a latency of five
cycles. Therefore, use alternative code when multiplying by
certain constants. In addition, because there is just one
multiply unit, the replacement code may provide better
throughput.
The following code samples are designed such that the original
source also receives the final result. Other sequences are
possible if the result is in a different register. Adds have been
favored over shifts to keep code size small. Generally, there is a
fast replacement if the constant has very few 1 bits in binary.
More constants are found in the file multiply_by_constants.txt
located in the same directory where this document is located in
the SDK.
by 2:ADD REG1, REG1 ;1 cycle
by 3:LEA REG1, [REG1*2+REG1];2 cycles
by 4:SHL REG1, 2;1 cycle
by 5:LEA REG1, [REG1*4+REG1] ;2 cycles
by 6:LEA REG2, [REG1*4+REG1] ;3 cycles
ADD REG1, REG2
by 7: MOVREG2, REG1;2 cycles
SHLREG1, 3
SUBREG1, REG2
by 8: SHL REG1, 3 ;1 cycle
by 9: LEA REG1, [REG1*8+REG1] ;2 cycles
by 10: LEA REG2, [REG1*8+REG1] ;3 cycles
ADD REG1, REG2
Use Alternative Code When Multiplying by a Constant81
AMD Athlon™ Processor x86 Code Optimization
by 11: LEA REG2, [REG1*8+REG1] ;3 cycles
ADD REG1, REG1
ADD REG1, REG2
by 12:SHL REG1, 2
LEA REG1, [REG1*2+REG1] ;3 cycles
by 13: LEA REG2, [REG1*2+REG1] ;3 cycles
SHLREG1, 4
SUBREG1, REG2
by 14: LEA REG2, [REG1*4+REG1] ;3 cycles
LEAREG1, [REG1*8+REG1]
ADD REG1, REG2
by 15: MOV REG2, REG1 ;2 cycles
SHL REG1, 4
SUB REG1, REG2
by 16: SHL REG1, 4 ;1 cycle
by 17: MOV REG2, REG1;2 cycles
SHL REG1, 4
ADD REG1, REG2
22007E/0—November 1999
by 18:ADD REG1, REG1;3 cycles
LEA REG1, [REG1*8+REG1]
by 19: LEA REG2, [REG1*2+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
by 20:SHL REG1, 2;3 cycles
LEA REG1, [REG1*4+REG1]
by 21: LEA REG2, [REG1*4+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
by 22: use IMUL
by 23: LEA REG2, [REG1*8+REG1] ;3 cycles
SHL REG1, 5
SUB REG1, REG2
by 24: SHL REG1, 3;3 cycles
LEA REG1, [REG1*2+REG1]
by 25: LEA REG2, [REG1*8+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
82Use Alternative Code When Multiplying by a Constant
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
by 26: use IMUL
by 27: LEA REG2, [REG1*4+REG1] ;3 cycles
SHLREG1, 5
SUBREG1, REG2
by 28: MOVREG2, REG1;3 cycles
SHL REG1, 3
SUB REG1, REG2
SHLREG1, 2
by 29: LEA REG2, [REG1*2+REG1] ;3 cycles
SHL REG1, 5
SUB REG1, REG2
by 30: MOVREG2, REG1;3 cycles
SHL REG1, 4
SUB REG1, REG2
ADDREG1, REG1
by 31: MOV REG2, REG1;2 cycles
SHL REG1, 5
SUB REG1, REG2
by 32: SHL REG1, 5 ;1 cycle
Use MMX™ Instructions for Integer-Only Work
In many programs it can be advantageous to use MMX
instructions to do integer-only work, especially if the function
already uses 3DNow!™ or MMX code. Using MMX instructions
relieves register pressure on the integer registers. As long as
data is simply loaded/stored, added, shifted, etc., MMX
instructions are good substitutes for integer instructions.
Integer registers are freed up with the following results:
■May be able to reduce the number of integer registers to
saved/restored on function entry/edit.
■Free up integer registers for pointers, loop counters, etc., so
that they do not have to be spilled to memory, which
reduces memory traffic and latency in dependency chains.
Be careful with regards to passing data between MMX and
integer registers and of creating mismatched store-to-load
forwarding cases. See “Unrolling Loops” on page 67.
Use MMX™ Instructions for Integer-Only Work83
AMD Athlon™ Processor x86 Code Optimization
In addition, using MMX instructions increases the available
parallelism. The AMD Athlon processor can issue three integer
OPs and two MMX OPs per cycle.
Repeated String Instruction Usage
Latency of Repeated String Instructions
Table 1 shows the latency for repeated string instructions on the
AMD Athlon processor.
Table 1 lists the latencies with the direction flag (DF) = 0
(increment) and DF = 1. In addition, these latencies are
assumed for aligned memory operands. Note that for
MOVS/STOS, when DF = 1 (DOWN), the overhead portion of the
latency increases significantly. However, these types are less
commonly found. The user should use the formula and round up
to the nearest integer value to determine the latency.
Guidelines for Repeated String Instructions
To help achieve good performance, this section contains
guidelines for the careful scheduling of VectorPath repeated
string instructions.
Use the Largest
Possible Operand
Size
84Repeated String Instruction Usage
Always move data using the largest operand size possible. For
example, use REP MOVSD rather than REP MOVSW and REP
MOVSW rather than REP MOVSB. Use REP STOSD rather than
REP STOSW and REP STOSW rather than REP MOVSB.
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.