The contents of this document are provided in connection with Advanced
Micro Devices, Inc. (“AMD”) products. AMD makes no representations or
warranties with respect to the accuracy or completeness of the contents of
this publication and reserves the right to make changes to specifications and
product descriptions at any time without notice. No license, whether express,
implied, arising by estoppel or otherwise, to any intellectual property rights
is granted by this publication. Except as set forth in AMD’s Standard Terms
and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims
any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular
purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use
as components in systems intended for surgical implant into the body, or in
other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur.
AMD reserves the right to discontinue or make changes to its products at any
time without notice.
Trademarks
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks,
and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc.
Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
MMX is a trademark and Pentium is a registered trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of
their respective companies.
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Revision History
DateRevDescription
Added “About this Document” on page 1.
Further clarification of “Consider the Sign of Integer Operands” on page 14.
Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29.
Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31.
Further clarification of “Select DirectPath Over VectorPath Instructions” on page 34.
Further clarification of “Align Branch Targets in Program Hot Spots” on page 36.
Further clarification of REP instruction as a filler in “Code Padding Using Neutral Code Fillers” on page 39.
Further clarification of “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46.
Modified examples 1 and 2 of “Unsigned Division by Multiplication of Constant” on page 78.
Nov.
1999
Added the optimization, “Efficient Implementation of Population Count Function” on page 91.
Further clarification of “Use FFREEP Macro to Pop One Register from the FPU Stack” on page 98.
Further clarification of “Minimize Floating-Point-to-Integer Conversions” on page 100.
Added the optimization, “Check Argument Range of Trigonometric Instructions Efficiently” on page 103.
Added the optimization, “Take Advantage of the FSINCOS Instruction” on page 105.
E
Further clarification of “Use 3DNow!™ Instructions for Fast Division” on page 108.
Further clarification “Use FEMMS Instruction” on page 107.
Further clarification of “Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root” on
page 110.
Clarified “3DNow!™ and MMX™ Intra-Operand Swapping” on page 112.
Corrected PCMPGT information in “Use MMX™ PCMP Instead of 3DNow!™ PFCMP” on page 114.
Added the optimization, “Use MMX™ Instructions for Block Copies and Block Fills” on page 115.
Modified the rule for “Use MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 118.
Modified the rule for “Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register” on page 119.
Added the optimization, “Optimized Matrix Multiplication” on page 119.
Added the optimization, “Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions” on page
122.
Added the optimization, “Complex Number Arithmetic” on page 126.
Added Appendix E, “Programming the MTRR and PAT”.
Rearranged the appendices.
Added Index.
Revision Historyxv
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
xviRevision History
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
1
Introduction
The AMD Athlon™ processor is the newest microprocessor in
the AMD K86™ family of microprocessors. The advances in the
AMD Athlon processor take superscalar operation and
out-of-order execution to a new level. The AMD Athlon
processor has been designed to efficiently execute code written
for previous-generation x86 processors. However, to enable the
fastest code execution with the AMD Athlon processor,
programmers should write software that includes specific code
optimization techniques.
About this Document
This document contains information to assist programmers in
creating optimized code for the AMD Athlon processor. In
addition to compiler and assembler designers, this document
has been targeted to C and assembly language programmers
writing execution-sensitive code sequences.
This document assumes that the reader possesses in-depth
knowledge of the x86 instruction set, the x86 architecture
(registers, programming modes, etc.), and the IBM PC-AT
platform.
This guide has been written specifically for the AMD Athlon
processor, but it includes considerations for
About this Document1
AMD Athlon™ Processor x86 Code Optimization
previous-generation processors and describes how those
optimizations are applicable to the AMD Athlon processor. This
guide contains the following chapters:
Chapter 1: Introduction. Outlines the material covered in this
document. Summarizes the AMD Athlon microarchitecture.
Chapter 2: Top Optimizations. Provides convenient descriptions of
the most important optimizations a programmer should take
into consideration.
Chapter 3: C Source Level Optimizations. Describes optimizations that
C/C++ programmers can implement.
Chapter 4: Instruction Decoding Optimizations. Describes methods that
will make the most efficient use of the three sophisticated
instruction decoders in the AMD Athlon processor.
22007E/0—November 1999
Chapter 5: Cache and Memory Optimizations. Describes optimizations
that makes efficient use of the large L1 caches and highbandwidth buses of the AMD Athlon processor.
Chapter 6: Branch Optimizations. Describes optimizations that
improves branch prediction and minimizes branch penalties.
Chapter 7: Scheduling Optimizations. Describes optimizations that
improves code scheduling for efficient execution resource
utilization.
Chapter 8: Integer Optimizations. Describes optimizations that
improves integer arithmetic and makes efficient use of the
integer execution units in the AMD Athlon processor.
Chapter 9: Floating-Point Optimizations. Describes optimizations that
makes maximum use of the superscalar and pipelined floatingpoint unit (FPU) of the AMD Athlon processor.
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines
for Enhanced 3DNow! and MMX code optimization techniques.
Chapter 11: General x86 Optimizations Guidelines. Lists generic
optimizations techniques applicable to x86 processors.
Appendix A: AMD Athlon Processor Microarchitecture. Describes in
detail the microarchitecture of the AMD Athlon processor.
2About this Document
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Appendix B: Pipeline and Execution Unit Resources Overview. Describes
in detail the execution units and its relation to the instruction
pipeline.
Appendix C: Implementation of Write Combining. Describes the
algorithm used by the AMD Athlon processor to write combine.
Appendix D: Performance Monitoring Counters. Describes the usage of
the performance counters available in the AMD Athlon
processor.
Appendix E: Programming the MTRR and PAT. Describes the steps
needed to program the Memory Type Range Registers and the
Page Attribute Table.
Appendix F: Instruction Dispatch and Execution Resources. Lists the
instruction’s execution resource usage.
Appendix G: DirectPath versus VectorPath Instructions. Lists the x86
instructions that are DirectPath and VectorPath instructions.
AMD Athlon™ Processor Family
The AMD Athlon processor family uses state-of-the-art
decoupled decode/execution design techniques to deliver
next-generation performance with x86 binary software
compatibility. This next-generation processor family advances
x86 code execution by using flexible instruction predecoding,
wide and balanced decoders, aggressive out-of-order execution,
parallel integer execution pipelines, parallel floating-point
execution pipelines, deep pipelined execution for higher
delivered operating frequency, dedicated backside cache
memory, and a new high-performance double-rate 64-bit local
bus. As an x86 binary-compatible processor, the AMD Athlon
processor implements the industry-standard x86 instruction set
by decoding and executing the x86 instructions using a
proprietary microarchitecture. This microarchitecture allows
the delivery of maximum performance when running x86-based
PC software.
AMD Athlon™ Processor Family3
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™ Processor Microarchitecture Summary
The AMD Athlon processor brings superscalar performance
and high operating frequency to PC systems running
industry-standard x86 software. A brief summary of the
next-generation design features implemented in the
AMD Athlon processor is as follows:
■High-speed double-rate local bus interface
■Large, split 128-Kbyte level-one (L1) cache
■Dedicated backside level-two (L2) cache
■Instruction predecode and branch detection during cache
line fills
■Decoupled decode/execution core
■Three-way x86 instruction decoding
■Dynamic scheduling and speculative execution
■Three-way integer execution
■Three-way address generation
■Three-way floating-point execution
■3DNow!™ technology and MMX™ single-instruction
multiple-data (SIMD) instruction extensions
■Super data forwarding
■Deep out-of-order integer and floating-point execution
■Register renaming
■Dynamic branch prediction
The AMD Athlon processor communicates through a
next-generation high-speed local bus that is beyond the current
Socket 7 or Super7™ bus standard. The local bus can transfer
data at twice the rate of the bus operating frequency by using
both the rising and falling edges of the clock (see
“AMD Athlon™ System Bus” on page 139 for more
information).
To reduce on-chip cache miss penalties and to avoid subsequent
data load or instruction fetch stalls, the AMD Athlon processor
has a dedicated high-speed backside L2 cache. The large
128-Kbyte L1 on-chip cache and the backside L2 cache allow the
4AMD Athlon™ Processor Microarchitecture Summary
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
AMD Athlon execution core to achieve and sustain maximum
performance.
As a decoupled decode/execution processor, the AMD Athlon
processor makes use of a proprietary microarchitecture, which
defines the heart of the AMD Athlon processor. With the
inclusion of all these features, the AMD Athlon processor is
capable of decoding, issuing, executing, and retiring multiple
x86 instructions per cycle, resulting in superior scaleable
performance.
The AMD Athlon processor includes both the industry-standard
MMX SIMD integer instructions and the 3DNow! SIMD
floating-point instructions that were first introduced in the
®
AMD-K6
-2 processor. The design of 3DNow! technology was
based on suggestions from leading graphics and independent
software vendors (ISVs). Using SIMD format, the AMD Athlon
processor can generate up to four 32-bit, single-precision
floating-point results per clock cycle.
The 3DNow! execution units allow for high-performance
floating-point vector operations, which can replace x87
instructions and enhance the performance of 3D graphics and
other floating-point-intensive applications. Because the
3DNow! architecture uses the same registers as the MMX
instructions, switching between MMX and 3DNow! has no
penalty.
The AMD Athlon processor designers took another innovative
step by carefully integrating the traditional x87 floating-point,
MMX, and 3DNow! execution units into one operational engine.
With the introduction of the AMD Athlon processor, the
switching overhead between x87, MMX, and 3DNow!
technology is virtually eliminated. The AMD Athlon processor
combined with 3DNow! technology brings a better multimedia
experience to mainstream PC users while maintaining
backwards compatibility with all existing x86 software.
Although the AMD Athlon processor can extract code
parallelism on-the-fly from off-the-shelf, commercially available
x86 software, specific code optimization for the AMD Athlon
processor can result in even higher delivered performance. This
document describes the proprietary microarchitecture in the
AMD Athlon processor and makes recommendations for
optimizing execution of x86 software on the processor.
AMD Athlon™ Processor Microarchitecture Summary5
AMD Athlon™ Processor x86 Code Optimization
The coding techniques for achieving peak performance on the
AMD Athlon processor include, but are not limited to, those for
the AMD-K6, AMD-K6-2, Pentium
II processors. However, many of these optimizations are not
necessary for the AMD Athlon processor to achieve maximum
performance. Due to the more flexible pipeline control and
aggressive out-of-order execution, the AMD Athlon processor is
not as sensitive to instruction selection and code scheduling.
This flexibility is one of the distinct advantages of the
AMD Athlon processor.
The AMD Athlon processor uses the latest in processor
microarchitecture design techniques to provide the highest x86
performance for today’s PC. In short, the AMD Athlon
processor offers true next-generation performance with x86
binary software compatibility.
22007E/0—November 1999
®
, Pentium Pro, and Pentium
6AMD Athlon™ Processor Microarchitecture Summary
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
2
Top Optimizations
Group I — Essential
Optimizations
Group II — Secondary
Optimizations
This chapter contains concise descriptions of the best
optimizations for improving the performance of the
AMD Athlon™ processor. Subsequent chapters contain more
detailed descriptions of these and other optimizations. The
optimizations in this chapter are divided into two groups and
listed in order of importance.
Group I contains essential optimizations. Users should follow
these critical guidelines closely. The optimizations in Group I
are as follows:
■Memory Size and Alignment Issues—Avoid memory size
mismatches—Align data where possible
■Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions
■Select DirectPath Over VectorPath Instructions
Group II contains secondary optimizations that can
significantly improve the performance of the AMD Athlon
processor. The optimizations in Group II are as follows:
■Load-Execute Instruction Usage—Use Load-Execute
instructions—Avoid load-execute floating-point instructions
with integer operands
■Take Advantage of Write Combining
■Use 3DNow! Instructions
■Avoid Branches Dependent on Random Data
Top Optimizations7
AMD Athlon™ Processor x86 Code Optimization
■Avoid Placing Code and Data in the Same 64-Byte Cache
Line
Optimization Star
The top optimizations described in this chapter are flagged
TOP
with a star. In addition, the star appears beside the more
detailed descriptions found in subsequent chapters.
✩
Group I Optimizations — Essential Optimizations
22007E/0—November 1999
Memory Size and Alignment Issues
See “Memory Size and Alignment Issues” on page 45 for more
details.
Avoid Memory Size Mismatches
Avoid memory size mismatches when instructions operate on
TOP
✩
Align Data Where Possible
TOP
the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each
operand the same size.
Avoid misaligned data references. A misaligned store or load
operation suffers a minimum one-cycle penalty in the
AMD Athlon processor load/store pipeline.
✩
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
For code that can take advantage of prefetching, use the
TOP
✩
8Optimization Star
3DNow! PREFETCH and PREFETCHW instructions to increase
the effective bandwidth to the AMD Athlon processor, which
significantly improves performance. All the prefetch
instructions are essentially integer instructions and can be used
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
anywhere, in any type of code (integer, x87, 3DNow!, MMX,
etc.). Use the following formula to determine prefetch distance:
DS
Prefetch Length = 200 (
■Round up to the nearest cache line.
■DS is the data stride per loop iteration.
■C is the number of cycles per loop iteration when hitting in
the L1 cache.
See “Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions” on page 46 for more details.
/C)
Select DirectPath Over VectorPath Instructions
Use DirectPath instructions rather than VectorPath
TOP
✩
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction. Three DirectPath instructions can be
decoded in parallel. Using VectorPath instructions will block
DirectPath instructions from decoding simultaneously.
See Appendix G, “DirectPath versus VectorPath Instructions”
on page 219 for a list of DirectPath and VectorPath instructions.
Group II Optimizations—Secondary Optimizations
Load-Execute Instruction Usage
See “Load-Execute Instruction Usage” on page 34 for more
details.
Use Load-Execute Instructions
Wherever possible, use load-execute instructions to increase
TOP
✩
code density with the one exception described below. The
split-instruction form of load-execute instructions can be used
to avoid scheduler stalls for longer executing instructions and
to explicitly schedule the load and execute operations.
Group II Optimizations—Secondary Optimizations9
AMD Athlon™ Processor x86 Code Optimization
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Do not use load-execute floating-point instructions with integer
TOP
✩
Take Advantage of Write Combining
TOP
✩
operands. The floating-point load-execute instructions with
integer operands are VectorPath and generate two OPs in a
cycle, while the discrete equivalent enables a third DirectPath
instruction to be decoded in the same cycle.
This guideline applies only to operating system, device driver,
and BIOS programmers. In order to improve system
performance, the AMD Athlon processor aggressively combines
multiple memory-write cycles of any data size that address
locations within a 64-byte cache line aligned write buffer.
See Appendix C, “Implementation of Write Combining” on
page 155 for more details.
floating-point computations using the 3DNow! instructions
instead of x87 instructions. The SIMD nature of 3DNow!
instructions achieves twice the number of FLOPs that are
achieved through x87 instructions. 3DNow! instructions also
provide for a flat register file instead of the stack-based
approach of x87 instructions.
See Table 23 on page 217 for a list of 3DNow! instructions. For
information about instruction usage, see the 3DNow!™Technology Manual, order# 21928.
Avoid data-dependent branches around a single instruction.
Data-dependent branches acting upon basically random data
can cause the branch prediction logic to mispredict the branch
about 50% of the time. Design branch-free alternative code
sequences, which results in shorter average execution time.
See “Avoid Branches Dependent on Random Data” on page 57
for more details.
10Group II Optimizations—Secondary Optimizations
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Consider that the AMD Athlon processor cache line is twice the
TOP
✩
size of previous processors. Code and data should not be shared
in the same 64-byte cache line, especially if the data ever
becomes modified. In order to maintain cache coherency, the
AMD Athlon processor may thrash its caches, resulting in lower
performance.
In general the following should be avoided:
■Self-modifying code
■Storing data in code segments
See “Avoid Placing Code and Data in the Same 64-Byte Cache
Line” on page 50 for more details.
Group II Optimizations—Secondary Optimizations11
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
12Group II Optimizations—Secondary Optimizations
22007E/0—November 1999AMD Athlon™ Processor x86 Code Optimization
3
C Source Level Optimizations
This chapter details C programming practices for optimizing
code for the AMD Athlon™ processor. Guidelines are listed in
order of importance.
Ensure Floating-Point Variables and Expressions are of
Type Float
For compilers that generate 3DNow!™ instructions, make sure
that all floating-point variables and expressions are of type
float. Pay special attention to floating-point constants. These
require a suffix of “F” or “f” (for example, 3.14f) in order to be
of type float, otherwise they default to type double. To avoid
automatic promotion of float arguments to double, always use
function prototypes for all functions that accept float
arguments.
Use 32-Bit Data Types for Integer Code
Use 32-bit data types for integer code. Compiler
implementations vary, but typically the following data types are
included—int, signed, signed int, unsigned, unsigned int, long,
signed long, long int, signed long int, unsigned long, and unsigned
long int.
Ensure Floating-Point Variables and Expressions are of Type Float13
AMD Athlon™ Processor x86 Code Optimization
Consider the Sign of Integer Operands
In many cases, the data stored in integer variables determines
whether a signed or an unsigned integer type is appropriate.
For example, to record the weight of a person in pounds, no
negative numbers are required so an unsigned type is
appropriate. However, recording temperatures in degrees
Celsius may require both positive and negative numbers so a
signed type is needed.
Where there is a choice of using either a signed or an unsigned
type, it should be considered that certain operations are faster
with unsigned types while others are faster for signed types.
Integer-to-floating-point conversion using integers larger than
16-bit is faster with signed types, as the x86 FPU provides
instructions for converting signed integers to floating-point, but
has no instructions for converting unsigned integers. In a
typical case, a 32-bit integer is converted as follows:
22007E/0—November 1999
Example 1 (Avoid):
double x; ====>MOV [temp+4], 0
unsigned int i; MOV EAX, i
MOV [temp], eax
x = i; FILD QWORD PTR [temp]
FSTP QWORD PTR [x]
This code is slow not only because of the number of instructions
but also because a size mismatch prevents store-to-loadforwarding to the FILD instruction.
Computing quotients and remainders in integer division by
constants are faster when performed on unsigned types. In a
typical case, a 32-bit integer is divided by four as follows:
14Consider the Sign of Integer Operands
Loading...
+ 226 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.