The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to make
changes to specifications and product descriptions at any time without notice. No license,
whether express, implied, arising by estoppel or otherwise, to any intellectual property
rights is granted by this publication. Except as set forth in AMD’s Standard Terms and
Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or
implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of
AMD’s product could create a situation where personal injury, death, or severe property or
environmental damage may occur. AMD reserves the right to discontinue or make changes
to its products at any time without notice.
Trademarks
AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, and combinations thereof, 3DNow! and AMD-8151 are trademarks of
Advanced Micro Devices, Inc.
HyperTransport is a licensed trademark of the HyperTransport Technology Consortium.
Microsoft is a registered trademark of Microsoft Corporation.
MMX is a trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
August 20053.06Updated latency tables in Appendix C. Added section 8.9 on optimizing integer
division. Clarified the use of non-temporal PREFETCHNTA instruction in section
5.6. Added explanatory information to section 5.3 on ccNUMA. Added section 4.5
on AMD64 complx addressing modes. Added new section 5.13 on memory copies.
October 20043.05Updated information on write-combining optimizations in Appendix B,
Implementation of Write-Combining; Added latency information for SSE3
instructions.
March 20043.04Incorporated a section on ccNUMA in Chapter 5. Added sections on moving
unaligned versus unaligned data. Added to PREFETCHNTA information in Chapter
5. Fixed many minor typos.
September 2003 3.03Made several minor typographical and formatting corrections.
July 20033.02Added index references. Corrected information pertaining to L1 and L2 data and
instruction caches. Corrected information on alignment in Chapter 5, “Cache and
Memory Optimizations”. Amended latency information in Appendix C.
April 20033.01Clarified section 2.22 'Array Indices'. Corrected factual errors and removed
misleading examples from Cache and Memory chapter..
April 20033.00Initial public release.
Revision Historyxv
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
xviRevision History
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Chapter 1Introduction
This guide provides optimization information and recommendations for the AMD Athlon™ 64 and
AMD Opteron™ processors. These optimizations are designed to yield software code that is fast,
compact, and efficient. Toward this end, the optimizations in each of the following chapters are listed
in order of importance.
This chapter covers the following topics:
TopicPage
Intended Audience1
Getting Started Quickly1
Using This Guide2
Important New Terms4
Key Optimizations6
1.1Intended Audience
This book is intended for compiler and assembler designers, as well as C, C++, and assemblylanguage programmers writing performance-sensitive code sequences. This guide assumes that you
are familiar with the AMD64 instruction set and the AMD64 architecture (registers and programming
modes). For complete information on the AMD64 architecture and instruction set, see the
multivolume AMD64 Architecture Programmer’s Manual available from AMD.com. Documentation
volumes and their order numbers are provided below.
TitleOrder no.
Volume 1 , Application Programming24592
Volume 2 , System Programming24593
Volume 3 , General-Purpose and System Instructions24594
Volume 4 , 128-Bit Media Instructions26568
Volume 5 , 64-Bit Media and x87 Floating-Point Instructions26569
1.2Getting Started Quickly
More experienced readers may skip to “Key Optimizations” on page 6, which identifies the most
important optimizations.
Chapter 1Introduction1
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
1.3Using This Guide
This chapter explains how to get the most benefit from this guide. It defines important new terms you
will need to understand before reading the rest of this guide and lists the most important optimizations
by rank.
Chapter 2 describes techniques that you can use to optimize your C and C++ source code. The
“Application” section for each optimization indicates whether the optimization applies to 32-bit
software, 64-bit software, or both.
Chapter 3 presents general assembly-language optimizations that improve the performance of
software designed to run in 64-bit mode. All optimizations in this chapter apply only to 64-bit
software.
The remaining chapters describe assembly-language optimizations. The “Application” section under
each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or
both.
Chapter 4Instruction-Decoding Optimizations
Chapter 5Cache and Memory Optimizations
Chapter 6Branch Optimizations
Chapter 7Scheduling Optimizations
Chapter 8Integer Optimizations
Chapter 9Optimizing with SIMD Instructions
Chapter 10x87 Floating-Point Optimizations
Appendix A discusses the internal design, or microarchitecture, of the processor and provides
specifications on the translation-lookaside buffers. It also provides information on other functional
units that are not part of the main processor but are integrated on the chip.
Appendix B describes the memory write-combining feature of the processor.
Appendix C provides a complete listing of all AMD64 instructions. It shows each instruction’s
encoding, decode type, execution latency, and—where applicable—the pipe used in the floating-point
unit.
Appendix D discusses optimizations that improve the throughput of AGP transfers.
Appendix E describes coding practices that improve performance when using SSE and SSE2
instructions.
2IntroductionChapter 1
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Special Information
Special information in this guide looks like this:
❖ This symbol appears next to the most important, or key, optimizations.
Numbering Systems
The following suffixes identify different numbering systems:
This suffixIdentifies a
bBinary number. For example, the binary equivalent of the number 5 is written 101b.
dDecimal number. Decimal numbers are followed by this suffix only when the possibility of
confusion exists. In general, decimal numbers are shown without a suffix.
hHexadecimal number. For example, the hexadecimal equivalent of the number 60 is
written 3Ch.
Typographic Notation
This guide uses the following typographic notations for certain types of information:
This type of textIdentifies
italicPlaceholders that represent information you must provide. Italicized text is also used
for the titles of publications and for emphasis.
monowidthProgram statements and function names.
Providing Feedback
If you have suggestions for improving this guide, we would like to hear from you. Please send your
comments to the following e-mail address:
code.optimization@amd.com
Chapter 1Introduction3
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
1.4Important New Terms
This section defines several important terms and concepts used in this guide.
Primitive Operations
AMD Athlon 64 and AMD Opteron processors perform four types of primitive operations:
•Integer (arithmetic or logic)
•Floating-point (arithmetic)
•Load
•Store
Internal Instruction Formats
The AMD64 instruction set is complex; instructions have variable-length encodings and many
perform multiple primitive operations. AMD Athlon 64 and AMD Opteron processors do not execute
these complex instructions directly, but, instead, decode them internally into simpler fixed-length
instructions called macro-ops. Processor schedulers subsequently break down macro-ops into
sequences of even simpler instructions called micro-ops, each of which specifies a single primitive
operation.
A macro-op is a fixed-length instruction that:
•Expresses, at most, one integer or floating-point operation and one load and/or store operation.
•Is the primary unit of work managed (that is, dispatched and retired) by the processor.
A micro-op is a fixed-length instruction that:
•Expresses one and only one of the primitive operations that the processor can perform (for
example, a load).
•Is executed by the processor’s execution units.
4IntroductionChapter 1
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Table 1 summarizes the differences between AMD64 instructions, macro-ops, and micro-ops.
Table 1.Instructions, Macro-ops and Micro-ops
ComparingAMD64 instructionsMacro-opsMicro-ops
ComplexityComplex
A single instruction may
specify one or more of
each of the following
operations:
• Integer or floating-point
operation
• Load
•Store
Encoded lengthVariable (instructions are
different lengths)
Regularized
instruction fields
No (field locations and
definitions vary among
instructions)
Average
A single macro-op may
specify—at most—one
integer or floating-point
operation and one of the
following operations:
• Load
•Store
• Load and store to the
same address
Fixed (all macro-ops are
the same length)
Yes (field locations and
definitions are the same
for all macro-ops)
Simple
A single micro-op
specifies only one of the
following primitive
operations:
• Integer or floating-point
• Load
•Store
Fixed (all micro-ops are
the same length)
Yes (field locations and
definitions are the same
for all micro-ops)
Types of Instructions
Instructions are classified according to how they are decoded by the processor. There are three types
of instructions:
Instruction TypeDescription
DirectPath SingleA relatively common instruction that the processor decodes directly into one macro-op
in hardware.
DirectPath DoubleA relatively common instruction that the processor decodes directly into two macro-
ops in hardware.
VectorPathA sophisticated or less common instruction that the processor decodes into one or
more (usually three or more) macro-ops using the on-chip microcode-engine ROM
(MROM).
Chapter 1Introduction5
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
1.5Key Optimizations
While all of the optimizations in this guide help improve software performance, some of them have
more impact than others. Optimizations that offer the most improvement are called key optimizations.
Guideline
Concentrate your efforts on implementing key optimizations before moving on to other optimizations,
and incorporate higher-ranking key optimizations first.
Key Optimizations by Rank
Table 1 lists the key optimizations by rank.
Table 2.Optimizations by Rank
RankOptimizationPage
1Memory-Size Mismatches92
2Natural Alignment of Data Objects95
3Memory Copy120
4Density of Branches126
5Prefetch Instructions104
6Two-Byte Near-Return RET Instruction128
7DirectPath Instructions72
8Load-Execute Integer Instructions73
9Load-Execute Floating-Point Instructions with Floating-Point Operands74
10Load-Execute Floating-Point Instructions with Integer Operands74
11Write-combining113
12Branches That Depend on Random Data130
13Half-Register Operations356
14Placing Code and Data in the Same 64-Byte Cache Line116
6IntroductionChapter 1
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Chapter 2C and C++ Source-Level
Optimizations
Although C and C++ compilers generally produce very compact object code, many performance
improvements are possible by careful source code optimization. Most such optimizations result from
taking advantage of the underlying mechanisms used by C and C++ compilers to translate source
code into sequences of AMD64 instructions. This chapter includes guidelines for writing C and C++
source code that result in the most efficiently optimized AMD64 code.
This chapter covers the following topics:
TopicPage
Declarations of Floating-Point Values9
Using Arrays and Pointers10
Unrolling Small Loops13
Expression Order in Compound Branch Conditions14
Long Logical Expressions in If Statements16
Arrange Boolean Operands for Quick Expression Evaluation17
Dynamic Memory Allocation Consideration19
Unnecessary Store-to-Load Dependencies20
Matching Store and Load Size22
SWITCH and Noncontiguous Case Expressions25
Arranging Cases by Probability of Occurrence28
Use of Function Prototypes29
Use of const Type Qualifier30
Generic Loop Hoisting31
Local Static Functions34
Explicit Parallelism in Code35
Extracting Common Subexpressions37
Sorting and Padding C and C++ Structures39
Sorting Local Variables41
Replacing Integer Division with Multiplication43
Frequently Dereferenced Pointer Arguments44
Array Indices46
32-Bit Integral Data Types47
Sign of Integer Operands48
Chapter 2C and C++ Source-Level Optimizations7
Software Optimization Guide for AMD64 Processors
TopicPage
Accelerating Floating-Point Division and Square Root50
Fast Floating-Point-to-Integer Conversion52
Speeding Up Branches Based on Comparisons Between Floats54
25112 Rev. 3.06 September 2005
8C and C++ Source-Level OptimizationsChapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
2.1Declarations of Floating-Point Values
Optimization
When working with single precision (float) values:
•Use the f or F suffix (for example, 3.14f) to specify a constant value of type float.
•Use function prototypes for all functions that accept arguments of type float.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
C and C++ compilers treat floating-point constants and arguments as double precision (double)
unless you specify otherwise. However, single precision floating-point values occupy half the
memory space as double precision values and can often provide the precision necessary for a given
computational problem.
Chapter 2C and C++ Source-Level Optimizations9
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
2.2Using Arrays and Pointers
Optimization
Use array notation instead of pointer notation when working with arrays.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
C allows the use of either the array operator ([]) or pointers to access the elements of an array.
However, the use of pointers in C makes work difficult for optimizers in C compilers. Without
detailed and aggressive pointer analysis, the compiler has to assume that writes through a pointer can
write to any location in memory, including storage allocated to other variables. (For example, *p and
*q can refer to the same memory location, while x[0] and x[2] cannot.) Using pointers causes
aliasing, where the same block of memory is accessible in more than one way. Using array notation
makes the task of the optimizer easier by reducing possible aliasing.
10C and C++ Source-Level OptimizationsChapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
Example
Avoid code, such as the following, which uses pointer notation:
Source-code transformations interact with a compiler’s code generator, making it difficult to control
the generated machine code from the source level. It is even possible that source-code transformations
aimed at improving performance may conflict with compiler optimizations. Depending on the
compiler and the specific source code, it is possible for pointer-style code to compile into machine
code that is faster than that generated from equivalent array-style code. Compare the performance of
your code after implementing a source-code transformation with the performance of the original code
to be sure that there is an improvement.
12C and C++ Source-Level OptimizationsChapter 2
25112 Rev. 3.06 September 2005
Software Optimization Guide for AMD64 Processors
2.3Unrolling Small Loops
Optimization
Completely unroll loops that have a small fixed loop count and a small loop body.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
Many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance,
especially if the loop body is small, which makes the loop overhead significant.
Example
Avoid a small loop like this:
// 3D-transform: Multiply vector V by 4x4 transform matrix M.
for (i = 0; i < 4; i++) {
r[i] = 0;
for (j = 0; j < 4; j++) {
r[i] += m[j][i] * v[j];
}
}
Instead, replace it with its completely unrolled equivalent, as shown here:
For information on loop unrolling at the assembly-language level, see “Loop Unrolling” on page 145.
Chapter 2C and C++ Source-Level Optimizations13
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
2.4Expression Order in Compound Branch
Conditions
Optimization
In the most active areas of a program, order the expressions in compound branch conditions to take
advantage of short circuiting of compound conditional expressions.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
Branch conditions in C programs often consist of compound conditions consisting of multiple
boolean expressions joined by the logical AND (&&) and logical OR (||) operators. C compilers
guarantee short-circuit evaluation of these operators. In a compound logical OR expression, the first
operand to evaluate to true terminates the evaluation, and subsequent operands are not evaluated at all.
Similarly, in a logical AND expression, the first operand to evaluate to false terminates the evaluation.
Because of this short-circuit evaluation, it is not always possible to swap the operands of logical OR
and logical AND. This is especially true when the evaluation of one of the operands causes a side
effect. However, in most cases the order of operands in such expressions is irrelevant.
When used to control conditional branches, expressions involving logical OR and logical AND are
translated into a series of conditional branches. The ordering of the conditional branches is a function
of the ordering of the expressions in the compound condition and can have a significant impact on
performance. It is impossible to give an easy, closed-form formula on how to order the conditions.
Overall performance is a function of a variety of the following factors:
•Probability of a branch misprediction for each of the branches generated
•Additional latency incurred due to a branch misprediction
•Cost of evaluating the conditions controlling each of the branches generated
•Amount of parallelism that can be extracted in evaluating the branch conditions
•Data stream consumed by an application (mostly due to the dependence of misprediction
probabilities on the nature of the incoming data in data-dependent branches)
It is recommended to experiment with the ordering of expressions in compound branch conditions in
the most active areas of a program (so-called “hot spots,” where most of the execution time is spent).
14C and C++ Source-Level OptimizationsChapter 2
Loading...
+ 354 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.