Software Optimization
Guide
for
AMD64 Processors
Publication # 25112 |
Revision: 3.06 |
Issue Date: September 2005
© 2001 – 2005 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.
Trademarks
AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, and combinations thereof, 3DNow! and AMD-8151 are trademarks of Advanced Micro Devices, Inc.
HyperTransport is a licensed trademark of the HyperTransport Technology Consortium.
Microsoft is a registered trademark of Microsoft Corporation.
MMX is a trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 |
Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.1 |
1.2 |
Getting Started Quickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.1 |
1.3 |
Using This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.2 |
1.4 |
Important New Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.4 |
1.5 |
Key Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.6 |
Chapter 2 |
C and C++ Source-Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.7 |
2.1 |
Declarations of Floating-Point Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.9 |
2.2 |
Using Arrays and Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
10 |
2.3 |
Unrolling Small Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
13 |
2.4 |
Expression Order in Compound Branch Conditions . . . . . . . . . . . . . . . . . . . . . . . . . |
14 |
2.5 |
Long Logical Expressions in If Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
16 |
2.6 |
Arrange Boolean Operands for Quick Expression Evaluation . . . . . . . . . . . . . . . . . . |
17 |
2.7 |
Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
19 |
2.8 |
Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
20 |
2.9 |
Matching Store and Load Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
22 |
2.10 |
SWITCH and Noncontiguous Case Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
25 |
2.11 |
Arranging Cases by Probability of Occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
28 |
2.12 |
Use of Function Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
29 |
2.13 |
Use of const Type Qualifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
30 |
2.14 |
Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
31 |
2.15 |
Local Static Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
34 |
2.16 |
Explicit Parallelism in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
35 |
2.17 |
Extracting Common Subexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
37 |
2.18 |
Sorting and Padding C and C++ Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
39 |
2.19 |
Sorting Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
41 |
Contents |
iii |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
2.20 Replacing Integer Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 2.21 Frequently Dereferenced Pointer Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 2.22 Array Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46 2.23 32-Bit Integral Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 2.24 Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 2.25 Accelerating Floating-Point Division and Square Root . . . . . . . . . . . . . . . . . . . . . . .50 2.26 Fast Floating-Point-to-Integer Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 2.27 Speeding Up Branches Based on Comparisons Between Floats . . . . . . . . . . . . . . . .54 2.28 Improving Performance in Linux Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
Chapter 3 General 64-Bit Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
3.1 64-Bit Registers and Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 3.2 64-Bit Arithmetic and Large-Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . .62 3.3 128-Bit Media Instructions and Floating-Point Operations . . . . . . . . . . . . . . . . . . . .67 3.4 32-Bit Legacy GPRs and Small Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . .68
Chapter 4 Instruction-Decoding Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
4.1 DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
4.2 Load-Execute Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
4.2.1 Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
4.2.2Load-Execute Floating-Point Instructions with Floating-Point Operands . . .74
4.2.3 Load-Execute Floating-Point Instructions with Integer Operands . . . . . . . . .74 4.3 Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . .77 4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes . . . . . . . . . . . . . .78 4.6 Short Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 4.7 Partial-Register Reads and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 4.8 Using LEAVE for Function Epilogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 4.9 Alternatives to SHLD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85 4.10 8-Bit Sign-Extended Immediate Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87 4.11 8-Bit Sign-Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 4.12 Code Padding with Operand-Size Override and NOP . . . . . . . . . . . . . . . . . . . . . . . .89
iv |
Contents |
25112 Rev. 3.06 September 2005 Software Optimization Guide for AMD64 Processors
Chapter 5 |
Cache and Memory Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.91 |
5.1 |
Memory-Size Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.92 |
5.2 |
Natural Alignment of Data Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.95 |
5.3 |
Cache-Coherent Nonuniform Memory Access (ccNUMA) . . . . . . . . . . . . . . . . . . . |
.96 |
5.4 |
Multiprocessor Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
.99 |
5.5 |
Store-to-Load Forwarding Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
100 |
5.6 |
Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
104 |
5.7 |
Streaming-Store/Non-Temporal Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
112 |
5.8 |
Write-combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
113 |
5.9 |
L1 Data Cache Bank Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
114 |
5.10 |
Placing Code and Data in the Same 64-Byte Cache Line . . . . . . . . . . . . . . . . . . . . . |
116 |
5.11 |
Sorting and Padding C and C++ Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
117 |
5.12 |
Sorting Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
119 |
5.13 |
Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
120 |
5.14 |
Stack Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
122 |
5.15 |
Cache Issues when Writing Instruction Bytes to Memory . . . . . . . . . . . . . . . . . . . . |
123 |
5.16 |
Interleave Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
124 |
Chapter 6 |
Branch Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
125 |
6.1 |
Density of Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
126 |
6.2 |
Two-Byte Near-Return RET Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
128 |
6.3 |
Branches That Depend on Random Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
130 |
6.4 |
Pairing CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
132 |
6.5 |
Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
133 |
6.6 |
Nonzero Code-Segment Base Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
135 |
6.7 |
Replacing Branches with Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
136 |
6.8 |
The LOOP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
141 |
6.9 |
Far Control-Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
142 |
Chapter 7 |
Scheduling Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
143 |
7.1 |
Instruction Scheduling by Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
144 |
7.2 |
Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
145 |
Contents |
v |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
7.3 |
Inline Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
149 |
7.4 |
Address-Generation Interlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
151 |
7.5 |
MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
153 |
7.6 |
Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
154 |
7.7 |
Pushing Memory Data Directly onto the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
157 |
Chapter 8 |
Integer Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
159 |
8.1 |
Replacing Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
160 |
8.2 |
Alternative Code for Multiplying by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . |
164 |
8.3 |
Repeated String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
167 |
8.4 |
Using XOR to Clear Integer Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
169 |
8.5 |
Efficient 64-Bit Integer Arithmetic in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . |
170 |
8.6 |
Efficient Implementation of Population-Count Function in 32-Bit Mode . . . . . . . . |
179 |
8.7 |
Efficient Binary-to-ASCII Decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . |
181 |
8.8 |
Derivation of Algorithm, Multiplier, and Shift Factor for Integer |
|
|
Division by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
186 |
8.9 |
Optimizing Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
192 |
Chapter 9 |
Optimizing with SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
193 |
9.1 |
Ensure All Packed Floating-Point Data are Aligned . . . . . . . . . . . . . . . . . . . . . . . . . |
195 |
9.2Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and
MOVLPS When Loading Data from Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196 9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access . . . . . . . . . . . . .198 9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS . . . . . . . . . . .199 9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency . . . . . . . . . .200
9.6Avoid Moving Data Directly Between
General-Purpose and MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206
9.7Use MMX™ Instructions to Construct Fast Block-Copy
Routines in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 9.8 Passing Data between MMX™ and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . .208 9.9 Storing Floating-Point Data in MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . .209 9.10 EMMS and FEMMS Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210
vi |
Contents |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
9.11Using SIMD Instructions for Fast Square Roots and Fast
Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211
9.12Use XOR Operations to Negate Operands of SSE, SSE2, and
3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 9.13 Clearing MMX™ and XMM Registers with XOR Instructions . . . . . . . . . . . . . . . .216
9.14Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and
3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217
9.15Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2,
and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218 9.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions . . . .221 9.17 Optimized 4 × 4 Matrix Multiplication on 4 × 1 Column Vector Routines . . . . . . .230
Chapter 10 |
x87 Floating-Point Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .237 |
10.1 |
Using Multiplication Rather Than Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .238 |
10.2 |
Achieving Two Floating-Point Operations per Clock Cycle . . . . . . . . . . . . . . . . |
. .239 |
10.3 |
Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .244 |
10.4 |
Using the FXCH Instruction Rather Than FST/FLD Pairs . . . . . . . . . . . . . . . . . . |
. .245 |
10.5 |
Floating-Point Subexpression Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .246 |
10.6 |
Accumulating Precision-Sensitive Quantities in x87 Registers . . . . . . . . . . . . . . |
. .247 |
10.7 |
Avoiding Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .248 |
Appendix A |
Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors |
. .249 |
A.1 |
Key Microarchitecture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .250 |
A.2 |
Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . . . . |
. .251 |
A.3 |
Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .251 |
A.4 |
Processor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .251 |
A.5 |
L1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .252 |
A.6 |
Branch-Prediction Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .253 |
A.7 |
Fetch-Decode Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .254 |
A.8 |
Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .254 |
A.9 |
Translation-Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .254 |
A.10 |
L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .255 |
A.11 |
Integer Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
. .256 |
Contents |
vii |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
A.12 Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256 A.13 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257 A.14 Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .258 A.15 Load-Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .258 A.16 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .259 A.17 Write-combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260 A.18 Buses for AMD Athlon™ 64 and AMD Opteron™ Processor . . . . . . . . . . . . . . . .260 A.19 Integrated Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260 A.20 HyperTransport™ Technology Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260
Appendix B Implementation of Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263
B.1 Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . .263 B.2 Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .264 B.3 Write-combining Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .264 B.4 Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .266
B.5 Write-Combining Optimization on Revision D and E
AMD Athlon™ 64 and AMD Opteron™ Processors . . . . . . . . . . . . . . . . . . . . . . . .266
Appendix C Instruction Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
269 |
C.1 Understanding Instruction Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .270
C.2 Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .273
C.3 MMX™ Technology Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .303
C.4 x87 Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .307
C.5 3DNow!™ Technology Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314
C.6 3DNow!™ Technology Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .316
C.7 SSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .317
C.8 SSE2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .326
C.9 SSE3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .342
Appendix D AGP Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345
D.1 Fast-Write Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 D.2 Fast-Write Optimizations for Graphics-Engine Programming . . . . . . . . . . . . . . . . .346 D.3 Fast-Write Optimizations for Video-Memory Copies . . . . . . . . . . . . . . . . . . . . . . .349
viii |
Contents |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
D.4 Memory Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351
D.5 Memory Optimizations for Graphics-Engine Programming
Using the DMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .352 D.6 Optimizations for Texture-Map Copies to AGP Memory . . . . . . . . . . . . . . . . . . . .353 D.7 Optimizations for Vertex-Geometry Copies to AGP Memory . . . . . . . . . . . . . . . . .353
Appendix E SSE and SSE2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
355 |
|
E.1 |
Half-Register Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
356 |
E.2 |
Zeroing Out an XMM Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
357 |
E.3 |
Reuse of Dead Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
359 |
E.4 |
Moving Data Between XMM Registers and GPRs . . . . . . . . . . . . . . . . . . . . . . . . . |
360 |
E.5 |
Saving and Restoring Registers of Unknown Format . . . . . . . . . . . . . . . . . . . . . . . |
361 |
E.6 |
SSE and SSE2 Copy Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
362 |
E.7 |
Explicit Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
363 |
E.8 |
Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
364 |
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .367
Contents |
ix |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
x |
Contents |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Table 1. |
Instructions, Macro-ops and Micro-ops ............................................................................ |
5 |
|
Table 2. |
Optimizations by Rank...................................................................................................... |
6 |
|
Table 3. |
Comparisons against Zero............................................................................................... |
55 |
|
Table 4. |
Comparisons against Positive Constant .......................................................................... |
55 |
|
Table 5. |
Comparisons among Two Floats..................................................................................... |
55 |
|
Table 6. |
Latency of Repeated String Instructions ....................................................................... |
167 |
|
Table 7. |
L1 |
Instruction Cache Specifications by Processor........................................................ |
253 |
Table 8. |
L1 |
Instruction TLB Specifications................................................................................ |
255 |
Table 9. |
L1 |
Data TLB Specifications.......................................................................................... |
255 |
Table 10. |
L2 |
TLB Specifications by Processor............................................................................. |
255 |
Table 11. |
L1 |
Data Cache Specifications by Processor.................................................................. |
256 |
Table 12. |
Write-Combining Completion Events........................................................................... |
265 |
|
Table 13. |
Integer Instructions........................................................................................................ |
273 |
|
Table 14. |
MMX™ Technology Instructions................................................................................. |
303 |
|
Table 15. |
x87 Floating-Point Instructions..................................................................................... |
307 |
|
Table 16. |
3DNow!™ Technology Instructions............................................................................. |
314 |
|
Table 17. |
3DNow!™ Technology Extensions .............................................................................. |
316 |
|
Table 18. |
SSE Instructions............................................................................................................ |
317 |
|
Table 19. |
SSE2 Instructions.......................................................................................................... |
326 |
|
Table 20. |
SSE3 Instructions.......................................................................................................... |
342 |
|
Table 21. |
Clearing XMM Registers .............................................................................................. |
357 |
|
Table 22. |
Converting Scalar Values ............................................................................................. |
364 |
|
Table 23. |
Converting Vector Values............................................................................................. |
365 |
|
Table 24. |
Converting Directly from Memory ............................................................................... |
365 |
Tables |
xi |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
xii |
Tables |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Figure 1. |
Simple SMP Block Diagram........................................................................................... |
97 |
Figure 2. |
Dual-Core AMD Opteron™ Processor Configuration ................................................... |
97 |
Figure 3. |
Memory-Limited Code ................................................................................................. |
109 |
Figure 4. |
Processor-Limited Code ............................................................................................... |
109 |
Figure 5. |
AMD Athlon™ 64 and AMD Opteron™ Processors Block Diagram ......................... |
252 |
Figure 6. |
Integer Execution Pipeline............................................................................................ |
256 |
Figure 7. |
Floating-Point Unit ....................................................................................................... |
258 |
Figure 8. |
Load-Store Unit ............................................................................................................ |
259 |
Figure 9. |
AGP 8x Fast-Write Transaction ................................................................................... |
346 |
Figure 10. |
Cacheable-Memory Command Structure ..................................................................... |
347 |
Figure 11. |
Northbridge Command Flow ........................................................................................ |
352 |
Figures |
xiii |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
xiv |
Figures |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Date |
Rev. |
Description |
|
|
|
August 2005 |
3.06 |
Updated latency tables in Appendix C. Added section 8.9 on optimizing integer |
|
|
division. Clarified the use of non-temporal PREFETCHNTA instruction in section |
|
|
5.6. Added explanatory information to section 5.3 on ccNUMA. Added section 4.5 |
|
|
on AMD64 complx addressing modes. Added new section 5.13 on memory copies. |
|
|
|
October 2004 |
3.05 |
Updated information on write-combining optimizations in Appendix B, |
|
|
Implementation of Write-Combining; Added latency information for SSE3 |
|
|
instructions. |
|
|
|
March 2004 |
3.04 |
Incorporated a section on ccNUMA in Chapter 5. Added sections on moving |
|
|
unaligned versus unaligned data. Added to PREFETCHNTA information in Chapter |
|
|
5. Fixed many minor typos. |
|
|
|
September 2003 |
3.03 |
Made several minor typographical and formatting corrections. |
|
|
|
July 2003 |
3.02 |
Added index references. Corrected information pertaining to L1 and L2 data and |
|
|
instruction caches. Corrected information on alignment in Chapter 5, “Cache and |
|
|
Memory Optimizations”. Amended latency information in Appendix C. |
|
|
|
April 2003 |
3.01 |
Clarified section 2.22 'Array Indices'. Corrected factual errors and removed |
|
|
misleading examples from Cache and Memory chapter.. |
|
|
|
April 2003 |
3.00 |
Initial public release. |
|
|
|
Revision History |
xv |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
xvi |
Revision History |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
This guide provides optimization information and recommendations for the AMD Athlon™ 64 and AMD Opteron™ processors. These optimizations are designed to yield software code that is fast, compact, and efficient. Toward this end, the optimizations in each of the following chapters are listed in order of importance.
This chapter covers the following topics:
Topic |
Page |
|
|
Intended Audience |
1 |
|
|
Getting Started Quickly |
1 |
|
|
Using This Guide |
2 |
|
|
Important New Terms |
4 |
|
|
Key Optimizations |
6 |
|
|
1.1Intended Audience
This book is intended for compiler and assembler designers, as well as C, C++, and assemblylanguage programmers writing performance-sensitive code sequences. This guide assumes that you are familiar with the AMD64 instruction set and the AMD64 architecture (registers and programming
modes). For complete information on the AMD64 architecture and instruction set, see the multivolume AMD64 Architecture Programmer’s Manual available from AMD.com. Documentation
volumes and their order numbers are provided below.
Title |
Order no. |
Volume 1, Application Programming |
24592 |
Volume 2, System Programming |
24593 |
Volume 3, General-Purpose and System Instructions |
24594 |
Volume 4, 128-Bit Media Instructions |
26568 |
Volume 5, 64-Bit Media and x87 Floating-Point Instructions |
26569 |
1.2Getting Started Quickly
More experienced readers may skip to “Key Optimizations” on page 6, which identifies the most important optimizations.
Chapter 1 |
Introduction |
1 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
This chapter explains how to get the most benefit from this guide. It defines important new terms you will need to understand before reading the rest of this guide and lists the most important optimizations by rank.
Chapter 2 describes techniques that you can use to optimize your C and C++ source code. The “Application” section for each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or both.
Chapter 3 presents general assembly-language optimizations that improve the performance of software designed to run in 64-bit mode. All optimizations in this chapter apply only to 64-bit software.
The remaining chapters describe assembly-language optimizations. The “Application” section under each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or both.
Chapter 4 |
Instruction-Decoding Optimizations |
Chapter 5 |
Cache and Memory Optimizations |
Chapter 6 |
Branch Optimizations |
Chapter 7 |
Scheduling Optimizations |
Chapter 8 |
Integer Optimizations |
Chapter 9 |
Optimizing with SIMD Instructions |
Chapter 10 |
x87 Floating-Point Optimizations |
Appendix A discusses the internal design, or microarchitecture, of the processor and provides specifications on the translation-lookaside buffers. It also provides information on other functional units that are not part of the main processor but are integrated on the chip.
Appendix B describes the memory write-combining feature of the processor.
Appendix C provides a complete listing of all AMD64 instructions. It shows each instruction’s encoding, decode type, execution latency, and—where applicable—the pipe used in the floating-point unit.
Appendix D discusses optimizations that improve the throughput of AGP transfers.
Appendix E describes coding practices that improve performance when using SSE and SSE2 instructions.
2 |
Introduction |
Chapter 1 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Special Information
Special information in this guide looks like this:
This symbol appears next to the most important, or key, optimizations.
Numbering Systems
The following suffixes identify different numbering systems:
This suffix |
Identifies a |
|
|
b |
Binary number. For example, the binary equivalent of the number 5 is written 101b. |
|
|
d |
Decimal number. Decimal numbers are followed by this suffix only when the possibility of |
|
confusion exists. In general, decimal numbers are shown without a suffix. |
|
|
h |
Hexadecimal number. For example, the hexadecimal equivalent of the number 60 is |
|
written 3Ch. |
|
|
Typographic Notation
This guide uses the following typographic notations for certain types of information:
This type of text |
Identifies |
|
|
italic |
Placeholders that represent information you must provide. Italicized text is also used |
|
for the titles of publications and for emphasis. |
|
|
monowidth |
Program statements and function names. |
|
|
Providing Feedback
If you have suggestions for improving this guide, we would like to hear from you. Please send your comments to the following e-mail address:
code.optimization@amd.com
Chapter 1 |
Introduction |
3 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
This section defines several important terms and concepts used in this guide.
Primitive Operations
AMD Athlon 64 and AMD Opteron processors perform four types of primitive operations:
•Integer (arithmetic or logic)
•Floating-point (arithmetic)
•Load
•Store
Internal Instruction Formats
The AMD64 instruction set is complex; instructions have variable-length encodings and many perform multiple primitive operations. AMD Athlon 64 and AMD Opteron processors do not execute these complex instructions directly, but, instead, decode them internally into simpler fixed-length instructions called macro-ops. Processor schedulers subsequently break down macro-ops into sequences of even simpler instructions called micro-ops, each of which specifies a single primitive operation.
A macro-op is a fixed-length instruction that:
•Expresses, at most, one integer or floating-point operation and one load and/or store operation.
•Is the primary unit of work managed (that is, dispatched and retired) by the processor.
A micro-op is a fixed-length instruction that:
•Expresses one and only one of the primitive operations that the processor can perform (for example, a load).
•Is executed by the processor’s execution units.
4 |
Introduction |
Chapter 1 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Table 1 summarizes the differences between AMD64 instructions, macro-ops, and micro-ops.
Table 1. |
Instructions, Macro-ops and Micro-ops |
|
||
|
|
|
|
|
Comparing |
|
AMD64 instructions |
Macro-ops |
Micro-ops |
|
|
|
|
|
Complexity |
|
Complex |
Average |
Simple |
|
|
A single instruction may |
A single macro-op may |
A single micro-op |
|
|
specify one or more of |
specify—at most—one |
specifies only one of the |
|
|
each of the following |
integer or floating-point |
following primitive |
|
|
operations: |
operation and one of the |
operations: |
|
|
• Integer or floating-point |
following operations: |
• Integer or floating-point |
|
|
|
||
|
|
operation |
• Load |
• Load |
|
|
|
|
|
|
|
• Load |
• Store |
• Store |
|
|
|
|
|
|
|
• Store |
• Load and store to the |
|
|
|
|
same address |
|
|
|
|
|
|
Encoded length |
Variable (instructions are |
Fixed (all macro-ops are |
Fixed (all micro-ops are |
|
|
|
different lengths) |
the same length) |
the same length) |
|
|
|
|
|
Regularized |
|
No (field locations and |
Yes (field locations and |
Yes (field locations and |
instruction fields |
definitions vary among |
definitions are the same |
definitions are the same |
|
|
|
instructions) |
for all macro-ops) |
for all micro-ops) |
|
|
|
|
|
Types of Instructions
Instructions are classified according to how they are decoded by the processor. There are three types of instructions:
Instruction Type |
Description |
|
|
DirectPath Single |
A relatively common instruction that the processor decodes directly into one macro-op |
|
in hardware. |
|
|
DirectPath Double |
A relatively common instruction that the processor decodes directly into two macro- |
|
ops in hardware. |
|
|
VectorPath |
A sophisticated or less common instruction that the processor decodes into one or |
|
more (usually three or more) macro-ops using the on-chip microcode-engine ROM |
|
(MROM). |
|
|
Chapter 1 |
Introduction |
5 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
While all of the optimizations in this guide help improve software performance, some of them have more impact than others. Optimizations that offer the most improvement are called key optimizations.
Guideline
Concentrate your efforts on implementing key optimizations before moving on to other optimizations, and incorporate higher-ranking key optimizations first.
Key Optimizations by Rank
Table 1 lists the key optimizations by rank.
Table 2. |
Optimizations by Rank |
|
|
|
|
Rank |
Optimization |
Page |
|
|
|
1 |
Memory-Size Mismatches |
92 |
|
|
|
2 |
Natural Alignment of Data Objects |
95 |
|
|
|
3 |
Memory Copy |
120 |
|
|
|
4 |
Density of Branches |
126 |
|
|
|
5 |
Prefetch Instructions |
104 |
|
|
|
6 |
Two-Byte Near-Return RET Instruction |
128 |
|
|
|
7 |
DirectPath Instructions |
72 |
|
|
|
8 |
Load-Execute Integer Instructions |
73 |
|
|
|
9 |
Load-Execute Floating-Point Instructions with Floating-Point Operands |
74 |
|
|
|
10 |
Load-Execute Floating-Point Instructions with Integer Operands |
74 |
|
|
|
11 |
Write-combining |
113 |
|
|
|
12 |
Branches That Depend on Random Data |
130 |
|
|
|
13 |
Half-Register Operations |
356 |
|
|
|
14 |
Placing Code and Data in the Same 64-Byte Cache Line |
116 |
|
|
|
6 |
Introduction |
Chapter 1 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Chapter 2 C and C++ Source-Level
Optimizations
Although C and C++ compilers generally produce very compact object code, many performance improvements are possible by careful source code optimization. Most such optimizations result from taking advantage of the underlying mechanisms used by C and C++ compilers to translate source code into sequences of AMD64 instructions. This chapter includes guidelines for writing C and C++ source code that result in the most efficiently optimized AMD64 code.
This chapter covers the following topics:
Topic |
Page |
|
|
Declarations of Floating-Point Values |
9 |
|
|
Using Arrays and Pointers |
10 |
|
|
Unrolling Small Loops |
13 |
|
|
Expression Order in Compound Branch Conditions |
14 |
|
|
Long Logical Expressions in If Statements |
16 |
|
|
Arrange Boolean Operands for Quick Expression Evaluation |
17 |
|
|
Dynamic Memory Allocation Consideration |
19 |
|
|
Unnecessary Store-to-Load Dependencies |
20 |
|
|
Matching Store and Load Size |
22 |
|
|
SWITCH and Noncontiguous Case Expressions |
25 |
|
|
Arranging Cases by Probability of Occurrence |
28 |
|
|
Use of Function Prototypes |
29 |
|
|
Use of const Type Qualifier |
30 |
|
|
Generic Loop Hoisting |
31 |
|
|
Local Static Functions |
34 |
|
|
Explicit Parallelism in Code |
35 |
|
|
Extracting Common Subexpressions |
37 |
|
|
Sorting and Padding C and C++ Structures |
39 |
|
|
Sorting Local Variables |
41 |
|
|
Replacing Integer Division with Multiplication |
43 |
|
|
Frequently Dereferenced Pointer Arguments |
44 |
|
|
Array Indices |
46 |
|
|
32-Bit Integral Data Types |
47 |
|
|
Sign of Integer Operands |
48 |
|
|
Chapter 2 |
C and C++ Source-Level Optimizations |
7 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
Topic |
Page |
|
|
Accelerating Floating-Point Division and Square Root |
50 |
|
|
Fast Floating-Point-to-Integer Conversion |
52 |
|
|
Speeding Up Branches Based on Comparisons Between Floats |
54 |
|
|
8 |
C and C++ Source-Level Optimizations |
Chapter 2 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
2.1Declarations of Floating-Point Values
Optimization
When working with single precision (float) values:
•Use the f or F suffix (for example, 3.14f) to specify a constant value of type float.
•Use function prototypes for all functions that accept arguments of type float.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
C and C++ compilers treat floating-point constants and arguments as double precision (double) unless you specify otherwise. However, single precision floating-point values occupy half the memory space as double precision values and can often provide the precision necessary for a given computational problem.
Chapter 2 |
C and C++ Source-Level Optimizations |
9 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
Optimization
Use array notation instead of pointer notation when working with arrays.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
C allows the use of either the array operator ([]) or pointers to access the elements of an array. However, the use of pointers in C makes work difficult for optimizers in C compilers. Without detailed and aggressive pointer analysis, the compiler has to assume that writes through a pointer can write to any location in memory, including storage allocated to other variables. (For example, *p and *q can refer to the same memory location, while x[0] and x[2] cannot.) Using pointers causes aliasing, where the same block of memory is accessible in more than one way. Using array notation makes the task of the optimizer easier by reducing possible aliasing.
10 |
C and C++ Source-Level Optimizations |
Chapter 2 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Example
Avoid code, such as the following, which uses pointer notation:
typedef struct { float x, y, z, w;
} VERTEX;
typedef struct { float m[4][4];
} MATRIX;
void XForm(float *res, const float *v, const float *m, int numverts) {
float dp; int i;
const VERTEX* vv = (VERTEX *)v;
for (i = 0; i < numverts; i++) { dp = vv->x * *m++;
dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; // Write transformed x.
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; // Write transformed y.
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; // Write transformed z.
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; // Write transformed w.
++vv; |
// |
Next input vertex |
|
m -= 16; |
// |
Reset to start of |
transform matrix. |
}
}
Chapter 2 |
C and C++ Source-Level Optimizations |
11 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
Instead, use the equivalent array notation:
typedef struct { float x, y, z, w;
} VERTEX;
typedef struct { float m[4][4];
} MATRIX;
void XForm(float *res, const float *v, const float *m, int numverts) {
int i;
const VERTEX* vv = (VERTEX *)v; const MATRIX* mm = (MATRIX *)m; VERTEX* rr = (VERTEX *)res;
for (i = |
0; i < numverts; i++) { |
rr->x |
= vv->x * mm->m[0][0] + vv->y * mm->m[0][1] + |
|
vv->z * mm->m[0][2] + vv->w * mm->m[0][3]; |
rr->y |
= vv->x * mm->m[1][0] + vv->y * mm->m[1][1] + |
|
vv->z * mm->m[1][2] + vv->w * mm->m[1][3]; |
rr->z |
= vv->x * mm->m[2][0] + vv->y * mm->m[2][1] + |
|
vv->z * mm->m[2][2] + vv->w * mm->m[2][3]; |
rr->w |
= vv->x * mm->m[3][0] + vv->y * mm->m[3][1] + |
|
vv->z * mm->m[3][2] + vv->w * mm->m[3][3]; |
++rr; |
// Increment the results pointer. |
++vv; |
// Increment the input vertex pointer. |
} |
|
}
Additional Considerations
Source-code transformations interact with a compiler’s code generator, making it difficult to control the generated machine code from the source level. It is even possible that source-code transformations aimed at improving performance may conflict with compiler optimizations. Depending on the compiler and the specific source code, it is possible for pointer-style code to compile into machine code that is faster than that generated from equivalent array-style code. Compare the performance of your code after implementing a source-code transformation with the performance of the original code to be sure that there is an improvement.
12 |
C and C++ Source-Level Optimizations |
Chapter 2 |
25112 Rev. 3.06 September 2005 |
Software Optimization Guide for AMD64 Processors |
Optimization
Completely unroll loops that have a small fixed loop count and a small loop body.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
Many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, especially if the loop body is small, which makes the loop overhead significant.
Example
Avoid a small loop like this:
// 3D-transform: Multiply vector V by 4x4 transform matrix M. for (i = 0; i < 4; i++) {
r[i] = 0;
for (j = 0; j < 4; j++) { r[i] += m[j][i] * v[j];
}
}
Instead, replace it with its completely unrolled equivalent, as shown here:
r[0] = m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3]; r[1] = m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3]; r[2] = m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3]; r[3] = m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3];
Related Information
For information on loop unrolling at the assembly-language level, see “Loop Unrolling” on page 145.
Chapter 2 |
C and C++ Source-Level Optimizations |
13 |
Software Optimization Guide for AMD64 Processors |
25112 Rev. 3.06 September 2005 |
Optimization
In the most active areas of a program, order the expressions in compound branch conditions to take advantage of short circuiting of compound conditional expressions.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
Branch conditions in C programs often consist of compound conditions consisting of multiple boolean expressions joined by the logical AND (&&) and logical OR (||) operators. C compilers guarantee short-circuit evaluation of these operators. In a compound logical OR expression, the first operand to evaluate to true terminates the evaluation, and subsequent operands are not evaluated at all. Similarly, in a logical AND expression, the first operand to evaluate to false terminates the evaluation. Because of this short-circuit evaluation, it is not always possible to swap the operands of logical OR and logical AND. This is especially true when the evaluation of one of the operands causes a side effect. However, in most cases the order of operands in such expressions is irrelevant.
When used to control conditional branches, expressions involving logical OR and logical AND are translated into a series of conditional branches. The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition and can have a significant impact on performance. It is impossible to give an easy, closed-form formula on how to order the conditions. Overall performance is a function of a variety of the following factors:
•Probability of a branch misprediction for each of the branches generated
•Additional latency incurred due to a branch misprediction
•Cost of evaluating the conditions controlling each of the branches generated
•Amount of parallelism that can be extracted in evaluating the branch conditions
•Data stream consumed by an application (mostly due to the dependence of misprediction probabilities on the nature of the incoming data in data-dependent branches)
It is recommended to experiment with the ordering of expressions in compound branch conditions in the most active areas of a program (so-called “hot spots,” where most of the execution time is spent).
14 |
C and C++ Source-Level Optimizations |
Chapter 2 |