AMD 250 User Manual

Download

Software Optimization

Guide

for

AMD64 Processors

25112Publication # 3.06Revision:

September 2005Issue Date:

The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.

AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.

Trademarks

AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, and combinations thereof, 3DNow! and AMD-8151 are trademarks of Advanced Micro Devices, Inc.

HyperTransport is a licensed trademark of the HyperTransport Technology Consortium.

Microsoft is a registered trademark of Microsoft Corporation.

MMX is a trademark of Intel Corporation.

Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Getting Started Quickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.3 Using This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

1.4 Important New Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

1.5 Key Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

Chapter 2 C and C++ Source-Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

2.1 Declarations of Floating-Point Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2.2 Using Arrays and Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.3 Unrolling Small Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

2.4 Expression Order in Compound Branch Conditions . . . . . . . . . . . . . . . . . . . . . . . . .14

2.5 Long Logical Expressions in If Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

2.6 Arrange Boolean Operands for Quick Expression Evaluation . . . . . . . . . . . . . . . . . .17

2.7 Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

2.8 Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

2.9 Matching Store and Load Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

2.10 SWITCH and Noncontiguous Case Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

2.11 Arranging Cases by Probability of Occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

2.12 Use of Function Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

2.13 Use of const Type Qualifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.14 Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

2.15 Local Static Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

2.16 Explicit Parallelism in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

2.17 Extracting Common Subexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

2.18 Sorting and Padding C and C++ Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

2.19 Sorting Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41

Contents iii

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.20 Replacing Integer Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

2.21 Frequently Dereferenced Pointer Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

2.22 Array Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

2.23 32-Bit Integral Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

2.24 Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48

2.25 Accelerating Floating-Point Division and Square Root . . . . . . . . . . . . . . . . . . . . . . .50

2.26 Fast Floating-Point-to-Integer Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

2.27 Speeding Up Branches Based on Comparisons Between Floats . . . . . . . . . . . . . . . .54

2.28 Improving Performance in Linux Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

Chapter 3 General 64-Bit Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

3.1 64-Bit Registers and Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

3.2 64-Bit Arithmetic and Large-Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . .62

3.3 128-Bit Media Instructions and Floating-Point Operations . . . . . . . . . . . . . . . . . . . .67

3.4 32-Bit Legacy GPRs and Small Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . .68

Chapter 4 Instruction-Decoding Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

4.1 DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

4.2 Load-Execute Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

4.2.1 Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

4.2.2 Load-Execute Floating-Point Instructions with Floating-Point Operands . . .74

4.2.3 Load-Execute Floating-Point Instructions with Integer Operands . . . . . . . . .74

4.3 Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . .77

4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes . . . . . . . . . . . . . .78

4.6 Short Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

4.7 Partial-Register Reads and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81

4.8 Using LEAVE for Function Epilogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

4.9 Alternatives to SHLD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85

4.10 8-Bit Sign-Extended Immediate Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87

4.11 8-Bit Sign-Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88

4.12 Code Padding with Operand-Size Override and NOP . . . . . . . . . . . . . . . . . . . . . . . .89

iv Contents

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 5 Cache and Memory Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

5.1 Memory-Size Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

5.2 Natural Alignment of Data Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95

5.3 Cache-Coherent Nonuniform Memory Access (ccNUMA) . . . . . . . . . . . . . . . . . . . .96

5.4 Multiprocessor Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99

5.5 Store-to-Load Forwarding Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100

5.6 Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

5.7 Streaming-Store/Non-Temporal Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112

5.8 Write-combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113

5.9 L1 Data Cache Bank Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114

5.10 Placing Code and Data in the Same 64-Byte Cache Line . . . . . . . . . . . . . . . . . . . . .116

5.11 Sorting and Padding C and C++ Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117

5.12 Sorting Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119

5.13 Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120

5.14 Stack Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

5.15 Cache Issues when Writing Instruction Bytes to Memory . . . . . . . . . . . . . . . . . . . .123

5.16 Interleave Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124

Chapter 6 Branch Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125

6.1 Density of Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126

6.2 Two-Byte Near-Return RET Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128

6.3 Branches That Depend on Random Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130

6.4 Pairing CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132

6.5 Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133

6.6 Nonzero Code-Segment Base Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

6.7 Replacing Branches with Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136

6.8 The LOOP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

6.9 Far Control-Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142

Chapter 7 Scheduling Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143

7.1 Instruction Scheduling by Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144

7.2 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145

Contents v

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

7.3 Inline Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149

7.4 Address-Generation Interlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151

7.5 MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153

7.6 Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154

7.7 Pushing Memory Data Directly onto the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . .157

Chapter 8 Integer Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159

8.1 Replacing Division with Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160

8.2 Alternative Code for Multiplying by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . .164

8.3 Repeated String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167

8.4 Using XOR to Clear Integer Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

8.5 Efficient 64-Bit Integer Arithmetic in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . .170

8.6 Efficient Implementation of Population-Count Function in 32-Bit Mode . . . . . . . .179

8.7 Efficient Binary-to-ASCII Decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . .181

8.8 Derivation of Algorithm, Multiplier, and Shift Factor for Integer

Division by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186

8.9 Optimizing Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192

Chapter 9 Optimizing with SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193

9.1 Ensure All Packed Floating-Point Data are Aligned . . . . . . . . . . . . . . . . . . . . . . . . .195

9.2 Improving Scalar SSE and SSE2 Floating-Point Performance with MOVLPD and

MOVLPS When Loading Data from Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196

9.3 Use MOVLPx/MOVHPx Instructions for Unaligned Data Access . . . . . . . . . . . . .198

9.4 Use MOVAPD and MOVAPS Instead of MOVUPD and MOVUPS . . . . . . . . . . .199

9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency . . . . . . . . . .200

9.6 Avoid Moving Data Directly Between

General-Purpose and MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206

9.7 Use MMX™ Instructions to Construct Fast Block-Copy

Routines in 32-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

9.8 Passing Data between MMX™ and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . .208

9.9 Storing Floating-Point Data in MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . .209

9.10 EMMS and FEMMS Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210

vi Contents

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

9.11 Using SIMD Instructions for Fast Square Roots and Fast

Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

9.12 Use XOR Operations to Negate Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

9.13 Clearing MMX™ and XMM Registers with XOR Instructions . . . . . . . . . . . . . . . .216

9.14 Finding the Floating-Point Absolute Value of Operands of SSE, SSE2, and

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217

9.15 Accumulating Single-Precision Floating-Point Numbers Using SSE, SSE2,

and 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218

9.16 Complex-Number Arithmetic Using SSE, SSE2, and 3DNow!™ Instructions . . . .221

9.17 Optimized 4 × 4 Matrix Multiplication on 4 × 1 Column Vector Routines . . . . . . .230

Chapter 10 x87 Floating-Point Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .237

10.1 Using Multiplication Rather Than Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238

10.2 Achieving Two Floating-Point Operations per Clock Cycle . . . . . . . . . . . . . . . . . .239

10.3 Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244

10.4 Using the FXCH Instruction Rather Than FST/FLD Pairs . . . . . . . . . . . . . . . . . . . .245

10.5 Floating-Point Subexpression Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .246

10.6 Accumulating Precision-Sensitive Quantities in x87 Registers . . . . . . . . . . . . . . . .247

10.7 Avoiding Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .248

Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . .249

A.1 Key Microarchitecture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .250

A.2 Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors . . . . . .251

A.3 Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251

A.4 Processor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251

A.5 L1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252

A.6 Branch-Prediction Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253

A.7 Fetch-Decode Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.8 Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.9 Translation-Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254

A.10 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255

A.11 Integer Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256

Contents vii

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

A.12 Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256

A.13 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257

A.14 Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .258

A.15 Load-Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .258

A.16 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .259

A.17 Write-combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260

A.18 Buses for AMD Athlon™ 64 and AMD Opteron™ Processor . . . . . . . . . . . . . . . .260

A.19 Integrated Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260

A.20 HyperTransport™ Technology Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260

Appendix B Implementation of Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263

B.1 Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . .263

B.2 Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .264

B.3 Write-combining Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .264

B.4 Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .266

B.5 Write-Combining Optimization on Revision D and E

AMD Athlon™ 64 and AMD Opteron™ Processors . . . . . . . . . . . . . . . . . . . . . . . .266

Appendix C Instruction Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .269

C.1 Understanding Instruction Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .270

C.2 Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .273

C.3 MMX™ Technology Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .303

C.4 x87 Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .307

C.5 3DNow!™ Technology Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314

C.6 3DNow!™ Technology Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .316

C.7 SSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .317

C.8 SSE2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .326

C.9 SSE3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .342

Appendix D AGP Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345

D.1 Fast-Write Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345

D.2 Fast-Write Optimizations for Graphics-Engine Programming . . . . . . . . . . . . . . . . .346

D.3 Fast-Write Optimizations for Video-Memory Copies . . . . . . . . . . . . . . . . . . . . . . .349

viii Contents

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

D.4 Memory Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351

D.5 Memory Optimizations for Graphics-Engine Programming

Using the DMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .352

D.6 Optimizations for Texture-Map Copies to AGP Memory . . . . . . . . . . . . . . . . . . . .353

D.7 Optimizations for Vertex-Geometry Copies to AGP Memory . . . . . . . . . . . . . . . . .353

Appendix E SSE and SSE2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .355

E.1 Half-Register Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .356

E.2 Zeroing Out an XMM Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .357

E.3 Reuse of Dead Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .359

E.4 Moving Data Between XMM Registers and GPRs . . . . . . . . . . . . . . . . . . . . . . . . .360

E.5 Saving and Restoring Registers of Unknown Format . . . . . . . . . . . . . . . . . . . . . . .361

E.6 SSE and SSE2 Copy Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .362

E.7 Explicit Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .363

E.8 Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .367

Contents ix

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

x Contents

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Tables

Table 1. Instructions, Macro-ops and Micro-ops ............................................................................5

Table 2. Optimizations by Rank......................................................................................................6

Table 3. Comparisons against Zero...............................................................................................55

Table 4. Comparisons against Positive Constant ..........................................................................55

Table 5. Comparisons among Two Floats.....................................................................................55

Table 6. Latency of Repeated String Instructions .......................................................................167

Table 7. L1 Instruction Cache Specifications by Processor........................................................253

Table 8. L1 Instruction TLB Specifications................................................................................255

Table 9. L1 Data TLB Specifications..........................................................................................255

Table 10. L2 TLB Specifications by Processor.............................................................................255

Table 11. L1 Data Cache Specifications by Processor..................................................................256

Table 12. Write-Combining Completion Events...........................................................................265

Table 13. Integer Instructions........................................................................................................273

Table 14. MMX™ Technology Instructions.................................................................................303

Table 15. x87 Floating-Point Instructions.....................................................................................307

Table 16. 3DNow!™ Technology Instructions.............................................................................314

Table 17. 3DNow!™ Technology Extensions ..............................................................................316

Table 18. SSE Instructions............................................................................................................317

Table 19. SSE2 Instructions..........................................................................................................326

Table 20. SSE3 Instructions..........................................................................................................342

Table 21. Clearing XMM Registers ..............................................................................................357

Table 22. Converting Scalar Values .............................................................................................364

Table 23. Converting Vector Values.............................................................................................365

Table 24. Converting Directly from Memory ...............................................................................365

Tables xi

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

xii Tables

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Figures

Figure 1. Simple SMP Block Diagram...........................................................................................97

Figure 2. Dual-Core AMD Opteron™ Processor Configuration ...................................................97

Figure 3. Memory-Limited Code .................................................................................................109

Figure 4. Processor-Limited Code ...............................................................................................109

Figure 5. AMD Athlon™ 64 and AMD Opteron™ Processors Block Diagram .........................252

Figure 6. Integer Execution Pipeline............................................................................................256

Figure 7. Floating-Point Unit .......................................................................................................258

Figure 8. Load-Store Unit ............................................................................................................259

Figure 9. AGP 8x Fast-Write Transaction ...................................................................................346

Figure 10. Cacheable-Memory Command Structure .....................................................................347

Figure 11. Northbridge Command Flow ........................................................................................352

Figures xiii

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

xiv Figures

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Revision History

Date Rev. Description

August 2005 3.06 Updated latency tables in Appendix C. Added section 8.9 on optimizing integer

division. Clarified the use of non-temporal PREFETCHNTA instruction in section

5.6. Added explanatory information to section 5.3 on ccNUMA. Added section 4.5 on AMD64 complx addressing modes. Added new section 5.13 on memory copies.

October 2004 3.05 Updated information on write-combining optimizations in Appendix B,

Implementation of Write-Combining; Added latency information for SSE3 instructions.

March 2004 3.04 Incorporated a section on ccNUMA in Chapter 5. Added sections on moving

unaligned versus unaligned data. Added to PREFETCHNTA information in Chapter

5. Fixed many minor typos.

September 2003 3.03 Made several minor typographical and formatting corrections.

July 2003 3.02 Added index references. Corrected information pertaining to L1 and L2 data and

instruction caches. Corrected information on alignment in Chapter 5, “Cache and Memory Optimizations”. Amended latency information in Appendix C.

April 2003 3.01 Clarified section 2.22 'Array Indices'. Corrected factual errors and removed

misleading examples from Cache and Memory chapter..

April 2003 3.00 Initial public release.

Revision History xv

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

xvi Revision History

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 1 Introduction

This guide provides optimization information and recommendations for the AMD Athlon™ 64 and AMD Opteron™ processors. These optimizations are designed to yield software code that is fast, compact, and efficient. Toward this end, the optimizations in each of the following chapters are listed in order of importance.

This chapter covers the following topics:

Topic Page

Intended Audience 1

Getting Started Quickly 1

Using This Guide 2

Important New Terms 4

Key Optimizations 6

1.1 Intended Audience

This book is intended for compiler and assembler designers, as well as C, C++, and assemblylanguage programmers writing performance-sensitive code sequences. This guide assumes that you are familiar with the AMD64 instruction set and the AMD64 architecture (registers and programming modes). For complete information on the AMD64 architecture and instruction set, see the multivolume AMD64 Architecture Programmer’s Manual available from AMD.com. Documentation volumes and their order numbers are provided below.

Title Order no.

Volume 1 , Application Programming 24592

Volume 2 , System Programming 24593

Volume 3 , General-Purpose and System Instructions 24594

Volume 4 , 128-Bit Media Instructions 26568

Volume 5 , 64-Bit Media and x87 Floating-Point Instructions 26569

1.2 Getting Started Quickly

More experienced readers may skip to “Key Optimizations” on page 6, which identifies the most important optimizations.

Chapter 1 Introduction 1

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

1.3 Using This Guide

This chapter explains how to get the most benefit from this guide. It defines important new terms you will need to understand before reading the rest of this guide and lists the most important optimizations by rank.

Chapter 2 describes techniques that you can use to optimize your C and C++ source code. The “Application” section for each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or both.

Chapter 3 presents general assembly-language optimizations that improve the performance of software designed to run in 64-bit mode. All optimizations in this chapter apply only to 64-bit software.

The remaining chapters describe assembly-language optimizations. The “Application” section under each optimization indicates whether the optimization applies to 32-bit software, 64-bit software, or both.

Chapter 4 Instruction-Decoding Optimizations

Chapter 5 Cache and Memory Optimizations

Chapter 6 Branch Optimizations

Chapter 7 Scheduling Optimizations

Chapter 8 Integer Optimizations

Chapter 9 Optimizing with SIMD Instructions

Chapter 10 x87 Floating-Point Optimizations

Appendix A discusses the internal design, or microarchitecture, of the processor and provides specifications on the translation-lookaside buffers. It also provides information on other functional units that are not part of the main processor but are integrated on the chip.

Appendix B describes the memory write-combining feature of the processor.

Appendix C provides a complete listing of all AMD64 instructions. It shows each instruction’s encoding, decode type, execution latency, and—where applicable—the pipe used in the floating-point unit.

Appendix D discusses optimizations that improve the throughput of AGP transfers.

Appendix E describes coding practices that improve performance when using SSE and SSE2 instructions.

2 Introduction Chapter 1

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Special Information

Special information in this guide looks like this:

❖ This symbol appears next to the most important, or key, optimizations.

Numbering Systems

The following suffixes identify different numbering systems:

This suffix Identifies a

b Binary number. For example, the binary equivalent of the number 5 is written 101b.

d Decimal number. Decimal numbers are followed by this suffix only when the possibility of

confusion exists. In general, decimal numbers are shown without a suffix.

h Hexadecimal number. For example, the hexadecimal equivalent of the number 60 is

written 3Ch.

Typographic Notation

This guide uses the following typographic notations for certain types of information:

This type of text Identifies

italic Placeholders that represent information you must provide. Italicized text is also used

for the titles of publications and for emphasis.

monowidth Program statements and function names.

Providing Feedback

If you have suggestions for improving this guide, we would like to hear from you. Please send your comments to the following e-mail address:

code.optimization@amd.com

Chapter 1 Introduction 3

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

1.4 Important New Terms

This section defines several important terms and concepts used in this guide.

Primitive Operations

AMD Athlon 64 and AMD Opteron processors perform four types of primitive operations:

• Integer (arithmetic or logic)

• Floating-point (arithmetic)

• Load

• Store

Internal Instruction Formats

The AMD64 instruction set is complex; instructions have variable-length encodings and many perform multiple primitive operations. AMD Athlon 64 and AMD Opteron processors do not execute these complex instructions directly, but, instead, decode them internally into simpler fixed-length instructions called macro-ops. Processor schedulers subsequently break down macro-ops into sequences of even simpler instructions called micro-ops, each of which specifies a single primitive operation.

A macro-op is a fixed-length instruction that:

• Expresses, at most, one integer or floating-point operation and one load and/or store operation.

• Is the primary unit of work managed (that is, dispatched and retired) by the processor.

A micro-op is a fixed-length instruction that:

• Expresses one and only one of the primitive operations that the processor can perform (for example, a load).

• Is executed by the processor’s execution units.

4 Introduction Chapter 1

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Table 1 summarizes the differences between AMD64 instructions, macro-ops, and micro-ops.

Table 1. Instructions, Macro-ops and Micro-ops

Comparing AMD64 instructions Macro-ops Micro-ops

Complexity Complex

A single instruction may specify one or more of each of the following operations:

• Integer or floating-point operation

• Load

•Store

Encoded length Variable (instructions are

different lengths)

Regularized instruction fields

No (field locations and definitions vary among instructions)

Average

A single macro-op may specify—at most—one integer or floating-point operation and one of the following operations:

• Load

•Store

• Load and store to the same address

Fixed (all macro-ops are the same length)

Yes (field locations and definitions are the same for all macro-ops)

Simple

A single micro-op specifies only one of the following primitive operations:

• Integer or floating-point

• Load

•Store

Fixed (all micro-ops are the same length)

Yes (field locations and definitions are the same for all micro-ops)

Types of Instructions

Instructions are classified according to how they are decoded by the processor. There are three types of instructions:

Instruction Type Description

DirectPath Single A relatively common instruction that the processor decodes directly into one macro-op

in hardware.

DirectPath Double A relatively common instruction that the processor decodes directly into two macro-

ops in hardware.

VectorPath A sophisticated or less common instruction that the processor decodes into one or

more (usually three or more) macro-ops using the on-chip microcode-engine ROM (MROM).

Chapter 1 Introduction 5

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

1.5 Key Optimizations

While all of the optimizations in this guide help improve software performance, some of them have more impact than others. Optimizations that offer the most improvement are called key optimizations.

Guideline

Concentrate your efforts on implementing key optimizations before moving on to other optimizations, and incorporate higher-ranking key optimizations first.

Key Optimizations by Rank

Table 1 lists the key optimizations by rank.

Table 2. Optimizations by Rank

Rank Optimization Page

1 Memory-Size Mismatches 92

2 Natural Alignment of Data Objects 95

3 Memory Copy 120

4 Density of Branches 126

5 Prefetch Instructions 104

6 Two-Byte Near-Return RET Instruction 128

7 DirectPath Instructions 72

8 Load-Execute Integer Instructions 73

9 Load-Execute Floating-Point Instructions with Floating-Point Operands 74

10 Load-Execute Floating-Point Instructions with Integer Operands 74

11 Write-combining 113

12 Branches That Depend on Random Data 130

13 Half-Register Operations 356

14 Placing Code and Data in the Same 64-Byte Cache Line 116

6 Introduction Chapter 1

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 2 C and C++ Source-Level

Optimizations

Although C and C++ compilers generally produce very compact object code, many performance improvements are possible by careful source code optimization. Most such optimizations result from taking advantage of the underlying mechanisms used by C and C++ compilers to translate source code into sequences of AMD64 instructions. This chapter includes guidelines for writing C and C++ source code that result in the most efficiently optimized AMD64 code.

This chapter covers the following topics:

Topic Page

Declarations of Floating-Point Values 9

Using Arrays and Pointers 10

Unrolling Small Loops 13

Expression Order in Compound Branch Conditions 14

Long Logical Expressions in If Statements 16

Arrange Boolean Operands for Quick Expression Evaluation 17

Dynamic Memory Allocation Consideration 19

Unnecessary Store-to-Load Dependencies 20

Matching Store and Load Size 22

SWITCH and Noncontiguous Case Expressions 25

Arranging Cases by Probability of Occurrence 28

Use of Function Prototypes 29

Use of const Type Qualifier 30

Generic Loop Hoisting 31

Local Static Functions 34

Explicit Parallelism in Code 35

Extracting Common Subexpressions 37

Sorting and Padding C and C++ Structures 39

Sorting Local Variables 41

Replacing Integer Division with Multiplication 43

Frequently Dereferenced Pointer Arguments 44

Array Indices 46

32-Bit Integral Data Types 47

Sign of Integer Operands 48

Chapter 2 C and C++ Source-Level Optimizations 7

Software Optimization Guide for AMD64 Processors

Topic Page

Accelerating Floating-Point Division and Square Root 50

Fast Floating-Point-to-Integer Conversion 52

Speeding Up Branches Based on Comparisons Between Floats 54

25112 Rev. 3.06 September 2005

8 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.1 Declarations of Floating-Point Values

Optimization

When working with single precision (float) values:

• Use the f or F suffix (for example, 3.14f) to specify a constant value of type float.

• Use function prototypes for all functions that accept arguments of type float.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

C and C++ compilers treat floating-point constants and arguments as double precision (double) unless you specify otherwise. However, single precision floating-point values occupy half the memory space as double precision values and can often provide the precision necessary for a given computational problem.

Chapter 2 C and C++ Source-Level Optimizations 9

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.2 Using Arrays and Pointers

Optimization

Use array notation instead of pointer notation when working with arrays.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

C allows the use of either the array operator ([]) or pointers to access the elements of an array. However, the use of pointers in C makes work difficult for optimizers in C compilers. Without detailed and aggressive pointer analysis, the compiler has to assume that writes through a pointer can write to any location in memory, including storage allocated to other variables. (For example, *p and

*q can refer to the same memory location, while x[0] and x[2] cannot.) Using pointers causes

aliasing, where the same block of memory is accessible in more than one way. Using array notation makes the task of the optimizer easier by reducing possible aliasing.

10 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Example

Avoid code, such as the following, which uses pointer notation:

typedef struct { float x, y, z, w; } VERTEX;

typedef struct { float m[4][4]; } MATRIX;

void XForm(float *res, const float *v, const float *m, int numverts) {

float dp; int i; const VERTEX* vv = (VERTEX *)v;

for (i = 0; i < numverts; i++) { dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;

*res++ = dp; // Write transformed x.

dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;

*res++ = dp; // Write transformed y.

dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;

*res++ = dp; // Write transformed z.

dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;

*res++ = dp; // Write transformed w.

++vv; // Next input vertex m -= 16; // Reset to start of transform matrix. } }

Chapter 2 C and C++ Source-Level Optimizations 11

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Instead, use the equivalent array notation:

typedef struct { float x, y, z, w; } VERTEX;

typedef struct { float m[4][4]; } MATRIX;

void XForm(float *res, const float *v, const float *m, int numverts) {

int i; const VERTEX* vv = (VERTEX *)v; const MATRIX* mm = (MATRIX *)m; VERTEX* rr = (VERTEX *)res;

for (i = 0; i < numverts; i++) { rr->x = vv->x * mm->m[0][0] + vv->y * mm->m[0][1] + vv->z * mm->m[0][2] + vv->w * mm->m[0][3]; rr->y = vv->x * mm->m[1][0] + vv->y * mm->m[1][1] + vv->z * mm->m[1][2] + vv->w * mm->m[1][3]; rr->z = vv->x * mm->m[2][0] + vv->y * mm->m[2][1] + vv->z * mm->m[2][2] + vv->w * mm->m[2][3]; rr->w = vv->x * mm->m[3][0] + vv->y * mm->m[3][1] + vv->z * mm->m[3][2] + vv->w * mm->m[3][3]; ++rr; // Increment the results pointer. ++vv; // Increment the input vertex pointer. } }

Additional Considerations

Source-code transformations interact with a compiler’s code generator, making it difficult to control the generated machine code from the source level. It is even possible that source-code transformations aimed at improving performance may conflict with compiler optimizations. Depending on the compiler and the specific source code, it is possible for pointer-style code to compile into machine code that is faster than that generated from equivalent array-style code. Compare the performance of your code after implementing a source-code transformation with the performance of the original code to be sure that there is an improvement.

12 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.3 Unrolling Small Loops

Optimization

Completely unroll loops that have a small fixed loop count and a small loop body.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Many compilers do not aggressively unroll loops. Manually unrolling loops can benefit performance, especially if the loop body is small, which makes the loop overhead significant.

Example

Avoid a small loop like this:

// 3D-transform: Multiply vector V by 4x4 transform matrix M. for (i = 0; i < 4; i++) { r[i] = 0; for (j = 0; j < 4; j++) { r[i] += m[j][i] * v[j]; } }

Instead, replace it with its completely unrolled equivalent, as shown here:

r[0] = m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3]; r[1] = m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3]; r[2] = m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3]; r[3] = m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3];

Related Information

For information on loop unrolling at the assembly-language level, see “Loop Unrolling” on page 145.

Chapter 2 C and C++ Source-Level Optimizations 13

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.4 Expression Order in Compound Branch Conditions

Optimization

In the most active areas of a program, order the expressions in compound branch conditions to take advantage of short circuiting of compound conditional expressions.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Branch conditions in C programs often consist of compound conditions consisting of multiple boolean expressions joined by the logical AND (&&) and logical OR (||) operators. C compilers guarantee short-circuit evaluation of these operators. In a compound logical OR expression, the first operand to evaluate to true terminates the evaluation, and subsequent operands are not evaluated at all. Similarly, in a logical AND expression, the first operand to evaluate to false terminates the evaluation. Because of this short-circuit evaluation, it is not always possible to swap the operands of logical OR and logical AND. This is especially true when the evaluation of one of the operands causes a side effect. However, in most cases the order of operands in such expressions is irrelevant.

When used to control conditional branches, expressions involving logical OR and logical AND are translated into a series of conditional branches. The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition and can have a significant impact on performance. It is impossible to give an easy, closed-form formula on how to order the conditions. Overall performance is a function of a variety of the following factors:

• Probability of a branch misprediction for each of the branches generated

• Additional latency incurred due to a branch misprediction

• Cost of evaluating the conditions controlling each of the branches generated

• Amount of parallelism that can be extracted in evaluating the branch conditions

• Data stream consumed by an application (mostly due to the dependence of misprediction

probabilities on the nature of the incoming data in data-dependent branches)

It is recommended to experiment with the ordering of expressions in compound branch conditions in the most active areas of a program (so-called “hot spots,” where most of the execution time is spent).

14 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Such hot spots can be found through the use of profiling by feeding a typical data stream to the program while doing the experiments.

Chapter 2 C and C++ Source-Level Optimizations 15

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.5 Long Logical Expressions in If Statements

Optimization

In if statements, avoid long logical expressions that can generate dense conditional branches that violate the guideline described in “Density of Branches” on page 126.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Listing 1. Preferred for Data that Falls Mostly Within the Range

if (a <= max && a >= min && b <= max && b >= min)

If most of the data falls within the range, the branches will not be taken, so the above code is preferred. Otherwise, the following code is preferred.

Listing 2. Preferred for Data that Does Not Fall Mostly Within the Range

if (a > max || a < min || b > max || b < min)

16 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.6 Arrange Boolean Operands for Quick Expression Evaluation

Optimization

In expressions that use the logical AND (&&) or logical OR (||) operator, arrange the operands for quick evaluation of the expression:

If the expression uses this operator

&& (logical AND) False

|| (logical OR) True

Then arrange the operands from left to right in decreasing probablity of being

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

C and C++ compilers guarantee short-circuit evaluation of the boolean operators && and ||. In an expression that uses &&, the first operand to evaluate to false terminates the evaluation; subsequent operands are not evaluated. In an expression that uses ||, the first operand to evaluate to true terminates the evaluation.

When used to control program flow, expressions involving && and || are translated into a series of conditional branches. This optimization minimizes the total number of conditions evaluated and branches executed.

Example 1

In the following code, the operands of && are not arranged for quick expression evaluation because the first operand is not the condition case most likely to be false (it is far less likely for an animal name to begin with a ‘y’ than for it to have fewer than four characters):

char animalname[30]; char *p;

p = animalname;

if ((strlen(p) > 4) && (*p == 'y')) { ... }

Chapter 2 C and C++ Source-Level Optimizations 17

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Because the odds that the animal name begins with a ‘y’ are comparatively low, it is better to put that operand first:

if ((*p == 'y') && (strlen(p) > 4)) { ... }

Example 2

In the following code (assuming a uniform random distribution of i), the operands of || are not arranged for quick expression evaluation because the first operand is not the condition most likely to be true:

unsigned int i;

if ((i < 4) || (i & 1)) { ... }

Because it is more likely for the least-significant bit of i to be 1, it is better to put that operand first:

if ((i & 1) || (i < 4)) { ... }

18 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.7 Dynamic Memory Allocation Consideration

Dynamic memory allocation—accomplished through the use of the malloc library function in C— should always return a pointer that is suitably aligned for the largest base type (quadword alignment). Where this aligned pointer cannot be guaranteed, use the technique shown in the following code to make the pointer quadword aligned, if needed. This code assumes that it is possible to cast the pointer to a long.

double *p; double *np;

p = (double *)malloc(sizeof(double) * number_of_doubles + 7L); np = (double *)((((long)(p)) + 7L) & (-8L));

Then use np instead of p to access the data. The pointer p is still needed in order to deallocate the storage.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Chapter 2 C and C++ Source-Level Optimizations 19

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.8 Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to memory, only to be read back shortly thereafter. For details, see “Store-to-Load Forwarding Restrictions” on page 100. The AMD Athlon™ 64 and AMD Opteron™ processors contain hardware to accelerate such store-to-load dependencies, allowing the load to obtain the store data before it has been written to memory. However, it is still faster to avoid such dependencies altogether and keep the data in an internal register.

Avoiding store-to-load dependencies is especially important if they are part of a long dependency chain, as may occur in a recurrence computation. If the dependency occurs while operating on arrays, many compilers are unable to optimize the code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the store-to-load dependency. Therefore, it is recommended that the programmer remove the dependency manually, for example, by introducing a temporary variable that can be kept in a register, as in the following example. This can result in a significant performance increase.

Listing 3. Avoid

double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned int k;

for (k = 1; k < VECLEN; k++) { x[k] = x[k-1] + y[k]; }

for (k = 1; k < VECLEN; k++) { x[k] = z[k] * (y[k] - x[k-1]); }

Listing 4. Preferred

double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned int k; double t;

t = x[0]; for (k = 1; k < VECLEN; k++) { t = t + y[k]; x[k] = t; }

t = x[0]; for (k = 1; k < VECLEN; k++) { t = z[k] * (y[k] - t); x[k] = t; }

20 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Software Optimization Guide for AMD64 Processors

Chapter 2 C and C++ Source-Level Optimizations 21

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.9 Matching Store and Load Size

Optimization

Align memory accesses and match addresses and sizes of stores and dependent loads.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

The AMD Athlon 64 and AMD Opteron processors contain a load-store buffer to speed up the forwarding of store data to dependent loads. However, this store-to-load forwarding (STLF) inside the load-store buffer occurs, in general, only when the addresses and sizes of the store and the dependent load match, and when both memory accesses are aligned. For details, see “Store-to-Load Forwarding Restrictions” on page 100.

It is impossible to control load and store activity at the source level so as to avoid all cases that violate restrictions placed on store-to-load-forwarding. In some instances it is possible to spot such cases in the source code. Size mismatches can easily occur when different-size data items are joined in a union. Address mismatches could be the result of pointer manipulation.

The following examples show a situation involving a union of different-size data items. The examples show a user-defined unsigned 16.16 fixed-point type and two operations defined on this type. Function portion of a fixed-point number. Listing 5 shows an inappropriate implementation of which, when used on the result of

fixed_add adds two fixed-point numbers, and function fixed_int extracts the integer

fixed_int,

fixed_add, causes misalignment, address mismatch, or size

mismatch between memory operands, such that no store-to-load forwarding in the load-store buffer takes place. Listing 6 shows how to properly implement fixed_int in order to allow store-to-load forwarding in the load-store buffer.

Examples

Listing 5. Avoid

typedef union { unsigned int whole; struct { unsigned short frac; /* Lower 16 bits are fraction. */ unsigned short intg; /* Upper 16 bits are integer. */ } parts; } FIXED_U_16_16;

22 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

__inline FIXED_U_16_16 fixed_add(FIXED_U_16_16 x, FIXED_U_16_16 y) { FIXED_U_16_16 z; z.whole = x.whole + y.whole; return (z); }

__inline unsigned int fixed_int(FIXED_U_16_16 x) { return((unsigned int)(x.parts.intg)); } ... FIXED_U_16_16 y, z; unsigned int q; ... label1: y = fixed_add (y, z); q = fixed_int (y);

label2: ...

Software Optimization Guide for AMD64 Processors

The object code generated for the source code between label1 and label2 typically follows one of these two variants:

; Variant 1 mov edx, DWORD PTR [z] mov eax, DWORD PTR [y] ; -+ add eax, edx ; | mov DWORD PTR [y], eax ; | mov EAX, DWORD PTR [y+2] ; <+ Address mismatch--no forwarding in LSU and EAX, 0FFFFh mov DWORD PTR [q], eax

; Variant 2 mov edx, DWORD PTR [z] mov eax, DWORD PTR [y] ; -+ add eax, edx ; | mov DWORD PTR [y], eax ; | movzx eax, WORD PTR [y+2] ; <+ Size and address mismatch--no forwarding in LSU mov DWORD PTR [q], eax

Listing 6. Preferred

typedef union { unsigned int whole; struct { unsigned short frac; /* Lower 16 bits are fraction. */ unsigned short intg; /* Upper 16 bits are integer. */ } parts; } FIXED_U_16_16;

__inline FIXED_U_16_16 fixed_add(FIXED_U_16_16 x, FIXED_U_16_16 y) {

Chapter 2 C and C++ Source-Level Optimizations 23

Software Optimization Guide for AMD64 Processors

FIXED_U_16_16 z; z.whole = x.whole + y.whole; return(z); }

__inline unsigned int fixed_int(FIXED_U_16_16 x) { return (x.whole >> 16); } ... FIXED_U_16_16 y, z; unsigned int q; ... label1: y = fixed_add (y, z); q = fixed_int (y);

label2: ...

25112 Rev. 3.06 September 2005

The object code generated for the source code between label1 and label2 typically looks like this:

mov edx, DWORD PTR [z] mov eax, DWORD PTR [y] add eax, edx mov DWORD PTR [y], eax ; -+ mov eax, DWORD PTR [y] ; <+ Aligned (size/address match)--forwarding in LSU shr eax, 16 mov DWORD PTR [q], eax

24 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.10 SWITCH and Noncontiguous Case Expressions

Optimization

Use if-else statements in place of switch statements that have noncontiguous case expressions. (Case expressions are the individual expressions to which the single switch expression is compared.)

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

If the case expressions are contiguous or nearly contiguous integer values, most compilers translate the switch statement as a jump table instead of a comparison chain. Jump tables generally improve performance because:

• They reduce the number of branches to a single procedure call.

• The size of the control-flow code is the same no matter how many cases there are.

• The amount of control-flow code that the processor must execute is the same for all values of the

switch expression.

However, if the case expressions are noncontiguous values, most compilers translate the switch statement as a comparison chain. Comparison chains are undesirable because:

• They use dense sequences of conditional branches, which interfere with the processor’s ability to

successfully perform branch prediction.

• The size of the control-flow code increases with the number of cases.

• The amount of control-flow code that the processor must execute varies with the value of the

switch expression.

Chapter 2 C and C++ Source-Level Optimizations 25

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Example 1

A switch statement like this one, whose case expressions are contiguous integer values, usually provides good performance:

switch (grade) { case ‘A’: ... break; case ‘B’: ... break; case ‘C’: ... break; case ‘D’: ... break; case ‘F’: ... break; }

Example 2

Because the case expressions in the following switch statement are not contiguous values, the compiler will likely translate the code into a comparison chain instead of a jump table:

switch (a) { case 8: // Sequence for a==8 break; case 16: // Sequence for a==16 break; ... default: // Default sequence break; }

26 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

To avoid a comparison chain and its undesirable effects on branch prediction, replace the switch statement with a series of if-else statements, as follows:

if (a==8) { // Sequence for a==8 } else if (a==16) { // Sequence for a==16 } ... else { // Default sequence }

Related Information

For information on preventing branch-prediction interference at the assembly-language level, see “Density of Branches” on page 126.

Chapter 2 C and C++ Source-Level Optimizations 27

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.11 Arranging Cases by Probability of Occurrence

Optimization

Arrange switch statement cases by probability of occurrence, from most probable to least probable.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Arranging switch statement cases by probability of occurrence improves performance when the

switch statement is translated as a comparison chain; this arrangement has no negative impact when

the statement is translated as a jump table.

Example

Avoid switch statements such as the following, in which the cases are not arranged by probability of occurrence:

int days_in_month, short_months, normal_months, long_months;

switch (days_in_month) { case 28: case 29: short_months++; break; case 30: normal_months++; break; case 31: long_months++; break; default: printf("Month has fewer than 28 or more than 31 days.\n"); }

Instead, arrange the cases to test for frequently occurring values first:

switch (days_in_month) { case 31: long_months++; break; case 30: normal_months++; break; case 28: case 29: short_months++; break; default: printf("Month has fewer than 28 or more than 31 days.\n"); }

28 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.12 Use of Function Prototypes

Optimization

In general, use prototypes for all functions.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Prototypes can convey additional information to the compiler that might enable more aggressive optimizations.

Chapter 2 C and C++ Source-Level Optimizations 29

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.13 Use of const Type Qualifier

Optimization

For objects whose values will not be changed, use the const type qualifier.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Using the const type qualifier makes code more robust and may enable the compiler to generate higher-performance code. For example, under the C standard, a compiler is not required to allocate storage for an object that is declared const, if its address is never used.

30 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.14 Generic Loop Hoisting

Optimization

To improve the performance of inner loops, reduce redundant constant calculations (that is, loopinvariant calculations). This idea can also be extended to invariant control structures.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale and Examples

The following example demonstrates the use of an invarient condition in an if statement in a for loop. The second listing shows the preferred optimization.

Listing 7. (Avoid)

for (i...) { if (CONSTANT0) { DoWork0(i); // Does not affect CONSTANT0. } else { DoWork1(i); // Does not affect CONSTANT0. } }

Listing 8. (Preferred Optimzation)

if (CONSTANT0) { for (i...) { DoWork0(i); } } else { for (i...) { DoWork1(i); } }

The preferred optimization in Listing 8 tightens the inner loops by avoiding repetitious evaluation of a known

if control structure. Although the branch would be easily predicted, the extra instructions and

decode limitations imposed by branching (which are usually advantageous) are saved.

Chapter 2 C and C++ Source-Level Optimizations 31

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

To generalize the example in Listing 8 further for multiple-constant control code, more work may be needed to create the proper outer loop. Enumeration of the constant cases reduces this to a simple

switch statement.

Listing 9.

for (i...) { if (CONSTANT0) { DoWork0(i); // Does not affect CONSTANT0 or CONSTANT1. } else { DoWork1(i); // Does not affect CONSTANT0 or CONSTANT1. }

if (CONSTANT1) { DoWork2(i); // Does not affect CONSTANT0 or CONSTANT1. } else { DoWork3(i); // Does not affect CONSTANT0 or CONSTANT1. } }

32 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Transform the loop in Listing 9 (by using the switch statement) into:

#define combine(c1, c2) (((c1) << 1) + (c2)) switch (combine(CONSTANT0 != 0, CONSTANT1 != 0)) { case combine(0, 0): for(i...) { DoWork0(i); DoWork2(i); } break; case combine(1, 0): for(i...) { DoWork1(i); DoWork2(i); } break; case combine(0, 1): for(i...) { DoWork0(i); DoWork3(i); } break; case combine( 1, 1 ): for(i...) { DoWork1(i); DoWork3(i); } break; default: break; }

Some introductory code is necessary to generate all the combinations for the switch constant and the total amount of code has doubled. However, the inner loops are now free of cases where the to greater parallelism than possible in the presence of intervening

DoWorkn functions are inlined, the successive functions have greater overlap, leading

if statements.

if statements. In ideal

The same idea can be applied to constant switch statements or to combinations of switch statements and

if statements inside of for loops. The method used to combine the input constants becomes

more complicated but benefits performance.

However, the number of inner loops can also substantially increase. If the number of inner loops is prohibitively high, then only the most common cases must be dealt with directly, and the remaining cases can fall back to the old code in the default clause of the

switch statement. This situation is

typical of run-time generated code. While the performance of run-time generated code can be improved by means similar to those presented here, it is much harder to maintain and developers must do their own code-generation optimizations without the help of an available compiler.

Chapter 2 C and C++ Source-Level Optimizations 33

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.15 Local Static Functions

Optimization

Declare as static functions that are not used outside the file where they are defined.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Declaring a function as static forces internal linkage. Functions that are not declared as static default to external linkage, which may inhibit certain optimizations—for example, aggressive inlining—with some compilers.

34 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.16 Explicit Parallelism in Code

Optimization

Where possible, break long dependency chains into several independent dependency chains that can then be executed in parallel, exploiting the execution units in each pipeline.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale and Examples

This is especially important to break long dependency chains into smaller executing units in floatingpoint code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules.

Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits.

Listing 10. Avoid

double a[100], sum; int i;

sum = 0.0f; for (i = 0; i < 100; i++) { sum += a[i]; }

Chapter 2 C and C++ Source-Level Optimizations 35

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Listing 11. Preferred

double a[100], sum1, sum2, sum3, sum4, sum; int i;

sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; for (i = 0; i < 100; i + 4) { sum1 += a[i]; sum2 += a[i+1]; sum3 += a[i+2]; sum4 += a[i+3]; } sum = (sum4 + sum3) + (sum1 + sum2);

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

36 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.17 Extracting Common Subexpressions

Optimization

Manually extract common subexpressions where C compilers may be unable to extract them from floating-point expressions due to the guarantee against reordering of such expressions in the ANSI standard.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Specifically, the compiler cannot rearrange the computation according to algebraic equivalencies before extracting common subexpressions. Rearranging the expression may give different computational results due to the lack of associativity of floating-point operations, but the results usually differ in only the least-significant bits.

Examples

Listing 12. Avoid

double a, b, c, d, e, f;

e = b * c / d; f = b / d * a;

Listing 13. Preferred

double a, b, c, d, e, f, t;

t = b / d; e = c * t; f = a * t;

Listing 14. Avoid

double a, b, c, e, f;

e = a / c; f = b / c;

Chapter 2 C and C++ Source-Level Optimizations 37

Software Optimization Guide for AMD64 Processors

Listing 15. Example 2 (Preferred)

double a, b, c, e, f, t;

t = 1 / c; e = a * t f = b * t;

25112 Rev. 3.06 September 2005

38 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.18 Sorting and Padding C and C++ Structures

Optimization

Sort and pad C and C++ structures to achieve natural alignment.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

In order to achieve better alignment for structures, many compilers have options that allow padding of structures to make their sizes multiples of words, doublewords, or quadwords. In addition, to improve the alignment of structure members, some compilers might allocate structure elements in an order that differs from the order in which they are declared. However, some compilers might not offer any of these features, or their implementations might not work properly in all situations.

By sorting and padding structures at the source-code level, if the first member of a structure is naturally aligned, then all other members are naturally aligned as well. This allows, for example, arrays of structures to be perfectly aligned.

Sorting and Padding C and C++ Structures

To sort and pad a C or C++ structure, follow these steps:

1. Sort the structure members according to their type sizes, declaring members with larger type sizes

ahead of members with smaller type sizes.

2. Pad the structure so the size of the structure is a multiple of the largest member’s type size.

Examples

Avoid structure declarations in which the members are not declared in order of their type sizes and the size of the structure is not a multiple of the size of the largest member’s type:

struct { char a[5]; \\ Smallest type size (1 byte * 5) long k; \\ 4 bytes in this example double x; \\ Largest type size (8 bytes) } baz;

Chapter 2 C and C++ Source-Level Optimizations 39

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Instead, declare the members according to their type sizes (largest to smallest) and add padding to ensure that the size of the structure is a multiple of the largest member’s type size:

struct { double x; \\ Largest type size (8 bytes) long k; \\ 4 bytes in this example char a[5]; \\ Smallest type size (1 byte * 5) char pad[7]; \\ Make structure size a multiple of 8. } baz;

40 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.19 Sorting Local Variables

Optimization

Sort local variables according to their type sizes, declaring those with larger type sizes ahead of those with smaller type sizes.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

It can be helpful to presort local variables, if your compiler allocates local variables in the same order in which they are declared in the source code. If the first variable is allocated for natural alignment, all other variables are allocated contiguously in the order they are declared and are naturally aligned without padding.

Some compilers do not allocate variables in the order they are declared. In these cases, the compiler should automatically allocate variables that are naturally aligned with the minimum amount of padding. In addition, some compilers do not guarantee that the stack is aligned suitably for the largest type (that is, they do not guarantee quadword alignment), so that quadword operands might be misaligned, even if this technique is used and the compiler does allocate variables in the order they are declared.

Example

Avoid local variable declarations, when the variables are not declared in order of their type sizes:

short ga, gu, gi; long foo, bar; double x, y, z[3]; char a, b; float baz;

Instead, sort the declarations according to their type sizes (largest to smallest):

double z[3]; double x, y; long foo, bar; float baz; short ga, gu, gi;

Chapter 2 C and C++ Source-Level Optimizations 41

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Related Information

For information on sorting local variables at the assembly-language level, see “Sorting Local Variables” on page 119.

42 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.20 Replacing Integer Division with Multiplication

Optimization

Replace integer division with multiplication when there are multiple divisions in an expression. (This is possible only if no overflow will occur during the computation of the product. The possibility of an overflow can be determined by considering the possible ranges of the divisors.)

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Integer division is the slowest of all integer arithmetic operations.

Examples

Avoid code that uses two integer divisions:

int i, j, k, m;

m = i / j / k;

Instead, replace one of the integer divisions with the appropriate multiplication:

m = i / (j * k);

Chapter 2 C and C++ Source-Level Optimizations 43

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.21 Frequently Dereferenced Pointer Arguments

Optimization

Avoid dereferenced pointer arguments inside a function.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Because the compiler has no knowledge of whether aliasing exists between the pointers, such dereferencing cannot be “optimized away” by the compiler. Since data may not be maintained in registers, memory traffic can significantly increase.

Many compilers have an “assume no aliasing” optimization switch. This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables. If your compiler does not have this type of optimization, then copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function.

Examples

Listing 16. Avoid

// Assumes pointers are different and q != r. void isqrt(unsigned long a, unsigned long *q, unsigned long *r) {

*q = a; if (a > 0) { while (*q > (*r = a / *q)) { *q = (*q + *r) >> 1; } } *r = a - *q * *q; }

44 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Listing 17. Preferred

// Assumes pointers are different and q != r. void isqrt(unsigned long a, unsigned long *q, unsigned long *r) {

unsigned long qq, rr; qq = a; if (a > 0) { while (qq > (rr = a / qq)) { qq = (qq + rr) >> 1; } } rr = a - qq * qq; *q = qq; *r = rr; }

Chapter 2 C and C++ Source-Level Optimizations 45

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.22 Array Indices

Optimization

The preferred type for array indices is ptrdiff_t.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Array indices are often used with pointers while doing arithmetic. Using ptrdiff_t produces more portable code and will generally provide good performance.

46 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.23 32-Bit Integral Data Types

Optimization

Use 32-bit integers instead of integers with smaller sizes (16-bit or 8-bit).

Application

This optimization applies to 32-bit software.

Rational

Be aware of the amount of storage associated with each integral data type.

Chapter 2 C and C++ Source-Level Optimizations 47

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.24 Sign of Integer Operands

Optimization

Where there is a choice of using either a signed or an unsigned type, take into consideration that some operations are faster with unsigned types while others are faster for signed types.

Application

This optimization applies to:

• 32-bit software

Rationale

In many cases, the type of data to be stored in an integer variable determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required, so an unsigned type is appropriate. However, recording temperatures in degrees Celsius may require both positive and negative numbers, so a signed type is needed.

Integer-to-floating-point conversion using integers larger than 16 bits is faster with signed types, as the AMD64 architecture provides instructions for converting signed integers to floating-point but has no instructions for converting unsigned integers. In a typical case, a 32-bit integer is converted by a compiler to assembly as follows:

Examples

Listing 18. (Avoid)

double x; ====> mov [temp+4], 0 unsigned int i; mov eax, i mov [temp], eax x = i; fild QWORD PTR [temp] fstp QWORD PTR [x]

The preceding code is slow not only because of the number of instructions, but also because a size mismatch prevents store-to-load forwarding to the FILD instruction. Instead, use the following code:

Listing 19. (Preferred)

double x; ====> fild DWORD PTR [i] int i; fstp QWORD PTR [x]

x = i;

Computing quotients and remainders in integer division by constants is faster when performed on unsigned types. The following typical case is the compiler output for a 32-bit integer divided by 4:

48 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Listing 20. Example 2 (Avoid)

int i; ====> mov eax, i cdq i = i / 4; and edx, 3 add eax, edx sar eax, 2 mov i, eax

Listing 21. Example 2 (Preferred)

unsigned int i; ====> shr i, 2

i = i / 4;

In summary, use unsigned types for:

• Division and remainders

• Loop counters

• Array indexing

Software Optimization Guide for AMD64 Processors

Use signed types for:

• Integer-to-floating-point conversion

Chapter 2 C and C++ Source-Level Optimizations 49

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.25 Accelerating Floating-Point Division and Square Root

Optimization

In applications that involve the heavy use of single precision division and square root operations, it is recommended that you port the code to SSE or 3DNow!™ inline assembly or use a compiler that can generate SSE or 3DNow! technology code. If neither of these methods are possible, the x87 FPU control word register precision control specification bits (PC) can be set to single precision to improve performance. (The processor defaults to double-extended precision. See AMD64 Architecture Programmer’s Manual Volume 1: Application Programming (order# 24592) for details on the FPU control register.)

Application

This optimization applies to 32-bit software.

Rationale

Division and square root have a much longer latency than other floating-point operations, even though the AMD Athlon 64 and AMD Opteron processors provide significant acceleration of these two operations. In some application programs, these operations occur so often as to seriously impact performance. If code has hot spots that use single precision arithmetic only (that is, all computation involves data of type float) and for some reason cannot be ported to 3DNow! code, the following technique may be used to improve performance.

The x87 FPU has a precision-control field as part of the FPU control word. The precision-control setting determines rounding precision of instruction results and affects the basic arithmetic operations, including division and the extraction of square root. Division and square root on the AMD Athlon 64 and AMD Opteron processors are only computed to the number of bits necessary for the currently selected precision. Setting precision control to single precision (versus the Win32 default of double precision) lowers the latency of those operations.

The Microsoft® Visual C environment provides functions to manipulate the FPU control word and thus the precision control. Note that these functions are not very fast, so insert changes of precision control where it creates little overhead, such as outside a computation-intensive loop. Otherwise, the overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square-root operations. For more information on this topic, see AMD64 Architecture Programmer's Manual Volume 1: Application Programming (order# 24592).

The following example shows how to set the precision control to single precision and later restore the original settings in the Microsoft Visual C environment.

50 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Examples

Listing 22.

/* Prototype for _controlfp function */ #include <float.h> unsigned int orig_cw;

/* Get current FPU control word and save it. */ orig_cw = _controlfp(0, 0);

/* Set precision control in FPU control word to single precision. This reduces the latency of divide and square-root operations. */ _controlfp(_PC_24, MCW_PC);

/* Restore original FPU control word. */ _controlfp(orig_cw, 0xfffff);

Chapter 2 C and C++ Source-Level Optimizations 51

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.26 Fast Floating-Point-to-Integer Conversion

Optimization

Use 3DNow! PF2ID instruction to perform truncating conversion to accomplish rapid floating-pointto-integer conversion, if the floating-point operand is a type float.

Application

This optimization applies to 32-bit software.

Rationale

Floating-point-to-integer conversion in C programs is typically a very slow operation. The semantics of C and C++ demand that the conversion use truncation. If the floating-point operand is of type

float, and the compiler supports 3DNow! code generation, then the 3DNow! PF2ID instruction,

which performs truncating conversion, can be utilized by the compiler to accomplish rapid floatingpoint-to-integer conversion.

Note: The PF2ID instruction does not provide conversion compliant with the IEEE-754 standard.

Some operands of type float (IEEE-754 single precision) such as NaNs, infinities, and denormals, are either unsupported or not handled in compliance with the IEEE-754 standard by 3DNow! technology.

For double precision operands, the usual way to accomplish truncating conversion involves the following algorithm:

1. Save the current x87 rounding mode (this is usually round to nearest or even).

2. Set the x87 rounding mode to truncation.

3. Load the floating-point source operand and store the integer result.

4. Restore the original x87 rounding mode.

This algorithm is typically implemented through the C run-time library function

ftol. While the

AMD Athlon 64 and AMD Opteron processors have special hardware optimizations to speed up the changing of x87 rounding modes and therefore

ftol, calls to ftol may still tend to be slow.

For situations where very fast floating-point-to-integer conversion is required, the conversion code in Listing 24 on page 53 may be helpful. This code uses the current rounding mode instead of truncation when performing the conversion. Therefore, the result may differ by 1 from the replacement code adds the “magic number” 2

52+251

to the source operand, then stores the double

ftol result. The

precision result to memory and retrieves the lower doubleword of the stored result. Adding the magic number shifts the original argument to the right inside the double precision mantissa, placing the binary point of the sum immediately to the right of the least-significant mantissa bit. Extracting the lower doubleword of the sum then delivers the integral portion of the original argument.

52 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

The following conversion code causes a 64-bit store to feed into a 32-bit load. The load is from the lower 32 bits of the 64-bit store, the one case of size mismatch between a store and a dependent load that is specifically supported by the store-to-load-forwarding hardware of the AMD Athlon 64 and AMD Opteron processors.

Examples

Listing 23. Slow

double x; int i;

i = x;

Listing 24. Fast

#define DOUBLE2INT(i, d) \ {double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}

double x; int i;

DOUBLE2INT(i, x);

Chapter 2 C and C++ Source-Level Optimizations 53

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

2.27 Speeding Up Branches Based on Comparisons Between Floats

Optimization

Store operands of type float into a memory location and use integer comparison with the memory location to perform fast branches in cases where compilers do not support fast floating-point comparison instructions or 3DNow! code generation.

Application

This optimization applies to 32-bit software.

Rationale

Branches based on floating-point comparisons are often slow. The AMD Athlon 64 and AMD Opteron processors support the FCOMI, FUCOMI, FCOMIP, and FUCOMIP instructions that allow implementation of fast branches based on comparisons between operands of type double or type float. However, many compilers do not support generating these instructions. Likewise, floating-point comparisons between operands of type float can be accomplished quickly by using the 3DNow! PFCMP instruction if the compiler supports 3DNow! code generation.

Many compilers only implement branches based on floating-point comparisons by using FCOM or FCOMP to compare the floating-point operands, followed by FSTSW AX in order to transfer the x87 condition-code flags into EAX. The subsequent branch is then based on the contents of the EAX register. Although the AMD Athlon 64 and AMD Opteron processors have acceleration hardware to speed up the FSTSW instruction, this process is still fairly slow.

Branches Dependent on Integer Comparisons Are Fast

One alternative for branches dependent upon the outcome of the comparison of operands of type

float is to store the operand(s) into a memory location and then perform an integer comparison with

that memory location. Branches dependent on integer comparisons are very fast. It should be noted that the replacement code uses a load dependent on an immediately prior store. If the store is not doubleword-aligned, no store-to-load-forwarding takes place, and the branch is still slow. Also, if there is a lot of activity in the load-store queue, forwarding of the store data may be somewhat delayed, thus negating some of the advantages of using the replacement code. It is recommended that you experiment with the replacement code to test whether it actually provides a performance increase in the code at hand.

The replacement code works well for comparisons against zero, including correct behavior when encountering a negative zero as allowed by the IEEE-754 standard. It also works well for comparing

54 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

to positive constants. In that case, the user must first determine the integer representation of that floating-point constant. This can be accomplished with the following C code snippet:

float x; scanf("%g", &x); printf("%08X\n", (*((int *)(&x))));

The replacement code is IEEE-754 compliant for all classes of floating-point operands except NaNs. However, NaNs do not occur in properly working software.

Examples

Intial definitions:

#define FLOAT2INTCAST(f) (*((int *)(&f))) #define FLOAT2UINTCAST(f) (*((unsigned int *)(&f)))

Table 3: Comparisons against Zero

Use this … Instead of this.

if (FLOAT2UINTCAST(f) > 0x80000000U) if (f < 0.0f)

if (FLOAT2INCAST(f) <= 0) if (f <= 0.0f)

if (FLOAT2INTCAST(f) > 0) if (f > 0.0f)

if (FLOAT2UINTCAST(f) <= 0x80000000U) if (f >= 0.0f)

Table 4: Comparisons against Positive Constant

Use this … Instead of this.

if (FLOAT2INTCAST(f) < 0x40400000) if (f < 3.0f)

if (FLOAT2INTCAST(f) <= 0x40400000) if (f <= 3.0f)

if (FLOAT2INTCAST(f) > 0x40400000) if (f > 3.0f)

if (FLOAT2INTCAST(f) >= 0x40400000) if (f >= 3.0f)

Table 5: Comparisons among Two Floats

Use this … Instead of this.

float t = f1 - f2; if (FLOAT2UINTCAST(t) > 0x80000000U)

if (f1 < f2)

Chapter 2 C and C++ Source-Level Optimizations 55

Software Optimization Guide for AMD64 Processors

Table 5: Comparisons among Two Floats

Use this … Instead of this.

25112 Rev. 3.06 September 2005

float t = f1 - f2; if (FLOAT2INTCAST(t) <= 0)

float t = f1 - f2; if (FLOAT2INTCAST(t) > 0)

float t = f1 - f2; f (FLOAT2UINTCAST(f) <= 0x80000000U)

if (f1 <= f2)

if (f1 > f2)

if (f1 >= f2)

56 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

2.28 Improving Performance in Linux Libraries

Optimization

If interposition is not important to a particular application, then, if using ld in the binutils package, you can make use of a linker option that results in references to public global routines inside the library that cannot be overridden.

Application This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Dynamically loadable libraries are a versatile feature of the Linux operating system. They allow one or more symbols in one library to override the same symbol in another library. Known as interposition, this ability makes customizations and probing seamless. Interposition is implemented by means of a procedure linkage table (PLT). The PLT is so flexible that even references to an overridden symbol inside the library end up referencing the overriding symbol. However, the PLT imposes a performance penalty by requiring all function calls to public global routines to go through an extra step that increases the chances of cache misses and branch mispredictions. This is particularly severe for C++ classes whose methods typically refer to other methods in the same class.

Examples

When using ld, include the following command line option:

-Bsymbolic

If using gcc to build a library, add this option to the command-line:

-Wl,-Bsymbolic

Chapter 2 C and C++ Source-Level Optimizations 57

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

58 C and C++ Source-Level Optimizations Chapter 2

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 3 General 64-Bit Optimizations

In long mode, the AMD64 architecture provides both a compatibility mode, which allows a 64-bit operating system to run existing 16-bit and 32-bit applications, and a 64-bit mode, which provides 64-bit addressing and expanded register resources to support higher performance for recompiled 64-bit programs. This chapter presents general optimizations that improve the performance of software designed to run in 64-bit mode. Therefore, all optimizations in this chapter apply only to 64-bit software.

This chapter covers the following topics:

Topic Page

64-Bit Registers and Integer Arithmetic 60

64-Bit Arithmetic and Large-Integer Multiplication 62

128-Bit Media Instructions and Floating-Point Operations 67

32-Bit Legacy GPRs and Small Unsigned Integers 68

Chapter 3 General 64-Bit Optimizations 59

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

3.1 64-Bit Registers and Integer Arithmetic

Optimization

Use 64-bit registers for 64-bit integer arithmetic.

Rationale

Using 64-bit registers instead of their 32-bit equivalents can dramatically reduce the amount of code necessary to perform 64-bit integer arithmetic.

Example 1

This code performs 64-bit addition using 32-bit registers:

; Add ECX:EBX to EDX:EAX, and place sum in EDX:EAX. 00000000 03 C3 add eax, ebx 00000002 13 D1 adc edx, ecx

Using 64-bit registers, the previous code can be replaced by one simple instruction (assuming that RAX and RBX contain the 64-bit integer values to add):

00000000 48 03 C3 add rax, rbx

Although the preceding instruction requires one additional byte for the REX prefix, it is still one byte shorter than the original code. More importantly, this instruction still has a latency of only one cycle, uses two fewer registers, and occupies only one decode slot.

60 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Example 2

To perform the low-order half of the product of two 64-bit integers using 32-bit registers, a procedure such as the following is necessary:

; In: [ESP+8]:[ESP+4] = multiplicand ; [ESP+16]:[ESP+12] = multiplier ; Out: EDX:EAX = (multiplicand * multiplier) % 2^64 ; Destroys: EAX, ECX, EDX, EFlags

llmul PROC mov edx, [esp+8] ; multiplicand_hi mov ecx, [esp+16] ; multiplier_hi or edx, ecx ; One operand >= 2^32? mov edx, [esp+12] ; multiplier_lo mov eax, [esp+4] ; multiplicand_lo jnz twomul ; Yes, need two multiplies. mul edx ; multiplicand_lo * multiplier_lo ret ; Done, return to caller.

twomul: imul edx, [esp+8] ; p3_lo = multiplicand_hi * multiplier_lo imul ecx, eax ; p2_lo = multiplier_hi * multiplicand_lo add ecx, edx ; p2_lo + p3_lo mul dword ptr [esp+12] ; p1 = multiplicand_lo * multiplier_lo add edx, ecx ; p1 + p2_lo + p3_lo = result in EDX:EAX ret ; Done, return to caller.

llmul ENDP

Using 64-bit registers, the entire product can be produced with only one instruction:

; Multiply RAX by RBX. The 128-bit product is stored in RDX:RAX. 00000000 48 F7 EB imul rbx

Related Information

For more examples of 64-bit arithmetic using only 32-bit registers, see “Efficient 64-Bit Integer Arithmetic in 32-Bit Mode” on page 170.

Chapter 3 General 64-Bit Optimizations 61

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

3.2 64-Bit Arithmetic and Large-Integer Multiplication

Optimization

Use 64-bit arithmetic for integer multiplication that produces 128-bit or larger products.

Background

Large-number multiplications (those involving 128-bit or larger products) are utilized in cryptography algorithms, which figure importantly in e-commerce applications and secure transactions on the Internet. Processors cannot perform large-number multiplication natively; they must break the operation into chunks that are permitted by their architecture (32-bit or 64-bit additions and multiplications).

Rationale

Using 64-bit rather than 32-bit integer operations dramatically reduces the number of additions and multiplications required to compute large products. For example, computing a 1024-bit product using 64-bit arithmetic requires fewer than one quarter the number of instructions required when using 32-bit operations:

Comparing... 32-bit arithmetic 64-bit arithmetic

Number of multiplications 256 64

Number of additions with carry 509 125

Number of additions 255 63

In addition, the processor performs 64-bit additions just as fast as it performs 32-bit additions, and the latency of 64-bit multiplications is only slightly higher than for 32-bit multiplications. (The processor is capable of performing a 64-bit addition each clock cycle and a 64-bit multiplication every other clock cycle.)

Example

Consider the multiplication of two unsigned 64-bit numbers a and b, represented in terms of 32-bit numbers a1:a0 and b1:b0.

a = a1 * 232 + a0

b = b1 * 232 + b0

The product of a and b, c, can be expressed in terms of products of the 32-bit components, as follows:

c = (a1 * b1) * 2

+ (a1 * b0 + a0 * b1) * 232 + (a0 * b0)

62 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Each of the products of the components of a and b (for example, a1 * b1) is composed of 64 bits—an upper 32 bits and a lower 32 bits. it is convenient to represent these individual products as d, e, f, and

g, as follows:

a0 * b0 = d1:d0 = d1 * 232 + d0

a1 * b0 = e1:e0 = e1 * 232 + e0

a0 * b1 = f1:f0 = f1 * 232 + f0

a1 * b1 = g1:g0 = g1 * 232 + g0

Substitution yields the following equation:

c = (g1 * 232 + g0) * 264 + (e1 * 232 + e0 + f1 * 232 + f0) * 232 + (d1 * 232 + d0)

Simplifying yields this equation:

c = g1 * 296 + (e1 + f1 + g0) * 264 + (d1 + e0 + f0) * 232 + d0

it is convenient to represent the terms that are multiplied by each power of 2 as c3, c2, c1, and c0, as follows:

g1 = c3

e1 + f1 + g0 = c2

d1 + e0 + f0 = c1

d0 = c0

Substituting again yields:

c = c3 * 296 + c2 * 264 + c1 * 232 + c0

Chapter 3 General 64-Bit Optimizations 63

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

The following procedure performs 64-bit unsigned integer multiplication, as previously illustrated using 32-bit integer operations:

; 32bitalu_64x64(int *a, int *b, int *c); ; ; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: ; ml.exe -coff -c 32bitalu_64x64.asm ; .586 .K3D .XMM _DATA SEGMENT tempESP dd 0 _DATA ENDS _TEXT SEGMENT ASSUME DS:_DATA PUBLIC _32bitalu_64x64 _32bitalu_64x64 PROC NEAR ;============================================================================== ; Save the register state. Registers EAX, ECX, and EDX are considered volatile ; and assumed to be changed, while the registers below must be preserved. push ebp mov ebp, esp ;============================================================================== ; Parameters passed into routine: ; [ebp+8] = ->a ; [ebp+12] = ->b ; [ebp+16] = ->c ;============================================================================== push ebx push esi push edi ;============================================================================== mov esi,[ebp+8] ; ESI = ->a mov edi,[ebp+12] ; EDI = ->b mov ecx,[ebp+16] ; ECX = ->c push ebp mov [tempESP], esp ;============================================================================== ; Multiply 64-bit numbers a and b, each of which is composed of two 32-bit ; components: ; a = a1 * 2^32 + a0 ; b = b1 * 2^32 + b0 mov eax,[esi] ; EAX = a0 mov edx,[edi] ; EDX = b0 mul edx ; EDX:EAX = a0*b0 = d1:d0 mov ebx,edx ; EDX = d1 mov [ecx],eax ; c0 = EAX

xor esp,esp ; ESP = 0 xor ebp,ebp ; EBP = 0

64 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

mov eax,[esi+4] ; EAX = a1 mov edx,[edi] ; EDX = b0 mul edx ; EDX:EAX = a1*b0 = e1:e0 add ebx,eax ; EBX = d1 + e0 adc ebp,edx ; EBP = e1 + possible carry from d1+e0 adc esp,0 ; Collect possible carry into c3.

mov eax,[esi] ; EAX = a0 mov edx,[edx+4] ; EDX = b1 mul edx ; EDX:EAX = a0*b1 = f1:f0 add ebx,eax ; EBX = d1 + e0 + f0 adc ebp,edx ; EBP = e1 + f1 + carry adc esp,0 ; Collect possible carry into c3. mov [ecx+4],ebx ; c1 = d1 + e0 + f0

mov eax,[esi+4] ; EAX = a1 mov edx,[edi+4] ; EDX = b1 mul edx ; EDX:EAX = a1*b1 = g1:g0 add ebp,eax ; EBP = e1 + f1 + g0 + carry adc esp,edx ; ESP = g1 + carry mov [ecx+8],ebp ; c2 = e1 + f1 + g0 + carry mov [ecx+12],esp ; c3 = g1 + carry ;============================================================================== ; Restore the register state. mov esp, [tempESP] pop ebp pop edi pop esi pop ebx mov esp, ebp pop ebp ;============================================================================== ret _32bitalu_64x64 ENDP _TEXT ENDS END

Software Optimization Guide for AMD64 Processors

Chapter 3 General 64-Bit Optimizations 65

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

To improve performance and substantially reduce code size, the following procedure performs the same 64-bit integer multiplication using 64-bit instead of 32-bit operations:

; 64bitalu_64x64(int *a, int *b, int *c); ; ; TO ASSEMBLE INTO *.obj DO THE FOLLOWING: ; ml64.exe -c 64bitalu_64x64.asm ; _TEXT SEGMENT 64bitalu_64x64 PROC NEAR ;============================================================================== ; Parameters passed into routine: ; rcx = ->a ; rdx = ->b ; r8 = ->c ;============================================================================== mov rax, [rcx] ; RAX = [a0] mul [rdx] ; Multiply [a0] by [b0] such that ; RDX:RAX = [c1]:[c0]. mov [r8], rax ; Store 128-bit product of a and b. mov [r8+8], rdx ;============================================================================== ret 64bitalu_64x64 ENDP END

66 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

3.3 128-Bit Media Instructions and Floating-Point Operations

Optimization

Use 128-bit media (SSE and SSE2) instructions instead of x87 or 64-bit media (MMX™ and 3DNow!™ technology) instructions for floating-point operations.

Rationale

In 64-bit mode, the processor provides eight additional XMM registers (XMM8–XMM15) for a total of 16. These extra registers can substantially reduce register pressure in floating-point code written using 128-bit media instructions.

Although the processor fully supports the x87 and 64-bit media instructions, there are only eight registers available to these instructions (ST(0)–ST(7) or MMX0–MMX7, respectively).

Chapter 3 General 64-Bit Optimizations 67

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

3.4 32-Bit Legacy GPRs and Small Unsigned Integers

Optimization

Use the 32-bit legacy general-purpose registers (EAX through ESI) instead of their 64-bit extensions to store unsigned integer values whose range never requires more than 32 bits, even if subsequent statements use the 32-bit value in a 64-bit operation. (For example, use ECX instead of RCX until you need to perform a 64-bit operation; then use RCX.)

Rationale

In 64-bit mode, the machine-language representation of many instructions that operate on 64-bit register operands requires a REX prefix byte, which increases the size of the code. However, instructions that operate on a 32-bit legacy register operand do not require the prefix and have the desirable side-effect of clearing the upper 32 bits of the extended register to zero. For example, using the AND instruction on ECX clears the upper half of RCX.

Caution

Because the assembler also uses a REX prefix byte to encode the 32-bit sizes of the eight new 64-bit general-purpose registers (R8D–R15D), you should only use one of the original eight generalpurpose registers (EAX through ESI) to implement this technique.

Example

The following example illustrates the unnecessary use of 64-bit registers to calculate the number of bytes remaining to be copied by an aligned block-copy routine after copying the first few bytes having addresses not meeting the routine’s 8-byte-alignment requirements. The first two statements, after the program comments, use the 64-bit R10 register—presumably, because this value is later used to adjust a 64-bit value in R8—even though the range of values stored in R10 take no more than four bits to represent. Using R10 instead of a smaller register requires a REX prefix byte (in this case, 49), which increases the size of the machine-language code.

; Input: ; R10 = source address (src) ; R8 = number of bytes to copy (count) 49 F7 DA neg r10 ; Subtract the source address from 2^64. 49 83 E2 07 and r10, 7 ; Determine how many bytes were copied separately. 4D 2B C2 sub r8, r10 ; Subtract the number of bytes already copied from ; the number of bytes to copy.

68 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

To improve code density, the following rewritten code uses ECX until it is absolutely necessary to use RCX, eliminating two REX prefix bytes:

F7 D9 neg ecx ; Subtract the source address from 2^32 (the processor ; clears the high 32 bits of RCX). 83 E1 07 and ecx, 7 ; Determine how many bytes were copied separately. 4C 2B C1 sub r8, rcx ; Subtract the number of bytes already copied from ; the number of bytes to copy.

Chapter 3 General 64-Bit Optimizations 69

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

70 General 64-Bit Optimizations Chapter 3

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Chapter 4 Instruction-Decoding

Optimizations

The optimizations in this chapter are designed to help maximize the number of instructions that the processor can decode at one time.

The instruction fetcher of both the AMD Athlon™ 64 and AMD Opteron™ processors reads 16-byte packets from the L1 instruction cache. These packets are 16-byte aligned. The instruction bytes are then merged into a 32-byte pick window. On each cycle, the in-order front-end engine selects up to three AMD64 instructions for decode from the pick window.

This chapter covers the following topics:

Topic Page

DirectPath Instructions 72

Load-Execute Instructions 73

Load-Execute Integer Instructions 73

Load-Execute Floating-Point Instructions with Floating-Point Operands 74

Load-Execute Floating-Point Instructions with Integer Operands 74

Branch Targets in Program Hot Spots 76

32/64-Bit vs. 16-Bit Forms of the LEA Instruction 77

Short Instruction Encodings 80

Partial-Register Reads and Writes 81

Using LEAVE for Function Epilogues 83

Alternatives to SHLD Instruction 85

8-Bit Sign-Extended Immediate Values 87

8-Bit Sign-Extended Displacements 88

Code Padding with Operand-Size Override and NOP 89

Chapter 4 Instruction-Decoding Optimizations 71

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

4.1 DirectPath Instructions

Optimization

❖ Use DirectPath instructions rather than VectorPath instructions. (To determine the type of an

instruction—either DirectPath or VectorPath—see Appendix C, “Instruction Latencies.”)

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

DirectPath instructions minimize the number of operations per AMD64 instruction, thus providing for optimally efficient decode and execution. Up to three DirectPath Single instructions, or one and a half DirectPath Double instructions, can be decoded per cycle. VectorPath instructions block the decoding of DirectPath instructions.

The AMD Athlon 64 and AMD Opteron processors implement the majority of instructions used by a compiler as DirectPath Single and DirectPath Double instructions. However, assembly writers must still take into consideration the use of DirectPath versus VectorPath instructions.

72 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

4.2 Load-Execute Instructions

A load-execute instruction is an instruction that loads a value from memory into a register and then performs an operation on that value. Many general purpose instructions, such as ADD, SUB, AND, etc., have load-execute forms:

add rax, QWORD PTR [foo]

This instruction loads the value foo from memory and then adds it to the value in the RAX register.

The work performed by a load-execute instruction can also be accomplished by using two discrete instructions—a load instruction followed by an execute instruction. The following example employs discrete load and execute stages:

mov rbx, QWORD PTR [foo] add rax, rbx

The first statement loads the value foo from memory into the RBX register. The second statement adds the value in RBX to the value in RAX.

The following optimizations govern the use of load-execute instructions:

• Load-Execute Integer Instructions on page 73.

• Load-Execute Floating-Point Instructions with Floating-Point Operands on page 74.

• Load-Execute Floating-Point Instructions with Integer Operands on page 74.

4.2.1 Load-Execute Integer Instructions

Optimization

❖ When performing integer computations, use load-execute instructions instead of discrete load

and execute instructions. Use discrete load and execute instructions only to avoid scheduler stalls for longer executing instructions and to explicitly schedule load and execute operations.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Most load-execute integer instructions are DirectPath decodable and can be decoded at the rate of three per cycle. Splitting a load-execute integer instruction into two separate instructions reduces decoding bandwidth and increases register pressure, which results in lower performance.

Chapter 4 Instruction-Decoding Optimizations 73

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

4.2.2 Load-Execute Floating-Point Instructions with Floating-Point Operands

Optimization

❖ When performing floating-point computations using floating-point (not integer) source operands,

use load-execute instructions instead of discrete load and execute instructions.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Using load-execute floating-point instructions that take floating-point operands improves performance for the following reasons:

• Denser code allows more work to be held in the instruction cache.

• Denser code generates fewer internal macro-ops, allowing the floating-point scheduler to hold

more work, which increases the chances of extracting parallelism from the code.

Example

Avoid code like this, which uses discrete load and execute instructions:

movss xmm0, [float_var1] movss xmm12, [float_var2] mulss xmm0, xmm12

Instead, use code like this, which uses a load-execute floating-point instruction:

movss xmm0, [float_var1] mulss xmm0, [float_var2]

4.2.3 Load-Execute Floating-Point Instructions with Integer Operands

Optimization

❖ Avoid x87 load-execute floating-point instructions that take integer operands (FIADD, FICOM,

FICOMP, FIDIV, FIDIVR, FIMUL, FISUB, and FISUBR). When performing floating-point computations using integer source operands, use discrete load (FILD) and execute instructions instead.

74 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

The load-execute floating-point instructions that take integer operands are VectorPath instructions and generate two micro-ops in a cycle, while discrete load and execute intructions enable a third DirectPath instruction to be decoded in the same cycle. In some situations, these optimizations can also reduce execution time if FILD can be scheduled several instructions ahead of the arithmetic instruction in order to cover the FILD latency.

Example

Avoid code such as the following, which uses load-execute floating-point instructions that take integer operands:

fld QWORD PTR [foo] ; Push foo onto FP stack [ST(0) = foo]. fimul DWORD PTR [bar] ; Multiply bar by ST(0) [ST(0) = bar * foo]. fiadd DWORD PTR [baz] ; Add baz to ST(0) [ST(0) = baz + (bar * foo)].

Instead, use code such as the following, which uses discrete load and execute instructions:

fild DWORD PTR [bar] ; Push bar onto FP stack. fild DWORD PTR [baz] ; Push baz onto FP stack. fld QWORD PTR [foo] ; Push foo onto FP stack. fmulp st(2), st ; Multiply and pop [ST(1) = foo * bar, ST(0) = baz]. faddp st(1), st ; Add and pop [ST(0) = baz + (foo * bar)].

Chapter 4 Instruction-Decoding Optimizations 75

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

4.3 Branch Targets in Program Hot Spots

Optimization

In program “hot spots” (as determined by either profiling or loop-nesting analysis), branch targets should be placed at or near the beginning of code windows that are 16-byte aligned. The smaller the basic block, the more beneficial this optimization will be.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Aligning branch targets maximizes the number of instructions in the pick window and preserves instruction-cache space in branch-intensive code outside such hot spots.

76 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction

Optimization

Use the 32-bit or 64-bit forms of the Load Effective Address (LEA) instruction rather than the 16-bit form.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

The 32-bit and 64-bit LEA instructions are implemented as DirectPath operations with an execution latency of only two cycles. The 16-bit LEA instruction, however, is a VectorPath instruction, which lowers the decode bandwidth and has a longer execution latency.

Chapter 4 Instruction-Decoding Optimizations 77

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes

Optimization

When porting from other architectures, or, perhaps, if you are just new to x86 assembly language, remember that the x86 architecture provides many complex addressing modes. By building the effective address in one instruction, the instruction count can sometimes be reduced, leading to better code density and greater decode bandwidth. Refer to the the section on effective addresses in the AMD64 Architecture Programmer's Manual Volume 1: Application Programming for more detailed information on how effective addresses are formed.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

When building the effective address you sometimes seem to require numerous instructions when there is a base address (such as the base of an array) an index and perhaps a displacement. But x86 architecture can often handle all of this in one instruction. This can lead to reduced code size and fewer instructions to decode. As always, attention should be paid to total instruction length, latencies and whether or not the instruction choices are DirectPath (fastest) or VectorPath (slower).

Example

This first instruction sequence of 5 instructions and a total latency count of 8 can be replaced by one instruction.

Number of Bytes Latency Instruction

31 movl %r10d,%r11d

leaq 0x68E35,rcx

addq %rcx,%r11

movb (%r11,%r13),%cl

cmpb %al,%cl

The following instruction replaces the functionality of the above sequence.

78 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Number of Bytes Latency Instruction

Software Optimization Guide for AMD64 Processors

84 cmpb %al,

Example

These two instructions can be replaced by one instruction.

movl 0x4c65a,%r11 movl (%r11,%r8,8),%r11

becomes:

movl 0x4c65a(,%r8,8),%r11

0x68e35(%r10,%r13)

Chapter 4 Instruction-Decoding Optimizations 79

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

4.6 Short Instruction Encodings

Optimization

Use instruction forms with shorter encodings rather than those with longer encodings. For example, use 8-bit displacements instead of 32-bit displacements, and use the single-byte form of simple integer instructions instead of the 2-byte opcode-ModRM form.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Using shorter instructions increases the number of instructions that can fit into the L1 instruction cache and increases the average decode rate.

Example

Avoid the use of instructions with longer encodings, such as those shown here:

81 C0 78 56 34 12 add eax, 12345678h ; 2-byte opcode form (with ModRM) 81 C3 FB FF FF FF add ebx, -5 ; 32-bit immediate value 0F 84 05 00 00 00 jz label1 ; 2-byte opcode, 32-bit immediate value

Instead, choose instructions with shorter encodings, like these:

05 78 56 34 12 add eax, 12345678h ; 1-byte opcode form 83 C3 FB add ebx, -5 ; 8-bit sign-extended immediate value 74 05 jz label1 ; 1-byte opcode, 8-bit immediate value

80 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

4.7 Partial-Register Reads and Writes

Optimization

Avoid partial register reads and writes.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

In order to handle partial register writes, the processor’s execution core implements a data merging scheme.

In the execution unit, an instruction that writes part of a register merges the modified portion with the current state of the other part of the register. Therefore, the dependency hardware can potentially force a false dependency on the most recent instruction that writes to any part of the register.

In addition, an instruction that has a read dependency on any part of a given architectural register has a read dependency on the most recent instruction that modifies any part of the same architectural register.

Example 1

Avoid code such as the following, which writes to only part of a register:

mov al, 10 ; Instruction 1 mov ah, 12 ; Instruction 2 has a false dependency on instruction 1. ; Instruction 2 merges new AH with current EAX register ; value forwarded by instruction 1.

Example 2

Avoid code such as the following, which both reads and writes only parts of registers:

mov bx, 12h ; Instruction 1 mov bl, dl ; Instruction 2 has a false dependency on the completion ; of instruction 1. mov bh, cl ; Instruction 3 has a false dependency on the completion ; of instruction 2. mov al, bl ; Instruction 4 depends on the completion of instruction 2.

Chapter 4 Instruction-Decoding Optimizations 81

Software Optimization Guide for AMD64 Processors

Example 3

Avoid:

mov al, bl

Preferred:

movzx eax, bl

Example 4

Avoid:

mov al, [ebx]

Preferred:

movzx eax, byte ptr [ebx]

Example 5

Avoid:

mov al, 01h

25112 Rev. 3.06 September 2005

Preferred:

mov eax, 00000001h

Example 6

Avoid:

movss xmm1, xmm2

Preferred:

movaps xmm1, xmm2

82 Instruction-Decoding Optimizations Chapter 4

25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

4.8 Using LEAVE for Function Epilogues

Optimization

The recommended optimization for function epilogues depends on whether the function allocates local variables.

If the function Then

Allocates local variables Replace the traditional function epilogue with the LEAVE instruction.

Does not allocate local variables Do no use function prologues or epilogues. Access function

arguments and local variables through rSP.

Application

This optimization applies to:

• 32-bit software

• 64-bit software

Rationale

Functions That Allocate Local Variables

The LEAVE instruction is a single-byte instruction and saves 2 bytes of code space over the traditional epilogue. Replacing the traditional sequence with LEAVE also preserves decode bandwidth.

Functions That Do not Allocate Local Variables

Accessing function arguments and local variables directly through ESP frees EBP for use as a general-purpose register.

Background

The function arguments and local variables inside a function are referenced through a so-called frame pointer. In AMD64 code, the base pointer register (rBP) is customarily used as a frame pointer. You set up the frame pointer at the beginning of the function using a function prologue:

push ebp ; Save old frame pointer. mov ebp, esp ; Initialize new frame pointer. sub esp, ; function allocates local variables).

; Allocate space for local variables (only if the

Chapter 4 Instruction-Decoding Optimizations 83

Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

Function arguments on the stack can now be accessed at positive offsets relative to rBP, and local variables are accessible at negative offsets relative to rBP.

Example

The traditional function epilogue looks like this:

mov esp, ebp ; Deallocate local variables (only if space was allocated). pop ebp ; Restore old frame pointer.

Replace the traditional function epilogue with a single LEAVE instruction:

leave

84 Instruction-Decoding Optimizations Chapter 4

AMD 250 User Manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Contents

Tables

Figures

Revision History

Chapter 1 Introduction

1.1 Intended Audience

1.2 Getting Started Quickly

1.3 Using This Guide

1.4 Important New Terms

1.5 Key Optimizations

2.1 Declarations of Floating-Point Values

2.2 Using Arrays and Pointers

2.3 Unrolling Small Loops

2.4 Expression Order in Compound Branch Conditions

2.5 Long Logical Expressions in If Statements

2.6 Arrange Boolean Operands for Quick Expression Evaluation

2.7 Dynamic Memory Allocation Consideration

2.8 Unnecessary Store-to-Load Dependencies

2.9 Matching Store and Load Size

2.10 SWITCH and Noncontiguous Case Expressions

2.11 Arranging Cases by Probability of Occurrence

2.12 Use of Function Prototypes

2.13 Use of const Type Qualifier

2.14 Generic Loop Hoisting

2.15 Local Static Functions

2.16 Explicit Parallelism in Code

2.17 Extracting Common Subexpressions

2.18 Sorting and Padding C and C++ Structures

2.19 Sorting Local Variables

2.20 Replacing Integer Division with Multiplication

2.21 Frequently Dereferenced Pointer Arguments

2.22 Array Indices

2.23 32-Bit Integral Data Types

2.24 Sign of Integer Operands

2.25 Accelerating Floating-Point Division and Square Root

2.26 Fast Floating-Point-to-Integer Conversion

2.27 Speeding Up Branches Based on Comparisons Between Floats

2.28 Improving Performance in Linux Libraries

Chapter 3 General 64-Bit Optimizations

3.1 64-Bit Registers and Integer Arithmetic

3.2 64-Bit Arithmetic and Large-Integer Multiplication

3.3 128-Bit Media Instructions and Floating-Point Operations

3.4 32-Bit Legacy GPRs and Small Unsigned Integers

4.1 DirectPath Instructions

4.2 Load-Execute Instructions

4.3 Branch Targets in Program Hot Spots

4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction

4.5 Take Advantage of x86 and AMD64 Complex Addressing Modes

4.6 Short Instruction Encodings

4.7 Partial-Register Reads and Writes

4.8 Using LEAVE for Function Epilogues