AMD x86 User Manual

TM
AMD Athlon Processor
x86 Code Optimization
Guide
© 1999 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other applica­tion in which the failure of AMD’s product could create a situation where per­sonal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.
Trademarks
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc.
Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization

Contents

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
About this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
AMD Athlon Processor Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
AMD Athlon Processor Microarchitecture Summary . . . . . . . . . . . . . 4
2 Top Optimizations 7
Optimization Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Group I Optimizations Essential Optimizations . . . . . . . . . . . . . . . 8
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . 8
Use the 3DNow! PREFETCH and PREFETCHW
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Select DirectPath Over VectorPath Instructions . . . . . . . . . . . 9
Group II OptimizationsSecondary Optimizations . . . . . . . . . . . . . . 9
Load-Execute Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . 9
Take Advantage of Write Combining. . . . . . . . . . . . . . . . . . . . 10
Use 3DNow! Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Avoid Branches Dependent on Random Data . . . . . . . . . . . . . 10
Avoid Placing Code and Data in the Same
64-Byte Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 C Source Level Optimizations 13
Ensure Floating-Point Variables and Expressions
are of Type Float. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Use 32-Bit Data Types for Integer Code . . . . . . . . . . . . . . . . . . . . . . . 13
Consider the Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . 14
Use Array Style Instead of Pointer Style Code . . . . . . . . . . . . . . . . . 15
Completely Unroll Small Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Avoid Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . 18
Consider Expression Order in Compound Branch Conditions . . . . . 20
Contents iii
AMD Athlon Processor x86 Code Optimization
Switch Statement Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Optimize Switch Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Use Prototypes for All Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Use Const Type Qualifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Generalization for Multiple Constant Control Code. . . . . . . . 23
Declare Local Functions as Static . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . 25
Introduce Explicit Parallelism into Code. . . . . . . . . . . . . . . . . . . . . . 25
Explicitly Extract Common Subexpressions . . . . . . . . . . . . . . . . . . . 26
C Language Structure Component Considerations . . . . . . . . . . . . . . 27
Sort Local Variables According to Base Type Size . . . . . . . . . . . . . . 28
22007E/0November 1999
Accelerating Floating-Point Divides and Square Roots . . . . . . . . . . 29
Avoid Unnecessary Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Copy Frequently De-referenced Pointer Arguments to
Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Instruction Decoding Optimizations 33
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . . 34
Load-Execute Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Floating-Point Instructions with
Floating-Point Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Avoid Load-Execute Floating-Point Instructions with
Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Align Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . 36
Use Short Instruction Lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Avoid Partial Register Reads and Writes. . . . . . . . . . . . . . . . . . . . . . 37
Replace Certain SHLD Instructions with Alternative Code. . . . . . . 38
Use 8-Bit Sign-Extended Immediates . . . . . . . . . . . . . . . . . . . . . . . . . 38
iv Contents
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Use 8-Bit Sign-Extended Displacements. . . . . . . . . . . . . . . . . . . . . . . 39
Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . . . . . . . . 39
Recommendations for the AMD Athlon Processor . . . . . . . . . 40
Recommendations for AMD-K6® Family and
AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41
5 Cache and Memory Optimizations 45
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Avoid Memory Size Mismatches. . . . . . . . . . . . . . . . . . . . . . . . 45
Align Data Where Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Use the 3DNow! PREFETCH and PREFETCHW Instructions. . . . . 46
Take Advantage of Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . 50
Avoid Placing Code and Data in the Same 64-Byte Cache Line. . . . 50
Store-to-Load Forwarding Restrictions. . . . . . . . . . . . . . . . . . . . . . . . 51
Store-to-Load Forwarding PitfallsTrue Dependencies. . . . 51
Summary of Store-to-Load Forwarding Pitfalls to Avoid. . . . 54
Stack Alignment Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Align TBYTE Variables on Quadword Aligned Addresses. . . . . . . . 55
C Language Structure Component Considerations . . . . . . . . . . . . . . 55
Sort Variables According to Base Type Size . . . . . . . . . . . . . . . . . . . 56
6 Branch Optimizations 57
Avoid Branches Dependent on Random Data . . . . . . . . . . . . . . . . . . 57
AMD Athlon Processor Specific Code . . . . . . . . . . . . . . . . . . . 58
Blended AMD-K6 and AMD Athlon Processor Code . . . . . . . 58
Always Pair CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Replace Branches with Computation in 3DNow! Code . . . . . . . . . . . 60
Muxing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Sample Code Translated into 3DNow! Code . . . . . . . . . . . . . . 61
Avoid the Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Avoid Far Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . 65
Avoid Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Contents v
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
7 Scheduling Optimizations 67
Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67
Unrolling Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Use Function Inlining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Always Inline Functions if Called from One Site . . . . . . . . . . 72
Always Inline Functions with Fewer than 25 Machine
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Avoid Address Generation Interlocks. . . . . . . . . . . . . . . . . . . . . . . . . 72
Use MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Minimize Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . 73
Push Memory Data Carefully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8 Integer Optimizations 77
Replace Divides with Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Multiplication by Reciprocal (Division) Utility . . . . . . . . . . . 77
Unsigned Division by Multiplication of Constant. . . . . . . . . . 78
Signed Division by Multiplication of Constant . . . . . . . . . . . . 79
Use Alternative Code When Multiplying by a Constant. . . . . . . . . . 81
Use MMX™ Instructions for Integer-Only Work . . . . . . . . . . . . . . . . 83
Repeated String Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Latency of Repeated String Instructions. . . . . . . . . . . . . . . . . 84
Guidelines for Repeated String Instructions . . . . . . . . . . . . . 84
Use XOR Instruction to Clear Integer Registers. . . . . . . . . . . . . . . . 86
Efficient 64-Bit Integer Arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Efficient Implementation of Population Count Function. . . . . . . . . 91
Derivation of Multiplier Used for Integer Division
by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Unsigned Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vi Contents
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Signed Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9 Floating-Point Optimizations 97
Ensure All FPU Data is Aligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Use Multiplies Rather than Divides . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Use FFREEP Macro to Pop One Register from the FPU Stack . . . . 98
Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 98
Use the FXCH Instruction Rather than FST/FLD Pairs . . . . . . . . . . 99
Avoid Using Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . 99
Minimize Floating-Point-to-Integer Conversions . . . . . . . . . . . . . . . 100
Floating-Point Subexpression Elimination. . . . . . . . . . . . . . . . . . . . 103
Check Argument Range of Trigonometric Instructions
Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Take Advantage of the FSINCOS Instruction . . . . . . . . . . . . . . . . . 105
10 3DNow!™ and MMX™ Optimizations 107
Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Use FEMMS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Use 3DNow! Instructions for Fast Division . . . . . . . . . . . . . . . . . . . 108
Optimized 14-Bit Precision Divide . . . . . . . . . . . . . . . . . . . . . 108
Optimized Full 24-Bit Precision Divide . . . . . . . . . . . . . . . . . 108
Pipelined Pair of 24-Bit Precision Divides. . . . . . . . . . . . . . . 109
Newton-Raphson Reciprocal. . . . . . . . . . . . . . . . . . . . . . . . . . 109
Use 3DNow! Instructions for Fast Square Root and
Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Optimized 15-Bit Precision Square Root . . . . . . . . . . . . . . . . 110
Optimized 24-Bit Precision Square Root . . . . . . . . . . . . . . . . 110
Newton-Raphson Reciprocal Square Root. . . . . . . . . . . . . . . 111
Use MMX PMADDWD Instruction to Perform
Two 32-Bit Multiplies in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3DNow! and MMX Intra-Operand Swapping . . . . . . . . . . . . . . . . . . 112
Contents vii
AMD Athlon Processor x86 Code Optimization
Fast Conversion of Signed Words to Floating-Point . . . . . . . . . . . . 113
Use MMX PXOR to Negate 3DNow! Data . . . . . . . . . . . . . . . . . . . . 113
Use MMX PCMP Instead of 3DNow! PFCMP. . . . . . . . . . . . . . . . . . 114
Use MMX Instructions for Block Copies and Block Fills . . . . . . . . 115
Use MMX PXOR to Clear All Bits in an MMX Register . . . . . . . . . 118
Use MMX PCMPEQD to Set All Bits in an MMX Register. . . . . . . 119
Use MMX PAND to Find Absolute Value in 3DNow! Code . . . . . . 119
Optimized Matrix Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Efficient 3D-Clipping Code Computation Using
3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation . . . . . 123
Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 125
22007E/0November 1999
Complex Number Arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11 General x86 Optimization Guidelines 127
Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Appendix A AMD AthlonProcessor Microarchitecture 129
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
AMD Athlon Processor Microarchitecture. . . . . . . . . . . . . . . . . . . . 130
Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Branch Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Early Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Integer Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
viii Contents
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Floating-Point Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137
Load-Store Unit (LSU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
L2 Cache Controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
AMD Athlon System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Appendix B Pipeline and Execution Unit Resources Overview 141
Fetch and Decode Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Execution Unit Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Integer Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Floating-Point Pipeline Operations . . . . . . . . . . . . . . . . . . . . 150
Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . 151
Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Appendix C Implementation of Write Combining 155
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . 156
What is Write Combining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Write-Combining Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . 159
Appendix D Performance-Monitoring Counters 161
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Performance Counter Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
PerfEvtSel[3:0] MSRs
(MSR Addresses C001_0000h–C001_0003h) . . . . . . . . . . . . . 162
Contents ix
AMD Athlon Processor x86 Code Optimization
PerfCtr[3:0] MSRs
(MSR Addresses C001_0004h–C001_0007h) . . . . . . . . . . . . . 167
Starting and Stopping the Performance-Monitoring
Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Event and Time-Stamp Monitoring Software. . . . . . . . . . . . . . . . . . 168
Monitoring Counter Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
22007E/0November 1999
Appendix E Programming the MTRR and PAT 171
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Memory Type Range Register (MTRR) Mechanism . . . . . . . . . . . . 171
Page Attribute Table (PAT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Appendix F Instruction Dispatch and Execution Resources 187
Appendix G DirectPath versus VectorPath Instructions 219
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . 219
DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
x Contents
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
List of Figures
Figure 1. AMD Athlon Processor Block Diagram . . . . . . . . . . . 131
Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135
Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 137
Figure 4. Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware. . . . . . . . 142
Figure 6. Fetch/Scan/Align/Decode Pipeline Stages. . . . . . . . . . . 142
Figure 7. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 8. Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 9. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 146
Figure 10. Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . 146
Figure 11. PerfEvtSel[3:0] Registers . . . . . . . . . . . . . . . . . . . . . . . . 162
Figure 12. MTRR Mapping of Physical Memory . . . . . . . . . . . . . . . 173
Figure 13. MTRR Capability Register Format . . . . . . . . . . . . . . . . 174
Figure 14. MTRR Default Type Register Format . . . . . . . . . . . . . . 175
Figure 15. Page Attribute Table (MSR 277h) . . . . . . . . . . . . . . . . . 177
Figure 16. MTRRphysBasen Register Format. . . . . . . . . . . . . . . . . 183
Figure 17. MTRRphysMaskn Register Format . . . . . . . . . . . . . . . . 184
List of Figures xi
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
xii List of Figures
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
List of Tables
Table 1. Latency of Repeated String Instructions. . . . . . . . . . . . . 84
Table 2. Integer Pipeline Operation Types . . . . . . . . . . . . . . . . . 149
Table 3. Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Table 4. Floating-Point Pipeline Operation Types . . . . . . . . . . . 150
Table 5. Floating-Point Decode Types . . . . . . . . . . . . . . . . . . . . . 150
Table 6. Load/Store Unit Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Table 7. Sample 1 – Integer Register Operations . . . . . . . . . . . . 153
Table 8. Sample 2 – Integer Register and Memory Load
Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Table 9. Write Combining Completion Events . . . . . . . . . . . . . . 158
Table 10. AMD Athlon System Bus Commands
Generation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Table 11. Performance-Monitoring Counters. . . . . . . . . . . . . . . . . 164
Table 12. Memory Type Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 174
Table 13. Standard MTRR Types and Properties. . . . . . . . . . . . . 176
Table 14. PATi 3-Bit Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Table 15. Effective Memory Type Based on PAT and
MTRRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Table 16. Final Output Memory Types. . . . . . . . . . . . . . . . . . . . . . 180
Table 17. MTRR Fixed Range Register Format . . . . . . . . . . . . . . 182
Table 18. MTRR-Related Model-Specific Register
(MSR) Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Table 19. Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Table 20. MMX Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Table 21. MMX Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Table 22. Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . 212
Table 23. 3DNow! Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Table 24. 3DNow! Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Table 25. DirectPath Integer Instructions . . . . . . . . . . . . . . . . . . . 220
Table 26. DirectPath MMX Instructions. . . . . . . . . . . . . . . . . . . . . 227
Table 27. DirectPath MMX Extensions. . . . . . . . . . . . . . . . . . . . . . 228
Table 28. DirectPath Floating-Point Instructions . . . . . . . . . . . . . 229
List of Tables xiii
AMD Athlon Processor x86 Code Optimization
Table 29. VectorPath Integer Instructions. . . . . . . . . . . . . . . . . . . 231
Table 30. VectorPath MMX Instructions . . . . . . . . . . . . . . . . . . . . 234
Table 31. VectorPath MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234
Table 32. VectorPath Floating-Point Instructions. . . . . . . . . . . . . 235
22007E/0November 1999
xiv List of Tables
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Revision History

Date Rev Description
Added “About this Document” on page 1. Further clarification of “Consider the Sign of Integer Operands” on page 14. Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15. Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29. Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31. Further clarification of “Select DirectPath Over VectorPath Instructions” on page 34. Further clarification of “Align Branch Targets in Program Hot Spots” on page 36. Further clarification of REP instruction as a filler in “Code Padding Using Neutral Code Fillers” on page 39. Further clarification of “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46. Modified examples 1 and 2 of “Unsigned Division by Multiplication of Constant” on page 78.
Nov.
1999
Added the optimization, “Efficient Implementation of Population Count Function” on page 91. Further clarification of “Use FFREEP Macro to Pop One Register from the FPU Stack” on page 98. Further clarification of “Minimize Floating-Point-to-Integer Conversions” on page 100. Added the optimization, “Check Argument Range of Trigonometric Instructions Efficiently” on page 103. Added the optimization, “Take Advantage of the FSINCOS Instruction” on page 105.
E
Further clarification of “Use 3DNow!™ Instructions for Fast Division” on page 108. Further clarification “Use FEMMS Instruction” on page 107. Further clarification of “Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root” on
page 110. Clarified “3DNow!™ and MMX™ Intra-Operand Swapping” on page 112. Corrected PCMPGT information in “Use MMX™ PCMP Instead of 3DNow!™ PFCMP” on page 114. Added the optimization, “Use MMX™ Instructions for Block Copies and Block Fills” on page 115. Modified the rule for “Use MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 118. Modified the rule for “Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register” on page 119. Added the optimization, “Optimized Matrix Multiplication” on page 119. Added the optimization, “Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions” on page
122. Added the optimization, “Complex Number Arithmetic” on page 126. Added Appendix E, “Programming the MTRR and PAT”. Rearranged the appendices. Added Index.
Revision History xv
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
xvi Revision History
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
1

Introduction

The AMD Athlon processor is the newest microprocessor in the AMD K86 family of microprocessors. The advances in the AMD Athlon processor take superscalar operation and out-of-order execution to a new level. The AMD Athlon processor has been designed to efficiently execute code written for previous-generation x86 processors. However, to enable the fastest code execution with the AMD Athlon processor, programmers should write software that includes specific code optimization techniques.

About this Document

This document contains information to assist programmers in creating optimized code for the AMD Athlon processor. In addition to compiler and assembler designers, this document has been targeted to C and assembly language programmers writing execution-sensitive code sequences.
This document assumes that the reader possesses in-depth knowledge of the x86 instruction set, the x86 architecture (registers, programming modes, etc.), and the IBM PC-AT platform.
This guide has been written specifically for the AMD Athlon processor, but it includes considerations for
About this Document 1
AMD Athlon Processor x86 Code Optimization
previous-generation processors and describes how those optimizations are applicable to the AMD Athlon processor. This guide contains the following chapters:
Chapter 1: Introduction. Outlines the material covered in this document. Summarizes the AMD Athlon microarchitecture.
Chapter 2: Top Optimizations. Provides convenient descriptions of the most important optimizations a programmer should take into consideration.
Chapter 3: C Source Level Optimizations. Describes optimizations that C/C++ programmers can implement.
Chapter 4: Instruction Decoding Optimizations. Describes methods that will make the most efficient use of the three sophisticated instruction decoders in the AMD Athlon processor.
22007E/0November 1999
Chapter 5: Cache and Memory Optimizations. Describes optimizations that makes efficient use of the large L1 caches and high­bandwidth buses of the AMD Athlon processor.
Chapter 6: Branch Optimizations. Describes optimizations that improves branch prediction and minimizes branch penalties.
Chapter 7: Scheduling Optimizations. Describes optimizations that improves code scheduling for efficient execution resource utilization.
Chapter 8: Integer Optimizations. Describes optimizations that improves integer arithmetic and makes efficient use of the integer execution units in the AMD Athlon processor.
Chapter 9: Floating-Point Optimizations. Describes optimizations that makes maximum use of the superscalar and pipelined floating­point unit (FPU) of the AMD Athlon processor.
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines for Enhanced 3DNow! and MMX code optimization techniques.
Chapter 11: General x86 Optimizations Guidelines. Lists generic optimizations techniques applicable to x86 processors.
Appendix A: AMD Athlon Processor Microarchitecture. Describes in detail the microarchitecture of the AMD Athlon processor.
2 About this Document
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Appendix B: Pipeline and Execution Unit Resources Overview. Describes in detail the execution units and its relation to the instruction pipeline.
Appendix C: Implementation of Write Combining. Describes the algorithm used by the AMD Athlon processor to write combine.
Appendix D: Performance Monitoring Counters. Describes the usage of the performance counters available in the AMD Athlon processor.
Appendix E: Programming the MTRR and PAT. Describes the steps needed to program the Memory Type Range Registers and the Page Attribute Table.
Appendix F: Instruction Dispatch and Execution Resources. Lists the instruction’s execution resource usage.
Appendix G: DirectPath versus VectorPath Instructions. Lists the x86 instructions that are DirectPath and VectorPath instructions.
AMD Athlon Processor Family
The AMD Athlon processor family uses state-of-the-art decoupled decode/execution design techniques to deliver next-generation performance with x86 binary software compatibility. This next-generation processor family advances x86 code execution by using flexible instruction predecoding, wide and balanced decoders, aggressive out-of-order execution, parallel integer execution pipelines, parallel floating-point execution pipelines, deep pipelined execution for higher delivered operating frequency, dedicated backside cache memory, and a new high-performance double-rate 64-bit local bus. As an x86 binary-compatible processor, the AMD Athlon processor implements the industry-standard x86 instruction set by decoding and executing the x86 instructions using a proprietary microarchitecture. This microarchitecture allows the delivery of maximum performance when running x86-based PC software.
AMD Athlon Processor Family 3
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
AMD Athlon Processor Microarchitecture Summary
The AMD Athlon processor brings superscalar performance and high operating frequency to PC systems running industry-standard x86 software. A brief summary of the next-generation design features implemented in the AMD Athlon processor is as follows:
High-speed double-rate local bus interface
Large, split 128-Kbyte level-one (L1) cache
Dedicated backside level-two (L2) cache
Instruction predecode and branch detection during cache
line fills
Decoupled decode/execution core
Three-way x86 instruction decoding
Dynamic scheduling and speculative execution
Three-way integer execution
Three-way address generation
Three-way floating-point execution
3DNow! technology and MMX single-instruction
multiple-data (SIMD) instruction extensions
Super data forwarding
Deep out-of-order integer and floating-point execution
Register renaming
Dynamic branch prediction
The AMD Athlon processor communicates through a next-generation high-speed local bus that is beyond the current Socket 7 or Super7 bus standard. The local bus can transfer data at twice the rate of the bus operating frequency by using both the rising and falling edges of the clock (see AMD Athlon System Bus on page 139 for more information).
To reduce on-chip cache miss penalties and to avoid subsequent data load or instruction fetch stalls, the AMD Athlon processor has a dedicated high-speed backside L2 cache. The large 128-Kbyte L1 on-chip cache and the backside L2 cache allow the
4 AMD Athlon Processor Microarchitecture Summary
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
AMD Athlon execution core to achieve and sustain maximum performance.
As a decoupled decode/execution processor, the AMD Athlon processor makes use of a proprietary microarchitecture, which defines the heart of the AMD Athlon processor. With the inclusion of all these features, the AMD Athlon processor is capable of decoding, issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance.
The AMD Athlon processor includes both the industry-standard MMX SIMD integer instructions and the 3DNow! SIMD floating-point instructions that were first introduced in the
®
AMD-K6
-2 processor. The design of 3DNow! technology was based on suggestions from leading graphics and independent software vendors (ISVs). Using SIMD format, the AMD Athlon processor can generate up to four 32-bit, single-precision floating-point results per clock cycle.
The 3DNow! execution units allow for high-performance floating-point vector operations, which can replace x87 instructions and enhance the performance of 3D graphics and other floating-point-intensive applications. Because the 3DNow! architecture uses the same registers as the MMX instructions, switching between MMX and 3DNow! has no penalty.
The AMD Athlon processor designers took another innovative step by carefully integrating the traditional x87 floating-point, MMX, and 3DNow! execution units into one operational engine. With the introduction of the AMD Athlon processor, the switching overhead between x87, MMX, and 3DNow! technology is virtually eliminated. The AMD Athlon processor combined with 3DNow! technology brings a better multimedia experience to mainstream PC users while maintaining backwards compatibility with all existing x86 software.
Although the AMD Athlon processor can extract code parallelism on-the-fly from off-the-shelf, commercially available x86 software, specific code optimization for the AMD Athlon processor can result in even higher delivered performance. This document describes the proprietary microarchitecture in the AMD Athlon processor and makes recommendations for optimizing execution of x86 software on the processor.
AMD Athlon Processor Microarchitecture Summary 5
AMD Athlon Processor x86 Code Optimization
The coding techniques for achieving peak performance on the AMD Athlon processor include, but are not limited to, those for the AMD-K6, AMD-K6-2, Pentium II processors. However, many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance. Due to the more flexible pipeline control and aggressive out-of-order execution, the AMD Athlon processor is not as sensitive to instruction selection and code scheduling. This flexibility is one of the distinct advantages of the AMD Athlon processor.
The AMD Athlon processor uses the latest in processor microarchitecture design techniques to provide the highest x86 performance for todays PC. In short, the AMD Athlon processor offers true next-generation performance with x86 binary software compatibility.
22007E/0November 1999
®
, Pentium Pro, and Pentium
6 AMD Athlon Processor Microarchitecture Summary
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
2

Top Optimizations

Group I Essential Optimizations

Group II Secondary Optimizations

This chapter contains concise descriptions of the best optimizations for improving the performance of the AMD Athlon processor. Subsequent chapters contain more detailed descriptions of these and other optimizations. The optimizations in this chapter are divided into two groups and listed in order of importance.
Group I contains essential optimizations. Users should follow these critical guidelines closely. The optimizations in Group I are as follows:
Memory Size and Alignment Issues—Avoid memory size
mismatchesAlign data where possible
Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions
Select DirectPath Over VectorPath Instructions
Group II contains secondary optimizations that can significantly improve the performance of the AMD Athlon processor. The optimizations in Group II are as follows:
Load-Execute Instruction Usage—Use Load-Execute
instructionsAvoid load-execute floating-point instructions with integer operands
Take Advantage of Write Combining
Use 3DNow! Instructions
Avoid Branches Dependent on Random Data
Top Optimizations 7
AMD Athlon Processor x86 Code Optimization
Avoid Placing Code and Data in the Same 64-Byte Cache
Line

Optimization Star

The top optimizations described in this chapter are flagged
TOP
with a star. In addition, the star appears beside the more detailed descriptions found in subsequent chapters.

Group I Optimizations Essential Optimizations

22007E/0November 1999

Memory Size and Alignment Issues

See Memory Size and Alignment Issues on page 45 for more details.
Avoid Memory Size Mismatches
Avoid memory size mismatches when instructions operate on
TOP
Align Data Where Possible
TOP
the same data. For instructions that store and reload the same data, keep operands aligned and keep the loads/stores of each operand the same size.
Avoid misaligned data references. A misaligned store or load operation suffers a minimum one-cycle penalty in the AMD Athlon processor load/store pipeline.

Use the 3DNow! PREFETCH and PREFETCHW Instructions

For code that can take advantage of prefetching, use the
TOP
8 Optimization Star
3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor, which significantly improves performance. All the prefetch instructions are essentially integer instructions and can be used
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.). Use the following formula to determine prefetch distance:
DS
Prefetch Length = 200 (
Round up to the nearest cache line.
DS is the data stride per loop iteration.
C is the number of cycles per loop iteration when hitting in
the L1 cache.
See Use the 3DNow! PREFETCH and PREFETCHW Instructions on page 46 for more details.
/C)

Select DirectPath Over VectorPath Instructions

Use DirectPath instructions rather than VectorPath
TOP
instructions. DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction. Three DirectPath instructions can be decoded in parallel. Using VectorPath instructions will block DirectPath instructions from decoding simultaneously.
See Appendix G, DirectPath versus VectorPath Instructions on page 219 for a list of DirectPath and VectorPath instructions.
Group II OptimizationsSecondary Optimizations
Load-Execute Instruction Usage
See Load-Execute Instruction Usage on page 34 for more details.
Use Load-Execute Instructions
Wherever possible, use load-execute instructions to increase
TOP
code density with the one exception described below. The split-instruction form of load-execute instructions can be used to avoid scheduler stalls for longer executing instructions and to explicitly schedule the load and execute operations.
Group II OptimizationsSecondary Optimizations 9
AMD Athlon Processor x86 Code Optimization
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Do not use load-execute floating-point instructions with integer
TOP

Take Advantage of Write Combining

TOP
operands. The floating-point load-execute instructions with integer operands are VectorPath and generate two OPs in a cycle, while the discrete equivalent enables a third DirectPath instruction to be decoded in the same cycle.
This guideline applies only to operating system, device driver, and BIOS programmers. In order to improve system performance, the AMD Athlon processor aggressively combines multiple memory-write cycles of any data size that address locations within a 64-byte cache line aligned write buffer.
See Appendix C, Implementation of Write Combining on page 155 for more details.
22007E/0November 1999

Use 3DNow! Instructions

Unless accuracy requirements dictate otherwise, perform
TOP

Avoid Branches Dependent on Random Data

TOP
floating-point computations using the 3DNow! instructions instead of x87 instructions. The SIMD nature of 3DNow! instructions achieves twice the number of FLOPs that are achieved through x87 instructions. 3DNow! instructions also provide for a flat register file instead of the stack-based approach of x87 instructions.
See Table 23 on page 217 for a list of 3DNow! instructions. For information about instruction usage, see the 3DNow!™ Technology Manual, order# 21928.
Avoid data-dependent branches around a single instruction. Data-dependent branches acting upon basically random data can cause the branch prediction logic to mispredict the branch about 50% of the time. Design branch-free alternative code sequences, which results in shorter average execution time.
See “Avoid Branches Dependent on Random Data on page 57 for more details.
10 Group II OptimizationsSecondary Optimizations
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Avoid Placing Code and Data in the Same 64-Byte Cache Line
Consider that the AMD Athlon processor cache line is twice the
TOP
size of previous processors. Code and data should not be shared in the same 64-byte cache line, especially if the data ever becomes modified. In order to maintain cache coherency, the AMD Athlon processor may thrash its caches, resulting in lower performance.
In general the following should be avoided:
Self-modifying code
Storing data in code segments
See “Avoid Placing Code and Data in the Same 64-Byte Cache Line on page 50 for more details.
Group II OptimizationsSecondary Optimizations 11
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
12 Group II OptimizationsSecondary Optimizations
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
3

C Source Level Optimizations

This chapter details C programming practices for optimizing code for the AMD Athlon processor. Guidelines are listed in order of importance.
Ensure Floating-Point Variables and Expressions are of
Type Float
For compilers that generate 3DNow! instructions, make sure that all floating-point variables and expressions are of type float. Pay special attention to floating-point constants. These require a suffix of “F” or “f” (for example, 3.14f) in order to be of type float, otherwise they default to type double. To avoid automatic promotion of float arguments to double, always use function prototypes for all functions that accept float arguments.
Use 32-Bit Data Types for Integer Code
Use 32-bit data types for integer code. Compiler implementations vary, but typically the following data types are includedint, signed, signed int, unsigned, unsigned int, long,
signed long, long int, signed long int, unsigned long, and unsigned long int.
Ensure Floating-Point Variables and Expressions are of Type Float 13
AMD Athlon Processor x86 Code Optimization

Consider the Sign of Integer Operands

In many cases, the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required so an unsigned type is appropriate. However, recording temperatures in degrees Celsius may require both positive and negative numbers so a signed type is needed.
Where there is a choice of using either a signed or an unsigned type, it should be considered that certain operations are faster with unsigned types while others are faster for signed types.
Integer-to-floating-point conversion using integers larger than 16-bit is faster with signed types, as the x86 FPU provides instructions for converting signed integers to floating-point, but has no instructions for converting unsigned integers. In a typical case, a 32-bit integer is converted as follows:
22007E/0November 1999

Example 1 (Avoid):

double x; ====> MOV [temp+4], 0 unsigned int i; MOV EAX, i
MOV [temp], eax
x = i; FILD QWORD PTR [temp]
FSTP QWORD PTR [x]
This code is slow not only because of the number of instructions but also because a size mismatch prevents store-to-load­forwarding to the FILD instruction.

Example (Preferred):

double x; ====> FILD DWORD PTR [i] int i; FSTP QWORD PTR [x]
x = i;
Computing quotients and remainders in integer division by constants are faster when performed on unsigned types. In a typical case, a 32-bit integer is divided by four as follows:
14 Consider the Sign of Integer Operands
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Example (Avoid):

int i; ====> MOV EAX, i
CDQ
i = i / 4; AND EDX, 3
ADD EAX, EDX SAR EAX, 2 MOV i, EAX

Example (Preferred):

unsigned int i; ====> SHR i, 2
i = i / 4;
In summary:

Use unsigned types for:

Division and remainders
Loop counters
Array indexing

Use signed types for:

Integer-to-float conversion

Use Array Style Instead of Pointer Style Code

The use of pointers in C makes work difficult for the optimizers in C compilers. Without detailed and aggressive pointer analysis, the compiler has to assume that writes through a pointer can write to any place in memory. This includes storage allocated to other variables, creating the issue of aliasing, i.e., the same block of memory is accessible in more than one way.
In order to help the optimizer of the C compiler in its analysis, avoid the use of pointers where possible. One example where this is trivially possible is in the access of data organized as arrays. C allows the use of either the array operator [] or pointers to access the array. Using array-style code makes the task of the optimizer easier by reducing possible aliasing.
For example, x[0] and x[2] can not possibly refer to the same memory location, while *p and *q could. It is highly recommended to use the array style, as significant performance advantages can be achieved with most compilers.
Use Array Style Instead of Pointer Style Code 15
AMD Athlon Processor x86 Code Optimization
Note that source code transformations will interact with a compilers code generator and that it is difficult to control the generated machine code from the source level. It is even possible that source code transformations for improving performance and compiler optimizations "fight" each other. Depending on the compiler and the specific source code it is therefore possible that pointer style code will be compiled into machine code that is faster than that generated from equivalent array style code. It is advisable to check the performance after any source code transformation to see whether performance indeed increased.

Example 1 (Avoid):

typedef struct { float x,y,z,w; } VERTEX;
typedef struct { float m[4][4]; } MATRIX;
22007E/0November 1999
void XForm (float *res, const float *v, const float *m, int numverts) { float dp; int i; const VERTEX* vv = (VERTEX *)v;
for (i = 0; i < numverts; i++) { dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; /* write transformed x */
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; /* write transformed y */
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
16 Use Array Style Instead of Pointer Style Code
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
*res++ = dp; /* write transformed z */
dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++;
*res++ = dp; /* write transformed w */
++vv; /* next input vertex */ m -= 16; /* reset to start of transform matrix */
} }

Example 2 (Preferred):

typedef struct { float x,y,z,w; } VERTEX;
typedef struct { float m[4][4]; } MATRIX;
void XForm (float *res, const float *v, const float *m, int numverts) { int i; const VERTEX* vv = (VERTEX *)v; const MATRIX* mm = (MATRIX *)m; VERTEX* rr = (VERTEX *)res;
for (i = 0; i < numverts; i++) { rr->x = vv->x*mm->m[0][0] + vv->y*mm->m[0][1] +
vv->z*mm->m[0][2] + vv->w*mm->m[0][3];
rr->y = vv->x*mm->m[1][0] + vv->y*mm->m[1][1] +
vv->z*mm->m[1][2] + vv->w*mm->m[1][3];
rr->z = vv->x*mm->m[2][0] + vv->y*mm->m[2][1] +
vv->z*mm->m[2][2] + vv->w*mm->m[2][3];
rr->w = vv->x*mm->m[3][0] + vv->y*mm->m[3][1] +
vv->z*mm->m[3][2] + vv->w*mm->m[3][3]; } }
Use Array Style Instead of Pointer Style Code 17
AMD Athlon Processor x86 Code Optimization

Completely Unroll Small Loops

Take advantage of the AMD Athlon processor’s large, 64-Kbyte instruction cache and completely unroll small loops. Unrolling loops can be beneficial to performance, especially if the loop body is small which makes the loop overhead significant. Many compilers are not aggressive at unrolling loops. For loops that have a small fixed loop count and a small loop body, completely unrolling the loops at the source level is recommended.

Example 1 (Avoid):

// 3D-transform: multiply vector V by 4x4 transform matrix M for (i=0; i<4; i++) { r[i] = 0; for (j=0; j<4; j++) { r[i] += M[j][i]*V[j]; } }
22007E/0November 1999

Example 2 (Preferred):

// 3D-transform: multiply vector V by 4x4 transform matrix M r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +
M[3][0]*V[3];
r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +
M[3][1]*V[3];
r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +
M[3][2]*V[3];
r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +
M[3][3]*v[3];

Avoid Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to memory, only to be read back shortly thereafter. See Store-to-Load Forwarding Restrictions on page 51 for more details. The AMD Athlon processor contains hardware to accelerate such store-to-load dependencies, allowing the load to obtain the store data before it has been written to memory. However, it is still faster to avoid such dependencies altogether and keep the data in an internal register.
Avoiding store-to-load dependencies is especially important if they are part of a long dependency chains, as might occur in a recurrence computation. If the dependency occurs while operating on arrays, many compilers are unable to optimize the
18 Completely Unroll Small Loops
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the store­to-load dependency. It is therefore recommended that the programmer remove the dependency manually, e.g., by introducing a temporary variable that can be kept in a register. This can result in a significant performance increase. The following is an example of this.

Example 1 (Avoid):

double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned int k;
for (k = 1; k < VECLEN; k++) { x[k] = x[k-1] + y[k]; }
for (k = 1; k < VECLEN; k++) { x[k] = z[k] * (y[k] - x[k-1]); }

Example 2 (Preferred):

double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned int k; double t;
t = x[0]; for (k = 1; k < VECLEN; k++) { t = t + y[k]; x[k] = t; }
t = x[0]; for (k = 1; k < VECLEN; k++) { t = z[k] * (y[k] - t); x[k] = t; }
Avoid Unnecessary Store-to-Load Dependencies 19
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Consider Expression Order in Compound Branch
Conditions
Branch conditions in C programs are often compound conditions consisting of multiple boolean expressions joined by the boolean operators && and ||. C guarantees a short-circuit evaluation of these operators. This means that in the case of ||, the first operand to evaluate to TRUE terminates the evaluation, i.e., following operands are not evaluated at all. Similarly for &&, the first operand to evaluate to FALSE terminates the evaluation. Because of this short-circuit evaluation, it is not always possible to swap the operands of || and &&. This is especially the case when the evaluation of one of the operands causes a side effect. However, in most cases the exchange of operands is possible.
When used to control conditional branches, expressions involving || and && are translated into a series of conditional branches. The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition, and can have a significant impact on performance. It is unfortunately not possible to give an easy, closed-form formula on how to order the conditions. Overall performance is a function of a variety of the following factors:
probability of a branch mispredict for each of the branches
generated
additional latency incurred due to a branch mispredict
cost of evaluating the conditions controlling each of the
branches generated
amount of parallelism that can be extracted in evaluating
the branch conditions
data stream consumed by an application (mostly due to the
dependence of mispredict probabilities on the nature of the incoming data in data dependent branches)
It is therefore recommended to experiment with the ordering of expressions in compound branch conditions in the most active areas of a program (so called hot spots) where most of the execution time is spent. Such hot spots can be found through the use of profiling. A "typical" data stream should be fed to the program while doing the experiments.

20 Consider Expression Order in Compound Branch Conditions

22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Switch Statement Usage

Optimize Switch Statements

Switch statements are translated using a variety of algorithms. The most common of these are jump tables and comparison chains/trees. It is recommended to sort the cases of a switch statement according to the probability of occurrences, with the most probable first. This will improve performance when the switch is translated as a comparison chain. It is further recommended to make the case labels small, contiguous integers, as this will allow the switch to be translated as a jump table.
Example 1 (Avoid):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) { case 28: case 29: short_months++; break; case 30: normal_months++; break; case 31: long_months++; break; default: printf ("month has fewer than 28 or more than 31
days\n");
}
Example 2 (Preferred):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) { case 31: long_months++; break; case 30: normal_months++; break; case 28: case 29: short_months++; break; default: printf ("month has fewer than 28 or more than 31
days\n");
}

Use Prototypes for All Functions

In general, use prototypes for all functions. Prototypes can convey additional information to the compiler that might enable more aggressive optimizations.
Switch Statement Usage 21
AMD Athlon Processor x86 Code Optimization

Use Const Type Qualifier

Use the “const” type qualifier as much as possible. This optimization makes code more robust and may enable higher performance code to be generated due to the additional information available to the compiler. For example, the C standard allows compilers to not allocate storage for objects that are declared “const”, if their address is never taken.

Generic Loop Hoisting

To improve the performance of inner loops, it is beneficial to reduce redundant constant calculations (i.e., loop invariant calculations). However, this idea can be extended to invariant control structures.
22007E/0November 1999
The first case is that of a constant “if()” statement in a “for()” loop.

Example 1:

for( i ... ) {
if( CONSTANT0 ) {
DoWork0( i ); // does not affect CONSTANT0
} else {
DoWork1( i ); // does not affect CONSTANT0
}
}
The above loop should be transformed into:
if( CONSTANT0 ) {
for( i ... ) {
DoWork0( i );
}
} else {
for( i ... ) {
DoWork1( i );
}
}
This will make your inner loops tighter by avoiding repetitious evaluation of a known “if()” control structure. Although the branch would be easily predicted, the extra instructions and decode limitations imposed by branching are saved, which are usually well worth it.
22 Use Const Type Qualifier
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Generalization for Multiple Constant Control Code

To generalize this further for multiple constant control code some more work may have to be done to create the proper outer loop. Enumeration of the constant cases will reduce this to a simple switch statement.
Example 2:
for(i ... ) {
if( CONSTANT0 ) {
DoWork0( i ); //does not affect CONSTANT0
// or CONSTANT1
} else {
DoWork1( i ); //does not affect CONSTANT0
// or CONSTANT1 } if( CONSTANT1 ) {
DoWork2( i ); //does not affect CONSTANT0
// or CONSTANT1 } else {
DoWork3( i ); //does not affect CONSTANT0
// or CONSTANT1 }
}
The above loop should be transformed into:
#define combine( c1, c2 ) (((c1) << 1) + (c2)) switch( combine( CONSTANT0!=0, CONSTANT1!=0 ) ) {
case combine( 0, 0 ):
for( i ... ) {
DoWork0( i );
DoWork2( i ); } break;
case combine( 1, 0 ):
for( i ... ) {
DoWork1( i );
DoWork2( i ); } break;
case combine( 0, 1 ):
for( i ... ) {
DoWork0( i );
DoWork3( i ); } break;
Generic Loop Hoisting 23
AMD Athlon Processor x86 Code Optimization
case combine( 1, 1 ):
default:
}
The trick here is that there is some up-front work involved in generating all the combinations for the switch constant and the total amount of code has doubled. However, it is also clear that the inner loops are "if()-free". In ideal cases where the DoWork*() functions are inlined, the successive functions will have greater overlap leading to greater parallelism than would be possible in the presence of intervening “if()” statements.
22007E/0November 1999
for( i ... ) {
DoWork1( i );
DoWork3( i ); } break;
break;
The same idea can be applied to constant “switch()” statements, or combinations of “switch()” statements and “if()” statements inside of “for()” loops. The method for combining the input constants gets more complicated but will be worth it for the performance benefit.
However, the number of inner loops can also substantially increase. If the number of inner loops is prohibitively high, then only the most common cases need to be dealt with directly, and the remaining cases can fall back to the old code in a "default:" clause for the “switch()” statement.
This typically comes up when the programmer is considering runtime generated code. While runtime generated code can lead to similar levels of performance improvement, it is much harder to maintain, and the developer must do their own optimizations for their code generation without the help of an available compiler.

Declare Local Functions as Static

Functions that are not used outside the file in which they are defined should always be declared static, which forces internal linkage. Otherwise, such functions default to external linkage,
24 Declare Local Functions as Static
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
which might inhibit certain optimizations with some compilersfor example, aggressive inlining.

Dynamic Memory Allocation Consideration

Dynamic memory allocation (‘malloc’ in C language) should always return a pointer that is suitably aligned for the largest base type (quadword alignment). Where this aligned pointer cannot be guaranteed, use the technique shown in the following code to make the pointer quadword aligned, if needed. This code assumes the pointer can be cast to a long.

Example:

double* p; double* np;
p = (double *)malloc(sizeof(double)*number_of_doubles+7L); np = (double *)((((long)(p))+7L) & (–8L));
Then use ‘np’ instead of ‘p’ to access the data. ‘p’ is still needed in order to deallocate the storage.

Introduce Explicit Parallelism into Code

Where possible, long dependency chains should be broken into several independent dependency chains which can then be executed in parallel exploiting the pipeline execution units. This is especially important for floating-point code, whether it is mapped to x87 or 3DNow! instructions because of the longer latency of floating-point operations. Since most languages, including ANSI C, guarantee that floating-point expressions are not re-ordered, compilers can not usually perform such optimizations unless they offer a switch to allow ANSI non­compliant reordering of floating-point expressions according to algebraic rules.
Note that re-ordered code that is algebraically identical to the original code does not necessarily deliver identical computational results due to the lack of associativity of floating point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may
Dynamic Memory Allocation Consideration 25
AMD Athlon Processor x86 Code Optimization
lead to unexpected results. Fortunately, in the vast majority of cases, the final result will differ only in the least significant bits.

Example 1 (Avoid):

double a[100],sum; int i;
sum = 0.0f;
for (i=0; i<100; i++) {
sum += a[i];
}

Example 2 (Preferred):

double a[100],sum1,sum2,sum3,sum4,sum; int i;
sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; for (i=0; i<100; i+4) {
sum1 += a[i]; sum2 += a[i+1]; sum3 += a[i+2]; sum4 += a[i+3]; }
sum = (sum4+sum3)+(sum1+sum2);
22007E/0November 1999
Notice that the 4-way unrolling was chosen to exploit the 4-stage fully pipelined floating-point adder. Each stage of the floating­point adder is occupied on every clock cycle, ensuring maximal sustained utilization.

Explicitly Extract Common Subexpressions

In certain situations, C compilers are unable to extract common subexpressions from floating-point expressions due to the guarantee against reordering of such expressions in the ANSI standard. Specifically, the compiler can not re-arrange the computation according to algebraic equivalencies before extracting common subexpressions. In such cases, the programmer should manually extract the common subexpression. It should be noted that re-arranging the expression may result in different computational results due to the lack of associativity of floating-point operations, but the results usually differ in only the least significant bits.
26 Explicitly Extract Common Subexpressions
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Example 1 Avoid:

double a,b,c,d,e,f;
e = b*c/d; f = b/d*a;
Preferred:
double a,b,c,d,e,f,t;
t = b/d; e = c*t; f = a*t;
Example 2 Avoid:
double a,b,c,e,f;
e = a/c; f = b/c;
Preferred:
double a,b,c,e,f,t;
t = 1/c; e = a*t f = b*t;

C Language Structure Component Considerations

Many compilers have options that allow padding of structures to make their size multiples of words, doublewords, or quadwords, in order to achieve better alignment for structures. In addition, to improve the alignment of structure members, some compilers might allocate structure elements in an order that differs from the order in which they are declared. However, some compilers might not offer any of these features, or their implementation might not work properly in all situations. Therefore, to achieve the best alignment of structures and structure members while minimizing the amount of padding regardless of compiler optimizations, the following methods are suggested.

Sort by Base Type Size

Sort structure members according to their base type size, declaring members with a larger base type size ahead of members with a smaller base type size.
C Language Structure Component Considerations 27
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999

Pad by Multiple of Largest Base Type Size

Pad the structure to a multiple of the largest base type size of any member. In this fashion, if the first member of a structure is naturally aligned, all other members are naturally aligned as well. The padding of the structure to a multiple of the largest based type size allows, for example, arrays of structures to be perfectly aligned.
The following example demonstrates the reordering of structure member declarations:
Original ordering (Avoid):
struct { char a[5]; long k; double x; } baz;
New ordering, with padding (Preferred):
struct { double x; long k; char a[5]; char pad[7]; } baz;
See C Language Structure Component Considerations on page 55 for a different perspective.

Sort Local Variables According to Base Type Size

When a compiler allocates local variables in the same order in which they are declared in the source code, it can be helpful to declare local variables in such a manner that variables with a larger base type size are declared ahead of the variables with smaller base type size. Then, if the first variable is allocated so that it is naturally aligned, all other variables are allocated contiguously in the order they are declared, and are naturally aligned without any padding.
Some compilers do not allocate variables in the order they are declared. In these cases, the compiler should automatically allocate variables in such a manner as to make them naturally aligned with the minimum amount of padding. In addition, some compilers do not guarantee that the stack is aligned suitably for the largest base type (that is, they do not guarantee
28 Sort Local Variables According to Base Type Size
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
quadword alignment), so that quadword operands might be misaligned, even if this technique is used and the compiler does allocate variables in the order they are declared.
The following example demonstrates the reordering of local variable declarations:

Original ordering (Avoid):

short ga, gu, gi; long foo, bar; double x, y, z[3]; char a, b; float baz;

Improved ordering (Preferred):

double z[3]; double x, y; long foo, bar; float baz; short ga, gu, gi;
See “Sort Variables According to Base Type Size on page 56 for more information from a different perspective.

Accelerating Floating-Point Divides and Square Roots

Divides and square roots have a much longer latency than other floating-point operations, even though the AMD Athlon processor provides significant acceleration of these two operations. In some codes, these operations occur so often as to seriously impact performance. In these cases, it is recommended to port the code to 3DNow! inline assembly or to use a compiler that can generate 3DNow! code. If code has hot spots that use single-precision arithmetic only (i.e., all computation involves data of type float) and for some reason cannot be ported to 3DNow!, the following technique may be used to improve performance.
The x87 FPU has a precision-control field as part of the FPU control word. The precision-control setting determines what precision results get rounded to. It affects the basic arithmetic operations, including divides and square roots. AMD Athlon and AMD-K6 root in such fashion as to only compute the number of bits
®
family processors implement divide and square
Accelerating Floating-Point Divides and Square Roots 29
AMD Athlon Processor x86 Code Optimization
necessary for the currently selected precision. This means that setting precision control to single precision (versus Win32 default of double precision) lowers the latency of those operations.
The Microsoft® Visual C environment provides functions to manipulate the FPU control word and thus the precision control. Note that these functions are not very fast, so changes of precision control should be inserted where it creates little overhead, such as outside a computation-intensive loop. Otherwise the overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square root operations.
The following example shows how to set the precision control to single precision and later restore the original settings in the Microsoft Visual C environment.
22007E/0November 1999

Example:

/* prototype for _controlfp() function */ #include <float.h> unsigned int orig_cw;
/* Get current FPU control word and save it */
orig_cw = _controlfp (0,0);
/* Set precision control in FPU control word to single precision. This reduces the latency of divide and square root operations. */
_controlfp (_PC_24, MCW_PC);
/* restore original FPU control word */
_controlfp (orig_cw, 0xfffff);
30 Accelerating Floating-Point Divides and Square Roots
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Avoid Unnecessary Integer Division

Integer division is the slowest of all integer arithmetic operations and should be avoided wherever possible. One possibility for reducing the number of integer divisions is multiple divisions, in which division can be replaced with multiplication as shown in the following examples. This replacement is possible only if no overflow occurs during the computation of the product. This can be determined by considering the possible ranges of the divisors.

Example 1 (Avoid):

int i,j,k,m;
m = i / j / k;

Example 2 (Preferred):

int i,j,k,l;
m = i / (j * k);
Copy Frequently De-referenced Pointer Arguments to Local
Variables
Avoid frequently de-referencing pointer arguments inside a function. Since the compiler has no knowledge of whether aliasing exists between the pointers, such de-referencing can not be optimized away by the compiler. This prevents data from being kept in registers and significantly increases memory traffic.
Note that many compilers have an assume no aliasing optimization switch. This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables.
Otherwise, copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function.
Avoid Unnecessary Integer Division 31
AMD Athlon Processor x86 Code Optimization

Example 1 (Avoid):

//assumes pointers are different and q!=r void isqrt ( unsigned long a,
{ *q = a; if (a > 0)
{ while (*q > (*r = a / *q))
{ *q = (*q + *r) >> 1;
} } *r = a - *q * *q; }

Example 2 (Preferred):

//assumes pointers are different and q!=r void isqrt ( unsigned long a,
{ unsigned long qq, rr; qq = a; if (a > 0)
{ while (qq > (rr = a / qq))
{
qq = (qq + rr) >> 1;
} } rr = a - qq * qq; *q = qq; *r = rr; }
22007E/0November 1999
unsigned long *q, unsigned long *r)
unsigned long *q, unsigned long *r)
32 Copy Frequently De-referenced Pointer Arguments to Local Variables
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
4

Instruction Decoding Optimizations

Overview

This chapter discusses ways to maximize the number of instructions decoded by the instruction decoders in the AMD Athlon processor. Guidelines are listed in order of importance.
The AMD Athlon processor instruction fetcher reads 16-byte aligned code windows from the instruction cache. The instruction bytes are then merged into a 24-byte instruction queue. On each cycle, the in-order front-end engine selects for decode up to three x86 instructions from the instruction-byte queue.
All instructions (x86, x87, 3DNow!, and MMX) are classified into two types of decodesDirectPath and VectorPath (see DirectPath Decoder and “VectorPath Decoder on page 133 for more information). DirectPath instructions are common instructions that are decoded directly in hardware. VectorPath instructions are more complex instructions that require the use of a sequence of multiple operations issued from an on-chip ROM.
Up to three DirectPath instructions can be selected for decode per cycle. Only one VectorPath instruction can be selected for decode per cycle. DirectPath instructions and VectorPath instructions cannot be simultaneously decoded.
Overview 33
AMD Athlon Processor x86 Code Optimization
TOP
TOP

Select DirectPath Over VectorPath Instructions

Use DirectPath instructions rather than VectorPath instructions. DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction, which includes ‘registerregister op memory as well as registerregister op register forms of instructions. Up to three DirectPath instructions can be decoded per cycle. VectorPath instructions will block the decoding of DirectPath instructions.
The very high majority of instructions used be a compiler has been implemented as DirectPath instructions in the AMD Athlon processor. Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions.
22007E/0November 1999
See Appendix F, Instruction Dispatch and Execution Resources on page 187 and Appendix G, DirectPath versus VectorPath Instructions on page 219 for tables of DirectPath and VectorPath instructions.
Load-Execute Instruction Usage
Use Load-Execute Integer Instructions
Most load-execute integer instructions are DirectPath decodable and can be decoded at the rate of three per cycle. Splitting a load-execute integer instruction into two separate instructionsa load instruction and a reg, reg instruction
reduces decoding bandwidth and increases register pressure, which results in lower performance. The split-instruction form can be used to avoid scheduler stalls for longer executing instructions and to explicitly schedule the load and execute operations.
34 Select DirectPath Over VectorPath Instructions
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
TOP
TOP
Use Load-Execute Floating-Point Instructions with Floating-Point Operands
When operating on single-precision or double-precision floating-point data, wherever possible use floating-point load-execute instructions to increase code density.
Note: This optimization applies only to floating-point instructions
with floating-point operands and not with integer operands, as described in the next optimization.
This coding style helps in two ways. First, denser code allows more work to be held in the instruction cache. Second, the denser code generates fewer internal OPs and, therefore, the FPU scheduler holds more work, which increases the chances of extracting parallelism from the code.
Example 1 (Avoid):
FLD QWORD PTR [TEST1] FLD QWORD PTR [TEST2] FMUL ST, ST(1)
Example 2 (Preferred):
FLD QWORD PTR [TEST1] FMUL QWORD PTR [TEST2]
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Do not use load-execute floating-point instructions with integer operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR, FICOM, and FICOMP. Remember that floating-point instructions can have integer operands while integer instruction cannot have floating-point operands.
Floating-point computations involving integer-memory operands should use separate FILD and arithmetic instructions. This optimization has the potential to increase decode bandwidth and OP density in the FPU scheduler. The floating­point load-execute instructions with integer operands are VectorPath and generate two OPs in a cycle, while the discrete equivalent enables a third DirectPath instruction to be decoded in the same cycle. In some situations this optimizations can also reduce execution time if the FILD can be scheduled several instructions ahead of the arithmetic instruction in order to cover the FILD latency.
Load-Execute Instruction Usage 35
AMD Athlon Processor x86 Code Optimization
Example 1 (Avoid):
FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD PTR [baz]
Example 2 (Preferred):
FILD DWORD PTR [bar] FILD DWORD PTR [baz] FLD QWORD PTR [foo] FMULP ST(2), ST FADDP ST(1),ST

Align Branch Targets in Program Hot Spots

In program hot spots (i.e., innermost loops in the absence of profiling data), place branch targets at or near the beginning of 16-byte aligned code windows. This technique helps to maximize the number of instructions that are filled into the instruction-byte queue while preventing I-cache space in branch intensive code.
22007E/0November 1999

Use Short Instruction Lengths

Assemblers and compilers should generate the tightest code possible to optimize use of the I-cache and increase average decode rate. Wherever possible, use instructions with shorter lengths. Using shorter instructions increases the number of instructions that can fit into the instruction-byte queue. For example, use 8-bit displacements as opposed to 32-bit displacements. In addition, use the single-byte format of simple integer instructions whenever possible, as opposed to the 2-byte opcode ModR/M format.

Example 1 (Avoid):

81 C0 78 56 34 12 add eax, 12345678h ;uses 2-byte opcode
81 C3 FB FF FF FF add ebx, -5 ;uses 32-bit
0F 84 05 00 00 00 jz $label1 ;uses 2-byte opcode,
; form (with ModR/M)
; immediate
; 32-bit immediate
36 Align Branch Targets in Program Hot Spots
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Example 2 (Preferred):

05 78 56 34 12 add eax, 12345678h ;uses single byte
; opcode form
83 C3 FB add ebx, -5 ;uses 8-bit sign
; extended immediate
74 05 jz $label1 ;uses 1-byte opcode,
; 8-bit immediate

Avoid Partial Register Reads and Writes

In order to handle partial register writes, the AMD Athlon processor execution core implements a data-merging scheme.
In the execution unit, an instruction writing a partial register merges the modified portion with the current state of the remainder of the register. Therefore, the dependency hardware can potentially force a false dependency on the most recent instruction that writes to any part of the register.
Example 1 (Avoid):
MOV AL, 10 ;inst 1 MOV AH, 12 ;inst 2 has a false dependency on
; inst 1 ;inst 2 merges new AH with current ; EAX register value forwarded ; by inst 1
In addition, an instruction that has a read dependency on any part of a given architectural register has a read dependency on the most recent instruction that modifies any part of the same architectural register.
Example 2 (Avoid):
MOV BX, 12h ;inst 1 MOV BL, DL ;inst 2, false dependency on
; completion of inst 1
MOV BH, CL ;inst 3, false dependency on
; completion of inst 2
MOV AL, BL ;inst 4, depends on completion of
; inst 2
Avoid Partial Register Reads and Writes 37
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999

Replace Certain SHLD Instructions with Alternative Code

Certain instances of the SHLD instruction can be replaced by alternative code using SHR and LEA. The alternative code has lower latency and requires less execution resources. SHR and LEA (32-bit version) are DirectPath instructions, while SHLD is a VectorPath instruction. SHR and LEA preserves decode bandwidth as it potentially enables the decoding of a third DirectPath instruction.

Example 1 (Avoid):

SHLD REG1, REG2, 1
(Preferred):
SHR REG2, 31 LEA REG1, [REG1*2 + REG2]
Example 2 (Avoid):
SHLD REG1, REG2, 2
(Preferred):
SHR REG2, 30 LEA REG1, [REG1*4 + REG2]
Example 3 (Avoid):
SHLD REG1, REG2, 3
(Preferred):
SHR REG2, 29 LEA REG1, [REG1*8 + REG2]
Use 8-Bit Sign-Extended Immediates
Using 8-bit sign-extended immediates improves code density with no negative effects on the AMD Athlon processor. For example, ADD BX, –5 should be encoded 83 C3 FB and not 81 C3 FF FB.
38 Replace Certain SHLD Instructions with Alternative
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Use 8-Bit Sign-Extended Displacements
Use 8-bit sign-extended displacements for conditional branches. Using short, 8-bit sign-extended displacements for conditional branches improves code density with no negative effects on the AMD Athlon processor.

Code Padding Using Neutral Code Fillers

Occasionally a need arises to insert neutral code fillers into the code stream, e.g., for code alignment purposes or to space out branches. Since this filler code can be executed, it should take up as few execution resources as possible, not diminish decode density, and not modify any processor state other than advancing EIP. A one byte padding can easily be achieved using the NOP instructions (XCHG EAX, EAX; opcode 0x90). In the x86 architecture, there are several multi-byte "NOP" instructions available that do not change processor state other than EIP:
MOV REG, REG
XCHG REG, REG
CMOVcc REG, REG
SHR REG, 0
SAR REG, 0
SHL REG, 0
SHRD REG, REG, 0
SHLD REG, REG, 0
LEA REG, [REG]
LEA REG, [REG+00]
LEA REG, [REG*1+00]
LEA REG, [REG+00000000]
LEA REG, [REG*1+00000000]
Not all of these instructions are equally suitable for purposes of code padding. For example, SHLD/SHRD are microcoded which reduces decode bandwidth and takes up execution resources.
Use 8-Bit Sign-Extended Displacements 39
AMD Athlon Processor x86 Code Optimization
Recommendations for the AMD Athlon Processor
For code that is optimized specifically for the AMD Athlon processor, the optimal code fillers are NOP instructions (opcode 0x90) with up to two REP prefixes (0xF3). In the AMD Athlon processor, a NOP with up to two REP prefixes can be handled by a single decoder with no overhead. As the REP prefixes are redundant and meaningless, they get discarded, and NOPs are handled without using any execution resources. The three decoders of AMD Athlon processor can handle up to three NOPs, each with up to two REP prefixes each, in a single cycle, for a neutral code filler of up to nine bytes.
Note: When used as a filler instruction, REP/REPNE prefixes can
be used in conjunction only with NOPs. REP/REPNE has undefined behavior when used with instructions other than a NOP.
If a larger amount of code padding is required, it is recommended to use a JMP instruction to jump across the padding region. The following assembly language macros show this:
22007E/0November 1999
NOP1_ATHLON TEXTEQU <DB 090h> NOP2_ATHLON TEXTEQU <DB 0F3h, 090h> NOP3_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h> NOP4_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 090h> NOP5_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 090h> NOP6_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h> NOP7_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
090h>
NOP8_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
0F3h, 090h>
NOP9_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
0F3h, 0F3h, 090h>
NOP10_ATHLONTEXTEQU <DB 0EBh, 008h, 90h, 90h, 90h, 90h,
90h, 90h, 90h, 90h>
40 Code Padding Using Neutral Code Fillers
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Recommendations for AMD-K6® Family and AMD Athlon Processor Blended Code
On x86 processors other than the AMD Athlon processor (including the AMD-K6 family of processors), the REP prefix and especially multiple prefixes cause decoding overhead, so the above technique is not recommended for code that has to run well both on AMD Athlon processor and other x86 processors (blended code). In such cases the instructions and instruction sequences below are recommended. For neutral code fillers longer than eight bytes in length, the JMP instruction can be used to jump across the padding region.
Note that each of the instructions and instruction sequences below utilizes an x86 register. To avoid performance degradation, the register used in the padding should be selected so as to not lengthen existing dependency chains, i.e., one should select a register that is not used by instructions in the vicinity of the neutral code filler. Note that certain instructions use registers implicitly. For example, PUSH, POP, CALL, and RET all make implicit use of the ESP register. The 5-byte filler sequence below consists of two instructions. If flag changes across the code padding are acceptable, the following instructions may be used as single instruction, 5-byte code fillers:
TEST EAX, 0FFFF0000h
CMP EAX, 0FFFF0000h
The following assembly language macros show the recommended neutral code fillers for code optimized for the AMD Athlon processor that also has to run well on other x86 processors. Note for some padding lengths, versions using ESP or EBP are missing due to the lack of fully generalized addressing modes.
NOP2_EAX TEXTEQU <DB 08Bh,0C0h> ;mov eax, eax NOP2_EBX TEXTEQU <DB 08Bh,0DBh> ;mov ebx, ebx NOP2_ECX TEXTEQU <DB 08Bh,0C9h> ;mov ecx, ecx NOP2_EDX TEXTEQU <DB 08Bh,0D2h> ;mov edx, edx NOP2_ESI TEXTEQU <DB 08Bh,0F6h> ;mov esi, esi NOP2_EDI TEXTEQU <DB 08Bh,0FFh> ;mov edi, edi NOP2_ESP TEXTEQU <DB 08Bh,0E4h> ;mov esp, esp NOP2_EBP TEXTEQU <DB 08Bh,0EDh> ;mov ebp, ebp
NOP3_EAX TEXTEQU <DB 08Dh,004h,020h> ;lea eax, [eax] NOP3_EBX TEXTEQU <DB 08Dh,01Ch,023h> ;lea ebx, [ebx]
Code Padding Using Neutral Code Fillers 41
AMD Athlon Processor x86 Code Optimization
NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx] NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi] NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea edi, [edi] NOP3_ESP TEXTEQU <DB 08Dh,03Ch,027h> ;lea esp, [esp] NOP3_EBP TEXTEQU <DB 08Dh,06Dh,000h> ;lea ebp, [ebp]
NOP4_EAX TEXTEQU <DB 08Dh,044h,020h,000h> ;lea eax, [eax+00] NOP4_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h> ;lea ebx, [ebx+00] NOP4_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h> ;lea ecx, [ecx+00] NOP4_EDX TEXTEQU <DB 08Dh,054h,022h,000h> ;lea edx, [edx+00] NOP4_ESI TEXTEQU <DB 08Dh,064h,024h,000h> ;lea esi, [esi+00] NOP4_EDI TEXTEQU <DB 08Dh,074h,026h,000h> ;lea edi, [edi+00] NOP4_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h> ;lea esp, [esp+00]
;lea eax, [eax+00];nop NOP5_EAX TEXTEQU <DB 08Dh,044h,020h,000h,090h>
;lea ebx, [ebx+00];nop NOP5_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h,090h>
22007E/0November 1999
;lea ecx, [ecx+00];nop NOP5_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h,090h>
;lea edx, [edx+00];nop NOP5_EDX TEXTEQU <DB 08Dh,054h,022h,000h,090h>
;lea esi, [esi+00];nop NOP5_ESI TEXTEQU <DB 08Dh,064h,024h,000h,090h>
;lea edi, [edi+00];nop NOP5_EDI TEXTEQU <DB 08Dh,074h,026h,000h,090h>
;lea esp, [esp+00];nop NOP5_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h,090h>
;lea eax, [eax+00000000] NOP6_EAX TEXTEQU <DB 08Dh,080h,0,0,0,0>
;lea ebx, [ebx+00000000] NOP6_EBX TEXTEQU <DB 08Dh,09Bh,0,0,0,0>
;lea ecx, [ecx+00000000] NOP6_ECX TEXTEQU <DB 08Dh,089h,0,0,0,0>
;lea edx, [edx+00000000] NOP6_EDX TEXTEQU <DB 08Dh,092h,0,0,0,0>
;lea esi, [esi+00000000] NOP6_ESI TEXTEQU <DB 08Dh,0B6h,0,0,0,0>
42 Code Padding Using Neutral Code Fillers
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0>
;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0>
;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0>
;lea ebx,[ebx*1+00000000] NOP7_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0>
;lea ecx,[ecx*1+00000000] NOP7_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0>
;lea edx,[edx*1+00000000] NOP7_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0>
;lea esi,[esi*1+00000000] NOP7_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0>
;lea edi,[edi*1+00000000] NOP7_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0>
;lea ebp,[ebp*1+00000000] NOP7_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0>
;lea eax,[eax*1+00000000] ;nop NOP8_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0,90h>
;lea ebx,[ebx*1+00000000] ;nop NOP8_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0,90h>
;lea ecx,[ecx*1+00000000] ;nop NOP8_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0,90h>
;lea edx,[edx*1+00000000] ;nop NOP8_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0,90h>
;lea esi,[esi*1+00000000] ;nop NOP8_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0,90h>
;lea edi,[edi*1+00000000] ;nop NOP8_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0,90h>
;lea ebp,[ebp*1+00000000] ;nop NOP8_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0,90h>
;JMP NOP9 TEXTEQU <DB 0EBh,007h,90h,90h,90h,90h,90h,90h,90h>
Code Padding Using Neutral Code Fillers 43
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
44 Code Padding Using Neutral Code Fillers
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
TOP
5

Cache and Memory Optimizations

This chapter describes code optimization techniques that take advantage of the large L1 caches and high-bandwidth buses of the AMD Athlon processor. Guidelines are listed in order of importance.

Memory Size and Alignment Issues

Avoid Memory Size Mismatches

Avoid memory size mismatches when instructions operate on the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each operand the same size. The following code examples result in a
store-to-load-forwarding (STLF) stall:
Example 1 (Avoid):
MOV DWORD PTR [FOO], EAX MOV DWORD PTR [FOO+4], EDX FLD QWORD PTR [FOO]
Memory Size and Alignment Issues 45
Avoid large-to-small mismatches, as shown in the following code:
Example 2 (Avoid):
FST QWORD PTR [FOO] MOV EAX, DWORD PTR [FOO] MOV EDX, DWORD PTR [FOO+4]
AMD Athlon Processor x86 Code Optimization
TOP
TOP

Align Data Where Possible

In general, avoid misaligned data references. All data whose size is a power of 2 is considered aligned if it is naturally aligned. For example:
QWORD accesses are aligned if they access an address
divisible by 8.
DWORD accesses are aligned if they access an address
divisible by 4.
WORD accesses are aligned if they access an address
divisible by 2.
TBYTE accesses are aligned if they access an address
divisible by 8.
A misaligned store or load operation suffers a minimum one-cycle penalty in the AMD Athlon processor load/store pipeline. In addition, using misaligned loads and stores increases the likelihood of encountering a store-to-load forwarding pitfall. For a more detailed discussion of store-to­load forwarding issues, see Store-to-Load Forwarding Restrictions” on page 51.
22007E/0November 1999

Use the 3DNow! PREFETCH and PREFETCHW Instructions

For code that can take advantage of prefetching, use the 3DNow! PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor.
The PREFETCH and PREFETCHW instructions take advantage of the AMD Athlon processor’s high bus bandwidth to hide long latencies when fetching data from system memory. The prefetch instructions are essentially integer instructions and can be used anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.).
Large data sets typically require unit-stride access to ensure that all data pulled in by PREFETCH or PREFETCHW is actually used. If necessary, algorithms or data structures should be reorganized to allow unit-stride access.
46 Use the 3DNow! PREFETCH and PREFETCHW
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

PREFETCH/W versus PREFETCHNTA/T0/T1 /T2

The PREFETCHNTA/T0/T1/T2 instructions in the MMX extensions are processor implementation dependent. To
®
maintain compatibility with the 25 million AMD-K6
-2 and AMD-K6-III processors already sold, use the 3DNow! PREFETCH/W instructions instead of the various prefetch flavors in the new MMX extensions.

PREFETCHW Usage Code that intends to modify the cache line brought in through

prefetching should use the PREFETCHW instruction. While PREFETCHW works the same as a PREFETCH on the AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a hint to the AMD Athlon processor of an intent to modify the cache line. The AMD Athlon processor will mark the cache line being brought in by PREFETCHW as Modified. Using PREFETCHW can save an additional 15-25 cycles compared to a PREFETCH and the subsequent cache state change caused by a write to the prefetched cache line.

Multiple Prefetches Programmers can initiate multiple outstanding prefetches on

the AMD Athlon processor. While the AMD-K6-2 and AMD-K6-III processors can have only one outstanding prefetch, the AMD Athlon processor can have up to six outstanding prefetches. When all six buffers are filled by various memory read requests, the processor will simply ignore any new prefetch requests until a buffer frees up. Multiple prefetch requests are essentially handled in-order. If data is needed first, then that data should be prefetched first.
The example below shows how to initiate multiple prefetches when traversing more than one array.
Example (Multiple Prefetches):
.CODE .K3D
; original C code ; ; #define LARGE_NUM 65536 ; ; double array_a[LARGE_NUM]; ; double array b[LARGE_NUM]; ; double array c[LARGE_NUM]; ; int i; ; ; for (i = 0; i < LARGE_NUM; i++) { ; a[i] = b[i] * c[i] ; }
Use the 3DNow! PREFETCH and PREFETCHW Instructions 47
AMD Athlon Processor x86 Code Optimization
MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get address of array_a MOV EDX, OFFSET array_b ;get address of array_b MOV ECX, OFFSET array_c ;get address of array_c
$loop:
PREFETCHW [EAX+196] ;two cachelines ahead PREFETCH [EDX+196] ;two cachelines ahead PREFETCH [ECX+196] ;two cachelines ahead FLD QWORD PTR [EDX+ECX*8+ARR_SIZE] ;b[i] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE] ;b[i]*c[i] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE] ;a[i] = b[i]*c[i] FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+8] ;b[i+1] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+8] ;b[i+1]*c[i+1] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+8] ;a[i+1] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+16];b[i+2] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+16];b[i+2]*c[i+2] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+16];a[i+2] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+24];b[i+3] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+24];b[i+3]*c[i+3] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+24];a[i+3] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+32];b[i+4] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+32];b[i+4]*c[i+4] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+32];a[i+4] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+40];b[i+5] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+40];b[i+5]*c[i+5] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+40];a[i+5] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+48];b[i+6] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+48];b[i+6]*c[i+6] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+48];a[i+6] =
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+56];b[i+7] FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+56];b[i+7]*c[i+7] FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+56];a[i+7] =
ADD ECX, 8 ;next 8 products JNZ $loop ;until none left
22007E/0November 1999
; b[i+1]*c[i+1]
; [i+2]*c[i+2]
; b[i+3]*c[i+3]
; b[i+4]*c[i+4]
; b[i+5]*c[i+5]
; b[i+6]*c[i+6]
; b[i+7]*c[i+7]
END
48 Use the 3DNow! PREFETCH and PREFETCHW
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
The following optimization rules were applied to this example.
Loops should be unrolled to make sure that the data stride
per loop iteration is equal to the length of a cache line. This avoids overlapping PREFETCH instructions and thus optimal use of the available number of outstanding PREFETCHes.
Since the array "array_a" is written rather than read,
PREFETCHW is used instead of PREFETCH to avoid overhead for switching cache lines to the correct MESI state. The PREFETCH lookahead has been optimized such that each loop iteration is working on three cache lines while six active PREFETCHes bring in the next six cache lines.
Index arithmetic has been reduced to a minimum by use of
complex addressing modes and biasing of the array base addresses in order to cut down on loop overhead.

Determining Prefetch Distance

Prefetch at Least 64 Bytes Away from Surrounding Stores

Given the latency of a typical AMD Athlon processor system and expected processor speeds, the following formula should be used to determine the prefetch distance in bytes for a single array:
Prefetch Distance = 200 (DS/C) bytes
Round up to the nearest 64-byte cache line.
The number 200 is a constant based upon expected
AMD Athlon processor clock frequencies and typical system memory latencies.
DS is the data stride in bytes per loop iteration.
C is the number of cycles for one loop to execute entirely
from the L1 cache.
The prefetch distance for multiple arrays are typically even longer.
The PREFETCH and PREFETCHW instructions can be affected by false dependencies on stores. If there is a store to an address that matches a request, that request (the PREFETCH or PREFETCHW instruction) may be blocked until the store is written to the cache. Therefore, code should prefetch data that is located at least 64 bytes away from any surrounding store’s data address.
Use the 3DNow! PREFETCH and PREFETCHW Instructions 49
AMD Athlon Processor x86 Code Optimization
TOP
TOP
22007E/0November 1999

Take Advantage of Write Combining

Operating system and device driver programmers should take advantage of the write-combining capabilities of the AMD Athlon processor. The AMD Athlon processor has a very aggressive write-combining algorithm, which improves performance significantly.
See Appendix C, Implementation of Write Combining on page 155 for more details.
Avoid Placing Code and Data in the Same 64-Byte Cache
Line
Sharing code and data in the same 64-byte cache line may cause the L1 caches to thrash (unnecessary castout of code/data) in order to maintain coherency between the separate instruction and data caches. The AMD Athlon processor has a cache-line size of 64-bytes, which is twice the size of previous processors. Programmers must be aware that code and data should not be shared within this larger cache line, especially if the data becomes modified.
For example, programmers should consider that a memory indirect JMP instruction may have the data for the jump table residing in the same 64-byte cache line as the JMP instruction, which would result in lower performance.
Although rare, do not place critical code at the border between 32-byte aligned code segments and a data segments. The code at the start or end of your data segment should be as rarely executed as possible or simply padded with garbage.
In general, the following should be avoided:
self-modifying code
50 Take Advantage of Write Combining
storing data in code segments
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Store-to-Load Forwarding Restrictions
Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer (LS2). There are instances in the AMD Athlon processor load/store architecture when either a load operation is not allowed to read needed data from a store in the store buffer, or a load OP detects a false data dependency on a store in the store buffer.
In either case, the load cannot complete (load the needed data into a register) until the store has retired out of the store buffer and written to the data cache. A store-buffer entry cannot retire and write to the data cache until every instruction before the store has completed and retired from the reorder buffer.
The implication of this restriction is that all instructions in the reorder buffer, up to and including the store, must complete and retire out of the reorder buffer before the load can complete. Effectively, the load has a false dependency on every instruction up to the store.
The following sections describe store-to-load forwarding examples that are acceptable and those that should be avoided.
Store-to-Load Forwarding PitfallsTrue Dependencies
A load is allowed to read data from the store-buffer entry only if all of the following conditions are satisfied:
The start address of the load matches the start address of
the store.
The load operand size is equal to or smaller than the store
operand size.
Neither the load or store is misaligned.
The store data is not from a high-byte register (AH, BH, CH,
or DH).
The following sections describe common-case scenarios to avoid whereby a load has a true dependency on a LS2-buffered store but cannot read (forward) data from a store-buffer entry.
Store-to-Load Forwarding Restrictions 51
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Narrow-to-Wide Store-Buffer Data Forwarding Restriction
If the following conditions are present, there is a narrow-to-wide store-buffer data forwarding restriction:
The operand size of the store data is smaller than the
operand size of the load data.
The range of addresses spanned by the store data covers
some sub-region of range of addresses spanned by the load data.
Avoid the type of code shown in the following two examples.
Example 1 (Avoid):
MOV EAX, 10h MOV WORD PTR [EAX], BX ;word store
... MOV ECX, DWORD PTR [EAX] ;doubleword load
;cannot forward upper ; byte from store buffer
Example 2 (Avoid):
MOV EAX, 10h MOV BYTE PTR [EAX + 3], BL ;byte store
... MOV ECX, DWORD PTR [EAX] ;doubleword load
;cannot forward upper byte ; from store buffer
Wide-to-Narrow Store-Buffer Data Forwarding Restriction
If the following conditions are present, there is a wide-to-narrow store-buffer data forwarding restriction:
The operand size of the store data is greater than the
operand size of the load data.
The start address of the store data does not match the start
address of the load.
Example 3 (Avoid):
MOV EAX, 10h ADD DWORD PTR [EAX], EBX ;doubleword store MOV CX, WORD PTR [EAX + 2] ;word load-cannot forward high
; word from store buffer
Use example 5 instead of example 4.
Example 4 (Avoid):
MOVQ [foo], MM1 ;store upper and lower half ... ADD EAX, [foo] ;fine
ADD EDX, [foo+4] ;uh-oh!
52 Store-to-Load Forwarding Restrictions
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example 5 (Preferred):
MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half ... ADD EAX, [foo] ;fine ADD EDX, [foo+4] ;fine
Misaligned Store-Buffer Data Forwarding Restriction
If the following condition is present, there is a misaligned store-buffer data forwarding restriction:
The store or load address is misaligned. For example, a
quadword store is not aligned to a quadword boundary, a doubleword store is not aligned to doubleword boundary, etc.
A common case of misaligned store-data forwarding involves the passing of misaligned quadword floating-point data on the doubleword-aligned integer stack. Avoid the type of code shown
in the following example.
Example 6 (Avoid):
MOV ESP, 24h FSTP QWORD PTR [ESP] ;esp=24 . ;store occurs to quadword . ; misaligned address . FLD QWORD PTR[ESP] ;quadword load cannot forward
; from quadword misaligned ; ‘fstp[esp]’ store OP
High-Byte Store-Buffer Data Forwarding Restriction
If the following condition is present, there is a high-byte store-data buffer forwarding restriction:
The store data is from a high-byte register (AH, BH, CH,
DH).
Avoid the type of code shown in the following example.
Example 7 (Avoid):
MOV EAX, 10h MOV [EAX], BH ;high-byte store . MOV DL, [EAX] ;load cannot forward from
; high-byte store
Store-to-Load Forwarding Restrictions 53
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
One Supported Store­to-Load Forwarding Case
There is one case of a mismatched store-to-load forwarding that is supported by the by AMD Athlon processor. The lower 32 bits from an aligned QWORD write feeding into a DWORD read is allowed.
Example 8 (Allowed):
MOVQ [AlignedQword], mm0 ... MOV EAX, [AlignedQword]
Summary of Store-to-Load Forwarding Pitfalls to Avoid
To avoid store-to-load forwarding pitfalls, code should conform to the following guidelines:
Maintain consistent use of operand size across all loads and
stores. Preferably, use doubleword or quadword operand sizes.
Avoid misaligned data references.
Avoid narrow-to-wide and wide-to-narrow forwarding cases.
When using word or byte stores, avoid loading data from
anywhere in the same doubleword of memory other than the identical start addresses of the stores.

Stack Alignment Considerations

Make sure the stack is suitably aligned for the local variable with the largest base type. Then, using the technique described in C Language Structure Component Considerations on page 55, all variables can be properly aligned with no padding.

Extend to 32 Bits Before Pushing onto Stack

Function arguments smaller than 32 bits should be extended to 32 bits before being pushed onto the stack, which ensures that the stack is always doubleword aligned on entry to a function.
If a function has no local variables with a base type larger than doubleword, no further work is necessary. If the function does have local variables whose base type is larger than a doubleword, additional code should be inserted to ensure proper alignment of the stack. For example, the following code achieves quadword alignment:
54 Stack Alignment Considerations
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example (Preferred):
Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE_OF_LOCALS ;size of local variables AND ESP, –8
;push registers that need to be preserved
Epilog: ;pop register that needed to be preserved MOV ESP, EBP POP EBP RET
With this technique, function arguments can be accessed via EBP, and local variables can be accessed via ESP. In order to free EBP for general use, it needs to be saved and restored between the prolog and the epilog.

Align TBYTE Variables on Quadword Aligned Addresses

Align variables of type TBYTE on quadword aligned addresses. In order to make an array of TBYTE variables that are aligned, array elements are 16-bytes apart. In general, TBYTE variables should be avoided. Use double-precision variables instead.

C Language Structure Component Considerations

Structures (‘struct’ in C language) should be made the size of a multiple of the largest base type of any of their components. To meet this requirement, padding should be used where necessary.
Language definitions permitting, to minimize padding, structure components should be sorted and allocated such that the components with a larger base type are allocated ahead of those with a smaller base type. For example, consider the following code:
Align TBYTE Variables on Quadword Aligned Addresses 55
AMD Athlon Processor x86 Code Optimization

Example:

struct {
char a[5]; long k; doublex; } baz;
The structure components should be allocated (lowest to highest address) as follows:
x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0
See C Language Structure Component Considerations on page 27 for more information from a C source code perspective.

Sort Variables According to Base Type Size

22007E/0November 1999
Sort local variables according to their base type size and allocate variables with larger base type size ahead of those with smaller base type size. Assuming the first variable allocated is naturally aligned, all other variables are naturally aligned without any padding. The following example is a declaration of local variables in a C function:

Example:

short ga, gu, gi; long foo, bar; double x, y, z[3]; char a, b; float baz;
Allocate in the following order from left to right (from higher to lower addresses):
x, y, z[2], z[1], z[0], foo, bar, baz, ga, gu, gi, a, b;
See Sort Local Variables According to Base Type Size on page 28 for more information from a C source code perspective.
56 Sort Variables According to Base Type Size
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
TOP
6

Branch Optimizations

While the AMD Athlon processor contains a very sophisticated branch unit, certain optimizations increase the effectiveness of the branch prediction unit. This chapter discusses rules that improve branch prediction and minimize branch penalties. Guidelines are listed in order of importance.

Avoid Branches Dependent on Random Data

Avoid conditional branches depending on random data, as these are difficult to predict. For example, a piece of code receives a random stream of characters “A through Z and branches if the character is before “M” in the collating sequence. Data-dependent branches acting upon basically random data causes the branch prediction logic to mispredict the branch about 50% of the time.
If possible, design branch-free alternative code sequences, which results in shorter average execution time. This technique is especially important if the branch body is small. Examples 1 and 2 illustrate this concept using the CMOV instruction. Note that the AMD-K6 instruction. Therefore, blended AMD-K6 and AMD Athlon processor code should use examples 3 and 4.
®
processor does not support the CMOV
Avoid Branches Dependent on Random Data 57
AMD Athlon Processor x86 Code Optimization
AMD Athlon Processor Specific Code
Example 1 Signed integer ABS function (X = labs(X)):
MOV ECX, [X] ;load value MOV EBX, ECX ;save value NEG ECX ;–value CMOVS ECX, EBX ;if –value is negative, select value MOV [X], ECX ;save labs result
Example 2 Unsigned integer min function (z = x < y ? x : y):
MOV EAX, [X] ;load X value MOV EBX, [Y] ;load Y value CMP EAX, EBX ;EBX<=EAX ? CF=0 : CF=1 CMOVNC EAX, EBX ;EAX=(EBX<=EAX) ? EBX:EAX MOV [Z], EAX ;save min (X,Y)
Blended AMD-K6® and AMD Athlon Processor Code
22007E/0November 1999
Example 3 Signed integer ABS function (X = labs(X)):
MOV ECX, [X] ;load value MOV EBX, ECX ;save value SAR ECX, 31 ;x < 0 ? 0xffffffff : 0 XOR EBX, ECX ;x < 0 ? ~x : x SUB EBX, ECX ;x < 0 ? (~x)+1 : x MOV [X], EBX ;x < 0 ? -x : x
Example 4 Unsigned integer min function (z = x < y ? x : y):
MOV EAX, [x] ;load x MOV EBX, [y] ;load y SUB EAX, EBX ;x < y ? CF : NC ; x - y SBB ECX, ECX ;x < y ? 0xffffffff : 0 AND ECX, EAX ;x < y ? x - y : 0 ADD ECX, EBX ;x < y ? x - y + y : y MOV [z], ECX ;x < y ? x : y
Example 5 Hexadecimal to ASCII conversion (y=x < 10 ? x + 0x30: x + 0x41):
MOV AL, [X] ;load X value CMP AL, 10 ;if x is less than 10, set carry flag SBB AL, 69h ;0..9 –> 96h, Ah..Fh –> A1h...A6h DAS ;0..9: subtract 66h, Ah..Fh: Sub. 60h MOV [Y],AL ;save conversion in y
58 Avoid Branches Dependent on Random Data
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example 6 Increment Ring Buffer Offset:
//C Code char buf[BUFSIZE]; int a;
if (a < (BUFSIZE-1)) { a++; } else { a = 0; }
;------------­;Assembly Code MOV EAX, [a] ; old offset CMP EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC INC EAX ; a++ SBB EDX, EDX ; a < (BUFSIZE-1) ? 0xffffffff :0 AND EAX, EDX ; a < (BUFSIZE-1) ? a++ : 0 MOV [a], EAX ; store new offset
Example 7 — Integer Signum Function:
//C Code int a, s;
if (!a) { s = 0; } else if (a < 0) { s = -1; } else { s = 1; }
;------------­;Assembly Code MOV EAX, [a] ;load a CDQ ;t = a < 0 ? 0xffffffff : 0 CMP EDX, EAX ;a > 0 ? CF : NC ADC EDX, 0 ;a > 0 ? t+1 : t MOV [s], EDX ;signum(x)

Always Pair CALL and RETURN

When the 12 entry return address stack gets out of synchronization, the latency of returns increase. The return address stack becomes out of sync when:
calls and returns do not match
the depth of the return stack is exceeded because of too
many levels of nested functions calls
Always Pair CALL and RETURN 59
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999

Replace Branches with Computation in 3DNow! Code

Branches negatively impact the performance of 3DNow! code. Branches can operate only on one data item at a time, i.e., they are inherently scalar and inhibit the SIMD processing that makes 3DNow! code superior. Also, branches based on 3DNow! comparisons require data to be passed to the integer units, which requires either transport through memory, or the use of MOVD reg, MMreg instructions. If the body of the branch is small, one can achieve higher performance by replacing the branch with computation. The computation simulates predicated execution or conditional moves. The principal tools for this are the following instructions: PCMPGT, PFCMPGT, PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.

Muxing Constructs

The most important construct to avoiding branches in 3DNow! and MMX code is a 2-way muxing construct that is equivalent to the ternary operator “?:” in C and C++. It is implemented using the PCMP/PFCMP, PAND, PANDN, and POR instructions. To maximize performance, it is important to apply the PAND and PANDN instructions in the proper order.
Example 1 (Avoid):
; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 MOVQ MM4, MM3 ; duplicate mask PANDN MM3, MM0 ; y > x ? 0 : a PAND MM1, MM4 ; y > x ? b : 0 POR MM1, MM3 ; r = y > x ? b : a
Because the use of PANDN destroys the mask created by PCMP, the mask needs to be saved, which requires an additional register. This adds an instruction, lengthens the dependency chain, and increases register pressure. Therefore 2-way muxing constructs should be written as follows.
60 Replace Branches with Computation in 3DNow! Code
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example 2 (Preferred):
; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ; y > x ? b : 0 PANDN MM3, MM0 ; y > x > 0 : a POR MM1, MM3 ; r = y > x ? b : a "

Sample Code Translated into 3DNow! Code

The following examples use scalar code translated into 3DNow! code. Note that it is not recommended to use 3DNow! SIMD instructions for scalar code, because the advantage of 3DNow! instructions lies in their “SIMDness”. These examples are meant to demonstrate general techniques for translating source code with branches into branchless 3DNow! code. Scalar source code was chosen to keep the examples simple. These techniques work in an identical fashion for vector code.
Each example shows the C code and the resulting 3DNow! code.
Example 1: C code:
float x,y,z; if (x < y) {
z += 1.0; } else {
z -= 1.0; }
3DNow! code:
;in: MM0 = x ; MM1 = y ; MM2 = z ;out: MM0 = z MOVQ MM3, MM0 ;save x MOVQ MM4, one ;1.0 PFCMPGE MM0, MM1 ;x < y ? 0 : 0xffffffff PSLLD MM0, 31 ;x < y ? 0 : 0x80000000 PXOR MM0, MM4 ;x < y ? 1.0 : -1.0 PFADD MM0, MM2 ;x < y ? z+1.0 : z-1.0
Replace Branches with Computation in 3DNow! Code 61
AMD Athlon Processor x86 Code Optimization
Example 2: C code:
float x,z; z = abs(x); if (z >= 1) {
z = 1/z; }
3DNow! code:
;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x) PFRCP MM2, MM0 ;1/z approx MOVQ MM1, MM0 ;save z PFRCPIT1 MM0, MM2 ;1/z step PFRCPIT2 MM0, MM2 ;1/z final PFMIN MM0, MM1 ;z = z < 1 ? z : 1/z
Example 3: C code:
float x,z,r,res; z = fabs(x) if (z < 0.575) { res = r; } else { res = PI/2 - 2*r; }
22007E/0November 1999
3DNow! code:
;in: MM0 = x ; MM1 = r ;out: MM0 = res MOVQ MM7, mabs ;mask for absolute value PAND MM0, MM7 ;z = abs(x) MOVQ MM2, bnd ;0.575 PCMPGTD MM2, MM0 ;z < 0.575 ? 0xffffffff : 0 MOVQ MM3, pio2 ;pi/2 MOVQ MM0, MM1 ;save r PFADD MM1, MM1 ;2*r PFSUBR MM1, MM3 ;pi/2 - 2*r PAND MM0, MM2 ;z < 0.575 ? r : 0 PANDN MM2, MM1 ;z < 0.575 ? 0 : pi/2 - 2*r POR MM0, MM2 ;z < 0.575 ? r : pi/2 - 2 * r
62 Replace Branches with Computation in 3DNow! Code
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example 4: C code:
#define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; } else { res = PI/2-r; }
3DNow! code:
;in: MM0 = x ; MM1 = r ;out: MM1 = res MOVQ MM5, mabs ; mask to clear sign bit MOVQ MM6, one ; 1.0 PAND MM0, MM5 ; z=abs(x) PCMPGTD MM6, MM0 ; z < 1 ? 0xffffffff : 0 MOVQ MM4, pio2 ; pi/2 PFSUB MM4, MM1 ; pi/2-r PANDN MM6, MM4 ; z < 1 ? 0 : pi/2-r PFMAX MM1, MM6 ; res = z < 1 ? r : pi/2-r
Replace Branches with Computation in 3DNow! Code 63
AMD Athlon Processor x86 Code Optimization
Example 5: C code:
#define PI 3.14159265358979323 float x,y,xa,ya,r,res; int xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r; } else if (xs) { res = PI - r; } else if (df) { res = PI/2 - r; } else { res = r; }
22007E/0November 1999
3DNow! code:
;in: MM0 = r ; MM1 = y ; MM2 = x ;out: MM0 = res MOVQ MM7, sgn ;mask to extract sign bit MOVQ MM6, sgn ;mask to extract sign bit MOVQ MM5, mabs ;mask to clear sign bit PAND MM7, MM2 ;xs = sign(x) PAND MM1, MM5 ;ya = abs(y) PAND MM2, MM5 ;xa = abs(x) MOVQ MM6, MM1 ;y PCMPGTD MM6, MM2 ;df = (xa < ya) ? 0xffffffff : 0 PSLLD MM6, 31 ;df = bit<31> MOVQ MM5, MM7 ;xs PXOR MM7, MM6 ;xs^df ? 0x80000000 : 0 MOVQ MM3, npio2 ;-pi/2 PXOR MM5, MM3 ;xs ? pi/2 : -pi/2 PSRAD MM6, 31 ;df ? 0xffffffff : 0 PANDN MM6, MM5 ;xs ? (df ? 0 : pi/2) : (df ? 0 : -pi/2) PFSUB MM6, MM3 ;pr = pi/2 + (xs ? (df ? 0 : pi/2) :
; (df ? 0 : -pi/2)) POR MM0, MM7 ;ar = xs^df ? -r : r PFADD MM0, MM6 ;res = ar + pr
64 Replace Branches with Computation in 3DNow! Code
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Avoid the Loop Instruction

The LOOP instruction in the AMD Athlon processor requires eight cycles to execute. Use the preferred code shown below:

Example 1 (Avoid):

LOOP LABEL

Example 2 (Preferred):

DEC ECX JNZ LABEL

Avoid Far Control Transfer Instructions

Avoid using far control transfer instructions. Far control transfer branches can not be predicted by the branch target buffer (BTB).
Avoid the Loop Instruction 65
AMD Athlon Processor x86 Code Optimization

Avoid Recursive Functions

Avoid recursive functions due to the danger of overflowing the return address stack. Convert end-recursive functions to iterative code. An end-recursive function is when the function call to itself is at the end of the code.

Example 1 (Avoid):

long fac(long a) {
if (a==0) {
return (1);
} else {
return (a*fac(a–1)); } return (t);
}
22007E/0November 1999

Example 2 (Preferred):

long fac(long a) {
long t=1; while (a > 0) {
t *= a;
a--; } return (t);
}
66 Avoid Recursive Functions
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
7

Scheduling Optimizations

This chapter describes how to code instructions for efficient scheduling. Guidelines are listed in order of importance.

Schedule Instructions According to their Latency

The AMD Athlon processor can execute up to three x86 instructions per cycle, with each x86 instruction possibly having a different latency. The AMD Athlon processor has flexible scheduling, but for absolute maximum performance, schedule instructions, especially FPU and 3DNow! instructions, according to their latency. Dependent instructions will then not have to wait on instructions with longer latencies.
See Appendix F, Instruction Dispatch and Execution Resources on page 187 for a list of latency numbers.

Unrolling Loops

Complete Loop Unrolling

Make use of the large AMD Athlon processor 64-Kbyte instruction cache and unroll loops to get more parallelism and reduce loop overhead, even with branch prediction. Complete
Schedule Instructions According to their Latency 67
AMD Athlon Processor x86 Code Optimization
unrolling reduces register pressure by removing the loop counter. To completely unroll a loop, remove the loop control and replicate the loop body N times. In addition, completely unrolling a loop increases scheduling opportunities.
Only unrolling very large code loops can result in the inefficient use of the L1 instruction cache. Loops can be unrolled completely, if all of the following conditions are true:
The loop is in a frequently executed piece of code.
The loop count is known at compile time.
The loop body, once unrolled, is less than 100 instructions,
which is approximately 400 bytes of code.

Partial Loop Unrolling

Partial loop unrolling can increase register pressure, which can make it inefficient due to the small number of registers in the x86 architecture. However, in certain situations, partial unrolling can be efficient due to the performance gains possible. Partial loop unrolling should be considered if the following conditions are met:
22007E/0November 1999
Spare registers are available
Loop body is small, so that loop overhead is significant
Number of loop iterations is likely > 10
Consider the following piece of C code:
double a[MAX_LENGTH], b[MAX_LENGTH];
for (i=0; i< MAX_LENGTH; i++) { a[i] = a[i] + b[i]; }
Without loop unrolling, the code looks like the following:
68 Unrolling Loops
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Without Loop Unrolling:
MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B
$add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop
The loop consists of seven instructions. The AMD Athlon processor can decode/retire three instructions per cycle, so it cannot execute faster than three iterations in seven cycles, or 3/7 floating-point adds per cycle. However, the pipelined floating-point adder allows one add every cycle. In the following code, the loop is partially unrolled by a factor of two, which creates potential endcases that must be handled outside the loop:
With Partial Loop Unrolling:
MOV ECX, MAX_LENGTH MOV EAX, offset A MOV EBX, offset B SHR ECX, 1 JNC $add_loop FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8
$add_loop: FLD QWORD PTR[EAX] FADD QWORD PTR[EBX] FSTP QWORD PTR[EAX] FLD QWORD PTR[EAX+8] FADD QWORD PTR[EBX+8] FSTP QWORD PTR[EAX+8] ADD EAX, 16 ADD EBX, 16 DEC ECX JNZ $add_loop
Now the loop consists of 10 instructions. Based on the decode/retire bandwidth of three OPs per cycle, this loop goes
Unrolling Loops 69
AMD Athlon Processor x86 Code Optimization
no faster than three iterations in 10 cycles, or 6/10 floating-point adds per cycle, or 1.4 times as fast as the original loop.
22007E/0November 1999
Deriving Loop Control For Partially Unrolled Loops
A frequently used loop construct is a counting loop. In a typical case, the loop count starts at some lower bound lo, increases by some fixed, positive increment inc for each iteration of the loop, and may not exceed some upper bound hi. The following example shows how to partially unroll such a loop by an unrolling factor of fac, and how to derive the loop control for the partially unrolled version of the loop.
Example 1 (rolled loop):
for (k = lo; k <= hi; k += inc) { x[k] = ... }
Example 2 (partially unrolled loop):
for (k = lo; k <= (hi - (fac-1)*inc); k += fac*inc) { x[k] = ... x[k+inc] = ... ... x[k+(fac-1)*inc] = ... }
/* handle end cases */
for (k = k; k <= hi; k += inc) { x[k] = ... }
70 Unrolling Loops
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Use Function Inlining

Overview

Make use of the AMD Athlon processors large 64-Kbyte instruction cache by inlining small routines to avoid procedure-call overhead. Consider the cost of possible increased register usage, which can increase load/store instructions for register spilling.
Function inlining has the advantage of eliminating function call overhead and allowing better register allocation and instruction scheduling at the site of the function call. The disadvantage is decreasing code locality, which can increase execution time due to instruction cache misses. Therefore, function inlining is an optimization that has to be used judiciously.
In general, due to its very large instruction cache, the AMD Athlon processor is less susceptible than other processors to the negative side effect of function inlining. Function call overhead on the AMD Athlon processor can be low because calls and returns are executed at high speed due to the use of prediction mechanisms. However, there is still overhead due to passing function arguments through memory, which creates STLF (store-to-load-forwarding) dependencies. Some compilers allow for a reduction of this overhead by allowing arguments to be passed in registers in one of their calling conventions, which has the drawback of constraining register allocation in the function and at the site of the function call.
In general, function inlining works best if the compiler can utilize feedback from a profiler to identify the function call sites most frequently executed. If such data is not available, a reasonable heuristic is to concentrate on function calls inside loops. Functions that are directly recursive should not be considered candidates for inlining. However, if they are end-recursive, the compiler should convert them to an iterative equivalent to avoid potential overflow of the AMD Athlon processor return prediction mechanism (return stack) during deep recursion. For best results, a compiler should support function inlining across multiple source files. In addition, a compiler should provide inline templates for commonly used library functions, such as sin(), strcmp(), or memcpy().
Use Function Inlining 71
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999

Always Inline Functions if Called from One Site

A function should always be inlined if it can be established that it is called from just one site in the code. For the C language, determination of this characteristic is made easier if functions are explicitly declared static unless they require external linkage. This case occurs quite frequently, as functionality that could be concentrated in a single large function is split across multiple small functions for improved maintainability and readability.

Always Inline Functions with Fewer than 25 Machine Instructions

In addition, functions that create fewer than 25 machine instructions once inlined should always be inlined because it is likely that the function call overhead is close to or more than the time spent executing the function body. For large functions, the benefits of reduced function call overhead gives diminishing returns. Therefore, a function that results in the insertion of more than 500 machine instructions at the call site should probably not be inlined. Some larger functions might consist of multiple, relatively short paths that are negatively affected by function overhead. In such a case, it can be advantageous to inline larger functions. Profiling information is the best guide in determining whether to inline such large functions.

Avoid Address Generation Interlocks

Loads and stores are scheduled by the AMD Athlon processor to access the data cache in program order. Newer loads and stores with their addresses calculated can be blocked by older loads and stores whose addresses are not yet calculated – this is known as an address generation interlock. Therefore, it is advantageous to schedule loads and stores that can calculate their addresses quickly, ahead of loads and stores that require the resolution of a long dependency chain in order to generate their addresses. Consider the following code examples.
72 Avoid Address Generation Interlocks
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization

Example 1 (Avoid):

ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.) MOV EDX, DWORD PTR [24h] ;this load is stalled from
; accessing data cache due ; to long latency for
; generating address for ; inst 3

Example 2 (Preferred):

ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 MOV EDX, DWORD PTR [24h] ;place load above inst 3
; to avoid address ; generation interlock stall
MOV ECX, DWORD PTR [EAX+EBX] ;inst 3

Use MOVZX and MOVSX

Use the MOVZX and MOVSX instructions to zero-extend and sign-extend byte-size and word-size operands to doubleword length. For example, typical code for zero extension creates a superset dependency when the zero-extended value is used, as in the following code:

Example 1 (Avoid):

XOR EAX, EAX MOV AL, [MEM]

Example 2 (Preferred):

MOVZX EAX, BYTE PTR [MEM]

Minimize Pointer Arithmetic in Loops

Minimize pointer arithmetic in loops, especially if the loop body is small. In this case, the pointer arithmetic would cause significant overhead. Instead, take advantage of the complex addressing modes to utilize the loop counter to index into memory arrays. Using complex addressing modes does not have any negative impact on execution speed, but the reduced number of instructions preserves decode bandwidth.
Use MOVZX and MOVSX 73
AMD Athlon Processor x86 Code Optimization

Example 1 (Avoid):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; }
MOV ECX, MAXSIZE ;initialize loop counter XOR ESI, ESI ;initialize offset into array a XOR EDI, EDI ;initialize offset into array b XOR EBX, EBX ;initialize offset into array c
$add_loop: MOV EAX, [ESI + a] ;get element a MOV EDX, [EDI + b] ;get element b ADD EAX, EDX ;a[i] + b[i] MOV [EBX + c], EAX ;write result to c ADD ESI, 4 ;increment offset into a ADD EDI, 4 ;increment offset into b ADD EBX, 4 ;increment offset into c DEC ECX ;decrement loop count JNZ $add_loop ;until loop count 0
22007E/0November 1999

Example 2 (Preferred):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; }
MOV ECX, MAXSIZE-1 ;initialize loop counter
$add_loop: MOV EAX, [ECX*4 + a] ;get element a MOV EDX, [ECX*4 + b] ;get element b ADD EAX, EDX ;a[i] + b[i] MOV [ECX*4 + c], EAX ;write result to c DEC ECX ;decrement index JNS $add_loop ;until index negative
Note that the code in example 2 traverses the arrays in a downward direction (i.e., from higher addresses to lower addresses), whereas the original code in example 1 traverses the arrays in an upward direction. Such a change in the direction of the traversal is possible if each loop iteration is completely independent of all other loop iterations, as is the case here.
In code where the direction of the array traversal cant be switched, it is still possible to minimize pointer arithmetic by appropriately biasing base addresses and using an index
74 Minimize Pointer Arithmetic in Loops
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
variable that starts with a negative value and reaches zero when the loop expires. Note that if the base addresses are held in registers (e.g., when the base addresses are passed as arguments of a function) biasing the base addresses requires additional instructions to perform the biasing at run time and a small amount of additional overhead is incurred. In the examples shown here the base addresses are used in the displacement portion of the address and biasing is accomplished at compile time by simply modifying the displacement.

Example 3 (Preferred):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV ECX, (-MAXSIZE) ;initialize index
$add_loop: MOV EAX, [ECX*4 + a + MAXSIZE*4] ;get a element MOV EDX, [ECX*4 + b + MAXSIZE*4] ;get b element ADD EAX, EDX ;a[i] + b[i] MOV [ECX*4 + c + MAXSIZE*4], EAX ;write result to c INC ECX ;increment index JNZ $add_loop ;until index==0

Push Memory Data Carefully

Carefully choose the best method for pushing memory data. To reduce register pressure and code dependencies, follow example 2 below.

Example 1 (Avoid):

MOV EAX, [MEM] PUSH EAX

Example 2 (Preferred):

PUSH [MEM]
Push Memory Data Carefully 75
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
76 Push Memory Data Carefully
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
8

Integer Optimizations

This chapter describes ways to improve integer performance through optimized programming techniques. The guidelines are listed in order of importance.

Replace Divides with Multiplies

Replace integer division by constants with multiplication by the reciprocal. Because the AMD Athlon™ processor has a very fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned) and the integer division delivers only one bit of quotient per cycle (22–47 cycles signed, 17–41 cycles unsigned), the equivalent code is much faster. The user can follow the examples in this chapter that illustrate the use of integer division by constants, or access the executables in the opt_utilities directory in the AMD documentation CD-ROM (order# 21860) to find alternative code for dividing by a constant.

Multiplication by Reciprocal (Division) Utility

The code for the utilities can be found at Derivation of Multiplier Used for Integer Division by Constants on page 93. All utilities were compiled for the Microsoft Windows® 95, Windows 98, and Windows NT® environments. All utilities are provided as is and are not supported by AMD.
Replace Divides with Multiplies 77
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Signed Division Utility
In the opt_utilities directory of the AMD documentation CDROM, run sdiv.exe in a DOS window to find the fastest code for signed division by a constant. The utility displays the code after the user enters a signed constant divisor. Type sdiv > example.out to output the code to a file.
Unsigned Division Utility
In the opt_utilities directory of the AMD documentation CDROM, run udiv.exe in a DOS window to find the fastest code for unsigned division by a constant. The utility displays the code after the user enters an unsigned constant divisor. Type udiv > example.out to output the code to a file.

Unsigned Division by Multiplication of Constant

Algorithm: Divisors
31
1 <= d < 2
, Odd d
The following code shows an unsigned division using a constant value multiplier.
;In: d = divisor, 1 <= d < 2^31, odd d ;Out: a = algorithm ; m = multiplier ; s = shift factor
;algorithm 0 MOV EDX, dividend MOV EAX, m MUL EDX SHR EDX, s ;EDX=quotient
;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m ADC EDX, 0 SHR EDX, s ;EDX=quotient
Derivation of a, m, s The derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section Unsigned Derivation for Algorithm, Multiplier, and Shift Factor on page 93.
Algorithm: Divisors 231 <= d < 2
32
For divisors 231 <= d < 232, the possible quotient values are either 0 or 1. This makes it easy to establish the quotient by simple comparison of the dividend and divisor. In cases where the dividend needs to be preserved, example 1 below is recommended.
78 Replace Divides with Multiplies
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Example 1:
;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0 SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1
In cases where the dividend does not need to be preserved, the division can be accomplished without the use of an additional register, thus reducing register pressure. This is shown in example 2 below:
Simpler Code for Restricted Dividend
Example 2
;In: EDX = dividend ;Out: EAX = quotient CMP EDX, d ;CF = (dividend < divisor) ? 1 : 0 MOV EAX, 0 ;0 SBB EAX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1
Integer division by a constant can be made faster if the range of the dividend is limited, which removes a shift associated with
:
most divisors. For example, for a divide by 10 operation, use the following code if the dividend is less than 40000005h:
MOV EAX, dividend MOV EDX, 01999999Ah MUL EDX MOV quotient, EDX

Signed Division by Multiplication of Constant

Algorithm: Divisors 2 <= d < 2
31
These algorithms work if the divisor is positive. If the divisor is negative, use abs(d) instead of d, and append a NEG EDX to the code. The code makes use of the fact that n/–d = –(n/d).
;IN: d = divisor, 2 <= d < 2^31 ;OUT: a = algorithm ; m = multiplier ; s = shift count
;algorithm 0 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ;quotient in EDX
Replace Divides with Multiplies 79
AMD Athlon Processor x86 Code Optimization
;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ;quotient in EDX
22007E/0November 1999
Derivation for a, m, s The derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section Signed Derivation for Algorithm, Multiplier, and Shift Factor on page 95.
Signed Division By 2
Signed Division By 2
;IN: EAX = dividend ;OUT:EAX = quotient CMP EAX, 800000000h ;CY = 1, if dividend >=0 SBB EAX, –1 ;Increment dividend if it is < 0 SAR EAX, 1 ;Perform a right shift
n
;IN:EAX = dividend ;OUT:EAX = quotient CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (use divisor –1) ADD EAX, EDX ;Apply correction if necessary SAR EAX, (n) ;Perform right shift by
Signed Division By –2 ;IN:EAX = dividend
;OUT:EAX = quotient CMP EAX, 800000000h ;CY = 1, if dividend >= 0 SBB EAX, –1 ;Increment dividend if it is < 0 SAR EAX, 1 ;Perform right shift NEG EAX ;Use (x/–2) == –(x/2)
Signed Division By –(2n)
;IN:EAX = dividend ;OUT:EAX = quotient CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (–divisor –1) ADD EAX, EDX ;Apply correction if necessary SAR EAX, (n) ;Right shift by log2(–divisor) NEG EAX ;Use (x/–(2^n)) == (–(x/2^n))
; log2 (divisor)
Remainder of Signed Integer 2 or –2
;IN:EAX = dividend ;OUT:EAX = remainder CDQ ;Sign extend into EDX AND EDX, 1 ;Compute remainder XOR EAX, EDX ;Negate remainder if SUB EAX, EDX ;Dividend was < 0 MOV [remainder], EAX
80 Replace Divides with Multiplies
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Remainder of Signed Integer 2n or –(2n)
;IN:EAX = dividend ;OUT:EAX = remainder CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (abs(divison)–1) ADD EAX, EDX ;Apply pre-correction AND EAX, (2^n–1) ;Mask out remainder (abs(divison)–1) SUB EAX, EDX ;Apply pre-correction, if necessary MOV [remainder], EAX

Use Alternative Code When Multiplying by a Constant

A 32-bit integer multiply by a constant has a latency of five cycles. Therefore, use alternative code when multiplying by certain constants. In addition, because there is just one multiply unit, the replacement code may provide better throughput.
The following code samples are designed such that the original source also receives the final result. Other sequences are possible if the result is in a different register. Adds have been favored over shifts to keep code size small. Generally, there is a fast replacement if the constant has very few 1 bits in binary.
More constants are found in the file multiply_by_constants.txt located in the same directory where this document is located in the SDK.
by 2: ADD REG1, REG1 ;1 cycle
by 3: LEA REG1, [REG1*2+REG1] ;2 cycles
by 4: SHL REG1, 2 ;1 cycle
by 5: LEA REG1, [REG1*4+REG1] ;2 cycles
by 6: LEA REG2, [REG1*4+REG1] ;3 cycles
ADD REG1, REG2
by 7: MOV REG2, REG1 ;2 cycles
SHL REG1, 3
SUB REG1, REG2
by 8: SHL REG1, 3 ;1 cycle
by 9: LEA REG1, [REG1*8+REG1] ;2 cycles
by 10: LEA REG2, [REG1*8+REG1] ;3 cycles
ADD REG1, REG2
Use Alternative Code When Multiplying by a Constant 81
AMD Athlon Processor x86 Code Optimization
by 11: LEA REG2, [REG1*8+REG1] ;3 cycles
ADD REG1, REG1
ADD REG1, REG2
by 12: SHL REG1, 2
LEA REG1, [REG1*2+REG1] ;3 cycles
by 13: LEA REG2, [REG1*2+REG1] ;3 cycles
SHL REG1, 4
SUB REG1, REG2
by 14: LEA REG2, [REG1*4+REG1] ;3 cycles
LEA REG1, [REG1*8+REG1]
ADD REG1, REG2
by 15: MOV REG2, REG1 ;2 cycles
SHL REG1, 4
SUB REG1, REG2
by 16: SHL REG1, 4 ;1 cycle
by 17: MOV REG2, REG1 ;2 cycles
SHL REG1, 4
ADD REG1, REG2
22007E/0November 1999
by 18: ADD REG1, REG1 ;3 cycles
LEA REG1, [REG1*8+REG1]
by 19: LEA REG2, [REG1*2+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
by 20: SHL REG1, 2 ;3 cycles
LEA REG1, [REG1*4+REG1]
by 21: LEA REG2, [REG1*4+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
by 22: use IMUL
by 23: LEA REG2, [REG1*8+REG1] ;3 cycles
SHL REG1, 5
SUB REG1, REG2
by 24: SHL REG1, 3 ;3 cycles
LEA REG1, [REG1*2+REG1]
by 25: LEA REG2, [REG1*8+REG1] ;3 cycles
SHL REG1, 4
ADD REG1, REG2
82 Use Alternative Code When Multiplying by a Constant
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
by 26: use IMUL
by 27: LEA REG2, [REG1*4+REG1] ;3 cycles
SHL REG1, 5
SUB REG1, REG2
by 28: MOV REG2, REG1 ;3 cycles
SHL REG1, 3
SUB REG1, REG2
SHL REG1, 2
by 29: LEA REG2, [REG1*2+REG1] ;3 cycles
SHL REG1, 5
SUB REG1, REG2
by 30: MOV REG2, REG1 ;3 cycles
SHL REG1, 4
SUB REG1, REG2
ADD REG1, REG1
by 31: MOV REG2, REG1 ;2 cycles
SHL REG1, 5
SUB REG1, REG2
by 32: SHL REG1, 5 ;1 cycle
Use MMX Instructions for Integer-Only Work
In many programs it can be advantageous to use MMX instructions to do integer-only work, especially if the function already uses 3DNow! or MMX code. Using MMX instructions relieves register pressure on the integer registers. As long as data is simply loaded/stored, added, shifted, etc., MMX instructions are good substitutes for integer instructions. Integer registers are freed up with the following results:
May be able to reduce the number of integer registers to
saved/restored on function entry/edit.
Free up integer registers for pointers, loop counters, etc., so
that they do not have to be spilled to memory, which reduces memory traffic and latency in dependency chains.
Be careful with regards to passing data between MMX and integer registers and of creating mismatched store-to-load forwarding cases. See Unrolling Loops on page 67.
Use MMX Instructions for Integer-Only Work 83
AMD Athlon Processor x86 Code Optimization
In addition, using MMX instructions increases the available parallelism. The AMD Athlon processor can issue three integer OPs and two MMX OPs per cycle.

Repeated String Instruction Usage

Latency of Repeated String Instructions

Table 1 shows the latency for repeated string instructions on the AMD Athlon processor.
Table 1. Latency of Repeated String Instructions
Instruction ECX=0 (cycles) DF = 0 (cycles) DF = 1 (cycles)
REP MOVS 11 15 + (4/3*c) 25 + (4/3*c)
22007E/0November 1999
REP STOS 11 14 + (1*c) 24 + (1*c) REP LODS 11 15 + (2*c) 15 + (2*c) REP SCAS 11 15 + (5/2*c) 15 + (5/2*c) REP CMPS 11 16 + (10/3*c) 16 + (10/3*c)
Note:
c = value of ECX, (ECX > 0)
Table 1 lists the latencies with the direction flag (DF) = 0 (increment) and DF = 1. In addition, these latencies are assumed for aligned memory operands. Note that for MOVS/STOS, when DF = 1 (DOWN), the overhead portion of the latency increases significantly. However, these types are less commonly found. The user should use the formula and round up to the nearest integer value to determine the latency.

Guidelines for Repeated String Instructions

To help achieve good performance, this section contains guidelines for the careful scheduling of VectorPath repeated string instructions.
Use the Largest Possible Operand Size
84 Repeated String Instruction Usage
Always move data using the largest operand size possible. For example, use REP MOVSD rather than REP MOVSW and REP MOVSW rather than REP MOVSB. Use REP STOSD rather than REP STOSW and REP STOSW rather than REP MOVSB.
Loading...