AMD AthlonTM Processor
x86 Code Optimization
Guide
© 1999 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.
Trademarks
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc.
Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
MMX is a trademark and Pentium is a registered trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
22007E/0—November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 |
Introduction |
1 |
About this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
AMD Athlon™ Processor Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
AMD Athlon Processor Microarchitecture Summary . . . . . . . . . . . . . 4
2 |
Top Optimizations |
7 |
Optimization Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Group I Optimizations — Essential Optimizations . . . . . . . . . . . . . . . 8
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . 8
Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Select DirectPath Over VectorPath Instructions . . . . . . . . . . . 9 Group II Optimizations—Secondary Optimizations . . . . . . . . . . . . . . 9 Load-Execute Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . 9 Take Advantage of Write Combining. . . . . . . . . . . . . . . . . . . . 10 Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Avoid Branches Dependent on Random Data . . . . . . . . . . . . . 10
Avoid Placing Code and Data in the Same
64-Byte Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 |
C Source Level Optimizations |
13 |
Ensure Floating-Point Variables and Expressions
are of Type Float . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Use 32-Bit Data Types for Integer Code . . . . . . . . . . . . . . . . . . . . . . . 13 Consider the Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . 14 Use Array Style Instead of Pointer Style Code . . . . . . . . . . . . . . . . . 15 Completely Unroll Small Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Avoid Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . 18 Consider Expression Order in Compound Branch Conditions . . . . . 20
Contents |
iii |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
Switch Statement Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Prototypes for All Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Use Const Type Qualifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Generalization for Multiple Constant Control Code. . . . . . . . 23 Declare Local Functions as Static . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . 25 Introduce Explicit Parallelism into Code . . . . . . . . . . . . . . . . . . . . . . 25 Explicitly Extract Common Subexpressions . . . . . . . . . . . . . . . . . . . 26 C Language Structure Component Considerations . . . . . . . . . . . . . . 27 Sort Local Variables According to Base Type Size . . . . . . . . . . . . . . 28 Accelerating Floating-Point Divides and Square Roots . . . . . . . . . . 29 Avoid Unnecessary Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Copy Frequently De-referenced Pointer Arguments to
Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 |
Instruction Decoding Optimizations |
33 |
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . . 34
Load-Execute Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Floating-Point Instructions with
Floating-Point Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Avoid Load-Execute Floating-Point Instructions with
Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Align Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . 36 Use Short Instruction Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Avoid Partial Register Reads and Writes. . . . . . . . . . . . . . . . . . . . . . 37 Replace Certain SHLD Instructions with Alternative Code. . . . . . . 38 Use 8-Bit Sign-Extended Immediates . . . . . . . . . . . . . . . . . . . . . . . . . 38
iv |
Contents |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
Use 8-Bit Sign-Extended Displacements. . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . . . . . . . . 39 Recommendations for the AMD Athlon Processor . . . . . . . . . 40
Recommendations for AMD-K6® Family and
AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41
5 |
Cache and Memory Optimizations |
45 |
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Avoid Memory Size Mismatches . . . . . . . . . . . . . . . . . . . . . . . . 45 Align Data Where Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Use the 3DNow! PREFETCH and PREFETCHW Instructions. . . . . 46
Take Advantage of Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . 50 Avoid Placing Code and Data in the Same 64-Byte Cache Line. . . . 50 Store-to-Load Forwarding Restrictions. . . . . . . . . . . . . . . . . . . . . . . . 51 Store-to-Load Forwarding Pitfalls—True Dependencies. . . . 51 Summary of Store-to-Load Forwarding Pitfalls to Avoid . . . . 54 Stack Alignment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Align TBYTE Variables on Quadword Aligned Addresses. . . . . . . . 55 C Language Structure Component Considerations . . . . . . . . . . . . . . 55 Sort Variables According to Base Type Size . . . . . . . . . . . . . . . . . . . 56
6 |
Branch Optimizations |
57 |
Avoid Branches Dependent on Random Data . . . . . . . . . . . . . . . . . . 57 AMD Athlon Processor Specific Code . . . . . . . . . . . . . . . . . . . 58 Blended AMD-K6 and AMD Athlon Processor Code . . . . . . . 58 Always Pair CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Replace Branches with Computation in 3DNow! Code . . . . . . . . . . . 60 Muxing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Sample Code Translated into 3DNow! Code . . . . . . . . . . . . . . 61 Avoid the Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Avoid Far Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . 65 Avoid Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Contents |
v |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
7 |
Scheduling Optimizations |
67 |
Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67 Unrolling Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Use Function Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Always Inline Functions if Called from One Site . . . . . . . . . . 72
Always Inline Functions with Fewer than 25 Machine
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Avoid Address Generation Interlocks. . . . . . . . . . . . . . . . . . . . . . . . . 72
Use MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Minimize Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . 73
Push Memory Data Carefully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8 |
Integer Optimizations |
77 |
Replace Divides with Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Multiplication by Reciprocal (Division) Utility . . . . . . . . . . . 77 Unsigned Division by Multiplication of Constant. . . . . . . . . . 78 Signed Division by Multiplication of Constant . . . . . . . . . . . . 79 Use Alternative Code When Multiplying by a Constant. . . . . . . . . . 81
Use MMX™ Instructions for Integer-Only Work . . . . . . . . . . . . . . . . 83 Repeated String Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Latency of Repeated String Instructions. . . . . . . . . . . . . . . . . 84 Guidelines for Repeated String Instructions . . . . . . . . . . . . . 84 Use XOR Instruction to Clear Integer Registers . . . . . . . . . . . . . . . . 86
Efficient 64-Bit Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Efficient Implementation of Population Count Function . . . . . . . . . 91
Derivation of Multiplier Used for Integer Division
by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Unsigned Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vi |
Contents |
22007E/0 —November 1999
AMD Athlon™ Processor x86 Code Optimization
Signed Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9 |
Floating-Point Optimizations |
97 |
Ensure All FPU Data is Aligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Use Multiplies Rather than Divides . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Use FFREEP Macro to Pop One Register from the FPU Stack . . . . 98 Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 98 Use the FXCH Instruction Rather than FST/FLD Pairs . . . . . . . . . . 99 Avoid Using Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . 99 Minimize Floating-Point-to-Integer Conversions . . . . . . . . . . . . . . . 100 Floating-Point Subexpression Elimination. . . . . . . . . . . . . . . . . . . . 103
Check Argument Range of Trigonometric Instructions
Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Take Advantage of the FSINCOS Instruction . . . . . . . . . . . . . . . . . 105
10 3DNow!™ and MMX™ Optimizations |
107 |
Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Use FEMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Use 3DNow! Instructions for Fast Division . . . . . . . . . . . . . . . . . . . 108 Optimized 14-Bit Precision Divide . . . . . . . . . . . . . . . . . . . . . 108 Optimized Full 24-Bit Precision Divide . . . . . . . . . . . . . . . . . 108 Pipelined Pair of 24-Bit Precision Divides. . . . . . . . . . . . . . . 109 Newton-Raphson Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Use 3DNow! Instructions for Fast Square Root and
Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Optimized 15-Bit Precision Square Root . . . . . . . . . . . . . . . . 110 Optimized 24-Bit Precision Square Root . . . . . . . . . . . . . . . . 110 Newton-Raphson Reciprocal Square Root. . . . . . . . . . . . . . . 111
Use MMX PMADDWD Instruction to Perform
Two 32-Bit Multiplies in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3DNow! and MMX Intra-Operand Swapping . . . . . . . . . . . . . . . . . . 112
Contents |
vii |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
Fast Conversion of Signed Words to Floating-Point . . . . . . . . . . . . 113 Use MMX PXOR to Negate 3DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCMP Instead of 3DNow! PFCMP. . . . . . . . . . . . . . . . . . 114 Use MMX Instructions for Block Copies and Block Fills . . . . . . . . 115 Use MMX PXOR to Clear All Bits in an MMX Register . . . . . . . . . 118 Use MMX PCMPEQD to Set All Bits in an MMX Register . . . . . . . 119 Use MMX PAND to Find Absolute Value in 3DNow! Code . . . . . . 119 Optimized Matrix Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Efficient 3D-Clipping Code Computation Using
3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation . . . . . 123 Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Complex Number Arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11 |
General x86 Optimization Guidelines |
127 |
Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Appendix A AMD Athlon™ Processor Microarchitecture |
129 |
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
AMD Athlon Processor Microarchitecture . . . . . . . . . . . . . . . . . . . . 130
Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Branch Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Early Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Integer Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
viii |
Contents |
22007E/0 —November 1999
AMD Athlon™ Processor x86 Code Optimization
Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Floating-Point Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137
Load-Store Unit (LSU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
L2 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
AMD Athlon System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Appendix B |
Pipeline and Execution Unit Resources Overview |
141 |
|
Fetch and Decode Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . |
141 |
|
Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
144 |
|
Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
146 |
|
Execution Unit Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
148 |
|
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
148 |
|
Integer Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . |
149 |
|
Floating-Point Pipeline Operations . . . . . . . . . . . . . . . . . . . . |
150 |
|
Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . |
151 |
|
Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
152 |
Appendix C |
Implementation of Write Combining |
155 |
|
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
155 |
|
Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . |
156 |
|
What is Write Combining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
156 |
|
Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
156 |
|
Write-Combining Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
157 |
|
Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . |
159 |
Appendix D |
Performance-Monitoring Counters |
161 |
|
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
161 |
|
Performance Counter Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
161 |
|
PerfEvtSel[3:0] MSRs |
|
|
(MSR Addresses C001_0000h–C001_0003h) . . . . . . . . . . . . . |
162 |
Contents |
ix |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
|
PerfCtr[3:0] MSRs |
|
|
(MSR Addresses C001_0004h–C001_0007h) . . . . . . . . . . . . . |
167 |
|
Starting and Stopping the Performance-Monitoring |
|
|
Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
168 |
|
Event and Time-Stamp Monitoring Software. . . . . . . . . . . . . . . . . . |
168 |
|
Monitoring Counter Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
169 |
Appendix E |
Programming the MTRR and PAT |
171 |
|
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
171 |
|
Memory Type Range Register (MTRR) Mechanism . . . . . . . . . . . . |
171 |
|
Page Attribute Table (PAT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
177 |
Appendix F |
Instruction Dispatch and Execution Resources |
187 |
Appendix G |
DirectPath versus VectorPath Instructions |
219 |
|
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . |
219 |
|
DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
219 |
|
VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
231 |
|
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
237 |
x |
Contents |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
List of Figures
Figure 1. AMD Athlon™ Processor Block Diagram . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 137 Figure 4. Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware . . . . . . . . 142 Figure 6. Fetch/Scan/Align/Decode Pipeline Stages . . . . . . . . . . . 142 Figure 7. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 144 Figure 8. Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure 9. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 146 Figure 10. Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . 146 Figure 11. PerfEvtSel[3:0] Registers . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 12. MTRR Mapping of Physical Memory . . . . . . . . . . . . . . . 173 Figure 13. MTRR Capability Register Format . . . . . . . . . . . . . . . . 174 Figure 14. MTRR Default Type Register Format . . . . . . . . . . . . . . 175 Figure 15. Page Attribute Table (MSR 277h) . . . . . . . . . . . . . . . . . 177 Figure 16. MTRRphysBasen Register Format . . . . . . . . . . . . . . . . . 183 Figure 17. MTRRphysMaskn Register Format . . . . . . . . . . . . . . . . 184
List of Figures |
xi |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
xii |
List of Figures |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
|
List of Tables |
|
|
Table 1. |
Latency of Repeated String Instructions. . . . . . . . . . . . |
. 84 |
Table 2. |
Integer Pipeline Operation Types . . . . . . . . . . . . . . . . . |
149 |
Table 3. |
Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . . . . . . |
149 |
Table 4. |
Floating-Point Pipeline Operation Types . . . . . . . . . . . |
150 |
Table 5. |
Floating-Point Decode Types . . . . . . . . . . . . . . . . . . . . . |
150 |
Table 6. |
Load/Store Unit Stages . . . . . . . . . . . . . . . . . . . . . . . . . . |
151 |
Table 7. |
Sample 1 – Integer Register Operations . . . . . . . . . . . . |
153 |
Table 8. |
Sample 2 – Integer Register and Memory Load |
|
|
Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
154 |
Table 9. |
Write Combining Completion Events . . . . . . . . . . . . . . |
158 |
Table 10. |
AMD Athlon™ System Bus Commands |
|
|
Generation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
159 |
Table 11. |
Performance-Monitoring Counters. . . . . . . . . . . . . . . . . |
164 |
Table 12. |
Memory Type Encodings . . . . . . . . . . . . . . . . . . . . . . . . . |
174 |
Table 13. |
Standard MTRR Types and Properties . . . . . . . . . . . . . |
176 |
Table 14. |
PATi 3-Bit Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
178 |
Table 15. |
Effective Memory Type Based on PAT and |
|
|
MTRRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
179 |
Table 16. |
Final Output Memory Types . . . . . . . . . . . . . . . . . . . . . . |
180 |
Table 17. |
MTRR Fixed Range Register Format . . . . . . . . . . . . . . |
182 |
Table 18. |
MTRR-Related Model-Specific Register |
|
|
(MSR) Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
185 |
Table 19. |
Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
188 |
Table 20. |
MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
208 |
Table 21. |
MMX Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
211 |
Table 22. |
Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . |
212 |
Table 23. |
3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . |
217 |
Table 24. |
3DNow! Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
218 |
Table 25. |
DirectPath Integer Instructions . . . . . . . . . . . . . . . . . . . |
220 |
Table 26. |
DirectPath MMX Instructions. . . . . . . . . . . . . . . . . . . . . |
227 |
Table 27. |
DirectPath MMX Extensions. . . . . . . . . . . . . . . . . . . . . . |
228 |
Table 28. |
DirectPath Floating-Point Instructions . . . . . . . . . . . . . |
229 |
List of Tables |
xiii |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
Table 29. VectorPath Integer Instructions . . . . . . . . . . . . . . . . . . . 231
Table 30. VectorPath MMX Instructions . . . . . . . . . . . . . . . . . . . . 234
Table 31. VectorPath MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234
Table 32. VectorPath Floating-Point Instructions . . . . . . . . . . . . . 235
xiv |
List of Tables |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
Date Rev |
Description |
Added “About this Document” on page 1.
Further clarification of “Consider the Sign of Integer Operands” on page 14.
Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29.
Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31. Further clarification of “Select DirectPath Over VectorPath Instructions” on page 34.
Further clarification of “Align Branch Targets in Program Hot Spots” on page 36.
Further clarification of REP instruction as a filler in “Code Padding Using Neutral Code Fillers” on page 39. Further clarification of “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46. Modified examples 1 and 2 of “Unsigned Division by Multiplication of Constant” on page 78.
Added the optimization, “Efficient Implementation of Population Count Function” on page 91. Further clarification of “Use FFREEP Macro to Pop One Register from the FPU Stack” on page 98. Further clarification of “Minimize Floating-Point-to-Integer Conversions” on page 100.
Added the optimization, “Check Argument Range of Trigonometric Instructions Efficiently” on page 103.
Added the optimization, “Take Advantage of the FSINCOS Instruction” on page 105.
Nov.
1999 E Further clarification of “Use 3DNow!™ Instructions for Fast Division” on page 108. Further clarification “Use FEMMS Instruction” on page 107.
Further clarification of “Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root” on page 110.
Clarified “3DNow!™ and MMX™ Intra-Operand Swapping” on page 112.
Corrected PCMPGT information in “Use MMX™ PCMP Instead of 3DNow!™ PFCMP” on page 114. Added the optimization, “Use MMX™ Instructions for Block Copies and Block Fills” on page 115. Modified the rule for “Use MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 118. Modified the rule for “Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register” on page 119. Added the optimization, “Optimized Matrix Multiplication” on page 119.
Added the optimization, “Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions” on page 122.
Added the optimization, “Complex Number Arithmetic” on page 126.
Added Appendix E, “Programming the MTRR and PAT”.
Rearranged the appendices.
Added Index.
Revision History |
xv |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
xvi |
Revision History |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
1
The AMD Athlon™ processor is the newest microprocessor in the AMD K86™ family of microprocessors. The advances in the AMD Athlon processor take superscalar operation and out-of-order execution to a new level. The AMD Athlon processor has been designed to efficiently execute code written for previous-generation x86 processors. However, to enable the fastest code execution with the AMD Athlon processor, programmers should write software that includes specific code optimization techniques.
This document contains information to assist programmers in creating optimized code for the AMD Athlon processor. In addition to compiler and assembler designers, this document has been targeted to C and assembly language programmers writing execution-sensitive code sequences.
This document assumes that the reader possesses in-depth knowledge of the x86 instruction set, the x86 architecture (registers, programming modes, etc.), and the IBM PC-AT platform.
This guide has been written specifically for the AMD Athlon p ro c e s s o r, b u t i t i n c l u d e s c o n s i d e ra t i o n s fo r
About this Document |
1 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
previous-generation processors and describes how those optimizations are applicable to the AMD Athlon processor. This guide contains the following chapters:
Chapter 1: Introduction. Outlines the material covered in this document. Summarizes the AMD Athlon microarchitecture.
Chapter 2: Top Optimizations. Provides convenient descriptions of the most important optimizations a programmer should take into consideration.
Chapter 3: C Source Level Optimizations. Describes optimizations that
C/C++ programmers can implement.
Chapter 4: Instruction Decoding Optimizations. Describes methods that will make the most efficient use of the three sophisticated instruction decoders in the AMD Athlon processor.
Chapter 5: Cache and Memory Optimizations. Describes optimizations that makes efficient use of the large L1 caches and highbandwidth buses of the AMD Athlon processor.
Chapter 6: Branch Optimizations. Describes optimizations that improves branch prediction and minimizes branch penalties.
Chapter 7: Scheduling Optimizations. Describes optimizations that improves code scheduling for efficient execution resource utilization.
Chapter 8: Integer Optimizations. Describes optimizations that improves integer arithmetic and makes efficient use of the integer execution units in the AMD Athlon processor.
Chapter 9: Floating-Point Optimizations. Describes optimizations that makes maximum use of the superscalar and pipelined floatingpoint unit (FPU) of the AMD Athlon processor.
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines for Enhanced 3DNow! and MMX code optimization techniques.
Chapter 11: General x86 Optimizations Guidelines. L i s t s |
g e n e r i c |
|
optimizations techniques applicable to x86 processors. |
|
|
Appendix A: AMD Athlon Processor Microarchitecture. D e s c ri b e s |
i n |
|
detail the microarchitecture of the AMD Athlon processor. |
|
2 |
About this Document |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
Appendix B: Pipeline and Execution Unit Resources Overview. Describes in detail the execution units and its relation to the instruction pipeline.
Appendix C: Implementation of Write Combining. D e s c r i b e s t h e algorithm used by the AMD Athlon processor to write combine.
Appendix D: Performance Monitoring Counters. Describes the usage of the performance counters available in the AMD Athlon processor.
Appendix E: Programming the MTRR and PAT. D e s c r i b e s t h e s t e p s needed to program the Memory Type Range Registers and the Page Attribute Table.
Appendix F: Instruction Dispatch and Execution Resources. L i s t s |
t h e |
instruction’s execution resource usage. |
|
Appendix G: DirectPath versus VectorPath Instructions. L i s t s t h e |
x 8 6 |
instructions that are DirectPath and VectorPath instructions.
AMD Athlon™ Processor Family
The AMD Athlon processor family uses state-of-the-art decoupled decode/execution design techniques to deliver next-generation performance with x86 binary software compatibility. This next-generation processor family advances x86 code execution by using flexible instruction predecoding, wide and balanced decoders, aggressive out-of-order execution, parallel integer execution pipelines, parallel floating-point execution pipelines, deep pipelined execution for higher delivered operating frequency, dedicated backside cache memory, and a new high-performance double-rate 64-bit local bus. As an x86 binary-compatible processor, the AMD Athlon processor implements the industry-standard x86 instruction set by decoding and executing the x86 instructions using a proprietary microarchitecture. This microarchitecture allows the delivery of maximum performance when running x86-based PC software.
AMD Athlon™ Processor Family |
3 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
AMD Athlon™ Processor Microarchitecture Summary
The AMD Athlon processor brings superscalar performance and high operating frequency to PC systems running industry-standard x86 software. A brief summary of the next-generation design features implemented in the AMD Athlon processor is as follows:
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
High-speed double-rate local bus interface
Large, split 128-Kbyte level-one (L1) cache
Dedicated backside level-two (L2) cache
Instruction predecode and branch detection during cache line fills
Decoupled decode/execution core
Three-way x86 instruction decoding
Dynamic scheduling and speculative execution
Three-way integer execution
Three-way address generation
Three-way floating-point execution
3DNow!™ technology and MMX™ single-instruction multiple-data (SIMD) instruction extensions
Super data forwarding
Deep out-of-order integer and floating-point execution
Register renaming
Dynamic branch prediction
The AMD Athlon processor communicates th rough a next-generation high-speed local bus that is beyond the current Socket 7 or Super7™ bus standard. The local bus can transfer data at twice the rate of the bus operating frequency by using b o t h t h e r i s i n g a n d fa l l in g e d g e s o f t h e c l o ck ( s e e “A M D A t h l o n ™ S y s t e m B u s ” o n p a g e 1 3 9 fo r m o re information).
To reduce on-chip cache miss penalties and to avoid subsequent data load or instruction fetch stalls, the AMD Athlon processor has a dedicated high-speed backside L2 cache. The large 128-Kbyte L1 on-chip cache and the backside L2 cache allow the
4 |
AMD Athlon™ Processor Microarchitecture Summary |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
AMD Athlon execution core to achieve and sustain maximum performance.
As a decoupled decode/execution processor, the AMD Athlon processor makes use of a proprietary microarchitecture, which defines the heart of the AMD Athlon processor. With the inclusion of all these features, the AMD Athlon processor is capable of decoding, issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance.
The AMD Athlon processor includes both the industry-standard MMX SIMD integer instructions and the 3DNow! SIMD floating-point instructions that were first introduced in the AMD-K6®-2 processor. The design of 3DNow! technology was based on suggestions from leading graphics and independent software vendors (ISVs). Using SIMD format, the AMD Athlon processor can generate up to four 32-bit, single-precision floating-point results per clock cycle.
The 3DNow! execution units allow for high-performance floating-point vector operations, which can replace x87 instructions and enhance the performance of 3D graphics and other floating-point-intensive applications. Because the 3DNow! architecture uses the same registers as the MMX instructions, switching between MMX and 3DNow! has no penalty.
The AMD Athlon processor designers took another innovative step by carefully integrating the traditional x87 floating-point, MMX, and 3DNow! execution units into one operational engine. With the introduction of the AMD Athlon processor, the switching overhead between x87, MMX, and 3DNow! technology is virtually eliminated. The AMD Athlon processor combined with 3DNow! technology brings a better multimedia experience to mainstream PC users while maintaining backwards compatibility with all existing x86 software.
Although the AMD Athlon processor can extract code parallelism on-the-fly from off-the-shelf, commercially available x86 software, specific code optimization for the AMD Athlon processor can result in even higher delivered performance. This document describes the proprietary microarchitecture in the AMD Athlon processor and makes recommendations for optimizing execution of x86 software on the processor.
AMD Athlon™ Processor Microarchitecture Summary |
5 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
The coding techniques for achieving peak performance on the AMD Athlon processor include, but are not limited to, those for the AMD-K6, AMD-K6-2, Pentium®, Pentium Pro, and Pentium II processors. However, many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance. Due to the more flexible pipeline control and aggressive out-of-order execution, the AMD Athlon processor is not as sensitive to instruction selection and code scheduling. This flexibility is one of the distinct advantages of the AMD Athlon processor.
The AMD Athlon processor uses the latest in processor microarchitecture design techniques to provide the highest x86 performance for today’s PC. In short, the AMD Athlon processor offers true next-generation performance with x86 binary software compatibility.
6 |
AMD Athlon™ Processor Microarchitecture Summary |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
2
This chapter contains concise descriptions of the best optimizations for im proving the performance of the AMD Athlon™ processor. Subsequent chapters contain more detailed descriptions of these and other optimizations. The optimizations in this chapter are divided into two groups and listed in order of importance.
Group I — Essential Group I contains essential optimizations. Users should follow Optimizations these critical guidelines closely. The optimizations in Group I
are as follows:
■Memory Size and Alignment Issues—Avoid memory size mismatches—Align data where possible
■Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
■Select DirectPath Over VectorPath Instructions
Group II — Secondary Group II co ntains secondary optimizations that can Optimizations significantly improve the performance of the AMD Athlon
processor. The optimizations in Group II are as follows:
■Load-Execute Instruction Usage—Use Load-Execute instructions—Avoid load-execute floating-point instructions with integer operands
■
■
■
Take Advantage of Write Combining
Use 3DNow! Instructions
Avoid Branches Dependent on Random Data
Top Optimizations |
7 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
■Avoid Placing Code and Data in the Same 64-Byte Cache Line
|
The top optimizations described in this chapter are flagged |
TOP |
with a star. In addition, the star appears beside the more |
|
|
|
detailed descriptions found in subsequent chapters. |
See “Memory Size and Alignment Issues” on page 45 for more details.
|
Avoid memory size mismatches when instructions operate on |
|
TOP |
the same data. For instructions that store and reload the same |
|
data, keep operands aligned and keep the loads/stores of each |
||
|
||
operand the same size. |
||
Align Data Where Possible |
||
|
Avoid misaligned data references. A misaligned store or load |
|
TOP |
operation suffers a minimum one-cycle penalty in the |
|
|
||
|
AMD Athlon processor load/store pipeline. |
|
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions |
||
|
For code that can take advantage of prefetching, use the |
|
TOP |
3DNow! PREFETCH and PREFETCHW instructions to increase |
|
the effective bandwidth to the AMD Athlon processor, which |
||
|
||
significantly improves performance. All the prefetch |
instructions are essentially integer instructions and can be used
8 |
Optimization Star |
22007E/0 —November 1999
AMD Athlon™ Processor x86 Code Optimization
anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.). Use the following formula to determine prefetch distance:
Prefetch Length = 200 (DS/C)
■Round up to the nearest cache line.
■DS is the data stride per loop iteration.
■C is the number of cycles per loop iteration when hitting in the L1 cache.
See “Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions” on page 46 for more details.
|
Use Direct Path instruct ions rather than Vec torPath |
|
TOP |
instructions. DirectPath instructions are optimized for decode |
|
and execute efficiently by minimizing the number of operations |
||
|
||
per x86 instruction. Three DirectPath instructions can be |
||
|
decoded in parallel. Using VectorPath instructions will block |
|
|
DirectPath instructions from decoding simultaneously. |
|
|
See Appendix G, “DirectPath versus VectorPath Instructions” |
|
|
on page 219 for a list of DirectPath and VectorPath instructions. |
Group II Optimizations—Secondary Optimizations
Load-Execute Instruction Usage
See “Load-Execute Instruction Usage” on page 34 for more details.
Use Load-Execute Instructions
|
Wherever possible, use load-execute instructions to increase |
|
TOP |
code density with the one exception described below. The |
|
split-instruction form of load-execute instructions can be used |
||
|
||
to avoid scheduler stalls for longer executing instructions and |
||
|
to explicitly schedule the load and execute operations. |
Group II Optimizations —Secondary Optimizations |
9 |
AMD Athlon™ Processor x86 Code Optimization
Avoid Load-Execute Floating-Point Instructions with Integer Operands
22007E/0 —November 1999
|
Do not use load-execute floating-point instructions with integer |
|
TOP |
operands. The floating-point load-execute instructions with |
|
integer operands are VectorPath and generate two OPs in a |
||
|
||
cycle, while the discrete equivalent enables a third DirectPath |
||
|
instruction to be decoded in the same cycle. |
|
Take Advantage of Write Combining |
||
|
This guideline applies only to operating system, device driver, |
|
TOP |
an d B IO S p rog ram m ers . I n o rd e r t o i m p rove sy st em |
|
performance, the AMD Athlon processor aggressively combines |
||
|
||
multiple memory-write cycles of any data size that address |
||
|
locations within a 64-byte cache line aligned write buffer. |
|
|
See Appendix C, “Implementation of Write Combining” on |
|
|
page 155 for more details. |
|
Use 3DNow!™ Instructions |
||
|
Unless accuracy requirements dictate otherwise, perform |
|
TOP |
floating-point computations using the 3DNow! instructions |
|
instead of x87 instructions. The SIMD nature of 3DNow! |
||
|
||
instructions achieves twice the number of FLOPs that are |
||
|
achieved through x87 instructions. 3DNow! instructions also |
|
|
provide for a flat register file instead of the stack-based |
|
|
approach of x87 instructions. |
|
|
See Table 23 on page 217 for a list of 3DNow! instructions. For |
|
|
information about instruction usage, see the 3DNow!™ |
|
|
Technology Manual, order# 21928. |
|
Avoid Branches Dependent on Random Data |
||
|
Avoid data-dependent branches around a single instruction . |
|
TOP |
Data-dependent branches acting upon basically random data |
|
can cause the branch prediction logic to mispredict the branch |
||
|
||
about 50% of the time. Design branch-free alternative code |
sequences, which results in shorter average execution time.
See “Avoid Branches Dependent on Random Data” on page 57 for more details.
10 |
Group II Optimizations —Secondary Optimizations |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
Avoid Placing Code and Data in the Same 64-Byte Cache Line
|
Consider that the AMD Athlon processor cache line is twice the |
|
TOP |
size of previous processors. Code and data should not be shared |
|
in the same 64-byte cache line, especially if the data ever |
||
|
||
becomes modified. In order to maintain cache coherency, the |
||
|
AMD Athlon processor may thrash its caches, resulting in lower |
|
|
performance. |
In general the following should be avoided:
■Self-modifying code
■Storing data in code segments
See “Avoid Placing Code and Data in the Same 64-Byte Cache
Line” on page 50 for more details.
Group II Optimizations —Secondary Optimizations |
11 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
12 |
Group II Optimizations —Secondary Optimizations |
22007E/0 —November 1999 |
AMD Athlon™ Processor x86 Code Optimization |
3
This chapter details C programming practices for optimizing code for the AMD Athlon™ processor. Guidelines are listed in order of importance.
Ensure Floating-Point Variables and Expressions are of Type Float
For compilers that generate 3DNow!™ instructions, make sure that all floating-point variables and expressions are of type float. Pay special attention to floating-point constants. These require a suffix of “F” or “f” (for example, 3.14f) in order to be of type float, otherwise they default to type double. To avoid automatic promotion of float arguments to double, always use function prototypes for all functions that accept float arguments.
Use 32-Bit Data Types for Integer Code
U se 3 2 -b i t d a t a t y p e s fo r i n t e g e r c o d e . Co m p i le r implementations vary, but typically the following data types are included —int, signed, signed int, unsigned, unsigned int, long, signed long, long int, signed long int, unsigned long, and unsigned long int.
Ensure Floating-Point Variables and Expressions are of Type Float |
13 |
AMD Athlon™ Processor x86 Code Optimization |
22007E/0 —November 1999 |
In many cases, the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required so an unsigned type is appropriate. However, recording temperatures in degrees Celsius may require both positive and negative numbers so a signed type is needed.
Where there is a choice of using either a signed or an unsigned type, it should be considered that certain operations are faster with unsigned types while others are faster for signed types.
Integer-to-floating-point conversion using integers larger than 16-bit is faster with signed types, as the x86 FPU provides instructions for converting signed integers to floating-point, but has no instructions for converting unsigned integers. In a typical case, a 32-bit integer is converted as follows:
double |
x; |
====> |
MOV |
[temp+4], |
0 |
|
unsigned int i; |
|
MOV |
EAX, i |
|
||
|
|
|
MOV |
[temp], eax |
||
x = i; |
|
|
FILD |
QWORD |
PTR [temp] |
|
|
|
|
FSTP |
QWORD |
PTR [x] |
This code is slow not only because of the number of instructions but also because a size mismatch prevents store-to-load- forwarding to the FILD instruction.
double x; |
====> |
FILD |
DWORD |
PTR [i] |
|
int |
i; |
|
FSTP |
QWORD |
PTR [x] |
x = |
i; |
|
|
|
|
Computing quotients and remainders in integer division by constants are faster when performed on unsigned types. In a typical case, a 32-bit integer is divided by four as follows:
14 |
Consider the Sign of Integer Operands |