IA-32 Intel® Architecture
Optimization Reference
Manual
Order Number: 248966-013US
April 2006
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY
RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining
applications. Intel may make changes to specifications and product descriptions at any time, without notice.
This IA-32 Intel ® Architecture Optimization Reference Manual as well as the software described in it is furnished
under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies
that may appear in this document or any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Developers must not rely on the absence or characteristics of any features or in structions marked “reserved” or “undefined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in
developer's software code when running on an Intel® processor. Intel reserves these features or instructions for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized
use.
Hyper-Threading Technology requires a computer system with an Intel® Pentium®4 processor supporting HyperThreading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary
depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more
information including details on which processors support HT Technology.
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and
VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other
countries.
*Other names and brands may be claimed as the property of others.
Copyright © 1999-2006 Intel Corporation.
ii
Contents
Introduction
®
Chapter 1 IA-32 Intel
SIMD Technology.................................................................................................................... 1-2
Summary of SIMD Technologies...................................................................................... 1-5
MMX™ Technology..................................................................................................... 1-5
Streaming SIMD Extensions....................................................................................... 1-5
Streaming SIMD Extensions 2.................................................................................... 1-6
Streaming SIMD Extensions 3.................................................................................... 1-6
®
Intel
Extended Memory 64 Technology (Intel®EM64T)........................................................ 1-7
Intel NetBurst
Design Goals of Intel NetBurst Microarchitecture ............................................................ 1-8
Overview of the Intel NetBurst Microarchitecture Pipeline ............................................... 1-9
Front End Pipeline Detail............................................................................................... 1-13
Execution Core Detail..................................................................................................... 1-16
®
Intel
Pentium® M Processor Microarchitecture............................................................. ...... 1-26
®
Microarchitecture............................................................................................ 1-8
The Front End........................................................................................................... 1-11
The Out-of-order Core.............................................................................................. 1-12
Retirement................................................................................................................ 1-12
Prefetching................................................................................................................ 1-13
Decoder.................................................................................................................... 1-14
Execution Trace Cache ............................................................................................ 1-14
Branch Prediction ..................................................................................................... 1-15
Instruction Latency and Throughput......................................................................... 1-17
Execution Units and Issue Ports............................................................................... 1-18
Caches...................................................................................................................... 1-19
Data Prefetch............................................................................................................ 1-21
Loads and Stores...................................................................................................... 1-24
Store Forwarding...................................................................................................... 1-25
The Front End........................................................................................................... 1-27
Data Prefetching....................................................................................................... 1-29
Architecture Processor Family Overview
iii
Out-of-Order Core......................... ... ..................................... .. .................................. 1-30
In-Order Retirement.................................................................................................. 1-31
Microarchitecture of Intel
®
Core™ Solo and Intel®Core™ Duo Processors........................ 1-31
Front End........................................................................................................................ 1-32
Data Prefetching............................................................................................................. 1-33
Hyper-Threading Technology................................................................................................ 1-33
Processor Resources and Hyper-Threading Technology............................................... 1-36
Replicated Resources............................................................................................... 1-36
Partitioned Resources ................................................................................ ... ... ........ 1-36
Shared Resources.................................................. .................................................. 1-37
Microarchitecture Pipeline and Hyper-Threading Technology........................................ 1-38
Front End Pipeline......................................................................................................... 1-38
Execution Core............................................................................................................... 1-39
Retirement...................................................................................................................... 1-39
Multi-Core Processors........................................................................................................... 1-39
Microarchitecture Pipeline and Multi-Core Processors................................................... 1-42
Shared Cache in Intel Core Duo Processors ................................................................. 1-42
Load and Store Operations....................................................................................... 1-42
Chapter 2 General Optimization Guidelines
Tuning to Achieve Optimum Performance.............................................................................. 2-1
Tuning to Prevent Known Coding Pitfalls................................................................................ 2-2
General Practices and Coding Guidelines.............................................................................. 2-3
Use Available Performance Tools..................................................................................... 2-4
Optimize Performance Across Processor Generations.................................................... 2-4
Optimize Branch Predictability........................ .................................................................. 2-5
Optimize Memory Access................................................................................................. 2-5
Optimize Floating-point Performance............................................................................... 2-6
Optimize Instruction Selection.......................................................................................... 2-6
Optimize Instruction Scheduling....................................................................................... 2-7
Enable Vectorization......................................................................................................... 2-7
Coding Rules, Suggestions and Tuning Hints......................................................................... 2-8
Performance Tools.................................................................................................................. 2-9
Processor Perspectives........................................................................................................ 2-11
®
Intel
C++ Compiler ......................................................................................................... 2-9
General Compiler Recommendations............................................................................ 2-10
VTune™ Performance Analyzer..................................................................................... 2-10
CPUID Dispatch Strategy and Compatible Code Strategy............................................. 2-13
Transparent Cache-Parameter Strategy......................................................................... 2-14
Threading Strategy and Hardware Multi-Threading Support.......................................... 2-14
iv
Branch Prediction.................................................................................................................. 2-15
Eliminating Branches...................................................................................................... 2-15
Spin-Wait and Idle Loops........................................ ... ..................................................... 2-18
Static Prediction.............................................................................................................. 2-19
Inlining, Calls and Returns ............................................................................................. 2-22
Branch Type Selection ................................................................................................... 2-23
Loop Unrolling ............................................................................................................... 2-26
Compiler Support for Branch Prediction......................................................................... 2-28
Memory Accesses................................................................................................................. 2-29
Alignment ....................................................................................................................... 2-29
Store Forwarding............................................................................................................ 2-32
Store-to-Load-Forwarding Restriction on Size and Alignment.................................. 2-33
Store-forwarding Restriction on Data Availability...................................................... 2-38
Data Layout Optimizations............................................................................................. 2-39
Stack Alignment.............................................................................................................. 2-42
Capacity Limits and Aliasing in Caches.......................................................................... 2-43
Capacity Limits in Set-Associative Caches............................................................... 2-44
Aliasing Cases in the Pentium
®
4 and Intel® Xeon® Processors ............................. 2-45
Aliasing Cases in the Pentium M Processor............................................................. 2-46
Mixing Code and Data.................................................................................................... 2-47
Self-modifying Code ................................................................................................. 2-47
Write Combining............................................................................................................. 2-48
Locality Enhancement.................................................................................................... 2-50
Minimizing Bus Latency.................................................................................................. 2-52
Non-Temporal Store Bus Traffic ..................................................................................... 2-53
Prefetching..................................................................................................................... 2-55
Hardware Instruction Fetching.................................................................................. 2-55
Software and Hardware Cache Line Fetching.......................................................... 2-55
Cacheability Instructions ................................................................................................ 2-56
Code Alignment............................................................................ .................................. 2-57
Improving the Performance of Floating-point Applications.................................................... 2-57
Guidelines for Optimizing Floating-point Code............................................................... 2-58
Floating-point Modes and Exceptions............................................................................ 2-60
Floating-point Exceptions ......................................................................................... 2-60
Floating-point Modes................................................................................................ 2-62
Improving Parallelism and the Use of FXCH.................................................................. 2-68
x87 vs. Scalar SIMD Floating-point Trade-offs............................................................... 2-69
Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo
Processors............................................................................................................. 2-70
Memory Operands.................................................. ... ... .................................... ... ........... 2-71
v
Floating-Point Stalls............................ ... ..................................... .. .................................. 2-72
x87 Floating-point Operations with Integer Operands .............................................. 2-72
x87 Floating-point Comparison Instructions............................................................. 2-72
Transcendental Functions ........................................................................................ 2-72
Instruction Selection.............................................................................................................. 2-73
Complex Instructions........................................... ... ........................................................ 2-74
Use of the lea Instruction................................................................................................ 2-74
Use of the inc and dec Instructions................................................................................ 2-75
Use of the shift and rotate Instructions........................................................................... 2-75
Flag Register Accesses.................................................................................................. 2-75
Integer Divide................................................................................................................. 2-76
Operand Sizes and Partial Register Accesses............................................................... 2-76
Prefixes and Instruction Decoding.............................................. .................................... 2-80
REP Prefix and Data Movement..................................................................................... 2-81
Address Calculations...................................................................................................... 2-86
Clearing Registers.......................................................................................................... 2-87
Compares....................................................................................................................... 2-87
Floating Point/SIMD Operands....................................................................................... 2-88
Prolog Sequences.......................................................................................................... 2-90
Code Sequences that Operate on Memory Operands................................................... 2-90
Instruction Scheduling........................................................................................................... 2-91
Latencies and Resource Constraints.............................................................................. 2-91
Spill Scheduling.............................................................................................................. 2-92
Scheduling Rules for the Pentium 4 Processor Decoder ............................................... 2-92
Scheduling Rules for the Pentium M Processor Decoder .............................................. 2-93
Vectorization ......................................................................................................................... 2-93
Miscellaneous....................................................................................................................... 2-95
NOPs.............................................................................................................................. 2-95
Summary of Rules and Suggestions..................................................................................... 2-96
User/Source Coding Rules............................................................................................. 2-97
Assembly/Compiler Coding Rules.................................................................................. 2-99
Tuning Suggestions...................................................................................................... 2-108
Chapter 3 Coding for SIMD Architectures
Checking for Processor Support of SIMD Technologies ......................................................... 3-2
Checking for MMX Technology Support ........................................................................... 3-2
Checking for Streaming SIMD Extensions Support.......................................................... 3-3
Checking for Streaming SIMD Extensions 2 Support....................................................... 3-5
Checking for Streaming SIMD Extensions 3 Support....................................................... 3-6
vi
Considerations for Code Conversion to SIMD Programming.................................................. 3-8
Identifying Hot Spots...................................................................................................... 3-10
Determine If Code Benefits by Conversion to SIMD Execution...................................... 3-11
Coding Techniques ............................................................................................................... 3-12
Coding Methodologies.................................................................................................... 3-13
Assembly.................................................................................................................. 3-15
Intrinsics.................................................................................................................... 3-15
Classes..................................................................................................................... 3-17
Automatic Vectorization............................................ ... ... .................................... ... ... 3-18
Stack and Data Alignment..................................................................................................... 3-20
Alignment and Contiguity of Data Access Patterns........................................................ 3-20
Using Padding to Align Data..................................................................................... 3-20
Using Arrays to Make Data Contiguous.................................................................... 3-21
Stack Alignment For 128-bit SIMD Technologies ........................................................... 3-22
Data Alignment for MMX Technology............................................................................. 3-23
Data Alignment for 128-bit data...................................................................................... 3-24
Compiler-Supported Alignment................................................................................. 3-24
Improving Memory Utilization................................................................................................ 3-27
Data Structure Layout..................................................................................................... 3-27
Strip Mining..................................................................................................................... 3-32
Loop Blocking................................................................................................................. 3-34
Instruction Selection.............................................................................................................. 3-37
SIMD Optimizations and Microarchitectures.................................................................. 3-38
Tuning the Final Application.................................................................................................. 3-39
Chapter 4 Optimizing for SIMD Integer Applications
General Rules on SIMD Integer Code .................................................................................... 4-2
Using SIMD Integer with x87 Floating-point............................................................................ 4-3
Using the EMMS Instruction............................................................................................. 4-3
Guidelines for Using EMMS Instruction............................................................................ 4-4
Data Alignment............................... ... ...................................................................... ................ 4-6
Data Movement Coding Techniques....................................................................................... 4-6
Unsigned Unpack............................................................................................................. 4-6
Signed Unpack................................................................................................................. 4-7
Interleaved Pack with Saturation...................................................................................... 4-8
Interleaved Pack without Saturation............................................................................... 4-10
Non-Interleaved Unpack................................................................................................. 4-11
Extract Word................................................ ................................................................... 4-13
Insert Word..................................................................................................................... 4-14
Move Byte Mask to Integer............................................................................................. 4-16
vii
Packed Shuffle Word for 64-bit Registers ...................................................................... 4-18
Packed Shuffle Word for 128-bit Registers .................................................................... 4-19
Unpacking/interleaving 64-bit Data in 128-bit Registers................................................. 4-20
Data Movement.............................................................................................................. 4-21
Conversion Instructions.................................................................................................. 4-21
Generating Constants................................................................. ... .................................... ... 4-21
Building Blocks...................................................................................................................... 4-23
Absolute Difference of Unsigned Numbers .................................................................... 4-23
Absolute Difference of Signed Numbers........................................................................ 4-24
Absolute Value ................................................................................................................ 4-25
Clipping to an Arbitrary Range [high, low]...................................................................... 4-26
Highly Efficient Clipping............................................................................................ 4-27
Clipping to an Arbitrary Unsigned Range [high, low]................................................ 4-28
Packed Max/Min of Signed Word and Unsigned Byte.................................................... 4-29
Signed Word............................................................................................................. 4-29
Unsigned Byte .......................................................................................................... 4-30
Packed Multiply High Unsigned...................................................................................... 4-30
Packed Sum of Absolute Differences............................................................................. 4-30
Packed Average (Byte/Word)......................................................................................... 4-31
Complex Multiply by a Constant........................................... .......................................... 4-32
Packed 32*32 Multiply.................................................................................................... 4-33
Packed 64-bit Add/Subtract............................................................................................ 4-33
128-bit Shifts................................................................................................................... 4-33
Memory Optimizations.......................................................................................................... 4-34
Partial Memory Accesses............................................................................................... 4-35
Supplemental Techniques for Avoiding Cache Line Splits........................................ 4-37
Increasing Bandwidth of Memory Fills and Video Fills ................................................... 4-39
Increasing Memory Bandwidth Using the MOVDQ Instruction................................. 4-39
Increasing Memory Bandwidth by Loading and Storing to and from the
Same DRAM Page ................................................................................................ 4-39
Increasing UC and WC Store Bandwidth by Using Aligned Stores........................... 4-40
Converting from 64-bit to 128-bit SIMD Integer .................................................................... 4-40
SIMD Optimizations and Microarchitectures.................................................................. 4-41
Packed SSE2 Integer versus MMX Instructions....................................................... 4-42
Chapter 5 Optimizing for SIMD Floating-point Applications
General Rules for SIMD Floating-point Code.......................................................................... 5-1
Planning Considerations......................................................................................................... 5-2
Using SIMD Floating-point with x87 Floating-point................................................................. 5-3
Scalar Floating-point Code...................................................................................................... 5-3
viii
Data Alignment............................... ... ...................................................................... ................ 5-4
Data Arrangement............................................................................................................ 5-4
Vertical versus Horizontal Computation...................................................................... 5-5
Data Swizzling............................................................................................................ 5-9
Data Deswizzling...................................................................................................... 5-14
Using MMX Technology Code for Copy or Shuffling Functions................................ 5-17
Horizontal ADD Using SSE....................................................................................... 5-18
Use of cvttps2pi/cvttss2si Instructions.................................................................................. 5-21
Flush-to-Zero and Denormals-are-Zero Modes .................................................................... 5-22
SIMD Floating-point Programming Using SSE3................................................................... 5-22
SSE3 and Complex Arithmetics ................................................. .. .................................. 5-23
SSE3 and Horizontal Computation................................................................................. 5-26
SIMD Optimizations and Microarchitectures.................................................................. 5-27
Packed Floating-Point Performance......................................................................... 5-27
Chapter 6 Optimizing Cache Usage
General Prefetch Coding Guidelines....................................................................................... 6-2
Hardware Prefetching of Data................................................................................................. 6-4
Prefetch and Cacheability Instructions.................................................................................... 6-5
Prefetch................................................................................................................................... 6-6
Software Data Prefetch.......................................... ... .................................... ... ................ 6-6
The Prefetch Instructions – Pentium 4 Processor Implementation................................... 6-8
Prefetch and Load Instructions......................................................................................... 6-8
Cacheability Control................................................................................................................ 6-9
The Non-temporal Store Instructions............................ .. ........................................ ........ 6-10
Fencing..................................................................................................................... 6-10
Streaming Non-temporal Stores ............................................................................... 6-10
Memory Type and Non-temporal Stores................................................................... 6-11
Write-Combining....................................................................................................... 6-12
Streaming Store Usage Models...................................................................................... 6-13
Coherent Requests................................................................................................... 6-13
Non-coherent requests............................................................................................. 6-13
Streaming Store Instruction Descriptions ....................................................................... 6-14
The fence Instructions.................................................................................................... 6-15
The sfence Instruction.............................................................................................. 6-15
The lfence Instruction............................................................................................... 6-16
The mfence Instruction............................................................................................. 6-16
The clflush Instruction .................................................................................................... 6-17
Memory Optimization Using Prefetch.................................................................................... 6-18
Software-controlled Prefetch........................................................... ... ... ......................... 6-18
ix
Hardware Prefetch ....................................................... .. ..................................... ... ........ 6-19
Example of Effective Latency Reduction with H/W Prefetch.......................................... 6-20
Example of Latency Hiding with S/W Prefetch Instruction ............................................ 6-22
Software Prefetching Usage Checklist........................................................................... 6-24
Software Prefetch Scheduling Distance......................................................................... 6-25
Software Prefetch Concatenation................................................................................... 6-26
Minimize Number of Software Prefetches...................................................................... 6-29
Mix Software Prefetch with Computation Instructions.................................................... 6-32
Software Prefetch and Cache Blocking Techniques....................................................... 6-34
Hardware Prefetching and Cache Blocking Techniques ................................................ 6-39
Single-pass versus Multi-pass Execution....................................................................... 6-41
Memory Optimization using Non-Temporal Stores................................................................ 6-43
Non-temporal Stores and Software Write-Combining..................................................... 6-43
Cache Management....................................................................................................... 6-44
Video Encoder.......................................................................................................... 6-45
Video Decoder.......................................................................................................... 6-45
Conclusions from Video Encoder and Decoder Implementation.............................. 6-46
Optimizing Memory Copy Routines.......................................................................... 6-46
TLB Priming.............................................................................................................. 6-47
Using the 8-byte Streaming Stores and Software Prefetch....................................... 6-48
Using 16-byte Streaming Stores and Hardware Prefetch......................................... 6-50
Performance Comparisons of Memory Copy Routines............................................ 6-52
Deterministic Cache Parameters .......................................................................................... 6-53
Cache Sharing Using Deterministic Cache Parameters................................................. 6-55
Cache Sharing in Single-core or Multi-core.................................................................... 6-55
Determine Prefetch Stride Using Deterministic Cache Parameters............................... 6-56
Chapter 7 Multi-Core and Hyper-Threading Technology
Performance and Usage Models............................................................................................. 7-2
Multithreading................................................................................................................... 7-2
Multitasking Environment................................................................................................. 7-4
Programming Models and Multithreading ............................................................................... 7-6
Parallel Programming Models .......................................................................................... 7-7
Domain Decomposition.................................. ............................................................. 7-7
Functional Decomposition.......................................................... .. ... ................................. 7-8
Specialized Programming Models.................................................................................... 7-8
Producer-Consumer Threading Models.................................................................... 7-10
Tools for Creating Multithreaded Applications................................................................ 7-14
Optimization Guidelines........................................................................................................ 7-16
Key Practices of Thread Synchronization ...................................................................... 7-16
x
Key Practices of System Bus Optimization.................................................................... 7-17
Key Practices of Memory Optimization .......................................................................... 7-17
Key Practices of Front-end Optimization........................................................................ 7-18
Key Practices of Execution Resource Optimization....................................................... 7-18
Generality and Performance Impact............................................................................... 7-19
Thread Synchronization........................................................................................................ 7-19
Choice of Synchronization Primitives............................................................ ... ... ........... 7-20
Synchronization for Short Periods.................................................................................. 7-22
Optimization with Spin-Locks ......................................................................................... 7-25
Synchronization for Longer Periods ............................................................................... 7-26
Avoid Coding Pitfalls in Thread Synchronization...................................................... 7-28
Prevent Sharing of Modified Dat a and False-Sharing.................................................... 7-30
Placement of Shared Synchronization Variable ............................................................. 7-31
System Bus Optimization...................................................................................................... 7-33
Conserve Bus Bandwidth............................................................................................... 7-34
Understand the Bus and Cache Interactions.................................................................. 7-35
Avoid Excessive Software Prefetches............................................................................ 7-36
Improve Effective Latency of Cache Misses................................................................... 7-36
Use Full Write Transactions to Achieve Higher Data Rate.......................................... ... 7-37
Memory Optimization............................................................................................................ 7-38
Cache Blocking Technique............................................................................................. 7-38
Shared-Memory Optimization......................................................................................... 7-39
Minimize Sharing of Data between Physical Processors.......................................... 7-39
Batched Producer-Consumer Model........................................................................ 7-40
Eliminate 64-KByte Aliased Data Accesses................................................................... 7-42
Preventing Excessive Evictions in First-Level Data Cache............................................ 7-43
Per-thread Stack Offset ............................................................................................ 7-44
Per-instance Stack Offset......................................................................................... 7-46
Front-end Optimization.......................................................................................................... 7-48
Avoid Excessive Loop Unrolling..................................................................................... 7-48
Optimization for Code Size............................................................................................. 7-49
Using Thread Affinities to Manage Shared Platform Resources........................................... 7-49
Using Shared Execution Resources in a Processor Core.............................................. 7-59
Chapter 8 64-bit Mode Coding Guidelines
Introduction............................................................................................................................. 8-1
Coding Rules Affecting 64-bit Mode........................................................................................ 8-1
Use Legacy 32-Bit Instructions When The Data Size Is 32 Bits....................................... 8-1
Use Extra Registers to Reduce Register Pressure .......................................................... 8-2
Use 64-Bit by 64-Bit Multiplies That Produce 128-Bit Results Only When Necessary..... 8-2
xi
Sign Extension to Full 64-Bits........................................................................................... 8-3
Alternate Coding Rules for 64-Bit Mode.................................................................................. 8-4
Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic..................... 8-4
Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible................................. 8-6
Using Software Prefetch................................................................................................... 8-6
Chapter 9 Power Optimization for Mobile Usages
Overview................................................................................................................................. 9-1
Mobile Usage Scenarios....................................................... ... ..................................... .. ........ 9-2
ACPI C-States......................................................................................................................... 9-4
Processor-Specific C4 and Deep C4 States..................................................................... 9-6
Guidelines for Extending Battery Life...................................................................................... 9-7
Adjust Performance to Meet Quality of Features ............................................................. 9-8
Reducing Amount of Work................................................................................................ 9-9
Platform-Level Optimizations.......................................................................................... 9-10
Handling Sleep State Transitions ................................................................................... 9-11
Using Enhanced Intel SpeedStep
Enabling Intel
®
Enhanced Deeper Sleep ....................................................................... 9-14
®
Technology ............................................................. 9-12
Multi-Core Considerations.............................................................................................. 9-15
Enhanced Intel SpeedStep
®
Technology..................... ............................................. 9-15
Thread Migration Considerations.............................................................................. 9-16
Multi-core Considerations for C-States..................................................................... 9-17
Appendix AApplication Performance Tools
Intel® Compilers..................................................................................................................... A-2
Code Optimization Options ............................ ................................. ... ............................. A-3
Targeting a Processor (-Gn) ......................................................... ... .......................... A-3
Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions])...... A-4
Vectoriz er Switch Options ........................ ... ... .................................... ............................. A-5
Loop Unrolling............................................................................................................ A-5
Multithreading with OpenMP*.................................................................................... A-6
Inline Expansion of Library Functions (-Oi, -Oi-) .......................... ... ................................ A-6
Floating-point Arithmetic Precision (-Op, -Op-, -Qprec, -Qprec_div, -Qpc,
-Qlong_double).................................. ... .................................... ................................... A-6
Rounding Control Option (-Qrcd) .................................................................................... A-6
Interprocedural and Profile-Guided Optimizations .......................................................... A-7
Interprocedural Optimization (IPO)............................................................................ A-7
Profile-Guided Optimization (PGO) ........................................................................... A-7
®
Intel
VTune™ Performance Analyzer................................................................................... A-8
Sampling ......................................................................................................................... A-9
xii
Time-based Sampling..................................................... ... .................................... .... A-9
Event-based Sampling............................................................................................. A-10
Workload Characterization ...................................................................................... A-11
Call Graph ............................................. ... ..................................................................... A-13
Counter Monitor............................................................................................................. A-14
®
Intel
Tuning Assistant.................................................................................................. A-14
®
Intel
Performance Libraries................................................................................................ A-14
Benefits Summary......................................................................................................... A-15
Optimizations with the Intel
®
Performance Libraries.................................................... A-16
Enhanced Debugger (EDB) ................................................................................................. A-17
®
Intel
Threading Tools.......................................................................................................... A-17
®
Intel
Thread Checker................................................................................................... A-17
Thread Profiler............................................................................................................... A-19
®
Intel
Software College........................................................................................................ A-20
Appendix BUsing Performance Monitoring Events
Pentium 4 Processor Performance Metrics......................... ................................................... B-1
Pentium 4 Processor-Specific Terminology............................................................................ B-2
Bogus, Non-bogus, Retire............................................................................................... B-2
Bus Ratio......................................................................................................................... B-2
Replay............................................................................................................................. B-3
Assist.............................................................................................................................. . B-3
Tagging............................................................................................................................ B-3
Counting Clocks..................................................................................................................... B-4
Non-Halted Clockticks..................................................................................................... B-5
Non-Sleep Clockticks...................................................................................................... B-6
Time Stamp Counter........................................................................................................ B-7
Microarchitecture Notes......................................................................................................... B-8
Trace Cache Events........................................................................................................ B-8
Bus and Memory Metrics...................................................... ........................................... B-8
Reads due to program loads ................................................................................... B-11
Reads due to program writes (RFOs)............................................................... .. ..... B-11
Writebacks (dirty evictions)...................................................................................... B-12
Usage Notes for Specific Metrics .................................................................................. B-13
Usage Notes on Bus Activities...................................................................................... B-15
Metrics Descriptions and Categories ................................................................................... B-16
Performance Metrics and Tagging Mechanisms.................................................................. B-46
Tags for replay_event.................................................................................................... B-46
Tags for front_end_event............................................................................................... B-48
Tags for execution_event .............................................................................................. B-48
xiii
Using Performance Metrics with Hyper-Threading Technology........................................... B-50
Using Performance Events of Intel Core Solo and Intel Core Duo processors.................... B-56
Understanding the Results in a Performance Counter.................................................. B-56
Ratio Interpretation........................................................................................................ B-57
Notes on Selected Events............................................................................................. B-58
Appendix CIA-32 Instruction Latency and Throughput
Overview................................................................................................................................ C-2
Definitions .............................................................................................................................. C-4
Latency and Throughput........................................................................................................ C-4
Latency and Throughput with Register Operands.......................................................... C-6
Table Footnotes.......................................... ..................................... ........................ C-19
Latency and Throughput with Memory Operands ......................................................... C-20
Appendix DStack Alignment
Stack Frames......................................................................................................................... D-1
Aligned esp-Based Stack Frames................................................................................... D-4
Aligned ebp-Based Stack Frames................................................................................... D-6
Stack Frame Optimizations.............................................................................................. D-9
Inlined Assembly and ebx.................................................................................................... D-10
Appendix EMathematics of Prefetch Scheduling Distance
Simplified Equation ............................................... .................................... ... ...................... ... . E-1
Mathematical Model for PSD................................................................................................. E-2
No Preloading or Prefetch.................. ... ..................................... .. ................................... E-6
Compute Bound (Case:Tc >= T
Compute Bound (Case: Tl + Tb > Tc > Tb) ..................................................................... E-8
Memory Throughput Bound (Case: Tb >= Tc)............................................................... E-10
Example ........................................................................................................................ E-11
+ Tb)............................ ... .............................................. E-7
l
Index
xiv
Examples
Example 2-1 Assembly Code with an Unpredictable Branch ............................. 2-17
Example 2-2 Code Optimization to Eliminate Branches..................................... 2-17
Example 2-3 Eliminating Branch with CMOV Instruction.................................... 2-18
Example 2-4 Use of
Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm.............. 2-20
Example 2-6 Static Taken Prediction Example ...................................................2-21
Example 2-7 Static Not-Taken Prediction Example ............................................ 2-21
Example 2-8 Indirect Branch With Two Favored Targets .................................... 2-25
Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction ..... 2-26
Example 2-10 Loop Unrolling ............................................................................... 2-28
Example 2-11 Code That Causes Cache Line Split ............................................. 2-31
Example 2-12 Several Situations of Small Loads After Large Store .................... 2-35
Example 2-14 A Non-forwarding Situation in Compiler Generated Code............. 2-36
Example 2-15 Two Examples to Avoid the Non-forwarding Situation in
Example 2-14 ................................................................................ 2-36
Example 2-13 A Non-forwarding Example of Large Load After Small Store ........2-36
Example 2-16 Large and Small Load Stalls ......................................................... 2-37
Example 2-17 An Example of Loop-carried Dependence Chain .......................... 2-39
Example 2-18 Rearranging a Data Structure ....................................................... 2-39
Example 2-19 Decomposing an Array .................................................................. 2-40
Example 2-20 Dynamic Stack Alignment ............................................................. 2-43
Example 2-21 Non-temporal Stores and 64-byte Bus Write Transactions............ 2-54
Example 2-22 Non-temporal Stores and Partial Bus Write Transactions .............2-54
Example 2-23 Algorithm to Avoid Changing the Rounding Mode......................... 2-66
Example 2-24 Dependencies Caused by Referencing Partial Registers.............. 2-77
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form..................... 2-91
Example 2-26 Spill Scheduling Example Code .................................................... 2-92
Example 3-1 Identification of MMX Technology with cpuid................................... 3-3
Example 3-3 Identification of SSE by the OS ....................................................... 3-4
Example 3-2 Identification of SSE with cpuid .......................................................3-4
pause Instruction ............................................................... 2-19
xv
Example 3-4 Identification of SSE2 with cpuid..................................................... 3-5
Example 3-5 Identification of SSE2 by the OS..................................................... 3-6
Example 3-6 Identification of SSE3 with cpuid..................................................... 3-7
Example 3-7 Identification of SSE3 by the OS..................................................... 3-8
Example 3-8 Simple Four-Iteration Loop ............................................................ 3-14
Example 3-9 Streaming SIMD Extensions Using Inlined Assembly Encoding ...3-15
Example 3-10 Simple Four-Iteration Loop Coded with Intrinsics.......................... 3-16
Example 3-11 C++ Code Using the Vector Classes ............................................. 3-18
Example 3-12 Automatic Vectorization for a Simple Loop .................................... 3-19
Example 3-13 C Algorithm for 64-bit Data Alignment ........................................... 3-23
Example 3-14 AoS Data Structure ....................................................................... 3-27
Example 3-16 AoS and SoA Code Samples ........................................................ 3-28
Example 3-15 SoA Data Structure ....................................................................... 3-28
Example 3-17 Hybrid SoA Data Structure ............................................................ 3-30
Example 3-18 Pseudo-code Before Strip Mining.................................................. 3-32
Example 3-19 Strip Mined Code........................................................................... 3-33
Example 3-20 Loop Blocking................................................................................ 3-35
Example 3-21 Emulation of Conditional Moves.................................................... 3-37
Example 4-1 Resetting the Register between __m64 and FP Data Types...........4-5
Example 4-2 Unsigned Unpack Instructions......................................................... 4-7
Example 4-3 Signed Unpack Code ...................................................................... 4-8
Example 4-4 Interleaved Pack with Saturation ................................................... 4-10
Example 4-5 Interleaved Pack without Saturation ..............................................4-11
Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way . 4-13
Example 4-7 pextrw Instruction Code................................................................. 4-14
Example 4-8 pinsrw Instruction Code................................................................. 4-15
Example 4-9 Repeated pinsrw Instruction Code ................................................ 4-16
Example 4-10 pmovmskb Instruction Code.......................................................... 4-17
Example 4-12 Broadcast Using 2 Instructions......................................................4-19
Example 4-11 pshuf Instruction Code ..................................................................4-19
Example 4-13 Swap Using 3 Instructions............................................................. 4-20
Example 4-14 Reverse Using 3 Instructions......................................................... 4-20
Example 4-15 Generating Constants ................................................................... 4-21
Example 4-16 Absolute Difference of Two Unsigned Numbers ............................ 4-23
Example 4-17 Absolute Difference of Signed Numbers ....................................... 4-24
Example 4-18 Computing Absolute Value ............................................................ 4-25
Example 4-19 Clipping to a Signed Range of Words [high, low] .......................... 4-27
xvi
Example 4-20 Clipping to an Arbitrary Signed Range [high, low]......................... 4-27
Example 4-21 Simplified Clipping to an Arbitrary Signed Range ......................... 4-28
Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]..................... 4-29
Example 4-23 Complex Multiply by a Constant .................................................... 4-32
Example 4-24 A Large Load after a Series of Small Stores (Penalty).................. 4-35
Example 4-25 Accessing Data without Delay....................................................... 4-35
Example 4-26 A Series of Small Loads after a Large Store .................................4-36
Example 4-27 Eliminating Delay for a Series of Small Loads after a
Large Store.................................................................................... 4-36
Example 4-28 An Example of Video Processing with Cache Line Splits.............. 4-37
Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits ........ 4-38
Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation ....................... 5-8
Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation........ 5-9
Example 5-3 Swizzling Data............................................................................... 5-10
Example 5-4 Swizzling Data Using Intrinsics ..................................................... 5-12
Example 5-5 Deswizzling Single-Precision SIMD Data ...................................... 5-14
Example 5-6 Deswizzling Data Using the movlhps and shuffle
Instructions .................................................................................... 5-15
Example 5-7 Deswizzling Data 64-bit Integer SIMD Data .................................. 5-16
Example 5-8 Using MMX Technology Code for Copying or Shuffling.................5-18
Example 5-9 Horizontal Add Using movhlps/movlhps ........................................ 5-19
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps .................5-21
Example 5-11 Multiplication of Two Pair of Single-precision Complex Number....5-24
Example 5-12 Division of Two Pair of Single-precision Complex Number............ 5-25
Example 5-13 Calculating Dot Products from AOS .............................................. 5-26
Example 6-1 Pseudo-code for Using cflush ....................................................... 6-18
Example 6-2 Populating an Array for Circular Pointer Chasing with
Constant Stride.............................................................................. 6-21
Example 6-3 Prefetch Scheduling Distance ....................................................... 6-26
Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop .......6-28
Example 6-4 Using Prefetch Concatenation....................................................... 6-28
Example 6-6 Spread Prefetch Instructions ......................................................... 6-33
Example 6-7 Data Access of a 3D Geometry Engine without Strip-mining........ 6-37
Example 6-8 Data Access of a 3D Geometry Engine with Strip-mining............. 6-38
Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic .......... 6-40
Example 6-10 Basic Algorithm of a Simple Memory Copy................................... 6-46
Example 6-11 A Memory Copy Routine Using Software Prefetch........................6-48
xvii
Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation.. 6-50
Example 7-1 Serial Execution of Producer and Consumer Work Items ............... 7-9
Example 7-2 Basic Structure of Implementing Producer Consumer Threads.... 7-11
Example 7-3 Thread Function for an Interlaced Producer Consumer Model .....7-13
Example 7-4 Spin-wait Loop and PAUSE Instructions........................................ 7-24
Example 7-5 Coding Pitfall using Spin Wait Loop .............................................. 7-29
Example 7-6 Placement of Synchronization and Regular Variables ..................7-32
Example 7-7 Declaring Synchronization Variables without Sharing
a Cache Line ................................................................................. 7-32
Example 7-8 Batched Implementation of the Producer Consumer Threads ......7-41
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads ............... 7-45
Example 7-10 Adding a Pseudo-random Offset to the Stack Pointer
in the Entry Function ..................................................................... 7-47
Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical
Processor ...................................................................................... 7-51
Example 7-12 Assembling a Look up Table to Manage Affinity Masks
and Schedule Threads to Each Core First ....................................7-54
Example 7-13 Discovering the Affinity Masks for Sibling Logical
Processors Sharing the Same Cache ........................................... 7-55
Example D-1 Aligned esp-Based Stack Frames .................................................. D-5
Example D-2 Aligned ebp-based Stack Frames................................................... D-7
Example E-1 Calculating Insertion for Scheduling Distance of 3 ..........................E-3
xviii
Figures
Figure 1-1 Typical SIMD Operations ................................................................... 1-3
Figure 1-2 SIMD Instruction Register Usage ...................................................... 1-4
Figure 1-3 The Intel NetBurst Microarchitecture ............................................... 1-10
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core.......................1-19
Figure 1-5 The Intel Pentium M Processor Microarchitecture........................... 1-27
Figure 1-6 Hyper-Threading Technology on an SMP ........................................ 1-35
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition
and Intel Core Duo Processor ......................................................... 1-41
Figure 2-1 Cache Line Split in Accessing Elements in a Array ......................... 2-31
Figure 2-2 Size and Alignment Restrictions in Store Forwarding...................... 2-34
Figure 3-1 Converting to Streaming SIMD Extensions Chart .............................3-9
Figure 3-2 Hand-Coded Assembly and High-Level Compiler
Performance Trade-offs ................................................................... 3-13
Figure 3-3 Loop Blocking Access Pattern ......................................................... 3-36
Figure 4-2 Interleaved Pack with Saturation ....................................................... 4-9
Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example ............................4-9
Figure 4-4 Result of Non-Interleaved Unpack High in MM1.............................. 4-12
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 .............................. 4-12
Figure 4-5 pextrw Instruction ............................................................................ 4-14
Figure 4-6 pinsrw Instruction............................................................................. 4-15
Figure 4-7 pmovmskb Instruction Example....................................................... 4-17
Figure 4-8 pshuf Instruction Example ............................................................... 4-18
Figure 4-9 PSADBW Instruction Example ........................................................ 4-31
Figure 5-1 Homogeneous Operation on Parallel Data Elements ........................ 5-5
Figure 5-2 Dot Product Operation ....................................................................... 5-8
Figure 5-3 Horizontal Add Using movhlps/movlhps .......................................... 5-19
Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction
HADDPD ......................................................................................... 5-23
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction ..............5-23
Figure 6-1 Effective Latency Reduction as a Function of Access Stride........... 6-22
xix
Figure 6-2 Memory Access Latency and Execution Without Prefetch ..............6-23
Figure 6-3 Memory Access Latency and Execution With Prefetch ...................6-23
Figure 6-4 Prefetch and Loop Unrolling ............................................................ 6-29
Figure 6-5 Memory Access Latency and Execution With Prefetch ...................6-31
Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent
Passes............................................................................................. 6-35
Figure 6-7 Examples of Prefetch and Strip-mining for Temporally
Adjacent and Non-Adjacent Passes Loops ..................................... 6-36
Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines .........................6-42
Figure 7-1 Amdahl’s Law and MP Speed-up ...................................................... 7-3
Figure 7-2 Single-threaded Execution of Producer-consumer
Threading Model................................................................................ 7-9
Figure 7-3 Execution of Producer-consumer Threading Model on
a Multi-core Processor..................................................................... 7-10
Figure 7-4 Interlaced Variation of the Producer Consumer Model.................... 7-12
Figure 7-5 Batched Approach of Producer Consumer Model ........................... 7-40
Figure 9-1 Performance History and State Transitions ....................................... 9-3
Figure 9-2 Active Time Versus Halted Time of a Processor ............................... 9-4
Figure 9-3 Application of C-states to Idle Time................................................... 9-6
Figure 9-4 Profiles of Coarse Task Scheduling and Power Consumption ......... 9-12
Figure 9-5 Thread Migration in a Multi-Core Processor .................................... 9-17
Figure 9-6 Progression to Deeper Sleep .......................................................... 9-18
Figure A-1 Sampling Analysis of Hotspots by Location.....................................A-10
Figure A-2 Intel Thread Checker Can Locate Data Race Conditions ................A-18
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded
Execution Timelines.........................................................................A-20
Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ
and Front Side Bus ..........................................................................B-10
Figure D-1 Stack Frames Based on Alignment Type .......................................... D-3
Figure E-1 Pentium II, Pentium III and Pentium 4 Processors Memory
Pipeline Sketch ..................................................................................E-4
Figure E-2 Execution Pipeline, No Preloading or Prefetch ..................................E-6
Figure E-3 Compute Bound Execution Pipeline ..................................................E-7
Figure E-4 Another Compute Bound Execution Pipeline.....................................E-8
Figure E-5 Memory Throughput Bound Pipeline ...............................................E-10
Figure E-6 Accesses per Iteration, Example 1 ..................................................E-12
Figure E-7 Accesses per Iteration, Example 2 ..................................................E-13
xx
Tables
Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters .................. 1-20
Table 1-3 Cache Parameters of Pentium M, Intel
Table 1-2 Trigger Threshold and CPUID Signatures for IA-32
Table 1-4 Family And Model Designations of Microarchitectures...................... 1-42
Table 1-5 Characteristics of Load and Store Operations
Table 2-1 Coding Pitfalls Affecting Performance ................................................. 2-2
Table 2-2 Avoiding Partial Flag Register Stall ................................................... 2-76
Table 2-3 Avoiding Partial Register Stall When Packing Byte Values ............... 2-78
Table 2-4 Avoiding False LCP Delays with 0xF7 Group Instructions ................ 2-81
Table 2-5 Using REP STOSD with Arbitrary Count Size and
Table 5-1 SoA Form of Representing Vertices Data ........................................... 5-7
Table 6-1 Software Prefetching Considerations into Strip-mining Code............ 6-39
Table 6-2 Relative Performance of Memory Copy Routines .............................6-52
Table 6-3 Deterministic Cache Parameters Leaf............................................... 6-54
Table 7-1 Properties of Synchronization Objects .............................................. 7-21
Table B-1 Pentium 4 Processor Performance Metrics .......................................B-18
Table B-2 Metrics That Utilize Replay Tagging Mechanism ...............................B-47
Table B-3 Table 3 Metrics That Utilize the Front-end Tagging Mechanism ........B-48
Table B-4 Metrics That Utilize the Execution Tagging Mechanism ....................B-49
Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)..............B-50
Table B-6 Metrics That Support Qualification by Logical Processor and
Table B-7 Metrics That Are Independent of Logical Processors........................B-55
Table C-1 Streaming SIMD Extension 3 SIMD Floating-point Instructions ......... C-6
Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions.................. C-7
Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point
Table C-4 Streaming SIMD Extension Single-precision Floating-point
Table C-6 MMX Technology 64-bit Instructions ................................................ C-14
®
Intel
Core™ Duo Processors .......................................................... 1-30
Processor Families............................................................................ 1-30
in Intel Core Duo Processors ............................................................1-43
4-Byte-Aligned Destination ................................................................ 2-85
Parallel Counting ...............................................................................B-51
Instructions ......................................................................................... C-9
Instructions ....................................................................................... C-12
®
Core™ Solo and
xxi
Table C-5 Streaming SIMD Extension 64-bit Integer Instructions..................... C-14
Table C-7 IA-32 x87 Floating-point Instructions................................................ C-16
Table C-8 IA-32 General Purpose Instructions ................................................. C-17
xxii
Introduction
The IA-32 Intel ® Architectur e Optimization Reference Manual describes
how to optimize software to take advantage of the performance
characteristics of the current generation of IA-32 Intel architecture
family of processors. The optimizations described in this manual apply
to IA-32 processors based on the Intel
the Intel
support Hyper-Threading Technology.
The target audience for this manual includes software programmer s and
compiler writers. This manual assumes that the reader is familiar with the
basics of the IA-32 architecture and has access to the Intel
Software Developer’s Manual: Volume 1, Basic Architecture;
Volu me 2A, Instruction Set Refer ence A-M; Volum e 2B, Instruction Set
Reference N-Z, and Volum e 3 , System Programmer’s Guide.
When developing and optimizing software applications to achieve a
high level of performance when running on IA-32 processors, a detailed
understanding of IA-32 family of processors is often required. In many
cases, knowledge of IA-32 microarchitectures is required.
®
Pentium® M processor family and IA-32 processors that
®
NetBurst® microarchitecture,
®
Architecture
This manual provides an overview of the Intel NetBurst
microarchitecture and the Intel Pentium M processor microarchitecture.
It contains design guidelines for high-performance software applications,
coding rules, and techniques for many aspects of code-tuning. These
rules are useful to programmers and compiler developers.
The design guidelines that are discussed in this manual for developing
high-performance software apply to current as well as to future IA-32
processors. The coding rules and code optimization techniques listed
xxiii
IA-32 Intel® Architectu re Optimization
target the Intel NetBurst microarchitecture and the Pentium M processor
microarchitecture.
Tuning Your Application
Tuning an application for high performance on any IA-32 processor
requires understanding and basic skills in:
• IA-32 architecture
• C and Assembly language
• the hot-spot regions in your application that have significant impact
on software performance
• the optimization capabilities of your compiler
• techniques to evaluate the application’s performance
®
The Intel
locate hot-spot regions in your applications. On the Pentium 4, Intel
Xeon
through a selection of performance monitoring events and analyze the
performance event data that is gathered during code execution.
VTune™ Performance Analyzer can help you analyze and
®
and Pentium M processors, this tool can monitor an application
®
This manual also describes information that can be gathered using the
performance counters through Pentium 4 processor’s performance
monitoring events.
For VTune Performance Analyzer order information, see the web page:
ttp://developer.intel.com
h
About This Manual
In this document, the reference “Pentium 4 processor” refers to
processors based on the Intel NetBurst microarchitecture. Currently this
includes the Intel Pentium 4 processor and Intel Xeon processor . Where
appropriate, differences between Pentium 4 processor and Intel Xeon
processor are noted.
xxiv
Introduction
The manual consists of the following parts:
Introduction. Defines the purpose and outlines the contents of this
manual.
®
Chapter 1: IA-32 Intel
Architecture Processor Family Overview.
Describes the features relevant to software optimization of the current
generation of IA-32 Intel architecture processors, including the
architectural extensions to the IA-32 architecture and an overview of the
Intel NetBurst microarchitecture, Pentium M processor
microarchitecture and Hyper-Threading Technology.
Chapter 2: General Optimization Guidelines. Describes general code
development and optimization techniques that apply to all applications
designed to take advantage of the common features of the Intel NetBurst
microarchitecture and Pentium M processor microarchitecture.
Chapter 3: Coding for SIMD Architectures. Describes techniques
and concepts for using the SIMD integer and SIMD floating-point
instructions provided by the MMX™ technology, Streaming SIMD
Extensions, Streaming SIMD Extensions 2, and Streaming SIMD
Extensions 3.
Chapter 4: Optimizing for SIMD Integer Applications. Provides
optimization suggestions and common building blocks for applications
that use the 64-bit and 128-bit SIMD integer instructions.
Chapter 5: Optimizing for SIMD Floating-point Applications.
Provides optimization suggestions and common building blocks for
applications that use the single-precision and double-precision SIMD
floating-point instructions.
Chapter 6: Optimizing Cache Usage. Describes how to use the
prefetch instruction, cache control management instructions to
optimize cache usage, and the deterministic cache parameters.
xxv
IA-32 Intel® Architectu re Optimization
Chapter 7: Multiprocessor and Hyper-Threading Technology.
Describes guidelines and techniques for optimizing multithreaded
applications to achieve optimal performance scaling. Use these when
targeting multiprocessor (MP) systems or MP systems using IA-32
processors that support Hyper-Threading Technology.
Chapter 8: 64-Bit Mode Coding Guidelines. This chapter describes a
set of additional coding guidelines for application software written to
run in 64-bit mode.
Chapter 9: Power Optimization for Mobile Usages. This chapter
provides background on power saving techniques in mobile processors
and makes recommendations that developers can leverage to provide
longer battery life.
Appendix A: Application Performance Tools . Introduces tools for
analyzing and enhancing application performance without having to
write assembly code.
Appendix B: Intel Pentium 4 Processor Performance Metrics.
Provides information that can be gathered using Pentium 4 processor’s
performance monitoring events. These performance metrics can help
programmers determine how effectively an application is using the
features of the Intel NetBurst microarchitecture.
xxvi
Appendix C: IA-32 Instruction Latency and Throughput . Provides
latency and throughput data for the IA-32 instructions. Instruction
timing data specific to the Pentium 4 and Pentium M processors are
provided.
Appendix D: St ack Alignment . Describes stack alignment conventions
and techniques to optimize performance of accessing stack-based data.
Appendix E: The Mathematics of Prefetch Scheduling Distance.
Discusses the optimum spacing to insert
prefetch instructions and
presents a mathematical model for determining the prefetch scheduling
distance (PSD) for your application.
Related Documentation
For more information on the Intel architecture, specific techniques, and
processor architecture terminology referenced in this manual, see the
following documents:
• Intel
• Intel
• VTune Performance Analyzer online help
• Intel
• Intel Processor Identification with the CPUID Instruction, doc.
®
C++ Compiler User’s Guide
®
Fortran Compiler User’s Guid e
®
Architecture Software Developer’s Manual :
— Volume 1: Basic Architecture , doc. number 253665
— Volume 2A: Instruction Set Reference Manual A-M, doc.
number 253666
— Volume 2B: Instruction Set Reference Manual N-Z , doc.
number 253667
— Volume 3: System Programmer’s Guide, doc. number 253668
number 241618.
Introduction
• Developing Multi-threaded Applications: A Platform Consistent
Approach , available at
http://cache-www.intel.com/cd/00/00/05/15/51534_developing_mul
tithreaded_applications.pdf
Also, refer to the following Application Notes:
• Adjusting Thread Stack Address To Improve Performance On Intel
Xeon MP Hyper-Threading Technology Enabled Processors
• Detecting Hyper-Threading Technology Enabled Processors
• Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon
Processor MP
In addition, refer to publications in the following web sites:
• http://developer.intel.com/technology/hyperthread
• http://cedar.intel.com/cgi-bin/ids.dll/topic.jsp?catCode=CDN
xxvii
IA-32 Intel® Architectu re Optimization
Notational Conventions
This manual uses the following conventions:
This type style Indicates an element of syntax, a reserved
THIS TYPE STYLE Indicates a value, for example, TRUE, CONST1,
This type style Indicates a placeholder for an identifier, an
... (ellipses) Indicate that a few lines of the code are
This type style
word, a keyword, a filename, instruction,
computer output, or part of a program
example. The text appears in lowercase
unless uppercase is significant.
or a variable, for example,
names
l indicates lowercase letter L in examples. 1
MMO through MM7.
is the number 1 in examples.
uppercase O in examples.
A, B, or register
O is the
0 is the number 0 in
examples.
expression, a string, a symbol, or a value.
Substitute one of these items for the
placeholder.
omitted.
Indicates a hypertext link.
xxviii
IA-32 Intel® Architecture
Processor Family Overview
This chapter gives an overview of the features relevant to software
optimization for the current generations of IA-32 processors, inclu ding:
®
Intel
Core™ Solo, Intel® Core™ Duo, Intel® Pentium® 4, Intel®
®
Xeon
architecture. These features include:
• SIMD instruction extensions including MMX
• Microarchitectures that enable executing instructions with high
• Intel
• Intel
• Multi-core architecture supported in Intel
Intel Pentium 4 processors, Intel Xeon processors, Pentium D
processors, and Pentium processor Extreme Editions are based on Intel
NetBurst® microarchitecture. The Intel Pentium M processor
microarchitecture balances performance and low power consumption.
, Intel® Pentium® M, and IA-32 processors with multi-core
Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2
(SSE2), and Streaming SIMD Extensions 3 (SSE3)
throughput at high clock rates, a high speed cache hierarchy and the
ability to fetch data with high speed system bus
®
Extended Memory 64 Technology (Intel® EM64T)
®
processors supporting Hyper-Threading (HT) Technology
®
Pentium
®
D processors and Pentium® processor Extreme Edition
1
™
technology,
Core™ Duo, Intel®
1
2
1. Hyper-Threading Technology requires a computer system with an Intel processor
supporting HT Technology and an HT Technology enabled chipset, BIOS and operating
system. Performance varies depending on the hardware and software used.
2. Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor
Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance
varies depending on the hardware and software used.
1-1
IA-32 Intel® Architectu re Optimization
Intel Core Solo and Intel Core Duo processors incorporate
microarchitectural enhancements for performance and power efficiency
that are in addition to those introduced in the Pentium M processor.
SIMD Technology
SIMD computations (see Figure 1-1) were introduced in the IA-32
architecture with MMX technology . MMX technology allows SIMD
computations to be performed on packed byte, word, and doubleword
integers. The integers are contained in a set of eight 64-bit registers
called MMX registers (see Figure 1-2).
The Pentium III processor extended the SIMD computation model with
the introduction of the Streaming SIMD Extensions (SSE). SSE allows
SIMD computations to be performed on operands that contain four
packed single-precision floating-point data elements. The operands can
be in memory or in a set of eight 128-bit XMM registers (see Figure
1-2). SSE also extended SIMD computational capability by adding
additional 64-bit MMX instructions.
Figure 1-1 shows a typical SIMD computation. Two sets of four packed
data elements (X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are
operated on in parallel, with the same operation being performed on
1-2