Intel IA-32 User Manual

IA-32 Intel® Architecture
Optimization Reference
Manual
Order Number: 248966-013US
April 2006
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDI­TIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PUR­POSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLEC­TUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.
This IA-32 Intel® Architecture Optimization Reference Manual as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this man­ual is furnished for informational use only, is subject to change without notice, and should not be construed as a com­mitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or trans­mitted in any form or by any means without the express written consent of Intel Corporation.
Developers must not rely on the absence or characteristics of any features or in structions marked “reserved” or “unde­fined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in developer's software code when running on an Intel® processor. Intel reserves these features or instructions for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use.
Hyper-Threading Technology requires a computer system with an Intel® Pentium®4 processor supporting Hyper­Threading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more information including details on which processors support HT Technology.
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others. Copyright © 1999-2006 Intel Corporation.
ii

Contents

Introduction
®
Chapter 1 IA-32 Intel
SIMD Technology.................................................................................................................... 1-2
Summary of SIMD Technologies...................................................................................... 1-5
MMX™ Technology..................................................................................................... 1-5
Streaming SIMD Extensions....................................................................................... 1-5
Streaming SIMD Extensions 2.................................................................................... 1-6
Streaming SIMD Extensions 3.................................................................................... 1-6
®
Intel
Extended Memory 64 Technology (Intel®EM64T)........................................................ 1-7
Intel NetBurst
Design Goals of Intel NetBurst Microarchitecture ............................................................ 1-8
Overview of the Intel NetBurst Microarchitecture Pipeline ............................................... 1-9
Front End Pipeline Detail............................................................................................... 1-13
Execution Core Detail..................................................................................................... 1-16
®
Intel
Pentium® M Processor Microarchitecture............................................................. ...... 1-26
®
Microarchitecture............................................................................................ 1-8
The Front End........................................................................................................... 1-11
The Out-of-order Core.............................................................................................. 1-12
Retirement................................................................................................................ 1-12
Prefetching................................................................................................................ 1-13
Decoder.................................................................................................................... 1-14
Execution Trace Cache ............................................................................................ 1-14
Branch Prediction ..................................................................................................... 1-15
Instruction Latency and Throughput......................................................................... 1-17
Execution Units and Issue Ports............................................................................... 1-18
Caches...................................................................................................................... 1-19
Data Prefetch............................................................................................................ 1-21
Loads and Stores...................................................................................................... 1-24
Store Forwarding...................................................................................................... 1-25
The Front End........................................................................................................... 1-27
Data Prefetching....................................................................................................... 1-29
Architecture Processor Family Overview
iii
Out-of-Order Core......................... ... ..................................... .. .................................. 1-30
In-Order Retirement.................................................................................................. 1-31
Microarchitecture of Intel
®
Core™ Solo and Intel®Core™ Duo Processors........................ 1-31
Front End........................................................................................................................ 1-32
Data Prefetching............................................................................................................. 1-33
Hyper-Threading Technology................................................................................................ 1-33
Processor Resources and Hyper-Threading Technology............................................... 1-36
Replicated Resources............................................................................................... 1-36
Partitioned Resources ................................................................................ ... ... ........ 1-36
Shared Resources.................................................. .................................................. 1-37
Microarchitecture Pipeline and Hyper-Threading Technology........................................ 1-38
Front End Pipeline......................................................................................................... 1-38
Execution Core............................................................................................................... 1-39
Retirement...................................................................................................................... 1-39
Multi-Core Processors........................................................................................................... 1-39
Microarchitecture Pipeline and Multi-Core Processors................................................... 1-42
Shared Cache in Intel Core Duo Processors ................................................................. 1-42
Load and Store Operations....................................................................................... 1-42
Chapter 2 General Optimization Guidelines
Tuning to Achieve Optimum Performance.............................................................................. 2-1
Tuning to Prevent Known Coding Pitfalls................................................................................ 2-2
General Practices and Coding Guidelines.............................................................................. 2-3
Use Available Performance Tools..................................................................................... 2-4
Optimize Performance Across Processor Generations.................................................... 2-4
Optimize Branch Predictability........................ .................................................................. 2-5
Optimize Memory Access................................................................................................. 2-5
Optimize Floating-point Performance............................................................................... 2-6
Optimize Instruction Selection.......................................................................................... 2-6
Optimize Instruction Scheduling....................................................................................... 2-7
Enable Vectorization......................................................................................................... 2-7
Coding Rules, Suggestions and Tuning Hints......................................................................... 2-8
Performance Tools.................................................................................................................. 2-9
Processor Perspectives........................................................................................................ 2-11
®
Intel
C++ Compiler ......................................................................................................... 2-9
General Compiler Recommendations............................................................................ 2-10
VTune™ Performance Analyzer..................................................................................... 2-10
CPUID Dispatch Strategy and Compatible Code Strategy............................................. 2-13
Transparent Cache-Parameter Strategy......................................................................... 2-14
Threading Strategy and Hardware Multi-Threading Support.......................................... 2-14
iv
Branch Prediction.................................................................................................................. 2-15
Eliminating Branches...................................................................................................... 2-15
Spin-Wait and Idle Loops........................................ ... ..................................................... 2-18
Static Prediction.............................................................................................................. 2-19
Inlining, Calls and Returns ............................................................................................. 2-22
Branch Type Selection ................................................................................................... 2-23
Loop Unrolling ............................................................................................................... 2-26
Compiler Support for Branch Prediction......................................................................... 2-28
Memory Accesses................................................................................................................. 2-29
Alignment ....................................................................................................................... 2-29
Store Forwarding............................................................................................................ 2-32
Store-to-Load-Forwarding Restriction on Size and Alignment.................................. 2-33
Store-forwarding Restriction on Data Availability...................................................... 2-38
Data Layout Optimizations............................................................................................. 2-39
Stack Alignment.............................................................................................................. 2-42
Capacity Limits and Aliasing in Caches.......................................................................... 2-43
Capacity Limits in Set-Associative Caches............................................................... 2-44
Aliasing Cases in the Pentium
®
4 and Intel® Xeon® Processors ............................. 2-45
Aliasing Cases in the Pentium M Processor............................................................. 2-46
Mixing Code and Data.................................................................................................... 2-47
Self-modifying Code ................................................................................................. 2-47
Write Combining............................................................................................................. 2-48
Locality Enhancement.................................................................................................... 2-50
Minimizing Bus Latency.................................................................................................. 2-52
Non-Temporal Store Bus Traffic ..................................................................................... 2-53
Prefetching..................................................................................................................... 2-55
Hardware Instruction Fetching.................................................................................. 2-55
Software and Hardware Cache Line Fetching.......................................................... 2-55
Cacheability Instructions ................................................................................................ 2-56
Code Alignment............................................................................ .................................. 2-57
Improving the Performance of Floating-point Applications.................................................... 2-57
Guidelines for Optimizing Floating-point Code............................................................... 2-58
Floating-point Modes and Exceptions............................................................................ 2-60
Floating-point Exceptions ......................................................................................... 2-60
Floating-point Modes................................................................................................ 2-62
Improving Parallelism and the Use of FXCH.................................................................. 2-68
x87 vs. Scalar SIMD Floating-point Trade-offs............................................................... 2-69
Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo
Processors............................................................................................................. 2-70
Memory Operands.................................................. ... ... .................................... ... ........... 2-71
v
Floating-Point Stalls............................ ... ..................................... .. .................................. 2-72
x87 Floating-point Operations with Integer Operands .............................................. 2-72
x87 Floating-point Comparison Instructions............................................................. 2-72
Transcendental Functions ........................................................................................ 2-72
Instruction Selection.............................................................................................................. 2-73
Complex Instructions........................................... ... ........................................................ 2-74
Use of the lea Instruction................................................................................................ 2-74
Use of the inc and dec Instructions................................................................................ 2-75
Use of the shift and rotate Instructions........................................................................... 2-75
Flag Register Accesses.................................................................................................. 2-75
Integer Divide................................................................................................................. 2-76
Operand Sizes and Partial Register Accesses............................................................... 2-76
Prefixes and Instruction Decoding.............................................. .................................... 2-80
REP Prefix and Data Movement..................................................................................... 2-81
Address Calculations...................................................................................................... 2-86
Clearing Registers.......................................................................................................... 2-87
Compares....................................................................................................................... 2-87
Floating Point/SIMD Operands....................................................................................... 2-88
Prolog Sequences.......................................................................................................... 2-90
Code Sequences that Operate on Memory Operands................................................... 2-90
Instruction Scheduling........................................................................................................... 2-91
Latencies and Resource Constraints.............................................................................. 2-91
Spill Scheduling.............................................................................................................. 2-92
Scheduling Rules for the Pentium 4 Processor Decoder ............................................... 2-92
Scheduling Rules for the Pentium M Processor Decoder .............................................. 2-93
Vectorization ......................................................................................................................... 2-93
Miscellaneous....................................................................................................................... 2-95
NOPs.............................................................................................................................. 2-95
Summary of Rules and Suggestions..................................................................................... 2-96
User/Source Coding Rules............................................................................................. 2-97
Assembly/Compiler Coding Rules.................................................................................. 2-99
Tuning Suggestions...................................................................................................... 2-108
Chapter 3 Coding for SIMD Architectures
Checking for Processor Support of SIMD Technologies ......................................................... 3-2
Checking for MMX Technology Support ........................................................................... 3-2
Checking for Streaming SIMD Extensions Support.......................................................... 3-3
Checking for Streaming SIMD Extensions 2 Support....................................................... 3-5
Checking for Streaming SIMD Extensions 3 Support....................................................... 3-6
vi
Considerations for Code Conversion to SIMD Programming.................................................. 3-8
Identifying Hot Spots...................................................................................................... 3-10
Determine If Code Benefits by Conversion to SIMD Execution...................................... 3-11
Coding Techniques ............................................................................................................... 3-12
Coding Methodologies.................................................................................................... 3-13
Assembly.................................................................................................................. 3-15
Intrinsics.................................................................................................................... 3-15
Classes..................................................................................................................... 3-17
Automatic Vectorization............................................ ... ... .................................... ... ... 3-18
Stack and Data Alignment..................................................................................................... 3-20
Alignment and Contiguity of Data Access Patterns........................................................ 3-20
Using Padding to Align Data..................................................................................... 3-20
Using Arrays to Make Data Contiguous.................................................................... 3-21
Stack Alignment For 128-bit SIMD Technologies ........................................................... 3-22
Data Alignment for MMX Technology............................................................................. 3-23
Data Alignment for 128-bit data...................................................................................... 3-24
Compiler-Supported Alignment................................................................................. 3-24
Improving Memory Utilization................................................................................................ 3-27
Data Structure Layout..................................................................................................... 3-27
Strip Mining..................................................................................................................... 3-32
Loop Blocking................................................................................................................. 3-34
Instruction Selection.............................................................................................................. 3-37
SIMD Optimizations and Microarchitectures.................................................................. 3-38
Tuning the Final Application.................................................................................................. 3-39
Chapter 4 Optimizing for SIMD Integer Applications
General Rules on SIMD Integer Code .................................................................................... 4-2
Using SIMD Integer with x87 Floating-point............................................................................ 4-3
Using the EMMS Instruction............................................................................................. 4-3
Guidelines for Using EMMS Instruction............................................................................ 4-4
Data Alignment............................... ... ...................................................................... ................ 4-6
Data Movement Coding Techniques....................................................................................... 4-6
Unsigned Unpack............................................................................................................. 4-6
Signed Unpack................................................................................................................. 4-7
Interleaved Pack with Saturation...................................................................................... 4-8
Interleaved Pack without Saturation............................................................................... 4-10
Non-Interleaved Unpack................................................................................................. 4-11
Extract Word................................................ ................................................................... 4-13
Insert Word..................................................................................................................... 4-14
Move Byte Mask to Integer............................................................................................. 4-16
vii
Packed Shuffle Word for 64-bit Registers ...................................................................... 4-18
Packed Shuffle Word for 128-bit Registers .................................................................... 4-19
Unpacking/interleaving 64-bit Data in 128-bit Registers................................................. 4-20
Data Movement.............................................................................................................. 4-21
Conversion Instructions.................................................................................................. 4-21
Generating Constants................................................................. ... .................................... ... 4-21
Building Blocks...................................................................................................................... 4-23
Absolute Difference of Unsigned Numbers .................................................................... 4-23
Absolute Difference of Signed Numbers........................................................................ 4-24
Absolute Value ................................................................................................................ 4-25
Clipping to an Arbitrary Range [high, low]...................................................................... 4-26
Highly Efficient Clipping............................................................................................ 4-27
Clipping to an Arbitrary Unsigned Range [high, low]................................................ 4-28
Packed Max/Min of Signed Word and Unsigned Byte.................................................... 4-29
Signed Word............................................................................................................. 4-29
Unsigned Byte .......................................................................................................... 4-30
Packed Multiply High Unsigned...................................................................................... 4-30
Packed Sum of Absolute Differences............................................................................. 4-30
Packed Average (Byte/Word)......................................................................................... 4-31
Complex Multiply by a Constant........................................... .......................................... 4-32
Packed 32*32 Multiply.................................................................................................... 4-33
Packed 64-bit Add/Subtract............................................................................................ 4-33
128-bit Shifts................................................................................................................... 4-33
Memory Optimizations.......................................................................................................... 4-34
Partial Memory Accesses............................................................................................... 4-35
Supplemental Techniques for Avoiding Cache Line Splits........................................ 4-37
Increasing Bandwidth of Memory Fills and Video Fills ................................................... 4-39
Increasing Memory Bandwidth Using the MOVDQ Instruction................................. 4-39
Increasing Memory Bandwidth by Loading and Storing to and from the
Same DRAM Page ................................................................................................ 4-39
Increasing UC and WC Store Bandwidth by Using Aligned Stores........................... 4-40
Converting from 64-bit to 128-bit SIMD Integer .................................................................... 4-40
SIMD Optimizations and Microarchitectures.................................................................. 4-41
Packed SSE2 Integer versus MMX Instructions....................................................... 4-42
Chapter 5 Optimizing for SIMD Floating-point Applications
General Rules for SIMD Floating-point Code.......................................................................... 5-1
Planning Considerations......................................................................................................... 5-2
Using SIMD Floating-point with x87 Floating-point................................................................. 5-3
Scalar Floating-point Code...................................................................................................... 5-3
viii
Data Alignment............................... ... ...................................................................... ................ 5-4
Data Arrangement............................................................................................................ 5-4
Vertical versus Horizontal Computation...................................................................... 5-5
Data Swizzling............................................................................................................ 5-9
Data Deswizzling...................................................................................................... 5-14
Using MMX Technology Code for Copy or Shuffling Functions................................ 5-17
Horizontal ADD Using SSE....................................................................................... 5-18
Use of cvttps2pi/cvttss2si Instructions.................................................................................. 5-21
Flush-to-Zero and Denormals-are-Zero Modes .................................................................... 5-22
SIMD Floating-point Programming Using SSE3................................................................... 5-22
SSE3 and Complex Arithmetics ................................................. .. .................................. 5-23
SSE3 and Horizontal Computation................................................................................. 5-26
SIMD Optimizations and Microarchitectures.................................................................. 5-27
Packed Floating-Point Performance......................................................................... 5-27
Chapter 6 Optimizing Cache Usage
General Prefetch Coding Guidelines....................................................................................... 6-2
Hardware Prefetching of Data................................................................................................. 6-4
Prefetch and Cacheability Instructions.................................................................................... 6-5
Prefetch................................................................................................................................... 6-6
Software Data Prefetch.......................................... ... .................................... ... ................ 6-6
The Prefetch Instructions – Pentium 4 Processor Implementation................................... 6-8
Prefetch and Load Instructions......................................................................................... 6-8
Cacheability Control................................................................................................................ 6-9
The Non-temporal Store Instructions............................ .. ........................................ ........ 6-10
Fencing..................................................................................................................... 6-10
Streaming Non-temporal Stores ............................................................................... 6-10
Memory Type and Non-temporal Stores................................................................... 6-11
Write-Combining....................................................................................................... 6-12
Streaming Store Usage Models...................................................................................... 6-13
Coherent Requests................................................................................................... 6-13
Non-coherent requests............................................................................................. 6-13
Streaming Store Instruction Descriptions ....................................................................... 6-14
The fence Instructions.................................................................................................... 6-15
The sfence Instruction.............................................................................................. 6-15
The lfence Instruction............................................................................................... 6-16
The mfence Instruction............................................................................................. 6-16
The clflush Instruction .................................................................................................... 6-17
Memory Optimization Using Prefetch.................................................................................... 6-18
Software-controlled Prefetch........................................................... ... ... ......................... 6-18
ix
Hardware Prefetch ....................................................... .. ..................................... ... ........ 6-19
Example of Effective Latency Reduction with H/W Prefetch.......................................... 6-20
Example of Latency Hiding with S/W Prefetch Instruction ............................................ 6-22
Software Prefetching Usage Checklist........................................................................... 6-24
Software Prefetch Scheduling Distance......................................................................... 6-25
Software Prefetch Concatenation................................................................................... 6-26
Minimize Number of Software Prefetches...................................................................... 6-29
Mix Software Prefetch with Computation Instructions.................................................... 6-32
Software Prefetch and Cache Blocking Techniques....................................................... 6-34
Hardware Prefetching and Cache Blocking Techniques ................................................ 6-39
Single-pass versus Multi-pass Execution....................................................................... 6-41
Memory Optimization using Non-Temporal Stores................................................................ 6-43
Non-temporal Stores and Software Write-Combining..................................................... 6-43
Cache Management....................................................................................................... 6-44
Video Encoder.......................................................................................................... 6-45
Video Decoder.......................................................................................................... 6-45
Conclusions from Video Encoder and Decoder Implementation.............................. 6-46
Optimizing Memory Copy Routines.......................................................................... 6-46
TLB Priming.............................................................................................................. 6-47
Using the 8-byte Streaming Stores and Software Prefetch....................................... 6-48
Using 16-byte Streaming Stores and Hardware Prefetch......................................... 6-50
Performance Comparisons of Memory Copy Routines............................................ 6-52
Deterministic Cache Parameters .......................................................................................... 6-53
Cache Sharing Using Deterministic Cache Parameters................................................. 6-55
Cache Sharing in Single-core or Multi-core.................................................................... 6-55
Determine Prefetch Stride Using Deterministic Cache Parameters............................... 6-56
Chapter 7 Multi-Core and Hyper-Threading Technology
Performance and Usage Models............................................................................................. 7-2
Multithreading................................................................................................................... 7-2
Multitasking Environment................................................................................................. 7-4
Programming Models and Multithreading ............................................................................... 7-6
Parallel Programming Models .......................................................................................... 7-7
Domain Decomposition.................................. ............................................................. 7-7
Functional Decomposition.......................................................... .. ... ................................. 7-8
Specialized Programming Models.................................................................................... 7-8
Producer-Consumer Threading Models.................................................................... 7-10
Tools for Creating Multithreaded Applications................................................................ 7-14
Optimization Guidelines........................................................................................................ 7-16
Key Practices of Thread Synchronization ...................................................................... 7-16
x
Key Practices of System Bus Optimization.................................................................... 7-17
Key Practices of Memory Optimization .......................................................................... 7-17
Key Practices of Front-end Optimization........................................................................ 7-18
Key Practices of Execution Resource Optimization....................................................... 7-18
Generality and Performance Impact............................................................................... 7-19
Thread Synchronization........................................................................................................ 7-19
Choice of Synchronization Primitives............................................................ ... ... ........... 7-20
Synchronization for Short Periods.................................................................................. 7-22
Optimization with Spin-Locks ......................................................................................... 7-25
Synchronization for Longer Periods ............................................................................... 7-26
Avoid Coding Pitfalls in Thread Synchronization...................................................... 7-28
Prevent Sharing of Modified Dat a and False-Sharing.................................................... 7-30
Placement of Shared Synchronization Variable ............................................................. 7-31
System Bus Optimization...................................................................................................... 7-33
Conserve Bus Bandwidth............................................................................................... 7-34
Understand the Bus and Cache Interactions.................................................................. 7-35
Avoid Excessive Software Prefetches............................................................................ 7-36
Improve Effective Latency of Cache Misses................................................................... 7-36
Use Full Write Transactions to Achieve Higher Data Rate.......................................... ... 7-37
Memory Optimization............................................................................................................ 7-38
Cache Blocking Technique............................................................................................. 7-38
Shared-Memory Optimization......................................................................................... 7-39
Minimize Sharing of Data between Physical Processors.......................................... 7-39
Batched Producer-Consumer Model........................................................................ 7-40
Eliminate 64-KByte Aliased Data Accesses................................................................... 7-42
Preventing Excessive Evictions in First-Level Data Cache............................................ 7-43
Per-thread Stack Offset ............................................................................................ 7-44
Per-instance Stack Offset......................................................................................... 7-46
Front-end Optimization.......................................................................................................... 7-48
Avoid Excessive Loop Unrolling..................................................................................... 7-48
Optimization for Code Size............................................................................................. 7-49
Using Thread Affinities to Manage Shared Platform Resources........................................... 7-49
Using Shared Execution Resources in a Processor Core.............................................. 7-59
Chapter 8 64-bit Mode Coding Guidelines
Introduction............................................................................................................................. 8-1
Coding Rules Affecting 64-bit Mode........................................................................................ 8-1
Use Legacy 32-Bit Instructions When The Data Size Is 32 Bits....................................... 8-1
Use Extra Registers to Reduce Register Pressure .......................................................... 8-2
Use 64-Bit by 64-Bit Multiplies That Produce 128-Bit Results Only When Necessary..... 8-2
xi
Sign Extension to Full 64-Bits........................................................................................... 8-3
Alternate Coding Rules for 64-Bit Mode.................................................................................. 8-4
Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic..................... 8-4
Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible................................. 8-6
Using Software Prefetch................................................................................................... 8-6
Chapter 9 Power Optimization for Mobile Usages
Overview................................................................................................................................. 9-1
Mobile Usage Scenarios....................................................... ... ..................................... .. ........ 9-2
ACPI C-States......................................................................................................................... 9-4
Processor-Specific C4 and Deep C4 States..................................................................... 9-6
Guidelines for Extending Battery Life...................................................................................... 9-7
Adjust Performance to Meet Quality of Features ............................................................. 9-8
Reducing Amount of Work................................................................................................ 9-9
Platform-Level Optimizations.......................................................................................... 9-10
Handling Sleep State Transitions ................................................................................... 9-11
Using Enhanced Intel SpeedStep Enabling Intel
®
Enhanced Deeper Sleep ....................................................................... 9-14
®
Technology ............................................................. 9-12
Multi-Core Considerations.............................................................................................. 9-15
Enhanced Intel SpeedStep
®
Technology..................... ............................................. 9-15
Thread Migration Considerations.............................................................................. 9-16
Multi-core Considerations for C-States..................................................................... 9-17
Appendix AApplication Performance Tools
Intel® Compilers..................................................................................................................... A-2
Code Optimization Options ............................ ................................. ... ............................. A-3
Targeting a Processor (-Gn) ......................................................... ... .......................... A-3
Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions])...... A-4
Vectoriz er Switch Options ........................ ... ... .................................... ............................. A-5
Loop Unrolling............................................................................................................ A-5
Multithreading with OpenMP*.................................................................................... A-6
Inline Expansion of Library Functions (-Oi, -Oi-) .......................... ... ................................ A-6
Floating-point Arithmetic Precision (-Op, -Op-, -Qprec, -Qprec_div, -Qpc,
-Qlong_double).................................. ... .................................... ................................... A-6
Rounding Control Option (-Qrcd) .................................................................................... A-6
Interprocedural and Profile-Guided Optimizations .......................................................... A-7
Interprocedural Optimization (IPO)............................................................................ A-7
Profile-Guided Optimization (PGO) ........................................................................... A-7
®
Intel
VTune™ Performance Analyzer................................................................................... A-8
Sampling ......................................................................................................................... A-9
xii
Time-based Sampling..................................................... ... .................................... .... A-9
Event-based Sampling............................................................................................. A-10
Workload Characterization ...................................................................................... A-11
Call Graph ............................................. ... ..................................................................... A-13
Counter Monitor............................................................................................................. A-14
®
Intel
Tuning Assistant.................................................................................................. A-14
®
Intel
Performance Libraries................................................................................................ A-14
Benefits Summary......................................................................................................... A-15
Optimizations with the Intel
®
Performance Libraries.................................................... A-16
Enhanced Debugger (EDB) ................................................................................................. A-17
®
Intel
Threading Tools.......................................................................................................... A-17
®
Intel
Thread Checker................................................................................................... A-17
Thread Profiler............................................................................................................... A-19
®
Intel
Software College........................................................................................................ A-20
Appendix BUsing Performance Monitoring Events
Pentium 4 Processor Performance Metrics......................... ................................................... B-1
Pentium 4 Processor-Specific Terminology............................................................................ B-2
Bogus, Non-bogus, Retire............................................................................................... B-2
Bus Ratio......................................................................................................................... B-2
Replay............................................................................................................................. B-3
Assist.............................................................................................................................. . B-3
Tagging............................................................................................................................ B-3
Counting Clocks..................................................................................................................... B-4
Non-Halted Clockticks..................................................................................................... B-5
Non-Sleep Clockticks...................................................................................................... B-6
Time Stamp Counter........................................................................................................ B-7
Microarchitecture Notes......................................................................................................... B-8
Trace Cache Events........................................................................................................ B-8
Bus and Memory Metrics...................................................... ........................................... B-8
Reads due to program loads ................................................................................... B-11
Reads due to program writes (RFOs)............................................................... .. ..... B-11
Writebacks (dirty evictions)...................................................................................... B-12
Usage Notes for Specific Metrics .................................................................................. B-13
Usage Notes on Bus Activities...................................................................................... B-15
Metrics Descriptions and Categories ................................................................................... B-16
Performance Metrics and Tagging Mechanisms.................................................................. B-46
Tags for replay_event.................................................................................................... B-46
Tags for front_end_event............................................................................................... B-48
Tags for execution_event .............................................................................................. B-48
xiii
Using Performance Metrics with Hyper-Threading Technology........................................... B-50
Using Performance Events of Intel Core Solo and Intel Core Duo processors.................... B-56
Understanding the Results in a Performance Counter.................................................. B-56
Ratio Interpretation........................................................................................................ B-57
Notes on Selected Events............................................................................................. B-58
Appendix CIA-32 Instruction Latency and Throughput
Overview................................................................................................................................ C-2
Definitions .............................................................................................................................. C-4
Latency and Throughput........................................................................................................ C-4
Latency and Throughput with Register Operands.......................................................... C-6
Table Footnotes.......................................... ..................................... ........................ C-19
Latency and Throughput with Memory Operands ......................................................... C-20
Appendix DStack Alignment
Stack Frames......................................................................................................................... D-1
Aligned esp-Based Stack Frames................................................................................... D-4
Aligned ebp-Based Stack Frames................................................................................... D-6
Stack Frame Optimizations.............................................................................................. D-9
Inlined Assembly and ebx.................................................................................................... D-10
Appendix EMathematics of Prefetch Scheduling Distance
Simplified Equation ............................................... .................................... ... ...................... ... . E-1
Mathematical Model for PSD................................................................................................. E-2
No Preloading or Prefetch.................. ... ..................................... .. ................................... E-6
Compute Bound (Case:Tc >= T
Compute Bound (Case: Tl + Tb > Tc > Tb) ..................................................................... E-8
Memory Throughput Bound (Case: Tb >= Tc)............................................................... E-10
Example ........................................................................................................................ E-11
+ Tb)............................ ... .............................................. E-7
l
Index
xiv
Examples
Example 2-1 Assembly Code with an Unpredictable Branch ............................. 2-17
Example 2-2 Code Optimization to Eliminate Branches..................................... 2-17
Example 2-3 Eliminating Branch with CMOV Instruction.................................... 2-18
Example 2-4 Use of
Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm.............. 2-20
Example 2-6 Static Taken Prediction Example ...................................................2-21
Example 2-7 Static Not-Taken Prediction Example ............................................ 2-21
Example 2-8 Indirect Branch With Two Favored Targets .................................... 2-25
Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction ..... 2-26
Example 2-10 Loop Unrolling ............................................................................... 2-28
Example 2-11 Code That Causes Cache Line Split ............................................. 2-31
Example 2-12 Several Situations of Small Loads After Large Store .................... 2-35
Example 2-14 A Non-forwarding Situation in Compiler Generated Code............. 2-36
Example 2-15 Two Examples to Avoid the Non-forwarding Situation in
Example 2-14 ................................................................................ 2-36
Example 2-13 A Non-forwarding Example of Large Load After Small Store ........2-36
Example 2-16 Large and Small Load Stalls ......................................................... 2-37
Example 2-17 An Example of Loop-carried Dependence Chain .......................... 2-39
Example 2-18 Rearranging a Data Structure ....................................................... 2-39
Example 2-19 Decomposing an Array .................................................................. 2-40
Example 2-20 Dynamic Stack Alignment ............................................................. 2-43
Example 2-21 Non-temporal Stores and 64-byte Bus Write Transactions............ 2-54
Example 2-22 Non-temporal Stores and Partial Bus Write Transactions .............2-54
Example 2-23 Algorithm to Avoid Changing the Rounding Mode......................... 2-66
Example 2-24 Dependencies Caused by Referencing Partial Registers.............. 2-77
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form..................... 2-91
Example 2-26 Spill Scheduling Example Code .................................................... 2-92
Example 3-1 Identification of MMX Technology with cpuid................................... 3-3
Example 3-3 Identification of SSE by the OS ....................................................... 3-4
Example 3-2 Identification of SSE with cpuid .......................................................3-4
pause Instruction ............................................................... 2-19
xv
Example 3-4 Identification of SSE2 with cpuid..................................................... 3-5
Example 3-5 Identification of SSE2 by the OS..................................................... 3-6
Example 3-6 Identification of SSE3 with cpuid..................................................... 3-7
Example 3-7 Identification of SSE3 by the OS..................................................... 3-8
Example 3-8 Simple Four-Iteration Loop ............................................................ 3-14
Example 3-9 Streaming SIMD Extensions Using Inlined Assembly Encoding ...3-15
Example 3-10 Simple Four-Iteration Loop Coded with Intrinsics.......................... 3-16
Example 3-11 C++ Code Using the Vector Classes ............................................. 3-18
Example 3-12 Automatic Vectorization for a Simple Loop .................................... 3-19
Example 3-13 C Algorithm for 64-bit Data Alignment ........................................... 3-23
Example 3-14 AoS Data Structure ....................................................................... 3-27
Example 3-16 AoS and SoA Code Samples ........................................................ 3-28
Example 3-15 SoA Data Structure ....................................................................... 3-28
Example 3-17 Hybrid SoA Data Structure ............................................................ 3-30
Example 3-18 Pseudo-code Before Strip Mining.................................................. 3-32
Example 3-19 Strip Mined Code........................................................................... 3-33
Example 3-20 Loop Blocking................................................................................ 3-35
Example 3-21 Emulation of Conditional Moves.................................................... 3-37
Example 4-1 Resetting the Register between __m64 and FP Data Types...........4-5
Example 4-2 Unsigned Unpack Instructions......................................................... 4-7
Example 4-3 Signed Unpack Code ...................................................................... 4-8
Example 4-4 Interleaved Pack with Saturation ................................................... 4-10
Example 4-5 Interleaved Pack without Saturation ..............................................4-11
Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way . 4-13
Example 4-7 pextrw Instruction Code................................................................. 4-14
Example 4-8 pinsrw Instruction Code................................................................. 4-15
Example 4-9 Repeated pinsrw Instruction Code ................................................ 4-16
Example 4-10 pmovmskb Instruction Code.......................................................... 4-17
Example 4-12 Broadcast Using 2 Instructions......................................................4-19
Example 4-11 pshuf Instruction Code ..................................................................4-19
Example 4-13 Swap Using 3 Instructions............................................................. 4-20
Example 4-14 Reverse Using 3 Instructions......................................................... 4-20
Example 4-15 Generating Constants ................................................................... 4-21
Example 4-16 Absolute Difference of Two Unsigned Numbers ............................ 4-23
Example 4-17 Absolute Difference of Signed Numbers ....................................... 4-24
Example 4-18 Computing Absolute Value ............................................................ 4-25
Example 4-19 Clipping to a Signed Range of Words [high, low] .......................... 4-27
xvi
Example 4-20 Clipping to an Arbitrary Signed Range [high, low]......................... 4-27
Example 4-21 Simplified Clipping to an Arbitrary Signed Range ......................... 4-28
Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]..................... 4-29
Example 4-23 Complex Multiply by a Constant .................................................... 4-32
Example 4-24 A Large Load after a Series of Small Stores (Penalty).................. 4-35
Example 4-25 Accessing Data without Delay....................................................... 4-35
Example 4-26 A Series of Small Loads after a Large Store .................................4-36
Example 4-27 Eliminating Delay for a Series of Small Loads after a
Large Store.................................................................................... 4-36
Example 4-28 An Example of Video Processing with Cache Line Splits.............. 4-37
Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits ........ 4-38
Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation ....................... 5-8
Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation........ 5-9
Example 5-3 Swizzling Data............................................................................... 5-10
Example 5-4 Swizzling Data Using Intrinsics ..................................................... 5-12
Example 5-5 Deswizzling Single-Precision SIMD Data ...................................... 5-14
Example 5-6 Deswizzling Data Using the movlhps and shuffle
Instructions .................................................................................... 5-15
Example 5-7 Deswizzling Data 64-bit Integer SIMD Data .................................. 5-16
Example 5-8 Using MMX Technology Code for Copying or Shuffling.................5-18
Example 5-9 Horizontal Add Using movhlps/movlhps ........................................ 5-19
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps .................5-21
Example 5-11 Multiplication of Two Pair of Single-precision Complex Number....5-24
Example 5-12 Division of Two Pair of Single-precision Complex Number............ 5-25
Example 5-13 Calculating Dot Products from AOS .............................................. 5-26
Example 6-1 Pseudo-code for Using cflush ....................................................... 6-18
Example 6-2 Populating an Array for Circular Pointer Chasing with
Constant Stride.............................................................................. 6-21
Example 6-3 Prefetch Scheduling Distance ....................................................... 6-26
Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop .......6-28
Example 6-4 Using Prefetch Concatenation....................................................... 6-28
Example 6-6 Spread Prefetch Instructions ......................................................... 6-33
Example 6-7 Data Access of a 3D Geometry Engine without Strip-mining........ 6-37
Example 6-8 Data Access of a 3D Geometry Engine with Strip-mining............. 6-38
Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic .......... 6-40
Example 6-10 Basic Algorithm of a Simple Memory Copy................................... 6-46
Example 6-11 A Memory Copy Routine Using Software Prefetch........................6-48
xvii
Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation.. 6-50
Example 7-1 Serial Execution of Producer and Consumer Work Items ............... 7-9
Example 7-2 Basic Structure of Implementing Producer Consumer Threads.... 7-11
Example 7-3 Thread Function for an Interlaced Producer Consumer Model .....7-13
Example 7-4 Spin-wait Loop and PAUSE Instructions........................................ 7-24
Example 7-5 Coding Pitfall using Spin Wait Loop .............................................. 7-29
Example 7-6 Placement of Synchronization and Regular Variables ..................7-32
Example 7-7 Declaring Synchronization Variables without Sharing
a Cache Line ................................................................................. 7-32
Example 7-8 Batched Implementation of the Producer Consumer Threads ......7-41
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads ............... 7-45
Example 7-10 Adding a Pseudo-random Offset to the Stack Pointer
in the Entry Function ..................................................................... 7-47
Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical
Processor ...................................................................................... 7-51
Example 7-12 Assembling a Look up Table to Manage Affinity Masks
and Schedule Threads to Each Core First ....................................7-54
Example 7-13 Discovering the Affinity Masks for Sibling Logical
Processors Sharing the Same Cache ........................................... 7-55
Example D-1 Aligned esp-Based Stack Frames .................................................. D-5
Example D-2 Aligned ebp-based Stack Frames................................................... D-7
Example E-1 Calculating Insertion for Scheduling Distance of 3 ..........................E-3
xviii
Figures
Figure 1-1 Typical SIMD Operations ................................................................... 1-3
Figure 1-2 SIMD Instruction Register Usage ...................................................... 1-4
Figure 1-3 The Intel NetBurst Microarchitecture ............................................... 1-10
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core.......................1-19
Figure 1-5 The Intel Pentium M Processor Microarchitecture........................... 1-27
Figure 1-6 Hyper-Threading Technology on an SMP ........................................ 1-35
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition
and Intel Core Duo Processor ......................................................... 1-41
Figure 2-1 Cache Line Split in Accessing Elements in a Array ......................... 2-31
Figure 2-2 Size and Alignment Restrictions in Store Forwarding...................... 2-34
Figure 3-1 Converting to Streaming SIMD Extensions Chart .............................3-9
Figure 3-2 Hand-Coded Assembly and High-Level Compiler
Performance Trade-offs ................................................................... 3-13
Figure 3-3 Loop Blocking Access Pattern ......................................................... 3-36
Figure 4-2 Interleaved Pack with Saturation ....................................................... 4-9
Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example ............................4-9
Figure 4-4 Result of Non-Interleaved Unpack High in MM1.............................. 4-12
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 .............................. 4-12
Figure 4-5 pextrw Instruction ............................................................................ 4-14
Figure 4-6 pinsrw Instruction............................................................................. 4-15
Figure 4-7 pmovmskb Instruction Example....................................................... 4-17
Figure 4-8 pshuf Instruction Example ............................................................... 4-18
Figure 4-9 PSADBW Instruction Example ........................................................ 4-31
Figure 5-1 Homogeneous Operation on Parallel Data Elements ........................ 5-5
Figure 5-2 Dot Product Operation ....................................................................... 5-8
Figure 5-3 Horizontal Add Using movhlps/movlhps .......................................... 5-19
Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction
HADDPD ......................................................................................... 5-23
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction ..............5-23
Figure 6-1 Effective Latency Reduction as a Function of Access Stride........... 6-22
xix
Figure 6-2 Memory Access Latency and Execution Without Prefetch ..............6-23
Figure 6-3 Memory Access Latency and Execution With Prefetch ...................6-23
Figure 6-4 Prefetch and Loop Unrolling ............................................................ 6-29
Figure 6-5 Memory Access Latency and Execution With Prefetch ...................6-31
Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent
Passes............................................................................................. 6-35
Figure 6-7 Examples of Prefetch and Strip-mining for Temporally
Adjacent and Non-Adjacent Passes Loops ..................................... 6-36
Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines .........................6-42
Figure 7-1 Amdahl’s Law and MP Speed-up ...................................................... 7-3
Figure 7-2 Single-threaded Execution of Producer-consumer
Threading Model................................................................................ 7-9
Figure 7-3 Execution of Producer-consumer Threading Model on
a Multi-core Processor..................................................................... 7-10
Figure 7-4 Interlaced Variation of the Producer Consumer Model.................... 7-12
Figure 7-5 Batched Approach of Producer Consumer Model ........................... 7-40
Figure 9-1 Performance History and State Transitions ....................................... 9-3
Figure 9-2 Active Time Versus Halted Time of a Processor ............................... 9-4
Figure 9-3 Application of C-states to Idle Time................................................... 9-6
Figure 9-4 Profiles of Coarse Task Scheduling and Power Consumption ......... 9-12
Figure 9-5 Thread Migration in a Multi-Core Processor .................................... 9-17
Figure 9-6 Progression to Deeper Sleep .......................................................... 9-18
Figure A-1 Sampling Analysis of Hotspots by Location.....................................A-10
Figure A-2 Intel Thread Checker Can Locate Data Race Conditions ................A-18
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded
Execution Timelines.........................................................................A-20
Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ
and Front Side Bus ..........................................................................B-10
Figure D-1 Stack Frames Based on Alignment Type .......................................... D-3
Figure E-1 Pentium II, Pentium III and Pentium 4 Processors Memory
Pipeline Sketch ..................................................................................E-4
Figure E-2 Execution Pipeline, No Preloading or Prefetch ..................................E-6
Figure E-3 Compute Bound Execution Pipeline ..................................................E-7
Figure E-4 Another Compute Bound Execution Pipeline.....................................E-8
Figure E-5 Memory Throughput Bound Pipeline ...............................................E-10
Figure E-6 Accesses per Iteration, Example 1 ..................................................E-12
Figure E-7 Accesses per Iteration, Example 2 ..................................................E-13
xx
Tables
Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters .................. 1-20
Table 1-3 Cache Parameters of Pentium M, Intel
Table 1-2 Trigger Threshold and CPUID Signatures for IA-32
Table 1-4 Family And Model Designations of Microarchitectures...................... 1-42
Table 1-5 Characteristics of Load and Store Operations
Table 2-1 Coding Pitfalls Affecting Performance ................................................. 2-2
Table 2-2 Avoiding Partial Flag Register Stall ................................................... 2-76
Table 2-3 Avoiding Partial Register Stall When Packing Byte Values ............... 2-78
Table 2-4 Avoiding False LCP Delays with 0xF7 Group Instructions ................ 2-81
Table 2-5 Using REP STOSD with Arbitrary Count Size and
Table 5-1 SoA Form of Representing Vertices Data ........................................... 5-7
Table 6-1 Software Prefetching Considerations into Strip-mining Code............ 6-39
Table 6-2 Relative Performance of Memory Copy Routines .............................6-52
Table 6-3 Deterministic Cache Parameters Leaf............................................... 6-54
Table 7-1 Properties of Synchronization Objects .............................................. 7-21
Table B-1 Pentium 4 Processor Performance Metrics .......................................B-18
Table B-2 Metrics That Utilize Replay Tagging Mechanism ...............................B-47
Table B-3 Table 3 Metrics That Utilize the Front-end Tagging Mechanism ........B-48
Table B-4 Metrics That Utilize the Execution Tagging Mechanism ....................B-49
Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)..............B-50
Table B-6 Metrics That Support Qualification by Logical Processor and
Table B-7 Metrics That Are Independent of Logical Processors........................B-55
Table C-1 Streaming SIMD Extension 3 SIMD Floating-point Instructions ......... C-6
Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions.................. C-7
Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point
Table C-4 Streaming SIMD Extension Single-precision Floating-point
Table C-6 MMX Technology 64-bit Instructions ................................................ C-14
®
Intel
Core™ Duo Processors .......................................................... 1-30
Processor Families............................................................................ 1-30
in Intel Core Duo Processors ............................................................1-43
4-Byte-Aligned Destination ................................................................ 2-85
Parallel Counting ...............................................................................B-51
Instructions ......................................................................................... C-9
Instructions ....................................................................................... C-12
®
Core™ Solo and
xxi
Table C-5 Streaming SIMD Extension 64-bit Integer Instructions..................... C-14
Table C-7 IA-32 x87 Floating-point Instructions................................................ C-16
Table C-8 IA-32 General Purpose Instructions ................................................. C-17
xxii

Introduction

The IA-32 Intel® Architectur e Optimization Reference Manual describes how to optimize software to take advantage of the performance characteristics of the current generation of IA-32 Intel architecture family of processors. The optimizations described in this manual apply to IA-32 processors based on the Intel the Intel support Hyper-Threading Technology.
The target audience for this manual includes software programmer s and compiler writers. This manual assumes that the reader is familiar with the basics of the IA-32 architecture and has access to the Intel Software Developer’s Manual: Volume 1, Basic Architecture; Volu me 2A, Instruction Set Refer ence A-M; Volum e 2B, Instruction Set Reference N-Z, and Volum e 3, System Programmer’s Guide.
When developing and optimizing software applications to achieve a high level of performance when running on IA-32 processors, a detailed understanding of IA-32 family of processors is often required. In many cases, knowledge of IA-32 microarchitectures is required.
®
Pentium® M processor family and IA-32 processors that
®
NetBurst® microarchitecture,
®
Architecture
This manual provides an overview of the Intel NetBurst microarchitecture and the Intel Pentium M processor microarchitecture. It contains design guidelines for high-performance software applications, coding rules, and techniques for many aspects of code-tuning. These rules are useful to programmers and compiler developers.
The design guidelines that are discussed in this manual for developing high-performance software apply to current as well as to future IA-32 processors. The coding rules and code optimization techniques listed
xxiii
IA-32 Intel® Architectu re Optimization
target the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.
Tuning Your Application
Tuning an application for high performance on any IA-32 processor requires understanding and basic skills in:
IA-32 architecture
C and Assembly language
the hot-spot regions in your application that have significant impact
on software performance
the optimization capabilities of your compiler
techniques to evaluate the application’s performance
®
The Intel locate hot-spot regions in your applications. On the Pentium 4, Intel Xeon through a selection of performance monitoring events and analyze the performance event data that is gathered during code execution.
VTune Performance Analyzer can help you analyze and
®
and Pentium M processors, this tool can monitor an application
®
This manual also describes information that can be gathered using the performance counters through Pentium 4 processor’s performance monitoring events.
For VTune Performance Analyzer order information, see the web page:
ttp://developer.intel.com
h
About This Manual
In this document, the reference “Pentium 4 processor” refers to processors based on the Intel NetBurst microarchitecture. Currently this includes the Intel Pentium 4 processor and Intel Xeon processor . Where appropriate, differences between Pentium 4 processor and Intel Xeon processor are noted.
xxiv
Introduction
The manual consists of the following parts: Introduction. Defines the purpose and outlines the contents of this
manual.
®
Chapter 1: IA-32 Intel
Architecture Processor Family Overview.
Describes the features relevant to software optimization of the current generation of IA-32 Intel architecture processors, including the architectural extensions to the IA-32 architecture and an overview of the Intel NetBurst microarchitecture, Pentium M processor microarchitecture and Hyper-Threading Technology.
Chapter 2: General Optimization Guidelines. Describes general code development and optimization techniques that apply to all applications designed to take advantage of the common features of the Intel NetBurst microarchitecture and Pentium M processor microarchitecture.
Chapter 3: Coding for SIMD Architectures. Describes techniques and concepts for using the SIMD integer and SIMD floating-point instructions provided by the MMX technology, Streaming SIMD Extensions, Streaming SIMD Extensions 2, and Streaming SIMD Extensions 3.
Chapter 4: Optimizing for SIMD Integer Applications. Provides optimization suggestions and common building blocks for applications that use the 64-bit and 128-bit SIMD integer instructions.
Chapter 5: Optimizing for SIMD Floating-point Applications. Provides optimization suggestions and common building blocks for applications that use the single-precision and double-precision SIMD floating-point instructions.
Chapter 6: Optimizing Cache Usage. Describes how to use the
prefetch instruction, cache control management instructions to
optimize cache usage, and the deterministic cache parameters.
xxv
IA-32 Intel® Architectu re Optimization
Chapter 7: Multiprocessor and Hyper-Threading Technology. Describes guidelines and techniques for optimizing multithreaded applications to achieve optimal performance scaling. Use these when targeting multiprocessor (MP) systems or MP systems using IA-32 processors that support Hyper-Threading Technology.
Chapter 8: 64-Bit Mode Coding Guidelines. This chapter describes a set of additional coding guidelines for application software written to run in 64-bit mode.
Chapter 9: Power Optimization for Mobile Usages. This chapter provides background on power saving techniques in mobile processors and makes recommendations that developers can leverage to provide longer battery life.
Appendix A: Application Performance Tools. Introduces tools for analyzing and enhancing application performance without having to write assembly code.
Appendix B: Intel Pentium 4 Processor Performance Metrics. Provides information that can be gathered using Pentium 4 processor’s performance monitoring events. These performance metrics can help programmers determine how effectively an application is using the features of the Intel NetBurst microarchitecture.
xxvi
Appendix C: IA-32 Instruction Latency and Throughput. Provides latency and throughput data for the IA-32 instructions. Instruction timing data specific to the Pentium 4 and Pentium M processors are provided.
Appendix D: St ack Alignment. Describes stack alignment conventions and techniques to optimize performance of accessing stack-based data.
Appendix E: The Mathematics of Prefetch Scheduling Distance. Discusses the optimum spacing to insert
prefetch instructions and
presents a mathematical model for determining the prefetch scheduling distance (PSD) for your application.
Related Documentation
For more information on the Intel architecture, specific techniques, and processor architecture terminology referenced in this manual, see the following documents:
Intel
Intel
VTune Performance Analyzer online help
Intel
Intel Processor Identification with the CPUID Instruction, doc.
®
C++ Compiler User’s Guide
®
Fortran Compiler User’s Guide
®
Architecture Software Developer’s Manual: — Volume 1: Basic Architecture, doc. number 253665 — Volume 2A: Instruction Set Reference Manual A-M, doc.
number 253666
— Volume 2B: Instruction Set Reference Manual N-Z, doc.
number 253667
— Volume 3: System Programmer’s Guide, doc. number 253668
number 241618.
Introduction
Developing Multi-threaded Applications: A Platform Consistent
Approach, available at http://cache-www.intel.com/cd/00/00/05/15/51534_developing_mul tithreaded_applications.pdf
Also, refer to the following Application Notes:
Adjusting Thread Stack Address To Improve Performance On Intel
Xeon MP Hyper-Threading Technology Enabled Processors
Detecting Hyper-Threading Technology Enabled Processors
Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon
Processor MP
In addition, refer to publications in the following web sites:
http://developer.intel.com/technology/hyperthread
http://cedar.intel.com/cgi-bin/ids.dll/topic.jsp?catCode=CDN
xxvii
IA-32 Intel® Architectu re Optimization
Notational Conventions
This manual uses the following conventions:
This type style Indicates an element of syntax, a reserved
THIS TYPE STYLE Indicates a value, for example, TRUE, CONST1,
This type style Indicates a placeholder for an identifier, an
... (ellipses) Indicate that a few lines of the code are
This type style
word, a keyword, a filename, instruction, computer output, or part of a program example. The text appears in lowercase unless uppercase is significant.
or a variable, for example, names
l indicates lowercase letter L in examples. 1
MMO through MM7.
is the number 1 in examples. uppercase O in examples.
A, B, or register
O is the
0 is the number 0 in
examples.
expression, a string, a symbol, or a value. Substitute one of these items for the placeholder.
omitted. Indicates a hypertext link.
xxviii

IA-32 Intel® Architecture Processor Family Overview

This chapter gives an overview of the features relevant to software optimization for the current generations of IA-32 processors, inclu ding:
®
Intel
CoreSolo, Intel® CoreDuo, Intel® Pentium® 4, Intel®
®
Xeon architecture. These features include:
SIMD instruction extensions including MMX
Microarchitectures that enable executing instructions with high
Intel
Intel
Multi-core architecture supported in Intel
Intel Pentium 4 processors, Intel Xeon processors, Pentium D processors, and Pentium processor Extreme Editions are based on Intel NetBurst® microarchitecture. The Intel Pentium M processor microarchitecture balances performance and low power consumption.
, Intel® Pentium® M, and IA-32 processors with multi-core
Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2), and Streaming SIMD Extensions 3 (SSE3)
throughput at high clock rates, a high speed cache hierarchy and the ability to fetch data with high speed system bus
®
Extended Memory 64 Technology (Intel® EM64T)
®
processors supporting Hyper-Threading (HT) Technology
®
Pentium
®
D processors and Pentium® processor Extreme Edition
1
technology,
Core Duo, Intel®
1
2
1. Hyper-Threading Technology requires a computer system with an Intel processor supporting HT Technology and an HT Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software used.
2. Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance varies depending on the hardware and software used.
1-1
IA-32 Intel® Architectu re Optimization
Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those introduced in the Pentium M processor.

SIMD Technology

SIMD computations (see Figure 1-1) were introduced in the IA-32 architecture with MMX technology . MMX technology allows SIMD computations to be performed on packed byte, word, and doubleword integers. The integers are contained in a set of eight 64-bit registers called MMX registers (see Figure 1-2).
The Pentium III processor extended the SIMD computation model with the introduction of the Streaming SIMD Extensions (SSE). SSE allows SIMD computations to be performed on operands that contain four packed single-precision floating-point data elements. The operands can be in memory or in a set of eight 128-bit XMM registers (see Figure 1-2). SSE also extended SIMD computational capability by adding additional 64-bit MMX instructions.
Figure 1-1 shows a typical SIMD computation. Two sets of four packed data elements (X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are operated on in parallel, with the same operation being performed on
1-2
each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.
Figure 1-1 Typical SIMD Operations
X4 X3 X2 X1
Y4 Y3 Y2 Y1
OP OP OP OP
X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1
The Pentium 4 processor further extended the SIMD computation model with the introduction of Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3)
IA-32 Intel® Architecture Processor Family Overview
OM15148
SSE2 works with operands in either memory or in the XMM registers. The technology extends SIMD computations to process packed double-precision floating-point data elements and 128-bit packed integers. There are 144 instructions in SSE2 that operate on two packed double-precision floating-point data elements or on 16 packed byte, 8 packed word, 4 doubleword, and 2 quadword integers.
SSE3 enhances x87, SSE and SSE2 by providing 13 instructions that can accelerate application performance in specific areas. These include video processing, complex arithmetics, and thread synchronization. SSE3 complements SSE and SSE2 with instructions that process SIMD data asymmetrically, facilitate horizontal computation, and help avoid loading cache line splits.
1-3
IA-32 Intel® Architectu re Optimization
Figure 1-2 SIMD Instruction Register Usage
64-bit M M X R egisters
MM7
MM7
MM6
MM5
MM4
MM3
MM2
MM1
MM0
128-bit X M M R egisters
MM7
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
OM15149
SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications and applications that have the following characteristics:
inherently parallel
recurring memory access patterns
localized recurring operations performed on the data
1-4
data-independent control flow
SIMD floating-point instructions fully support the IEEE Standard 754 for Binary Floating-Point Arithmetic. They are accessible from all IA-32 execution modes: protected mode, real address mode, and V irtual 8086 mode.
SSE, SSE2, and MMX technologies are architectural extensions in the IA-32 Intel architecture. Existing software will continue to run correctly, without modification on IA-32 microprocessors that incorporate these technologies. Existing software will also run correctly in the presence of applications that incorporate SIMD technologies.
IA-32 Intel® Architecture Processor Family Overview
SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can improve cache usage and application performance.
For more on SSE, SSE2, SSE3 and MMX technologies, see:
IA-32 Intel® Architecture Software Developer’s Manual, Volume 1: Chapter 9, “Programming with Intel® MMX™ Technology”; Chapter 10, “Programming with Streaming SIMD Extensions (SSE)”; Chapter 11, “Programming with Streaming SIMD Extensions 2 (SSE3)”; Chapter 12, “Programming with Streaming SIMD Extensions 3 (SSE3)”

Summary of SIMD Technologies

MMX Technology
MMX Technology introduced:
64-bit MMX registers
support for SIMD operations on packed byte, word, and doubleword
integers
MMX instructions are useful for multimedia and communications software.
Streaming SIMD Extensions
Streaming SIMD extensions introduced:
128-bit XMM registers
128-bit data type with four packed single-precision floating-point
operands
data prefetch instructions
non-temporal store instructions and other cacheability and memory
ordering instructions
extra 64-bit SIMD integer support
1-5
IA-32 Intel® Architectu re Optimization
SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding.
Streaming SIMD Extensions 2
Streaming SIMD extensions 2 add the following:
128-bit data type with two packed double-precision floating-point
operands
128-bit data types for SIMD integer operation on 16-byte, 8-word,
4-doubleword, or 2-quadword integers
support for SIMD arithmetic on 64-bit integer operands
instructions for converting between new and existing data types
extended support for data shuffling
extended support for cacheability and memory ordering operations
SSE2 instructions are useful for 3D graphics, video decoding/encoding, and encryption.
Streaming SIMD Extensions 3
1-6
Streaming SIMD extensions 3 add the following:
SIMD floating-point instructions for asymmetric and horizontal
computation
a special-purpose 128-bit load instruction to avoid cache line splits
an x87 FPU instruction to convert to integer independent of the
floating-point control word (FCW)
instructions to support thread synchronization
SSE3 instructions are useful for scientific, video and multi-threaded applications.
IA-32 Intel® Architecture Processor Family Overview

Intel® Extended Memory 64 Technology (Intel®EM64T)

Intel EM64T is an extension of the IA-32 Intel architecture. Intel EM64T increases the linear address space for software to 64 bits and supports physical address space up to 40 bits. The technology also introduces a new operating mode referred to as IA-32e mode.
IA-32e mode consists of two sub-modes: (1) compatibility mode enables a 64-bit operating system to run most legacy 32-bit software unmodified, (2) 64-bit mode enables a 64-bit operating system to run applications written to access 64-bit linear address space.
In the 64-bit mode of Intel EM64T, software may access:
64-bit flat linear addressing
8 additional general-purpose registers (GPRs)
8 additional registers for streaming SIMD extensions (SSE, SSE2
and SSE3)
64-bit-wide GPRs and instruction pointers
uniform byte-register addressing
fast interrupt-prioritization mechanism
a new instruction-pointer relative-addressing mode
For optimizing 64-bit applications, the features that impact software optimizations include:
using a set of prefixes to access new registers or 64-bit register
operand
pointer size increases from 32 bits to 64 bits
instruction-specific usages
1-7
IA-32 Intel® Architectu re Optimization

Intel NetBurst® Microarchitecture

The Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hyper-Threading Technology, Pentium D processor, Pentium processor Extreme Edition and the Intel Xeon processor implement the Intel NetBurst microarchitecture.
This section describes the features of the Intel NetBurst microarchitecture and its operation common to the above processors. It provides the technical background required to understand optimization recommendations and the coding rules discussed in the rest of this manual. For implementation details, including instruction latencies, see Appendix C, “IA-32 Instruction Latency and Throughput.”
Intel NetBurst microarchitecture is designed to achieve high performance for integer and floating-point computations at high clock rates. It supports the following features:
hyper-pipelined technology that enables high clock rates
a high-performance, quad-pumped bus interface to the Intel
NetBurst microarchitecture system bus
a rapid execution engine to reduce the latency of basic integer
instructions
out-of-order speculative execution to enable parallelism
superscalar issue to enable parallelism
hardware register renaming to avoid register name space limitations
cache line sizes of 64 bytes
hardware prefetch

Design Goals of Intel NetBurst Microarchitecture

The design goals of Intel NetBurst microarchitecture are:
to execute legacy IA-32 applications and applications based on
single-instruction, multiple-data (SIMD) technology at high throughput
1-8
IA-32 Intel® Architecture Processor Family Overview
to operate at high clock rates and to scale to higher performance and
clock rates in the future
Design advances of the Intel NetBurst microarchitecture include:
a deeply pipelined design that allows for high clock rates (with
different parts of the chip running at different clock rates).
a pipeline that optimizes for the common case of frequently
executed instructions; the most frequently-executed instructions in common circumstances (such as a cache hit) are decoded efficiently and executed with short latencies
employment of techniques to hide stall penalties; Among these are
parallel execution, buffering, and speculation. The microarchitecture executes instructions dynamically and out-of-order, so the time it takes to execute each individual instruction is not always deterministic
Chapter 2, “General Optimization Guidelines,” lists optimizations to use and situations to avoid. The chapter also gives a sense of relative priority . Because most optimizations are implementation dependent, the chapter does not quantify expected benefits and penalties.
The following sections provide more information about key features of the Intel NetBurst microarchitecture.

Overview of the Intel NetBurst Microarchitecture Pipeline

The pipeline of the Intel NetBurst microarchitecture contains:
an in-order issue front end
an out-of-order superscalar execution core
an in-order retirement unit
The front end supplies instructions in program order to the out-of-order core. It fetches and decodes IA-32 instructions. The decoded IA-32 instructions are translated into micro-operations (µops). The front end’s primary job is to feed a continuous stream of µops to the execution core in original program order.
1-9
IA-32 Intel® Architectu re Optimization
The out-of-order core aggressively reorders µops so that µops whose inputs are ready (and have execution resources available) can execute as soon as possible. The core can issue multiple µops per cycle.
The retirement section ensures that the results of execution are processed according to original program order and that the proper architectural states are updated.
Figure 1-3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline. The following subsections provide an overview for each.
Figure 1-3 The Intel NetBurst Microarchitecture
6\VWHP%XV
%XV8QLW
UG/HYHO&DFKH
QG/HYHO&DFKH
)URQW(QG
)HWFK'HFRGH
%7%V%UDQFK3UHGLFWLRQ
2SWLRQDO
:D\
7UDFH&DFKH
0LFURFRGH520
)UHTXHQWO\XVHGSDWKV
/HVVIUHTXHQWO\XVHGSDWKV
VW/HYHO&DFKH
ZD\
([HFXWLRQ
2XW2I2UGHU&RUH
%UDQFK+LVWRU\8SGDWH
5HWLUHPHQW
1-10
IA-32 Intel® Architecture Processor Family Overview
The Front End
The front end of the Intel NetBurst microarchitecture consists of two parts:
fetch/decode unit
execution trace cache
It performs the following functions:
prefetches IA-32 instructions that are likely to be executed
fetches required instructions that have not been prefetched
decodes instructions into µops
generates microcode for complex instructions and special-purpose
code
delivers decoded instructions from the execution trace cache
predicts branches using advanced algorithms
The front end is designed to address two problems that are sources of delay:
the time required to decode instructions fetched from the target
wasted decode bandwidth due to branches or a branch target in the
middle of a cache line
Instructions are fetched and decoded by a translation engine. The translation engine then builds decoded instructions into µop sequences called traces. Next, traces are then stored in the execution trace cache.
The execution trace cache stores µops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line. This increases the instruction flow from the cache and makes better use of the overall cache storage space since the cache no longer stores instructions that are branched over and never executed.
The trace cache can deliver up to 3 µops per clock to the core.
1-11
IA-32 Intel® Architectu re Optimization
The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch targets are fetched from the execution trace cache if they are cached, otherwise they are fetched from the memory hierarchy. The translation engine’s branch prediction information is used to form traces along the most likely paths.
The Out-of-order Core
The core’s ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the processor to reorder instructions so that if one µop is delayed while waiting for data or a contended resource, other µops that appear later in the program order may proceed. This implies that when one portion of the pipeline experiences a delay, the delay may be covered by other operations executing in parallel or by the execution of µops queued up in a buffer.
The core is designed to facilitate parallel execution. It can dispatch up to six µops per cycle through the issue ports (Figure 1-4, page 19). Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth. The higher bandwidth in the core allows for peak bursts of greater than three µops and to achieve higher issue rates by allowing greater flexibility in issuing µops to different execution ports.
1-12
Most core execution units can start executing a new µop every cycle, so several instructions can be in flight at one time in each pipeline. A number of arithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructions start one every two cycles. Finally , µops can begin execution out of program order, as soon as their data inputs are ready and resources are available.
Retirement
The retirement section receives the results of the executed µops from the execution core and processes the results so that the architectural state is updated according to the original program order. For semantically
IA-32 Intel® Architecture Processor Family Overview
correct execution, the results of IA-32 instructions must be committed in original program order before they are retired. Exceptions may be raised as instructions are retired. For this reason, exceptions cannot occur speculatively.
When a µop completes and writes its result to the destination, it is retired. Up to three µops may be retired per cycle. The reorder buffer (ROB) is the unit in the processor which buffers completed µops, updates the architectural state and manages the ordering of exceptions.
The retirement section also keeps track of branches and sends updated branch target information to the branch target buffer (BTB). This updates branch history. Figure 1-3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture: an execution loop that interacts with multilevel cache hierarchy and the system bus.
The following sections describe in more detail the operation of the front end and the execution core. This information provides the background for using the optimization techniques and instruction latency data documented in this manual.

Front End Pipeline Detail

The following information about the front end operation is be useful for tuning software with respect to prefetching, branch prediction, and execution trace cache operations.
Prefetching
The Intel NetBurst microarchitecture supports three prefetching mechanisms:
a hardware instruction fetcher that automatically prefetches
instructions
a hardware mechanism that automatically fetches data and
instructions into the unified second-level cache
1-13
IA-32 Intel® Architectu re Optimization
a mechanism fetches data only and includes two distinct
components: (1) a hardware mechanism to fetch the adjacent cache line within an 128-byte sector that contains the data needed due to a cache line miss, this is also referred to as adjacent cache line prefetch (2) a software controlled mechanism that fetches data into the caches using the prefetch instructions.
The hardware instruction fetcher reads instructions along the path predicted by the branch target buffer (BTB) into instruction streaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms are described later.
Decoder
The front end of the Intel NetBurst microarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock. Some complex instructions must enlist the help of the microcode ROM. The decoder operation is connected to the execution trace cache.
Execution Trace Cache
1-14
The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst microarchitecture. The TC stores decoded IA-32 instructions (µops).
In the Pentium 4 processor implementation, TC can hold up to 12K µops and can deliver up to three µops per cycle. TC does not hold all of the µops that need to be executed in the execution core. In some situations, the execution core may need to execute a microcode flow instead of the µop traces that are stored in the trace cache.
The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace cache while only a few instructions involve the microcode ROM.
IA-32 Intel® Architecture Processor Family Overview
Branch Prediction
Branch prediction is important to the performance of a deeply pipelined processor. It enables the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction. For Pentium 4 and Intel Xeon processors, the branch delay for a correctly predicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be many cycles, usually equivalent to the pipeline depth.
Branch prediction in the Intel NetBurst microarchitecture predicts all near branches (conditional calls, unconditional calls, returns and indirect branches). It does not predict far transfers (far calls, irets and software interrupts).
Mechanisms have been implemented to aid in predicting branches accurately and to reduce the cost of taken branches. These include:
the ability to dynamically predict the direction and target of
branches based on an instruction’s linear address, using the branch target buffer (BTB)
if no dynamic prediction is available or if it is invalid, the ability to
statically predict the outcome based on the offset of the target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken
the ability to predict return addresses using the 16-entry return
address stack
the ability to build a trace of instructions across predicted taken
branches to avoid branch penalties.
The Static Predictor. Once a branch instruction is decoded, the direction of the branch (forward or backward) is known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the direction of the branch. The static prediction mechanism predicts backward conditional branches (those with negative displacement, such as loop-closing branches) as taken. Forward branches are predicted not taken.
1-15
IA-32 Intel® Architectu re Optimization
To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the likely target of the branch immediately follows forward branches (see also: “Branch Prediction” in Chapter 2).
Branch Target Buffer. Once branch history is available, the Pentium 4
processor can predict the branch outcome even before the branch instruction is decoded. The processor uses a branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of branches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the target address.
Return Stack. Returns are always taken; but since a procedure may be invoked from several call sites, a single predicted target does not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly predicted, a taken branch may reduce available parallelism in a typical processor (since the decode bandwidth is wasted for instructions which immediately follow the branch and precede the target, if the branch does not end the line and target does not begin the line). The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing instruction delivery from the front end.

Execution Core Detail

The execution core is designed to optimize overall performance by handling common cases most efficiently. The hardware is designed to execute frequent operations in a common context as fast as possible, at the expense of infrequent operations using rare contexts.
1-16
IA-32 Intel® Architecture Processor Family Overview
Some parts of the core may speculate that a common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains to store-to-load forwarding (see “Store Forwarding” in this chapter). If a load is predicted to be dependent on a store, it gets its data from that store and tentatively proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded from memory, then it proceeds.
Instruction Latency and Throughput
The superscalar out-of-order core contains hardware resources that can execute multiple μops in parallel. The core’s ability to make use of available parallelism of execution units can enhanced by software’s ability to:
select IA-32 instructions that can be decoded in less than 4 μops
and/or have short latencies
order IA-32 instructions to preserve available parallelism by
minimizing long dependence chains and covering long instruction latencies
order instructions so that their operands are ready and their
corresponding issue ports and execution units are free when they reach the scheduler
This subsection describes port restrictions, result latencies, and issue latencies (also referred to as throughput). These concepts form the basis to assist software for ordering instructions to increase parallelism. The order that μops are presented to the core of the processor is further affected by the machine’s scheduling resources.
It is the execution core that reacts to an ever-changing machine state, reordering μops for faster execution or delaying them because of dependence and resource constraints. The ordering of instructions in software is more of a suggestion to the hardware.
Appendix C, “IA-32 Instruction Latency and Throughput,” lists some of the more-commonly-used IA-32 instructions with their latency, their issue throughput, and associated execution units (where relevant). Some
1-17
IA-32 Intel® Architectu re Optimization
execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instructions to generate. All µops executed out of the microcode ROM involve extra overhead.
Execution Units and Issue Ports
At each cycle, the core may dispatch µops to one or more of four issue ports. At the microarchitecture level, store operations are further divided into two parts: store data and store address operations. The four ports through which μops are dispatched to execution units and to load and store operations are shown in Figure 1-4. Some ports can dispatch two µops per clock. Those execution units are marked Double Speed.
Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move µop (a floating-point stack move, floating-point exchange or floating-point store data), or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store data). In the second half of the cycle, it can dispatch one similar ALU µop.
1-18
Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) µop or one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle, it can dispatch one similar ALU µop.
Port 2. This port supports the dispatch of one load operation per cycle. Port 3. This port supports the dispatch of one store address operation
per cycle. The total issue bandwidth can range from zero to six µops per cycle.
Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit and the floating-point execution units (adder, multiplier, and divider) can share a pipeline.
IA-32 Intel® Architecture Processor Family Overview
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core
Port 0
ALU 0
Double
Speed
ADD/SUB
Logic
Store Data
Branches
Note:
FP_ADD refers to x87 FP, and SIMD FP add and subtract operations FP_MUL refers to x87 FP, and SIMD FP multiply operations FP_DIV refers to x87 FP, and SIMD FP divide and square root operations MMX_ALU refers to SIMD integer arithmetic and logic operations MMX_ SHF T han d le s Shift, R o ta te , S h u f fle, Pack and U npack o perations MMX_MISC handles SIMD reciprocal and some integer operations
FP
Move
FP Move
FP Store Data
FXCH
ALU 1
Double
Speed
ADD /S U B Shift/Ro tate
Port 1
Intege r
Operation
Normal
Speed
FP
Execute
FP_ADD FP_MUL
FP_DIV
FP_MISC
MMX_SHFT
MMX_ALU
MMX_MISC
Port 2
Memory
Load
All Loads
Prefetch
Caches
The Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBurst microarchitecture. The Intel Xeon processor MP and selected Pentium and Intel Xeon processors may also contain a third-level cache.
Port 3
Memory
Store
Store
Address
OM15151
The first level cache (nearest to the execution core) contains separate caches for instructions and data. These include the first-level data cache and the trace cache (an advanced first-level instruction cache). All other caches are shared between instructions and data.
1-19
IA-32 Intel® Architectu re Optimization
Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it is also in level i+1. All caches use a pseudo-LRU (least recently used) replacement algorithm.
Table 1-1 provides parameters for all cache levels for Pentium and Intel Xeon Processors with CPUID model encoding equals 0, 1, 2 or 3.
Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters
Access Latency, Associa­tivity
Level (Model) Capacity
First (Model 0, 1, 2)
First (Model 3) 16 KB 8 64 4/12 write through
TC (All models) 12K µops 8 N/A N/A N/A
Second (Model 0, 1, 2)
Second (Model 3, 4)
Second (Model 3, 4, 6)
Third (Model 0, 1, 2)
8 KB 4 64 2/9 write through
256 KB or 512
2
KB
1 MB 8 64
2 MB 8 64
0, 512 KB, 1 MB or 2 MB
(ways)
86417/7 write back
864
Line Size (bytes)
1
1
1
Integer/
floating-point
(clocks)
18/18 write back
20/20 write back
14/14 write back
Write Update Policy
1
Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write operation is 64 bytes.
2
Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level cache of 512 KB.
On processors without a third level cache, the second-level cache miss initiates a transaction across the system bus interface to the memory sub-system. On processors with a third level cache, the third-level cache miss initiates a transaction across the system bus. A bus write transaction writes 64 bytes to cacheable memory, or separate 8-byte chunks if the destination is not cacheable. A bus read transaction from cacheable memory fetches two cache lines of data.
The system bus interface supports using a scalable bus clock and achieves an effective speed that quadruples the speed of the scalable bus clock. It takes on the order of 12 processor cycles to get to the bus and
1-20
IA-32 Intel® Architecture Processor Family Overview
back within the processor, and 6-12 bus cycles to access memory if there is no bus congestion. Each bus cycle equals several processor cycles. The ratio of processor clock speed to the scalable bus clock speed is referred to as bus ratio. For example, one bus cycle for a 100 MHz bus is equal to 15 processor cycles on a 1.50 GHz processor. Since the speed of the bus is implementation-dependent, consult the specifications of a given system for further details.
Data Prefetch
The Pentium 4 processor and other IA-32 processors based on the NetBurst microarchitecture have two type of mechanisms for prefetching data: software prefetch instructions and hardware-based prefetch mechanisms.
Software controlled prefetch is enabled using the four prefetch instructions (PREFETCHh) introduced with SSE. The software prefetch is not intended for prefetching code. Using it can incur significant penalties on a multiprocessor system if code is shared.
Software prefetch can provide benefits in selected situations. These situations include:
when the pattern of memory access operations in software allows
the programmer to hide memory latency
when a reasonable choice can be made about how many cache lines
to fetch ahead of the line being execute
when an choice can be made about the type of prefetch to use
SSE prefetch instructions have different behaviors, depending on cache levels updated and the processor implementation. For instance, a processor may implement the non-temporal prefetch by returning data to the cache level closest to the processor core. This approach has the following effect:
minimizes disturbance of temporal data in other cache levels
1-21
IA-32 Intel® Architectu re Optimization
avoids the need to access off-chip caches, which can increase the
realized bandwidth compared to a normal load-miss, which returns data to all cache levels
Situations that are less likely to benefit from software prefetch are:
for cases that are already bandwidth bound, prefetching tends to
increase bandwidth demands
prefetching far ahead can cause eviction of cached data from the
caches prior to the data being used in execution
not prefetching far enough can reduce the ability to overlap memory
and execution latencies
Software prefetches are treated by the processor as a hint to initiate a request to fetch data from the memory system, and consume resources in the processor and the use of too many prefetches can limit their effectiveness. Examples of this include prefetching data in a loop for a reference outside the loop and prefetching in a basic block that is frequently executed, but which seldom precedes the reference for which the prefetch is targeted.
1-22
See also: Chapter 6, “Optimizing Cache Usage.” Automatic hardware prefetch is a feature in the Pentium 4 processor.
It brings cache lines into the unified second-level cache based on prior reference patterns. See also: Chapter 6, “Optimizing Cache Usage.”
Pros and Cons of Software and Hardware Prefetching. Software prefetching has the following characteristics:
handles irregular access patterns, which would not trigger the
hardware prefetcher
handles prefetching of short arrays and avoids hardware prefetching
start-up delay before initiating the fetches
must be added to new code; so it does not benefit existing
applications
IA-32 Intel® Architecture Processor Family Overview
Hardware prefetching for Pentium 4 processor has the following characteristics:
works with existing applications
does not require extensive study of prefetch instructions
requires regular access patterns
avoids instruction and issue port bandwidth overhead
has a start-up penalty before the hardware prefetcher triggers and
begins initiating fetches
The hardware prefetcher can handle multiple streams in either the forward or backward directions. The start-up delay and fetch-ahead has a larger effect for short arrays when hardware prefetching generates a request for data beyond the end of an array (not actually utilized). The hardware penalty diminishes if it is amortized over longer arrays.
Hardware prefetching is triggered after two successive cache misses in the last level cache and requires these cache misses to satisfy a condition that the linear address distance between these cache misses is within a threshold value. The threshold value depends on the processor implementation of the microarchitecture (see Table 1-2). However, hardware prefetching will not cross 4KB page boundaries. As a result, hardware prefetching can be very effective when dealing with cache miss patterns that have small strides that are significantly less than half the threshold distance to trigger hardware prefetching. On the other hand, hardware prefetching will not benefit cache miss patterns that have frequent DTLB misses or have access strides that cause successive cache misses that are spatially apart by more than the trigger threshold distance.
Software can proactively control data access pattern to favor smaller access strides (e.g., stride that is less than half of the trigger threshold distance) over larger access strides (stride that is greater than the trigger threshold distance), this can achieve additional benefit of improved temporal locality and reducing cache misses in the last level cache significantly.
1-23
IA-32 Intel® Architectu re Optimization
Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to favor greater proportions of smaller-stride data accesses in the workload; before attempting to provide hints to the processor by employing software prefetch instructions.
Loads and Stores
The Pentium 4 processor employs the following techniques to speed up the execution of memory operations:
speculative execution of loads
reordering of loads with respect to loads and stores
multiple outstanding misses
buffering of writes
forwarding of data from stores to dependent loads
Performance may be enhanced by not exceeding the memory issue bandwidth and buffer resources provided by the processor. Up to one load and one store may be issued for each cycle from a memory port reservation station. In order to be dispatched to a reservation station, there must be a buffer entry available for each memory operation. There are 48 load buffers and 24 store buffers address information until the operation is completed, retired, and deallocated.
3
. These buffers hold the µop and
The Pentium 4 processor is designed to enable the execution of memory operations out of order with respect to other instructions and with respect to each other. Loads can be carried out speculatively, that is, before all preceding branches are resolved. However, speculative loads cannot cause page faults.
3. Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store buffers.
1-24
IA-32 Intel® Architecture Processor Family Overview
Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute operations as soon as their inputs are ready. Writes to memory are always carried out in program order to maintain program correctness.
A cache miss for a load does not prevent other loads from issuing and completing. The Pentium 4 processor supports up to four (or eight for Pentium 4 processor with CPUID signature corresponding to family 15, model 3) outstanding load misses that can be serviced either by on-chip caches or by memory.
Store buffers improve performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or cache is complete. Writes are generally not on the critical path for dependence chains, so it is often beneficial to delay writes for more efficient use of memory-access bus cycles.
Store Forwarding
Loads can be moved before stores that occurred earlier in the program if they are not predicted to load from the same linear address. If they do read from the same linear address, they have to wait for the store data to become available. However, with store forwarding, they do not have to wait for the store to write to the memory hierarchy and retire. The data from the store can be forwarded directly to the load, as long as the following conditions are met:
Sequence: the data to be forwarded to the load has been generated
by a programmatically-earlier store which has already executed
Size: the bytes loaded must be a subset of (including a proper
subset, that is, the same) bytes stored
Alignment: the store cannot wrap around a cache line boundary , and
the linear address of the load must be the same as that of the store
1-25
IA-32 Intel® Architectu re Optimization

Intel® Pentium® M Processor Microarchitecture

Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchitecture contains three sections:
in-order issue front end
out-of-order superscalar execution core
in-order retirement unit
Intel Pentium M processor microarchitecture supports a high-speed system bus (up to 533 MHz) with 64-byte line size. Most coding recommendations that apply to the Intel NetBurst microarchitecture also apply to the Intel Pentium M processor.
1-26
IA-32 Intel® Architecture Processor Family Overview
The Intel Pentium M processor microarchitecture is designed for lower power consumption. There are other specific areas of the Pentium M processor microarchitecture that differ from the Intel NetBurst microarchitecture. They are described next. A block diagram of the Intel Pentium M processor is shown in Figure 1-5.
Figure 1-5 The Intel Pentium M Processor Microarchitecture
6\VWHP%XV
%XV8QLW
QG/HYHO&DFKH
VW/HYHO
,QVWUXFWLRQ
&DFKH
%7%V%UDQFK3UHGLFWLRQ
)URQW(QG
)HWFK'HFRGH
)UHTXHQWO\XVHGSDWKV
/HVVIUHTXHQWO\XVHG SDWKV
VW/HYHO'DWD
&DFKH
([HFXWLRQ
2XW2I2UGHU&RUH
%UDQFK+LVWRU\8SGDWH
5HWLUHPHQW

The Front End

The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption. It’s shorter than that of the Intel NetBurst microarchitecture.
The Intel Pentium M processor front end consists of two parts:
fetch/decode unit
instruction cache
1-27
IA-32 Intel® Architectu re Optimization
The fetch and decode unit includes a hardware instruction prefetcher and three decoders that enable parallelism. It also provides a 32KB instruction cache that stores un-decoded binary instructions.
The instruction prefetcher fetches instructions in a linear fashion from memory if the target instructions are not already in the instruction cache. The prefetcher is designed to fetch efficiently from an aligned 16-byte block. If the modulo 16 remainder of a branch target address is 14, only two useful instruction bytes are fetched in the first cycle. The rest of the instruction bytes are fetched in subsequent cycles.
The three decoders decode IA-32 instructions and break them down into micro-ops (µops). In each clock cycle, the first decoder is capable of decoding an instruction with four or fewer µops. The remaining two decoders each decode a one µop instruction in each clock cycle.
The front end can issue multiple µops per cycle, in original program order, to the out-of-order core.
The Intel Pentium M processor incorporates sophisticated branch prediction hardware to support the out-of-order core. The branch prediction hardware includes dynamic prediction, and branch target buffers.
1-28
The Intel Pentium M processor has enhanced dynamic branch prediction hardware. Branch target buffers (BTB) predict the direction and target of branches based on an instruction’s address.
The Pentium M Processor includes two techniques to reduce the execution time of certain operations:
ESP Folding. This eliminates the ESP manipulation
micro-operations in stack-related instructions such as PUSH, POP, CALL and RET. It increases decode rename and retirement throughput. ESP folding also increases execution bandwidth by eliminating µops which would have required execution resources.
IA-32 Intel® Architecture Processor Family Overview
Micro-ops (µops) fusion. Some of the most frequent pairs of µops
derived from the same instruction can be fused into a single µops. The following categories of fused µops have been implemented in the Pentium M processor:
— “Store address” and “store data” micro-ops are fused into a
single “Store” micro-op. This holds for all types of store operations, including integer, floating-point, MMX technology, and Streaming SIMD Extensions (SSE and SSE2) operations.
— A load micro-op in most cases can be fused with a successive
execution micro-op.This holds for integer, floating-point and MMX technology loads and for most kinds of successive execution operations. Note that SSE Loads can not be fused.

Data Prefetching

The Intel Pentium M processor supports three prefetching mechanisms:
The first mechanism is a hardware instruction fetcher and is
described in the previous section.
The second mechanism automatically fetches data into the
second-level cache. The implementation of automatic hardware prefetching in Pentium M processor family is basically similar to those described for NetBurst microarchitecture. The trigger threshold distance for each relevant processor models is shown in Table 1-2
The third mechanism is a software mechanism that fetches data into
the caches using the prefetch instructions.
1-29
IA-32 Intel® Architectu re Optimization
Table 1-2 Trigger Threshold and CPUID Signatures for IA-32 Processor
Families
Trigger Thresh­old Distance (Bytes)
512 0 0 15 3, 4, 6
256 0 0 15 0, 1, 2
256 0069, 13, 14
Extended Model ID
Extended Family ID Family ID Model ID
Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entries. See Table 1-3 for processor cache parameters.
Table 1-3 Cache Parameters of Pentium M, Intel® Core Solo and
Intel®Core Duo Processors
Line
Associativity
Level Capacity
First 32 KB 8 64 3 Writeback
Instruction 32 KB 8 N/A N/A N/A
Second (model 9)
Second (model 13)
Second (model 14)
1 MB 8 64 9 Writeback
2 MB 8 64 10 Writeback
2 MB 8 64 14 Writeback
(ways)
Size (bytes)
Access Latency (clocks)
Write Update Policy

Out-of-Order Core

The processor core dynamically executes µops independent of program order. The core is designed to facilitate parallel execution by employing many buffers, issue ports, and parallel execution units.
The out-of-order core buffers µops in a Reservation Station (RS) until their operands are ready and resources are available. Each cycle, the core may dispatch up to five µops through the issue ports.
1-30
IA-32 Intel® Architecture Processor Family Overview

In-Order Retirement

The retirement unit in the Pentium M processor buffers completed µops is the reorder buffer (ROB). The ROB updates the architectural state in order. Up to three µops may be retired per cycle.

Microarchitecture of Intel® Core Solo and Intel®Core Duo Processors

Intel Core Solo and Intel Core Duo processors incorporate an microarchitecture that is similar to the Pentium M processor microarchitecture, but provides additional enhancements for performance and power efficiency. Enhancements include:
Intel Smart Cache
This second level cache is shared between two cores in an Intel Core Duo processor to minimize bus traffic between two cores accessing a single-copy of cached data. It allows an Intel Core Solo processor (or when one of the two cores in an Intel Core Duo processor is idle) to access its full capacity.
Stream SIMD Extensions 3
These extensions are supported in Intel Core Solo and Intel Core Duo processors.
Decoder improvement
Improvement in decoder and micro-op fusion allows the front en d to see most instructions as single throughput of the three decoders in the front end.
μop instructions. This increases the
Improved execution core
Throughput of SIMD instructions is improved and the out-of-order engine is more robust in handling sequences of frequently-used instructions. Enhanced internal buffering and prefetch mechanisms also improve data bandwidth for execution.
1-31
IA-32 Intel® Architectu re Optimization
Power-optimized bus
The system bus is optimized for power efficiency; increased bus speed supports 667 MHz.
Data Prefetch
Intel Core Solo and Intel Core Duo processors implement improved hardware prefetch mechanisms: one mechanism can look ahead and prefetch data into L1 from L2. These processors also provide enhanced hardware prefetchers similar to those of the Pentium M processor (see Table 1-2).

Front End

Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are improved over Pentium M processors by the following enhancements:
Micro-op fusion
Scalar SIMD operations on register and memory have single micro-op flows comparable to X87 flows. Many packed instructions are fused to reduce its micro-op flow from four to two micro-ops.
1-32
Eliminating decoder restrictions
Intel Core Solo and Intel Core Duo processors improve decoder throughput with micro-fusion and macro-fusion, so that many more SSE and SSE2 instructions can be decoded without restriction. On Pentium M processors, many single micro-op SSE and SSE2 instructions must be decoded by the main decoder.
Improved packed SIMD instruction decoding
On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle. There are some exceptions to the above; some shuffle/unpack/shift operations are not fused and require the main decoder.

Data Prefetching

Intel Core Solo and Intel Core Duo processors provide hardware mechanisms to prefetch data from memory to the second-level cache. There are two techniques: one mechanism activates after the data access pattern experiences two cache-reference misses within a trigger-distance threshold (see Table 1-2). This mechanism is similar to that of the Pentium M processor, but can track 16 forward data streams and 4 backward streams. The second mechanism fetches an adjacent cache line of data after experiencing a cache miss. This effectively simulates the prefetching capabilities of 128-byte sectors (similar to the sectoring of two adjacent 64-byte cache lines available in Pentium 4 processors).
Hardware prefetch requests are queued up in the bus system at lower priority than normal cache-miss requests. If bus queue is in high demand, hardware prefetch requests may be ignored or cancelled to service bus traffic required by demand cache-misses and other bus transactions.
Hardware prefetch mechanisms are enhanced over that of Pentium M processor by:
IA-32 Intel® Architecture Processor Family Overview
Data stores that are not in the second-level cache generate read for
ownership requests. These requests are treated as loads and can trigger a prefetch stream.
Software prefetch instructions are treated as loads, they can also
trigger a prefetch stream.

Hyper-Threading Technology

Intel® Hyper-Threading (HT) Technology is supported by specific members of the Intel Pentium 4 and Xeon processor families. The technology enables software to take advantage of task-level, or thread-level parallelism by providing multiple logical processors within a physical processor package. In its first implementation in Intel Xeon processor, Hyper-Threading Technology makes a single physical processor appear as two logical processors.
1-33
IA-32 Intel® Architectu re Optimization
The two logical processors each have a complete set of architectural registers while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.
By sharing resources needed for peak demands between two logical processors, HT Technology is well suited for multiprocessor systems to provide an additional performance boost in throughput when compared to traditional MP systems.
Figure 1-6 shows a typical bus-based symmetric multiprocessor (SMP) based on processors supporting Hyper-Threading Technology. Each logical processor can execute a software thread, allowing a maximum of two software threads to execute simultaneously on one physical processor. The two software threads execute simultaneously, meaning that in the same clock cycle an “add” operation from logical processor 0 and another “add” operation and load from logical processor 1 can be executed simultaneously by the execution engine.
1-34
IA-32 Intel® Architecture Processor Family Overview
In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor. This minimizes the die area cost of implementing HT Technology while still achieving performance gains for multithreaded applications or multitasking workloads.
Figure 1-6 Hyper-Threading Technology on an SMP
Architectural
State
Execution Engine
Local APIC
Bus Inte rfac e
Architectural
State
Local APIC
Architectural
State
Execution Engine
Local APIC
Bus Inte rfa ce
System Bus
Architectural
State
Local APIC
OM15152
The performance potential due to HT Technology is due to:
the fact that operating systems and user programs can schedule
processes or threads to execute simultaneously on the logical processors in each physical processor
the ability to use on-chip execution resources at a higher level than
when only a single thread is consuming the execution resources; higher level of resource utilization can lead to higher system throughput
1-35
IA-32 Intel® Architectu re Optimization

Processor Resources and Hyper-Threading Technology

The majority of microarchitecture resources in a physical processor are shared between the logical processors. Only a few small data structures were replicated for each logical processor. This section describes how resources are shared, partitioned or replicated.
Replicated Resources
The architectural state is replicated for each logical processor. The architecture state consists of registers that are used by the operating system and application code to control program behavior and store data for computations. This state includes the eight general-purpose registers, the control registers, machine state registers, debug registers, and others. There are a few exceptions, most notably the memory type range registers (MTRRs) and the performance monitoring resources. For a complete list of the architecture state and exceptions, see the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B.
Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track execution and state changes of the two logical processors. The return stack predictor is replicated to improve branch prediction of return instructions.
1-36
In addition, a few buffers (for example, the 2-entry instruction streaming buffers) were replicated to reduce complexity.
Partitioned Resources
Several buffers are shared by limiting the use of each logical processor to half the entries. These are referred to as partitioned resources. Reasons for this partitioning include:
operational fairness
permitting the ability to allow operations from one logical processor
to bypass operations of the other logical processor that may have stalled
IA-32 Intel® Architecture Processor Family Overview
For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor from making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blocking forward progress.
In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers include µop queues after the execution trace cache, the queues after the register rename stage, the reorder buffer which stag es instru ctions for retirement, and the load and store buffers.
In the case of load and store buffers, partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations.
Shared Resources
Most resources in a physical processor are fully shared to improve the dynamic utilization of the resource, including caches and all the execution units. Some shared resources which are linearly addressed, like the DTLB, include a logical processor ID bit to distinguish whether the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a context-ID bit:
Shared mode: The L1 data cache is fully shared by two logical
processors.
Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the L1 data cache.
The other resources are fully shared.
1-37
IA-32 Intel® Architectu re Optimization

Microarchitecture Pipeline and Hyper-Threading Technology

This section describes the HT Technology microarchitecture and how instructions from the two logical processors are handled between the front end and the back end of the pipeline.
Although instructions originating from two programs or two threads execute simultaneously and not necessarily in program order in the execution core and memory hierarchy, the front end and back end contain several selection points to select between instructions from the two logical processors. All selection points alternate between the two logical processors unless one logical processor cannot make use of a pipeline stage. In this case, the other logical processor has full use of every cycle of the pipeline stage. Reasons why a logical processor may not use a pipeline stage include cache misses, branch mispredictions, and instruction dependencies.

Front End Pipeline

The execution trace cache is shared between two logical processors. Execution trace cache access is arbitrated by the two logical processors every clock. If a cache line is fetched for one logical processor in one clock cycle, the next clock cycle a line would be fetched for the other logical processor provided that both logical processors are requesting access to the trace cache.
1-38
If one logical processor is stalled or is unable to use the execution trace cache, the other logical processor can use the full bandwidth of the trace cache until the initial logical processor’s instruction fetches return from the L2 cache.
After fetching the instructions and building traces of µops, the µops are placed in a queue. This queue decouples the execution trace cache from the register rename pipeline stage. As described earlier, if both logical processors are active, the queue is partitioned so that both logical processors can make independent forward progress.

Execution Core

The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops are placed in the queues waiting for execution, there is no distinction between instructions from the two logical processors. The execution core and memory hierarchy is also oblivious to which instructions belong to which logical processor.
After execution, instructions are placed in the re-order buffer. The re-order buffer decouples the execution stage from the retirement stage. The re-order buffer is partitioned such that each uses half the entries.

Retirement

The retirement logic tracks when instructions from the two logical processors are ready to be retired. It retires the instruction in program order for each logical processor by alternating between the two logical processors. If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor.
IA-32 Intel® Architecture Processor Family Overview
Once stores have retired, the processor needs to write the store data into the level-one data cache. Selection logic alternates between the two logical processors to commit store data to the cache.

Multi-Core Processors

The Intel Pentium D processor and the Pentium Processor Extreme Edition introduce multi-core features in the IA-32 architecture. These processors enhance hardware support for multi-threading by providing two processor cores in each physical processor package. The Dual-core Intel Xeon and Intel Core Duo processors also provide two processor cores in a physical package.
The Intel Pentium D processor provides two logical processors in a physical package, each logical processor has a separate execution core and a cache hierarchy . The Dual-core Intel Xeon processor and the Intel
1-39
IA-32 Intel® Architectu re Optimization
Pentium Processor Extreme Edition provide four logical processors in a physical package that has two execution cores. Each core provides two logical processors sharing an execution core and a cache hierarchy.
The Intel Core Duo processor provides two logical processors in a physical package. Each logical processor has a separate execution core (including first-level cache) and a smart second-level cache. The second-level cache is shared between two logical processors and optimized to reduce bus traffic when the same copy of cached data is used by two logical processors. The full capacity of the second-level cache can be used by one logical processor if the other logical processor is inactive.
The functional blocks of these processors are shown in Figure 1-7.
1-40
IA-32 Intel® Architecture Processor Family Overview
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition
and Intel Core Duo Processor
Pentium D Processor
Architectual Stat e Execut ion E ngine
Local API C Local APIC
Caches Caches
Bus Interface Bus Interface
System Bus
Pent i um Processor Extreme E d ition
Architectual
State
Execut ion E ngine
Local APIC Local API C
Archit ec tual State
Execut ion E ngine
Architectual
State
Local APIC Local AP I C
Caches
Bus Interface Bus Interface
System Bus
In t el Core Duo Processor
Local APIC Local APIC
Second Level Cache
Bus Interface
Archit ec tual State Execution Engine
Architectual
State
Execution Engine
Architectual Stat e
Execution Engine
Caches
Architectual
State
System Bus
1-41
IA-32 Intel® Architectu re Optimization

Microarchitecture Pipeline and Multi-Core Processors

In general, each core in a multi-core processor resembles a single-core processor implementation of the underlying microarchitecture. The implementation of the cache hierarchy in a dual-core or multi-core processor may be the same or different from the cache hierarchy implementation in a single-core processor.
CPUID should be used to determine cache-sharing topology information in a processor implementation and the underlying microarchitecture. The former is obtained by querying the deterministic cache parameter leaf (see Chapter 6, “Optimizing Cache Usage”); the latter by using the encoded values for extended family , family, extended model, and model fields. See Table 1 -4.
Table 1-4 Family And Model Designations of Microarchitectures
Dual-Core Processor
Pentium D processor NetBurst 0 15 0 3, 4, 6
Pentium processor Extreme Edition
Intel Core Duo processor
Micro­architecture
NetBurst 0 15 0 3, 4, 6
Improved Pentium M
Extended Family Family
060 14
Extended Model Model

Shared Cache in Intel Core Duo Processors

The Intel Core Duo processor has two symmetric cores that share the second-level cache and a single bus interface (see Figure 1-7). Two threads executing on two cores in an Intel Core Duo processor can take advantage of shared second-level cache, accessing a single-copy of cached data without generating bus traffic.
Load and Store Operations
When an instruction needs to read data from a memory address, the processor looks for it in caches and memory. When an instruction writes data to a memory location (write back) the processor first makes sure
1-42
IA-32 Intel® Architecture Processor Family Overview
that the cache line that contains the memory location is owned by the first-level data cache of the initiating core (that is, the line is in exclusive or modified state). Then the processor looks for the cache line in the cache and memory sub-systems. The look-ups for the locality of load or store operation are in the following order:
1. First level cache of the initiating core
2. Second-level cache and the first-level cache of the other core
3. Memory Table 1-5 lists the performance characteristics of generic load and store
operations in an Intel Core Duo processor . are
in terms of processor core cycles.
Table 1-5 Characteristics of Load and Store Operations
in Intel Core Duo Processors
Load Store
Data Locality
1st-level cache (L1) 3 1 2 1
L1 of the other core in “Modified” state
2nd-level cache 14 <6 14 <6
Memory 14 + bus
Latency Throughput Latency Throughput
14 + bus transaction
transaction
14 + bus transaction
Bus read protocol
Numeric values of Table 1-5
14 + bus transaction
14 + bus transaction
~10
Bus write protocol
Throughput is expressed as the number of cycles to wait before the same operation can start again. The latency of a bus transaction is exposed in some of these operations, as indicated by entries containing “+ bus transaction”. On Intel Core Duo processors, a typical bus transaction may take 5.5 bus cycles. For a 667 MHz bus and a core frequency of 2.167GHz, the total of 14 + 5.5 * 2167 /(667/4) ~ 86 core cycles.
Sometimes a modified cache line has to be evicted to make room for a new cache line. The modified cache line is evicted in parallel to bringing in new data and does not require additional latency. However,
1-43
IA-32 Intel® Architectu re Optimization
when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and are within a short time, there is an overall degradation in response time of these cache misses.
For store operation, reading for ownership must be completed before the data is written to the first-level data cache and the line is marked as modified. Reading for ownership and storing the data happens after instruction retirement and follows the order of retirement. The bus store latency does not affect the store instruction itself. However, several sequential stores may have cumulative latency that can effect performance.
1-44

General Optimization Guidelines

This chapter discusses general optimization techniques that can improve the performance of applications running on the Intel Pentium 4, Intel Xeon, Pentium M processors, as well as on dual-core processors. These techniques take advantage of the microarchitectural features of the generation of IA-32 processor family described in Chapter 1. Optimization guidelines for 64-bit mode applications are discussed in Chapter 8. Additional optimization guidelines applicable to dual-core processors and Hyper-Threading Technology are discussed in Chapter 7.
This chapter explains the optimization techniques both for those who use the Intel compilers. The Intel for IA-32 processor family, provides the most of the optimization. For those not using the Intel C++ or Fortran Compiler, the assembly code tuning optimizations may be useful. The explanations are supported by coding examples.

Tuning to Achieve Optimum Performance

®
C++ or Fortran Compiler and for those who use other
®
compiler, which generates code specifically tuned
2
The most important factors in achieving optimum processor performance are:
good branch prediction
avoiding memory access stalls
good floating-point performance
instruction selection, including use of SIMD instructions
instruction scheduling (to maximize trace cache bandwidth)
vectorization
2-1
IA-32 Intel® Architectu re Optimization
The following sections describe practices, tools, coding rules and recommendations associated with these factors that will aid in optimizing the performance on IA-32 processors.

Tuning to Prevent Known Coding Pitfalls

To produce program code that takes advantage of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture, you must avoid the coding pitfalls that limit the performance of the target processor family. This section lists several known pitfalls that can limit performance of Pentium 4 and Intel Xeon processor implementations. Some of these pitfalls, to a lesser degree, also negatively impact Pentium M processor performance (store-to-load-forwarding restrictions, cache-line splits).
Table 2-1 lists coding pitfalls that cause performance degradation in some Pentium 4 and Intel Xeon processor implementations. For every issue, Table 2-1 references a section in this document. The section describes in detail the causes of the penalty and presents a recommended solution. Note that “aligned” here means that the address of the load is aligned with respect to the address of the store.
Table 2-1 Coding Pitfalls Affecting Performance
Factors Affecting Performance Symptom
Small, unaligned after large store
load after small
Large
store; Load
dword after store
dword, store byte;
Load dword, AND with
after store byte
0xff
2-2
load
Store-forwarding blocked
Store-forwarding blocked
Example (if applicable) Section Reference
Example 2-12 Store Forwarding,
Example 2-13, Example 2-14
Store-to-Load-Forwar ding Restriction on Size and Alignment
Store Forwarding, Store-to-Load-Forwar ding Restriction on Size and Alignment
continued
General Optimization Guidelines 2
Table 2-1 Coding Pitfalls Affecting Performance (continued)
Factors Affecting Performance Symptom
Cache line splits Access across
cache line boundary
Denormal inputs and outputs
Cycling more than 2 values of Floating-point Control Word
* Streaming SIMD Extensions (SSE)
** Streaming SIMD Extensions 2 (SSE2)
Slows x87, SSE*, SSE2** floating-
point operations
fldcw not
optimized
Example (if applicable) Section Reference
Example 2-11 Align data on natural
operand size address boundaries. If the data will be accesses with vector instruction loads and stores, align the data on 16 byte boundaries.
Floating-point Exceptions
Floating-point Modes

General Practices and Coding Guidelines

This section discusses guidelines derived from the performance factors listed in the “Tuning to Achieve Optimum Performance” section. It also highlights practices that use performance tools.
The majority of these guidelines benefit processors based on the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture. Some guidelines benefit one microarchitecture more than the other. As a whole, these coding rules enable software to be optimized for the common performance features of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.
The coding practices recommended under each heading and the bullets under each heading are listed in order of importance.
2-3
IA-32 Intel® Architectu re Optimization

Use Available Performance Tools

Current-generation compiler, such as the Intel C++ Compiler:
— Set this compiler to produce code for the target processor
implementation
— Use the compiler switches for optimization and/or
profile-guided optimization. These features are summarized in the “Intel® C++ Compiler” section. For more detail, see the
Intel® C++ Compiler User’s Guide.
Current-generation performance monitoring tools, such as VTune™
Performance Analyzer: — Identify performance issues, use event-based sampling, code
coach and other analysis resource.
— Measure workload characteristics such as instruction
throughput, data traffic locality, memory traffic characteristics, etc.
— Characterize the performance gain.

Optimize Performance Across Processor Generations

Use a cpuid dispatch strategy to deliver optimum performance for
all processor generations.
Use deterministic cache parameter leaf of cpuid to deliver scalable
performance that are transparent across processor families with different cache sizes.
Use compatible code strategy to deliver optimum performance for
the current generation of IA-32 processor family and future IA-32 processors.
Use a low-overhead threading strategy so that a multi-threaded
application delivers optimal multi-processor scaling performance when executing on processors that have hardware multi-threading support, or deliver nearly identical single-processor scaling when executing on a processor without hardware multi-threading support.
2-4

Optimize Branch Predictability

Improve branch predictability and optimize instruction prefetching
by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken.
Avoid mixing near calls, far calls and returns.
Avoid implementing a call by pushing the return address and
jumping to the target. The hardware can pair up call and return instructions to enhance predictability.
Use the pause instruction in spin-wait loops.
Inline functions according to coding recommendations.
Whenever possible, eliminate branches.
Avoid indirect calls.

Optimize Memory Access

Observe store-forwarding constraints.
Ensure proper data alignment to prevent data split across cache line.
boundary. This includes stack and passing parameters.
General Optimization Guidelines 2
Avoid mixing code and data (self-modifying code).
Choose data types carefully (see next bullet below) and avoid type
casting.
Employ data structure layout optimization to ensure efficient use of
64-byte cache line size.
Favor parallel data access to mask latency over data accesses with
dependency that expose latency.
For cache-miss data traffic, favor smaller cache-miss strides to
avoid frequent DTLB misses.
Use prefetching appropriately.
Use the following techniques to enhance locality: blocking,
hardware-friendly tiling, loop interchange, loop skewing.
2-5
IA-32 Intel® Architectu re Optimization
Minimize use of global variables and pointers.
Use the const modifier; use the static modifier for global
variables.
Use new cacheability instructions and memory-ordering behavior.

Optimize Floating-point Performance

Avoid exceeding representable ranges during computation, since
handling these cases can have a performance impact. Do not use a larger precision format (double-extended floating point) unless required, since this increases memory size and bandwidth utilization.
Use FISTTP to avoid changing rounding mode when possible or use
optimized registers (rounding modes) between more than two values.
fldcw; avoid changing floating-point control/status
Use efficient conversions, such as those that implicitly include a
rounding mode, in order to avoid changing control/status registers.
Take advantage of the SIMD capabilities of Streaming SIMD
Extensions (SSE) and of Streaming SIMD Extensions 2 (SSE2) instructions. Enable flush-to-zero mode and DAZ mode when using SSE and SSE2 instructions.
Avoid denormalized input values, denormalized output values, and
explicit constants that could cause denormal exceptions.
Avoid excessive use of the fxch instruction.

Optimize Instruction Selection

Focus instruction selection at the granularity of path length for a
sequence of instructions versus individual instruction selections; minimize the number of uops, data/register dependency in aggregates of the path length, and maximize retirement throughput.
2-6
General Optimization Guidelines 2
Avoid longer latency instructions: integer multiplies and divides.
Replace them with alternate code sequences (e.g., use shifts instead of multiplies).
Use the lea instruction and the full range of addressing modes to do
address calculation.
Some types of stores use more µops than others, try to use simpler
store variants and/or reduce the number of stores.
Avoid use of complex instructions that require more than 4 µops.
Avoid instructions that unnecessarily introduce dependence-related
stalls:
inc and dec instructions, partial register operations (8/16-bit
operands).
Avoid use of ah, bh, and other higher 8-bits of the 16-bit registers,
because accessing them requires a shift operation internally.
Use xor and pxor instructions to clear registers and break
dependencies for integer operations; also use clear XMM registers for floating-point operations.
xorps and xorpd to
Use efficient approaches for performing comparisons.

Optimize Instruction Scheduling

Consider latencies and resource constraints.
Calculate store addresses as early as possible.

Enable Vectorization

Use the smallest possible data type. This enables more parallelism
with the use of a longer vector.
Arrange the nesting of loops so the innermost nesting level is free of
inter-iteration dependencies. It is especially important to avoid the case where the store of data in an earlier iteration happens lexically after the load of that data in a future iteration (called lexically-backward dependence).
2-7
IA-32 Intel® Architectu re Optimization
Avoid the use of conditionals.
Keep induction (loop) variable expressions simple.
Av oid using pointers, tr y to replace pointers with arrays and indices.

Coding Rules, Suggestions and Tuning Hints

This chapter includes rules, suggestions and hints. They are maintained in separately-numbered lists and are targeted for engineers who are:
modifying the source to enhance performance (user/source rules)
writing assembly or compilers (assembly/compiler rules)
doing detailed performance tuning (tuning suggestions)
Coding recommendations are ranked in importance using two measures:
Local impact (referred to as “impact”) is the difference that a
recommendation makes to performance for a given instance, with the impact’s priority marked as: H = high, M = medium, L = low.
Generality measures how frequently such instances occur across all
application domains, with the frequency marked as: H = high, M = medium, L = low.
2-8
These rules are very approximate. They can vary depending on coding style, application domain, and other factors. The purpose of including high, medium and low priorities with each recommendation is to provide some hints as to the degree of performance gain that one can expect if a recommendation is implemented.
Because it is not possible to predict the frequency of occurrence of a code instance in applications, priority hints cannot be directly correlated to application-level performance gain. However, in important cases where application-level performance gain has been observed, a more quantitative characterization of application-level performance gain is provided for information only (see: “Store-to-Load-Forwarding Restriction on Size and Alignment” and “Instruction Selection” in this document). In places where no priority is assigned, the impact has been deemed inapplicable.

Performance Tools

Intel offers several tools that can facilitate optimizing your application’s performance.

Intel® C++ Compiler

Use the Intel C++ Compiler following the recommendations described here. The Intel Compiler’s advanced optimization features provide good performance without the need to hand-tune assembly code. However, the following features may enhance performance even further:
Inlined assembly
Intrinsics, which have a one-to-one correspondence with assembly
language instructions but allow the compiler to perform register allocation and instruction scheduling. Refer to the “Intel C++ Intrinsics Reference” section of the Intel® C++ Compiler User’s Guide.
C++ class libraries. Refer to the “Intel C++ Class Libraries for
SIMD Operations Reference” section of the Intel® C++ Compiler
User’s Guide.
General Optimization Guidelines 2
Vectorization in conjunction with compiler directives (pragmas).
Refer to the “Compiler Vectorization Support and Guidelines” section of the Intel® C++ Compiler User’s Guide.
The Intel C++ Compiler can generate an executable which uses features such as Streaming SIMD Extensions 2. The executable will maximize performance on the current generation of IA-32 processor family (for example, a Pentium 4 processor) and still execute correctly on older processors. Refer to the “Processor Dispatch Support” section in the Intel® C++ Compiler User’s Guide.
2-9
IA-32 Intel® Architectu re Optimization

General Compiler Recommendations

A compiler that has been extensively tuned for the target microarchitec­ture can be expected to match or outperform hand-coding in a general case. However, if particular performance problems are noted with the compiled code, some compilers (like the Intel C++ and Fortran Compil­ers) allow the coder to insert intrinsics or inline assembly in order to exert greater control over what code is generated. If inline assembly is used, the user should verify that the code generated to integrate the inline assembly is of good quality and yields good overall performance.
Default compiler switches are targeted for the common case. An optimization may be made to the compiler default if it is beneficial for most programs. If a performance problem is root-caused to a poor choice on the part of the compiler, using dif ferent switches or compiling the targeted module with a different compiler may be the solution.

VTune Performance Analyzer

Where performance is a critical concern, use performance monitoring hardware and software tools to tune your application and its interaction with the hardware. IA-32 processors have counters which can be used to monitor a large number of performance-related events for each microarchitecture. The counters also provide information that helps resolve the coding pitfalls.
2-10
The VTune Performance Analyzer allow engineers to use these counters to provide with two kinds of tuning feedback:
indication of a performance improvement gained by using a specific
coding recommendation or microarchitectural feature,
information on whether a change in the program has improved or
degraded performance with respect to a particular metric.
General Optimization Guidelines 2
The VTune Performance Analyzer also enables engineers to use these counters to measure a number of workload characteristics, including:
retirement throughput of instruction execution as an indication of
the degree of extractable instruction-level parallelism in the workload,
data traffic locality as an indication of the stress point of the cache
and memory hierarchy,
data traffic parallelism as an indication of the degree of
effectiveness of amortization of data access latency.
Note that improving performance in one part of the machine does not necessarily bring significant gains to overall performance. It is possible to degrade overall performance by improving performance for some particular metric.
Where appropriate, coding recommendations in this chapter include descriptions of the VTune analyzer events that provide measurable data of performance gain achieved by following recommendations. Refer to the VTune analyzer online help for instructions on how to use the tool.
VTune analyzer events include the Pentium 4 processor performance metrics described in Appendix B, “Using Performance Monitoring Events.”

Processor Perspectives

The majority of the coding recommendations for the Pentium 4 and Intel Xeon processors also apply to Pentium M, Intel Core Solo, and Intel Core Duo processors. However, there are situations where a recommendation may benefit one microarchitecture more than the other. The most important of these are:
Instruction decode throughput is important for the Pentium M, Intel
Core Solo, and Intel Core Duo processors but less important for the Pentium 4 and Intel Xeon processors. Generating code with the 4-1-1 template (instruction with four μops followed by two instructions with one μop each) helps the Pentium M processor.
2-11
IA-32 Intel® Architectu re Optimization
Intel Core Solo and Intel Core Duo processors have enhanced front end that is less sensitive to the 4-1-1 template. The practice has no real impact on processors based on the Intel NetBurst microarchitecture.
Dependencies for partial register writes incur large penalties when
using the Pentium M processor (this applies to processors with CPUID signature family 6, model 9). On Pentium 4, Intel Xeon processors, Pentium M processor (with CPUID signature family 6, model 13), and Intel Core Solo, and Intel Core Duo processors, such penalties are resolved by artificial dependencies between each partial register write. To avoid false dependences from partial register updates, use full register updates and extended moves.
On Pentium 4 and Intel Xeon processors, some latencies have
increased: shifts, rotates, integer multiplies, and moves from memory with sign extension are longer than before. Use care when using the Instruction” for recommendations.
lea instruction. See the section “Use of the lea
The inc and dec instructions should always be avoided. Using add
and
sub instructions instead avoids data dependence and improves
performance.
2-12
Dependence-breaking support is added for the pxor instruction.
Floating point register stack exchange instructions were free; now
they are slightly more expensive due to issue restrictions.
Writes and reads to the same location should now be spaced apart.
This is especially true for writes that depend on long-latency instructions.
Hardware prefetching may shorten the effective memory latency for
data and instruction accesses.
Cacheability instructions are available to streamline stores and
manage cache utilization.
Cache lines are 64 bytes (see Table 1-1 and Table 1-3). Because of
this, software prefetching should be done less often. False sharing, however, can be an issue.
General Optimization Guidelines 2
On the Pentium 4 and Intel Xeon processors, the primary code size
limit of interest is imposed by the trace cache. On Pentium M processors, code size limit is governed by the instruction cache.
There may be a penalty when instructions with immediates
requiring more than 16-bit signed representation are placed next to other instructions that use immediates.
Note that memory-related optimization techniques for alignments, complying with store-to-load-forwarding restrictions and avoiding data splits help Pentium 4 processors as well as Pentium M processors.

CPUID Dispatch Strategy and Compatible Code Strategy

Where optimum performance on all processor generations is desired, applications can take advantage of generation and integrate processor-specific instructions (such as SSE2 instructions) into the source code. The Intel C++ Compiler supports the integration of different versions of the code for different target processors. The selection of which code to execute at runtime is made based on the CPU identifier that is read with targeted for different processor generations can be generated under the control of the programmer or by the compiler.
cpuid to identify the processor
cpuid. Binary code
For applications run on both the Intel Pentium 4 and Pentium M processors, and where minimum binary code size and single code path is important, a compatible code strategy is the best. Optimizing applications for the Intel NetBurst microarchitecture is likely to improve code efficiency and scalability when running on processors based on current and future generations of IA-32 processors. This approach to optimization is also likely to deliver high performance on Pentium M processors.
2-13
IA-32 Intel® Architectu re Optimization

Transparent Cache-Parameter Strategy

If CPUID instruction supports function leaf 4, also known as deterministic cache parameter leaf, this function leaf will report detailed cache parameters for each level of the cache hierarchy in a deterministic and forward-compatible manner across current and future IA-32 processor families. See CPUID instruction in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2B.
For coding techniques that rely on specific parameters of a cache level, using the deterministic cache parameter allow software to implement such coding technique to be forward-compatible with future generations of IA-32 processors, and be cross-compatible with processors equipped with different cache sizes.

Threading Strategy and Hardware Multi-Threading Support

Current IA-32 processor families offer hardware multi-threading support in two forms: dual-core technology and Hyper-Threading Technology. Future trend for IA-32 processors will continue to impro ve in the direction of multi-core technology.
2-14
To fully harness the performance potentials of the hardware multi-threading capabilities in current and future generations of IA-32 processors, software must embrace a threaded approach in application design. At the same time, to address the widest range of installed base of machines, multi-threaded software should be able to run without failure on single processor without hardware multi-threading support, and multi-threaded software implementation should also achieve comparable performance on a single logical processor relative to an unthreaded implementation if such comparison can be made. This generally requires architecting a multi-threaded application to minimize the overhead of thread synchronization. Additional software optimization guidelines on multi-threading are discussed in Chapter 7.

Branch Prediction

Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability of branches, you can increase the speed of code significantly.
Optimizations that help branch prediction are:
Keep code and data on separate pages (a very important item, see
more details in the “Memory Accesses” section).
Whenever possible, eliminate branches.
Arrange code to be consistent with the static branch prediction
algorithm.
Use the pause instruction in spin-wait loops.
Inline functions and pair up calls and returns.
Unroll as necessary so that repeatedly-executed loops have sixteen
or fewer iterations, unless this causes an excessive code size increase.
Separate branches so that they occur no more frequently than every
three
μops where possible.
General Optimization Guidelines 2

Eliminating Branches

Eliminating branches improves performance because it:
reduces the possibility of mispredictions
reduces the number of required branch target buffer (BTB) entries;
conditional branches, which are never taken, do not consume BTB resources
There are four principal ways of eliminating branches:
arrange code to make basic blocks contiguous
unroll loops, as discussed in the “Loop Unrolling” section
use the cmov instruction
use the setcc instruction
2-15
IA-32 Intel® Architectu re Optimization
Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and eliminate unnecessary branches.
For the Pentium M processor, every branch counts, even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor. Also, taken branches consume space in the branch prediction structures and extra branches create pressure on the capacity of the structures.
Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the
setcc and cmov instructions to eliminate unpredictable conditional branches
where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpr edictable conditional branches (be cause using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting conditional branches to data dependence and restricts the capability of the out of order engine. When tuning, note that all IA-32 based pr oc essors ha ve very hig h bra nc h prediction rates. Consistently mispredicted are rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.
cmovs or setcc trades of contr ol flow d ependen ce for
2-16
Consider a line of C code that has a condition dependent upon one of the constants:
X = (A < B) ? CONST1 : CONST2;
This code conditionally compares two values, A and B. If the condition is true,
X is set to CONST1; otherwise it is set to CONST2. An assembly code
sequence equivalent to the above C code can contain branches that are not predictable if there are no correlation in the two values.
Example 2-1 shows the assembly code with unpredictable branches. The unpredictable branches in Example 2-1 can be removed with the use of the
setcc instruction. Example 2-2 shows an optimized code that
does not have branches.
General Optimization Guidelines 2
Example 2-1 Assembly Code with an Unpredictable Branch
cmp A, B ; condition jge L30 ; conditional branch mov ebx, CONST1 ; ebx holds X jmp L31 ; unconditional branch
L30:
mov ebx, CONST2
L31:
Example 2-2 Code Optimization to Eliminate Branches
xor ebx, ebx ; clear ebx (X in the C code) cmp A, B setge bl ; When ebx = 0 or 1
; OR the complement condition sub ebx, 1 ; ebx=11...11 or 00...00 and ebx, CONST3 ; CONST3 = CONST1-CONST2 add ebx, CONST2 ; ebx=CONST1 or CONST2
See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B, ebx is set to one. Then ebx is decreased and “ sets
ebx to either zero or the difference of the values. By adding CONST2
back to
ebx, the correct value is written to ebx. When CONST2 is equal to
and-ed” with the difference of the constant values. This
zero, the last instruction can be deleted. Another way to remove branches on Pentium II and subsequent
processors is to use the shows changing a eliminating a branch. If the will be moved to
cmov and fcmov instructions. Example 2-3
test and branch instruction sequence using cmov and
test sets the equal flag, the value in ebx
eax. This branch is data-dependent, and is
representative of an unpredictable branch.
2-17
IA-32 Intel® Architectu re Optimization
Example 2-3 Eliminating Branch with CMOV Instruction
test ecx, ecx jne 1h
mov eax, ebx 1h: ; To optimize code, combine jne and mov into one cmovcc
; instruction that checks the equal flag
test ecx, ecx ; test the flags
cmoveq eax, ebx ; if the equal flag is set, move
; ebx to eax - the lh: tag no longer
; needed
The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pentium processors and earlier 32-bit Intel architecture processors. Be sure to check whether a processor supports these instructions with the
cpuid instruction.

Spin-Wait and Idle Loops

2-18
The Pentium 4 processor introduces a new pause instruction; the instruction is architecturally a
nop on all IA-32 implementations. T o th e
Pentium 4 processor, this instruction acts as a hint that the code sequence is a spin-wait loop. Without a
pause instruction in such loops,
the Pentium 4 processor may suffer a severe penalty when exiting the loop because the processor may detect a possible memory order violation. Inserting the
pause instruction significantly reduces the
likelihood of a memory order violation and as a result improves performance.
In Example 2-4, the code spins until memory location A matches the value stored in the register
eax. Such code sequences are common when
protecting a critical section, in producer-consumer sequences, for barriers, or other synchronization.
Example 2-4 Use of pause Instruction
lock: cmp eax, A
jne loop ; code in critical section:
loop: pause
cmp eax, A jne loop jmp lock

Static Prediction

Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted using a static prediction algorithm. The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo processors have similar static prediction algorithms:
Predict unconditional branches to be taken.
Predict indirect branches to be NOT taken.
General Optimization Guidelines 2
In addition, conditional branches in processors based on the Intel NetBurst microarchitecture are predicted using the following static prediction algorithm:
Predict backward conditional branches to be taken. This rule is
suitable for loops.
Predict forward conditional branches to be NOT taken.
Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict conditional branches according to the jump direction. All conditional branches are dynamically predicted, even at their first appearance.
2-19
IA-32 Intel® Architectu re Optimization
Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.
Example 2-5 illustrates the static branch prediction algorithm. The body of an
if-then conditional is predicted to be executed.
Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm
forw ard conditional branches not taken (fall through)
If <condition> { ...
} for<condition>{
...
}
BackwardConditionalBranchesaretaken
Uncondi t ional Branchestaken JM P
2-20
loop {
}<condition>
General Optimization Guidelines 2
Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm.
In Example 2-6, the backward branch (
JC Begin) is not in the BTB the
first time through, therefore, the BTB does not issue a prediction. The static predictor, however, will predict the branch to be taken, so a misprediction will not occur.
Example 2-6 Static Taken Prediction Example
Begin: mov eax, mem32
and eax, ebx imul eax, edx shld eax, 7 jc Begin
The first branch instruction (
JC Begin) in Example 2-7 segment is a
conditional forward branch. It is not in the BTB the first time through, but the static predictor will predict the branch to fall through
The static prediction algorithm correctly predicts that the Call
Convert
instruction will be taken, even before the branch has any
branch history in the BTB.
Example 2-7 Static Not-Taken Prediction Example
mov eax, mem32 and eax, ebx imul eax, edx shld eax, 7 jc Begin mov eax, 0
Begin: call Convert
.
2-21
IA-32 Intel® Architectu re Optimization

Inlining, Calls and Returns

The return address stack mechanism augments the static and dynamic predictors to optimize specifically for calls and returns. It holds 16 entries, which is large enough to cover th e call d epth of most programs. If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may be degraded.
The trace cache maintains branch prediction information for calls and returns. As long as the trace with the call or return remains in the trace cache and if the call and return targets remain unchanged, the depth limit of the return address stack described above will not impede performance.
To enable the use of the return stack mechanism, calls and returns must be matched in pairs. If this is done, the likelihood of exceeding the stack depth in a manner that will impact performance is very low.
Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with near returns, and far calls must be matched with far returns. Pushing the return address on the stack and jumping to the routine to be called is not recommended since it creates a mismatch in calls and returns.
2-22
Calls and returns are expensive; use inlining for the following reasons:
Parameter passing overhead can be eliminated.
In a compiler, inlining a function exposes more opportunity for
optimization.
If the inlined routine contains branches, the additional context of the
caller may improve branch prediction within the routine.
A mispredicted branch can lead to larger performance penalties
inside a small function than if that function is inlined.
Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function where doing so decreases code size or if the function is small and the call site is fr e qu ently executed.
General Optimization Guidelines 2
Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache.
Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession; consider transforming the program with inline to reduce the call depth.
Assembly/Compiler Coding Rule 8. (ML impact, ML generality)
inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred.
Assembly/Compiler Coding Rule 9. (L impact, L generality)
statement in a function is a call to another function, consider converting the call to a jump. This will save the call/ return overhead as well as an entry in the return stack buffer.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk.
Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branc hes in a 16-b yte chunk.
Favor
If the last

Branch Type Selection

The default predicted target for indirect branches and calls is the fall-through path. The fall-through prediction is overridden if and when a hardware prediction is available for that branch. The predicted branch target from branch prediction hardware for an indirect branch is the previously executed branch target.
The default prediction to the fall-through path is only a significant issue if no branch prediction is available, due to poor code locality or pathological branch conflict problems. For indirect calls, predicting the fall-through path is usually not an issue, since execution will likely return to the instruction after the associated return.
2-23
IA-32 Intel® Architectu re Optimization
Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it looks like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery. Also, the data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.
Assembly/Compiler Coding Rule 12. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path.
Indirect branches resulting from code constructs, such as switch statements, computed arbitrary number of locations. If the code sequence is such that the target destination of a branch goes to the same address most of the time, then the BTB will predict accurately most of the time. Since only one taken (non-fall-through) target can be stored in the BTB, indirect branches with multiple taken targets may have lower prediction rates.
GOTOs or calls through pointers, can jump to an
2-24
The effective number of targets stored may be increased by introducing additional conditional branches. Adding a conditional branch to a target is fruitful if and only if:
The branch direction is correlated with the branch history leading up
to that branch, that is, not just the last target, but how it got to this branch.
The source/target pair is common enough to warrant using the extra
branch prediction capacity. (This may increase the number of overall branch mispredictions, while improving the misprediction of indirect branches. The profitability is lower if the number of mispredicting branches is very large).
User/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more common taken targets, and at least one of those targets are correlated with branch history leading up to the branch, then convert the
General Optimization Guidelines 2
indirect branch into a tree where one or more indirect branches are preceded by conditional branches to th ose tar g ets. Apply this “peeling” procedur e to the common target of an indirect branch that correlates to branch history.
The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of branches, even at the expense of adding more branches. The added branches must be very predictable for this to be worthwhile. One reason for such predictability is a strong correlation with preceding branch history , that is, the directions taken on preceding branches are a good indicator of the direction of the branch under consideration.
Example 2-8 shows a simple example of the correlation between a target of a preceding conditional branch with a target of an indirect branch.
Example 2-8 Indirect Branch With Two Favored Targets
function () { int n = rand(); // random integer 0 to RAND_MAX
if( !(n & 0x01) ){ // n will be 0 half the times
n = 0; // updates branch history to predict taken
}
// indirect branches with multiple taken targets // may have lower prediction rates
switch (n) {
case 0: handle_0(); break; // common target, correlated with
// branch history that is forward taken
case 1: handle_1(); break;// uncommon
case 3: handle_3(); break;// uncommon
default: handle_other(); // common target } }
Correlation can be difficult to determine analytically, either for a compiler or sometimes for an assembly language programmer . It may be fruitful to evaluate performance with and without this peeling, to get the
2-25
IA-32 Intel® Architectu re Optimization
best performance from a coding effort. An example of peeling out the most favored target of an indirect branch with correlated branch history is shown in Example 2-9.
Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction
function () { int n = rand(); // random integer 0 to RAND_MAX
if( !(n & 0x01) ) n = 0; // n will be 0 half the times
if (!n) handle_0(); // peel out the most common target
// with correlated branch history else { switch (n) {
case 1: handle_1(); break; // uncommon case 3: handle_3(); break;// uncommon default: handle_other(); // make the favored target in
// the fall-through path
} } }

Loop Unrolling

The benefits of unrolling loops are:
Unrolling amortizes the branch overhead, since it eliminates
branches and some of the code to manage induction variables.
Unrolling allows you to aggressively schedule (or pipeline) the loop
to hide latencies. This is useful if you have enough free registers to keep variables live as you stretch out the dependence chain to expose the critical path.
Unrolling exposes the code to various other optimizations, such as
removal of redundant loads, common subexpression elimination, and so on.
2-26
General Optimization Guidelines 2
The Pentium 4 processor can correctly predict the exit branch for an
inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops until they have a maximum of 16 iterations. With the Pentium M processor, do not unroll loops more than 64 iterations.
The potential costs of unrolling loops are:
Excessive unrolling, or unrolling of very large loops can lead to
increased code size. This can be harmful if the unrolled loop no longer fits in the trace cache (TC).
Unrolling loops whose bodies contain branches increases demands
on the BTB capacity . If the number of iterations of th e unrolled loop is 16 or less, the branch predictor should be able to correctly predict branches in the loop body that alternate direction.
Assembly/Compiler Coding Rule 13. (H impact, M generality) Unr oll small loops until the overhead of the branch and the induction variable accounts, generally, for less than about 10% of the execution time of the loop.
Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid unrolling loop s excessiv ely, as this may thrash the trace cache or instruction cache.
Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll loops that are frequently executed and that have a predictable number of iterations to reduce the number of iterations to 16 or fewer, unless this increases code size so that the working set no longer fits in the trace cache or instruction cache. If the loop body contains mo r e than one c onditional bran ch, then unroll so that the number of iterations is 16/(# conditional branches).
Example 2-10 shows how unrolling enables other optimizations.
2-27
IA-32 Intel® Architectu re Optimization
Example 2-10 Loop Unrolling
Before unrolling:
do i=1,100 if (i mod 2 == 0) then a(i) = x else a(i) = y enddo
After unrolling
do i=1,100,2 a(i) = y a(i+1) = x enddo
In this example, a loop that executes 100 times assigns x to every even-numbered element and
y to every odd-numbered element. By
unrolling the loop you can make both assignments each iteration, removing one branch in the loop body.

Compiler Support for Branch Prediction

Compilers can generate code that improves the efficiency of branch prediction in the Pentium 4 and Pentium M processors. The Intel C++ Compiler accomplishes this by:
2-28
keeping code and data on separate pages
using conditional move instructions to eliminate branches
generating code that is consistent with the static branch prediction
algorithm
inlining where appropriate
unrolling, if the number of iterations is predictable
With profile-guided optimization, the Intel compiler can lay out basic blocks to eliminate branches for the most frequently executed paths of a function or at least improve their predictability. Branch prediction need not be a concern at the source level. For more information, see the Intel® C++ Compiler User’s Guide.
Loading...