Intel IA-32 User Manual

Download

IA-32 Intel® Architecture

Optimization Reference

Manual

Order Number: 248966-013US

April 2006

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

This IA-32 Intel® Architecture Optimization Reference Manual as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.

Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.

Developers must not rely on the absence or characteristics of any features or in structions marked “reserved” or “undefined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in developer's software code when running on an Intel® processor. Intel reserves these features or instructions for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use.

Hyper-Threading Technology requires a computer system with an Intel® Pentium®4 processor supporting HyperThreading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more information including details on which processors support HT Technology.

Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Introduction

Chapter 1 IA-32 Intel

SIMD Technology.................................................................................................................... 1-2

Summary of SIMD Technologies...................................................................................... 1-5

MMX™ Technology..................................................................................................... 1-5

Streaming SIMD Extensions....................................................................................... 1-5

Streaming SIMD Extensions 2.................................................................................... 1-6

Streaming SIMD Extensions 3.................................................................................... 1-6

Intel

Extended Memory 64 Technology (Intel®EM64T)........................................................ 1-7

Intel NetBurst

Design Goals of Intel NetBurst Microarchitecture ............................................................ 1-8

Overview of the Intel NetBurst Microarchitecture Pipeline ............................................... 1-9

Front End Pipeline Detail............................................................................................... 1-13

Execution Core Detail..................................................................................................... 1-16

Intel

Pentium® M Processor Microarchitecture............................................................. ...... 1-26

Microarchitecture............................................................................................ 1-8

The Front End........................................................................................................... 1-11

The Out-of-order Core.............................................................................................. 1-12

Retirement................................................................................................................ 1-12

Prefetching................................................................................................................ 1-13

Decoder.................................................................................................................... 1-14

Execution Trace Cache ............................................................................................ 1-14

Branch Prediction ..................................................................................................... 1-15

Instruction Latency and Throughput......................................................................... 1-17

Execution Units and Issue Ports............................................................................... 1-18

Caches...................................................................................................................... 1-19

Data Prefetch............................................................................................................ 1-21

Loads and Stores...................................................................................................... 1-24

Store Forwarding...................................................................................................... 1-25

The Front End........................................................................................................... 1-27

Data Prefetching....................................................................................................... 1-29

Architecture Processor Family Overview

iii

Out-of-Order Core......................... ... ..................................... .. .................................. 1-30

In-Order Retirement.................................................................................................. 1-31

Microarchitecture of Intel

Core™ Solo and Intel®Core™ Duo Processors........................ 1-31

Front End........................................................................................................................ 1-32

Data Prefetching............................................................................................................. 1-33

Hyper-Threading Technology................................................................................................ 1-33

Processor Resources and Hyper-Threading Technology............................................... 1-36

Replicated Resources............................................................................................... 1-36

Partitioned Resources ................................................................................ ... ... ........ 1-36

Shared Resources.................................................. .................................................. 1-37

Microarchitecture Pipeline and Hyper-Threading Technology........................................ 1-38

Front End Pipeline......................................................................................................... 1-38

Execution Core............................................................................................................... 1-39

Retirement...................................................................................................................... 1-39

Multi-Core Processors........................................................................................................... 1-39

Microarchitecture Pipeline and Multi-Core Processors................................................... 1-42

Shared Cache in Intel Core Duo Processors ................................................................. 1-42

Load and Store Operations....................................................................................... 1-42

Chapter 2 General Optimization Guidelines

Tuning to Achieve Optimum Performance.............................................................................. 2-1

Tuning to Prevent Known Coding Pitfalls................................................................................ 2-2

General Practices and Coding Guidelines.............................................................................. 2-3

Use Available Performance Tools..................................................................................... 2-4

Optimize Performance Across Processor Generations.................................................... 2-4

Optimize Branch Predictability........................ .................................................................. 2-5

Optimize Memory Access................................................................................................. 2-5

Optimize Floating-point Performance............................................................................... 2-6

Optimize Instruction Selection.......................................................................................... 2-6

Optimize Instruction Scheduling....................................................................................... 2-7

Enable Vectorization......................................................................................................... 2-7

Coding Rules, Suggestions and Tuning Hints......................................................................... 2-8

Performance Tools.................................................................................................................. 2-9

Processor Perspectives........................................................................................................ 2-11

Intel

C++ Compiler ......................................................................................................... 2-9

General Compiler Recommendations............................................................................ 2-10

VTune™ Performance Analyzer..................................................................................... 2-10

CPUID Dispatch Strategy and Compatible Code Strategy............................................. 2-13

Transparent Cache-Parameter Strategy......................................................................... 2-14

Threading Strategy and Hardware Multi-Threading Support.......................................... 2-14

Branch Prediction.................................................................................................................. 2-15

Eliminating Branches...................................................................................................... 2-15

Spin-Wait and Idle Loops........................................ ... ..................................................... 2-18

Static Prediction.............................................................................................................. 2-19

Inlining, Calls and Returns ............................................................................................. 2-22

Branch Type Selection ................................................................................................... 2-23

Loop Unrolling ............................................................................................................... 2-26

Compiler Support for Branch Prediction......................................................................... 2-28

Memory Accesses................................................................................................................. 2-29

Alignment ....................................................................................................................... 2-29

Store Forwarding............................................................................................................ 2-32

Store-to-Load-Forwarding Restriction on Size and Alignment.................................. 2-33

Store-forwarding Restriction on Data Availability...................................................... 2-38

Data Layout Optimizations............................................................................................. 2-39

Stack Alignment.............................................................................................................. 2-42

Capacity Limits and Aliasing in Caches.......................................................................... 2-43

Capacity Limits in Set-Associative Caches............................................................... 2-44

Aliasing Cases in the Pentium

4 and Intel® Xeon® Processors ............................. 2-45

Aliasing Cases in the Pentium M Processor............................................................. 2-46

Mixing Code and Data.................................................................................................... 2-47

Self-modifying Code ................................................................................................. 2-47

Write Combining............................................................................................................. 2-48

Locality Enhancement.................................................................................................... 2-50

Minimizing Bus Latency.................................................................................................. 2-52

Non-Temporal Store Bus Traffic ..................................................................................... 2-53

Prefetching..................................................................................................................... 2-55

Hardware Instruction Fetching.................................................................................. 2-55

Software and Hardware Cache Line Fetching.......................................................... 2-55

Cacheability Instructions ................................................................................................ 2-56

Code Alignment............................................................................ .................................. 2-57

Improving the Performance of Floating-point Applications.................................................... 2-57

Guidelines for Optimizing Floating-point Code............................................................... 2-58

Floating-point Modes and Exceptions............................................................................ 2-60

Floating-point Exceptions ......................................................................................... 2-60

Floating-point Modes................................................................................................ 2-62

Improving Parallelism and the Use of FXCH.................................................................. 2-68

x87 vs. Scalar SIMD Floating-point Trade-offs............................................................... 2-69

Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo

Processors............................................................................................................. 2-70

Memory Operands.................................................. ... ... .................................... ... ........... 2-71

Floating-Point Stalls............................ ... ..................................... .. .................................. 2-72

x87 Floating-point Operations with Integer Operands .............................................. 2-72

x87 Floating-point Comparison Instructions............................................................. 2-72

Transcendental Functions ........................................................................................ 2-72

Instruction Selection.............................................................................................................. 2-73

Complex Instructions........................................... ... ........................................................ 2-74

Use of the lea Instruction................................................................................................ 2-74

Use of the inc and dec Instructions................................................................................ 2-75

Use of the shift and rotate Instructions........................................................................... 2-75

Flag Register Accesses.................................................................................................. 2-75

Integer Divide................................................................................................................. 2-76

Operand Sizes and Partial Register Accesses............................................................... 2-76

Prefixes and Instruction Decoding.............................................. .................................... 2-80

REP Prefix and Data Movement..................................................................................... 2-81

Address Calculations...................................................................................................... 2-86

Clearing Registers.......................................................................................................... 2-87

Compares....................................................................................................................... 2-87

Floating Point/SIMD Operands....................................................................................... 2-88

Prolog Sequences.......................................................................................................... 2-90

Code Sequences that Operate on Memory Operands................................................... 2-90

Instruction Scheduling........................................................................................................... 2-91

Latencies and Resource Constraints.............................................................................. 2-91

Spill Scheduling.............................................................................................................. 2-92

Scheduling Rules for the Pentium 4 Processor Decoder ............................................... 2-92

Scheduling Rules for the Pentium M Processor Decoder .............................................. 2-93

Vectorization ......................................................................................................................... 2-93

Miscellaneous....................................................................................................................... 2-95

NOPs.............................................................................................................................. 2-95

Summary of Rules and Suggestions..................................................................................... 2-96

User/Source Coding Rules............................................................................................. 2-97

Assembly/Compiler Coding Rules.................................................................................. 2-99

Tuning Suggestions...................................................................................................... 2-108

Chapter 3 Coding for SIMD Architectures

Checking for Processor Support of SIMD Technologies ......................................................... 3-2

Checking for MMX Technology Support ........................................................................... 3-2

Checking for Streaming SIMD Extensions Support.......................................................... 3-3

Checking for Streaming SIMD Extensions 2 Support....................................................... 3-5

Checking for Streaming SIMD Extensions 3 Support....................................................... 3-6

Considerations for Code Conversion to SIMD Programming.................................................. 3-8

Identifying Hot Spots...................................................................................................... 3-10

Determine If Code Benefits by Conversion to SIMD Execution...................................... 3-11

Coding Techniques ............................................................................................................... 3-12

Coding Methodologies.................................................................................................... 3-13

Assembly.................................................................................................................. 3-15

Intrinsics.................................................................................................................... 3-15

Classes..................................................................................................................... 3-17

Automatic Vectorization............................................ ... ... .................................... ... ... 3-18

Stack and Data Alignment..................................................................................................... 3-20

Alignment and Contiguity of Data Access Patterns........................................................ 3-20

Using Padding to Align Data..................................................................................... 3-20

Using Arrays to Make Data Contiguous.................................................................... 3-21

Stack Alignment For 128-bit SIMD Technologies ........................................................... 3-22

Data Alignment for MMX Technology............................................................................. 3-23

Data Alignment for 128-bit data...................................................................................... 3-24

Compiler-Supported Alignment................................................................................. 3-24

Improving Memory Utilization................................................................................................ 3-27

Data Structure Layout..................................................................................................... 3-27

Strip Mining..................................................................................................................... 3-32

Loop Blocking................................................................................................................. 3-34

Instruction Selection.............................................................................................................. 3-37

SIMD Optimizations and Microarchitectures.................................................................. 3-38

Tuning the Final Application.................................................................................................. 3-39

Chapter 4 Optimizing for SIMD Integer Applications

General Rules on SIMD Integer Code .................................................................................... 4-2

Using SIMD Integer with x87 Floating-point............................................................................ 4-3

Using the EMMS Instruction............................................................................................. 4-3

Guidelines for Using EMMS Instruction............................................................................ 4-4

Data Alignment............................... ... ...................................................................... ................ 4-6

Data Movement Coding Techniques....................................................................................... 4-6

Unsigned Unpack............................................................................................................. 4-6

Signed Unpack................................................................................................................. 4-7

Interleaved Pack with Saturation...................................................................................... 4-8

Interleaved Pack without Saturation............................................................................... 4-10

Non-Interleaved Unpack................................................................................................. 4-11

Extract Word................................................ ................................................................... 4-13

Insert Word..................................................................................................................... 4-14

Move Byte Mask to Integer............................................................................................. 4-16

vii

Packed Shuffle Word for 64-bit Registers ...................................................................... 4-18

Packed Shuffle Word for 128-bit Registers .................................................................... 4-19

Unpacking/interleaving 64-bit Data in 128-bit Registers................................................. 4-20

Data Movement.............................................................................................................. 4-21

Conversion Instructions.................................................................................................. 4-21

Generating Constants................................................................. ... .................................... ... 4-21

Building Blocks...................................................................................................................... 4-23

Absolute Difference of Unsigned Numbers .................................................................... 4-23

Absolute Difference of Signed Numbers........................................................................ 4-24

Absolute Value ................................................................................................................ 4-25

Clipping to an Arbitrary Range [high, low]...................................................................... 4-26

Highly Efficient Clipping............................................................................................ 4-27

Clipping to an Arbitrary Unsigned Range [high, low]................................................ 4-28

Packed Max/Min of Signed Word and Unsigned Byte.................................................... 4-29

Signed Word............................................................................................................. 4-29

Unsigned Byte .......................................................................................................... 4-30

Packed Multiply High Unsigned...................................................................................... 4-30

Packed Sum of Absolute Differences............................................................................. 4-30

Packed Average (Byte/Word)......................................................................................... 4-31

Complex Multiply by a Constant........................................... .......................................... 4-32

Packed 32*32 Multiply.................................................................................................... 4-33

Packed 64-bit Add/Subtract............................................................................................ 4-33

128-bit Shifts................................................................................................................... 4-33

Memory Optimizations.......................................................................................................... 4-34

Partial Memory Accesses............................................................................................... 4-35

Supplemental Techniques for Avoiding Cache Line Splits........................................ 4-37

Increasing Bandwidth of Memory Fills and Video Fills ................................................... 4-39

Increasing Memory Bandwidth Using the MOVDQ Instruction................................. 4-39

Increasing Memory Bandwidth by Loading and Storing to and from the

Same DRAM Page ................................................................................................ 4-39

Increasing UC and WC Store Bandwidth by Using Aligned Stores........................... 4-40

Converting from 64-bit to 128-bit SIMD Integer .................................................................... 4-40

SIMD Optimizations and Microarchitectures.................................................................. 4-41

Packed SSE2 Integer versus MMX Instructions....................................................... 4-42

Chapter 5 Optimizing for SIMD Floating-point Applications

General Rules for SIMD Floating-point Code.......................................................................... 5-1

Planning Considerations......................................................................................................... 5-2

Using SIMD Floating-point with x87 Floating-point................................................................. 5-3

Scalar Floating-point Code...................................................................................................... 5-3

viii

Data Alignment............................... ... ...................................................................... ................ 5-4

Data Arrangement............................................................................................................ 5-4

Vertical versus Horizontal Computation...................................................................... 5-5

Data Swizzling............................................................................................................ 5-9

Data Deswizzling...................................................................................................... 5-14

Using MMX Technology Code for Copy or Shuffling Functions................................ 5-17

Horizontal ADD Using SSE....................................................................................... 5-18

Use of cvttps2pi/cvttss2si Instructions.................................................................................. 5-21

Flush-to-Zero and Denormals-are-Zero Modes .................................................................... 5-22

SIMD Floating-point Programming Using SSE3................................................................... 5-22

SSE3 and Complex Arithmetics ................................................. .. .................................. 5-23

SSE3 and Horizontal Computation................................................................................. 5-26

SIMD Optimizations and Microarchitectures.................................................................. 5-27

Packed Floating-Point Performance......................................................................... 5-27

Chapter 6 Optimizing Cache Usage

General Prefetch Coding Guidelines....................................................................................... 6-2

Hardware Prefetching of Data................................................................................................. 6-4

Prefetch and Cacheability Instructions.................................................................................... 6-5

Prefetch................................................................................................................................... 6-6

Software Data Prefetch.......................................... ... .................................... ... ................ 6-6

The Prefetch Instructions – Pentium 4 Processor Implementation................................... 6-8

Prefetch and Load Instructions......................................................................................... 6-8

Cacheability Control................................................................................................................ 6-9

The Non-temporal Store Instructions............................ .. ........................................ ........ 6-10

Fencing..................................................................................................................... 6-10

Streaming Non-temporal Stores ............................................................................... 6-10

Memory Type and Non-temporal Stores................................................................... 6-11

Write-Combining....................................................................................................... 6-12

Streaming Store Usage Models...................................................................................... 6-13

Coherent Requests................................................................................................... 6-13

Non-coherent requests............................................................................................. 6-13

Streaming Store Instruction Descriptions ....................................................................... 6-14

The fence Instructions.................................................................................................... 6-15

The sfence Instruction.............................................................................................. 6-15

The lfence Instruction............................................................................................... 6-16

The mfence Instruction............................................................................................. 6-16

The clflush Instruction .................................................................................................... 6-17

Memory Optimization Using Prefetch.................................................................................... 6-18

Software-controlled Prefetch........................................................... ... ... ......................... 6-18

Hardware Prefetch ....................................................... .. ..................................... ... ........ 6-19

Example of Effective Latency Reduction with H/W Prefetch.......................................... 6-20

Example of Latency Hiding with S/W Prefetch Instruction ............................................ 6-22

Software Prefetching Usage Checklist........................................................................... 6-24

Software Prefetch Scheduling Distance......................................................................... 6-25

Software Prefetch Concatenation................................................................................... 6-26

Minimize Number of Software Prefetches...................................................................... 6-29

Mix Software Prefetch with Computation Instructions.................................................... 6-32

Software Prefetch and Cache Blocking Techniques....................................................... 6-34

Hardware Prefetching and Cache Blocking Techniques ................................................ 6-39

Single-pass versus Multi-pass Execution....................................................................... 6-41

Memory Optimization using Non-Temporal Stores................................................................ 6-43

Non-temporal Stores and Software Write-Combining..................................................... 6-43

Cache Management....................................................................................................... 6-44

Video Encoder.......................................................................................................... 6-45

Video Decoder.......................................................................................................... 6-45

Conclusions from Video Encoder and Decoder Implementation.............................. 6-46

Optimizing Memory Copy Routines.......................................................................... 6-46

TLB Priming.............................................................................................................. 6-47

Using the 8-byte Streaming Stores and Software Prefetch....................................... 6-48

Using 16-byte Streaming Stores and Hardware Prefetch......................................... 6-50

Performance Comparisons of Memory Copy Routines............................................ 6-52

Deterministic Cache Parameters .......................................................................................... 6-53

Cache Sharing Using Deterministic Cache Parameters................................................. 6-55

Cache Sharing in Single-core or Multi-core.................................................................... 6-55

Determine Prefetch Stride Using Deterministic Cache Parameters............................... 6-56

Chapter 7 Multi-Core and Hyper-Threading Technology

Performance and Usage Models............................................................................................. 7-2

Multithreading................................................................................................................... 7-2

Multitasking Environment................................................................................................. 7-4

Programming Models and Multithreading ............................................................................... 7-6

Parallel Programming Models .......................................................................................... 7-7

Domain Decomposition.................................. ............................................................. 7-7

Functional Decomposition.......................................................... .. ... ................................. 7-8

Specialized Programming Models.................................................................................... 7-8

Producer-Consumer Threading Models.................................................................... 7-10

Tools for Creating Multithreaded Applications................................................................ 7-14

Optimization Guidelines........................................................................................................ 7-16

Key Practices of Thread Synchronization ...................................................................... 7-16

Key Practices of System Bus Optimization.................................................................... 7-17

Key Practices of Memory Optimization .......................................................................... 7-17

Key Practices of Front-end Optimization........................................................................ 7-18

Key Practices of Execution Resource Optimization....................................................... 7-18

Generality and Performance Impact............................................................................... 7-19

Thread Synchronization........................................................................................................ 7-19

Choice of Synchronization Primitives............................................................ ... ... ........... 7-20

Synchronization for Short Periods.................................................................................. 7-22

Optimization with Spin-Locks ......................................................................................... 7-25

Synchronization for Longer Periods ............................................................................... 7-26

Avoid Coding Pitfalls in Thread Synchronization...................................................... 7-28

Prevent Sharing of Modified Dat a and False-Sharing.................................................... 7-30

Placement of Shared Synchronization Variable ............................................................. 7-31

System Bus Optimization...................................................................................................... 7-33

Conserve Bus Bandwidth............................................................................................... 7-34

Understand the Bus and Cache Interactions.................................................................. 7-35

Avoid Excessive Software Prefetches............................................................................ 7-36

Improve Effective Latency of Cache Misses................................................................... 7-36

Use Full Write Transactions to Achieve Higher Data Rate.......................................... ... 7-37

Memory Optimization............................................................................................................ 7-38

Cache Blocking Technique............................................................................................. 7-38

Shared-Memory Optimization......................................................................................... 7-39

Minimize Sharing of Data between Physical Processors.......................................... 7-39

Batched Producer-Consumer Model........................................................................ 7-40

Eliminate 64-KByte Aliased Data Accesses................................................................... 7-42

Preventing Excessive Evictions in First-Level Data Cache............................................ 7-43

Per-thread Stack Offset ............................................................................................ 7-44

Per-instance Stack Offset......................................................................................... 7-46

Front-end Optimization.......................................................................................................... 7-48

Avoid Excessive Loop Unrolling..................................................................................... 7-48

Optimization for Code Size............................................................................................. 7-49

Using Thread Affinities to Manage Shared Platform Resources........................................... 7-49

Using Shared Execution Resources in a Processor Core.............................................. 7-59

Chapter 8 64-bit Mode Coding Guidelines

Introduction............................................................................................................................. 8-1

Coding Rules Affecting 64-bit Mode........................................................................................ 8-1

Use Legacy 32-Bit Instructions When The Data Size Is 32 Bits....................................... 8-1

Use Extra Registers to Reduce Register Pressure .......................................................... 8-2

Use 64-Bit by 64-Bit Multiplies That Produce 128-Bit Results Only When Necessary..... 8-2

Sign Extension to Full 64-Bits........................................................................................... 8-3

Alternate Coding Rules for 64-Bit Mode.................................................................................. 8-4

Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic..................... 8-4

Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible................................. 8-6

Using Software Prefetch................................................................................................... 8-6

Chapter 9 Power Optimization for Mobile Usages

Overview................................................................................................................................. 9-1

Mobile Usage Scenarios....................................................... ... ..................................... .. ........ 9-2

ACPI C-States......................................................................................................................... 9-4

Processor-Specific C4 and Deep C4 States..................................................................... 9-6

Guidelines for Extending Battery Life...................................................................................... 9-7

Adjust Performance to Meet Quality of Features ............................................................. 9-8

Reducing Amount of Work................................................................................................ 9-9

Platform-Level Optimizations.......................................................................................... 9-10

Handling Sleep State Transitions ................................................................................... 9-11

Using Enhanced Intel SpeedStep Enabling Intel

Enhanced Deeper Sleep ....................................................................... 9-14

Technology ............................................................. 9-12

Multi-Core Considerations.............................................................................................. 9-15

Enhanced Intel SpeedStep

Technology..................... ............................................. 9-15

Thread Migration Considerations.............................................................................. 9-16

Multi-core Considerations for C-States..................................................................... 9-17

Appendix AApplication Performance Tools

Intel® Compilers..................................................................................................................... A-2

Code Optimization Options ............................ ................................. ... ............................. A-3

Targeting a Processor (-Gn) ......................................................... ... .......................... A-3

Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions])...... A-4

Vectoriz er Switch Options ........................ ... ... .................................... ............................. A-5

Loop Unrolling............................................................................................................ A-5

Multithreading with OpenMP*.................................................................................... A-6

Inline Expansion of Library Functions (-Oi, -Oi-) .......................... ... ................................ A-6

Floating-point Arithmetic Precision (-Op, -Op-, -Qprec, -Qprec_div, -Qpc,

-Qlong_double).................................. ... .................................... ................................... A-6

Rounding Control Option (-Qrcd) .................................................................................... A-6

Interprocedural and Profile-Guided Optimizations .......................................................... A-7

Interprocedural Optimization (IPO)............................................................................ A-7

Profile-Guided Optimization (PGO) ........................................................................... A-7

Intel

VTune™ Performance Analyzer................................................................................... A-8

Sampling ......................................................................................................................... A-9

xii

Time-based Sampling..................................................... ... .................................... .... A-9

Event-based Sampling............................................................................................. A-10

Workload Characterization ...................................................................................... A-11

Call Graph ............................................. ... ..................................................................... A-13

Counter Monitor............................................................................................................. A-14

Intel

Tuning Assistant.................................................................................................. A-14

Intel

Performance Libraries................................................................................................ A-14

Benefits Summary......................................................................................................... A-15

Optimizations with the Intel

Performance Libraries.................................................... A-16

Enhanced Debugger (EDB) ................................................................................................. A-17

Intel

Threading Tools.......................................................................................................... A-17

Intel

Thread Checker................................................................................................... A-17

Thread Profiler............................................................................................................... A-19

Intel

Software College........................................................................................................ A-20

Appendix BUsing Performance Monitoring Events

Pentium 4 Processor Performance Metrics......................... ................................................... B-1

Pentium 4 Processor-Specific Terminology............................................................................ B-2

Bogus, Non-bogus, Retire............................................................................................... B-2

Bus Ratio......................................................................................................................... B-2

Replay............................................................................................................................. B-3

Assist.............................................................................................................................. . B-3

Tagging............................................................................................................................ B-3

Counting Clocks..................................................................................................................... B-4

Non-Halted Clockticks..................................................................................................... B-5

Non-Sleep Clockticks...................................................................................................... B-6

Time Stamp Counter........................................................................................................ B-7

Microarchitecture Notes......................................................................................................... B-8

Trace Cache Events........................................................................................................ B-8

Bus and Memory Metrics...................................................... ........................................... B-8

Reads due to program loads ................................................................................... B-11

Reads due to program writes (RFOs)............................................................... .. ..... B-11

Writebacks (dirty evictions)...................................................................................... B-12

Usage Notes for Specific Metrics .................................................................................. B-13

Usage Notes on Bus Activities...................................................................................... B-15

Metrics Descriptions and Categories ................................................................................... B-16

Performance Metrics and Tagging Mechanisms.................................................................. B-46

Tags for replay_event.................................................................................................... B-46

Tags for front_end_event............................................................................................... B-48

Tags for execution_event .............................................................................................. B-48

xiii

Using Performance Metrics with Hyper-Threading Technology........................................... B-50

Using Performance Events of Intel Core Solo and Intel Core Duo processors.................... B-56

Understanding the Results in a Performance Counter.................................................. B-56

Ratio Interpretation........................................................................................................ B-57

Notes on Selected Events............................................................................................. B-58

Appendix CIA-32 Instruction Latency and Throughput

Overview................................................................................................................................ C-2

Definitions .............................................................................................................................. C-4

Latency and Throughput........................................................................................................ C-4

Latency and Throughput with Register Operands.......................................................... C-6

Table Footnotes.......................................... ..................................... ........................ C-19

Latency and Throughput with Memory Operands ......................................................... C-20

Appendix DStack Alignment

Stack Frames......................................................................................................................... D-1

Aligned esp-Based Stack Frames................................................................................... D-4

Aligned ebp-Based Stack Frames................................................................................... D-6

Stack Frame Optimizations.............................................................................................. D-9

Inlined Assembly and ebx.................................................................................................... D-10

Appendix EMathematics of Prefetch Scheduling Distance

Simplified Equation ............................................... .................................... ... ...................... ... . E-1

Mathematical Model for PSD................................................................................................. E-2

No Preloading or Prefetch.................. ... ..................................... .. ................................... E-6

Compute Bound (Case:Tc >= T

Compute Bound (Case: Tl + Tb > Tc > Tb) ..................................................................... E-8

Memory Throughput Bound (Case: Tb >= Tc)............................................................... E-10

Example ........................................................................................................................ E-11

+ Tb)............................ ... .............................................. E-7

Index

xiv

Examples

Example 2-1 Assembly Code with an Unpredictable Branch ............................. 2-17

Example 2-2 Code Optimization to Eliminate Branches..................................... 2-17

Example 2-3 Eliminating Branch with CMOV Instruction.................................... 2-18

Example 2-4 Use of

Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm.............. 2-20

Example 2-6 Static Taken Prediction Example ...................................................2-21

Example 2-7 Static Not-Taken Prediction Example ............................................ 2-21

Example 2-8 Indirect Branch With Two Favored Targets .................................... 2-25

Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction ..... 2-26

Example 2-10 Loop Unrolling ............................................................................... 2-28

Example 2-11 Code That Causes Cache Line Split ............................................. 2-31

Example 2-12 Several Situations of Small Loads After Large Store .................... 2-35

Example 2-14 A Non-forwarding Situation in Compiler Generated Code............. 2-36

Example 2-15 Two Examples to Avoid the Non-forwarding Situation in

Example 2-14 ................................................................................ 2-36

Example 2-13 A Non-forwarding Example of Large Load After Small Store ........2-36

Example 2-16 Large and Small Load Stalls ......................................................... 2-37

Example 2-17 An Example of Loop-carried Dependence Chain .......................... 2-39

Example 2-18 Rearranging a Data Structure ....................................................... 2-39

Example 2-19 Decomposing an Array .................................................................. 2-40

Example 2-20 Dynamic Stack Alignment ............................................................. 2-43

Example 2-21 Non-temporal Stores and 64-byte Bus Write Transactions............ 2-54

Example 2-22 Non-temporal Stores and Partial Bus Write Transactions .............2-54

Example 2-23 Algorithm to Avoid Changing the Rounding Mode......................... 2-66

Example 2-24 Dependencies Caused by Referencing Partial Registers.............. 2-77

Example 2-25 Recombining LOAD/OP Code into REG,MEM Form..................... 2-91

Example 2-26 Spill Scheduling Example Code .................................................... 2-92

Example 3-1 Identification of MMX Technology with cpuid................................... 3-3

Example 3-3 Identification of SSE by the OS ....................................................... 3-4

Example 3-2 Identification of SSE with cpuid .......................................................3-4

pause Instruction ............................................................... 2-19

Example 3-4 Identification of SSE2 with cpuid..................................................... 3-5

Example 3-5 Identification of SSE2 by the OS..................................................... 3-6

Example 3-6 Identification of SSE3 with cpuid..................................................... 3-7

Example 3-7 Identification of SSE3 by the OS..................................................... 3-8

Example 3-8 Simple Four-Iteration Loop ............................................................ 3-14

Example 3-9 Streaming SIMD Extensions Using Inlined Assembly Encoding ...3-15

Example 3-10 Simple Four-Iteration Loop Coded with Intrinsics.......................... 3-16

Example 3-11 C++ Code Using the Vector Classes ............................................. 3-18

Example 3-12 Automatic Vectorization for a Simple Loop .................................... 3-19

Example 3-13 C Algorithm for 64-bit Data Alignment ........................................... 3-23

Example 3-14 AoS Data Structure ....................................................................... 3-27

Example 3-16 AoS and SoA Code Samples ........................................................ 3-28

Example 3-15 SoA Data Structure ....................................................................... 3-28

Example 3-17 Hybrid SoA Data Structure ............................................................ 3-30

Example 3-18 Pseudo-code Before Strip Mining.................................................. 3-32

Example 3-19 Strip Mined Code........................................................................... 3-33

Example 3-20 Loop Blocking................................................................................ 3-35

Example 3-21 Emulation of Conditional Moves.................................................... 3-37

Example 4-1 Resetting the Register between __m64 and FP Data Types...........4-5

Example 4-2 Unsigned Unpack Instructions......................................................... 4-7

Example 4-3 Signed Unpack Code ...................................................................... 4-8

Example 4-4 Interleaved Pack with Saturation ................................................... 4-10

Example 4-5 Interleaved Pack without Saturation ..............................................4-11

Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way . 4-13

Example 4-7 pextrw Instruction Code................................................................. 4-14

Example 4-8 pinsrw Instruction Code................................................................. 4-15

Example 4-9 Repeated pinsrw Instruction Code ................................................ 4-16

Example 4-10 pmovmskb Instruction Code.......................................................... 4-17

Example 4-12 Broadcast Using 2 Instructions......................................................4-19

Example 4-11 pshuf Instruction Code ..................................................................4-19

Example 4-13 Swap Using 3 Instructions............................................................. 4-20

Example 4-14 Reverse Using 3 Instructions......................................................... 4-20

Example 4-15 Generating Constants ................................................................... 4-21

Example 4-16 Absolute Difference of Two Unsigned Numbers ............................ 4-23

Example 4-17 Absolute Difference of Signed Numbers ....................................... 4-24

Example 4-18 Computing Absolute Value ............................................................ 4-25

Example 4-19 Clipping to a Signed Range of Words [high, low] .......................... 4-27

xvi

Example 4-20 Clipping to an Arbitrary Signed Range [high, low]......................... 4-27

Example 4-21 Simplified Clipping to an Arbitrary Signed Range ......................... 4-28

Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]..................... 4-29

Example 4-23 Complex Multiply by a Constant .................................................... 4-32

Example 4-24 A Large Load after a Series of Small Stores (Penalty).................. 4-35

Example 4-25 Accessing Data without Delay....................................................... 4-35

Example 4-26 A Series of Small Loads after a Large Store .................................4-36

Example 4-27 Eliminating Delay for a Series of Small Loads after a

Large Store.................................................................................... 4-36

Example 4-28 An Example of Video Processing with Cache Line Splits.............. 4-37

Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits ........ 4-38

Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation ....................... 5-8

Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation........ 5-9

Example 5-3 Swizzling Data............................................................................... 5-10

Example 5-4 Swizzling Data Using Intrinsics ..................................................... 5-12

Example 5-5 Deswizzling Single-Precision SIMD Data ...................................... 5-14

Example 5-6 Deswizzling Data Using the movlhps and shuffle

Instructions .................................................................................... 5-15

Example 5-7 Deswizzling Data 64-bit Integer SIMD Data .................................. 5-16

Example 5-8 Using MMX Technology Code for Copying or Shuffling.................5-18

Example 5-9 Horizontal Add Using movhlps/movlhps ........................................ 5-19

Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps .................5-21

Example 5-11 Multiplication of Two Pair of Single-precision Complex Number....5-24

Example 5-12 Division of Two Pair of Single-precision Complex Number............ 5-25

Example 5-13 Calculating Dot Products from AOS .............................................. 5-26

Example 6-1 Pseudo-code for Using cflush ....................................................... 6-18

Example 6-2 Populating an Array for Circular Pointer Chasing with

Constant Stride.............................................................................. 6-21

Example 6-3 Prefetch Scheduling Distance ....................................................... 6-26

Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop .......6-28

Example 6-4 Using Prefetch Concatenation....................................................... 6-28

Example 6-6 Spread Prefetch Instructions ......................................................... 6-33

Example 6-7 Data Access of a 3D Geometry Engine without Strip-mining........ 6-37

Example 6-8 Data Access of a 3D Geometry Engine with Strip-mining............. 6-38

Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic .......... 6-40

Example 6-10 Basic Algorithm of a Simple Memory Copy................................... 6-46

Example 6-11 A Memory Copy Routine Using Software Prefetch........................6-48

xvii

Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation.. 6-50

Example 7-1 Serial Execution of Producer and Consumer Work Items ............... 7-9

Example 7-2 Basic Structure of Implementing Producer Consumer Threads.... 7-11

Example 7-3 Thread Function for an Interlaced Producer Consumer Model .....7-13

Example 7-4 Spin-wait Loop and PAUSE Instructions........................................ 7-24

Example 7-5 Coding Pitfall using Spin Wait Loop .............................................. 7-29

Example 7-6 Placement of Synchronization and Regular Variables ..................7-32

Example 7-7 Declaring Synchronization Variables without Sharing

a Cache Line ................................................................................. 7-32

Example 7-8 Batched Implementation of the Producer Consumer Threads ......7-41

Example 7-9 Adding an Offset to the Stack Pointer of Three Threads ............... 7-45

Example 7-10 Adding a Pseudo-random Offset to the Stack Pointer

in the Entry Function ..................................................................... 7-47

Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical

Processor ...................................................................................... 7-51

Example 7-12 Assembling a Look up Table to Manage Affinity Masks

and Schedule Threads to Each Core First ....................................7-54

Example 7-13 Discovering the Affinity Masks for Sibling Logical

Processors Sharing the Same Cache ........................................... 7-55

Example D-1 Aligned esp-Based Stack Frames .................................................. D-5

Example D-2 Aligned ebp-based Stack Frames................................................... D-7

Example E-1 Calculating Insertion for Scheduling Distance of 3 ..........................E-3

xviii

Figures

Figure 1-1 Typical SIMD Operations ................................................................... 1-3

Figure 1-2 SIMD Instruction Register Usage ...................................................... 1-4

Figure 1-3 The Intel NetBurst Microarchitecture ............................................... 1-10

Figure 1-4 Execution Units and Ports in the Out-Of-Order Core.......................1-19

Figure 1-5 The Intel Pentium M Processor Microarchitecture........................... 1-27

Figure 1-6 Hyper-Threading Technology on an SMP ........................................ 1-35

Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition

and Intel Core Duo Processor ......................................................... 1-41

Figure 2-1 Cache Line Split in Accessing Elements in a Array ......................... 2-31

Figure 2-2 Size and Alignment Restrictions in Store Forwarding...................... 2-34

Figure 3-1 Converting to Streaming SIMD Extensions Chart .............................3-9

Figure 3-2 Hand-Coded Assembly and High-Level Compiler

Performance Trade-offs ................................................................... 3-13

Figure 3-3 Loop Blocking Access Pattern ......................................................... 3-36

Figure 4-2 Interleaved Pack with Saturation ....................................................... 4-9

Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example ............................4-9

Figure 4-4 Result of Non-Interleaved Unpack High in MM1.............................. 4-12

Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 .............................. 4-12

Figure 4-5 pextrw Instruction ............................................................................ 4-14

Figure 4-6 pinsrw Instruction............................................................................. 4-15

Figure 4-7 pmovmskb Instruction Example....................................................... 4-17

Figure 4-8 pshuf Instruction Example ............................................................... 4-18

Figure 4-9 PSADBW Instruction Example ........................................................ 4-31

Figure 5-1 Homogeneous Operation on Parallel Data Elements ........................ 5-5

Figure 5-2 Dot Product Operation ....................................................................... 5-8

Figure 5-3 Horizontal Add Using movhlps/movlhps .......................................... 5-19

Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction

HADDPD ......................................................................................... 5-23

Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction ..............5-23

Figure 6-1 Effective Latency Reduction as a Function of Access Stride........... 6-22

xix

Figure 6-2 Memory Access Latency and Execution Without Prefetch ..............6-23

Figure 6-3 Memory Access Latency and Execution With Prefetch ...................6-23

Figure 6-4 Prefetch and Loop Unrolling ............................................................ 6-29

Figure 6-5 Memory Access Latency and Execution With Prefetch ...................6-31

Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent

Passes............................................................................................. 6-35

Figure 6-7 Examples of Prefetch and Strip-mining for Temporally

Adjacent and Non-Adjacent Passes Loops ..................................... 6-36

Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines .........................6-42

Figure 7-1 Amdahl’s Law and MP Speed-up ...................................................... 7-3

Figure 7-2 Single-threaded Execution of Producer-consumer

Threading Model................................................................................ 7-9

Figure 7-3 Execution of Producer-consumer Threading Model on

a Multi-core Processor..................................................................... 7-10

Figure 7-4 Interlaced Variation of the Producer Consumer Model.................... 7-12

Figure 7-5 Batched Approach of Producer Consumer Model ........................... 7-40

Figure 9-1 Performance History and State Transitions ....................................... 9-3

Figure 9-2 Active Time Versus Halted Time of a Processor ............................... 9-4

Figure 9-3 Application of C-states to Idle Time................................................... 9-6

Figure 9-4 Profiles of Coarse Task Scheduling and Power Consumption ......... 9-12

Figure 9-5 Thread Migration in a Multi-Core Processor .................................... 9-17

Figure 9-6 Progression to Deeper Sleep .......................................................... 9-18

Figure A-1 Sampling Analysis of Hotspots by Location.....................................A-10

Figure A-2 Intel Thread Checker Can Locate Data Race Conditions ................A-18

Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded

Execution Timelines.........................................................................A-20

Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ

and Front Side Bus ..........................................................................B-10

Figure D-1 Stack Frames Based on Alignment Type .......................................... D-3

Figure E-1 Pentium II, Pentium III and Pentium 4 Processors Memory

Pipeline Sketch ..................................................................................E-4

Figure E-2 Execution Pipeline, No Preloading or Prefetch ..................................E-6

Figure E-3 Compute Bound Execution Pipeline ..................................................E-7

Figure E-4 Another Compute Bound Execution Pipeline.....................................E-8

Figure E-5 Memory Throughput Bound Pipeline ...............................................E-10

Figure E-6 Accesses per Iteration, Example 1 ..................................................E-12

Figure E-7 Accesses per Iteration, Example 2 ..................................................E-13

Tables

Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters .................. 1-20

Table 1-3 Cache Parameters of Pentium M, Intel

Table 1-2 Trigger Threshold and CPUID Signatures for IA-32

Table 1-4 Family And Model Designations of Microarchitectures...................... 1-42

Table 1-5 Characteristics of Load and Store Operations

Table 2-1 Coding Pitfalls Affecting Performance ................................................. 2-2

Table 2-2 Avoiding Partial Flag Register Stall ................................................... 2-76

Table 2-3 Avoiding Partial Register Stall When Packing Byte Values ............... 2-78

Table 2-4 Avoiding False LCP Delays with 0xF7 Group Instructions ................ 2-81

Table 2-5 Using REP STOSD with Arbitrary Count Size and

Table 5-1 SoA Form of Representing Vertices Data ........................................... 5-7

Table 6-1 Software Prefetching Considerations into Strip-mining Code............ 6-39

Table 6-2 Relative Performance of Memory Copy Routines .............................6-52

Table 6-3 Deterministic Cache Parameters Leaf............................................... 6-54

Table 7-1 Properties of Synchronization Objects .............................................. 7-21

Table B-1 Pentium 4 Processor Performance Metrics .......................................B-18

Table B-2 Metrics That Utilize Replay Tagging Mechanism ...............................B-47

Table B-3 Table 3 Metrics That Utilize the Front-end Tagging Mechanism ........B-48

Table B-4 Metrics That Utilize the Execution Tagging Mechanism ....................B-49

Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)..............B-50

Table B-6 Metrics That Support Qualification by Logical Processor and

Table B-7 Metrics That Are Independent of Logical Processors........................B-55

Table C-1 Streaming SIMD Extension 3 SIMD Floating-point Instructions ......... C-6

Table C-2 Streaming SIMD Extension 2 128-bit Integer Instructions.................. C-7

Table C-3 Streaming SIMD Extension 2 Double-precision Floating-point

Table C-4 Streaming SIMD Extension Single-precision Floating-point

Table C-6 MMX Technology 64-bit Instructions ................................................ C-14

Intel

Core™ Duo Processors .......................................................... 1-30

Processor Families............................................................................ 1-30

in Intel Core Duo Processors ............................................................1-43

4-Byte-Aligned Destination ................................................................ 2-85

Parallel Counting ...............................................................................B-51

Instructions ......................................................................................... C-9

Instructions ....................................................................................... C-12

Core™ Solo and

xxi

Table C-5 Streaming SIMD Extension 64-bit Integer Instructions..................... C-14

Table C-7 IA-32 x87 Floating-point Instructions................................................ C-16

Table C-8 IA-32 General Purpose Instructions ................................................. C-17

xxii

Introduction

The IA-32 Intel® Architectur e Optimization Reference Manual describes how to optimize software to take advantage of the performance characteristics of the current generation of IA-32 Intel architecture family of processors. The optimizations described in this manual apply to IA-32 processors based on the Intel the Intel support Hyper-Threading Technology.

The target audience for this manual includes software programmer s and compiler writers. This manual assumes that the reader is familiar with the basics of the IA-32 architecture and has access to the Intel Software Developer’s Manual: Volume 1, Basic Architecture; Volu me 2A, Instruction Set Refer ence A-M; Volum e 2B, Instruction Set Reference N-Z, and Volum e 3, System Programmer’s Guide.

When developing and optimizing software applications to achieve a high level of performance when running on IA-32 processors, a detailed understanding of IA-32 family of processors is often required. In many cases, knowledge of IA-32 microarchitectures is required.

Pentium® M processor family and IA-32 processors that

NetBurst® microarchitecture,

Architecture

This manual provides an overview of the Intel NetBurst microarchitecture and the Intel Pentium M processor microarchitecture. It contains design guidelines for high-performance software applications, coding rules, and techniques for many aspects of code-tuning. These rules are useful to programmers and compiler developers.

The design guidelines that are discussed in this manual for developing high-performance software apply to current as well as to future IA-32 processors. The coding rules and code optimization techniques listed

xxiii

IA-32 Intel® Architectu re Optimization

target the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.

Tuning Your Application

Tuning an application for high performance on any IA-32 processor requires understanding and basic skills in:

• IA-32 architecture

• C and Assembly language

• the hot-spot regions in your application that have significant impact

on software performance

• the optimization capabilities of your compiler

• techniques to evaluate the application’s performance

The Intel locate hot-spot regions in your applications. On the Pentium 4, Intel Xeon through a selection of performance monitoring events and analyze the performance event data that is gathered during code execution.

VTune™ Performance Analyzer can help you analyze and

and Pentium M processors, this tool can monitor an application

This manual also describes information that can be gathered using the performance counters through Pentium 4 processor’s performance monitoring events.

For VTune Performance Analyzer order information, see the web page:

ttp://developer.intel.com

About This Manual

In this document, the reference “Pentium 4 processor” refers to processors based on the Intel NetBurst microarchitecture. Currently this includes the Intel Pentium 4 processor and Intel Xeon processor . Where appropriate, differences between Pentium 4 processor and Intel Xeon processor are noted.

xxiv

Introduction

The manual consists of the following parts: Introduction. Defines the purpose and outlines the contents of this

manual.

Chapter 1: IA-32 Intel

Architecture Processor Family Overview.

Describes the features relevant to software optimization of the current generation of IA-32 Intel architecture processors, including the architectural extensions to the IA-32 architecture and an overview of the Intel NetBurst microarchitecture, Pentium M processor microarchitecture and Hyper-Threading Technology.

Chapter 2: General Optimization Guidelines. Describes general code development and optimization techniques that apply to all applications designed to take advantage of the common features of the Intel NetBurst microarchitecture and Pentium M processor microarchitecture.

Chapter 3: Coding for SIMD Architectures. Describes techniques and concepts for using the SIMD integer and SIMD floating-point instructions provided by the MMX™ technology, Streaming SIMD Extensions, Streaming SIMD Extensions 2, and Streaming SIMD Extensions 3.

Chapter 4: Optimizing for SIMD Integer Applications. Provides optimization suggestions and common building blocks for applications that use the 64-bit and 128-bit SIMD integer instructions.

Chapter 5: Optimizing for SIMD Floating-point Applications. Provides optimization suggestions and common building blocks for applications that use the single-precision and double-precision SIMD floating-point instructions.

Chapter 6: Optimizing Cache Usage. Describes how to use the

prefetch instruction, cache control management instructions to

optimize cache usage, and the deterministic cache parameters.

xxv

IA-32 Intel® Architectu re Optimization

Chapter 7: Multiprocessor and Hyper-Threading Technology. Describes guidelines and techniques for optimizing multithreaded applications to achieve optimal performance scaling. Use these when targeting multiprocessor (MP) systems or MP systems using IA-32 processors that support Hyper-Threading Technology.

Chapter 8: 64-Bit Mode Coding Guidelines. This chapter describes a set of additional coding guidelines for application software written to run in 64-bit mode.

Chapter 9: Power Optimization for Mobile Usages. This chapter provides background on power saving techniques in mobile processors and makes recommendations that developers can leverage to provide longer battery life.

Appendix A: Application Performance Tools. Introduces tools for analyzing and enhancing application performance without having to write assembly code.

Appendix B: Intel Pentium 4 Processor Performance Metrics. Provides information that can be gathered using Pentium 4 processor’s performance monitoring events. These performance metrics can help programmers determine how effectively an application is using the features of the Intel NetBurst microarchitecture.

xxvi

Appendix C: IA-32 Instruction Latency and Throughput. Provides latency and throughput data for the IA-32 instructions. Instruction timing data specific to the Pentium 4 and Pentium M processors are provided.

Appendix D: St ack Alignment. Describes stack alignment conventions and techniques to optimize performance of accessing stack-based data.

Appendix E: The Mathematics of Prefetch Scheduling Distance. Discusses the optimum spacing to insert

prefetch instructions and

presents a mathematical model for determining the prefetch scheduling distance (PSD) for your application.

IA-32 Intel® Architecture Processor Family Overview

This chapter gives an overview of the features relevant to software optimization for the current generations of IA-32 processors, inclu ding:

Intel

Core™ Solo, Intel® Core™ Duo, Intel® Pentium® 4, Intel®

Xeon architecture. These features include:

• SIMD instruction extensions including MMX

• Microarchitectures that enable executing instructions with high

• Intel

• Multi-core architecture supported in Intel

Intel Pentium 4 processors, Intel Xeon processors, Pentium D processors, and Pentium processor Extreme Editions are based on Intel NetBurst® microarchitecture. The Intel Pentium M processor microarchitecture balances performance and low power consumption.

, Intel® Pentium® M, and IA-32 processors with multi-core

Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2), and Streaming SIMD Extensions 3 (SSE3)

throughput at high clock rates, a high speed cache hierarchy and the ability to fetch data with high speed system bus

Extended Memory 64 Technology (Intel® EM64T)

processors supporting Hyper-Threading (HT) Technology

Pentium

D processors and Pentium® processor Extreme Edition

™

technology,

Core™ Duo, Intel®

1. Hyper-Threading Technology requires a computer system with an Intel processor supporting HT Technology and an HT Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software used.

2. Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance varies depending on the hardware and software used.

1-1

IA-32 Intel® Architectu re Optimization

Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those introduced in the Pentium M processor.

SIMD Technology

SIMD computations (see Figure 1-1) were introduced in the IA-32 architecture with MMX technology . MMX technology allows SIMD computations to be performed on packed byte, word, and doubleword integers. The integers are contained in a set of eight 64-bit registers called MMX registers (see Figure 1-2).

The Pentium III processor extended the SIMD computation model with the introduction of the Streaming SIMD Extensions (SSE). SSE allows SIMD computations to be performed on operands that contain four packed single-precision floating-point data elements. The operands can be in memory or in a set of eight 128-bit XMM registers (see Figure 1-2). SSE also extended SIMD computational capability by adding additional 64-bit MMX instructions.

Figure 1-1 shows a typical SIMD computation. Two sets of four packed data elements (X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are operated on in parallel, with the same operation being performed on

1-2

each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.

Figure 1-1 Typical SIMD Operations

X4 X3 X2 X1

Y4 Y3 Y2 Y1

OP OP OP OP

X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1

The Pentium 4 processor further extended the SIMD computation model with the introduction of Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3)

IA-32 Intel® Architecture Processor Family Overview

OM15148

SSE2 works with operands in either memory or in the XMM registers. The technology extends SIMD computations to process packed double-precision floating-point data elements and 128-bit packed integers. There are 144 instructions in SSE2 that operate on two packed double-precision floating-point data elements or on 16 packed byte, 8 packed word, 4 doubleword, and 2 quadword integers.

SSE3 enhances x87, SSE and SSE2 by providing 13 instructions that can accelerate application performance in specific areas. These include video processing, complex arithmetics, and thread synchronization. SSE3 complements SSE and SSE2 with instructions that process SIMD data asymmetrically, facilitate horizontal computation, and help avoid loading cache line splits.

1-3

IA-32 Intel® Architectu re Optimization

Figure 1-2 SIMD Instruction Register Usage

64-bit M M X R egisters

MM7

MM6

MM5

MM4

MM3

MM2

MM1

MM0

128-bit X M M R egisters

MM7

XMM7

XMM6

XMM5

XMM4

XMM3

XMM2

XMM1

XMM0

OM15149

SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications and applications that have the following characteristics:

• inherently parallel

• recurring memory access patterns

• localized recurring operations performed on the data

1-4

• data-independent control flow

SIMD floating-point instructions fully support the IEEE Standard 754 for Binary Floating-Point Arithmetic. They are accessible from all IA-32 execution modes: protected mode, real address mode, and V irtual 8086 mode.

SSE, SSE2, and MMX technologies are architectural extensions in the IA-32 Intel architecture. Existing software will continue to run correctly, without modification on IA-32 microprocessors that incorporate these technologies. Existing software will also run correctly in the presence of applications that incorporate SIMD technologies.

IA-32 Intel® Architecture Processor Family Overview

SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can improve cache usage and application performance.

For more on SSE, SSE2, SSE3 and MMX technologies, see:

IA-32 Intel® Architecture Software Developer’s Manual, Volume 1: Chapter 9, “Programming with Intel® MMX™ Technology”; Chapter 10, “Programming with Streaming SIMD Extensions (SSE)”; Chapter 11, “Programming with Streaming SIMD Extensions 2 (SSE3)”; Chapter 12, “Programming with Streaming SIMD Extensions 3 (SSE3)”

Summary of SIMD Technologies

MMX™ Technology

MMX Technology introduced:

• 64-bit MMX registers

• support for SIMD operations on packed byte, word, and doubleword

integers

MMX instructions are useful for multimedia and communications software.

Streaming SIMD Extensions

Streaming SIMD extensions introduced:

• 128-bit XMM registers

• 128-bit data type with four packed single-precision floating-point

operands

• data prefetch instructions

• non-temporal store instructions and other cacheability and memory

ordering instructions

• extra 64-bit SIMD integer support

1-5

IA-32 Intel® Architectu re Optimization

SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding.

Streaming SIMD Extensions 2

Streaming SIMD extensions 2 add the following:

• 128-bit data type with two packed double-precision floating-point

operands

• 128-bit data types for SIMD integer operation on 16-byte, 8-word,

4-doubleword, or 2-quadword integers

• support for SIMD arithmetic on 64-bit integer operands

• instructions for converting between new and existing data types

• extended support for data shuffling

• extended support for cacheability and memory ordering operations

SSE2 instructions are useful for 3D graphics, video decoding/encoding, and encryption.

Streaming SIMD Extensions 3

1-6

Streaming SIMD extensions 3 add the following:

• SIMD floating-point instructions for asymmetric and horizontal

computation

• a special-purpose 128-bit load instruction to avoid cache line splits

• an x87 FPU instruction to convert to integer independent of the

floating-point control word (FCW)

• instructions to support thread synchronization

SSE3 instructions are useful for scientific, video and multi-threaded applications.

IA-32 Intel® Architecture Processor Family Overview

Intel® Extended Memory 64 Technology (Intel®EM64T)

Intel EM64T is an extension of the IA-32 Intel architecture. Intel EM64T increases the linear address space for software to 64 bits and supports physical address space up to 40 bits. The technology also introduces a new operating mode referred to as IA-32e mode.

IA-32e mode consists of two sub-modes: (1) compatibility mode enables a 64-bit operating system to run most legacy 32-bit software unmodified, (2) 64-bit mode enables a 64-bit operating system to run applications written to access 64-bit linear address space.

In the 64-bit mode of Intel EM64T, software may access:

• 64-bit flat linear addressing

• 8 additional general-purpose registers (GPRs)

• 8 additional registers for streaming SIMD extensions (SSE, SSE2

and SSE3)

• 64-bit-wide GPRs and instruction pointers

• uniform byte-register addressing

• fast interrupt-prioritization mechanism

• a new instruction-pointer relative-addressing mode

For optimizing 64-bit applications, the features that impact software optimizations include:

• using a set of prefixes to access new registers or 64-bit register

operand

• pointer size increases from 32 bits to 64 bits

• instruction-specific usages

1-7

IA-32 Intel® Architectu re Optimization

Intel NetBurst® Microarchitecture

The Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hyper-Threading Technology, Pentium D processor, Pentium processor Extreme Edition and the Intel Xeon processor implement the Intel NetBurst microarchitecture.

This section describes the features of the Intel NetBurst microarchitecture and its operation common to the above processors. It provides the technical background required to understand optimization recommendations and the coding rules discussed in the rest of this manual. For implementation details, including instruction latencies, see Appendix C, “IA-32 Instruction Latency and Throughput.”

Intel NetBurst microarchitecture is designed to achieve high performance for integer and floating-point computations at high clock rates. It supports the following features:

• hyper-pipelined technology that enables high clock rates

• a high-performance, quad-pumped bus interface to the Intel

NetBurst microarchitecture system bus

• a rapid execution engine to reduce the latency of basic integer

instructions

• out-of-order speculative execution to enable parallelism

• superscalar issue to enable parallelism

• hardware register renaming to avoid register name space limitations

• cache line sizes of 64 bytes

• hardware prefetch

Design Goals of Intel NetBurst Microarchitecture

The design goals of Intel NetBurst microarchitecture are:

• to execute legacy IA-32 applications and applications based on

single-instruction, multiple-data (SIMD) technology at high throughput

1-8

IA-32 Intel® Architecture Processor Family Overview

• to operate at high clock rates and to scale to higher performance and

clock rates in the future

Design advances of the Intel NetBurst microarchitecture include:

• a deeply pipelined design that allows for high clock rates (with

different parts of the chip running at different clock rates).

• a pipeline that optimizes for the common case of frequently

executed instructions; the most frequently-executed instructions in common circumstances (such as a cache hit) are decoded efficiently and executed with short latencies

• employment of techniques to hide stall penalties; Among these are

parallel execution, buffering, and speculation. The microarchitecture executes instructions dynamically and out-of-order, so the time it takes to execute each individual instruction is not always deterministic

Chapter 2, “General Optimization Guidelines,” lists optimizations to use and situations to avoid. The chapter also gives a sense of relative priority . Because most optimizations are implementation dependent, the chapter does not quantify expected benefits and penalties.

The following sections provide more information about key features of the Intel NetBurst microarchitecture.

Overview of the Intel NetBurst Microarchitecture Pipeline

The pipeline of the Intel NetBurst microarchitecture contains:

• an in-order issue front end

• an out-of-order superscalar execution core

• an in-order retirement unit

The front end supplies instructions in program order to the out-of-order core. It fetches and decodes IA-32 instructions. The decoded IA-32 instructions are translated into micro-operations (µops). The front end’s primary job is to feed a continuous stream of µops to the execution core in original program order.

1-9

IA-32 Intel® Architectu re Optimization

The out-of-order core aggressively reorders µops so that µops whose inputs are ready (and have execution resources available) can execute as soon as possible. The core can issue multiple µops per cycle.

The retirement section ensures that the results of execution are processed according to original program order and that the proper architectural states are updated.

Figure 1-3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline. The following subsections provide an overview for each.

Figure 1-3 The Intel NetBurst Microarchitecture

6\VWHP%XV

%XV8QLW

UG/HYHO&DFKH

QG/HYHO&DFKH

)URQW(QG

)HWFK'HFRGH

%7%V%UDQFK3UHGLFWLRQ

2SWLRQDO

:D\

7UDFH&DFKH

0LFURFRGH520

)UHTXHQWO\XVHGSDWKV

/HVVIUHTXHQWO\XVHGSDWKV

VW/HYHO&DFKH

ZD\

([HFXWLRQ

2XW2I2UGHU&RUH

%UDQFK+LVWRU\8SGDWH

5HWLUHPHQW

1-10

IA-32 Intel® Architecture Processor Family Overview

The Front End

The front end of the Intel NetBurst microarchitecture consists of two parts:

• fetch/decode unit

• execution trace cache

It performs the following functions:

• prefetches IA-32 instructions that are likely to be executed

• fetches required instructions that have not been prefetched

• decodes instructions into µops

• generates microcode for complex instructions and special-purpose

code

• delivers decoded instructions from the execution trace cache

• predicts branches using advanced algorithms

The front end is designed to address two problems that are sources of delay:

• the time required to decode instructions fetched from the target

• wasted decode bandwidth due to branches or a branch target in the

middle of a cache line

Instructions are fetched and decoded by a translation engine. The translation engine then builds decoded instructions into µop sequences called traces. Next, traces are then stored in the execution trace cache.

The execution trace cache stores µops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line. This increases the instruction flow from the cache and makes better use of the overall cache storage space since the cache no longer stores instructions that are branched over and never executed.

The trace cache can deliver up to 3 µops per clock to the core.

1-11

IA-32 Intel® Architectu re Optimization

The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear address using branch prediction logic and fetched as soon as possible. Branch targets are fetched from the execution trace cache if they are cached, otherwise they are fetched from the memory hierarchy. The translation engine’s branch prediction information is used to form traces along the most likely paths.

The Out-of-order Core

The core’s ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the processor to reorder instructions so that if one µop is delayed while waiting for data or a contended resource, other µops that appear later in the program order may proceed. This implies that when one portion of the pipeline experiences a delay, the delay may be covered by other operations executing in parallel or by the execution of µops queued up in a buffer.

The core is designed to facilitate parallel execution. It can dispatch up to six µops per cycle through the issue ports (Figure 1-4, page 19). Note that six µops per cycle exceeds the trace cache and retirement µop bandwidth. The higher bandwidth in the core allows for peak bursts of greater than three µops and to achieve higher issue rates by allowing greater flexibility in issuing µops to different execution ports.

1-12

Most core execution units can start executing a new µop every cycle, so several instructions can be in flight at one time in each pipeline. A number of arithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructions start one every two cycles. Finally , µops can begin execution out of program order, as soon as their data inputs are ready and resources are available.

Retirement

The retirement section receives the results of the executed µops from the execution core and processes the results so that the architectural state is updated according to the original program order. For semantically

IA-32 Intel® Architecture Processor Family Overview

correct execution, the results of IA-32 instructions must be committed in original program order before they are retired. Exceptions may be raised as instructions are retired. For this reason, exceptions cannot occur speculatively.

When a µop completes and writes its result to the destination, it is retired. Up to three µops may be retired per cycle. The reorder buffer (ROB) is the unit in the processor which buffers completed µops, updates the architectural state and manages the ordering of exceptions.

The retirement section also keeps track of branches and sends updated branch target information to the branch target buffer (BTB). This updates branch history. Figure 1-3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture: an execution loop that interacts with multilevel cache hierarchy and the system bus.

The following sections describe in more detail the operation of the front end and the execution core. This information provides the background for using the optimization techniques and instruction latency data documented in this manual.

Front End Pipeline Detail

The following information about the front end operation is be useful for tuning software with respect to prefetching, branch prediction, and execution trace cache operations.

Prefetching

The Intel NetBurst microarchitecture supports three prefetching mechanisms:

• a hardware instruction fetcher that automatically prefetches

instructions

• a hardware mechanism that automatically fetches data and

instructions into the unified second-level cache

1-13

IA-32 Intel® Architectu re Optimization

• a mechanism fetches data only and includes two distinct

components: (1) a hardware mechanism to fetch the adjacent cache line within an 128-byte sector that contains the data needed due to a cache line miss, this is also referred to as adjacent cache line prefetch (2) a software controlled mechanism that fetches data into the caches using the prefetch instructions.

The hardware instruction fetcher reads instructions along the path predicted by the branch target buffer (BTB) into instruction streaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms are described later.

Decoder

The front end of the Intel NetBurst microarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock. Some complex instructions must enlist the help of the microcode ROM. The decoder operation is connected to the execution trace cache.

Execution Trace Cache

1-14

The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst microarchitecture. The TC stores decoded IA-32 instructions (µops).

In the Pentium 4 processor implementation, TC can hold up to 12K µops and can deliver up to three µops per cycle. TC does not hold all of the µops that need to be executed in the execution core. In some situations, the execution core may need to execute a microcode flow instead of the µop traces that are stored in the trace cache.

The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace cache while only a few instructions involve the microcode ROM.

IA-32 Intel® Architecture Processor Family Overview

Branch Prediction

Branch prediction is important to the performance of a deeply pipelined processor. It enables the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction. For Pentium 4 and Intel Xeon processors, the branch delay for a correctly predicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be many cycles, usually equivalent to the pipeline depth.

Branch prediction in the Intel NetBurst microarchitecture predicts all near branches (conditional calls, unconditional calls, returns and indirect branches). It does not predict far transfers (far calls, irets and software interrupts).

Mechanisms have been implemented to aid in predicting branches accurately and to reduce the cost of taken branches. These include:

• the ability to dynamically predict the direction and target of

branches based on an instruction’s linear address, using the branch target buffer (BTB)

• if no dynamic prediction is available or if it is invalid, the ability to

statically predict the outcome based on the offset of the target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken

• the ability to predict return addresses using the 16-entry return

address stack

• the ability to build a trace of instructions across predicted taken

branches to avoid branch penalties.

The Static Predictor. Once a branch instruction is decoded, the direction of the branch (forward or backward) is known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the direction of the branch. The static prediction mechanism predicts backward conditional branches (those with negative displacement, such as loop-closing branches) as taken. Forward branches are predicted not taken.

1-15

IA-32 Intel® Architectu re Optimization

To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the likely target of the branch immediately follows forward branches (see also: “Branch Prediction” in Chapter 2).

Branch Target Buffer. Once branch history is available, the Pentium 4

processor can predict the branch outcome even before the branch instruction is decoded. The processor uses a branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of branches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the target address.

Return Stack. Returns are always taken; but since a procedure may be invoked from several call sites, a single predicted target does not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.

Even if the direction and target address of the branch are correctly predicted, a taken branch may reduce available parallelism in a typical processor (since the decode bandwidth is wasted for instructions which immediately follow the branch and precede the target, if the branch does not end the line and target does not begin the line). The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing instruction delivery from the front end.

Execution Core Detail

The execution core is designed to optimize overall performance by handling common cases most efficiently. The hardware is designed to execute frequent operations in a common context as fast as possible, at the expense of infrequent operations using rare contexts.

1-16

IA-32 Intel® Architecture Processor Family Overview

Some parts of the core may speculate that a common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains to store-to-load forwarding (see “Store Forwarding” in this chapter). If a load is predicted to be dependent on a store, it gets its data from that store and tentatively proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded from memory, then it proceeds.

Instruction Latency and Throughput

The superscalar out-of-order core contains hardware resources that can execute multiple μops in parallel. The core’s ability to make use of available parallelism of execution units can enhanced by software’s ability to:

• select IA-32 instructions that can be decoded in less than 4 μops

and/or have short latencies

• order IA-32 instructions to preserve available parallelism by

minimizing long dependence chains and covering long instruction latencies

• order instructions so that their operands are ready and their

corresponding issue ports and execution units are free when they reach the scheduler

This subsection describes port restrictions, result latencies, and issue latencies (also referred to as throughput). These concepts form the basis to assist software for ordering instructions to increase parallelism. The order that μops are presented to the core of the processor is further affected by the machine’s scheduling resources.

It is the execution core that reacts to an ever-changing machine state, reordering μops for faster execution or delaying them because of dependence and resource constraints. The ordering of instructions in software is more of a suggestion to the hardware.

Appendix C, “IA-32 Instruction Latency and Throughput,” lists some of the more-commonly-used IA-32 instructions with their latency, their issue throughput, and associated execution units (where relevant). Some

1-17

IA-32 Intel® Architectu re Optimization

execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instructions to generate. All µops executed out of the microcode ROM involve extra overhead.

Execution Units and Issue Ports

At each cycle, the core may dispatch µops to one or more of four issue ports. At the microarchitecture level, store operations are further divided into two parts: store data and store address operations. The four ports through which μops are dispatched to execution units and to load and store operations are shown in Figure 1-4. Some ports can dispatch two µops per clock. Those execution units are marked Double Speed.

Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move µop (a floating-point stack move, floating-point exchange or floating-point store data), or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store data). In the second half of the cycle, it can dispatch one similar ALU µop.

1-18

Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) µop or one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle, it can dispatch one similar ALU µop.

Port 2. This port supports the dispatch of one load operation per cycle. Port 3. This port supports the dispatch of one store address operation

per cycle. The total issue bandwidth can range from zero to six µops per cycle.

Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit and the floating-point execution units (adder, multiplier, and divider) can share a pipeline.

IA-32 Intel® Architecture Processor Family Overview

Figure 1-4 Execution Units and Ports in the Out-Of-Order Core

Port 0

ALU 0

Double

Speed

ADD/SUB

Logic

Store Data

Branches

Note:

FP_ADD refers to x87 FP, and SIMD FP add and subtract operations FP_MUL refers to x87 FP, and SIMD FP multiply operations FP_DIV refers to x87 FP, and SIMD FP divide and square root operations MMX_ALU refers to SIMD integer arithmetic and logic operations MMX_ SHF T han d le s Shift, R o ta te , S h u f fle, Pack and U npack o perations MMX_MISC handles SIMD reciprocal and some integer operations

Move

FP Move

FP Store Data

FXCH

ALU 1

Double

Speed

ADD /S U B Shift/Ro tate

Port 1

Intege r

Operation

Normal

Speed

Execute

FP_ADD FP_MUL

FP_DIV

FP_MISC

MMX_SHFT

MMX_ALU

MMX_MISC

Port 2

Memory

Load

All Loads

Prefetch

Caches

The Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBurst microarchitecture. The Intel Xeon processor MP and selected Pentium and Intel Xeon processors may also contain a third-level cache.

Port 3

Memory

Store

Address

OM15151

The first level cache (nearest to the execution core) contains separate caches for instructions and data. These include the first-level data cache and the trace cache (an advanced first-level instruction cache). All other caches are shared between instructions and data.

1-19

IA-32 Intel® Architectu re Optimization

Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it is also in level i+1. All caches use a pseudo-LRU (least recently used) replacement algorithm.

Table 1-1 provides parameters for all cache levels for Pentium and Intel Xeon Processors with CPUID model encoding equals 0, 1, 2 or 3.

Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters

Access Latency, Associativity

Level (Model) Capacity

First (Model 0, 1, 2)

First (Model 3) 16 KB 8 64 4/12 write through

TC (All models) 12K µops 8 N/A N/A N/A

Second (Model 0, 1, 2)

Second (Model 3, 4)

Second (Model 3, 4, 6)

Third (Model 0, 1, 2)

8 KB 4 64 2/9 write through

256 KB or 512

1 MB 8 64

2 MB 8 64

0, 512 KB, 1 MB or 2 MB

(ways)

86417/7 write back

864

Line Size (bytes)

Integer/

floating-point

(clocks)

18/18 write back

20/20 write back

14/14 write back

Write Update Policy

Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write operation is 64 bytes.

Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level cache of 512 KB.

On processors without a third level cache, the second-level cache miss initiates a transaction across the system bus interface to the memory sub-system. On processors with a third level cache, the third-level cache miss initiates a transaction across the system bus. A bus write transaction writes 64 bytes to cacheable memory, or separate 8-byte chunks if the destination is not cacheable. A bus read transaction from cacheable memory fetches two cache lines of data.

The system bus interface supports using a scalable bus clock and achieves an effective speed that quadruples the speed of the scalable bus clock. It takes on the order of 12 processor cycles to get to the bus and

1-20

IA-32 Intel® Architecture Processor Family Overview

back within the processor, and 6-12 bus cycles to access memory if there is no bus congestion. Each bus cycle equals several processor cycles. The ratio of processor clock speed to the scalable bus clock speed is referred to as bus ratio. For example, one bus cycle for a 100 MHz bus is equal to 15 processor cycles on a 1.50 GHz processor. Since the speed of the bus is implementation-dependent, consult the specifications of a given system for further details.

Data Prefetch

The Pentium 4 processor and other IA-32 processors based on the NetBurst microarchitecture have two type of mechanisms for prefetching data: software prefetch instructions and hardware-based prefetch mechanisms.

Software controlled prefetch is enabled using the four prefetch instructions (PREFETCHh) introduced with SSE. The software prefetch is not intended for prefetching code. Using it can incur significant penalties on a multiprocessor system if code is shared.

Software prefetch can provide benefits in selected situations. These situations include:

• when the pattern of memory access operations in software allows

the programmer to hide memory latency

• when a reasonable choice can be made about how many cache lines

to fetch ahead of the line being execute

• when an choice can be made about the type of prefetch to use

SSE prefetch instructions have different behaviors, depending on cache levels updated and the processor implementation. For instance, a processor may implement the non-temporal prefetch by returning data to the cache level closest to the processor core. This approach has the following effect:

• minimizes disturbance of temporal data in other cache levels

1-21

IA-32 Intel® Architectu re Optimization

• avoids the need to access off-chip caches, which can increase the

realized bandwidth compared to a normal load-miss, which returns data to all cache levels

Situations that are less likely to benefit from software prefetch are:

• for cases that are already bandwidth bound, prefetching tends to

increase bandwidth demands

• prefetching far ahead can cause eviction of cached data from the

caches prior to the data being used in execution

• not prefetching far enough can reduce the ability to overlap memory

and execution latencies

Software prefetches are treated by the processor as a hint to initiate a request to fetch data from the memory system, and consume resources in the processor and the use of too many prefetches can limit their effectiveness. Examples of this include prefetching data in a loop for a reference outside the loop and prefetching in a basic block that is frequently executed, but which seldom precedes the reference for which the prefetch is targeted.

1-22

See also: Chapter 6, “Optimizing Cache Usage.” Automatic hardware prefetch is a feature in the Pentium 4 processor.

It brings cache lines into the unified second-level cache based on prior reference patterns. See also: Chapter 6, “Optimizing Cache Usage.”

Pros and Cons of Software and Hardware Prefetching. Software prefetching has the following characteristics:

• handles irregular access patterns, which would not trigger the

hardware prefetcher

• handles prefetching of short arrays and avoids hardware prefetching

start-up delay before initiating the fetches

• must be added to new code; so it does not benefit existing

applications

IA-32 Intel® Architecture Processor Family Overview

Hardware prefetching for Pentium 4 processor has the following characteristics:

• works with existing applications

• does not require extensive study of prefetch instructions

• requires regular access patterns

• avoids instruction and issue port bandwidth overhead

• has a start-up penalty before the hardware prefetcher triggers and

begins initiating fetches

The hardware prefetcher can handle multiple streams in either the forward or backward directions. The start-up delay and fetch-ahead has a larger effect for short arrays when hardware prefetching generates a request for data beyond the end of an array (not actually utilized). The hardware penalty diminishes if it is amortized over longer arrays.

Hardware prefetching is triggered after two successive cache misses in the last level cache and requires these cache misses to satisfy a condition that the linear address distance between these cache misses is within a threshold value. The threshold value depends on the processor implementation of the microarchitecture (see Table 1-2). However, hardware prefetching will not cross 4KB page boundaries. As a result, hardware prefetching can be very effective when dealing with cache miss patterns that have small strides that are significantly less than half the threshold distance to trigger hardware prefetching. On the other hand, hardware prefetching will not benefit cache miss patterns that have frequent DTLB misses or have access strides that cause successive cache misses that are spatially apart by more than the trigger threshold distance.

Software can proactively control data access pattern to favor smaller access strides (e.g., stride that is less than half of the trigger threshold distance) over larger access strides (stride that is greater than the trigger threshold distance), this can achieve additional benefit of improved temporal locality and reducing cache misses in the last level cache significantly.

1-23

IA-32 Intel® Architectu re Optimization

Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to favor greater proportions of smaller-stride data accesses in the workload; before attempting to provide hints to the processor by employing software prefetch instructions.

Loads and Stores

The Pentium 4 processor employs the following techniques to speed up the execution of memory operations:

• speculative execution of loads

• reordering of loads with respect to loads and stores

• multiple outstanding misses

• buffering of writes

• forwarding of data from stores to dependent loads

Performance may be enhanced by not exceeding the memory issue bandwidth and buffer resources provided by the processor. Up to one load and one store may be issued for each cycle from a memory port reservation station. In order to be dispatched to a reservation station, there must be a buffer entry available for each memory operation. There are 48 load buffers and 24 store buffers address information until the operation is completed, retired, and deallocated.

. These buffers hold the µop and

The Pentium 4 processor is designed to enable the execution of memory operations out of order with respect to other instructions and with respect to each other. Loads can be carried out speculatively, that is, before all preceding branches are resolved. However, speculative loads cannot cause page faults.

3. Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store buffers.

1-24

IA-32 Intel® Architecture Processor Family Overview

Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute operations as soon as their inputs are ready. Writes to memory are always carried out in program order to maintain program correctness.

A cache miss for a load does not prevent other loads from issuing and completing. The Pentium 4 processor supports up to four (or eight for Pentium 4 processor with CPUID signature corresponding to family 15, model 3) outstanding load misses that can be serviced either by on-chip caches or by memory.

Store buffers improve performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or cache is complete. Writes are generally not on the critical path for dependence chains, so it is often beneficial to delay writes for more efficient use of memory-access bus cycles.

Store Forwarding

Loads can be moved before stores that occurred earlier in the program if they are not predicted to load from the same linear address. If they do read from the same linear address, they have to wait for the store data to become available. However, with store forwarding, they do not have to wait for the store to write to the memory hierarchy and retire. The data from the store can be forwarded directly to the load, as long as the following conditions are met:

• Sequence: the data to be forwarded to the load has been generated

by a programmatically-earlier store which has already executed

• Size: the bytes loaded must be a subset of (including a proper

subset, that is, the same) bytes stored

• Alignment: the store cannot wrap around a cache line boundary , and

the linear address of the load must be the same as that of the store

1-25

IA-32 Intel® Architectu re Optimization

Intel® Pentium® M Processor Microarchitecture

Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchitecture contains three sections:

• in-order issue front end

• out-of-order superscalar execution core

• in-order retirement unit

Intel Pentium M processor microarchitecture supports a high-speed system bus (up to 533 MHz) with 64-byte line size. Most coding recommendations that apply to the Intel NetBurst microarchitecture also apply to the Intel Pentium M processor.

1-26

IA-32 Intel® Architecture Processor Family Overview

The Intel Pentium M processor microarchitecture is designed for lower power consumption. There are other specific areas of the Pentium M processor microarchitecture that differ from the Intel NetBurst microarchitecture. They are described next. A block diagram of the Intel Pentium M processor is shown in Figure 1-5.

Figure 1-5 The Intel Pentium M Processor Microarchitecture

6\VWHP%XV

%XV8QLW

QG/HYHO&DFKH

VW/HYHO

,QVWUXFWLRQ

&DFKH

%7%V%UDQFK3UHGLFWLRQ

)URQW(QG

)HWFK'HFRGH

)UHTXHQWO\XVHGSDWKV

/HVVIUHTXHQWO\XVHG SDWKV

VW/HYHO'DWD

&DFKH

([HFXWLRQ

2XW2I2UGHU&RUH

%UDQFK+LVWRU\8SGDWH

5HWLUHPHQW

The Front End

The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption. It’s shorter than that of the Intel NetBurst microarchitecture.

The Intel Pentium M processor front end consists of two parts:

• fetch/decode unit

• instruction cache

1-27

IA-32 Intel® Architectu re Optimization

The fetch and decode unit includes a hardware instruction prefetcher and three decoders that enable parallelism. It also provides a 32KB instruction cache that stores un-decoded binary instructions.

The instruction prefetcher fetches instructions in a linear fashion from memory if the target instructions are not already in the instruction cache. The prefetcher is designed to fetch efficiently from an aligned 16-byte block. If the modulo 16 remainder of a branch target address is 14, only two useful instruction bytes are fetched in the first cycle. The rest of the instruction bytes are fetched in subsequent cycles.

The three decoders decode IA-32 instructions and break them down into micro-ops (µops). In each clock cycle, the first decoder is capable of decoding an instruction with four or fewer µops. The remaining two decoders each decode a one µop instruction in each clock cycle.

The front end can issue multiple µops per cycle, in original program order, to the out-of-order core.

The Intel Pentium M processor incorporates sophisticated branch prediction hardware to support the out-of-order core. The branch prediction hardware includes dynamic prediction, and branch target buffers.

1-28

The Intel Pentium M processor has enhanced dynamic branch prediction hardware. Branch target buffers (BTB) predict the direction and target of branches based on an instruction’s address.

The Pentium M Processor includes two techniques to reduce the execution time of certain operations:

• ESP Folding. This eliminates the ESP manipulation

micro-operations in stack-related instructions such as PUSH, POP, CALL and RET. It increases decode rename and retirement throughput. ESP folding also increases execution bandwidth by eliminating µops which would have required execution resources.

IA-32 Intel® Architecture Processor Family Overview

• Micro-ops (µops) fusion. Some of the most frequent pairs of µops

derived from the same instruction can be fused into a single µops. The following categories of fused µops have been implemented in the Pentium M processor:

— “Store address” and “store data” micro-ops are fused into a

single “Store” micro-op. This holds for all types of store operations, including integer, floating-point, MMX technology, and Streaming SIMD Extensions (SSE and SSE2) operations.

— A load micro-op in most cases can be fused with a successive

execution micro-op.This holds for integer, floating-point and MMX technology loads and for most kinds of successive execution operations. Note that SSE Loads can not be fused.

Data Prefetching

The Intel Pentium M processor supports three prefetching mechanisms:

• The first mechanism is a hardware instruction fetcher and is

described in the previous section.

• The second mechanism automatically fetches data into the

second-level cache. The implementation of automatic hardware prefetching in Pentium M processor family is basically similar to those described for NetBurst microarchitecture. The trigger threshold distance for each relevant processor models is shown in Table 1-2

• The third mechanism is a software mechanism that fetches data into

the caches using the prefetch instructions.

1-29

IA-32 Intel® Architectu re Optimization

Table 1-2 Trigger Threshold and CPUID Signatures for IA-32 Processor

Families

Trigger Threshold Distance (Bytes)

512 0 0 15 3, 4, 6

256 0 0 15 0, 1, 2

256 0069, 13, 14

Extended Model ID

Extended Family ID Family ID Model ID

Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entries. See Table 1-3 for processor cache parameters.

Table 1-3 Cache Parameters of Pentium M, Intel® Core™ Solo and

Intel®Core™ Duo Processors

Line

Associativity

Level Capacity

First 32 KB 8 64 3 Writeback

Instruction 32 KB 8 N/A N/A N/A

Second (model 9)

Second (model 13)

Second (model 14)

1 MB 8 64 9 Writeback

2 MB 8 64 10 Writeback

2 MB 8 64 14 Writeback

(ways)

Size (bytes)

Access Latency (clocks)

Write Update Policy

Out-of-Order Core

The processor core dynamically executes µops independent of program order. The core is designed to facilitate parallel execution by employing many buffers, issue ports, and parallel execution units.

The out-of-order core buffers µops in a Reservation Station (RS) until their operands are ready and resources are available. Each cycle, the core may dispatch up to five µops through the issue ports.

1-30

IA-32 Intel® Architecture Processor Family Overview

In-Order Retirement

The retirement unit in the Pentium M processor buffers completed µops is the reorder buffer (ROB). The ROB updates the architectural state in order. Up to three µops may be retired per cycle.

Microarchitecture of Intel® Core™ Solo and Intel®Core™ Duo Processors

Intel Core Solo and Intel Core Duo processors incorporate an microarchitecture that is similar to the Pentium M processor microarchitecture, but provides additional enhancements for performance and power efficiency. Enhancements include:

• Intel Smart Cache

This second level cache is shared between two cores in an Intel Core Duo processor to minimize bus traffic between two cores accessing a single-copy of cached data. It allows an Intel Core Solo processor (or when one of the two cores in an Intel Core Duo processor is idle) to access its full capacity.

• Stream SIMD Extensions 3

These extensions are supported in Intel Core Solo and Intel Core Duo processors.

• Decoder improvement

Improvement in decoder and micro-op fusion allows the front en d to see most instructions as single throughput of the three decoders in the front end.

μop instructions. This increases the

• Improved execution core

Throughput of SIMD instructions is improved and the out-of-order engine is more robust in handling sequences of frequently-used instructions. Enhanced internal buffering and prefetch mechanisms also improve data bandwidth for execution.

1-31

IA-32 Intel® Architectu re Optimization

• Power-optimized bus

The system bus is optimized for power efficiency; increased bus speed supports 667 MHz.

• Data Prefetch

Intel Core Solo and Intel Core Duo processors implement improved hardware prefetch mechanisms: one mechanism can look ahead and prefetch data into L1 from L2. These processors also provide enhanced hardware prefetchers similar to those of the Pentium M processor (see Table 1-2).

Front End

Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are improved over Pentium M processors by the following enhancements:

• Micro-op fusion

Scalar SIMD operations on register and memory have single micro-op flows comparable to X87 flows. Many packed instructions are fused to reduce its micro-op flow from four to two micro-ops.

1-32

• Eliminating decoder restrictions

Intel Core Solo and Intel Core Duo processors improve decoder throughput with micro-fusion and macro-fusion, so that many more SSE and SSE2 instructions can be decoded without restriction. On Pentium M processors, many single micro-op SSE and SSE2 instructions must be decoded by the main decoder.

• Improved packed SIMD instruction decoding

On Intel Core Solo and Intel Core Duo processors, decoding of most packed SSE instructions is done by all three decoders. As a result the front end can process up to three packed SSE instructions every cycle. There are some exceptions to the above; some shuffle/unpack/shift operations are not fused and require the main decoder.

Data Prefetching

Intel Core Solo and Intel Core Duo processors provide hardware mechanisms to prefetch data from memory to the second-level cache. There are two techniques: one mechanism activates after the data access pattern experiences two cache-reference misses within a trigger-distance threshold (see Table 1-2). This mechanism is similar to that of the Pentium M processor, but can track 16 forward data streams and 4 backward streams. The second mechanism fetches an adjacent cache line of data after experiencing a cache miss. This effectively simulates the prefetching capabilities of 128-byte sectors (similar to the sectoring of two adjacent 64-byte cache lines available in Pentium 4 processors).

Hardware prefetch requests are queued up in the bus system at lower priority than normal cache-miss requests. If bus queue is in high demand, hardware prefetch requests may be ignored or cancelled to service bus traffic required by demand cache-misses and other bus transactions.

Hardware prefetch mechanisms are enhanced over that of Pentium M processor by:

IA-32 Intel® Architecture Processor Family Overview

• Data stores that are not in the second-level cache generate read for

ownership requests. These requests are treated as loads and can trigger a prefetch stream.

• Software prefetch instructions are treated as loads, they can also

trigger a prefetch stream.

Hyper-Threading Technology

Intel® Hyper-Threading (HT) Technology is supported by specific members of the Intel Pentium 4 and Xeon processor families. The technology enables software to take advantage of task-level, or thread-level parallelism by providing multiple logical processors within a physical processor package. In its first implementation in Intel Xeon processor, Hyper-Threading Technology makes a single physical processor appear as two logical processors.

1-33

IA-32 Intel® Architectu re Optimization

The two logical processors each have a complete set of architectural registers while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technology capable processor looks like two processors to software, including operating system and application code.

By sharing resources needed for peak demands between two logical processors, HT Technology is well suited for multiprocessor systems to provide an additional performance boost in throughput when compared to traditional MP systems.

Figure 1-6 shows a typical bus-based symmetric multiprocessor (SMP) based on processors supporting Hyper-Threading Technology. Each logical processor can execute a software thread, allowing a maximum of two software threads to execute simultaneously on one physical processor. The two software threads execute simultaneously, meaning that in the same clock cycle an “add” operation from logical processor 0 and another “add” operation and load from logical processor 1 can be executed simultaneously by the execution engine.

1-34

IA-32 Intel® Architecture Processor Family Overview

In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor. This minimizes the die area cost of implementing HT Technology while still achieving performance gains for multithreaded applications or multitasking workloads.

Figure 1-6 Hyper-Threading Technology on an SMP

Architectural

State

Execution Engine

Local APIC

Bus Inte rfac e

Architectural

State

Local APIC

Architectural

State

Execution Engine

Local APIC

Bus Inte rfa ce

System Bus

Architectural

State

Local APIC

OM15152

The performance potential due to HT Technology is due to:

• the fact that operating systems and user programs can schedule

processes or threads to execute simultaneously on the logical processors in each physical processor

• the ability to use on-chip execution resources at a higher level than

when only a single thread is consuming the execution resources; higher level of resource utilization can lead to higher system throughput

1-35

IA-32 Intel® Architectu re Optimization

Processor Resources and Hyper-Threading Technology

The majority of microarchitecture resources in a physical processor are shared between the logical processors. Only a few small data structures were replicated for each logical processor. This section describes how resources are shared, partitioned or replicated.

Replicated Resources

The architectural state is replicated for each logical processor. The architecture state consists of registers that are used by the operating system and application code to control program behavior and store data for computations. This state includes the eight general-purpose registers, the control registers, machine state registers, debug registers, and others. There are a few exceptions, most notably the memory type range registers (MTRRs) and the performance monitoring resources. For a complete list of the architecture state and exceptions, see the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B.

Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track execution and state changes of the two logical processors. The return stack predictor is replicated to improve branch prediction of return instructions.

1-36

In addition, a few buffers (for example, the 2-entry instruction streaming buffers) were replicated to reduce complexity.

Partitioned Resources

Several buffers are shared by limiting the use of each logical processor to half the entries. These are referred to as partitioned resources. Reasons for this partitioning include:

• operational fairness

• permitting the ability to allow operations from one logical processor

to bypass operations of the other logical processor that may have stalled

IA-32 Intel® Architecture Processor Family Overview

For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor from making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blocking forward progress.

In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers include µop queues after the execution trace cache, the queues after the register rename stage, the reorder buffer which stag es instru ctions for retirement, and the load and store buffers.

In the case of load and store buffers, partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations.

Shared Resources

Most resources in a physical processor are fully shared to improve the dynamic utilization of the resource, including caches and all the execution units. Some shared resources which are linearly addressed, like the DTLB, include a logical processor ID bit to distinguish whether the entry belongs to one logical processor or the other.

The first level cache can operate in two modes depending on a context-ID bit:

• Shared mode: The L1 data cache is fully shared by two logical

processors.

• Adaptive mode: In adaptive mode, memory accesses using the page

directory is mapped identically across logical processors sharing the L1 data cache.

The other resources are fully shared.

1-37

IA-32 Intel® Architectu re Optimization

Microarchitecture Pipeline and Hyper-Threading Technology

This section describes the HT Technology microarchitecture and how instructions from the two logical processors are handled between the front end and the back end of the pipeline.

Although instructions originating from two programs or two threads execute simultaneously and not necessarily in program order in the execution core and memory hierarchy, the front end and back end contain several selection points to select between instructions from the two logical processors. All selection points alternate between the two logical processors unless one logical processor cannot make use of a pipeline stage. In this case, the other logical processor has full use of every cycle of the pipeline stage. Reasons why a logical processor may not use a pipeline stage include cache misses, branch mispredictions, and instruction dependencies.

Front End Pipeline

The execution trace cache is shared between two logical processors. Execution trace cache access is arbitrated by the two logical processors every clock. If a cache line is fetched for one logical processor in one clock cycle, the next clock cycle a line would be fetched for the other logical processor provided that both logical processors are requesting access to the trace cache.

1-38

If one logical processor is stalled or is unable to use the execution trace cache, the other logical processor can use the full bandwidth of the trace cache until the initial logical processor’s instruction fetches return from the L2 cache.

After fetching the instructions and building traces of µops, the µops are placed in a queue. This queue decouples the execution trace cache from the register rename pipeline stage. As described earlier, if both logical processors are active, the queue is partitioned so that both logical processors can make independent forward progress.

Execution Core

The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops are placed in the queues waiting for execution, there is no distinction between instructions from the two logical processors. The execution core and memory hierarchy is also oblivious to which instructions belong to which logical processor.

After execution, instructions are placed in the re-order buffer. The re-order buffer decouples the execution stage from the retirement stage. The re-order buffer is partitioned such that each uses half the entries.

Retirement

The retirement logic tracks when instructions from the two logical processors are ready to be retired. It retires the instruction in program order for each logical processor by alternating between the two logical processors. If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor.

IA-32 Intel® Architecture Processor Family Overview

Once stores have retired, the processor needs to write the store data into the level-one data cache. Selection logic alternates between the two logical processors to commit store data to the cache.

Multi-Core Processors

The Intel Pentium D processor and the Pentium Processor Extreme Edition introduce multi-core features in the IA-32 architecture. These processors enhance hardware support for multi-threading by providing two processor cores in each physical processor package. The Dual-core Intel Xeon and Intel Core Duo processors also provide two processor cores in a physical package.

The Intel Pentium D processor provides two logical processors in a physical package, each logical processor has a separate execution core and a cache hierarchy . The Dual-core Intel Xeon processor and the Intel

1-39

IA-32 Intel® Architectu re Optimization

Pentium Processor Extreme Edition provide four logical processors in a physical package that has two execution cores. Each core provides two logical processors sharing an execution core and a cache hierarchy.

The Intel Core Duo processor provides two logical processors in a physical package. Each logical processor has a separate execution core (including first-level cache) and a smart second-level cache. The second-level cache is shared between two logical processors and optimized to reduce bus traffic when the same copy of cached data is used by two logical processors. The full capacity of the second-level cache can be used by one logical processor if the other logical processor is inactive.

The functional blocks of these processors are shown in Figure 1-7.

1-40

IA-32 Intel® Architecture Processor Family Overview

Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition

and Intel Core Duo Processor

Pentium D Processor

Architectual Stat e Execut ion E ngine

Local API C Local APIC

Caches Caches

Bus Interface Bus Interface

System Bus

Pent i um Processor Extreme E d ition

Architectual

State

Execut ion E ngine

Local APIC Local API C

Archit ec tual State

Execut ion E ngine

Architectual

State

Local APIC Local AP I C

Caches

Bus Interface Bus Interface

System Bus

In t el Core Duo Processor

Local APIC Local APIC

Second Level Cache

Bus Interface

Archit ec tual State Execution Engine

Architectual

State

Execution Engine

Architectual Stat e

Execution Engine

Caches

Architectual

State

System Bus

1-41

IA-32 Intel® Architectu re Optimization

Microarchitecture Pipeline and Multi-Core Processors

In general, each core in a multi-core processor resembles a single-core processor implementation of the underlying microarchitecture. The implementation of the cache hierarchy in a dual-core or multi-core processor may be the same or different from the cache hierarchy implementation in a single-core processor.

CPUID should be used to determine cache-sharing topology information in a processor implementation and the underlying microarchitecture. The former is obtained by querying the deterministic cache parameter leaf (see Chapter 6, “Optimizing Cache Usage”); the latter by using the encoded values for extended family , family, extended model, and model fields. See Table 1 -4.

Table 1-4 Family And Model Designations of Microarchitectures

Dual-Core Processor

Pentium D processor NetBurst 0 15 0 3, 4, 6

Pentium processor Extreme Edition

Intel Core Duo processor

Microarchitecture

NetBurst 0 15 0 3, 4, 6

Improved Pentium M

Extended Family Family

060 14

Extended Model Model

Shared Cache in Intel Core Duo Processors

The Intel Core Duo processor has two symmetric cores that share the second-level cache and a single bus interface (see Figure 1-7). Two threads executing on two cores in an Intel Core Duo processor can take advantage of shared second-level cache, accessing a single-copy of cached data without generating bus traffic.

Load and Store Operations

When an instruction needs to read data from a memory address, the processor looks for it in caches and memory. When an instruction writes data to a memory location (write back) the processor first makes sure

1-42

IA-32 Intel® Architecture Processor Family Overview

that the cache line that contains the memory location is owned by the first-level data cache of the initiating core (that is, the line is in exclusive or modified state). Then the processor looks for the cache line in the cache and memory sub-systems. The look-ups for the locality of load or store operation are in the following order:

1. First level cache of the initiating core

2. Second-level cache and the first-level cache of the other core

3. Memory Table 1-5 lists the performance characteristics of generic load and store

operations in an Intel Core Duo processor . are

in terms of processor core cycles.

Table 1-5 Characteristics of Load and Store Operations

in Intel Core Duo Processors

Load Store

Data Locality

1st-level cache (L1) 3 1 2 1

L1 of the other core in “Modified” state

2nd-level cache 14 <6 14 <6

Memory 14 + bus

Latency Throughput Latency Throughput

14 + bus transaction

transaction

14 + bus transaction

Bus read protocol

Numeric values of Table 1-5

14 + bus transaction

~10

Bus write protocol

Throughput is expressed as the number of cycles to wait before the same operation can start again. The latency of a bus transaction is exposed in some of these operations, as indicated by entries containing “+ bus transaction”. On Intel Core Duo processors, a typical bus transaction may take 5.5 bus cycles. For a 667 MHz bus and a core frequency of 2.167GHz, the total of 14 + 5.5 * 2167 /(667/4) ~ 86 core cycles.

Sometimes a modified cache line has to be evicted to make room for a new cache line. The modified cache line is evicted in parallel to bringing in new data and does not require additional latency. However,

1-43

IA-32 Intel® Architectu re Optimization

when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and are within a short time, there is an overall degradation in response time of these cache misses.

For store operation, reading for ownership must be completed before the data is written to the first-level data cache and the line is marked as modified. Reading for ownership and storing the data happens after instruction retirement and follows the order of retirement. The bus store latency does not affect the store instruction itself. However, several sequential stores may have cumulative latency that can effect performance.

1-44

General Optimization Guidelines

This chapter discusses general optimization techniques that can improve the performance of applications running on the Intel Pentium 4, Intel Xeon, Pentium M processors, as well as on dual-core processors. These techniques take advantage of the microarchitectural features of the generation of IA-32 processor family described in Chapter 1. Optimization guidelines for 64-bit mode applications are discussed in Chapter 8. Additional optimization guidelines applicable to dual-core processors and Hyper-Threading Technology are discussed in Chapter 7.

This chapter explains the optimization techniques both for those who use the Intel compilers. The Intel for IA-32 processor family, provides the most of the optimization. For those not using the Intel C++ or Fortran Compiler, the assembly code tuning optimizations may be useful. The explanations are supported by coding examples.

Tuning to Achieve Optimum Performance

C++ or Fortran Compiler and for those who use other

compiler, which generates code specifically tuned

The most important factors in achieving optimum processor performance are:

• good branch prediction

• avoiding memory access stalls

• good floating-point performance

• instruction selection, including use of SIMD instructions

• instruction scheduling (to maximize trace cache bandwidth)

• vectorization

2-1

IA-32 Intel® Architectu re Optimization

The following sections describe practices, tools, coding rules and recommendations associated with these factors that will aid in optimizing the performance on IA-32 processors.

Tuning to Prevent Known Coding Pitfalls

To produce program code that takes advantage of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture, you must avoid the coding pitfalls that limit the performance of the target processor family. This section lists several known pitfalls that can limit performance of Pentium 4 and Intel Xeon processor implementations. Some of these pitfalls, to a lesser degree, also negatively impact Pentium M processor performance (store-to-load-forwarding restrictions, cache-line splits).

Table 2-1 lists coding pitfalls that cause performance degradation in some Pentium 4 and Intel Xeon processor implementations. For every issue, Table 2-1 references a section in this document. The section describes in detail the causes of the penalty and presents a recommended solution. Note that “aligned” here means that the address of the load is aligned with respect to the address of the store.

Table 2-1 Coding Pitfalls Affecting Performance

Factors Affecting Performance Symptom

Small, unaligned after large store

load after small

Large

store; Load

dword after store

dword, store byte;

Load dword, AND with

after store byte

0xff

2-2

load

Store-forwarding blocked

Example (if applicable) Section Reference

Example 2-12 Store Forwarding,

Example 2-13, Example 2-14

Store-to-Load-Forwar ding Restriction on Size and Alignment

Store Forwarding, Store-to-Load-Forwar ding Restriction on Size and Alignment

continued

General Optimization Guidelines 2

Table 2-1 Coding Pitfalls Affecting Performance (continued)

Factors Affecting Performance Symptom

Cache line splits Access across

cache line boundary

Denormal inputs and outputs

Cycling more than 2 values of Floating-point Control Word

* Streaming SIMD Extensions (SSE)

** Streaming SIMD Extensions 2 (SSE2)

Slows x87, SSE*, SSE2** floating-

point operations

fldcw not

optimized

Example (if applicable) Section Reference

Example 2-11 Align data on natural

operand size address boundaries. If the data will be accesses with vector instruction loads and stores, align the data on 16 byte boundaries.

Floating-point Exceptions

Floating-point Modes

General Practices and Coding Guidelines

This section discusses guidelines derived from the performance factors listed in the “Tuning to Achieve Optimum Performance” section. It also highlights practices that use performance tools.

The majority of these guidelines benefit processors based on the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture. Some guidelines benefit one microarchitecture more than the other. As a whole, these coding rules enable software to be optimized for the common performance features of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.

The coding practices recommended under each heading and the bullets under each heading are listed in order of importance.

2-3

IA-32 Intel® Architectu re Optimization

Use Available Performance Tools

• Current-generation compiler, such as the Intel C++ Compiler:

— Set this compiler to produce code for the target processor

implementation

— Use the compiler switches for optimization and/or

profile-guided optimization. These features are summarized in the “Intel® C++ Compiler” section. For more detail, see the

Intel® C++ Compiler User’s Guide.

• Current-generation performance monitoring tools, such as VTune™

Performance Analyzer: — Identify performance issues, use event-based sampling, code

coach and other analysis resource.

— Measure workload characteristics such as instruction

throughput, data traffic locality, memory traffic characteristics, etc.

— Characterize the performance gain.

Optimize Performance Across Processor Generations

• Use a cpuid dispatch strategy to deliver optimum performance for

all processor generations.

• Use deterministic cache parameter leaf of cpuid to deliver scalable

performance that are transparent across processor families with different cache sizes.

• Use compatible code strategy to deliver optimum performance for

the current generation of IA-32 processor family and future IA-32 processors.

• Use a low-overhead threading strategy so that a multi-threaded

application delivers optimal multi-processor scaling performance when executing on processors that have hardware multi-threading support, or deliver nearly identical single-processor scaling when executing on a processor without hardware multi-threading support.

2-4

Optimize Branch Predictability

• Improve branch predictability and optimize instruction prefetching

by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken.

• Avoid mixing near calls, far calls and returns.

• Avoid implementing a call by pushing the return address and

jumping to the target. The hardware can pair up call and return instructions to enhance predictability.

• Use the pause instruction in spin-wait loops.

• Inline functions according to coding recommendations.

• Whenever possible, eliminate branches.

• Avoid indirect calls.

Optimize Memory Access

• Observe store-forwarding constraints.

• Ensure proper data alignment to prevent data split across cache line.

boundary. This includes stack and passing parameters.

General Optimization Guidelines 2

• Avoid mixing code and data (self-modifying code).

• Choose data types carefully (see next bullet below) and avoid type

casting.

• Employ data structure layout optimization to ensure efficient use of

64-byte cache line size.

• Favor parallel data access to mask latency over data accesses with

dependency that expose latency.

• For cache-miss data traffic, favor smaller cache-miss strides to

avoid frequent DTLB misses.

• Use prefetching appropriately.

• Use the following techniques to enhance locality: blocking,

hardware-friendly tiling, loop interchange, loop skewing.

2-5

IA-32 Intel® Architectu re Optimization

• Minimize use of global variables and pointers.

• Use the const modifier; use the static modifier for global

variables.

• Use new cacheability instructions and memory-ordering behavior.

Optimize Floating-point Performance

• Avoid exceeding representable ranges during computation, since

handling these cases can have a performance impact. Do not use a larger precision format (double-extended floating point) unless required, since this increases memory size and bandwidth utilization.

• Use FISTTP to avoid changing rounding mode when possible or use

optimized registers (rounding modes) between more than two values.

fldcw; avoid changing floating-point control/status

• Use efficient conversions, such as those that implicitly include a

rounding mode, in order to avoid changing control/status registers.

• Take advantage of the SIMD capabilities of Streaming SIMD

Extensions (SSE) and of Streaming SIMD Extensions 2 (SSE2) instructions. Enable flush-to-zero mode and DAZ mode when using SSE and SSE2 instructions.

• Avoid denormalized input values, denormalized output values, and

explicit constants that could cause denormal exceptions.

• Avoid excessive use of the fxch instruction.

Optimize Instruction Selection

• Focus instruction selection at the granularity of path length for a

sequence of instructions versus individual instruction selections; minimize the number of uops, data/register dependency in aggregates of the path length, and maximize retirement throughput.

2-6

General Optimization Guidelines 2

• Avoid longer latency instructions: integer multiplies and divides.

Replace them with alternate code sequences (e.g., use shifts instead of multiplies).

• Use the lea instruction and the full range of addressing modes to do

address calculation.

• Some types of stores use more µops than others, try to use simpler

store variants and/or reduce the number of stores.

• Avoid use of complex instructions that require more than 4 µops.

• Avoid instructions that unnecessarily introduce dependence-related

stalls:

inc and dec instructions, partial register operations (8/16-bit

operands).

• Avoid use of ah, bh, and other higher 8-bits of the 16-bit registers,

because accessing them requires a shift operation internally.

• Use xor and pxor instructions to clear registers and break

dependencies for integer operations; also use clear XMM registers for floating-point operations.

xorps and xorpd to

• Use efficient approaches for performing comparisons.

Optimize Instruction Scheduling

• Consider latencies and resource constraints.

• Calculate store addresses as early as possible.

Enable Vectorization

• Use the smallest possible data type. This enables more parallelism

with the use of a longer vector.

• Arrange the nesting of loops so the innermost nesting level is free of

inter-iteration dependencies. It is especially important to avoid the case where the store of data in an earlier iteration happens lexically after the load of that data in a future iteration (called lexically-backward dependence).

2-7

IA-32 Intel® Architectu re Optimization

• Avoid the use of conditionals.

• Keep induction (loop) variable expressions simple.

• Av oid using pointers, tr y to replace pointers with arrays and indices.

Coding Rules, Suggestions and Tuning Hints

This chapter includes rules, suggestions and hints. They are maintained in separately-numbered lists and are targeted for engineers who are:

• modifying the source to enhance performance (user/source rules)

• writing assembly or compilers (assembly/compiler rules)

• doing detailed performance tuning (tuning suggestions)

Coding recommendations are ranked in importance using two measures:

• Local impact (referred to as “impact”) is the difference that a

recommendation makes to performance for a given instance, with the impact’s priority marked as: H = high, M = medium, L = low.

• Generality measures how frequently such instances occur across all

application domains, with the frequency marked as: H = high, M = medium, L = low.

2-8

These rules are very approximate. They can vary depending on coding style, application domain, and other factors. The purpose of including high, medium and low priorities with each recommendation is to provide some hints as to the degree of performance gain that one can expect if a recommendation is implemented.

Because it is not possible to predict the frequency of occurrence of a code instance in applications, priority hints cannot be directly correlated to application-level performance gain. However, in important cases where application-level performance gain has been observed, a more quantitative characterization of application-level performance gain is provided for information only (see: “Store-to-Load-Forwarding Restriction on Size and Alignment” and “Instruction Selection” in this document). In places where no priority is assigned, the impact has been deemed inapplicable.

Performance Tools

Intel offers several tools that can facilitate optimizing your application’s performance.

Intel® C++ Compiler

Use the Intel C++ Compiler following the recommendations described here. The Intel Compiler’s advanced optimization features provide good performance without the need to hand-tune assembly code. However, the following features may enhance performance even further:

• Inlined assembly

• Intrinsics, which have a one-to-one correspondence with assembly

language instructions but allow the compiler to perform register allocation and instruction scheduling. Refer to the “Intel C++ Intrinsics Reference” section of the Intel® C++ Compiler User’s Guide.

• C++ class libraries. Refer to the “Intel C++ Class Libraries for

SIMD Operations Reference” section of the Intel® C++ Compiler

User’s Guide.

General Optimization Guidelines 2

• Vectorization in conjunction with compiler directives (pragmas).

Refer to the “Compiler Vectorization Support and Guidelines” section of the Intel® C++ Compiler User’s Guide.

The Intel C++ Compiler can generate an executable which uses features such as Streaming SIMD Extensions 2. The executable will maximize performance on the current generation of IA-32 processor family (for example, a Pentium 4 processor) and still execute correctly on older processors. Refer to the “Processor Dispatch Support” section in the Intel® C++ Compiler User’s Guide.

2-9

IA-32 Intel® Architectu re Optimization

General Compiler Recommendations

A compiler that has been extensively tuned for the target microarchitecture can be expected to match or outperform hand-coding in a general case. However, if particular performance problems are noted with the compiled code, some compilers (like the Intel C++ and Fortran Compilers) allow the coder to insert intrinsics or inline assembly in order to exert greater control over what code is generated. If inline assembly is used, the user should verify that the code generated to integrate the inline assembly is of good quality and yields good overall performance.

Default compiler switches are targeted for the common case. An optimization may be made to the compiler default if it is beneficial for most programs. If a performance problem is root-caused to a poor choice on the part of the compiler, using dif ferent switches or compiling the targeted module with a different compiler may be the solution.

VTune™ Performance Analyzer

Where performance is a critical concern, use performance monitoring hardware and software tools to tune your application and its interaction with the hardware. IA-32 processors have counters which can be used to monitor a large number of performance-related events for each microarchitecture. The counters also provide information that helps resolve the coding pitfalls.

2-10

The VTune Performance Analyzer allow engineers to use these counters to provide with two kinds of tuning feedback:

• indication of a performance improvement gained by using a specific

coding recommendation or microarchitectural feature,

• information on whether a change in the program has improved or

degraded performance with respect to a particular metric.

General Optimization Guidelines 2

The VTune Performance Analyzer also enables engineers to use these counters to measure a number of workload characteristics, including:

• retirement throughput of instruction execution as an indication of

the degree of extractable instruction-level parallelism in the workload,

• data traffic locality as an indication of the stress point of the cache

and memory hierarchy,

• data traffic parallelism as an indication of the degree of

effectiveness of amortization of data access latency.

Note that improving performance in one part of the machine does not necessarily bring significant gains to overall performance. It is possible to degrade overall performance by improving performance for some particular metric.

Where appropriate, coding recommendations in this chapter include descriptions of the VTune analyzer events that provide measurable data of performance gain achieved by following recommendations. Refer to the VTune analyzer online help for instructions on how to use the tool.

VTune analyzer events include the Pentium 4 processor performance metrics described in Appendix B, “Using Performance Monitoring Events.”

Processor Perspectives

The majority of the coding recommendations for the Pentium 4 and Intel Xeon processors also apply to Pentium M, Intel Core Solo, and Intel Core Duo processors. However, there are situations where a recommendation may benefit one microarchitecture more than the other. The most important of these are:

• Instruction decode throughput is important for the Pentium M, Intel

Core Solo, and Intel Core Duo processors but less important for the Pentium 4 and Intel Xeon processors. Generating code with the 4-1-1 template (instruction with four μops followed by two instructions with one μop each) helps the Pentium M processor.

2-11

IA-32 Intel® Architectu re Optimization

Intel Core Solo and Intel Core Duo processors have enhanced front end that is less sensitive to the 4-1-1 template. The practice has no real impact on processors based on the Intel NetBurst microarchitecture.

• Dependencies for partial register writes incur large penalties when

using the Pentium M processor (this applies to processors with CPUID signature family 6, model 9). On Pentium 4, Intel Xeon processors, Pentium M processor (with CPUID signature family 6, model 13), and Intel Core Solo, and Intel Core Duo processors, such penalties are resolved by artificial dependencies between each partial register write. To avoid false dependences from partial register updates, use full register updates and extended moves.

• On Pentium 4 and Intel Xeon processors, some latencies have

increased: shifts, rotates, integer multiplies, and moves from memory with sign extension are longer than before. Use care when using the Instruction” for recommendations.

lea instruction. See the section “Use of the lea

• The inc and dec instructions should always be avoided. Using add

and

sub instructions instead avoids data dependence and improves

performance.

2-12

• Dependence-breaking support is added for the pxor instruction.

• Floating point register stack exchange instructions were free; now

they are slightly more expensive due to issue restrictions.

• Writes and reads to the same location should now be spaced apart.

This is especially true for writes that depend on long-latency instructions.

• Hardware prefetching may shorten the effective memory latency for

data and instruction accesses.

• Cacheability instructions are available to streamline stores and

manage cache utilization.

• Cache lines are 64 bytes (see Table 1-1 and Table 1-3). Because of

this, software prefetching should be done less often. False sharing, however, can be an issue.

General Optimization Guidelines 2

• On the Pentium 4 and Intel Xeon processors, the primary code size

limit of interest is imposed by the trace cache. On Pentium M processors, code size limit is governed by the instruction cache.

• There may be a penalty when instructions with immediates

requiring more than 16-bit signed representation are placed next to other instructions that use immediates.

Note that memory-related optimization techniques for alignments, complying with store-to-load-forwarding restrictions and avoiding data splits help Pentium 4 processors as well as Pentium M processors.

CPUID Dispatch Strategy and Compatible Code Strategy

Where optimum performance on all processor generations is desired, applications can take advantage of generation and integrate processor-specific instructions (such as SSE2 instructions) into the source code. The Intel C++ Compiler supports the integration of different versions of the code for different target processors. The selection of which code to execute at runtime is made based on the CPU identifier that is read with targeted for different processor generations can be generated under the control of the programmer or by the compiler.

cpuid to identify the processor

cpuid. Binary code

For applications run on both the Intel Pentium 4 and Pentium M processors, and where minimum binary code size and single code path is important, a compatible code strategy is the best. Optimizing applications for the Intel NetBurst microarchitecture is likely to improve code efficiency and scalability when running on processors based on current and future generations of IA-32 processors. This approach to optimization is also likely to deliver high performance on Pentium M processors.

2-13

IA-32 Intel® Architectu re Optimization

Transparent Cache-Parameter Strategy

If CPUID instruction supports function leaf 4, also known as deterministic cache parameter leaf, this function leaf will report detailed cache parameters for each level of the cache hierarchy in a deterministic and forward-compatible manner across current and future IA-32 processor families. See CPUID instruction in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2B.

For coding techniques that rely on specific parameters of a cache level, using the deterministic cache parameter allow software to implement such coding technique to be forward-compatible with future generations of IA-32 processors, and be cross-compatible with processors equipped with different cache sizes.

Threading Strategy and Hardware Multi-Threading Support

Current IA-32 processor families offer hardware multi-threading support in two forms: dual-core technology and Hyper-Threading Technology. Future trend for IA-32 processors will continue to impro ve in the direction of multi-core technology.

2-14

To fully harness the performance potentials of the hardware multi-threading capabilities in current and future generations of IA-32 processors, software must embrace a threaded approach in application design. At the same time, to address the widest range of installed base of machines, multi-threaded software should be able to run without failure on single processor without hardware multi-threading support, and multi-threaded software implementation should also achieve comparable performance on a single logical processor relative to an unthreaded implementation if such comparison can be made. This generally requires architecting a multi-threaded application to minimize the overhead of thread synchronization. Additional software optimization guidelines on multi-threading are discussed in Chapter 7.

Branch Prediction

Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability of branches, you can increase the speed of code significantly.

Optimizations that help branch prediction are:

• Keep code and data on separate pages (a very important item, see

more details in the “Memory Accesses” section).

• Whenever possible, eliminate branches.

• Arrange code to be consistent with the static branch prediction

algorithm.

• Use the pause instruction in spin-wait loops.

• Inline functions and pair up calls and returns.

• Unroll as necessary so that repeatedly-executed loops have sixteen

or fewer iterations, unless this causes an excessive code size increase.

• Separate branches so that they occur no more frequently than every

three

μops where possible.

General Optimization Guidelines 2

Eliminating Branches

Eliminating branches improves performance because it:

• reduces the possibility of mispredictions

• reduces the number of required branch target buffer (BTB) entries;

conditional branches, which are never taken, do not consume BTB resources

There are four principal ways of eliminating branches:

• arrange code to make basic blocks contiguous

• unroll loops, as discussed in the “Loop Unrolling” section

• use the cmov instruction

• use the setcc instruction

2-15

IA-32 Intel® Architectu re Optimization

Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and eliminate unnecessary branches.

For the Pentium M processor, every branch counts, even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor. Also, taken branches consume space in the branch prediction structures and extra branches create pressure on the capacity of the structures.

Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the

setcc and cmov instructions to eliminate unpredictable conditional branches

where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpr edictable conditional branches (be cause using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting conditional branches to data dependence and restricts the capability of the out of order engine. When tuning, note that all IA-32 based pr oc essors ha ve very hig h bra nc h prediction rates. Consistently mispredicted are rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.

cmovs or setcc trades of contr ol flow d ependen ce for

2-16

Consider a line of C code that has a condition dependent upon one of the constants:

X = (A < B) ? CONST1 : CONST2;

This code conditionally compares two values, A and B. If the condition is true,

X is set to CONST1; otherwise it is set to CONST2. An assembly code

sequence equivalent to the above C code can contain branches that are not predictable if there are no correlation in the two values.

Example 2-1 shows the assembly code with unpredictable branches. The unpredictable branches in Example 2-1 can be removed with the use of the

setcc instruction. Example 2-2 shows an optimized code that

does not have branches.

General Optimization Guidelines 2

Example 2-1 Assembly Code with an Unpredictable Branch

cmp A, B ; condition jge L30 ; conditional branch mov ebx, CONST1 ; ebx holds X jmp L31 ; unconditional branch

L30:

mov ebx, CONST2

L31:

Example 2-2 Code Optimization to Eliminate Branches

xor ebx, ebx ; clear ebx (X in the C code) cmp A, B setge bl ; When ebx = 0 or 1

; OR the complement condition sub ebx, 1 ; ebx=11...11 or 00...00 and ebx, CONST3 ; CONST3 = CONST1-CONST2 add ebx, CONST2 ; ebx=CONST1 or CONST2

See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B, ebx is set to one. Then ebx is decreased and “ sets

ebx to either zero or the difference of the values. By adding CONST2

back to

ebx, the correct value is written to ebx. When CONST2 is equal to

and-ed” with the difference of the constant values. This

zero, the last instruction can be deleted. Another way to remove branches on Pentium II and subsequent

processors is to use the shows changing a eliminating a branch. If the will be moved to

cmov and fcmov instructions. Example 2-3

test and branch instruction sequence using cmov and

test sets the equal flag, the value in ebx

eax. This branch is data-dependent, and is

representative of an unpredictable branch.

2-17

IA-32 Intel® Architectu re Optimization

Example 2-3 Eliminating Branch with CMOV Instruction

test ecx, ecx jne 1h

mov eax, ebx 1h: ; To optimize code, combine jne and mov into one cmovcc

; instruction that checks the equal flag

test ecx, ecx ; test the flags

cmoveq eax, ebx ; if the equal flag is set, move

; ebx to eax - the lh: tag no longer

; needed

The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pentium processors and earlier 32-bit Intel architecture processors. Be sure to check whether a processor supports these instructions with the

cpuid instruction.

Spin-Wait and Idle Loops

2-18

The Pentium 4 processor introduces a new pause instruction; the instruction is architecturally a

nop on all IA-32 implementations. T o th e

Pentium 4 processor, this instruction acts as a hint that the code sequence is a spin-wait loop. Without a

pause instruction in such loops,

the Pentium 4 processor may suffer a severe penalty when exiting the loop because the processor may detect a possible memory order violation. Inserting the

pause instruction significantly reduces the

likelihood of a memory order violation and as a result improves performance.

In Example 2-4, the code spins until memory location A matches the value stored in the register

eax. Such code sequences are common when

protecting a critical section, in producer-consumer sequences, for barriers, or other synchronization.

Example 2-4 Use of pause Instruction

lock: cmp eax, A

jne loop ; code in critical section:

loop: pause

cmp eax, A jne loop jmp lock

Static Prediction

Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted using a static prediction algorithm. The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo processors have similar static prediction algorithms:

• Predict unconditional branches to be taken.

• Predict indirect branches to be NOT taken.

General Optimization Guidelines 2

In addition, conditional branches in processors based on the Intel NetBurst microarchitecture are predicted using the following static prediction algorithm:

• Predict backward conditional branches to be taken. This rule is

suitable for loops.

• Predict forward conditional branches to be NOT taken.

Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict conditional branches according to the jump direction. All conditional branches are dynamically predicted, even at their first appearance.

2-19

IA-32 Intel® Architectu re Optimization

Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.

Example 2-5 illustrates the static branch prediction algorithm. The body of an

if-then conditional is predicted to be executed.

Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm

forw ard conditional branches not taken (fall through)

If <condition> { ...

} for<condition>{

...

}

BackwardConditionalBranchesaretaken

Uncondi t ional Branchestaken JM P

2-20

loop {

}<condition>

General Optimization Guidelines 2

Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm.

In Example 2-6, the backward branch (

JC Begin) is not in the BTB the

first time through, therefore, the BTB does not issue a prediction. The static predictor, however, will predict the branch to be taken, so a misprediction will not occur.

Example 2-6 Static Taken Prediction Example

Begin: mov eax, mem32

and eax, ebx imul eax, edx shld eax, 7 jc Begin

The first branch instruction (

JC Begin) in Example 2-7 segment is a

conditional forward branch. It is not in the BTB the first time through, but the static predictor will predict the branch to fall through

The static prediction algorithm correctly predicts that the Call

Convert

instruction will be taken, even before the branch has any

branch history in the BTB.

Example 2-7 Static Not-Taken Prediction Example

mov eax, mem32 and eax, ebx imul eax, edx shld eax, 7 jc Begin mov eax, 0

Begin: call Convert

2-21

IA-32 Intel® Architectu re Optimization

Inlining, Calls and Returns

The return address stack mechanism augments the static and dynamic predictors to optimize specifically for calls and returns. It holds 16 entries, which is large enough to cover th e call d epth of most programs. If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may be degraded.

The trace cache maintains branch prediction information for calls and returns. As long as the trace with the call or return remains in the trace cache and if the call and return targets remain unchanged, the depth limit of the return address stack described above will not impede performance.

To enable the use of the return stack mechanism, calls and returns must be matched in pairs. If this is done, the likelihood of exceeding the stack depth in a manner that will impact performance is very low.

Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with near returns, and far calls must be matched with far returns. Pushing the return address on the stack and jumping to the routine to be called is not recommended since it creates a mismatch in calls and returns.

2-22

Calls and returns are expensive; use inlining for the following reasons:

• Parameter passing overhead can be eliminated.

• In a compiler, inlining a function exposes more opportunity for

optimization.

• If the inlined routine contains branches, the additional context of the

caller may improve branch prediction within the routine.

• A mispredicted branch can lead to larger performance penalties

inside a small function than if that function is inlined.

Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function where doing so decreases code size or if the function is small and the call site is fr e qu ently executed.

General Optimization Guidelines 2

Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache.

Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession; consider transforming the program with inline to reduce the call depth.

Assembly/Compiler Coding Rule 8. (ML impact, ML generality)

inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred.

Assembly/Compiler Coding Rule 9. (L impact, L generality)

statement in a function is a call to another function, consider converting the call to a jump. This will save the call/ return overhead as well as an entry in the return stack buffer.

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk.

Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branc hes in a 16-b yte chunk.

Favor

If the last

Branch Type Selection

The default predicted target for indirect branches and calls is the fall-through path. The fall-through prediction is overridden if and when a hardware prediction is available for that branch. The predicted branch target from branch prediction hardware for an indirect branch is the previously executed branch target.

The default prediction to the fall-through path is only a significant issue if no branch prediction is available, due to poor code locality or pathological branch conflict problems. For indirect calls, predicting the fall-through path is usually not an issue, since execution will likely return to the instruction after the associated return.

2-23

IA-32 Intel® Architectu re Optimization

Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it looks like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery. Also, the data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.

Assembly/Compiler Coding Rule 12. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path.

Indirect branches resulting from code constructs, such as switch statements, computed arbitrary number of locations. If the code sequence is such that the target destination of a branch goes to the same address most of the time, then the BTB will predict accurately most of the time. Since only one taken (non-fall-through) target can be stored in the BTB, indirect branches with multiple taken targets may have lower prediction rates.

GOTOs or calls through pointers, can jump to an

2-24

The effective number of targets stored may be increased by introducing additional conditional branches. Adding a conditional branch to a target is fruitful if and only if:

• The branch direction is correlated with the branch history leading up

to that branch, that is, not just the last target, but how it got to this branch.

• The source/target pair is common enough to warrant using the extra

branch prediction capacity. (This may increase the number of overall branch mispredictions, while improving the misprediction of indirect branches. The profitability is lower if the number of mispredicting branches is very large).

User/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more common taken targets, and at least one of those targets are correlated with branch history leading up to the branch, then convert the

General Optimization Guidelines 2

indirect branch into a tree where one or more indirect branches are preceded by conditional branches to th ose tar g ets. Apply this “peeling” procedur e to the common target of an indirect branch that correlates to branch history.

The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of branches, even at the expense of adding more branches. The added branches must be very predictable for this to be worthwhile. One reason for such predictability is a strong correlation with preceding branch history , that is, the directions taken on preceding branches are a good indicator of the direction of the branch under consideration.

Example 2-8 shows a simple example of the correlation between a target of a preceding conditional branch with a target of an indirect branch.

Example 2-8 Indirect Branch With Two Favored Targets

function () { int n = rand(); // random integer 0 to RAND_MAX

if( !(n & 0x01) ){ // n will be 0 half the times

n = 0; // updates branch history to predict taken

}

// indirect branches with multiple taken targets // may have lower prediction rates

switch (n) {

case 0: handle_0(); break; // common target, correlated with

// branch history that is forward taken

case 1: handle_1(); break;// uncommon

case 3: handle_3(); break;// uncommon

default: handle_other(); // common target } }

Correlation can be difficult to determine analytically, either for a compiler or sometimes for an assembly language programmer . It may be fruitful to evaluate performance with and without this peeling, to get the

2-25

IA-32 Intel® Architectu re Optimization

best performance from a coding effort. An example of peeling out the most favored target of an indirect branch with correlated branch history is shown in Example 2-9.

Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction

function () { int n = rand(); // random integer 0 to RAND_MAX

if( !(n & 0x01) ) n = 0; // n will be 0 half the times

if (!n) handle_0(); // peel out the most common target

// with correlated branch history else { switch (n) {

case 1: handle_1(); break; // uncommon case 3: handle_3(); break;// uncommon default: handle_other(); // make the favored target in

// the fall-through path

} } }

Loop Unrolling

The benefits of unrolling loops are:

• Unrolling amortizes the branch overhead, since it eliminates

branches and some of the code to manage induction variables.

• Unrolling allows you to aggressively schedule (or pipeline) the loop

to hide latencies. This is useful if you have enough free registers to keep variables live as you stretch out the dependence chain to expose the critical path.

• Unrolling exposes the code to various other optimizations, such as

removal of redundant loads, common subexpression elimination, and so on.

2-26

General Optimization Guidelines 2

• The Pentium 4 processor can correctly predict the exit branch for an

inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops until they have a maximum of 16 iterations. With the Pentium M processor, do not unroll loops more than 64 iterations.

The potential costs of unrolling loops are:

• Excessive unrolling, or unrolling of very large loops can lead to

increased code size. This can be harmful if the unrolled loop no longer fits in the trace cache (TC).

• Unrolling loops whose bodies contain branches increases demands

on the BTB capacity . If the number of iterations of th e unrolled loop is 16 or less, the branch predictor should be able to correctly predict branches in the loop body that alternate direction.

Assembly/Compiler Coding Rule 13. (H impact, M generality) Unr oll small loops until the overhead of the branch and the induction variable accounts, generally, for less than about 10% of the execution time of the loop.

Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid unrolling loop s excessiv ely, as this may thrash the trace cache or instruction cache.

Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll loops that are frequently executed and that have a predictable number of iterations to reduce the number of iterations to 16 or fewer, unless this increases code size so that the working set no longer fits in the trace cache or instruction cache. If the loop body contains mo r e than one c onditional bran ch, then unroll so that the number of iterations is 16/(# conditional branches).

Example 2-10 shows how unrolling enables other optimizations.

2-27

IA-32 Intel® Architectu re Optimization

Example 2-10 Loop Unrolling

Before unrolling:

do i=1,100 if (i mod 2 == 0) then a(i) = x else a(i) = y enddo

After unrolling

do i=1,100,2 a(i) = y a(i+1) = x enddo

In this example, a loop that executes 100 times assigns x to every even-numbered element and

y to every odd-numbered element. By

unrolling the loop you can make both assignments each iteration, removing one branch in the loop body.

Compiler Support for Branch Prediction

Compilers can generate code that improves the efficiency of branch prediction in the Pentium 4 and Pentium M processors. The Intel C++ Compiler accomplishes this by:

2-28

• keeping code and data on separate pages

• using conditional move instructions to eliminate branches

• generating code that is consistent with the static branch prediction

algorithm

• inlining where appropriate

• unrolling, if the number of iterations is predictable

With profile-guided optimization, the Intel compiler can lay out basic blocks to eliminate branches for the most frequently executed paths of a function or at least improve their predictability. Branch prediction need not be a concern at the source level. For more information, see the Intel® C++ Compiler User’s Guide.

Intel IA-32 User Manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Contents

Introduction

IA-32 Intel® Architecture Processor Family Overview

SIMD Technology

Summary of SIMD Technologies

MMX™ Technology

Streaming SIMD Extensions

Streaming SIMD Extensions 2

Streaming SIMD Extensions 3

Intel® Extended Memory 64 Technology (Intel®EM64T)

Intel NetBurst® Microarchitecture

Design Goals of Intel NetBurst Microarchitecture

Overview of the Intel NetBurst Microarchitecture Pipeline

The Front End

The Out-of-order Core

Retirement

Front End Pipeline Detail

Prefetching

Decoder

Execution Trace Cache

Branch Prediction

Execution Core Detail

Instruction Latency and Throughput

Execution Units and Issue Ports

Caches

Data Prefetch

Loads and Stores

Store Forwarding

Intel® Pentium® M Processor Microarchitecture

The Front End

Data Prefetching

Out-of-Order Core

In-Order Retirement

Microarchitecture of Intel® Core™ Solo and Intel®Core™ Duo Processors

Front End

Data Prefetching

Hyper-Threading Technology

Processor Resources and Hyper-Threading Technology

Replicated Resources

Partitioned Resources

Shared Resources

Microarchitecture Pipeline and Hyper-Threading Technology

Front End Pipeline

Execution Core

Retirement

Multi-Core Processors

Microarchitecture Pipeline and Multi-Core Processors

Shared Cache in Intel Core Duo Processors

Load and Store Operations

General Optimization Guidelines

Tuning to Achieve Optimum Performance

Tuning to Prevent Known Coding Pitfalls

General Practices and Coding Guidelines

Use Available Performance Tools

Optimize Performance Across Processor Generations

Optimize Branch Predictability

Optimize Memory Access

Optimize Floating-point Performance

Optimize Instruction Selection

Optimize Instruction Scheduling

Enable Vectorization

Coding Rules, Suggestions and Tuning Hints

Performance Tools

Intel® C++ Compiler

General Compiler Recommendations

VTune™ Performance Analyzer

Processor Perspectives

CPUID Dispatch Strategy and Compatible Code Strategy

Transparent Cache-Parameter Strategy

Threading Strategy and Hardware Multi-Threading Support

Branch Prediction

Eliminating Branches

Spin-Wait and Idle Loops

Static Prediction

Inlining, Calls and Returns

Branch Type Selection