Intel mx2 MCM (Hondo), Itanium 2 1GHz (YK80542KC0013M) Itanium 2 Reference Manual For Software Developement and Optimization

Intel® Itanium® 2 Processor Reference Manual

For Software Development and Optimization

May 2004

Order Number: 251110-003

THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE.

Information in this document is provided in connection wi th Intel® products . No license , express or implied, by esto ppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or i nfringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for

future definition and shall have no responsibilit y wha tsoeve r for conflicts or incompat ibilities arisi ng from futu re changes to them. The Pentium, Itanium and IA-32 architecture processors may contain design defects or errors known as errata which may cause the product to deviate

from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and befo re placi ng your product order. Copies of documents which have an order number and are referenc ed in this docum ent, or other Intel literatur e, m ay be obtained by calling 1- 800-

548-4725, or by visiting Intel's web site at http://www.intel.com. Intel, Itanium, Pentium, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other

2 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

Contents

1 About this Manual.............................................................................................................13

1.1 Overview.............................................................................................................13

1.2 Contents.............................................................................................................. 14

1.3 Terminology.........................................................................................................14

1.4 Related Documentation.......................................................................................15

2 Itanium

2.1 Implemented Instructions.................................................................................... 17

2.2 Functional Units an d Issue Rules....................... .............. ............. ............. .........17

2.3 Operation Latencies.............................. ............. .... ............. ............. .............. .....17

2.4 Data Operations..................................................................................................18

2.5 Memory Hierarchy...............................................................................................20

2.6 Branch Prediction................................................................................................22

2.7 Instruction Prefetching.......... .............. ............. ............. .............. ... ............. .........23

2.8 IA-32 Execution Layer......................................................................................... 23

3 Functional Units and Issue Rules.....................................................................................25

3.1 Execution Model....................... ... ............. .............. ............. ............. .............. .....25

3.2 Number and Types of Functional Units...............................................................25

3.3 Instruction Slot to Functional Unit Mapping.........................................................26

2 Processor Enhancements..............................................................................17

2.4.1 Data Speculation and the ALAT.............................................................18

2.4.2 Data Alignment.......................................................................................18

2.4.3 Control Speculati o n............... ............. ............. .............. ............. ............20

3.3.1 Execution Width .....................................................................................28

3.3.2 Dispersal Rules......................................................................................29

3.3.3 Split Issue and Bundle Types................................................................. 31

4 Latencies and Bypas ses...... ............. .............. ............. ............. .............. ............. ............33

4.1 Control and Data Speculation Penalties..............................................................33

4.2 Branch Related Latencies and Penalties............................................................33

4.3 Latencies for OS Related Instructions.................................................................34

5 Data Operations...............................................................................................................37

5.1 Data Speculation and the ALAT..........................................................................37

5.1.1 Allocation/Replacement Policy............................................................... 38

5.1.2 Rules and Special Cases.. .............. ............. ............. .............. ............. ..38

5.2 Speculative and Predicated Loads/Stores..........................................................38

5.3 Floating-Point Loads...........................................................................................40

5.4 Data Cache Prefetching and Load Hints.............................................................40

5.4.1 lfetch Implementat i on.. ............. ............. .............. ............. ............. .........40

5.4.2 Load Temporal Locality Completers.......................................................41

5.5 Data Alignment....................................................................................................42

5.6 Write Coalescing.................................................................................................43

5.6.1 WC Buffer Eviction Conditions...............................................................43

5.6.2 WC Buffer Flushing Behavior.................................................................43

5.7 Register Stack Engine.............. ............. ............. .............. ............. ............. .........44

5.8 FC Instructions....................................................................................................44

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 3

6 Memory Subsystem.........................................................................................................45

6.1 Translatio n Loo ka sid e Buff ers.................. ............. .............. ... ............. .............. ..46

6.1.1 Instruction TLBs...... ............. .............. ............. ............. .............. ............46

6.1.2 Data TLBs..............................................................................................46

6.2 Hardware Page Walker.......................................................................................47

6.3 Cache Summary .................................................................................................48

6.4 First-L ev el Instruction Cache ................ ............. ............. .............. ............. .........48

6.5 Instruction Stream Buffer ....................................................................................49

6.6 First-L ev el Da ta Cach e........... .............. ............. ............. .............. ............. .........49

6.6.1 L1D Loads............... ............. .............. ............. ............. .... ............. .........50

6.6.2 L1D Stores........... ............. ............. .............. ... ............. .............. ............50

6.6.3 L1D Load and Store Considerations......................................................51

6.6.4 L1D Misses ....................... ... .............. ............. ............. .............. ............52

6.7 Second-Level Unified Cache...............................................................................53

6.7.1 L1D Requests to L2 ................................. .............. ............. ............. ......54

6.7.2 L2 OzQ...................................................................................................54

6.7.3 L2 Cancels.............................................................................................56

6.7.4 L2 Recirculate........................................................................................57

6.7.5 Memory Ordering................................................................................... 5 8

6.7.6 L2 Instruction Prefetch FIFO..................................................................58

6.7.7 L2 Load and Store Considerations.........................................................59

6.8 System Bus/L3 Interactions ................................................................................59

6.9 Third-Level Unified Cache...................................................................................60

6.10 System Bus.........................................................................................................61

7 Branch Instructions and Branch Prediction......................................................................63

7.1 Branch Prediction Hints.......................................................................................64

7.2 Indirect Branches................................................................................................6 4

7.3 Perfect Loop Prediction.......................................................................................65

8 Instruction Prefetchin g..... .... ............. ............. .............. ............. ... .............. ............. .........67

8.1 Streaming Prefetching.........................................................................................6 7

8.2 Hint Prefetching...................................................................................................68

8.3 Prefetch Flush Hints............................................................................................69

8.4 The brl Instruction ............................. .............. ............. ............. .............. ... .........69

9 Opt i m i zing for the It ani um

2 Processor......................... ............. .............. ............. .........71

9.1 Hints for Scheduling............................................................................................71

9.2 Optimal Us e of lfetch............... .............. ............. ............. .............. ............. .........71

9.3 Data Streaming................................................................................................... 7 2

9.3.1 Floating-Point Data Streams..................................................................72

9.3.2 Integer Data Streams.................. ............. .............. ............. ............. ......73

9.3.3 Store Data Streams................................................................................73

9.4 Control and Data Specula ti o n............... ............. ............. .............. ............. .........74

9.5 Known L2 Miss Bundle Placement...................................................................... 7 4

9.6 Avoid Known L2 Cancel and Recirculate Conditions..........................................74

9.7 Instruction Bundling.............................................................................................74

9.8 Branches.............................................................................................................75

9.8.1 Single Cycle Branches........................................................................... 7 5

9.8.2 Perfect Loop Prediction ..........................................................................75

9.8.3 Branch Targets.......................................................................................75

4 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

10 Performance Monitoring...................................................................................................77

10.1 Introduction..........................................................................................................77

10.2 Performance Monitor Programming Models........................................................ 77

10.2.1 Workload Characterization.....................................................................78

10.2.2 Profiling ..................................................................................................81

10.2.3 Event Qualificati o n...................... .............. ............. ............. .............. .....83

10.2.4 References.............................................................................................88

10.3 Performance Monitor State.................................................................................88

10.3.1 Performance Monitor Control and Accessibility......................................91

10.3.2 Performance Counter Registers.............................................................92

10.3.3 Performance Monitor Overflow Status Registers (PMC0,1,2,3).............94

10.3.4 Opcode Match Check (PMC8,9,15) .......................................................94

10.3.5 Instruction Address Range Matching ..................................................... 97

10.3.6 Data Address Range Matching (PMC13)...............................................99

10.3.7 Event Address Registers (PMC10,11/PMD0,1,2,3,17) ........................100

10.3.8 Data EAR (PMC11, PMD2,3,17)..........................................................103

10.3.9 Branch Trace Buffer.............................................................................107

10.3.10 Interrupts..............................................................................................112

10.3.11 Processor Reset, PAL Cal ls, and Low Power State................... ... .......112

11 Performance Monitor Events..........................................................................................115

11.1 Introduction........................................................................................................115

11.2 Categorization of Events...................................................................................115

11.3 Basic Events......... ............. ............. .............. ............. ............. .............. .............116

11.4 Instruction Dispers al Eve nt s........... .............. ............. ............. .............. .............117

11.5 Instruction Executio n Event s.................................. ............. ............. .............. ...117

11.6 Stall Events.......................................................................................................118

11.7 Branch Events ...................................................................................................119

11.8 Memory Hierarchy.............................................................................................120

11.8.1 L1 Instruction Ca ch e and Prefe tc h Eve nts................ .............. .............122

11.8.2 L1 Data Cache Events .........................................................................123

11.8.3 L2 Unified Cache Eve nt s.................... ............. .............. ............. ..........125

11.8.4 L3 Cache Events............... .............. ............. ............. .............. .............129

11.9 System Events........... .............. ............. ... .............. ............. ............. .............. ...130

11.10 TLB Events........................... .............. ............. ............. .............. ... ............. .......130

11.11 System Bus Events........... ............. .............. ............. ... .............. ............. ..........132

11.12 RSE Events.............................. ............. ............. .............. ............. ............. .... ...135

11.13 Performance Monitors Ordered by Event Code................................................1 36

11.14 Performance Monitor Event List........................................................................142

12 Model-Specific and Optional Features...........................................................................1 91

12.1 Memory Attributes.............................................................................................191

12.2 Purge Behavior of ptc.e............... ............. .............. ............. ............. .............. ...191

12.3 Data Debug Break.............................................................................................191

12.4 CPUID Values.. .............. ............. ............. .... ............. ............. .............. .............191

A Itanium

2 Processor Pipeline .......................................................................................193

A.1 Core Pipeline.....................................................................................................193

A.2 Pipeline Stages.......... .............. ............. ............. .............. ............. ............. .......193

A.2.1 IPG STAGE............. .............. ............. ............. .... ............. ............. .......193

A.2.2 ROT STAGE............ .... ............. ............. .............. ............. ... .............. ...194

A.2.3 EXP STAGE.........................................................................................194

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 5

Figures

A.2.4 REN STAGE ........................................................................................ 194

A.2.5 REG Stage........................................................................................... 194

A.2.6 EXE Stage............................................................................................194

A.2.7 DET Stage............................................................................................194

A.2.8 WRB Stage ..........................................................................................194

A.3 Instruction Buffer (IB) ........................................................................................195

A.4 Micro-Pipelines..................................................................................................195

A.4.1 FPU Micro-Pipeline..............................................................................195

A.4.2 L1D Micro-Pipeline................... ... ............. .............. ............. ............. ....195

A.4.3 L2 Micro-Pipeline ................................................................................. 195

6-1 Three Level Cache Hierarchy of the Itanium® 2 Processor................................45

10-1 Time-Based Sampling.........................................................................................78

10-2 Itanium

Processor Family Cycle Accounting.....................................................80

10-3 Event Histogram by Program Counter................................................................81

10-4 Itanium 10-5 Instruction Ta gging Mechanis m in t h e I t a ni um

2 Processor Event Qualification...........................................................8 4

2 Processor............................. 8 5

10-6 Single Process Monitor.................. ... .............. ............. ............. .............. ............87

10-7 Multiple Process Monitor................................. ............. ............. .............. ... .........87

10-8 System Wide Monitor..........................................................................................88

10-9 Itanium

10-10 Processor Status Register (PSR) Fields for Performance Monitoring ................91

10-11 Itanium 10-12 Itanium 10-13 Itanium

2 Processor Performance Monitor Register Mode...............................90

2 Processor Generic PMC Registers (PMC4,5,6,7).............................92

2 Processor Generic PMD Registers (PMD4,5,6,7).............................93

2 Processor Performance Monitor Overflow Status Regis t ers

(PMC0,1,2,3).......................................................................................................94

10-14 Opcode Match Registers (PMC8,9)....................................................................95

10-15 Opcode Match Configuration Register (PMC15)................................................. 9 5

10-16 Instruction Address Range Configuration Register (PMC14)..............................9 7

10-17 Memory Pipeline Event Constraints Configuration Register (PMC13)..............100

10-18 Instruction Event Address Configuration Register (PMC10).............................101

10-19 Instruction Event Address Register Format (PMD0,1)......................................101

10-20 Data Event Address Configuration Register (PMC11)......................................103

10-21 Data Event Address Register Format (PMD2,3,17).......................................... 104

10-22 Branch Trace Buffer Co nfi gu rati o n Regis te r (PMC 12).................. ............. .......108

10-23 Branch Trace Buffe r Register Fo rm at (PMD8-15, where PMC 10-24 Branch Trace Buffe r Register Fo rm at (PMD8-15, where PMC

.ds == 0)........109

.ds == 1)........109

10-25 Branch Trace Buffer Index Register Format (PMD16)......................................111

11-1 Event Monitors in the Itanium

2 Processor Memory Hierarchy.......................121

A-1 Core Pi p el i ne of the Itanium® 2 Processor........................................................193

6 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

Tables

1-1 Definition Table ................. ... .............. ............. ............. .............. ............. ............13

2-1 Itanium

2/ Itanium Processors Operation Latencies .........................................19

2-2 L1I Cache Differences.................... .... ............. ............. .............. ............. ............20

2-3 L1D Cache Differences..... ............. .............. ............. ............. .............. ............. ..21

2-4 L2 Unified Cache Differences..............................................................................21

2-5 L3 Cache Differences..........................................................................................21

2-6 Instruction TLB Differences............ .............. ............. ............. .............. ............. ..22

2-7 Data TLB Differences..........................................................................................22

2-8 Branch Prediction Latencies (in cycles) ..............................................................23

3-1 A-Type Instruction Port Mapping.........................................................................27

3-2 I-Type Instruction Port Mapping..........................................................................27

3-3 M-Type Instruction Port Mapping........................................................................27

3-4 Dual Issue Bundle Types....................................................................................30

4-1 Speculative Load Recovery Latencies................................................................33

4-2 Branch Prediction Latencies................................................................................33

4-3 Execution with Bypass Latency Summary.......................................................... 34

4-4 Latencies for OS Related Instructions.................................................................35

5-1 ALAT Entry Comparison Sizes............................................................................37

5-2 Early and Late Deferral.......................................................................................39

5-3 Control Speculation Penalties.............................................................................39

5-4 Processor Cache Hints........................................................................................ 41

5-5 Itanium 6-1 Itanium

2 Processor WCB Eviction Conditions.................................................43

2 Processor Virtual Memory Support ...................................................45

6-2 Major Features of Instruction and Data TLBs......................................................46

6-3 Best Case HPW Penalties...................................................................................47

6-4 Cache Summary................ ... .............. ............. ............. .............. ............. ............48

6-5 Store to Load Forwarding Penalties ....................................................................52

6-6 L2 Issue Priorities................................................................................................59

6-7 Effective Release Operations..............................................................................59

6-8 System Bus/L3 Requests and Final L2 State............... .... ............. ............. .........60

7-1 Branch Prediction Latencies................................................................................63

8-1 Summary of Streaming Prefetch Actions............................................................ 68

8-2 Prefetch Mechanisms.................. ... .............. ............. ............. .... ............. ............68

10-1 Average Latency per Request and Requests per Cycle

Calculation Example............................................................................................79

10-2 Itanium 10-3 Itanium 10-4 Itanium

2 Processor EARs and Branch Trace Buffer........................................82

2 Processor Event Qualification Modes ...............................................86

2 Processor Performance Monitor Register Set...................................89

10-5 Perform ance Moni tor PMC Reg i st er Control Fi el ds

(PMC4,5,6,7,0,11,12).......................................................................................... 91

10-6 Itanium 10-7 Itanium 10-8 Itanium

2 Processor Generic PMC Register Fields (PMC4,5,6,7)....................92

2 Processor Generic PMD Register Fields...........................................93

2 Processor Performance M onitor Over flow Regist er Fields (

PMC0,1,2,3)........................................................................................................94

10-9 Opcode Match Register Fields (PMC8,9)............................................................95

10-10 Opcode Match Configuration Registe r Field s (PMC15 )................... .............. .....96

10-11 Itanium

2 Processor Instructi on Address Range Chec k by Instruction Set.......97

10-12 Instruction Address Range Configuration Register Fields (PMC14)...................98

10-13 Memory Pipeline Event Constraints Fields (PMC13)..........................................99

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 7

10-14 Instruction Event Address Configuration Register Fields (PMC10) ..................101

10-15 Instruction EAR (PMC 10) umask Fi el d i n Cache Mode (PMC10.ct =’1x) ..........102

10-16 Instruction EAR (PMD0,1) in Cache Mode (PMC10.ct=’1x)..............................102

10-17 Instruction EAR (PMC10) umask Field in TLB Mode (PMC10.ct=00)...............102

10-18 Instruction EAR (PMD0,1) in TLB Mode (PMC10.ct=‘00) .................................103

10-19 Data Event Address Configuration Register Fields (PMC11) ...........................103

10-20 Data EAR (PMC11) Umask Fields in Data Cache Mode

(PMC11.mode=00)............................................................................................ 104

10-21 PMD2,3,17 Fields in Data Cache Load Miss Mode (PMC11.mode=00)...........105

10-22 Data EAR (PMC11) Umask Field in TLB Mode (PMC10.ct=01).......................106

10-23 PMD2,3,17 Fields in TLB Miss Mode (PMC11.mode=‘01)................................106

10-24 PMD2,3,17 Fields in ALAT Miss Mode (PMC11.mode=‘1x).............................107

10-25 Branch Trace Buffer Confi gurati o n Regis te r Field s (PMC12 ).................... .......108

10-26 Branch Trace Buffer Register Fields (PMD8-15) ..............................................110

10-27 Branch Trace Buffer Index Register Fields (PMD16)........................................111

10-28 Information Returned by PAL_PERF_MON_INFO for the Itanium

Processor..........................................................................................................113

11-1 Performance Monitors for Basic Events................................. ............. ..............116

11-2 Derived Monitors for Basic Events....................................................................116

11-3 Performance Monitors for Instructi o n Dispe rs al Eve nts................ ... ............. ....117

11-4 Performance Monitors for Instructi o n Execu tio n Events. .... ............. ............. ....118

11-5 Derived Monitors for Instruction Execution Events ...........................................118

11-6 Performance Monitors for Stall Events........ .............. ... ............. .............. ..........119

11-7 Performance Monitors for Branch Events....... ............. ............. .............. ..........120

11-8 Performance Monitors for L1 Instruction Cache and Prefetch Events..............122

11-9 Derived Monitors for L1 Instruction Cache and Prefetch Events ......................123

11-10 Performance Monitors for L1 Data Cache Events.............................................123

11-11 Performance Monitors for L1D Cache Set 0.....................................................124

11-12 Performance Monitors for L1D Cache Set 1.....................................................124

11-13 Performance Monitors for L1D Cache Set 2.....................................................124

11-14 Performance Monitors for L1D Cache Set 3.....................................................124

11-15 Performance Monitors for L1D Cache Set 4.....................................................125

11-16 Performance Monitors for L2 Unified Cache Events.........................................125

11-17 Derived Monitors for L2 Unified Cache Events.................................................126

11-18 Performance Monitors for L2 Cache Set 0........................................................127

11-19 Performance Monitors for L2 Cache Set 1........................................................127

11-20 Performance Monitors for L2 Cache Set 2........................................................127

11-21 Performance Monitors for L2 Cache Set 3........................................................128

11-22 Performance Monitors for L2 Cache Set 4........................................................128

11-23 Performance Monitors for L2 Cache Set 5........................................................128

11-24 Performance Monitors for L3 Unified Cache Events.........................................129

11-25 Derived Monitors for L3 Unified Cache Events.................................................129

11-26 Performance Monito rs fo r Syst em Eve nt s............. .............. ............. ... ..............130

11-28 Performance Monito rs for TLB Eve nts............... ... .............. ............. ............. ....131

11-29 Derived Monitors for TLB Events ......................................................................131

11-30 Performance Monito rs for System Bus Events ................... ............. ............. ....132

11-31 Derived Monitors for System Bus Events..........................................................133

11-32 Conventions for System Bus Transactions.......................................................135

11-33 Bus Events by Snoop Response..................... ............. ............. .............. ..........135

11-34 Performance Monito rs for RSE Event s.... ............. .............. ............. ............. ....135

11-35 Derived Monitors for RSE Events .....................................................................136

11-36 All Performance Monitors Ordered by Code..................................................... 136

8 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

11-37 Unit Masks for ALAT_CAPACITY_MISS...........................................................142

11-38 Unit Masks for BACK_END_BUBBLE...............................................................142

11-39 Unit Masks for BE_BR_MISPREDICT_DETAIL ................................................143

11-40 Unit Masks for BE_EXE_BUBBLE....................................................................143

11-41 Unit Masks for BE_FLUSH_BUBBLE................................................................144

11-42 Unit Masks for BE_L1D_FPU_BUBBLE............................................................144

11-43 Unit Masks for BE_LOST_BW_DUE_TO_FE...................................................145

11-44 Unit Masks for BE_RSE_BUBBLE....................................................................146

11-45 Unit Masks for BR_MISPRED_DETAIL............................................................1 46

11-46 Unit Masks for BR_MISPREDICT_DETAIL2.....................................................147

11-47 Unit Masks for BR_PATH_PRED......................................................................148

11-48 Unit Masks for BR_PATH_PRED2....................................................................149

11-49 Unit Masks for BUS_ALL...................................................................................150

11-50 Unit Masks for BUS_BACKSNP_REQ..............................................................150

11-51 Unit Masks for BUS_IO.....................................................................................152

11-52 Unit Masks for BUS_LOCK...............................................................................152

11-53 Unit Masks for BUS_MEMORY.........................................................................153

11-54 Unit Masks for BUS_MEM_READ.....................................................................153

11-55 Unit Masks for BUS_RD_DATA........................................................................155

11-56 Unit Masks for BUS_RD_IO..............................................................................1 56

11-57 Unit Masks for BUS_RD_PRTL.........................................................................156

11-58 Unit Masks for BUS_SNOOPS..........................................................................157

11-59 Unit Masks for BUS_SNOOPS_HITM...............................................................157

11-60 Unit Masks for BUS_SNOOP_STALL_CYCLES...............................................158

11-61 Unit Masks for BUS_WR_WB...........................................................................158

11-62 Unit Masks for ENCBR_MISPRED_DETAIL.....................................................161

11-63 Unit Masks for EXTERN_DP_PINS_0_TO_3...................................................161

11-64 Unit Masks for EXTERN_DP_PINS_4_TO_5...................................................162

11-65 Unit Masks for FE_BUBBLE..............................................................................162

11-66 Unit Masks for FE_LOST_BW...........................................................................163

11-67 Unit Masks for IA64_INST_RETIRED ...............................................................165

11-68 Unit Masks for IA64_TAGGED_INST_RETIRED..............................................166

11-69 Unit Masks for IDEAL_BE_LOST_BW_DUE_TO_FE.......................................166

11-70 Unit Masks for INST_CHKA_LDC_ALAT ..........................................................167

11-71 Unit Masks for INST_FAILED_CHKA_LDC_ALAT ...........................................167

11-72 Unit Masks for INST_FAILED_CHKS_RETIRED..............................................168

11-73 Unit Masks for ITLB_MISSES_FETCH.............................................................168

11-74 Unit Masks for L1D_READ_MISSES................................................................170

11-75 Unit Masks for L1I_PREFETCH_STALL...........................................................171

11-76 Unit Masks for L2_BAD_LINES_SELECTED.................................................... 1 73

11-77 Unit Masks for L2_BYPASS..............................................................................1 73

11-78 Unit Masks for L2_DATA_REFERENCES........................................................174

11-79 Unit Masks for L2_FILLB_FULL ........................................................................ 1 75

11-80 Unit Masks for L2_FORCE_RECIRC................................................................175

11-81 Unit Masks for L2_GOT_RECIRC_IFETCH......................................................176

11-82 Unit Masks for L2_IFET_CANCELS.................................................................. 1 76

11-83 Unit Masks for L2_ISSUED_RECIRC_IFETCH................................................177

11-84 Unit Masks for L2_L3ACCESS_CANCEL.........................................................178

11-85 Unit Masks for L2_OPS_ISSUED.....................................................................179

11-86 Unit Masks for L2_OZDB_FULL........................................................................179

11-87 Unit Masks for L2_OZQ_CANCELS0................................................................180

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 9

11-88 Unit Masks for L2_OZQ_CANCELS1...............................................................180

11-89 Unit Masks for L2_OZQ_CANCELS2...............................................................181

11-90 Unit Masks for L2_OZQ_FULL..........................................................................182

11-91 Unit Masks for L2_STORE_HIT_SHARED.......................................................183

11-92 Unit Masks for L2_VICTIMB_FULL...................................................................183

11-93 Unit Masks for L3_READS................................................................................184

11-94 Unit Masks for L3_WRITES..............................................................................185

11-95 Unit Masks for MEM_READ_CURRENT..........................................................186

11-96 Unit Masks for RSE_REFERENCES_RETIRED ..............................................188

11-97 Unit Masks for SYLL_NOT_DISPERSED.........................................................189

11-98 Unit Masks for SYLL_OVERCOUNT ................................................................190

12-1 Itanium 12-2 Itanium 12-3 Itanium

2 Processor CPUID Register 3 Values...............................................192

2 Processor Family and Model Values...............................................192

2 Processor CPUID Register 4 Values...............................................192

12-4 Encoding of IA-32 CPUID Cache Return Value s.. .... ............. ............. ..............192

A-1 FPU Pipeline.....................................................................................................195

A-2 L1D Micro-Pipeline............................................................................................195

A-3 L2 Micro-Pipeline .............................. .............. ............. ............. .............. ... .......195

10 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

Revision History

Revision

Number

-001 Public release of the document. June 2002

-002 Refresh to incorporate new Itanium models.

-003 Refresh to incorporate new Itanium models.

Description Date

2 processor with up to 6M L3 cache

2 processor with up to 9M L3 cache

April 2003

May 2004

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 11

12 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

About this Manual 1

1.1 Overview

The Intel® Itaniu m®2 processor is the second implementat ion of the Intel® Itanium® architecture. There have now been three generations of th e Itanium 2 processor, which can be identified by their unique CPUID model values. For simpli c ity of document a tion, throug hout this document we will group all processors of like model together. Table 1-1 lists out the varieties of the Itanium 2 processor that are avail able along with their grouping.

Table 1-1. Definition Table

Processor Abbreviation

Intel

Itanium® 2 Processor 900 MHz with 1.5 MB L3 Cache

Intel

Itanium® 2 Processor 1.0 GHz with 3 MB L3 Cache

Low Voltage Intel L3 Cache

Intel

Itanium® 2 Processor 1.40 GHz with 1.5 MB L3 Cache

Intel

Itanium® 2 Processor 1.40 GHz with 3 MB L3 Cache

Intel

Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache

Intel

Itanium® 2 Processor 1.30 GHz with 3 MB L3 Cache

Itanium® 2 Processor 1.40 GHz with 4 MB L3 Cache

Intel

Itanium® 2 Processor 1.50 GHz with 6 MB L3 Cache

Low Voltage Intel L3 Cache

Intel

Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache

Intel

Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache for

533MHz DP Platforms

Intel

Itanium® 2 Processor 1.50 GHz with 4 MB L3 Cache

Intel

Itanium® 2 Processor 1.60 GHz with 6 MB L3 Cache

Intel

Itanium® 2 Processor 1.70 GHz with 9 MB L3 Cache

Itanium® 2 Processor 1.0 GHz with 1.5 MB

Itanium® 2 Processor 1.20 GHz with 3 MB

Itanium 2 Processor (up to 3MB L3 cache)

Itanium 2 Processor (up to 6MB L3 cache)

Itanium 2 Processor (up to 9MB L3 cache)

The Itanium 2 processors with up to 9 MB L3 cache will have va rieties cap a ble of running with system bus speeds of 400 MHz, 533 MHz, and 667 MHz. Fo r comp l ete det ails on the curren t offerings pl ease refer to the datasheets at http://developer.intel.com/design/Itaniu m2 /.

This document describes how the Itan ium2 processor implements features of the Itanium architecture, as well as spec ific fe atures of the I tanium2 processor that ar e relevan t to per formanc e tuning, compilation, a nd a ssembler pr ogramming. Unless otherwi se stated, all o f the restrictions, rules, sizes, and capaciti es desc ribed in this doc ument ap ply specif icall y to the Itanium 2 processor and may not apply to other implementations of the Itanium architecture.

General under standing of proce ssor components and explicit familiar ity with Itanium instructions are assumed. This document is not intended to be used as an architect ural refer ence for the Itan ium architecture. For more inf ormation on the Itanium archit ecture, consult the Intel

Itanium®

Architecture Software Developer’s Manual.

Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization 13

About this Manual

1.2 Contents

Chapter 2, “Itanium® 2 Processor Enhancements” compares the Itanium processor and the

Itanium 2 processor, highlighti ng some of the considerations that sho uld be take n when optimizing for the Itanium 2 processor .

Chapter 3, “Functional Units and Issue Rules” describes the number and type of available

functional units, instruction iss ue rules, and he uristics for efficie nt instruction schedul ing based upon machine resources and issue rules .

Chapter 4, “Latencie s and Bypa sses” de scribes latenc ies and bypa sses fo r exec ution of the diff erent

instruction types on the Itanium 2 processor.

Chapter 5, “Data Operations” describes consider ations for data operations such as specu lative or

predicated loads or stores, floating-point loads, and prefetches. Data alignment considerations are also discussed.

Chapter 6, “Memory Subsyste m” prov ides an ove rvi ew of the mem ory s ubsy st em hie ra rchy on the

Itanium 2 processo r.

Chapter 7, “Branch Instructions and Branch Prediction” describes how hints for branch prediction

and instruction prefetch are implemented on the Itanium 2 processor.

Chapter 8, “Instruction Prefetching” describes how prefetching is implemented on the Itanium 2

processor.

Chapter 9, “Optimizing for the Itanium

important points noted in e a rlier chapte rs.

Chapter 10, “Perfor m a nce Monito ri ng ” discusses performance monitoring registers and

implementations specific to the Itanium 2 processor.

Chapter 11, “Performance Monitor Events” summarizes the Itanium 2 processor events and

describes how to compute commonly used pe rformance me trics.

Chapter 12, “Model-Specif ic and Optiona l Featur es” discusses Itaniu m 2 processor model-specific

behavior, su ch as executing CPUID instruction s.

1.3 Terminology

The following definitions are for term s that will be used throughout this docu ment:

Dispersal The process of mapping instruc tions within bundles to

Bundle rotation The process of bringing ne w bundles into th e two-bundle

Split issue Instruction execution when an instruction does no t issue at

2 Processor” is a summary that draws conclusi o ns from

functiona l units.

issue window.

the same time as the instru ction immediately before it.

Advanced load address table (ALAT) The ALAT holds the stat e necessary fo r advanced load and

check opera tions.

Translation lookaside buffe r (TLB) The TLB holds virtual to physical mappings. Virtual hash page table (VHPT) The VHPT is an ext ension of the TLB hierarchy , which

resides in the virtual memory space, is designed to enhance virtual address translation performance.

14 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation

Hardware page walker (H PW ) The HPW is the third level of address translation. It is an

engine that performs page look-ups f rom the VHPT and seeks oppor tuniti es to insert tr anslatio ns into th e processor TLBs.

the backing store in memory.

Event address r egisters (EARs) The EARs record the instructi on and data addresses of data

cache misses.

1.4 Related Documentation

The reader of this document should al so be fa mili ar with the m ateria l and conc epts pres ented i n the followi ng docu me nt s :

• Intel

Itanium® Architecture Software D eveloper’s Manual, Volume 1: Application

Architecture

Itanium® Architecture Software D eveloper’s Manual, Volume 2: System Architecture

Itanium® Architecture Software D eveloper’s Manual, Volume 3: Ins truction Set

Reference

About this Manual

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 15

About this Manual

16 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Itanium

2 Processor Enhancements 2

This chapter outlines the major differences betw een the Itanium 2 processor and the Itanium processor. This is not an exhaustive list, so a reference to more details accomp anies each topic.

2.1 Implemented Instructions

The Itanium 2 processor implement s the 64-bit long branch instruction (brl) in st ruction dir ect l y in hardware. This instruction wa s not implemented in the Itanium pro cessor . It allows progr ammers to direct a branch to an address that us es a ll 64 address bits. Details on the brl instruction can be found in Volum e 2 of the Intel some branch prediction perf or mance implications associated with the brl instruction which are noted in Chapter 7, “Branch Instructions and Branch Prediction. ”

Itanium® Architecture Software Developer ’s Manual. There are

2.2 Functional Units and Issue Rules

In general, the Itanium 2 processor has more functio nal units than the Itanium processor.

• In parti c u la r, the Ita nium 2 proc e s sor has si x ari th me t ic logic un i ts (ALUs) to per f o rm

arithmetic operations, co mpares, m ost multim edia instr uction s, e tc. The Itani um processor can only issue fo ur of these types of instructions per cycle.

• The Itanium 2 pr oc essor has four memory po rts allowing two integer loads and two integer

stores per cycle. The Itanium processor has two memor y ports .

• The Itanium 2 processor can issue one SIMD floating-point (FP) instruction per cycle. The

Itanium processor can issue two SIMD FP instructions per cycle.

• Under certain conditions, the Itanium 2 processor can iss ue I- type instructions to memory

functional unit s, thus increasing the number of template pair types which c an be issued in one cycle. For the Itaniu m pr ocessor, I-type instructions will only be issued to integer functional units.

• The Itanium 2 processor scoreboards multi-cycle operatio ns such as first-level instruction

cache (L1D) misses, multimedia, and floating-point operati ons. This means that when an integer opera tion uses the result of a multimedia oper a tion and the

integer opera tion is not scheduled to cover the la tency, the dependent instruction group will wait until the multim edia data is available.

A predicated off operation, with a use of a score boarded operand, will stall the issue group for one cycle if the predicate was genera ted in the previous cycle. A predicate d off instruction with predicates generated two or mor e c ycles earlier will not incur pipeline stalls even when operands are scoreboar ded.

2.3 Operation Latencies

On the Itanium 2 processor, most latencies are the same or shorter than on the Itanium processor with a few exce ptions , i. e., mem ory lat enci es are sh or ter , f loa tin g-poin t la tencies are sh orter. A few more bypasses exist which remove some asymmetries. Table 2-1, “Itanium

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 17

2/ Itanium Processors

Itanium® 2 Processor Enhancements

Operation Latencies” shows latencies for both the Itanium 2 processor and the Itanium proces s or.

The areas of difference are indic a te d by non-sha de d boxes. The two different latency numbers are separated by a forward-slash or ‘/’. When reading from left to right, the first latency number corresponds to the Itanium 2 process or and the second number corresponds to the Itanium processor.

2.4 Data Operations

2.4.1 Data Speculation and the ALAT

The Itanium 2 processor advanced lo ad addre ss table (ALAT) is fully associati ve whil e the Itanium processor ALAT is two-way associative.

On the Itanium processo r , a ld.c which misses th e ALAT cau se s a 10- cyc le pi peli ne f lus h. O n t he Itanium 2 processor, the pe na lty is 8 cycles.

On the Itanium processor, if a chk.a, chk.s, or fchkf fails, an operating system (OS) handler will be invoked through a trap handler to steer execution to the recovery code at the location specified in the targe t field of the chk.a/chk.s/fchkf instructio n. On the Itanium 2 processor, hardware will usually perform the resteer without oper ating system intervention. This reduces the resteer cost from approximately 200 cycles to 18 cycles. If any of the following conditions are no t met, the Itanium 2 processor will trap to the OS to service the chk.a/chk.s/fchkf:

psr.ic = 1 psr.it = 1 psr.ss = 0 psr.tb = 0

If a chk.a follows a store within the same cycle, the chk.a will always fail on the Itanium processor. On the Itanium 2 process or, a 12-bit ad dress compare against ALAT entries will occur. See Section 5.1, “Data Speculation and th e AL AT” for more details.

2.4.2 Data Alignment

The Itanium pro cessor can su pport misal igned integer acces ses w ithin 16- byte b locks; ho wever , the Itanium 2 processor suppor ts misaligned integer accesse s within 8-byte blocks. Section 5.5, “Data

Alignment” has gr eater detail on misaligned access support for the Itanium 2 processor.

18 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Table 2-1. Itanium® 2/ Itanium Processors Operation Latencies

Consumer

Itanium® 2 Processor Enhancements

Load

Store

Addr

Multi-

media

Store

Data

Fmac Fmisc getf setf

Adder: add, cmp, tbit,

Qual. Pred.

Branch

Pred.

ALU

n/a n/a 1 1/(1-2)13 1 n/a n/a n/a 1 addp4, shladd, shladdp4, sum, logical ops, 64-bit immed. moves, movl, post-inc ops (includes post-inc stores, loads, lfetches)

Multimedia getf setf Fmac: fma, fms,

n/a n/a 3 3 2 3 n/a n/a n/a 3

n/a n/a 5/9 6/9 6/9 5/9 n/a n/a n/a 6/9

n/a n/a n/a n/a n/a 6 /2 6/2 6/2 6/2 n/a

n/a n/a n/a n/a n/a 4 /5 4/5 4/5 4/5 n/a fnma, fpma, fpms, fpnma, fadd, fnmpy, fsub, fpmpy, fpnmpy, fmpy, fnorm, xma, frcpa, fprcpa, frsqrta, fpsqrta, fcvt, fpcvt

Fmisc: fselect, fcmp,

n/a n/a n/a n/a n/a 4 /5 4/5 4/5 4/5 n/a fclass, fmin, fmax, famin, famax, fpmin, fpmax, fpamin, fpcmp, fmerge, fmix,

Producer

fsxt, fpack, fswap, fand, fandcm, for, fxor, fpmerge, fneg, fnegabs, fpabs, fpneg, fpnegabs

INT side predicate

1 0 n/a n/a n/a n/a n/a n/a n/a n/a write: cmp, tbit, tnat

FP side predicate

21/1n/a n/a n/a n/a n/a n/a n/a n/a write: fcmp

FP side predicate

2 2 n/a n/a n/a n/a n/a n/a n/a n/a write: frcpa, fprcpa, frsqrta, fpsqrta

Int Load FP Load

IEU2: move_from_br,

n/a n/aNN+1N+1NNNNN

n/a n/a M+1 M+2 M+2 M+1 M+1 M+1 M+1 M+1

n/a n/a 2 2 3 2 n/a n/a n/a 2 alloc

Move to/from cr,ar Move to pr Move indirect

1. On the Itanium® processor, the address computation instruction must be in an M-slot type to avoid an extra cycle of latency.

2. N depends upon which level of cache is hit. For the Itanium processor, N=2 for L1D, N=6 for L2, N=21 for L3. For the Itanium 2 processor, N=1 for L1D, N=5 for L2, N=(12-15) for L3. These are minimum latencies.

3. M depends upon which level of cache is hit. For the Itanium processor, M=8 for L2 and M=24 for L3. For the Itanium 2 processor, M =5 fo r L2 and M=(12-15) for L3. These are minimum latencies. The “+1” entries indicat e one cycle is needed for for ma t conve rsion.

4. Best-case values of C range from 2 to 35 cycles depending upon registers accessed. EC and LC accesses are 2 cycles. FPSR and CR accesses are 10-12 cycles.

5. Best-case values of D range from 6 to 35 cycles depending upon indirect registers accessed; Iregs pkr and rr accesses are faster at 6 cycles.

n/a n/a C C C C n/a n/a n/a C 1 0 2 2 3 2 n/a n/a n/a n/a

n/a n/a D D D D n/a n/a n/a D

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 19

Itanium® 2 Processor Enhancements

2.4.3 Control Speculation

The Itanium 2 processor implements features intended to increase the perfor m ance of applications by decreasing the cost for incorrect control speculation. There are two parts of the solution for the Itanium 2 processor:

• The first part allows speculative load operations (This includes lfetch without the .fau lt

completer.) to abort and set a NaT bit at the time of a data translation lookaside buffer (TLB) miss. In contrast, the Itanium processor would wait for the hardware page walker (HPW) operation to complete the walk before setting the NaT bit.

• The second part allows for a chk.s instruction (al so fo r a fchkf/chk.a instruction) to

branch directly to the fix-up code without involving the OS. The Itanium processor faults on a chk.s, chk.a, or fchkf instruction and requests that the OS branch to the fix-up code.

Thus, deferrals on the Itanium 2 processor occur quickly and the branch to fix-up code occurs quickly.

The deferr a l at da t a TLB mi s s is tu rn e d off ins i de in t e rru p t handlers (when PSR.is = 1), which allows ld.s and lfetch instructions to complete a TLB walk and possibly return data. Clearing the dcr.dm bit will also prevent speculative oper ations from deferring at data TLB miss. Fast deferral requires the dcr.dm bit to be set. Refer to Section 5.2, “Speculative and Predicated

Loads/Stores” for more information.

2.5 Memory Hierarchy

Both the Itanium microarchitecture and the Itanium 2 microarchitecture incorporate a three-level cache structure. In general, line sizes of the Itanium 2 processor are twice as large as those of the Itanium processor. Also, latencies of the Itanium 2 processor are shorter that those of the Itanium processor . The thir d-lev el ca che (L3) o f the It anium 2 p rocess or is on-chip and r uns at a hi gher core frequency, which results in a much shorter latency. The Itanium 2 processor has a two-level TLB design for both in st ruction and data, while the Itanium processor has a single-level instruction TLB. The Itanium 2 processor’s TLBs are larger. The following tables list some of the dif fer ences

Table 2-2. L1I Cache Differences

Itanium

Itanium (up to 3MB L3 cache)

Itanium (up to 6MB L3 cache)

Itanium (up to 9MB L3 cache)

in caches and TLBs. Details can be found in Chapter 6, “Memory Subsystem.”

Size Line Size Associativity Latency

Processor

2 Processor

16 KB 32 bytes 4-way 1 cycle 16 KB 64 bytes 4-way 1 cycle

16 KB 64 bytes 4-way 1 cycle

16 KB 64 bytes 4 way 1 cycle

20 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Table 2-3. L1D Cache Dif ferences

Size Line Size Associativity Latency Write Policies

Itanium® 2 Processor Enhancements

Itanium

Processor

2 Processor

16 KB 32 bytes 4-way 2 cycles Write through,

16 KB 64 bytes 4-way 1 cycle Write through,

(up to 3MB L3 cache)

Itanium

2 Processor

16 KB 64 bytes 4-way 1 cycle Write through,

(up to 6MB L3 cache)

Itanium

2 Processor

16 KB 64 bytes 4-way 1 cycle Write through,

(up to 9MB L3 cache)

Table 2-4. L2 Unified Cache Diffe r en ces

Size Line Size Associativity

Itanium

Processor

2 Processor

(up to 3MB L3 cache)

Itanium

2 Processor

(up to 6MB L3 cache)

Itanium

2 Processor

(up to 9MB L3 cache)

96 KB 64 bytes 6-way Minimum of 6

256 KB 128 bytes 8-way Minimum of 5

Latency

cycles

Integer

No write allocate

Floating-point

Latency

Minimum of 9 cycles

Minimum of 6 cycles

Write

Policies

Write back, Write allocate

Table 2-5. L3 Cache Differences

Integer

Latency

cycles

Itanium

Processor

2 Processor

(up to 3MB L3 cache)

Itanium

2 Processor

(up to 6MB L3 cache)

Itanium

2 Processor

(up to 9MB L3 cache)

Size Line Size Associativity

4 MB or 2MB,

64 bytes 4-way Minimum of 21

off chip

1.5 MB or 3

128 bytes 4-ways per MBMinimum of 12

MB, on chip

1.5 MB, 3 MB,

128 bytes 4-ways per MBMinimum of 14 4 MB, or 6MB, on chip

3 MB, 4 MB, 6

128 bytes 2-ways per MBMinimum of 14 MB, or 9MB, on chip

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 21

Floating-point

Latency

Minimum of 24 cycles

Minimum of 13 cycles

Minimum of 15 cycles

Bandwidth

16 bytes/cycle

32 bytes/cycle

Itanium® 2 Processor Enhancements

Table 2-6. Instruction T L B Differenc es

Hierarchy Size Associativity

Itanium

Itanium (up to 3MB L3 cache)

Itanium (up to 6MB L3 cache)

Itanium (up to 9MB L3 cache)

Processor

2 Processor

1 level: ITLB 64-entry Full 2 levels: L1 ITLB, L2 ITLB 32-entry, 128-entry Full, Full

2 levels: L1 ITLB, L2 ITLB 32-entry, 128-entry Full, Full

Table 2-7. Data TLB Differences

Hierarchy Size Associativity

Itanium

Itanium (up to 3MB L3 cache)

Itanium (up to 6MB L3 cache)

Itanium (up to 9MB L3 cache)

Processor

2 Processor

2 levels: L1 DTLB, L2 DTLB

2.6 Branch Prediction

The major differences in the Itanium 2 processor and the Itanium processor bran ch prediction support are:

• Latencies

• brp instructions are ignored for branch prediction, i.e., the brp.imp is not required to

achieve zero-bubble branches.

• Indirect branch targets are predicted from the source branch register rather than from a

hardware tab l e.

• Possible reduced prediction of BBB bundles due to prediction encoding.

Penalty for Missing

First Level DTLB

32-entry, 96-entry Direct, Full 10 cycles

32-entry, 128-entry Full, Full 2 cycles

• More robust method for prediction structure r epair after a mispredicted return.

• Hardware impleme ntati on of th e brl (64-bit relat i ve branch) instruction.

• Setting ar.ec = 1 is not required for perf e c t loop prediction.

Full details can be fou nd in Section 7, “Branch Instructions and Branch Prediction.”

22 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Ta bl e 2-8. Branch Predict ion Late ncies (in cycles)

Correctly Predicted Taken IP-relative Branch 0 1 Correctly Predicted Taken Indirect Branch 2 0 Correctly Predicted Taken Return Branch 1 1 Last Branch in Perfect Loop Prediction 0 2 Misprediction Latency 6+ 9

2.7 Instruction Prefetching

The Itanium 2 processor has an improved implementation of streaming and hint prefetching. See

Chapter 8, “Instruction Prefetch ing” for more details.

2.8 IA-32 Execution Layer

Itanium® 2 Processor Enhancements

Itanium® 2 Processor Itanium® Processor

IA-32 Exec ution Layer (IA-32 EL) is a new technology that executes IA-32 applications on Itanium architecture-based systems. Previous ly, support for IA-32 applications on Itanium architecture-based platform s has been achieved using ha rdware circuitry on the Ita nium 2 processors. IA-32 EL will enhance this capability.

IA-32 EL is a software layer that is current ly shipping with Itanium architecture-based operating systems and will convert IA-32 instructions into Itanium instructions via dynamic translation. Further details on operating system support and functionality of IA-32 EL can be found at http://www.intel.com/products/server/processors/ser ver/itanium2/index.htm .

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 23

Itanium® 2 Processor Enhancements

24 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Functional Units and Issue Rules 3

This chapt e r de scribes the number and type of available functional units, instruction issue rules, and heuri stics for efficient instruction scheduling based upon machine resou rces and issue rules.

3.1 Execution Model

The Itanium 2 processor issues and executes instructions in assembly order, so programmer understanding of stall conditions is essential for generating high pe rformance assembly code.

In general, when an instruction does no t issue at the same time as the instruction immediately before it, instruction execution is said to have split issue. When a split issue condition occu rs, all instructions after the split point stall one or more clocks, even if there are sufficien t resources for some of them to execute. Common cau ses of split issue in the Itanium 2 processor are:

• An explicit stop is encountered.

• There are insufficie nt m achine resources of the type required to execute an instruction.

• Instructions have not been placed in accordance with issue rules on the Itaniu m 2 processor.

The Itanium2 processor issu es ins truc ti ons in th e orde r defin ed by the st atic sc hed ule . Care sh ould be taken by th e code gene rato r to avoi d re giste r dep ende nc ie s withi n an i ssu e gro up. The Ita nium 2 processor does not insert implicit stop bits to break WAW hazards; thus, a WAW hazard between loads and stores will result in an 8-cycle penalty if the predicates are true. Other WAW hazards, such as those due to ALU operations , will result in non-deterministic results and also consider predicates.

Once instructions are issued as a group, they will proceed as a group thro ugh the pipeline. If one instruction in the issue group has a stall condition, the whole group will stall. This stall will also stall all instructions behind it (y ounger) in the pipeline.

3.2 Number and Types of Functional Units

Although parallel instruction groups may extend over an arbitr a ry number of bundles and contain an arbitrary number of each instruction type, the Itanium 2 processor has finite execution resources. If a parallel in s truction group contains more instructions than there are available execution uni ts, the f irs t ins tru cti on f or wh ich an ap pro pria te uni t can not be fou nd w ill cause a sp lit issue and bre a k the parallel instruction group.

The front -e nd of the Itanium 2 processor pipe line can fetch up to two bundl e s per cycle and the back-end of the pipeline can is s ue as many as two bu ndles per cycle. Given that there are 3 instructions per bundle, the Itanium 2 processor can be considered a six instruction issue machine. For more on details on the pipeline, see Appendix A, “Itanium

The Itanium 2 processor has a large number of functional units of various types. This allows many combinations of instruct ions to be iss ued per c ycle. Since only s ix instr uction s may is sue per cyc le, only a portion of th e Itanium 2 processor’s functional units described be low will be used each cycle.

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 25

2 Processor Pipeline.”

Functional Units and Issue Rules

There are six general-purpose ALU uni ts (ALU0, 1, 2, 3, 4, 5), two integer units (I0, 1), and one shift unit (I SHIFT, used for general purpose shifts a nd other special instructions). A maximum of six of these types of instructi ons can be issued per cycle.

The Data Cache Unit (DCU) contains four memory ports . Two ports are generally used for load operations; two are generally used for store ope rations. A maximum of four of these types of instructions can be issued per cycle. The two store por ts can support a special subset of the floating-point load instructions.

There are six multim edia functional u nits (PALU0, 1, 2, 3, 4, 5), two parallel shift units (PSMU0, 1), one parallel multiply unit (PMUL), and one popul ation count unit (POPCNT). These handle multimedia, parallel multiply, and the popcnt instruction types. At most, one pmul or popcnt instru ction may be issued per cycle. How ever , the Itanium 2 processor may issue up to six PALU instructions per cycle.

There are four floating-point functional units: two FMAC units to execute floating-point multiply-adds and two FMISC units to perform other floating-point operations, such as fcmp, fmerge, etc. A maximum of two fl oating-point operations can be executed per cycle.

There are three branch units enabling thr ee br anches to be executed per cycle. All of the computational fun ctional u nits ar e fully pipel ined, so each functio nal unit can accept one

new instruction per clock cycle in the ab sence of other types of stalls. System instructions a nd access to system registers may be an exception.

3.3 Instruction Slot to Functional Unit Mapping

Each fetched instruction is assigned to a functional unit through an issue port. The numerous functional units share a smaller number of issue ports. There are 1 1 iss ue ports: eight for non-branc h instructions and three for branch in structions. They are labe led M0, M1, M2, M3, I0, I1, F0, F1, B0, B1, and B2. The process of mapping instructions within bundles to functional units is called dispersal.

An instruction’s type and pos ition within the issue group define to which issue port th e ins truction is assigned. An instruction is mapped to a subset of the issue por ts based upon the instruction type (i.e., ALU, Memor y, Integer, etc.). Then, based on the position of the instruction within the instruction group presented for disp ers a l, the instruction is mapped to a particular issue port within that subset.

Table 3-1, “A-Type Instructio n Port Mappi ng,” Table 3-2, “I-Type Instruction Port Mappi ng ,” and Table 3-3, “M-Type In st ruc ti on Por t Ma ppi ng” sh ow t he m apping s of ins truc tio n ty pes to por ts an d

functiona l units. Section 3.3.2 des c ri be s th e se l e c ti o n of the par ti c ul a r po rt based upon instructio n position.

Note: Shading in the following tables indicates the instruction type can be issued on the port(s).

A-type instructions can be is s ued on all M and I ports (M0-M3 and I0 and I1). I-type instructions can only issue to I0 o r I 1. The I ports are asymmetric s o s om e I-type instructions can only issue on port I0. M ports hav e many asy mmetries : some M-t ype inst ructions can issue on all port s; some can only issue on M0 a nd M1; some can only issue on M2 and M3; some can only issu e on M0; some can only issue on M2.

26 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Ta bl e 3-1. A-Type Instruction Port Mapping

Functional Units and Iss ue Rules

Instruction

Type

A1-A5 ALU add, shladd M0-M3, I0, I1 A4, A5 Add Immediate addp4, addl M0-M3, I0, I1 A6, A7, A8 Compare cmp, cmp4 M0-M3, I0, I1 A9 MM ALU pcmp[1 | 2 | 4] M0-M3, I0, I1 A10 MM Shift and Add pshladd2 M0-M3, I0, I1

Description Examples Ports

Ta bl e 3-2. I-Type Instruction Port Mapping

Instruction

Type

I1 MM Multiply/Shift pmpy2.[l | r],

I2 MM Mix/Pack mix[1 | 2 | 4].[ l | r

I3, I4 MM Mux mux1, mux2 I5 Variable Right Shift shr{.u] =ar,ar

I6 MM Right Shift Fixed pshr[2 | 4] =ar,c I7 Variable Left Shift shl{.u] =ar,ar

I8 MM Left Shift Fixed pshl[2 | 4] =ar,c I9 MM Popcount popcnt I10 Shift Right Pair shrp I11-I17 Extr, Dep

Test Nat I19 Break, Nop break.i, nop.i I20 Integer Speculation Check chk.s.i I21-28 Move to/from BR/PR/IP/AR mov =[br | pr | ip | ar]

I29 Sxt/Zxt/Czx sxt, zxt, czx

Description Examples

I Port

I0 I1

pmpyshr2{.u}

pmin, pmax

pshr[2 | 4] =ar,ar

pshl[2 | 4] =ar,ar

extr{.u}, dep{.z} tnat

mov [br | pr | ip | ar]=

Table 3-3. M-Type Instruct ion Port Mapping

Instruction

Type

M1, 2, 3 Integer Load ldsz, ld8.fill M4, 5 Integer Store stsz, st8.spill M6, 7, 8 Floating-point Load ldffsz, ldffsz.s, ldf.fill

Floating-point Advanced Load ldffsz.a, ldffsz.c.[clr | nc] M9, 10 Floating-point Store stffsz, stf.spill

Description Examples

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 27

Memory Port

M0 M1 M2 M3

Functional Units and Issue Rules

Table 3-3. M-Type Instruction Port Mapping (Continued)

Instruction

Type

M11, 12 Floating-point Load Pair ldfpfsz M13, 14, 15 Line Prefetch lfetch M16 Compare and Exchange cmpxchgsz.[acq | rel] M17 Fetch and Add fetchaddsz.[acq | rel] M18 Set Floating-point Reg setf.[s | d | exp | sig} M19 Get Floating-point Reg getf.[s | d | exp | sig} M20, 21 Speculation Check chk.s{.m} M22, 23 Advanced Load Check chk.a[clr | nc] M24 Invalidate ALAT invala

Mem Fence, Sync, Serialize fwb, mf{.a}, srlz.[d | i],

M25 RSE Control flushrs, loadrs M26, 27 Invalidate ALAT invala.e M28 Flush Cache, Purge TC Entry fc, ptc.e M29, 30, 31 Move to/from App Reg mov{.m} ar=

M32, 33 Move to/from Control Reg mov cr=, mov =cr M34 Allocate Register Stack Frame alloc M35, 36 Move to/from Proc. Status Reg mov psr.[l | um]

M37 Break, Nop.m break.m, nop.m M38, 39, 40 Probe Access probe.[r | w].{fault} M41 Insert Translation Cache itc.[d | i] M42, 43 Move Indirect Reg

Insert TR M44 Set/Reset User/System Mask sum, rum, ssm, rsm M45 Purge Translation Cache/Reg ptc.[d | i | g | ga] M46 Virtual Address Translation tak, thash, tpa, ttag

Description Examples

sync.li

mov{.m} =ar

mov =psr.[l | m]

mov ireg=, move =ireg, itr.[d | i]

Memory Port

M0 M1 M2 M3

3.3.1 Execution Width

When dispers i ng instructions to func tional units, the Itanium 2 processor views, at most, two bundles at a time with no special alignment requirements. This text refers to these bundles as the first and second bundles. A bundle rotation causes new bundles to be brought into the two-bundle window of instructions being consi dered for issue. Bundle rotat ions occur when all the ins tructi ons within a bundle are issued. Either one or two bundles can be rotated depending on how many instructions were issued.

28 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

3.3.2 Dispersal Rules

The Itanium2 processor hardware makes no att e mpt to reorder instructions to avoid stalls. Thus, the code generator must be careful about the number, type, and order of instructions within a parallel instructi on group to avoid unnecessary stalls. The use of predicates has no effect on dispersal – all instructions are dispersed in the same fas hion whether predicated true, predicated false, or unpredicated . S im ilarly, nop instructions are dispersed to functional units as if they were normal instruction s. The dispers al rules fo r execution units vary according to slot type; i.e., I, M, F, B, or L. The ru le s for the different slot types are described below.

Dispersal Rules for F Slot Instructions

• An F slot instruction in the first bundle maps to F0.

• An F slot instruc tion in the second bundle maps to F1.

• A SIMD FP instructio n essentially maps to both F0 and F1. See Section 3.3. 3 fo r m or e

inform a tion on SIMD FP is sue ru les .

Dispersal Rules for B Slot Instructions

• Each B slot instructi on in an MBB or BBB bu ndl e maps t o the corr espo nding B un it. That is, a

B slot instru ction i n t he fir st p ositi on of th e templ ate is m appe d t o B0; in th e seco nd posit ion , it is mapped to B1; and in the third position, it is mapped to B2.

Functional Units and Iss ue Rules

• The B instruction in an MI B/MFB/MMB bundle maps to B0 if it is a brp or nop.b and it is

the first bundle, otherwise it maps to B2.

• For purposes of dispersal, break.b is treated like a branch.

Dispersal Rules for L Slot Instructions

• An MLX bun dle uses port s equ iva lent to an MFI bun dle. If th e MLX bu ndl e is the fi rs t bun dle ,

the L slot instructi on maps to F0. Other wise, it maps to F1. Howeve r, there is no conflict when the MLX template is issued with an MMF or MIF bundle and the F op is a SIMD FP instruction.

Dispersal Rule s for I Slot Instructions

• The instruction in the first I slot of the two-bundle issue group will issue to I0. The second

I slot instruction will issue to I1.

• If the second I slot instruction can only map to an I0 port, see Table 3-2, an implicit stop will

be inserted and the second I slot instruction will be issued in the next cycle. Thus, an I0-only instruction sh ould b e plac ed in the fir st I slot o f a bund le pa ir. Only one I0-onl y inst ructio n can be issued per cycle.

• An instruction in an I slot will not necess a rily be issued to an I port. If the first two I slot

instructions have been issued to the I port s, and an additional I slo t instruction in the issue group contains A- type instructions as listed in Table 3-1, and M ports are available; these instructions will be mapped to available M ports. This allows the potential dual issue of the MII-MII bundle pair. This is new to the Itaniu m 2 proces sor and is not true on the Itanium processor.

• For the MLI template, the I slot instruction is always assigned to port I0 if it is in the first

bundle or it is assigned to port I1 if it is in the second bundle. Thus , the bundle pair MII-MLI can never dual issue.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 29

Functional Units and Issue Rules

Dispersal Rules for M Slot Instructions

On the Itaniu m 2 processor, M slot instruc tions are grouped into four s ubtypes (see Table 3-3):

• Load subtype, which can be issued on either M0 or M1 or both (e.g., integer load, sync)

• Store subty pe , which can be issued on either M2 or M3 or both (e.g., intege r store, alloc,

getf)

• Generic subtype, which can be issued on any of the four M ports (e.g., ALU, floating-point

load)

• Special instructions, which can be issued on only M2 port (e.g., getf, mov to AR)

The issue logic ca n reorder M slot instructions between different subtypes but cannot reorder instructions within the same subt ypes. For instance, within an issue group an integer store can precede an integer load without causing a split issue. The store will be mapped to M2 and the load to M0 since the two instructio ns were from different subtypes.

However, if a store precedes a getf, the store will be issued to M2 and a split issu e will occur because the getf must issue on M2. Instructions within the same sub type cannot be reordered. Therefore, the code scheduler should pl ace the getf instruction before the store to ensure the getf ins t ruction is mapped to M2 and the store is mapped to M3 to avoid port oversubscription.

Dispersal becomes more complicat ed whe n generic subtype instructions early in the issue group consume M ports. Th ere is n o encom passin g rule to cov er these c ase s. It is recommended that the more restrict ive subtypes get scheduled first in the issue group. Example 3-1 and Example 3-2 demonstrate some of the dispersal possibilities.

Note: M

Example 3-1. M

is a generic subtype, ML is an integer load, and MS is a store subtype instruction.

I - MSMAI

AML

The bundle pair M

I - MSMAI gets mapped to ports M2 M0 I0 - M3 M1 I1.

AML

The first generi c s ubtype instruction mapped to M2 causes the M If M

is a getf instruction, a split issue will occur.

Example 3-2. M

The bundle pair M

AMA

I - MSMAI

AMA

I - MSMAI gets mapped to ports M0 M1 I0 - M2 M3 I1, which allows MS

to get the more favorable M2 port.

Table 3-4 shows the combination bundle types that the Itanium2 process or can dual issue

(indicated by the shaded areas). Row s contain first bundle pair; columns contain second.

Table 3-4. Dual Issue Bundle Types

MII MLI MMI MFI MMF MIB MBB BBB MMB MFB

MII MLI MMI MFI MMF

MIB MBB

instruction to be mapped to M3.

30 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Ta bl e 3-4. Dual Issue Bundle Types (Continued)

MII MLI MMI MFI MMF MIB MBB BBB MMB MFB

BBB MMB MFB

1. The B must be nop.b or brp

Note: Floatin g-poi nt load s ar e gener ic subty pe instr uctions. As s uch, the It anium 2 p roces sor can issue up

to four per cycle. This capability is a va ilable to all normal and spec ulative floating-point loads of all sizes. Advanced floating-point loads, load pair in st ructions, and check load instructions are not generic and must i ssue on the two load ports while the floating-point stores only issue to the two store ports.

3.3.3 Split Issue and Bundle Types

Because there is an increased num ber of functional units in the Itanium 2 processor and I slot instructions can sometimes issue to M ports, many bundle pairs can dual issue. Resource oversubscription rarely occurs. Reas ons that bundle pairs would not dual issue are explicit stop s and dispersa l problems mentioned in the previous sect ion. In addition, there are several Itanium 2 processor-specific (r ather t han archite ctur al) speci al cases th at will cause s plit i ssue. Thes e specific cases are listed below:

Functional Units and Iss ue Rules

• Branches

— BBB/MBB Always splits iss ue after either of these bun dles. — MIB/MFB/MMB Splits issue after any of these bun dles unless the B slot contains a

nop.b or a brp ins tructio n. A br instruc tion a lway s i ntr oduces an implicit stop bit for these bundle types.

— MIB BBB Splits issue after the first bundle in this pair from B port

oversubscription.

• SIMD FP

— Only one FP instruction can issue per cycle if the instruction is an SIMD FP instruction.

For instance, for the bundle pa ir MF be an implicit stop between the M and F instr uctions of the second bundle, even if the F instruction is a nop.f.

— Similarly, for the bundle pair MFI MF

instructions of the second bu ndle since the Fp instruction m us t issue to the F0 port and

the first F instruction has already mapped to F0.

— One case which might seem to cause a split issue, but does not, is the bundle pair MF

MLX. Even though the L slot acts like it maps to an F port, these two bundles can dual issue.

I MFI, where Fp is a SIMD FP operation, ther e will

I, there will be an implicit stop between the M and

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 31

Functional Units and Issue Rules

32 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Latencies and Bypasses 4

This chapt er de scrib es laten cie s and b ypas ses fo r exe cuti on of the dif f er ent in stru cti on ty pes on the Itanium 2 processor .

In general, integer in structions have one cycle of latency, floating-point instruc t ions have four cycles of latency, multimedia instructions have two cycles of latency, and L1 cache hits have one cycle of latency. However, due to asymmetric bypasses, there are many special cases that need to be listed separately.

4.1 Control and Data Speculation Penalties

The Itanium 2 processor can compute the address of the recovery code fro m th e of fset in the chk.a/chk.s/fchkf i nstr uction wit hou t ha ving to tr ap to the OS fault h andl er. The speculative load recovery latencies listed in Table 4-1 are approximations based upon the time difference between the chk.s/chk.a/fchkf retirement and the completion of first instruction of the fix-up code. These latencie s do not include possible cache or TLB latencies, the cost of recovery code itself, or the final bran ch at the end of the recovery code. Also, the cost of the reco very code itself is not included. Further information on advanced loads can be found in Section 5.1, “Data

Speculation and the ALAT.”

Table 4-1. Specu lative Load Recovery L atencies

Instruction Latency (cycles)

chk.a, both int and fp (ALAT hit), chk.s (no NaT/NatVal) 0 chk.a, both int and fp (ALAT miss), chk.s (NaT/NatVal) 18 ld*.c, ldf*.c (ALAT hit, L1/L2 hit) 0 ld*.c, ldf*.c (ALAT miss, L1/L2 hit) 8

4.2 Branch Related Latencies and Penalties

Table 4-2 describes latencies for branch operations and branch related flushes. See Section 7, “Branch Instructions and Branch Prediction” for m o re detailed information.

Table 4-2. Br anch Prediction Late ncies

Branch Type Whether Prediction Target Prediction Front-end Bubbles

IP-relative Correct Correct 0 IP-relative Correct Incorrect 1/6 Return Correct Correct 1 Return Correct Incorrect 6

1. The 6-cycle penalty is for IP-relative branches that cross a 40-bit boundary. Loop branches that are mispredicted take 7 cycles. These incur a full branch mispredict penalty.

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 33

Latencies and Bypasses

Ta ble 4-3. Execut io n with By pass Latency Summary

Consumer (across)

Producer (down)

Adder: add, cmp, tbit, addp4,

Qual. Pred.

Branch

Pred.

ALU

n/a n/a 1 1 3 1 n/a n/a n/a 1

Load Store Addr

Multi-

media

Store

Data

Fmac Fmisc getf setf

shladd, shladdp4, sum, logical ops, 64-bit immed. moves, movl, post-inc ops (includes post-inc stores, loads, lfetches)

Multimedia

thash, ttag, tak, tpa, probe

getf

setf Fmac: fma, fms, fnma, fpma,

n/a n/a 3 3 2 3 n/a n/a n/a 3

5665 n/a n/a 5 6 6 5 n/a n/a n/a 5 n/a n/a n/a n/a n/a6666n/a n/a n/a n/a n/a n/a4444n/a

fpms, fpnma, fadd, fnmpy, fsub, fpmpy, fpnmpy, fmpy, fnorm, xma, frcpa, fprcpa, frsqrta, fpsqrta, fcvt, fpcvt

Fmisc: fselect, fcmp, fclass,

n/a n/a n/a n/a n/a4444n/a

fmin, fmax, famin, famax, fpmin, fpmax, fpamin, fpcmp, fmerge, fmix, fsxt, fpack, fswap, fand, fandcm, for, fxor, fpmerge, fneg, fnegabs, fpabs, fpneg, fpnegabs

Integer side predicate write:

n/a n/a n/a n/a n/a n/a n/a n/a

cmp, tbit, tnat FP side predicate write: fcmp 2 1 FP side predicate write: frcpa,

n/a n/a n/a n/a n/a n/a n/a n/a

fprcpa, frsqrta, fpsqrta Integer Load FP Load IEU2: move_from_br, allo c Move to/from CR or AR Move to pr 102232 Move indirect

1. Since these operations are performed on the L1D, they interact with the L1D and L2 pipelines. These are the minimum latencies but they could be much larger because of this interaction.

2. Since these operations are performed on the L1D, they interact with the L1D and L2 pipelines. These are the minimum latencies which could be much larger because of this interaction.

3. N depends upon which level of cache is hit: N=1 for L1D, N=5 for L2, N=12-15 for L3, N=~180-225 for main memory. These are minimum latenc ies and are likely to be larger for higher levels of cache.

4. M depends upon which level of cache is hit: M=5 for L2, M=12-15 for L3, M=~180-225 for main memory. These are minimum latencies and are likely to be larger for higher levels of cache. The +1 in all table entries denotes one cycle needed for format conversion.

5. Best case values of C range from 2 to 35 cycles depending upon the registers accessed. EC and LC accesses are 2 cycles, FPSR and CR accesses are 10-12 cycles.

6. Best case values of D range from 6 to 35 cycles depending upon the indirect registers accessed. Iregs pkr and rr are on the faster side being 6 cycle accesses.

7. It should be noted that the multimedia type includes I1-I9, A9, A10, and only the cmp4 f rom A8 instru ctions as listed in Table 3-1 and Table 3-2.

n/a n/aNN+1N+1NNNNN n/a n/a M+1 M+2 M+2 M+1 M+1 M+1 M +1 M+1 n/a n/a 2 2 3 2 n/a n/a n/a 2

n/a n/a C C C C n/a n/a n/a C

n/a n/a n/a n/a

n/a n/a D D D D n/a n/a n/a D

4.3 Latencies for OS Relat ed Instructions

Table 4-4 lists the latencies for accesses to the CR, AR, and KR reg is ters and the serialization

latencies associated with m any driver or OS operations, such as virtual address creation.

34 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Table 4-4. Latencies for OS Related Instru ctions

READ Form WRITE Form

Latencies and Bypasses

Type

GR mov r1=r3 int op 1

FP mov f1=f3 fp op 4 PSR mov r1=psr 12 mov psr=r2 6 17 mov r1=psr.um 12 mov psr.um=r2 int op 5 IP mov r1=ip 2 READ ONLY PR mov r1=pr 2 mov pr=r2 int op 1 mov pr.rot=r2 int op 1 BR mov r1=br 2 mov br=r2 branch 7 mov br.ret=r2 return 7 AR mov r1=ar.kr0 12 mov ar.kr0=r2 read kr 1

CR mov r2=cr.dcr 12 mov cr.dcr=r2 6 17

Instruction Latency Instruction Use Latency srlz.d srlz.i

mov r1=imm22 int op 1 mov r1=imm64 int op 1

mov r1=ar.rsc 12 mov ar.rsc=r2 loadrs 14 mov r1=ar.bsp 12 READ

ONLY mov r1=ar.bspstore 12 mov ar.bspstore=r2 flushrs 14 mov r1=ar.rnat 5 mov ar.rnat=r2 flushrs 3 mov r 1=ar.ccv 11 mov ar.ccv=r2 cm pxchg 1 mov r1=ar.unat 5 mov ar.unat=r2 ld8.fill 6 mov r1=ar.fpsr 12 mov ar.fpsr=r2 fmac 7 mov r1=ar.itc 36 mov ar.itc=r2 read itc 1 mov r1=ar.pfs 2 mov ar.pfs=r2 alloc 1

return 0 mov r1=ar.lc 2 mov r1=ar.ec 2

mov r2=cr.itm 36 mov cr.itm =r2 35 mov r2=cr.iva 2 mov cr.iva=r2 7 mov r2=cr.pta 5 mov cr.pta=r2 6 17 mov r2=cr.gpta 5 mov cr.gpta=r2 0 11 mov r2=cr.ipsr 12 mov cr.ipsr=r2 6 mov r2=cr.isr 2 mov cr.isr=r2 7 mov r2=cr.iip 2 mov cr.iip=r2 7 mov r2=cr.ifa 5 mov cr.ifa=r2 6 mov r2=cr.itir 5 mov cr.itir=r2 6 mov r2=cr.iipa 2 mov cr.iipa=r2 7

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 35

Latencies and Bypasses

Ta ble 4-4 . Laten cies fo r OS Rela ted Instructions (Continued)

READ Form WRITE Form

Type

IR mov from cpuid[r0] 36 READ ONLY

Instruction Latency Instruction Use Latency srlz.d srlz.i

mov r2=cr.ifs 12 mov cr.ifs =r2 11 mov r2=cr.iim 2 mov cr.iim=r2 11 mov r2=cr.iha 5 mov cr.iha=r2 6 mov r2=cr.lid 36 mov cr.lid=r2 35 mov r2=cr.ivr 36 READ ONLY mov r2=cr.tpr 36 mov cr.tpr=r2 35 mov r2=cr.eoi 36 mov cr.eoi=r2 35 mov r2=cr.irr0 36 READ ONLY mov r2=cr.irr1 36 READ ONLY mov r2=cr.irr2 36 READ ONLY mov r2=cr.irr3 36 READ ONLY mov r2=cr.itv 36 mov cr.itv =r2 35 mov r2=cr.pmv 36 mov cr.pmv=r2 35 mov r2=cr.cmcv 36 mov cr.cmcv=r2 35 mov r2=cr.lrr0 36 mov cr.lrr0=r2 35 mov r2=cr.lrr1 36 mov cr.lrr1=r2 35

mov from dbr[r0] 36 mov to dbr[r3] 1 mov from ibr[r0] 36 mov to ibr[r3] 46 mov from pkr[r0] 5 mov to pkr[r3] 11 22 mov from pmc[r0] 36 mov to pmc[r3] 35 46 mov from pmd[r0] 36 mov to pmd[r3] 35 46 mov from rr[r0] 5 mov to rr[r3] 11 22

36 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Data Operations 5

This chapter describes c onsider ations for dat a operati ons such as speculative or predic ated loads or stores, floating-point loads, and prefetches. Load hints, data alignment, and write coalescing considerations are also discussed.

5.1 Data Speculation and the ALAT

The family of instructions composed of ld.a/ldf.a/ldfp.a, ld.c/ldf.c/ldfp.c, and chk.a provide the capability to dynamically disambiguate memory addresses between loads and

stores. Architectu rally, the ld.c and chk.a instructions have a 0-cycle latency to consuming instructions . This allows the ld.c/ldf.c/ldfp.c/chk.a and the corresponding consuming instruction to be scheduled in the same cycle. However, if a ld.c/ldf.c/ldfp.c/chk.a misses in the ALAT, additional latency is incurred. Also, an adva nce load activates the scoreboard for the target register in order to ensure correct operation in the event of a L1D miss.

A ld.c,ldf.c, or ldfp.c that misses the ALAT initiates an L1 cache access. Other instructions in the iss ue gr oup will be re-executed. This is an 8-cycle penalty that will affect all operations issued sin ce the check load, whether there was a consumer in the same issue group or not. The consumer will be exposed to a ny additional cache latency (i.e., if the check load is found in the L1 then the penalty will be only 8 cycles ). However, if the check load is in the L2, the user will see greater latency.

A chk.a t hat misses in t he AL AT executes a br anch to recovery code . On the Itanium 2 proces sor, the branch target can be comp uted from the offset contained in the chk.a instruction in most instances. This avoids the trap to the operating system that is done on the Itanium pro ces sor. The cost of a chk.a that mi sses in the ALAT is at least 18 cycles to branch to recovery code, plus the cost of the recovery code, plus the return. The actual resteer to fix up code occurs wit hin 10 cycles, however there are at least 8 cycles f or th e first instruction of the fix up code to co mplete. The 8 cycles will increase when the branch to fix up code misses the L2 ITLB or L1I and other cache levels.

The Itanium 2 processor ALAT has 32 entries and is fully associative. Each entr y contains the register number , type, and the l ower 20 bi ts of th e physical address . The addr ess is us ed to co mpare against potentially conflicting stores while the re gister index and typ e support the check operation. Since only partial addr es ses are saved in the ALAT, it is possible to have a false conflict if a store and an ALAT entry had different address es yet s har ed the same lower 20 bits of physical addres s. In addition, if a ld.c or chk.a follows a store too closely, the ALA T addre ss comparison will be done on fewer than 20 bits of physical address. This is a result of the minimum 4K page size support and the need for both store and check addresses to be fully translated to accomplish the 20-bit physical addres s comparison. Table 5-1 lists the distances and comparison sizes.

Note: Table 5-1, ld.c also implies ldf.c and ldfp.c.

T able 5-1. ALAT Entry Comparison Sizes

Distance Comparison Size

st and ld.c in same cycle 12-bit st precedes ld.c by 1cycle 12-bit st precedes ld.c by more than 1 cycles 20-bit

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 37

Data Operat io ns

Tab le 5-1. AL AT Entry Comparison Sizes (Continued)

Distance Comparison Size

st and chk.a in same cycle 12-bit st precedes chk.a by 1 cycle 12-bit st precedes chk.a by more than 1 cycles 20-bit

Note: On the Itanium processor, if a store and chk.a occur in the same cycle, the chk.a will always

fail, but this is not the cas e for the Itanium 2 processor.

5.1.1 A llo catio n/R eplace ment Policy

When a new entry is added in the ALAT, the following is the pr iority listing of which entry is replaced:

• The entry with the same register number as the new entry.

• The first invalid entry.

• A valid entry is replaced based upon advancing pointers associated with ports M0 and M1.

This approximates a first-in - first-out (FIFO) algorithm.

5.1.2 Rules and Special Cases

The following rules and special cases should be noted:

• The Itanium architecture definition prohibits scheduling a ld.a and ld.c in the same cycle

if both instructions have the same target register. Similarly, ld.a and chk.a cannot be scheduled in the same cycle if they have the same target register. However, separation by one or more cycle s will give normal ALAT behavior. A similar situation is true for ldf.a and ldfp.a.

• A faulting ld.a w ill not writ e to the AL AT. Such faults are listed in Volume 3: Instruction Set

Reference of the Intel

among others, Data Page Not Present, Data TLB, and Unaligned Data Reference faults. In these situations, a subsequ ent corresponding ld.c or chk.a will def initely miss in the ALAT.

Itanium® Architecture Software D eveloper’s Manual and include,

• If both an ALAT set and ALAT invalidate instruction occur in the same cycle, the ALAT set

will not occur. For instance, if a chk.a.clr rx and rx = ld.a[addr] occur in the same cycle, the address of the ld.a[addr] will not be entered in the ALAT.

5.2 Speculative and Predicated Loads/Stores

Memory operations with speculative inputs behave in the following manner:

• For a normal load/store whose source register contains a NaT value, a re gister NaT

consumption fault will occur.

• For a speculative load whose s ource register c ont ains a NaT val ue, t he NaT b it is set and a z ero

value will be retur ned.

38 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

The Itanium 2 processor supports two deferral behaviors: ear ly a nd late. The behavior of speculative memory operations depends on several factors such as interrupt state, deferral control registers, and processor configuration. Early deferral mode is enabled through PAL procedure PAL_PROC_SET_FEATURES. The effects of this will be maintained until the system is rebooted and the proc e ssor returns to the default late deferral behavior. Table 5-2 lists the requirements to enable early deferral.

Table 5-2. Ea rl y an d L ate Deferral

Early Deferral Enabled psr.ic dcr.dm Deferral Mode

Yes 0 0 Late Yes 0 1 Early Yes 1 0 Late Yes 1 1 Late

No x x Late

Table 5-3 shows the latency, according to deferral mode, that a speculative load may incur before

returning data or eventually setting the destination NaT bit. The cost of each exception deferral ranges from one cycle to several cycl es depending to the latency of the HPW. These HPW-r elated penalties c a nnot be scheduled around and a ffect every instruction in the issue group. Also, it is possible for the exception causing a deferral to not be resolved when the exception is deferred. Thus, the deferral stall may be seen each time through a loop where the chk.s is not reached.

Data Op er ations

Table 5-3. Control Speculation Penalties

Result Hit/Miss Penalty (early deferral) Penalty (late deferral)

Return Valid Data L1 DTLB Hit 1 1

L2 DTLB Hit 4 + L2 latency 4 + L2 latency VHPT Hit 5 (no HPW walk) 20 + L2 latency

Set NaT Bit NaT Source 1 1

L1 DTLB Hit 2 2 L2 DTLB Hit 2 or 4 + L2 latency 2 or 4 + L2 latency VHPT Hit 5 (no HPW walk) 22 or 2 + L2 latency VHPT Miss 5 (no HPW walk) 20 + L2 latency VHPT Fault 5 (no HPW walk) 17 + L2 latency

Note: Speculative loads are not limited to ld.s instructions. lfetch instructions are normally

speculative and be ha ve sim ilarl y to ld.s instructions with the excepti on that the y neve r set a NaT bit or return data. An lfetch instruction may be made non-speculative with the .fault completer.

The advantage of early deferral is that speculative operations complete with low latency. The latency is at best thr ee cycles for an early deferred ld.s as seen by a dependent operation. This is important in situations where the code generator is aggres s ive in its speculation and the chances of the speculative operation actually hitting in the data TLB is low. Since early deferral does not initiate a VHPT walk by the HPW , even valid requests may fault since they are not in the L2 DTLB.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 39

Data Operat io ns

5.3 Floating-Point Loads

Floating-point loads are not cached in the L1D and are instead proces s ed directly by the L2. The limited size and bandwidth o f the L1D makes caching this data unp rofitable. It is expected t hat F P memory accesses can more easily be schedu led to cover the additional latency of the L2.

Floating-point load s incur an extra clock of latency over integer accesses to accommodate format conversion. Therefore, a floati ng- point load takes 6 cycles if it hits in L2. Note als o, that the FP load pair instructions (b oth double-precision and sin gle-precision) also access th e L2 cache, so the latency for a load pai r instruction is also 6 cycles assuming that it is an L2 hit.

5.4 Data Cache Prefetching and Load Hints

The architecture provides two softwar e mechanisms to con trol when and where data is loaded. The lfetch instruct ion is used to expl icitly pr efe tch data into the L1D, L2, or L3 caches. To facilitate more data locality, temporal hints can be used to contro l the level of the cach e hierarchy into which loaded data is placed.

5.4.1 lfetch Implementation

The Itanium 2 processor implementation of lfetch is as follows:

• lfetch.none is completed only if there are no except ion s. Exceptions are not reported.

Section 5.2 contains information on the behavior of lfetch instructions that encounter

memory manageme n t fau lts.

• lfetch.fault is completed whether or not there is an exception. If there is an exceptio n, it

is raised to the OS to com plete the operation. A TLB miss is resolved as with a normal load.

• If the lfetch misses in L1D but hits in the L2, the L1D cache is allocated based on the

lfetch temporal hint. lfetch instructions have th e same temporal locality behavior as

integer loads .

• All lfetch types which miss in the first level data TLB and hit in the second level data TLB

will stall the main pipeline and fill the first level data TLB as a normal load operation. The behavior of the lfetch in the event of an L2 DTLB miss depends on the use of the early or late deferral modes described in Sectio n 5.2. In early def e rral mode, the lfetch aborts with an L2 DTLB miss. In late deferral mode, the lfetch will initiate an H PW acces s. If the access fails, the lfetch will abort. However, it is only the lfetch.fault instruction that will initiate a HPW access when it mis ses bo th data TLBs.

• An lfetch.excl appears as a store to other cache levels and the system bus. This means

that these operations will place a line in the M state within the caches. Do not use the.excl completer unle ss there is a high prob abil ity that the data wil l truly be mo difie d. Otherwi se, the cache will evict unmodified data to the cache structures and eventually to memory.

• An lfetch to an uncacheable memory location will not reach the L2 cache as required by the

architecture.

Note: Th e lfetch instruction appears as a load operation without a specific data return to the core. As

such, many of the limitations that normal loads experience anywhere in the memory hierarchy will affect the lfetch instruction as well. Exceptions a re noted and are provided with the intent that they will make lfetch instructions easier for the compi ler to use in realizing performa nce.

40 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

5.4.2 Load Temporal Locality Completers

The Itanium architecture uses memory localit y hints for managing th e data cache hierarc hy. On the Itanium 2 processor , four types of memory localit y hints ar e imp lement ed: t1, nt1, nt2 and nta. The Itanium2 process or does not support a non-tem poral buffer; instead, non-temporal L2 accesses are allocated in L2 with biased repl acem ent. The implementation is as follows:

• t1 hint is for normal access es . O n a load, the line is allocated in L1D , L2, and L3. On a store,

the line is allocated in L 2 and L3, but not L1D.

• For loads with nt1 hint, the line is only allocated in L2 and L3. In addition, the line is biased

to be replaced in the L2. This is achieved by not updating the L2 LRU bits. Note that by doing so, the line has a hi gher probability of being r e placed, though it is not guaranteed to be replaced next.

• Loads with nt2 hint are implem ented in the same manner as loads with nt1 hint.

• For loads and stores with nta hint, the line is only allocated and biased to be replaced in L2.

The line is not allocated into L3.

Table 5-4 lists how L1D, L2, and L3 handle line allocation and LRU update for different hints.

Note that:

• L1D is write through and does not support FP loads and stores.

Data Op er ations

• The valid bit update in th e L1D cac he and th e LRU bi ts u pdate i n the L3 c ache are indep endent

of the hint bit s. Only the update of the L2 LRU is biase d to mimic the behavior of a non-temporal buffer.

Table 5-4. Processor Cache Hints

Access Hint

t1 Yes Yes Yes Yes Yes Yes

lfetch

Integer load

Integer store

FP load

FP store

1. Alloc indicates an entry is allocated in that leve l of the cache on a cache miss.

2. Integer Load and FP Load - only t1, nt1, and nta attributes are allowed.

3. Integer Store and FP store - only t1 and nta are allowed.

nt1 No No Yes Yes Yes Yes nt2 No No Yes No Yes Yes nta No No Yes No No No

t1 Yes Yes Yes Yes Yes Yes

nt1 No No Yes Yes Yes Yes nta No No Yes No No No

t1 No No Yes Yes Yes Yes

nta No No Yes No No No

t1 No No Yes Yes Yes Yes nt1 No No Yes No Yes Yes nta No No Yes No No No

t1 No No Yes Yes Yes Yes nta No No Yes No No No

Alloc

L1D L2 L3

Update LRU

Bits?

Alloc

Update LRU

Bits?

Alloc

Update LRU

Bits?

Note: Other instruction/hint combinations are not allowed by the I tanium architecture.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 41

Data Operat io ns

5.4.2.1 General Descriptions of Hints

Memory locality hints are described below:

• .none: The load delivers the data and is loaded into both L1D and L2.

• .nt1: This hint means non-temporal locality in the first cache level capable of holding the

referenced data. The Itanium architecture suggests this hint indicates that the load should deliver the data and the line should not be allocated in the first level caches. For the Itanium 2 processor , this hint will cause the line not to be allocated to the L1D on an integer cache miss. If it is already in the L1D cache, it will not be deallocated.

• .nt2: This hint means non-temporal locality in the second cache level capable of holding the

instruction. For the Itanium 2 processor, this hint will cause integer accesses to the line to be allocated in L2; however, the LRU information will not be updated for the line (i.e. , it will be the next line to be replaced in the pa rticular set). If it is already in the L2 cache, it will not be deallocated.

• .nta: This hint means non-temporal locality in all levels of the cache hierarchy. For the

Itanium 2 processor, this hint will cause the line to be alloca ted in L2; however, the LRU information will no t be updated for the line (i.e., it will be the next line to be rep laced in the particular set). This line will not be allocated in the L3 cache. If present in any cache, it will not be deallocated from that cache, although sometimes lines are deallocated for coherency reasons.

Note: There is no way to allocate only in L3 and not impact L2, even with an lfetch instruction.

The one-way allocation for non-temporal L2 data may lead to displacement of L2 data for a temporary dat a st rea m sin ce the no n- tem poral dat a m ay b e q uick ly re pla c ed. A si ngle L 2 wa y h old s 32KB. This may be large enough for a single .nt stream, bu t an attempt to use two non-temporal streams may cause one stream to displace the other.

5.5 Data Alignment

The Itanium 2 processor implementation supports arbit rar ily aligned load and store accesses, except for integer accesses that cross 8-byte boundaries and any accesses tha t cro ss 16 -byte boundaries .

If psr.ac = 1, all unaligned memory references will fault. If psr.ac = 0, these rules must be followed to avoid faults:

• Integer loads and stores must be aligned within an 8-byte aligned window.

• All FP 4-byte and 8-byte load operations c a n be unaligned within a 16-byte aligned window.

• All FP load pairs must be naturally a ligned; i.e., singles on an 8-byte alignment, doubles on a

16-byte alignment, ldpr.8 on a 16-byte ali gnment.

• All FP 10-byte loads can be unaligned within a 16-byte window.

• FP fill/spill instructions must be aligned within a 16-b yte aligned window.

• FP stores can be una ligned within a 16-byte ali gne d window.

• Semaphores (cmpxchg, xchg, fetchadd) must be restricted to natural alignment.

• All uncacheable (UC, WC) accesses which cross an 8-byte boundary will fault.

42 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

5.6 Write Coalescing

For increased performance of uncach eable references to frame buffers, previous generation IA- 32 processors defined the wri te coalescing (WC) memory type. WC allows streams of data writes to be combined int o a si ngle, larger bus write transaction. The Itanium 2 processor fully supports write coalescing as defined by the Intel Itanium 2 processor WC loads are performed directly fr om memor y and not f rom the coalescing buffers.

The Itanium 2 processor has a separate two-entry, 128-byte buffer (WCB) that is used for WC accesses exclusively. Each byte in the line has a valid bit. If all valid bits are true, then the line is said to be full and will be evicted by the processor.

5.6.1 WC Buffer Eviction Conditions

To ensure consistency with memory, the WCB is flushed on the following condi tions (both entries are flushed). Table 5-5 shows the eviction conditions when the processor is operating in the Itanium system enviro nm ent:

Pentium

Data Op er ations

III processor . Like t he Penti um III process or , the

Table 5-5. Itanium

Memory fence (mf) mf Memory release ordering (op.rel) st.rel, cmpxchg.rel, fetchadd.rel, ptc.g

Flush cache (fc) hit on WCB yes Flush write buffers (fwb) yes Any UC load no Any UC store no UC load or ifetch hits WCB no UC store hits WCB no WC load/ifetch hits WCB no WC store hits WCB no

1. Itanium® architecture doesn’t require the WC buffers to be coherent with respec t to UC

2. A WC store which hits in the WCB updates that entry if it is not full. If it is full, a check is made

2 Processor WCB Eviction Conditions

Eviction Condition Itanium® Instructions

Architectural Conditions for WCB Flush

1 1 1 1

load/store operations. if that entry is older or younger than the other WCB entry. If it is younger, the older WCB entry

is flushed out (even if it is not full). The younger WCB entry is flushed afterwards. If the WCB entry is the oldest, it is flushed by itself.

5.6.2 WC Buffer Flushing Behavior

As mentioned previously, the Itanium 2 processor WCB contains two entries. The WC entrie s are flushed in the same order as they are allo cated. Tha t is, the en tries are f lushed in writt en order. This flushing order applies only to a “well-behaved” stream. A “well-behaved” stream writes one WC entry at a time and does not write the second WC entr y until the first one is full. This implies that the addresses of the WC stores monotonically increase. A store with release semantics should be used to force a flush of a partial line befo re s tarting on the next line.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 43

Data Operat io ns

In the absence of platform retry or deferral, the flushing rule impl ies that the WCB entries are always flushed in a program written orde r for a “well-behaved” stream, even in the presence of interrupts. For example, consider the following scena rio: if software issues a “well-behaved” stream, but is interrupted in the middle, one of the WC entries could be partially filled. The WCB (includin g the partially filled entry) could be flushed by the OS kernel code or by other processes. When the interrupted context resumes, it sends out the remaining line and then moves on to fill the other entry. Note that the resum ed cont ext could be int err up ted again in the m idd le of fi lli ng up the other entry, causing both entries to be partially filled when the interrupt occurs .

For streams that do not conform to th e a bov e “well- behaved” rule, the order in which the WC buffer is flushed is random.

WCB eviction is p erf orme d f or fu ll l ine s by a singl e 12 8- bit bu s t ran sacti on. For par tia lly fu ll line s, the WCB is evicted us ing 1-8, 16, or 32-byte transactions with the proper enables. The flushing will issue the largest data tr ans actions allowed by a continuous and aligned set of write coalescing data. When flushing, WC transactions are given the highest priority of all exte rnal bus operations.

5.7 Register Stack Engine

The Itanium 2 processor register stack engine (RSE) only operates in lazy mode (ar.rsc.mode = 0). All other m ode configurations are ignored.

A maximum of two loads or two stores can be performed by th e RSE in each cycle, but not both loads and stores at the same time.

Generally, it is assumed that the RSE loads and stores will hit in the L1D cache and the L1D is capable of holding RSE cache lines in L1D.

5.8 FC Instructions

The fc instruction will invalidate a specified cache line from all levels of the cache hierarchy. In the Itanium 2 proce ssor, each fc will invalidate 128 bytes corresponding to the L3 cache line s ize. Since both th e L1 I a nd L1D have line sizes of 64 byte s, a single fc instruction can invalidate two lines.

44 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Memory Subsystem 6

The Itanium 2 processor memory system has a three-level cache structure: a first-level instruction cache (L1I), a first-level data cache (L1D), a unified second-level cache (L2), and a unified third-level cache (L3).

The following sections contain detailed information on the workings of the L1 D, L2, L3, and system bus. This information is presented to gi ve a ba sis for the optimization recommendations. However, it is necessary to give enough understanding to recognize bottlenecks that are not specifically covered in this document. Chapter 9, “Optimizing for the Itanium provides some important suggestions in optimizing for the Itanium 2 processor memory subsystem.

Figure 6-1. T h r ee Level Cac h e Hierarchy of the Itanium

Itanium

2 Processo r

2 Processor

2 Processor”

Memory

and I/O

6.4 GB

System

Bus

Control

Logic

L3 9 MB

128 Byte Line

14+ Cycle

The Itanium 2 processor employs a two-level TLB for both instr uction and data references: the first-level instruction TLB (L1 ITLB) and the second-level instruction TLB for instructions, and the first-level data TLB (L1 DTLB) and the second-level data TLB.

The Itanium 2 processor implements all the features of the Itanium architecture requirements fo r virtual memory support. Table 6-1 lists the specific parameters of the Itanium 2 processor implementation.

Table 6-1. Itanium® 2 Processor Virtual Memory Support

Virtual Memory Itanium® 2 Processor Implementation

Page Size 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, and 4G bytes Physical Address 50 Bits Virtual Address 64 Bits Region Registers 8 registers with 24 bits in each register Protection Key

Registers

16 registers with 24 bits in each register

32 GB

L2 256 KB

128 Byte Line

5+ Cycle

32 GB

16 GB

L1I 16 KB

64 Byte Line

1 Cycle

L1D 16 KB

64 Byte Line

1 Cycle

001228b

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 45

Memory Subsystem

6.1 Translation Lookaside Buffers

Table 6-2 shows the major features of the TLBs of the Itanium 2 processor. The capabilities of the

instruction an d d ata TLBs are ap proxi ma tely equ ival ent . The f irst level T LBs ar e cl osely tie d to th e first level instruction and data caches. This is necessary to suppor t the single cycle access for the

Tab le 6-2. Major Features of Instruction and Data TLBs

6.1.1 Instruction TLBs

L1 caches and comes at the price that a first level TLB miss forces a fi rst level cache miss.

Instruction TLBs Data TLBs

Structures L1 ITLB, L2 ITLB L1 DTLB, L2 DTLB Number of Entries 32, 128 32, 128 Associativity Full, Full Full, Full Penalty for First Level Miss 2 cycles 4 cycles

The L1 ITLB has 32 fully associative entries and is dual ported. One port is used exclusively for regular instruction fetches and LRU updates. The second port is shared among instruction prefetches, snoops, and TLB purges. The L1 ITLB contains suf ficien t informatio n, regi on reg isters, and protection keys, such that it does not need to be a strict subset of the larger L2 ITLB.

When an L1 ITLB page translation is replaced, all entries in the L1I cache from the victimized page are invalidated. The victim entry is determined using true LRU. The L1 ITLB directly supports only a 4KB-page size. Other page sizes are indirectly supported by allocating additional L1 ITLB entries as each 4KByte segment of the larger page is referenced.

The L2 ITLB has 128 fu lly as sociat ive en tries and is si ngl e porte d. Up to 64 entr ies of the L2 ITLB can be assigned as translation regi sters (TRs). TRs are effecti vely translations locked into the L2 ITLB and are therefore not subject to LRU replacement policy. The L2 ITLB directly supports page sizes of 4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 1GB, a nd 4GB.

The L1 ITLB and L2 ITLB are accessed in parallel for demand fet ches to reduce an L1 ITLB miss (and associated L1I cache miss) penalty. These parallel accesses do not update the L2 ITLB LRU values. If an instruction access misses in the L1 ITLB, but hits in the L2 ITLB, the first-level instruction cache access wi ll have two cycles of penalty (in parall el with the second-level cache latency) to transfer t he page in formati on from the L2 ITLB to th e L1 ITLB. Since an L1 ITLB miss results in an L1I cache miss, the penalty will likely be greater as the instruction must be accessed from higher-level caches or the system memory.

6.1.2 Data TLBs

The L1 DTLB has 32 f ully associative entries and is dual porte d. Only two ports are requir e d because it supports only integer load operations. Unlike the L1 ITLB, the L1 DTLB lacks protection a nd page attribute information. Consequently, the L1 DTLB is accessed in parallel with the DTLB and must be a strict subset of the second-level DTLB for an L1D hit.

When an L1 DTLB page translation is replaced, all entr ies in the L1D from th e victimized page are invalidated. The L1 DTLB has a fixed page size of 4KB. Larger page sizes are supported by allocating additional L1 DTLB entries as a 4KB portion of the larger pag e .

46 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

The L2 DTLB has 128 fully associative entries and four ported. The four ports are needed to allow all combinat ions of integer loads, stores and floating-point loa ds to be looked up in parallel. The integer loads rely on the L2 DTLB for protection and page attribute information. The other accesses get virtual to ph ysical mapping, protecti on, and page attributes from th e L 2 DTLB.

Up to 64 entries of the L2 DTLB can be assigned as TRs. TRs are effectively translations locked into the L2 DTLB and are therefore not subject to LRU replacement policy. The L2 DTLB directly supports page sizes of 4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 1GB, and 4GB.

Stores or floatin g- point accesses that miss the L1 DTLB incur no penalty from an L1 DTLB miss. Integer loads that miss the L1 DTLB but hit the L2 DTLB incur a 4-cycle penalty (in additio n to the L2 cache latency) to transfer from the L2 DTLB to the L1 DTLB. Also, a load access that misses the L1 DTLB will not hit in the L1D.

6.2 Hardware Page Walker

The HPW is th e third level of address translation. The HPW is an eng ine that performs page look-ups from the virtual hash page table (VHPT). When an L2 DTLB or L2 ITLB miss is encountered, the HPW will access (as necessary) the L2 cache, the L3 cache, and finally memory to obtain the page entry. If the HPW cannot locate the page entry in the L2, the L3, or memory, an interruption is generated and a software handler is called to complete the translation (unless the requesting instruction defers the exception). The HPW will accept a new instructio n TLB miss when processing a data TLB miss (and visa versa); howev e r, the HPW will not pro ces s them at the same time. The requests are effectively serialized.

Memory Subsystem

Cache accesses must wait for TLB resolution to complete:

• L1D accesses both L1 DTLB and L2 DTLB in parallel.

• L1I accesses only require an L1 ITLB lookup (an L2 ITLB lookup is required upon an L1

ITLB miss).

• L2/L3 data access only require an L2 DTLB lookup.

• L2/L3 instruction accesses only requir e an L2 ITLB looku p.

When an L2 DTLB or L2 ITLB miss occurs, an HPW lookup is performed. This HPW walk may be aborted at a ny time. For non-speculative memory requests, when the HPW aborts or can not successfully m ap the virtual address , a fau lt is raised. For speculative memory requests, the actual request is aborted and the ld.s will set the NaT bit. The minimum penalty for going to the HPW is summarized in Table 6-3. A HPW lookup does not look in or cause a fill of the L1D cache.

Since an L2 DTLB or L2 I T LB miss also implies a miss in the L1D or L1I, the penalty shown in

Table 6-3 has the best case L2 cache latency added to the HPW walk latency.

Table 6-3. Best Case HPW P e nalties

Event Penalty in Cycles

Hit in L2 25 Miss in L2, hit in L3 31 Miss in both L2 and L3 20 + Main memory Latency

(System dependent)

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 47

Memory Subsystem

Tab le 6-3. Best Case HPW Penaltie s (Continued)

Event Penalty in Cycles

HPW Abort OS trap/abort HPW mapping failed OS trap/abort

6.3 Cache Summary

Table 6-4 summarizes the key parameters of the on-chip caches of the Itanium 2 processor.

Table 6-4. Cach e Su m mary

L1I L1D L2 L3

Size 16 KB 16 KB 256 KB up to 9 MB Associativity 4-way 4-way 8-way 18 or 12-way Line size 64 Bytes 64 By tes 128 Bytes 128 Bytes Latency 1 cycle 1 cycle Minimum 5 cycles

Tag Read Bandwidth

Data Read Bandwidth

Data banks n/a 8 bytes/bank

Write Bandwidth n/a 2 x 8B / cycle 4 x 16B / cycle 1 x 32B / cycle Fill Bandwidth 64 bytes

Outstanding Misses

Line Size 64 B ytes 64 Bytes 128 Bytes 128 Bytes

1. The L2 read bandwidth is 48 bytes/cycle because the L2 can complete 2 ldfpd and 2 integer loads at a ti me. Any combination of 4 floating-point and integer returns may also complet e every cycle.

2 / cycle 4 / cycle 4 / cycle 1 / cycle

1 X 32B / cycle 2 X 8B / cycle 2 x 16B / cycle + 2x

assembly 2 cycles write - 1 cycle

7 prefetches 8 unique lines 16 unique lines 22 (16 read shared

(store only)

64 bytes assembly 2 cycles write - 1 cycle

integer load use. Minimum 6 cycles

floating-point load use.

7 cycles wi th 6 cycle stall penalty in ROT stage for instruction load use.

8B 16 bytes/bank n/a

128 bytes assembly 4 cycles write - 1 cycle

Minimum 14 or 12 cycles load use.

1 x 32B / cycle

128 bytes in 4 cycles

with L2, 6 write)

6.4 First-Level Instruction Cache

The first-level instruction cache (L1I) is a 16KB, four-way set associative, physically addressed cache with a 64-byte line size. Lower virtu al address bits 11:0, whi ch represent the minimum virtual page, are never translated and are used for cache indexing. The L1I can fill a 64-byte line once every two cycles. It blocks on-demand fetch misses but is non-blocking for prefet c h misses allowing up to seven to be outstanding to the L2 cache.

48 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

The L1I can s ustain a rate of 32-byte reads per cycle to support a fetch rate of two bundles per cycle. The front- end alw ays fetch es ali gned 3 2-byte bu ndle pai rs from the L1I. If a branch point s to the middle ra ther than the beginning of a 32-byte bundle pair, only the second bundle will be fetched. Therefor e, branch targets must be on aligned 32-byte boundaries to achieve maximum fetch bandwidth from the L1I.

The tag array is d ual-po rted: o ne port i s dedi cated to inst ructio n de mand f etc hes, the oth er is shar ed between cache snoops and instruction prefetches. Cache sno ops have priority over prefetches . Th e data arr ay i s dual po rt ed, on e f or re a ding, on e fo r fil ls. Ad dition ally, specia l effort has been ma de to allow L1I reads and fills to occur sim ultaneously, thus there are few events that can keep an L1I miss from eventually writing into the L1I.

6.5 Instruction Stream Buffer

The Itanium 2 processor instruction stream buffer (ISB) is located between the L1I and the L2 caches. It serves as a line fill buffer for the L1I and assists in instruction prefetching. The ISB contains eight 64-byte cache lines or 8 double bundle pairs of instr uctions and is fully associativ e.

L1I lines returned fr om the L2, whether demand misses or prefetches, ar e all stored in the ISB. If a returned cache line is a demand miss, it will be forwarded to the instruction pipeline and may be moved into the L1I. The cache line remains in the ISB until an idle pe riod where can drain into the L1I. The IS B e ntry may be victimized or invalidated before this move occurs preve nting the L1I fill from occurring. The L1I supports both r eads and fills at the same time, hence their ISB entries empty quickly into the L1I and few ISB victimizations or invalidations will occur.

Memory Subsystem

The ISB is accessed in parallel with the L1I. An ISB hit has the same latency as an L1I hit. If the target lin e hits both the ISB and the L1I, the matching line in the ISB is invalidated.

6.6 First-Level Data Cache

The first-level data cache (L1D ) is a mu lti-ported, 16KB, four-way set associative, physically-addres sed cache with a 64-byte lin e size. The L1D is non-blocking and in- order. Lower virtual address bits 11:0, which represent the minimum virtual page, are nev er tr ans lated and are used for cache indexing.

The L1D is desig ned such that there are two dedicated load ports and two dedicated store ports. These ports are fixed, but the issue logic can rearrange loads and stores within an issue group to ensure they issue to the appropriate memory p ort. The load ports are dual ported, meaning that any two load addresses can be read fro m th e memor y in pa rallel without conflict. S tores, however, access the L1 D da ta array in 8 groups that are 8 bytes wide. Stores do have the potential for conflicts, but the store buffer coalescing hardware limits the impact such conflicts have on performance.

The access latency of the L1D is one cycle unles s the use is fo r an ad dre ss of anot her load operations (i.e., pointer chasing) in which case it is two cycles. The L1D enforces a write-through, with no write-allocate policy. Al l stores will go to the L2 cache whether they hit or miss in the L1D. If a stor e hits in the L1D, the data is kept in a s t ore buffer until the data arrays become available to update the L1D. These store bu ffers are capable of merging store data and forwarding it to later loads with restrictions. The L1D allocates on load misses according to temporal hint s, load type, and available reso ur ces.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 49

Memory Subsystem

The L1D is highly i ntegrated into the integer data path. All integer lo ads must go through the L1D to return data to the integer regist er f ile/bypass network. Consequently, integer L1D misses, after being servic ed by L2, L3, or memory, also use the L1D datapath to the integer register file and block any core load that may re quire the same L1D data path.

Floating-point loads do not access the L1D. This allows them to issue on any of the four memor y ports with minimal restrictions. Floating-point load-pairs and any floating-point loads with ALAT interactions can only be di spersed on the load ports. Despite th e fact that lfetch instructio ns do not deliver data to the core, they can only be issued on the two load ports because they may cause an L1D fill and that capability is only provided on the two load memory ports.

An unaligne d data reference exc eption will be raised if an unaligned integer load crosse s an 8-byte boundary. See Section 5.5, “Data Alignment ” for more details about alignment support.

6.6.1 L1D Loads

When a core load request gets access to the L1D, it will access the L1D tag and data ar rays at the same time. Rotator s at the output of the L1D data array provide support for both little and big endian accesses as well as some unaligned accesse s without penalty. A virtual to physical mapping must be in the L1 DTLB an d L 1D tags for a load request to be a L1D hit. If the load is a miss or is forced to miss the L1D, then the request is passed on to the L2 when there are sufficient resources. The miss may result in a L1D fill depending on resources and cache hints. At minimu m, all L1D misses eventually update the target reg ister. Floating-point loads and ordered operations are forced to miss the L1D, bu t will not cause an L1D fill.

The L1D has re sources for up to 8 outsta nding L1D fill-reques t s to the L2. If more than 8 misses are outstandi ng, the subse quent m iss es wil l be pa sse d to the L2 , but wil l no t res ult in an L 1D fill . If two or more accesses miss the L1D and are accessing the same L1D line, only on e will request an L1D fill but will be passed to the L2 cache to be sati sfied.

6.6.2 L1D Stores

All store requests are passed to the L2 cache since the L1D is a write through cache. A store that misses the L1D has no effect on the L1D. However, if the store is a hit, the L1D must update the data array so that later loads can see the new data. To support this, the store data is read from the source register and staged down the L1D pipeline. Each store pipeline (M2/M3) has indepen de nt store buffers and control logi c.

When the data is ready to update the L1D data array, it is allowed to do so provided there are no conflicts. Other operations writing the data array at the same time, such as an L1D fill, a load accessing the same 8-byte bank, or a stor e to the sam e bank, m ay prev ent the n eeded up date. In t his case, the store data is moved to a backup buff er and waits for the array to become available. The store buffer can coalesce younger stor es acces s ing the same L1D 8-byte wide data bank. If the backup buffer cannot update th e data array and is needed by a new store that it cannot co alesce, the L1D pipeline will stall to create an opportunity for the backup buffer to drain.

Given this o rga niz at ion, it ma y be be tter f or stor es targeting the same g roup t o iss u e down t he s ame L1D pipeline. For example, it would be better to have all accesses to bank 0 to issue down M2 and all accesses to bank 1 to issue down M3. Thus, when it comes time to updat e the array, M2 and M3 will not conflict and will be allowed to update without delay.

50 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

6.6.3 L1D Load and Store Considerations

Some memory requests may af fect each other even when separated in time. This se ction covers some possible load/load, load/ store, store/load , and store/store interacti ons for both the L1D hit and miss cases. Each discussion will have a sum mary and a suggested solution.

6.6.3.1 Load/Load Conflicts

Load requests that hit in the L1D have no conflicts with each other because the L1D is true dual ported. However, a load request that misses the L1D may have conflicts at the L2 due to ban k conflicts. If low latency is needed, special care should be taken to avoid loads in the same issue group that access the same L2 bank, i.e., A[7: 4] should be unique for L2 bound accesses.

A less obvious load/load conflict can occur when a load is waiting to issue to the L1D, but is preempte d by an older lo ad returning from the L2/ L3 or system bus. Here, the older load is given priority and the younger load must wait. These events are difficult to predict and hence difficult to schedule around. However, the L2 cache will only take the M1 port if t here is onl y one inte ger load to return i n a cycle. Thus, a conflict can be avoided by not using the M1 port for loads. This should not be done if it adds to the critical path.

This same confli ct may exist between loads and special req ues ts that use the L2 data paths to get information to the core. These are the probe, thash, ttag, tpa, and tak instructions.

Memory Subsystem

6.6.3.2 Load/Store Conflict s

A load and store conflict has very dif ferent implications dep e nding on which occurs firs t, the load or the store. Despite the fact that issue groups are inherently parallel, loads and stores are ordered according to position in the issue group.

When a load precedes a store and the load is a hit, ther e are no conflicts. However, there are significant impl ications when the load precedes the store and they are both L1D misses. In this case, the load will m iss the L1D and likely request an L1D fill. The stor e, if it is seen by the L1D before the fill associat ed with the load, will be an L1D miss. As such, the store will invalidate the L1D associated fill buff er entry and stop the L1D fill from occurring. This is necessary because there is no opportunity for the store to update the incoming data before the L1D fill. The Itanium 2 processor must ensure that a later load sees an earlier store, so the fill is cancel led and the mer ge of the store with the cache line is taken care of by the L2. If the fill occurs before the store, then the fill completes and a normal stor e upd ate of the L1D is done. These st atements are true if the load and store share A[49:6] (a full L1D cache line).

One method to avoid this issue is to place a use of the load result before a conflicting store. This ensures that the data is filled into the L1D. Once the L1D is filled, the stor e updates the L1D and proceeds on to the L2 cache. This suggestion may not be app ropriate for single load accesses or when the L1D line is not accessed again aft er a conflicting store.

6.6.3.3 Store/ Loa d Conflic ts

When a store precedes a load, the store data must be seen by the load. In the case where the requests are L1D misses, the L2 ensures this occurs. When the operations are L1D hits, the response to the load depends on the co mmon address bits and how many cycles separate the store and load.

Table 6-5 shows the different store/load penalties. The penalty may depend on whether the load

accesses the sa me data as the store, a s ubset of the store’s data, or is completely in dependent of the store.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 51

Memory Subsystem

Table 6-5. Store to Load Forwarding Penalties

Store Precedes Load

0 cycles 17 cycles 17 cycles 11:2

1 cycle 3 cycles 5 cycles 11:2 2 cycles 3 cycles 3 cycles 49:2 3 cycles 1 cycle 3 cycles 49:2

Loads Accesses Bytes

Completely Within Store

Load Accesses Bytes

Partially Within Store

The 5 and 17 cycle penalties are due both to the load being fo rced to miss the L1D and to the load and store facing L2 confl ict con dition s. The 3 an d 1 cycle pe nalt ies are due to th e L1D recircu lating the load request un til the dista nce betw een the l oad and store ex ceeds 3 cycles . This al lows tim e for the L1D to update the data ar ray with th e store da ta and allow t he load to proceed as if ther e was no store.

To avoid store/load conflicts , the store and load must be separated by more than 3 cycles . If more than 3 cycles separation is difficult to achieve, then ensure at least 1 cycle separation.

6.6.3.4 Integer and Floating-Point Access Interactions

Floating-po int loa ds a nd st or es are pas sed d ir ectly to t he L 2 a nd byp as s the L1D . If a f loat ing -p oint store occurs to a lin e wh ich i s resid ent in the L1 D cac he, tha t L1D lin e wi ll be in vali dat ed. T his ca n cause problems when integer and f loating-point data share the same L1D cache line. This is possible when bo th integer and floating-point data exist in the stack or as part of the same data structure. Suppose that both an integer value and a floating-point value share the same 64-byte aligned block. An integer load will bring the line into the L1D. A later floating-point store will write to L2 and invalidate the L1D line. Thus, a subsequent load of the integer value will miss the L1D.

Address Comparison

This may be mitigated by bringi ng th e lin e back into the L1 D throu gh an lfetch after issuing the store or by using .nt1 hints on the integer accesses to keep them from filling the L1D and scheduling them for L2 latency.

6.6.3.5 Store/Store Conflicts

The L1D is true dual ported for loads, but only pseudo-dual ported for stores; two stores cannot update the exact same locati on in the data ar ray at th e same ti me (see Section 6.6.2, “L1D Stores”). The store buffer design, with coales cing, pr events mos t store/st ore confli cts fo r L1D store h its. The exception is that two stores cannot up date the same L1D bank at the same time. Should there be a conflict, the younger store will move into a store buffer and may later update the L1D data array without impact ing the L1D pipeline. However, if the store buffer is unavailable, the L1D will stall until the store buffer is drained. The conflict does not exist if either of the two stores misses the L1D. Note that the two stores do not need to access the same L1D cache line to conflict.

6.6.4 L1D Misses

When an L1D request misses, it is passed on to the L2 once the L2 has suf ficient resources available to hold the new request. The resources include at least an L2 OzQ entry and an L2 Data entry . A L2 Data entry must be available for a store to be accepted, but a load does not requ ire a L2 Data entry. If either the L2 OzQ or Data i s fu ll, t he operati ons and every other operat ion in the i ssue group will stall until these resources are made available.

52 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

The L2 control log ic reserves some L2 OzQ entr ies to ensure that when a request is allowed to leave the L1D pipeline, there i s an L2 OzQ entry availab le for it. The logic reserves f our entrie s for every cycle of ambiguity which is three cycles. The result is that in some inst ances , such as streaming, only about 20 of the 32 L2 OzQ entries are available. The Itanium 2 processor only stalls the L1D pipeline when the L2 is full and there is a request in the L1D that needs to go to the L2.

6.6.4.1 L1D Forced Misses

There are some load instructions that are forced to miss the L1D. A floating-poi nt load will always miss the L1D. An ordered load (ld.acq) is allowed to hit in the L1D, but if it does mi ss the L1D , all subsequent loads, regardless of address or ordering constraints, will be forced to m iss the L1D until the L2 indicates that the ordered load is visible.

6.6.4.2 L1D Forced Invali dates

Just as some operations are forced to miss the L1D, some operations will inva lidate the L1D. A floating- point store will invalidate the L1D if it is a L1D hit. Sema phores will also invalidate the L1D if they hit in the L1D to ensure that or dering is maintained.

Memory Subsystem

6.7 Second-Level Unified Cache

The second-level unified cach e (L2) cache is a uni fied, 256 KByte, 8-way set-associati ve cache with a line size of 128 bytes. The L2 tags are true four ported and are accessed as par t of the L1D pipeline. The L2 employs write-back and write-allocate policies. The integer access latency to the L2 is 5, 7, 9 or greater cycles. Floating-p oint accesses take 6, 8, 10, or greater cycles, which includes the floating -p oint conversion stage. An L1I miss that hits in the L2 and uses the L2 5-cycle bypass incurs a 7-cycle latency with a 6-cycle stall penalty.

The L2 cache is non-blocking and out of order. All memory operations that access the L2 (L1D misses and all stores) check the L2 tags and allocate int o a 32 entry queuing stru cture calle d the L2 OzQ. All stores require one of the 24 L2 data entr ies to hold data to eventually update the L2 data array. The operations issue, up to four at a time, to access the L2 data array when conflicts are resolved and reso urces ar e availab le. L1I instr uction misses a re als o sent t o the L2, but are s tored in the Instruction Prefetch FIFO (IPF). The L2 OzQ and IPF requests arbi trate for access to the data array and t he L3/system bus.

The L2 data array has 16 banks which are each 16 bytes wide. This allows for multiple simultaneous accesses pr ovided each access is to a different b ank. Fl oating-point loads may issue from the L2 OzQ and access the L2 data array four at a time since th e L2 has four dat apaths to the FP units and regis ter file. The L2 does not have direct datapaths to the integer units an d register file; integer loads deliver da ta via the L1D, wh ich has two datapaths to th e integer un its and register file. Stores may issue from the L2 OzQ and access the L2 data array four at a time provided th ey are all to different banks.

The fill path width from the L2 to the L1D and the L1I is 32 bytes. The fill bandwidth from the L3 or memory to the L 2 is 32 bytes per cycle. Four 32-byte quantities are accumula ted in the L2 fill buffers , then the 128-byte cache li ne is written into the L2 in o ne cycle, thus updating both tag and data arrays. Note that an NRU algorithm is used for cache line replacement.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 53

Memory Subsystem

The L2 cache is not inclusive of the L1D and L1I caches. The L2 maintains state inf ormation for each line, tracking if the dat a stored is modified (M), exclusiv e (E), s hared (S), invalid (I), or pending updat e ( P). This al lows the L2 t o use the ME SIP proto col to mainta in c ac he co her ency and track victimized lines.

6.7.1 L1D Requests to L2

Every cycle, the L1D may issue up to four requests to the L2. These requests may be L1D load/store misses, L2 r ecirculates, L2 fills, instruction fetch es , or snoops. The L2 tags are true four ported and are part of the L1D pipeline. Th is allows all four L1D load or store requests to access the L2 tags and determine if they are an L2 hit or miss before being allocated into the L2 queuing structures. This feature allows L2 misses to be identified and quickly passed on to the system bus/L3. It als o lowers the latency of L2 hit requests.

All L1D load, store, semaphore request s are placed in the L2 OzQ. All L1I instruction misses, which are issued through the L1D to the L2, are placed in the IPF where they arbitrate against the L2 OzQ for access to the L2 data arrays and the system bus/L3. Other requests coming from the L1D such as snoops and fills are transitory and are not queued.

Read (load) operations of the L2 data array occur three cycles before a write (sto re) of the L2 data array. This timing relationship becomes important when determining load/store data array conflicts.

The L2 provides 16 fill buffers to trac k L2 misses. Eac h L 2 miss may result in modified data eviction. The L 2 provides 16 victim bu ffers to hold vict im data; how ever , only 6 L2 victims may be outstanding at a time.

6.7.2 L2 OzQ

The non-blocking nature of the L2 is made possible by the L 2 Oz Q. This structure holds up to 32 operations that cannot be satis fied by the L1D. These include all stores, semapho res, uncacheable accesses, L1D load misses, a nd L1D unr esolved co nflict cas es. The L2 cache desi gn req uires fewer than 32 L2 OzQ entries to hold the maximum number of L1D requests in conflict-free cases. However, there are many conflict cases within the L2. These cases may increas e req ues t lifetimes in the L2 OzQ. Thus, the additional entries allow the L1D pipeline to continue to service hits and make additional requests of the L2 while th e L2 resolves the conflicts. The conflicts increase th e L2 latency and make L2 la tency prediction impossible.

6.7.2.1 L2 OzQ Allocation and Deallocation

The L2 OzQ control logic alloca t es up to four contiguous entrie s per cycle starting from the la st entry allocated th e previous cycle. If there are too few entries available, the L1D pipeline is stalled to prohibit any additional operations being passed to the L2. Requests are removed from the L2 OzQ when they complete at th e L2 - that is wh en a store u pdates th e data array, when a load returns correct data to the core, or when an L2 miss request is accepted by the system bus/L3.

6.7.2.2 L2 OzQ Behavior

The L2 OzQ control logic enforces architectural ordering requirements; and in instances where the architecture allows, operations may complete out of o rder. An operation blocked due to con flict or issue rest rictions does not block you nger operations from completing. This allows for high resource utilization withi n the L2 resulting in a performance benefit. Additionally, the out-of-order

54 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

issue allows the L2 to quickly recov er from circumstances where the L2 control logic was temporarily not able to retire requests.

The out-of -order and non-blocking nature of the L2 OzQ has the e ffect of removing any time relation ships between operations . For example, if the code generator separates two ope rations by 4 cycles, they will appear 4 cycles apart in the L1D pipeline. However, conflicts may keep t he first operation from issuing immediately and force it to wait in the L2 OzQ. This situation may result in the second op erat ion actu ally com plet ing in th e L 2 befor e th e fir st op era tio n, assu mi ng no or der ing restraints , despite their 4 cycles of separation in the code s tream.

The latencies of the L2 hit accesses are typicall y 5, 7, or 9+ cycles. These several latencies arise from the fact that some operations can issue and access the L2 data array at different times depending on the resourc es r equired an d what preceded the request. Th e l ower l atencies co me f rom allowing L1D request to access the L2 data array befor e they are allocated in the L2 OzQ. These are the 5- and 7-cycle L2 OzQ b ypass . All latencies listed as 9+ are for operations that cannot take these bypasses and must allocate into the L2 OzQ and then later issue from the L2 OzQ to access the L2 data array.

6.7.2.3 5- and 7-Cycle Bypass

New L1D requests may take the 5-cycle bypass of th e L2 OzQ and is s ue directly to the L2 data array provided there are no conflicts with older operations in the L2 OzQ. Thi s bypass may be granted to the entire issue grou p provided there are no conflicts wit hin the iss ue group. If a confli ct occurs, th e old er re que st wil l tak e the by pas s whi le th e you nger re ques ts m ay no t. Sem aph ores will never take a 5 or 7 cycle bypass and have a minimum latency of 9 cycles.

Memory Subsystem

L2 bank conf lic ts will be discussed in Section 6.7.3, but they are used here in an example of how the L2 re-orders request to give the lowest possible latency. Conflicts typically are due to multiple requests for the same L2 data array (bank conflict). Consider the an L1D reque st (iss ue) group below:

ldfs f20 = [0x004] (L2 Bank 0) ldfs f21 = [0x008] (L2 Bank 0) ldfs f22 = [0x00c] (L2 Bank 0) ldfs f23 = [0x010] (L2 Bank 1)

The first load will take th e 5-cycl e bypass. The b ank conflic t between the fir st and second l oad will prohibit the second and third loads from taking the 5-cycle bypass. The fourth load will also take the 5-cycle bypass since there is no ban k conflict with the older requests or architectural ordering requirements.

When a request is kept from taking the 5-cycle byp ass, the next choice is the 7-cycle bypass. The bank conflict between the first and second load will keep the s econd and third load from taking the 5 or 7-cycle bypass.

The situation becomes more complicated when the instructions above are foll ow ed by mor e instructions to be satisfied by the L2. Consider the i ssue group of loads from the previous example which is immediately followed by the following issue group of loads:

ldfs f25 = [0x014] (L2 Bank 1) ldfs f26 = [0x018] (L2 Bank 1) ldfs f27 = [0x01c] (L2 Bank 1) ldfs f28 = [0x020] (L2 Bank 2)

In this example , the f20 and f25 loads take the 5-cycle bypass. The f21, f22, and f23 loads will try to take the 7-cycle bypass. However, before they can take the bypass, the new request gr oup with f25, f26, f27, and f28 comes alon g. In this issue group, f25 and f28 take the 5-cycle bypass. Doing

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 55

Memory Subsystem

so blocks the ol de r issue group from taking the 7-cycle bypass . T hose requests must then issue from the L2 OzQ. This increases their minimum latency f ro m 6 to 12 cycles. The latencies of the operations would be as follows (noted in parenth e sis after the lo a d):

ldfs f20 = [0x004] (6) ldfs f21 = [0x008] (12) ldfs f22 = [0x00c] (13) ldfs f23 = [0x010] (6) ldfs f25 = [0x014] (6) ldfs f26 = [0x018] (13) ldfs f27 = [0x01c] (14) ldfs f28 = [0x020] (6)

6.7.2.4 L2 OzQ Issue

Every cycle the L2 OzQ searches for requests to issue to the L2 data array (L2 hits), the system bus/L3 (L2 misses), or back to the L1D for a nother L2 tag lookup (recirculate). See Section 6.7.3 for more information on L2 cancel conditions and Sectio n 6.7.4 for more info rma t i on on L2 recirculate conditions.

The L2 can issue up to four L2 hi t acces ses per cy cle provided there are no confl icts among th em or among earlier iss ued o peratio ns. The conflicts fo r L2 hits in clu de L2 d ata array b anks, r egis ter p ort, L1D fill, and orderi ng. In the case of the L1D fill, on ly one such load may issue. Al so, sin ce the L2 uses the L1D register return paths for loads, only two loads can issue per cycle.

The L2 can issue only one access to the system bus/L3 at a time. An L2 miss in the same L1D request gr oup as an L 2 hi t s houl d b e o n t he M0 p ort t o h ave the s horte st L3 late ncy. If the m iss i s on another port, its latency will increase slightly.

The system bus/L3 control logic will then eithe r accep t or reject the request based on system bus/L3 resources and conflict cases. Once the request is accepted, it may be remov ed from the L2 OzQ. The L2 OzQ pipelines L2 miss requests; it does not wait for the system bus/L3 to accept a request before issuing anot her request.

6.7.3 L2 Ca ncels

The L2 cancels ge ne rally apply only to reques ts taking a 5 or 7 cycle L2 OzQ bypass. This is because in most cases, the issue logic considers the conflict cases and holds off issue until the conflict is resolved. The best example of holding off issue from the L2 OzQ are bank conflicts. All the information n eeded to avoid all possible issue time conflicts may not be available and some L2 OzQ issued requests must be later cancelled and re-i ssued. When an operatio n taking a bypass gets cancelled, it will re-iss ue from the L2 OzQ si nce the bypasses are only available to L1D request groups. When an L2 OzQ requ est i s iss ued an d then lat er canc elled, it s late ncy wi ll incr ease by fo ur cycles.

The cancel logic may also cancel or block is s ue in more instances than expecte d due to issue logic simplification or unavailable information. For example, requests that are recirculated will be included in cancel/bloc k calculations for other instructions considered for is s ue, or th e is su e logic will try to issue up to four requests that need to recirculate even though it cannot recirculate more than one request.

A 5 or 7 cycle bypass is more likely to be cance led f or P3 operations because it is the you ngest in the issue group and due to events external to the L2 such as Sys tem Bus /L3 re tur ns and snoop requests. P0 requests are th e leas t likely to be canceled because these are the oldest instructions in the i s s u e g roup .

56 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

There are many reasons to cancel or block L2 OzQ issue. The reasons are placed into two categories: those that are predictably avoidable and those that are n ot.

6.7.3.1 Predicta bly Avoidable Cancel Conditions

L2 Data array conflicts : The data array has 16-byte wide banks. Bits 7:4 of the address determine the bank. Any requests with the same bank, reg ardless of cache line, are candidates for a bank conflict. Any L1D request group with multiple loads targeting the same bank will see the younger requests cancelled or the L2 OzQ issue blocked. This also a pplies to multiple stores target ing the same bank.

Since L2 loads and stores access the L2 data array at different times, a load and store in the same request group cannot have bank conflicts; however, there is potential for load and store bank conflicts between enti rely different L1D request groups. Store requests access the data array three cycles after a load would. This means that a store issued at time X may block or cancel a load that would issue at time X+3 if they both access the same L2 bank.

The following examples show how the confl ict logic considers the L2 data array access time to determ in e ba nk confl ic t s . T he fo ll owing tw o e xa mples do not have ba n k c on flicts:

ld8 r20 = [0x008] ;; ld8 r21 = [0x010]

Memory Subsystem

and:

st8 [0x008] = r20 ld8 r21 = [0x010]

However, the following example shows a bank conflict between the store and the last load, but not between any other requests. :

st8 [0x000] = r0 ;; ld8 r19 = [0x000] ;; ld8 r20 = [0x008] ;; ld8 r22 = [0x120]

Bank conflicts due to L1D fill requir ements are sli ghtly less predictabl e. These bank confl icts ar ise from the fact that an L1D fill requires 64 bytes of data and hence, four banks at a time. Additionally, the data path to the L1D can only support one fill every two cycles. These are not predictable because not all L1D misses will req uest an L1D fill. Sectio n 6.6.1 has more information on which requests can require an L1D fill.

6.7.3.2 Unpredictably Avoidable Cancel Conditions

There are some bank conflicts that are generally unpredictable. These events are tightly coupled with the unpredictable events of sys tem bus and L3 data ret urns. The unpredictable cancel conditions may result in un e xplained L2 latency increases.

6.7.4 L2 Recirculate

The L2 OzQ will need to recirculate requests whe never th e request do es not have a clea r indicat ion of hit or miss, or the requ ired resources to complete an L2 miss are unavailable.

The most predictable reason for a request to recirculate is that the request mi s ses a line that is already being serviced b y the system bu s/L3, but has not yet ret urned to th e L2. The L2 only retires L2 hits and primary L 2 miss es to an L2 line. It does no t ret ire mu ltiple L 2 miss reques ts; add iti onal misses remain in the L2 OzQ and recirculate until the tag lookup returns a hit. The request then

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 57

Memory Subsystem

issues from the L2 OzQ and returns data (for a load) or updat e s the array ( fo r a stor e) as a normal L2 hit request.

6.7.4.1 lfetch and Recirculation

There is one significant exception to this secon dary L2 miss recirculate con dition. lfetch instructions have been optimized to avoid allocation in the L2 OzQ if th e y meet the following criteria:

• Secondary access to an L2 miss.

• Will not fi ll the L1D.

Since these lfetch instructions are not allocated into the L2 OzQ, they cannot recirculate. The only way to guarantee that an lfetch instruction will not fill the L1D is to place temporal hints such as .nt1, .nt2, o r .nta.

6.7.5 Memory Ordering

Itanium architecture memor y ordering requires that a request with acquir e s emantics must reach visibility before any other younger operation. A request with release semantics must not reach visibility before older operations .

The L2 issue logic enforces the architectura l rel eas e ordering semantic by blocking issue of a release request until it is the oldest ope ration in the L2 OzQ. The issue logic may issue a release operation that is not the oldest, but then cancel and re-issue.

If the ordered operation is not an L2 hit, the L2 co ntr ol logic can speculatively make a system bus/L3 request of the line or transform the re quest to a prefetch. If the other L2 OzQ entries proceeding the ordered requ est do no t conflict, the prefetch will have the benefit of starting the access early without violat ing ordering requirements. If there are conflicts, the request is re- is s ued to ensure proper ordering.

Since the L2 is responsible for maintaining architectural ordering, all loads that are in the shado w of a ld.acq must be seen by the L2. Thus, they are forced to miss the L1D until the ld.acq has achie v e d vi s i bility.

6.7.6 L2 Instruction Prefetch FIFO

The Instruction Prefetch FIFO (IPF) is an 8 entry queu e to hol d L1I requests. Up to seven of these eight entries may contain pref etch r equests. One slot is always reserved for a demand request. Just like the L2 OzQ, the IPF can have requests that are L2 hits, L2 misses, bank conflicts, or recirculates. The IPF faces the same issue re strictions for each of these requests as the L2 OzQ does. Howeve r, unlike the L2 OzQ hit requests, only one IPF L2 hit may be is sued to the L2 data array per cycle. T his is due to the fact that all IPF requests will return data to the L1I cache and th e data path back to the L1I can only support one fill per cycle.

Since the L2 s upports both instructio n a nd data accesses, all L2 is sue control logic choos es among instruction and data req ues ts according to Table 6-6.

58 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Table 6-6. L2 Is su e Prioritie s

Priority Request

1 Demand Instruction Fetch (IPF) 2 Demand Instruction Fetch (7 cycle bypass) 3 Data (L2 OzQ) 4 Data (7 cycle bypass) 5 Prefetch Instruction Fetch (IPF) 6 Prefetch Instruction Fetch (7 cycle bypass) 7 Data (7 cycle bypass)

6.7.7 L2 Load and Store Considerations

Some memory requests may af fect each other even when separated in time. This sections covers some possible lo ad/load, load/store, store/load, and store/store in teractions for both the L2 cache. Since the L2 OzQ allows out of order issuing, the L2 OzQ will re-order requests to fully utilize the L2 data arr ays in sa tisf ying r eques t s. As a r es ult, any s ta tic t iming place in th e cod e s tre am may not have the des ired result on L2 behavior, however there are s till actions the code generator can take to increase performance.

Memory Subsystem

6.7.7.1 Effective Releases

The L2 cache deals with load/s tore, store/load, and store/store conflicts by ensuring that the issue order in the L2 OzQ is the same as the program order of the operations. The L2 control logic leverages the archi tectural ordering mechan isms that al re ady exist t o addre ss the pos sible conflicts .

When the L2 OzQ accepts a new request, it checks the phys ical address bit 49:2 against all older incomplete requ ests in t he L2 OzQ. If a m atch exis ts a nd a co nflict resu lts, the c ontro l log ic appli es architectural releas e semantics to the incoming request. This is called effective release. The effective release association remains until the operation completes and causes the L2 issue and conflict logic to cancel the request until it is the oldest request in the L2 OzQ.

Table 6-7 summarizes the addresses and operation types that can experience an effective release.

T able 6-7. Effective Release Operations

Incoming

Request

Load Load No

Load Store Yes Store Load Yes Store Store Yes

Matching

Request

Effective

Release

6.8 System Bus/L3 Interactions

All requests that the L2 cannot satis fy reach the system bus/L3 as a Read Line (RL) or Read For Ownership (RFO) request. The RL request is used for code and common load oper ations. The L2 may receive the line in M, E, or S for RL requests depending on L3 state or the snoop response provided on the system bus. The RF O reques t indicates the L2 intends to modify the line to store

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 59

Memory Subsystem

data. Stores as well as lfetch.excl and ld.bias instructions result in Read For Ownership requests. These requests will a lways exist in t he M state in L2. Table 6-8 summarizes this beh avi or.

Tab le 6-8. System Bus/L3 Requests and Final L2 State

L3/System Bus Request L2 Request

Read Line Code Read S E M S S

Data Read S E M S E

Read For Ownership store Miss M M M

lfetch.excl Miss M M M ld.bias Miss M M M

L3 State System Bus

SE MHITNo Hit

n/a

The L2 may make partia l line requests of the sys tem bus, but this is only for UC attrib ute accesses and is not part of this discussion because they are neither coherent nor a concern fo r perfo rmance.

The L2 will make one RL or RFO to the system bus/L3 per cycle. Each of thes e requests will have a dirty victim associat ed with i t when th e L2 way chosen for victim izatio n is in the M sta te. The L2 issues a request to the sys tem bus/L3 and then later conf irms the request. This protocol exists to allow issuing requ ests t o the sy stem bu s/L3 that ar e later cancelled an d/or reci rculat ed. The L2 may make a request, but will not confirm a request if there are insufficient r es our ces available. The L2 will not issue two requests to the same L 2 line. A request that is not confirmed wi ll wait at least four cycles before it is issued again.

The system bus/L3 will decide if the request is accepted and inform the L2 based on address conflicts, available resources to support the read request and the asso ciated dirty victim. The L2 will then deallocate the request from the L2 OzQ if the system bus/L3 accepts the request. An L2 request may be rejected ( see S ection 6. 10). A r ejected r equ est w ill wait at l east f our cycles befo re i t is issued again.

When the system bus/L3 is ready to deliver dat a to the L2, it will be indicated to the L2 and the L2 will prepare to receive the data . The data returns com e 32 bytes (a chunk) at a tim e from th e system bus/L3 with the critical chunk f irst. L3 returns have higher priori ty than system bus data returns and come consecutively. In many instances, an L2 miss may also c ause an L1D fill. Since the L1D line width is only 64 bytes, there is sufficient data to cause an L1D fill when on ly two chunks have been received from the system bus or L3. These requests must access the L1D pipeline and may block core requests from entering the L1D pipeline during that cycle. If there are two L1D fills for an L2 miss, another fill will occur when the las t two c hun ks have been received by the L2.

6.9 Third-Level Unified Cache

The third-level unified cache (L3) is a unified, 9 MByte, 18-way set associative cache with a 128-byte line size. Some versions of Itanium2 processor may have L3 cache sizes of 6, 4, 3, or

1.5 MByte. Latencies and set-associativity may vary betw een the dif ferent cache sizes and mode ls. See Chapter 2 for exact latency and set-associativity numbers. These caches are alike in all other respects.

All L3 accesses are for the entire 128 byte line – no partial line access es are supported. The access latency is 12, 14, or more cycles. This latency depends on how quickly the L2 issues the request and the activity of the L3 at the time of the requ es t.

60 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

On the Itanium 2 processor, L3 accesses are ful ly pipelined and thus have a much higher eff ective bandwidth than the L3 on the Itanium pro ces s or. The L3 tag array is single ported and is pipelined to allow a new tag access every cycle. The L3 data array is also single ported, but requires up to four cycles to transfer a full line of data to the L2 cache or to the system bus in the case of an L3 dirty victim.

The L3 is non-blocking and has an 8-entry qu e ue to support mult iple outstanding reques ts. This queue orders reques ts and prioritizes them among tag read/write and data read/write to achieve the highest performance given the operations required.

6.10 System Bus

The Itanium 2 processor with 3M and 6M L3 cache system bus operates at 200 MHz and is comprised of multiple sub-busses for various functions, such as address/re quest, snoop, response, data, and de fer. The data bus is 128 bits wide and operates source synch ronously, achieving a peak bandwidth of 40 0 mi lli on memo ry tra nsact ion s or 6 . 4 GB p er s econ d. T he Ita ni um 2 pr oc esso r wit h 9M L3 cache has multiple system bus speed options - 200 MHz, 266 MHz, and 333 MH z. The operating frequency is the only chan ge in the system bus. These faster speeds now allow a peak bandwidth of 8.5 GB and 10.6 GB per second.

The system bus control logic is an In Order Queue (IOQ) and an Out of Order Queue (OOQ), which tracks all transactions pending completion on the system bus. The IO Q tracks the in-order phases of a request and is identical to all processors. The OOQ contents hold o nly a processors requests that are deferred . The IOQ can hold 8 entries while the OOQ can hold 18 requests which allows for a maxim um of 19 transactions to be outstanding on the sys tem bus from a single Itanium 2 processor .

Memory Subsystem

L2 requests th at have not been completed (i.e., have not accessed the L3 nor completed a data phase on the system bus) are mainta ined in structures of the following sizes:

• 16 outstanding read requests from L2.

• 6 outstanding dirty writeback requests from L2.

• 6 outstanding L3 writebacks (i.e., replaceme nt of a dirty line) to be serviced by the main

memory.

• A combination of 16 outstanding L3 writebacks or L3 castouts (i.e., replacement of a clean

line depending on the coherence mech anis m, this might incur memory traffic) to be serviced by the main memory.

• T wo 128-byte coalescing buffers to support WC stores.

Read transactions (thi s i ncludes store instructions that miss the L2) are placed in one of the 16 bus request queues (BRQs). Ea ch of these may then be sent to the L3 to see if the L3 can satisfy the request. In the case where the request is also an L3 miss, the request is scheduled to ge nerate a system bus request (ei ther B us Rea d Line o r Bus Re ad Inv alid ate Line for stores ). When the s ystem bus responds with the data, the line is written to the L2 and L3 bas e d on its temporal locality hints and type of access.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 61

Memory Subsystem

62 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Branch Instructions and Branch

Prediction 7

The Itanium 2 processor employs bo th st atic and dynam ic me thods f or branch pred iction. F or sta tic branch prediction, the It anium 2 processor uses the hint completers from the branch instructions. For dynami c prediction, the Itanium 2 processor uses several hardware s t ructures.

This chapter describes how branch predictio n aff ects soft ware execution . The front-end inst ruction fetching is decoupled from the back-end instruction execution through an 8-bundle instruction buffer. For more detail regarding the instruction buffer, see Appendix A, “Itanium

Pipeline.” Throughout this chapter, the term ‘bubble’ refers to cycles for which the front-end

cannot deliver usef ul data, becau se the p enalty ma y never tran slate t o a loss in performance if ther e is another event blocking the back-en d from retiri ng instru ctions. In the case where the back-end is waiting for the front-end, the penalty is a stall.

Table 7-1, “Branch Prediction Latencie s” summarizes bran ch predictio n la tencies for the It anium 2

processor. Notice that in the case of a correctly predicted IP-relative branch, there is no front-end

bubble.

Table 7-1. Br anch Prediction Late ncies

Branch Type Whether Prediction Target Prediction Penalty

2 Processor

IP-relative Correct Correct 0 Front-end bubbles IP-relative Correct Incorrect 1 Front-end bubble Return Correct Correct 1 Front-end bubble Return Correct Incorrect 6+ Pipeline stalls Indirect Correct Correct 2 Front-end bubbles Indirect Correct Incorrect 6+ Pipeline stalls Loop Incorrect N/A 7+ Pipeline stalls Any type Incorrect N/A 6+ Pipeline stalls

1. The + refers to the fact that some branches may cause the front-end to stall. This is only for incorrectly predicted short (up to 16 bundles) forward branches. The additional latency w ill be at most 8 cycles and m ay be less dependi ng on how man y branches were seen by the front-end after the mispredicted branch was seen by the front-end.

The branch prediction microarchitecture in the Itanium 2 processor is significantly different from that of the Itanium processor. Branch prediction is closely tied to the L1I cache which allows for the zero bubble resteer.

Single-cycle branches experience a stall once every two cycles (i.e., a one-cycle loop takes four cycles to make three i teratio ns) . Single- cycle lo ops sho uld be avo ided. It is als o poss ible that a stall may occur if several branches are encoun tered in succession. For example, if the fr ont -end sees a branch every cycle for 3 cycles, one cycle of stall may occur .

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 63

Branch Inst ructions and Branch Prediction

7.1 Branch Prediction Hints

Information about branch behavior can be provided to the processor to impr ove branch prediction. This information can be encoded through branch h ints as part of a branch instruction. Branch hints do not affec t the functi onal behavior of the progra m and may be ignored by the processor.

Only hints specifie d with in a br anch in structi on are used for bran ch pr ediction . Hints on th e brp or mov br instructions are ignored by the branch predictor.

For the Itanium 2 processor, branch hints are .sptk, .spnt,.dptk, or .dpnt (sp=static prediction, dp=dynamic prediction, tk=taken, nt=not taken). The terms “static” and “dynamic” hints refer to the code generator’s confidence in the branch behavior. For example, .sptk means the code generator is very sure that the br anch will be taken, whereas .dptk means that the code generator thinks the branch will be taken, but it is not so confident.

The impact of these branch hints depends on other branches in the two-bundle window and other branch information maintain ed in the processor. The consequence is that a branch with a .dpnt hint may be predicted taken the first time seen. The processor will quickly recover f rom this and correctly predict th is branch in the future.

The use of .dpxx is recommended as defa ult, unless the loop is a ctop or cloop in whic h case .spxx is recommended.

The .spxx hint is also important for very short, 1 or 2 cycle, loops. With static prediction hints, these loops will not wait for the machine to generate a new hint prediction, but will instead use the take or not-taken from the static hint. If dynamic hints are used in the short loops, the processor may stall each iteration th at the br anch prediction requires updating.

The branch prediction hints have a an anoma lous behavior when used in .bbb bundles. Normally, the branch hints of each branch instr uction will effect only that specific branch. However, a .bbb bundle will al wa ys use the branch hints provided on the slot 0 branch for the slot 1 and slot 2 branches. There are a few ways to avoid this. The first is to br eak up the .bbb bun dle into tw o other bundles. Un fortunately, this may not be good for code den sity and other solutions such as using a

.dpxx hint or a .spxx with a .clr complet e r on the slot 0 branch should be considered.

7.2 In direct Branches

The predicted targets of indi rect branches, other than returns, are extracted from the source branch register of the indirect branch rather than from a hardware table. This has several implications .

There is always a penalty for indirect branches on the Itanium 2 processor. A two-cycle front-end bubble is seen for a correctly predic ted indirect branch. An incorrect taken/not taken or address prediction is 6 or mo re p ipel ine sta lls. The a ddr ess pre dict ion is ba sed on the c ont ents of t he br an ch register referenced by the branch as see n by t he fr ont-end. An i n-flight updat e to the b ranch reg ister will not be seen by the front-end and the predicted target may be wrong. Correct target prediction requires that the branch register wri te pr ecede the indirect branch by several cycles. Thi s dis tance varies sinc e the f ro nt an d back -e nds of the pipe lin es ar e de coup led. A c ode ge ner ator c an min im ize the impact of this in the following ways:

• Separate the write and indirect branch by at least 6 front-end L1I cache accesses.

• Add an additional write to the branch regi st er ab ove the true branch register writer to hint the

target.

64 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

• Use different branch regist ers fo r each indirect branch instance to minimize co nf licts with

other indirect branches.

7.3 Perfect Loop Prediction

In many cases, the perfect loop predictor can correctly predict the back -ed ge br anch of a counted loop, i .e., cloop or ctop type branches, includin g the f all-through instance, as well as the loop back iterations. Unlike the Itanium processor, the Itanium 2 processor does not need brp to accomplish this.

The Itanium 2 processor uses the PLP only for the final iteration of the loop. The initial loop predictions are decided on dynamic or static information based on the hints used.

If the last branch of a loop is predicted correctly, there might still be a one- or two-cycle bubble in order to get this correct prediction. The smaller the number of loop iterations , the more likely it is that there will be a two-b ubble res teer. Conversely , the lar ger t he loop i teratio n, the more li kely it is that there will be a zero-bubble resteer. The PLP uses the current values of ar.lc and ar.ec for prediction, so any writers to these registers should be well ahead of th e coun ted loop branch to assure correc t prediction.

In some instances, the Itaniu m processor required that ar.ec be set to 1 for correct prediction. The Itanium2 processor does not have this same requirement and actually expec ts ar.ec = 0 when there is no epilog.

Branch Ins tructions and Branch Pr edi ction

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 65

Branch Inst ructions and Branch Prediction

66 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Instruction Prefetching 8

The Itanium2 processor supports several forms of instruction prefetching. Instruction prefetch is defined to be the act of moving instruction cache lines from higher levels of cache or memory into L1I. Streaming pr efetching initiates hardw are prefetching of the next cache lines, either sequential or at the target of predicted taken br anches. Hi nt prefetch ing allo ws soft ware to spe cify a particu lar line, or lines, to be prefetched. On the Itanium 2 processor, it is expected that instruction prefetching will be an ef fecti ve way to reduc e ins tructi on cache miss es since th e code generat or has a wide degree of control over the prefet ch agen t and the Itanium 2 processor cache design specifically consi der ed pr efetching.

8.1 Streaming Prefetching

Streaming prefetching is initiated by using the .many completer on branch instructions. If the front-end processes a branch with a .many completer, the prefetch engine will continuously is su e prefetch requests, at one request per cycle, for subsequent instruction lines, into the prefetch pipeline. The prefetch request is checked against the L1I and the L1 ITLB. If it hits in the L1 ITLB and misses in the L1I, the request is sent to the L2, otherwise it is discarded. The lines are prefetched starting at the branch target plus 64 or 128 bytes (depending on the alignment of the branch target). Streaming prefetching continues until one of the following stop cond itions occurs:

• A predicted- ta ken branch is encountere d by the front-end

• A branch misprediction occurs

• A brp instruction without the .imp completer is encountered by the front-end

The L1I cache design allows both fill and look ups to occur at the same tim e. Thus, the lifetime of a request in the ISB is typically very small. This allows the prefetch engine to prefetch instruc tions with little chance that the line will get overwritten before it is used. If the branch is predicted taken by the front-end , prefetchi ng w ill be initiated in the fron t-end. If the br anc h is incorr ectly predicte d not-taken by the front-end, prefetching will be initiated by the back-end whe n the prediction is corrected. How ever, if the opposite case o ccurs, i.e., the br anch is inc orrect ly pred icte d take n in the front-end, prefetching will be terminated and it will NOT be restarted when the back-end corrects the prediction. Finally, if the branch is incorrectl y predicted-taken by the front-end, prefetching will be terminated when the prediction is corrected by the back-end.

A .many prefetch stream may be halted by an L1I TLB miss. The event does not canc el the prefetch, but suspends the pr efetch until the L1I TLB fill completes at which point the prefetch continues until stopp e d from one of the reasons described above.

1. A brp instruc t ion s ugge s t s tha t a n a s soc iated br.many is a round the corne r. The assu mpt i o n is that the pr ef et c h engine has alread y prefetched past the br.many, and additional prefetches would be useless. The reason that a brp.imp does not terminate prefetching is related to Itanium processo r code. In the Itanium pr oc es s or, brp.imp instr uc t ions are used to pre dic t br a nc hes a nd might not have any as s oc i ati on wi th a b r.many .

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 67

Instruction Prefetching

Table 8-1. Summar y of Stre am ing Prefetch Actions

Predicted Taken Predicted Not-Taken

Actually T ak en Any current streaming prefetch is stopped in the

Actually not-Taken

front-end. If the branch has a .many completer, a new

stream is started by the front-end. Any current streaming prefetch is stopped in the

front-end. It is NOT restarted when the misprediction is detected.

If branch has a .many completer, a new stream is started in the front-end. It is terminated when the misprediction is detected by the back-end.

8.2 Hint Prefetching

Hint prefetching is initiated with the brp or mov br instruc ti o ns. Unli ke the Ita nium processor, the Itanium 2 processor prefetch initiation does not affect branch prediction state. However, it has this same restrict ion as the Itanium processor: brp instructions must be on th e last instruction slot (slot 2) of a bu ndle in order to be processed; otherwise, it is ignored. brp instructions have no associated branch pr ediction effects. Table 8-2 illustrates the prefetching mechan is ms as sociated with the branch hints.

Table 8-2. Prefe tch Mechanisms

Branch Hint Prefetch Mechanism

brp.(sptk,loop,dptk).few Normal prefetch of 1 cache line generated. brp.(sptk,loop,dptk).many Prefetches 2 cache lines from target. brp.(sptk,loop,dptk).imp.few Flushes prefetch virtual address buffer (PVAB) and prefetches 1 cache line. brp.(sptk,loop,dptk).imp.many Flushes prefetch virtual address buffer (PVAB) and prefetches 2 cache lines. move_to_br.(sptk,dptk).few All other fields ignored, prefetches 1 cache line. .many hint Streaming prefetches triggered off predicted taken IP-relative branches.

Any current streaming prefetch is stopped in the back-end.

If the branch has a .many completer, a new stream is started in the back-end.

No effect on any current str eaming prefetch.

A new stream in NOT started.

A .few comp leter will prefetch one-half or one L2 line, depending on the alignment of the associated branch target, and a .many compl e te r will prefetch 1.5 or 2 L2 lines, depending on the alignment of the associated bra nch target. Hint prefetches are sent to the 8-entry prefetch virtua l address buffer (PVAB). Up to 2 hint prefetches can be sent to the PVAB in each cycle.

In a given cycle, if the prefetch pipeline is not stalled and if a br.many is not active, a prefetch request is remov ed from the P VAB. The p refetch requ est i s then c heck ed again st t he L1I and th e L1 ITLB. If it hits in the L1 ITLB and misses in the L 1I, it is sent to L2, otherwise it is discarded. The intent is to use hint prefetches to p refetch the first “chunk” of in st ructions at the target of a branch and to use streaming prefetching to prefetch the subsequent instructions. In order to fu lly hide the latency of an L2 hit, a hint prefetch should precede a branch by 9 fetch cycles. If a br.many is preceded by a brp.many, there will be some overlap betw een the pr efetch es generated by the tw o instructions . W hile this overlap is wasteful, there is benefit in having more lines prefetched earlier (as opposed to presaging the br.many by a brp.few). brp.few prefetches might be useful in conjunction with streaming prefetches as described in Section 8.1.

68 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

8.3 Prefetch Flush Hints

Certai n forms of brp instruction have the side effect of flu shing the contents of the PVAB and possibly the prefetch pipeline. These are provided to give the compiler some control over the state of prefetching.

• brp.few.imp - will remove all brp.few prefetches from the PVAB (but not any already in

the prefetch pipeline).

• brp.exit.imp - will remove all prefetches from the PVAB and those in the prefetch

pipeline, an d a dditionally will stop the st reaming prefetch engine (a nd therefore will stop br.many, brp.many and brp.exit prefetching).

• brp.* - (brp withou t the .imp completer) will cancel any streaming prefetches initiated by

a br.many instruction.The in tent is to allow the comp iler to stop a br.many from prefetching too far.

The flushing side effect is in addition to the normal behavior of th es e prefetch instructions. Note that flushing a prefetch once it reaches the pipeline may not be effective (i.e., the prefetch m a y still be issued to t he L 2 a nd beyond).

Instruction Prefe tc hin g

8.4 The brl Instruction

The Itanium 2 processor implements the brl instruction that provides 64-bit relative branches. These long relative branch instructions have less cost than in the Itanium processor, but they are higher cost than the short rel ative branch br instructions. Specifically, the branch prediction mechanisms in the Itanium 2 processor do not calculate th e pred icted target correctly for brl instructions unless the target is set when the L1I cache line is allocated. Thus, if a brl predicti on target is aliase d with another branch in the bundle pair, the target will be incorrect and the branch will see a full branch mispred ict penalty and it will not be fixed.

The brl instruction is much more efficient than multiple short jumps despite this cost. Howeve r, The linker should place brl instructions only where th ey are specifically needed.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 69

Instruction Prefetching

70 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Optimizing for the Itanium

This chapter is a summary of conclusions that can be drawn from important points noted in earlier chapters. These guidelines are not applicable in all s ituations and profiling should be u sed to guide the use of optimizations.

9.1 Hints for Scheduling

Observing the f ollowing heuristics whenever po ssible will minimi ze the chances of implicit stops or unexpected dispers al related stalls:

2 Processor 9

• Schedule the most restricted instructions early in the bundle. This lessens the chance that a

generic subtype instruction will consume a port which is needed by a la te r more restricted instruction.

• In some cases, placing A-type instructions in I slots rather than M slots might achieve den ser

bundling. If this is done, place any I-type instructions (which m us t go in I slots) earlier in the issue group when possible. This way, the later instructions in I slots can be issued to available M ports. Since not all processors suppor t this (such as the Itaniu m processor), it is prefer able to place A-type instructions in M slots.

• Most floating-point load types can be issued to any of the four memory ports, not just M0 and

M1. Control speculation-related (advanced and check) and pair floating- point loads are the exceptions which can only be issu ed to ports M0 and M1. When schedu ling a mix of FP loads, advanced FP loads, integer loads, and lfetch instructions, ensure that regular FP loads are scheduled late in th e issue group so that if necessary, they can be issued to the M2 and M3 ports. This frees the M0 an d M1 por ts needed by lfetch instructions or more res tric tive load types.

• A voi d u sing nop.f. It risks uni ntende d s tal ls du e t o o utst andi ng lon g laten cy instr uct ion s. For

example, a write to FPSR is a multiple-cycle operation. Any floating-point operatio n, including a nop.f, will stall until the write is com pleted.

• On the Itanium

are many other dual issue template pairs on the Itanium2 processor so using this template should no longer be necessary.

processor, MFI was a commonly used template to facilitate dual issue. There

9.2 Optimal Use of lfetch

The lfetch instruction is key t o a ch ievi ng goo d per f orman ce on the I ta niu m 2 processo r in m any memory-related situations. lfetch allows the L1D to often be a hit for integer data. This has the benefit of allowing the L1D cach e to filt er requests to the L2. Many L2 confli cts can be avoi ded by ensuring intege r loads hit in the L1D and thus, never are seen by the L2. The fewer requests the L2 sees, the fewer requests conflict.

lfetch in s t ru c ti o ns require car e ful use. Ca re l e ss l y pl a c i ng lfetch instructions may lower perfor mance. Refer to Chapter 6, “Memory Subsystem” for details regarding the Itanium 2 processor cache structures. The following guidelines were developed with regard to the memory subsystem:

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 71

Optimizing for the Itanium® 2 Processor

• The maximum numb e r of outstanding lfetch operations to L3 or memory, the sum of both

data and instruction requests, may not exceed 16.

• lfetch instructi ons are restricted to only memory ports M0 and M1 while FP loads (not

ldfpd or ldfps) can be issued on a ny of the four memory ports. Therefore, when mixing lfetch instructions with FP loads, lfetch instructions should be scheduled early in issue

groups. For e xa mple, if two FP loads and an lfetch are to be scheduled in the same cycle, the lfetch should be scheduled in the first bundle so that it will be issued on one of the first two memory ports. If the two FP loads are scheduled first, the hardware will insert an implicit stop befo re is s ui n g th e lfetch instruc tion.

• The Itanium 2 lfetch.excl instruction will bring data into the L2 cache in the M state.

The.excl completer should only be use d when the data brought in by the lfetch will shortly be modified by store instruc tions.

• The Itanium 2 lfetch instructions will not bring the data into the cache if a DTLB entry

providing translation and protection information is not ava ilable. To en sure the lfetch instruction completes a HPW walk and possibly generates a TLB translat ion or protection fault, the.fault completer sh ould be used. Since there may be high cost associated with these events, th e.fault completer should not be used for speculative addresses.

• lfetch instructions may have effects in the cache hierar chy that make their use high cost.

These effects include:

— Acquiring L2 resources such as the L2 OzQ. — Arbitration for acces s to the L2 data arrays and thus becoming a candidate for an L2 bank

conflict.

— Reci r cu l a ti on of the lfetch in the case of a secondary L2 miss.

The effects of the L2 recirculate for a secondar y L2 miss can be mitigat ed by placin g.nt completers on the lfetch. The.nt hints keep the lfetch from causing an L1D fill and allows the lfetch to be removed from the L2 OzQ.

In the case where an lfetch hits the L2, it takes L2 OzQ resour c e s, causes other request to cancel, and may get canceled itself a s if it actually reads the L2 data array regardless of the.nt hint or actual need to fill the L1D.

Applying.nt hints to lfetch requests also reduces the L2 banks required to satisfy the lfetch to only 1 bank. For temporal lfetch instructions 4 ba nks may be required and such lfetch requests may have significantly increased prob ability of causing L2 bank conflicts.

9.3 Data Streaming

There are several methods to handle long, high-bandwidth data str e a ms. This section lists several possible solutions and discusses some of the benefits and costs of each.

9.3.1 Floating-Point Data Streams

Floating-po int dat a r es ide s in the L2 c ac he. H ere, t he lfetch.fault.nt1 inst ru ctio n s houl d b e issued only once per L2 cache line for the source, and the lfetch.fault.excl.nt1 instruction should be issued only once per L2 cache line for the destination. The.fault completer is used to ensure that the data enters into the cache hierarchy, even if it results in an L2 DTLB miss or VHPT miss. The.nt1 completer ensures that the floating-point data will not displace data residing in the L1D. The.nt1 completer also allows an lfetch instruction that is a

72 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

secondary L2 miss to avoid allocation in the L2 OzQ. This is impo rtant for situations where the design of the data streaming code cannot avoid additional r equests to an L2 line without performance loss. The.excl completer for the destination stream will ens ure the data is ready to be modifi e d .

When data is accessed as an L2 hit, care should be taken to avoid L2 bank conflicts among request groups. This is necessary to ensure L2 5- and 7-cycle byp as se s are available. Latency is not generall y a co nce rn for floating-point code, however, in strea ming situations, the life time of an operation in the L2 OzQ coupled with th e siz e of th e OzQ may cau se co re stalls from the L2 control logi c to think the OzQ is full. A lower latency means a sh orter lifetime in the OzQ and effectively m or e OzQ en tries are available.

9.3.2 Integer Data Streams

Integer data streams are more complicated than floating-point streams because, in some instances, getting the data into the L1D will be important for performance. Streaming from the L1D presents several problems. First, each load operation requires integer register return resources even if it misses in the L1D. This makes it difficult for L1D miss es to return data to the r egister file without impacting the flow of new L1D misses. Second, each fill operation will take an additional cycle to complete. Third, the need to fill the L1D eliminates an opportunity for the L2 OzQ to remove secondary L2 miss lfetch instructions. This is sig nifica nt because the L1D lin e size i s half of the L2’s and one lfetch per L1D line will result in at least one secondary L2 miss access for every L2 line thus li miting L2 OzQ throughput.

Optimizing for the Itanium® 2 Processor

One approach would be to use three separate lfetch instructions. An lfetch.fault.nt1 would bring the data into the L2. Later, when the data is in the L2, lfetch.fault instructions can hit in the L2 cache and bring the data into the L1D. This makes the lfetch instructions asymmetric and req uires several load memor y s lots.

An optimization to the three lfetch approach above would use only two separate lfetch.fault instruction, but stage them such that th e first will bring data into L2 and the L1D. Then, when th e L2 is fille d from the first re quest , the second lfetch can bring the data into the L1D without b eing a secondar y L2 mis s (the L2 is fill ed so the lfetch is an L2 hit). Thi s frees an additional load memory slot and makes the lfetch instructions re-usable.

An outst andi ng L 1D fill may be in vali date d b y a stor e t o t he sam e lin e. Us in g lfetch instructions for even small data strea ms can re s ult in a significant performan ce increase provided the lfetch fills the L1D be fore the store to the line is seen.

Also, since all loads that hi t in the L1 D nev er a llocate in to t he L2 Oz Q, usin g lfetch instructions to ensure an L1D h it may also help performance by limiting L2 OzQ to only store data and lfetch requests. This relieves pressure on the limited OzQ reso urces and reduces th e possibility of conflicts among OzQ entries .

9.3.3 Store Data Streams

Since store instructions are always seen by the L2, there is no benefit to bri nging store destination data into the L1D. Th ere ar e many benefits to using an lfetch.fault.excl.nt1 completer for destinatio n streams. For inst ance, the .nt1 hint allows se cond ary L2 miss es to be r emo ved an d the core is not slowed by the L1D fills. Also, the.excl hint ensures that the L2 data is ready to receive the store data.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 73

Optimizing for the Itanium® 2 Processor

9.4 Control and Data Speculation

The Itanium 2 processor reduces the costs associated with contr ol and data speculation in the ALAT via fast defe rr al and low lat enc y fix up. As such , add iti ona l per form ance may be r ealiz ed by tuning the code generation to aggres sively use speculation. So me speculation considerations are specific to the Itanium processor and d o not apply t o t he Itan ium 2 processor . If speculati on is mor e aggressive, then mor e cal ls to fix up co de will be encount ered. Fo r the Itaniu m proces sor, the fix up code was often moved to cold pag es very far from the actual speculation. The heuristic for placing fix up code near or far from the point of spec ulation should be revisite d a nd include profile information in the decision matrix.

9.5 Known L2 Miss Bundle Placement

Given the Itanium2 processor design, it is slightly better to put instructions which are known to miss the L2 cache on memory port 0 (allocate th e fir s t memory op in the issue group). This wil l allow , when possible, a speculative request to be made to L3. If the memory request that needs to go to L2 is in M1, M2, or M3, then they will need to wait until they can be reissued out of the L2 OzQ.

9.6 Avoid Known L2 Cancel and Recirculate Conditions

The most predictable L2 cancel is an L2 bank conflict. These can be avoided by carefully organizi ng L2 accesses or by bringing th e data into the L1D with an lfetch instruction and avoiding the L2 entirely.

The most predictable L2 recirculate is for secondary L2 miss acce sses. Thes e can be avoided by using the lfetch instruction to bring data into the L2. Only lfetch instructions that do not fill L1D are not counted as a secondary access. If an lfetch is the pr imary L2 miss and a load is th e secondary L2 miss, then the load will still need to recirculate, as it must eventually return data to the core. It is important to schedule L2 miss lfetch ins tructions far in front of the load to avoid this situat ion.

9.7 Instruction Bundling

The Itanium 2 processor can completely issue almost all bundle template combinations . Provided the ILP is availa ble, closing the correct bundling and instruction scheduling may benefit performance. There are two concerns here. Fir st , pl ace more restrictive instructions early in the issue group and, where possible, transform restrictive instructions. The simple instruction nop.i must issue to an I port, however, an add can issue on either an M or I port. The nop.i should be scheduled early to ensure it receiv e s its needed I port. An alternative would be to replace the nop.i with an inst ru ction that is effectively a nop (such as add r3=r0, r3) which can issue on either an I or M port.

74 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

9.8 Branches

The following branch and branch prediction related optimization suggestions are covered in detail in Chapter 7, “Branch Instructions and Branch Prediction. ” They are summarized here.

9.8.1 Single Cycle Branches

The Itanium 2 processor cannot support single cycle loop branches without some penalty in some iterations of the loop. Unroll the loop to at least two cycles to get expected performance. This may come at a small cost to code size.

9.8.2 Perfect Loop Prediction

Also, perfect loop prediction only predicts the final iteration of the loop. As such, the Itanium 2 processor considers the branch hints in predicting the branches . The Itanium 2 processor requires ar.ec to be set correctly (i.e., if there is no epilog ue, set ar.ec=0 not to 1 as the Itanium processor expected).

9.8.3 Branch Targets

Optimizing for the Itanium® 2 Processor

Branch ta rge ts shou ld be a lig ned on 32- by te b ounda ries t o en sure tha t th e f ront- end can del iver t wo bundles per cycle to the back-e nd.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 75

Optimizing for the Itanium® 2 Processor

76 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring 10

10.1 Introduction

This chapter defines the performance monitoring features of the Itanium 2 processor. The Itanium 2 processor provides four 48-bit performance counters, 100+ m onitorable events, and several advanced monitor ing capabilities. This chapter outlines the targeted performance monitor usage models, defines th e software interface and programming model, and lists the set of monitored events .

The Itanium architecture inco rporates architected mechanisms that allow software to actively and directly manage performance critical processor resources s uch as br anch prediction structures, processor data and instruction caches, virt ual memory translation structures, and more. To achieve the highest pe rforma nce level s, dyn amic proce ssor beha vior c an be monitor ed and fed ba ck into the code generation process to better en c ode observed r un-time behavior or to expo se higher levels of instruction level par allelism. On the Itanium 2 processor, we expect to measure the behavio r of real-world Itanium architecture-based applications and operating systems as well as mixed IA-32 and Itanium architecture-based code. These measurements will be critical for understanding the behavior of compiler opt im izations, the use of architectural features such as speculation and predication, or the effectiveness of microarc hitectural structures such as the ALAT, the caches, and the TLBs. These measurements will provi de the d ata to dr ive application tuning and futur e processor, compiler , and operating system designs.

The remainder of the document is split into the following sections:

• Section 10.2, “Performance Monitor Programming Models” discuss e s how perfo rm a n c e

monitors ar e used, and presents various Itanium 2 processor perf ormance monitoring programming models.

• Section 10.3, “Performance Mo n itor State” defines the Itanium 2 processor specific

PMC/PMD performance monitoring registers.

• Chapter 11, “Performance Monitor Events” gives an overview of the Itani um 2 processor

event list.

10.2 Performance Monitor Programming Models

This section introduces the Itanium 2 processor performance monitoring f eatures from a programming model point of view and describes how the different e vent monitoring mechani sms can be used effectively. The Itanium 2 processor performance monitor architecture focuses on the following two usage models:

• Workload Characterization: The first step in any performance analysis is to unders tand the

performance characteristics of the workload under study. Section 10.2.1, “Workload

Characterization” discu sses the Itanium 2 processor support for workload characterization.

• Profiling: Profiling is used by application develope rs and profile -guided compilers.

Application deve lopers ar e interested in identify ing perfo rmance bottlenec ks and relat ing them back to their code. Their primary objective is to understand whic h program location caused performance degradation at the module, function, and basic block level. For optimization of data placement and the analysis of cri tical loops, instruction level granularity is desirable. Profile-guided compilers that use advanced Itanium architectural features such as predication

Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization 77

Performance Monitoring

and speculatio n benefit from run-time profile infor mation to optimize instruction schedules. The Itanium 2 processor supports instruction level statistical profiling of branch mispredicts and cache misses. Details of the Itanium 2 processor’s profiling support are described in

Section 10.2.2, “Profiling.”

10.2.1 Workload Character iza tion

The first step in any performance analysis is to understand the performance characteristics of the workload under study. There are two fundamental measures of interest: event rates and program cycle break down.

• Event Rate Monitoring: Event rates of interest include average retired instruction s per clock,

data and instruction cache miss rat es, or branch mispredict rates measured across the entire application. Character ization of operatin g systems or larg e commercial workl oads (e.g., OLTP analysis) require s a system-level view of performance rel evant events such as TLB miss rates, VHP T walks/second, interrupts/s econd, or bus utilization rates. Section 10.2.1.1, “Event Rate

Monitoring” discusses event rate monitoring .

• Cycle Accounting: The cycle breakdown of a workload attributes a reason to every cycle

spent by a program. Apart from a program’s inherent execution latency, extra cycles are usually due to pipeline stalls and flus he s. Section 10.2.1.4, “Cycle Accou nting” discusses cycle accounting.

10.2.1.1 Event Rate Monitoring

Event rate monitor ing determines event rates by reading processor event o ccurrence counters before and after the workload is run, and then computing the desired rates. For instance, two basic Itanium 2 processor events that count the number of retired Itanium instructions (IA64_INST_RETIRED.u) and the number of elapsed clock cycles (CPU_CYCLES) allow a workload’s instructions per cycle (IPC) to be computed as follows:

• IPC = (IA64_INST_RETIRED.u

CPU_CYCLES

Time-based sampling is the basis for many perform ance debugging tools [VTune™, gprof, WinNT]. As shown in Figure 10-1, time-based sampling can be used to plot the event rates over time, and can provide insights into the different phases that the workload moves thro ugh .

Figure 10-1. Time-Based Sampling

)

Event Rate

- IA64_INST_RETIRED.ut0) / (CPU_CYCLESt1 -

t1t0

Sample Interval

Time

On the Itanium processor, many event types, e.g., TLB misses or branch mispredicts are limited to a rate of one per clock cycle. These are referred to as “single occu rr ence” events. However, in the Itanium 2 processor, multiple events of the same type may occur in the same clock. W e refer to such events as “multi-occurr ence” events. An example of a multi-occurrence events on the

78 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

Itanium 2 processor is data cache read misses (up to two per clock). Multi-occurrence events , such as the number of entries in the memory req uest queue, can be used to the derive average number and average latency of memory accesses . The next tw o s ections describe the basic Itanium 2 processor mechanisms for monitoring single and multi-occurrence events.

10.2.1.2 Single Occurrence Events and Duration Counts

A single occurrence event can be monitored by any of the Itanium 2 processor performance counters. For all single occurrence events, a counter is increment ed by up to one per clock cy cle. Duration counters that count the numbe r of clock cycles during which a condition persists are considered “single occurrence” events. Examples of single occurr ence events on the Itanium 2 processor are TLB misses, branch mispredictions, and cycle-based metrics.

10.2.1.3 Mul ti-Occurrence Events, Thresholding, and Averagin g

Events that, due to hardware par allelism, may occur at rates greater than one per clock cycle are termed “multi-occurrence” events. Examples of such events on the Itanium 2 processor are retired instructions or the numb er of live entries in the memory reque st queue.

Thresholding capabilities are available in the Itanium 2 processor’s multi-occur rence counters and can be used to plot an event distribution histogram. When a non-zero threshold is speci fied, the monitor is increment ed by one in every cycle in which the observed event count exceeds that programmed threshold. This al lows questions such as “For how many cycles did the memory request queue contai n m or e than two entries?” or “Dur i ng how many cycles did the machine retire more than three instr uctions?” to be answered. This capability allows microarchitectural buffer sizing ex pe riments to be supported by real measurements. By running a ben c hmark with different threshold values, a histogram c a n be drawn up that may help to identify the performance “knee” at a certain buffer size.

For overlapping concurrent events, such as pending memory operations, the average number of concurre ntly ou ts tan ding re que sts and the av erag e numbe r of cycle s th at re ques t s were pe nding a re of interest. To calculate the average number or latency of multiple outstanding requests in the memory queue, we need to know the total number of requests (n requests per cycle (n multi-occurrence counter,

/cycle). By summing up the live requests (n

live

Σn

is directly measured by hardware. We can now calculate the

live

) and the number of live

total

/cycle) using a

live

average number of requests and the aver age latency as follows:

• Average outstanding requests/cycle = Σn

• Average latency per request = Σn

live

/ n

live

total

/ ∆t

An example of this calculation is given in Table 10-1 in which the average outstanding requests/c ycle = 15/8 = 1.825, and the average latency per request = 15/5 = 3 cycles.

Ta bl e 10-1. Average Latency per Request and Reques ts per Cycle Calculation Example

Time [Cycles]

# Requests In

# Requests Out

live

Σn

live

total

12345678 11111000 00011111 12333210 136912141515 12345555

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 79

Performance Monitoring

The Itanium 2 processor provides the following capabilities to support event rate monitoring:

• Clock cycle counter.

• Retired instruction counter.

• Event occurrence and duration counters.

• Multi-occurrence counters with thresholding capability.

10.2.1.4 Cycle Accounting

While event rate monitoring counts the number of events, it does not tell us whether the observed events are contributing to a performance problem. A commonly used strategy is to plot multiple event rates and correlate them with the measured IPC rate. If a low IPC occurs concurrently with a peak of cache miss activity, chances are that cache misses are causing a performance problem. To eliminate such guess work, the I tanium 2 processor provides a set of cycle accounting monit or s, that break down the number of cycles that are lost due to various kinds of microarchitectural events. As shown in Figure 10-2, this lets us account for every cycle spent by a program and therefore provides insight into an application’s microarchitectural behavior. Note that cycle accounting is diff e rent from simple stall or flush duration counting. Cycle accounting is based on the machine’s actual stall and flush conditions, and accounts for overlapped pipeline del ays , while simple stall or flush duration counters do not. Cycle accounting determines a program’s cycle breakdown by stall and flush reasons, while simple duration c ounters are useful in determining cumulative stall or flush latencies.

Figure 10-2. Itanium

Processor Family Cycle Accounting

Inhere nt Program

Execution Latency

30% 25%

Data Access

Cycles

20% 15% 10%

100% Execution Time

Branch

Mispredicts

I Fetch

Stalls

Other Stalls

001229

The Itanium 2 processor cycle accounting monitors account f or all major single and multi-cycle stall and flush conditions. Overlapping stall and flush c onditions are prioritized in reverse pipeline order, i.e. , delays that occur later in the pipe and that overlap with earlier stage de lays are reported as being caused later in the pipeli ne. The six back-end stall and flush reasons are prioritized in the following order:

1. Exception/Interrup tion Cycle: cycles spent flushing the pipe due to interrupts and exceptions.

2. Branch Mispredict Cycle: cycles s pent flushing the pipe due to branch mispredicts.

3. Data/FPU Access Cycle: memory pipeline full, data TLB stalls, load-use stalls, and access to floating -point unit.

4. Execution Latency Cycle: scoreboa rd and other register dependency stalls.

5. RSE Active Cycle: RSE s pill/fill stall.

6. Front-end Stalls: stalls due to the back-end waiting on the front-end.

Additional front- end stall counter s ar e availab le which detail s even p ossibl e reas ons for a f ront- end stall to occur. However, the back-end an d f ront-end stall events sh ould not be compared since th ey are counted in different stages of the pipeline.

For details, refer to Section11.6, “Stall Events.”

80 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

10.2.2 Profiling

Profiling is used by application developers, profile-guided com pilers, optimizing linker s, a nd run-time systems. Ap plication dev elopers are int erested in iden tifying performan ce bottlenecks and relating them back to t heir source cod e. Bas ed on profile f eedback dev eloper s can make cha nges to the high-le ve l al gorit hm s and d ata s truc tur es of the pr ogr am. Co mpile rs can use pr of il e fee dback to optimize instruction schedules by employing advanced Itan ium arch itectural features such as predication and speculation.

To support profiling, pe rformance monitor counts have to be associated with program locations. The following mechanisms are su pported directly by the Itanium 2 processor’s performance monitors:

• Program Counter Sampling

• Miss Event Address Sampl ing: Itani um 2 processo r even t address reg isters (EARs) provid e

sub-pipeline length event resolution for performance critical events ( instruction and data caches, branch mispredicts, and instruction and data TLBs).

• Event Qualification: cons trains event monitoring to a specific instruction address range, to

certain opcodes or pr ivilege levels.

These profiling features are presented in the next three subsections.

Performance Monitoring

10.2.2.1 Program Counter Sampling

Applicat ion tuning tools like [VTune, gprof] us e time-based or event-based sampling of th e program counte r and other event counters to identify performance critical fu nctions and basic blocks. As shown in Figure 10-3, the sampled points can be histogrammed by instru c tion addresses. For application tuning, statistical sampling techniques have been very successful, because the programmer can rapidly identify code hot spots in which the program spends a significant fract ion of its time, or where certain event counts are high.

Figure 10-3. Event Histogram by Program Counter

Event

Frequency Examples:

# Cache

Program counter sampling points the p e rformance analysts at code hot spots, but does not indica te what caused the perfor mance p roblem. In specti on and man ual analy sis of the hot-s pot r egion along with a fair a mount of guess work are required to identify the ro ot c a use of the performance problem. On the Itanium 2 processor, the cycle accounting mechanism (described in

Section 10.2.1.4, “Cycle Accounting”) can be used to directly measure an application’s

microarchitectural behavior.

Address Space

The Itanium architectural interval timer facilities (ITC and ITM registers) can be used fo r time-ba sed program counter sampl i ng. Event-based program c ounter sampling is supported by a dedicated performance monitor overflow interrupt mechanism described in detail in Se c tion 7.2.2

“Performa n ce M on ito r Overflow Stat us Reg i st e rs (PM C [0 ]..PMC[3])” in Volume 2 of the Intel Itanium® Architecture Software Developer’s Manual.

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 81

Performance Monitoring

To s uppo rt progr am counter sam pling , the Itaniu m 2 processor provides the foll owing mechanisms :

• Timer interrupt for time-based program counter sampling.

• Event count overflow interrupt for event-based program counter sampling.

• Hardware-supported cycle accounting.

10.2.2.2 Miss Event Address Sampling

Program counter sampling and cy cle accounting provide an accurate picture of cumulative microarchitectural behavior, but they do not provide the application developer with pointers to specific program elements (cod e locations and data structures) that repeatedly cause microarchitectural “miss events”. In a cache study of the SPEC92 benchmarks, [Lebeck] used (trace based) cache miss profiling t o ga in performanc e improvements of 1.02 to 3.4 6 on various benchmarks by making simple changes to the source code. This type of analysis requires identifica tion of instruction and data ad dresses related to microarchitectural “miss events” such as cache misses, branch mispredicts, or TLB misses. Usi ng symbol tables or compiler annotations these addresses can be mapped back to criti cal s our ce code elements. Like Lebeck, most performance analysts in the past have had to capture hardware traces and resort to trace driven simulation.

Due to the superscalar issue, deep pipelining, and out-of- order instruction completi on of today’s microarchitectures, the s amp led program counter value may not be related to the instruction address that caused a miss event. On a Pentium processor pipeline, the sampled program count er may be off by two dynamic instructions from the instruction that caused the miss event. On a Pentium Itanium 2 processor, it is approximately 48 dynamic ins tructions. If program counter sampling is used for miss event address identification on the Itanium 2 processor , a mis s ev ent m ight be associated with an instruction almost five dynamic basic blocks away from where it actually occurred (assuming that 10% of all instructions are branches). Therefore, it is essential for hardware to precisely identify an event’s address.

Pro processor, this distance increases to approximately 32 dynamic instructions. On the

The Itanium 2 processor provi des a set of event addr es s registe rs (EARs) that record the ins truction and data addresses of data cache misses for loads, the instruction and dat a addresses of data TLB misses, and the instruction addr es s e s of in st ru ction TLB and cache misses. A four deep branch trace buffer captures seq uences of branch instruction s . Table 10-2 summarizes the capabilities offered by the Itanium 2 processor EARs and the branch trace buffer. Exposing miss event addresses to sof tware allows them to be monitored eith er by sampling or by code instrumentation. This eliminates the need for trace generation to identify and solve performance problems and enables performance analysis by a much lar ger audience on unmodified hardware.

Table 10-2. It anium

Event Address Register Triggers On What is Recorded

Instruction Cache Instruction fetches that miss

Instruction TLB (ITLB) Instruction fetch missed L1

Data Cache Load instructions that miss L1

82 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

2 Processo r EARs and Branch Trace Buffer

the L1 instruction cache (demand fetches only)

ITLB (demand fetches only)

data cache

Instruction Address Number of cycles fetch was in flight

Instruction Address Who serviced L1 ITLB miss: L2 ITLB VHPT or software

Instruction Address Data Address Number of cycles load was in flight.

Performance Monitoring

Table 10-2. Itani u m® 2 Processor EARs and Branch Trace Buffer (Continued)

Event Address Register Triggers On What is Recorded

Data TLB (DTLB)

Branch Trace Buffer

Data references that miss L1 DTLB

Branch Outcomes Branch Instruction Address

The Itaniu m 2 processor EARs e nable st atisti ca l sam pl ing by co nf igu ring a per fo rmance coun ter to count, for instance, the number of data cach e misses or retired instructions. The performance counter va lue is set up to i nte rrupt the processor after a predet e rmined number of events have been observed. The data cache event address register repeatedly captures the instruction and data addresses of actual data cache load misses. Whenever the counter overflo ws, miss event address collection is suspended until the event address register is read by software (this prevents software from capturing a miss event that might be caused by the monitoring softw a re itself). When the counter overflows, an interr upt is de liv ered to so ftware, the obse rved e vent address es are col lecte d, and a new observation inter val can be setup by rewriting the performan ce coun ter register. For time-based (rather th an event-based) sampling methods, the event address regis ters indicate to software whether or not a quali fied event was captured. Statistical sampling can achieve arbitrary event resolution by varying the number of events within an observation interva l a nd by increasing the number of observation intervals.

10.2.3 Event Qualification

On the Itanium 2 processor, performance monitoring can be confined to a subs et of all events. As shown in Figure 10-4 ev ent s can be qua lif ied for monit ori ng bas ed on an instr uc tion addr ess ran ge, a particular instruction opcode, a data address range, an event-s pecific “unit ma s k ” (uma sk ), the privilege le vel and instruction set the e vent was caused by, and the status of the performance monitoring freeze bit (PMC

• Itanium Instruction Address Range Che ck: The Itanium 2 processor allows event monitoring

to be constrained to a programmable instruction address ran ge. This enables monitoring of dynamically linked libraries (DLLs), functions, or loops of in te rest in the context of a large Itanium-based applicati on. The It anium instruction address range che c k is applied at the instruction fetch stage of the pipeline and the resulting qualification is carried by the instruction throughout the pipeline. This enables conditional event counting at a level of granulari t y smaller than dynamic instruction length of the pipeline (approximately 48 instructions). The Itanium 2 processor’s instruction address ra nge check operates only during Itanium-based code execution, i.e., when PSR.is is zero. For details, see Itanium Opcode Match and Address Range Check Registers (PMC

.fr).

Instruction Address Data Address Who serviced L1 DTLB miss: L2 DTLB, VHPT or software

Branch Target Instruction Address Mispredict status and reason

8,9

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 83

Performance Monitoring

Figure 10-4. Itanium

2 Processor Event Qualification

Instruction Address

Instruction Opcode

Data Address

Current Privilege

Curr ent Instruction

Set (Itanium or IA-32)

Performance Monitor

Freeze Bit (PMC

Level

.fr)

Itanium® Instruction

Address Range Check

Itanium Instruction

Opcode Match

Itanium Data Address

Range Check

(Memory Operatio ns Only)

Event Spefic "Uni t Mask"Event Did event happen and qualify?

Privi lege Level Check

Instruction Set Check

Event Count Freeze

Is Itanium instruction pointer in IBR range?

Does Itanium opcode match?

Is Itanium data address in DBR range?

Executing at monitored privilege leve l?

Executing in monitored instruction set?

Is event monitoring enabled?

YES, all of the above are true; this event is qualified.

000987a

• Itanium Instruction Op code Match: The Itanium 2 processor pr ovides two independent

Itanium opcode match registers each of which match the currently issued instruction encodings with a programmable opcode match and mask function. The res ulting match events can be selected as an event type for counting by the performance counters. This allows histogramming of instru ction types, usage of destination and predicate registers as well as basic bloc k profiling (through insertion of tagged NOPs). The opcode matcher operates only during Itanium-based code execution, i.e., when PSR.is is zero. Details are described in

Section 10.3.4.

• Itanium Data Address Range Check: The Itanium 2 processor allows event collection for

memory operations to be constr ained to a programmable data address range. This enables selective monitorin g of data cache miss behavior of specific data s tructures. For details, see

Section 10.3.6.

• Event Specific Unit Masks: Some events allow the specification of “unit masks” to filter ou t

interesting events directly at the moni tored unit. As an example, the number of counted bus transactions can be qualified by an event specific unit mask to contain transactions that originate d from any bus agent, from the processor it self, or from other I/O bus masters. In this case, the bus unit uses a three-way unit mask (any, self, or I/O) that specifies which transactions are to be counted. In the Itanium 2 processor, events fr om the branch, memory and bus units suppor t a variety of unit masks. For details, refe r to the event pages in Chapter 11,

“Performance Monitor Events.”

84 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

• Privilege Level: Two bits in the processor status register are provi ded to enable selective

process-based event monitoring. The Itanium 2 processor supports con ditional event counting based on the current privilege level; this allows perf ormance monitoring softw a re to break down event coun ts into user and operating system contri butions. For details on how to constrain monitoring by privilege level refe r to Section 10.3.1, “Performance Mon it o r Con tro l

and Accessibility.”

• Instruction Set: The Itanium 2 processor supports conditional event c ounting based on the

currently executing instruction set (Itanium or IA-32) by providing two instruction set mask bits for each event moni tor. This allo ws pe rfor mance m onito ri ng soft ware t o b rea k down eve nt counts into It aniu m and IA -3 2 co ntri buti ons . F or de t ails, re fe r to Section 10.3.1, “Performance

Monitor Contro l and Accessibilit y.”.

• Performance Monitor Freeze: Event counter overflows or software can freeze event

monitoring. When frozen, no event monitori ng takes place until softwar e clears the monitorin g freeze bit (PMC

.fr). This ensures that the performance monitoring routines themselves, e.g.,

counter ove rflo w inte rru pt ha ndle rs or per fo rman ce m onitor in g con tex t swi tch rou tines , do n ot “pollute” the even t co unts of the s ystem un der ob serva tion . Fo r detai ls refer to Sec tion 7. 2.4 of

Volume 2 of the Intel® Itanium™ Archite cture Software Developer’s Manual.

10.2.3.1 Combining Opcode Matching, Instruction, and Data Address Range Check

The Itanium 2 processor allows various event quali fication mechanisms to be combined by providing the instruction tagging mechanism shown in Figure 10-5.

Figure 10-5. Instruct i o n Tagging M echanism in the Itanium

Itanium

Instruction

Address

Range

Check (IBRs,

PMC

)

IBRRange Tag

Itanium

Opcode

Matcher

(PMC

Tag(PMC[8])

Itaniu m Opcode Matcher

(PM C

, PMC15)

Itanium Data

Address Range

Check

(DBRs, PMC

DBRRange Tag

Tag(PMC9)

)

Memory

During Itanium instruction execution (PSR.is is zero), the instruction address range check is applied first. T he resulting addr ess range check tag (IBRRangeT ag) is passed to two opcode matchers that combine the instructio n address range check with the opcode match. Each of the two combined tags (Tag(PMC

) and Tag(PMC9)) can be counted as a re tired instruction count event

(for details r e fer to event description “IA64_TAGGED_INST_RETIRED” on page 11-165 ).

2 Processo r

Event

Event Select (PMC

Privilege Lev e l Mas k Instruction Set Mask

(PMC

.plm, PMCi.ism)

Privilege

Level &

Instruction Set

Check

.es)

Counter (PMD

)

000988b

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 85

Performance Monitoring

One of the combined Itanium address ra nge a nd opcode match tags, Tag(PMC8), qualifies all downstream pipeline event s . Ev ents in the memory hierarchy (L1 and L2 data cache and data TLB events can further be qualified using a data address DBRRangeTag).

As summarized in Table 10-3, data address range checking can be combined with o pcode match ing and instruction range checking on the Itanium 2 processor. Additional event qualifications based on the current privilege level and the current instruction set can be applied to all events and are discussed in Section 10.2.3.2, “Priv ilege Level Constraints” and Section 10.2.3.3, “In s t ru c tio n Se t

Constraints.”

Table 10-3. It anium

2 Processor Event Qualification Modes

Opcode Match

Event Qualification Modes

Unconstrained Monitoring (all events)

Instruction Address Range Check Only

Opcode Matching Only 1 Desired

Data Address Range Check Only

Instruction Address Range Check and Opcode Matching

Instruction and Data Address Range Check

Opcode Matching and Data Address Range Check

Enable

.ibrp0-

PMC

pmc8

x 0xffff_ffff_ffff_ffff x [1,11] or [0,xx]

x 0xffff_ffff_ffff_fffe 0 [1,00]

x 0xffff_ffff_ffff_ffff x [1,10]

1 Desired

x 0xffff_ffff_ffff_fffe 0 [1,00]

1 Desired

10.2.3.2 Privilege Level Constraints

Performanc e monitoring software can not always count on contex t switch support from the operating system. In ge ne ral, this has made performance analysis of a single process in a multi-processing system or a m ulti-process workl oa d impossible. To provide hard wa re support for this kind of analysis, the Itanium architecture specif ies three global bi ts (PSR.up, PSR.pp, DCR.p p) and a per-m onitor “privilege monitor” bit (PM C contributions of operating syst em and user-level application components, each monitor speci fies a 4-bit privilege level mask (PMC processor st a tus register (PSR.cpl), an d e vent counting is enabled if PMC The Itanium 2 processor performance monitors control is discussed in Section 10.3.1,

“Performance Monitor Control and Accessibility.”

.plm). The mask is compared to the current privil ege level in the

Instruction

Opcode

Matching

PMC

Opcodes

.pm). To break down the performance

Address

Range Check

Enable

PMC

.ibrp0

x [1,01]

0 [1,01]

x [1,00]

.plm[PS R .cpl ] is one.

(mem pipe events only)

Data Address Range Check

[PMC

.enable-dbrp#

PMC

.dbrp#]

PMC registers can be configured as user-level monitors (PMC (PMC

.pm is 1). A user-level monitor is enabled whenever PSR.u p is one. PSR.up can be

.pm is 0) or system-level monitor s

controlled by an application using the sum/rum instr uctions. This allows applicatio ns to enable/disable perform ance m onitoring for specific code sections. A system-level monito r is enabled whenever PSR.pp is one. PSR.pp can be c ontrolled at privilege level 0 only, whi c h al lows monitor control without interference from user-level processes. The pp field in the default control register (DCR.pp) is copied into PSR.pp whenever an interruption is delivered. This allows events generated during interruptions to be broken down separately: if DCR.pp is 0, events duri ng interruptions are not counted; if DCR.pp is 1, they are included in the kernel counts.

86 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

As shown in Figure 10-6, Figure 10-7, and Figure 10-8, single process, multi-process, and system-level performance monitoring are possible by specifying the appropriate combination of PSR and DCR bits. These bits allow performance monitoring to be controlled entirely from a kernel level device driver, without explicit operatin g system support. Once the de sired monitoring configuration has been setup in a process’ processor status register (PSR), “regular” unmodified operating context switch code automatically enables/disables performance monitoring.

With support fro m the operating system, indi vidual per-process breakdown of event counts can be generated as outlined in the pe rformance monitoring ch apter of the Intel Software Developer’s Manual.

10.2.3.3 Instruction Set Constraints

On the Itanium 2 processor, moni toring can additionally be constrained based on the currently executing instruction set as defined by PSR.is. This capability is supported by the four generic performance counters, as well as the opcode matching and instruction and data event address registers. However, the branch trace buffer only supports Itanium-based cod e execution. When Itanium architectur e only features are used, the correspo nding PMC register instruction set mask (PMC

.ism) shoul d be set to Itanium architecture only (01) to ensure that events generated by

IA-32 code do not corrupt the Itaniu m 2 processor event counts.

Figure 10-6. Single Process Monitor

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .up = 1, others 0

PMC.pm = 0 PMC.plm = 1000 DCR.pp = 0

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .up = 1, others 0

PMC.pm = 0 PMC.plm = 1001 DCR.pp = 0

Itanium® Architecture

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .pp = 1, others 0

PMC.pm = 1 PMC.plm = 1001 DCR.pp = 1

000989

Figure 10-7. Multiple Process Monitor

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .up = 1, others 0

A/B

PMC.pm = 0 PMC.plm = 1000 DCR.pp = 0

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 87

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .up = 1, others 0

A/B

PMC.pm = 0 PMC.plm = 1001 DCR.pp = 0

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PSR .pp = 1, others 0

A/B

PMC.pm = 1 PMC.plm = 1001 DCR.pp = 1

000990

Performance Monitoring

Figure 10-8. System Wide Monitor

Kernel-level, cpl = 0

Interrupt-level, cpl = 0

Proc A Proc B Proc C

PMC.pm = 1 PMC.plm = 1000 DCR.pp = 0

10.2.4 References

• [gprof] S.L. Graham S.L., P.B. Kessler and M.K. McKusick, “gprof: A Call Graph Execution

Profiler”, Proceedings SIGPLAN’82 Symposium on Co mpiler Construction; SIGPLAN Notices; Vol. 17, No. 6, pp. 120-126, June 1982.

• [Lebeck] Al vin R. Lebeck and Da vid A. Wood, “Cache Profiling and the SPEC benchmarks:

A Case Study”, Tech Report 1164, Computer Science Dept., University of Wisconsin - Madison, July 1993.

• [VTune] Mark Atkins and Ramesh Subramaniam, “PC Software Performance Tuning”, IEEE

Computer, Vol. 29, No. 8, pp. 47 -54, August 1996.

User-level, cpl = 3

(Application)

(OS)

(Handlers)

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc AProc BProc C

All PSR.up = 1 All PSR.pp = 1All PSR.up = 1 PMC.pm = 0 PMC.plm = 1001 DCR.pp = 0

User-level, cpl = 3

(Application)

Kernel-level, cpl = 0

(OS)

Interrupt-level, cpl = 0

(Handlers)

Proc A Proc B Proc C

PMC.pm = 1 PMC.plm = 1001 DCR.pp = 1

000991

• [WinNT] Russ Bl ake , “Optimizing Windows NT(tm)”, Volume 4 of the Microsoft “Windows

NT Resource Kit for Windows NT Version 3.51”, Micro soft Press, 1995.

10.3 Performance Monitor State

T wo sets of performa nce monit or registe rs are def ined. Perform ance Monito r Config uratio n (PMC) registers are used to configure the monitors. Performance Mon itor Data (PMD) registers pro vide data values from the mon itors. This section describes the Itanium2 processor performance monitoring re gisters which expands on the Itanium arc hitectural definition. As shown in

Figure 10-9 the Itanium 2 processor provides four 48-bit perfor mance counters (PMC/PMD

pairs), and the following model-specific monitoring registers: instruction and data event address registers (EARs) for monitoring cache and TLB misses, a branch trace buffer , two op code match registers, and an instructio n addres s range check register.

Table 10-4 defines the PMC/PMD register ass ign men ts for each monitoring feature. The interrupt

status registers are mapped to PMC to PMC/PM D configuration registers (PMC

. The event address register s and the branch trace buffer are controlled by three

4,5,6,7

10,11,12

accessible to so ftware through five event address data registers (PMD buffer (PMD

). On the Itanium 2 processor, monitoring of some events can additionally be

8-16

constrained to a programmable instruction ad dress range by appropriate setting of the instruction breakpoint registe rs (IBR) an d the ins truc tion add ress rang e check re gister ( PMC the checking mechanism in the opcod e match register (PMC

. The four generic performance counter pairs are assigned

0,1,2,3

). Captured event addresses and cache miss laten cies are

0,1,2,3,17

) T wo opcode match registers

8,9

) and a branch trace

) and turning on

4,5,6,7

88 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

(PMC

8,9

be qualified with a programmable opcode. For memory operations, events can be qualified by a programmable data address range by appropriate setting of the data breakpoint registers (DBRs) and the data address range configuration register (PMC

Table 10-4. Itani u m

Monitoring

Feature

Interrupt Status PMC

Event Counters PMC Opcode

Matching Instruction EAR PMC

Data EAR PMC Branch Trace

Buffer Instruction

Address Range Check

Memory Pipeline Event Constraints

) and an opco de m atch con fi gura ti on re gi ster (P MC15) allow monit or ing of som e even ts to

2 Processor Performance Monitor Register Set

Configuration

Registers

(PMC)

0,1,2,3

4,5,6,7

PMC

8,9,15

PMC

Data

Registers

Description

(PMD)

none See Section 10.3.3, “Performance Monitor Overflow

Status Registers (PMC0,1,2,3).”

PMD

4,5,6,7

See Section 10.3.2, “Performance Counter Registers.”

none See Section 10.3.4, “Opcode Match Check

(PMC8,9,15).”

PMD

PMD PMD

0,1

2,3,17 8-16

See Section 10.3.7.1, “Instruction EAR (PMC10,

PMD0,1).”

See Section 10.3.8, “Data EAR (PMC11, PMD2,3,17).” See Section 10.3.9.2, “Branch Trace Buffer Reading.”

none See Section 10.3.5, “Instruction Address Range

Matching.”

none See Section 10.3.6, “Data Address Range Matching

(PMC13).”

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 89

Performance Monitoring

Figure 10-9. Itanium® 2 Processor Performance Monitor Register Mode

Performance Counter

Overflow Status Registers

63 0

pmc

Performance Counter

Configuration Registers

63 0

pmc

Itanium Opcode/Address

Range Match Registers

63 0

pmc

Configuration Registers:

Instruction/Data Event Address

63 0 pmc pmc

10 11

instr. data

Branch Trace Buffer

63 0 pmc

Performance Counter

Data Registers

63 0

pmc

Instruction/Data Event

Address Data Registers

63 0

pmc

Branch Trace

Buffer Registers

63 0

pmc

instr.

data

Processor Status Register

63 0

PSR

Default Control Register

63 0

DCR

Performance Monitor

Vector Register

63 0

PMV

Itanium® Architecture Generic Register Set

Itanium

2 Processor ImplementationSpecific Register Set

pmc

Opcode Match

63 0

pmc

Instruction/Data Address

Range Check

63 0

pmc

instr. data

000992b

90 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

10.3.1 Performance Monitor Control and Accessibility

In order to use pe rfor ma nce moni tor feat ur es, th e pow er to the P MU sho uld be turne d on by set ting PMC

.enable to 1. At reset, this bit will be set. To provide power savings, this bit can be clear ed to

turn off the clocks to all PMDs, PMCs (with the exceptio n of PMC circuitry.

Once the power is turned on, event collection is controlled by the Performance Monitor Configuration (PMC) registers and the processor status register (PSR). Four PSR fields (PSR.up, PSR.pp , P SR .cpl and P SR .sp) and the perf or m a nc e mo ni tor free ze bit (PMC behavior of a ll performance monito r registers.

), and other non-critical

.fr) affect the

Per-monitor control is provided by three PMC register field s (PMC PMC

.pm). Ins t ruction set masking based on PMCi.ism is a Itanium 2 proc esso r mo del-sp e cifi c

.plm, PMCi.ism, and

feature. Event collection for a monitor is enabled under the following constraints on the Itaniu m 2 processor:

Monitor Enablei =(not PMC0.fr) and PMCi.plm[PSR.cpl ] and (( not PMCi.ism[PSR.is] ) or (PMC

=12)) and ((not (PMCi.pm) and PSR.up) or (PMCi.pm and PSR.pp))

Figure 10-10 defines the PSR control fields that affect performance monitoring. For a detailed

definition of how the PSR bits affect event monitoring and control accessibility of PMD registers, please refer to Sectio n 3.3.2 and Section 7.2.1 of Volume 2 of the Intel

Itanium® Architecture

Software Developer’s Manual.

Table 10-5 defines per monitor controls that apply to PMC

“Itanium

2 Processor Performance Monitor Register Set,” each of these PMC registers controls

4,5,6,7,10,11,12

. As defined in Table 10-4,

the behavior of its associat ed per formance monitor data register s (P MD) . The Itan ium2 processor model-specific PMD (PMD

0,1,2,3,8-17

registers associated with instruction/data EARs and the branch trace buffer

) can be read only when event monitoring is frozen (PMC0.fr is one).

Figure 10-10. Processor Status Register (PSR) Fields for Performance Monitoring

31 30 29 28 27 26 25 24 2 3 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

reserved

63 62 61 60 59 58 57 56 5 5 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32

other

ppsp other reserved other upoth rv

reserved other is cpl

Ta bl e 10-5. Perf orm anc e Monitor PM C Register C ontr ol Fields ( PM C

Field Bits Description

plm 3:0 Privilege Level Mask - controls performance monitor operation for a specific privilege level.

Each bit corresponds to one of the 4 privilege levels, with bit 0 corresponding to privilege level 0, bit 1 with privilege level 1, etc. A bit val ue of 1 indicates that the monitor is enabled at that privilege level. Writing zeros to all plm bits effectively disables the monitor. In this state, the Itanium 2 processor will not preserve the value of the corresponding PMD register(s).

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 91

4,5,6,7,0,11,12

)

Performance Monitoring

Table 10-5. Performance Monitor PMC Register Control Fields (PMC

(Continued)

Field Bits Description

pm 6 Privileged monitor - When 0, the performance monitor is configured as a user monitor and

ism 25:24 Instruction Set Mask - controls performance monitor operation based on the current

enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a privileged monitor, enabled by PSR.pp, and PMD can only be read by privileged software. Any read of the PMD by non-privileged software in this case will return 0.

NOTE: In PMC

instruction set. The instruction set mask applies to PMC 00: monitoring enabled dur ing It anium and IA-32 i nstructi on exec ution ( regardles s of P SR.is)

10: bit 24 low enables monitoring during Itanium instruction execution (when PSR.is is zero) 01: bit 25 low enables monitoring during IA-32 instruction execution (when PSR.is is one) 11: disables monitoring

NOTE: In PMC

this field is implemented in bit [4].

this is implemented in [15:14]. PM C

10.3.2 Performance Counter Registers

The Itanium 2 processor provides fou r generic performance counters (PMC/PMD implemented counter width on the Itanium2 processor is 48 bits ([47] indicates overflow condition). More than the Itanium processor, PMC/PMD pairs on the Itanium 2 processor are symmetrical, i.e., nearly al l event types can be monitored by all counters. There are exceptions within some of t he cache co unters. S ee Section 11.8.2, “L1 Data Cache Events” and Section 11.8.3,

“L2 Unified Cache Events” for more information. These counters can track events whose

maximum per-cycle event increment is up to seven.

4,5,6,7,0,11,12

4,5,6,7,10,11

does not have this field.

but not to PMC12.

)

4,5,6,7

pairs). The

Figure 10-11 and Table10-6 define the layout of the Itanium 2 processor Performance Counter

Configuration Reg ister s (P MC

). The main task of these config uratio n register s is to select the

4,5,6,7

events to be monitored by the respective performance monitor data counters. Event selection ( and uni t ma s k (

umask) fields in the PMC registers perform the selection of these events. The rest

of the fields in PMCs specify under what conditions the c ounting should be done ( by how much the counter should be incremented ( counter overflows (

ev, oi).

Figure 10-11. Ita niu m® 2 Processor Generic PMC Registers (PMC

63 28 27 26 25 24 23 22 20 19 16 15 8 7 6 5 4 3 0

PMC

4,5,6,7

reserved --- ism ena

36 2 2 1 3 4 8 1 1 1 1 4

ble

thres-

hold

threshold), and what need to be do ne if the

4,5,6,7

umask es ig pm oi ev plm

Table 10-6. It anium® 2 Processor Generic PMC Regi ster Fields (PMC

Field Bits Description

plm 3:0 Privilege Level Mask. See Table 10-5 “Performance Monitor PMC Register Control

ev 4 External visibility - When 1, an external notification (BPM pin strobe) is provided

Fields (PMC4,5,6,7,0,11,12).”

whenever the counter over flows. E xt ernal noti f icati on occ urs regar dle ss of the s etting o f the oi bit (see below). On the Itanium2 the BPM0 pin, PMC strobes the BPM3 pin.

strobes the BPM1 pin, PMC6 strobes the BPM2 pin, and PMC7

processor, PMC

plm, pm, ism),

)

4,5,6,7

external notification strobes

es)

92 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

Table 10-6. Itani u m® 2 Processor Generic PM C Reg ister Fields ( PM C

Field Bits Description

oi 5 Overflow interrupt - When 1, a Performance Monitor Interrupt is raised and the

pm 6 Privilege Monitor. See Table 10-5 “Performance Monitor PMC Register Control Fields

ig 7 reserved es 15:8 Event select - selects the performance event to be monitored.

umask 19:16 Unit Mask - event specific mask bits (see event definition for details) threshold 22:20 Threshold -enables thresholding for “multi-occurrence” events.

enable 23 PMC

ism 25:24 Instruction Set Mask. See Table 10-5 “Performance Monitor PMC Register Control

--- 27:26 Must write 0 for proper PMU operation. ignored 63:28 Read zero, Writes ignored.

performance monitor freeze bit (PMC interrupt is raised and the performance monitor freeze bit (PMC unchanged. Counter overflows generate only one interrupt. Setting the corresponding PMC

bit on an overflow will be independent of this bit.

(PMC4,5,6,7,0,11,12).”

Itanium 2 processor event encodings are defined in Chapter 11, “Performance Monitor

Events.”

When threshold is zero, the counter sums up all observed event values. When the threshold is non-zero, the counter increments by one in every cycle in which the observed event value exceeds the threshold.

Only. Enables use of the PMUs. A 1 must be written for the PMUs to function.

Power up value is 1.

Fields (PMC4,5,6,7,0,11,12).”

.fr) is set when th e moni tor ov erflow s. When 0, no

) (Continued)

4,5,6,7

.fr) remains

Figure 10-12 and Table 10-7 defines the layout of the Itanium2 process or Performance Counter

Data Registers (PMD from bit 46 is detected). Software can force an external inter ruption or exter nal notif ication after N events by preloadin g the monitor with a count value of 2

). A counter overflow occurs when the counter wraps (i.e., a carry out

4,5,6,7

- N. Note that bit 47 is the overflow bit

and must be initialized to 0 whenever there is a need to initialize the register. When accessible, software can continuously read the performance counter registers PMD

without dis a bling event collection. Any read of the PMD from software without the appropriate privilege le vel will re turn 0 (See “plm” in Table 10-6). The processor ensures that softwar e will see monotonically increasing counter values.

Figure 10-12. Itani u m® 2 Processor Generic PMD Reg isters (PMD

63 48 47 46 0

PMD

4,5,6,7

sxt47 ov Count

16 1 47

Table 10-7. Itani u m® 2 Processor Generic PM D Reg ister Fields

Field Bits Description

sxt47 63:48 Writes are ignored, Reads r eturn the value of bit 47, so count values appear as sign

ov 47 Overflow bit (carry out from bit 46).

count 46:0 Event Count. The counter is defined to overflow when the count field wraps (carry out

extended.

NOTE: Writes to initialize the PMD should write 0 to this bit.

from bit 46).

4,5,6,7

)

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 93

Performance Monitoring

10.3.3 Performance Monitor Overflow Status Registers (PMC

As previously mentioned, the Itanium 2 processor supports four performance monitoring counters. The overflow status of these four coun ters is indicat ed in regis ter PMC and Table 10-8 only PMC read as zero and ignore writes.

Figure 10-13 . Itanium

(PMC

0,1,2,3

63 876543210

Table 10-8. It anium

PMC

PMC PMC

0 0

0 1,2,3

[7:4,0] bits are popula t e d. All other overflow bits are ignored, i.e., they

2 Processor Performance Monitor Overf lo w Status Registers

)

reserved (PMC0)overflowrsv. fr

reserved (PMC1)

reserved (PMC2)

reserved (PMC3)

2 Processor Performance Monitor Overfl ow Re giste r Field s (PMC

fr 0 Performance Monitor “freeze” bit - When 1, event monitoring is disabled.

ignored 3:1 Read zero, Writes ignored. overflow 7:4 Event Counter Overflow - When bit n is one, indicate that the PMDn

ignored 63:8 R ead zero, Writes ignored. ignored 63:0 R ead zero, Writes ignored.

When 0, event monitoring is enabled. This bit is set by hardware whenever a performance monitor overflow occurs and its corresponding overflow interrupt bit (PMC .oi) is set to one. S W is res ponsib le for c lear ing it. When the PMC.oi bit is not set, then counter overflows do not set this bit.

overflowed. This is a bit vector indicating which performance monitor overflowed. These overflow bits are set on their corresponding counters overflow regardless of the state of the PMC.oi bit. Software may also set these bits. These bits are sticky and multiple bits may be set.

. As shown in Figure 10-13

431

0,1,2,3

)

10.3.4 Opcode Match Check (PMC

8,9,15

)

The Itanium 2 processor allows event monitoring to be constrained based on the instr uction address and/or Itanium encoding (opcode) of an instruction. Registers PMC

“Instruction Address Range Matchin g”) a re used to enab le thes e features. Registers PMC

configuring thes e featu res. F or mem ory rela ted events , the appro priate bi ts must be set in PMC

and PMC14 (Section10.3.5,

8,9

allow

to enable this feature. Please refer to Section 10.3.6, “Data Address Range Matching (PMC13)” for details. Unlike in the Itanium processor, the opcode matcher in the Itan ium2 processor operates during both Itanium-based a nd IA-32 code execution. When operating in IA-32 mode it checks for Itanium opcodes.

Figure 10-14 and Table 10-9 describe the fields of PMC

describes the register PMC

. All combinations of bits [63:60] are supported. To match A-slot

registers. Figure 10-15 and Table 10-10

8,9

instruction, set bits [6 3:62] to 11. To match all instruction types, bits [63:60] should be set to 1111. To ensure that all events are count ed independent of the opcode matcher, all mifb and all mask bits of PMC

94 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

should be set to one (a ll opcodes match).

8,9

Performance Monitoring

PMC9 only qualifies the event IA64_TAGGED_INST_RETIRED. The Itanium 2 processor’s opcode cons traint for IA64_TAGGED_INST_RETIRED event ANDs PMC and IBRP

matches and PMC8 results with IBR0 and IBRP2 matches. PMC8, however, constrains

results with IBRP1

other downstream events as well. T o ensure that all events are counted independent of the opcode matcher, bit[63:60] and bit [29:3] should be set to all ones.

Figure 10-14. Opcode Match Registers (PMC

63 62 61 60 59 33 32 30 29 3 2 1 0

8,9

m i f b match rsv mask -- inv ig_

1111 27 3 27 11 1

Ta bl e 10-9. Opcode Match R egist er Fields (PMC

Field Bits Width Description

ig_ad 0 1 Ignore Instruction Address Range Checking. If set to 1, all instruction

inv 1 1 Invert Range Check. If set to 1, the address ranged spec ified by IBR 0-1 is

--- 2 1 Must write 1 for proper PMU operation. mask 29:3 27 Bits that mask Itanium

rsv 32:30 3 Reserved bits match 59:33 27 Opcode bits against which Itanium instruction encoding to be matched

b 60 1 If 1: match if opcode is an B-slot f 61 1 If 1: match if opcode is an F-slot i 62 1 If 1: match if opcode is an I-slot m 63 1 If 1: match if opcode is an M-slot

addresses are considered for events. If 0, IBRs 0-1 will be used for address constraints.

NOTE: This bit is ignored in PMC

inverted. Effective only when ig_ad bit is set to 0. NOTE: This bit is ignored in PMC

[15:3] mask bits for opcode bits[12:0] [29:16] mask bits for opcode bits [40:27] If mask bit is set to 1, the corresponding opcod e bit is not used for opcode

matching

[45:33]: match bits for opcode bits[12:0] [59:46]: match bits for opcode bits[40:27]

)

8,9

instruction encoding bits

Figure 10-15. Opcode Match Configuration Register (PMC15)

63 43 2 1 0

reserved ibrp3

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 95

ibrp2

ibrp1

ibrp0

pmc

pmc9

pmc8

111 1

Performance Monitoring

Tab le 10-10. Opcode Match Configuration Register Fields (PMC

Field Bits Description

ibrp0-pmc8 0 1: PMU events will not be constrained by opcode

0: PMU events (including IA64_TAGGED_INST_RETIRED.00) will be opcode constrained by PMC

ibrp1-pmc9 1 1: IA64_TA GGED_INST_RETIRED.01 won’t be cons trained by op code

0: IA64_TAGGED_INST_RETIRED.01 will be opcode constrained by PMC

ibrp2-pmc8 2 1: IA64_TA GGED_INST_RETIRED.10 won’t be cons trained by op code

0: IA64_TAGGED_INST_RETIRED.10 will be opcode constrained by PMC

ibrp3-pmc9 3 1: IA64_TAGGED_INST_RETIRED.11 won’t be constrained by opcode

0: IA64_TAGGED_INST_RETIRED.11 will be opcode constrained by PMC

For opcode matching purposes, a n Itanium instruction is defined by two items: the instruction type

“itype” (one of M, I, F or B) and the 42-bit enco ding “enco{41:0}” define d th e Intel Archite cture Software Developer’s Manual. Each instruction is evaluated again st each opcode

match register (PMC

Match(PMCi) = (imatch(itype, PMCi.mifb) AND ematch(enco,PMCi.match,PMCi.mask))

) as follows:

8,9

Where:

imatch(itype,PMC[i].mifb) = (itype=M AND PMC[i].m) OR (itype=I AND PMC[i].i) OR (itype=F AND PMC[i].f) OR (itype=B AND PMC[i]. b)

ematch(enco,match,mask) = AND

((enco{b}=match{b}) OR mask{b})

b=12..0

b=40..27

((enco{b}=match{b-14}) OR mask{b-1 4}) AND

)

Itanium®

This functi on matches encoding bits{40:27} (major opc ode) and encoding bits{12:0} (destination and qualifying predicate) on ly. Bits{26:13} of the instruction encoding are ignored by the opcode matcher.

The IBRP matches are advanced with the instruction pointer to the point where opcodes are being dispersed. The matches from opcode matchers are ANDed with the IBRP matches at this point.

This produces two opcode match events that are combined with the instruction range check tag (IBRRangeTag, see Section 10.3.5, “Instruction Addr ess Range Matching”) as follows:

Tag(PMC8) = Match(PMC8) and IBRRangeTag Tag(PMC

) = Match(PMC9) and IBRRangeTag

As shown in Figure 10-5 the two tags, Tag( PMC8) and Tag(PMC9), are staged down the processor pipeline until instru ction retirement and can be select ed as a retired instruction count event (see event description “IA64_T AGGED_INST_RETIRED” on page 11-165). In this way, a performance counter (PMC/PMD within the programmed range th at ma tch the specified opcodes.

The opcodes dispersed to differen t pip elines are compared to PMC qualified by a number of user configurabl e bits (please refer to definition of PMC document) and ANDed with IBRP0 match befo re being distributed to different places

Note: Register PMC

not lis ted in Table10-10 processor behavior is not defined.

) can be used to count the numb er of retire d instructions

4,5,6,7

; the opcode match is further

in this

must contai n the predetermined value of 0xfffffff0. If software modifies any bits

96 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

10.3.5 Instruction Address Range Matching

The Itanium2 process or allows event monitoring to be constrained to a range of instruction addresses. The four architectural Instruction Breakpoint Register Pairs IBRP used to specify the d esire d addre ss range. O nce progr ammed thi s restrict ion would be ap plied to all events. In the Itanium 2 processor, regis ters PMC applied to the performance monitors. With the e xception of IA64_INST_RETIRED and prefetch events, IBRP

is the only IBR pair used and will be considered the default for this section. For

memory related events, the appropriate bits must be set in PMC refer to Section 10.3.6, “Data Address Range Matching (PMC13)” for details.

specify how the resulting address match is

8,14

Performance Monitoring

(IBR

0-3

to enable this feature. Please

0-17

) can be

Figure 10-16 and Table 10-12 describe the fields of register PMC

checking is controlled by th e “igno re address range check” bit (PMC When PMC settings . In this mode, events from both IA-32 and Itanium-ba sed code execution contribute to the event count. When both PMC check based on the IBR se ttings is applied to all Itanium code fetches. In this mode, IA-32 instructions are never tagg ed, and, as a result, events generated by IA-32 code execution are ignored. Table 10-11 defines the behav ior of the instruction address range checker for different combinations of PSR.is and PMC

Table 10-11. Itanium

PMC

.ig_ad OR

.ibrp0

PMC

0 Tag only Itanium instructions if they match

1 Tag all Itanium and IA-32 instructions. Ignore IBR range.

The processor compar es ev ery Itan ium i nstructi on fe tch add ress I P{63:0} agains t the addre ss range programmed into the architectural in struction breakpoint register pair IBRP value of the instruction breakpoint fault enable (IBR x-bit), the following expres sion is evaluated for the Itan ium 2 processor ’s IBRP

IBRmatch = match(IP,IBR0.addr, IBR1.mask, IBR1.plm)

Instruction address range

14.

.ig_ad and PMC14.ibrp0).

.ig_ad is one (or PMC14.ibrp0 is one), all instructions are tagged regardles s of IBR

.ig_ad and PMC14.ibrp0 are zero, the instruction address range

.ig_ad or PMC14.ibrp0.

2 Processor Instruction Address Range Check by Instruction Set

PSR.is

0 (Itanium

IBR range.

) 1 (IA-32)

DO NOT tag any IA-32 operations.

. Regardless of the

The events which occur before the instruction dispersal stage will fire only if this qualified match (IBRmatch) is true. This qualified match will be ANDed with the result of Opcode Matcher PMC and furt he r qualifi e d with more user definable bits (see Table 10-12) be f ore being di stribute d to different places. The events which occur after instruction dispersal stage, will use this new qualifi e d match (ibrp0-pmc8 matc h).

Figure 10-16. Instruction Address Range Configuration R egist er (PMC14)

63 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

reserved fine reser

50 121212121 1

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 97

ved

ibrp3reser

ved

ibrp2reser

ved

ibrp1reser

ved

ibrp0 reser

ved

Performance Monitoring

Tab le 10-12. Instruction Address Range Configuration Register Fields (PMC

Field Bits Description

ibrp0 1 1: No constraint

0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.00 included) will be constrained by IBR P

ibrp1 4 1: No constraint

0: Prefetch PMU events (IA64_TAGGED_INST_RETIRED.01 included) will be constrained by IBR P

ibrp2 7 1: No constraint

0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.10 included) ill be constrained by IBRP

ibrp3 10 1: No constraint

0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.11 included) will be constrained by IBR P

fine 13 Enable arbitrary range checking (non power of 2)

1: IBRP 0: Normal mode This bit provides this capability. If set to 1, ibrp0 (lower limit) and

ibrp2 (upper limit) are paired together; So are ibrp1 (lower limit) and ibrp3(upper limit). Bits [63:12] of upper and lower limits need to be exactly the same but could have any value. Bits[11:0] of upper limit needs to be greater than bits[11:0] of lower limit. If an address falls in between the upper and lower limits then a match will be signaled for both of the ibr pairs used (ibrp0 and ibrp2 will signals matches at the same time).

NOTE: The mask bits programmed in IBRs 1,3,5,7 for bits [11:0] have no effect in this mode.

and IBRP

0,2

are paired as lo/hi limit bits

1,3

)

IBRP0 match is generated in the following fashion. Note that unless fine mode is us ed, arbitrary range checking cannot b e performed since the mask bits are in powers of 2. In fine mode, two IBR pairs are used to s pecify the up per and lower limi ts of a ran ge with in a pag e (the upper bits of lo wer and upper limits must be exactly the same).

If PMC14.Fine=0, IBRmatch0 = match[IP(63:0), IBR0(63:0), IBR1(55:0)]

Else, IBRmatch0 = match[IP(63:12), IBR0(63:12), IBR1(55:12)] and [IP(11:0) > IBR0(11:0)] and [IP(11:0) < IBR4(11:0)]

IBRadrmatch0 = IBRmatch0 ibrp0 match = (PMC8.ign or PMC14.ibrp0) or (IBRadmatch0 and match[PSR.cpl,

IBR1(59:56)])

The instruction range check tag (IBRR angeTag) considers the IBR address ranges only if PMC

.ig_ad, PMC14.ibrp0 and PSR.is are all zero and if none of the IBR x-bits or PSR.db are

set. In order to a llow simultaneous use of som e IBRs for Performance Monitoring and the others for

debugging (the architected purpose of these registers), separate mechanisms are provided for enabling IBRs and the x-bit sho uld be clear ed to 0 for th e IBRP which i s goin g to be used for PMU.

10.3.5.1 Use of IBRP0 For Instruction Address Range Check – Exception 1

The address range constraint for prefetch events is on the target address of these events rather than the address of t he prefetch instruc tion. Theref ore IBRP

must be used for constraining these events.

98 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Performance Monitoring

Calculatio n of IBRP1 match is the same as that of IBRP0 match with the exception th at we use IBR

instead of IBR

2,3,6

0,1,4

Note: Regis ter PMC

listed in Table 10-12 processor behavior is not defined. It is illegal to have PMC and PMC

must contain the predetermined value 0xdb6. If software modifies any bits no t

[0]=0 and ((PMC14[2:1]= 10 or 00 ) or (PMC14[5:4]=10 or 00)); this produces

[48:45]=0000

inconsistencies in tagging I-side events in L1D and L2.

10.3.5.2 Use of IBRP0 For Instruction Address Range Check – Exception 2

The Addres s Range Constraint for IA 64_TAGGE D_INST_RETIRED event us e s all four IBR pairs. Calculation of IBRP IBR

(in non-f ine m ode ) are us ed i nste ad of I BR0. Calculation of IB RP3 match is the same as that

4,5

of IBRP

match with the exception that we use IBR

match is the same as that of IBRP0 match with the exception that

(in non-fine m ode ) instead of IBR

6,7

2,3

10.3.6 Data Address Range Matching (PMC13)

For instructions that reference memory, the Itanium 2 processor allows event counting to be constrained by data address ran ges. The 4 architectural Data Breakpoint Regis ters (DBRs) can be used to specify the des ired address range. Data address range checking capability is controlled by the Memory Pi peline Event Constraints Regi ster (PMC

Figure 10-17 and Table 10-11 describe the fields of register PMC

bits corresponding to one of the 4 DBRs to be used), data address rang e checking is applied to loads, stores, semaphore operations, and the lfetch inst r u c ti on.

Ta bl e 10-13. Memo ry Pipeline Event Constra ints Fields (PMC

Field Bits Description

cfg dbrp0 4:3 These bits determine whether and how DBRP

constraining memory pipeline events (where applicable). 00: IBR /Opc / DBR - U se I BRP

they will be counted only if their Instruction Address, opcodes and Data Address matches the IB RP

01: IBR/Opc - Use IBRP 10: DBR - Only use DBRP 11: N o constraints NOTE: When used in conjunction with “fine” mode (see PMC14

description), only the lower bound DBR Pair (DBRP0 or DBRP1) config needs to be set. The upper bound DBR Pair config should be left to no constraint. So if IBRP0,2 are chosen for “fine” mode, cfg_dbrp0 needs to be set according to the desired constraints but cfg_dbrp2 should be left as 11 (No constraints).

cfg dbrp1 12:11 These bits determine whether and how DBRP

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP

cfg dbrp2 20:19 These bits determine whether and how DBRP

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP

cfg dbrp3 48, 28:27 These bits determine whether and how DBRP

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP0.

Enable dbrp0 45 0 - No constraints

1 - Constraints as set by cfg dbrp0.

When enabled ([1,x0] in the

13.

)

/PMC8 and DBRP0 for constraints (i.e.,

programmed into these registers).

/PMC8 for constraints

for constraints

should be used for

Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization 99

Performance Monitoring

Table 10-13. Memory Pip eline Event Constraints Fiel d s (PMC13) (Contin u ed )

Field Bits Description

Enable dbrp1 46 0 - No constraints

1 - Constraints as set by cfg dbrp1

Enable dbrp2 47 0 - No constraints

1 - Constraints as set by cfg dbrp2

Enable dbrp3 48 0 - No constraints

1 - Constraints as set by cfg dbrp3

Figure 10-17. Memory Pipeline Event Constraints Configuration Register (PMC13)

63 49 48 47 46 45 44 29 28 27 26 21 20 19 18 13 12 11 10 5 4 3 2 0

reser

ved

15 1111 16 2 6 2 6 2 6 2 3

enable

dbrp

3 2 1 0

reserved cfg

dbrp

reser

ved

cfg

dbrp2

reser

ved

cfg

dbrp

DBRPx match is generated in the following fashion. Arbitrary range checking is not possible since the mask bits are in powers of 2. Altho ugh it is poss ible to enable more than one DBRP at a time for checking, it is not reco mmended. The res ulting fo ur matches are combin ed with PSR. db to form a single DBR match:

reser

ved

cfg

dbrp0

reser

ved

DBRRangeMatch = ((DBRRangeMatch0 or DBRRangeMatch1 or DBRRangeMatch2 or DBRRangeMatch3) and (not PSR.db))

Events which occur after a memory instruction gets to the EXE stage will fire only if this qualified match (DBRP

match) is true. The data address is compared to DBRPx; the address match is

further qua l ified by a number of user co nfigurable bits in PMC different places. DBR matching for performance monitoring ignores the setting of the DBR r,w, and plm fi e ld s.

In order to allow simultaneous use of some DBRs for Performance Monitoring and the others for debugging (the architected purpose of these registers), separate mechanisms are provided for enabling DBRs and the r/w-bit should be cleared to 0 for the DBRP which is going to be used for the PMU.

Note: Register PMC

must contain th e predetermined value 0x207 8fefefefe. If software modifies any

bits not listed in Table 10-11 processor behavior is not defined. It is illegal to ha ve PMC13[48:45 ]=0000 and PMC

[0]=0 and ((PMC14[2:1]=10 or 00) or (PMC14[5:4]=10 or 00) );

this produces inconsistencies in tagging I-side events in L1D and L3.

10.3.7 Event Address Registers (PMC

This section defin es the register layout f or the Itanium 2 processor instruction and data event address reg isters (EARs). Sampling of six events is supported on the Itanium 2 processor: instruction cache and instruction TLB misses, data cache load misses and data TLB misses, ALAT misses, and front-end stalls . The EARs are configured throug h two PMC r egisters (PMC EAR specific unit masks allow software to specify event coll ection parameters to hardware. Instruction and data addresses , operation latencies and other captured event parameters are provided in five PMD registers (PMD

0,1,2,3,17

latency of captured cache events and allow latency thresholding to qualify event capture. Event address data registers (PMD (PMC

.fr is one). Reads of PMD

0,1,2,3,17

) contain valid data only when event collection is frozen

0,1,2,3,17

). The instruction and data cache EARs report the

while event coll ection is enabled return undefined values.

10,11

before being distributed to

/PMD

0,1,2,3,17

)

10,11

100 Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization

Intel mx2 MCM (Hondo), Itanium 2 1GHz (YK80542KC0013M) Itanium 2 Reference Manual For Software Developement and Optimization

Specifications and Main Features

Frequently Asked Questions

User Manual

About this Manual 1

1.1 Overview

1.2 Contents

1.3 Terminology

1.4 Related Documentation

2.1 Implemented Instructions

2.2 Functional Units and Issue Rules

2.3 Operation Latencies

2.4 Data Operations

2.4.1 Data Speculation and the ALAT

2.4.2 Data Alignment

2.4.3 Control Speculation

2.5 Memory Hierarchy

2.6 Branch Prediction

2.7 Instruction Prefetching

2.8 IA-32 Execution Layer

Functional Units and Issue Rules 3

3.1 Execution Model

3.2 Number and Types of Functional Units

3.3 Instruction Slot to Functional Unit Mapping

3.3.1 Execution Width

3.3.2 Dispersal Rules

3.3.3 Split Issue and Bundle Types

Latencies and Bypasses 4

4.1 Control and Data Speculation Penalties

4.2 Branch Related Latencies and Penalties

4.3 Latencies for OS Relat ed Instructions

Data Operations 5

5.1 Data Speculation and the ALAT

5.1.1 A llo catio n/R eplace ment Policy

5.1.2 Rules and Special Cases

5.2 Speculative and Predicated Loads/Stores

5.3 Floating-Point Loads

5.4 Data Cache Prefetching and Load Hints

5.4.1 lfetch Implementation

5.4.2 Load Temporal Locality Completers

5.5 Data Alignment

5.6 Write Coalescing

5.6.1 WC Buffer Eviction Conditions

5.6.2 WC Buffer Flushing Behavior

5.7 Register Stack Engine

5.8 FC Instructions

Memory Subsystem 6

6.1 Translation Lookaside Buffers

6.1.1 Instruction TLBs

6.1.2 Data TLBs

6.2 Hardware Page Walker

6.3 Cache Summary

6.4 First-Level Instruction Cache

6.5 Instruction Stream Buffer

6.6 First-Level Data Cache

6.6.1 L1D Loads

6.6.2 L1D Stores

6.6.3 L1D Load and Store Considerations

6.6.4 L1D Misses

6.7 Second-Level Unified Cache

6.7.1 L1D Requests to L2

6.7.2 L2 OzQ

6.7.3 L2 Ca ncels

6.7.4 L2 Recirculate

6.7.5 Memory Ordering

6.7.6 L2 Instruction Prefetch FIFO

6.7.7 L2 Load and Store Considerations

6.8 System Bus/L3 Interactions

6.9 Third-Level Unified Cache

6.10 System Bus

7.1 Branch Prediction Hints

7.2 In direct Branches

7.3 Perfect Loop Prediction

Instruction Prefetching 8

8.1 Streaming Prefetching

8.2 Hint Prefetching

8.3 Prefetch Flush Hints

8.4 The brl Instruction

9.1 Hints for Scheduling

9.2 Optimal Use of lfetch