Intel Itanium 2 9052 (NE80549KE025LK) Dual-Core Itanium 2 Reference Manual For Software Developement and Optimization (0.9)

Download

Dual-Core Update to the Intel® Itanium®2 Processor Reference Manual

For Software Development and Optimization

Revision 0.9

January 2006

Document Number: 308065-001

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for

future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Itanium 2 processor may contain design defects or errors known as errata which may cause the product to deviate from published specifications.

Current characterized errata are available on request. The code name “Montecito” presented in this document is only for use by Intel to identify a product, technology, or service in development, that has not

been made commercially available to the public, i.e., announced, launched or shipped. It is not a “commercial” name for products or services and is not intended to function as a trademark.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-

548-4725, or by visiting Intel's web site at http://www.intel.com. Intel, Itanium, Pentium, VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States

2 Reference Manual for Software Development and Optimization

Contents

1 Introduction.........................................................................................................................9

1.1 Terminology...........................................................................................................9

1.2 Related Documentation.........................................................................................9

2 The Dual-Core Itanium 2 Processor.................................................................................11

2.1 Overview .............................................................................................................11

2.1.1 Identifying the Dual-Core Itanium 2 Processor.......................................11

2.1.2 Introducing Montecito.............................................................................12

2.2 New Instructions..................................................................................................14

2.3 Core.....................................................................................................................15

2.3.1 Instruction Slot to Functional Unit Mapping............................................15

2.3.2 Instruction Latencies and Bypasses.......................................................17

2.3.3 Caches and Cache Management Changes ...........................................18

2.4 Threading ............................................................................................................20

2.4.1 Sharing Core Resources........................................................................21

2.4.2 Tailoring Thread Switch Behavior ..........................................................23

2.4.3 Sharing Cache and Memory Resources ................................................24

2.5 Dual Cores ..........................................................................................................25

2.5.1 Fairness and Arbitration .........................................................................27

2.6 Intel® Virtualization Technology ..........................................................................27

2.7 Tips and Tricks....................................................................................................27

2.7.1 Cross Modifying Code............................................................................27

2.7.2 ld.bias and lfetch.excl.............................................................................27

2.7.3 L2D Victimization Optimization...............................................................27

2.7.4 Instruction Cache Coherence Optimization............................................28

2.8 IA-32 Execution...................................................................................................28

2.9 Brand Information................................................................................................28

3 Performance Monitoring...................................................................................................31

3.1 Introduction to Performance Monitoring ..............................................................31

3.2 Performance Monitor Programming Models........................................................31

3.2.1 Workload Characterization.....................................................................32

3.2.2 Profiling ..................................................................................................35

3.2.3 Event Qualification .................................................................................37

3.2.4 References.............................................................................................43

3.3 Performance Monitor State .................................................................................43

3.3.1 Performance Monitor Control and Accessibility......................................46

3.3.2 Performance Counter Registers.............................................................46

3.3.3 Performance Monitor Event Counting Restrictions Overview ................49

3.3.4 Performance Monitor Overflow Status Registers (PMC0,1,2,3).............49

3.3.5 Instruction Address Range Matching .....................................................50

3.3.6 Opcode Match Check (PMC32,33,34,35,36) .........................................53

3.3.7 Data Address Range Matching (PMC41)...............................................56

3.3.8 Instruction EAR (PMC37/PMD32,33,36)................................................57

3.3.9 Data EAR (PMC40, PMD32,33,36)........................................................60

Reference Manual for Software Development and Optimization 3

3.3.10 Execution Trace Buffer (PMC

3.3.11 Interrupts................................................................................................72

3.3.12 Processor Reset, PAL Calls, and Low Power State...............................73

4 Performance Monitor Events............................................................................................75

4.1 Introduction .........................................................................................................75

4.2 Categorization of Events.....................................................................................75

4.2.1 Hyper-Threading and Event Types ........................................................76

4.3 Basic Events .......................................................................................................77

4.4 Instruction Dispersal Events................................................................................77

4.5 Instruction Execution Events...............................................................................78

4.6 Stall Events .........................................................................................................79

4.7 Branch Events.....................................................................................................80

4.8 Memory Hierarchy...............................................................................................81

4.8.1 L1 Instruction Cache and Prefetch Events.............................................83

4.8.2 L1 Data Cache Events ...........................................................................84

4.8.3 L2 Instruction Cache Events ..................................................................86

4.8.4 L2 Data Cache Events ...........................................................................87

4.8.5 L3 Cache Events....................................................................................91

4.9 System Events ....................................................................................................92

4.10 TLB Events..........................................................................................................93

4.11 System Bus Events.............................................................................................95

4.11.1 System Bus Conventions .......................................................................98

4.11.2 Extracting Memory Latency from Montecito Performance Counters......98

4.12 RSE Events.......................................................................................................100

4.13 Hyper-Threading Events ...................................................................................101

4.14 Performance Monitors Ordered by Event Code................................................102

4.15 Performance Monitor Event List........................................................................108

39,42

,PMD

48-63,38,39

)................................65

Figures

2-1 The Montecito Processor ....................................................................................14

2-2 Urgency and Thread Switching...........................................................................23

2-3 The Arbiter and Queues......................................................................................26

3-1 Time-Based Sampling.........................................................................................32

3-2 Itanium® Processor Family Cycle Accounting.....................................................34

3-3 Event Histogram by Program Counter ................................................................36

3-4 Montecito Processor Event Qualification ............................................................ 38

3-5 Instruction Tagging Mechanism in the Montecito Processor...............................39

3-6 Single Process Monitor .......................................................................................42

3-7 Multiple Process Monitor.....................................................................................42

3-8 System Wide Monitor..........................................................................................43

3-9 Montecito Processor Performance Monitor Register Mode ................................45

3-10 Processor Status Register (PSR) Fields for Performance Monitoring ................46

3-11 Montecito Processor Generic PMC Registers (PMC4-15) ..................................47

3-12 Montecito Processor Generic PMD Registers (PMD4-15) ..................................48

4 Reference Manual for Software Development and Optimization

3-13 Montecito Processor Performance Monitor Overflow Status

Registers (PMC0,1,2,3).......................................................................................49

3-14 Instruction Address Range Configuration Register (PMC38)..............................51

3-15 Opcode Match Registers (PMC32,34) ................................................................54

3-16 Opcode Match Registers (PMC33,35) ................................................................54

3-17 Opcode Match Configuration Register (PMC36).................................................55

3-18 Memory Pipeline Event Constraints Configuration Register (PMC41)................57

3-19 Instruction Event Address Configuration Register (PMC37) ...............................58

3-20 Instruction Event Address Register Format (PMD34,35) ....................................58

3-21 Data Event Address Configuration Register (PMC40) ........................................60

3-22 Data Event Address Register Format (PMD32,d3,36) ........................................61

3-23 Execution Trace Buffer Configuration Register (PMC39)....................................65

3-24 Execution Trace Buffer Register Format (PMD48-63, where

PMC39.ds == 0)...................................................................................................67

3-25 Execution Trace Buffer Index Register Format (PMD38)....................................68

3-26 Execution Trace Buffer Extension Register Format (PMD39)

(PMC42.mode=‘1xx) ...........................................................................................68

3-27 IP-EAR Configuration Register (PMC42) ............................................................69

3-28 IP-EAR data format (PMD48-63, where PMC42.mode == 100 and

PMD48-63.ef =0).................................................................................................70

3-29 IP-EAR data format (PMD48-63, where PMC42.mode == 100 and

PMD48-63.ef =1).................................................................................................70

3-30 IP Trace Buffer Index Register Format (PMD38)(PMC42.mode=‘1xx) ...............71

3-31 IP Trace Buffer Extension Register Format (PMD39)

(PMC42.mode=‘1xx) ...........................................................................................71

4-1 Event Monitors in the Itanium® 2 Processor Memory Hierarchy.........................82

4-2 Extracting Memory Latency from PMUs............................................................100

Tables

2-1 Itanium® Processor Family and Model Values....................................................11

2-2 Definition Table ...................................................................................................12

2-3 New Instructions Available in Montecito..............................................................14

2-4 A-Type Instruction Port Mapping.........................................................................15

2-5 B-Type Instruction Port Mapping.........................................................................16

2-6 I-Type Instruction Port Mapping ..........................................................................16

2-7 M-Type Instruction Port Mapping ........................................................................16

2-8 Execution with Bypass Latency Summary ..........................................................18

2-9 Montecito Cache Hierarchy Summary.................................................................19

2-10 PAL_BRAND_INFO Implementation-Specific Return Values .............................28

2-11 Montecito Processor Feature Set Return Values................................................29

3-1 Average Latency per Request and Requests per Cycle

Calculation Example33

3-2 Montecito Processor EARs and Branch Trace Buffer.........................................37

3-3 Montecito Processor Event Qualification Modes.................................................40

3-4 Montecito Processor Performance Monitor Register Set ....................................44

3-5 Performance Monitor PMC Register Control Fields (PMC4-15)..........................46

3-6 Montecito Processor Generic PMC Register Fields (PMC4-15) .........................47

Reference Manual for Software Development and Optimization 5

3-7 Montecito Processor Generic PMD Register Fields............................................48

3-8 Montecito Processor Performance Monitor Overflow Register

Fields (PMC0,1,2,3) ............................................................................................49

3-9 Montecito Processor Instruction Address Range Check by

Instruction Set.....................................................................................................51

3-10 Instruction Address Range Configuration Register Fields (PMC38)...................51

3-11 Opcode Match Registers(PMC32,34) .................................................................54

3-12 Opcode Match Registers(PMC33,35) .................................................................55

3-13 Opcode Match Configuration Register Fields (PMC36)......................................55

3-14 Memory Pipeline Event Constraints Fields (PMC41) ..........................................56

3-15 Instruction Event Address Configuration Register Fields (PMC37) ....................58

3-16 Instruction EAR (PMC37) umask Field in Cache Mode

(PMC37.ct=’1x) ...................................................................................................59

3-17 Instruction EAR (PMD34,35) in Cache Mode (PMC37.ct=’1x)............................59

3-18 Instruction EAR (PMC37) umask Field in TLB Mode (PMC37.ct=00).................59

3-19 Instruction EAR (PMD34,35) in TLB Mode (PMC37.ct=‘00) ...............................60

3-20 Data Event Address Configuration Register Fields (PMC40) .............................60

3-21 Data EAR (PMC40) Umask Fields in Data Cache Mode

(PMC40.mode=00)..............................................................................................61

3-22 PMD32,33,36 Fields in Data Cache Load Miss Mode

(PMC40.mode=00)..............................................................................................62

3-23 Data EAR (PMC40) Umask Field in TLB Mode (PMC40.ct=01).........................63

3-24 PMD32,33,36 Fields in TLB Miss Mode (PMC40.mode=‘01)..............................63

3-25 PMD32,33,36 Fields in ALAT Miss Mode (PMC11.mode=‘1x) ...........................64

3-26 Execution Trace Buffer Configuration Register Fields (PMC39).........................66

3-27 Execution Trace Buffer Register Fields (PMD48-63)

(PMC42.mode=‘000)...........................................................................................67

3-28 Execution Trace Buffer Index Register Fields (PMD38) .....................................68

3-29 Execution Trace Buffer Extension Register Fields (PMD39)

(PMC42.mode=‘1xx) ...........................................................................................69

3-30 IP-EAR Configuration Register Fields (PMC42) .................................................70

3-31 IP-EAR Data Register Fields (PMD48-63) (PMC42.mode=‘1xx) ........................70

3-32 IP Trace Buffer Index Register Fields (PMD38) (PMC42.mode=‘1xx)................71

3-33 IP Trace Buffer Extension Register Fields (PMD39)

(PMC42.mode=‘1xx) ...........................................................................................72

3-34 Information Returned by PAL_PERF_MON_INFO for the

Montecito Processor ...........................................................................................73

4-1 Performance Monitors for Basic Events..............................................................77

4-2 Derived Monitors for Basic Events......................................................................77

4-3 Performance Monitors for Instruction Dispersal Events......................................78

4-4 Performance Monitors for Instruction Execution Events .....................................78

4-5 Derived Monitors for Instruction Execution Events .............................................79

4-6 Performance Monitors for Stall Events................................................................80

4-7 Performance Monitors for Branch Events ...........................................................81

4-8 Performance Monitors for L1/L2 Instruction Cache and

Prefetch Events...................................................................................................83

4-9 Derived Monitors for L1 Instruction Cache and Prefetch Events ........................84

4-10 Performance Monitors for L1 Data Cache Events...............................................84

6 Reference Manual for Software Development and Optimization

4-11 Performance Monitors for L1D Cache Set 0 .......................................................85

4-12 Performance Monitors for L1D Cache Set 1 .......................................................85

4-13 Performance Monitors for L1D Cache Set 2 .......................................................85

4-14 Performance Monitors for L1D Cache Set 3 .......................................................85

4-15 Performance Monitors for L1D Cache Set 4 .......................................................86

4-16 Performance Monitors for L1D Cache Set 6 .......................................................86

4-19 Performance Monitors for L2 Data Cache Events...............................................87

4-20 Derived Monitors for L2 Data Cache Events.......................................................88

4-21 Performance Monitors for L2 Data Cache Set 0 .................................................89

4-22 Performance Monitors for L2 Data Cache Set 1 .................................................89

4-23 Performance Monitors for L2 Data Cache Set 2 .................................................89

4-24 Performance Monitors for L2 Data Cache Set 3 .................................................89

4-25 Performance Monitors for L2 Data Cache Set 4 .................................................90

4-26 Performance Monitors for L2 Data Cache Set 5 .................................................90

4-27 Performance Monitors for L2 Data Cache Set 6 .................................................90

4-28 Performance Monitors for L2 Data Cache Set 7 .................................................90

4-29 Performance Monitors for L2 Data Cache Set 8 .................................................91

4-30 Performance Monitors for L2D Cache - Not Set Restricted ................................91

4-31 Performance Monitors for L3 Unified Cache Events ...........................................91

4-32 Derived Monitors for L3 Unified Cache Events ...................................................92

4-33 Performance Monitors for System Events...........................................................93

4-34 Derived Monitors for System Events...................................................................93

4-35 Performance Monitors for TLB Events ................................................................93

4-36 Derived Monitors for TLB Events ........................................................................94

4-37 Performance Monitors for System Bus Events....................................................95

4-38 Derived Monitors for System Bus Events............................................................97

4-39 Performance Monitors for RSE Events .............................................................100

4-40 Derived Monitors for RSE Events......................................................................101

4-41 Performance Monitors for Multi-thread Events..................................................101

4-42 All Performance Monitors Ordered by Code .....................................................102

Reference Manual for Software Development and Optimization 7

Revision History

Document

Number

308065-001 0.9 • Initial release of the document. January 2006

Revision

Number

Description Date

8 Reference Manual for Software Development and Optimization

1 Introduction

This document is an update to the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization. This update is meant to give guidance on the changes that the dual-

core Intel®Itanium® 2 processor, code named Montecito, brings to the existing Itanium 2 processor family.

1.1 Terminology

The following definitions are for terms that will be used throughout this document:

Term Definition

Dispersal The process of mapping instructions within bundles to functional units Bundle rotation The process of bringing new bundles into the two-bundle issue

Split issue Instruction execution when an instruction does not issue at the same

Advanced load address table (ALAT) The ALAT holds the state necessary for advanced load and check

Translation lookaside buffer (TLB) The TLB holds virtual to physical address mappings Virtual hash page table (VHPT) The VHPT is an extension of the TLB hierarchy, which resides in the

Hardware page walker (HPW) The HPW is the third level of address translation. It is an engine that

Event address registers (EARs) The EARs record the instruction and data addresses of data cache

window

time as the instruction immediately before it.

operations.

virtual memory space, is designed to enhance virtual address translation performance.

performs page look-ups from the VHPT and seeks opportunities to insert translations into the processor TLBs.

store in memory.

misses.

1.2 Related Documentation

The reader of this document should also be familiar with the material and concepts presented in the following documents:

• Intel

Reference Manual for Software Development and Optimization 9

Itanium®Architecture Software Developer’s Manual, Volume 1: Application

Architecture

Itanium®Architecture Software Developer’s Manual, Volume 2: System Architecture

Itanium®Architecture Software Developer’s Manual, Volume 3: Instruction Set

Reference

Reference Manual for Software Development and Optimization 10

2 The Dual-Core Itanium 2

Processor

2.1 Overview

The first dual-core Itanium 2 processor, code named Montecito, is the fourth generation of the Itanium 2 processor. Montecito builds on the strength of the previous Itanium 2 processors while bringing many new key technologies for performance and management to the Itanium processor family. Key improvements include multiple-cores, multiple-threads, cache hierarchy, and speculation with the addition of new instructions.

This document describes key Montecito features and how Montecito differs in its implementation of the Itanium architecture from previous Itanium 2 processors. Some of this information may not be directly applicable to performance tuning, but is certainly needed to better understand and interpret changes in application behavior on Montecito versus other Itanium architecture-based processors. Unless otherwise stated, all of the restrictions, rules, sizes, and capacities described in this document apply specifically to Montecito and may not apply to other Itanium architecturebased processors. This document assumes a familiarity with the previous Itanium 2 processors and some of the unique properties and behaviors of those. Furthermore, only differences as they relate to performance will be included here. Information about Montecito features such as error protection, Virtualization technology, Hyper-Threading technology, and lockstep support may be obtained in separate documents.

General understanding of processor components and explicit familiarity with Itanium processor instructions are assumed. This document is not intended to be used as an architectural reference for the Itanium architecture. For more information on the Itanium architecture, consult the Intel Itanium®Architecture Software Developer’s Manual.

2.1.1 Identifying the Dual-Core Itanium 2 Processor

There have now been four generations of the Itanium 2 processor, which can be identified by their unique CPUID values. For simplicity of documentation, throughout this document we will group all processors of like model together. Table 2-1details out the CPUID values of all of the Itanium processor family generations. Table 2-2 lists out all of the varieties of the Itanium processor family that are available along with their grouping.

Note that the Montecito CPUID family value changes to 0x20.

Table 2-1. Itanium® Processor Family and Model Values

Family Model Description

0x07 0x00 Itanium® Processor

0x1f 0x00 Itanium 2 Processor (up to 3 MB L3 cache) 0x1f 0x01 Itanium 2 Processor (up to 6 MB L3 cache) 0x1f 0x02 Itanium 2 Processor (up to 9 MB L3 cache)

0x20 0x00 Dual-Core Itanium 2 Processor (Montecito)

Reference Manual for Software Development and Optimization 11

The Dual-Core Itanium 2 Processor

Table 2-2. Definition Table

Intel® Itanium® 2 Processor 900 MHz with 1.5 MB L3 Cache Intel® Itanium® 2 Processor 1.0 GHz with 3 MB L3 Cache Low Voltage Intel® Itanium® 2 Processor 1.0 GHz with 1.5 MB

L3 Cache Intel® Itanium® 2 Processor 1.40 GHz with 1.5 MB L3 Cache Intel® Itanium® 2 Processor 1.40 GHz with 3 MB L3 Cache

Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache

Intel Intel® Itanium® 2 Processor 1.30 GHz with 3 MB L3 Cache Intel® Itanium® 2 Processor 1.40 GHz with 4 MB L3 Cache

Itanium® 2 Processor 1.50 GHz with 6 MB L3 Cache

Intel Low Voltage Intel® Itanium® 2 Processor 1.30 GHz with 3 MB

L3 Cache Intel® Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache at

400 and 533 MHz System Bus (DP Optimized) Intel® Itanium® 2 Processor 1.50 GHz with 4 MB L3 Cache Intel® Itanium® 2 Processor 1.60 GHz with 6 MB L3 Cache Intel® Itanium® 2 Processor 1.60 GHz with 9 MB L3 Cache

Itanium® 2 Processor 1.66 GHz with 6 MB L3 Cache

Intel Intel® Itanium® 2 Processor 1.66 GHz with 9 MB L3 Cache Individual SKUs TBD Dual-Core Itanium 2 Processor (Montecito)

Processor Abbreviation

Itanium 2 Processor (up to 3 MB L3 cache)

Itanium 2 Processor (up to 6 MB L3 cache)

Itanium 2 Processor (up to 9 MB L3 cache)

2.1.2 Introducing Montecito

Montecito takes the latest Itanium 2 processor core, improves the memory hierarchy and adds an enhanced form of temporal multi-threading. A full introduction to the Itanium 2 processor is available elsewhere but a brief review is provided below.

The front-end, with two levels of branch prediction, two TLBs, and a 0 cycle branch predictor, feeds two bundles of three instructions each into the instruction buffer every cycle. This 8 entry queue decouples the front-end from the back-end and delivers up to two bundles, of any alignment, to the remaining 6 stages of the pipeline. The dispersal logic determines issue groups and allocates up to 6 instructions to nearly every combination of the 11 available functional units (2 integer, 4 memory, 2 floating point, and 3 branch). The renaming logic maps virtual registers into physical registers. Actual register (up to 12 integer and 4 floating point) reads are performed just before the instructions execute or requests are issued to the cache hierarchy. The full bypass network allows nearly immediate access to previous instruction results while final results are written into the register file (up to 6 integer and 4 floating point).

Montecito preserves application and operating system investments while providing greater opportunity for code generators to continue their steady performance push without any destructive disturbance. This is important since even today, three years after the introduction of the first Itanium 2 processor, compilers are providing significant performance improvements. The block diagram of the Montecito processor can be found in Figure 2-1.

Montecito provides a second integer shifter and popcounter to help reduce port asymmetries. The front-end provides better branching behavior for single cycle branches and cache allocation/ reclamation. Finally, Montecito decreases the time to reach recovery code when speculation fails

12 Reference Manual for Software Development and Optimization

The Dual-Core Itanium 2 Processor

thereby providing a lower cost for speculation. All told, nearly every core block and piece of control logic includes some optimization to improve small deficiencies.

Exposing additional performance in an already capable cache hierarchy is also challenging and includes additional capacity, improved coherence architecture, and more efficient cache organization and queuing. Montecito supports three levels of on-chip cache. The first level (L1) caches are each 4-way set associative caches and hold 16 KB of instruction or data. These caches are in-order, like the rest of the pipeline, but are non-blocking allowing high request concurrency. These L1 caches are accessed in a single cycle using pre-validated tags. The data cache is writethrough and dual-ported to support two integer loads and two stores, while the instruction cache has dual-ported tags and a single data port to support simultaneous demand and prefetch accesses.

While previous generations of the Itanium 2 processor share the second level (L2) cache with both data and instructions, Montecito provides a dedicated 1 MB L2 cache for instructions. This cache is 8-way set associative with a 128 byte line size and provides the same 7 cycle instruction access latency as the previous smaller Itanium 2 processor unified cache. A single tag and data port supports out-of-order and pipelined accesses to provide a high utilization. The separate instruction and data L2 caches provide more efficient access to the caches compared to Itanium 2 processors where instruction requests would contend against data accesses for L2 bandwidth against data accesses and potentially impact core execution as well as L2 throughput.

This previously shared 256 KB L2 cache is now dedicated to data on Montecito with several micro-architectural improvements to increase throughput. The instruction and data separation effectively increase the data hit rate. The L2D hit latency remains at 5 cycles for integer and 6 cycles for floating-point accesses. The tag is true 4-ported and the data is pseudo 4-ported with 16byte banks. Montecito removes some of the code generator challenges found in the Itanium 2 processor L2 cache. Specifically, any accesses beyond the first access to miss the L2 in previous Itanium 2 processors would access the L2 tags periodically until a hit is detected. The repeated tag accesses consume bandwidth from the core and increase the miss latency. On Montecito, such misses are suspended until the L2 fill occurs. The fill awakens and immediately satisfies the request which greatly reduces bandwidth contention and final latency. The Montecito L2D, like previous generations of the Itanium 2 processor L2, is out-of-order and pipelined with the ability to track up to 32 requests in addition to 16 misses and their associated victims. However, Montecito optimizes allocation of the 32 queue entries providing a higher concurrency level than previously possible.

The third level (L3) cache remains unified as in previous Itanium 2 processors, but is now 12 MB in size while maintaining the same 14 cycle integer access latency found on the 6 MB and 9 MB Itanium 2 processors. The L3 uses an asynchronous interface with the data array to achieve this low latency; there is no clock, only a read or write valid indication. The read signal is coincident with index and way values that initiate L3 data array accesses. Four cycles later, the entire 128-byte line is available and latched. This data is then delivered in 4 cycles to either the L2D or L2I cache in critical byte order.

The L3 receives requests from both the L2I and L2D but gives priority to the L2I request in the rare case of a conflict. Moving the arbitration point from the L1-L2 in the Itanium 2 processor to the L2-L3 cache greatly reduces conflicts thanks to the high hit rates of the L2.

The cache hierarchy is replicated in each core to total more than 13.3 MB for each core and nearly 27 MB for the entire processor.

Reference Manual for Software Development and Optimization 13

The Dual-Core Itanium 2 Processor

Figure 2-1. The Montecito Processor

2.2 New Instructions

Montecito is compliant with the latest revisions of the Itanium architecture in addition to the Intel Itanium Architecture Virtualization Specification Update. As such, Montecito introduces several

new instructions as summarized below:

Table 2-3. New Instructions Available in Montecito

New Instruction Comment

fc.i

ld16

st16 cmp8xchg16

hint@pause vmsw.0, vmsw.1 On promote pages, these instructions allow cooperative operating systems to obtain and

NOTES:

1. This instruction behaves as the  instruction on Montecito

2. This instruction will fault if issued to UC, UCE, or WC memory

3. This instruction will not initiate a thread switch if it is a B type instruction.

Insures that instruction caches are coherent with data caches AR.csd and the register specified are the targets for this load. AR.csd and the value in the register specified are written for this store

AR.csd and the value in the register specified are written for this exchange if the 8 byte compare is true.

The current thread is yielding resources to the other thread.

give up VMM privilege

14 Reference Manual for Software Development and Optimization

The Dual-Core Itanium 2 Processor

2.3 Core

The Montecito core is very similar to previous generations of the Itanium 2 processor core from a code generation point of view. The core has new resources; specifically, an additional integer shifter and popcounter. The core also removes the rarely needed MMU to Memory Address bypass path. The core also includes many optimizations, from the front-end to the cache hierarchy, that are transparent to the code generator and legacy code can see improvements without any code change.

2.3.1 Instruction Slot to Functional Unit Mapping

This information is very similar to previous Itanium 2 processors. Changes between Itanium 2 processors and Montecito will be noted with footnotes.

Each fetched instruction is assigned to a functional unit through an issue port. The numerous functional units share a smaller number of issue ports. There are 11 functional units: eight for nonbranch instructions and three for branch instructions. They are labeled M0, M1, M2, M3, I0, I1, F0, F1, B0, B1, and B2. The process of mapping instructions within bundles to functional units is called dispersal.

An instruction’s type and position within the issue group determine which functional unit the instruction is assigned. An instruction is mapped to a subset of the functional units based upon the instruction type (i.e. ALU, Memory, Integer, etc.). Then, based on the position of the instruction within the instruction group presented for dispersal, the instruction is mapped to a particular functional unit within that subset.

Table 2-4, Table 2-5, Table 2-6 and Table 2-7 show the mappings of instruction types to ports and

functional units.

Note: Shading in the following tables indicates the instruction type can be issued on the port(s).

A-type instructions can be issued on all M and I ports (M0-M3 and I0 and I1). I-type instructions can only issue to I0 or I1. The I ports are asymmetric so some I-type instructions can only issue on port I0. M ports have many asymmetries: some M-type instructions can issue on all ports; some can only issue on M0 and M1; some can only issue on M2 and M3; some can only issue on M0; some can only issue on M2.

Table 2-4. A-Type Instruction Port Mapping

Instruction

Type

A1-A5 ALU add, shladd M0-M3, I0, I1 A4, A5 Add Immediate addp4, addl M0-M3, I0, I1 A6,A7,A8 Compare cmp, cmp4 M0-M3, I0, I1 A9 MM ALU pcmp[1 | 2 | 4] M0-M3, I0, I1 A10 MM Shift and Add pshladd2 M0-M3, I0, I1

Description Examples Ports

Reference Manual for Software Development and Optimization 15

The Dual-Core Itanium 2 Processor

Table 2-5. B-Type Instruction Port Mapping

Instruction

Type

Description Examples Ports

B1-B5 Branch br B0-B2 B6-8 Branch Predict brp B0-B2

Break, nop, thread switch hint hint B0-B2

NOTES:

1. hint.b is treated as a nop.b -- it does not have any impact on multi-thread control in Montecito.

Table 2-6. I-Type Instruction Port Mapping

Instruction

Type

I1 MM Multiply/Shift pmpy2.[l | r],

I2 MM Mix/Pack mix[1 | 2 | 4].[l | r

I3, I4 MM Mux mux1, mux2 I5 Variable Right Shift shr{.u] =ar,ar

I6 MM Right Shift Fixed pshr[2 | 4] =ar,c I7 Variable Left Shift shl{.u] =ar,ar

I8 MM Left Shift Fixed pshl[2 | 4] =ar,c

I10 I11-I17

MM Popcount popcnt Shift Right Pair shrp Extr, Dep

Test Nat I18 Hint hint.i I19 Break, Nop break.i, nop.i I20 Integer Speculation Check chk.s.i I21-28 Move to/from BR/PR/IP/AR mov =[br | pr | ip | ar]

I29 Sxt/Zxt/Czx sxt, zxt, czx

NOTES:

1. The I1 issue capability is new to Montecito

Description Examples

I Port

I0 I1

pmpyshr2{.u}

pmin, pmax

pshr[2 | 4] =ar,ar

pshl[2 | 4] =ar,ar

extr{.u}, dep{.z} tnat

mov [br | pr | ip | ar]=

Table 2-7. M-Type Instruction Port Mapping (Sheet 1 of 2)

Instruction

Type

Description Examples

M1, 2, 3 Integer Load ldsz, ld8.fill M4, 5 Integer Store stsz, st8.spill

16 Reference Manual for Software Development and Optimization

Memory Port

M0 M1 M2 M3

Table 2-7. M-Type Instruction Port Mapping (Sheet 2 of 2)

The Dual-Core Itanium 2 Processor

Instruction

Type

M6, 7, 8 Floating-point Load ldffsz, ldffsz.s, ldf.fill

Floating-point Advanced Load ldffsz.a, ldffsz.c.[clr | nc] M9, 10 Floating-point Store stffsz, stf.spill M11, 12 Floating-point Load Pair ldfpfsz M13, 14, 15 Line Prefetch lfetch M16 Compare and Exchange cmpxchgsz.[acq | rel] M17 Fetch and Add fetchaddsz.[acq | rel] M18 Set Floating-point Reg setf.[s | d | exp | sig} M19 Get Floating-point Reg getf.[s | d | exp | sig} M20, 21 Speculation Check chk.s{.m} M22, 23 Advanced Load Check chk.a[clr | nc] M24 Invalidate ALAT invala

Mem Fence, Sync, Serialize fwb, mf{.a}, srlz.[d | i],

M25 RSE Control flushrs, loadrs M26, 27 Invalidate ALAT invala.e M28 Flush Cache, Purge TC Entry fc, ptc.e M29, 30, 31 Move to/from App Reg mov{.m} ar=

M32, 33 Move to/from Control Reg mov cr=, mov =cr M34 Allocate Register Stack Frame alloc M35, 36 Move to/from Proc. Status Reg mov psr.[l | um]

M37 Break, Nop.m break.m, nop.m M38, 39, 40 Probe Access probe.[r | w].{fault} M41 Insert Translation Cache itc.[d | i] M42, 43 Move Indirect Reg

Insert TR M44 Set/Reset User/System Mask sum, rum, ssm, rsm M45 Purge Translation Cache/Reg ptc.[d | i | g | ga] M46 Virtual Address Translation tak, thash, tpa, ttag M47 Purge Translation Cache ptc.e M48 Thread switch hint hint

Description Examples

sync.li

mov{.m} =ar

mov =psr.[l | m]

mov ireg=, move =ireg, itr.[d | i]

Memory Port

M0 M1 M2 M3

2.3.2 Instruction Latencies and Bypasses

Table 2-8 lists the Montecito processor operation latencies.

Reference Manual for Software Development and Optimization 17

The Dual-Core Itanium 2 Processor

Table 2-8. Execution with Bypass Latency Summary

Consumer (across)

Producer (down)

Adder: add, cmp, cmp4, shrp,

Qual. Pred.

Branch

Pred.

ALU

n/a n/a 1 1 3 1 n/a n/a n/a 1

Load Store Addr

Multi-

media

Store

Data

Fmac Fmisc getf setf

extr, dep, tbit, addp4, shladd, shladdp4, zxt, sxt, czx, sum, logical ops, 64-bit immed. moves, movl, post-inc ops (includes post-inc stores, loads, lfetches)

Multimedia n/a n/a 3 4 or 812 3 n/a n/a n/a 3 thash, ttag, tak, tpa, probe

getf

setf

Fmac: fma, fms, fnma, fpma,

n/a n/a 5 6 6 5 n/a n/a n/a 5 n/a n/a n/a n/a n/a 6 6 6 6 n/a n/a n/a n/a n/a n/a 4 4 4 4 n/a

5 6 6 5

fpms, fpnma, fadd, fnmpy, fsub, fpmpy, fpnmpy, fmpy, fnorm, xma, frcpa, fprcpa, frsqrta, fpsqrta, fcvt, fpcvt

Fmisc: fselect, fcmp, fclass,

n/a n/a n/a n/a n/a 4 4 4 4 n/a fmin, fmax, famin, famax, fpmin, fpmax, fpamin, fpcmp, fmerge, fmix, fsxt, fpack, fswap, fand, fandcm, for, fxor, fpmerge, fneg, fnegabs, fpabs, fpneg, fpnegabs

Integer side predicate write:

1 0 n/a n/a n/a n/a n/a n/a n/a n/a cmp, tbit, tnat

FP side predicate write: fcmp 2 1 n/a n/a n/a n/a n/a n/a n/a n/a FP side predicate write: frcpa,

2 2 n/a n/a n/a n/a n/a n/a n/a n/a fprcpa, frsqrta, fpsqrta

Integer Load FP Load

n/a n/a N N+1 N+2 N N N N N

n/a n/a M+1 M+2 M+3 M+1 M+1 M+1 M+1 M+1 IEU2: move_from_br, alloc n/a n/a 2 2 3 2 n/a n/a n/a 2 Move to/from CR or AR

Move to pr 1 0 2 2 3 2 Move indirect

n/a n/a C C C C n/a n/a n/a C

n/a n/a n/a n/a

n/a n/a D D D D n/a n/a n/a D

NOTES:

1. The MMU to memory address bypass in Montecito does not exist. If code does not account for the missing bypass, the processor will detect the case and cause a pipeflush to ensure proper separation between the producer and the consumer.

2. Since these operations are performed by the L2D, they interact with the L2D pipeline. These are the minimum latencies but they could be much larger

because of this interaction.

3. N depends upon which level of cache is hit: N=1 for L1D, N=5 for L2D, N=14-15 for L3, N=~180-225 for main memory. These are minimum latencies

and are likely to be larger for higher levels of cache.

4. M depends upon which level of cache is hit: M=5 for L2D, M=14-15 for L3, M=~180-225 for main memory. These are minimum latencies and are

likely to be larger for higher levels of cache. The +1 in all table entries denotes one cycle needed for format conversion.

5. Best case values of C range from 2 to 35 cycles depending upon the registers accessed. EC and LC accesses are 2 cycles, FPSR and CR accesses

are 10-12 cycles.

6. Best case values of D range from 6 to 35 cycles depending upon the indirect registers accessed. LREGS, PKR, and RR are on the faster side being

6 cycle accesses.

2.3.3 Caches and Cache Management Changes

Montecito, like the previous Itanium 2 processors, supports three levels of on-chip cache. Each core contains a complete cache hierarchy, with nearly 13.3 Mbytes per core, for a total of nearly 27 Mbytes of processor cache.

18 Reference Manual for Software Development and Optimization

Table 2-9. Montecito Cache Hierarchy Summary

The Dual-Core Itanium 2 Processor

Cache

Data Types

Supported

L1D Integer WT 16 KB 64 Bytes 4 VA[11:6] 8 Fills 1/1

L1I Instruction NA 16 KB 64 Bytes 4 VA[11:6] 1 Demand

L2D Integer,

Floating

Point L2I Instruction NA 1 MByte 128 Bytes 8 PA[16:7] 8 7/10 L3 Integer,

Floating

Point,

Instruction

2.3.3.1 L1 Caches

The L1I and L1D caches are essentially unchanged from previous generations of the Itanium 2 processor.

2.3.3.2 L2 Caches

Level 2 caches are both different and similar to the Itanium 2 processor L2 cache. The previous Itanium 2 processor L2 shares both data and instructions, while the Montecito has dedicated instruction (L2I) and data (L2D) caches. This separation of instruction and data caches makes it possible to have dedicated access paths to the caches and thus eliminates contention and eases capacity pressures on the L2 caches.

Write

Through/

Write Back

WB 256 KB 128 Bytes 8 PA[14:7] 32 OzQ/

WB 12 MByte 128 Bytes 12 PA[19:7] 8 14/21

Data

Array

Size

Line Size Ways Index Queuing

+ 7

Prefetch

Fills

16 Fills

Minimum

/Typical Latency

1/1

5/11

The L2I cache holds 1 Mbyte; is eight-way set associative; and has a 128-byte line size-yet has the same seven-cycle instruction-access latency as the smaller previous Itanium 2 processor unified cache. The tag and data arrays are single ported, but the control logic supports out-of-order and pipelined accesses. This large cache greatly reduces the number of instruction accesses seen at the L3 cache. Any coherence request to identify whether a cache line is in the processor will invalidate that line from the L2I cache.

The L2D cache has the same structure and organization as the Itanium 2 processor shared 256 KB L2 cache but with several microarchitectural improvements to increase throughput. The L2D hit latency remains at five cycles for integer and six cycles for floating-point accesses. The tag array is true four-ported-four fully independent accesses in the same cycle-and the data array is pseudo four-ported with 16-byte banks.

Montecito optimizes several aspects of the L2D. In the Itanium 2 processor, any accesses to the same cache line beyond the first access that misses L2 will access the L2 tags periodically until the tags detect a hit. The repeated tag accesses consume bandwidth from the core and increase the L2 miss latency. Montecito suspends such secondary misses until the L2D fill occurs. At that point, the fill immediately satisfies the suspended request. This approach greatly reduces bandwidth contention and final latency. The L2D, like the Itanium 2 processor L2, is out of order, pipelined, and tracks 32 requests (L2D hits or L2D misses not yet passed to the L3 cache) in addition to 16 misses and their associated victims. The difference is that Montecito allocates the 32 queue entries more efficiently, which provides a higher concurrency level than with the Itanium 2 processor.

Reference Manual for Software Development and Optimization 19

The Dual-Core Itanium 2 Processor

Specifically, the queue allocation policy now supports recovery of empty entries. This allows for greater availability of the L2 OzQ in light of accesses completed out of order.

The L2D also considers the thread identifier when performing ordering such that an ordered request from one thread is not needlessly ordered against another thread’s accesses.

2.3.3.3 L3 Cache

Montecito's L3 cache remains unified as in previous Itanium 2 processors, but is now 12 MB. Even so, it maintains the same 14-cycle integer-access best case latency in the 6M and 9M Itanium 2 processors. Montecito's L3 cache uses an asynchronous interface with the data array to achieve this low latency; there is no clock, only a read or write valid indication. Four cycles after a read signal, index, and way, the entire 128-byte line is available and latched. The array then delivers this data in four cycles to either the L2D or L2I in critical-byte order.

Montecito's L3 receives requests from both the L2I and L2D but gives priority to the L2I request in the rare case of a conflict. Conflicts are rare because Montecito moves the arbitration point from the Itanium 2 processor L1-L2 to L2-L3. This greatly reduces conflicts because of L2I and L2D's high hit rates. The I and D arbitration point also reduces conflict and access pressure within the core; L1I misses go directly to the L2I and not through the core. L2I misses contend against L2D request for L3 access.

2.3.3.4 Request Tracking

All L2I and L2D requests are allocated to one of 16 request buffers. Requests are sent to the to the L3 cache and system from these buffers by the tracking logic. A modified L2D victim or partial write may be allocated to one of 8 write buffers. This is an increase of 2 over the Itanium 2 processor. The lifetime of the L2D victim buffers is also significantly decreased to further reduce pressure on them. Lastly, the L3 dirty victim resources has grown by 2 entries to 8 in Montecito.

In terms of write coalescing buffers (WCB), Montecito has 4 128B line WCBs in each core. These are fully shared between threads.

2.4 Threading

The multiple thread concept starts with the idea that the processor has some resources that cannot be effectively utilized by a single thread. Therefore, sharing under-utilized resources between multiple threads will increase utilization and performance. The Montecito processor HyperThreading Technology implementation duplicates and shares resources to create two logical processors. All architectural state and some micro-architectural state is duplicated.

The duplicated architectural state (general, floating point, predicate, branch, application, translation, performance monitoring, bank, and interrupt registers) allows each thread to appear as a complete processor to the operating system thus minimizing the changes needed at the OS level. The duplicated micro-architectural state of the return stack buffer and the advanced load address table (ALAT) prevent cross-thread pollution that would occur if these resources were shared between the two logical processors.

The two logical processors share the parallel execution resources (core) and the memory hierarchy (caches and TLBs). There are many approaches to sharing resources that vary from fixed time intervals, temporal multi-threading or TMT, to sharing resources concurrently, simultaneous multithreading or SMT. The Montecito Hyper-Threading Technology approach blends both approaches such that the cores share threads using a TMT approach while the memory hierarchy shares resources using a SMT approach. The core TMT approach is further augmented with control

20 Reference Manual for Software Development and Optimization

hardware that monitors the dynamic behavior of the threads and allocates core resources to the most appropriate thread - an event experienced by the workload may cause a switch before the thread quantum of TMT would cause a switch. This modification of TMT may be termed switchon-event multi-threading.

2.4.1 Sharing Core Resources

Many processors implementing multi-threading share resources using the SMT paradigm. In SMT, instructions from different threads compete for and share execution resources such that each functional resource is dynamically allocated to an available thread. This approach allocates resources originally meant for instruction level parallelism (ILP), but under-utilized in the single thread case, to exploit thread level parallelism (TLP). This is common in many out-of-order execution designs where increased utilization of functional units can be attained for little cost.

Processor resources may also be shared temporally rather than symmetrically. In TMT, a thread is given exclusive ownership of resources for a small time period. Complexity may be reduced by expanding the time quantum to at least the pipeline depth and thus ensure that only a single thread owns any execution or pipeline resources at any moment. Using this approach to multi-threading, nearly all structures and control logic can be thread agnostic allowing the natural behaviors of the pipeline, bypass, and stall control logic for execution to be leveraged while orthogonal logic controls and completes a thread switch is added. However, this approach also means that a pipeline flush is required at thread switch points.

The Dual-Core Itanium 2 Processor

In the core, one thread has exclusive access to the execution resources (foreground thread) for a period of time while the other thread is suspended (background thread). Control logic monitors the workload's behavior and dynamically decreases the time quantum for a thread that is not likely to make progress. Thus, if the control logic determines that a thread is not making progress, the pipeline is flushed and the execution resources are given to the background thread. This ensures better overall utilization of the core resources over strict TMT and effectively hides the cost of long latency operations such as memory accesses.

A thread switch on Montecito requires 15 cycles from initiation until the background thread retires an instruction. Given the low latency of the memory hierarchy (1 cycle L1D, 5 cycle L2D, and 14 cycle L3) memory accesses are the only potentially stalling condition that greatly exceeds the thread switch time and thus is the primary switch event.

A thread switch also has other side effects such as invalidating the Prefetch Virtual Address Buffer (PVAB) and canceling any prefetch requests in the prefetch pipeline.

2.4.1.1 The Switch Events

There are several events that can lead to a thread switch event. Given that hiding memory latency is the primary motivation for multi-threading, the most common switch event is based on L3 cache misses and data returns. Other events, such as the time-out and forward progress event, provide fairness, while the hint events provide paths for the software to influence thread switches. These events have an impact on a thread's urgency which indicates a thread's ability to effectively use core resources. Each event is described below:

• L3 Cache Miss - An L3 miss by the foreground thread is likely to cause that thread to stall

waiting for the return from the system interface. Hence, L3 misses can trigger thread switches subject to thread urgency comparisons. This event decreases the thread’s urgency. Since there is some latency between when a thread makes a request and when it is determined to be an L3 miss, it is possible to have multiple requests from a thread miss the L3 cache before a thread switch occurs.

Reference Manual for Software Development and Optimization 21

The Dual-Core Itanium 2 Processor

• L3 Cache Return - An L3 miss data return for the background thread is likely to resolve data

dependences and is an early indication of execution readiness, hence an L3 miss data return can trigger thread switch events subject to thread urgency comparisons. This event increases the thread’s urgency

• Time-out - Thread-quantum counters ensure fairness in access to the pipeline execution

resources for each thread. If the thread-quantum expiration occurs when the thread was not stalled, its urgency is set to a high value to indicate execution readiness prior to the switch event.

• Switch Hint - The Itanium architecture provides the  instruction which can

trigger a thread switch to yield execution to the background thread. This allows software to indicate when the current thread has no need of the core resources.

• Low-power Mode - When the active thread has entered into a quiesced low-power mode, a

thread switch is triggered to the background thread so that it may continue execution. Similarly, if both threads are in a quiesced low-power state, and the background thread is awakened, a thread switch is triggered.

The L3 miss and data return event can occur for several types of accesses: data or instruction, prefetch or demand, cacheable or uncacheable, or hardware page walker (HPW). A data demand access includes loads, stores, and semaphores.

The switch events are intended to enable the control logic to decide the appropriate time to switch threads without software intervention. Thus, Montecito Hyper-Threading Technology is mostly transparent to the application and the operating system

2.4.1.2 Software Control of Thread Switching

The  instruction is used by software to initiate a thread switch. The intent is to allow code to indicate that it does not have any useful work to do and that its execution resources should be given to the other thread. Some later event, such as an interrupt, may change the work for the thread and should awaken the thread.

The  instruction forces a switch from the foreground thread to the background thread. This instruction can be predicated to conditionally initiate a thread switch. Since the current issue group retires before the switch is initiated, the following code sequences are equivalent:



    





    



2.4.1.3 Urgency

Each thread has an urgency which can take on values from 0 to 7. A value of 0 denotes that a thread has no useful work to perform. A value of 7 signifies that a thread is actively making forward progress. The nominal urgency is 5 and indicates that a thread is actively progressing. The urgency of one thread is compared against the other at every L3 event. If the urgency of the currently

22 Reference Manual for Software Development and Optimization

executing thread is lower than the background thread then the L3 event will initiate a thread switch. Every L3 miss event decrements the urgency by 1, eventually saturating at 0. Similarly, every L3 return event increments the urgency by 1 as long as the urgency is below 5. Figure 2-2 shows a typical urgency based switch scenario. The urgency can be set to 7 for a thread that is switched out due to time-out event. An external interrupt directed at the background thread will set the urgency for that thread to 6 which increases the probability of a thread switch and provide a reasonable response time for interrupt servicing.

Figure 2-2. Urgency and Thread Switching

The Dual-Core Itanium 2 Processor

2.4.2 Tailoring Thread Switch Behavior

Montecito allows the behavior of the thread switch control logic to be tailored to meet specific software requirements. Specifically, thread switch control may emphasize overall performance, thread fairness or elevate the priority of one thread over the other. These different behaviors are available through a low latency PAL call, PAL_SET_HW_POLICY. This will allow software to exert some level of control over how the processor determines the best time to switch. Details on this call and the parameters can be found in the latest Intel®Itanium®Architecture Software

Developer’s Manual and Intel®Itanium®Architecture Software Developer’s Manual Specification Update.

Reference Manual for Software Development and Optimization 23

The Dual-Core Itanium 2 Processor

2.4.3 Sharing Cache and Memory Resources

The Montecito memory resources that are concurrently or simultaneously shared between the two threads include the first and second level TLBs, the first, second, and third level caches, and system interface resources. Each of these structures are impacted in different ways as a result of their sharing.

2.4.3.1 Hyper-Threading Technology and the TLBs

The instruction in previous Itanium 2 processors would invalidate the entire Translation Cache (TC) section of the TLB with one instruction. This same behavior is retained for Montecito with the caveat that a  issued on one thread will invalidate the TC of the other thread at the same time.

The L2I and L2D TLB on the Itanium 2 processor supported 64 Translation Registers (TR). Montecito supports 32 TRs for each logical processor.

2.4.3.1.1 Instruction TLBs

The replacement algorithms for the L1I and L2I TLB do not consider thread for replacement vector updating. However, the L2I TLB will reserve one TLB entry for each thread to meet the architectural requirements for TCs available to a logical processor.

The TLBs support SMT-based sharing by assigning a thread identifier to the virtual address. Thus, two threads cannot share the same TLB entry at the same time even if the virtual address is the same between the two threads.

Since the L1I TLB is key in providing a pseudo-virtual access to the L1I cache, using prevalidation, when a L1I TLB entry is invalidated, the L1I cache entries associated with that page (up to 4 K) are invalidated. However, the invalidation of a page (and hence cache contents) can be suppressed when two threads access the same virtual and physical addresses. This allows the two threads to share much of the L1I TLB and cache contents. For example, T0 inserts a L1I TLB entry with VA=0 and PA=0x1001000. T0 then accesses VAs 0x000 to 0xFFF which are allocated to the L1I cache. A thread switch occurs. Now, T1 initiates an access with VA=0. It will miss in the L1I TLB because the entry with VA=0 belongs to T0. T1 will insert a L1I TLB entry with VA=0 and PA=0x1001000. The T1 L1I TLB entry replaces the T0 L1I TLB entry without causing an invalidation. Thus, the accesses performed by T0 become available to T1 with the exception of the initial T1 access that inserted the L1I TLB page. Since the L1I cache contents can be shared between two threads and the L1I cache includes branch prediction information, this optimization allows one thread to impact the branch information contained in the L1I cache and hence branch predictions generated for each thread.

2.4.3.1.2 Data TLBs

The replacement algorithms for the L1D and L2D TLB do not consider threads for replacement vector updating. However, the L2D TLB will reserves 16 TLB entries for each thread to meet the architectural requirements for TCs available to a logical processor.

The TLBs support SMT based sharing by assigning a thread identifier to the virtual address. Thus, two threads cannot share the same TLB entry at the same time even if the virtual address is the same between the two threads.

Despite the fact that both the instruction and data L1 TLBs support prevalidation, the L1I TLB optimization regarding cache contents is not supported in the L1D TLB.

24 Reference Manual for Software Development and Optimization

The Dual-Core Itanium 2 Processor

2.4.3.2 Hyper-Threading Technology and the Caches

The L2I, L2D, and L3 caches are all physically addressed. Thus, the threads can fully share the cache contents (i.e. an access allocated by T0 can be accessed by both T0 and T1). The queueing resources for these cache levels are equally available to each thread. The replacement logic also ignores threads such that T0 can cause an eviction of T1 allocated data and a hit will cause a cache line to be considered recently used regardless of the thread that allocated or accessed the line.

A thread identifier is provided with each instruction or data cache request to ensure proper ordering of requests between threads at the L2D cache in addition to performance monitoring and switch event calculations at all levels. The thread identifier allows ordered and unordered transactions from T0 pass ordered transactions from T1.

2.4.3.3 Hyper-Threading Technology and the System Interface

The system interface logic also ignores the thread identifier in allocating queue entries and in prioritizing system interface requests. The system interface logic tracks L3 miss and fills and as such, uses the thread identifier to correctly signal to the core which thread missed or filled the cache for L3 miss/return events. The thread identifier is also used in performance monitor event collection and counting.

The thread identifier can be made visible on the system interface as part of the agent identifier through a PAL call. This is for informational purposes only as the bit would appear in a reserved portion of the agent identifier and Montecito does not require the memory controller to ensure forward progress and fairness based on the thread identifier -- the L2D cache ensures forward progress between threads.

2.5 Dual Cores

Montecito is the first dual core Itanium 2 processor. The two cores attach to the system interface through the arbiter, which provides a low-latency path for each core to initiate and respond to system events.

Figure 2-3 is a block diagram of the arbiter, which organizes and optimizes each core's request to

the system interface, ensures fairness and forward progress, and collects responses from each core to provide a unified response. The arbiter maintains each core's unique identity to the system interface and operates at a fixed ratio to the system interface frequency. The cores are responsible for thread ordering and fairness so the thread identifier to uniquely identify transactions on the system interface is not necessary. However, the processor can be configured to provide the thread identifier for informational purposes only.

Reference Manual for Software Development and Optimization 25

The Dual-Core Itanium 2 Processor

Figure 2-3. The Arbiter and Queues

As the figure shows, the arbiter consists of a set of address queues, data queues, and synchronizers, as well as logic for core and system interface arbitration. Error-Correction Code (ECC) encoders/ decoders and parity generators exist but are not shown.

The core initiates one of three types of accesses, which the arbiter allocates to the following queues and buffers:

• Request queue. This is the primary address queue that supports most request types. Each core

has four request queues.

• Write address queue. This queue holds addresses only and handles explicit writebacks and

partial line writes. Each core has two write address queues.

• Clean castout queue. This queue holds the address for the clean castout (directory and snoop

filter update) transactions. The arbiter holds pending transactions until it issues them on the system interface. Each core has four clean castout queues.

• Write data buffer. This buffer holds outbound data and has a one-to-one correspondence with

addresses in the write address queue. Each core has four write data buffers, with the additional two buffers holding implicit writeback data.

The number of entries in these buffers are small because they are deallocated once the transaction is issued on the system interface. System interface responses to the transaction are sent directly to the core where the overall tracking of a system interface request occurs.

Note that there are no core to core bypasses present. Thus, a cache line that is requested by core 0 but exists modified on core 1 will be issued to the system interface, snoop core 1 which provides the data and a modified snoop result - all of which is seen on the system interface.

The Snoop queue issues snoop requests to the cores and coalesces the snoop response from each core into a unified snoop response for the socket. If any core is delayed in delivering its snoop response, the arbiter will delay the snoop response on the system interface.

The arbiter delivers all data returns directly to the appropriate core using a unique identifier provided with the initial request. It delivers broadcast transactions, such as interrupts and TLB purges, to both cores in the same way that delivery would occur if each core were connected directly to the system interface.

26 Reference Manual for Software Development and Optimization

The Dual-Core Itanium 2 Processor

2.5.1 Fairness and Arbitration

The arbiter interleaves core requests on a one-to-one basis when both cores have transactions to issue. When only one core has requests, its can issue its requests without the other core having to issue a transaction. Because read latency is the greatest concern, the read requests are typically the highest priority, followed by writes, and finally clean castouts. Each core tracks the occupancy of the arbiter's queues using a credit system for flow control. As requests complete, the arbiter informs the appropriate core of the type and number of deallocated queue entries. The cores use this information to determine which, if any, transaction to issue to the arbiter.

2.6 Intel® Virtualization Technology

The Montecito processor is the first Itanium 2 processor to implement Intel® Virtualization Technology. The full specification as well as further information on Intel Virtualization Technology can be found at:

http://www.intel.com/technology/computing/vptech/.

2.7 Tips and Tricks

2.7.1 Cross Modifying Code

Section 2.5 in Part 2 of Volume 2 of the Intel®Itanium®Architecture Software Developer’s Manual specifies specific sequences that must be followed when any instruction code may exist in the data cache. Many violations of this code may have worked in previous Itanium 2 processors, but such violations are likely to be exposed by the cache hierarchy found in Montecito. Code in violation of the architecture should be modified to adhere to the architectural requirements.

The large L2I and the separation of the instruction and data at the L2 level also requires additional time to ensure coherence if using the PAL_CACHE_FLUSH procedure with the I/D coherence option. Care should be taken to ensure that previously lower cost uses of the PAL_CACHE_FLUSH call should be replaced with the architecture required code sequence for ensuring instruction and data consistency.

2.7.2 ld.bias and lfetch.excl

The  and instructions have been enhanced on the Montecito processor. These instructions can now bring in lines into the cache in a state that is ready to be modified if supported by the memory controller. This feature allows a single or  to prefetch both the source and destination streams. This feature is enabled by default, but may be disabled by PAL_SET_PROC_FEATURES bit 7 of the Montecito feature_set (18).

2.7.3 L2D Victimization Optimization

Montecito also improves on the behaviors associated with internal cache line coherence tracking. The number of false L2D victims will drastically reduce on Montecito over previous Itanium 2 processors. This optimization is enabled by default, but may be disabled by PAL_SET_PROC_FEATURES.

Reference Manual for Software Development and Optimization 27

The Dual-Core Itanium 2 Processor

2.7.4 Instruction Cache Coherence Optimization

Coherence requests of the L1I and L2I caches will invalidate the line if it is in the cache. Montecito allows instruction requests on the system interface to be filtered such that they will not initiate coherence requests of the L1I and L2I caches. This will allow instructions to be cached at the L1I and L2I levels across multiple processors in a coherent domain. This optimization is enabled by default, but may be disabled by PAL_SET_PROC_FEATURES bit 5 of the Montecito feature_set (18).

2.8 IA-32 Execution

IA-32 execution on the Montecito processor is enabled with the IA-32 Execution Layer (IA-32 EL) and PAL-based IA-32 execution. IA-32 EL is OS-based and is only available after an OS has booted. PAL-based IA-32 execution is available after PAL_COPY_PAL is called and provides IA32 execution support before the OS has booted. All OSes running on Montecito have a requirement to have IA-32 EL installed. There is no support for PAL-based IA-32 execution in an OS environment.

IA-32 EL is a software layer that is currently shipping with Itanium architecture-based operating systems and will convert IA-32 instructions into Itanium processor instructions via dynamic translation. Further details on operating system support and functionality of IA-32 EL can be found at http://www.intel.com/cd/ids/developer/asmo-na/eng/strategy/66007.htm.

2.9 Brand Information

One of the newer additions to the Itanium architecture is the PAL_BRAND_INFO procedure. This procedure, along with PAL_PROC_GET_FEATURES, allows software to obtain processor branding and feature information. Details on the above functions can be found in the Intel Itanium®Architecture Software Developer’s Manual.

Below is the table of the implementation-specific return values for PAL_BRAND_INFO. Montecito will implement all three, however previous implementations of the Intel Itanium 2 processor are all unable to retrieve the processor frequency, so requests for these fields will return 6, information not available. Also, previous Itanium 2 processors cannot return system bus frequency speed. Implementation-specific values are expected to start at value 16 and continue until an invalid argument (-2) is returned.

Note: The values returned below are the values that the processor was validated at, which is not necessarily the values that the processor is currently running at.

Table 2-10. PAL_BRAND_INFO Implementation-Specific Return Values

Value Definition

18 The system bus frequency component (in Hz) of the brand

identification string will be returned in the brand_info return argument.

17 The cache size component (in bytes) of the brand

identification string will be returned in the brand_info return argument.

16 The frequency component (in Hz) of the brand identification

string will be returned in the brand_info return argument.

28 Reference Manual for Software Development and Optimization

There are other processor features that may not be included in the brand name above. To obtain information on if that technology or feature has been implemented, the PAL_PROC_GET_GEATURES procedure should be used. Montecito features will be in the

Montecito processor feature_set (18).

Table 2-11. Montecito Processor Feature Set Return Values

Value Definition

The Dual-Core Itanium 2 Processor

18 Hyper-Threading Technology (HT) - This processor

17 Low Voltage (LV) - This processor is a low power SKU 16 Dual-Processor (DP) - This processor is restricted to two

supports Hyper-Threading Technology

processor (DP) systems

Reference Manual for Software Development and Optimization 29

The Dual-Core Itanium 2 Processor

30 Reference Manual for Software Development and Optimization

3 Performance Monitoring

3.1 Introduction to Performance Monitoring

This chapter defines the performance monitoring features of the Montecito processor. The Montecito processor provides 12 48-bit performance counters per thread, 200+ monitorable events, and several advanced monitoring capabilities. This chapter outlines the targeted performance monitor usage models and defines the software interface and programming model.

The Itanium architecture incorporates architected mechanisms that allow software to actively and directly manage performance critical processor resources such as branch prediction structures, processor data and instruction caches, virtual memory translation structures, and more. To achieve the highest performance levels, dynamic processor behavior should be able to be monitored and fed back into the code generation process to better encode observed run-time behavior or to expose higher levels of instruction level parallelism. These measurements will be critical for understanding the behavior of compiler optimizations, the use of architectural features such as speculation and predication, or the effectiveness of microarchitectural structures such as the ALAT, the caches, and the TLBs. These measurements will provide the data to drive application tuning and future processor, compiler, and operating system designs.

The remainder of this chapter is divided into the following sections:

• Section 3.2 discusses how performance monitors are used, and presents various Montecito

processor performance monitoring programming models.

• Section 3.3 defines the Montecito processor specific performance monitoring features,

structures and registers.

Chapter 4 provides an overview of the Montecito processor events that can be monitored.

3.2 Performance Monitor Programming Models

This section introduces the Montecito processor performance monitoring features from a programming model point of view and describes how the different event monitoring mechanisms can be used effectively. The Montecito processor performance monitor architecture focuses on the following two usage models:

• Workload Characterization: the first step in any performance analysis is to understand the

performance characteristics of the workload under study. Section 3.2.1 discusses the Montecito processor support for workload characterization.

• Profiling: profiling is used by application developers and profile-guided compilers.

Application developers are interested in identifying performance bottlenecks and relating them back to their code. Their primary objective is to understand which program location caused performance degradation at the module, function, and basic block level. For optimization of data placement and the analysis of critical loops, instruction level granularity is desirable. Profile-guided compilers that use advanced features of the Itanium architecture, such as predication and speculation, benefit from run-time profile information to optimize instruction schedules. The Montecito processor supports instruction level statistical profiling of branch mispredicts and cache misses. Details of the Montecito processor’s profiling support are described in Section 3.2.2

Reference Manual for Software Development and Optimization 31

Performance Monitoring

3.2.1 Workload Characterization

The first step in any performance analysis is to understand the performance characteristics of the workload under study. There are two fundamental measures of interest: event rates and program cycle break down.

• Event Rate Monitoring: Event rates of interest include average retired instructions-per-clock

(IPC), data and instruction cache miss rates, or branch mispredict rates measured across the entire application. Characterization of operating systems or large commercial workloads (e.g. OLTP analysis) requires a system-level view of performance relevant events such as TLB miss rates, VHPT walks/second, interrupts/second, or bus utilization rates. Section 3.2.1.1 discusses event rate monitoring.

• Cycle Accounting: The cycle breakdown of a workload attributes a reason to every cycle

spent by a program. Apart from a program’s inherent execution latency, extra cycles are usually due to pipeline stalls and flushes. Section 3.2.1.4 discusses cycle accounting.

3.2.1.1 Event Rate Monitoring

Event rate monitoring determines event rates by reading processor event occurrence counters before and after the workload is run, and then computing the desired rates. For instance, two basic Montecito processor events that count the number of retired Itanium instructions (IA64_INST_RETIRED.u) and the number of elapsed clock cycles (CPU_OP_CYCLES) allow a workload’s instructions per cycle (IPC) to be computed as follows:

• IPC = (IA64_INST_RETIRED.u

CPU_OP_CYCLESt0)

Time-based sampling is the basis for many performance debugging tools (VTune™analyzer, gprof, WinNT). As shown in Figure 3-1, time-based sampling can be used to plot the event rates over time, and can provide insights into the different phases that the workload moves through.

Figure 3-1. Time-Based Sampling

On the Montecito processor, many event types, e.g. TLB misses or branch mispredicts are limited to a rate of one per clock cycle. These are referred to as “single occurrence” events. However, in the Montecito processor, multiple events of the same type may occur in the same clock. We refer to such events as “multi-occurrence” events. An example of a multi-occurrence events on the Montecito processor is data cache read misses (up to two per clock). Multi-occurrence events, such as the number of entries in the memory request queue, can be used to the derive average number and average latency of memory accesses. Section 3.2.1.2 and Section 3.2.1.3 describe the basic Montecito processor mechanisms for monitoring single and multi-occurrence events.

- IA64_INST_RETIRED.ut0) / (CPU_OP_CYCLESt1 -

t1t0

Sample Interval

Time

32 Reference Manual for Software Development and Optimization

Performance Monitoring

3.2.1.2 Single Occurrence Events and Duration Counts

A single occurrence event can be monitored by any of the Montecito processor performance counters. For all single occurrence events, a counter is incremented by up to one per clock cycle. Duration counters that count the number of clock cycles during which a condition persists are considered “single occurrence” events. Examples of single occurrence events on the Montecito processor are TLB misses, branch mispredictions, and cycle-based metrics.

3.2.1.3 Multi-Occurrence Events, Thresholding, and Averaging

Events that, due to hardware parallelism, may occur at rates greater than one per clock cycle are termed “multi-occurrence” events. Examples of such events on the Montecito processor are retired instructions or the number of live entries in the memory request queue.

Thresholding capabilities are available in the Montecito processor’s multi-occurrence counters and can be used to plot an event distribution histogram. When a non-zero threshold is specified, the monitor is incremented by one in every cycle in which the observed event count exceeds that programmed threshold. This allows questions such as “For how many cycles did the memory request queue contain more than two entries?” or “During how many cycles did the machine retire more than three instructions?” to be answered. This capability allows microarchitectural buffer sizing experiments to be supported by real measurements. By running a benchmark with different threshold values, a histogram can be drawn up that may help to identify the performance “knee” at a certain buffer size.

For overlapping concurrent events, such as pending memory operations, the average number of concurrently outstanding requests and the average number of cycles that requests were pending are of interest. To calculate the average number or latency of multiple outstanding requests in the memory queue, we need to know the total number of requests (n requests per cycle (n occurrence counter, n

/cycle). By summing up the live requests (n

live

is directly measured by hardware. We can now calculate the average

live

number of requests and the average latency as follows:

• Average outstanding requests/cycle = n

• Average latency per request = n

live

/ n

live

total

/ t

An example of this calculation is given in Table 3-1 in which the average outstanding requests/ cycle = 15/8 = 1.825, and the average latency per request = 15/5 = 3 cycles.

Table 3-1. Average Latency per Request and Requests per Cycle

Calculation Example

Time [Cycles]

# Requests In

# Requests Out

live

total

1 2 3 4 5 6 7 8 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 2 3 3 3 2 1 0 1 3 6 9 12 14 15 15 1 2 3 4 5 5 5 5

) and the number of live

total

/cycle) using a multi-

live

The Montecito processor provides the following capabilities to support event rate monitoring:

• Clock cycle counter

• Retired instruction counter

Reference Manual for Software Development and Optimization 33

Performance Monitoring

• Event occurrence and duration counters

• Multi-occurrence counters with thresholding capability

3.2.1.4 Cycle Accounting

While event rate monitoring counts the number of events, it does not tell us whether the observed events are contributing to a performance problem. A commonly used strategy is to plot multiple event rates and correlate them with the measured instructions per cycle (IPC) rate. If a low IPC occurs concurrently with a peak of cache miss activity, chances are that cache misses are causing a performance problem. To eliminate such guess work, the Montecito processor provides a set of cycle accounting monitors, that break down the number of cycles that are lost due to various kinds of microarchitectural events. As shown in Figure 3-2, this lets us account for every cycle spent by a program and therefore provides insight into an application’s microarchitectural behavior. Note that cycle accounting is different from simple stall or flush duration counting. Cycle accounting is based on the machine’s actual stall and flush conditions, and accounts for overlapped pipeline delays, while simple stall or flush duration counters do not. Cycle accounting determines a program’s cycle breakdown by stall and flush reasons, while simple duration counters are useful in determining cumulative stall or flush latencies.

Figure 3-2. Itanium® Processor Family Cycle Accounting

Inherent Program

Execution Latency

30% 25%

Data Access

Cycles

20% 15% 10%

100% Execution Time

Branch

Mispredicts

I Fetch

Stalls

Other Stalls

001229

The Montecito processor cycle accounting monitors account for all major single and multi-cycle stall and flush conditions. Overlapping stall and flush conditions are prioritized in reverse pipeline order, i.e. delays that occur later in the pipe and that overlap with earlier stage delays are reported as being caused later in the pipeline. The six back-end stall and flush reasons are prioritized in the following order:

1. Exception/Interruption Cycle: cycles spent flushing the pipe due to interrupts and exceptions.

2. Branch Mispredict Cycle: cycles spent flushing the pipe due to branch mispredicts.

3. Data/FPU Access Cycle: memory pipeline full, data TLB stalls, load-use stalls, and access to floating-point unit.

4. Execution Latency Cycle: scoreboard and other register dependency stalls.

5. RSE Active Cycle: RSE spill/fill stall.

6. Front End Stalls: stalls due to the back-end waiting on the front end.

Additional front-end stall counters are available which detail seven possible reasons for a front-end stall to occur. However, the back-end and front-end stall events should not be compared since they are counted in different stages of the pipeline.

For details, refer to Section 4.6.

34 Reference Manual for Software Development and Optimization

3.2.2 Profiling

Profiling is used by application developers, profile-guided compilers, optimizing linkers, and runtime systems. Application developers are interested in identifying performance bottlenecks and relating them back to their source code. Based on profile feedback developers can make changes to the high-level algorithms and data structures of the program. Compilers can use profile feedback to optimize instruction schedules by employing advanced features of the Itanium architecture, such as predication and speculation.

To support profiling, performance monitor counts have to be associated with program locations. The following mechanisms are supported directly by the Montecito processor’s performance monitors:

• Program Counter Sampling

• Miss Event Address Sampling: Montecito processor event address registers (EARs) provide

sub-pipeline length event resolution for performance critical events (instruction and data caches, branch mispredicts, and instruction and data TLBs).

• Event Qualification: constrains event monitoring to a specific instruction address range, to

certain opcodes or privilege levels.

These profiling features are presented in Section 3.2.2.1, Section 3.2.2.2 and Section 3.2.3.3.

Performance Monitoring

3.2.2.1 Program Counter Sampling

Application tuning tools like VTune analyzer and gprof use time-based or event-based sampling of the program counter and other event counters to identify performance critical functions and basic blocks. As shown in Figure 3-3, the sampled points can be histogrammed by instruction addresses. For application tuning, statistical sampling techniques have been very successful, because the programmer can rapidly identify code hot spots in which the program spends a significant fraction of its time, or where certain event counts are high.

Program counter sampling points the performance analysts at code hot spots, but does not indicate what caused the performance problem. Inspection and manual analysis of the hot-spot region along with a fair amount of guess work are required to identify the root cause of the performance problem. On the Montecito processor, the cycle accounting mechanism (described in

Section 3.2.1.4) can be used to directly measure an application’s microarchitectural behavior.

The interval timer facilities of the Itanium architecture (ITC and ITM registers) can be used for time-based program counter sampling. Event-based program counter sampling is supported by a dedicated performance monitor overflow interrupt mechanism described in detail in Section 7.2.2 “Performance Monitor Overflow Status Registers (PMC[0]..PMC[3])” in Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.

Reference Manual for Software Development and Optimization 35

Performance Monitoring

Figure 3-3. Event Histogram by Program Counter

Event

Frequency Examples:

# Cache Misses

# TLB Misses

To support program counter sampling, the Montecito processor provides the following mechanisms:

• Timer interrupt for time-based program counter sampling

• Event count overflow interrupt for event-based program counter sampling

• Hardware-supported cycle accounting

Address Space

3.2.2.2 Miss Event Address Sampling

Program counter sampling and cycle accounting provide an accurate picture of cumulative microarchitectural behavior, but they do not provide the application developer with pointers to specific program elements (code locations and data structures) that repeatedly cause microarchitectural “miss events”. In a cache study of the SPEC92 benchmarks, [Lebeck] used (trace based) cache miss profiling to gain performance improvements of 1.02 to 3.46 on various benchmarks by making simple changes to the source code. This type of analysis requires identification of instruction and data addresses related to microarchitectural “miss events” such as cache misses, branch mispredicts, or TLB misses. Using symbol tables or compiler annotations these addresses can be mapped back to critical source code elements. Like Lebeck, most performance analysts in the past have had to capture hardware traces and resort to trace driven simulation.

Due to the superscalar issue, deep pipelining, and out-of-order instruction completion of today’s microarchitectures, the sampled program counter value may not be related to the instruction address that caused a miss event. On a Pentium® processor pipeline, the sampled program counter may be off by two dynamic instructions from the instruction that caused the miss event. On a Pentium® Pro processor, this distance increases to approximately 32 dynamic instructions. On the Montecito processor, it is approximately 48 dynamic instructions. If program counter sampling is used for miss event address identification on the Montecito processor, a miss event might be associated with an instruction almost five dynamic basic blocks away from where it actually occurred (assuming that 10% of all instructions are branches). Therefore, it is essential for hardware to precisely identify an event’s address.

The Montecito processor provides a set of event address registers (EARs) that record the instruction and data addresses of data cache misses for loads, the instruction and data addresses of data TLB misses, and the instruction addresses of instruction TLB and cache misses. A 16 entry deep execution trace buffer captures sequences of branch instructions and other instructions and events which causes changes to execution flow. Table 3-2 summarizes the capabilities offered by the Montecito processor EARs and the execution trace buffer. Exposing miss event addresses to software allows them to be monitored either by sampling or by code instrumentation. This

36 Reference Manual for Software Development and Optimization

eliminates the need for trace generation to identify and solve performance problems and enables performance analysis by a much larger audience on unmodified hardware.

Table 3-2. Montecito Processor EARs and Branch Trace Buffer

Event Address Register Triggers on What is Recorded

Performance Monitoring

Instruction Cache Instruction fetches that miss

Instruction TLB (ITLB) Instruction fetch missed L1

Data Cache Load instructions that miss L1

Data TLB (DTLB)

Execution Trace Buffer

the L1 instruction cache (demand fetches only)

ITLB (demand fetches only)

data cache

Data references that miss L1 DTLB

Branch Outcomes rfi, exceptions, failed “chk”

instructions which cause a change in execution flow

Instruction Address Number of cycles fetch was in flight

Instruction Address What serviced L1 ITLB miss: L2 ITLB VHPT or software

Instruction Address Data Address Number of cycles load was in flight.

Instruction Address Data Address What serviced L1 DTLB miss: L2 DTLB, VHPT or software

Source instruction address of the event Target Instruction Address of the event Mispredict status and reason for branches

The Montecito processor EARs enable statistical sampling by configuring a performance counter to count, for instance, the number of data cache misses or retired instructions. The performance counter value is set up to interrupt the processor after a predetermined number of events have been observed. The data cache event address register repeatedly captures the instruction and data addresses of actual data cache load misses. Whenever the counter overflows, miss event address collection is suspended until the event address register is read by software (this prevents software from capturing a miss event that might be caused by the monitoring software itself). When the counter overflows, an interrupt is delivered to software, the observed event addresses are collected, and a new observation interval can be setup by rewriting the performance counter register. For time-based (rather than event-based) sampling methods, the event address registers indicate to software whether or not a qualified event was captured. Statistical sampling can achieve arbitrary event resolution by varying the number of events within an observation interval and by increasing the number of observation intervals.

3.2.3 Event Qualification

In the Montecito processor, many of the performance monitoring events can be qualified in a number of ways such that only a subset of the events are counted using performance monitoring counters. As shown in Figure 3-4 events can be qualified for monitoring based on instruction address range, instruction opcode, data address range, event-specific “unit mask” (umask), the privilege level and instruction set the event was caused by, and the status of the performance monitoring freeze bit (PMC0.fr). The following paragraphs describes these capabilities in detail.

• Itanium Instruction Address Range Check: The Montecito processor allows event monitoring

to be constrained to a programmable instruction address range. This enables monitoring of dynamically linked libraries (DLLs), functions, or loops of interest in the context of a large Itanium architecture-based application. The Itanium instruction address range check is applied at the instruction fetch stage of the pipeline and the resulting qualification is carried by the instruction throughout the pipeline. This enables conditional event counting at a level of granularity smaller than dynamic instruction length of the pipeline (approximately 48 instructions). The Montecito processor’s instruction address range check operates only during

Reference Manual for Software Development and Optimization 37

Performance Monitoring

Itanium architecture-based code execution, i.e. when  is zero. For details, see

Section 3.3.5.

Figure 3-4. Montecito Processor Event Qualification

Instruction Address

Instruction Opcode

Data Address

Current Privilege

Level

Current Instruction

Set (Itanium or IA-32)

Performance Monitor

Freeze Bit (PMC0.fr)

Itanium® Instruction

Address Range Check

Itanium Instruction

Opcode Match

Itanium Data Address

Range Check

(Memory Operations Only)

Event Spefic "Unit Mask"Event Did event happen and qualify?

Privilege Level Check

Instruction Set Check

Event Count Freeze

Is Itanium instruction pointer in IBR range?

Does Itanium opcode match?

Is Itanium data address in DBR range?

Executing at monitored privilege level?

Executing in monitored instruction set?

Is event monitoring enabled?

YES, all of the above are true; this event is qualified.

000987a

• Itanium Instruction Opcode Match: The Montecito processor provides two independent

Itanium instruction opcode match ranges, each of which match the currently issued instruction encodings with a programmable opcode match and mask function. The resulting match events can be selected as an event type for counting by the performance counters. This allows histogramming of instruction types, usage of destination and predicate registers as well as basic block profiling (through insertion of tagged NOPs). The opcode matcher operates only during Itanium architecture-based code execution, i.e. when  is zero. Details are described in Section 3.3.6.

• Itanium Data Address Range Check: The Montecito processor allows event collection for

memory operations to be constrained to a programmable data address range. This enables selective monitoring of data cache miss behavior of specific data structures. For details, see

Section 3.3.7.

• Event Specific Unit Masks: Some events allow the specification of “unit masks” to filter out

interesting events directly at the monitored unit. As an example, the number of counted bus transactions can be qualified by an event specific unit mask to contain transactions that

38 Reference Manual for Software Development and Optimization

Performance Monitoring

originated from any bus agent, from the processor itself, or from other I/O bus masters. In this case, the bus unit uses a three-way unit mask (any, self, or I/O) that specifies which transactions are to be counted. In the Montecito processor, events from the branch, memory and bus units support a variety of unit masks. For details, refer to the event pages in Chapter 4

• Privilege Level: Two bits in the processor status register(PSR) are provided to enable selective

process-based event monitoring. The Montecito processor supports conditional event counting based on the current privilege level; this allows performance monitoring software to break down event counts into user and operating system contributions. For details on how to constrain monitoring by privilege level refer to Section 3.3.1

• Instruction Set: The Montecito processor supports conditional event counting based on the

currently executing instruction set (Itanium or IA-32) by providing two instruction set mask bits for each event monitor. This allows performance monitoring software to break down event counts into Itanium architecture and IA-32 contributions. For details, refer to Section 3.3.1.

• Performance Monitor Freeze: Event counter overflows or software can freeze event

monitoring. When frozen, no event monitoring takes place until software clears the monitoring freeze bit (PMC0.fr). This ensures that the performance monitoring routines themselves, e.g. counter overflow interrupt handlers or performance monitoring context switch routines, do not “pollute” the event counts of the system under observation. For details refer to Section 7.2.4 of Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.

3.2.3.1 Combining Opcode Matching, Instruction, and Data Address Range Check

The Montecito processor allows various event qualification mechanisms to be combined by providing the instruction tagging mechanism shown in Figure 3-5.

Figure 3-5. Instruction Tagging Mechanism in the Montecito Processor

IBRP0 PMC38

IBRP1 PMC38

IBRP2 PMC38

IBRP3 PMC38

Opcode Matcher0 (PMC32,33,36)

Opcode Matcher1 (PMC34,35,36)

Opcode Matcher0 (PMC32,33,36)

Opcode Matcher1 (PMC34,35,36)

Data Address Range checkers DBRs, PMC41)

Event

Memory Event

Event Select

.es)

(PMC

Privilege Level & Instr. Set Check (PMC.plm PMC.ism)

Counter (Pmdi)

Reference Manual for Software Development and Optimization 39

Performance Monitoring

During Itanium instruction execution, the instruction address range check is applied first. This is applied separately for each IBR pair (IBRP) to generate 4 independent tag bits which flow down the machine in four tag channels. Tags in the four tag channels are then passed to two opcode matchers that combine the instruction address range check with the opcode match and generate another set of four tags. This is done by combining tag channels 0 and 2 with first opcode match registers and tag channels 1 and 3 with the second opcode match registers as shown in Figure 3-5. Each of the 4 combined tags in the four tag channels can be counted as a retired instruction count event (for details refer to event description “IA64_TAGGED_INST_RETIRED”).

Combined Itanium processor address range and opcode match tags in tag channel 0, qualifies all downstream pipeline events. Events in the memory hierarchy (L1 and L2 data cache and data TLB events can further be qualified using a data address DBR RangeTag).

As summarized in Figure 3-5, data address range checking can be combined with opcode matching and instruction range checking on the Montecito processor. Additional event qualifications based on the current privilege level can be applied to all events and are discussed in Section 3.2.3.2.

Table 3-3. Montecito Processor Event Qualification Modes

Instruction

Event Qualification

Modes

Address

Range Check

Enable

(in Opcode

Match)

PMC

(1)

.ig_

Instruction

Address

Range Check Config PMC

Tag Channel

Opcode Match

Enable

PMC

Opcode

Match

PMC

34,35

32,33

Data Address Range Check

[PMC41.e_dbrp

PMC41.cfg_dtagj]

(2)

(mem pipe events

only)

Unconstrained Monitoring, channel 0 (all events)

Unconstrained Monitoring, channel (i=0,1,2,3; Limited events only)

Instruction Address Range Check only; channel 0

Opcode Matching only Channel

Data Address Range Check only

Instruction Address Range Check and Opcode Matching, channel0

Instruction and Data Address Range Check

Opcode Matching and Data Address Range Check

1. For all cases where PMC32.ig_ad is set to 0, PMC32.inv must be set to 0 if address range inversion is not needed.

2. See column 2 for the value of PMC32.ig_ad bit field.

1 x x X [1,11] or [0,xx]

0 ig_ibrpi=1 Chi_ig_OPC=1 X [1,11] or [0,xx]

0 ig_ibrp0=0 Ch0_ig_OPC=1 x [1,00]

0 ig_ibrpi=1 Chi_ig_OPC=0 Desired

Opcodes

1 x x x [1,10]

0 ig_ibrp0=0 Ch0_ig_OPC=0 Desired

Opcodes

0 ig_ibrp0=0 Ch0_ig_OPC=1 x [1,00]

0 x Ch0_ig_OPC=0 Desired

Opcodes

[1,01]

[1,00]

40 Reference Manual for Software Development and Optimization

3.2.3.2 Privilege Level Constraints

Performance monitoring software cannot always count on context switch support from the operating system. In general, this has made performance analysis of a single process in a multiprocessing system or a multi-process workload impossible. To provide hardware support for this kind of analysis, the Itanium architecture specifies three global bits (PSR.up, PSR.pp, DCR.pp) and a per-monitor “privilege monitor” bit (PMCi.pm). To break down the performance contributions of operating system and user-level application components, each monitor specifies a 4-bit privilege level mask (PMCi.plm). The mask is compared to the current privilege level in the processor status register (PSR.cpl), and event counting is enabled if PMCi.plm[PSR.cpl] is one. The Montecito processor performance monitor control is discussed in Section 3.3.1.

PMC registers can be configured as user-level monitors (PMCi.pm is 0) or system-level monitors (PMCi.pm is 1). A user-level monitor is enabled whenever PSR.up is one. PSR.up can be controlled by an application using the “sum”/”rum” instructions. This allows applications to enable/disable performance monitoring for specific code sections. A system-level monitor is enabled whenever PSR.pp is one. PSR.pp can be controlled at privilege level 0 only, which allows monitor control without interference from user-level processes. The pp field in the default control register (DCR.pp) is copied into PSR.pp whenever an interruption is delivered. This allows events generated during interruptions to be broken down separately: if DCR.pp is 0, events during interruptions are not counted; if DCR.pp is 1, they are included in the kernel counts.

As shown in Figure 3-6, Figure 3-7, and Figure 3-8, single process, multi-process, and systemlevel performance monitoring are possible by specifying the appropriate combination of PSR and DCR bits. These bits allow performance monitoring to be controlled entirely from a kernel level device driver, without explicit operating system support. Once the desired monitoring configuration has been setup in a process’ processor status register (PSR), “regular” unmodified operating context switch code automatically enables/disables performance monitoring.

Performance Monitoring

With support from the operating system, individual per-process breakdown of event counts can be generated as outlined in the chapter on performance monitoring in the Intel®Itanium®Architecture Software Developer’s Manual.

3.2.3.3 Instruction Set Constraints

Instruction set constraints are not fully supported in Montecito and the corresponding PMC register instruction set mask (PMCi.ism) should be set to Itanium architecture only (‘10) to ensure correct operation. Any other values for these bits may cause undefined behavior.

Reference Manual for Software Development and Optimization 41

Performance Monitoring

Figure 3-6. Single Process Monitor

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSRA.up=1, others 0 PMC.pm=0 PMC.plm=1000 DCR.pp=0 DCR.pp=0

Figure 3-7. Multiple Process Monitor

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSRA.pp=1, others 0 PMC.pm=1 PMC.plm=1001

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSRA.pp=1, others 0 PMC.pm=1

PMC.plm=1001 DCR.pp=1

000989

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSR

.up=1, others 0

A/B

PMC.pm=0 PMC.plm=1000 DCR.pp=0 DCR.pp=0

42 Reference Manual for Software Development and Optimization

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSR

.pp=1, others 0

A/B

PMC.pm=1 PMC.plm=1001

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

PSR

.pp=1, others 0

A/B

PMC.pm=1 PMC.plm=1001 DCR.pp=1

000990

Figure 3-8. System Wide Monitor

Performance Monitoring

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

All PSR.up=1 PMC.pm=0 PMC.plm=1000 DCR.pp=0 DCR.pp=0

3.2.4 References

• [gprof] S.L. Graham S.L., P.B. Kessler and M.K. McKusick, “gprof: A Call Graph Execution

Profiler”, Proceedings SIGPLAN’82 Symposium on Compiler Construction; SIGPLAN Notices; Vol. 17, No. 6, pp. 120-126, June 1982.

• [Lebeck] Alvin R. Lebeck and David A. Wood, “Cache Profiling and the SPEC benchmarks:

A Case Study”, Tech Report 1164, Computer Science Dept., University of Wisconsin Madison, July 1993.

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

All PSR.pp=1 PMC.pm=1 PMC.plm=1001

user-level, cpl=3

(application)

kernel-level, cpl=0

(OS)

interrupt-level, cpl=0

(handlers)

Proc A Proc B Proc C

All PSR.pp=1 PMC.pm=1

PMC.plm=1001 DCR.pp=1

000991

• [VTune] Mark Atkins and Ramesh Subramaniam, “PC Software Performance Tuning”, IEEE

Computer, Vol. 29, No. 8, pp. 47-54, August 1996.

• [WinNT] Russ Blake, “Optimizing Windows NT(tm)”, Volume 4 of the Microsoft “Windows

NT Resource Kit for Windows NT Version 3.51”, Microsoft Press, 1995.

3.3 Performance Monitor State

Itanium Performance Monitoring architecture described in Volume 2 of the Intel® Itanium Architecture Software Developer’s Manual defines two sets of performance monitor registers;

Performance Monitor Configuration (PMC) registers to configure the monitoring and Performance Monitor Data (PMD) registers to provide data values from the monitors. Additionally, the architecture also allows for architectural as well as model specific registers. Complying with this architectural definition, Montecito provides both kind of PMCs and PMDs. As shown in Figure 3-9 the Montecito processor provides 12 48-bit performance counters (PMC/PMD of model-specific monitoring registers.

Table 3-4 defines the PMC/PMD register assignments for each monitoring feature. The interrupt

status registers are mapped to PMC PMC/PMD controlled by three configuration registers (PMC latencies are accessible to software through five event address data registers (PMD

Reference Manual for Software Development and Optimization 43

. The Event Address Registers (EARs) and the Execution Trace Buffer (ETB) are

4-15

. The 12 generic performance counter pairs are assigned to

0,1,2,3

). Captured event addresses and cache miss

37,40,39

pairs), and a set

4-15

34,35,32,33,36

)

Performance Monitoring

and a branch trace buffer (PMD

). On the Montecito processor, monitoring of some events can

48-63

additionally be constrained to a programmable instruction address range by appropriately setting the instruction breakpoint registers (IBR) and the instruction address range check register (PMC38) and turning on the checking mechanism in the opcode match registers (PMC opcode match register sets and an opcode match configuration register (PMC36) allow monitoring of some events to be qualified with a programmable opcode. For memory operations, events can be qualified by a programmable data address range by appropriate setting of the data breakpoint registers (DBRs) and the data address range configuration register (PMC41).

Montecito, being a processor capable of running two threads, provides the illusion of having two processors by providing exactly the same set of performance monitoring features and structures separately for each thread.

Table 3-4. Montecito Processor Performance Monitor Register Set

Monitoring

Feature

Interrupt Status PMC

Event Counters PMC Opcode

Matching Instruction EAR PMC

Data EAR PMC Branch Trace

Buffer Instruction

Address Range Check

Memory Pipeline Event Constraints

Retired IP EAR PMC

Configuration

PMC

Registers

(PMC)

0,1,2,3

4-15 32,33,34,35,

40 39

Data

Registers

(PMD)

none See Section 3.3.3, “Performance Monitor Event Counting

PMD

4-15

none See Section 3.3.6, “Opcode Match Check

PMD

34,35

PMD

32,33,36

PMD

48-63,39

none See Section 3.3.5, “Instruction Address Range Matching”

none See Section 3.3.7, “Data Address Range Matching

PMD

48-63,39

Restrictions Overview”

See Section 3.3.2, “Performance Counter Registers”

(PMC32,33,34,35,36)”

See Section 3.3.8, “Instruction EAR (PMC37/

PMD32,33,36)”

See Section 3.3.9, “Data EAR (PMC40, PMD32,33,36)” See Section 3.3.10, “Execution Trace Buffer

(PMC

(PMC41)”

See Section 3.3.10.2, “IP Event Address Capture

(PMC42.mode=‘1xx)”

39,42

,PMD

48-63,38,39

Description

)”

32,33,34,35

). Two

44 Reference Manual for Software Development and Optimization

Figure 3-9. Montecito Processor Performance Monitor Register Mode

Performance Monitoring

Perf. Counter Overflow Status Regs

pmc0 pmc1 pmc2 pmc3

Perf. Counter Conf. Regs

pmc4 pmc5

.........

pmc15

Instr/Data Addr. Range Check Conf. Regs

pmc38 pmc41

Instr/Data EAR Conf. Regs

pmc37 pmc40

OpCode Match Conf. Regs

pmc32 pmc33 pmc34 pmc35 pmc36

ETB Conf. Reg

pmc39

IP-EAR Conf. Reg

pmc42

Perf. Counter Data. Regs

pmd4 pmd5

.........

pmd15

Inst/Data EAR Data. Regs

pmd32 pmd33 pmd34 pmd35 pmd36

ETB/IP-EAR Data Regs

pmd48 pmd49

.........

pmd63

ETB/IP-EAR Support Regs

pmd38 pmd39

Montecito Processor Performance Monitoring Generic Reg. Set

Processor Status Reg.

PSR

Default Conf. Reg.

DCR

Perf. Mon. Vector Reg.

PMV

Montecito Processor Specific Performance Monitoring Reg. Set

Reference Manual for Software Development and Optimization 45

Performance Monitoring

3.3.1 Performance Monitor Control and Accessibility

As in other IPF processors, Montecito event collection is controlled by the Performance Monitor Configuration (PMC) registers and the processor status register (PSR). Four PSR fields (PSR.up, PSR.pp, PSR.cpl and PSR.sp) and the performance monitor freeze bit (PMC0.fr) affect the behavior of all performance monitor registers.

Per-monitor control is provided by three PMC register fields (PMCi.plm, PMCi.ism, and PMCi.pm). Event collection for a monitor is enabled under the following constraints on the Montecito processor:

 

Figure 3-10 defines the PSR control fields that affect performance monitoring. For a detailed

definition of how the PSR bits affect event monitoring and control accessibility of PMD registers, please refer to Section 3.3.2 and Section 7.2.1 of Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.

Table 3-5 defines per monitor controls that apply to PMC

4-15,,32-42

. As defined in Table 3-4 each of these PMC registers controls the behavior of its associated performance monitor data registers (PMD). The Montecito processor model-specific PMD registers associated with instruction/data EARs and the branch trace buffer (PMD

32-39,48-63

) can be read only when event monitoring is

frozen (PMC0.fr is one).

Figure 3-10. Processor Status Register (PSR) Fields for Performance Monitoring

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

reserved

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32

other

ppsp other reserved other upoth rv

reserved other is cpl

Table 3-5. Performance Monitor PMC Register Control Fields (PMC

Field Bits Description

plm 3:0 Privilege Level Mask - controls performance monitor operation for a specific privilege level.

Each bit corresponds to one of the 4 privilege levels, with bit 0 corresponding to privilege level 0, bit 1 with privilege level 1, etc. A bit value of 1 indicates that the monitor is enabled at that privilege level. Writing zeros to all plm bits effectively disables the monitor. In this state, the Montecito processor will not preserve the value of the corresponding PMD register(s).

pm 6 Privileged monitor - When 0, the performance monitor is configured as a user monitor and

enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a privileged monitor, enabled by PSR.pp, and PMD can only be read by privileged software. Any read of the PMD by non-privileged software in this case will return 0.

NOTE: In PMC37 this field is implemented in bit [4].

ism 25:24 Instruction Set Mask - Should be set to ‘10 for proper operation. Undefined behavior with

other values. NOTE: PMC37 and PMC39do not have this field.

4-15

)

3.3.2 Performance Counter Registers

The PMUs are not shared between hardware threads. Each hardware thread has its own set of 12 generic performance counter (PMC/PMD

46 Reference Manual for Software Development and Optimization

4-15

) pairs.

Performance Monitoring

Due to the complexities of monitoring in an MT “aware” environment, the PMC/PMD pairs are split according to differences in functionality. These PMC/PMD pairs can be divided into two categories; duplicated counters (PMC/PMD

• Banked counters (PMC/PMD

10-15

)and banked counters (PMC/PMD

4-9

10-15

): The banked counter capabilities are somewhat limited.

These PMDs cannot increment when their thread is in the background. That is, if Thread 0 is placed in the background, Thread 0’s PMD10 cannot increment until the thread is brought back to the foreground by hardware. Due to this fact, the banked counters should not be used to monitor a thread specific event (.all is set to 0) that could occur when its thread is in the background (e.g. L3_MISSES).

• Duplicated counters (PMC/PMD

): In contrast, duplicated counters can increment when

4-9

their thread is in the background. As such, they can be used to monitor thread specific events which could occur even when the thread those events belong to is not currently active.

PMC/PMD pairs are not entirely symmetrical in their ability to count events. Please refer to

Section 3.3.3 for more information. Figure 3-11 and Table 3-6 define the layout of the Montecito processor Performance Counter

Configuration Registers (PMC

). The main task of these configuration registers is to select the

4-15

events to be monitored by the respective performance monitor data counters. Event selection (), unit mask (), and MESI fields in the PMC registers perform the selection of these events. The rest of the fields in PMCs specify under what conditions the counting should be done (, , ), by how much the counter should be incremented (), and what need to be done if the counter overflows (, ).

Figure 3-11. Montecito Processor Generic PMC Registers (PMC

PMC

30 27 262524 2322

M E S I all ism ig thres-

4-15

201

161

5 8 7 6 5 4 3 0

umask es ig pm oi ev plm

hold

1 2 1 3 4 8 1 1 1 1 4

Table 3-6. Montecito Processor Generic PMC Register Fields (PMC

Field Bits Description

plm 3:0 Privilege Level Mask. See Table 3-5 “Performance Monitor PMC Register Control Fields

ev 4 External visibility - When 1, an external notification (if the capability is present) is

oi 5 Overflow interrupt - When 1, a Performance Monitor Interrupt is raised and the

pm 6 Privilege Monitor. See Table 3-5 “Performance Monitor PMC Register Control Fields

ig 7 Read zero; writes ignored. es 15:8 Event select - selects the performance event to be monitored.

umask 19:16 Unit Mask - event specific mask bits (see event definition for details)

(PMC4-15).”

provided whenever the counter overflows. External notification occurs regardless of the setting of the oi bit (see below).

performance monitor freeze bit (PMC0.fr) is set when the monitor overflows. When 0, no interrupt is raised and the performance monitor freeze bit (PMC0.fr) remains unchanged. Counter overflows generate only one interrupt. Setting the corresponding PMC0 bit on an overflow will be independent of this bit.

(PMC4-15).”.

Montecito processor event encodings are defined in Chapter 4, “Performance Monitor

Events.”

4-15

)

) (Sheet 1 of 2)

4-15

Reference Manual for Software Development and Optimization 47

Performance Monitoring

Table 3-6. Montecito Processor Generic PMC Register Fields (PMC

Field Bits Description

threshold 22:20 Threshold -enables thresholding for “multi-occurrence” events.

When threshold is zero, the counter sums up all observed event values. When the threshold is non-zero, the counter increments by one in every cycle in which the

observed event value exceeds the threshold. ig 23 Read zero, Writes ignored. ism 25:24 Instruction Set Mask. See Table 3-5 “Performance Monitor PMC Register Control Fields

all 26 All threads; This bit selects whether or not to monitor just the self thread or both threads.

MESI 30:27 Umask for MESI filtering; Only the events with this capability are affected.

ig 63:31 Read zero; writes ignored.

(PMC4-15).”.

This bit is applicable only for Duplicated counters (PMC4-9)

If 1, events from both threads are monitored; If 0, only self thread is monitored. Filters

(IAR/DAR/OPC) are only associated with the thread they belong to. If filtering of an

event with .all enabled is desired, both of the thread’s filters should be given matching

configurations.

[27] : I; [28] = S; [29] = E; [30] = M

If the counter is measuring an event implying that a cache line is being replaced, the

filter applies to bits in the existing cache line and not the line being brought in.

Also note, for the events affected by MESI filtering, if a user wishes to simply captured

all occurrences of the event the filter must be set to b1111.

Figure 3-12 and Table 3-7 define the layout of the Montecito processor Performance Counter Data

Registers (PMD

). A counter overflow occurs when the counter wraps (i.e a carry out from bit

4-15

46 is detected). Software can force an external interruption or external notification after N events by preloading the monitor with a count value of 247 - N. Note that bit 47 is the overflow bit and must be initialized to 0 whenever there is a need to initialize the register.

) (Sheet 2 of 2)

4-15

When accessible, software can continuously read the performance counter registers PMD without disabling event collection. Any read of the PMD from software without the appropriate privilege level will return 0 (See “plm” in Table 3-6). The processor ensures that software will see monotonically increasing counter values.

Figure 3-12. Montecito Processor Generic PMD Registers (PMD

PMD

4-15

6 3

sxt47 ov Count

16 1 47

4 8 47 46 0

Table 3-7. Montecito Processor Generic PMD Register Fields

Field Bits Description

sxt47 63:48 Writes are ignored, Reads return the value of bit 46, so count values appear as sign

ov 47 Overflow bit (carry out from bit 46).

count 46:0 Event Count. The counter is defined to overflow when the count field wraps (carry out

extended.

NOTE: When writing to a PMD, always write 0 to this bit. Reads will return the value of bit 46. DO NOT USE this field to properly determine whether the counter has overflowed or not. Use the appropriate bit from PMC0 instead.

from bit 46).

4-15

)

48 Reference Manual for Software Development and Optimization

Performance Monitoring

3.3.3 Performance Monitor Event Counting Restrictions Overview

Similar to other Itanium brand products, not all performance monitoring events can be monitored using any generic performance monitor counters (PMD4-15). The following need to be noted when determining which counter to be used to monitor events. This is just an overview and further details can be found under the specific event/event type.

• ER/SI/L2D events can only be monitored using PMD4-9 (These are the events with event

select IDs belong to ‘h8x, ‘h9x, ‘hax, ‘hbx, ‘hex and ‘hfx)

• To monitor any L2D events it is necessary to monitor at least one L2D event in either PMC4 or

PMC6.(See Section 4.8.4 for more information)

• To monitor any L1D events it is necessary to program PMC5/PMD5 to monitor one L1D

event. (See Section 4.8.2 for more information)

• In a MT enabled system, if a “floating” event is monitoring in a banked counter (PMC/PMD

), the value may be incorrect. To insure accuracy, these events should be measured by a

duplicated counter (PMC/PMD

4-9

• The CYCLES_HALTED event can only be monitored in PMD10. If measured by any other

PMD, the count value is undefined.

3.3.4 Performance Monitor Overflow Status Registers (PMC

As previously mentioned, the Montecito processor supports 12 performance monitoring counters per thread. The overflow status of these 12 counters is indicated in register PMC0. As shown in

Figure 3-13 and Table 3-8 only PMC0[15:4,0] bits are populated. All other overflow bits are

ignored, i.e. they read as zero and ignore writes.

Figure 3-13. Montecito Processor Performance Monitor Overflow Status Registers (PMC

63 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

ig Overflow ig fr

4 3 1

ig (PMC1)

ig (PMC2)

ig (PMC3)

0,1,2,3

10-

)

Table 3-8. Montecito Processor Performance Monitor Overflow Register Fields (PMC

0,1,2,3

)

(Sheet 1 of 2)

PMC

Reference Manual for Software Development and Optimization 49

fr 0 0 Performance Monitor “freeze” bit - When 1, event monitoring is

ig 3:1 - Read zero, Writes ignored.

Reset

Description

disabled. When 0, event monitoring is enabled. This bit is set by hardware whenever a performance monitor overflow occurs and its corresponding overflow interrupt bit (PMC.oi) is set to one. SW is responsible for clearing it. When the PMC.oi bit is not set, then counter overflows do not set this bit.

Performance Monitoring

Table 3-8. Montecito Processor Performance Monitor Overflow Register Fields (PMC

(Sheet 2 of 2)

PMC

PMC PMC

overflow 15:4 0 Event Counter Overflow - When bit n is one, indicate that the PMDn

ig 63:16 - Read zero, Writes ignored.

ig 63:0 - Read zero, Writes ignored.

1,2,3

Reset

overflowed. This is a bit vector indicating which performance monitor overflowed. These overflow bits are set on their corresponding counters overflow regardless of the state of the PMC.oi bit. Software may also set these bits. These bits are sticky and multiple bits may be set.

Description

3.3.5 Instruction Address Range Matching

The Montecito processor allows event monitoring to be constrained to a range of instruction addresses. Once programmed with this constraints, only the events generated by instructions with their addresses within this range are counted using PMD4-15. The four architectural Instruction Breakpoint Register Pairs IBRP these IBR pairs it is possible to define up to 4 different address ranges (only 2 address ranges in “fine mode”) that can be used to qualify event monitoring.

Once programmed, each of these 4 address restrictions can be applied separately to all events that are identified to do so. The event, IA64_INST_RETIRED, is the only event that can be constrained using any of the four address ranges. Events described as prefetch events can only be constrained using the address range 2 (IBRP1). All other events can only use the first address range (IBRP0) and this range will be considered as the default for this section.

0-3

(IBR

) are used to specify the desired address ranges. Using

0-7

0,1,2,3

)

In addition to constraint events based on instruction addresses, Montecito processor allows event qualification based on the opcode of the instruction and the address of the data the memory related instructions accessed. These are done by applying these constraints to the same 4 instruction address ranges described in this section. These features are explained in Section 3.3.6 and

Section 3.3.7.

3.3.5.1 PMC

Performance Monitoring Configuration register PMC38 is the main control register for Instruction Address Range matching feature. In addition to this register, PMC32 also controls certain aspects of this feature as explained in the following paragraphs.

Figure 3-14 and Table 3-10 describe the fields of register PMC38. For the proper use of instruction

address range checking described in this section, PMC38 is expected to be programmed to 0xdb6 as the default value.

Instruction address range checking is controlled by the “ignore address range check” bit (PMC32.ig_ad and PMC38.ig_ibrp0). When PMC32.ig_ad is one (or PMC14.ig_ibrp0 is one), all instructions are included (i.e. un-constrained) regardless of IBR settings. In this mode, events from both IA-32 and Itanium architecture-based code execution contribute to the event count. When both PMC32.ig_ad and PMC38.ig_ibrp0 are zero, the instruction address range check based on the IBRP0 settings is applied to all Itanium processor code fetches. In this mode, IA-32 instructions are never tagged, and, as a result, events generated by IA-32 code execution are ignored. Table 3-9 defines the behavior of the instruction address range checker for different combinations of  and PMC32.ig_ad or PMC38.ig_ibrp0.

50 Reference Manual for Software Development and Optimization

Performance Monitoring

Table 3-9. Montecito Processor Instruction Address Range Check by Instruction Set

PSR.is

PMC32.ig_ad OR

.ig_ibrp0

PMC

0 Tag only Itanium instructions if they match

1 Tag all Itanium and IA-32 instructions. Ignore IBR range.

IBR range

0 (IA-64) 1 (IA-32)

DO NOT tag any IA-32 operations.

The processor compares every Itanium instruction fetch address IP{63:0} against the address range programmed into the architectural instruction breakpoint register pair IBRP0. Regardless of the value of the instruction breakpoint fault enable (IBR x-bit), the following expression is evaluated for the Montecito processor’s IBRP0:



The events which occur before the instruction dispersal stage will fire only if this qualified match (IBRmatch) is true. This qualified match will be ANDed with the result of Opcode Matcher PMC

and further qualified with more user definable bits (See Table 3-10) before being

32,33

distributed to different places. The events which occur after instruction dispersal stage, will use this new qualified match (IBRP0-OpCode0 match).

Figure 3-14. Instruction Address Range Configuration Register (PMC38)

63 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

reserved fine reser

50 1 2 1 2 1 2 1 2 1 1

ved

ibrp3reser

ved

ibrp2reser

ved

ibrp1reser

ved

ibrp0 reser

ved

Table 3-10. Instruction Address Range Configuration Register Fields (PMC38) (Sheet 1 of 2)

Field Bits Description

ig_ibrp0 1 1: No constraint

0: Address range 0 based on IBRP0 enabled

ig_ibrp1 4 1: No constraint

0: Address range 1 based on IBRP1 enabled

ig_ibrp2 7 1: No constraint

0: Address range2 based on IBRP2 is enabled

Reference Manual for Software Development and Optimization 51

Performance Monitoring

Table 3-10. Instruction Address Range Configuration Register Fields (PMC38) (Sheet 2 of 2)

Field Bits Description

ig_ibrp3 10 1: No constraint

0: address range 3 based on IBRP3 is enabled

fine 13 Enable fine-mode address range checking (non power of 2)

1: IBRP 0: Normal mode If set to 1, IBRP0 and iIBRP2 define the lower and upper limits for

address range0; Similarly, IBRP1 and IBRP3 define the lower and upper limits for address range1.

Bits [63:16] of upper and lower limits need to be exactly the same but could have any value. Bits[15:0] of upper limit needs to be greater than bits[15:0] of lower limit. If an address falls in between the upper and lower limits then a match will be signaled only in address ranges 0 or 1. Any event qualification based on address ranges 2 and 3 are not defined.

NOTE: The mask bits programmed in IBRs 1,3,5,7 for bits [15:0] have no

effect in this mode. When using fine mode address range 0, it is necessary to program

PMC38.ig_ibrp0,ig_ibrp2 to 0. Similarly, when using address range 1, it is necessary to set PMC38.ig_ibrp1,ig_ibrp3 to 0.

and IBRP

0,2

are paired to define two address ranges

1,3

IBRP0 match is generated in the following fashion. Note that unless fine mode is used, arbitrary range checking cannot be performed since the mask bits are in powers of 2. In fine mode, two IBR pairs are used to specify the upper and lower limits of a range within a page (the upper bits of lower and upper limits must be exactly the same).



 













The instruction range checking considers the address range specified by IBRPi only if PMC32.ig_ad(for i=0), PMC38.ig_ibrpi and IBRPi x-bits are all 0s. If the IBRPi x-bits is set, this particular IBRP would be used for debug purposes as described in IA64 architecture.

3.3.5.2 Use of IBRP0 For Instruction Address Range Check - Exception 1

The address range constraint for prefetch events is on the target address of these events rather than the address of the prefetch instruction. Therefore IBRP1 must be used for constraining these events. Calculation of IBRP1 match is the same as that of IBRP0 match with the exception that we use IBR

Note: Register PMC38 must contain the predetermined value 0x0db6. If software modifies any bits not

listed in Table 3-10 processor behavior is not defined. It is illegal to have PMC41[48:45]=0000 and PMC32.ig_ad=0 and ((PMC38[2:1]=10 or 00) or (PMC38[5:4]=10 or 00)); this produces inconsistencies in tagging I-side events in L1D and L2.

instead of IBR

2,3,6

0,1,4

52 Reference Manual for Software Development and Optimization

Performance Monitoring

3.3.5.3 Use of IBRP0 For Instruction Address Range Check - Exception 2

The Address Range Constraint for IA64_TAGGED_INST_RETIRED event uses all four IBR pairs. Calculation of IBRP2 match is the same as that of IBRP0 match with the exception that IBR

(in non-fine mode) are used instead of IBR0. Calculation of IBRP3 match is the same as that

4,5

of IBRP1 match with the exception that we use IBR The instruction range check tag is computed early in the processor pipeline and therefore includes

speculative, wrong-path as well as predicated off instructions. Furthermore, range check tags are not accurate in the instruction fetch and out-of-order parts of the pipeline (cache and bus units). Therefore, software must accept a level of range check inaccuracy for events generated by these units, especially for non-looping code sequences that are shorter than the Montecito processor pipeline. As described in Section 3.2.3.1, the instruction range check result may be combined with the results of the IA-64 opcode match registers described in Section 3.3.5.4.

(in non-fine mode) instead of IBR

6,7

2,3

3.3.5.4 Fine Mode Address Range Check

In addition to providing coarse address range checking described above, Montecito processor can be programmed to perform address range checks in the fine mode. Montecito provides the use of two address ranges for fine mode. The first range is defined using IBRP0 and IBRP2 while the second is defined using IBRP1 and IBRP3. When properly programmed to use address range 0, all performance monitoring events that has been indicated to be able to qualify with IBRP0 would now qualify with this new address range (defined collectively by IBRP0 and IBRP2). Similarly, when using the address range 1, all events that could be qualified with IBRP1, now get qualified with this new address range.

A user can configure the Montecito PMU to use fine mode address range 0 by following these steps: (It is assumed that PMCs 32,33,34,35,36,38,41 all start with default settings):

• Program IBRP0 and IBRP2 to define the instruction address range. Note to follow the

programming restrictions mentioned in Table 3-10

• Program PMC32[ig_ad,inv] = ‘00 to turn off default tags injected into tag channel 0

• Program PMC38[ig_ibrp0,ig_ibrp2] = ‘00 to turn on address tagging based on IBRP0 and

IBRP2.

• Program PMC38.fine = 1

Similarly, a user can configure Montecito PMU to use fine mode address range by following the same steps as above but this time with IBRP1&3. The only exception is that PMC32[ig_ad,inv] need not to be programmed.

3.3.6 Opcode Match Check (PMC

As shown in Figure 3-5, in the Montecito processor, event monitoring can be constrained based on the Itanium processor encoding (opcode) of an instruction. Registers PMC configuring this feature.In Montecito, registers PMC (Opcode matcher 0 (OpCM0) and Opcode Matcher 1 (OpCM1)). Register PMC36 controls how to apply opcode range checking to the four instruction address ranges defined by using IBRPs.

3.3.6.1 PMC32,33,34,35

32,33,34,35,36

and PMC

32,33

)

32,33,34,35,36

define 2 opcode matchers

34,35

allow

Figure 3-15, Figure 3-16 and Table 3-11, Table 3-12 describe the fields of PMC Figure 3-17 and Table 3-14 describes the register PMC36.

Reference Manual for Software Development and Optimization 53

32,33,34,35

registers.

Performance Monitoring

All combinations of bits [51:48] in PMC necessary to set bits [51:50] to 11. To match all instruction types, bits [51:48] should be set to 1111. To ensure that all events are counted independent of the opcode matcher, all mifb and all mask bits of PMC

should be set to one (all opcodes match) while keeping the inv bit cleared.

32,34

Once the opcode matcher constraints are generated, they are ANDed with the address range constraints available on 4 IBRP channels to form 4 combined address range and opcode match ranges as described here. The constraints defined by OpCM0 are ANDed with address constraints defined by IBRP0 and IBRP2 to form combined constraints for channels 0 and 2. Similarly, the constraints defined by OpCM1 are ANDed with address constraints defined by IBRP1 and IBRP to form combined constraints for channels 1 and 3.

Figure 3-15. Opcode Match Registers (PMC

63 58 57 56 55 52 51 50 49 48 47 41 40 2 1 0

ig ig_adinv ig m i f b ig mask

6 1 1 4 1 1 1 1 7 41

Table 3-11. Opcode Match Registers(PMC

Field Bits Width

mask 40:0 41 all 1 Bits that mask Itanium® instruction encoding bits. Any of the 41

ig 47:41 7 n/a Reads zero; Writes ignored b 48 1 1 If 1: match if opcode is an B-slot f 49 1 1 If 1: match if opcode is an F-slot i 50 1 1 If 1: match if opcode is an I-slot m 51 1 1 If 1: match if opcode is an M-slot ig 55:52 4 n/a Reads zero; writes ignored inv 56 1 1 Invert Range Check. for tag channel 0

ig_ad 57 1 1 Ignore Instruction Address Range Checking for tag channle0

ig 63:58 4 n/a Reads zero; Writes ignored

Reset

are supported. To match a A-slot instruction, it is

32,34

)

32,34

)

32,34

Description

syllable bits can be selectively masked If mask bit is set to 1, the corresponding opcode bit is not used

for opcode matching

If set to 1, the address ranged specified by IBRP0 is inverted. Effective only when ig_ad bit is set to 0.

NOTE: This bit is ignored in PMC

If set to 1, all instruction addresses are considered for events. If 0, IBRs 0-1 will be used for address constraints. NOTE: This bit is ignored in PMC

Figure 3-16. Opcode Match Registers (PMC

63 62 61 60 59 58 57 56 55 52 51 50 49 48 47 41 40 2 1 0

33,35

)

ig match

23 41

54 Reference Manual for Software Development and Optimization

Performance Monitoring

Table 3-12. Opcode Match Registers(PMC

Field Bits Width

match 40:0 41 all 1s Opcode bits against which Itanium® instruction encoding to be

ig 63:41 23 n/a Ignored bits

Reset

)

33,35

matched. Each opcode bit has a corresponding bit position here.

3.3.6.2 PMC36

Performance Monitoring Configuration register PMC36 controls whether or not to apply opcode matching in event qualification. As mentioned earlier, opcode matching is applied to the same four instruction address ranges defined by using IBRPs.

Figure 3-17. Opcode Match Configuration Register (PMC36)

63 32 31 4 3 2 1 0

ig rsv Ch3

Table 3-13. Opcode Match Configuration Register Fields (PMC36)

Field Bits

Ch0_ig_OPC 0 0 1: Tag channel0 PMU events will not be constrained by opcode

Ch1_ig_OPC 1 0 1: tag channle1 events (IA64_TAGGED_INST_RETIRED.01) won’t be

Ch2_ig_OpC 2 0 1: Tag channel2 events (IA64_TAGGED_INST_RETIRED.10) won’t be

Ch3_ig_OpC 3 0 1: Tag channel3 events (IA64_TAGGED_INST_RETIRED.11) won’t be

rsv 31:4 0xfffffff Reserved. Users should not change this field from reset value ig 63:32 n/a Ignored bits

Reset

0: Tag channel0 PMU events (including IA64_TAGGED_INST_RETIRED.00) will be opcode constrained by OpCM0

constrained by opcode 0: tag channel1 events will be opcode constrained by OpCM1

constrained by opcode 0: Tag channel2 events will be opcode constrained by OpCM0

constrained by opcode 0: Tag channel2 events will be opcode constrained by OpCM1

Description

Ch2_

Ch1_

_ig_

ig_O

OPC

1 1 1 1

Ch0_

ig_O

For opcode matching purposes, an Itanium instruction is defined by two items: the instruction type “itype” (one of M, I, F or B) and the 41-bit encoding “enco{40:0}” defined the Intel®Itanium

Architecture Software Developer’s Manual. Each instruction is evaluated against each opcode

match register (OpCM0 and OpCM1) as follows:

 

Where:

 





Reference Manual for Software Development and Optimization 55



Performance Monitoring

The IBRP matches are advanced with the instruction pointer to the point where opcodes are being dispersed. The matches from opcode matchers are ANDed with the IBRP matches at this point.

This produces two opcode match events that are combined with the instruction range check tag (IBRRangeTag, see Section 3.3.5) as follows:

   

As shown in Figure 3-5 the 4 tags, Tag(IBRChnli; i=0-3) are staged down the processor pipeline until instruction retirement and can be selected as a retired instruction count event (see event description “IA64_TAGGED_INST_RETIRED”). In this way, a performance counter (PMC/ PMD

) can be used to count the number of retired instructions within the programmed range that

4-15

match the specified opcodes.

Note: Register PMC

must contain the predetermined value of 0xfffffff0. If software modifies any bits

not listed in Table 3-13 processor behavior is not defined. This is the reset value for PMC36.

3.3.7 Data Address Range Matching (PMC41)

For instructions that reference memory, the Montecito processor allows event counting to be constrained by data address ranges. The 4 architectural Data Breakpoint Registers (DBRs) can be used to specify the desired address range. Data address range checking capability is controlled by the Memory Pipeline Event Constraints Register (PMC41).

Figure 3-18 and Table 3-14 describe the fields of register PMC

corresponding to one of the 4 DBRs to be used), data address range checking is applied to loads, stores, semaphore operations, and the  instruction.

Table 3-14. Memory Pipeline Event Constraints Fields (PMC41) (Sheet 1 of 2)

Field Bits Description

cfgdtag0 4:3 These bits determine whether and how DBRP0 should be used for

constraining memory pipeline events (where applicable) 00: IBR/Opc/DBR - Use IBRP0/OpCM0 and DBRP0 for constraints (i.e.

they will be counted only if their Instruction Address, opcodes and Data Address matches the IBRP0 programmed into these registers)

01: IBR/Opc - Use IBRP0/OpCM0 for constraints 10: DBR - Only use DBRP0 for constraints 11: No constraints NOTE: When used in conjunction with “fine” mode (see PMC14

description), only the lower bound DBR Pair (DBRP0 or DBRP1) config needs to be set. The upper bound DBR Pair config should be left to no constraint. So if IBRP0,2 are chosen for “fine” mode, cfgdtag0 needs to be set according to the desired constraints but cfgdtag2 should be left as 11 (No constraints).

cfgdtag1 12:11 These bits determine whether and how DBRP1 should be used for

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP

cfgdtag2 20:19 These bits determine whether and how DBRP2 should be used for

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP

When enabled ([1,x0] in the bits

41.

56 Reference Manual for Software Development and Optimization

Performance Monitoring

Table 3-14. Memory Pipeline Event Constraints Fields (PMC41) (Sheet 2 of 2)

Field Bits Description

cfgdtag3 48,28:27 These bits determine whether and how DBRP3 should be used for

en_dbrp0 45 0 - No constraints

en_dbrp0 46 0 - No constraints

en_dbrp0 47 0 - No constraints

en_dbrp0 48 0 - No constraints

constraining memory pipeline events (where applicable); bit for bit these match those defined for DBRP

1 - Constraints as set by cfgdtag0

1 - Constraints as set by cfgdtag1

1 - Constraints as set by cfgdtag2

1 - Constraints as set by cfgdtag3

Figure 3-18. Memory Pipeline Event Constraints Configuration Register (PMC41)

63 49 48 47 46 45 44 29 28 27 26 21 20 19 18 13 12 11 10 5 4 3 2 0

reser

ved

en_dbrp

3 2 1 0

reserved cfg

dtag3

reser

ved

cfg

dtag2

reser

ved

cfg

dtag

reser

ved

15 1 1 1 1 16 2 6 2 6 2 6 2 3

DBRPx match is generated in the following fashion. Arbitrary range checking is not possible since the mask bits are in powers of 2. Although it is possible to enable more than one DBRP at a time for checking, it is not recommended. The resulting four matches are combined as follows to form a single DBR match:

cfg

dtag0

reser

ved

 

Events which occur after a memory instruction gets to the EXE stage will fire only if this qualified match (DBRPx match) is true. The data address is compared to DBRPx; the address match is further qualified by a number of user configurable bits in PMC41 before being distributed to different places. DBR matching for performance monitoring ignores the setting of the DBR r,w, and plm fields.

In order to allow simultaneous use of some DBRs for Performance Monitoring and the others for debugging (the architected purpose of these registers), separate mechanisms are provided for enabling DBRs. DBR bits x and the r/w-bit should be cleared to 0 for the DBRP which is going to be used for the PMU. PSR.db has no effect when DBRs are used for this purpose.

Note: Register PMC41 must contain the predetermined value 0x2078fefefefe. If software modifies any

bits not listed in Table 3-14 processor behavior is not defined. It is illegal to have PMC41[48:45]=0000 and PMC32[57]=0 and ((PMC38[2:1]=10 or 00) or (PMC38[5:4]=10 or 00)); this produces inconsistencies in tagging I-side events in L1D and L3.

3.3.8 Instruction EAR (PMC37/PMD

This section defines the register layout for the Montecito processor instruction event address registers (IEAR). The IEAR, configured through PMC37, can be programmed in one of two modes: instruction cache and instruction TLB miss collection. EAR specific unit masks allow software to

32,33,36

)

Reference Manual for Software Development and Optimization 57

Performance Monitoring

specify event collection parameters to hardware. Figure 3-19 and Table 3-15 detail the register layout of PMC37. The instruction address, latency and other captured event parameters are provided in three PMD registers (PMD data registers PMD

34,35

). Table 3-20 describes the associated event address

32,33,36

Both the instruction and data cache EARs (see Section 3.3.9) report the latency of captured cache events and allow latency thresholding to qualify event capture. Event address data registers (PMD PMD

) contain valid data only when event collection is frozen (PMC0.fr is one). Reads of

32-36

while event collection is enabled return undefined values.

32-36

Figure 3-19. Instruction Event Address Configuration Register (PMC37)

63 16 15 14 13 12 11 5 4 3 0

ig rsv ct umask pm plm

2 2 7 1 4

Table 3-15. Instruction Event Address Configuration Register Fields (PMC37)

Field Bits

plm 3:0 0 See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)” pm 4 0 See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)” umask 11:5

12:5

ct 13:12 0 cache_tlb bit. Instruction EAR selector. Select instruction cache or TLB stalls

rsv 15:14 0 Reserved bits ignored 63:16 - Reads are 0; Writes are ignored

Reset

0 Selects the event to be monitored

If [13] = ‘1 then [12:5] are used for umask

if =1x: Monitor demand instruction cache misses

NOTE: ISB hits are not considered misses.

PMD if =01: Nothing monitored if =00: Monitor L1 instruction TLB misses

PMD

34,35

Description

Figure 3-20. Instruction Event Address Register Format (PMD

63 5 4 2 1 0

Instruction Cache Line Address (PMD34) ig. stat

59 3 2

63 13 12 11 0

ig (PMD35) ov latency

51 1 12

34,35

)

When the cache_tlb field (PMC

.ct) is set to 1x, instruction cache misses are monitored. When it

is set to 00, instruction TLB misses are monitored. The interpretation of the umask field and performance monitor data registers PMD

depends on the setting of this bit and is described in

34,35

Section 3.3.8.1 for instruction cache monitoring and in Section 3.3.8.2 for instruction TLB

monitoring.

58 Reference Manual for Software Development and Optimization

3.3.8.1 Instruction EAR Cache Mode (PMC37.ct=’1x)

When PMC37.ct is 1x, the instruction event address register captures instruction addresses and access latencies for L1 instruction cache misses. Only misses whose latency exceeds a programmable threshold are captured. The threshold is specified as an eight bit umask field in the configuration register PMC37. Possible threshold values are defined in Table 3-16.

Performance Monitoring

Table 3-16. Instruction EAR (PMC

umask

Bits 12:5

01xxxxxx >0 (All L1 Misses) 11100000 >=256 11111111 >=4 11000000 >=1024 11111110 >=8 10000000 >=4096 11111100 >=16 other undefined 11111000 >=32 00000000 RAB hit

11110000 >=128

Latency Threshold

[CPU cycles]

As defined in Table 3-17, the address of the instruction cache line missed the L1 instruction cache is provided in PMD34. If no qualified event was captured, it is indicated in PMD34.stat. The latency of the captured instruction cache miss in CPU clock cycles is provided in the latency field of PMD35.

Table 3-17. Instruction EAR (PMD

PMD34 stat 1:0 Status

Instruction Cache Line Address

PMD

latency 11:0 Latency in CPU clocks

overflow 12 If 1, latency counter has overflowed one or more times

34,35

) umask Field in Cache Mode (PMC37.ct=’1x)

umask

Bits 12:5

Latency Threshold

[CPU cycles]

(All L1 misses which hit in RAB)

) in Cache Mode (PMC37.ct=’1x)

x0: EAR did not capture qualified event x1: EAR contains valid event data

63:5 Address of instruction cache line that caused cache miss

before data was returned

3.3.8.2 Instruction EAR TLB Mode (PMC37.ct=00)

When PMC37.ct is ‘00, the instruction event address register captures addresses of instruction TLB misses. The unit mask allows event address collection to capture specific subsets of instruction TLB misses. Table 3-18 summarizes the instruction TLB umask settings. All combinations of the mask bits are supported.

Table 3-18. Instruction EAR (PMC37) umask Field in TLB Mode (PMC37.ct=00) (Sheet 1 of 2)

ITLB Miss Type PMC.umask[7:5] Description

--- 000 Disabled; nothing will be counted

L2TLB xx1 L1 ITLB misses which hit L2 TLB

VHPT x1x L1 Instruction TLB misses that hit VHPT

Reference Manual for Software Development and Optimization 59

Performance Monitoring

Table 3-18. Instruction EAR (PMC37) umask Field in TLB Mode (PMC37.ct=00) (Sheet 2 of 2)

ITLB Miss Type PMC.umask[7:5] Description

FAULT 1xx Instruction TLB miss produced by an ITLB Miss Fault

ALL 111 Select all L1 ITLB Misses

NOTE: All combinations are supported.

As defined in Table 3-19 the address of the instruction cache line fetch that missed the L1 ITLB is provided in PMD34. The stat bit [1] indicates whether the captured TLB miss hit in the VHPT or required servicing by software. PMD34.stat will indicate whether a qualified event was captured. In TLB mode, the latency field of PMD35 is undefined.

Table 3-19. Instruction EAR (PMD

PMD

stat 1:0 Status Bits

Instruction Cache Line Address

latency 11:2 Undefined in TLB mode

) in TLB Mode (PMC37.ct=‘00)

34,35

63:5 Address of instruction cache line that caused TLB miss

3.3.9 Data EAR (PMC40, PMD

The data event address configuration register (PMC40) can be programmed to monitor either L1 data cache load misses, FP loads, L1 data TLB misses, or ALAT misses. Figure 3-21 and

Table 3-20 detail the register layout of PMC40. Figure 3-22 describes the associated event address

data registers PMD TLB, or ALAT monitoring. The interpretation of the umask field and registers PMD depends on the setting of the mode bits and is described in Section 3.3.9.1 for data cache load miss monitoring, Section 3.3.9.2 for data TLB monitoring, and Section 3.3.9.3 for ALAT monitoring.

Both the instruction (see Section 3.3.8) and data cache EARsreport the latency of captured cache events and allow latency thresholding to qualify event capture. Event address data registers (PMD PMD

) contain valid data only when event collection is frozen (PMC0.fr is one). Reads of

32-36

while event collection is enabled return undefined values.

32-36

. The mode bits in configuration register PMC40 select data cache, data

32,33,36

00: EAR did not capture qualified event 01: L1 ITLB miss hit in L2 ITLB 10: L1 ITLB miss hit in VHPT 11: L1 ITLB miss produced an ITLB Miss Fault

32,33,36

)

32,33,36

Figure 3-21. Data Event Address Configuration Register (PMC40)

63 26 25 24 23 20 19 16 15 9 8 7 6 5 4 3 0

ig. ism ig umask ig mode pm ig. plm

38 2 4 4 7 2 1 2 4

Table 3-20. Data Event Address Configuration Register Fields (PMC40) (Sheet 1 of 2)

Field Bits

plm 3:0 0 See Table 3-5 “Performance Monitor PMC Register Control Fields (PMC4-15).” ig 5:4 - Reads 0; Writes are ignored

60 Reference Manual for Software Development and Optimization

Reset

Description

Performance Monitoring

Table 3-20. Data Event Address Configuration Register Fields (PMC40) (Sheet 2 of 2)

Field Bits

pm 6 0 See Table 3-5 “Performance Monitor PMC Register Control Fields (PMC4-15).” mode 8:7 0 Data EAR mode selector:

ig 15:9 - Reads 0; Writes are ignored umask 19:16 Data EAR unit mask

ig 23:20 - Reads 0; Writes are ignored ism 25:24 See Table 3-5 “Performance Monitor PMC Register Control Fields (PMC4-15).” ig 63:26 - Reads 0; Writes are ignored

Reset

‘00: L1 data cache load misses and FP loads ‘01: L1 data TLB misses ‘1x: ALAT misses

mode 00: data cache unit mask (definition see Table 3-21, “Data EAR (PMC40)

Umask Fields in Data Cache Mode (PMC40.mode=00)”)

mode 01: data TLB unit mask (definition see Table 3-23, “Data EAR (PMC40)

Umask Field in TLB Mode (PMC40.ct=01)”)

Figure 3-22. Data Event Address Register Format (PMD

63 4 3 2 1 0

Instruction Address (PMD36) vl bn slot

60 1 1 2

63 62 61 15 14 13 12 0

ig (PMD33) stat ov latency

2 50 12

63 0

Data Address (PMD32)

Description

32,d3,36

)

3.3.9.1 Data Cache Load Miss Monitoring (PMC40.mode=00)

If the Data EAR is configured to monitor data cache load misses, the umask is used as a load latency threshold defined by Table 3-21.

As defined in Table 3-22, the instruction and data addresses as well as the load latency of a captured data cache load miss are presented to software in three registers PMD event was captured, the valid bit in PMD3 is zero.

HPW accesses will not be monitored.  and reads from ccv will not be monitored. If an L1D cache miss is not at least 7 clocks after a captured miss, it will not be captured. Semaphore instructions and floating point loads will be counted.

Reference Manual for Software Development and Optimization 61

. If no qualified

2,3,17

Performance Monitoring

Table 3-21. Data EAR (PMC40) Umask Fields in Data Cache Mode (PMC40.mode=00)

0000 >= 4 (Any latency) 0110 >= 256 0001 >= 8 0111 >= 512 0010 >= 16 1000 >= 1024 0011 >= 32 1001 >= 2048 0100 >= 64 1010 >= 4096 0101 >= 128 1011.. 1111 No events are captured.

Table 3-22. PMD

PMD

umask

Bits 19:16

32,33,36

Fields in Data Cache Load Miss Mode (PMC40.mode=00)

Data Address 63:0 64-bit virtual address of data item that caused miss latency 12:0 Latency in CPU clocks overflow 13 Overflow - If 1, latency counter has overflowed one or

stat 15:14 Status bits;

ig 63:26 Reads 0; Writes are ignored slot 1:0 Slot bits; If “.vl” is 1, the Instruction bundle slot of memory

bn 2 Bundle bit; If “.vl” is 1 this indicates which of the executed

vl 3 Valid bit;

Instruction Address 63:4 Virtual address of the first bundle in the 2-bundle dispersal

Latency

Threshold

[CPU cycles]

umask

Bits 19:16

more times before data was returned

00: No valid information in PMD32,36 and rest of PMD33 01: Valid information in PMD32,33 and may be in PMD36

NOTE: These bits should be cleared before the EAR is reused.

instruction

bundles is associated with the captured miss

0: Invalid Address (EAR did not capture qualified event) 1: EAR contains valid event data

NOTE: This bit should be cleared before the EAR is reused

window which was being executed at the time of the miss. If “.bn” is 1 then the second bundle contains memory instruction and 16 should be added to the address.

Latency

Threshold

[CPU cycles]

The detection of data cache load misses requires a load instruction to be tracked during multiple clock cycles from instruction issue to cache miss occurrence. Since multiple loads may be outstanding at any point in time and the Montecito processor data cache miss event address register can only track a single load at a time, not all data cache load misses may be captured. When the processor hardware captures the address of a load (called the monitored load), it ignores all other overlapped concurrent loads until it is determined whether the monitored load turns out to be an L1 data cache miss or not. If the monitored load turns out to be a cache miss, its parameters are latched into PMD

. The processor randomizes the choice of which load instructions are tracked to

32,33,36

prevent the same data cache load miss from always being captured (in a regular sequence of overlapped data cache load misses). While this mechanism will not always capture all data cache load misses in a particular sequence of overlapped loads, its accuracy is sufficient to be used by statistical sampling or code instrumentation.

62 Reference Manual for Software Development and Optimization

3.3.9.2 Data TLB Miss Monitoring (PMC40.mode=‘01)

If the Data EAR is configured to monitor data TLB misses, the umask defined in Table 3-24 determines which data TLB misses are captured by the Data EAR. For TLB monitoring, all combinations of the mask bits are supported.

As defined in Table 3-24 the instruction and data addresses of captured DTLB misses are presented to software in PMD When programmed for data TLB monitoring, the contents of the latency field of PMD33 are undefined.

Both load and store TLB misses will be captured. Some unreached instructions will also be captured. For example, if a load misses in L1DTLB but hits in L2 DTLB and is in an instruction group after a taken branch, it will be captured. Stores and floating-point operations never miss in L1DTLB but could miss the L2 DTLB or fault to be handled by software.

Note: PMC39 must be 0 in this mode; else the wrong IP for misses coming right after a mispredicted

branch.

Table 3-23. Data EAR (PMC40) Umask Field in TLB Mode (PMC40.ct=01)

. If no qualified event was captured, the valid bit in PMD36 reads zero.

32,36

Performance Monitoring

L1 DTLB Miss

Table 3-24. PMD

PMD

Type

--- 000x Disabled; nothing will be counted

L2DTLB xx1x L1 DTLB misses which hit L2 DTLB

VHPT x1xx L1 DTLB misses that hit VHPT

FAULT 1xxx Data TLB miss produced a fault

ALL 111x Select all L1 DTLB Misses

32,33,36

Data Address 63:0 64-bit virtual address of data item that caused miss latency 12:0 Undefined in TLB Miss mode ov 13 Undefined in TLB Miss mode stat 15:14 Status

ig 63:26 Reads 0; Writes are ignored slot 1:0 Slot bits; If “.vl” is 1, the Instruction bundle slot of memory

bn 2 Bundle bit; If “.vl” is 1 this indicates which of the executed

PMC.umask[19:16] Description

NOTE: All combinations are supported.

Fields in TLB Miss Mode (PMC40.mode=‘01) (Sheet 1 of 2)

00: invalid information in PMD32,36 and rest of PMD33 01: L2 Data TLB hit

10: VHPT hit 11: Data TLB miss produced a fault

NOTE: These bits should be cleared before the EAR is reused.

instruction.

bundles is associated with the captured miss

Reference Manual for Software Development and Optimization 63

Performance Monitoring

Table 3-24. PMD

32,33,36

Fields in TLB Miss Mode (PMC40.mode=‘01) (Sheet 2 of 2)

vl 3 Valid bit;

0: Invalid Instruction Address 1: EAR contains valid instruction address of the miss

NOTE: It is possible for this bit to contain 0 while PMD33.stat indicate valid D-EAR data. This can happen when D-EAR is triggered by an RSE load for which no instruction address is captured.

NOTE: This bit should be cleared before the EAR is reused.

Instruction Address 63:4 Virtual address of the first bundle in the 2-bundle dispersal

window which was being executed at the time of the miss. If “.bn” is 1 then the second bundle contains memory instruction and 16 should be added to the address.

3.3.9.3 ALAT Miss Monitoring (PMC40.mode=‘1x)

As defined in Table 3-25, the address of the instruction (failing  and ) causing an ALAT miss is presented to software in PMD36. If no qualified event was captured, the valid bit in PMD36 reads zero. When programmed for ALAT monitoring, the latency field of PMD33 and the contents of PMD32 are undefined.

Note: PMC39 must be 0 in this mode; else the wrong IP for misses coming right after a mispredicted

branch.

Table 3-25. PMD

PMD

32,33,36

Fields in ALAT Miss Mode (PMC11.mode=‘1x)

Data Address 63:0 Undefined in ALAT Miss Mode latency 12:0 Undefined in ALAT Miss mode ov 13 Undefined in ALAT Miss mode stat 15:14 Status bits;

ig 63:26 Reads 0; Writes are ignored slot 1:0 Slot bits; If “.vl” is 1, the Instruction bundle slot of memory

bn 2 Bundle bit; If “.vl” is 1 this indicates which of the executed

vl 3 Valid bit;

Instruction Address 63:4 Virtual address of the first bundle in the 2-bundle dispersal

00: No valid information in PMD 01: Valid information in PMD

NOTE: These bits should be cleared before the EAR is reused.

instruction

bundles is associated with the captured miss

0: Invalid Address (EAR did not capture qualified event) 1: EAR contains valid event data

NOTE: This bit should be cleared before the EAR is reused.

window which was being executed at the time of the miss. If “.bn” is 1 then the second bundle contains memory instruction and 16 should be added to the address.

and rest of PMD

32,36

and may be in PMD

32,33

64 Reference Manual for Software Development and Optimization

Performance Monitoring

3.3.10 Execution Trace Buffer (PMC

The execution trace buffer provides information about the most recent Itanium processor control flow changes. The Montecito execution trace buffer configuration register (PMC39) defines the conditions under which instructions which cause the changes to the execution flow are captured, and allows the trace buffer to capture specific subsets of these events.

In addition to the branches captured in the previous generations of Itanium 2 processor BTB, Montecito’s ETB captures rfi instructions, exceptions (excluding asynchronous interrupts) and silently resteered chk (failed chk) events. Passing chk instructions are not captured under any programming conditions (except when there is another capturable event).

In every cycle in which a qualified change to the execution flow happens, its source bundle address and slot number are written to the execution trace buffer. This event’s target address is written to the next buffer location. If the target instruction bundle itself contains a qualified execution flow change, the execution trace buffer either records a single trace buffer entry (with the s-bit set) or makes two trace buffer entries: one that records the target instruction as a branch target s-bit cleared), and another that records the target instruction as a branch source (s-bit set). As a result, the branch trace buffer may contain a mixed sequence of the source and target addresses.

Note: The setting of PMC42 can override the setting of PMC39. PMC42 is used to configure the Execution

Trace Buffer’s alternate mode: the IP-EAR. Please refer to Section 3.3.10.2.1, “Notes on the IP-

EAR” for more information about this mode. PMC42.mode must be set to 000 to enable normal

branch trace capture in PMD PMC39’s contents will be ignored.

as described below. If PMC42.mode is set to other than 000,

48-63

39,42

,PMD

48-63,38,39

)

3.3.10.1 Execution Trace Capture (PMC42.mode=‘000)

Section 3.3.10.1.1 through Section 3.3.10.1.3 describe the operation of the Execution Trace Buffer

when configured to capture an execution trace (or “enhanced” branch trace).

3.3.10.1.1 Execution Trace Buffer Collection Conditions

The execution trace buffer configuration register (PMC39) defines the conditions under which execution flow changes are to be captured. These conditions are given in Figure 3-23 and

Table 3-26, which refer to conditions associated with the branch prediction. These conditions are:

• Whether the target of the branch should be captured

• The path of the branch (not taken/taken), and

• Whether or not the branch path was mispredicted

• Whether or not the target of the branch was mispredicted

• What type of branch should be captured

Note: All instructions eligible for capture are subject to filtering by the “plm” field but only branches are

affected by PMC39’s other filters (tm,ptm,ppm and brt) as well as the Instruction Addr Range and Opcode Match filters.

Figure 3-23. Execution Trace Buffer Configuration Register (PMC39)

63 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

ig brt ppm ptm tm ds pm ig. plm

48 2 2 2 2 1 1 2 4

Reference Manual for Software Development and Optimization 65

Performance Monitoring

Table 3-26. Execution Trace Buffer Configuration Register Fields (PMC39)

Field Bits Description

plm 3:0 See Table 3-5

Note: This mask is applied at the time the event’s source address is captured. Once the source IP is captured, the target IP of this event is always captured even if the ETB is

disabled. ig 5:4 Reads zero; writes are ignored pm 6 See Table 3-5

Note: This bit is applied at the time the event’s source address is captured. Once the

source IP is captured, the target IP of this event is always captured even if the ETB is

disabled. ds 7 Data selector:

1: reserved (undefined data is captured in lieu of the target address)

0: capture branch target tm 9:8 Taken Mask:

ptm 11:10 Predicted Target Address Mask:

ppm 13:12 Predicted Predicate Mask:

brt 15:14 Branch Type Mask:

ig 63:16 Reads zero; writes are ignored

11: all Itanium® instruction branches

10: Taken Itanium instruction branches only 01: Not Taken Itanium instruction branches only 00: No branch is captured

11: capture branch regardless of target prediction outcome 10: branch target address predicted correctly 01: branch target address mispredicted 00: No branch is captured

11: capture branch regardless of predicate prediction outcome 10: branch predicted branch path (taken/not taken) correctly 01: branch mispredicted branch path (taken/not taken) 00: No branch is captured

11: only non-return indirect branches captured 10: only return branches will be captured 01: only IP-relative branches will be captured 00: all branches are captured

To summarize, an Itanium instruction branch and its target are captured by the trace buffer if the following equation is true:



  





  

 



  

 



  

 

 



66 Reference Manual for Software Development and Optimization

To capture all correctly predicted Itanium instruction branches, the Montecito execution trace buffer configuration settings in PMC39 should be: ds=0, tm=11, ptm=10, ppm=10,brt=00.

Either branches whose path was mispredicted can be captured (ds=0, tm=11, ptm=11, ppm=01,brt=00) or branches with a target misprediction (ds=0, tm=11, ptm=01, ppm=11,brt=00) can be captured but not both. A setting of ds=0, tm=11, ptm=01, ppm=01,brt=00 will result in an empty buffer. If a branch’s path is mispredicted, no target prediction is recorded.

Instruction Address Range Matching (Section 3.3.5) and Opcode Matching (Section 3.3.5) may also be used to constrain what is captured in the execution trace buffer.

3.3.10.1.2 Execution Trace Buffer Data Format (PMC42.mode=‘000)

Performance Monitoring

Figure 3-24. Execution Trace Buffer Register Format (PMD

63 4 3 2 1 0

Address slot mp s

Table 3-27. Execution Trace Buffer Register Fields (PMD

Field Bit Range Description

s 0 Source Bit

1: contents of register is the source address of a monitored event (branch, rfi, exception or failed chk) 0: contents of register is a target or undefined (if PMC39.ds = 1)

mp 1 Mispredict Bit

if s=1 and mp=1: mispredicted event (e.g. target, predicate or back end misprediction) if s=1 and mp=0: correctly predicted event

if s=0 and mp=1: valid target address if s=0 and mp=0: invalid ETB register

rfi/exceptions/failed_chk are all considered as mispredicted events and are encoded as above.

slot 3:2 if s=0: undefined

if s=1: Slot index of first taken event in bundle 00: Itanium processor Slot 0 source/target 01: Itanium processor Slot 1 source/target 10: Itanium processor Slot 2 source/target 11: this was a not taken event

Address 63:4 if s=1: 60-bit bundle address of Itanium instruction branch

if ds=0 and s=0: 60-bit target bundle address of Itanium instruction branch

48-63

, where PMC39.ds == 0)

48-63

) (PMC

.mode=‘000)

2 1 1

The sixteen execution trace buffer registers PMD captured event sequence. The branch trace buffer registers (PMD

provide information about the outcome of a

48-63

) contain valid data only

48-63

when event collection is frozen (PMC0.fr is one). While event collection is enabled, reads of PMD

return undefined values. The registers follow the layout defined in Figure 3-24, and

48-63

Table 3-27 contain the address of either a captured branch instruction (s-bit=1) or a branch target

(s-bit=0). For branch instructions, the mp-bit indicates a branch misprediction. An execution trace register with a zero s-bit and a zero mp-bit indicates an invalid buffer entry. The slot field captures the slot number of the first taken Itanium instruction branch in the captured instruction bundle. A slot number of 3 indicates a not-taken branch.

In every cycle in which a qualified Itanium instruction branch retires1, its source bundle address and slot number are written to the branch trace buffer. If within the next clock, the target instruction bundle contains a branch that retires and meets the same conditions, the address of the second

Reference Manual for Software Development and Optimization 67

Performance Monitoring

branch is stored. Otherwise, either the branches’ target address (PMC39.ds=0) or details of the branch prediction (PCM39.ds=1) are written to the next buffer location. As a result, the execution trace buffer may contain a mixed sequence of the branches and targets.

The Montecito branch trace buffer is a circular buffer containing the last four to eight qualified Itanium instruction branches. The Execution Trace Buffer Index Register (PMD38) defined in

Figure 3-25 and Table 3-28 identify the most recently recorded branch or target. In every cycle in

which a qualified branch or target is recorded, the execution buffer index (ebi) is post-incremented. After 8 entries have been recorded, the branch index wraps around, and the next qualified branch will overwrite the first trace buffer entry. The wrap condition itself is recorded in the full bit of PMD16. The ebi field of PMD38 defines the next branch buffer index that is about to be written.The following formula computes the last written branch trace buffer PMD index from the contents of PMD38:

If both the full bit and the ebi field of PMD38 are zero, no qualified branch has been captured by the branch trace buffer. The full bit gets set the every time the branch trace buffer wraps from PMD to PMD48. Once set, the full bit remains set until explicitly cleared by software, i.e. it is a sticky bit. Software can reset the ebi index and the full bit by writing to PMD38.

PMD39 provides additional information related to the ETB entries.

last-written-PMD-index = 48+ ([ 16*PMD38.full) + (PMC38.ebi - 1)] % 16)

Figure 3-25. Execution Trace Buffer Index Register Format (PMD38)

63 6 5 4 3 0

ig full ig ebi

58 1 1 4

Table 3-28. Execution Trace Buffer Index Register Fields (PMD38)

Field Bit Range Description

ebi 3:0 Execution Buffer Index [Range 0..15 - Index 0 indicates PMD48]

ig 4 Reads zero; Writes are ignored full 5 Full Bit (sticky)

ig 63:6 Reads zero; Writes are ignored

Pointer to the next execution trace buffer entry to be written if full=1: points to the oldest recorded branch/target if full=0: points to the next location to be written

if full=1: execution trace buffer has wrapped if full=0: execution trace buffer has not wrapped

Figure 3-26. Execution Trace Buffer Extension Register Format (PMD39) (PMC42.mode=‘1xx)

63 59 58 55 54 24 23 20 19 16 15 12 11 8 7 4 3 0

pmd63

ext

pmd55

ext

...

pmd58

ext

pmd50

ext

pmd57

ext

pmd49

ext

pmd56

ext

pmd48

ext

4 4 4 4 4 4 4 4

1. In some cases, the Montecito processor execution trace buffer will capture the source (but not the target) address of an excepting branch instruction. This occurs on trapping branch instructions as well as faulting ,  and multi-way branches.

68 Reference Manual for Software Development and Optimization

Performance Monitoring

Table 3-29. Execution Trace Buffer Extension Register Fields (PMD39) (PMC42.mode=‘1xx)

Field Bit Range Bits Description

pmd48 ext 3:0 3:2 ignored

Reads zero; writes are ignored

1 bruflush

If PMD48.bits[1:0] = 11, 1 = back end mispredicted the branch and the pipeline was flushed by it 0 = no pipeline flushes are associated with this branch

0 b1

if PMD48.s = 1, then

1 = branch was from bundle 1, add 0x1 to PMD48.bits[63:4] 0 = branch was from bundle 0, no correction is necessary

else, ignore pmd56 ext 7:4 Same as above for PMD56 pmd49 ext 11:8 Same as above for PMD49 pmd57 ext 15:12 Same as above for PMD57 pmd50 ext 19:16 Same as above for PMD50 pmd58 ext 23:20 Same as above for PMD58 so on so on so on pmd63 ext 63:60 Same as above for PMD63

3.3.10.1.3 Notes on the Execution Trace Buffer

Although the Montecito ETB does not capture asynchronous interrupts as events, the address of these handlers can be captured as target addresses. This could happen if, at the target of a captured event (e.g. taken branch), an asynchronous event is taken before executing any instruction at the target.

3.3.10.2 IP Event Address Capture (PMC42.mode=‘1xx)

Montecito has a new feature called Instruction Pointer Address Capture (or IP-EAR). This feature is intended to facilitate the correlation of performance monitoring events to IP values. To do this, the Montecito’s Execution Trace Buffer (ETB) can be configured to capture IPs of retired instructions. When a performance monitoring event is used to trigger an IP-EAR freeze, if the IP which caused the event gets to retirement there is a good chance that IP would be captured in the ETB. The IP-EAR freezes after a programmable number of cycles following a PMU freeze as described below

Register PMC42 is used to configure this feature and the ETB registers(PMD capture the data. PMD38 holds the index and overflow bits for the IP Buffer much as it does for the ETB.

Note: Setting PMC42.mode to a non-0 value will override the setting of PMC39 (the configuration

Figure 3-27. IP-EAR Configuration Register (PMC42)

63 19 18 11 10 8 7 6 5 4 3 2 1 0

ig delay mode ig pm ig plm

45 8 3 1 1 2 4

48-63,39

) are used to

Reference Manual for Software Development and Optimization 69

Performance Monitoring

Table 3-30. IP-EAR Configuration Register Fields (PMC42)

Field Bits Description

plm 3:0 See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)” ig 5:4 Reads zero; Writes are ignored pm 6 See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)” ig 7 Reads zero; Writes are ignored mode 10:8 IP EAR mode:

delay 18:11 Programmable delay before freezing ig 63:20 Reads zero; Writes are ignored

000: ETB Mode (IP-EAR not functional; ETB is functional) 100: IP-EAR Mode (IP-EAR is functional; ETB not functional)

The IP_EAR functions by continuously capturing retired IPs in PMD It captures retired IPs and the elapsed time between retirements. Up to 16 entries can be captured.

The IP-EAR has a slightly different freezing model than the rest of the Performance Monitors. It is capable of delaying its freeze for a number of cycles past the point of PMU freeze. The user can program an 8-bit number to determine the number of cycles the freeze will be delayed.

Note: PMD

are not, in fact, 68b registers. Figure 3-28 and Figure 3-29 represent the virtual layout of

48-63

an execution trace buffer entry in IP-EAR mode for the sake of clarity. The higher order bits [6764] for each entry are mapped into PMD39 as described in Table 3-33.

Figure 3-28. IP-EAR data format (PMD

67 66 65 60 59 0

, where PMC42.mode == 100 and PMD

48-63

ef f cycl IP[63:4]

1 1 6 60

Figure 3-29. IP-EAR data format (PMD

67 66 65 60 59 8 7 0

, where PMC42.mode == 100 and PMD48-63.ef =1)

48-63

ef f cycl IP[63:12] Delay

1 6 52 8

Table 3-31. IP-EAR Data Register Fields (PMD

Field Bits Description

as long as it is enabled.

48-63

) (Sheet 1 of 2)(PMC42.mode=‘1xx)

48-63

.ef =0)

cycl 63:60 Elapsed cycles

4-bit least significant bits of a 6-bit elapsed cycle count from the previous retired IP. This is a saturating counter and would stay at all 1s when counted up to the maximum value.

Note: the 2 most significant bits for each entry are found in PMD39. See below.

70 Reference Manual for Software Development and Optimization

Performance Monitoring

Table 3-31. IP-EAR Data Register Fields (PMD

Field Bits Description

IP 59:8 Retired IP value; bits[63:12] delay 7:0 Delay count

If ef = 1 Indicates the remainder of the delay count Else Retired IP value: bits[11:4]

) (Sheet 2 of 2)(PMC42.mode=‘1xx)

48-63

Figure 3-30. IP Trace Buffer Index Register Format (PMD38)(PMC42.mode=‘1xx)

63 6 5 4 3 0

ig full ig ebi

58 1 1 4

Table 3-32. IP Trace Buffer Index Register Fields (PMD38) (PMC42.mode=‘1xx)

Field Bit Range Description

ebi 3:0 IP Trace Buffer Index [Range 0..15 - Index 0 indicates PMD48]

ig 4 Reads zero; Writes are ignored full 5 Full Bit (sticky)

ig 63:6 Reads zero; Writes are ignored

Pointer to the next IP trace buffer entry to be written if full=1: points to the oldest recorded IP entry

if full=0: points to the next location to be written

if full=1: IP trace buffer has wrapped if full=0: IP trace buffer has not wrapped

Figure 3-31. IP Trace Buffer Extension Register Format (PMD39) (PMC42.mode=‘1xx)

63 59 58 55 54 24 23 20 19 16 15 12 11 8 7 4 3 0

pmd63

ext

pmd55

ext

...

pmd58

ext

pmd50

ext

pmd57

ext

pmd49

ext

pmd56

ext

pmd48

ext

4 4 4 4 4 4 4 4

Reference Manual for Software Development and Optimization 71

Performance Monitoring

Table 3-33. IP Trace Buffer Extension Register Fields (PMD39) (PMC42.mode=‘1xx)

Field Bit Range Bits Description

pmd48 ext 3:0 3:2 cycl - Elapsed cycles

2-bit most significant bits of a 6-bit elapsed cycle count from the previous retired IP. This is a saturating counter and would stay at all 1s when counted up to the maximum value.

1 f - Flush

Indicates whether there has been a pipe flush since the last entry

0 ef - Early freeze

if 1: The current entry is an early freeze case Early freeze occurs if:

PSR bits causes IP-EAR to become disabled

Thread switch pmd56 ext 7:4 Same as above for PMD56 pmd49 ext 11:8 Same as above for PMD49 pmd57 ext 15:12 Same as above for PMD57 pmd50 ext 19:16 Same as above for PMD50 pmd58 ext 23:20 Same as above for PMD58 so on so on so on pmd63 ext 63:60 Same as above for PMD63

3.3.10.2.1 Notes on the IP-EAR

When the IP-EAR freezes due to its normal freeze mechanism (i.e. PMU freeze + delay), it captures one last entry with “ef”=0. The IP value in this entry could be incorrect since there is no guarantee that the CPU would be retiring an IP at this particular time. Since this is always the youngest entry captured in IP_EAR buffer, it should be easier to identify this event.

3.3.11 Interrupts

As mentioned in Table 3-6, each one of registers PMD conditions are all true:

• PMC

This interrupt is an “External Interrupt” with Vector= 0x3000 and will be recognized only if the following conditions are true:

• PMV.m=0 and PMV.vector is set up correctly; i.e. Performance Monitor interrupts are not

• PSR.i =1 and PSR.ic=1; i.e. interruptions are unmasked and interruption collection is enabled

• TPR.mmi=0 (i.e. all external interrupts are not masked) and TPR.mic is a value that the

.oi=1 (i.e. overflow interrupt is enabled for PMDi) and PMDi overflows. Note that there

is only one interrupt line that will be raised regardless of which PMC/PMD set meets this condition.

masked and a proper vector is programmed for this interrupt by executing a “ ”.

in the Processor Status Register by executing either the “” or “” instruction.

priority class that Performance Monitor Interrupt belongs to are not masked. For example if we assign vector 0xD2 to the Performance Monitor Interrupt, according to Table 5-7 “Interrupt Priorities, Enabling, and Masking” in Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual, it will be priority class 13. So any value less than 13 for TPR.mic is okay for recognizing this interrupt. A “” will write to this register.

will cause an interrupt if the following

4-15

72 Reference Manual for Software Development and Optimization

Performance Monitoring

• There are no higher priority faults, traps, or external interrupts pending.

Interrupt Service routine needs to read IVR register “” in order to figure out the highest priority external interrupt which needs to be serviced.

Before returning from interrupt service routine, the Performance Monitor needs to be initialized such that the interrupt will be cleared. This could be done by clearing the PMC.oi and/or reinitializing the PMD which caused the interrupt (you will know this by reading PMC0). In addition to this, all bits of PMC0 need to be cleared if further monitoring needs to be done.

3.3.12 Processor Reset, PAL Calls, and Low Power State

Processor Reset: On processor hardware reset bits  and  of all PMC registers are zero, and PMV.m is set to one. This ensures that no interrupts are generated, and events are not externally visible. On reset, PAL firmware ensures that the instruction address range check, the opcode matcher and the data address range check are initialized as follows:

• PMC

32,33,34,35 41 38 36

= 0xffffffffffffffff, (match all opcodes) = 0x2078fefefefe, (no memory pipeline event constraints) = 0xdb6, (no instruction address range constraints) = 0xfffffff0, (no opcode match constraints)

All other performance monitoring related state is undefined.

Table 3-34. Information Returned by PAL_PERF_MON_INFO for the Montecito Processor

PAL_PERF_MON_INFO

Return Value

PAL_RETIRED 8-bit unsigned event type for counting the number of

untagged retired Itanium instructions

PAL_CYCLES 8-bit unsigned event type for counting the number of

running CPU cycles PAL_WIDTH 8-bit unsigned number of implemented counter bits 48 PAL_GENERIC_PM_PAIRS 8-bit unsigned number of generic PMC/PMD pairs 4 PAL_PMCmask 256-bit mask defining which PMC registers are

populated PAL_PMDmask 256-bit mask defining which PMD registers are

populated PAL_CYCLES_MASK 256-bit mask defining which PMC/PMD counters can

count running CPU cycles (event defined by

PAL_CYCLES) PAL_RETIRED_MASK 256-bit mask defining which PMC/PMD counters can

count untagged retired Itanium instructions (event

defined by PAL_RETIRED)

Description

Montecito

Processor Specific

Value

0x08

0x12

0x3FFF

0x3FFFF

0xF0

PAL Call: As defined in the Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual, the PAL call PAL_PERF_MON_INFO provides software with information about the

implemented performance monitors. The Montecito processor specific values are summarized in

Table 3-34.

Low Power State: To ensure that monitor counts are preserved when the processor enters low power state, PAL_LIGHT_HALT freezes event monitoring prior to powering down the processor.

Reference Manual for Software Development and Optimization 73

Performance Monitoring

As a result, bus events occurring during lower power state (e.g. snoops) will not be counted. PAL_LIGHT_HALT preserves the original value of the PMC0 register.

74 Reference Manual for Software Development and Optimization

4 Performance Monitor Events

4.1 Introduction

This chapter describes the architectural and microarchitectural events measurable on the Montecito processor through the performance monitoring mechanisms described earlier in Chapter 3. The early sections of this chapter provide a categorized high-level view of the event list, grouping logically related events together. Computation (either directly by a counter in hardware or indirectly as a “derived” event) of common performance metrics is also discussed. Each directly measurable event is then described in greater detail in the alphabetized list of all processor events in Chapter 4.

The Montecito processor is capable of monitoring numerous events. The majority of events can be selected as input to any of the PMD the hexadecimal values shown in the “event code” field of the event list. Please refer to

Section 4.8.2 and Section 4.8.4 for events that have more specific requirements.

4.2 Categorization of Events

Performance related events are grouped into the following categories:

• Basic Events: clock cycles, retired instructions (Section 4.3)

by programming bit [15:8] of the corresponding PMC to

4-15

• Instruction Dispersal Events: instruction decode and issue (Section 4.4)

• Instruction Execution Events: instruction execution, data and control speculation, and memory

operations (Section 4.5)

• Stall Events: stall and execution cycle breakdowns (Section 4.6)

• Branch Events: branch prediction (Section 4.7)

• Memory Hierarchy: instruction and data caches (Section 4.8)

• System Events: operating system monitors (Section 4.9)

• TLB Events: instruction and data TLBs (Section 4.10)

• System Bus Events: (Section 4.11)

• RSE Events: Register Stack Engine (Section 4.12)

• Hyper-Threading Events (Section 4.13)

Each section listed above includes a table providing information on directly measurable events. The section may also contain a second table of events that can be derived from those that are directly measurable. These derived events may simply rename existing events or present steps to determine the value of common performance metrics. Derived events are not, however, discussed in the systematic event listing in Section 4.15.

Directly measurable events often use the PMC.umask field (See Chapter 3) to measure a certain variant of the event in question. Symbolic event names for such events include a period to indicate use of the umask, specified by four bits in the detailed event description (x’s are for don’t-cares).

Reference Manual for Software Development and Optimization 75

Performance Monitor Events

The summary tables in the subsequent sections define events by specifying the following attributes:

• Symbol Name - Symbolic name used to denote this event.

• Event Code - Hexadecimal value to program into bits [15:8] of the appropriate PMC register in

order to measure this event.

• IAR - Can this event be constrained by the Instruction Address Range registers?

• DAR - Can this event be constrained by the Data Address Range registers?

• OPC - Can this event by constrained by the Opcode Match registers?

• Max Inc/Cyc - Maximum Increment Per Cycle or the maximum value this event may be

increased by each cycle.

• T - Type; Either A for Active, F for Floating, S for Self Floating or C for Causal (check table

Table 4-42 for this information).

• Description - Brief description of the event.

4.2.1 Hyper-Threading and Event Types

The Montecito Processor implements a type of hardware based multi-threading that effectively allows two threads to coexist within a processor core although only one thread is “active” within the core’s pipeline at any moment in time. This affects how events are generated. Certain events may be generated after the thread they belong has become inactive. This also affects how events are assigned to the threads occupying the same core, which is also dependent upon which PMD the event was programmed into (see Section 3.3.2 for more information). Certain events do not have the concept of a “home” thread.

These effects are further complicated by the use of the “.all” field, which allows a user to choose to monitor a particular event for the thread being programmed to or for both threads (see Table 3-6). It should be noted that monitoring with .all enabled does not always produce valid results and in certain cases the setting of .all is ignored. Please refer to the individual events for further information.

To help decipher these effects, events have been classified by the following types:

• Active - this event can only occur when the thread that generated it is “active” (currently

executing in the processor core’s pipeline) and is considered to be generated by the active thread. Either type of monitor can be used if .all is not set. Example(s): BE_EXE_BUBBLE and IA64_INST_RETIRED.

• Causal - this event does not belong to a thread. It is assigned to the active thread. Although it

seems natural to use either type of monitor if .all is not set, due to implementation constraints, causal events should only be monitored in duplicated counters. There is one exception to this rule: CPU_OP_CYCLES can be measured in both types of counters. Example(s): CPU_OP_CYCLES and L2I_SNOOP_HITS.

• Floating - this event belongs to a thread, but could have been generated when its thread was

inactive (or “in the background”). These events should only be monitored in duplicated counters. If .all is not set, only events associated with the monitoring thread will be captured. If .all is set, events associated with both threads will be captured during the time the monitoring thread has been assigned to a processor by the OS. Example(s): L2D_REFERENCES and ER_MEM_READ_OUT_LO.

76 Reference Manual for Software Development and Optimization

• Self Floating - this is a hybrid event used to better categorize certain BUS and SI (System

Interface) events. If this event was monitored with the .SELF umask, it is a Floating event. If any other umask is used it is considered Causal. These events should only be monitored in duplicated counters. Example(s): BUS_IO and SI_WRITEQ_INSERTS.

4.3 Basic Events

Table 4-1 summarizes two basic execution monitors. The Montecito retired instruction count,

IA64_INST_RETIRED, includes both predicated true and predicated off instructions and  instructions, but excludes RSE operations.

Table 4-1. Performance Monitors for Basic Events

Symbol Name

CPU_OP_CYCLES 0x12 Y N Y 1 C CPU Operating Cycles IA64_INST_RETIRED 0x08 Y N Y 6 A Retired Itanium® Instructions

Event

Code

Performance Monitor Events

O A R

Max

Inc/CycT Description

Table 4-2. Derived Monitors for Basic Events

Symbol Name Description Equation

IA64_IPC Average Number of Itanium

Instructions Per Cycle During Itanium architecture-based Code Sequences

4.4 Instruction Dispersal Events

Instruction cache lines are delivered to the execution core and dispersed to the Montecito processor functional units. The Montecito processor can issue, or disperse, 6 instructions per clock cycle. In other words, the Montecito processor can issue to 6 instruction slots (or syllables).The following events are intended to give users an idea of how effectively instructions are dispersed and why they are not dispersed at full capacity. There are five reasons for not dispersing at full capacity. One is measured by DISP_STALLED. For every clock that dispersal is stalled, dispersal takes a hit of 6-syllables. The other four reasons are measured by SYLL_NOT_DISPERSED. Due to the way the hardware is designed, SYLL_NOT_DISPERSED may contain an overcount due to implicit and explicit bits; although this number should be small, SYLL_OVERCOUNT will provide an accurate count for it.

The relationship between these events is as follows: 6*(CPU_OP_CYCLES-DISP_STALLED) = INST_DISPERSED +

SYLL_NOT_DISPERSED.ALL - SYLL_OVERCOUNT.ALL

IA64_INST_RETIRED / CPU_OP_CYCLES

Reference Manual for Software Development and Optimization 77

Performance Monitor Events

Table 4-3. Performance Monitors for Instruction Dispersal Events

Symbol Name

DISP_STALLED 0x49 N N N 1 Number of cycles dispersal stalled INST_DISPERSED 0x4d Y N N 6 Syllables dispersed from REN to REG

SYLL_NOT_DISPERSED 0x4e Y N N 5 Syllables not dispersed SYLL_OVERCOUNT 0x4f Y N N 2 Syllables overcounted

Event

Code

4.5 Instruction Execution Events

Retired instruction counts, IA64_TAGGED_INST_RETIRED and NOPS_RETIRED, are based on tag information specified by the address range check and opcode match facilities. A separate event, PREDICATE_SQUASHED_RETIRED, is provided to count predicated off instructions.

The FP monitors listed in the table capture dynamic information about pipeline flushes and flush-to-zero occurrences due to floating-point operations. The FP_OPS_RETIRED event counts the number of retired FP operations.

Max

Inc/Cyc

Description

stage

As Table 4-4 describes, monitors for control and data speculation capture dynamic run-time information: the number of failed  instructions (INST_FAILED_CHKS_RETIRED.ALL), the number of advanced load checks and check loads (INST_CHKA_LDC_ALAT.ALL), and failed advanced load checks and check loads (INST_FAILED_CHKA_LDC_ALAT.ALL) as seen by the ALAT. The number of retired  instructions is monitored by the IA64_TAGGED_INST_RETIRED event, given the appropriate opcode mask. Since the Montecito processor ALAT is updated by operations on mispredicted branch paths, the number of advanced load checks and check loads need an explicit event (INST_CHKA_LDC_ALAT.ALL).

Table 4-4. Performance Monitors for Instruction Execution Events

Symbol Name

ALAT_CAPACITY_MISS 0x58 Y Y Y 2 ALAT Entry Replaced FP_FAILED_FCHKF 0x06 Y N N 1 Failed fchkf FP_FALSE_SIRSTALL 0x05 Y N N 1 SIR stall without a trap FP_FLUSH_TO_ZERO 0x0b Y N N 2 FP Result Flushed to Zero FP_OPS_RETIRED 0x09 Y N N 6 Retired FP operations FP_TRUE_SIRSTALL 0x03 Y N N 1 SIR stall asserted and leads to a trap IA64_TAGGED_INST_RETIRED 0x08 Y N Y 6 Retired Tagged Instructions INST_CHKA_LDC_ALAT 0x56 Y Y Y 2 Advanced Check Loads INST_FAILED_CHKA_LDC_ALAT 0x57 Y Y Y 1 Failed Advanced Check Loads INST_FAILED_CHKS_RETIRED 0x55 N N N 1 Failed Speculative Check Loads LOADS_RETIRED 0xcd Y Y Y 4 Retired Loads MISALIGNED_LOADS_RETIRED 0xce Y Y Y 4 Retired Misaligned Load Instructions MISALIGNED_STORES_RETIRED 0xd2 Y Y Y 2 Retired Misaligned Store Instructions NOPS_RETIRED 0x50 Y N Y 6 Retired NOP Instructions

Event

Code

O P C

Max

Inc/Cyc

Description

78 Reference Manual for Software Development and Optimization

Table 4-4. Performance Monitors for Instruction Execution Events

Symbol Name

Event Code

O P C

Max

Inc/Cyc

Performance Monitor Events

Description

PREDICATE_SQUASHED_RETIRED 0x51 Y N Y 6 Instructions Squashed Due to

STORES_RETIRED 0xd1 Y Y Y 2 Retired Stores UC_LOADS_RETIRED 0xcf Y Y Y 4 Retired Uncacheable Loads UC_STORES_RETIRED 0xd0 Y Y Y 2 Retired Uncacheable Stores

Table 4-5. Derived Monitors for Instruction Execution Events

Symbol Name Description Equation

ALAT_EAR_EVENTS Counts the number of ALAT

events captured by EAR

CTRL_SPEC_MISS_RATIO Control Speculation Miss Ratio INST_FAILED_CHKS_RETIRED.ALL /

DATA_SPEC_MISS_RATIO Data Speculation Miss Ratio INST_FAILED_CHKA_LDC_ALAT.ALL /

4.6 Stall Events

Montecito processor stall accounting is separated into front-end and back-end stall accounting. Back-end and front-end events should not be compared since they are counted in different stages of the pipeline.

Predicate Off

DATA_EAR_EVENTS

IA64_TAGGED_INST_RETIRED[chk.s]

INST_CHKA_LDC_ALAT.ALL

The back-end can be stalled due to five distinct mechanisms: FPU/L1D, RSE, EXE, branch/exception or the front-end. BACK_END_BUBBLE provides an overview of which mechanisms are producing stalls while the other back-end counters provide more explicit information broken down by category. Each time there is a stall, a bubble is inserted in only one location in the pipeline. Each time there is a flush, bubbles are inserted in all locations in the pipeline. With the exception of BACK_END_BUBBLE, the back-end stall accounting events are prioritized in order to mimic the operation of the main pipe (i.e. priority form high to low is given to: BE_FLUSH_BUBBLE.XPN, BE_FLUSH_BUBBLE.BRU, L1D_FPU stalls, EXE stalls, RSE stalls, front-end stalls). This prioritization guarantees that the events are mutually exclusive and only the most important cause, the one latest in the pipeline, is counted.

The Montecito processor’s front-end can be stalled due to seven distinct mechanisms: FEFLUSH, TLBMISS, IMISS, branch, FILL-RECIRC, BUBBLE, IBFULL (listed in priority from high to low). The front-end stalls have exactly the same effect on the pipeline so their accounting is simpler.

During every clock in which the CPU is not in a halted state, the back-end pipeline has either a bubble or it retires 1 or more instructions, CPU_OP_CYCLES = BACK_END_BUBBLE.all + (IA64_INST_RETIRED >= 1). To further investigate bubbles occurring in the back-end of the pipeline the following equation holds true: BACK_END_BUBBLE.all = BE_RSE_BUBBLE.all + BE_EXE_BUBBLE.all + BE_L1D_FPU_BUBBLE.all + BE_FLUSH_BUBBLE.all + BACK_END_BUBBLE.fe.

Reference Manual for Software Development and Optimization 79

Performance Monitor Events

Note: CPU_OP_CYCLES is not incremented during a HALT state. If a measurement is set up to match

clock cycles to bubbles to instructions retired (as outlined above) and a halt occurs within the measurement interval, measuring CYCLES_HALTED in PMD10 may be used to compensate.

Each of the stall events (summarized in Table 4-6) take a umask to choose among several available sub-events. Please refer to the detailed event descriptions in Section 4.15 for a list of available sub-events and their individual descriptions.

Table 4-6. Performance Monitors for Stall Events

Symbol Name

BACK_END_BUBBLE 0x00 N N N 1 Full pipe bubbles in main pipe BE_EXE_BUBBLE 0x02 N N N 1 Full pipe bubbles in main pipe due to

BE_FLUSH_BUBBLE 0x04 N N N 1 Full pipe bubbles in main pipe due to

BE_L1D_FPU_BUBBLE 0xca N N N 1 Full pipe bubbles in main pipe due to

BE_LOST_BW_DUE_TO_FE 0x72 N N N 2 Invalid bundles if BE not stalled for

BE_RSE_BUBBLE 0x01 N N N 1 Full pipe bubbles in main pipe due to

FE_BUBBLE 0x71 N N N 1 Bubbles seen by FE FE_LOST_BW 0x70 N N N 2 Invalid bundles at the entrance to IB IDEAL_BE_LOST_BW_DUE_TO_FE 0x73 N N N 2 Invalid bundles at the exit from IB

Event

Code

O P C

Max

Inc/Cyc

Description

Execution unit stalls

flushes

FP or L1D cache

other reasons

RSE stalls

4.7 Branch Events

Note that for branch events, retirement means a branch was reached and committed regardless of its predicate value. Details concerning prediction results are contained in pairs of monitors. For accurate misprediction counts, the following measurement must be taken:

BR_MISPRED_DETAIL.[umask] - BR_MISPRED_DETAIL2.[umask]

By performing this calculation for every umask, one can obtain a true value for the BR_MISPRED_DETAIL event.

The method for obtaining the true value of BR_PATH_PRED is slightly different. When there is more than one branch in a bundle and one is predicted as taken, all the higher number ports are forced to a predicted not taken mode without actually knowing the their true prediction.

The true OKPRED_NOTTAKEN predicted path information can be obtained by calculating:

BR_PATH_PRED.[branch type].OKPRED_NOTTAKEN - BR_PATH_PRED2.[branch type].UNKNOWNPRED_NOTTAKEN using the same “branch type” (ALL, IPREL, RETURN, NRETIND) specified for both events.

Similarly, the true MISPRED_TAKEN predicted path information can be obtained by calculating:

80 Reference Manual for Software Development and Optimization

BR_PATH_PRED.[branch type].MISPRED_TAKEN - BR_PATH_PRED2.[branch type].UKNOWNPRED_TAKEN using the same “branch type” (ALL, IPREL, RETURN, NRETIND) selected for both events.

BRANCH_EVENT counts the number of events captured by the Execution Trace Buffer. For detailed information on the ETB please refer to Section 3.3.10.

Table 4-7. Performance Monitors for Branch Events

Symbol Name

BE_BR_MISPRED_DETAIL 0x61 Y N Y 1 BE branch misprediction detail BRANCH_EVENT 0x11 Y N Y 1 Branch Event Captured BR_MISPRED_DETAIL 0x5b Y N Y 3 FE Branch Mispredict Detail BR_MISPRED_DETAIL2 0x68 Y N Y 2 FE Branch Mispredict Detail

BR_PATH_PRED 0x54 Y N Y 3 FE Branch Path Prediction Detail BR_PATH_PRED2 0x6a Y N Y 2 FE Branch Path Prediction Detail

ENCBR_MISPRED_DETAIL 0x63 Y N Y 3 Number of encoded branches retired

Event Code

I A R

Performance Monitor Events

O P C

Max

Inc/Cyc

Description

(Unknown path component)

(Unknown prediction component)

A R

4.8 Memory Hierarchy

This section summarizes events related to the Montecito processor’s memory hierarchy. The memory hierarchy events are grouped as follows:

• L1 Instruction Cache and Prefetch Events (Section 4.8.1)

• L1 Data Cache Events (Section 4.8.2)

• L2 Instruction Cache Events (Section 4.8.3)

• L2 Data Cache Events (Section 4.8.4)

• L3 Cache Events (Section 4.8.5)

An overview of the Montecito processor’s three level memory hierarchy and its event monitors is shown in Figure 4-1. The instruction and the data stream work through separate L1 caches. The L1 data cache is a write-through cache. Two separate L2I and L2D caches serve both the L1 instruction and data caches respectively, and are backed by a large unified L3 cache. Events for individual levels of the cache hierarchy are described in the Section 4.8.1 through Section 4.8.3.

Reference Manual for Software Development and Optimization 81

Performance Monitor Events

Figure 4-1. Event Monitors in the Itanium® 2 Processor Memory Hierarchy

BUS

L3_MISSES

L3_REFERENCES

L3_READ_REFERENCES(d)L3_WRITE_REFERENCES(d)

L3_STORE_REFERENCES(d) L3_INST_REFERENCES(d)L2_WB_REFERENCES(d)

L3_DATA_READ_REFERENCES(d)

L2D_MISSES

L2D

L2D_REFERENCES L2I_READS

L1D_READ_MISSES

L1D

(write through)

ITLB_MISSES_FETCH

Store Buffer

L2DTLB_MISSES

L1DTLB

VHPT Walker

(d) = derived counter

L2I_MISSES

L2I

ISB_BUNPAIRS_IN

ISB

L1I_FILLS

DTLB_INSERTS_HPW

L1I

ITLB

ITLB_INSERTS_HPW

L1DTLB_MISSES

L2DTLB

DATA_REFERENCES

Processor Pipeline

L1I_READSL1I_PREFETCHES

82 Reference Manual for Software Development and Optimization

Performance Monitor Events

4.8.1 L1 Instruction Cache and Prefetch Events

Table 4-8 describes and summarizes the events that the Montecito processor provides to monitor

L1 instruction cache demand fetch and prefetch activity. The instruction fetch monitors distinguish between demand fetch (L1I_READS) and prefetch activity (L1I_PREFETCHES). The amount of data returned from the L2I to the L1 instruction cache and the Instruction Streaming Buffer is monitored by two events, L1I_FILLS and ISB_LINES_IN. The L1I_EAR_EVENTS monitor counts how many instruction cache or L1ITLB misses are captured by the instruction event address register.

The L1 instruction cache and prefetch events can be qualified by the instruction address range check, but not by the opcode matcher. Since instruction cache and prefetch events occur early in the processor pipeline, they include events caused by speculative, wrong-path instructions as well as predicated-off instructions. Since the address range check is based on speculative instruction addresses rather than retired instruction addresses, event counts may be inaccurate when the range checker is confined to address ranges smaller than the length of the processor pipeline (see

Chapter 3 for details).

L1I_EAR_EVENTS counts the number of events captured by the Montecito processor’s instruction EARs. Please refer to Chapter 3 for more detailed information about the instruction EARs.

Table 4-8. Performance Monitors for L1/L2 Instruction Cache and Prefetch Events

Symbol Name

ISB_BUNPAIRS_IN 0x46 Y N N 1 Bundle pairs written from L2I into FE L1I_EAR_EVENTS 0x43 Y N N 1 Instruction EAR Events L1I_FETCH_ISB_HIT 0x66 Y N N 1 “Just-in-time” instruction fetch hitting

L1I_FETCH_RAB_HIT 0x65 Y N N 1 Instruction fetch hitting in RAB L1I_FILLS 0x41 Y N N 1 L1 Instruction Cache Fills L1I_PREFETCHES 0x44 Y N N 1 L1 Instruction Prefetch Requests L1I_PREFETCH_STALL 0x67 N N N 1 Why prefetch pipeline is stalled? L1I_PURGE 0x4b Y N N 1 L1ITLB purges handled by L1I L1I_PVAB_OVERFLOW 0x69 N N N 1 PVAB overflow L1I_RAB_ALMOST_FULL 0x64 N N N 1 Is RAB almost full? L1I_RAB_FULL 0x60 N N N 1 Is RAB full? L1I_READS 0x40 Y N N 1 L1 Instruction Cache Reads L1I_SNOOP 0x4a Y Y Y 1 Snoop requests handled by L1I L1I_STRM_PREFETCHES 0x5f Y N N 1 L1 Instruction Cache line prefetch

L2I_DEMAND_READS 0x42 Y N N 1 L1 Instruction Cache and ISB Misses L2I_PREFETCHES 0x45 Y N N 1 L2 Instruction Prefetch Requests

Event Code

O P C

Max

Inc/Cyc

Description

in and being bypassed from ISB

requests

Reference Manual for Software Development and Optimization 83

Performance Monitor Events

Table 4-9. Derived Monitors for L1 Instruction Cache and Prefetch Events

Symbol Name Description Equation

L1I_MISSES L1I Misses L2I_DEMAND_READS ISB_LINES_IN Number of cache lines written

from L2I (and beyond) into the

front end L1I_DEMAND_MISS_RATIO L1I Demand Miss Ratio L2I_DEMAND_READS / L1I_READS L1I_MISS_RATIO L1I Miss Ratio (L1I_MISSES + L2I_PREFETCHES) /

L1I_PREFETCH_MISS_RATIO L1I Prefetch Miss Ratio L2I_PREFETCHES / L1I_PREFETCHES L1I_REFERENCES Number of L1 Instruction Cache

reads and fills

ISB_BUNPAIRS_IN/4

(L1I_READS + L1I_PREFETCHES)

L1I_READS + L1I_PREFETCHES

4.8.2 L1 Data Cache Events

Table 4-10 lists the Montecito processor’s L1 data cache monitors. As shown in Figure 4-1, the

write-through L1 data cache services cacheable loads, integer and RSE loads, check loads and hinted L2 memory references. DATA_REFERENCES is the number of issued data memory references.

L1 data cache reads (L1D_READS) and L1 data cache misses (L1D_READ_MISSES) monitor the read hit/miss rate of the L1 data cache. RSE operations are included in all data cache monitors, but are not broken down explicitly. The DATA_EAR_EVENTS monitor counts how many data cache or DTLB misses are captured by the Data Event Address Register. Please refer to Section 3.3.9 for more detailed information about the data EARs.

L1D cache events have been divided into 6 sets (sets 0,1,2,3,4,6; set 5 is reserved). Events from different sets of L1D Cache events cannot be measured at the same time. Each set is selected by the event code programmed into PMC5 (i.e. if you want to measure any of the events in this set, one of them needs to be measured by PMD5). There are no limitations on umasks. Monitors belonging to each set are explicitly presented in Table 4-10 through Table 4-16.

Table 4-10. Performance Monitors for L1 Data Cache Events

Symbol Name

DATA_EAR_EVENTS 0xc8 Y Y Y 1 L1 Data Cache EAR Events L1D_READS_SET0 0xc2 Y Y Y 2 L1 Data Cache Reads DATA_REFERENCES_SET0 0xc3 Y Y Y 4 Data memory references issued to

L1D_READS_SET1 0xc4 Y Y Y 2 L1 Data Cache Reads DATA_REFERENCES_SET1 0xc5 Y Y Y 4 Data memory references issued to

L1D_READ_MISSES 0xc7 Y Y Y 2 L1 Data Cache Read Misses

Event

Code

O P C

Max

Inc/Cyc

Description

memory pipeline

84 Reference Manual for Software Development and Optimization

4.8.2.1 L1D Cache Events (set 0)

Table 4-11. Performance Monitors for L1D Cache Set 0

Symbol Name

Event Code

Performance Monitor Events

O P C

Max

Inc/Cyc

Description

L1DTLB_TRANSFER 0xc0 Y Y Y 1 L1DTLB misses hit in L2DTLB for

L2DTLB_MISSES 0xc1 Y Y Y 4 L2DTLB Misses L1D_READS_SET0 0xc2 Y Y Y 2 L1 Data Cache Reads DATA_REFERENCES_SET0 0xc3 Y Y Y 4 Data memory references issued to

4.8.2.2 L1D Cache Events (set 1)

Table 4-12. Performance Monitors for L1D Cache Set 1

Symbol Name

L1D_READS_SET1 0xc4 Y Y Y 2 L1 Data Cache Reads DATA_REFERENCES_SET1 0xc5 Y Y Y 4 Data memory references issued to

L1D_READ_MISSES 0xc7 Y Y Y 2 L1 Data Cache Read Misses

Event Code

4.8.2.3 L1D Cache Events (set 2)

Table 4-13. Performance Monitors for L1D Cache Set 2

Symbol Name

Event Code

access counted in L1D_READS

memory pipeline

O P C

Max

Inc/Cyc

Max

Inc/Cyc

Description

memory pipeline

Description

BE_L1D_FPU_BUBBLE 0xca N N N 1 Full pipe bubbles in main pipe due to

FP or L1D cache

4.8.2.4 L1D Cache Events (set 3)

Table 4-14. Performance Monitors for L1D Cache Set 3

Symbol Name

LOADS_RETIRED 0xcd Y Y Y 4 Retired Loads MISALIGNED_LOADS_RETIRED 0xce Y Y Y 4 Retired Misaligned Load Instructions UC_LOADS_RETIRED 0xcf Y Y Y 4 Retired Uncacheable Loads

Event Code

Reference Manual for Software Development and Optimization 85

O P C

Max

Inc/Cyc

Description

Performance Monitor Events

4.8.2.5 L1D Cache Events (set 4)

Table 4-15. Performance Monitors for L1D Cache Set 4

Symbol Name

MISALIGNED_STORES_RETIRED 0xd2 Y Y Y 2 Retired Misaligned Store Instructions STORES_RETIRED 0xd1 Y Y Y 2 Retired Stores UC_STORES_RETIRED 0xd0 Y Y Y 2 Retired Uncacheable Stores

Event

Code

4.8.2.6 L1D Cache Events (set 6)

Table 4-16. Performance Monitors for L1D Cache Set 6

Symbol Name

LOADS_RETIRED_INTG 0xd8 Y Y Y 2 Integer loads retired SPEC_LOADS_NATTED 0xd9 Y Y Y 2 Times ld.s or ld.sa NaT’d

Event

Code

O P C

Max

Inc/Cyc

Max

Inc/Cyc

Description

4.8.3 L2 Instruction Cache Events

Table 4-17. Performance Monitors for L2I Cache

Symbol Name

L2I_READS 0x78 Y N Y 1 L2I Cacheable Reads L2I_UC_READS 0x79 Y N Y 1 L2I uncacheable reads L2I_VICTIMIZATIONS 0x7a Y N Y 1 L2I victimizations L2I_RECIRCULATES 0x7b Y N Y 1 L2I recirculates L2I_L3_REJECTS 0x7c Y N Y 1 L3 rejects L2I_HIT_CONFLICTS 0x7d Y N Y 1 L2I hit conflicts L2I_SPEC_ABORTS 0x7e Y N Y 1 L2I speculative aborts L2I_SNOOP_HITS 0x7f Y N Y 1 L2I snoop hits

Table 4-18. Derived Monitors for L2I Cache (Sheet 1 of 2)

Symbol Name Description Equation

L2I_SNOOPS Number of snoops received by

L2I_FILLS L2I Fills L2I_READS.MISS.DMND +

L2I_FETCHES Requests made to L2I due to

L2I_REFERENCES Instructions requests made to L2I L2I_READS.ALL.ALL

Event

Code

the L2I.

demand instruction fetches.

O P C

Max

Inc/Cyc

L1I_SNOOPS

L2I_READS.MISS.PFTCH L2I_READS.ALL.DMND

Description

86 Reference Manual for Software Development and Optimization

Table 4-18. Derived Monitors for L2I Cache (Sheet 2 of 2)

Symbol Name Description Equation

L2I_MISS_RATIO Percentage of L2I Misses L2I_READS.MISS/L2I_READS.ALL L2I_HIT_RATIO Percentage of L2I Hits L2I_READS.HIT/L2I_READS.ALL

4.8.4 L2 Data Cache Events

Table 4-19 summarizes the events available to monitor the Montecito processor L2D cache.

Most L2D events have been divided into 8 sets. Only events within two of these sets (or non-L2D events) can be measured at the same time. These two sets are selected by the event code programmed into PMC4 and PMC6 (i.e. if you want to measure any of the events in a particular set, one of these events needs to be measured by PMD4 or PMD6).

Note: The opposite holds true. If PMC4 is not programmed to monitor an L2D event, yet PMC5 or PMC8

are (similarly with PMC6->PMC7/9), PMD values are undefined. Also note that Any event set can be measured by programming either PMC4 or PMC6. Once PMC4 is

programmed to measure an event from one L2D event set, PMD4, PMD5, and PMD8 can only measure events from the same L2D event set (PMD5,8 share the umask programmed into PMD4). Similarly, once PMC6 is programmed to monitor another set (could be the same set as measured by PMC4), PMD6, PMD7 and PMD9 can measure events from this set only. None of the L2 data cache events can be measured using PMD10-15.

Performance Monitor Events

Support for the .all bit has the same restrictions as the set restrictions. The value set for “.all” in PMC4 applies to both the L1D events selected by it. Hence, even though the “.all” values in PMC5 and PMC8 are different from the value in PMC4, PMC4’s value selects the capability. This is same with PMC6,7,9. Original Montecito documentation claimed that Thread 0 PMC4 .me/.all applied to PMC4-PMC7 but that is no longer true. This bit is available for both the threads. Hence, it is possible for one thread’s PMDs to monitor just the events credited for that thread while the other thread’s PMDs can monitor events for both threads (if PMC4.all is set). Note that some events do not support .all counting. If .all counting is enabled for events that don’t support it, the resulting counts will be wrong.

While the L2D events support threading, not all counts have access to exact thread id bit needed. Each count is labeled with one of ActiveTrue, ActiveApprox, or TrueThrd. ActiveTrue means that the event is counted with the current active thread, and that thread is the only thread that can see the event when it is counted. ActiveApprox means the event is counted with the current active thread, but there are some corner cases were the event may actually be due to the other non-Active thread. It is assumed in most cases that the error due this approximation will be negligible. TrueThrd indicates that the L2D cache has knowledge of what thread the count belongs to besides the active thread indication, and that knowledge is always correct.

Table 4-19. Performance Monitors for L2 Data Cache Events (Sheet 1 of 2)

Symbol Name

L2D_OZQ_CANCELS0 0xe0 Y Y Y Y 4 L2D OZQ cancels L2D_OZQ_FULL 0xe1

L2D_OZQ_CANCELS1 0xe2 Y Y Y Y 4 L2D OZQ cancels

Event

Code

0xe3

O P C

.all

capable

N N N N 1 L2D OZQ is full

Max

Inc/Cyc

Description

Reference Manual for Software Development and Optimization 87

Performance Monitor Events

Table 4-19. Performance Monitors for L2 Data Cache Events (Sheet 2 of 2)

Symbol Name

Event

Code

O P C

.all

capable

Max

Inc/Cyc

Description

L2D_BYPASS 0xe4 Y Y Y Y/N 1 L2D Hit or Miss Bypass (.all

L2D_OZQ_RELEASE 0xe5 N N N N 1 Clocks with release ordering

L2D_REFERENCES 0xe6 Y Y Y Y 4 Data RD/WR access to L2D L2D_L3ACCESS_CANCEL 0xe8 Y Y Y N 1 Canceled L3 accesses L2D_OZDB_FULL 0xe9 N N N Y 1 L2D OZ data buffer is full L2D_FORCE_RECIRC 0xea Y Y Y Y/N 4 Forced recirculates L2D_ISSUED_RECIRC_OZQ_ACC 0xeb Y Y Y Y 1 Count the number of times a

L2DBAD_LINES_SELECTED 0xec Y Y Y Y 4 Valid line replaced when

L2D_STORE_HIT_SHARED 0xed Y Y Y Y 2 Store hit a shared line L2D_OZQ_ACQUIRE 0xef N N N Y 1 Clocks with acquire ordering

L2D_OPS_ISSUED 0xf0 Y Y Y N 4 Different operations issued by

L2D_FILLB_FULL 0xf1 N N N N 1 L2D Fill buffer is full L2D_FILL_MESI_STATE 0xf2 Y Y Y Y 1 MESI states of fills to L2D

L2D_VICTIMB_FULL 0xf3 N N N Y 1 L2D victim buffer is full L2D_MISSES 0xcb Y Y Y Y 1 An L2D miss has been issued

L2D_INSERT_HITS 0xb1 Y Y Y Y 4 Count Number of Times an

L2D_INSERT_MISSES 0xb0 Y Y Y Y 4 Count Number of Times an

support is umask dependent)

attribute existed in L2D OZQ

recirculate issue was attempted and not preempted

invalid line is available

attribute existed in L2D OZQ

L2D

cache

to the L3, does not include secondary misses

Inserting Data Request Hit in the L2D.

Inserting Data Request Missed in the L2D.

Table 4-20. Derived Monitors for L2 Data Cache Events

Symbol Name Description Equation

L2D_READS L2 Data Read Requests L2D_REFERENCES.READS L2D_WRITES L2 Data Write Requests L2D_REFERENCES.WRITES L2D_MISS_RATIO Percentage of L2D Misses L2D_INSERT_MISSES/L2D_REFERENCES L2D_HIT_RATIO Percentage of L2D Hits L2D_INSERT_HITS/L2D_REFERENCES L2D_RECIRC_ATTEMPTS Number of times the L2

issue logic attempted to issue a recirculate

88 Reference Manual for Software Development and Optimization

L2D_ISSUED_RECIRC_OZQ_ACC + L2D_OZQ_CANCEL_S0.RECIRC

4.8.4.1 L2 Data Cache Events (set 0)

Table 4-21. Performance Monitors for L2 Data Cache Set 0

Symbol Name

Event Code

Performance Monitor Events

O P C

Max

Inc/Cyc

Description

L2D_OZQ_FULL 0xe1

0xe3 L2D_OZQ_CANCELS0 0xe0 Y Y Y 4 L2D OZQ cancels-TrueThrd L2D_OZQ_CANCELS1 0xe2 Y Y Y 4 L2D OZQ cancels-TrueThrd

L2D_OZQ_FULL is not .all capable.

4.8.4.2 L2 Data Cache Events (set 1)

Table 4-22. Performance Monitors for L2 Data Cache Set 1

Symbol Name

L2D_BYPASS 0xe4 Y Y Y 4 Count L2 Hit bypasses-TrueThread L2D_OZQ_RELEASE 0xe5 N N N 1 Effective Release is valid in

Event Code

The L2D_BYPASS count on Itanium 2 processors was too speculative to be useful. It has been fixed and we now count how many bypasses occurred in a given cycle, rather than signalling a 1 for 1-4 bypasses. The 5 and 7 cycle umasks of L2D_BYPASS and the L2D_OZQ_RELEASE counts are not .all capable.

4.8.4.3 L2 Data Cache Events (set 2)

N N N 1 L2D OZQ Full-ActiveApprox

O P C

Max

Inc/Cyc

Description

Ozq-ActiveApprox

Table 4-23. Performance Monitors for L2 Data Cache Set 2

Symbol Name

L2D_REFERENCES 0xe6 Y Y Y 4 Inserts of Data Accesses into

Event Code

O P C

Max

Inc/Cyc

Description

Ozq-ActiveTrue

4.8.4.4 L2 Data Cache Events (set 3)

Table 4-24. Performance Monitors for L2 Data Cache Set 3

Symbol Name

L2D_L3_ACCESS_CANCEL 0xe8 Y Y Y 1 L2D request to L3 was

L2D_OZDB_FULL 0xe9 N N N 1 L2D OZ data buffer is

Event Code

Reference Manual for Software Development and Optimization 89

O P C

Max

Inc/Cyc

Description

cancelled-TrueThrd

full-AcitveApprox

Performance Monitor Events

L2D_L3_ACCESS_CANCEL events are not .all capable.

4.8.4.5 L2 Data Cache Events (set 4)

Table 4-25. Performance Monitors for L2 Data Cache Set 4

Symbol Name

Event Code

O P C

Max

Inc/Cyc

Description

L2D_FORCE_RECIRC 0xea Y Y Y 4 Forced recirculates - ActiveTrue or

L2D_ISSUED_RECIRC_OZQ_ACC 0xeb Y Y Y 1 Ozq Issued Recirculate - TrueThrd

Some umasks of L2D_FORCE_RECIRC are not .all capable.

4.8.4.6 L2 Data Cache Events (set 5)

Table 4-26. Performance Monitors for L2 Data Cache Set 5

Symbol Name

L2D_BAD_LINES_SELECTED 0xec Y Y Y 4 Valid line replaced when invalid line is

L2D_STORE_HIT_SHARED 0xed Y Y Y 2 Store hit a shared line

Event

Code

4.8.4.7 L2 Data Cache Events (set 6)

Table 4-27. Performance Monitors for L2 Data Cache Set 6

Symbol Name

Event

Code

ActiveApprox

O P C

Max

Inc/Cyc

Max

Inc/Cyc

Description

available

Description

L2D_OZQ_ACQUIRE 0xef N N N 1 Valid acquire operation in

Ozq-TrueThrd

4.8.4.8 L2 Data Cache Events (set 7)

Table 4-28. Performance Monitors for L2 Data Cache Set 7

Symbol Name

L2D_OPS_ISSUED 0xf0 Y Y Y 4 Different operations issued by

L2D_FILLB_FULL 0xf1 N N N 1 L2D Fill buffer is full-ActiveApprox

Event

Code

L2D_OPS_ISSUED and L2D_FILLB_FULL are not .all capable.

90 Reference Manual for Software Development and Optimization

O P C

Max

Inc/Cyc

Description

L2D-TrueThrd

4.8.4.9 L2 Data Cache Events (set 8)

Table 4-29. Performance Monitors for L2 Data Cache Set 8

Symbol Name

Event Code

Performance Monitor Events

O P C

Max

Inc/Cyc

Description

L2D_FILL_MESI_STATE 0xf2 Y Y Y 1 Fill to L2D is of a particular MESI

L2D_VICTIMB_FULL 0xf3 N N N 1 L2D victim buffer is full-ActiveApprox

4.8.4.10 L2 Data Cache Events (Not Set Restricted)

These events are sent to the PMU block directly and thus are not set restricted.

Table 4-30. Performance Monitors for L2D Cache - Not Set Restricted

Symbol Name

L2D_MISSES 0xcb Y Y Y 1 An L2D miss has been issued to the

L2D_INSERT_MISSES 0xb0 Y Y Y 4 An inserting Ozq op was a miss on its

L2D_INSERT_HITS 0xb1 Y Y Y 4 An inserting Ozq op was a hit on its

Event Code

O P C

Max

Inc/Cyc

4.8.5 L3 Cache Events

Table 4-31 summarizes the directly-measured L3 cache events. An extensive list of derived events

is provided in Table 4-32.

value. TrueThrd

Description

L3, does not include secondary misses.

first lookup.

Table 4-31. Performance Monitors for L3 Unified Cache Events

Symbol Name

L3_LINES_REPLACED 0xdf N N N 1 L3 Cache Lines Replaced

L3_MISSES 0xdc Y Y Y 1 L3 Misses L3_READS 0xdd Y Y Y 1 L3 Reads

L3_REFERENCES 0xdb Y Y Y 1 L3 References L3_WRITES 0xde Y Y Y 1 L3 Writes

L3_INSERTS 0xda 1 L3 Cache lines inserted (allocated)

Event Code

Reference Manual for Software Development and Optimization 91

O P C

Max

Inc/Cyc

Description

MESI filtering capability

Performance Monitor Events

Table 4-32. Derived Monitors for L3 Unified Cache Events

Symbol Name Description Equation

L3_DATA_HITS L3 Data Read Hits L3_READS.DATA_READ.HIT L3_DATA_MISS_RATIO L3 Data Miss Ratio (L3_READS.DATA_READ.MISS +

L3_DATA_READ_MISSES L3 Data Read Misses L3_READS.DATA_READ.MISS L3_DATA_READ_RATIO Ratio of L3 References that are

L3_DATA_READ_REFEREN CES

L3_INST_HITS L3 Instruction Hits L3_READS.INST_FETCH.HIT L3_INST_MISSES L3 Instruction Misses L3_READS.INST_FETCH.MISS L3_INST_MISS_RATIO L3_READS.INST_FETCH.MISS /

L3_INST_RATIO Ratio of L3 References that are

L3_INST_REFERENCES L3 Instruction References L3_READS.INST_FETCH.ALL L3_MISS_RATIO Percentage Of L3 Misses L3_MISSES/L3_REFERENCES L3_READ_HITS L3 Read Hits L3_READS.READS.HIT L3_READ_MISSES L3 Read Misses L3_READS.READS.MISS L3_READ_REFERENCES L3 Read References L3_READS.READS.ALL L3_STORE_HITS L3 Store Hits L3_WRITES.DATA_WRITE.HIT L3_STORE_MISSES L3 Store Misses L3_WRITES.DATA_WRITE.MISS L3_STORE_REFERENCES L3 Store References L3_WRITES.DATA_WRITE.ALL L2_WB_HITS L2D Writeback Hits L3_WRITES.L2_WB.HIT L2_WB_MISSES L2D Writeback Misses L3_WRITES.L2_WB.MISS L2_WB_REFERENCES L2D Writeback References L3_WRITES.L2_WB.ALL L3_WRITE_HITS L3 Write Hits L3_WRITES.ALL.HIT L3_WRITE_MISSES L3 Write Misses L3_WRITES.ALL.MISS L3_WRITE_REFERENCES L3 Write References L3_WRITES.ALL.ALL

Data Read References L3 Data Read References L3_READS.DATA_READ.ALL

Instruction References

L3_WRITES.DATA_WRITE.MISS) / (L3_READS.DATA_READ.ALL + L3_WRITES.DATA_WRITE.ALL)

L3_READS.DATA_READ.ALL / L3_REFERENCES

L3_READS.INST_FETCH.ALL L3_READS.INST_FETCH.ALL /

L3_REFERENCES

4.9 System Events

The debug register match events count how often the address of any instruction or data breakpoint register (IBR or DBR) matches the current retired instruction pointer (CODE_DEBUG_REGISTER_MATCHES) or the current data memory address (DATA_DEBUG_REGISTER_MATCHES). CPU_CPL_CHANGES counts the number of privilege level transitions due to interruptions, system calls (epc), returns (demoting branch), and  instructions.

92 Reference Manual for Software Development and Optimization

Table 4-33. Performance Monitors for System Events

Symbol Name

CPU_CPL_CHANGES 0x13 N N N 1 Privilege Level Changes DATA_DEBUG_REGISTER_FAULT 0x52 N N N 1 Fault due to data debug reg. Match to

DATA_DEBUG_REGISTER_MATCHES0xc6 Y Y Y 1 Data debug register matches data

SERIALIZATION_EVENTS 0x53 N N N 1 Number of srlz.I instructions CYCLES_HALTED 0x18 N N N 1 Number of core cycles the thread is in

Event Code

Table 4-34. Derived Monitors for System Events

Symbol Name Description Equation

O P C

Max

Inc/Cyc

Performance Monitor Events

Description

load/store instruction

address of memory reference

low-power halted state.

NOTE: only PMC/PMD10 pair is capable of counting this event

CODE_DEBUG_REGISTER_ MATCHES

Code Debug Register Matches IA64_TAGGED_INST_RETIRED

4.10 TLB Events

The Montecito processor instruction and data TLBs and the virtual hash page table walker are monitored by the events described in Table 4-35.

L1ITLB_REFERENCES and L1DTLB_REFERENCES are derived from the respective instruction/data cache access events. Note that ITLB_REFERENCES does not include prefetch requests made to the L1I cache (L1I_PREFETCH_READS). This is because prefetches are cancelled when they miss in the ITLB and thus do not trigger VHPT walks or software TLB miss handling. ITLB_MISSES_FETCH and L2DTLB_MISSES count TLB misses. L1ITLB_INSERTS_HPW and DTLB_INSERTS_HPW count the number of instruction/data TLB inserts performed by the virtual hash page table walker.

Table 4-35. Performance Monitors for TLB Events

Symbol Name

DTLB_INSERTS_HPW 0xc9 Y Y Y 4 Hardware Page Walker inserts to

HPW_DATA_REFERENCES 0x2d Y Y Y 4 Data memory references to VHPT L2DTLB_MISSES 0xc1 Y Y Y 4 L2DTLB Misses L1ITLB_INSERTS_HPW 0x48 Y N N 1 L1ITLB Hardware Page Walker

ITLB_MISSES_FETCH 0x47 Y N N 1 ITLB Misses Demand Fetch L1DTLB_TRANSFER 0xc0 Y Y Y 1 L1DTLB misses that hit in the

Event Code

O P C

Max

Inc/Cyc

Description

DTLB

Inserts

L2DTLB for accesses counted in L1D_READS

Reference Manual for Software Development and Optimization 93

Performance Monitor Events

Table 4-36. Derived Monitors for TLB Events

Symbol Name Description Equation

L1DTLB_EAR_EVENTS Counts the number of L1DTLB

L2DTLB_MISS_RATIO L2DTLB miss ratio L2DTLB_MISSES /

L1DTLB_REFERENCES L1DTLB References DATA_REFERENCES_SET0 or

L1ITLB_EAR_EVENTS Provides information on the

L1ITLB_MISS_RATIO L1ITLB miss ratio ITLB_MISSES_FETCH.L1ITLB /

L1ITLB_REFERENCES L1ITLB References L1I_READS L1DTLB_FOR_L1D_MISS_R

ATIO

events captured by the EAR

number of L1ITLB events captured by the EAR. This is a subset of L1I_EAR_EVENTS

Miss Ratio of L1DTLB servicing the L1D

DATA_EAR_EVENTS

DATA_REFERENCES_SET0 or L2DTLB_MISSES / DATA_REFERENCES_SET1

DATA_REFERENCES_SET1 L1I_EAR_EVENTS

L1I_READS

L1DTLB_TRANSFER / L1D_READS_SET0 or L1DTLB_TRANSFER / L1D_READS_SET1

The Montecito processor has 2 data TLBs called L1DTLB and L2DTLB (also referred to as DTLB). These TLBs are in parallel and the L2DTLB is the larger and slower of the two.The possible actions for the combination of hits and misses in these TLBs are outlined below:

• L1DTLB_hit=0, L2DTLB_hit=0: If enabled, HPW kicks in and inserts a translation into one

or both TLBs.

• L1DTLB_hit=0, L2DTLB_hit=1: If floating-point, no action is taken; else a transfer is made

from L2DTLB to L1DTLB.

• L1DTLB_hit=1, L2DTLB_hit=0: If enabled, HPW kicks in and inserts a translation into one

or both TLBs.

• L1DTLB_hit=1, L2DTLB_hit=1: No action is taken.

When a memory operation goes down the memory pipeline, DATA_REFERENCES will count it. If the translation does not exist in the L2DTLB, then L2DTLB_MISSES will count it. If the HPW is enabled, then HPW_DATA_REFERENCES will count it. If the HPW finds the data in VHPT, it will insert it in the L1DTLB and L2DTLB (as needed). If the translation exists in the L2DTLB, the only case that some work is done is when translation does not exist in the L1DTLB. If the operation is serviced by the L1D (see L1D_READS description), L1DTLB_TRANSFER will count it. For the purpose of calculating the TLB miss ratios, VHPT memory references have been excluded from the DATA_REFERENCES event and provided VHPT_REFERENCES for the situations where one might want to add them in.

Due to the TLB hardware design, there are some corner cases, where some of these events will show activity even though the instruction causing the activity never reaches retirement (they are marked so). Since the processor is stalled even for these corner cases, they are included in the counts and as long as all events that are used for calculating a metric are consistent with respect to this issue, fairly accurate numbers are expected.

94 Reference Manual for Software Development and Optimization

4.11 System Bus Events

Table 4-40 lists the system bus transaction monitors. Many of the listed bus events take a umask

that qualifies the event by initiator. For all bus events, when “per cycles” is mentioned, SI clock cycles (bus clock multiplied by bus ratio) are inferred rather than bus clock cycles unless otherwise specified. Numerous derived events have been included in Table 4-41.

Table 4-37. Performance Monitors for System Bus Events (Sheet 1 of 3)

Symbol Name

BUS_ALL 0x87 N N N 1 Bus Transactions ER_BRQ_LIVE_REQ_HI 0xb8 N N N 2 BRQ Live Requests (two

ER_BRQ_LIVE_REQ_LO 0xb9 N N N 7 BRQ Live Requests (three

ER_BRQ_REQ_INSERTED 0xba N N N 1 BRQ Requests Inserted ER_BKSNP_ME_ACCEPTED 0xbb N N N 1 BacksnoopMe Requests accepted

ER_REJECT_ALL_L1_REQ 0xbc N N N 1 Number of cycles in which the BRQ

ER_REJECT_ALL_L1D_REQ 0xbd N N N 1 Number of cycles in which the BRQ

ER_REJECT_ALL_L1I_REQ 0xbe N N N 1 Number of cycles in which the BRQ

BUS_DATA_CYCLE 0x88 N N N 1 Valid data cycle on the Bus BUS_HITM 0x84 N N N 1 Bus Hit Modified Line Transactions BUS_IO 0x90 N N N 1 IA-32 Compatible IO Bus

SI_IOQ_LIVE_REQ_HI 0x98 N N N 1 In-order Bus Queue Requests (one

SI_IOQ_LIVE_REQ_LO 0x97 N N N 7 In-order Bus Queue Requests (three

BUS_B2B_DATA_CYCLES 0x93 N N N 1 Back-to-back bursts of data SI_CYCLES 0x8e N N N 1 Counts SI cycles BUS_MEMORY 0x8a N N N 1 Bus Memory Transactions BUS_MEM_READ 0x8b N N N 1 Full Cache line D/I memory RD, RD

ER_MEM_READ_OUT_HI 0xb4 N N N 2 Outstanding memory RD transactions

Event Code

O P C

Max

Inc/Cyc

Performance Monitor Events

Description

most-significant-bit of the 5-bit outstanding BRQ request count)

least-significant-bit of the 5-bit outstanding BRQ request count)

into the BRQ from the L2D (used by the L2D to get itself out of potential forward progress situations)

was rejecting all L1I/L1D requests (for the “Big Hammer” forward progress logic)

was rejecting all L1D requests (for L1D/L1I forward progress)

was rejecting all L1I requests (for L1D/L1I forward progress)

Transactions

most-significant-bit of the 4-bit outstanding IOQ request count)

least-significant-bit of the 4-bit outstanding IOQ request count)

invalidate, and BRIL

(upper two bits)

Reference Manual for Software Development and Optimization 95

Performance Monitor Events

Table 4-37. Performance Monitors for System Bus Events (Sheet 2 of 3)

Symbol Name

Event

Code

O P C

Max

Inc/Cyc

Description

ER_MEM_READ_OUT_LO 0xb5 N N N 7 Outstanding memory RD transactions

BUS_RD_DATA 0x8c N N N 1 Bus Read Data Transactions BUS_RD_HIT 0x80 N N N 1 Bus Read Hit Clean Non-local Cache

BUS_RD_HITM 0x81 N N N 1 Bus Read Hit Modified Non-local

BUS_RD_INVAL_BST_HITM 0x83 N N N 1 Bus BRIL Burst Transaction Results

BUS_RD_INVAL_HITM 0x82 N N N 1 Bus BIL Transaction Results in HITM BUS_RD_IO 0x91 N N N 1 IA-32 Compatible IO Read

BUS_RD_PRTL 0x8d N N N 1 Bus Read Partial Transactions ER_SNOOPQ_REQ_HI 0xb6 N N N 1 ER Snoop Queue Requests (most

ER_SNOOPQ_REQ_LO 0xb7 N N N 7 ER Snoop Queue Requests (three

BUS_SNOOP_STALL_CYCLES 0x8f N N N 1 Bus Snoop Stall Cycles (from any

BUS_WR_WB 0x92 N N N 1 Bus Write Back Transactions MEM_READ_CURRENT 0x89 N N N 1 Current Mem Read Transactions On

SI_RQ_INSERTS 0x9e N N N 2 SI request queue inserts SI_RQ_LIVE_REQ_HI 0xa0 N N N 1 SI request queue live requests

SI_RQ_LIVE_REQ_LO 0x9f N N N 7 SI request queue live requests

SI_WRITEQ_INSERTS 0xa1 N N N 2 SI write queue inserts SI_WRITEQ_LIVE_REQ_HI 0xa3 N N N 1 SI write queue live requests

SI_WRITEQ_LIVE_REQ_LO 0xa2 N N N 7 SI write queue live requests

SI_WAQ_COLLISIONS 0xa4 N N N 1 SI write address queue collisions

SI_CCQ_INSERTS 0xa5 N N N 2 SI clean castout queue inserts SI_CCQ_LIVE_REQ_HI 0xa7 N N N 1 SI clean castout queue live requests

SI_CCQ_LIVE_REQ_LO 0xa6 N N N 7 SI clean castout queue live requests

SI_CCQ_COLLISIONS 0xa8 N N N 1 SI clean castout queue collisions

SI_IOQ_COLLISIONS 0xaa N N N 1 SI inorder queue collisions (outgoing

(lower three bits)

Transactions

Cache Transactions

in HITM

Transactions

significant bit of 4-bit count)

least-significant-bits or 4-bit count)

agent)

Bus

(most-significant bit)

(least-significant three bits)

(most-significant bit)

(least-significant three bits)

(incoming FSB snoop collides with an entry in WAQ)

(most-significant bit)

(least-significant three bits)

(incoming FSB snoop collides with an entry in CCQ)

transaction collides with an entry in IOQ)

96 Reference Manual for Software Development and Optimization

Performance Monitor Events

Table 4-37. Performance Monitors for System Bus Events (Sheet 3 of 3)

Symbol Name

SI_SCB_INSERTS 0xab N N N 1 SI snoop coalescing buffer inserts SI_SCB_LIVE_REQ_HI 0xad N N N 1 SI snoop coalescing buffer live

SI_SCB_LIVE_REQ_LO 0xac N N N 7 SI snoop coalescing buffer live

SI_SCB_SIGNOFFS 0xae N N N 1 SI snoop coalescing buffer coherency

SI_WDQ_ECC_ERRORS 0xaf N N N 1 SI write data queue ECC errors

Event Code

O P C

Max

Inc/Cyc

requests (most-significant bit)

requests (least-significant three bits)

signoffs

Table 4-38. Derived Monitors for System Bus Events (Sheet 1 of 2)

Symbol Name Description Equation

Description

BIL_HITM_LINE_RATIO BIL Hit to Modified Line Ratio BUS_RD_INVAL_HITM /

BIL_RATIO BIL Ratio BUS_RD_INVAL / BUS_MEMORY BRIL_HITM_LINE_RATIO BRIL Hit to Modified Line Ratio BUS_RD_INVAL_BST_HITM /

BUS_ADDR_BPRI Bus transactions used by IO

BUS_BRQ_LIVE_REQ BRQ Live Requests ER_BRQ_LIVE_REQ_HI * 8 +

BUS_BURST Full cache line memory

BUS_HITM_RATIO Bus Modified Line Hit Ratio BUS_HITM / BUS_MEMORY or

BUS_HITS_RATIO Bus Read Hit to Shared Line

BUS_IOQ_LIVE_REQ Inorder Bus Queue Requests SI_IOQ_LIVE_REQ_HI * 8+

BUS_IO_CYCLE_RATIO Bus I/O Cycle Ratio BUS_IO / BUS_ALL BUS_IO_RD_RATIO Bus I/O Read Ratio BUS_RD_IO / BUS_IO BUS_MEM_READ_OUTSTA

NDING BUS_PARTIAL Less than cache line memory

BUS_PARTIAL_RATIO Bus Partial Access Ratio BUS_MEMORY.LT_128BYTE /

BUS_RD_ALL Full cache line memory read

BUS_RD_DATA_RATIO Cacheable Data Fetch Bus

agent.

transactions (BRL, BRIL, BWL)

Ratio

Number of outstanding memory RD transactions

transactions (BRP, BWP)

transactions (BRL)

Transaction Ratio

BUS_MEMORY or BUS_RD_INVAL_HITM / BUS_RD_INVAL

BUS_MEMORY or BUS_RD_INVAL_BST_HITM / BUS_RD_INVAL

BUS_MEMORY.*.IO

ER_BRQ_LIVE_REQ_LO BUS_MEMORY.EQ_128BYTE.*

BUS_HITM / BUS_BURST BUS_RD_HIT / BUS_RD_ALL or

BUS_RD_HIT / BUS_MEMORY

SI_IOQ_LIVE_REQ_LO

ER_MEM_READ_OUT_HI * 8 + ER_MEM_READ_OUT_LO

BUS_MEMORY.LT_128BYTE.*

BUS_MEMORY.ALL BUS_MEM_READ.BRL.*

BUS_RD_DATA / BUS_ALL or BUS_RD_DATA / BUS_MEMORY

Reference Manual for Software Development and Optimization 97

Performance Monitor Events

Table 4-38. Derived Monitors for System Bus Events (Sheet 2 of 2)

Symbol Name Description Equation

BUS_RD_HITM_RATIO Bus Read Hit to Modified Line

BUS_RD_INSTRUCTIONS Full cache line instruction

BUS_RD_INVAL 0 byte memory read-invalidate

BUS_RD_INVAL_BST Full cache line read-invalidate

BUS_RD_INVAL_BST_MEM ORY

BUS_RD_INVAL_MEMORY Bus Read Invalidate Line

BUS_RD_INVAL_ALL_HITM Bus Read Invalidate Line

BUS_RD_PRTL_RATIO Bus Read Partial Access Ratio BUS_RD_PRTL / BUS_MEMORY BUS_WB_RATIO Writeback Ratio BUS_WR_WB / BUS_MEMORY or

CACHEABLE_READ_RATIO Cacheable Read Ratio (BUS_RD_ALL +

Ratio

memory read transactions (BRP)

transactions (BIL)

transactions (BRIL) Bus Read Invalid Line in Burst

transactions (BRIL) satisfied by memory

transactions (BIL) satisfied from memory

transactions (BRIL and BIL) resulting in HITMs

4.11.1 System Bus Conventions

BUS_RD_HITM / BUS_RD_ALL or BUS_RD_HITM / BUS_MEMORY

BUS_RD_ALL - BUS_RD_DATA

BUS_MEM_READ.BIL.*

BUS_MEM_READ.BRIL.*

BUS_RD_INVAL_BST BUS_RD_INVAL_BST_HITM

BUS_RD_INVAL BUS_RD_INVAL_HITM

BUS_RD_INVAL_BST_HITM + BUS_RD_INVAL_HITM

BUS_WR_WB / BUS_BURST

BUS_MEM_READ.BRIL) / BUS_MEMORY

Table 4-39 defines the conventions that will be used when describing the Montecito processor

system bus transaction monitors in this section, as well as the individual monitor descriptions in

Section 4.15.

Other transactions besides those listed in Table 4-42 include Deferred Reply, Special Transactions, Interrupt, Interrupt Acknowledge, and Purge TC. Note that the monitors will count if any transaction gets a retry response from the priority agent.

To support the analysis of snoop traffic in a multiprocessor system, the Montecito processor provides local processor and remote response monitors. The local processor snoop events (SI_SCB_INSERTS and SI_SCB_SIGNOFFS) monitor inbound snoop traffic. The remote response events (BUS_RD_HIT, BUS_RD_HITM, BUS_RD_INVAL_HITM and BUS_RD_INVAL_BST_HITM) monitor the snoop responses of other processors to bus transactions that the monitoring processor originated. Table 4-40 summarizes the remote snoop events by bus transaction.

4.11.2 Extracting Memory Latency from Montecito Performance Counters

On the Itanium 2 processors, several events were provided to approximate memory latency as seen by the processor using the following equation:

((BUS_MEM_READ_OUT_HI * 8) + BUS_MEM_READ_OUT_LO) / (BUS_MEM_READ.BRL.SELF + BUS_MEM_READ.BRIL.SELF)

98 Reference Manual for Software Development and Optimization

Performance Monitor Events

The BUS_MEM_READ_OUT starts counting one bus clock after a request is issued on the system interface (ADS) and stops incrementing when the request completes its first data transfer or is retried. In each core cycle after counting is initiated, the number of live requests in that cycle are added to the count. This count may as high as 15. For ease of implementation, the count is split into two parts: BUS_MEM_READ_OUT_LO sums up the low order 3 bits of the number of live requests, while BUS_MEM_READ_OUT_HI sums up the high order bit.

In the above formula, the numerator provides the number of live requests and the denominator provides the number of requests that are counted. When the live count is divided by the number of transactions issued, you get an average lifetime of a transaction issued on the system interface (a novel application of Little’s Law).

The Montecito processor has similar counters: ER_MEM_READ_OUT.{HI,LO}. Using these events to derive Montecito memory latency will give results that are higher than the true memory latency seen in Montecito. The main reason for this is the fact that the start and stop point of the counters are not equivalent between the two processors. Specifically, in Montecito, ER_MEM_READ_OUT.{HI,LO}events start counting the core clock after a request is sent to the arbiter. The Montecito ER_MEM_READ_OUT_{HI,LO}events stop counting when the request receives its first data transfer within the external request logic (after the arbiter). Thus, these events include the entire time requests spend in the arbiter (pre and post request).

The requests may remain in the arbiter for a long or short time depending on system interface behaviors. Arbiter queue events SI_RQ_LIVE_REQ.{HI,LO} may be used to reduce the effects of arbiter latency on the calculations. Unfortunately, these events are not sufficient to successfully enable a completely equivalent measurement for Itanium 2 processors. The arbiter time back from FSB to core is fixed for a specific arbiter to system interface ratio. These arbiter events may occur in a different time domain from core events

The new memory latency approximation formula for Montecito, with corrective events included, is below:

(ER_MEM_READ_OUT_HI * 8 + ER_MEM_READ_OUT_LO) – (SI_RQ_LIVE_REQ_HI * 8 + SI_RQ_LIVE_REQ_LO) / ( BUS_MEM_READ)

Note that the Data EAR may be used to compare data cache load miss latency between Madison and Montecito. However, an access’ memory latency, as measured by the Data EAR or other cycle counters will be inherently greater on Montecito compared to previous Itanium 2 processors due to the latency the arbiter adds to both the outbound request and inbound data transfer. Also, the Data EAR encompasses the entire latency through the processor’s memory hierarchy and queues without details into time spent in any specific queue.

Even with this improved formula, the estimated memory latency for Montecito will appear greater than previous Itanium 2 processors. We have not observed any design point that suggests that the system interface component of memory accesses are excessive on Montecito.

We have observed that snoop stalls and write queue pressure lead to additional memory latency on Montecito compared to previous Itanium 2 processors, but these are phenomena that impact the pre-system or post-system interface aspect of a memory latency and are very workload dependant in their impact. Specifically, the write queues need to be sufficiently filled to cause back pressure on the victimizing read requests such that a new read request cannot issue to the system interface because it cannot identify a victim in the L3 cache to ensure its proper allocation. This sever pressure has only been seen with steady streams of every read requests resulting in a dirty L3 victim. Additional snoop stalls should only add latency to transactions that receive a HITM snoop response (cache to cache transfers) because non-HITM responses are satisfied by the memory and memory access should be initiated as a consequence of the initial transaction rather than its snoop response.

Reference Manual for Software Development and Optimization 99

Performance Monitor Events

The figure below (Figure 4-2) shows the latency is determined using the above calculations on Itanium 2 and Montecito processors. The red portion of the Montecito diagram shows latency accounted for by the correction found in the Montecito calculation.

Figure 4-2. Extracting Memory Latency from PMUs

Ita n iu m 2

L o a d I s s u e d t o c a c h e s

D a t a r e t u r n e d t o r e g is t e r

L o a d is s u e d o n s y s t e m in t e r f a c e

T i m e c a l c u la t e d w it h

P M U e v e n t s

D a t a d e li v e r y s ta r t e d

M o n t e c i t o

L o a d I s s u e d t o c a c h e s

T i m e i n

A r b it e r

D a t a r e t u r n e d t o r e g is te r

L o a d is s u e d t o a r b it e r

D a t a d e li v e r y s e e n b y e x te r n a l r e q u e s t l o g ic

4.12 RSE Events

Register Stack Engine events are presented in Table 4-39. The number of current/dirty registers are split among three monitors since there are 96 physical registers in the Montecito processor.

Table 4-39. Performance Monitors for RSE Events (Sheet 1 of 2)

L o a d is s u e d o n s y s t e m in t e r f a c e

T i m e c a l c u la t e d

w i t h P M U e v e n t s

D a t a d e li v e r y s ta r t e d

Symbol Name

RSE_CURRENT_REGS_2_TO_0 0x2b N N N 7 Current RSE registers RSE_CURRENT_REGS_5_TO_3 0x2a N N N 7 Current RSE registers RSE_CURRENT_REGS_6 0x26 N N N 1 Current RSE registers RSE_DIRTY_REGS_2_TO_0 0x29 N N N 7 Dirty RSE registers

Event

Code

O P C

Max

Inc/Cyc

Description

100 Reference Manual for Software Development and Optimization

Intel Itanium 2 9052 (NE80549KE025LK) Dual-Core Itanium 2 Reference Manual For Software Developement and Optimization (0.9)

Specifications and Main Features

Frequently Asked Questions

User Manual