Dual-Core Update
to the Intel® Itanium®2 Processor
Reference Manual
For Software Development and Optimization
Revision 0.9
January 2006
Document Number: 308065-001
Notice: This document contains information on products in the design phase of development. The information here is subject to change without
notice. Do not finalize a design with this information.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS
ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
The Itanium 2 processor may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
The code name “Montecito” presented in this document is only for use by Intel to identify a product, technology, or service in development, that has not
been made commercially available to the public, i.e., announced, launched or shipped. It is not a “commercial” name for products or services and is
not intended to function as a trademark.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-
548-4725, or by visiting Intel's web site at http://www.intel.com.
Intel, Itanium, Pentium, VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
4-9Derived Monitors for L1 Instruction Cache and Prefetch Events ........................84
4-10Performance Monitors for L1 Data Cache Events...............................................84
6Reference Manual for Software Development and Optimization
4-11Performance Monitors for L1D Cache Set 0 .......................................................85
4-12Performance Monitors for L1D Cache Set 1 .......................................................85
4-13Performance Monitors for L1D Cache Set 2 .......................................................85
4-14Performance Monitors for L1D Cache Set 3 .......................................................85
4-15Performance Monitors for L1D Cache Set 4 .......................................................86
4-16Performance Monitors for L1D Cache Set 6 .......................................................86
4-19Performance Monitors for L2 Data Cache Events...............................................87
4-20Derived Monitors for L2 Data Cache Events.......................................................88
4-21Performance Monitors for L2 Data Cache Set 0 .................................................89
4-22Performance Monitors for L2 Data Cache Set 1 .................................................89
4-23Performance Monitors for L2 Data Cache Set 2 .................................................89
4-24Performance Monitors for L2 Data Cache Set 3 .................................................89
4-25Performance Monitors for L2 Data Cache Set 4 .................................................90
4-26Performance Monitors for L2 Data Cache Set 5 .................................................90
4-27Performance Monitors for L2 Data Cache Set 6 .................................................90
4-28Performance Monitors for L2 Data Cache Set 7 .................................................90
4-29Performance Monitors for L2 Data Cache Set 8 .................................................91
4-30Performance Monitors for L2D Cache - Not Set Restricted ................................91
4-31Performance Monitors for L3 Unified Cache Events ...........................................91
4-32Derived Monitors for L3 Unified Cache Events ...................................................92
4-33Performance Monitors for System Events...........................................................93
4-34Derived Monitors for System Events...................................................................93
4-35Performance Monitors for TLB Events ................................................................93
4-36Derived Monitors for TLB Events ........................................................................94
4-37Performance Monitors for System Bus Events....................................................95
4-38Derived Monitors for System Bus Events............................................................97
4-39Performance Monitors for RSE Events .............................................................100
4-40Derived Monitors for RSE Events......................................................................101
4-41Performance Monitors for Multi-thread Events..................................................101
4-42All Performance Monitors Ordered by Code .....................................................102
Reference Manual for Software Development and Optimization7
Revision History
Document
Number
308065-0010.9• Initial release of the document.January 2006
Revision
Number
DescriptionDate
8Reference Manual for Software Development and Optimization
1Introduction
This document is an update to the Intel® Itanium® 2 Processor Reference Manual for Software
Development and Optimization. This update is meant to give guidance on the changes that the dual-
core Intel®Itanium® 2 processor, code named Montecito, brings to the existing Itanium 2
processor family.
1.1Terminology
The following definitions are for terms that will be used throughout this document:
TermDefinition
DispersalThe process of mapping instructions within bundles to functional units
Bundle rotationThe process of bringing new bundles into the two-bundle issue
Split issueInstruction execution when an instruction does not issue at the same
Advanced load address table (ALAT)The ALAT holds the state necessary for advanced load and check
Translation lookaside buffer (TLB)The TLB holds virtual to physical address mappings
Virtual hash page table (VHPT)The VHPT is an extension of the TLB hierarchy, which resides in the
Hardware page walker (HPW)The HPW is the third level of address translation. It is an engine that
Register stack engine (RSE)The RSE moves registers between the register stack and the backing
Event address registers (EARs)The EARs record the instruction and data addresses of data cache
window
time as the instruction immediately before it.
operations.
virtual memory space, is designed to enhance virtual address
translation performance.
performs page look-ups from the VHPT and seeks opportunities to
insert translations into the processor TLBs.
store in memory.
misses.
1.2Related Documentation
The reader of this document should also be familiar with the material and concepts presented in the
following documents:
®
• Intel
• Intel
• Intel
Reference Manual for Software Development and Optimization9
Itanium®Architecture Software Developer’s Manual, Volume 2: System Architecture
®
Itanium®Architecture Software Developer’s Manual, Volume 3: Instruction Set
Reference
§
Reference Manual for Software Development and Optimization10
2The Dual-Core Itanium 2
Processor
2.1Overview
The first dual-core Itanium 2 processor, code named Montecito, is the fourth generation of the
Itanium 2 processor. Montecito builds on the strength of the previous Itanium 2 processors while
bringing many new key technologies for performance and management to the Itanium processor
family. Key improvements include multiple-cores, multiple-threads, cache hierarchy, and
speculation with the addition of new instructions.
This document describes key Montecito features and how Montecito differs in its implementation
of the Itanium architecture from previous Itanium 2 processors. Some of this information may not
be directly applicable to performance tuning, but is certainly needed to better understand and
interpret changes in application behavior on Montecito versus other Itanium architecture-based
processors. Unless otherwise stated, all of the restrictions, rules, sizes, and capacities described in
this document apply specifically to Montecito and may not apply to other Itanium architecturebased processors. This document assumes a familiarity with the previous Itanium 2 processors and
some of the unique properties and behaviors of those. Furthermore, only differences as they relate
to performance will be included here. Information about Montecito features such as error
protection, Virtualization technology, Hyper-Threading technology, and lockstep support may be
obtained in separate documents.
General understanding of processor components and explicit familiarity with Itanium processor
instructions are assumed. This document is not intended to be used as an architectural reference for
the Itanium architecture. For more information on the Itanium architecture, consult the IntelItanium®Architecture Software Developer’s Manual.
2.1.1Identifying the Dual-Core Itanium 2 Processor
There have now been four generations of the Itanium 2 processor, which can be identified by their
unique CPUID values. For simplicity of documentation, throughout this document we will group
all processors of like model together. Table 2-1details out the CPUID values of all of the Itanium
processor family generations. Table 2-2 lists out all of the varieties of the Itanium processor family
that are available along with their grouping.
Note that the Montecito CPUID family value changes to 0x20.
Table 2-1. Itanium® Processor Family and Model Values
FamilyModelDescription
0x070x00Itanium® Processor
0x1f0x00Itanium 2 Processor (up to 3 MB L3 cache)
0x1f0x01Itanium 2 Processor (up to 6 MB L3 cache)
0x1f0x02Itanium 2 Processor (up to 9 MB L3 cache)
0x200x00Dual-Core Itanium 2 Processor (Montecito)
®
Reference Manual for Software Development and Optimization11
The Dual-Core Itanium 2 Processor
Table 2-2. Definition Table
Intel® Itanium® 2 Processor 900 MHz with 1.5 MB L3 Cache
Intel® Itanium® 2 Processor 1.0 GHz with 3 MB L3 Cache
Low Voltage Intel® Itanium® 2 Processor 1.0 GHz with 1.5 MB
Montecito takes the latest Itanium 2 processor core, improves the memory hierarchy and adds an
enhanced form of temporal multi-threading. A full introduction to the Itanium 2 processor is
available elsewhere but a brief review is provided below.
The front-end, with two levels of branch prediction, two TLBs, and a 0 cycle branch predictor,
feeds two bundles of three instructions each into the instruction buffer every cycle. This 8 entry
queue decouples the front-end from the back-end and delivers up to two bundles, of any alignment,
to the remaining 6 stages of the pipeline. The dispersal logic determines issue groups and allocates
up to 6 instructions to nearly every combination of the 11 available functional units (2 integer, 4
memory, 2 floating point, and 3 branch). The renaming logic maps virtual registers into physical
registers. Actual register (up to 12 integer and 4 floating point) reads are performed just before the
instructions execute or requests are issued to the cache hierarchy. The full bypass network allows
nearly immediate access to previous instruction results while final results are written into the
register file (up to 6 integer and 4 floating point).
Montecito preserves application and operating system investments while providing greater
opportunity for code generators to continue their steady performance push without any destructive
disturbance. This is important since even today, three years after the introduction of the first
Itanium 2 processor, compilers are providing significant performance improvements. The block
diagram of the Montecito processor can be found in Figure 2-1.
Montecito provides a second integer shifter and popcounter to help reduce port asymmetries. The
front-end provides better branching behavior for single cycle branches and cache allocation/
reclamation. Finally, Montecito decreases the time to reach recovery code when speculation fails
12Reference Manual for Software Development and Optimization
The Dual-Core Itanium 2 Processor
thereby providing a lower cost for speculation. All told, nearly every core block and piece of
control logic includes some optimization to improve small deficiencies.
Exposing additional performance in an already capable cache hierarchy is also challenging and
includes additional capacity, improved coherence architecture, and more efficient cache
organization and queuing. Montecito supports three levels of on-chip cache. The first level (L1)
caches are each 4-way set associative caches and hold 16 KB of instruction or data. These caches
are in-order, like the rest of the pipeline, but are non-blocking allowing high request concurrency.
These L1 caches are accessed in a single cycle using pre-validated tags. The data cache is writethrough and dual-ported to support two integer loads and two stores, while the instruction cache
has dual-ported tags and a single data port to support simultaneous demand and prefetch accesses.
While previous generations of the Itanium 2 processor share the second level (L2) cache with both
data and instructions, Montecito provides a dedicated 1 MB L2 cache for instructions. This cache is
8-way set associative with a 128 byte line size and provides the same 7 cycle instruction access
latency as the previous smaller Itanium 2 processor unified cache. A single tag and data port
supports out-of-order and pipelined accesses to provide a high utilization. The separate instruction
and data L2 caches provide more efficient access to the caches compared to Itanium 2 processors
where instruction requests would contend against data accesses for L2 bandwidth against data
accesses and potentially impact core execution as well as L2 throughput.
This previously shared 256 KB L2 cache is now dedicated to data on Montecito with several
micro-architectural improvements to increase throughput. The instruction and data separation
effectively increase the data hit rate. The L2D hit latency remains at 5 cycles for integer and 6
cycles for floating-point accesses. The tag is true 4-ported and the data is pseudo 4-ported with 16byte banks. Montecito removes some of the code generator challenges found in the Itanium 2
processor L2 cache. Specifically, any accesses beyond the first access to miss the L2 in previous
Itanium 2 processors would access the L2 tags periodically until a hit is detected. The repeated tag
accesses consume bandwidth from the core and increase the miss latency. On Montecito, such
misses are suspended until the L2 fill occurs. The fill awakens and immediately satisfies the
request which greatly reduces bandwidth contention and final latency. The Montecito L2D, like
previous generations of the Itanium 2 processor L2, is out-of-order and pipelined with the ability to
track up to 32 requests in addition to 16 misses and their associated victims. However, Montecito
optimizes allocation of the 32 queue entries providing a higher concurrency level than previously
possible.
The third level (L3) cache remains unified as in previous Itanium 2 processors, but is now 12 MB
in size while maintaining the same 14 cycle integer access latency found on the 6 MB and 9 MB
Itanium 2 processors. The L3 uses an asynchronous interface with the data array to achieve this low
latency; there is no clock, only a read or write valid indication. The read signal is coincident with
index and way values that initiate L3 data array accesses. Four cycles later, the entire 128-byte line
is available and latched. This data is then delivered in 4 cycles to either the L2D or L2I cache in
critical byte order.
The L3 receives requests from both the L2I and L2D but gives priority to the L2I request in the rare
case of a conflict. Moving the arbitration point from the L1-L2 in the Itanium 2 processor to the
L2-L3 cache greatly reduces conflicts thanks to the high hit rates of the L2.
The cache hierarchy is replicated in each core to total more than 13.3 MB for each core and nearly
27 MB for the entire processor.
Reference Manual for Software Development and Optimization13
The Dual-Core Itanium 2 Processor
Figure 2-1. The Montecito Processor
2.2New Instructions
Montecito is compliant with the latest revisions of the Itanium architecture in addition to the Intel
Itanium Architecture Virtualization Specification Update. As such, Montecito introduces several
new instructions as summarized below:
Table 2-3. New Instructions Available in Montecito
New InstructionComment
1
fc.i
2
ld16
2
st16
cmp8xchg16
hint@pause
vmsw.0, vmsw.1On promote pages, these instructions allow cooperative operating systems to obtain and
NOTES:
1. This instruction behaves as the instruction on Montecito
2. This instruction will fault if issued to UC, UCE, or WC memory
3. This instruction will not initiate a thread switch if it is a B type instruction.
Insures that instruction caches are coherent with data caches
AR.csd and the register specified are the targets for this load.
AR.csd and the value in the register specified are written for this store
2
AR.csd and the value in the register specified are written for this exchange if the 8 byte
compare is true.
3
The current thread is yielding resources to the other thread.
give up VMM privilege
14Reference Manual for Software Development and Optimization
The Dual-Core Itanium 2 Processor
2.3Core
The Montecito core is very similar to previous generations of the Itanium 2 processor core from a
code generation point of view. The core has new resources; specifically, an additional integer
shifter and popcounter. The core also removes the rarely needed MMU to Memory Address bypass
path. The core also includes many optimizations, from the front-end to the cache hierarchy, that are
transparent to the code generator and legacy code can see improvements without any code change.
2.3.1Instruction Slot to Functional Unit Mapping
This information is very similar to previous Itanium 2 processors. Changes between Itanium 2
processors and Montecito will be noted with footnotes.
Each fetched instruction is assigned to a functional unit through an issue port. The numerous
functional units share a smaller number of issue ports. There are 11 functional units: eight for nonbranch instructions and three for branch instructions. They are labeled M0, M1, M2, M3, I0, I1, F0,
F1, B0, B1, and B2. The process of mapping instructions within bundles to functional units is
called dispersal.
An instruction’s type and position within the issue group determine which functional unit the
instruction is assigned. An instruction is mapped to a subset of the functional units based upon the
instruction type (i.e. ALU, Memory, Integer, etc.). Then, based on the position of the instruction
within the instruction group presented for dispersal, the instruction is mapped to a particular
functional unit within that subset.
Table 2-4, Table 2-5, Table 2-6 and Table 2-7 show the mappings of instruction types to ports and
functional units.
Note: Shading in the following tables indicates the instruction type can be issued on the port(s).
A-type instructions can be issued on all M and I ports (M0-M3 and I0 and I1). I-type instructions
can only issue to I0 or I1. The I ports are asymmetric so some I-type instructions can only issue on
port I0. M ports have many asymmetries: some M-type instructions can issue on all ports; some can
only issue on M0 and M1; some can only issue on M2 and M3; some can only issue on M0; some
can only issue on M2.
n/an/aM+1M+2M+3M+1M+1M+1M+1M+1
IEU2: move_from_br, allocn/an/a2232n/an/an/a2
Move to/from CR or AR
5
Move to pr102232
Move indirect
6
n/an/aCCCCn/an/an/aC
n/an/an/an/a
n/an/aDDDDn/an/an/aD
NOTES:
1. The MMU to memory address bypass in Montecito does not exist. If code does not account for the missing bypass, the processor will detect the case
and cause a pipeflush to ensure proper separation between the producer and the consumer.
2. Since these operations are performed by the L2D, they interact with the L2D pipeline. These are the minimum latencies but they could be much larger
because of this interaction.
3. N depends upon which level of cache is hit: N=1 for L1D, N=5 for L2D, N=14-15 for L3, N=~180-225 for main memory. These are minimum latencies
and are likely to be larger for higher levels of cache.
4. M depends upon which level of cache is hit: M=5 for L2D, M=14-15 for L3, M=~180-225 for main memory. These are minimum latencies and are
likely to be larger for higher levels of cache. The +1 in all table entries denotes one cycle needed for format conversion.
5. Best case values of C range from 2 to 35 cycles depending upon the registers accessed. EC and LC accesses are 2 cycles, FPSR and CR accesses
are 10-12 cycles.
6. Best case values of D range from 6 to 35 cycles depending upon the indirect registers accessed. LREGS, PKR, and RR are on the faster side being
6 cycle accesses.
2.3.3Caches and Cache Management Changes
Montecito, like the previous Itanium 2 processors, supports three levels of on-chip cache. Each
core contains a complete cache hierarchy, with nearly 13.3 Mbytes per core, for a total of nearly 27
Mbytes of processor cache.
18Reference Manual for Software Development and Optimization
Table 2-9. Montecito Cache Hierarchy Summary
The Dual-Core Itanium 2 Processor
Cache
Data Types
Supported
L1DIntegerWT16 KB64 Bytes4VA[11:6]8 Fills1/1
L1IInstructionNA16 KB64 Bytes4VA[11:6] 1 Demand
L2DInteger,
Floating
Point
L2IInstructionNA1 MByte128 Bytes8PA[16:7]87/10
L3Integer,
Floating
Point,
Instruction
2.3.3.1L1 Caches
The L1I and L1D caches are essentially unchanged from previous generations of the Itanium 2
processor.
2.3.3.2L2 Caches
Level 2 caches are both different and similar to the Itanium 2 processor L2 cache. The previous
Itanium 2 processor L2 shares both data and instructions, while the Montecito has dedicated
instruction (L2I) and data (L2D) caches. This separation of instruction and data caches makes it
possible to have dedicated access paths to the caches and thus eliminates contention and eases
capacity pressures on the L2 caches.
Write
Through/
Write Back
WB256 KB128 Bytes8PA[14:7]32 OzQ/
WB12 MByte 128 Bytes12PA[19:7]814/21
Data
Array
Size
Line Size WaysIndexQueuing
+ 7
Prefetch
Fills
16 Fills
Minimum
/Typical
Latency
1/1
5/11
The L2I cache holds 1 Mbyte; is eight-way set associative; and has a 128-byte line size-yet has the
same seven-cycle instruction-access latency as the smaller previous Itanium 2 processor unified
cache. The tag and data arrays are single ported, but the control logic supports out-of-order and
pipelined accesses. This large cache greatly reduces the number of instruction accesses seen at the
L3 cache. Any coherence request to identify whether a cache line is in the processor will invalidate
that line from the L2I cache.
The L2D cache has the same structure and organization as the Itanium 2 processor shared 256 KB
L2 cache but with several microarchitectural improvements to increase throughput. The L2D hit
latency remains at five cycles for integer and six cycles for floating-point accesses. The tag array is
true four-ported-four fully independent accesses in the same cycle-and the data array is pseudo
four-ported with 16-byte banks.
Montecito optimizes several aspects of the L2D. In the Itanium 2 processor, any accesses to the
same cache line beyond the first access that misses L2 will access the L2 tags periodically until the
tags detect a hit. The repeated tag accesses consume bandwidth from the core and increase the L2
miss latency. Montecito suspends such secondary misses until the L2D fill occurs. At that point, the
fill immediately satisfies the suspended request. This approach greatly reduces bandwidth
contention and final latency. The L2D, like the Itanium 2 processor L2, is out of order, pipelined,
and tracks 32 requests (L2D hits or L2D misses not yet passed to the L3 cache) in addition to 16
misses and their associated victims. The difference is that Montecito allocates the 32 queue entries
more efficiently, which provides a higher concurrency level than with the Itanium 2 processor.
Reference Manual for Software Development and Optimization19
The Dual-Core Itanium 2 Processor
Specifically, the queue allocation policy now supports recovery of empty entries. This allows for
greater availability of the L2 OzQ in light of accesses completed out of order.
The L2D also considers the thread identifier when performing ordering such that an ordered
request from one thread is not needlessly ordered against another thread’s accesses.
2.3.3.3L3 Cache
Montecito's L3 cache remains unified as in previous Itanium 2 processors, but is now 12 MB. Even
so, it maintains the same 14-cycle integer-access best case latency in the 6M and 9M Itanium 2
processors. Montecito's L3 cache uses an asynchronous interface with the data array to achieve this
low latency; there is no clock, only a read or write valid indication. Four cycles after a read signal,
index, and way, the entire 128-byte line is available and latched. The array then delivers this data in
four cycles to either the L2D or L2I in critical-byte order.
Montecito's L3 receives requests from both the L2I and L2D but gives priority to the L2I request in
the rare case of a conflict. Conflicts are rare because Montecito moves the arbitration point from
the Itanium 2 processor L1-L2 to L2-L3. This greatly reduces conflicts because of L2I and L2D's
high hit rates. The I and D arbitration point also reduces conflict and access pressure within the
core; L1I misses go directly to the L2I and not through the core. L2I misses contend against L2D
request for L3 access.
2.3.3.4Request Tracking
All L2I and L2D requests are allocated to one of 16 request buffers. Requests are sent to the to the
L3 cache and system from these buffers by the tracking logic. A modified L2D victim or partial
write may be allocated to one of 8 write buffers. This is an increase of 2 over the Itanium 2
processor. The lifetime of the L2D victim buffers is also significantly decreased to further reduce
pressure on them. Lastly, the L3 dirty victim resources has grown by 2 entries to 8 in Montecito.
In terms of write coalescing buffers (WCB), Montecito has 4 128B line WCBs in each core. These
are fully shared between threads.
2.4Threading
The multiple thread concept starts with the idea that the processor has some resources that cannot
be effectively utilized by a single thread. Therefore, sharing under-utilized resources between
multiple threads will increase utilization and performance. The Montecito processor HyperThreading Technology implementation duplicates and shares resources to create two logical
processors. All architectural state and some micro-architectural state is duplicated.
The duplicated architectural state (general, floating point, predicate, branch, application,
translation, performance monitoring, bank, and interrupt registers) allows each thread to appear as
a complete processor to the operating system thus minimizing the changes needed at the OS level.
The duplicated micro-architectural state of the return stack buffer and the advanced load address
table (ALAT) prevent cross-thread pollution that would occur if these resources were shared
between the two logical processors.
The two logical processors share the parallel execution resources (core) and the memory hierarchy
(caches and TLBs). There are many approaches to sharing resources that vary from fixed time
intervals, temporal multi-threading or TMT, to sharing resources concurrently, simultaneous multithreading or SMT. The Montecito Hyper-Threading Technology approach blends both approaches
such that the cores share threads using a TMT approach while the memory hierarchy shares
resources using a SMT approach. The core TMT approach is further augmented with control
20Reference Manual for Software Development and Optimization
hardware that monitors the dynamic behavior of the threads and allocates core resources to the
most appropriate thread - an event experienced by the workload may cause a switch before the
thread quantum of TMT would cause a switch. This modification of TMT may be termed switchon-event multi-threading.
2.4.1Sharing Core Resources
Many processors implementing multi-threading share resources using the SMT paradigm. In SMT,
instructions from different threads compete for and share execution resources such that each
functional resource is dynamically allocated to an available thread. This approach allocates
resources originally meant for instruction level parallelism (ILP), but under-utilized in the single
thread case, to exploit thread level parallelism (TLP). This is common in many out-of-order
execution designs where increased utilization of functional units can be attained for little cost.
Processor resources may also be shared temporally rather than symmetrically. In TMT, a thread is
given exclusive ownership of resources for a small time period. Complexity may be reduced by
expanding the time quantum to at least the pipeline depth and thus ensure that only a single thread
owns any execution or pipeline resources at any moment. Using this approach to multi-threading,
nearly all structures and control logic can be thread agnostic allowing the natural behaviors of the
pipeline, bypass, and stall control logic for execution to be leveraged while orthogonal logic
controls and completes a thread switch is added. However, this approach also means that a pipeline
flush is required at thread switch points.
The Dual-Core Itanium 2 Processor
In the core, one thread has exclusive access to the execution resources (foreground thread) for a
period of time while the other thread is suspended (background thread). Control logic monitors the
workload's behavior and dynamically decreases the time quantum for a thread that is not likely to
make progress. Thus, if the control logic determines that a thread is not making progress, the
pipeline is flushed and the execution resources are given to the background thread. This ensures
better overall utilization of the core resources over strict TMT and effectively hides the cost of long
latency operations such as memory accesses.
A thread switch on Montecito requires 15 cycles from initiation until the background thread retires
an instruction. Given the low latency of the memory hierarchy (1 cycle L1D, 5 cycle L2D, and 14
cycle L3) memory accesses are the only potentially stalling condition that greatly exceeds the
thread switch time and thus is the primary switch event.
A thread switch also has other side effects such as invalidating the Prefetch Virtual Address Buffer
(PVAB) and canceling any prefetch requests in the prefetch pipeline.
2.4.1.1The Switch Events
There are several events that can lead to a thread switch event. Given that hiding memory latency is
the primary motivation for multi-threading, the most common switch event is based on L3 cache
misses and data returns. Other events, such as the time-out and forward progress event, provide
fairness, while the hint events provide paths for the software to influence thread switches. These
events have an impact on a thread's urgency which indicates a thread's ability to effectively use
core resources. Each event is described below:
• L3 Cache Miss - An L3 miss by the foreground thread is likely to cause that thread to stall
waiting for the return from the system interface. Hence, L3 misses can trigger thread switches
subject to thread urgency comparisons. This event decreases the thread’s urgency. Since there
is some latency between when a thread makes a request and when it is determined to be an L3
miss, it is possible to have multiple requests from a thread miss the L3 cache before a thread
switch occurs.
Reference Manual for Software Development and Optimization21
The Dual-Core Itanium 2 Processor
• L3 Cache Return - An L3 miss data return for the background thread is likely to resolve data
dependences and is an early indication of execution readiness, hence an L3 miss data return
can trigger thread switch events subject to thread urgency comparisons. This event increases
the thread’s urgency
• Time-out - Thread-quantum counters ensure fairness in access to the pipeline execution
resources for each thread. If the thread-quantum expiration occurs when the thread was not
stalled, its urgency is set to a high value to indicate execution readiness prior to the switch
event.
• Switch Hint - The Itanium architecture provides the instruction which can
trigger a thread switch to yield execution to the background thread. This allows software to
indicate when the current thread has no need of the core resources.
• Low-power Mode - When the active thread has entered into a quiesced low-power mode, a
thread switch is triggered to the background thread so that it may continue execution.
Similarly, if both threads are in a quiesced low-power state, and the background thread is
awakened, a thread switch is triggered.
The L3 miss and data return event can occur for several types of accesses: data or instruction,
prefetch or demand, cacheable or uncacheable, or hardware page walker (HPW). A data demand
access includes loads, stores, and semaphores.
The switch events are intended to enable the control logic to decide the appropriate time to switch
threads without software intervention. Thus, Montecito Hyper-Threading Technology is mostly
transparent to the application and the operating system
2.4.1.2Software Control of Thread Switching
The instruction is used by software to initiate a thread switch. The intent is to allow
code to indicate that it does not have any useful work to do and that its execution resources should
be given to the other thread. Some later event, such as an interrupt, may change the work for the
thread and should awaken the thread.
The instruction forces a switch from the foreground thread to the background
thread. This instruction can be predicated to conditionally initiate a thread switch. Since the current
issue group retires before the switch is initiated, the following code sequences are equivalent:
2.4.1.3Urgency
Each thread has an urgency which can take on values from 0 to 7. A value of 0 denotes that a thread
has no useful work to perform. A value of 7 signifies that a thread is actively making forward
progress. The nominal urgency is 5 and indicates that a thread is actively progressing. The urgency
of one thread is compared against the other at every L3 event. If the urgency of the currently
22Reference Manual for Software Development and Optimization
executing thread is lower than the background thread then the L3 event will initiate a thread switch.
Every L3 miss event decrements the urgency by 1, eventually saturating at 0. Similarly, every L3
return event increments the urgency by 1 as long as the urgency is below 5. Figure 2-2 shows a
typical urgency based switch scenario. The urgency can be set to 7 for a thread that is switched out
due to time-out event. An external interrupt directed at the background thread will set the urgency
for that thread to 6 which increases the probability of a thread switch and provide a reasonable
response time for interrupt servicing.
Figure 2-2. Urgency and Thread Switching
The Dual-Core Itanium 2 Processor
2.4.2Tailoring Thread Switch Behavior
Montecito allows the behavior of the thread switch control logic to be tailored to meet specific
software requirements. Specifically, thread switch control may emphasize overall performance,
thread fairness or elevate the priority of one thread over the other. These different behaviors are
available through a low latency PAL call, PAL_SET_HW_POLICY. This will allow software to
exert some level of control over how the processor determines the best time to switch. Details on
this call and the parameters can be found in the latest Intel®Itanium®Architecture Software
Developer’s Manual and Intel®Itanium®Architecture Software Developer’s Manual Specification
Update.
Reference Manual for Software Development and Optimization23
The Dual-Core Itanium 2 Processor
2.4.3Sharing Cache and Memory Resources
The Montecito memory resources that are concurrently or simultaneously shared between the two
threads include the first and second level TLBs, the first, second, and third level caches, and system
interface resources. Each of these structures are impacted in different ways as a result of their
sharing.
2.4.3.1Hyper-Threading Technology and the TLBs
The instruction in previous Itanium 2 processors would invalidate the entire Translation
Cache (TC) section of the TLB with one instruction. This same behavior is retained for Montecito
with the caveat that a issued on one thread will invalidate the TC of the other thread at the
same time.
The L2I and L2D TLB on the Itanium 2 processor supported 64 Translation Registers (TR).
Montecito supports 32 TRs for each logical processor.
2.4.3.1.1Instruction TLBs
The replacement algorithms for the L1I and L2I TLB do not consider thread for replacement vector
updating. However, the L2I TLB will reserve one TLB entry for each thread to meet the
architectural requirements for TCs available to a logical processor.
The TLBs support SMT-based sharing by assigning a thread identifier to the virtual address. Thus,
two threads cannot share the same TLB entry at the same time even if the virtual address is the
same between the two threads.
Since the L1I TLB is key in providing a pseudo-virtual access to the L1I cache, using
prevalidation, when a L1I TLB entry is invalidated, the L1I cache entries associated with that page
(up to 4 K) are invalidated. However, the invalidation of a page (and hence cache contents) can be
suppressed when two threads access the same virtual and physical addresses. This allows the two
threads to share much of the L1I TLB and cache contents. For example, T0 inserts a L1I TLB entry
with VA=0 and PA=0x1001000. T0 then accesses VAs 0x000 to 0xFFF which are allocated to the
L1I cache. A thread switch occurs. Now, T1 initiates an access with VA=0. It will miss in the L1I
TLB because the entry with VA=0 belongs to T0. T1 will insert a L1I TLB entry with VA=0 and
PA=0x1001000. The T1 L1I TLB entry replaces the T0 L1I TLB entry without causing an
invalidation. Thus, the accesses performed by T0 become available to T1 with the exception of the
initial T1 access that inserted the L1I TLB page. Since the L1I cache contents can be shared
between two threads and the L1I cache includes branch prediction information, this optimization
allows one thread to impact the branch information contained in the L1I cache and hence branch
predictions generated for each thread.
2.4.3.1.2Data TLBs
The replacement algorithms for the L1D and L2D TLB do not consider threads for replacement
vector updating. However, the L2D TLB will reserves 16 TLB entries for each thread to meet the
architectural requirements for TCs available to a logical processor.
The TLBs support SMT based sharing by assigning a thread identifier to the virtual address. Thus,
two threads cannot share the same TLB entry at the same time even if the virtual address is the
same between the two threads.
Despite the fact that both the instruction and data L1 TLBs support prevalidation, the L1I TLB
optimization regarding cache contents is not supported in the L1D TLB.
24Reference Manual for Software Development and Optimization
The Dual-Core Itanium 2 Processor
2.4.3.2Hyper-Threading Technology and the Caches
The L2I, L2D, and L3 caches are all physically addressed. Thus, the threads can fully share the
cache contents (i.e. an access allocated by T0 can be accessed by both T0 and T1). The queueing
resources for these cache levels are equally available to each thread. The replacement logic also
ignores threads such that T0 can cause an eviction of T1 allocated data and a hit will cause a cache
line to be considered recently used regardless of the thread that allocated or accessed the line.
A thread identifier is provided with each instruction or data cache request to ensure proper ordering
of requests between threads at the L2D cache in addition to performance monitoring and switch
event calculations at all levels. The thread identifier allows ordered and unordered transactions
from T0 pass ordered transactions from T1.
2.4.3.3Hyper-Threading Technology and the System Interface
The system interface logic also ignores the thread identifier in allocating queue entries and in
prioritizing system interface requests. The system interface logic tracks L3 miss and fills and as
such, uses the thread identifier to correctly signal to the core which thread missed or filled the
cache for L3 miss/return events. The thread identifier is also used in performance monitor event
collection and counting.
The thread identifier can be made visible on the system interface as part of the agent identifier
through a PAL call. This is for informational purposes only as the bit would appear in a reserved
portion of the agent identifier and Montecito does not require the memory controller to ensure
forward progress and fairness based on the thread identifier -- the L2D cache ensures forward
progress between threads.
2.5Dual Cores
Montecito is the first dual core Itanium 2 processor. The two cores attach to the system interface
through the arbiter, which provides a low-latency path for each core to initiate and respond to
system events.
Figure 2-3 is a block diagram of the arbiter, which organizes and optimizes each core's request to
the system interface, ensures fairness and forward progress, and collects responses from each core
to provide a unified response. The arbiter maintains each core's unique identity to the system
interface and operates at a fixed ratio to the system interface frequency. The cores are responsible
for thread ordering and fairness so the thread identifier to uniquely identify transactions on the
system interface is not necessary. However, the processor can be configured to provide the thread
identifier for informational purposes only.
Reference Manual for Software Development and Optimization25
The Dual-Core Itanium 2 Processor
Figure 2-3. The Arbiter and Queues
As the figure shows, the arbiter consists of a set of address queues, data queues, and synchronizers,
as well as logic for core and system interface arbitration. Error-Correction Code (ECC) encoders/
decoders and parity generators exist but are not shown.
The core initiates one of three types of accesses, which the arbiter allocates to the following queues
and buffers:
• Request queue. This is the primary address queue that supports most request types. Each core
has four request queues.
• Write address queue. This queue holds addresses only and handles explicit writebacks and
partial line writes. Each core has two write address queues.
• Clean castout queue. This queue holds the address for the clean castout (directory and snoop
filter update) transactions. The arbiter holds pending transactions until it issues them on the
system interface. Each core has four clean castout queues.
• Write data buffer. This buffer holds outbound data and has a one-to-one correspondence with
addresses in the write address queue. Each core has four write data buffers, with the additional
two buffers holding implicit writeback data.
The number of entries in these buffers are small because they are deallocated once the transaction
is issued on the system interface. System interface responses to the transaction are sent directly to
the core where the overall tracking of a system interface request occurs.
Note that there are no core to core bypasses present. Thus, a cache line that is requested by core 0
but exists modified on core 1 will be issued to the system interface, snoop core 1 which provides
the data and a modified snoop result - all of which is seen on the system interface.
The Snoop queue issues snoop requests to the cores and coalesces the snoop response from each
core into a unified snoop response for the socket. If any core is delayed in delivering its snoop
response, the arbiter will delay the snoop response on the system interface.
The arbiter delivers all data returns directly to the appropriate core using a unique identifier
provided with the initial request. It delivers broadcast transactions, such as interrupts and TLB
purges, to both cores in the same way that delivery would occur if each core were connected
directly to the system interface.
26Reference Manual for Software Development and Optimization
The Dual-Core Itanium 2 Processor
2.5.1Fairness and Arbitration
The arbiter interleaves core requests on a one-to-one basis when both cores have transactions to
issue. When only one core has requests, its can issue its requests without the other core having to
issue a transaction. Because read latency is the greatest concern, the read requests are typically the
highest priority, followed by writes, and finally clean castouts. Each core tracks the occupancy of
the arbiter's queues using a credit system for flow control. As requests complete, the arbiter
informs the appropriate core of the type and number of deallocated queue entries. The cores use
this information to determine which, if any, transaction to issue to the arbiter.
2.6Intel® Virtualization Technology
The Montecito processor is the first Itanium 2 processor to implement Intel® Virtualization
Technology. The full specification as well as further information on Intel Virtualization Technology
can be found at:
Section 2.5 in Part 2 of Volume 2 of the Intel®Itanium®Architecture Software Developer’s Manual
specifies specific sequences that must be followed when any instruction code may exist in the data
cache. Many violations of this code may have worked in previous Itanium 2 processors, but such
violations are likely to be exposed by the cache hierarchy found in Montecito. Code in violation of
the architecture should be modified to adhere to the architectural requirements.
The large L2I and the separation of the instruction and data at the L2 level also requires additional
time to ensure coherence if using the PAL_CACHE_FLUSH procedure with the I/D coherence
option. Care should be taken to ensure that previously lower cost uses of the
PAL_CACHE_FLUSH call should be replaced with the architecture required code sequence for
ensuring instruction and data consistency.
2.7.2ld.bias and lfetch.excl
The and instructions have been enhanced on the Montecito processor.
These instructions can now bring in lines into the cache in a state that is ready to be modified if
supported by the memory controller. This feature allows a single or to
prefetch both the source and destination streams. This feature is enabled by default, but may be
disabled by PAL_SET_PROC_FEATURES bit 7 of the Montecito feature_set (18).
2.7.3L2D Victimization Optimization
Montecito also improves on the behaviors associated with internal cache line coherence tracking.
The number of false L2D victims will drastically reduce on Montecito over previous Itanium 2
processors. This optimization is enabled by default, but may be disabled by
PAL_SET_PROC_FEATURES.
Reference Manual for Software Development and Optimization27
The Dual-Core Itanium 2 Processor
2.7.4Instruction Cache Coherence Optimization
Coherence requests of the L1I and L2I caches will invalidate the line if it is in the cache. Montecito
allows instruction requests on the system interface to be filtered such that they will not initiate
coherence requests of the L1I and L2I caches. This will allow instructions to be cached at the L1I
and L2I levels across multiple processors in a coherent domain. This optimization is enabled by
default, but may be disabled by PAL_SET_PROC_FEATURES bit 5 of the Montecito
feature_set (18).
2.8IA-32 Execution
IA-32 execution on the Montecito processor is enabled with the IA-32 Execution Layer (IA-32 EL)
and PAL-based IA-32 execution. IA-32 EL is OS-based and is only available after an OS has
booted. PAL-based IA-32 execution is available after PAL_COPY_PAL is called and provides IA32 execution support before the OS has booted. All OSes running on Montecito have a requirement
to have IA-32 EL installed. There is no support for PAL-based IA-32 execution in an OS
environment.
IA-32 EL is a software layer that is currently shipping with Itanium architecture-based operating
systems and will convert IA-32 instructions into Itanium processor instructions via dynamic
translation. Further details on operating system support and functionality of IA-32 EL can be found
at http://www.intel.com/cd/ids/developer/asmo-na/eng/strategy/66007.htm.
2.9Brand Information
One of the newer additions to the Itanium architecture is the PAL_BRAND_INFO procedure. This
procedure, along with PAL_PROC_GET_FEATURES, allows software to obtain processor
branding and feature information. Details on the above functions can be found in the IntelItanium®Architecture Software Developer’s Manual.
Below is the table of the implementation-specific return values for PAL_BRAND_INFO.
Montecito will implement all three, however previous implementations of the Intel Itanium 2
processor are all unable to retrieve the processor frequency, so requests for these fields will return 6, information not available. Also, previous Itanium 2 processors cannot return system bus
frequency speed. Implementation-specific values are expected to start at value 16 and continue
until an invalid argument (-2) is returned.
Note: The values returned below are the values that the processor was validated at, which is not
necessarily the values that the processor is currently running at.
18The system bus frequency component (in Hz) of the brand
identification string will be returned in the brand_info return
argument.
17The cache size component (in bytes) of the brand
identification string will be returned in the brand_info return
argument.
16The frequency component (in Hz) of the brand identification
string will be returned in the brand_info return argument.
®
28Reference Manual for Software Development and Optimization
There are other processor features that may not be included in the brand name above. To obtain
information on if that technology or feature has been implemented, the
PAL_PROC_GET_GEATURES procedure should be used. Montecito features will be in the
\
Montecito processor feature_set (18).
Table 2-11. Montecito Processor Feature Set Return Values
ValueDefinition
The Dual-Core Itanium 2 Processor
18Hyper-Threading Technology (HT) - This processor
17Low Voltage (LV) - This processor is a low power SKU
16Dual-Processor (DP) - This processor is restricted to two
supports Hyper-Threading Technology
processor (DP) systems
§
Reference Manual for Software Development and Optimization29
The Dual-Core Itanium 2 Processor
30Reference Manual for Software Development and Optimization
3Performance Monitoring
3.1Introduction to Performance Monitoring
This chapter defines the performance monitoring features of the Montecito processor. The
Montecito processor provides 12 48-bit performance counters per thread, 200+ monitorable events,
and several advanced monitoring capabilities. This chapter outlines the targeted performance
monitor usage models and defines the software interface and programming model.
The Itanium architecture incorporates architected mechanisms that allow software to actively and
directly manage performance critical processor resources such as branch prediction structures,
processor data and instruction caches, virtual memory translation structures, and more. To achieve
the highest performance levels, dynamic processor behavior should be able to be monitored and fed
back into the code generation process to better encode observed run-time behavior or to expose
higher levels of instruction level parallelism. These measurements will be critical for
understanding the behavior of compiler optimizations, the use of architectural features such as
speculation and predication, or the effectiveness of microarchitectural structures such as the ALAT,
the caches, and the TLBs. These measurements will provide the data to drive application tuning
and future processor, compiler, and operating system designs.
The remainder of this chapter is divided into the following sections:
• Section 3.2 discusses how performance monitors are used, and presents various Montecito
• Section 3.3 defines the Montecito processor specific performance monitoring features,
structures and registers.
Chapter 4 provides an overview of the Montecito processor events that can be monitored.
3.2Performance Monitor Programming Models
This section introduces the Montecito processor performance monitoring features from a
programming model point of view and describes how the different event monitoring mechanisms
can be used effectively. The Montecito processor performance monitor architecture focuses on the
following two usage models:
• Workload Characterization: the first step in any performance analysis is to understand the
performance characteristics of the workload under study. Section 3.2.1 discusses the
Montecito processor support for workload characterization.
• Profiling: profiling is used by application developers and profile-guided compilers.
Application developers are interested in identifying performance bottlenecks and relating them
back to their code. Their primary objective is to understand which program location caused
performance degradation at the module, function, and basic block level. For optimization of
data placement and the analysis of critical loops, instruction level granularity is desirable.
Profile-guided compilers that use advanced features of the Itanium architecture, such as
predication and speculation, benefit from run-time profile information to optimize instruction
schedules. The Montecito processor supports instruction level statistical profiling of branch
mispredicts and cache misses. Details of the Montecito processor’s profiling support are
described in Section 3.2.2
Reference Manual for Software Development and Optimization31
Performance Monitoring
3.2.1Workload Characterization
The first step in any performance analysis is to understand the performance characteristics of the
workload under study. There are two fundamental measures of interest: event rates and program
cycle break down.
• Event Rate Monitoring: Event rates of interest include average retired instructions-per-clock
(IPC), data and instruction cache miss rates, or branch mispredict rates measured across the
entire application. Characterization of operating systems or large commercial workloads (e.g.
OLTP analysis) requires a system-level view of performance relevant events such as TLB miss
rates, VHPT walks/second, interrupts/second, or bus utilization rates. Section 3.2.1.1 discusses
event rate monitoring.
• Cycle Accounting: The cycle breakdown of a workload attributes a reason to every cycle
spent by a program. Apart from a program’s inherent execution latency, extra cycles are
usually due to pipeline stalls and flushes. Section 3.2.1.4 discusses cycle accounting.
3.2.1.1Event Rate Monitoring
Event rate monitoring determines event rates by reading processor event occurrence counters
before and after the workload is run, and then computing the desired rates. For instance, two basic
Montecito processor events that count the number of retired Itanium instructions
(IA64_INST_RETIRED.u) and the number of elapsed clock cycles (CPU_OP_CYCLES) allow a
workload’s instructions per cycle (IPC) to be computed as follows:
• IPC = (IA64_INST_RETIRED.u
CPU_OP_CYCLESt0)
Time-based sampling is the basis for many performance debugging tools (VTune™analyzer, gprof,
WinNT). As shown in Figure 3-1, time-based sampling can be used to plot the event rates over
time, and can provide insights into the different phases that the workload moves through.
Figure 3-1. Time-Based Sampling
On the Montecito processor, many event types, e.g. TLB misses or branch mispredicts are limited
to a rate of one per clock cycle. These are referred to as “single occurrence” events. However, in
the Montecito processor, multiple events of the same type may occur in the same clock. We refer to
such events as “multi-occurrence” events. An example of a multi-occurrence events on the
Montecito processor is data cache read misses (up to two per clock). Multi-occurrence events, such
as the number of entries in the memory request queue, can be used to the derive average number
and average latency of memory accesses. Section 3.2.1.2 and Section 3.2.1.3 describe the basic
Montecito processor mechanisms for monitoring single and multi-occurrence events.
- IA64_INST_RETIRED.ut0) / (CPU_OP_CYCLESt1 -
t1
t1t0
Sample Interval
Time
32Reference Manual for Software Development and Optimization
Performance Monitoring
3.2.1.2Single Occurrence Events and Duration Counts
A single occurrence event can be monitored by any of the Montecito processor performance
counters. For all single occurrence events, a counter is incremented by up to one per clock cycle.
Duration counters that count the number of clock cycles during which a condition persists are
considered “single occurrence” events. Examples of single occurrence events on the Montecito
processor are TLB misses, branch mispredictions, and cycle-based metrics.
3.2.1.3Multi-Occurrence Events, Thresholding, and Averaging
Events that, due to hardware parallelism, may occur at rates greater than one per clock cycle are
termed “multi-occurrence” events. Examples of such events on the Montecito processor are retired
instructions or the number of live entries in the memory request queue.
Thresholding capabilities are available in the Montecito processor’s multi-occurrence counters and
can be used to plot an event distribution histogram. When a non-zero threshold is specified, the
monitor is incremented by one in every cycle in which the observed event count exceeds that
programmed threshold. This allows questions such as “For how many cycles did the memory
request queue contain more than two entries?” or “During how many cycles did the machine retire
more than three instructions?” to be answered. This capability allows microarchitectural buffer
sizing experiments to be supported by real measurements. By running a benchmark with different
threshold values, a histogram can be drawn up that may help to identify the performance “knee” at
a certain buffer size.
For overlapping concurrent events, such as pending memory operations, the average number of
concurrently outstanding requests and the average number of cycles that requests were pending are
of interest. To calculate the average number or latency of multiple outstanding requests in the
memory queue, we need to know the total number of requests (n
requests per cycle (n
occurrence counter, n
/cycle). By summing up the live requests (n
live
is directly measured by hardware. We can now calculate the average
live
number of requests and the average latency as follows:
• Average outstanding requests/cycle = n
• Average latency per request = n
live
/ n
live
total
/ t
An example of this calculation is given in Table 3-1 in which the average outstanding requests/
cycle = 15/8 = 1.825, and the average latency per request = 15/5 = 3 cycles.
Table 3-1. Average Latency per Request and Requests per Cycle
The Montecito processor provides the following capabilities to support event rate monitoring:
• Clock cycle counter
• Retired instruction counter
Reference Manual for Software Development and Optimization33
Performance Monitoring
• Event occurrence and duration counters
• Multi-occurrence counters with thresholding capability
3.2.1.4Cycle Accounting
While event rate monitoring counts the number of events, it does not tell us whether the observed
events are contributing to a performance problem. A commonly used strategy is to plot multiple
event rates and correlate them with the measured instructions per cycle (IPC) rate. If a low IPC
occurs concurrently with a peak of cache miss activity, chances are that cache misses are causing a
performance problem. To eliminate such guess work, the Montecito processor provides a set of
cycle accounting monitors, that break down the number of cycles that are lost due to various kinds
of microarchitectural events. As shown in Figure 3-2, this lets us account for every cycle spent by a
program and therefore provides insight into an application’s microarchitectural behavior. Note that
cycle accounting is different from simple stall or flush duration counting. Cycle accounting is
based on the machine’s actual stall and flush conditions, and accounts for overlapped pipeline
delays, while simple stall or flush duration counters do not. Cycle accounting determines a
program’s cycle breakdown by stall and flush reasons, while simple duration counters are useful in
determining cumulative stall or flush latencies.
Figure 3-2. Itanium® Processor Family Cycle Accounting
Inherent Program
Execution Latency
30%25%
Data Access
Cycles
20%15%10%
100% Execution Time
Branch
Mispredicts
I Fetch
Stalls
Other Stalls
001229
The Montecito processor cycle accounting monitors account for all major single and multi-cycle
stall and flush conditions. Overlapping stall and flush conditions are prioritized in reverse pipeline
order, i.e. delays that occur later in the pipe and that overlap with earlier stage delays are reported
as being caused later in the pipeline. The six back-end stall and flush reasons are prioritized in the
following order:
1. Exception/Interruption Cycle: cycles spent flushing the pipe due to interrupts and exceptions.
2. Branch Mispredict Cycle: cycles spent flushing the pipe due to branch mispredicts.
3. Data/FPU Access Cycle: memory pipeline full, data TLB stalls, load-use stalls, and access to
floating-point unit.
4. Execution Latency Cycle: scoreboard and other register dependency stalls.
5. RSE Active Cycle: RSE spill/fill stall.
6. Front End Stalls: stalls due to the back-end waiting on the front end.
Additional front-end stall counters are available which detail seven possible reasons for a front-end
stall to occur. However, the back-end and front-end stall events should not be compared since they
are counted in different stages of the pipeline.
For details, refer to Section 4.6.
34Reference Manual for Software Development and Optimization
3.2.2Profiling
Profiling is used by application developers, profile-guided compilers, optimizing linkers, and runtime systems. Application developers are interested in identifying performance bottlenecks and
relating them back to their source code. Based on profile feedback developers can make changes to
the high-level algorithms and data structures of the program. Compilers can use profile feedback to
optimize instruction schedules by employing advanced features of the Itanium architecture, such as
predication and speculation.
To support profiling, performance monitor counts have to be associated with program locations.
The following mechanisms are supported directly by the Montecito processor’s performance
monitors:
• Program Counter Sampling
• Miss Event Address Sampling: Montecito processor event address registers (EARs) provide
sub-pipeline length event resolution for performance critical events (instruction and data
caches, branch mispredicts, and instruction and data TLBs).
• Event Qualification: constrains event monitoring to a specific instruction address range, to
certain opcodes or privilege levels.
These profiling features are presented in Section 3.2.2.1, Section 3.2.2.2 and Section 3.2.3.3.
Performance Monitoring
3.2.2.1Program Counter Sampling
Application tuning tools like VTune analyzer and gprof use time-based or event-based sampling of
the program counter and other event counters to identify performance critical functions and basic
blocks. As shown in Figure 3-3, the sampled points can be histogrammed by instruction addresses.
For application tuning, statistical sampling techniques have been very successful, because the
programmer can rapidly identify code hot spots in which the program spends a significant fraction
of its time, or where certain event counts are high.
Program counter sampling points the performance analysts at code hot spots, but does not indicate
what caused the performance problem. Inspection and manual analysis of the hot-spot region along
with a fair amount of guess work are required to identify the root cause of the performance
problem. On the Montecito processor, the cycle accounting mechanism (described in
Section 3.2.1.4) can be used to directly measure an application’s microarchitectural behavior.
The interval timer facilities of the Itanium architecture (ITC and ITM registers) can be used for
time-based program counter sampling. Event-based program counter sampling is supported by a
dedicated performance monitor overflow interrupt mechanism described in detail in Section 7.2.2
“Performance Monitor Overflow Status Registers (PMC[0]..PMC[3])” in Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.
Reference Manual for Software Development and Optimization35
Performance Monitoring
Figure 3-3. Event Histogram by Program Counter
Event
Frequency
Examples:
# Cache Misses
# TLB Misses
To support program counter sampling, the Montecito processor provides the following
mechanisms:
• Timer interrupt for time-based program counter sampling
• Event count overflow interrupt for event-based program counter sampling
• Hardware-supported cycle accounting
Address Space
3.2.2.2Miss Event Address Sampling
Program counter sampling and cycle accounting provide an accurate picture of cumulative
microarchitectural behavior, but they do not provide the application developer with pointers to
specific program elements (code locations and data structures) that repeatedly cause
microarchitectural “miss events”. In a cache study of the SPEC92 benchmarks, [Lebeck] used
(trace based) cache miss profiling to gain performance improvements of 1.02 to 3.46 on various
benchmarks by making simple changes to the source code. This type of analysis requires
identification of instruction and data addresses related to microarchitectural “miss events” such as
cache misses, branch mispredicts, or TLB misses. Using symbol tables or compiler annotations
these addresses can be mapped back to critical source code elements. Like Lebeck, most
performance analysts in the past have had to capture hardware traces and resort to trace driven
simulation.
Due to the superscalar issue, deep pipelining, and out-of-order instruction completion of today’s
microarchitectures, the sampled program counter value may not be related to the instruction
address that caused a miss event. On a Pentium® processor pipeline, the sampled program counter
may be off by two dynamic instructions from the instruction that caused the miss event. On a
Pentium® Pro processor, this distance increases to approximately 32 dynamic instructions. On the
Montecito processor, it is approximately 48 dynamic instructions. If program counter sampling is
used for miss event address identification on the Montecito processor, a miss event might be
associated with an instruction almost five dynamic basic blocks away from where it actually
occurred (assuming that 10% of all instructions are branches). Therefore, it is essential for
hardware to precisely identify an event’s address.
The Montecito processor provides a set of event address registers (EARs) that record the
instruction and data addresses of data cache misses for loads, the instruction and data addresses of
data TLB misses, and the instruction addresses of instruction TLB and cache misses. A 16 entry
deep execution trace buffer captures sequences of branch instructions and other instructions and
events which causes changes to execution flow. Table 3-2 summarizes the capabilities offered by
the Montecito processor EARs and the execution trace buffer. Exposing miss event addresses to
software allows them to be monitored either by sampling or by code instrumentation. This
36Reference Manual for Software Development and Optimization
eliminates the need for trace generation to identify and solve performance problems and enables
performance analysis by a much larger audience on unmodified hardware.
Table 3-2. Montecito Processor EARs and Branch Trace Buffer
Event Address RegisterTriggers onWhat is Recorded
Performance Monitoring
Instruction CacheInstruction fetches that miss
Instruction TLB (ITLB)Instruction fetch missed L1
Data CacheLoad instructions that miss L1
Data TLB
(DTLB)
Execution
Trace
Buffer
the L1 instruction cache
(demand fetches only)
ITLB (demand fetches only)
data cache
Data references that miss
L1 DTLB
Branch Outcomes
rfi, exceptions, failed “chk”
instructions which cause a
change in execution flow
Instruction Address
Number of cycles fetch was in flight
Instruction Address
What serviced L1 ITLB miss: L2 ITLB VHPT
or software
Instruction Address
Data Address
Number of cycles load was in flight.
Instruction Address
Data Address
What serviced L1 DTLB miss: L2 DTLB,
VHPT or software
Source instruction address of the event
Target Instruction Address of the event
Mispredict status and reason for branches
The Montecito processor EARs enable statistical sampling by configuring a performance counter
to count, for instance, the number of data cache misses or retired instructions. The performance
counter value is set up to interrupt the processor after a predetermined number of events have been
observed. The data cache event address register repeatedly captures the instruction and data
addresses of actual data cache load misses. Whenever the counter overflows, miss event address
collection is suspended until the event address register is read by software (this prevents software
from capturing a miss event that might be caused by the monitoring software itself). When the
counter overflows, an interrupt is delivered to software, the observed event addresses are collected,
and a new observation interval can be setup by rewriting the performance counter register. For
time-based (rather than event-based) sampling methods, the event address registers indicate to
software whether or not a qualified event was captured. Statistical sampling can achieve arbitrary
event resolution by varying the number of events within an observation interval and by increasing
the number of observation intervals.
3.2.3Event Qualification
In the Montecito processor, many of the performance monitoring events can be qualified in a
number of ways such that only a subset of the events are counted using performance monitoring
counters. As shown in Figure 3-4 events can be qualified for monitoring based on instruction
address range, instruction opcode, data address range, event-specific “unit mask” (umask), the
privilege level and instruction set the event was caused by, and the status of the performance
monitoring freeze bit (PMC0.fr). The following paragraphs describes these capabilities in detail.
• Itanium Instruction Address Range Check: The Montecito processor allows event monitoring
to be constrained to a programmable instruction address range. This enables monitoring of
dynamically linked libraries (DLLs), functions, or loops of interest in the context of a large
Itanium architecture-based application. The Itanium instruction address range check is applied
at the instruction fetch stage of the pipeline and the resulting qualification is carried by the
instruction throughout the pipeline. This enables conditional event counting at a level of
granularity smaller than dynamic instruction length of the pipeline (approximately 48
instructions). The Montecito processor’s instruction address range check operates only during
Reference Manual for Software Development and Optimization37
Performance Monitoring
Itanium architecture-based code execution, i.e. when is zero. For details, see
Event Spefic "Unit Mask"EventDid event happen and qualify?
Privilege Level Check
Instruction Set Check
Event Count Freeze
Is Itanium instruction pointer
in IBR range?
Does Itanium opcode match?
Is Itanium data address
in DBR range?
Executing at monitored
privilege level?
Executing in monitored
instruction set?
Is event monitoring enabled?
YES, all of the above are true;
this event is qualified.
000987a
• Itanium Instruction Opcode Match: The Montecito processor provides two independent
Itanium instruction opcode match ranges, each of which match the currently issued instruction
encodings with a programmable opcode match and mask function. The resulting match events
can be selected as an event type for counting by the performance counters. This allows
histogramming of instruction types, usage of destination and predicate registers as well as
basic block profiling (through insertion of tagged NOPs). The opcode matcher operates only
during Itanium architecture-based code execution, i.e. when is zero. Details are
described in Section 3.3.6.
• Itanium Data Address Range Check: The Montecito processor allows event collection for
memory operations to be constrained to a programmable data address range. This enables
selective monitoring of data cache miss behavior of specific data structures. For details, see
Section 3.3.7.
• Event Specific Unit Masks: Some events allow the specification of “unit masks” to filter out
interesting events directly at the monitored unit. As an example, the number of counted bus
transactions can be qualified by an event specific unit mask to contain transactions that
38Reference Manual for Software Development and Optimization
Performance Monitoring
originated from any bus agent, from the processor itself, or from other I/O bus masters. In this
case, the bus unit uses a three-way unit mask (any, self, or I/O) that specifies which
transactions are to be counted. In the Montecito processor, events from the branch, memory
and bus units support a variety of unit masks. For details, refer to the event pages in Chapter 4
• Privilege Level: Two bits in the processor status register(PSR) are provided to enable selective
process-based event monitoring. The Montecito processor supports conditional event counting
based on the current privilege level; this allows performance monitoring software to break
down event counts into user and operating system contributions. For details on how to
constrain monitoring by privilege level refer to Section 3.3.1
• Instruction Set: The Montecito processor supports conditional event counting based on the
currently executing instruction set (Itanium or IA-32) by providing two instruction set mask
bits for each event monitor. This allows performance monitoring software to break down event
counts into Itanium architecture and IA-32 contributions. For details, refer to Section 3.3.1.
• Performance Monitor Freeze: Event counter overflows or software can freeze event
monitoring. When frozen, no event monitoring takes place until software clears the monitoring
freeze bit (PMC0.fr). This ensures that the performance monitoring routines themselves, e.g.
counter overflow interrupt handlers or performance monitoring context switch routines, do not
“pollute” the event counts of the system under observation. For details refer to Section 7.2.4 of
Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.
3.2.3.1Combining Opcode Matching, Instruction, and Data Address Range
Check
The Montecito processor allows various event qualification mechanisms to be combined by
providing the instruction tagging mechanism shown in Figure 3-5.
Figure 3-5. Instruction Tagging Mechanism in the Montecito Processor
IBRP0
PMC38
IBRP1
PMC38
IBRP2
PMC38
IBRP3
PMC38
Opcode
Matcher0
(PMC32,33,36)
Opcode
Matcher1
(PMC34,35,36)
Opcode
Matcher0
(PMC32,33,36)
Opcode
Matcher1
(PMC34,35,36)
Data Address
Range checkers
DBRs, PMC41)
Event
Event
Memory
Event
i
j
k
Event
Select
.es)
(PMC
i
Privilege
Level &
Instr. Set
Check
(PMC.plm
PMC.ism)
Counter
(Pmdi)
Reference Manual for Software Development and Optimization39
Performance Monitoring
During Itanium instruction execution, the instruction address range check is applied first. This is
applied separately for each IBR pair (IBRP) to generate 4 independent tag bits which flow down
the machine in four tag channels. Tags in the four tag channels are then passed to two opcode
matchers that combine the instruction address range check with the opcode match and generate
another set of four tags. This is done by combining tag channels 0 and 2 with first opcode match
registers and tag channels 1 and 3 with the second opcode match registers as shown in Figure 3-5.
Each of the 4 combined tags in the four tag channels can be counted as a retired instruction count
event (for details refer to event description “IA64_TAGGED_INST_RETIRED”).
Combined Itanium processor address range and opcode match tags in tag channel 0, qualifies all
downstream pipeline events. Events in the memory hierarchy (L1 and L2 data cache and data TLB
events can further be qualified using a data address DBR RangeTag).
As summarized in Figure 3-5, data address range checking can be combined with opcode matching
and instruction range checking on the Montecito processor. Additional event qualifications based
on the current privilege level can be applied to all events and are discussed in Section 3.2.3.2.
Instruction Address
Range Check and
Opcode Matching,
channel0
Instruction and Data
Address Range Check
Opcode Matching and
Data Address Range
Check
1. For all cases where PMC32.ig_ad is set to 0, PMC32.inv must be set to 0 if address range inversion is not
needed.
2. See column 2 for the value of PMC32.ig_ad bit field.
i
1xxX[1,11] or [0,xx]
0ig_ibrpi=1Chi_ig_OPC=1X [1,11] or [0,xx]
0ig_ibrp0=0Ch0_ig_OPC=1x[1,00]
0ig_ibrpi=1Chi_ig_OPC=0Desired
Opcodes
1xxx[1,10]
0ig_ibrp0=0Ch0_ig_OPC=0Desired
Opcodes
0ig_ibrp0=0Ch0_ig_OPC=1x[1,00]
0xCh0_ig_OPC=0Desired
Opcodes
[1,01]
[1,01]
[1,00]
40Reference Manual for Software Development and Optimization
3.2.3.2Privilege Level Constraints
Performance monitoring software cannot always count on context switch support from the
operating system. In general, this has made performance analysis of a single process in a multiprocessing system or a multi-process workload impossible. To provide hardware support for this
kind of analysis, the Itanium architecture specifies three global bits (PSR.up, PSR.pp, DCR.pp) and
a per-monitor “privilege monitor” bit (PMCi.pm). To break down the performance contributions of
operating system and user-level application components, each monitor specifies a 4-bit privilege
level mask (PMCi.plm). The mask is compared to the current privilege level in the processor status
register (PSR.cpl), and event counting is enabled if PMCi.plm[PSR.cpl] is one. The Montecito
processor performance monitor control is discussed in Section 3.3.1.
PMC registers can be configured as user-level monitors (PMCi.pm is 0) or system-level monitors
(PMCi.pm is 1). A user-level monitor is enabled whenever PSR.up is one. PSR.up can be
controlled by an application using the “sum”/”rum” instructions. This allows applications to
enable/disable performance monitoring for specific code sections. A system-level monitor is
enabled whenever PSR.pp is one. PSR.pp can be controlled at privilege level 0 only, which allows
monitor control without interference from user-level processes. The pp field in the default control
register (DCR.pp) is copied into PSR.pp whenever an interruption is delivered. This allows events
generated during interruptions to be broken down separately: if DCR.pp is 0, events during
interruptions are not counted; if DCR.pp is 1, they are included in the kernel counts.
As shown in Figure 3-6, Figure 3-7, and Figure 3-8, single process, multi-process, and systemlevel performance monitoring are possible by specifying the appropriate combination of PSR and
DCR bits. These bits allow performance monitoring to be controlled entirely from a kernel level
device driver, without explicit operating system support. Once the desired monitoring
configuration has been setup in a process’ processor status register (PSR), “regular” unmodified
operating context switch code automatically enables/disables performance monitoring.
Performance Monitoring
With support from the operating system, individual per-process breakdown of event counts can be
generated as outlined in the chapter on performance monitoring in the Intel®Itanium®Architecture
Software Developer’s Manual.
3.2.3.3Instruction Set Constraints
Instruction set constraints are not fully supported in Montecito and the corresponding PMC register
instruction set mask (PMCi.ism) should be set to Itanium architecture only (‘10) to ensure correct
operation. Any other values for these bits may cause undefined behavior.
Reference Manual for Software Development and Optimization41
42Reference Manual for Software Development and Optimization
interrupt-level, cpl=0
(handlers)
Proc A Proc B Proc C
PSR
.pp=1, others 0
A/B
PMC.pm=1
PMC.plm=1001
interrupt-level, cpl=0
(handlers)
Proc A Proc B Proc C
PSR
.pp=1, others 0
A/B
PMC.pm=1
PMC.plm=1001
DCR.pp=1
000990
Figure 3-8. System Wide Monitor
Performance Monitoring
user-level, cpl=3
(application)
kernel-level, cpl=0
(OS)
interrupt-level, cpl=0
(handlers)
Proc A Proc B Proc C
All PSR.up=1
PMC.pm=0
PMC.plm=1000
DCR.pp=0DCR.pp=0
3.2.4References
• [gprof] S.L. Graham S.L., P.B. Kessler and M.K. McKusick, “gprof: A Call Graph Execution
Profiler”, Proceedings SIGPLAN’82 Symposium on Compiler Construction; SIGPLAN
Notices; Vol. 17, No. 6, pp. 120-126, June 1982.
• [Lebeck] Alvin R. Lebeck and David A. Wood, “Cache Profiling and the SPEC benchmarks:
A Case Study”, Tech Report 1164, Computer Science Dept., University of Wisconsin Madison, July 1993.
user-level, cpl=3
(application)
kernel-level, cpl=0
(OS)
interrupt-level, cpl=0
(handlers)
Proc A Proc B Proc C
All PSR.pp=1
PMC.pm=1
PMC.plm=1001
user-level, cpl=3
(application)
kernel-level, cpl=0
(OS)
interrupt-level, cpl=0
(handlers)
Proc A Proc B Proc C
All PSR.pp=1
PMC.pm=1
PMC.plm=1001
DCR.pp=1
000991
• [VTune] Mark Atkins and Ramesh Subramaniam, “PC Software Performance Tuning”, IEEE
Computer, Vol. 29, No. 8, pp. 47-54, August 1996.
• [WinNT] Russ Blake, “Optimizing Windows NT(tm)”, Volume 4 of the Microsoft “Windows
NT Resource Kit for Windows NT Version 3.51”, Microsoft Press, 1995.
3.3Performance Monitor State
Itanium Performance Monitoring architecture described in Volume 2 of the Intel® Itanium
Architecture Software Developer’s Manual defines two sets of performance monitor registers;
Performance Monitor Configuration (PMC) registers to configure the monitoring and Performance
Monitor Data (PMD) registers to provide data values from the monitors. Additionally, the
architecture also allows for architectural as well as model specific registers. Complying with this
architectural definition, Montecito provides both kind of PMCs and PMDs. As shown in Figure 3-9
the Montecito processor provides 12 48-bit performance counters (PMC/PMD
of model-specific monitoring registers.
Table 3-4 defines the PMC/PMD register assignments for each monitoring feature. The interrupt
status registers are mapped to PMC
PMC/PMD
controlled by three configuration registers (PMC
latencies are accessible to software through five event address data registers (PMD
Reference Manual for Software Development and Optimization43
. The Event Address Registers (EARs) and the Execution Trace Buffer (ETB) are
4-15
. The 12 generic performance counter pairs are assigned to
0,1,2,3
). Captured event addresses and cache miss
37,40,39
pairs), and a set
4-15
®
34,35,32,33,36
)
Performance Monitoring
and a branch trace buffer (PMD
). On the Montecito processor, monitoring of some events can
48-63
additionally be constrained to a programmable instruction address range by appropriately setting
the instruction breakpoint registers (IBR) and the instruction address range check register (PMC38)
and turning on the checking mechanism in the opcode match registers (PMC
opcode match register sets and an opcode match configuration register (PMC36) allow monitoring
of some events to be qualified with a programmable opcode. For memory operations, events can be
qualified by a programmable data address range by appropriate setting of the data breakpoint
registers (DBRs) and the data address range configuration register (PMC41).
Montecito, being a processor capable of running two threads, provides the illusion of having two
processors by providing exactly the same set of performance monitoring features and structures
separately for each thread.
Table 3-4. Montecito Processor Performance Monitor Register Set
Montecito Processor
Performance Monitoring
Generic Reg. Set
Processor Status Reg.
PSR
Default Conf. Reg.
DCR
Perf. Mon. Vector Reg.
PMV
Montecito Processor
Specific Performance
Monitoring Reg. Set
Reference Manual for Software Development and Optimization45
Performance Monitoring
3.3.1Performance Monitor Control and Accessibility
As in other IPF processors, Montecito event collection is controlled by the Performance Monitor
Configuration (PMC) registers and the processor status register (PSR). Four PSR fields (PSR.up,
PSR.pp, PSR.cpl and PSR.sp) and the performance monitor freeze bit (PMC0.fr) affect the
behavior of all performance monitor registers.
Per-monitor control is provided by three PMC register fields (PMCi.plm, PMCi.ism, and
PMCi.pm). Event collection for a monitor is enabled under the following constraints on the
Montecito processor:
Figure 3-10 defines the PSR control fields that affect performance monitoring. For a detailed
definition of how the PSR bits affect event monitoring and control accessibility of PMD registers,
please refer to Section 3.3.2 and Section 7.2.1 of Volume 2 of the Intel® Itanium® Architecture Software Developer’s Manual.
Table 3-5 defines per monitor controls that apply to PMC
4-15,,32-42
. As defined in Table 3-4 each of
these PMC registers controls the behavior of its associated performance monitor data registers
(PMD). The Montecito processor model-specific PMD registers associated with instruction/data
EARs and the branch trace buffer (PMD
32-39,48-63
) can be read only when event monitoring is
frozen (PMC0.fr is one).
Figure 3-10. Processor Status Register (PSR) Fields for Performance Monitoring
Table 3-5. Performance Monitor PMC Register Control Fields (PMC
FieldBitsDescription
plm3:0Privilege Level Mask - controls performance monitor operation for a specific privilege level.
Each bit corresponds to one of the 4 privilege levels, with bit 0 corresponding to privilege
level 0, bit 1 with privilege level 1, etc. A bit value of 1 indicates that the monitor is enabled at
that privilege level. Writing zeros to all plm bits effectively disables the monitor. In this state,
the Montecito processor will not preserve the value of the corresponding PMD register(s).
pm6Privileged monitor - When 0, the performance monitor is configured as a user monitor and
enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a
privileged monitor, enabled by PSR.pp, and PMD can only be read by privileged software.
Any read of the PMD by non-privileged software in this case will return 0.
NOTE: In PMC37 this field is implemented in bit [4].
ism25:24Instruction Set Mask - Should be set to ‘10 for proper operation. Undefined behavior with
other values.
NOTE: PMC37 and PMC39do not have this field.
4-15
)
3.3.2Performance Counter Registers
The PMUs are not shared between hardware threads. Each hardware thread has its own set of 12
generic performance counter (PMC/PMD
46Reference Manual for Software Development and Optimization
4-15
) pairs.
Performance Monitoring
Due to the complexities of monitoring in an MT “aware” environment, the PMC/PMD pairs are
split according to differences in functionality. These PMC/PMD pairs can be divided into two
categories; duplicated counters (PMC/PMD
• Banked counters (PMC/PMD
10-15
)and banked counters (PMC/PMD
4-9
10-15
).
): The banked counter capabilities are somewhat limited.
These PMDs cannot increment when their thread is in the background. That is, if Thread 0 is
placed in the background, Thread 0’s PMD10 cannot increment until the thread is brought back
to the foreground by hardware. Due to this fact, the banked counters should not be used to
monitor a thread specific event (.all is set to 0) that could occur when its thread is in the
background (e.g. L3_MISSES).
• Duplicated counters (PMC/PMD
): In contrast, duplicated counters can increment when
4-9
their thread is in the background. As such, they can be used to monitor thread specific events
which could occur even when the thread those events belong to is not currently active.
PMC/PMD pairs are not entirely symmetrical in their ability to count events. Please refer to
Section 3.3.3 for more information.
Figure 3-11 and Table 3-6 define the layout of the Montecito processor Performance Counter
Configuration Registers (PMC
). The main task of these configuration registers is to select the
4-15
events to be monitored by the respective performance monitor data counters. Event selection (),
unit mask (), and MESI fields in the PMC registers perform the selection of these events.
The rest of the fields in PMCs specify under what conditions the counting should be done (,
, ), by how much the counter should be incremented (), and what need to be
done if the counter overflows (, ).
plm3:0Privilege Level Mask. See Table 3-5 “Performance Monitor PMC Register Control Fields
ev4External visibility - When 1, an external notification (if the capability is present) is
oi5Overflow interrupt - When 1, a Performance Monitor Interrupt is raised and the
pm6Privilege Monitor. See Table 3-5 “Performance Monitor PMC Register Control Fields
ig7Read zero; writes ignored.
es15:8Event select - selects the performance event to be monitored.
umask19:16Unit Mask - event specific mask bits (see event definition for details)
(PMC4-15).”
provided whenever the counter overflows. External notification occurs regardless of the
setting of the oi bit (see below).
performance monitor freeze bit (PMC0.fr) is set when the monitor overflows. When 0, no
interrupt is raised and the performance monitor freeze bit (PMC0.fr) remains
unchanged. Counter overflows generate only one interrupt. Setting the corresponding
PMC0 bit on an overflow will be independent of this bit.
(PMC4-15).”.
Montecito processor event encodings are defined in Chapter 4, “Performance Monitor
Events.”
4-15
)
) (Sheet 1 of 2)
4-15
Reference Manual for Software Development and Optimization47
threshold22:20Threshold -enables thresholding for “multi-occurrence” events.
When threshold is zero, the counter sums up all observed event values. When the
threshold is non-zero, the counter increments by one in every cycle in which the
observed event value exceeds the threshold.
ig23Read zero, Writes ignored.
ism25:24Instruction Set Mask. See Table 3-5 “Performance Monitor PMC Register Control Fields
all26All threads; This bit selects whether or not to monitor just the self thread or both threads.
MESI30:27Umask for MESI filtering; Only the events with this capability are affected.
ig63:31Read zero; writes ignored.
(PMC4-15).”.
This bit is applicable only for Duplicated counters (PMC4-9)
If 1, events from both threads are monitored; If 0, only self thread is monitored. Filters
(IAR/DAR/OPC) are only associated with the thread they belong to. If filtering of an
event with .all enabled is desired, both of the thread’s filters should be given matching
configurations.
[27] : I; [28] = S; [29] = E; [30] = M
If the counter is measuring an event implying that a cache line is being replaced, the
filter applies to bits in the existing cache line and not the line being brought in.
Also note, for the events affected by MESI filtering, if a user wishes to simply captured
all occurrences of the event the filter must be set to b1111.
Figure 3-12 and Table 3-7 define the layout of the Montecito processor Performance Counter Data
Registers (PMD
). A counter overflow occurs when the counter wraps (i.e a carry out from bit
4-15
46 is detected). Software can force an external interruption or external notification after N events
by preloading the monitor with a count value of 247 - N. Note that bit 47 is the overflow bit and
must be initialized to 0 whenever there is a need to initialize the register.
) (Sheet 2 of 2)
4-15
When accessible, software can continuously read the performance counter registers PMD
without disabling event collection. Any read of the PMD from software without the appropriate
privilege level will return 0 (See “plm” in Table 3-6). The processor ensures that software will see
monotonically increasing counter values.
sxt4763:48Writes are ignored, Reads return the value of bit 46, so count values appear as sign
ov47Overflow bit (carry out from bit 46).
count46:0Event Count. The counter is defined to overflow when the count field wraps (carry out
extended.
NOTE: When writing to a PMD, always write 0 to this bit. Reads will return the value of
bit 46. DO NOT USE this field to properly determine whether the counter has
overflowed or not. Use the appropriate bit from PMC0 instead.
from bit 46).
4-15
4-15
)
48Reference Manual for Software Development and Optimization
Similar to other Itanium brand products, not all performance monitoring events can be monitored
using any generic performance monitor counters (PMD4-15). The following need to be noted when
determining which counter to be used to monitor events. This is just an overview and further details
can be found under the specific event/event type.
• ER/SI/L2D events can only be monitored using PMD4-9 (These are the events with event
select IDs belong to ‘h8x, ‘h9x, ‘hax, ‘hbx, ‘hex and ‘hfx)
• To monitor any L2D events it is necessary to monitor at least one L2D event in either PMC4 or
PMC6.(See Section 4.8.4 for more information)
• To monitor any L1D events it is necessary to program PMC5/PMD5 to monitor one L1D
event. (See Section 4.8.2 for more information)
• In a MT enabled system, if a “floating” event is monitoring in a banked counter (PMC/PMD
), the value may be incorrect. To insure accuracy, these events should be measured by a
15
duplicated counter (PMC/PMD
4-9
).
• The CYCLES_HALTED event can only be monitored in PMD10. If measured by any other
PMD, the count value is undefined.
3.3.4Performance Monitor Overflow Status Registers (PMC
As previously mentioned, the Montecito processor supports 12 performance monitoring counters
per thread. The overflow status of these 12 counters is indicated in register PMC0. As shown in
Figure 3-13 and Table 3-8 only PMC0[15:4,0] bits are populated. All other overflow bits are
ignored, i.e. they read as zero and ignore writes.
Figure 3-13. Montecito Processor Performance Monitor Overflow Status Registers (PMC
Reference Manual for Software Development and Optimization49
fr00Performance Monitor “freeze” bit - When 1, event monitoring is
0
ig3:1-Read zero, Writes ignored.
0
HW
Reset
Description
disabled. When 0, event monitoring is enabled. This bit is set by
hardware whenever a performance monitor overflow occurs and its
corresponding overflow interrupt bit (PMC.oi) is set to one. SW is
responsible for clearing it. When the PMC.oi bit is not set, then
counter overflows do not set this bit.
overflow15:40Event Counter Overflow - When bit n is one, indicate that the PMDn
0
ig63:16-Read zero, Writes ignored.
0
ig63:0-Read zero, Writes ignored.
1,2,3
HW
Reset
overflowed. This is a bit vector indicating which performance monitor
overflowed. These overflow bits are set on their corresponding
counters overflow regardless of the state of the PMC.oi bit. Software
may also set these bits. These bits are sticky and multiple bits may
be set.
Description
3.3.5Instruction Address Range Matching
The Montecito processor allows event monitoring to be constrained to a range of instruction
addresses. Once programmed with this constraints, only the events generated by instructions with
their addresses within this range are counted using PMD4-15. The four architectural Instruction
Breakpoint Register Pairs IBRP
these IBR pairs it is possible to define up to 4 different address ranges (only 2 address ranges in
“fine mode”) that can be used to qualify event monitoring.
Once programmed, each of these 4 address restrictions can be applied separately to all events that
are identified to do so. The event, IA64_INST_RETIRED, is the only event that can be constrained
using any of the four address ranges. Events described as prefetch events can only be constrained
using the address range 2 (IBRP1). All other events can only use the first address range (IBRP0)
and this range will be considered as the default for this section.
0-3
(IBR
) are used to specify the desired address ranges. Using
0-7
0,1,2,3
)
In addition to constraint events based on instruction addresses, Montecito processor allows event
qualification based on the opcode of the instruction and the address of the data the memory related
instructions accessed. These are done by applying these constraints to the same 4 instruction
address ranges described in this section. These features are explained in Section 3.3.6 and
Section 3.3.7.
3.3.5.1PMC
Performance Monitoring Configuration register PMC38 is the main control register for Instruction
Address Range matching feature. In addition to this register, PMC32 also controls certain aspects of
this feature as explained in the following paragraphs.
Figure 3-14 and Table 3-10 describe the fields of register PMC38. For the proper use of instruction
address range checking described in this section, PMC38 is expected to be programmed to 0xdb6 as
the default value.
Instruction address range checking is controlled by the “ignore address range check” bit
(PMC32.ig_ad and PMC38.ig_ibrp0). When PMC32.ig_ad is one (or PMC14.ig_ibrp0 is one), all
instructions are included (i.e. un-constrained) regardless of IBR settings. In this mode, events from
both IA-32 and Itanium architecture-based code execution contribute to the event count. When
both PMC32.ig_ad and PMC38.ig_ibrp0 are zero, the instruction address range check based on the
IBRP0 settings is applied to all Itanium processor code fetches. In this mode, IA-32 instructions are
never tagged, and, as a result, events generated by IA-32 code execution are ignored. Table 3-9
defines the behavior of the instruction address range checker for different combinations of
and PMC32.ig_ad or PMC38.ig_ibrp0.
38
50Reference Manual for Software Development and Optimization
Performance Monitoring
Table 3-9. Montecito Processor Instruction Address Range Check by Instruction Set
PSR.is
PMC32.ig_ad OR
.ig_ibrp0
PMC
38
0Tag only Itanium instructions if they match
1Tag all Itanium and IA-32 instructions. Ignore IBR range.
IBR range
0 (IA-64)1 (IA-32)
DO NOT tag any IA-32 operations.
The processor compares every Itanium instruction fetch address IP{63:0} against the address range
programmed into the architectural instruction breakpoint register pair IBRP0. Regardless of the
value of the instruction breakpoint fault enable (IBR x-bit), the following expression is evaluated
for the Montecito processor’s IBRP0:
The events which occur before the instruction dispersal stage will fire only if this qualified match
(IBRmatch) is true. This qualified match will be ANDed with the result of Opcode Matcher
PMC
and further qualified with more user definable bits (See Table 3-10) before being
32,33
distributed to different places. The events which occur after instruction dispersal stage, will use this
new qualified match (IBRP0-OpCode0 match).
Figure 3-14. Instruction Address Range Configuration Register (PMC38)
63 14 13 12 11 109 876 543 210
reservedfine reser
501212121211
ved
ibrp3reser
ved
ibrp2reser
ved
ibrp1reser
ved
ibrp0 reser
ved
Table 3-10. Instruction Address Range Configuration Register Fields (PMC38) (Sheet 1 of 2)
FieldBitsDescription
ig_ibrp011: No constraint
0: Address range 0 based on IBRP0 enabled
ig_ibrp141: No constraint
0: Address range 1 based on IBRP1 enabled
ig_ibrp271: No constraint
0: Address range2 based on IBRP2 is enabled
Reference Manual for Software Development and Optimization51
Performance Monitoring
Table 3-10. Instruction Address Range Configuration Register Fields (PMC38) (Sheet 2 of 2)
FieldBitsDescription
ig_ibrp3101: No constraint
0: address range 3 based on IBRP3 is enabled
fine13Enable fine-mode address range checking (non power of 2)
1: IBRP
0: Normal mode
If set to 1, IBRP0 and iIBRP2 define the lower and upper limits for
address range0; Similarly, IBRP1 and IBRP3 define the lower and
upper limits for address range1.
Bits [63:16] of upper and lower limits need to be exactly the same but
could have any value. Bits[15:0] of upper limit needs to be greater than
bits[15:0] of lower limit. If an address falls in between the upper and
lower limits then a match will be signaled only in address ranges 0 or 1.
Any event qualification based on address ranges 2 and 3 are not
defined.
NOTE:
The mask bits programmed in IBRs 1,3,5,7 for bits [15:0] have no
effect in this mode.
When using fine mode address range 0, it is necessary to program
PMC38.ig_ibrp0,ig_ibrp2 to 0. Similarly, when using address range 1, it
is necessary to set PMC38.ig_ibrp1,ig_ibrp3 to 0.
and IBRP
0,2
are paired to define two address ranges
1,3
IBRP0 match is generated in the following fashion. Note that unless fine mode is used, arbitrary
range checking cannot be performed since the mask bits are in powers of 2. In fine mode, two IBR
pairs are used to specify the upper and lower limits of a range within a page (the upper bits of lower
and upper limits must be exactly the same).
The instruction range checking considers the address range specified by IBRPi only if
PMC32.ig_ad(for i=0), PMC38.ig_ibrpi and IBRPi x-bits are all 0s. If the IBRPi x-bits is set, this
particular IBRP would be used for debug purposes as described in IA64 architecture.
3.3.5.2Use of IBRP0 For Instruction Address Range Check - Exception 1
The address range constraint for prefetch events is on the target address of these events rather than
the address of the prefetch instruction. Therefore IBRP1 must be used for constraining these events.
Calculation of IBRP1 match is the same as that of IBRP0 match with the exception that we use
IBR
Note: Register PMC38 must contain the predetermined value 0x0db6. If software modifies any bits not
listed in Table 3-10 processor behavior is not defined. It is illegal to have PMC41[48:45]=0000 and
PMC32.ig_ad=0 and ((PMC38[2:1]=10 or 00) or (PMC38[5:4]=10 or 00)); this produces
inconsistencies in tagging I-side events in L1D and L2.
instead of IBR
2,3,6
0,1,4
.
52Reference Manual for Software Development and Optimization
Performance Monitoring
3.3.5.3Use of IBRP0 For Instruction Address Range Check - Exception 2
The Address Range Constraint for IA64_TAGGED_INST_RETIRED event uses all four IBR
pairs. Calculation of IBRP2 match is the same as that of IBRP0 match with the exception that
IBR
(in non-fine mode) are used instead of IBR0. Calculation of IBRP3 match is the same as that
4,5
of IBRP1 match with the exception that we use IBR
The instruction range check tag is computed early in the processor pipeline and therefore includes
speculative, wrong-path as well as predicated off instructions. Furthermore, range check tags are
not accurate in the instruction fetch and out-of-order parts of the pipeline (cache and bus units).
Therefore, software must accept a level of range check inaccuracy for events generated by these
units, especially for non-looping code sequences that are shorter than the Montecito processor
pipeline. As described in Section 3.2.3.1, the instruction range check result may be combined with
the results of the IA-64 opcode match registers described in Section 3.3.5.4.
(in non-fine mode) instead of IBR
6,7
2,3
3.3.5.4Fine Mode Address Range Check
In addition to providing coarse address range checking described above, Montecito processor can
be programmed to perform address range checks in the fine mode. Montecito provides the use of
two address ranges for fine mode. The first range is defined using IBRP0 and IBRP2 while the
second is defined using IBRP1 and IBRP3. When properly programmed to use address range 0, all
performance monitoring events that has been indicated to be able to qualify with IBRP0 would
now qualify with this new address range (defined collectively by IBRP0 and IBRP2). Similarly,
when using the address range 1, all events that could be qualified with IBRP1, now get qualified
with this new address range.
.
A user can configure the Montecito PMU to use fine mode address range 0 by following these
steps: (It is assumed that PMCs 32,33,34,35,36,38,41 all start with default settings):
• Program IBRP0 and IBRP2 to define the instruction address range. Note to follow the
programming restrictions mentioned in Table 3-10
• Program PMC32[ig_ad,inv] = ‘00 to turn off default tags injected into tag channel 0
• Program PMC38[ig_ibrp0,ig_ibrp2] = ‘00 to turn on address tagging based on IBRP0 and
IBRP2.
• Program PMC38.fine = 1
Similarly, a user can configure Montecito PMU to use fine mode address range by following the
same steps as above but this time with IBRP1&3. The only exception is that PMC32[ig_ad,inv]
need not to be programmed.
3.3.6Opcode Match Check (PMC
As shown in Figure 3-5, in the Montecito processor, event monitoring can be constrained based on
the Itanium processor encoding (opcode) of an instruction. Registers PMC
configuring this feature.In Montecito, registers PMC
(Opcode matcher 0 (OpCM0) and Opcode Matcher 1 (OpCM1)). Register PMC36 controls how to
apply opcode range checking to the four instruction address ranges defined by using IBRPs.
3.3.6.1PMC32,33,34,35
32,33,34,35,36
and PMC
32,33
)
32,33,34,35,36
define 2 opcode matchers
34,35
allow
Figure 3-15, Figure 3-16 and Table 3-11, Table 3-12 describe the fields of PMC
Figure 3-17 and Table 3-14 describes the register PMC36.
Reference Manual for Software Development and Optimization53
32,33,34,35
registers.
Performance Monitoring
All combinations of bits [51:48] in PMC
necessary to set bits [51:50] to 11. To match all instruction types, bits [51:48] should be set to 1111.
To ensure that all events are counted independent of the opcode matcher, all mifb and all mask bits
of PMC
should be set to one (all opcodes match) while keeping the inv bit cleared.
32,34
Once the opcode matcher constraints are generated, they are ANDed with the address range
constraints available on 4 IBRP channels to form 4 combined address range and opcode match
ranges as described here. The constraints defined by OpCM0 are ANDed with address constraints
defined by IBRP0 and IBRP2 to form combined constraints for channels 0 and 2. Similarly, the
constraints defined by OpCM1 are ANDed with address constraints defined by IBRP1 and IBRP
to form combined constraints for channels 1 and 3.
Figure 3-15. Opcode Match Registers (PMC
63 58 5756 5552 51 50 49 48 4741 402 1 0
igig_adinvigm i f bigmask
61141 1 1 1741
Table 3-11. Opcode Match Registers(PMC
FieldBitsWidth
mask40:041all 1Bits that mask Itanium® instruction encoding bits. Any of the 41
ig47:417n/aReads zero; Writes ignored
b4811If 1: match if opcode is an B-slot
f4911If 1: match if opcode is an F-slot
i5011If 1: match if opcode is an I-slot
m5111If 1: match if opcode is an M-slot
ig55:524n/aReads zero; writes ignored
inv5611Invert Range Check. for tag channel 0
ig_ad5711Ignore Instruction Address Range Checking for tag channle0
ig63:584n/aReads zero; Writes ignored
HW
Reset
are supported. To match a A-slot instruction, it is
32,34
)
32,34
)
32,34
Description
syllable bits can be selectively masked
If mask bit is set to 1, the corresponding opcode bit is not used
for opcode matching
If set to 1, the address ranged specified by IBRP0 is inverted.
Effective only when ig_ad bit is set to 0.
NOTE: This bit is ignored in PMC
If set to 1, all instruction addresses are considered for events.
If 0, IBRs 0-1 will be used for address constraints.
NOTE: This bit is ignored in PMC
54Reference Manual for Software Development and Optimization
Performance Monitoring
Table 3-12. Opcode Match Registers(PMC
FieldBitsWidth
match40:041all 1s Opcode bits against which Itanium® instruction encoding to be
ig63:4123n/aIgnored bits
HW
Reset
)
33,35
matched. Each opcode bit has a corresponding bit position here.
3.3.6.2PMC36
Performance Monitoring Configuration register PMC36 controls whether or not to apply opcode
matching in event qualification. As mentioned earlier, opcode matching is applied to the same four
instruction address ranges defined by using IBRPs.
Figure 3-17. Opcode Match Configuration Register (PMC36)
6332 3143210
igrsvCh3
Table 3-13. Opcode Match Configuration Register Fields (PMC36)
FieldBits
Ch0_ig_OPC001: Tag channel0 PMU events will not be constrained by opcode
Ch1_ig_OPC101: tag channle1 events (IA64_TAGGED_INST_RETIRED.01) won’t be
Ch2_ig_OpC201: Tag channel2 events (IA64_TAGGED_INST_RETIRED.10) won’t be
Ch3_ig_OpC301: Tag channel3 events (IA64_TAGGED_INST_RETIRED.11) won’t be
rsv31:40xfffffff Reserved. Users should not change this field from reset value
ig63:32 n/aIgnored bits
HW
Reset
0: Tag channel0 PMU events (including
IA64_TAGGED_INST_RETIRED.00) will be opcode constrained by
OpCM0
constrained by opcode
0: tag channel1 events will be opcode constrained by OpCM1
constrained by opcode
0: Tag channel2 events will be opcode constrained by OpCM0
constrained by opcode
0: Tag channel2 events will be opcode constrained by OpCM1
Description
Description
Ch2_
Ch1_
_ig_
ig_O
ig_O
OPC
PC
PC
1111
Ch0_
ig_O
PC
For opcode matching purposes, an Itanium instruction is defined by two items: the instruction type
“itype” (one of M, I, F or B) and the 41-bit encoding “enco{40:0}” defined the Intel®Itanium
®
Architecture Software Developer’s Manual. Each instruction is evaluated against each opcode
Reference Manual for Software Development and Optimization55
Performance Monitoring
The IBRP matches are advanced with the instruction pointer to the point where opcodes are being
dispersed. The matches from opcode matchers are ANDed with the IBRP matches at this point.
This produces two opcode match events that are combined with the instruction range check tag
(IBRRangeTag, see Section 3.3.5) as follows:
As shown in Figure 3-5 the 4 tags, Tag(IBRChnli; i=0-3) are staged down the processor pipeline
until instruction retirement and can be selected as a retired instruction count event (see event
description “IA64_TAGGED_INST_RETIRED”). In this way, a performance counter (PMC/
PMD
) can be used to count the number of retired instructions within the programmed range that
4-15
match the specified opcodes.
Note: Register PMC
must contain the predetermined value of 0xfffffff0. If software modifies any bits
36
not listed in Table 3-13 processor behavior is not defined. This is the reset value for PMC36.
3.3.7Data Address Range Matching (PMC41)
For instructions that reference memory, the Montecito processor allows event counting to be
constrained by data address ranges. The 4 architectural Data Breakpoint Registers (DBRs) can be
used to specify the desired address range. Data address range checking capability is controlled by
the Memory Pipeline Event Constraints Register (PMC41).
Figure 3-18 and Table 3-14 describe the fields of register PMC
corresponding to one of the 4 DBRs to be used), data address range checking is applied to loads,
stores, semaphore operations, and the instruction.
cfgdtag04:3These bits determine whether and how DBRP0 should be used for
constraining memory pipeline events (where applicable)
00: IBR/Opc/DBR - Use IBRP0/OpCM0 and DBRP0 for constraints (i.e.
they will be counted only if their Instruction Address, opcodes and
Data Address matches the IBRP0 programmed into these registers)
01: IBR/Opc - Use IBRP0/OpCM0 for constraints
10: DBR - Only use DBRP0 for constraints
11: No constraints
NOTE: When used in conjunction with “fine” mode (see PMC14
description), only the lower bound DBR Pair (DBRP0 or DBRP1)
config needs to be set. The upper bound DBR Pair config should be
left to no constraint. So if IBRP0,2 are chosen for “fine” mode,
cfgdtag0 needs to be set according to the desired constraints but
cfgdtag2 should be left as 11 (No constraints).
cfgdtag112:11These bits determine whether and how DBRP1 should be used for
constraining memory pipeline events (where applicable); bit for bit
these match those defined for DBRP
cfgdtag220:19These bits determine whether and how DBRP2 should be used for
constraining memory pipeline events (where applicable); bit for bit
these match those defined for DBRP
When enabled ([1,x0] in the bits
41.
0
0
56Reference Manual for Software Development and Optimization
DBRPx match is generated in the following fashion. Arbitrary range checking is not possible since
the mask bits are in powers of 2. Although it is possible to enable more than one DBRP at a time
for checking, it is not recommended. The resulting four matches are combined as follows to form a
single DBR match:
Events which occur after a memory instruction gets to the EXE stage will fire only if this qualified
match (DBRPx match) is true. The data address is compared to DBRPx; the address match is
further qualified by a number of user configurable bits in PMC41 before being distributed to
different places. DBR matching for performance monitoring ignores the setting of the DBR r,w,
and plm fields.
In order to allow simultaneous use of some DBRs for Performance Monitoring and the others for
debugging (the architected purpose of these registers), separate mechanisms are provided for
enabling DBRs. DBR bits x and the r/w-bit should be cleared to 0 for the DBRP which is going to
be used for the PMU. PSR.db has no effect when DBRs are used for this purpose.
Note: Register PMC41 must contain the predetermined value 0x2078fefefefe. If software modifies any
bits not listed in Table 3-14 processor behavior is not defined. It is illegal to have
PMC41[48:45]=0000 and PMC32[57]=0 and ((PMC38[2:1]=10 or 00) or (PMC38[5:4]=10 or 00));
this produces inconsistencies in tagging I-side events in L1D and L3.
3.3.8Instruction EAR (PMC37/PMD
This section defines the register layout for the Montecito processor instruction event address
registers (IEAR). The IEAR, configured through PMC37, can be programmed in one of two modes:
instruction cache and instruction TLB miss collection. EAR specific unit masks allow software to
32,33,36
)
Reference Manual for Software Development and Optimization57
Performance Monitoring
specify event collection parameters to hardware. Figure 3-19 and Table 3-15 detail the register
layout of PMC37. The instruction address, latency and other captured event parameters are
provided in three PMD registers (PMD
data registers PMD
34,35
.
). Table 3-20 describes the associated event address
32,33,36
Both the instruction and data cache EARs (see Section 3.3.9) report the latency of captured cache
events and allow latency thresholding to qualify event capture. Event address data registers
(PMD
PMD
) contain valid data only when event collection is frozen (PMC0.fr is one). Reads of
32-36
while event collection is enabled return undefined values.
rsv15:140Reserved bits
ignored63:16-Reads are 0; Writes are ignored
HW
Reset
0Selects the event to be monitored
If [13] = ‘1 then [12:5] are used for umask
if =1x: Monitor demand instruction cache misses
NOTE: ISB hits are not considered misses.
PMD
if =01:Nothing monitored
if =00:Monitor L1 instruction TLB misses
PMD
register interpretation (see Table 3-17)
34,35
register interpretation (see Table 3-17)
34,35
Description
Figure 3-20. Instruction Event Address Register Format (PMD
635 42 1 0
Instruction Cache Line Address (PMD34)ig.stat
5932
6313 12 110
ig (PMD35)ovlatency
51112
34,35
)
When the cache_tlb field (PMC
.ct) is set to 1x, instruction cache misses are monitored. When it
37
is set to 00, instruction TLB misses are monitored. The interpretation of the umask field and
performance monitor data registers PMD
depends on the setting of this bit and is described in
34,35
Section 3.3.8.1 for instruction cache monitoring and in Section 3.3.8.2 for instruction TLB
monitoring.
58Reference Manual for Software Development and Optimization
3.3.8.1Instruction EAR Cache Mode (PMC37.ct=’1x)
When PMC37.ct is 1x, the instruction event address register captures instruction addresses and
access latencies for L1 instruction cache misses. Only misses whose latency exceeds a
programmable threshold are captured. The threshold is specified as an eight bit umask field in the
configuration register PMC37. Possible threshold values are defined in Table 3-16.
Performance Monitoring
Table 3-16. Instruction EAR (PMC
umask
Bits 12:5
01xxxxxx>0 (All L1 Misses)11100000>=256
11111111>=411000000>=1024
11111110>=810000000>=4096
11111100>=16otherundefined
11111000>=3200000000RAB hit
11110000>=128
Latency Threshold
[CPU cycles]
As defined in Table 3-17, the address of the instruction cache line missed the L1 instruction cache
is provided in PMD34. If no qualified event was captured, it is indicated in PMD34.stat. The latency
of the captured instruction cache miss in CPU clock cycles is provided in the latency field of
PMD35.
Table 3-17. Instruction EAR (PMD
RegisterFieldBitsDescription
PMD34stat1:0Status
Instruction Cache
Line Address
PMD
latency11:0Latency in CPU clocks
35
overflow12If 1, latency counter has overflowed one or more times
34,35
) umask Field in Cache Mode (PMC37.ct=’1x)
37
umask
Bits 12:5
Latency Threshold
[CPU cycles]
(All L1 misses which hit in RAB)
) in Cache Mode (PMC37.ct=’1x)
x0: EAR did not capture qualified event
x1: EAR contains valid event data
63:5Address of instruction cache line that caused cache miss
before data was returned
3.3.8.2Instruction EAR TLB Mode (PMC37.ct=00)
When PMC37.ct is ‘00, the instruction event address register captures addresses of instruction TLB
misses. The unit mask allows event address collection to capture specific subsets of instruction
TLB misses. Table 3-18 summarizes the instruction TLB umask settings. All combinations of the
mask bits are supported.
Table 3-18. Instruction EAR (PMC37) umask Field in TLB Mode (PMC37.ct=00) (Sheet 1 of 2)
ITLB Miss TypePMC.umask[7:5]Description
---000Disabled; nothing will be counted
L2TLBxx1L1 ITLB misses which hit L2 TLB
VHPTx1xL1 Instruction TLB misses that hit VHPT
Reference Manual for Software Development and Optimization59
Performance Monitoring
Table 3-18. Instruction EAR (PMC37) umask Field in TLB Mode (PMC37.ct=00) (Sheet 2 of 2)
ITLB Miss TypePMC.umask[7:5]Description
FAULT1xxInstruction TLB miss produced by an ITLB Miss Fault
ALL111Select all L1 ITLB Misses
NOTE: All combinations are supported.
As defined in Table 3-19 the address of the instruction cache line fetch that missed the L1 ITLB is
provided in PMD34. The stat bit [1] indicates whether the captured TLB miss hit in the VHPT or
required servicing by software. PMD34.stat will indicate whether a qualified event was captured. In
TLB mode, the latency field of PMD35 is undefined.
Table 3-19. Instruction EAR (PMD
RegisterFieldBitsDescription
PMD
PMD
stat1:0Status Bits
34
Instruction Cache
Line Address
latency11:2Undefined in TLB mode
35
) in TLB Mode (PMC37.ct=‘00)
34,35
63:5Address of instruction cache line that caused TLB miss
3.3.9Data EAR (PMC40, PMD
The data event address configuration register (PMC40) can be programmed to monitor either L1
data cache load misses, FP loads, L1 data TLB misses, or ALAT misses. Figure 3-21 and
Table 3-20 detail the register layout of PMC40. Figure 3-22 describes the associated event address
data registers PMD
TLB, or ALAT monitoring. The interpretation of the umask field and registers PMD
depends on the setting of the mode bits and is described in Section 3.3.9.1 for data cache load miss
monitoring, Section 3.3.9.2 for data TLB monitoring, and Section 3.3.9.3 for ALAT monitoring.
Both the instruction (see Section 3.3.8) and data cache EARsreport the latency of captured cache
events and allow latency thresholding to qualify event capture. Event address data registers
(PMD
PMD
) contain valid data only when event collection is frozen (PMC0.fr is one). Reads of
32-36
while event collection is enabled return undefined values.
32-36
. The mode bits in configuration register PMC40 select data cache, data
32,33,36
00: EAR did not capture qualified event
01: L1 ITLB miss hit in L2 ITLB
10: L1 ITLB miss hit in VHPT
11: L1 ITLB miss produced an ITLB Miss Fault
32,33,36
)
32,33,36
Figure 3-21. Data Event Address Configuration Register (PMC40)
63 26 25 24 2320 1916 159 876 5 4 30
ig.ismigumaskigmode pm ig.plm
3824472124
Table 3-20. Data Event Address Configuration Register Fields (PMC40) (Sheet 1 of 2)
FieldBits
plm3:00See Table 3-5 “Performance Monitor PMC Register Control Fields (PMC4-15).”
ig5:4-Reads 0; Writes are ignored
60Reference Manual for Software Development and Optimization
HW
Reset
Description
Performance Monitoring
Table 3-20. Data Event Address Configuration Register Fields (PMC40) (Sheet 2 of 2)
ig15:9-Reads 0; Writes are ignored
umask19:16Data EAR unit mask
ig23:20-Reads 0; Writes are ignored
ism25:24See Table 3-5 “Performance Monitor PMC Register Control Fields (PMC4-15).”
ig63:26-Reads 0; Writes are ignored
HW
Reset
‘00: L1 data cache load misses and FP loads
‘01: L1 data TLB misses
‘1x: ALAT misses
mode 00: data cache unit mask (definition see Table 3-21, “Data EAR (PMC40)
Umask Fields in Data Cache Mode (PMC40.mode=00)”)
mode 01: data TLB unit mask (definition see Table 3-23, “Data EAR (PMC40)
Umask Field in TLB Mode (PMC40.ct=01)”)
Figure 3-22. Data Event Address Register Format (PMD
634 3 2 1 0
Instruction Address (PMD36)vl bn slot
601 12
63 62 6115 14 13 120
ig (PMD33)stat ovlatency
25012
630
Data Address (PMD32)
64
Description
32,d3,36
)
3.3.9.1Data Cache Load Miss Monitoring (PMC40.mode=00)
If the Data EAR is configured to monitor data cache load misses, the umask is used as a load
latency threshold defined by Table 3-21.
As defined in Table 3-22, the instruction and data addresses as well as the load latency of a
captured data cache load miss are presented to software in three registers PMD
event was captured, the valid bit in PMD3 is zero.
HPW accesses will not be monitored. and reads from ccv will not be monitored. If an L1D
cache miss is not at least 7 clocks after a captured miss, it will not be captured. Semaphore
instructions and floating point loads will be counted.
Reference Manual for Software Development and Optimization61
. If no qualified
2,3,17
Performance Monitoring
Table 3-21. Data EAR (PMC40) Umask Fields in Data Cache Mode (PMC40.mode=00)
Fields in Data Cache Load Miss Mode (PMC40.mode=00)
Data Address63:064-bit virtual address of data item that caused miss
latency12:0Latency in CPU clocks
overflow13Overflow - If 1, latency counter has overflowed one or
stat15:14Status bits;
ig63:26Reads 0; Writes are ignored
slot1:0Slot bits; If “.vl” is 1, the Instruction bundle slot of memory
bn2Bundle bit; If “.vl” is 1 this indicates which of the executed
vl3Valid bit;
Instruction Address63:4Virtual address of the first bundle in the 2-bundle dispersal
Latency
Threshold
[CPU cycles]
umask
Bits 19:16
more times before data was returned
00: No valid information in PMD32,36 and rest of PMD33
01: Valid information in PMD32,33 and may be in PMD36
NOTE: These bits should be cleared before the EAR is
reused.
instruction
bundles is associated with the captured miss
0: Invalid Address (EAR did not capture qualified event)
1: EAR contains valid event data
NOTE: This bit should be cleared before the EAR is
reused
window which was being executed at the time of the miss.
If “.bn” is 1 then the second bundle contains memory
instruction and 16 should be added to the address.
Latency
Threshold
[CPU cycles]
The detection of data cache load misses requires a load instruction to be tracked during multiple
clock cycles from instruction issue to cache miss occurrence. Since multiple loads may be
outstanding at any point in time and the Montecito processor data cache miss event address register
can only track a single load at a time, not all data cache load misses may be captured. When the
processor hardware captures the address of a load (called the monitored load), it ignores all other
overlapped concurrent loads until it is determined whether the monitored load turns out to be an L1
data cache miss or not. If the monitored load turns out to be a cache miss, its parameters are latched
into PMD
. The processor randomizes the choice of which load instructions are tracked to
32,33,36
prevent the same data cache load miss from always being captured (in a regular sequence of
overlapped data cache load misses). While this mechanism will not always capture all data cache
load misses in a particular sequence of overlapped loads, its accuracy is sufficient to be used by
statistical sampling or code instrumentation.
62Reference Manual for Software Development and Optimization
3.3.9.2Data TLB Miss Monitoring (PMC40.mode=‘01)
If the Data EAR is configured to monitor data TLB misses, the umask defined in Table 3-24
determines which data TLB misses are captured by the Data EAR. For TLB monitoring, all
combinations of the mask bits are supported.
As defined in Table 3-24 the instruction and data addresses of captured DTLB misses are presented
to software in PMD
When programmed for data TLB monitoring, the contents of the latency field of PMD33 are
undefined.
Both load and store TLB misses will be captured. Some unreached instructions will also be
captured. For example, if a load misses in L1DTLB but hits in L2 DTLB and is in an instruction
group after a taken branch, it will be captured. Stores and floating-point operations never miss in
L1DTLB but could miss the L2 DTLB or fault to be handled by software.
Note: PMC39 must be 0 in this mode; else the wrong IP for misses coming right after a mispredicted
branch.
Table 3-23. Data EAR (PMC40) Umask Field in TLB Mode (PMC40.ct=01)
. If no qualified event was captured, the valid bit in PMD36 reads zero.
32,36
Performance Monitoring
L1 DTLB Miss
Table 3-24. PMD
RegisterFieldBit RangeDescription
PMD
32
PMD
33
PMD
36
Type
---000xDisabled; nothing will be counted
L2DTLBxx1xL1 DTLB misses which hit L2 DTLB
VHPTx1xxL1 DTLB misses that hit VHPT
FAULT1xxxData TLB miss produced a fault
ALL111xSelect all L1 DTLB Misses
32,33,36
Data Address63:064-bit virtual address of data item that caused miss
latency12:0Undefined in TLB Miss mode
ov13Undefined in TLB Miss mode
stat15:14Status
ig63:26Reads 0; Writes are ignored
slot1:0Slot bits; If “.vl” is 1, the Instruction bundle slot of memory
bn2Bundle bit; If “.vl” is 1 this indicates which of the executed
PMC.umask[19:16]Description
NOTE: All combinations are supported.
Fields in TLB Miss Mode (PMC40.mode=‘01) (Sheet 1 of 2)
00: invalid information in PMD32,36 and rest of PMD33
01: L2 Data TLB hit
10: VHPT hit
11: Data TLB miss produced a fault
NOTE: These bits should be cleared before the EAR is
reused.
instruction.
bundles is associated with the captured miss
Reference Manual for Software Development and Optimization63
Performance Monitoring
Table 3-24. PMD
RegisterFieldBit RangeDescription
32,33,36
Fields in TLB Miss Mode (PMC40.mode=‘01) (Sheet 2 of 2)
vl3Valid bit;
0: Invalid Instruction Address
1: EAR contains valid instruction address of the miss
NOTE: It is possible for this bit to contain 0 while
PMD33.stat indicate valid D-EAR data. This can happen
when D-EAR is triggered by an RSE load for which no
instruction address is captured.
NOTE: This bit should be cleared before the EAR is
reused.
Instruction Address63:4Virtual address of the first bundle in the 2-bundle dispersal
window which was being executed at the time of the miss.
If “.bn” is 1 then the second bundle contains memory
instruction and 16 should be added to the address.
3.3.9.3ALAT Miss Monitoring (PMC40.mode=‘1x)
As defined in Table 3-25, the address of the instruction (failing and ) causing an
ALAT miss is presented to software in PMD36. If no qualified event was captured, the valid bit in
PMD36 reads zero. When programmed for ALAT monitoring, the latency field of PMD33 and the
contents of PMD32 are undefined.
Note: PMC39 must be 0 in this mode; else the wrong IP for misses coming right after a mispredicted
branch.
Table 3-25. PMD
RegisterFieldBit RangeDescription
PMD
32
PMD
33
PMD
36
32,33,36
Fields in ALAT Miss Mode (PMC11.mode=‘1x)
Data Address63:0Undefined in ALAT Miss Mode
latency12:0Undefined in ALAT Miss mode
ov13Undefined in ALAT Miss mode
stat15:14Status bits;
ig63:26Reads 0; Writes are ignored
slot1:0Slot bits; If “.vl” is 1, the Instruction bundle slot of memory
bn2Bundle bit; If “.vl” is 1 this indicates which of the executed
vl3Valid bit;
Instruction Address63:4Virtual address of the first bundle in the 2-bundle dispersal
00: No valid information in PMD
01: Valid information in PMD
NOTE: These bits should be cleared before the EAR is
reused.
instruction
bundles is associated with the captured miss
0: Invalid Address (EAR did not capture qualified event)
1: EAR contains valid event data
NOTE: This bit should be cleared before the EAR is
reused.
window which was being executed at the time of the miss.
If “.bn” is 1 then the second bundle contains memory
instruction and 16 should be added to the address.
and rest of PMD
32,36
and may be in PMD
32,33
33
36
64Reference Manual for Software Development and Optimization
Performance Monitoring
3.3.10Execution Trace Buffer (PMC
The execution trace buffer provides information about the most recent Itanium processor control
flow changes. The Montecito execution trace buffer configuration register (PMC39) defines the
conditions under which instructions which cause the changes to the execution flow are captured,
and allows the trace buffer to capture specific subsets of these events.
In addition to the branches captured in the previous generations of Itanium 2 processor BTB,
Montecito’s ETB captures rfi instructions, exceptions (excluding asynchronous interrupts) and
silently resteered chk (failed chk) events. Passing chk instructions are not captured under any
programming conditions (except when there is another capturable event).
In every cycle in which a qualified change to the execution flow happens, its source bundle address
and slot number are written to the execution trace buffer. This event’s target address is written to
the next buffer location. If the target instruction bundle itself contains a qualified execution flow
change, the execution trace buffer either records a single trace buffer entry (with the s-bit set) or
makes two trace buffer entries: one that records the target instruction as a branch target s-bit
cleared), and another that records the target instruction as a branch source (s-bit set). As a result,
the branch trace buffer may contain a mixed sequence of the source and target addresses.
Note: The setting of PMC42 can override the setting of PMC39. PMC42 is used to configure the Execution
Trace Buffer’s alternate mode: the IP-EAR. Please refer to Section 3.3.10.2.1, “Notes on the IP-
EAR” for more information about this mode. PMC42.mode must be set to 000 to enable normal
branch trace capture in PMD
PMC39’s contents will be ignored.
as described below. If PMC42.mode is set to other than 000,
48-63
39,42
,PMD
48-63,38,39
)
3.3.10.1Execution Trace Capture (PMC42.mode=‘000)
Section 3.3.10.1.1 through Section 3.3.10.1.3 describe the operation of the Execution Trace Buffer
when configured to capture an execution trace (or “enhanced” branch trace).
The execution trace buffer configuration register (PMC39) defines the conditions under which
execution flow changes are to be captured. These conditions are given in Figure 3-23 and
Table 3-26, which refer to conditions associated with the branch prediction. These conditions are:
• Whether the target of the branch should be captured
• The path of the branch (not taken/taken), and
• Whether or not the branch path was mispredicted
• Whether or not the target of the branch was mispredicted
• What type of branch should be captured
Note: All instructions eligible for capture are subject to filtering by the “plm” field but only branches are
affected by PMC39’s other filters (tm,ptm,ppm and brt) as well as the Instruction Addr Range and
Opcode Match filters.
Note: This mask is applied at the time the event’s source address is captured. Once the
source IP is captured, the target IP of this event is always captured even if the ETB is
disabled.
ig5:4Reads zero; writes are ignored
pm6See Table 3-5
Note: This bit is applied at the time the event’s source address is captured. Once the
source IP is captured, the target IP of this event is always captured even if the ETB is
disabled.
ds7Data selector:
1: reserved (undefined data is captured in lieu of the target address)
0: capture branch target
tm9:8Taken Mask:
ptm11:10Predicted Target Address Mask:
ppm13:12Predicted Predicate Mask:
brt15:14Branch Type Mask:
ig63:16Reads zero; writes are ignored
11: all Itanium® instruction branches
10: Taken Itanium instruction branches only
01: Not Taken Itanium instruction branches only
00: No branch is captured
11: capture branch regardless of target prediction outcome
10: branch target address predicted correctly
01: branch target address mispredicted
00: No branch is captured
11: capture branch regardless of predicate prediction outcome
10: branch predicted branch path (taken/not taken) correctly
01: branch mispredicted branch path (taken/not taken)
00: No branch is captured
11: only non-return indirect branches captured
10: only return branches will be captured
01: only IP-relative branches will be captured
00: all branches are captured
To summarize, an Itanium instruction branch and its target are captured by the trace buffer if the
following equation is true:
66Reference Manual for Software Development and Optimization
To capture all correctly predicted Itanium instruction branches, the Montecito execution trace
buffer configuration settings in PMC39 should be: ds=0, tm=11, ptm=10, ppm=10,brt=00.
Either branches whose path was mispredicted can be captured (ds=0, tm=11, ptm=11,
ppm=01,brt=00) or branches with a target misprediction (ds=0, tm=11, ptm=01, ppm=11,brt=00)
can be captured but not both. A setting of ds=0, tm=11, ptm=01, ppm=01,brt=00 will result in an
empty buffer. If a branch’s path is mispredicted, no target prediction is recorded.
Instruction Address Range Matching (Section 3.3.5) and Opcode Matching (Section 3.3.5) may
also be used to constrain what is captured in the execution trace buffer.
3.3.10.1.2Execution Trace Buffer Data Format (PMC42.mode=‘000)
Performance Monitoring
Figure 3-24. Execution Trace Buffer Register Format (PMD
1: contents of register is the source address of a monitored event (branch, rfi,
exception or failed chk)
0: contents of register is a target or undefined (if PMC39.ds = 1)
mp1Mispredict Bit
if s=1 and mp=1: mispredicted event (e.g. target, predicate or back end
misprediction)
if s=1 and mp=0: correctly predicted event
if s=0 and mp=1: valid target address
if s=0 and mp=0: invalid ETB register
rfi/exceptions/failed_chk are all considered as mispredicted events and are encoded
as above.
slot3:2if s=0: undefined
if s=1: Slot index of first taken event in bundle
00: Itanium processor Slot 0 source/target
01: Itanium processor Slot 1 source/target
10: Itanium processor Slot 2 source/target
11: this was a not taken event
Address63:4if s=1: 60-bit bundle address of Itanium instruction branch
if ds=0 and s=0: 60-bit target bundle address of Itanium instruction branch
48-63
, where PMC39.ds == 0)
48-63
) (PMC
.mode=‘000)
42
21 1
The sixteen execution trace buffer registers PMD
captured event sequence. The branch trace buffer registers (PMD
provide information about the outcome of a
48-63
) contain valid data only
48-63
when event collection is frozen (PMC0.fr is one). While event collection is enabled, reads of
PMD
return undefined values. The registers follow the layout defined in Figure 3-24, and
48-63
Table 3-27 contain the address of either a captured branch instruction (s-bit=1) or a branch target
(s-bit=0). For branch instructions, the mp-bit indicates a branch misprediction. An execution trace
register with a zero s-bit and a zero mp-bit indicates an invalid buffer entry. The slot field captures
the slot number of the first taken Itanium instruction branch in the captured instruction bundle. A
slot number of 3 indicates a not-taken branch.
In every cycle in which a qualified Itanium instruction branch retires1, its source bundle address
and slot number are written to the branch trace buffer. If within the next clock, the target instruction
bundle contains a branch that retires and meets the same conditions, the address of the second
Reference Manual for Software Development and Optimization67
Performance Monitoring
branch is stored. Otherwise, either the branches’ target address (PMC39.ds=0) or details of the
branch prediction (PCM39.ds=1) are written to the next buffer location. As a result, the execution
trace buffer may contain a mixed sequence of the branches and targets.
The Montecito branch trace buffer is a circular buffer containing the last four to eight qualified
Itanium instruction branches. The Execution Trace Buffer Index Register (PMD38) defined in
Figure 3-25 and Table 3-28 identify the most recently recorded branch or target. In every cycle in
which a qualified branch or target is recorded, the execution buffer index (ebi) is post-incremented.
After 8 entries have been recorded, the branch index wraps around, and the next qualified branch
will overwrite the first trace buffer entry. The wrap condition itself is recorded in the full bit of
PMD16. The ebi field of PMD38 defines the next branch buffer index that is about to be written.The
following formula computes the last written branch trace buffer PMD index from the contents of
PMD38:
If both the full bit and the ebi field of PMD38 are zero, no qualified branch has been captured by the
branch trace buffer. The full bit gets set the every time the branch trace buffer wraps from PMD
to PMD48. Once set, the full bit remains set until explicitly cleared by software, i.e. it is a sticky bit.
Software can reset the ebi index and the full bit by writing to PMD38.
PMD39 provides additional information related to the ETB entries.
Figure 3-25. Execution Trace Buffer Index Register Format (PMD38)
636 5 4 3 0
igfull ig ebi
581 14
Table 3-28. Execution Trace Buffer Index Register Fields (PMD38)
FieldBit RangeDescription
ebi3:0Execution Buffer Index [Range 0..15 - Index 0 indicates PMD48]
ig4Reads zero; Writes are ignored
full5Full Bit (sticky)
ig63:6Reads zero; Writes are ignored
Pointer to the next execution trace buffer entry to be written
if full=1: points to the oldest recorded branch/target
if full=0: points to the next location to be written
if full=1: execution trace buffer has wrapped
if full=0: execution trace buffer has not wrapped
Figure 3-26. Execution Trace Buffer Extension Register Format (PMD39) (PMC42.mode=‘1xx)
63 59 58 55 5424 23 20 19 16 15 12 11 87430
pmd63
ext
pmd55
ext
...
pmd58
ext
pmd50
ext
pmd57
ext
pmd49
ext
pmd56
ext
pmd48
ext
44444444
1. In some cases, the Montecito processor execution trace buffer will capture the source (but not the target) address of an excepting branch
instruction. This occurs on trapping branch instructions as well as faulting , and multi-way branches.
68Reference Manual for Software Development and Optimization
If PMD48.bits[1:0] = 11,
1 = back end mispredicted the branch and the pipeline was flushed by it
0 = no pipeline flushes are associated with this branch
0b1
if PMD48.s = 1, then
1 = branch was from bundle 1, add 0x1 to PMD48.bits[63:4]
0 = branch was from bundle 0, no correction is necessary
else, ignore
pmd56 ext7:4Same as above for PMD56
pmd49 ext11:8Same as above for PMD49
pmd57 ext15:12Same as above for PMD57
pmd50 ext19:16Same as above for PMD50
pmd58 ext23:20Same as above for PMD58
so onso onso on
pmd63 ext63:60Same as above for PMD63
3.3.10.1.3Notes on the Execution Trace Buffer
Although the Montecito ETB does not capture asynchronous interrupts as events, the address of
these handlers can be captured as target addresses. This could happen if, at the target of a captured
event (e.g. taken branch), an asynchronous event is taken before executing any instruction at the
target.
Montecito has a new feature called Instruction Pointer Address Capture (or IP-EAR). This feature
is intended to facilitate the correlation of performance monitoring events to IP values. To do this,
the Montecito’s Execution Trace Buffer (ETB) can be configured to capture IPs of retired
instructions. When a performance monitoring event is used to trigger an IP-EAR freeze, if the IP
which caused the event gets to retirement there is a good chance that IP would be captured in the
ETB. The IP-EAR freezes after a programmable number of cycles following a PMU freeze as
described below
Register PMC42 is used to configure this feature and the ETB registers(PMD
capture the data. PMD38 holds the index and overflow bits for the IP Buffer much as it does for the
ETB.
Note: Setting PMC42.mode to a non-0 value will override the setting of PMC39 (the configuration
plm3:0See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)”
ig5:4Reads zero; Writes are ignored
pm6See Table 3-5, “Performance Monitor PMC Register Control Fields (PMC4-15)”
ig7Reads zero; Writes are ignored
mode10:8IP EAR mode:
delay18:11Programmable delay before freezing
ig63:20Reads zero; Writes are ignored
000: ETB Mode (IP-EAR not functional; ETB is functional)
100: IP-EAR Mode (IP-EAR is functional; ETB not functional)
The IP_EAR functions by continuously capturing retired IPs in PMD
It captures retired IPs and the elapsed time between retirements. Up to 16 entries can be captured.
The IP-EAR has a slightly different freezing model than the rest of the Performance Monitors. It is
capable of delaying its freeze for a number of cycles past the point of PMU freeze. The user can
program an 8-bit number to determine the number of cycles the freeze will be delayed.
Note: PMD
are not, in fact, 68b registers. Figure 3-28 and Figure 3-29 represent the virtual layout of
48-63
an execution trace buffer entry in IP-EAR mode for the sake of clarity. The higher order bits [6764] for each entry are mapped into PMD39 as described in Table 3-33.
Figure 3-28. IP-EAR data format (PMD
67 66 6560 590
, where PMC42.mode == 100 and PMD
48-63
effcyclIP[63:4]
11660
Figure 3-29. IP-EAR data format (PMD
67 66 6560 598 70
, where PMC42.mode == 100 and PMD48-63.ef =1)
48-63
effcyclIP[63:12]Delay
16528
Table 3-31. IP-EAR Data Register Fields (PMD
FieldBitsDescription
as long as it is enabled.
48-63
) (Sheet 1 of 2)(PMC42.mode=‘1xx)
48-63
48-63
.ef =0)
cycl63:60Elapsed cycles
4-bit least significant bits of a 6-bit elapsed cycle count from the previous retired IP.
This is a saturating counter and would stay at all 1s when counted up to the maximum
value.
Note: the 2 most significant bits for each entry are found in PMD39. See below.
70Reference Manual for Software Development and Optimization
Performance Monitoring
Table 3-31. IP-EAR Data Register Fields (PMD
FieldBitsDescription
IP59:8Retired IP value; bits[63:12]
delay7:0Delay count
If ef = 1
Indicates the remainder of the delay count
Else
Retired IP value: bits[11:4]
) (Sheet 2 of 2)(PMC42.mode=‘1xx)
48-63
Figure 3-30. IP Trace Buffer Index Register Format (PMD38)(PMC42.mode=‘1xx)
636 5 4 3 0
igfull ig ebi
581 14
Table 3-32. IP Trace Buffer Index Register Fields (PMD38) (PMC42.mode=‘1xx)
FieldBit RangeDescription
ebi3:0IP Trace Buffer Index [Range 0..15 - Index 0 indicates PMD48]
ig4Reads zero; Writes are ignored
full5Full Bit (sticky)
ig63:6Reads zero; Writes are ignored
Pointer to the next IP trace buffer entry to be written
if full=1: points to the oldest recorded IP entry
if full=0: points to the next location to be written
if full=1: IP trace buffer has wrapped
if full=0: IP trace buffer has not wrapped
Figure 3-31. IP Trace Buffer Extension Register Format (PMD39) (PMC42.mode=‘1xx)
63 59 58 55 5424 23 20 19 16 15 12 11 87430
pmd63
ext
pmd55
ext
...
pmd58
ext
pmd50
ext
pmd57
ext
pmd49
ext
pmd56
ext
pmd48
ext
44444444
Reference Manual for Software Development and Optimization71
Performance Monitoring
\
Table 3-33. IP Trace Buffer Extension Register Fields (PMD39) (PMC42.mode=‘1xx)
FieldBit RangeBitsDescription
pmd48 ext3:03:2cycl - Elapsed cycles
2-bit most significant bits of a 6-bit elapsed cycle count from the previous
retired IP. This is a saturating counter and would stay at all 1s when
counted up to the maximum value.
1f - Flush
Indicates whether there has been a pipe flush since the last entry
0ef - Early freeze
if 1: The current entry is an early freeze case
Early freeze occurs if:
PSR bits causes IP-EAR to become disabled
Thread switch
pmd56 ext7:4Same as above for PMD56
pmd49 ext11:8Same as above for PMD49
pmd57 ext15:12Same as above for PMD57
pmd50 ext19:16Same as above for PMD50
pmd58 ext23:20Same as above for PMD58
so onso onso on
pmd63 ext63:60Same as above for PMD63
3.3.10.2.1Notes on the IP-EAR
When the IP-EAR freezes due to its normal freeze mechanism (i.e. PMU freeze + delay), it
captures one last entry with “ef”=0. The IP value in this entry could be incorrect since there is no
guarantee that the CPU would be retiring an IP at this particular time. Since this is always the
youngest entry captured in IP_EAR buffer, it should be easier to identify this event.
3.3.11Interrupts
As mentioned in Table 3-6, each one of registers PMD
conditions are all true:
• PMC
This interrupt is an “External Interrupt” with Vector= 0x3000 and will be recognized only if the
following conditions are true:
• PMV.m=0 and PMV.vector is set up correctly; i.e. Performance Monitor interrupts are not
• PSR.i =1 and PSR.ic=1; i.e. interruptions are unmasked and interruption collection is enabled
• TPR.mmi=0 (i.e. all external interrupts are not masked) and TPR.mic is a value that the
.oi=1 (i.e. overflow interrupt is enabled for PMDi) and PMDi overflows. Note that there
i
is only one interrupt line that will be raised regardless of which PMC/PMD set meets this
condition.
masked and a proper vector is programmed for this interrupt by executing a “”.
in the Processor Status Register by executing either the “” or “”
instruction.
priority class that Performance Monitor Interrupt belongs to are not masked. For example if
we assign vector 0xD2 to the Performance Monitor Interrupt, according to Table 5-7 “Interrupt
Priorities, Enabling, and Masking” in Volume 2 of the Intel® Itanium® Architecture Software
Developer’s Manual, it will be priority class 13. So any value less than 13 for TPR.mic is okay
for recognizing this interrupt. A “” will write to this register.
will cause an interrupt if the following
4-15
72Reference Manual for Software Development and Optimization
Performance Monitoring
• There are no higher priority faults, traps, or external interrupts pending.
Interrupt Service routine needs to read IVR register “” in order to figure out the
highest priority external interrupt which needs to be serviced.
Before returning from interrupt service routine, the Performance Monitor needs to be initialized
such that the interrupt will be cleared. This could be done by clearing the PMC.oi and/or reinitializing the PMD which caused the interrupt (you will know this by reading PMC0). In addition
to this, all bits of PMC0 need to be cleared if further monitoring needs to be done.
3.3.12Processor Reset, PAL Calls, and Low Power State
Processor Reset: On processor hardware reset bits and of all PMC registers are zero, and
PMV.m is set to one. This ensures that no interrupts are generated, and events are not externally
visible. On reset, PAL firmware ensures that the instruction address range check, the opcode
matcher and the data address range check are initialized as follows:
• PMC
• PMC
• PMC
• PMC
32,33,34,35
41
38
36
= 0xffffffffffffffff, (match all opcodes)
= 0x2078fefefefe, (no memory pipeline event constraints)
= 0xdb6, (no instruction address range constraints)
= 0xfffffff0, (no opcode match constraints)
All other performance monitoring related state is undefined.
Table 3-34. Information Returned by PAL_PERF_MON_INFO for the Montecito Processor
PAL_PERF_MON_INFO
Return Value
PAL_RETIRED8-bit unsigned event type for counting the number of
untagged retired Itanium instructions
PAL_CYCLES8-bit unsigned event type for counting the number of
running CPU cycles
PAL_WIDTH8-bit unsigned number of implemented counter bits48
PAL_GENERIC_PM_PAIRS8-bit unsigned number of generic PMC/PMD pairs4
PAL_PMCmask256-bit mask defining which PMC registers are
populated
PAL_PMDmask256-bit mask defining which PMD registers are
populated
PAL_CYCLES_MASK256-bit mask defining which PMC/PMD counters can
count running CPU cycles (event defined by
PAL_CYCLES)
PAL_RETIRED_MASK256-bit mask defining which PMC/PMD counters can
PAL Call: As defined in the Volume 2 of the Intel® Itanium® Architecture Software Developer’s
Manual, the PAL call PAL_PERF_MON_INFO provides software with information about the
implemented performance monitors. The Montecito processor specific values are summarized in
Table 3-34.
Low Power State: To ensure that monitor counts are preserved when the processor enters low
power state, PAL_LIGHT_HALT freezes event monitoring prior to powering down the processor.
Reference Manual for Software Development and Optimization73
Performance Monitoring
As a result, bus events occurring during lower power state (e.g. snoops) will not be counted.
PAL_LIGHT_HALT preserves the original value of the PMC0 register.
§
74Reference Manual for Software Development and Optimization
4Performance Monitor Events
4.1Introduction
This chapter describes the architectural and microarchitectural events measurable on the Montecito
processor through the performance monitoring mechanisms described earlier in Chapter 3. The
early sections of this chapter provide a categorized high-level view of the event list, grouping
logically related events together. Computation (either directly by a counter in hardware or
indirectly as a “derived” event) of common performance metrics is also discussed. Each directly
measurable event is then described in greater detail in the alphabetized list of all processor events
in Chapter 4.
The Montecito processor is capable of monitoring numerous events. The majority of events can be
selected as input to any of the PMD
the hexadecimal values shown in the “event code” field of the event list. Please refer to
Section 4.8.2 and Section 4.8.4 for events that have more specific requirements.
4.2Categorization of Events
Performance related events are grouped into the following categories:
Each section listed above includes a table providing information on directly measurable events.
The section may also contain a second table of events that can be derived from those that are
directly measurable. These derived events may simply rename existing events or present steps to
determine the value of common performance metrics. Derived events are not, however, discussed
in the systematic event listing in Section 4.15.
Directly measurable events often use the PMC.umask field (See Chapter 3) to measure a certain
variant of the event in question. Symbolic event names for such events include a period to indicate
use of the umask, specified by four bits in the detailed event description (x’s are for don’t-cares).
Reference Manual for Software Development and Optimization75
Performance Monitor Events
The summary tables in the subsequent sections define events by specifying the following
attributes:
• Symbol Name - Symbolic name used to denote this event.
• Event Code - Hexadecimal value to program into bits [15:8] of the appropriate PMC register in
order to measure this event.
• IAR - Can this event be constrained by the Instruction Address Range registers?
• DAR - Can this event be constrained by the Data Address Range registers?
• OPC - Can this event by constrained by the Opcode Match registers?
• Max Inc/Cyc - Maximum Increment Per Cycle or the maximum value this event may be
increased by each cycle.
• T - Type; Either A for Active, F for Floating, S for Self Floating or C for Causal (check table
Table 4-42 for this information).
• Description - Brief description of the event.
4.2.1Hyper-Threading and Event Types
The Montecito Processor implements a type of hardware based multi-threading that effectively
allows two threads to coexist within a processor core although only one thread is “active” within
the core’s pipeline at any moment in time. This affects how events are generated. Certain events
may be generated after the thread they belong has become inactive. This also affects how events
are assigned to the threads occupying the same core, which is also dependent upon which PMD the
event was programmed into (see Section 3.3.2 for more information). Certain events do not have
the concept of a “home” thread.
These effects are further complicated by the use of the “.all” field, which allows a user to choose to
monitor a particular event for the thread being programmed to or for both threads (see Table 3-6). It
should be noted that monitoring with .all enabled does not always produce valid results and in
certain cases the setting of .all is ignored. Please refer to the individual events for further
information.
To help decipher these effects, events have been classified by the following types:
• Active - this event can only occur when the thread that generated it is “active” (currently
executing in the processor core’s pipeline) and is considered to be generated by the active
thread. Either type of monitor can be used if .all is not set. Example(s): BE_EXE_BUBBLE
and IA64_INST_RETIRED.
• Causal - this event does not belong to a thread. It is assigned to the active thread. Although it
seems natural to use either type of monitor if .all is not set, due to implementation constraints,
causal events should only be monitored in duplicated counters. There is one exception to this
rule: CPU_OP_CYCLES can be measured in both types of counters. Example(s):
CPU_OP_CYCLES and L2I_SNOOP_HITS.
• Floating - this event belongs to a thread, but could have been generated when its thread was
inactive (or “in the background”). These events should only be monitored in duplicated
counters. If .all is not set, only events associated with the monitoring thread will be captured.
If .all is set, events associated with both threads will be captured during the time the
monitoring thread has been assigned to a processor by the OS. Example(s):
L2D_REFERENCES and ER_MEM_READ_OUT_LO.
76Reference Manual for Software Development and Optimization
• Self Floating - this is a hybrid event used to better categorize certain BUS and SI (System
Interface) events. If this event was monitored with the .SELF umask, it is a Floating event. If
any other umask is used it is considered Causal. These events should only be monitored in
duplicated counters. Example(s): BUS_IO and SI_WRITEQ_INSERTS.
4.3Basic Events
Table 4-1 summarizes two basic execution monitors. The Montecito retired instruction count,
IA64_INST_RETIRED, includes both predicated true and predicated off instructions and
instructions, but excludes RSE operations.
Table 4-1. Performance Monitors for Basic Events
Symbol Name
CPU_OP_CYCLES0x12Y N Y1CCPU Operating Cycles
IA64_INST_RETIRED0x08Y N Y6ARetired Itanium® Instructions
Event
Code
Performance Monitor Events
I
D
O
A
R
Max
A
P
Inc/CycTDescription
R
C
Table 4-2. Derived Monitors for Basic Events
Symbol NameDescriptionEquation
IA64_IPCAverage Number of Itanium
Instructions Per Cycle During Itanium
architecture-based Code Sequences
®
4.4Instruction Dispersal Events
Instruction cache lines are delivered to the execution core and dispersed to the Montecito processor
functional units. The Montecito processor can issue, or disperse, 6 instructions per clock cycle. In
other words, the Montecito processor can issue to 6 instruction slots (or syllables).The following
events are intended to give users an idea of how effectively instructions are dispersed and why they
are not dispersed at full capacity. There are five reasons for not dispersing at full capacity. One is
measured by DISP_STALLED. For every clock that dispersal is stalled, dispersal takes a hit of
6-syllables. The other four reasons are measured by SYLL_NOT_DISPERSED. Due to the way
the hardware is designed, SYLL_NOT_DISPERSED may contain an overcount due to implicit and
explicit bits; although this number should be small, SYLL_OVERCOUNT will provide an accurate
count for it.
The relationship between these events is as follows:
6*(CPU_OP_CYCLES-DISP_STALLED) = INST_DISPERSED +
SYLL_NOT_DISPERSED.ALL - SYLL_OVERCOUNT.ALL
IA64_INST_RETIRED / CPU_OP_CYCLES
Reference Manual for Software Development and Optimization77
Performance Monitor Events
Table 4-3. Performance Monitors for Instruction Dispersal Events
I
D
Symbol Name
DISP_STALLED0x49N N N1Number of cycles dispersal stalled
INST_DISPERSED0x4dY N N6Syllables dispersed from REN to REG
SYLL_NOT_DISPERSED0x4eY N N5Syllables not dispersed
SYLL_OVERCOUNT0x4fY N N2Syllables overcounted
Event
Code
O
A
A
P
R
R
C
4.5Instruction Execution Events
Retired instruction counts, IA64_TAGGED_INST_RETIRED and NOPS_RETIRED, are based on
tag information specified by the address range check and opcode match facilities. A separate event,
PREDICATE_SQUASHED_RETIRED, is provided to count predicated off instructions.
The FP monitors listed in the table capture dynamic information about pipeline flushes and
flush-to-zero occurrences due to floating-point operations. The FP_OPS_RETIRED event counts
the number of retired FP operations.
Max
Inc/Cyc
Description
stage
As Table 4-4 describes, monitors for control and data speculation capture dynamic run-time
information: the number of failed instructions (INST_FAILED_CHKS_RETIRED.ALL),
the number of advanced load checks and check loads (INST_CHKA_LDC_ALAT.ALL), and
failed advanced load checks and check loads (INST_FAILED_CHKA_LDC_ALAT.ALL) as seen
by the ALAT. The number of retired instructions is monitored by the
IA64_TAGGED_INST_RETIRED event, given the appropriate opcode mask. Since the Montecito
processor ALAT is updated by operations on mispredicted branch paths, the number of advanced
load checks and check loads need an explicit event (INST_CHKA_LDC_ALAT.ALL).
Table 4-4. Performance Monitors for Instruction Execution Events
I
D
Symbol Name
ALAT_CAPACITY_MISS0x58Y Y Y2ALAT Entry Replaced
FP_FAILED_FCHKF0x06Y N N1Failed fchkf
FP_FALSE_SIRSTALL0x05Y N N1SIR stall without a trap
FP_FLUSH_TO_ZERO0x0bY N N2FP Result Flushed to Zero
FP_OPS_RETIRED0x09Y N N6Retired FP operations
FP_TRUE_SIRSTALL0x03Y N N1SIR stall asserted and leads to a trap
IA64_TAGGED_INST_RETIRED0x08Y N Y6Retired Tagged Instructions
INST_CHKA_LDC_ALAT0x56Y Y Y2Advanced Check Loads
INST_FAILED_CHKA_LDC_ALAT0x57Y Y Y1Failed Advanced Check Loads
INST_FAILED_CHKS_RETIRED0x55N N N1Failed Speculative Check Loads
LOADS_RETIRED0xcdY Y Y4Retired Loads
MISALIGNED_LOADS_RETIRED0xceY Y Y4Retired Misaligned Load Instructions
MISALIGNED_STORES_RETIRED0xd2Y Y Y2Retired Misaligned Store Instructions
NOPS_RETIRED0x50Y N Y6Retired NOP Instructions
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
Description
78Reference Manual for Software Development and Optimization
Table 4-4. Performance Monitors for Instruction Execution Events
I
D
Symbol Name
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
Performance Monitor Events
Description
PREDICATE_SQUASHED_RETIRED0x51Y N Y6Instructions Squashed Due to
STORES_RETIRED0xd1Y Y Y2Retired Stores
UC_LOADS_RETIRED0xcfY Y Y4Retired Uncacheable Loads
UC_STORES_RETIRED0xd0Y Y Y2Retired Uncacheable Stores
Table 4-5. Derived Monitors for Instruction Execution Events
Symbol NameDescriptionEquation
ALAT_EAR_EVENTSCounts the number of ALAT
events captured by EAR
CTRL_SPEC_MISS_RATIOControl Speculation Miss RatioINST_FAILED_CHKS_RETIRED.ALL /
DATA_SPEC_MISS_RATIOData Speculation Miss RatioINST_FAILED_CHKA_LDC_ALAT.ALL /
4.6Stall Events
Montecito processor stall accounting is separated into front-end and back-end stall accounting.
Back-end and front-end events should not be compared since they are counted in different stages of
the pipeline.
Predicate Off
DATA_EAR_EVENTS
IA64_TAGGED_INST_RETIRED[chk.s]
INST_CHKA_LDC_ALAT.ALL
The back-end can be stalled due to five distinct mechanisms: FPU/L1D, RSE, EXE,
branch/exception or the front-end. BACK_END_BUBBLE provides an overview of which
mechanisms are producing stalls while the other back-end counters provide more explicit
information broken down by category. Each time there is a stall, a bubble is inserted in only one
location in the pipeline. Each time there is a flush, bubbles are inserted in all locations in the
pipeline. With the exception of BACK_END_BUBBLE, the back-end stall accounting events are
prioritized in order to mimic the operation of the main pipe (i.e. priority form high to low is given
to: BE_FLUSH_BUBBLE.XPN, BE_FLUSH_BUBBLE.BRU, L1D_FPU stalls, EXE stalls, RSE
stalls, front-end stalls). This prioritization guarantees that the events are mutually exclusive and
only the most important cause, the one latest in the pipeline, is counted.
The Montecito processor’s front-end can be stalled due to seven distinct mechanisms: FEFLUSH,
TLBMISS, IMISS, branch, FILL-RECIRC, BUBBLE, IBFULL (listed in priority from high to
low). The front-end stalls have exactly the same effect on the pipeline so their accounting is
simpler.
During every clock in which the CPU is not in a halted state, the back-end pipeline has either a
bubble or it retires 1 or more instructions, CPU_OP_CYCLES = BACK_END_BUBBLE.all +
(IA64_INST_RETIRED >= 1). To further investigate bubbles occurring in the back-end of the
pipeline the following equation holds true: BACK_END_BUBBLE.all = BE_RSE_BUBBLE.all +
BE_EXE_BUBBLE.all + BE_L1D_FPU_BUBBLE.all + BE_FLUSH_BUBBLE.all +
BACK_END_BUBBLE.fe.
Reference Manual for Software Development and Optimization79
Performance Monitor Events
Note: CPU_OP_CYCLES is not incremented during a HALT state. If a measurement is set up to match
clock cycles to bubbles to instructions retired (as outlined above) and a halt occurs within the
measurement interval, measuring CYCLES_HALTED in PMD10 may be used to compensate.
Each of the stall events (summarized in Table 4-6) take a umask to choose among several available
sub-events. Please refer to the detailed event descriptions in Section 4.15 for a list of available
sub-events and their individual descriptions.
Table 4-6. Performance Monitors for Stall Events
Symbol Name
BACK_END_BUBBLE0x00N N N1Full pipe bubbles in main pipe
BE_EXE_BUBBLE0x02N N N1Full pipe bubbles in main pipe due to
BE_FLUSH_BUBBLE0x04N N N1Full pipe bubbles in main pipe due to
BE_L1D_FPU_BUBBLE0xcaN N N1Full pipe bubbles in main pipe due to
BE_LOST_BW_DUE_TO_FE0x72N N N2Invalid bundles if BE not stalled for
BE_RSE_BUBBLE0x01N N N1Full pipe bubbles in main pipe due to
FE_BUBBLE0x71N N N1Bubbles seen by FE
FE_LOST_BW0x70N N N2Invalid bundles at the entrance to IB
IDEAL_BE_LOST_BW_DUE_TO_FE0x73N N N2Invalid bundles at the exit from IB
Event
Code
I
D
O
P
C
Max
Inc/Cyc
Description
Execution unit stalls
flushes
FP or L1D cache
other reasons
RSE stalls
A
A
R
R
4.7Branch Events
Note that for branch events, retirement means a branch was reached and committed regardless of
its predicate value. Details concerning prediction results are contained in pairs of monitors. For
accurate misprediction counts, the following measurement must be taken:
By performing this calculation for every umask, one can obtain a true value for the
BR_MISPRED_DETAIL event.
The method for obtaining the true value of BR_PATH_PRED is slightly different. When there is
more than one branch in a bundle and one is predicted as taken, all the higher number ports are
forced to a predicted not taken mode without actually knowing the their true prediction.
The true OKPRED_NOTTAKEN predicted path information can be obtained by calculating:
BR_PATH_PRED.[branch type].OKPRED_NOTTAKEN - BR_PATH_PRED2.[branch
type].UNKNOWNPRED_NOTTAKEN using the same “branch type” (ALL, IPREL,
RETURN, NRETIND) specified for both events.
Similarly, the true MISPRED_TAKEN predicted path information can be obtained by calculating:
80Reference Manual for Software Development and Optimization
BR_PATH_PRED.[branch type].MISPRED_TAKEN - BR_PATH_PRED2.[branch
type].UKNOWNPRED_TAKEN using the same “branch type” (ALL, IPREL, RETURN,
NRETIND) selected for both events.
BRANCH_EVENT counts the number of events captured by the Execution Trace Buffer. For
detailed information on the ETB please refer to Section 3.3.10.
Table 4-7. Performance Monitors for Branch Events
Symbol Name
BE_BR_MISPRED_DETAIL0x61Y N Y1BE branch misprediction detail
BRANCH_EVENT0x11Y N Y1Branch Event Captured
BR_MISPRED_DETAIL0x5bY N Y3FE Branch Mispredict Detail
BR_MISPRED_DETAIL20x68Y N Y2FE Branch Mispredict Detail
BR_PATH_PRED0x54Y N Y3FE Branch Path Prediction Detail
BR_PATH_PRED20x6aY N Y2FE Branch Path Prediction Detail
ENCBR_MISPRED_DETAIL 0x63Y N Y3Number of encoded branches retired
Event
Code
I
A
R
Performance Monitor Events
D
O
P
C
Max
Inc/Cyc
Description
(Unknown path component)
(Unknown prediction component)
A
R
4.8Memory Hierarchy
This section summarizes events related to the Montecito processor’s memory hierarchy. The
memory hierarchy events are grouped as follows:
• L1 Instruction Cache and Prefetch Events (Section 4.8.1)
• L1 Data Cache Events (Section 4.8.2)
• L2 Instruction Cache Events (Section 4.8.3)
• L2 Data Cache Events (Section 4.8.4)
• L3 Cache Events (Section 4.8.5)
An overview of the Montecito processor’s three level memory hierarchy and its event monitors is
shown in Figure 4-1. The instruction and the data stream work through separate L1 caches. The L1
data cache is a write-through cache. Two separate L2I and L2D caches serve both the L1
instruction and data caches respectively, and are backed by a large unified L3 cache. Events for
individual levels of the cache hierarchy are described in the Section 4.8.1 through Section 4.8.3.
Reference Manual for Software Development and Optimization81
Performance Monitor Events
Figure 4-1. Event Monitors in the Itanium® 2 Processor Memory Hierarchy
82Reference Manual for Software Development and Optimization
Performance Monitor Events
4.8.1L1 Instruction Cache and Prefetch Events
Table 4-8 describes and summarizes the events that the Montecito processor provides to monitor
L1 instruction cache demand fetch and prefetch activity. The instruction fetch monitors distinguish
between demand fetch (L1I_READS) and prefetch activity (L1I_PREFETCHES). The amount of
data returned from the L2I to the L1 instruction cache and the Instruction Streaming Buffer is
monitored by two events, L1I_FILLS and ISB_LINES_IN. The L1I_EAR_EVENTS monitor
counts how many instruction cache or L1ITLB misses are captured by the instruction event address
register.
The L1 instruction cache and prefetch events can be qualified by the instruction address range
check, but not by the opcode matcher. Since instruction cache and prefetch events occur early in
the processor pipeline, they include events caused by speculative, wrong-path instructions as well
as predicated-off instructions. Since the address range check is based on speculative instruction
addresses rather than retired instruction addresses, event counts may be inaccurate when the range
checker is confined to address ranges smaller than the length of the processor pipeline (see
Chapter 3 for details).
L1I_EAR_EVENTS counts the number of events captured by the Montecito processor’s
instruction EARs. Please refer to Chapter 3 for more detailed information about the instruction
EARs.
Table 4-8. Performance Monitors for L1/L2 Instruction Cache and Prefetch Events
I
D
Symbol Name
ISB_BUNPAIRS_IN0x46Y N N1Bundle pairs written from L2I into FE
L1I_EAR_EVENTS0x43Y N N1Instruction EAR Events
L1I_FETCH_ISB_HIT0x66Y N N1“Just-in-time” instruction fetch hitting
L1I_FETCH_RAB_HIT0x65Y N N1Instruction fetch hitting in RAB
L1I_FILLS0x41Y N N1L1 Instruction Cache Fills
L1I_PREFETCHES0x44Y N N1L1 Instruction Prefetch Requests
L1I_PREFETCH_STALL0x67N N N1Why prefetch pipeline is stalled?
L1I_PURGE0x4bY N N1L1ITLB purges handled by L1I
L1I_PVAB_OVERFLOW0x69N N N1PVAB overflow
L1I_RAB_ALMOST_FULL0x64N N N1Is RAB almost full?
L1I_RAB_FULL0x60N N N1Is RAB full?
L1I_READS0x40Y N N1L1 Instruction Cache Reads
L1I_SNOOP0x4aY Y Y1Snoop requests handled by L1I
L1I_STRM_PREFETCHES0x5fY N N1L1 Instruction Cache line prefetch
L2I_DEMAND_READS0x42Y N N1L1 Instruction Cache and ISB Misses
L2I_PREFETCHES0x45Y N N1L2 Instruction Prefetch Requests
Event
Code
O
P
C
Max
Inc/Cyc
Description
in and being bypassed from ISB
requests
A
A
R
R
Reference Manual for Software Development and Optimization83
Performance Monitor Events
Table 4-9. Derived Monitors for L1 Instruction Cache and Prefetch Events
Symbol NameDescriptionEquation
L1I_MISSESL1I MissesL2I_DEMAND_READS
ISB_LINES_INNumber of cache lines written
from L2I (and beyond) into the
front end
L1I_DEMAND_MISS_RATIOL1I Demand Miss RatioL2I_DEMAND_READS / L1I_READS
L1I_MISS_RATIOL1I Miss Ratio(L1I_MISSES + L2I_PREFETCHES) /
L1I_PREFETCH_MISS_RATIO L1I Prefetch Miss RatioL2I_PREFETCHES / L1I_PREFETCHES
L1I_REFERENCESNumber of L1 Instruction Cache
reads and fills
ISB_BUNPAIRS_IN/4
(L1I_READS + L1I_PREFETCHES)
L1I_READS + L1I_PREFETCHES
4.8.2L1 Data Cache Events
Table 4-10 lists the Montecito processor’s L1 data cache monitors. As shown in Figure 4-1, the
write-through L1 data cache services cacheable loads, integer and RSE loads, check loads and
hinted L2 memory references. DATA_REFERENCES is the number of issued data memory
references.
L1 data cache reads (L1D_READS) and L1 data cache misses (L1D_READ_MISSES) monitor the
read hit/miss rate of the L1 data cache. RSE operations are included in all data cache monitors, but
are not broken down explicitly. The DATA_EAR_EVENTS monitor counts how many data cache
or DTLB misses are captured by the Data Event Address Register. Please refer to Section 3.3.9 for
more detailed information about the data EARs.
L1D cache events have been divided into 6 sets (sets 0,1,2,3,4,6; set 5 is reserved). Events from
different sets of L1D Cache events cannot be measured at the same time. Each set is selected by the
event code programmed into PMC5 (i.e. if you want to measure any of the events in this set, one of
them needs to be measured by PMD5). There are no limitations on umasks. Monitors belonging to
each set are explicitly presented in Table 4-10 through Table 4-16.
Table 4-10. Performance Monitors for L1 Data Cache Events
I
D
Symbol Name
DATA_EAR_EVENTS0xc8Y Y Y1L1 Data Cache EAR Events
L1D_READS_SET00xc2Y Y Y2L1 Data Cache Reads
DATA_REFERENCES_SET00xc3Y Y Y4Data memory references issued to
L1D_READS_SET10xc4Y Y Y2L1 Data Cache Reads
DATA_REFERENCES_SET10xc5Y Y Y4Data memory references issued to
L1D_READ_MISSES0xc7Y Y Y2L1 Data Cache Read Misses
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
Description
memory pipeline
memory pipeline
84Reference Manual for Software Development and Optimization
4.8.2.1L1D Cache Events (set 0)
Table 4-11. Performance Monitors for L1D Cache Set 0
Symbol Name
Event
Code
Performance Monitor Events
I
D
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
L1DTLB_TRANSFER0xc0Y Y Y1L1DTLB misses hit in L2DTLB for
L2DTLB_MISSES0xc1Y Y Y4L2DTLB Misses
L1D_READS_SET00xc2Y Y Y2L1 Data Cache Reads
DATA_REFERENCES_SET00xc3Y Y Y4Data memory references issued to
4.8.2.2L1D Cache Events (set 1)
Table 4-12. Performance Monitors for L1D Cache Set 1
Symbol Name
L1D_READS_SET10xc4Y Y Y2L1 Data Cache Reads
DATA_REFERENCES_SET10xc5Y Y Y4Data memory references issued to
L1D_READ_MISSES0xc7Y Y Y2L1 Data Cache Read Misses
Event
Code
4.8.2.3L1D Cache Events (set 2)
Table 4-13. Performance Monitors for L1D Cache Set 2
Symbol Name
Event
Code
access counted in L1D_READS
memory pipeline
I
D
O
P
C
O
P
C
Max
Inc/Cyc
Max
Inc/Cyc
Description
memory pipeline
Description
A
A
R
R
I
D
A
A
R
R
BE_L1D_FPU_BUBBLE0xcaN N N1Full pipe bubbles in main pipe due to
FP or L1D cache
4.8.2.4L1D Cache Events (set 3)
Table 4-14. Performance Monitors for L1D Cache Set 3
I
D
Symbol Name
LOADS_RETIRED0xcdY Y Y4Retired Loads
MISALIGNED_LOADS_RETIRED0xceY Y Y4Retired Misaligned Load Instructions
UC_LOADS_RETIRED0xcfY Y Y4Retired Uncacheable Loads
Event
Code
Reference Manual for Software Development and Optimization85
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
Performance Monitor Events
4.8.2.5L1D Cache Events (set 4)
Table 4-15. Performance Monitors for L1D Cache Set 4
Symbol Name
MISALIGNED_STORES_RETIRED0xd2Y Y Y2Retired Misaligned Store Instructions
STORES_RETIRED0xd1Y Y Y2Retired Stores
UC_STORES_RETIRED0xd0Y Y Y2Retired Uncacheable Stores
Event
Code
4.8.2.6L1D Cache Events (set 6)
Table 4-16. Performance Monitors for L1D Cache Set 6
Symbol Name
LOADS_RETIRED_INTG0xd8Y Y Y2Integer loads retired
SPEC_LOADS_NATTED0xd9Y Y Y2Times ld.s or ld.sa NaT’d
Event
Code
I
D
O
P
C
O
P
C
Max
Inc/Cyc
Max
Inc/Cyc
Description
Description
A
A
R
R
I
D
A
A
R
R
4.8.3L2 Instruction Cache Events
Table 4-17. Performance Monitors for L2I Cache
Symbol Name
L2I_READS0x78Y N Y1L2I Cacheable Reads
L2I_UC_READS0x79Y N Y1L2I uncacheable reads
L2I_VICTIMIZATIONS0x7aY N Y1L2I victimizations
L2I_RECIRCULATES0x7bY N Y1L2I recirculates
L2I_L3_REJECTS0x7cY N Y1L3 rejects
L2I_HIT_CONFLICTS0x7dY N Y1L2I hit conflicts
L2I_SPEC_ABORTS0x7eY N Y1L2I speculative aborts
L2I_SNOOP_HITS0x7fY N Y1L2I snoop hits
Table 4-18. Derived Monitors for L2I Cache (Sheet 1 of 2)
Symbol NameDescriptionEquation
L2I_SNOOPSNumber of snoops received by
L2I_FILLSL2I FillsL2I_READS.MISS.DMND +
L2I_FETCHESRequests made to L2I due to
L2I_REFERENCESInstructions requests made to L2I L2I_READS.ALL.ALL
Event
Code
the L2I.
demand instruction fetches.
I
D
O
P
C
Max
Inc/Cyc
L1I_SNOOPS
L2I_READS.MISS.PFTCH
L2I_READS.ALL.DMND
Description
A
A
R
R
86Reference Manual for Software Development and Optimization
Table 4-18. Derived Monitors for L2I Cache (Sheet 2 of 2)
Symbol NameDescriptionEquation
L2I_MISS_RATIOPercentage of L2I MissesL2I_READS.MISS/L2I_READS.ALL
L2I_HIT_RATIOPercentage of L2I HitsL2I_READS.HIT/L2I_READS.ALL
4.8.4L2 Data Cache Events
Table 4-19 summarizes the events available to monitor the Montecito processor L2D cache.
Most L2D events have been divided into 8 sets. Only events within two of these sets (or non-L2D
events) can be measured at the same time. These two sets are selected by the event code
programmed into PMC4 and PMC6 (i.e. if you want to measure any of the events in a particular
set, one of these events needs to be measured by PMD4 or PMD6).
Note: The opposite holds true. If PMC4 is not programmed to monitor an L2D event, yet PMC5 or PMC8
are (similarly with PMC6->PMC7/9), PMD values are undefined. Also note that
Any event set can be measured by programming either PMC4 or PMC6. Once PMC4 is
programmed to measure an event from one L2D event set, PMD4, PMD5, and PMD8 can only
measure events from the same L2D event set (PMD5,8 share the umask programmed into PMD4).
Similarly, once PMC6 is programmed to monitor another set (could be the same set as measured by
PMC4), PMD6, PMD7 and PMD9 can measure events from this set only. None of the L2 data
cache events can be measured using PMD10-15.
Performance Monitor Events
Support for the .all bit has the same restrictions as the set restrictions. The value set for “.all” in
PMC4 applies to both the L1D events selected by it. Hence, even though the “.all” values in PMC5
and PMC8 are different from the value in PMC4, PMC4’s value selects the capability. This is same
with PMC6,7,9. Original Montecito documentation claimed that Thread 0 PMC4 .me/.all applied
to PMC4-PMC7 but that is no longer true. This bit is available for both the threads. Hence, it is
possible for one thread’s PMDs to monitor just the events credited for that thread while the other
thread’s PMDs can monitor events for both threads (if PMC4.all is set). Note that some events do
not support .all counting. If .all counting is enabled for events that don’t support it, the resulting
counts will be wrong.
While the L2D events support threading, not all counts have access to exact thread id bit needed.
Each count is labeled with one of ActiveTrue, ActiveApprox, or TrueThrd. ActiveTrue means that
the event is counted with the current active thread, and that thread is the only thread that can see the
event when it is counted. ActiveApprox means the event is counted with the current active thread,
but there are some corner cases were the event may actually be due to the other non-Active thread.
It is assumed in most cases that the error due this approximation will be negligible. TrueThrd
indicates that the L2D cache has knowledge of what thread the count belongs to besides the active
thread indication, and that knowledge is always correct.
Table 4-19. Performance Monitors for L2 Data Cache Events (Sheet 1 of 2)
I
D
Symbol Name
L2D_OZQ_CANCELS00xe0Y Y YY4L2D OZQ cancels
L2D_OZQ_FULL0xe1
L2D_OZQ_CANCELS10xe2Y Y YY4L2D OZQ cancels
Event
Code
0xe3
O
P
C
.all
capable
A
A
R
R
N N NN1L2D OZQ is full
Max
Inc/Cyc
Description
Reference Manual for Software Development and Optimization87
Performance Monitor Events
Table 4-19. Performance Monitors for L2 Data Cache Events (Sheet 2 of 2)
I
D
Symbol Name
Event
Code
O
P
C
.all
capable
A
A
R
R
Max
Inc/Cyc
Description
L2D_BYPASS0xe4Y Y YY/N1L2D Hit or Miss Bypass (.all
L2D_OZQ_RELEASE0xe5N N NN1Clocks with release ordering
L2D_REFERENCES0xe6Y Y YY4Data RD/WR access to L2D
L2D_L3ACCESS_CANCEL0xe8Y Y YN1Canceled L3 accesses
L2D_OZDB_FULL0xe9N N NY1L2D OZ data buffer is full
L2D_FORCE_RECIRC0xeaY Y YY/N4Forced recirculates
L2D_ISSUED_RECIRC_OZQ_ACC0xebY Y YY1Count the number of times a
L2DBAD_LINES_SELECTED0xecY Y YY4Valid line replaced when
L2D_STORE_HIT_SHARED0xedY Y YY2Store hit a shared line
L2D_OZQ_ACQUIRE0xefN N NY1Clocks with acquire ordering
L2D_OPS_ISSUED0xf0Y Y YN4Different operations issued by
L2D_FILLB_FULL0xf1N N NN1L2D Fill buffer is full
L2D_FILL_MESI_STATE0xf2Y Y YY1MESI states of fills to L2D
L2D_VICTIMB_FULL0xf3N N NY1L2D victim buffer is full
L2D_MISSES0xcbY Y YY1An L2D miss has been issued
L2D_INSERT_HITS0xb1Y Y YY4Count Number of Times an
L2D_INSERT_MISSES0xb0Y Y YY4Count Number of Times an
support is umask dependent)
attribute existed in L2D OZQ
recirculate issue was
attempted and not preempted
invalid line is available
attribute existed in L2D OZQ
L2D
cache
to the L3, does not include
secondary misses
Inserting Data Request Hit in
the L2D.
Inserting Data Request
Missed in the L2D.
Table 4-20. Derived Monitors for L2 Data Cache Events
Symbol NameDescriptionEquation
L2D_READSL2 Data Read RequestsL2D_REFERENCES.READS
L2D_WRITESL2 Data Write RequestsL2D_REFERENCES.WRITES
L2D_MISS_RATIOPercentage of L2D MissesL2D_INSERT_MISSES/L2D_REFERENCES
L2D_HIT_RATIOPercentage of L2D HitsL2D_INSERT_HITS/L2D_REFERENCES
L2D_RECIRC_ATTEMPTSNumber of times the L2
issue logic attempted to
issue a recirculate
88Reference Manual for Software Development and Optimization
Table 4-21. Performance Monitors for L2 Data Cache Set 0
Symbol Name
Event
Code
Performance Monitor Events
I
D
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
L2D_OZQ_FULL0xe1
0xe3
L2D_OZQ_CANCELS00xe0Y Y Y4L2D OZQ cancels-TrueThrd
L2D_OZQ_CANCELS10xe2Y Y Y4L2D OZQ cancels-TrueThrd
L2D_OZQ_FULL is not .all capable.
4.8.4.2L2 Data Cache Events (set 1)
Table 4-22. Performance Monitors for L2 Data Cache Set 1
Symbol Name
L2D_BYPASS0xe4Y Y Y4Count L2 Hit bypasses-TrueThread
L2D_OZQ_RELEASE0xe5N N N1Effective Release is valid in
Event
Code
The L2D_BYPASS count on Itanium 2 processors was too speculative to be useful. It has been
fixed and we now count how many bypasses occurred in a given cycle, rather than signalling a 1
for 1-4 bypasses. The 5 and 7 cycle umasks of L2D_BYPASS and the L2D_OZQ_RELEASE
counts are not .all capable.
4.8.4.3L2 Data Cache Events (set 2)
N N N1L2D OZQ Full-ActiveApprox
I
D
O
P
C
Max
Inc/Cyc
Description
Ozq-ActiveApprox
A
A
R
R
Table 4-23. Performance Monitors for L2 Data Cache Set 2
I
D
Symbol Name
L2D_REFERENCES0xe6Y Y Y4Inserts of Data Accesses into
Event
Code
O
P
C
Max
Inc/Cyc
Description
Ozq-ActiveTrue
A
A
R
R
4.8.4.4L2 Data Cache Events (set 3)
L
Table 4-24. Performance Monitors for L2 Data Cache Set 3
I
D
Symbol Name
L2D_L3_ACCESS_CANCEL0xe8Y Y Y1L2D request to L3 was
L2D_OZDB_FULL0xe9N N N1L2D OZ data buffer is
Event
Code
Reference Manual for Software Development and Optimization89
O
P
C
Max
Inc/Cyc
Description
cancelled-TrueThrd
full-AcitveApprox
A
A
R
R
Performance Monitor Events
L2D_L3_ACCESS_CANCEL events are not .all capable.
4.8.4.5L2 Data Cache Events (set 4)
Table 4-25. Performance Monitors for L2 Data Cache Set 4
Symbol Name
Event
Code
I
D
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
L2D_FORCE_RECIRC0xeaY Y Y4Forced recirculates - ActiveTrue or
L2D_ISSUED_RECIRC_OZQ_ACC0xebY Y Y1Ozq Issued Recirculate - TrueThrd
Some umasks of L2D_FORCE_RECIRC are not .all capable.
4.8.4.6L2 Data Cache Events (set 5)
Table 4-26. Performance Monitors for L2 Data Cache Set 5
Symbol Name
L2D_BAD_LINES_SELECTED0xecY Y Y4Valid line replaced when invalid line is
L2D_STORE_HIT_SHARED0xedY Y Y2Store hit a shared line
Event
Code
4.8.4.7L2 Data Cache Events (set 6)
E
Table 4-27. Performance Monitors for L2 Data Cache Set 6
Symbol Name
Event
Code
ActiveApprox
I
D
O
P
C
O
P
C
Max
Inc/Cyc
Max
Inc/Cyc
Description
available
Description
A
A
R
R
I
D
A
A
R
R
L2D_OZQ_ACQUIRE0xefN N N1Valid acquire operation in
Ozq-TrueThrd
4.8.4.8L2 Data Cache Events (set 7)
Table 4-28. Performance Monitors for L2 Data Cache Set 7
I
D
Symbol Name
L2D_OPS_ISSUED0xf0Y Y Y4Different operations issued by
L2D_FILLB_FULL0xf1N N N1L2D Fill buffer is full-ActiveApprox
Event
Code
L2D_OPS_ISSUED and L2D_FILLB_FULL are not .all capable.
90Reference Manual for Software Development and Optimization
O
P
C
Max
Inc/Cyc
Description
L2D-TrueThrd
A
A
R
R
4.8.4.9L2 Data Cache Events (set 8)
Table 4-29. Performance Monitors for L2 Data Cache Set 8
Symbol Name
Event
Code
Performance Monitor Events
I
D
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
L2D_FILL_MESI_STATE0xf2Y Y Y1Fill to L2D is of a particular MESI
L2D_VICTIMB_FULL0xf3N N N1L2D victim buffer is full-ActiveApprox
4.8.4.10L2 Data Cache Events (Not Set Restricted)
These events are sent to the PMU block directly and thus are not set restricted.
Table 4-30. Performance Monitors for L2D Cache - Not Set Restricted
I
D
Symbol Name
L2D_MISSES0xcbY Y Y1An L2D miss has been issued to the
L2D_INSERT_MISSES0xb0Y Y Y4An inserting Ozq op was a miss on its
L2D_INSERT_HITS0xb1Y Y Y4An inserting Ozq op was a hit on its
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
4.8.5L3 Cache Events
Table 4-31 summarizes the directly-measured L3 cache events. An extensive list of derived events
is provided in Table 4-32.
value. TrueThrd
Description
L3, does not include secondary
misses.
first lookup.
first lookup.
Table 4-31. Performance Monitors for L3 Unified Cache Events
I
D
Symbol Name
L3_LINES_REPLACED0xdfN N N1L3 Cache Lines Replaced
L3_MISSES0xdcY Y Y1L3 Misses
L3_READS0xddY Y Y1L3 Reads
L3_REFERENCES0xdbY Y Y1L3 References
L3_WRITES0xdeY Y Y1L3 Writes
The debug register match events count how often the address of any instruction or data breakpoint
register (IBR or DBR) matches the current retired instruction pointer
(CODE_DEBUG_REGISTER_MATCHES) or the current data memory address
(DATA_DEBUG_REGISTER_MATCHES). CPU_CPL_CHANGES counts the number of
privilege level transitions due to interruptions, system calls (epc), returns (demoting branch), and
instructions.
92Reference Manual for Software Development and Optimization
Table 4-33. Performance Monitors for System Events
I
Symbol Name
CPU_CPL_CHANGES0x13N N N1Privilege Level Changes
DATA_DEBUG_REGISTER_FAULT0x52N N N1Fault due to data debug reg. Match to
DATA_DEBUG_REGISTER_MATCHES0xc6Y Y Y1Data debug register matches data
SERIALIZATION_EVENTS0x53N N N1Number of srlz.I instructions
CYCLES_HALTED0x18N N N1Number of core cycles the thread is in
Event
Code
D
A
A
R
R
Table 4-34. Derived Monitors for System Events
Symbol NameDescriptionEquation
O
P
C
Max
Inc/Cyc
Performance Monitor Events
Description
load/store instruction
address of memory reference
low-power halted state.
NOTE: only PMC/PMD10 pair is
capable of counting this event
The Montecito processor instruction and data TLBs and the virtual hash page table walker are
monitored by the events described in Table 4-35.
L1ITLB_REFERENCES and L1DTLB_REFERENCES are derived from the respective
instruction/data cache access events. Note that ITLB_REFERENCES does not include prefetch
requests made to the L1I cache (L1I_PREFETCH_READS). This is because prefetches are
cancelled when they miss in the ITLB and thus do not trigger VHPT walks or software TLB miss
handling. ITLB_MISSES_FETCH and L2DTLB_MISSES count TLB misses.
L1ITLB_INSERTS_HPW and DTLB_INSERTS_HPW count the number of instruction/data TLB
inserts performed by the virtual hash page table walker.
Table 4-35. Performance Monitors for TLB Events
Symbol Name
DTLB_INSERTS_HPW0xc9Y Y Y4Hardware Page Walker inserts to
HPW_DATA_REFERENCES0x2dY Y Y4Data memory references to VHPT
L2DTLB_MISSES0xc1Y Y Y4L2DTLB Misses
L1ITLB_INSERTS_HPW0x48Y N N1L1ITLB Hardware Page Walker
ITLB_MISSES_FETCH0x47Y N N1ITLB Misses Demand Fetch
L1DTLB_TRANSFER0xc0Y Y Y1L1DTLB misses that hit in the
Event
Code
I
D
O
P
C
Max
Inc/Cyc
Description
DTLB
Inserts
L2DTLB for accesses counted in
L1D_READS
A
A
R
R
Reference Manual for Software Development and Optimization93
Performance Monitor Events
Table 4-36. Derived Monitors for TLB Events
Symbol NameDescriptionEquation
L1DTLB_EAR_EVENTSCounts the number of L1DTLB
L2DTLB_MISS_RATIOL2DTLB miss ratioL2DTLB_MISSES /
L1DTLB_REFERENCESL1DTLB ReferencesDATA_REFERENCES_SET0 or
L1ITLB_EAR_EVENTSProvides information on the
L1ITLB_MISS_RATIOL1ITLB miss ratioITLB_MISSES_FETCH.L1ITLB /
number of L1ITLB events
captured by the EAR. This is a
subset of L1I_EAR_EVENTS
Miss Ratio of L1DTLB servicing
the L1D
DATA_EAR_EVENTS
DATA_REFERENCES_SET0 or
L2DTLB_MISSES /
DATA_REFERENCES_SET1
DATA_REFERENCES_SET1
L1I_EAR_EVENTS
L1I_READS
L1DTLB_TRANSFER /
L1D_READS_SET0 or
L1DTLB_TRANSFER /
L1D_READS_SET1
The Montecito processor has 2 data TLBs called L1DTLB and L2DTLB (also referred to as
DTLB). These TLBs are in parallel and the L2DTLB is the larger and slower of the two.The
possible actions for the combination of hits and misses in these TLBs are outlined below:
• L1DTLB_hit=0, L2DTLB_hit=0: If enabled, HPW kicks in and inserts a translation into one
or both TLBs.
• L1DTLB_hit=0, L2DTLB_hit=1: If floating-point, no action is taken; else a transfer is made
from L2DTLB to L1DTLB.
• L1DTLB_hit=1, L2DTLB_hit=0: If enabled, HPW kicks in and inserts a translation into one
or both TLBs.
• L1DTLB_hit=1, L2DTLB_hit=1: No action is taken.
When a memory operation goes down the memory pipeline, DATA_REFERENCES will count it.
If the translation does not exist in the L2DTLB, then L2DTLB_MISSES will count it. If the HPW
is enabled, then HPW_DATA_REFERENCES will count it. If the HPW finds the data in VHPT, it
will insert it in the L1DTLB and L2DTLB (as needed). If the translation exists in the L2DTLB, the
only case that some work is done is when translation does not exist in the L1DTLB. If the operation
is serviced by the L1D (see L1D_READS description), L1DTLB_TRANSFER will count it. For
the purpose of calculating the TLB miss ratios, VHPT memory references have been excluded from
the DATA_REFERENCES event and provided VHPT_REFERENCES for the situations where one
might want to add them in.
Due to the TLB hardware design, there are some corner cases, where some of these events will
show activity even though the instruction causing the activity never reaches retirement (they are
marked so). Since the processor is stalled even for these corner cases, they are included in the
counts and as long as all events that are used for calculating a metric are consistent with respect to
this issue, fairly accurate numbers are expected.
94Reference Manual for Software Development and Optimization
4.11System Bus Events
Table 4-40 lists the system bus transaction monitors. Many of the listed bus events take a umask
that qualifies the event by initiator. For all bus events, when “per cycles” is mentioned, SI clock
cycles (bus clock multiplied by bus ratio) are inferred rather than bus clock cycles unless otherwise
specified. Numerous derived events have been included in Table 4-41.
Table 4-37. Performance Monitors for System Bus Events (Sheet 1 of 3)
I
D
Symbol Name
BUS_ALL0x87N N N1Bus Transactions
ER_BRQ_LIVE_REQ_HI0xb8N N N2BRQ Live Requests (two
ER_BRQ_LIVE_REQ_LO0xb9N N N7BRQ Live Requests (three
ER_BRQ_REQ_INSERTED0xbaN N N1BRQ Requests Inserted
ER_BKSNP_ME_ACCEPTED0xbbN N N1BacksnoopMe Requests accepted
ER_REJECT_ALL_L1_REQ0xbcN N N1Number of cycles in which the BRQ
ER_REJECT_ALL_L1D_REQ0xbdN N N1Number of cycles in which the BRQ
ER_REJECT_ALL_L1I_REQ0xbeN N N1Number of cycles in which the BRQ
BUS_DATA_CYCLE0x88N N N1Valid data cycle on the Bus
BUS_HITM0x84N N N1Bus Hit Modified Line Transactions
BUS_IO0x90N N N1IA-32 Compatible IO Bus
SI_IOQ_LIVE_REQ_HI0x98N N N1In-order Bus Queue Requests (one
SI_IOQ_LIVE_REQ_LO0x97N N N7In-order Bus Queue Requests (three
BUS_B2B_DATA_CYCLES0x93N N N1Back-to-back bursts of data
SI_CYCLES0x8eN N N1Counts SI cycles
BUS_MEMORY0x8aN N N1Bus Memory Transactions
BUS_MEM_READ0x8bN N N1Full Cache line D/I memory RD, RD
ER_MEM_READ_OUT_HI0xb4N N N2Outstanding memory RD transactions
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
Performance Monitor Events
Description
most-significant-bit of the 5-bit
outstanding BRQ request count)
least-significant-bit of the 5-bit
outstanding BRQ request count)
into the BRQ from the L2D (used by
the L2D to get itself out of potential
forward progress situations)
was rejecting all L1I/L1D requests (for
the “Big Hammer” forward progress
logic)
was rejecting all L1D requests (for
L1D/L1I forward progress)
was rejecting all L1I requests (for
L1D/L1I forward progress)
Transactions
most-significant-bit of the 4-bit
outstanding IOQ request count)
least-significant-bit of the 4-bit
outstanding IOQ request count)
invalidate, and BRIL
(upper two bits)
Reference Manual for Software Development and Optimization95
Performance Monitor Events
Table 4-37. Performance Monitors for System Bus Events (Sheet 2 of 3)
I
D
Symbol Name
Event
Code
O
P
C
Max
Inc/Cyc
A
A
R
R
Description
ER_MEM_READ_OUT_LO0xb5N N N7Outstanding memory RD transactions
BUS_RD_DATA0x8cN N N1Bus Read Data Transactions
BUS_RD_HIT0x80N N N1Bus Read Hit Clean Non-local Cache
BUS_RD_HITM0x81N N N1Bus Read Hit Modified Non-local
BUS_RD_INVAL_BST_HITM0x83N N N1Bus BRIL Burst Transaction Results
BUS_RD_INVAL_HITM0x82N N N1Bus BIL Transaction Results in HITM
BUS_RD_IO0x91N N N1IA-32 Compatible IO Read
BUS_RD_PRTL0x8dN N N1Bus Read Partial Transactions
ER_SNOOPQ_REQ_HI0xb6N N N1ER Snoop Queue Requests (most
ER_SNOOPQ_REQ_LO0xb7N N N7ER Snoop Queue Requests (three
BUS_SNOOP_STALL_CYCLES0x8fN N N1Bus Snoop Stall Cycles (from any
BUS_WR_WB0x92N N N1Bus Write Back Transactions
MEM_READ_CURRENT0x89N N N1Current Mem Read Transactions On
SI_RQ_INSERTS0x9eN N N2SI request queue inserts
SI_RQ_LIVE_REQ_HI0xa0N N N1SI request queue live requests
SI_RQ_LIVE_REQ_LO0x9fN N N7SI request queue live requests
SI_WRITEQ_INSERTS0xa1N N N2SI write queue inserts
SI_WRITEQ_LIVE_REQ_HI0xa3N N N1SI write queue live requests
SI_WRITEQ_LIVE_REQ_LO0xa2N N N7SI write queue live requests
SI_WAQ_COLLISIONS0xa4N N N1SI write address queue collisions
SI_CCQ_INSERTS0xa5N N N2SI clean castout queue inserts
SI_CCQ_LIVE_REQ_HI0xa7N N N1SI clean castout queue live requests
SI_CCQ_LIVE_REQ_LO0xa6N N N7SI clean castout queue live requests
SI_CCQ_COLLISIONS0xa8N N N1SI clean castout queue collisions
SI_IOQ_COLLISIONS0xaaN N N1SI inorder queue collisions (outgoing
(lower three bits)
Transactions
Cache Transactions
in HITM
Transactions
significant bit of 4-bit count)
least-significant-bits or 4-bit count)
agent)
Bus
(most-significant bit)
(least-significant three bits)
(most-significant bit)
(least-significant three bits)
(incoming FSB snoop collides with an
entry in WAQ)
(most-significant bit)
(least-significant three bits)
(incoming FSB snoop collides with an
entry in CCQ)
transaction collides with an entry in
IOQ)
96Reference Manual for Software Development and Optimization
Performance Monitor Events
Table 4-37. Performance Monitors for System Bus Events (Sheet 3 of 3)
I
D
Symbol Name
SI_SCB_INSERTS0xabN N N1SI snoop coalescing buffer inserts
SI_SCB_LIVE_REQ_HI0xadN N N1SI snoop coalescing buffer live
SI_SCB_LIVE_REQ_LO0xacN N N7SI snoop coalescing buffer live
SI_SCB_SIGNOFFS0xaeN N N1SI snoop coalescing buffer coherency
SI_WDQ_ECC_ERRORS0xafN N N1SI write data queue ECC errors
Event
Code
O
P
C
Max
Inc/Cyc
requests (most-significant bit)
requests (least-significant three bits)
signoffs
A
A
R
R
Table 4-38. Derived Monitors for System Bus Events (Sheet 1 of 2)
Symbol NameDescriptionEquation
Description
BIL_HITM_LINE_RATIOBIL Hit to Modified Line RatioBUS_RD_INVAL_HITM /
BIL_RATIOBIL RatioBUS_RD_INVAL / BUS_MEMORY
BRIL_HITM_LINE_RATIOBRIL Hit to Modified Line RatioBUS_RD_INVAL_BST_HITM /
BUS_ADDR_BPRIBus transactions used by IO
BUS_BRQ_LIVE_REQBRQ Live RequestsER_BRQ_LIVE_REQ_HI * 8 +
BUS_BURSTFull cache line memory
BUS_HITM_RATIOBus Modified Line Hit RatioBUS_HITM / BUS_MEMORY or
BUS_HITS_RATIOBus Read Hit to Shared Line
BUS_IOQ_LIVE_REQInorder Bus Queue RequestsSI_IOQ_LIVE_REQ_HI * 8+
transactions (BRIL)
Bus Read Invalid Line in Burst
transactions (BRIL) satisfied by
memory
transactions (BIL) satisfied from
memory
transactions (BRIL and BIL)
resulting in HITMs
4.11.1System Bus Conventions
BUS_RD_HITM / BUS_RD_ALL or
BUS_RD_HITM / BUS_MEMORY
BUS_RD_ALL - BUS_RD_DATA
BUS_MEM_READ.BIL.*
BUS_MEM_READ.BRIL.*
BUS_RD_INVAL_BST BUS_RD_INVAL_BST_HITM
BUS_RD_INVAL BUS_RD_INVAL_HITM
BUS_RD_INVAL_BST_HITM +
BUS_RD_INVAL_HITM
BUS_WR_WB / BUS_BURST
BUS_MEM_READ.BRIL) /
BUS_MEMORY
Table 4-39 defines the conventions that will be used when describing the Montecito processor
system bus transaction monitors in this section, as well as the individual monitor descriptions in
Section 4.15.
Other transactions besides those listed in Table 4-42 include Deferred Reply, Special Transactions,
Interrupt, Interrupt Acknowledge, and Purge TC. Note that the monitors will count if any
transaction gets a retry response from the priority agent.
To support the analysis of snoop traffic in a multiprocessor system, the Montecito processor
provides local processor and remote response monitors. The local processor snoop events
(SI_SCB_INSERTS and SI_SCB_SIGNOFFS) monitor inbound snoop traffic. The remote
response events (BUS_RD_HIT, BUS_RD_HITM, BUS_RD_INVAL_HITM and
BUS_RD_INVAL_BST_HITM) monitor the snoop responses of other processors to bus
transactions that the monitoring processor originated. Table 4-40 summarizes the remote snoop
events by bus transaction.
4.11.2Extracting Memory Latency from Montecito Performance
Counters
On the Itanium 2 processors, several events were provided to approximate memory latency as seen
by the processor using the following equation:
98Reference Manual for Software Development and Optimization
Performance Monitor Events
The BUS_MEM_READ_OUT starts counting one bus clock after a request is issued on the system
interface (ADS) and stops incrementing when the request completes its first data transfer or is
retried. In each core cycle after counting is initiated, the number of live requests in that cycle are
added to the count. This count may as high as 15. For ease of implementation, the count is split into
two parts: BUS_MEM_READ_OUT_LO sums up the low order 3 bits of the number of live
requests, while BUS_MEM_READ_OUT_HI sums up the high order bit.
In the above formula, the numerator provides the number of live requests and the denominator
provides the number of requests that are counted. When the live count is divided by the number of
transactions issued, you get an average lifetime of a transaction issued on the system interface (a
novel application of Little’s Law).
The Montecito processor has similar counters: ER_MEM_READ_OUT.{HI,LO}. Using these
events to derive Montecito memory latency will give results that are higher than the true memory
latency seen in Montecito. The main reason for this is the fact that the start and stop point of the
counters are not equivalent between the two processors. Specifically, in Montecito,
ER_MEM_READ_OUT.{HI,LO}events start counting the core clock after a request is sent to the
arbiter. The Montecito ER_MEM_READ_OUT_{HI,LO}events stop counting when the request
receives its first data transfer within the external request logic (after the arbiter). Thus, these events
include the entire time requests spend in the arbiter (pre and post request).
The requests may remain in the arbiter for a long or short time depending on system interface
behaviors. Arbiter queue events SI_RQ_LIVE_REQ.{HI,LO} may be used to reduce the effects of
arbiter latency on the calculations. Unfortunately, these events are not sufficient to successfully
enable a completely equivalent measurement for Itanium 2 processors. The arbiter time back from
FSB to core is fixed for a specific arbiter to system interface ratio. These arbiter events may occur
in a different time domain from core events
The new memory latency approximation formula for Montecito, with corrective events included, is
below:
Note that the Data EAR may be used to compare data cache load miss latency between Madison
and Montecito. However, an access’ memory latency, as measured by the Data EAR or other cycle
counters will be inherently greater on Montecito compared to previous Itanium 2 processors due to
the latency the arbiter adds to both the outbound request and inbound data transfer. Also, the Data
EAR encompasses the entire latency through the processor’s memory hierarchy and queues
without details into time spent in any specific queue.
Even with this improved formula, the estimated memory latency for Montecito will appear greater
than previous Itanium 2 processors. We have not observed any design point that suggests that the
system interface component of memory accesses are excessive on Montecito.
We have observed that snoop stalls and write queue pressure lead to additional memory latency on
Montecito compared to previous Itanium 2 processors, but these are phenomena that impact the
pre-system or post-system interface aspect of a memory latency and are very workload dependant
in their impact. Specifically, the write queues need to be sufficiently filled to cause back pressure
on the victimizing read requests such that a new read request cannot issue to the system interface
because it cannot identify a victim in the L3 cache to ensure its proper allocation. This sever
pressure has only been seen with steady streams of every read requests resulting in a dirty L3
victim. Additional snoop stalls should only add latency to transactions that receive a HITM snoop
response (cache to cache transfers) because non-HITM responses are satisfied by the memory and
memory access should be initiated as a consequence of the initial transaction rather than its snoop
response.
Reference Manual for Software Development and Optimization99
Performance Monitor Events
The figure below (Figure 4-2) shows the latency is determined using the above calculations on
Itanium 2 and Montecito processors. The red portion of the Montecito diagram shows latency
accounted for by the correction found in the Montecito calculation.
Figure 4-2. Extracting Memory Latency from PMUs
Ita n iu m 2
L o a d I s s u e d t o
c a c h e s
D a t a r e t u r n e d t o
r e g is t e r
L o a d is s u e d o n
s y s t e m in t e r f a c e
T i m e c a l c u la t e d w it h
P M U e v e n t s
D a t a d e li v e r y
s ta r t e d
M o n t e c i t o
L o a d I s s u e d t o
c a c h e s
T i m e i n
A r b it e r
D a t a r e t u r n e d t o
r e g is te r
L o a d is s u e d t o
a r b it e r
D a t a d e li v e r y
s e e n b y e x te r n a l
r e q u e s t l o g ic
4.12RSE Events
Register Stack Engine events are presented in Table 4-39. The number of current/dirty registers are
split among three monitors since there are 96 physical registers in the Montecito processor.
Table 4-39. Performance Monitors for RSE Events (Sheet 1 of 2)
L o a d is s u e d o n
s y s t e m in t e r f a c e
T i m e c a l c u la t e d
w i t h P M U e v e n t s
D a t a d e li v e r y
s ta r t e d
I
D
Symbol Name
RSE_CURRENT_REGS_2_TO_00x2bN N N7Current RSE registers
RSE_CURRENT_REGS_5_TO_30x2aN N N7Current RSE registers
RSE_CURRENT_REGS_60x26N N N1Current RSE registers
RSE_DIRTY_REGS_2_TO_00x29N N N7Dirty RSE registers
Event
Code
O
P
C
Max
Inc/Cyc
Description
A
A
R
R
100Reference Manual for Software Development and Optimization
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.