Intel® Xeon® Processor E5-2600
Product Family Uncore Performance
Monitoring Guide
March 2012
Reference Number: 327043-001
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS
PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING T O SALE AND/OR USE OF INTEL PRODUC TS INCLUDING
LIABILITY OR WARRANTIES RELA TING T O FITNES S FOR A PARTICULAR PURPOSE, MERCHANT ABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving,
life sustaining applications.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal
injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU
SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS,
OFFICERS, AND EMPLOYEES OF EACH, HARM LESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE
AT TORNEY S' FEES ARISING O UT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM O F PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH
ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS
NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
®
PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel
reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future
changes to them.
Code names represented in this document are only for use by Intel to identify a product, technology, or service in development,
that has not been made commercially available to the public, i.e., announced, launched or shipped. It is not a “commercial” name
for products or services and is not intended to function as a trademark.
The Intel
from published specifications. Current characterized er rata ar e ava ila bl e on re q uest.
Hyper-Threading Technology requires a computer system with an I ntel
an HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and
software you use. For more information, see http://www.intel.com/technology/hyperthread/index.htm; including details on which processors
support HT Technology.
Intel
(VMM) and for some uses, certain platform software enabled for it. Functionality, performance or other benefits will
on hardware and software configurations. Intel
development.
64-bit computing on Intel architecture requires a computer system with a processor , chipset, BIOS, operating system, device drivers and applications enabled for Intel
64 architecture-enabled BIOS. Performance will vary depending on your hardware and software configurations. Consult with your
system vendor for more information.
®
64 architecture processors may contain design defects or errors known as errata, which may cause the product to deviate
®
processor supporting Hyper-Threading Technology and
®
Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor
®
Virtualization Technology-enabled BIOS and VMM applications are currently in
®
64 architecture. Processors will not operate (including 32-bit operation) without an Intel
vary depending
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained
by calling 1-800-548-4725, or by visiting Intel's Web Site
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo , In tel Cor e Duo , Intel Core 2 Duo, Intel Core 2 Extreme, Intel Pentium
D, Itanium, Intel SpeedStep, MMX, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in
The uncore subsystem of the Intel® Xeon® processor E5-2600 product family is shown in Figure 1-1.
The uncore subsystem also applies to the Intel® Xeon® processor E5-1600 product family in a
single-socket platform
CBox caching agent to the power controller unit (PCU), integrated memory controller (iMC) and home
agent (HA), to name a few. Most of these components provide similar performance monitoring
capabilities.
Figure 1-1. Uncore Sub-system Block Diagram of Intel Xeon Processor E5-2600 Family
1
. The uncore sub-system consists of a variety of components, r anging from the
1.2Uncore PMON Overview
The uncore performance monitoring facilities are organized into per-component performance
monitoring (or ‘PMON’) units. A PMON unit within an uncore component may contain one of more
sets of counter registers. With the exception of the UBox, each PMON unit provides a unit-level
control register to synchronize actions across the counters within the box (e.g., to start/stop
counting).
1. The uncore sub-system in Intel® CoreTM i7-3930K and i7-3820 processors are derived from
above, hence most of the descriptions of this document also apply.
Reference Number: 327043-0019
Introduction
Events can be collected by reading a set of local counter registers. Each counter register is paired with
a dedicated control register used to specify what to count (i.e. through the event select/umask fields)
and how to count it. Some units provide the ability to specify additional information that can be used
to ‘filter’ the monitored events (e.g., C-box; see Section 2.3.3.3, “CBo Filter Register
(Cn_MSR_PMON_BOX_FILTER)”).
Uncore performance monitors represent a per-socket resource that is not meant to be affected by
context switches and thread migration performed by the OS, it is recommended that the monitoring
software agent establish a fixed affinity binding to prevent cross-talk of event counts from different
uncore PMU.
The programming interface of the counter registers and control registers fall into two address spaces:
• Accessed by MSR are PMON registers within the Cbo units, PCU, and U-Box, see Table 1-2.
• Access by PCI device configuration space are PMON registers within the HA, iMC, Intel® QPI,
R2PCIe and R3QPI units, see Table 1-3.
Irrespective of the address-space difference and with only minor exceptions, the bit-granular layout of
the control registers to program event code, unit mask, start/stop, and signal filtering via threshold/
edge detect are the same.
The general performance monitoring capabilities of each box are outlined in the following table.
• Section 2.7, “Intel® QPI Link Layer Performance Monitoring”
• Section 2.9, “R3QPI Performance Monitoring”
• Section 2.8, “R2PCIe Performance Monitoring”
• Section 2.10, “Packet Matching Reference”
10Reference Number: 327043-001
Introduction
1.4Uncore PMON - Typical Control/Counter Logic
Following is a diagram of the standard perfmon counter block illustrating how event information is
routed and stored within each counter and how its paired control register helps to select and filter the
incoming information. Details for how control bits affect event information is presented in each of the
box subsections of Chapter 2, with some summary information below.
Note:The PCU uses an adaptation of this block (refer to Section 2.6.3, “PCU Performance
Monitors” more information). Also note that only a subset of the available control bits
are presented in the diagram.
Figure 1-2. Perfmon Control/Counter Block Diagram
Selecting What To Monitor: The main task of a configuration register is to select the event to be
monitored by its respective data counter. Setting the .ev_sel and .umask fields performs the event
selection.
Telling HW that the Control Register Is Set: .en bit must be set to 1 to enable counting. Once
counting has been enabled in the box and global level of the Performance Monitoring H ier archy (refer
to Section 2.1.1, “Setting up a Monitoring Session” for more information), the paired data register will
begin to collect events.
Reference Number: 327043-00111
Introduction
Additional control bits include:
Applying a Threshold to Incoming Events: .thresh - since most counters can increment by a
value greater than 1, a threshold can be applied to generate an event based on the outcome of the
comparison. If the .thresh is set to a non-zero value, that value is compared against the incoming
count for that event in each cycle. If the incoming count is >= the threshold value, then the event
count captured in the data register will be incremented by 1.
Using the threshold field to generate additional events can be particularly useful when applied to a
queue occupancy count. For example, if a queue is known to contain eight entries, it may be useful to
know how often it contains 6 or more entires (i.e. Almost Full) or when it contains 1 or more entries
(i.e. Not Empty).
Note:The .invert and .edge_det bits follow the threshold comparison in sequence. If a user
wishes to apply these bits to events that only increment by 1 per cycle, . thresh must be
set to 0x1.
Inverting the Threshold Comparison: .invert - Changes the .thresh test condition to ‘<‘.
Counting State Transitions Instead of per-Cycle Events: .edge_det - Rather than accumulating
the raw count each cycle (for events that can increment by 1 per cycle), the register can capture
transitions from no event to an event incoming (i.e. the ‘Rising Edge’).
1.5Uncore PMU Summary Tables
Following is a list of the registers provided in the Uncore for Performance Monitoring. It should be
noted that the PMON interfaces are split between MSR space (U, CBo and PCU) and PCICFG space.
Table 1-2.MSR Space Uncore Performance Monitoring Registers (Sheet 1 of 2)
BoxMSR AddressesDescription
C-Box Counters
C-Box 7
C-Box 6
C-Box 5
C-Box 4
C-Box 3
0xDF9-0xDF6 Counter Registers
0xDF4 Counter Filters
0xDF3-0xDF0 Counter Config Registers
0xDE4 Box Control
0xDD9-0xDD6 Counter Registers
0xDD4 Counter Filters
0xDD3-0xDD0 Counter Config Registers
0xDC4 Box Control
0xDB9-0xDB6 Counter Registers
0xDB4 Counter Filters
0xDB3-0xDB0 Counter Config Registers
0xDA4 Box Control
0xD99-0xD96 Counter Registers
0xD94 Counter Filters
0xD93-0xD90 Counter Config Registers
0xD84 Box Control
0xD79-0xD76 Counter Registers
0xD74 Counter Filters
0xD73-0xD70 Counter Config Registers
12Reference Number: 327043-001
Introduction
Table 1-2.MSR Space Uncore Per formance Monitoring Registers (Sheet 2 of 2)
Table 1-3.PCICFG Space Uncore Performance Monitoring Registers (Sheet 1 of 2)
Box
PCICFG Register
Addresses
R3QPID19:F5,6F(5,6) for Link 0,1
F4 Box Control
E0-D8 Counter Config Registers
B4-A0 Counter Registers
R2PCIeD19:F1
F4 Box Control
E4-D8 Counter Config Registers
BC-A0 Counter Registers
iMCD16:F0,1,4,5
F4 Box Control
F0 Counter Config Register (Fixed)
Reference Number: 327043-00113
F(0,1,4,5) For Channel 0,1,2,3
Description
Introduction
Table 1-3.PCICFG Space Uncore Performance Monitoring Registers (She et 2 of 2)
Box
HAD14:F1
QPID8,9:F2D(8,9) for Port 0,1
QPI Mask/MatchD8,9:F6D(8,9) for Port 0,1
QPI MiscD8,9:F0D(8,9) for Port 0,1
PCICFG Register
Addresses
E4-D8 Counter Config Registers (General)
D4-D0 Counter Register (Fixed)
BC-A0 Counter Registers (General)
F4 Box Control
E4-D8 Counter Config Registers
BC-A0 Counter Registers
48-40 Opcode/Addr Match Filters
F4 Box Control
E4-D8 Counter Config Registers
BC-A0 Counter Registers
23C-238 Mask 0,1
22C-228 Match 0,1
D4 QPI Rate Status
Description
1.6On Parsing and Using Derived Events
For many of the sections in the chapter covering the Performance Monitoring capabilites of each box,
a set of commonly measured metrics or ‘Derived Events’ have been included. For the most part,
these derived events are simple mathetmatical combinations of events found within the box. (e.g.
[SAMPLE]) However, there are some extensions to the notation used by the metrics.
Following is a breakdown of a Derived Event to illustrate many of the notations used. To calculcate
“Average Number of Data Read Entries that Miss the LLC when the TOR is not empty”.
pnemonic for the register will be included in the equation. Software will be responsible for
configuring the data register and setting it to start counting with the other events used by the
metric.
Requires more input to software to determine the specific event/subevent
• In some cases, there may be multiple events/subevents that cover the same information across
multiple like hardware units. Rather than manufacturing a derived event for each combination,
the derived event will use a lower case variable in the event name.
•e.g., POWER_CKE_CYCLES.RANKx / MC_Chy_PCI_PMON_CTR_FIXED where ‘x’ is a variable to cover
events POWER_CKE_CYCLES.RANK0 through POWER_CKE_CYCLES.RANK7
Requires setting extra control bits in the register the event has been programmed in:
• event_name[.subevent_name]{ctrl_bit[=value],}
•e.g.,
NOTE: If there is no [=value] specified it is assumed that the bit must be set to 1.
Requires gathering of extra information outside the box (often for common terms):
• See following section for a breakdown of common terms found in Derived Events.
COUNTER0_OCCUPANCY{edge_det,thresh=0x1}
1.6.1On Common Terms found in Derived Events
To convert a Latency term from a count of clocks to a count of nanoseconds:
• e.g., For READ_MEM_BW, an event derived from iMC:CAS_COUNT.RD * 64, which is the amount
of memory bandwidth consumed by read requests, put ‘READ_MEM_BW’ into the bandwidth term
to convert the measurement from raw bytes to GB/sec.
Following are some other terms that may be found wi thin Metrics and how they should be interpreted.
• GB_CONVERSION: 1024^3
• TSC_SPEED: Time Stamp Counter frequency in MHz
• SAMPLE_INTERVAL = TSC end time - TSC start time.
§
Reference Number: 327043-00115
Introduction
16Reference Number: 327043-001
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
2Intel® Xeon® Processor E5-
2600 Product Family Uncore
Performance Monitoring
2.1Uncore Per-Socket Performance Monitoring
Control
The uncore PMON does not support interrupt based sampling. T o manage the large number of counter
registers distributed across many units and collect event data efficiently, this section describes the
hierarchical technique to start/stop/restart event counting that a software agent may need to perform
during a monitoring session.
2.1.1Setting up a Monitoring Session
On HW reset, all the counters are disabled. Enabling is hierarchical. So the following steps, which
include programming the event control registers and enabling the counters to begin collecting events,
must be taken to set up a monitoring session. Section 2.1.2 co vers the steps to stop/re-start counter
registers during a monitoring session.
For each box in which events will be measured: Skip (a) and (b) for U-Box monitoring.
a) Enable each box to accept the freeze signal to start/stop/re-start all counter registers in that box
e.g., set Cn_MSR_PMON_BOX_CTL.frz_en to 1
Note:Recommended: set the .frz_en bits during the setup phase for each box a user intends
to monitor, and left alone for the duration of the monitoring session.
b) Freeze the box’s counters while setting up the monitoring session.
e.g., set Cn_MSR_PMON_BOX_CTL.frz to 1
For each event to be measured within each box:
c) Enable counting for each monitor
e.g. Set C0_MSR_PMON_CTL2.en to 1
Note:Recommended: set the .en bit for all counters in each box a user intends to monitor,
and left alone for the duration of the monitoring session.
d) Select event to monitor if the event control register hasn’t been programmed:
Reference Number: 327043-001 17
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Program the .ev_sel and .umask bits in the control register with the encodings necessary to capture
the requested event along with any signal conditioning bits (.thresh/.edge_det/.invert) used to qualify
the event.
e.g., Set C0_MSR_PMON_CT2.{ev_sel, umask} to {0x03, 0x1} in order to capture
LLC_VICTIMS.M_STATE in CBo 0’s C0_MSR_PMON_CTR2.
Note:It is also important to program any additional filter registers used to further qualify the
events (e.g., setting the opcode match field in Cn_MSR_BOX_FILTER to qualify
TOR_INSERTS by a specific opcode).
Back to the box level:
e) Reset counters in each box to ensure no stale values have been acquired from previous sessions.
• For each CBo, set Cn_MSR_PMON_BOX_CTL[1:0] to 0x2.
• For each Intel® QPI Port, set Q_Py_PCI_PMON_BOX_CTL[1:0] to 0x2.
• Set PCU_MSR_PMON_BOX_CTL[1:0] to 0x2.
• For each Link, set R3QPI_PCI_PMON_BOX_CTL[1:0] to 0x2.
• Set R2PCIE_PCI_PMON_BOX_CTL[1:0] to 0x2.
Note:The UBox does not have a Unit Control register and neither the iMC nor the HA have a
reset bit in their Unit Control register. The counters in the UBox, the HA each populated
DRAM channel in the iMC will need to be manually reset by writing a 0 in each data
register.
Back to the box level:
f) Commence counting at the box level by unfreezing the counters in each box
e.g., set Cn_MSR_PMON_BOX_CTL.frz to 0
And with that, counting will begin.
Note:The UBox does not have a Unit Control register. Once enabled and programmed with a
valid event, they will be collecting events. For somewhat better synchronization, a user
can keep the U_MSR_PMON_CTL.ev_sel at 0x0 while enabled and write it with a valid
value just prior to unfreezing the registers in other boxes.
2.1.2Reading the Sample Interval
Software can poll the counters whenever it chooses.
a) Polling - before reading, it is recommended that software freeze the counters in each box in which
counting is to take place (by setting *_PMON_BOX_CTL.frz_en and .frz to 1). After reading the event
counts from the counter registers, the monitoring agent can choose to reset the event counts to avoid
event-count wrap-around; or resume the counter register without resetting their values. The latter
choice will require the monitoring agent to check and adjust for potential wrap-around situations.
18Reference Number: 327043-001
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
2.2UBox Performance Monitoring
2.2.1Overview of the UBox
The UBox serves as the system configuration controller for the Intel Xeon Processor E5-2600 family
uncore.
In this capacity, the UBox acts as the central unit for a variety of functions:
• The master for reading and writing physically distributed registers across the uncore using the
Message Channel.
• The UBox is the intermediary for interrupt traffic, receiving interrupts from the sytem and
dispatching interrupts to the appropriate core.
• The UBox serves as the system lock master used when quiescing the platform (e.g., Intel® QPI
bus lock).
2.2.2UBox Performance Monitoring Overview
The UBox supports event monitoring through two programmable 44-bit wide counters
(U_MSR_PMON_CTR{1:0}), and a 48-bit fixed counter which increments each u-clock. Each of these
counters can be programmed (U_MSR_PMON_CTL{1:0}) to monitor any UBox event.
For information on how to setup a monitoring session, refer to Section 2.1, “Uncore Per- Sock et
The following registers represent the state governing all box-level PMUs in the UBox.
Size
(bits)
Description
U
Reference Number: 327043-001 19
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
2.2.3.2UBox PMON state - Counter/Control Pairs
The following table defines the layout of the UBox performance monitor control registers. The main
task of these configuration registers is to select the event to be monitored by their respective data
counter (.ev_sel, .umask). Additional control bits are provided to shape the incoming events (e.g.
.invert, .edge_det, .thresh) as well as provide additional functionality for monitoring software (.rst).
Table 2-2.U_MSR_PMON_CTL{1-0} Register – Field D efinitions
FieldBitsAttr
rsv31:29RV0 Reserved (?)
thresh28:24RW0 Threshold used in counter comparison.
invert23RW0 Invert comparison against Threshold.
en22RW0 Local Counter Enable.
rsv21:20RV0 Reserved. SW must write to 0 for proper operation.
rsv19RV0 Reserved (?)
edge_det18RW0 When set to 1, rather than measuring the event in each cycle it
rst17WO0 When set to 1, the corresponding counter will be cleared to 0.
umask15:8RW0 Select subevents to be counted within the selected event.
ev_sel7:0RW0 Select event to be counted.
HW
Reset
Val
Description
0 - comparison will be ‘is event increment >= threshold?’.
1 - comparison is inverted - ‘is event increment < threshol d?’
NOTE: .invert is in series following .thresh, Due to this, the
.thresh field must be set to a non-0 value. For events that
increment by no more than 1 per cycle, set .thresh to 0x1.
Also, if .edge_det is set to 1, the counter will increment when a 1
to 0 transition (i.e. falling edge) is detected.
is active, the corresponding counter will incr ement when a 0 to 1
transition (i.e. rising edge) is detected.
When 0, the counter will increment in each cycle that the event
is asserted.
NOTE: .edge_det is in series following .thresh, Due to this, the
.thresh field must be set to a non-0 value. For events that
increment by no more than 1 per cycle, set .thresh to 0x1.
The UBox performance monitor data registers are 44-bit wide. Should a counter ov erflow (a carry out
from bit 43), the counter will wrap and continue to collect events.
If accessible, software can continuously read the data registers without disabling event collection.
Table 2-3.U_MSR_PMON_CTR{1-0} Register – Field Definitions
The Global UBox PMON registers also include a fixed counter that increments at UCLK for each cycle it
is enabled.
20Reference Number: 327043-001
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-4.U_MSR_PMON_FIXED_CTL Register – Field Definitions
FieldBitsAttr
rsv31:23RV0 Reserved (?)
en22RW0 Enable counter when global enable is set.
rsv21:20RV0 Reserved. SW must write to 0 for proper operation.
rsv19:0RV0 Reserved ( ?)
HW
Rese
t Val
Description
Table 2-5.U_MSR_PMON_FIXED_CTR Register – Field Definitions
• Definition: Number of times an IDI Lock/SplitLock sequence was started
2.3Caching Agent (Cbo) Performance Monitoring
2.3.1Overview of the CBo
The LLC coherence engine (CBo) manages the interface between the core and the last level cache
(LLC). All core transactions that access the LLC are directed from the core to a CBo via the ring
interconnect. The CBo is responsible for managing data delivery from the LLC to the requesting core.
It is also responsible for maintaining coherence between the cores within the socket that share the
LLC; generating snoops and collecting snoop responses from the local cores when the MESIF protocol
requires it.
So, if the CBo fielding the core request indicates that a core within the socket owns the line (for a
coherent read), the request is snooped to that local core. That same CBo will then snoop all peers
which might have the address cached (other cores, remote sockets, etc) and send the request to the
appropriate Home Agent for conflict checking, memory requests and writebacks.
In the process of maintaining cache coherency within the socket, the CBo is the gate keeper for all
®
QuickPath Interconnect (Intel® QPI) messages that originate in the core and is responsible for
Intel
ensuring that all Intel
®
QPI messages that pass through the socket’s LLC remain coherent.
The CBo manages local conflicts by ensuring that only one request is issued to the system for a
specific cacheline.
The uncore contains up to eight instances of the CBo, each assigned to manage a distint 2.5MB slice
of the processor’s total LLC capacity. A slice that can be up to 20-way set associative. For processors
with fewer than 8 2.5MB LLC slices, the CBo Boxes or missing slices will still be active and track ring
traffic caused by their co-located core even if they have no LLC related traffic to track (i.e. hits/
misses/snoops).
Every physical memory address in the system is uniquely associated with a single CBo instance via a
proprietary hashing algorithm that is designed to keep the distribution of traffic across the CBo
instances relatively uniform for a wide range of possible address patterns. This enables the individual
CBo instances to operate independently , each man aging its slice of the physical address space without
any CBo in a given socket ever needing to communicate with the other CBos in that same socket.
22Reference Number: 327043-001
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
2.3.2CBo Performance Monitoring Overview
Each of the CBos in the uncore supports event monitoring through four 44-bit wide counters
(Cn_MSR_PMON_CTR{3:0}). Event programming in the CBo is restricted such that each events can
only be measured in certain counters within the CBo. For example, counter 0 is dedicated to
occupancy events. No other counter may be used to capture occupancy events.
• Counter 0: Queue-occupancy-enabled counter that tracks all events
• Counter 1: Basic counter that tracks all but queue occupancy events
• Counter 2: Basic counter that tracks ring events and the occupancy companion event
(COUNTER0_EVENT).
• Counter 3: Basic counter that tracks ring events and the occupancy companion event
(COUNTER0_EVENT).
CBo counter 0 can increment by a maximum of 20 per cycle; counters 1-3 can increment by 1 per
cycle.
Some uncore performance events that monitor transaction activities require additional details that
must be programmed in a filter register. Each Cbo provides one filter register and allows only one
such event be programmed at a given time, see Section 2.3.3.3.
For information on how to setup a monitoring session, refer to Section 2.1, “Uncore Per- Sock et
Performance Monitoring Control”
.
2.3.2.1Special Note on CBo Occupancy Events
Although only counter 0 supports occupancy events, it is possible to program coounters 1-3 to
monitor the same occupancy event by selecting the “OCCUPANCY_COUNTER0” event code on
counters 1-3.
This allows:
• Thresholding on all four counters.
While one can monitor no more than one queue at a time, it is possible to setup different queue
occupancy thresholds on each of the four counters. For example, if one wanted to monitor the
IRQ, one could setup thresholds of 1, 7, 14, and 18 to get a picture of the time spent at different
occupancies in the IRQ.
• Average Latency and Average Occupancy
It can be useful to monitor the average occupancy in a queue as well as the average number of
items in the queue. One could program counter 0 to accumulate the occupancy, counter 1 with
the queue’s allocations event, and counter 2 with the OCCUPANCY_COUNTER0 event and a
threshold of 1. Latency could then be calculated by counter 0 / counter 1, and occupancy by
counter 0 / counter 2.
Reference Number: 327043-001 23
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
2.3.3CBo Performance Monitors
Table 2-8.CBo Performance Monitoring MSRs (Sheet 1 of 4)
C0_MSR_PMON_CTL30x0D13 32CBo 0 PMON Control for Counter 3
C0_MSR_PMON_CTL20x0D12 32CBo 0 PMON Control for Counter 2
C0_MSR_PMON_CTL10x0D11 32CBo 0 PMON Control for Counter 1
C0_MSR_PMON_CTL00x0D10 32CBo 0 PMON Control for Counter 0
Box-Level Control/Status
C0_MSR_PMON_BOX_CTL0x0D0432 CBo 0 PMON Box-Wide Control
C1_MSR_PMON_CTL30x0D33 32CBo 1 PMON Control for Counter 3
C1_MSR_PMON_CTL20x0D32 32CBo 1 PMON Control for Counter 2
C1_MSR_PMON_CTL10x0D31 32CBo 1 PMON Control for Counter 1
C1_MSR_PMON_CTL00x0D30 32CBo 1 PMON Control for Counter 0
Box-Level Control/Status
C1_MSR_PMON_BOX_CTL0x0D2432 CBo 1 PMON Box-Wide Control
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-8.CBo Performance Moni toring MSRs (Sheet 2 of 4)
MSR Name
C2_MSR_PMON_BOX_FILTER0x0D5432 CBo 2 PMON Filter
Generic Counter Control
C2_MSR_PMON_CTL30x0D53 32CBo 2 PMON Control for Counter 3
C2_MSR_PMON_CTL20x0D52 32CBo 2 PMON Control for Counter 2
C2_MSR_PMON_CTL10x0D51 32CBo 2 PMON Control for Counter 1
C2_MSR_PMON_CTL00x0D50 32CBo 2 PMON Control for Counter 0
Box-Level Control/Status
C2_MSR_PMON_BOX_CTL0x0D4432 CBo 2 PMON Box-Wide Control
C3_MSR_PMON_CTL30x0D73 32CBo 3 PMON Control for Counter 3
C3_MSR_PMON_CTL20x0D72 32CBo 3 PMON Control for Counter 2
C3_MSR_PMON_CTL10x0D71 32CBo 3 PMON Control for Counter 1
C3_MSR_PMON_CTL00x0D70 32CBo 3 PMON Control for Counter 0
Box-Level Control/Status
C3_MSR_PMON_BOX_CTL0x0D6432 CBo 3 PMON Box-Wide Control
C4_MSR_PMON_CTL30x0D93 32CBo 4 PMON Control for Counter 3
C4_MSR_PMON_CTL20x0D92 32CBo 4 PMON Control for Counter 2
C4_MSR_PMON_CTL10x0D91 32CBo 4 PMON Control for Counter 1
C4_MSR_PMON_CTL00x0D90 32CBo 4 PMON Control for Counter 0
Box-Level Control/Status
C4_MSR_PMON_BOX_CTL0x0D8432 CBo 4 PMON Box-Wide Control
Reference Number: 327043-001 25
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-8.CBo Performance Monitoring MSRs (Sheet 3 of 4)
C5_MSR_PMON_CTL30x0DB3 32CBo 5 PMON Control for Counter 3
C5_MSR_PMON_CTL20x0DB2 32CBo 5 PMON Control for Counter 2
C5_MSR_PMON_CTL10x0DB1 32CBo 5 PMON Control for Counter 1
C5_MSR_PMON_CTL00x0DB0 32CBo 5 PMON Control for Counter 0
Box-Level Control/Status
C5_MSR_PMON_BOX_CTL0x0DA432 CBo 5 PMON Box-Wide Control
C6_MSR_PMON_CTL30x0DD3 32CBo 6 PMON Control for Counter 3
C6_MSR_PMON_CTL20x0DD2 32CBo 6 PMON Control for Counter 2
C6_MSR_PMON_CTL10x0DD1 32CBo 6 PMON Control for Counter 1
C6_MSR_PMON_CTL00x0DD0 32CBo 6 PMON Control for Counter 0
Box-Level Control/Status
C6_MSR_PMON_BOX_CTL0x0DC432 CBo 6 PMON Box-Wide Control
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-8.CBo Performance Moni toring MSRs (Sheet 4 of 4)
MSR Name
C7_MSR_PMON_CTL30x0DF3 32CBo 7 PMON Control for Counter 3
C7_MSR_PMON_CTL20x0DF2 32CBo 7 PMON Control for Counter 2
C7_MSR_PMON_CTL10x0DF1 32CBo 7 PMON Control for Counter 1
C7_MSR_PMON_CTL00x0DF0 32CBo 7 PMON Control for Counter 0
Box-Level Control/Status
C7_MSR_PMON_BOX_CTL0x0DE432 CBo 7 PMON Box-Wide Control
MSR
Address
Size
(bits)
Description
2.3.3.1CBo Box Level PMON State
The following registers represent the state governing all box-level PMUs in the CBo.
In the case of the CBo, the Cn_MSR_PMON_BOX_CTL register governs what happens when a freeze
signal is received (.frz_en). It also provides the ability to manually freeze the counters in the box
(.frz) and reset the generic state (.rst_ctrs and .rst_ctrl).
Table 2-9.Cn_MSR_PMON_BOX_CTL Register – Field Definitions
FieldBitsAttr
rsv31:18RV0 Reserved (?)
rsv17RV0 Reserved; SW must write to 0 else behavior is undefined.
frz_en16WO0 Freeze Enable.
HW
Reset
Val
Description
If set to 1 and a freeze signal is received, the counters will be
stopped or ‘frozen’, else the freeze signal will be ignored.
rsv15:9RV0 Reserved (?)
frz8WO0 Freeze.
If set to 1 and the .frz_en is 1, the counters in this box will be
When set to 1, the Counter Registers will be reset to 0.
rst_ctrl0WO0 Reset Control.
U
When set to 1, the Counter Control Registers will be reset to 0.
2.3.3.2CBo PMON state - Counter/Control Pairs
The following table defines the layout of the CBo performance monitor control registers. The main
task of these configuration registers is to select the event to be monitored by their respective data
counter (.ev_sel, .umask). Additional control bits are provided to shape the incoming events (e.g.
.invert, .edge_det, .thresh) as well as provide additional functionality for monitoring software (.rst).
Table 2-10. Cn_MSR_PMON_CTL{3-0} Re gister – Field Definitions (Sheet 1 of 2)
FieldBitsAttr
thresh31:24RW-V0 Threshold used in counter comparison.
HW
Reset
Val
Description
Reference Number: 327043-001 27
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-10. Cn_MSR_PMON_CTL{3-0} Register – Field Definitions (Sheet 2 of 2)
FieldBitsAttr
invert23RW-V0 Invert comparison against Threshold.
en22RW-V0 Local Counter Enable.
rsv21:20RV0 Reserved; SW must write to 0 else behavior is undefined.
tid_en19RW-V0 TID Filter Enable
edge_det18RW-V0 When set to 1, rather than measuring the event in each cycle it
rst17WO0 When set to 1, the corresponding counter will be cleared to 0.
rsv16RV0 Reserved. SW must write to 0 else behavior is undefined.
umask15:8RW-V0 Select subevents to be counted within the selected event.
ev_sel7:0RW-V0 Select event to be counted.
HW
Reset
Val
Description
0 - comparison will be ‘is event increment >= threshold?’.
1 - comparison is inverted - ‘is event increment < threshol d?’
NOTE: .invert is in series following .thresh, Due to this, the
.thresh field must be set to a non-0 value. For events that
increment by no more than 1 per cycle, set .thresh to 0x1.
Also, if .edge_det is set to 1, the counter will increment when a 1
to 0 transition (i.e. falling edge) is detected.
is active, the corresponding counter will incr ement when a 0 to 1
transition (i.e. rising edge) is detected.
When 0, the counter will increment in each cycle that the event
is asserted.
NOTE: .edge_det is in series following .thresh, Due to this, the
.thresh field must be set to a non-0 value. For events that
increment by no more than 1 per cycle, set .thresh to 0x1.
The CBo performance monitor data registers are 44b wide. Should a counter overflow (a carry out
from bit 43), the counter will wrap and continue to collect events.If accessible, software can
continuously read the data registers without disabling event collection.
Table 2-11. Cn_MSR_PMON_CTR{3-0} Register – Field Definitions
In addition to generic event counting, each CBo provides a MATCH register that allows a user to filter
various traffic as it applies to specific events (see Event Section for more information). LLC_LOOKUP
may be filtered by the cacheline state, QPI_CREDITS may be filtered by link while TOR_INSERTS and
TOR_OCCUPANCY may be filtered by the opcode of the queued request as well as the corresponding
NodeID.
Any of the CBo events may be filtered by Thread/Core-ID. To do so, the control register’s .tid_en bit
must be set to 1 and the tid field in the FILTER register filled out.
28Reference Number: 327043-001
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Note:Not all transactions can be associated with a specific thread. For example, when a
snoop triggers a WB, it does not have an associated thread. Transactions that are
associated with PCIe will come from “0x1E” (b11110).
Note:Only one of these filtering criteria may be applied at a time.
Table 2-12. Cn_MSR_PMON_BOX_FILTER Register – Field Definitions
FieldBitsAtrtr
opc
(7b IDI Opcode?
w/top 2b 0x3)
state22:18RW0 Select state to monitor for LLC_LOOKUP event. Setting multiple
NOTE: Only tracks opcodes that come from the IRQ. It is not
possible to track snoops (from IPQ) or other transactions from
the ISMQ.
bits in this field will allow a user to track multiple states.
b1xxxx - ‘F’ state.
bx1xxx - ‘M’ state
bxx1xx - ‘E’ state.
bxxx1x - ‘S’ state.
bxxxx1 - ‘I’ state.
NID is a mask filter with each bit representing a different Node in
the system. 0x01 would filter on NID 0, 0x2 would filter on NID
1, etc
[0] Thread 1/0
When .tid_en is 0; the specified counter will count ALL events
Thread-ID 0xF is reserved for non- associated requests such as: -
LLC victims - PMSeq - External Snoops
Refer to Table 2-144, “Opcodes (Alphabetical Listing)” for definitions of the opcodes found in the
following table.
Table 2-13. Opcode Match by IDI Packet Type for Cn_MSR_PMON_BOX_FILTER.opc (Sheet
1 of 2)
opc
Value
0x180RFODemand Data RFO
0x181CRdDemand Code Read
0x182DRdDemand Data Read
0x187PRdPartial Reads (UC)
0x18CWCiLFStreaming Store - Full
0x18DWCiLStreaming Store - Partial
0x190PrefRFOPrefetch RFO into LLC but don’t pass to L2. Includes Hints
0x191PrefCodePrefetch Code into LLC but don’t pass to L2. Includes Hints
0x192PrefDataPrefetch Data into LLC but don’t pass to L2. Includes Hints
Reference Number: 327043-001 29
OpcodeDefn
Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring
Table 2-13. Opcode Match by IDI Packet Type for Cn_MSR_PMON_BOX_FILTER.opc (Sheet
2 of 2)
opc
Value
0x194PCIWiLFPCIe Write (non-allocating)
0x195PCIPRdPCIe UC Read
0x19CPCIItoMPCIe Write (allocating)
0x19EPCIRdCurPCIe read current
0x1C4WbMtoIRequest writeback Modified invalidate line
0x1C5WbMtoERequest writeback Modified set to Exclusive
0x1C8ItoMRequest Invalidate Line
0x1E4PCINSRdPCIe Non-Snoop Read
0x1E5PCINSWrPCIe Non-Snoop Write (partial)
0x1E6PCINSWrFPCIe Non-Snoop Read (full)
OpcodeDefn
2.3.4CBo Performance Monitoring Events
2.3.4.1An Overview:
The performance monitoring events within the CBo include all events internal to the LLC as well as
events which track ring related activity at the CBo/Core ring stops.
CBo performance monitoring events can be used to track LLC access rates, LLC hit/miss rates, LLC
eviction and fill rates, and to detect evidence of back pressure on the LLC pipelines. In addition, the
CBo has performance monitoring events for tracking MESI state transitions that occur as a result of
data sharing across sockets in a multi-socket system. And finally, there are events in the CBo for
tracking ring traffic at the CBo/Core sink inject points.
Every event in the CBo is from the point of view of the LLC and is not associated with any specific core
since all cores in the socket send their LLC transactions to all CBos in the socket. However, the PMON
logic in the CBo provides a thread-id field in the Cn_MSR_PMON_BOX_FILTER register which can be
applied to the CBo events to obtain the interactions between specific cores and threads.
There are separate sets of counters for each CBo instance. For any event, to get an aggregate count
of that event for the entire LLC, the counts across the CBo instances must be added together. The
counts can be averaged across the CBo instances to get a view of the typical count of an event from
the perspective of the individual CBos. Indiv idual per-CBo deviations from the a ver age can be used to
identify hot-spotting across the CBos or other evidences of non-uniformity in LLC behavior across the
CBos. Such hot-spotting should be rare, though a repetitive polling on a fixed physical address is one
obvious example of a case where an analysis of the deviations across the CBos would indicate hotspotting.
2.3.4.2Acronyms frequently used in CBo Events:
The Rings:
AD (Address) Ring - Core Read/Write Requests and Intel QPI Snoops. Carries Intel QPI requests and
snoop responses from C to Intel® QPI.
BL (Block or Data) Ring - Data == 2 transfers for 1 cache line
30Reference Number: 327043-001
Loading...
+ 106 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.