INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR
LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
Developers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Im-
proper use of reserved or undefined featu res or instructions may cause unpr edictable behavior or failure in developer 's software
code when running on an Intel processor. Intel reserves these features or instr uctions for fut ure definition and s hall hav e no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use.
The Intel
able on request.
Hyper-Threading Technology requires a computer system with an Intel
HT T echnology enabled chipset, BIO S and oper ating sy stem. P erformance wi ll vary dependin g on the specific ha rdware and softw are
you use. For more information, see http://www.intel.com/technology/hyperthread/index.htm; including details on which processors support
HT Technology.
Intel
and for some uses, certain platform software enabled for it. Functionality, performance or other benefits will
ware and software c onfigurations. Intel
ment.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers
and applications enabled for Intel
architecture-enabled BIOS. P erforma nce will v ary depend ing on yo ur hardwar e and soft ware conf igurat ions. Consul t with your system vendor for more information.
®
64 architecture processors may contain design defects or errors known as err at a. Curren t char ac terize d err ata ar e av ail -
®
processor supporting Hyper-Threading Technology and an
®
Virtualization T echnolo gy requires a computer system with an enabled Intel® processor , BIOS, virtual machine monitor (VMM)
®
Virtualization Technology-enabled BIOS and VMM applications are currently in develop-
®
64 architecture. Processors will not op erate (including 32-bit operation) witho ut an Intel® 64
vary depending on hard-
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other
countries.
*Other names and brands may be claimed as the property of others.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering nu mber and are r eferenced in t his document, or other Intel liter ature, may be obtained
from:
Intel Corporation
P.O. Box 5937
Denver, CO 80217-9808
or call 1-800-548-4725
or visit Intel’s website at http://www.intel.com
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEINTRODUCTION
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Figure 1-1 provides an Intel® Xeon® Processor 7500 Series block diagram.
Figure 1-1. Intel Xeon Processor 7500 Series Block Diagram
1.2 Uncore PMU Overview
The processor uncore performance monitoring is supported by PMUs local to each of the C, S, B, M, R, U,
and W-Boxes. Each of these boxes communicates with the U-Box which contains registers to control all
uncore PMU activity (as outlined in Section 2.1, “Global Performance Monitoring Control”).
All processor uncore performance monitoring features can be accessed through RDMSR/WRMSR instructions executed at ring 0.
Since the uncore performance monitors represent socket-wide resources that are not context switched
by the OS, it is highly recommended that only one piece of software (per-socket) attempt to program and
extract information from the monitors. T o keep things simple, it is also recommended that the monitoring
software communicate with the OS such that it can be executed on coreId = 0, threadId = 0. Although
recommended, this step is not necessary . Software may be notified of an overflowing uncore counter on
any core.
1-1
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEINTRODUCTION
The general performance monitoring capabilities in each box are outlined in the following table.
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEINTRODUCTION
• Section 2.8, “W-Box Performance Monitoring”
• Section 2.9, “Packet Matching Reference”
1-4
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
CHAPTER 2
UNCORE PERFORMANCE MONITORING
2.1 Global Performance Monitoring Control
2.1.1 Counter Overflow
If a counter overflows, it will send the overflow signal towards the U-Box. This signal will be
accumulated along the way in summary registers contained in each S-Box and a final summary register
in the U-Box.
®
The Intel
this overflow with two basic actions:
2.1.1.1 Freezing on Counter Overflow
Each uncore performance counter may be configured to, upon detection of overflow, disable (or
‘freeze’) all other counters in the uncore. To do so, the .pmi_en in the individual counter’s control
register must be set to 1. If the U_MSR_PMON_GLOBAL_CTL.frz_all is also set to 1, once the U-Box
receives the PMI from the uncore box, it will set U_MSR_PMON_GLOBAL_CTL.en_all to 0 which will
disable all counting.
Xeon® Processor 7500 Series uncore performance monitors may be configured to respond to
2.1.1.2 PMI on Counter Overflow
The uncore may also be configured to, upon detection of a performance counter overflow, send a PMI
signal to the core executing the monitoring software. To do so, the .pmi_en in the individual counter’s
control register must be set to 1 and U_MSR_PMON_GLOBAL_CTL.pmi_core_sel must be set to point to
the core the monitoring software is executing on.
Note:PMI is decoupled from freeze, so if software also wants the counters frozen, it must set
U_MSR_PMON_GLOBAL_CTL.frz_all to 1.
2.1.2 Setting up a Monitoring Session
On HW reset, all the counters should be disabled. Enabling is hierarchical. So the following steps must
be taken to set up a new monitoring session:
a) Reset counters to ensure no stale values have been acquired from previous sessions:
- set U_MSR_PMON_GLOBAL_CTL.rst_all to 1.
b) Select event to monitor:
Determine what events should be captured and program the control registers to capture them (i.e.
typically selected by programming the .ev_sel bits although other bit fields may be involved).
i.e. Set B_MSR_PMON_EVT_SEL3.ev _s e l to 0x03 to capture SNP_MERGE.
c) Enable counting locally:
i.e. Set B_MSR_PMON_EVT_SEL3.en to 1.
2-1
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
d) Enable counting at the box-level:
Enable counters within that box via it’s ‘GLOBAL_CTL’ register
i.e. set B_MSR_PMON_GLOBAL_CTL[3] to 1.
e) Select how to gather data. If polling, skip to 4. If sampling:
To set up a sample interval, software can pre-program the data register with a value of [2^48 -
sample interval length]. Doing so allows software, through use of the pmi mechanism, to be notified
when the number of events in the sample have been captured. Capturing a performance monitoring
sample every ‘X cycles’ (the fixed counter in the W-Box counts uncore clock cycles) is a common use of
this mechanism.
i.e. To stop counting and receive notification when the 1,000th SNP_MERGE has been detected,
- set B_MSR_PMON_CNT to (2^48- 1000)
- set B_MSR_PMON_EVT_SEL.pmi_en to 1
- set U_MSR_PMON_GLOBAL_CTL.frz_all to 1
- set U_MSR_PMON_GLOBAL_CTL.pmi_core_sel to which core the monitoring thread is executing on.
f) Enable counting at the global level by setting the U_MSR_PMON_GLOBAL_CTL.en_all bit to 1. Set the
.rst_all field to 0 with the same write.
And with that, counting will begin.
2.1.3 Reading the Sample Interval
Software can either poll the counters whenever it chooses, or wait to be notified that a counter has
overflowed (by receiving a PMI).
a) Polling - before reading, it is recommended that software freeze and disable the counters (by
clearing U_MSR_PMON_GLOBAL_CTL.en_all).
b) Frozen counters - If software set up the counters to freeze on overflow and send notification when it
happens, the next question is: Who caused the freeze?
Overflow bits are stored hierarchically within the Intel Xeon Processor 7500 Series uncore. First,
software should read the U_MSR_PMON_GLOBAL_STA T US.ov_* bits to determine whether a U or W box
counter caused the overflow or whether it was a counter in a box attached to the S0 or S1 Box.
The S-Boxes aggregate overflow bits from the M/B/C/R boxes they are attached to. So the next step is
to read the S{0,1}_MSR_PMON_SUMMARY.ov_* bits. Once the box(es) that contains the overflowing
counter is identified, the last step is to read that box’s *_MSR_PMON_GLOBAL_STATUS.ov field to find
the overflowing counter.
Note:More than one counter may overflow at any given time.
2.1.4 Enabling a New Sample Interval from Frozen Counters
Note:Software can determine if the counters have been frozen due to a PMI by examining two
bits: U_MSR_PMON_GLOBAL_SUMMARY.pmi should be 1 and
U_MSR_PMON_GLOBAL_CTL.en_all should be 0. If not, set
U_MSR_PMON_GLOBAL_CTL.en_all to 0 to disable counting.
2-2
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
a) Clear all uncore counters: Set U_MSR_PMON_GLOBAL_CTL.rst_all to 1.
b) Clear all overflow bits. When an overflow bit is cleared, all bits that summarize that overflow (above
in the hierarchy) will also be cleared. Therefore it is only necessary to clear the overflow bits
corresponding to the actual counter.
i.e. If counter 3 in B-Box 1 overflowed, to clear the overflow bit software should set
B_MSR_PMON_GLOBAL_OVF_CTL.clr_ov[3] to 1 in B-Box 1. This action will also clear
S_MSR_PMON_SUMMARY.ov_mb in S-Box 1 and U_MSR_PMON_GLOBAL_STATUS.ov_s1.c
c) Create the next sample: Reinitialize the sample by setting the monitoring data register to (2^48 sample_interval). Or set up a new sample interv al as outlined in Sect ion 2.1.2, “Setting up a Monitoring
Session”.
d) Re-enable counting: Set U_MSR_PMON_GLOBAL_CTL.en_all to 1. Set the .rst_all field back to 0 with
the same write.
2.1.5 Global Performance Monitors
Table 2-1. Global Performance Monitoring Control MSRs
MSR NameAccess
U_MSR_PMON_GLOBAL_OVF_CTLRW_RW0x0C0232 U-Box PMON Global Overflow Control
U_MSR_PMON_GLOBAL_STATUSRW_RO0x0C0132 U-Box PMON Global Status
U_MSR_PMON_GLOBAL_CTLRW_RO0x0C0032 U-Box PMON Global Control
MSR
Address
Size
(bits)
Description
2.1.5.1 Global PMON Global Control/Status Registers
The following registers represent state governing all PMUs in the uncore, both to exert global control
and collect box-level information.
U_MSR_PMON_GLOBAL_CTL contains bits that can reset (.rst_all) and freeze/enable (.en_all) all the
uncore counters. The .en_all bit must be set to 1 before any uncore counters will collect events.
Note:The register also contains the enable for the U-Box counters.
If an overflow is detected in any of the uncore’s PMON registers, it will be summarized in
U_MSR_PMON_GLOBAL_STATUS. This register accumulates overflows sent to it from the U-Box, W-Box
and S-Boxes and indicates if a disable was received from one of the boxes. To reset the summary
overflow bits, a user must set the corresponding bits in the U_MSR_PMON_GLOBAL_OVF_CTL register.
2-3
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
Table 2-2. U_MSR_PMON_GLOBAL_CTL Register – Field Definitions
FieldBits
frz_all310 Disable uncore counting (by clearing .en_all) if PMI is received from box
Ex:
If counter pmi is sent to U-Box for Box with overflowing counter...
00000000 - No PMI sent
00000001 - Send PMI to core 0
10000000 - Send PMI to core 7
11000100 - Send PMI to core 2, 6 & 7
etc.
Table 2-3. U_MSR_PMON_GLOBAL_STATUS Register – Field Definitions
FieldBits
cond310 Condition Change
pmi300 PMI Received from box with overflowing counter.
ig31:40 Read zero; writes ignored. (?)
ov_s030 Set if overflow is detected from a S-Box 0 PMON register.
ov_s120 Set if overflow is detected from a S-Box 1 PMON register.
ov_w10 Set if overflow is detected from a W-Box PMON register.
ov_u00 Set if overflow is detected from a U-Box PMON register.
HW
Reset
Val
Description
Table 2-4. U_MSR_PMON_GLOBAL_OVF_CTL Register – Field Definitions
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
2.2 U-Box Performance Monitoring
The U-Box serves as the system configuration controller for the Intel Xeon Processor 7500 Series.
It contains one counter which can be configured to capture a small set of events.
U-Box global state bits are stored in the uncore global state registers. Refer to Section 2.1, “Global
Performance Monitoring Control” for more information.
2.2.1.2 U-Box PMON state - Counter/Control Pairs
The following table defines the layout of the U-Box performance monitor control register. The main task
of this configuration register is to select the event to be monitored by its respective data counter.
Setting the .ev_sel field performs the event selection. The .en bit must be set to 1 to enable counting.
Additional control bits include:
- .pmi_en which governs what to do if an overflow is detected.
- .edge_detect - Rather than accumulating the raw count each cycle, the register can capture
transitions from no event to an event incoming.
Table 2-6. U_MSR_PMON_EVT_SEL Register – Field Definitions
FieldBits
ig630 Read zero; writes ignored. (?)
rsv620 Reserved; Must write to 0 else behavior is undefined.
ig61:230 Read zero; writes ignored. (?)
en220 Local Counter Enable. When set, the associated counter is locally
ig210 Read zero; writes ignored. (?)
pmi_en200 When this bit is asserted and the corresponding counter overflows, a PMI
ig190 Read zero; writes ignored. (?)
edge_detect180 When asserted, the 0 to 1 transition edge of a 1 bit event input will cause
ig17:80 Read zero; writes ignored. (?)
ev_sel7:00 Select event to be counted.
HW
Reset
Val
Description
enabled.
NOTE: It must also be enabled in C_MSR_PMON_GLOBAL_CTL and the
U-Box to be fully enabled.
exception is sent to the U-Box.
the corresponding counter to increment. When 0, the counter will
increment for however long the event is asserted .
2-5
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
The U-Box performance monitor data register is 48b wide. A counter overflow occurs when a carry out
bit from bit 47 is detected. Software can force all uncore counting to freeze after N events by preloading
48
a monitor with a count value of 2
- N and setting the control register to send a PMI to the U-Box. Upon
receipt of the PMI, the U-Box will disable counting ( Section 2.1.1.1, “Freezing on Counter Overflow”).
During the interval of time between overflow and global disable, the counter value will wrap and
continue to collect events.
In this way, software can capture the precise number of events that occurred between the time uncore
counting was enabled and when it was disabled (or ‘frozen’) with minimal skew.
If accessible, software can continuously read the data registers without disabling event collection.
Table 2-7. U_MSR_PMON_CTR Register – Field Definitions
FieldBits
event_count47:00 48-bit performance event counter
HW
Reset
Val
Description
2.2.2 U-Box Performance Monitoring Events
The set of events that can be monitored in the U-Box are summarized in the following section.
- Tracks NcMsgS packets generated by the U-Box, as they arbitrate to be broadcast. They are
prioritized as follows: Special Cycle->StopReq1/StartReq2->Lock/Unlock->Remote Interrupts->Local
Interrupts.
- Errors detected and distinguished between recoverable, corrected, uncorrected and fatal.
- Number of times cores were sent IPIs or were Woken up.
- Requests to the Ring or a B-Box.
etc.
2.2.3 U-Box Events Ordered By Code
Table 2-8 summarizes the directly-measured U-Box events.
Table 2-8. Performance Monitor Events for U-Box Events
Symbol Name
BUF_VALID_LOCAL_INT0x0001Local IPI Buffer is valid
BUF_VALID_REMOTE_INT0x0011Remote IPI Buffer is valid
BUF_VALID_LOCK0x0021Lock Buffer is valid
BUF_VALID_STST0x0031Start/Stop Req Buffer is valid
BUF_VALID_SPC_CYCLES0x0041SpcCyc Buffer is valid
U2R_REQUESTS0x0501Number U-Box to Ring Requests
U2B_REQUEST_CYCLES0x0511U to B-Box Active Request Cycles
WOKEN0x0F81Number of core woken up
IPIS_SENT0x0F91Number of core IPIs sent
RECOV0x1DF1Recoverable
CORRECTED_ERR0x1E41Corrected Error
Event
Code
Max
Inc/Cyc
Description
2-6
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
Table 2-8. Performance Monitor Events for U-Box Events
This section enumerates Intel Xeon Processor 7500 Series uncore performance monitoring events for
the U-Box.
BUF_VALID_LOCAL_INT
• Title: Local IPI Buffer Valid
• Category: U-Box Events
• Event Code: 0x000, Max. Inc/Cyc: 1,
• Definition: Number of cycles the Local Interrupt packet buffer contained a valid entry.
BUF_VALID_LOCK
• Title: Lock Buffer Valid
• Category: U-Box Events
• Event Code: 0x002, Max. Inc/Cyc: 1,
• Definition: Number of cycles the Lock packet buffer contained a valid entry.
BUF_VALID_REMOTE_INT
• Title: Remote IPI Buffer Valid
• Category: U-Box Events
• Event Code: 0x001, Max. Inc/Cyc: 1,
• Definition: Number of cycles the Remote IPI packet buffer contained a valid entry.
BUF_VALID_SPC_CYCLES
• Title: SpcCyc Buffer Valid
• Category: U-Box Events
• Event Code: 0x004, Max. Inc/Cyc: 1,
• Definition: Number of uncore cycles the Special Cycle packet buffer contains a valid entry. ‘Special
Cycles’ are NcMsgS packets generated by the U-Box and broadcast to internal cores to cover such
things as Shutdown, Invd_Ack and WbInvd_Ack conditions.
BUF_VALID_STST
• Title: Start/Stop Req Buffer Valid
• Category: U-Box Events
• Event Code: 0x003, Max. Inc/Cyc: 1,
• Definition: Number of uncore cycles the Start/Stop Request packet buffer contained a valid entry.
CORRECTED_ERR
• Title: Corrected Errors
• Category: U-Box Events
• Event Code: 0x1E4, Max. Inc/Cyc: 1,
• Definition: Number of corrected errors.
2-7
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
FATAL_ERR
• Title: Fatal Errors
• Category: U-Box Events
• Event Code: 0x1E6, Max. Inc/Cyc: 1,
• Definition: Number of fatal errors.
IPIS_SENT
• Title: Number Core IPIs Sent
• Category: U-Box Events
• Event Code: 0x0F9, Max. Inc/Cyc: 1,
• Definition: Number of core IPIs sent.
RECOV
• Title: Recoverable
• Category: U-Box Events
• Event Code: 0x1DF, Max. Inc/Cyc: 1,
• Definition: Number of recoverable errors.
U2R_REQUESTS
• Title: Number U2R Requests
• Category: U-Box Events
• Event Code: 0x050, Max. Inc/Cyc: 1,
• Definition: Number U-Box to Ring Requests.
U2B_REQUEST_CYCLES
• Title: U2B Active Request Cycles
• Category: U-Box Events
• Event Code: 0x051, Max. Inc/Cyc: 1,
• Definition: Number U to B-Box Active Request Cycles.
UNCORRECTED_ERR
• Title: Uncorrected Error
• Category: U-Box Events
• Event Code: 0x1E5, Max. Inc/Cyc: 1,
• Definition: Number of uncorrected errors.
WOKEN
• Title: Number Cores Woken Up
• Category: U-Box Events
• Event Code: 0x0F8, Max. Inc/Cyc: 1,
• Definition: Number of cores woken up.
2-8
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
2.3 C-Box Performance Monitoring
2.3.1 Overview of the C-Box
For the Intel Xeon Processor 7500 Series, the LLC coherence engine (C-Box) manages the interface
between the core and the last level cache (LLC). All core transactions that access the LLC are directed
from the core to a C-Box via the ring interconnect. The C-Box is responsible for managing data delivery
from the LLC to the requesting core. It is also responsible for maintaining coherence between the cores
within the socket that share the LLC; generating snoops and collecting snoop responses to the local
cores when the MESI protocol requires it.
®
The C-Box is also the gate keeper for all Intel
originate in the core and is responsible for ensuring that all Intel QuickPath Interconnect messages that
pass through the socket’s LLC remain coherent.
The Intel Xeon Processor 7500 Series contains eight instances of the C-Box, each assigned to manage a
distinct 3MB, 24-way set associative slice of the processor’s total LLC capacity. For processors with
fewer than 8 3MB LLC slices, the C-Boxes for missing slices will still be active and track ring traffic
caused by their co-located core even if they have no LLC related traffic to track (i.e. hits/misses/
snoops).
Every physical memory address in the system is uniquely associated with a single C-Box instance via a
proprietary hashing algorithm that is designed to keep the distribution of traffic across the C-Box
instances relatively uniform for a wide range of possible address patterns. This enables the individual CBox instances to operate independently, each managing its slice of the physical address space without
any C-Box in a given socket ever needing to communicate with the other C-Boxes in that same socket.
QuickPath Interconnect (Intel® QPI) messages that
Each C-Box is uniquely associated with a single S-Box. All messages which a given C-Box sends out to
the system memory or Intel QPI pass through the S-Box that is physically closest to that C-Box.
2.3.2 C-Box Performance Monitoring Overview
Each of the C-Boxes in the Intel Xeon Processor 7500 Series supports event monitoring through six 48bit wide counters (CBx_CR_C_MSR_PMON_CTR{5:0}). Each of these six counters can be programmed
to count any C-Box event. The C-Box counters can increment by a maximum of 5b per cycle.
For information on how to setup a monitoring session, refer to Section 2.1, “Global Performance
Monitoring Control”
2.3.2.1 C-Box PMU - Overflow, Freeze and Unfreeze
If an overflow is detected from a C-Box performance counter, the overflow bit is set at the box level
(C_MSR_PMON_GLOBAL_STATUS.ov), and forwarded up the chain towards the U-Box. If a C-Box0
counter overflows, a notification is sent and stored in S-Box0 (S_MSR_PMON_SUMMARY.ov_c_l) which,
in turn, sends the overflow notification up to the U-Box (U_MSR_PMON_GLOBAL_ST ATUS.ov_s0). Refer
to Table 2-26, “S_MSR_PMON_SUMMARY Register Fields” to determine how each C-Box’ s overflow bit is
accumulated in the attached S-Box.
HW can be also configured (by setting the corresponding .pmi_en to 1) to send a PMI to the U-Box
when an overflow is detected. The U-Box may be configured to freeze all uncore counting and/or send a
PMI to selected cores when it receives this signal.
Once a freeze has occurred, in order to see a new freeze, the overflow field responsible for the freeze,
must be cleared by setting the corresponding bit in C_MSR_PMON_GLOBAL_OVF_CTL.clr_ov. Assuming
all the counters have been locally enabled (.en bit in data registers meant to monitor events) and the
overflow bit(s) has been cleared, the C-Box is prepared for a new sample interval. Once the global
controls have been re-enabled (Section 2.1.4, “Enabling a New Sample Interval from Frozen
Counters”), counting will resume.
.
2-9
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
CB0_CR_C_MSR_PMON_EVT_SEL_0RW_RO0xD1064C-Box 0 PMON Event Select 0
CB0_CR_C_MSR_PMON_GLOBAL_OVF_CTLWO_RO0xD0232C-Box 0 PMON Global Overflow Control
CB0_CR_C_MSR_PMON_GLOBAL_STATUSRW_RW0xD0132C-Box 0 PMON Global Status
CB0_CR_C_MSR_PMON_GLOBAL_CTLRW_RO0xD0032C-Box 0 PMON Global Control
2.3.3.1 C-Box Box Level PMON state
The following registers represent the state governing all box-level PMUs in the C-Box.
The _GLOBAL_CTL register contains the bits used to enable monitoring. It is necessary to set the
.ctr_en bit to 1 before the corresponding data register can collect events.
2-13
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
If an overflow is detected from one of the C-Box PMON registers, the corresponding bit in the
_GLOBAL_STATUS.ov field will be set. To reset the overflow bits set in the _GLOBAL_STATUS.ov field, a
user must set the corresponding bits in the _GLOBAL_OVF_CTL.clr_ov field before beginning a new
sample interval.
Table 2-10. C_MSR_PMON_GLOBAL_CTL Register – Field Definitions
FieldBits
ctr_en5:00 Must be set to enable each C-Box counter.
HW
Reset
Val
Description
NOTE: U-Box enable and per counter enable must also be set to fully
enable the counter.
Table 2-11. C_MSR_PMON_GLOBAL_STATUS Register – Field Definitions
FieldBits
ov5:00 If an overflow is detected from the corresponding CBOX PMON register,
HW
Reset
Val
Description
it’s overflow bit will be set.
NOTE: This bit is also cleared by setting the corresponding bit in
C_MSR_PMON_GLOBAL_OVF_CTL
Table 2-12. C_MSR_PMON_GLOBAL_OVF_CTL Register – Field Definitions
FieldBits
clr_ov5:00 Write ‘1’ to reset the corresponding C_MSR_PMON_GLOBAL_STATUS
HW
Reset
Val
Description
overflow bit.
2.3.3.2 C-Box PMON state - Counter/Control Pairs
The following table defines the layout of the C-Box performance monitor control registers. The main
task of these configuration registers is to select the event to be monitored by their respective data
counter. Setting the .ev_sel and .umask fields performs the event selection. The .en bit must be set to
1 to enable counting.
Additional control bits include:
- .pmi_en governs what to do if an overflow is detected.
- .threshold - since C-Box counters can increment by a value greater than 1, a threshold can be applied.
If the .threshold is set to a non-zero value, that value is compared against the incoming count for that
event in each cycle. If the incoming count is >= the threshold value, then the event count captured in
the data register will be incremented by 1.
- .invert - Changes the .threshold test condition to ‘<‘
- .edge_detect - Rather than accumulating the raw count each cycle (for events that can increment by
1 per cycle), the register can capture transitions from no event to an event incoming.
2-14
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
Table 2-13. C_MSR_PMON_EVT_SEL{5-0} Register – Field Definitions
FieldBits
ig630 Read zero; writes ignored. (?)
rsv62:610 Reserved; Must write to 0 else behavior is undefined.
ig60:500 Read zero; writes ignored. (?)
threshold31:240 Threshold used in counter comparison.
invert230 When 0, the comparison that will be done is threshold <= event. When
en220 Local Counter Enable. When set, the associated counter is locally
ig210 Read zero; writes ignored. (?)
pmi_en200 When this bit is asserted and the corresponding counter overflows, a PMI
ig190 Read zero; writes ignored. (?)
edge_detect180 When asserted, the 0 to 1 transition edge of a 1 bit event input will cause
ig17:160 Read zero; writes ignored. (?)
umask15:80 Select subevents to be counted within the selected event.
ev_sel7:00 Select event to be counted.
HW
Reset
Val
Description
set to 1, the comparison that is inverted (e.g. threshold < event)
enabled.
NOTE: It must also be enabled in C_MSR_PMON_GLOBAL_CTL and the
U-Box to be fully enabled.
exception is sent to the U-Box.
the corresponding counter to increment. When 0, the counter will
increment for however long the event is asserted .
NOTE: .edge_detect is in series following threshold and invert, so it can
be applied to multi-increment events that have been filtered by the
threshold field.
The C-Box performance monitor data registers are 48b wide. A counter overflow occurs when a carry
out bit from bit 47 is detected. Software can force all uncore counting to freeze after N events by
48
preloading a monitor with a count value of 2
- N and setting the control register to send a PMI to the
U-Box. Upon receipt of the PMI, the U-Box will disable counting ( Section 2.1.1.1, “Freezing on Counter
Overflow”). During the interval of time between overflow and global disable, the counter value will wr ap
and continue to collect events.
In this way, software can capture the precise number of events that occurred between the time uncore
counting was enabled and when it was disabled (or ‘frozen’) with minimal skew.
If accessible, software can continuously read the data registers without disabling event collection.
Table 2-14. C_MSR_PMON_CTR{5-0} Register – Field Definition s
FieldBits
event_count47:00 48-bit performance event counter
HW
Reset
Val
Description
2-15
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
2.3.4 C-BOX Performance Monitoring Events
2.3.4.1 An Overview:
The performance monitoring events within the C-Box include all events internal to the LLC as well as
events which track ring related activity at the C-Box/Core ring stops. The only ring specific events that
are not tracked by the C-Box PMUs are those events that track ring activity at the S-Box ring stop (see
the S-Box chapter for details on those events).
C-Box performance monitoring events can be used to track LLC access rates, LLC hit/miss rates, LLC
eviction and fill rates, and to detect evidence of back pressure on the LLC pipelines. In addition, the CBox has performance monitoring events for tracking MESI state transitions that occur as a result of
data sharing across sockets in a multi-socket system. And finally, there are events in the C-Box for
tracking ring traffic at the C-Box/Core sink inject points.
Every event in the C-Box (with the exception of the P2C inject and *2P sink counts) are from the point
of view of the LLC and cannot be associated with any specific core since all cores in the socket send
their LLC transactions to all C-Boxes in the socket. The P2C inject and *2P sink counts serve as the
exception since those events are tracking ring activity at the cores’ ring inject/sink points.
There are separate sets of counters for each C-Box instance. For any event, to get an aggregate count
of that event for the entire LLC, the counts across the C-Box instances must be added together. The
counts can be averaged across the C-Box instances to get a view of the typical count of an event from
the perspective of the individual C-Boxes. Individual per-C-Box deviations from the average can be
used to identify hot-spotting across the C-Boxes or other evidences of non-uniformity in LLC behavior
across the C-Boxes. Such hot-spotting should be rare, though a repetitive polling on a fixed physical
address is one obvious example of a case where an analysis of the deviations across the C-Box es would
indicate hot-spotting.
2.3.4.2 Acronyms frequently used in C-Box Events:
The Rings:
AD (Address) Ring - Core Read/Write Requests and Intel QPI Snoops. Carries Intel QPI requests and
snoop responses from C to S-Box.
BL (Block or Data) Ring - Data == 2 transfers for 1 cache line
AK (Acknowledge) Ring - Acknowledges S-Box to C-Box and C-Box to Core. Carries snoop responses
from Core to C-Box.
IV (Invalidate) Ring - C-Box Snoop requests of core caches
Internal C-Box Queues:
IRQ - Ingress Request Queue on AD Ring. Associated with requests from core.
IPQ - Ingress Probe Queue on AD Ring. Associated with snoops from S-Box.
VIQ - Victim Queue internal to C-Box.
IDQ - Ingress Data Queue on BL Ring. For data from either Core or S-Box.
ICQ - S-Box Ingress Complete Queue on AK Ring
SRQ - Processor Snoop Response Queue on AK ring
IGQ - Ingress GO-pending (tracking GO’s to core) Queue
MAF - Miss Address File. Intel QPI ordering buffer that also tracks local coherence.
2-16
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
2.3.4.3 The Queues:
There are seven internal occupancy queue counters, each of which is 5bits wide and dedicated to its
queue: IRQ, IPQ, VIQ, MAF, RWRF, RSPF, IDF.
Note:IDQ, ICQ, SRQ and IGQ occupancies are not tracked since they are mapped 1:1 to the
MAF and, therefore, can not create back pressure.
It should be noted that, while the IRQ, IPQ, VIQ and MAF queues reside within the C-Box; the RWRF,
RSPF and IDF queues do not. Instead, they live in-between the Core and the Ring buffering messages
as those messages transit between the two. This distinction is useful in that, the queues located within
the C-Box can provide information about what is going on in the LLC with respect to the flow of
transactions at the point where they become “observed” by the coherence fabric (i.e. , where the MAF is
located). Occupancy of these buffers informs how many transactions the C-Box is tracking, and where
the bottlenecks are manifesting when the C-Box starts to get busy and/or congested.
There is no need to explicitly reset the occupancy counters in the C-Box since they are counting from
reset de-assertion.
2.3.4.4 Detecting Performance Problems in the C-Box Pipeline:
IRQ occupancy counters should be used to track if the C-Box pipeline is exerting back pressure on the
Core-request path. There is a one-to-one correspondence between the LLC requests generated by the
cores and the IRQ allocations. IPQ occupancy counters should be used to track if the C-Box pipeline is
exerting back pressure on the Intel QPI-snoop path. There is a one-to-one correspondence between the
Intel QPI snoops received by the socket, and the IPQ allocations in the C-Boxes. In both cases, if the
message is in the IRQ/IPQ then the C-Box hasn’t acknowledged it yet and the request hasn’t yet
entered the LLC’s “coherence domain”. It deallocates from the IRQ/IPQ at the moment that the C-Box
does acknowledge it. In optimal performance scenarios, where there are minimal conflicts between
transactions and loads are low enough to keep latencies relatively near to idle, IRQ and IPQ
occupancies should remain very low.
One relatively common scenario in which IRQ back pressure will be high is worth mentioning: The IRQ
will backup when software is demanding data from memory at a rate that exceeds the available
memory BW. The IRQ is designed to be the place where the extra transactions wait U-Box’s RTIDs to
become available when memory becomes saturated. IRQ back pressure becomes interesting in a
scenario where memory is not operating at or near peak sustainable BW. That can be a sign of a
performance problem that may be correctable with software tuning.
One final warning on LLC pipeline congestion: Care should be taken not to blindly sum events across CBoxes without also checking the deviation across individual C-Boxes when investigating performance
issues that are concentrated in the C-Box pipelines. Performance problems where congestion in the CBox pipelines is the cause should be rare, but if they do occur, the event counts may not be
homogeneous across the C-Boxes in the socket. The average count across the C-Boxes may be
misleading. If performance issues are found in this area it will be useful to know if they are or are not
localized to specific C-Boxes.
2.3.5 C-Box Events Ordered By Code
Table 2-15 summarizes the directly-measured C-Box events.
Table 2-15. Performance Monitor Events for C-Box Events
Symbol Name
Ring Events
BOUNCES_P2C_AD0x011Number of P2C AD bounces.
BOUNCES_C2P_AK0x021Number of C2P AK bounces.
BOUNCES_C2P_BL0x031Number of C2P BL bounces.
Event
Code
Max
Inc/Cyc
Description
2-17
INTEL® XEON® PROCESSOR 7500 SERIES UNCORE PROGRAMMING GUIDEUNCORE PERFORMANCE MONITORING
Table 2-15. Performance Monitor Events for C-Box Events
Symbol Name
Ring Events
BOUNCES_C2P_IV0x041Number of C2P IV bounces.
SINKS_P2C0x053Number of P2C sinks.
SINKS_C2P0x063Number of C2P sinks.
SINKS_S2C0x073Number of S2C sinks.
SINKS_S2P_BL0x081Number of S2P sinks (BL only).
ARB_WINS0x097Number of ARB wins.
ARB_LOSSES0x0A7Number of ARB losses.