Freescale Semiconductor PowerQUICC III Application Note

Freescale Semiconductor
Document Number: AN3636
Application Note
PowerQUICC III Performance Monitors
Using the Core and System Performance Monitors
Rev. 2, 03/2014
This application note describes aspects of utilizing the core and device-level performance monitors on PowerQUICC III (PQ3). Included are example calculations to aid in interpreting data collected.
1 Performance Monitors
PowerQUICC III processors are the first family of PowerQUICC processors to include performance monitors on-chip. These include both core performance monitors, described in detail in the Power PC® e500 Core Family Reference Manual, as well as device-level performance monitors, described in detail in the product-specific reference manual.
The e500 core level performance monitors enable the counting of e500-specific events, for example, cache misses, mispredicted branches, or the number of cycles an execution unit stalls. These are configured by a set of special purpose registers that can only be written through supervisor-level accesses. The core-level event counters are also available through a read-only set of user-level registers.
Contents
1. Performance Monitors . . . . . . . . . . . . . . . . . . . . . . . . 1
2. e500 Core Performance Monitors . . . . . . . . . . . . . . . . 2
3. Device Performance Monitors . . . . . . . . . . . . . . . . . . 2
4. Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5. Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7. Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8. Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
The device-level performance monitors can be used to monitor and record selected events on a device level. These
© 2008-2014 Freescale Semiconductor, Inc. All rights reserved.
e500 Core Performance Monitors
performance monitors are similar in many respects to the performance monitors implemented on the e500 core. However, they are capable of counting events only outside the e500 core, for example, PCI, DDR, and L2 cache events. Device-level performance monitors are memory-mapped, allowing user space configuration accesses.
Together, these two sets of performance monitor registers can be used by the developer to improve system performance, characterize and benchmark processors, and help debug their systems.
2 e500 Core Performance Monitors
The e500 core performance monitors are described in detail in Chapter 7 of the Power PC e500 Core Family Reference Manual.
Performance monitor registers are grouped into supervisor-level registers, accessed with mtpmr and mfpmr, and user-level performance monitor registers, which are read-only and accessed with the mfpmr
instruction. The supervisor-level registers consist of the four performance monitor counters (PMC0-PMC3), each used to count up to 128 events; associated performance monitor local control registers (PMLCa0-PMLCa3); and the performance monitor global control register. The user mode registers are read-only copies of the supervisor-level registers. These consist of the same four counters (UPMC0-UPMC3), associated local control registers (UPMLCa0-UPMLCa3), and global control register (UPMGC0).
Additionally, the core performance monitor may use the external core input, pm_event, as well as the performance monitor mark bit in the MSR (MSR[PMM]) to control which processes are monitored.
2.1 Counter Events
Counter events are listed in the Power PC e500 Core Family Reference Manual. These are subdivided into three groups:
Reference (Ref:#) - Possible to count these events on any of the four counters (PMC0-PMC3).
These events are applicable to most Power Architecture® microprocessors.
Common (Com:#) - Possible to count these events on any of the four counters (PMC0-PMC3).
These events are specific to the e500 microarchitecture.
Counter-Specific (C[0-3]:#) - Can only be counted on the specific counter noted. For example, an
event assigned to counter PMC2 is shown as C2:#
3 Device Performance Monitors
The device performance monitors are described in detail in the corresponding product reference manual. These performance monitor counters operate separately from the core performance monitors and are intended to monitor and record device-level events.
The device performance monitor consists of ten counters (PMC0-PMC9), capable of monitoring 576 events, as well as the associated local control registers (PMLCA0-PLMCA9) and the global control register (PMGC0). These registers are all memory-mapped and can be accessed in supervisor or user mode.
2 Freescale Semiconductor
PowerQUICC III Performance Monitors, Rev. 2
Performance Metrics
3.1 Counter Events
PMC0 is a 64-bit counter specifically designed to count core complex bus (CCB) clock cycles. This counter is started automatically out of reset and continually counts platform clock cycles. PMC1-PMC9 are 32-bit counters that can monitor up to 576 events.
Counter events are subdivided into two groups:
Reference (Ref:#) - Possible to count these events on any of the nine counters PMC1-PMC9.
Counter-Specific (C[0-3]:#) - Can only be counted on the specific counter noted. For example, an
event assigned to counter PMC2 is shown as C2:#
4 Performance Metrics
The use of the on-chip performance monitors to gather data is relatively straightforward. Using the data to calculate meaningful performance metrics presents a much bigger challenge. Tab le 1 presents metrics commonly used for performance analysis and characterization. These include:
Instructions per cycle (IPC)
Instructions per packet (IPP)
Packets per second (PPS)
Branch misses per total branches (%)
Branches per 1000 instructions
L1 instruction cache miss rate
L1 data cache miss rate
L2 cache core miss rate
L2 cache non-core miss rate
Memory system page hit ratio
Note that because these calculations make use of both the core events and the system events, we differentiate between them by a two-letter prefix:
CE - Core Event
SE - System Event
To specify an event, this prefix is followed by the event number, as defined in the core and system manuals. For example, CE:Ref:0 refers to Core Event, Reference 0, which according to the Power PC e500 Core Family Reference Manual refers to processor cycles. SE:C0 would refer to System Event, Counter 0, which according to the device-specific reference manuals corresponds to CCB (platform) clock cycles.
Note that for counter-specific events, an offset of 64 must be used when programming the field, because counter-specific events occupy the bottom 4 values of the 7-bit event fields.
Freescale Semiconductor 3
PowerQUICC III Performance Monitors, Rev. 2
Performance Metrics
Table 1. Commonly Used Performance Metrics
Metric
Core cycles CE:Ref1, or SE:C0 CE:Ref:1 or
Time [processor cycles/processor frequency]
Instructions cer cycle (IPC) [instructions completed/processor cycles]
Instructions per packet (IPP) instructions completed/accepted frames on TSEC1
Packets per second (PPS) accepted frames on TSEC1/Time
Branch miss ratio branches mispredicted/branches finished
Branches per 1000 instructions (1000*branches finished/kilo instructions completed)
L1 I-cache miss rate (I-cache fetch & pre-fetch miss)/instructions completed
Performance Monitor
Event(s)
SE:C0 * Clock Ratio
CE:Ref:1 CE:Ref:1/Processor Frequency
CE:Ref:1 CE:Ref:2
SE:Ref:36 CE:Ref:2
SE:Ref:36 CE:Ref:1
CE:Com:12 CE:Com:17
CE:Com:12 CE:Ref:2
CE:Ref:2 CE:Com:60
CE:Ref:2/CE:Ref:1
CE:Ref:2/SE:Ref:36
SE:Ref:36/(CE:Ref:1/Processor Frequency)
(CE:Com:12 - CE:Com17)/CE:Com:12
1000*CE:Com:12/CE:Ref:2
CE:Com:60/CE:Ref:2
Formula
L1 D-cache miss rate D-cache miss/data micro-ops completed
L2 cache core miss rate L2 D&I core miss/(L2 D&I core miss + L2 D&I core hit)
L2 cache non-core miss rate L2 non-core miss/(L2 non-core miss + hit)
DDR page row open table miss rate DDR read & write miss/(DDR read & write miss + hit)
CE:Com:41 CE:Com:9 CE:Com:10
SE:Ref:22 SE:C2:59 SE:Ref:23 SE:C4:57
SE:Ref:24 SE:C1:54
SE:C2 SE:C4 SE:C6 SE:C8
CE:Com:41/(CE:Com:9 + CE:Com:10)
(SE:C2L59 + SE:C4:57)/(SE:C2:59 + SE:C4:57 + SE:Ref:22 + SE:Ref:23)
SE:C1:54/(SE:C1:54 + SE:Ref:24)
(SE:C2 + SE:C4)/(SE:C2 + SE:C4 + SE:C6 + SE:C8)
Note that some of the events can be used in the calculation of multiple metrics. For example, CE:Ref:2 (instructions completed) is used to calculate IPC, IPP, Branches per 1k Instructions, and L1 I-cache miss rate. This is advantageous, since only a limited number of events can be captured simultaneously in the limited number of PMCs available.
4.1 Example Configuration
As an example, note the calculation of the L2 cache core miss rate. This metric requires the following performance monitor events:
4 Freescale Semiconductor
PowerQUICC III Performance Monitors, Rev. 2
Data Collection
SE:Ref:22 - core instruction accesses to L2 that hit
SE:C2:59 - core instruction accesses to L2 that miss
SE:Ref:23 - core data accesses to L2 that hit
SE:C4:57 - core data accesses to L2 that miss
Note that these are all device-level performance monitor events that can all be run simultaneously. This example uses counters PMC2 - PMC5.
// Initialize Counters
*(unsigned int *) ((unsigned int) CCSB + 0xE1038) = 0x0 /*PMC2*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1048) = 0x0 /*PMC3*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1058) = 0x0 /*PMC4*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1068) = 0x0 /*PMC5*/
// Initialize Global Control Register
*(unsigned int *) ((unsigned int) CCSB + 0xE1000) = 0x80000000 /*PMGC0*/
// Initialize Local Control Registers
*(unsigned int *) ((unsigned int) CCSB + 0xE1030) = 0x007B0000 /*PMLCa2*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1040) = 0x00160000 /*PMLCa3*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1050) = 0x00790000 /*PMLCa4*/
*(unsigned int *) ((unsigned int) CCSB + 0xE1060) = 0x00170000 /*PMLCa5*/
// Start Global Control Register
*(unsigned int *) ((unsigned int) CCSB + 0xE1000) = 0x00000000 /*PMGC0*/
The above code shows a sequence for initializing counters PMC2-PMC5 to zero, then setting up the local control registers to count the events required for the metric previously mentioned. The global control register is then set to 0x0, which will start the counting.
Note that because the events counted by C2 and C4 are counter-specific events, they are offset by 64.
When the software task is finished, the counters can be halted by the global control register, and results may be read from the relevant counters.
5 Data Collection
The core performance monitor has four 32-bit PMCs for capturing core events. The system performance monitor has eight 32-bit PMCs for capturing system events and one 64-bit PMC exclusively dedicated for capturing the CCB clock cycles. Collectively, these counters allow the capture of four core events, eight system events, and the CCB clock cycles simultaneously. Collecting data from various events simultaneously makes the captured events almost perfectly correlated, as they are collected under the
Freescale Semiconductor 5
PowerQUICC III Performance Monitors, Rev. 2
Loading...
+ 11 hidden pages