THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY,
FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR
SAMPLE.
Information in this document is provided in connection wi th Intel® products . No license , express or implied, by esto ppel or otherwise, to any
intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no
liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties
relating to fitness for a particular purpose, merchantability, or i nfringement of any patent, copyright or other intellectual property right. Intel products are
not intended for use in medical, life saving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibilit y wha tsoeve r for conflicts or incompat ibilities arisi ng from futu re changes to them.
The Pentium, Itanium and IA-32 architecture processors may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and befo re placi ng your product order.
Copies of documents which have an order number and are referenc ed in this docum ent, or other Intel literatur e, m ay be obtained by calling 1- 800-
548-4725, or by visiting Intel's web site at http://www.intel.com.
Intel, Itanium, Pentium, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other
10Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation
Revision History
Revision
Number
-001Public release of the document.June 2002
-002Refresh to incorporate new Itanium
models.
-003Refresh to incorporate new Itanium
models.
DescriptionDate
®
2 processor with up to 6M L3 cache
®
2 processor with up to 9M L3 cache
April 2003
May 2004
Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization11
12Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation
About this Manual1
1.1Overview
The Intel® Itaniu m®2 processor is the second implementat ion of the Intel® Itanium® architecture.
There have now been three generations of th e Itanium 2 processor, which can be identified by their
unique CPUID model values. For simpli c ity of document a tion, throug hout this document we will
group all processors of like model together. Table 1-1 lists out the varieties of the Itanium 2
processor that are avail able along with their grouping.
Table 1-1. Definition Table
ProcessorAbbreviation
®
Intel
Itanium® 2 Processor 900 MHz with 1.5 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.0 GHz with 3 MB L3 Cache
Low Voltage Intel
L3 Cache
®
Intel
Itanium® 2 Processor 1.40 GHz with 1.5 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.40 GHz with 3 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.30 GHz with 3 MB L3 Cache
®
Itanium® 2 Processor 1.40 GHz with 4 MB L3 Cache
Intel
®
Intel
Itanium® 2 Processor 1.50 GHz with 6 MB L3 Cache
Low Voltage Intel
L3 Cache
®
Intel
Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.60 GHz with 3 MB L3 Cache for
533MHz DP Platforms
®
Intel
Itanium® 2 Processor 1.50 GHz with 4 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.60 GHz with 6 MB L3 Cache
®
Intel
Itanium® 2 Processor 1.70 GHz with 9 MB L3 Cache
®
Itanium® 2 Processor 1.0 GHz with 1.5 MB
®
Itanium® 2 Processor 1.20 GHz with 3 MB
Itanium 2 Processor (up to 3MB L3 cache)
Itanium 2 Processor (up to 6MB L3 cache)
Itanium 2 Processor (up to 9MB L3 cache)
The Itanium 2 processors with up to 9 MB L3 cache will have va rieties cap a ble of running with
system bus speeds of 400 MHz, 533 MHz, and 667 MHz. Fo r comp l ete det ails on the curren t
offerings pl ease refer to the datasheets at http://developer.intel.com/design/Itaniu m2 /.
This document describes how the Itan ium2 processor implements features of the Itanium
architecture, as well as spec ific fe atures of the I tanium2 processor that ar e relevan t to per formanc e
tuning, compilation, a nd a ssembler pr ogramming. Unless otherwi se stated, all o f the restrictions,
rules, sizes, and capaciti es desc ribed in this doc ument ap ply specif icall y to the Itanium 2 processor
and may not apply to other implementations of the Itanium architecture.
General under standing of proce ssor components and explicit familiar ity with Itanium instructions
are assumed. This document is not intended to be used as an architect ural refer ence for the Itan ium
architecture. For more inf ormation on the Itanium archit ecture, consult the Intel
®
Itanium®
Architecture Software Developer’s Manual.
Intel® Itanium® 2 Processor Ref erence Manual For Software Development and Optimization13
About this Manual
1.2Contents
Chapter 2, “Itanium® 2 Processor Enhancements” compares the Itanium processor and the
Itanium 2 processor, highlighti ng some of the considerations that sho uld be take n when optimizing
for the Itanium 2 processor .
Chapter 3, “Functional Units and Issue Rules” describes the number and type of available
functional units, instruction iss ue rules, and he uristics for efficie nt instruction schedul ing based
upon machine resources and issue rules .
Chapter 4, “Latencie s and Bypa sses” de scribes latenc ies and bypa sses fo r exec ution of the diff erent
instruction types on the Itanium 2 processor.
Chapter 5, “Data Operations” describes consider ations for data operations such as specu lative or
predicated loads or stores, floating-point loads, and prefetches. Data alignment considerations are
also discussed.
Chapter 6, “Memory Subsyste m” prov ides an ove rvi ew of the mem ory s ubsy st em hie ra rchy on the
Itanium 2 processo r.
Chapter 7, “Branch Instructions and Branch Prediction” describes how hints for branch prediction
and instruction prefetch are implemented on the Itanium 2 processor.
Chapter 8, “Instruction Prefetching” describes how prefetching is implemented on the Itanium 2
processor.
Chapter 9, “Optimizing for the Itanium
important points noted in e a rlier chapte rs.
Chapter 10, “Perfor m a nce Monito ri ng ” discusses performance monitoring registers and
implementations specific to the Itanium 2 processor.
Chapter 11, “Performance Monitor Events” summarizes the Itanium 2 processor events and
describes how to compute commonly used pe rformance me trics.
Chapter 12, “Model-Specif ic and Optiona l Featur es” discusses Itaniu m 2 processor model-specific
behavior, su ch as executing CPUID instruction s.
1.3Terminology
The following definitions are for term s that will be used throughout this docu ment:
DispersalThe process of mapping instruc tions within bundles to
Bundle rotationThe process of bringing ne w bundles into th e two-bundle
Split issueInstruction execution when an instruction does no t issue at
®
2 Processor” is a summary that draws conclusi o ns from
functiona l units.
issue window.
the same time as the instru ction immediately before it.
Advanced load address table (ALAT) The ALAT holds the stat e necessary fo r advanced load and
check opera tions.
Translation lookaside buffe r (TLB)The TLB holds virtual to physical mappings.
Virtual hash page table (VHPT)The VHPT is an ext ension of the TLB hierarchy , which
resides in the virtual memory space, is designed to enhance
virtual address translation performance.
14Intel® Itanium® 2 Processor Re ference Manual For Software Development and Opt i mi zation
Hardware page walker (H PW )The HPW is the third level of address translation. It is an
engine that performs page look-ups f rom the VHPT and
seeks oppor tuniti es to insert tr anslatio ns into th e processor
TLBs.
Register stack engine (RSE)The RSE moves registers between the register stack and
the backing store in memory.
Event address r egisters (EARs)The EARs record the instructi on and data addresses of data
cache misses.
1.4Related Documentation
The reader of this document should al so be fa mili ar with the m ateria l and conc epts pres ented i n the
followi ng docu me nt s :
®
• Intel
• Intel
• Intel
Itanium® Architecture Software D eveloper’s Manual, Volume 1: Application
Architecture
®
Itanium® Architecture Software D eveloper’s Manual, Volume 2: System Architecture
®
Itanium® Architecture Software D eveloper’s Manual, Volume 3: Ins truction Set
Reference
About this Manual
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization15
About this Manual
16Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Itanium
®
2 Processor Enhancements2
This chapter outlines the major differences betw een the Itanium 2 processor and the Itanium
processor. This is not an exhaustive list, so a reference to more details accomp anies each topic.
2.1Implemented Instructions
The Itanium 2 processor implement s the 64-bit long branch instruction (brl) in st ruction dir ect l y
in hardware. This instruction wa s not implemented in the Itanium pro cessor . It allows progr ammers
to direct a branch to an address that us es a ll 64 address bits. Details on the brl instruction can be
found in Volum e 2 of the Intel
some branch prediction perf or mance implications associated with the brl instruction which are
noted in Chapter 7, “Branch Instructions and Branch Prediction. ”
®
Itanium® Architecture Software Developer ’s Manual. There are
2.2Functional Units and Issue Rules
In general, the Itanium 2 processor has more functio nal units than the Itanium processor.
• In parti c u la r, the Ita nium 2 proc e s sor has si x ari th me t ic logic un i ts (ALUs) to per f o rm
arithmetic operations, co mpares, m ost multim edia instr uction s, e tc. The Itani um processor can
only issue fo ur of these types of instructions per cycle.
• The Itanium 2 pr oc essor has four memory po rts allowing two integer loads and two integer
stores per cycle. The Itanium processor has two memor y ports .
• The Itanium 2 processor can issue one SIMD floating-point (FP) instruction per cycle. The
Itanium processor can issue two SIMD FP instructions per cycle.
• Under certain conditions, the Itanium 2 processor can iss ue I- type instructions to memory
functional unit s, thus increasing the number of template pair types which c an be issued in one
cycle. For the Itaniu m pr ocessor, I-type instructions will only be issued to integer functional
units.
• The Itanium 2 processor scoreboards multi-cycle operatio ns such as first-level instruction
cache (L1D) misses, multimedia, and floating-point operati ons.
This means that when an integer opera tion uses the result of a multimedia oper a tion and the
integer opera tion is not scheduled to cover the la tency, the dependent instruction group will
wait until the multim edia data is available.
A predicated off operation, with a use of a score boarded operand, will stall the issue group for
one cycle if the predicate was genera ted in the previous cycle. A predicate d off instruction
with predicates generated two or mor e c ycles earlier will not incur pipeline stalls even when
operands are scoreboar ded.
2.3Operation Latencies
On the Itanium 2 processor, most latencies are the same or shorter than on the Itanium processor
with a few exce ptions , i. e., mem ory lat enci es are sh or ter , f loa tin g-poin t la tencies are sh orter. A few
more bypasses exist which remove some asymmetries. Table 2-1, “Itanium
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization17
®
2/ Itanium Processors
Itanium® 2 Processor Enhancements
Operation Latencies” shows latencies for both the Itanium 2 processor and the Itanium proces s or.
The areas of difference are indic a te d by non-sha de d boxes. The two different latency numbers are
separated by a forward-slash or ‘/’. When reading from left to right, the first latency number
corresponds to the Itanium 2 process or and the second number corresponds to the Itanium
processor.
2.4Data Operations
2.4.1Data Speculation and the ALAT
The Itanium 2 processor advanced lo ad addre ss table (ALAT) is fully associati ve whil e the Itanium
processor ALAT is two-way associative.
On the Itanium processo r , a ld.c which misses th e ALAT cau se s a 10- cyc le pi peli ne f lus h. O n t he
Itanium 2 processor, the pe na lty is 8 cycles.
On the Itanium processor, if a chk.a, chk.s, or fchkf fails, an operating system (OS) handler
will be invoked through a trap handler to steer execution to the recovery code at the location
specified in the targe t field of the chk.a/chk.s/fchkf instructio n. On the Itanium 2 processor,
hardware will usually perform the resteer without oper ating system intervention. This reduces the
resteer cost from approximately 200 cycles to 18 cycles. If any of the following conditions are no t
met, the Itanium 2 processor will trap to the OS to service the chk.a/chk.s/fchkf:
psr.ic = 1
psr.it = 1
psr.ss = 0
psr.tb = 0
If a chk.a follows a store within the same cycle, the chk.a will always fail on the Itanium
processor. On the Itanium 2 process or, a 12-bit ad dress compare against ALAT entries will occur.
See Section 5.1, “Data Speculation and th e AL AT” for more details.
2.4.2Data Alignment
The Itanium pro cessor can su pport misal igned integer acces ses w ithin 16- byte b locks; ho wever , the
Itanium 2 processor suppor ts misaligned integer accesse s within 8-byte blocks. Section 5.5, “Data
Alignment” has gr eater detail on misaligned access support for the Itanium 2 processor.
18Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
1. On the Itanium® processor, the address computation instruction must be in an M-slot type to avoid an extra cycle of latency.
2. N depends upon which level of cache is hit. For the Itanium processor, N=2 for L1D, N=6 for L2, N=21 for L3. For the Itanium 2 processor, N=1 for
L1D, N=5 for L2, N=(12-15) for L3. These are minimum latencies.
3. M depends upon which level of cache is hit. For the Itanium processor, M=8 for L2 and M=24 for L3. For the Itanium 2 processor, M =5 fo r L2 and
M=(12-15) for L3. These are minimum latencies. The “+1” entries indicat e one cycle is needed for for ma t conve rsion.
4. Best-case values of C range from 2 to 35 cycles depending upon registers accessed. EC and LC accesses are 2 cycles. FPSR and CR accesses
are 10-12 cycles.
5. Best-case values of D range from 6 to 35 cycles depending upon indirect registers accessed; Iregs pkr and rr accesses are faster at 6 cycles.
4
n/an/aCCCCn/an/an/aC
102232n/an/an/an/a
5
n/an/aDDDDn/an/an/aD
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization19
Itanium® 2 Processor Enhancements
2.4.3Control Speculation
The Itanium 2 processor implements features intended to increase the perfor m ance of applications
by decreasing the cost for incorrect control speculation. There are two parts of the solution for the
Itanium 2 processor:
• The first part allows speculative load operations (This includes lfetch without the .fau lt
completer.) to abort and set a NaT bit at the time of a data translation lookaside buffer (TLB)
miss. In contrast, the Itanium processor would wait for the hardware page walker (HPW)
operation to complete the walk before setting the NaT bit.
• The second part allows for a chk.s instruction (al so fo r a fchkf/chk.a instruction) to
branch directly to the fix-up code without involving the OS. The Itanium processor faults on a
chk.s, chk.a, or fchkf instruction and requests that the OS branch to the fix-up code.
Thus, deferrals on the Itanium 2 processor occur quickly and the branch to fix-up code occurs
quickly.
The deferr a l at da t a TLB mi s s is tu rn e d off ins i de in t e rru p t handlers (when PSR.is = 1), which
allows ld.s and lfetch instructions to complete a TLB walk and possibly return data. Clearing
the dcr.dm bit will also prevent speculative oper ations from deferring at data TLB miss. Fast
deferral requires the dcr.dm bit to be set. Refer to Section 5.2, “Speculative and Predicated
Loads/Stores” for more information.
2.5Memory Hierarchy
Both the Itanium microarchitecture and the Itanium 2 microarchitecture incorporate a three-level
cache structure. In general, line sizes of the Itanium 2 processor are twice as large as those of the
Itanium processor. Also, latencies of the Itanium 2 processor are shorter that those of the Itanium
processor . The thir d-lev el ca che (L3) o f the It anium 2 p rocess or is on-chip and r uns at a hi gher core
frequency, which results in a much shorter latency. The Itanium 2 processor has a two-level TLB
design for both in st ruction and data, while the Itanium processor has a single-level instruction
TLB. The Itanium 2 processor’s TLBs are larger. The following tables list some of the dif fer ences
.
Table 2-2. L1I Cache Differences
®
Itanium
®
Itanium
(up to 3MB L3 cache)
®
Itanium
(up to 6MB L3 cache)
®
Itanium
(up to 9MB L3 cache)
in caches and TLBs. Details can be found in Chapter 6, “Memory Subsystem.”
2 levels: L1 ITLB, L2 ITLB32-entry, 128-entryFull, Full
2 levels: L1 ITLB, L2 ITLB32-entry, 128-entryFull, Full
Table 2-7. Data TLB Differences
HierarchySizeAssociativity
®
Itanium
Itanium
(up to 3MB L3 cache)
Itanium
(up to 6MB L3 cache)
Itanium
(up to 9MB L3 cache)
Processor
®
2 Processor
®
2 Processor
®
2 Processor
2 levels: L1 DTLB,
L2 DTLB
2 levels: L1 DTLB,
L2 DTLB
2 levels: L1 DTLB,
L2 DTLB
2 levels: L1 DTLB,
L2 DTLB
2.6Branch Prediction
The major differences in the Itanium 2 processor and the Itanium processor bran ch prediction
support are:
• Latencies
• brp instructions are ignored for branch prediction, i.e., the brp.imp is not required to
achieve zero-bubble branches.
• Indirect branch targets are predicted from the source branch register rather than from a
hardware tab l e.
• Possible reduced prediction of BBB bundles due to prediction encoding.
Penalty for Missing
First Level DTLB
32-entry, 96-entryDirect, Full10 cycles
32-entry, 128-entryFull, Full2 cycles
32-entry, 128-entryFull, Full2 cycles
32-entry, 128-entryFull, Full2 cycles
• More robust method for prediction structure r epair after a mispredicted return.
• Hardware impleme ntati on of th e brl (64-bit relat i ve branch) instruction.
• Setting ar.ec = 1 is not required for perf e c t loop prediction.
Full details can be fou nd in Section 7, “Branch Instructions and Branch Prediction.”
22Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Ta bl e 2-8. Branch Predict ion Late ncies (in cycles)
Correctly Predicted Taken IP-relative Branch01
Correctly Predicted Taken Indirect Branch20
Correctly Predicted Taken Return Branch11
Last Branch in Perfect Loop Prediction02
Misprediction Latency6+9
2.7Instruction Prefetching
The Itanium 2 processor has an improved implementation of streaming and hint prefetching. See
Chapter 8, “Instruction Prefetch ing” for more details.
2.8IA-32 Execution Layer
Itanium® 2 Processor Enhancements
Itanium® 2 ProcessorItanium® Processor
IA-32 Exec ution Layer (IA-32 EL) is a new technology that executes IA-32 applications on
Itanium architecture-based systems. Previous ly, support for IA-32 applications on Itanium
architecture-based platform s has been achieved using ha rdware circuitry on the Ita nium 2
processors. IA-32 EL will enhance this capability.
IA-32 EL is a software layer that is current ly shipping with Itanium architecture-based operating
systems and will convert IA-32 instructions into Itanium instructions via dynamic translation.
Further details on operating system support and functionality of IA-32 EL can be found at
http://www.intel.com/products/server/processors/ser ver/itanium2/index.htm .
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization23
Itanium® 2 Processor Enhancements
24Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Functional Units and Issue Rules3
This chapt e r de scribes the number and type of available functional units, instruction issue rules,
and heuri stics for efficient instruction scheduling based upon machine resou rces and issue rules.
3.1Execution Model
The Itanium 2 processor issues and executes instructions in assembly order, so programmer
understanding of stall conditions is essential for generating high pe rformance assembly code.
In general, when an instruction does no t issue at the same time as the instruction immediately
before it, instruction execution is said to have split issue. When a split issue condition occu rs, all
instructions after the split point stall one or more clocks, even if there are sufficien t resources for
some of them to execute. Common cau ses of split issue in the Itanium 2 processor are:
• An explicit stop is encountered.
• There are insufficie nt m achine resources of the type required to execute an instruction.
• Instructions have not been placed in accordance with issue rules on the Itaniu m 2 processor.
The Itanium2 processor issu es ins truc ti ons in th e orde r defin ed by the st atic sc hed ule . Care sh ould
be taken by th e code gene rato r to avoi d re giste r dep ende nc ie s withi n an i ssu e gro up. The Ita nium 2
processor does not insert implicit stop bits to break WAW hazards; thus, a WAW hazard between
loads and stores will result in an 8-cycle penalty if the predicates are true. Other WAW hazards,
such as those due to ALU operations , will result in non-deterministic results and also consider
predicates.
Once instructions are issued as a group, they will proceed as a group thro ugh the pipeline. If one
instruction in the issue group has a stall condition, the whole group will stall. This stall will also
stall all instructions behind it (y ounger) in the pipeline.
3.2Number and Types of Functional Units
Although parallel instruction groups may extend over an arbitr a ry number of bundles and contain
an arbitrary number of each instruction type, the Itanium 2 processor has finite execution
resources. If a parallel in s truction group contains more instructions than there are available
execution uni ts, the f irs t ins tru cti on f or wh ich an ap pro pria te uni t can not be fou nd w ill cause a sp lit
issue and bre a k the parallel instruction group.
The front -e nd of the Itanium 2 processor pipe line can fetch up to two bundl e s per cycle and the
back-end of the pipeline can is s ue as many as two bu ndles per cycle. Given that there are 3
instructions per bundle, the Itanium 2 processor can be considered a six instruction issue machine.
For more on details on the pipeline, see Appendix A, “Itanium
The Itanium 2 processor has a large number of functional units of various types. This allows many
combinations of instruct ions to be iss ued per c ycle. Since only s ix instr uction s may is sue per cyc le,
only a portion of th e Itanium 2 processor’s functional units described be low will be used each
cycle.
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization25
®
2 Processor Pipeline.”
Functional Units and Issue Rules
There are six general-purpose ALU uni ts (ALU0, 1, 2, 3, 4, 5), two integerunits (I0, 1), and one
shift unit (I SHIFT, used for general purpose shifts a nd other special instructions). A maximum of
six of these types of instructi ons can be issued per cycle.
The Data Cache Unit (DCU) contains four memory ports . Two ports are generally used for load
operations; two are generally used for store ope rations. A maximum of four of these types of
instructions can be issued per cycle. The two store por ts can support a special subset of the
floating-point load instructions.
There are six multim edia functional u nits (PALU0, 1, 2, 3, 4, 5), two parallel shift units
(PSMU0, 1), one parallel multiply unit (PMUL), and one popul ation count unit (POPCNT). These
handle multimedia, parallel multiply, and the popcnt instruction types. At most, one pmul or
popcnt instru ction may be issued per cycle. How ever , the Itanium 2 processor may issue up to six
PALU instructions per cycle.
There are four floating-point functional units: two FMAC units to execute floating-point
multiply-adds and two FMISC units to perform other floating-point operations, such as fcmp, fmerge, etc. A maximum of two fl oating-point operations can be executed per cycle.
There are three branch units enabling thr ee br anches to be executed per cycle.
All of the computational fun ctional u nits ar e fully pipel ined, so each functio nal unit can accept one
new instruction per clock cycle in the ab sence of other types of stalls. System instructions a nd
access to system registers may be an exception.
3.3Instruction Slot to Functional Unit Mapping
Each fetched instruction is assigned to a functional unit through an issue port. The numerous
functional units share a smaller number of issue ports. There are 1 1 iss ue ports: eight for
non-branc h instructions and three for branch in structions. They are labe led M0, M1, M2, M3, I0,
I1, F0, F1, B0, B1, and B2. The process of mapping instructions within bundles to functional units
is called dispersal.
An instruction’s type and pos ition within the issue group define to which issue port th e ins truction
is assigned. An instruction is mapped to a subset of the issue por ts based upon the instruction type
(i.e., ALU, Memor y, Integer, etc.). Then, based on the position of the instruction within the
instruction group presented for disp ers a l, the instruction is mapped to a particular issue port within
that subset.
Table 3-1, “A-Type Instructio n Port Mappi ng,” Table 3-2, “I-Type Instruction Port Mappi ng ,” and
Table 3-3, “M-Type In st ruc ti on Por t Ma ppi ng” sh ow t he m apping s of ins truc tio n ty pes to por ts an d
functiona l units. Section 3.3.2 des c ri be s th e se l e c ti o n of the par ti c ul a r po rt based upon instructio n
position.
Note:Shading in the following tables indicates the instruction type can be issued on the port(s).
A-type instructions can be is s ued on all M and I ports (M0-M3 and I0 and I1). I-type instructions
can only issue to I0 o r I 1. The I ports are asymmetric s o s om e I-type instructions can only issue on
port I0. M ports hav e many asy mmetries : some M-t ype inst ructions can issue on all port s; some can
only issue on M0 a nd M1; some can only issue on M2 and M3; some can only issu e on M0; some
can only issue on M2.
26Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
When dispers i ng instructions to func tional units, the Itanium 2 processor views, at most, two
bundles at a time with no special alignment requirements. This text refers to these bundles as the
first and second bundles. A bundle rotation causes new bundles to be brought into the two-bundle
window of instructions being consi dered for issue. Bundle rotat ions occur when all the ins tructi ons
within a bundle are issued. Either one or two bundles can be rotated depending on how many
instructions were issued.
28Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
3.3.2Dispersal Rules
The Itanium2 processor hardware makes no att e mpt to reorder instructions to avoid stalls. Thus,
the code generator must be careful about the number, type, and order of instructions within a
parallel instructi on group to avoid unnecessary stalls. The use of predicates has no effect on
dispersal – all instructions are dispersed in the same fas hion whether predicated true, predicated
false, or unpredicated . S im ilarly, nop instructions are dispersed to functional units as if they were
normal instruction s. The dispers al rules fo r execution units vary according to slot type; i.e., I, M, F,
B, or L. The ru le s for the different slot types are described below.
Dispersal Rules for F Slot Instructions
• An F slot instruction in the first bundle maps to F0.
• An F slot instruc tion in the second bundle maps to F1.
• A SIMD FP instructio n essentially maps to both F0 and F1. See Section 3.3. 3 fo r m or e
inform a tion on SIMD FP is sue ru les .
Dispersal Rules for B Slot Instructions
• Each B slot instructi on in an MBB or BBB bu ndl e maps t o the corr espo nding B un it. That is, a
B slot instru ction i n t he fir st p ositi on of th e templ ate is m appe d t o B0; in th e seco nd posit ion , it
is mapped to B1; and in the third position, it is mapped to B2.
Functional Units and Iss ue Rules
• The B instruction in an MI B/MFB/MMB bundle maps to B0 if it is a brp or nop.b and it is
the first bundle, otherwise it maps to B2.
• For purposes of dispersal, break.b is treated like a branch.
Dispersal Rules for L Slot Instructions
• An MLX bun dle uses port s equ iva lent to an MFI bun dle. If th e MLX bu ndl e is the fi rs t bun dle ,
the L slot instructi on maps to F0. Other wise, it maps to F1. Howeve r, there is no conflict when
the MLX template is issued with an MMF or MIF bundle and the F op is a SIMD FP
instruction.
Dispersal Rule s for I Slot Instructions
• The instruction in the first I slot of the two-bundle issue group will issue to I0. The second
I slot instruction will issue to I1.
• If the second I slot instruction can only map to an I0 port, see Table 3-2, an implicit stop will
be inserted and the second I slot instruction will be issued in the next cycle. Thus, an I0-only
instruction sh ould b e plac ed in the fir st I slot o f a bund le pa ir. Only one I0-onl y inst ructio n can
be issued per cycle.
• An instruction in an I slot will not necess a rily be issued to an I port. If the first two I slot
instructions have been issued to the I port s, and an additional I slo t instruction in the issue
group contains A- type instructions as listed in Table 3-1, and M ports are available; these
instructions will be mapped to available M ports. This allows the potential dual issue of the
MII-MII bundle pair. This is new to the Itaniu m 2 proces sor and is not true on the Itanium
processor.
• For the MLI template, the I slot instruction is always assigned to port I0 if it is in the first
bundle or it is assigned to port I1 if it is in the second bundle. Thus , the bundle pair MII-MLI
can never dual issue.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization29
Functional Units and Issue Rules
Dispersal Rules for M Slot Instructions
On the Itaniu m 2 processor, M slot instruc tions are grouped into four s ubtypes (see Table 3-3):
• Load subtype, which can be issued on either M0 or M1 or both (e.g., integer load, sync)
• Store subty pe , which can be issued on either M2 or M3 or both (e.g., intege r store, alloc,
getf)
• Generic subtype, which can be issued on any of the four M ports (e.g., ALU, floating-point
load)
• Special instructions, which can be issued on only M2 port (e.g., getf, mov to AR)
The issue logic ca n reorder M slot instructions between different subtypes but cannot reorder
instructions within the same subt ypes. For instance, within an issue group an integer store can
precede an integer load without causing a split issue. The store will be mapped to M2 and the load
to M0 since the two instructio ns were from different subtypes.
However, if a store precedes a getf, the store will be issued to M2 and a split issu e will occur
because the getf must issue on M2. Instructions within the same sub type cannot be reordered.
Therefore, the code scheduler should pl ace the getf instruction before the store to ensure the getf ins t ruction is mapped to M2 and the store is mapped to M3 to avoid port oversubscription.
Dispersal becomes more complicat ed whe n generic subtype instructions early in the issue group
consume M ports. Th ere is n o encom passin g rule to cov er these c ase s. It is recommended that the more restrict ive subtypes get scheduled first in the issue group. Example 3-1 and Example 3-2
demonstrate some of the dispersal possibilities.
Note:M
Example 3-1. M
is a generic subtype, ML is an integer load, and MS is a store subtype instruction.
A
I - MSMAI
AML
The bundle pair M
I - MSMAI gets mapped to ports M2 M0 I0 - M3 M1 I1.
AML
The first generi c s ubtype instruction mapped to M2 causes the M
If M
is a getf instruction, a split issue will occur.
S
Example 3-2. M
The bundle pair M
AMA
I - MSMAI
AMA
I - MSMAI gets mapped to ports M0 M1 I0 - M2 M3 I1, which allows MS
to get the more favorable M2 port.
Table 3-4 shows the combination bundle types that the Itanium2 process or can dual issue
(indicated by the shaded areas). Row s contain first bundle pair; columns contain second.
Table 3-4. Dual Issue Bundle Types
MIIMLIMMIMFIMMFMIBMBBBBBMMBMFB
MII
MLI
MMI
MFI
MMF
1
MIB
MBB
instruction to be mapped to M3.
S
30Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Ta bl e 3-4. Dual Issue Bundle Types (Continued)
MIIMLIMMIMFIMMFMIBMBBBBBMMBMFB
BBB
MMB
MFB
1. The B must be nop.b or brp
Note:Floatin g-poi nt load s ar e gener ic subty pe instr uctions. As s uch, the It anium 2 p roces sor can issue up
to four per cycle. This capability is a va ilable to all normal and spec ulative floating-point loads of
all sizes. Advanced floating-point loads, load pair in st ructions, and check load instructions are not
generic and must i ssue on the two load ports while the floating-point stores only issue to the two
store ports.
3.3.3Split Issue and Bundle Types
Because there is an increased num ber of functional units in the Itanium 2 processor and I slot
instructions can sometimes issue to M ports, many bundle pairs can dual issue. Resource
oversubscription rarely occurs. Reas ons that bundle pairs would not dual issue are explicit stop s
and dispersa l problems mentioned in the previous sect ion. In addition, there are several Itanium 2
processor-specific (r ather t han archite ctur al) speci al cases th at will cause s plit i ssue. Thes e specific
cases are listed below:
Functional Units and Iss ue Rules
• Branches
— BBB/MBB Always splits iss ue after either of these bun dles.
— MIB/MFB/MMBSplits issue after any of these bun dles unless the B slot contains a
nop.b or a brp ins tructio n. A br instruc tion a lway s i ntr oduces an
implicit stop bit for these bundle types.
— MIB BBBSplits issue after the first bundle in this pair from B port
oversubscription.
• SIMD FP
— Only one FP instruction can issue per cycle if the instruction is an SIMD FP instruction.
For instance, for the bundle pa ir MF
be an implicit stop between the M and F instr uctions of the second bundle, even if the F
instruction is a nop.f.
— Similarly, for the bundle pair MFI MF
F
instructions of the second bu ndle since the Fp instruction m us t issue to the F0 port and
p
the first F instruction has already mapped to F0.
— One case which might seem to cause a split issue, but does not, is the bundle pair MF
MLX. Even though the L slot acts like it maps to an F port, these two bundles can dual
issue.
I MFI, where Fp is a SIMD FP operation, ther e will
p
I, there will be an implicit stop between the M and
p
I
p
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization31
Functional Units and Issue Rules
32Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Latencies and Bypasses4
This chapt er de scrib es laten cie s and b ypas ses fo r exe cuti on of the dif f er ent in stru cti on ty pes on the
Itanium 2 processor .
In general, integer in structions have one cycle of latency, floating-point instruc t ions have four
cycles of latency, multimedia instructions have two cycles of latency, and L1 cache hits have one
cycle of latency. However, due to asymmetric bypasses, there are many special cases that need to
be listed separately.
4.1Control and Data Speculation Penalties
The Itanium 2 processor can compute the address of the recovery code fro m th e of fset in the
chk.a/chk.s/fchkf i nstr uction wit hou t ha ving to tr ap to the OS fault h andl er. The speculative
load recovery latencies listed in Table 4-1 are approximations based upon the time difference
between the chk.s/chk.a/fchkf retirement and the completion of first instruction of the
fix-up code. These latencie s do not include possible cache or TLB latencies, the cost of recovery
code itself, or the final bran ch at the end of the recovery code. Also, the cost of the reco very code
itself is not included. Further information on advanced loads can be found in Section 5.1, “Data
Speculation and the ALAT.”
Table 4-1. Specu lative Load Recovery L atencies
InstructionLatency (cycles)
chk.a, both int and fp (ALAT hit), chk.s (no NaT/NatVal)0
chk.a, both int and fp (ALAT miss), chk.s (NaT/NatVal)18
ld*.c, ldf*.c (ALAT hit, L1/L2 hit)0
ld*.c, ldf*.c (ALAT miss, L1/L2 hit)8
4.2Branch Related Latencies and Penalties
Table 4-2 describes latencies for branch operations and branch related flushes. See Section 7, “Branch Instructions and Branch Prediction” for m o re detailed information.
1. The 6-cycle penalty is for IP-relative branches that cross a 40-bit boundary. Loop branches that are mispredicted take 7 cycles.
These incur a full branch mispredict penalty.
1
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization33
Latencies and Bypasses
Ta ble 4-3. Execut io n with By pass Latency Summary
cmp, tbit, tnat
FP side predicate write: fcmp21
FP side predicate write: frcpa,
22
n/an/an/an/an/an/an/an/a
n/an/an/an/an/an/an/an/a
fprcpa, frsqrta, fpsqrta
Integer Load
FP Load
IEU2: move_from_br, allo c
Move to/from CR or AR
Move to pr102232
Move indirect
1. Since these operations are performed on the L1D, they interact with the L1D and L2 pipelines. These are the minimum latencies but they could be
much larger because of this interaction.
2. Since these operations are performed on the L1D, they interact with the L1D and L2 pipelines. These are the minimum latencies which could be
much larger because of this interaction.
3. N depends upon which level of cache is hit: N=1 for L1D, N=5 for L2, N=12-15 for L3, N=~180-225 for main memory. These are minimum latenc ies
and are likely to be larger for higher levels of cache.
4. M depends upon which level of cache is hit: M=5 for L2, M=12-15 for L3, M=~180-225 for main memory. These are minimum latencies and are likely
to be larger for higher levels of cache. The +1 in all table entries denotes one cycle needed for format conversion.
5. Best case values of C range from 2 to 35 cycles depending upon the registers accessed. EC and LC accesses are 2 cycles, FPSR and CR accesses
are 10-12 cycles.
6. Best case values of D range from 6 to 35 cycles depending upon the indirect registers accessed. Iregs pkr and rr are on the faster side being 6 cycle
accesses.
7. It should be noted that the multimedia type includes I1-I9, A9, A10, and only the cmp4 f rom A8 instru ctions as listed in Table 3-1 and Table 3-2.
mov r2=cr.ifs12mov cr.ifs =r211
mov r2=cr.iim2mov cr.iim=r211
mov r2=cr.iha5mov cr.iha=r26
mov r2=cr.lid36mov cr.lid=r235
mov r2=cr.ivr36READ ONLY
mov r2=cr.tpr36mov cr.tpr=r235
mov r2=cr.eoi36mov cr.eoi=r235
mov r2=cr.irr036READ ONLY
mov r2=cr.irr136READ ONLY
mov r2=cr.irr236READ ONLY
mov r2=cr.irr336READ ONLY
mov r2=cr.itv36mov cr.itv =r235
mov r2=cr.pmv36mov cr.pmv=r235
mov r2=cr.cmcv36mov cr.cmcv=r235
mov r2=cr.lrr036mov cr.lrr0=r235
mov r2=cr.lrr136mov cr.lrr1=r235
mov from dbr[r0]36mov to dbr[r3]1
mov from ibr[r0]36mov to ibr[r3]46
mov from pkr[r0]5mov to pkr[r3]1122
mov from pmc[r0]36mov to pmc[r3]3546
mov from pmd[r0]36mov to pmd[r3]3546
mov from rr[r0]5mov to rr[r3]1122
36Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Data Operations5
This chapter describes c onsider ations for dat a operati ons such as speculative or predic ated loads or
stores, floating-point loads, and prefetches. Load hints, data alignment, and write coalescing
considerations are also discussed.
5.1Data Speculation and the ALAT
The family of instructions composed of ld.a/ldf.a/ldfp.a, ld.c/ldf.c/ldfp.c, and
chk.a provide the capability to dynamically disambiguate memory addresses between loads and
stores. Architectu rally, the ld.c and chk.a instructions have a 0-cycle latency to consuming
instructions . This allows the ld.c/ldf.c/ldfp.c/chk.a and the corresponding consuming
instruction to be scheduled in the same cycle. However, if a ld.c/ldf.c/ldfp.c/chk.a
misses in the ALAT, additional latency is incurred. Also, an adva nce load activates the scoreboard
for the target register in order to ensure correct operation in the event of a L1D miss.
A ld.c,ldf.c, or ldfp.c that misses the ALAT initiates an L1 cache access. Other
instructions in the iss ue gr oup will be re-executed. This is an 8-cycle penalty that will affect all
operations issued sin ce the check load, whether there was a consumer in the same issue group or
not. The consumer will be exposed to a ny additional cache latency (i.e., if the check load is found
in the L1 then the penalty will be only 8 cycles ). However, if the check load is in the L2, the user
will see greater latency.
A chk.a t hat misses in t he AL AT executes a br anch to recovery code . On the Itanium 2 proces sor,
the branch target can be comp uted from the offset contained in the chk.a instruction in most
instances. This avoids the trap to the operating system that is done on the Itanium pro ces sor. The
cost of a chk.a that mi sses in the ALAT is at least 18 cycles to branch to recovery code, plus the
cost of the recovery code, plus the return. The actual resteer to fix up code occurs wit hin 10 cycles,
however there are at least 8 cycles f or th e first instruction of the fix up code to co mplete. The
8 cycles will increase when the branch to fix up code misses the L2 ITLB or L1I and other cache
levels.
The Itanium 2 processor ALAT has 32 entries and is fully associative. Each entr y contains the
register number , type, and the l ower 20 bi ts of th e physical address . The addr ess is us ed to co mpare
against potentially conflicting stores while the re gister index and typ e support the check operation.
Since only partial addr es ses are saved in the ALAT, it is possible to have a false conflict if a store
and an ALAT entry had different address es yet s har ed the same lower 20 bits of physical addres s.
In addition, if a ld.c or chk.a follows a store too closely, the ALA T addre ss comparison will be
done on fewer than 20 bits of physical address. This is a result of the minimum 4K page size
support and the need for both store and check addresses to be fully translated to accomplish the
20-bit physical addres s comparison. Table 5-1 lists the distances and comparison sizes.
Note:Table 5-1, ld.c also implies ldf.c and ldfp.c.
T able 5-1. ALAT Entry Comparison Sizes
DistanceComparison Size
st and ld.c in same cycle12-bit
st precedes ld.c by 1cycle12-bit
st precedes ld.c by more than 1 cycles20-bit
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization37
Data Operat io ns
Tab le 5-1. AL AT Entry Comparison Sizes (Continued)
DistanceComparison Size
st and chk.a in same cycle12-bit
st precedes chk.a by 1 cycle12-bit
st precedes chk.a by more than 1 cycles20-bit
Note:On the Itanium processor, if a store and chk.a occur in the same cycle, thechk.a will always
fail, but this is not the cas e for the Itanium 2 processor.
5.1.1A llo catio n/R eplace ment Policy
When a new entry is added in the ALAT, the following is the pr iority listing of which entry is
replaced:
• The entry with the same register number as the new entry.
• The first invalid entry.
• A valid entry is replaced based upon advancing pointers associated with ports M0 and M1.
This approximates a first-in - first-out (FIFO) algorithm.
5.1.2Rules and Special Cases
The following rules and special cases should be noted:
• The Itanium architecture definition prohibits scheduling a ld.a and ld.c in the same cycle
if both instructions have the same target register. Similarly, ld.a and chk.a cannot be
scheduled in the same cycle if they have the same target register. However, separation by one
or more cycle s will give normal ALAT behavior. A similar situation is true for ldf.a and
ldfp.a.
• A faulting ld.a w ill not writ e to the AL AT. Such faults are listed in Volume 3: Instruction Set
Reference of the Intel
among others, Data Page Not Present, Data TLB, and Unaligned Data Reference faults. In
these situations, a subsequ ent corresponding ld.c or chk.a will def initely miss in the
ALAT.
®
Itanium® Architecture Software D eveloper’s Manual and include,
• If both an ALAT set and ALAT invalidate instruction occur in the same cycle, the ALAT set
will not occur. For instance, if a chk.a.clr rx and rx = ld.a[addr] occur in the
same cycle, the address of the ld.a[addr] will not be entered in the ALAT.
5.2Speculative and Predicated Loads/Stores
Memory operations with speculative inputs behave in the following manner:
• For a normal load/store whose source register contains a NaT value, a re gister NaT
consumption fault will occur.
• For a speculative load whose s ource register c ont ains a NaT val ue, t he NaT b it is set and a z ero
value will be retur ned.
38Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
The Itanium 2 processor supports two deferral behaviors: ear ly a nd late. The behavior of
speculative memory operations depends on several factors such as interrupt state, deferral control
registers, and processor configuration. Early deferral mode is enabled through PAL procedure
PAL_PROC_SET_FEATURES. The effects of this will be maintained until the system is rebooted
and the proc e ssor returns to the default late deferral behavior. Table 5-2 lists the requirements to
enable early deferral.
Table 5-2. Ea rl y an d L ate Deferral
Early Deferral Enabled psr.ic dcr.dm Deferral Mode
Yes00Late
Yes01Early
Yes10Late
Yes11Late
NoxxLate
Table 5-3 shows the latency, according to deferral mode, that a speculative load may incur before
returning data or eventually setting the destination NaT bit. The cost of each exception deferral
ranges from one cycle to several cycl es depending to the latency of the HPW. These HPW-r elated
penalties c a nnot be scheduled around and a ffect every instruction in the issue group. Also, it is
possible for the exception causing a deferral to not be resolved when the exception is deferred.
Thus, the deferral stall may be seen each time through a loop where the chk.s is not reached.
L1 DTLB Hit22
L2 DTLB Hit2 or 4 + L2 latency2 or 4 + L2 latency
VHPT Hit5 (no HPW walk)22 or 2 + L2 latency
VHPT Miss5 (no HPW walk)20 + L2 latency
VHPT Fault5 (no HPW walk)17 + L2 latency
Note:Speculative loads are not limited to ld.s instructions. lfetch instructions are normally
speculative and be ha ve sim ilarl y to ld.s instructions with the excepti on that the y neve r set a NaT
bit or return data. An lfetch instruction may be made non-speculative with the .fault
completer.
The advantage of early deferral is that speculative operations complete with low latency. The
latency is at best thr ee cycles for an early deferred ld.s as seen by a dependent operation. This is
important in situations where the code generator is aggres s ive in its speculation and the chances of
the speculative operation actually hitting in the data TLB is low. Since early deferral does not
initiate a VHPT walk by the HPW , even valid requests may fault since they are not in the L2
DTLB.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization39
Data Operat io ns
5.3Floating-Point Loads
Floating-point loads are not cached in the L1D and are instead proces s ed directly by the L2. The
limited size and bandwidth o f the L1D makes caching this data unp rofitable. It is expected t hat F P
memory accesses can more easily be schedu led to cover the additional latency of the L2.
Floating-point load s incur an extra clock of latency over integer accesses to accommodate format
conversion. Therefore, a floati ng- point load takes 6 cycles if it hits in L2. Note als o, that the FP
load pair instructions (b oth double-precision and sin gle-precision) also access th e L2 cache, so the
latency for a load pai r instruction is also 6 cycles assuming that it is an L2 hit.
5.4Data Cache Prefetching and Load Hints
The architecture provides two softwar e mechanisms to con trol when and where data is loaded. The
lfetch instruct ion is used to expl icitly pr efe tch data into the L1D, L2, or L3 caches. To facilitate
more data locality, temporal hints can be used to contro l the level of the cach e hierarchy into which
loaded data is placed.
5.4.1lfetch Implementation
The Itanium 2 processor implementation of lfetch is as follows:
• lfetch.none is completed only if there are no except ion s. Exceptions are not reported.
Section 5.2 contains information on the behavior of lfetch instructions that encounter
memory manageme n t fau lts.
• lfetch.fault is completed whether or not there is an exception. If there is an exceptio n, it
is raised to the OS to com plete the operation. A TLB miss is resolved as with a normal load.
• If the lfetch misses in L1D but hits in the L2, the L1D cache is allocated based on the
lfetch temporal hint. lfetch instructions have th e same temporal locality behavior as
integer loads .
• All lfetch types which miss in the first level data TLB and hit in the second level data TLB
will stall the main pipeline and fill the first level data TLB as a normal load operation. The
behavior of the lfetch in the event of an L2 DTLB miss depends on the use of the early or
late deferral modes described in Sectio n 5.2. In early def e rral mode, the lfetch aborts with
an L2 DTLB miss. In late deferral mode, the lfetch will initiate an H PW acces s. If the
access fails, the lfetch will abort. However, it is only the lfetch.fault instruction that
will initiate a HPW access when it mis ses bo th data TLBs.
• An lfetch.excl appears as a store to other cache levels and the system bus. This means
that these operations will place a line in the M state within the caches. Do not use the.excl
completer unle ss there is a high prob abil ity that the data wil l truly be mo difie d. Otherwi se, the
cache will evict unmodified data to the cache structures and eventually to memory.
• An lfetch to an uncacheable memory location will not reach the L2 cache as required by the
architecture.
Note:Th e lfetch instruction appears as a load operation without a specific data return to the core. As
such, many of the limitations that normal loads experience anywhere in the memory hierarchy will
affect the lfetch instruction as well. Exceptions a re noted and are provided with the intent that
they will make lfetch instructions easier for the compi ler to use in realizing performa nce.
40Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
5.4.2Load Temporal Locality Completers
The Itanium architecture uses memory localit y hints for managing th e data cache hierarc hy. On the
Itanium 2 processor , four types of memory localit y hints ar e imp lement ed: t1, nt1, nt2 and nta.
The Itanium2 process or does not support a non-tem poral buffer; instead, non-temporal L2
accesses are allocated in L2 with biased repl acem ent. The implementation is as follows:
• t1 hint is for normal access es . O n a load, the line is allocated in L1D , L2, and L3. On a store,
the line is allocated in L 2 and L3, but not L1D.
• For loads with nt1 hint, the line is only allocated in L2 and L3. In addition, the line is biased
to be replaced in the L2. This is achieved by not updating the L2 LRU bits. Note that by doing
so, the line has a hi gher probability of being r e placed, though it is not guaranteed to be
replaced next.
• Loads with nt2 hint are implem ented in the same manner as loads with nt1 hint.
• For loads and stores with nta hint, the line is only allocated and biased to be replaced in L2.
The line is not allocated into L3.
Table 5-4 lists how L1D, L2, and L3 handle line allocation and LRU update for different hints.
Note that:
• L1D is write through and does not support FP loads and stores.
Data Op er ations
• The valid bit update in th e L1D cac he and th e LRU bi ts u pdate i n the L3 c ache are indep endent
of the hint bit s. Only the update of the L2 LRU is biase d to mimic the behavior of a
non-temporal buffer.
Table 5-4. Processor Cache Hints
AccessHint
t1YesYesYesYesYesYes
lfetch
Integer load
Integer store
FP load
FP store
1. Alloc indicates an entry is allocated in that leve l of the cache on a cache miss.
2. Integer Load and FP Load - only t1, nt1, and nta attributes are allowed.
3. Integer Store and FP store - only t1 and nta are allowed.
Note:Other instruction/hint combinations are not allowed by the I tanium architecture.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization41
Data Operat io ns
5.4.2.1General Descriptions of Hints
Memory locality hints are described below:
• .none: The load delivers the data and is loaded into both L1D and L2.
• .nt1: This hint means non-temporal locality in the first cache level capable of holding the
referenced data. The Itanium architecture suggests this hint indicates that the load should
deliver the data and the line should not be allocated in the first level caches. For the Itanium 2
processor , this hint will cause the line not to be allocated to the L1D on an integer cache miss.
If it is already in the L1D cache, it will not be deallocated.
• .nt2: This hint means non-temporal locality in the second cache level capable of holding the
instruction. For the Itanium 2 processor, this hint will cause integer accesses to the line to be
allocated in L2; however, the LRU information will not be updated for the line (i.e. , it will be
the next line to be replaced in the pa rticular set). If it is already in the L2 cache, it will not be
deallocated.
• .nta: This hint means non-temporal locality in all levels of the cache hierarchy. For the
Itanium 2 processor, this hint will cause the line to be alloca ted in L2; however, the LRU
information will no t be updated for the line (i.e., it will be the next line to be rep laced in the
particular set). This line will not be allocated in the L3 cache. If present in any cache, it will
not be deallocated from that cache, although sometimes lines are deallocated for coherency
reasons.
Note:There is no way to allocate only in L3 and not impact L2, even with an lfetch instruction.
The one-way allocation for non-temporal L2 data may lead to displacement of L2 data for a
temporary dat a st rea m sin ce the no n- tem poral dat a m ay b e q uick ly re pla c ed. A si ngle L 2 wa y h old s
32KB. This may be large enough for a single .nt stream, bu t an attempt to use two non-temporal
streams may cause one stream to displace the other.
5.5Data Alignment
The Itanium 2 processor implementation supports arbit rar ily aligned load and store accesses,
except for integer accesses that cross 8-byte boundaries and any accesses tha t cro ss 16 -byte
boundaries .
If psr.ac = 1, all unaligned memory references will fault.
If psr.ac = 0, these rules must be followed to avoid faults:
• Integer loads and stores must be aligned within an 8-byte aligned window.
• All FP 4-byte and 8-byte load operations c a n be unaligned within a 16-byte aligned window.
• All FP load pairs must be naturally a ligned; i.e., singles on an 8-byte alignment, doubles on a
16-byte alignment, ldpr.8 on a 16-byte ali gnment.
• All FP 10-byte loads can be unaligned within a 16-byte window.
• FP fill/spill instructions must be aligned within a 16-b yte aligned window.
• FP stores can be una ligned within a 16-byte ali gne d window.
• Semaphores (cmpxchg, xchg, fetchadd) must be restricted to natural alignment.
• All uncacheable (UC, WC) accesses which cross an 8-byte boundary will fault.
42Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
5.6Write Coalescing
For increased performance of uncach eable references to frame buffers, previous generation IA- 32
processors defined the wri te coalescing (WC) memory type. WC allows streams of data writes to
be combined int o a si ngle, larger bus write transaction. The Itanium 2 processor fully supports
write coalescing as defined by the Intel
Itanium 2 processor WC loads are performed directly fr om memor y and not f rom the coalescing
buffers.
The Itanium 2 processor has a separate two-entry, 128-byte buffer (WCB) that is used for WC
accesses exclusively. Each byte in the line has a valid bit. If all valid bits are true, then the line is
said to be full and will be evicted by the processor.
5.6.1WC Buffer Eviction Conditions
To ensure consistency with memory, the WCB is flushed on the following condi tions (both entries
are flushed). Table 5-5 shows the eviction conditions when the processor is operating in the
Itanium system enviro nm ent:
®
Pentium
®
Data Op er ations
III processor . Like t he Penti um III process or , the
Flush cache (fc) hit on WCByes
Flush write buffers (fwb)yes
Any UC loadno
Any UC storeno
UC load or ifetch hits WCBno
UC store hits WCBno
WC load/ifetch hits WCBno
WC store hits WCBno
1. Itanium® architecture doesn’t require the WC buffers to be coherent with respec t to UC
2. A WC store which hits in the WCB updates that entry if it is not full. If it is full, a check is made
®
2 Processor WCB Eviction Conditions
Eviction ConditionItanium® Instructions
Architectural Conditions for WCB Flush
1
1
1
1
2
load/store operations.
if that entry is older or younger than the other WCB entry. If it is younger, the older WCB entry
is flushed out (even if it is not full). The younger WCB entry is flushed afterwards. If the WCB
entry is the oldest, it is flushed by itself.
5.6.2WC Buffer Flushing Behavior
As mentioned previously, the Itanium 2 processor WCB contains two entries. The WC entrie s are
flushed in the same order as they are allo cated. Tha t is, the en tries are f lushed in writt en order. This
flushing order applies only to a “well-behaved” stream. A “well-behaved” stream writes one WC
entry at a time and does not write the second WC entr y until the first one is full. This implies that
the addresses of the WC stores monotonically increase. A store with release semantics should be
used to force a flush of a partial line befo re s tarting on the next line.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization43
Data Operat io ns
In the absence of platform retry or deferral, the flushing rule impl ies that the WCB entries are
always flushed in a program written orde r for a “well-behaved” stream, even in the presence of
interrupts. For example, consider the following scena rio: if software issues a “well-behaved”
stream, but is interrupted in the middle, one of the WC entries could be partially filled. The WCB
(includin g the partially filled entry) could be flushed by the OS kernel code or by other processes.
When the interrupted context resumes, it sends out the remaining line and then moves on to fill the
other entry. Note that the resum ed cont ext could be int err up ted again in the m idd le of fi lli ng up the
other entry, causing both entries to be partially filled when the interrupt occurs .
For streams that do not conform to th e a bov e “well- behaved” rule, the order in which the WC
buffer is flushed is random.
WCB eviction is p erf orme d f or fu ll l ine s by a singl e 12 8- bit bu s t ran sacti on. For par tia lly fu ll line s,
the WCB is evicted us ing 1-8, 16, or 32-byte transactions with the proper enables. The flushing
will issue the largest data tr ans actions allowed by a continuous and aligned set of write coalescing
data. When flushing, WC transactions are given the highest priority of all exte rnal bus operations.
5.7Register Stack Engine
The Itanium 2 processor register stack engine (RSE) only operates in lazy mode (ar.rsc.mode
= 0). All other m ode configurations are ignored.
A maximum of two loads or two stores can be performed by th e RSE in each cycle, but not both
loads and stores at the same time.
Generally, it is assumed that the RSE loads and stores will hit in the L1D cache and the L1D is
capable of holding RSE cache lines in L1D.
5.8FC Instructions
The fc instruction will invalidate a specified cache line from all levels of the cache hierarchy. In
the Itanium 2 proce ssor, each fc will invalidate 128 bytes corresponding to the L3 cache line s ize.
Since both th e L1 I a nd L1D have line sizes of 64 byte s, a single fc instruction can invalidate two
lines.
44Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Memory Subsystem6
The Itanium 2 processor memory system has a three-level cache structure: a first-level instruction
cache (L1I), a first-level data cache (L1D), a unified second-level cache (L2), and a unified
third-level cache (L3).
The following sections contain detailed information on the workings of the L1 D, L2, L3, and
system bus. This information is presented to gi ve a ba sis for the optimization recommendations.
However, it is necessary to give enough understanding to recognize bottlenecks that are not
specifically covered in this document. Chapter 9, “Optimizing for the Itanium
provides some important suggestions in optimizing for the Itanium 2 processor memory
subsystem.
Figure 6-1. T h r ee Level Cac h e Hierarchy of the Itanium
Itanium
®
2 Processo r
®
2 Processor
®
2 Processor”
Memory
and I/O
6.4 GB
System
Bus
Control
Logic
L3 9 MB
128 Byte Line
14+ Cycle
The Itanium 2 processor employs a two-level TLB for both instr uction and data references: the
first-level instruction TLB (L1 ITLB) and the second-level instruction TLB for instructions, and
the first-level data TLB (L1 DTLB) and the second-level data TLB.
The Itanium 2 processor implements all the features of the Itanium architecture requirements fo r
virtual memory support. Table 6-1 lists the specific parameters of the Itanium 2 processor
implementation.
Table 6-1. Itanium® 2 Processor Virtual Memory Support
Virtual MemoryItanium® 2 Processor Implementation
Page Size4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, and 4G bytes
Physical Address50 Bits
Virtual Address64 Bits
Region Registers8 registers with 24 bits in each register
Protection Key
Registers
16 registers with 24 bits in each register
32 GB
L2 256 KB
128 Byte Line
5+ Cycle
32 GB
32 GB
16 GB
L1I 16 KB
64 Byte Line
1 Cycle
L1D 16 KB
64 Byte Line
1 Cycle
001228b
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization45
Memory Subsystem
6.1Translation Lookaside Buffers
Table 6-2 shows the major features of the TLBs of the Itanium 2 processor. The capabilities of the
instruction an d d ata TLBs are ap proxi ma tely equ ival ent . The f irst level T LBs ar e cl osely tie d to th e
first level instruction and data caches. This is necessary to suppor t the single cycle access for the
Tab le 6-2. Major Features of Instruction and Data TLBs
6.1.1Instruction TLBs
L1 caches and comes at the price that a first level TLB miss forces a fi rst level cache miss.
Instruction TLBsData TLBs
StructuresL1 ITLB, L2 ITLBL1 DTLB, L2 DTLB
Number of Entries32, 12832, 128
AssociativityFull, FullFull, Full
Penalty for First Level Miss2 cycles4 cycles
The L1 ITLB has 32 fully associative entries and is dual ported. One port is used exclusively for
regular instruction fetches and LRU updates. The second port is shared among instruction
prefetches, snoops, and TLB purges. The L1 ITLB contains suf ficien t informatio n, regi on reg isters,
and protection keys, such that it does not need to be a strict subset of the larger L2 ITLB.
When an L1 ITLB page translation is replaced, all entries in the L1I cache from the victimized
page are invalidated. The victim entry is determined using true LRU. The L1 ITLB directly
supports only a 4KB-page size. Other page sizes are indirectly supported by allocating additional
L1 ITLB entries as each 4KByte segment of the larger page is referenced.
The L2 ITLB has 128 fu lly as sociat ive en tries and is si ngl e porte d. Up to 64 entr ies of the L2 ITLB
can be assigned as translation regi sters (TRs). TRs are effecti vely translations locked into the L2
ITLB and are therefore not subject to LRU replacement policy. The L2 ITLB directly supports
page sizes of 4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 1GB, a nd
4GB.
The L1 ITLB and L2 ITLB are accessed in parallel for demand fet ches to reduce an L1 ITLB miss
(and associated L1I cache miss) penalty. These parallel accesses do not update the L2 ITLB LRU
values. If an instruction access misses in the L1 ITLB, but hits in the L2 ITLB, the first-level
instruction cache access wi ll have two cycles of penalty (in parall el with the second-level cache
latency) to transfer t he page in formati on from the L2 ITLB to th e L1 ITLB. Since an L1 ITLB miss
results in an L1I cache miss, the penalty will likely be greater as the instruction must be accessed
from higher-level caches or the system memory.
6.1.2Data TLBs
The L1 DTLB has 32 f ully associative entries and is dual porte d. Only two ports are requir e d
because it supports only integer load operations. Unlike the L1 ITLB, the L1 DTLB lacks
protection a nd page attribute information. Consequently, the L1 DTLB is accessed in parallel with
the DTLB and must be a strict subset of the second-level DTLB for an L1D hit.
When an L1 DTLB page translation is replaced, all entr ies in the L1D from th e victimized page are
invalidated. The L1 DTLB has a fixed page size of 4KB. Larger page sizes are supported by
allocating additional L1 DTLB entries as a 4KB portion of the larger pag e .
46Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
The L2 DTLB has 128 fully associative entries and four ported. The four ports are needed to allow
all combinat ions of integer loads, stores and floating-point loa ds to be looked up in parallel. The
integer loads rely on the L2 DTLB for protection and page attribute information. The other
accesses get virtual to ph ysical mapping, protecti on, and page attributes from th e L 2 DTLB.
Up to 64 entries of the L2 DTLB can be assigned as TRs. TRs are effectively translations locked
into the L2 DTLB and are therefore not subject to LRU replacement policy. The L2 DTLB directly
supports page sizes of 4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB,
1GB, and 4GB.
Stores or floatin g- point accesses that miss the L1 DTLB incur no penalty from an L1 DTLB miss.
Integer loads that miss the L1 DTLB but hit the L2 DTLB incur a 4-cycle penalty (in additio n to
the L2 cache latency) to transfer from the L2 DTLB to the L1 DTLB. Also, a load access that
misses the L1 DTLB will not hit in the L1D.
6.2Hardware Page Walker
The HPW is th e third level of address translation. The HPW is an eng ine that performs page
look-ups from the virtual hash page table (VHPT). When an L2 DTLB or L2 ITLB miss is
encountered, the HPW will access (as necessary) the L2 cache, the L3 cache, and finally memory
to obtain the page entry. If the HPW cannot locate the page entry in the L2, the L3, or memory, an
interruption is generated and a software handler is called to complete the translation (unless the
requesting instruction defers the exception). The HPW will accept a new instructio n TLB miss
when processing a data TLB miss (and visa versa); howev e r, the HPW will not pro ces s them at the
same time. The requests are effectively serialized.
Memory Subsystem
Cache accesses must wait for TLB resolution to complete:
• L1D accesses both L1 DTLB and L2 DTLB in parallel.
• L1I accesses only require an L1 ITLB lookup (an L2 ITLB lookup is required upon an L1
ITLB miss).
• L2/L3 data access only require an L2 DTLB lookup.
• L2/L3 instruction accesses only requir e an L2 ITLB looku p.
When an L2 DTLB or L2 ITLB miss occurs, an HPW lookup is performed. This HPW walk may
be aborted at a ny time. For non-speculative memory requests, when the HPW aborts or can not
successfully m ap the virtual address , a fau lt is raised. For speculative memory requests, the actual
request is aborted and the ld.s will set the NaT bit. The minimum penalty for going to the HPW
is summarized in Table 6-3. A HPW lookup does not look in or cause a fill of the L1D cache.
Since an L2 DTLB or L2 I T LB miss also implies a miss in the L1D or L1I, the penalty shown in
Table 6-3 has the best case L2 cache latency added to the HPW walk latency.
Table 6-3. Best Case HPW P e nalties
EventPenalty in Cycles
Hit in L225
Miss in L2, hit in L331
Miss in both L2 and L320 + Main memory Latency
(System dependent)
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization47
Table 6-4 summarizes the key parameters of the on-chip caches of the Itanium 2 processor.
Table 6-4. Cach e Su m mary
L1IL1DL2L3
Size16 KB16 KB256 KBup to 9 MB
Associativity4-way4-way8-way18 or 12-way
Line size64 Bytes64 By tes128 Bytes128 Bytes
Latency1 cycle1 cycleMinimum 5 cycles
Tag Read
Bandwidth
Data Read
Bandwidth
Data banksn/a8 bytes/bank
Write Bandwidthn/a2 x 8B / cycle4 x 16B / cycle1 x 32B / cycle
Fill Bandwidth64 bytes
Outstanding
Misses
Line Size64 B ytes64 Bytes128 Bytes128 Bytes
1. The L2 read bandwidth is 48 bytes/cycle because the L2 can complete 2 ldfpd and 2 integer loads at a ti me. Any combination
of 4 floating-point and integer returns may also complet e every cycle.
7 cycles wi th 6 cycle
stall penalty in ROT
stage for instruction
load use.
1
8B
16 bytes/bankn/a
128 bytes
assembly 4 cycles
write - 1 cycle
Minimum 14 or 12
cycles load use.
1 x 32B / cycle
128 bytes in 4
cycles
with L2, 6 write)
6.4First-Level Instruction Cache
The first-level instruction cache (L1I) is a 16KB, four-way set associative, physically addressed
cache with a 64-byte line size. Lower virtu al address bits 11:0, whi ch represent the minimum
virtual page, are never translated and are used for cache indexing. The L1I can fill a 64-byte line
once every two cycles. It blocks on-demand fetch misses but is non-blocking for prefet c h misses
allowing up to seven to be outstanding to the L2 cache.
48Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
The L1I can s ustain a rate of 32-byte reads per cycle to support a fetch rate of two bundles per
cycle. The front- end alw ays fetch es ali gned 3 2-byte bu ndle pai rs from the L1I. If a branch point s to
the middle ra ther than the beginning of a 32-byte bundle pair, only the second bundle will be
fetched. Therefor e, branch targets must be on aligned 32-byte boundaries to achieve maximum
fetch bandwidth from the L1I.
The tag array is d ual-po rted: o ne port i s dedi cated to inst ructio n de mand f etc hes, the oth er is shar ed
between cache snoops and instruction prefetches. Cache sno ops have priority over prefetches . Th e
data arr ay i s dual po rt ed, on e f or re a ding, on e fo r fil ls. Ad dition ally, specia l effort has been ma de to
allow L1I reads and fills to occur sim ultaneously, thus there are few events that can keep an L1I
miss from eventually writing into the L1I.
6.5Instruction Stream Buffer
The Itanium 2 processor instruction stream buffer (ISB) is located between the L1I and the L2
caches. It serves as a line fill buffer for the L1I and assists in instruction prefetching. The ISB
contains eight 64-byte cache lines or 8 double bundle pairs of instr uctions and is fully associativ e.
L1I lines returned fr om the L2, whether demand misses or prefetches, ar e all stored in the ISB. If a
returned cache line is a demand miss, it will be forwarded to the instruction pipeline and may be
moved into the L1I. The cache line remains in the ISB until an idle pe riod where can drain into the
L1I. The IS B e ntry may be victimized or invalidated before this move occurs preve nting the L1I
fill from occurring. The L1I supports both r eads and fills at the same time, hence their ISB entries
empty quickly into the L1I and few ISB victimizations or invalidations will occur.
Memory Subsystem
The ISB is accessed in parallel with the L1I. An ISB hit has the same latency as an L1I hit. If the
target lin e hits both the ISB and the L1I, the matching line in the ISB is invalidated.
6.6First-Level Data Cache
The first-level data cache (L1D ) is a mu lti-ported, 16KB, four-way set associative,
physically-addres sed cache with a 64-byte lin e size. The L1D is non-blocking and in- order. Lower
virtual address bits 11:0, which represent the minimum virtual page, are nev er tr ans lated and are
used for cache indexing.
The L1D is desig ned such that there are two dedicated load ports and two dedicated store ports.
These ports are fixed, but the issue logic can rearrange loads and stores within an issue group to
ensure they issue to the appropriate memory p ort. The load ports are dual ported, meaning that any
two load addresses can be read fro m th e memor y in pa rallel without conflict. S tores, however,
access the L1 D da ta array in 8 groups that are 8 bytes wide. Stores do have the potential for
conflicts, but the store buffer coalescing hardware limits the impact such conflicts have on
performance.
The access latency of the L1D is one cycle unles s the use is fo r an ad dre ss of anot her load
operations (i.e., pointer chasing) in which case it is two cycles. The L1D enforces a write-through,
with no write-allocate policy. Al l stores will go to the L2 cache whether they hit or miss in the
L1D. If a stor e hits in the L1D, the data is kept in a s t ore buffer until the data arrays become
available to update the L1D. These store bu ffers are capable of merging store data and forwarding
it to later loads with restrictions. The L1D allocates on load misses according to temporal hint s,
load type, and available reso ur ces.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization49
Memory Subsystem
The L1D is highly i ntegrated into the integer data path. All integer lo ads must go through the L1D
to return data to the integer regist er f ile/bypass network. Consequently, integer L1D misses, after
being servic ed by L2, L3, or memory, also use the L1D datapath to the integer register file and
block any core load that may re quire the same L1D data path.
Floating-point loads do not access the L1D. This allows them to issue on any of the four memor y
ports with minimal restrictions. Floating-point load-pairs and any floating-point loads with ALAT
interactions can only be di spersed on the load ports. Despite th e fact that lfetch instructio ns do
not deliver data to the core, they can only be issued on the two load ports because they may cause
an L1D fill and that capability is only provided on the two load memory ports.
An unaligne d data reference exc eption will be raised if an unaligned integer load crosse s an 8-byte
boundary. See Section 5.5, “Data Alignment ” for more details about alignment support.
6.6.1L1D Loads
When a core load request gets access to the L1D, it will access the L1D tag and data ar rays at the
same time. Rotator s at the output of the L1D data array provide support for both little and big
endian accesses as well as some unaligned accesse s without penalty. A virtual to physical mapping
must be in the L1 DTLB an d L 1D tags for a load request to be a L1D hit. If the load is a miss or is
forced to miss the L1D, then the request is passed on to the L2 when there are sufficient resources.
The miss may result in a L1D fill depending on resources and cache hints. At minimu m, all L1D
misses eventually update the target reg ister. Floating-point loads and ordered operations are forced
to miss the L1D, bu t will not cause an L1D fill.
The L1D has re sources for up to 8 outsta nding L1D fill-reques t s to the L2. If more than 8 misses
are outstandi ng, the subse quent m iss es wil l be pa sse d to the L2 , but wil l no t res ult in an L 1D fill . If
two or more accesses miss the L1D and are accessing the same L1D line, only on e will request an
L1D fill but will be passed to the L2 cache to be sati sfied.
6.6.2L1D Stores
All store requests are passed to the L2 cache since the L1D is a write through cache. A store that
misses the L1D has no effect on the L1D. However, if the store is a hit, the L1D must update the
data array so that later loads can see the new data. To support this, the store data is read from the
source register and staged down the L1D pipeline. Each store pipeline (M2/M3) has indepen de nt
store buffers and control logi c.
When the data is ready to update the L1D data array, it is allowed to do so provided there are no
conflicts. Other operations writing the data array at the same time, such as an L1D fill, a load
accessing the same 8-byte bank, or a stor e to the sam e bank, m ay prev ent the n eeded up date. In t his
case, the store data is moved to a backup buff er and waits for the array to become available. The
store buffer can coalesce younger stor es acces s ing the same L1D 8-byte wide data bank. If the
backup buffer cannot update th e data array and is needed by a new store that it cannot co alesce, the
L1D pipeline will stall to create an opportunity for the backup buffer to drain.
Given this o rga niz at ion, it ma y be be tter f or stor es targeting the same g roup t o iss u e down t he s ame
L1D pipeline. For example, it would be better to have all accesses to bank 0 to issue down M2 and
all accesses to bank 1 to issue down M3. Thus, when it comes time to updat e the array, M2 and M3
will not conflict and will be allowed to update without delay.
50Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
6.6.3L1D Load and Store Considerations
Some memory requests may af fect each other even when separated in time. This se ction covers
some possible load/load, load/ store, store/load , and store/store interacti ons for both the L1D hit and
miss cases. Each discussion will have a sum mary and a suggested solution.
6.6.3.1Load/Load Conflicts
Load requests that hit in the L1D have no conflicts with each other because the L1D is true dual
ported. However, a load request that misses the L1D may have conflicts at the L2 due to ban k
conflicts. If low latency is needed, special care should be taken to avoid loads in the same issue
group that access the same L2 bank, i.e., A[7: 4] should be unique for L2 bound accesses.
A less obvious load/load conflict can occur when a load is waiting to issue to the L1D, but is
preempte d by an older lo ad returning from the L2/ L3 or system bus. Here, the older load is given
priority and the younger load must wait. These events are difficult to predict and hence difficult to
schedule around. However, the L2 cache will only take the M1 port if t here is onl y one inte ger load
to return i n a cycle. Thus, a conflict can be avoided by not using the M1 port for loads. This should
not be done if it adds to the critical path.
This same confli ct may exist between loads and special req ues ts that use the L2 data paths to get
information to the core. These are the probe, thash, ttag, tpa, and tak instructions.
Memory Subsystem
6.6.3.2Load/Store Conflict s
A load and store conflict has very dif ferent implications dep e nding on which occurs firs t, the load
or the store. Despite the fact that issue groups are inherently parallel, loads and stores are ordered
according to position in the issue group.
When a load precedes a store and the load is a hit, ther e are no conflicts. However, there are
significant impl ications when the load precedes the store and they are both L1D misses. In this
case, the load will m iss the L1D and likely request an L1D fill. The stor e, if it is seen by the L1D
before the fill associat ed with the load, will be an L1D miss. As such, the store will invalidate the
L1D associated fill buff er entry and stop the L1D fill from occurring. This is necessary because
there is no opportunity for the store to update the incoming data before the L1D fill. The Itanium 2
processor must ensure that a later load sees an earlier store, so the fill is cancel led and the mer ge of
the store with the cache line is taken care of by the L2. If the fill occurs before the store, then the
fill completes and a normal stor e upd ate of the L1D is done. These st atements are true if the load
and store share A[49:6] (a full L1D cache line).
One method to avoid this issue is to place a use of the load result before a conflicting store. This
ensures that the data is filled into the L1D. Once the L1D is filled, the stor e updates the L1D and
proceeds on to the L2 cache. This suggestion may not be app ropriate for single load accesses or
when the L1D line is not accessed again aft er a conflicting store.
6.6.3.3Store/ Loa d Conflic ts
When a store precedes a load, the store data must be seen by the load. In the case where the
requests are L1D misses, the L2 ensures this occurs. When the operations are L1D hits, the
response to the load depends on the co mmon address bits and how many cycles separate the store
and load.
Table 6-5 shows the different store/load penalties. The penalty may depend on whether the load
accesses the sa me data as the store, a s ubset of the store’s data, or is completely in dependent of the
store.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization51
The 5 and 17 cycle penalties are due both to the load being fo rced to miss the L1D and to the load
and store facing L2 confl ict con dition s. The 3 an d 1 cycle pe nalt ies are due to th e L1D recircu lating
the load request un til the dista nce betw een the l oad and store ex ceeds 3 cycles . This al lows tim e for
the L1D to update the data ar ray with th e store da ta and allow t he load to proceed as if ther e was no
store.
To avoid store/load conflicts , the store and load must be separated by more than 3 cycles . If more
than 3 cycles separation is difficult to achieve, then ensure at least 1 cycle separation.
6.6.3.4Integer and Floating-Point Access Interactions
Floating-po int loa ds a nd st or es are pas sed d ir ectly to t he L 2 a nd byp as s the L1D . If a f loat ing -p oint
store occurs to a lin e wh ich i s resid ent in the L1 D cac he, tha t L1D lin e wi ll be in vali dat ed. T his ca n
cause problems when integer and f loating-point data share the same L1D cache line. This is
possible when bo th integer and floating-point data exist in the stack or as part of the same data
structure. Suppose that both an integer value and a floating-point value share the same 64-byte
aligned block. An integer load will bring the line into the L1D. A later floating-point store will
write to L2 and invalidate the L1D line. Thus, a subsequent load of the integer value will miss the
L1D.
Address Comparison
This may be mitigated by bringi ng th e lin e back into the L1 D throu gh an lfetch after issuing the
store or by using .nt1 hints on the integer accesses to keep them from filling the L1D and
scheduling them for L2 latency.
6.6.3.5Store/Store Conflicts
The L1D is true dual ported for loads, but only pseudo-dual ported for stores; two stores cannot
update the exact same locati on in the data ar ray at th e same ti me (see Section 6.6.2, “L1D Stores”).
The store buffer design, with coales cing, pr events mos t store/st ore confli cts fo r L1D store h its. The
exception is that two stores cannot up date the same L1D bank at the same time. Should there be a
conflict, the younger store will move into a store buffer and may later update the L1D data array
without impact ing the L1D pipeline. However, if the store buffer is unavailable, the L1D will stall
until the store buffer is drained. The conflict does not exist if either of the two stores misses the
L1D. Note that the two stores do not need to access the same L1D cache line to conflict.
6.6.4L1D Misses
When an L1D request misses, it is passed on to the L2 once the L2 has suf ficient resources
available to hold the new request. The resources include at least an L2 OzQ entry and an L2 Data
entry . A L2 Data entry must be available for a store to be accepted, but a load does not requ ire a L2
Data entry. If either the L2 OzQ or Data i s fu ll, t he operati ons and every other operat ion in the i ssue
group will stall until these resources are made available.
52Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
The L2 control log ic reserves some L2 OzQ entr ies to ensure that when a request is allowed to
leave the L1D pipeline, there i s an L2 OzQ entry availab le for it. The logic reserves f our entrie s for
every cycle of ambiguity which is three cycles. The result is that in some inst ances , such as
streaming, only about 20 of the 32 L2 OzQ entries are available. The Itanium 2 processor only
stalls the L1D pipeline when the L2 is full and there is a request in the L1D that needs to go to the
L2.
6.6.4.1L1D Forced Misses
There are some load instructions that are forced to miss the L1D. A floating-poi nt load will always
miss the L1D. An ordered load (ld.acq) is allowed to hit in the L1D, but if it does mi ss the L1D ,
all subsequent loads, regardless of address or ordering constraints, will be forced to m iss the L1D
until the L2 indicates that the ordered load is visible.
6.6.4.2L1D Forced Invali dates
Just as some operations are forced to miss the L1D, some operations will inva lidate the L1D. A
floating- point store will invalidate the L1D if it is a L1D hit. Sema phores will also invalidate the
L1D if they hit in the L1D to ensure that or dering is maintained.
Memory Subsystem
6.7Second-Level Unified Cache
The second-level unified cach e (L2) cache is a uni fied, 256 KByte, 8-way set-associati ve cache
with a line size of 128 bytes. The L2 tags are true four ported and are accessed as par t of the L1D
pipeline. The L2 employs write-back and write-allocate policies. The integer access latency to the
L2 is 5, 7, 9 or greater cycles. Floating-p oint accesses take 6, 8, 10, or greater cycles, which
includes the floating -p oint conversion stage. An L1I miss that hits in the L2 and uses the L2
5-cycle bypass incurs a 7-cycle latency with a 6-cycle stall penalty.
The L2 cache is non-blocking and out of order. All memory operations that access the L2 (L1D
misses and all stores) check the L2 tags and allocate int o a 32 entry queuing stru cture calle d the L2
OzQ. All stores require one of the 24 L2 data entr ies to hold data to eventually update the L2 data
array. The operations issue, up to four at a time, to access the L2 data array when conflicts are
resolved and reso urces ar e availab le. L1I instr uction misses a re als o sent t o the L2, but are s tored in
the Instruction Prefetch FIFO (IPF). The L2 OzQ and IPF requests arbi trate for access to the data
array and t he L3/system bus.
The L2 data array has 16 banks which are each 16 bytes wide. This allows for multiple
simultaneous accesses pr ovided each access is to a different b ank. Fl oating-point loads may issue
from the L2 OzQ and access the L2 data array four at a time since th e L2 has four dat apaths to the
FP units and regis ter file. The L2 does not have direct datapaths to the integer units an d register
file; integer loads deliver da ta via the L1D, wh ich has two datapaths to th e integer un its and register
file. Stores may issue from the L2 OzQ and access the L2 data array four at a time provided th ey
are all to different banks.
The fill path width from the L2 to the L1D and the L1I is 32 bytes. The fill bandwidth from the L3
or memory to the L 2 is 32 bytes per cycle. Four 32-byte quantities are accumula ted in the L2 fill
buffers , then the 128-byte cache li ne is written into the L2 in o ne cycle, thus updating both tag and
data arrays. Note that an NRU algorithm is used for cache line replacement.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization53
Memory Subsystem
The L2 cache is not inclusive of the L1D and L1I caches. The L2 maintains state inf ormation for
each line, tracking if the dat a stored is modified (M), exclusiv e (E), s hared (S), invalid (I), or
pending updat e ( P). This al lows the L2 t o use the ME SIP proto col to mainta in c ac he co her ency and
track victimized lines.
6.7.1L1D Requests to L2
Every cycle, the L1D may issue up to four requests to the L2. These requests may be L1D
load/store misses, L2 r ecirculates, L2 fills, instruction fetch es , or snoops. The L2 tags are true four
ported and are part of the L1D pipeline. Th is allows all four L1D load or store requests to access
the L2 tags and determine if they are an L2 hit or miss before being allocated into the L2 queuing
structures. This feature allows L2 misses to be identified and quickly passed on to the system
bus/L3. It als o lowers the latency of L2 hit requests.
All L1D load, store, semaphore request s are placed in the L2 OzQ. All L1I instruction misses,
which are issued through the L1D to the L2, are placed in the IPF where they arbitrate against the
L2 OzQ for access to the L2 data arrays and the system bus/L3. Other requests coming from the
L1D such as snoops and fills are transitory and are not queued.
Read (load) operations of the L2 data array occur three cycles before a write (sto re) of the L2 data
array. This timing relationship becomes important when determining load/store data array
conflicts.
The L2 provides 16 fill buffers to trac k L2 misses. Eac h L 2 miss may result in modified data
eviction. The L 2 provides 16 victim bu ffers to hold vict im data; how ever , only 6 L2 victims may be
outstanding at a time.
6.7.2L2 OzQ
The non-blocking nature of the L2 is made possible by the L 2 Oz Q. This structure holds up to 32
operations that cannot be satis fied by the L1D. These include all stores, semapho res, uncacheable
accesses, L1D load misses, a nd L1D unr esolved co nflict cas es. The L2 cache desi gn req uires fewer
than 32 L2 OzQ entries to hold the maximum number of L1D requests in conflict-free cases.
However, there are many conflict cases within the L2. These cases may increas e req ues t lifetimes
in the L2 OzQ. Thus, the additional entries allow the L1D pipeline to continue to service hits and
make additional requests of the L2 while th e L2 resolves the conflicts. The conflicts increase th e
L2 latency and make L2 la tency prediction impossible.
6.7.2.1L2 OzQ Allocation and Deallocation
The L2 OzQ control logic alloca t es up to four contiguous entrie s per cycle starting from the la st
entry allocated th e previous cycle. If there are too few entries available, the L1D pipeline is stalled
to prohibit any additional operations being passed to the L2. Requests are removed from the L2
OzQ when they complete at th e L2 - that is wh en a store u pdates th e data array, when a load returns
correct data to the core, or when an L2 miss request is accepted by the system bus/L3.
6.7.2.2L2 OzQ Behavior
The L2 OzQ control logic enforces architectural ordering requirements; and in instances where the
architecture allows, operations may complete out of o rder. An operation blocked due to con flict or
issue rest rictions does not block you nger operations from completing. This allows for high
resource utilization withi n the L2 resulting in a performance benefit. Additionally, the out-of-order
54Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
issue allows the L2 to quickly recov er from circumstances where the L2 control logic was
temporarily not able to retire requests.
The out-of -order and non-blocking nature of the L2 OzQ has the e ffect of removing any time
relation ships between operations . For example, if the code generator separates two ope rations by
4 cycles, they will appear 4 cycles apart in the L1D pipeline. However, conflicts may keep t he first
operation from issuing immediately and force it to wait in the L2 OzQ. This situation may result in
the second op erat ion actu ally com plet ing in th e L 2 befor e th e fir st op era tio n, assu mi ng no or der ing
restraints , despite their 4 cycles of separation in the code s tream.
The latencies of the L2 hit accesses are typicall y 5, 7, or 9+ cycles. These several latencies arise
from the fact that some operations can issue and access the L2 data array at different times
depending on the resourc es r equired an d what preceded the request. Th e l ower l atencies co me f rom
allowing L1D request to access the L2 data array befor e they are allocated in the L2 OzQ. These
are the 5- and 7-cycle L2 OzQ b ypass . All latencies listed as 9+ are for operations that cannot take
these bypasses and must allocate into the L2 OzQ and then later issue from the L2 OzQ to access
the L2 data array.
6.7.2.35- and 7-Cycle Bypass
New L1D requests may take the 5-cycle bypass of th e L2 OzQ and is s ue directly to the L2 data
array provided there are no conflicts with older operations in the L2 OzQ. Thi s bypass may be
granted to the entire issue grou p provided there are no conflicts wit hin the iss ue group. If a confli ct
occurs, th e old er re que st wil l tak e the by pas s whi le th e you nger re ques ts m ay no t. Sem aph ores will
never take a 5 or 7 cycle bypass and have a minimum latency of 9 cycles.
Memory Subsystem
L2 bank conf lic ts will be discussed in Section 6.7.3, but they are used here in an example of how
the L2 re-orders request to give the lowest possible latency. Conflicts typically are due to multiple
requests for the same L2 data array (bank conflict). Consider the an L1D reque st (iss ue) group
below:
ldfs f20 = [0x004] (L2 Bank 0)
ldfs f21 = [0x008] (L2 Bank 0)
ldfs f22 = [0x00c] (L2 Bank 0)
ldfs f23 = [0x010] (L2 Bank 1)
The first load will take th e 5-cycl e bypass. The b ank conflic t between the fir st and second l oad will
prohibit the second and third loads from taking the 5-cycle bypass. The fourth load will also take
the 5-cycle bypass since there is no ban k conflict with the older requests or architectural ordering
requirements.
When a request is kept from taking the 5-cycle byp ass, the next choice is the 7-cycle bypass. The
bank conflict between the first and second load will keep the s econd and third load from taking the
5 or 7-cycle bypass.
The situation becomes more complicated when the instructions above are foll ow ed by mor e
instructions to be satisfied by the L2. Consider the i ssue group of loads from the previous example
which is immediately followed by the following issue group of loads:
ldfs f25 = [0x014] (L2 Bank 1)
ldfs f26 = [0x018] (L2 Bank 1)
ldfs f27 = [0x01c] (L2 Bank 1)
ldfs f28 = [0x020] (L2 Bank 2)
In this example , the f20 and f25 loads take the 5-cycle bypass. The f21, f22, and f23 loads will try
to take the 7-cycle bypass. However, before they can take the bypass, the new request gr oup with
f25, f26, f27, and f28 comes alon g. In this issue group, f25 and f28 take the 5-cycle bypass. Doing
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization55
Memory Subsystem
so blocks the ol de r issue group from taking the 7-cycle bypass . T hose requests must then issue
from the L2 OzQ. This increases their minimum latency f ro m 6 to 12 cycles. The latencies of the
operations would be as follows (noted in parenth e sis after the lo a d):
Every cycle the L2 OzQ searches for requests to issue to the L2 data array (L2 hits), the system
bus/L3 (L2 misses), or back to the L1D for a nother L2 tag lookup (recirculate). See Section 6.7.3
for more information on L2 cancel conditions and Sectio n 6.7.4 for more info rma t i on on L2
recirculate conditions.
The L2 can issue up to four L2 hi t acces ses per cy cle provided there are no confl icts among th em or
among earlier iss ued o peratio ns. The conflicts fo r L2 hits in clu de L2 d ata array b anks, r egis ter p ort,
L1D fill, and orderi ng. In the case of the L1D fill, on ly one such load may issue. Al so, sin ce the L2
uses the L1D register return paths for loads, only two loads can issue per cycle.
The L2 can issue only one access to the system bus/L3 at a time. An L2 miss in the same L1D
request gr oup as an L 2 hi t s houl d b e o n t he M0 p ort t o h ave the s horte st L3 late ncy. If the m iss i s on
another port, its latency will increase slightly.
The system bus/L3 control logic will then eithe r accep t or reject the request based on system
bus/L3 resources and conflict cases. Once the request is accepted, it may be remov ed from the L2
OzQ. The L2 OzQ pipelines L2 miss requests; it does not wait for the system bus/L3 to accept a
request before issuing anot her request.
6.7.3L2 Ca ncels
The L2 cancels ge ne rally apply only to reques ts taking a 5 or 7 cycle L2 OzQ bypass. This is
because in most cases, the issue logic considers the conflict cases and holds off issue until the
conflict is resolved. The best example of holding off issue from the L2 OzQ are bank conflicts. All
the information n eeded to avoid all possible issue time conflicts may not be available and some L2
OzQ issued requests must be later cancelled and re-i ssued. When an operatio n taking a bypass gets
cancelled, it will re-iss ue from the L2 OzQ si nce the bypasses are only available to L1D request
groups. When an L2 OzQ requ est i s iss ued an d then lat er canc elled, it s late ncy wi ll incr ease by fo ur
cycles.
The cancel logic may also cancel or block is s ue in more instances than expecte d due to issue logic
simplification or unavailable information. For example, requests that are recirculated will be
included in cancel/bloc k calculations for other instructions considered for is s ue, or th e is su e logic
will try to issue up to four requests that need to recirculate even though it cannot recirculate more
than one request.
A 5 or 7 cycle bypass is more likely to be cance led f or P3 operations because it is the you ngest in
the issue group and due to events external to the L2 such as Sys tem Bus /L3 re tur ns and snoop
requests. P0 requests are th e leas t likely to be canceled because these are the oldest instructions in
the i s s u e g roup .
56Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
There are many reasons to cancel or block L2 OzQ issue. The reasons are placed into two
categories: those that are predictably avoidable and those that are n ot.
6.7.3.1Predicta bly Avoidable Cancel Conditions
L2 Data array conflicts : The data array has 16-byte wide banks. Bits 7:4 of the address determine
the bank. Any requests with the same bank, reg ardless of cache line, are candidates for a bank
conflict. Any L1D request group with multiple loads targeting the same bank will see the younger
requests cancelled or the L2 OzQ issue blocked. This also a pplies to multiple stores target ing the
same bank.
Since L2 loads and stores access the L2 data array at different times, a load and store in the same
request group cannot have bank conflicts; however, there is potential for load and store bank
conflicts between enti rely different L1D request groups. Store requests access the data array three
cycles after a load would. This means that a store issued at time X may block or cancel a load that
would issue at time X+3 if they both access the same L2 bank.
The following examples show how the confl ict logic considers the L2 data array access time to
determ in e ba nk confl ic t s . T he fo ll owing tw o e xa mples do not have ba n k c on flicts:
ld8 r20 = [0x008] ;;
ld8 r21 = [0x010]
Memory Subsystem
and:
st8 [0x008] = r20
ld8 r21 = [0x010]
However, the following example shows a bank conflict between the store and the last load, but not
between any other requests. :
Bank conflicts due to L1D fill requir ements are sli ghtly less predictabl e. These bank confl icts ar ise
from the fact that an L1D fill requires 64 bytes of data and hence, four banks at a time.
Additionally, the data path to the L1D can only support one fill every two cycles. These are not
predictable because not all L1D misses will req uest an L1D fill. Sectio n 6.6.1 has more information
on which requests can require an L1D fill.
6.7.3.2Unpredictably Avoidable Cancel Conditions
There are some bank conflicts that are generally unpredictable. These events are tightly coupled
with the unpredictable events of sys tem bus and L3 data ret urns. The unpredictable cancel
conditions may result in un e xplained L2 latency increases.
6.7.4L2 Recirculate
The L2 OzQ will need to recirculate requests whe never th e request do es not have a clea r indicat ion
of hit or miss, or the requ ired resources to complete an L2 miss are unavailable.
The most predictable reason for a request to recirculate is that the request mi s ses a line that is
already being serviced b y the system bu s/L3, but has not yet ret urned to th e L2. The L2 only retires
L2 hits and primary L 2 miss es to an L2 line. It does no t ret ire mu ltiple L 2 miss reques ts; add iti onal
misses remain in the L2 OzQ and recirculate until the tag lookup returns a hit. The request then
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization57
Memory Subsystem
issues from the L2 OzQ and returns data (for a load) or updat e s the array ( fo r a stor e) as a normal
L2 hit request.
6.7.4.1lfetch and Recirculation
There is one significant exception to this secon dary L2 miss recirculate con dition. lfetch
instructions have been optimized to avoid allocation in the L2 OzQ if th e y meet the following
criteria:
• Secondary access to an L2 miss.
• Will not fi ll the L1D.
Since these lfetch instructions are not allocated into the L2 OzQ, they cannot recirculate. The
only way to guarantee that an lfetch instruction will not fill the L1D is to place temporal hints
such as .nt1, .nt2, o r .nta.
6.7.5Memory Ordering
Itanium architecture memor y ordering requires that a request with acquir e s emantics must reach
visibility before any other younger operation. A request with release semantics must not reach
visibility before older operations .
The L2 issue logic enforces the architectura l rel eas e ordering semantic by blocking issue of a
release request until it is the oldest ope ration in the L2 OzQ. The issue logic may issue a release
operation that is not the oldest, but then cancel and re-issue.
If the ordered operation is not an L2 hit, the L2 co ntr ol logic can speculatively make a system
bus/L3 request of the line or transform the re quest to a prefetch. If the other L2 OzQ entries
proceeding the ordered requ est do no t conflict, the prefetch will have the benefit of starting the
access early without violat ing ordering requirements. If there are conflicts, the request is re- is s ued
to ensure proper ordering.
Since the L2 is responsible for maintaining architectural ordering, all loads that are in the shado w
of a ld.acq must be seen by the L2. Thus, they are forced to miss the L1D until the ld.acq has
achie v e d vi s i bility.
6.7.6L2 Instruction Prefetch FIFO
The Instruction Prefetch FIFO (IPF) is an 8 entry queu e to hol d L1I requests. Up to seven of these
eight entries may contain pref etch r equests. One slot is always reserved for a demand request. Just
like the L2 OzQ, the IPF can have requests that are L2 hits, L2 misses, bank conflicts, or
recirculates. The IPF faces the same issue re strictions for each of these requests as the L2 OzQ
does. Howeve r, unlike the L2 OzQ hit requests, only one IPF L2 hit may be is sued to the L2 data
array per cycle. T his is due to the fact that all IPF requests will return data to the L1I cache and th e
data path back to the L1I can only support one fill per cycle.
Since the L2 s upports both instructio n a nd data accesses, all L2 is sue control logic choos es among
instruction and data req ues ts according to Table 6-6.
58Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Some memory requests may af fect each other even when separated in time. This sections covers
some possible lo ad/load, load/store, store/load, and store/store in teractions for both the L2 cache.
Since the L2 OzQ allows out of order issuing, the L2 OzQ will re-order requests to fully utilize the
L2 data arr ays in sa tisf ying r eques t s. As a r es ult, any s ta tic t iming place in th e cod e s tre am may not
have the des ired result on L2 behavior, however there are s till actions the code generator can take
to increase performance.
Memory Subsystem
6.7.7.1Effective Releases
The L2 cache deals with load/s tore, store/load, and store/store conflicts by ensuring that the issue
order in the L2 OzQ is the same as the program order of the operations. The L2 control logic
leverages the archi tectural ordering mechan isms that al re ady exist t o addre ss the pos sible conflicts .
When the L2 OzQ accepts a new request, it checks the phys ical address bit 49:2 against all older
incomplete requ ests in t he L2 OzQ. If a m atch exis ts a nd a co nflict resu lts, the c ontro l log ic appli es
architectural releas e semantics to the incoming request. This is called effective release. The
effective release association remains until the operation completes and causes the L2 issue and
conflict logic to cancel the request until it is the oldest request in the L2 OzQ.
Table 6-7 summarizes the addresses and operation types that can experience an effective release.
T able 6-7. Effective Release Operations
Incoming
Request
LoadLoadNo
LoadStoreYes
StoreLoadYes
StoreStoreYes
Matching
Request
Effective
Release
6.8System Bus/L3 Interactions
All requests that the L2 cannot satis fy reach the system bus/L3 as a Read Line (RL) or Read For
Ownership (RFO) request. The RL request is used for code and common load oper ations. The L2
may receive the line in M, E, or S for RL requests depending on L3 state or the snoop response
provided on the system bus. The RF O reques t indicates the L2 intends to modify the line to store
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization59
Memory Subsystem
data. Stores as well as lfetch.excl and ld.bias instructions result in Read For Ownership
requests. These requests will a lways exist in t he M state in L2. Table 6-8 summarizes this beh avi or.
Tab le 6-8. System Bus/L3 Requests and Final L2 State
L3/System Bus RequestL2 Request
Read LineCode ReadSEMSS
Data ReadSEMSE
Read For OwnershipstoreMissMMM
lfetch.exclMissMMM
ld.biasMissMMM
L3 StateSystem Bus
SE MHITNo Hit
n/a
The L2 may make partia l line requests of the sys tem bus, but this is only for UC attrib ute accesses
and is not part of this discussion because they are neither coherent nor a concern fo r perfo rmance.
The L2 will make one RL or RFO to the system bus/L3 per cycle. Each of thes e requests will have
a dirty victim associat ed with i t when th e L2 way chosen for victim izatio n is in the M sta te. The L2
issues a request to the sys tem bus/L3 and then later conf irms the request. This protocol exists to
allow issuing requ ests t o the sy stem bu s/L3 that ar e later cancelled an d/or reci rculat ed. The L2 may
make a request, but will not confirm a request if there are insufficient r es our ces available. The L2
will not issue two requests to the same L 2 line. A request that is not confirmed wi ll wait at least
four cycles before it is issued again.
The system bus/L3 will decide if the request is accepted and inform the L2 based on address
conflicts, available resources to support the read request and the asso ciated dirty victim. The L2
will then deallocate the request from the L2 OzQ if the system bus/L3 accepts the request. An L2
request may be rejected ( see S ection 6. 10). A r ejected r equ est w ill wait at l east f our cycles befo re i t
is issued again.
When the system bus/L3 is ready to deliver dat a to the L2, it will be indicated to the L2 and the L2
will prepare to receive the data . The data returns com e 32 bytes (a chunk) at a tim e from th e system
bus/L3 with the critical chunk f irst. L3 returns have higher priori ty than system bus data returns
and come consecutively. In many instances, an L2 miss may also c ause an L1D fill. Since the L1D
line width is only 64 bytes, there is sufficient data to cause an L1D fill when on ly two chunks have
been received from the system bus or L3. These requests must access the L1D pipeline and may
block core requests from entering the L1D pipeline during that cycle. If there are two L1D fills for
an L2 miss, another fill will occur when the las t two c hun ks have been received by the L2.
6.9Third-Level Unified Cache
The third-level unified cache (L3) is a unified, 9 MByte, 18-way set associative cache with a
128-byte line size. Some versions of Itanium2 processor may have L3 cache sizes of 6, 4, 3, or
1.5 MByte. Latencies and set-associativity may vary betw een the dif ferent cache sizes and mode ls.
See Chapter 2 for exact latency and set-associativity numbers. These caches are alike in all other
respects.
All L3 accesses are for the entire 128 byte line – no partial line access es are supported. The access
latency is 12, 14, or more cycles. This latency depends on how quickly the L2 issues the request
and the activity of the L3 at the time of the requ es t.
60Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
On the Itanium 2 processor, L3 accesses are ful ly pipelined and thus have a much higher eff ective
bandwidth than the L3 on the Itanium pro ces s or. The L3 tag array is single ported and is pipelined
to allow a new tag access every cycle. The L3 data array is also single ported, but requires up to
four cycles to transfer a full line of data to the L2 cache or to the system bus in the case of an L3
dirty victim.
The L3 is non-blocking and has an 8-entry qu e ue to support mult iple outstanding reques ts. This
queue orders reques ts and prioritizes them among tag read/write and data read/write to achieve the
highest performance given the operations required.
6.10System Bus
The Itanium 2 processor with 3M and 6M L3 cache system bus operates at 200 MHz and is
comprised of multiple sub-busses for various functions, such as address/re quest, snoop, response,
data, and de fer. The data bus is 128 bits wide and operates source synch ronously, achieving a peak
bandwidth of 40 0 mi lli on memo ry tra nsact ion s or 6 . 4 GB p er s econ d. T he Ita ni um 2 pr oc esso r wit h
9M L3 cache has multiple system bus speed options - 200 MHz, 266 MHz, and 333 MH z. The
operating frequency is the only chan ge in the system bus. These faster speeds now allow a peak
bandwidth of 8.5 GB and 10.6 GB per second.
The system bus control logic is an In Order Queue (IOQ) and an Out of Order Queue (OOQ),
which tracks all transactions pending completion on the system bus. The IO Q tracks the in-order
phases of a request and is identical to all processors. The OOQ contents hold o nly a processors
requests that are deferred . The IOQ can hold 8 entries while the OOQ can hold 18 requests which
allows for a maxim um of 19 transactions to be outstanding on the sys tem bus from a single
Itanium 2 processor .
Memory Subsystem
L2 requests th at have not been completed (i.e., have not accessed the L3 nor completed a data
phase on the system bus) are mainta ined in structures of the following sizes:
• 16 outstanding read requests from L2.
• 6 outstanding dirty writeback requests from L2.
• 6 outstanding L3 writebacks (i.e., replaceme nt of a dirty line) to be serviced by the main
memory.
• A combination of 16 outstanding L3 writebacks or L3 castouts (i.e., replacement of a clean
line depending on the coherence mech anis m, this might incur memory traffic) to be serviced
by the main memory.
• T wo 128-byte coalescing buffers to support WC stores.
Read transactions (thi s i ncludes store instructions that miss the L2) are placed in one of the 16 bus
request queues (BRQs). Ea ch of these may then be sent to the L3 to see if the L3 can satisfy the
request. In the case where the request is also an L3 miss, the request is scheduled to ge nerate a
system bus request (ei ther B us Rea d Line o r Bus Re ad Inv alid ate Line for stores ). When the s ystem
bus responds with the data, the line is written to the L2 and L3 bas e d on its temporal locality hints
and type of access.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization61
Memory Subsystem
62Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Branch Instructions and Branch
Prediction7
The Itanium 2 processor employs bo th st atic and dynam ic me thods f or branch pred iction. F or sta tic
branch prediction, the It anium 2 processor uses the hint completers from the branch instructions.
For dynami c prediction, the Itanium 2 processor uses several hardware s t ructures.
This chapter describes how branch predictio n aff ects soft ware execution . The front-end inst ruction
fetching is decoupled from the back-end instruction execution through an 8-bundle instruction
buffer. For more detail regarding the instruction buffer, see Appendix A, “Itanium
Pipeline.” Throughout this chapter, the term ‘bubble’ refers to cycles for which the front-end
cannot deliver usef ul data, becau se the p enalty ma y never tran slate t o a loss in performance if ther e
is another event blocking the back-en d from retiri ng instru ctions. In the case where the back-end is
waiting for the front-end, the penalty is a stall.
Table 7-1, “Branch Prediction Latencie s” summarizes bran ch predictio n la tencies for the It anium 2
processor. Notice that in the case of a correctly predicted IP-relative branch, there is no front-end
1. The + refers to the fact that some branches may cause the front-end to stall. This is only for incorrectly predicted short (up to
16 bundles) forward branches. The additional latency w ill be at most 8 cycles and m ay be less dependi ng on how man y
branches were seen by the front-end after the mispredicted branch was seen by the front-end.
1
1
1
The branch prediction microarchitecture in the Itanium 2 processor is significantly different from
that of the Itanium processor. Branch prediction is closely tied to the L1I cache which allows for
the zero bubble resteer.
Single-cycle branches experience a stall once every two cycles (i.e., a one-cycle loop takes four
cycles to make three i teratio ns) . Single- cycle lo ops sho uld be avo ided. It is als o poss ible that a stall
may occur if several branches are encoun tered in succession. For example, if the fr ont -end sees a
branch every cycle for 3 cycles, one cycle of stall may occur .
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization63
Branch Inst ructions and Branch Prediction
7.1Branch Prediction Hints
Information about branch behavior can be provided to the processor to impr ove branch prediction.
This information can be encoded through branch h ints as part of a branch instruction. Branch hints
do not affec t the functi onal behavior of the progra m and may be ignored by the processor.
Only hints specifie d with in a br anch in structi on are used for bran ch pr ediction . Hints on th e brp or mov br instructions are ignored by the branch predictor.
For the Itanium 2 processor, branch hints are .sptk, .spnt,.dptk, or .dpnt (sp=static
prediction, dp=dynamic prediction, tk=taken, nt=not taken). The terms “static” and “dynamic”
hints refer to the code generator’s confidence in the branch behavior. For example, .sptk means
the code generator is very sure that the br anch will be taken, whereas .dptk means that the code
generator thinks the branch will be taken, but it is not so confident.
The impact of these branch hints depends on other branches in the two-bundle window and other
branch information maintain ed in the processor. The consequence is that a branch with a .dpnt
hint may be predicted taken the first time seen. The processor will quickly recover f rom this and
correctly predict th is branch in the future.
The use of .dpxx is recommended as defa ult, unless the loop is a ctop or cloop in whic h case .spxx is recommended.
The .spxx hint is also important for very short, 1 or 2 cycle, loops. With static prediction hints,
these loops will not wait for the machine to generate a new hint prediction, but will instead use the
take or not-taken from the static hint. If dynamic hints are used in the short loops, the processor
may stall each iteration th at the br anch prediction requires updating.
The branch prediction hints have a an anoma lous behavior when used in .bbb bundles. Normally,
the branch hints of each branch instr uction will effect only that specific branch. However, a .bbb
bundle will al wa ys use the branch hints provided on the slot 0 branch for the slot 1 and slot 2
branches. There are a few ways to avoid this. The first is to br eak up the .bbb bun dle into tw o other
bundles. Un fortunately, this may not be good for code den sity and other solutions such as using a
.
.dpxx hint or a .spxx with a .clr complet e r on the slot 0 branch should be considered.
7.2In direct Branches
The predicted targets of indi rect branches, other than returns, are extracted from the source branch
register of the indirect branch rather than from a hardware table. This has several implications .
There is always a penalty for indirect branches on the Itanium 2 processor. A two-cycle front-end
bubble is seen for a correctly predic ted indirect branch. An incorrect taken/not taken or address
prediction is 6 or mo re p ipel ine sta lls. The a ddr ess pre dict ion is ba sed on the c ont ents of t he br an ch
register referenced by the branch as see n by t he fr ont-end. An i n-flight updat e to the b ranch reg ister
will not be seen by the front-end and the predicted target may be wrong. Correct target prediction
requires that the branch register wri te pr ecede the indirect branch by several cycles. Thi s dis tance
varies sinc e the f ro nt an d back -e nds of the pipe lin es ar e de coup led. A c ode ge ner ator c an min im ize
the impact of this in the following ways:
• Separate the write and indirect branch by at least 6 front-end L1I cache accesses.
• Add an additional write to the branch regi st er ab ove the true branch register writer to hint the
target.
64Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
• Use different branch regist ers fo r each indirect branch instance to minimize co nf licts with
other indirect branches.
7.3Perfect Loop Prediction
In many cases, the perfect loop predictor can correctly predict the back -ed ge br anch of a counted
loop, i .e., cloop or ctop type branches, includin g the f all-through instance, as well as the loop
back iterations. Unlike the Itanium processor, the Itanium 2 processor does not need brp to
accomplish this.
The Itanium 2 processor uses the PLP only for the final iteration of the loop. The initial loop
predictions are decided on dynamic or static information based on the hints used.
If the last branch of a loop is predicted correctly, there might still be a one- or two-cycle bubble in
order to get this correct prediction. The smaller the number of loop iterations , the more likely it is
that there will be a two-b ubble res teer. Conversely , the lar ger t he loop i teratio n, the more li kely it is
that there will be a zero-bubble resteer. The PLP uses the current values of ar.lc and ar.ec for
prediction, so any writers to these registers should be well ahead of th e coun ted loop branch to
assure correc t prediction.
In some instances, the Itaniu m processor required that ar.ec be set to 1 for correct prediction.
The Itanium2 processor does not have this same requirement and actually expec ts ar.ec = 0
when there is no epilog.
Branch Ins tructions and Branch Pr edi ction
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization65
Branch Inst ructions and Branch Prediction
66Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Instruction Prefetching8
The Itanium2 processor supports several forms of instruction prefetching. Instruction prefetch is
defined to be the act of moving instruction cache lines from higher levels of cache or memory into
L1I. Streaming pr efetching initiates hardw are prefetching of the next cache lines, either sequential
or at the target of predicted taken br anches. Hi nt prefetch ing allo ws soft ware to spe cify a particu lar
line, or lines, to be prefetched. On the Itanium 2 processor, it is expected that instruction
prefetching will be an ef fecti ve way to reduc e ins tructi on cache miss es since th e code generat or has
a wide degree of control over the prefet ch agen t and the Itanium 2 processor cache design
specifically consi der ed pr efetching.
8.1Streaming Prefetching
Streaming prefetching is initiated by using the .many completer on branch instructions. If the
front-end processes a branch with a .many completer, the prefetch engine will continuously is su e
prefetch requests, at one request per cycle, for subsequent instruction lines, into the prefetch
pipeline. The prefetch request is checked against the L1I and the L1 ITLB. If it hits in the L1 ITLB
and misses in the L1I, the request is sent to the L2, otherwise it is discarded. The lines are
prefetched starting at the branch target plus 64 or 128 bytes (depending on the alignment of the
branch target). Streaming prefetching continues until one of the following stop cond itions occurs:
• A predicted- ta ken branch is encountere d by the front-end
• A branch misprediction occurs
• A brp instruction without the .imp completer is encountered by the front-end
The L1I cache design allows both fill and look ups to occur at the same tim e. Thus, the lifetime of a
request in the ISB is typically very small. This allows the prefetch engine to prefetch instruc tions
with little chance that the line will get overwritten before it is used. If the branch is predicted taken
by the front-end , prefetchi ng w ill be initiated in the fron t-end. If the br anc h is incorr ectly predicte d
not-taken by the front-end, prefetching will be initiated by the back-end whe n the prediction is
corrected. How ever, if the opposite case o ccurs, i.e., the br anch is inc orrect ly pred icte d take n in the
front-end, prefetching will be terminated and it will NOT be restarted when the back-end corrects
the prediction. Finally, if the branch is incorrectl y predicted-taken by the front-end, prefetching
will be terminated when the prediction is corrected by the back-end.
A .many prefetch stream may be halted by an L1I TLB miss. The event does not canc el the
prefetch, but suspends the pr efetch until the L1I TLB fill completes at which point the prefetch
continues until stopp e d from one of the reasons described above.
1. A brp instruc t ion s ugge s t s tha t a n a s soc iated br.many is a round the corne r. The assu mpt i o n is that the pr ef et c h engine has alread y prefetched
past the br.many, and additional prefetches would be useless. The reason that a brp.imp does not terminate prefetching is related to Itanium
processo r code. In the Itanium pr oc es s or, brp.imp instr uc t ions are used to pre dic t br a nc hes a nd might not have any as s oc i ati on wi th a b r.many .
1
®
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization67
Instruction Prefetching
Table 8-1. Summar y of Stre am ing Prefetch Actions
Predicted TakenPredicted Not-Taken
Actually T ak enAny current streaming prefetch is stopped in the
Actually
not-Taken
front-end.
If the branch has a .many completer, a new
stream is started by the front-end.
Any current streaming prefetch is stopped in the
front-end. It is NOT restarted when the
misprediction is detected.
If branch has a .many completer, a new stream
is started in the front-end. It is terminated when
the misprediction is detected by the back-end.
8.2Hint Prefetching
Hint prefetching is initiated with the brp or mov br instruc ti o ns. Unli ke the Ita nium processor,
the Itanium 2 processor prefetch initiation does not affect branch prediction state. However, it has
this same restrict ion as the Itanium processor: brp instructions must be on th e last instruction slot
(slot 2) of a bu ndle in order to be processed; otherwise, it is ignored. brp instructions have no
associated branch pr ediction effects. Table 8-2 illustrates the prefetching mechan is ms as sociated
with the branch hints.
Table 8-2. Prefe tch Mechanisms
Branch HintPrefetch Mechanism
brp.(sptk,loop,dptk).fewNormal prefetch of 1 cache line generated.
brp.(sptk,loop,dptk).manyPrefetches 2 cache lines from target.
brp.(sptk,loop,dptk).imp.fewFlushes prefetch virtual address buffer (PVAB) and prefetches 1 cache line.
brp.(sptk,loop,dptk).imp.manyFlushes prefetch virtual address buffer (PVAB) and prefetches 2 cache lines.
move_to_br.(sptk,dptk).fewAll other fields ignored, prefetches 1 cache line.
.many hintStreaming prefetches triggered off predicted taken IP-relative branches.
Any current streaming prefetch is
stopped in the back-end.
If the branch has a .many completer, a
new stream is started in the back-end.
No effect on any current str eaming
prefetch.
A new stream in NOT started.
A .few comp leter will prefetch one-half or one L2 line, depending on the alignment of the
associated branch target, and a .many compl e te r will prefetch 1.5 or 2 L2 lines, depending on the
alignment of the associated bra nch target. Hint prefetches are sent to the 8-entry prefetch virtua l
address buffer (PVAB). Up to 2 hint prefetches can be sent to the PVAB in each cycle.
In a given cycle, if the prefetch pipeline is not stalled and if a br.many is not active, a prefetch
request is remov ed from the P VAB. The p refetch requ est i s then c heck ed again st t he L1I and th e L1
ITLB. If it hits in the L1 ITLB and misses in the L 1I, it is sent to L2, otherwise it is discarded. The
intent is to use hint prefetches to p refetch the first “chunk” of in st ructions at the target of a branch
and to use streaming prefetching to prefetch the subsequent instructions. In order to fu lly hide the
latency of an L2 hit, a hint prefetch should precede a branch by 9 fetch cycles. If a br.many is
preceded by a brp.many, there will be some overlap betw een the pr efetch es generated by the tw o
instructions . W hile this overlap is wasteful, there is benefit in having more lines prefetched earlier
(as opposed to presaging the br.many by a brp.few). brp.few prefetches might be useful in
conjunction with streaming prefetches as described in Section 8.1.
68Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
8.3Prefetch Flush Hints
Certai n forms of brp instruction have the side effect of flu shing the contents of the PVAB and
possibly the prefetch pipeline. These are provided to give the compiler some control over the state
of prefetching.
• brp.few.imp - will remove all brp.few prefetches from the PVAB (but not any already in
the prefetch pipeline).
• brp.exit.imp - will remove all prefetches from the PVAB and those in the prefetch
pipeline, an d a dditionally will stop the st reaming prefetch engine (a nd therefore will stop
br.many, brp.many and brp.exit prefetching).
• brp.* - (brp withou t the .imp completer) will cancel any streaming prefetches initiated by
a br.many instruction.The in tent is to allow the comp iler to stop a br.many from
prefetching too far.
The flushing side effect is in addition to the normal behavior of th es e prefetch instructions. Note
that flushing a prefetch once it reaches the pipeline may not be effective (i.e., the prefetch m a y still
be issued to t he L 2 a nd beyond).
Instruction Prefe tc hin g
8.4The brl Instruction
The Itanium 2 processor implements the brl instruction that provides 64-bit relative branches.
These long relative branch instructions have less cost than in the Itanium processor, but they are
higher cost than the short rel ative branch br instructions. Specifically, the branch prediction
mechanisms in the Itanium 2 processor do not calculate th e pred icted target correctly for brl
instructions unless the target is set when the L1I cache line is allocated. Thus, if a brl predicti on
target is aliase d with another branch in the bundle pair, the target will be incorrect and the branch
will see a full branch mispred ict penalty and it will not be fixed.
The brl instruction is much more efficient than multiple short jumps despite this cost. Howeve r,
The linker should place brl instructions only where th ey are specifically needed.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization69
Instruction Prefetching
70Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Optimizing for the
Itanium
This chapter is a summary of conclusions that can be drawn from important points noted in earlier
chapters. These guidelines are not applicable in all s ituations and profiling should be u sed to guide
the use of optimizations.
9.1Hints for Scheduling
Observing the f ollowing heuristics whenever po ssible will minimi ze the chances of implicit stops
or unexpected dispers al related stalls:
2 Processor9
• Schedule the most restricted instructions early in the bundle. This lessens the chance that a
generic subtype instruction will consume a port which is needed by a la te r more restricted
instruction.
• In some cases, placing A-type instructions in I slots rather than M slots might achieve den ser
bundling. If this is done, place any I-type instructions (which m us t go in I slots) earlier in the
issue group when possible. This way, the later instructions in I slots can be issued to available
M ports. Since not all processors suppor t this (such as the Itaniu m processor), it is prefer able to
place A-type instructions in M slots.
• Most floating-point load types can be issued to any of the four memory ports, not just M0 and
M1. Control speculation-related (advanced and check) and pair floating- point loads are the
exceptions which can only be issu ed to ports M0 and M1. When schedu ling a mix of FP loads,
advanced FP loads, integer loads, and lfetch instructions, ensure that regular FP loads are
scheduled late in th e issue group so that if necessary, they can be issued to the M2 and M3
ports. This frees the M0 an d M1 por ts needed by lfetch instructions or more res tric tive load
types.
• A voi d u sing nop.f. It risks uni ntende d s tal ls du e t o o utst andi ng lon g laten cy instr uct ion s. For
example, a write to FPSR is a multiple-cycle operation. Any floating-point operatio n,
including a nop.f, will stall until the write is com pleted.
• On the Itanium
are many other dual issue template pairs on the Itanium2 processor so using this template
should no longer be necessary.
processor, MFI was a commonly used template to facilitate dual issue. There
®
9.2Optimal Use of lfetch
The lfetch instruction is key t o a ch ievi ng goo d per f orman ce on the I ta niu m 2 processo r in m any
memory-related situations. lfetch allows the L1D to often be a hit for integer data. This has the
benefit of allowing the L1D cach e to filt er requests to the L2. Many L2 confli cts can be avoi ded by
ensuring intege r loads hit in the L1D and thus, never are seen by the L2. The fewer requests the L2
sees, the fewer requests conflict.
lfetch in s t ru c ti o ns require car e ful use. Ca re l e ss l y pl a c i ng lfetch instructions may lower
perfor mance. Refer to Chapter 6, “Memory Subsystem” for details regarding the Itanium 2
processor cache structures. The following guidelines were developed with regard to the memory
subsystem:
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization71
Optimizing for the Itanium® 2 Processor
• The maximum numb e r of outstanding lfetch operations to L3 or memory, the sum of both
data and instruction requests, may not exceed 16.
• lfetch instructi ons are restricted to only memory ports M0 and M1 while FP loads (not
ldfpd or ldfps) can be issued on a ny of the four memory ports. Therefore, when mixing
lfetch instructions with FP loads, lfetch instructions should be scheduled early in issue
groups. For e xa mple, if two FP loads and an lfetch are to be scheduled in the same cycle,
the lfetch should be scheduled in the first bundle so that it will be issued on one of the first
two memory ports. If the two FP loads are scheduled first, the hardware will insert an implicit
stop befo re is s ui n g th e lfetch instruc tion.
• The Itanium 2 lfetch.excl instruction will bring data into the L2 cache in the M state.
The.excl completer should only be use d when the data brought in by the lfetch will
shortly be modified by store instruc tions.
• The Itanium 2 lfetch instructions will not bring the data into the cache if a DTLB entry
providing translation and protection information is not ava ilable. To en sure the lfetch
instruction completes a HPW walk and possibly generates a TLB translat ion or protection
fault, the.fault completer sh ould be used. Since there may be high cost associated with
these events, th e.fault completer should not be used for speculative addresses.
• lfetch instructions may have effects in the cache hierar chy that make their use high cost.
These effects include:
— Acquiring L2 resources such as the L2 OzQ.
— Arbitration for acces s to the L2 data arrays and thus becoming a candidate for an L2 bank
conflict.
— Reci r cu l a ti on of the lfetch in the case of a secondary L2 miss.
The effects of the L2 recirculate for a secondar y L2 miss can be mitigat ed by placin g.nt
completers on the lfetch. The.nt hints keep the lfetch from causing an L1D fill and allows
the lfetch to be removed from the L2 OzQ.
In the case where an lfetch hits the L2, it takes L2 OzQ resour c e s, causes other request to
cancel, and may get canceled itself a s if it actually reads the L2 data array regardless of the.nt
hint or actual need to fill the L1D.
Applying.nt hints to lfetch requests also reduces the L2 banks required to satisfy the lfetch
to only 1 bank. For temporal lfetch instructions 4 ba nks may be required and such lfetch
requests may have significantly increased prob ability of causing L2 bank conflicts.
9.3Data Streaming
There are several methods to handle long, high-bandwidth data str e a ms. This section lists several
possible solutions and discusses some of the benefits and costs of each.
9.3.1Floating-Point Data Streams
Floating-po int dat a r es ide s in the L2 c ac he. H ere, t he lfetch.fault.nt1 inst ru ctio n s houl d b e
issued only once per L2 cache line for the source, and the lfetch.fault.excl.nt1
instruction should be issued only once per L2 cache line for the destination. The.fault
completer is used to ensure that the data enters into the cache hierarchy, even if it results in an L2
DTLB miss or VHPT miss. The.nt1 completer ensures that the floating-point data will not
displace data residing in the L1D. The.nt1 completer also allows an lfetch instruction that is a
72Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
secondary L2 miss to avoid allocation in the L2 OzQ. This is impo rtant for situations where the
design of the data streaming code cannot avoid additional r equests to an L2 line without
performance loss. The.excl completer for the destination stream will ens ure the data is ready to
be modifi e d .
When data is accessed as an L2 hit, care should be taken to avoid L2 bank conflicts among request
groups. This is necessary to ensure L2 5- and 7-cycle byp as se s are available. Latency is not
generall y a co nce rn for floating-point code, however, in strea ming situations, the life time of an
operation in the L2 OzQ coupled with th e siz e of th e OzQ may cau se co re stalls from the L2
control logi c to think the OzQ is full. A lower latency means a sh orter lifetime in the OzQ and
effectively m or e OzQ en tries are available.
9.3.2Integer Data Streams
Integer data streams are more complicated than floating-point streams because, in some instances,
getting the data into the L1D will be important for performance. Streaming from the L1D presents
several problems. First, each load operation requires integer register return resources even if it
misses in the L1D. This makes it difficult for L1D miss es to return data to the r egister file without
impacting the flow of new L1D misses. Second, each fill operation will take an additional cycle to
complete. Third, the need to fill the L1D eliminates an opportunity for the L2 OzQ to remove
secondary L2 miss lfetch instructions. This is sig nifica nt because the L1D lin e size i s half of the
L2’s and one lfetch per L1D line will result in at least one secondary L2 miss access for every
L2 line thus li miting L2 OzQ throughput.
Optimizing for the Itanium® 2 Processor
One approach would be to use three separate lfetch instructions. An lfetch.fault.nt1
would bring the data into the L2. Later, when the data is in the L2, lfetch.fault instructions
can hit in the L2 cache and bring the data into the L1D. This makes the lfetch instructions
asymmetric and req uires several load memor y s lots.
An optimization to the three lfetch approach above would use only two separate
lfetch.fault instruction, but stage them such that th e first will bring data into L2 and the
L1D. Then, when th e L2 is fille d from the first re quest , the second lfetch can bring the data into
the L1D without b eing a secondar y L2 mis s (the L2 is fill ed so the lfetch is an L2 hit). Thi s frees
an additional load memory slot and makes the lfetch instructions re-usable.
An outst andi ng L 1D fill may be in vali date d b y a stor e t o t he sam e lin e. Us in g lfetch instructions
for even small data strea ms can re s ult in a significant performan ce increase provided the lfetch
fills the L1D be fore the store to the line is seen.
Also, since all loads that hi t in the L1 D nev er a llocate in to t he L2 Oz Q, usin g lfetch instructions
to ensure an L1D h it may also help performance by limiting L2 OzQ to only store data and
lfetch requests. This relieves pressure on the limited OzQ reso urces and reduces th e possibility
of conflicts among OzQ entries .
9.3.3Store Data Streams
Since store instructions are always seen by the L2, there is no benefit to bri nging store destination
data into the L1D. Th ere ar e many benefits to using an lfetch.fault.excl.nt1 completer
for destinatio n streams. For inst ance, the .nt1 hint allows se cond ary L2 miss es to be r emo ved an d
the core is not slowed by the L1D fills. Also, the.excl hint ensures that the L2 data is ready to
receive the store data.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization73
Optimizing for the Itanium® 2 Processor
9.4Control and Data Speculation
The Itanium 2 processor reduces the costs associated with contr ol and data speculation in the
ALAT via fast defe rr al and low lat enc y fix up. As such , add iti ona l per form ance may be r ealiz ed by
tuning the code generation to aggres sively use speculation. So me speculation considerations are
specific to the Itanium processor and d o not apply t o t he Itan ium 2 processor . If speculati on is mor e
aggressive, then mor e cal ls to fix up co de will be encount ered. Fo r the Itaniu m proces sor, the fix up
code was often moved to cold pag es very far from the actual speculation. The heuristic for placing
fix up code near or far from the point of spec ulation should be revisite d a nd include profile
information in the decision matrix.
9.5Known L2 Miss Bundle Placement
Given the Itanium2 processor design, it is slightly better to put instructions which are known to
miss the L2 cache on memory port 0 (allocate th e fir s t memory op in the issue group). This wil l
allow , when possible, a speculative request to be made to L3. If the memory request that needs to
go to L2 is in M1, M2, or M3, then they will need to wait until they can be reissued out of the L2
OzQ.
9.6Avoid Known L2 Cancel and Recirculate Conditions
The most predictable L2 cancel is an L2 bank conflict. These can be avoided by carefully
organizi ng L2 accesses or by bringing th e data into the L1D with an lfetch instruction and
avoiding the L2 entirely.
The most predictable L2 recirculate is for secondary L2 miss acce sses. Thes e can be avoided by
using the lfetch instruction to bring data into the L2. Only lfetch instructions that do not fill
L1D are not counted as a secondary access. If an lfetch is the pr imary L2 miss and a load is th e
secondary L2 miss, then the load will still need to recirculate, as it must eventually return data to
the core. It is important to schedule L2 miss lfetch ins tructions far in front of the load to avoid
this situat ion.
9.7Instruction Bundling
The Itanium 2 processor can completely issue almost all bundle template combinations . Provided
the ILP is availa ble, closing the correct bundling and instruction scheduling may benefit
performance. There are two concerns here. Fir st , pl ace more restrictive instructions early in the
issue group and, where possible, transform restrictive instructions. The simple instruction nop.i
must issue to an I port, however, an add can issue on either an M or I port. The nop.i should be
scheduled early to ensure it receiv e s its needed I port. An alternative would be to replace the
nop.i with an inst ru ction that is effectively a nop (such as add r3=r0, r3) which can issue
on either an I or M port.
74Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
9.8Branches
The following branch and branch prediction related optimization suggestions are covered in detail
in Chapter 7, “Branch Instructions and Branch Prediction. ” They are summarized here.
9.8.1Single Cycle Branches
The Itanium 2 processor cannot support single cycle loop branches without some penalty in some
iterations of the loop. Unroll the loop to at least two cycles to get expected performance. This may
come at a small cost to code size.
9.8.2Perfect Loop Prediction
Also, perfect loop prediction only predicts the final iteration of the loop. As such, the Itanium 2
processor considers the branch hints in predicting the branches . The Itanium 2 processor requires
ar.ec to be set correctly (i.e., if there is no epilog ue, set ar.ec=0 not to 1 as the Itanium
processor expected).
9.8.3Branch Targets
Optimizing for the Itanium® 2 Processor
Branch ta rge ts shou ld be a lig ned on 32- by te b ounda ries t o en sure tha t th e f ront- end can del iver t wo
bundles per cycle to the back-e nd.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization75
Optimizing for the Itanium® 2 Processor
76Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring10
10.1Introduction
This chapter defines the performance monitoring features of the Itanium 2 processor. The
Itanium 2 processor provides four 48-bit performance counters, 100+ m onitorable events, and
several advanced monitor ing capabilities. This chapter outlines the targeted performance monitor
usage models, defines th e software interface and programming model, and lists the set of
monitored events .
The Itanium architecture inco rporates architected mechanisms that allow software to actively and
directly manage performance critical processor resources s uch as br anch prediction structures,
processor data and instruction caches, virt ual memory translation structures, and more. To achieve
the highest pe rforma nce level s, dyn amic proce ssor beha vior c an be monitor ed and fed ba ck into the
code generation process to better en c ode observed r un-time behavior or to expo se higher levels of
instruction level par allelism. On the Itanium 2 processor, we expect to measure the behavio r of
real-world Itanium architecture-based applications and operating systems as well as mixed IA-32
and Itanium architecture-based code. These measurements will be critical for understanding the
behavior of compiler opt im izations, the use of architectural features such as speculation and
predication, or the effectiveness of microarc hitectural structures such as the ALAT, the caches, and
the TLBs. These measurements will provi de the d ata to dr ive application tuning and futur e
processor, compiler , and operating system designs.
The remainder of the document is split into the following sections:
• Section 10.2, “Performance Monitor Programming Models” discuss e s how perfo rm a n c e
monitors ar e used, and presents various Itanium 2 processor perf ormance monitoring
programming models.
• Section 10.3, “Performance Mo n itor State” defines the Itanium 2 processor specific
PMC/PMD performance monitoring registers.
• Chapter 11, “Performance Monitor Events” gives an overview of the Itani um 2 processor
event list.
10.2Performance Monitor Programming Models
This section introduces the Itanium 2 processor performance monitoring f eatures from a
programming model point of view and describes how the different e vent monitoring mechani sms
can be used effectively. The Itanium 2 processor performance monitor architecture focuses on the
following two usage models:
• Workload Characterization: The first step in any performance analysis is to unders tand the
performance characteristics of the workload under study. Section 10.2.1, “Workload
Characterization” discu sses the Itanium 2 processor support for workload characterization.
• Profiling: Profiling is used by application develope rs and profile -guided compilers.
Application deve lopers ar e interested in identify ing perfo rmance bottlenec ks and relat ing them
back to their code. Their primary objective is to understand whic h program location caused
performance degradation at the module, function, and basic block level. For optimization of
data placement and the analysis of cri tical loops, instruction level granularity is desirable.
Profile-guided compilers that use advanced Itanium architectural features such as predication
Intel® Itanium® 2 Processor Ref erence Manual For Software D evelopment and Optimization77
Performance Monitoring
and speculatio n benefit from run-time profile infor mation to optimize instruction schedules.
The Itanium 2 processor supports instruction level statistical profiling of branch mispredicts
and cache misses. Details of the Itanium 2 processor’s profiling support are described in
Section 10.2.2, “Profiling.”
10.2.1Workload Character iza tion
The first step in any performance analysis is to understand the performance characteristics of the
workload under study. There are two fundamental measures of interest: event rates and program
cycle break down.
• Event Rate Monitoring: Event rates of interest include average retired instruction s per clock,
data and instruction cache miss rat es, or branch mispredict rates measured across the entire
application. Character ization of operatin g systems or larg e commercial workl oads (e.g., OLTP
analysis) require s a system-level view of performance rel evant events such as TLB miss rates,
VHP T walks/second, interrupts/s econd, or bus utilization rates. Section 10.2.1.1, “Event Rate
Monitoring” discusses event rate monitoring .
• Cycle Accounting: The cycle breakdown of a workload attributes a reason to every cycle
spent by a program. Apart from a program’s inherent execution latency, extra cycles are
usually due to pipeline stalls and flus he s. Section 10.2.1.4, “Cycle Accou nting” discusses
cycle accounting.
10.2.1.1Event Rate Monitoring
Event rate monitor ing determines event rates by reading processor event o ccurrence counters
before and after the workload is run, and then computing the desired rates. For instance, two basic
Itanium 2 processor events that count the number of retired Itanium instructions
(IA64_INST_RETIRED.u) and the number of elapsed clock cycles (CPU_CYCLES) allow a
workload’s instructions per cycle (IPC) to be computed as follows:
• IPC = (IA64_INST_RETIRED.u
CPU_CYCLES
Time-based sampling is the basis for many perform ance debugging tools [VTune™, gprof,
WinNT]. As shown in Figure 10-1, time-based sampling can be used to plot the event rates over
time, and can provide insights into the different phases that the workload moves thro ugh .
Figure 10-1. Time-Based Sampling
)
t0
Event Rate
- IA64_INST_RETIRED.ut0) / (CPU_CYCLESt1 -
t1
t1t0
Sample Interval
Time
On the Itanium processor, many event types, e.g., TLB misses or branch mispredicts are limited to
a rate of one per clock cycle. These are referred to as “single occu rr ence” events. However, in the
Itanium 2 processor, multiple events of the same type may occur in the same clock. W e refer to
such events as “multi-occurr ence” events. An example of a multi-occurrence events on the
78Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
Itanium 2 processor is data cache read misses (up to two per clock). Multi-occurrence events , such
as the number of entries in the memory req uest queue, can be used to the derive average number
and average latency of memory accesses . The next tw o s ections describe the basic Itanium 2
processor mechanisms for monitoring single and multi-occurrence events.
10.2.1.2Single Occurrence Events and Duration Counts
A single occurrence event can be monitored by any of the Itanium 2 processor performance
counters. For all single occurrence events, a counter is increment ed by up to one per clock cy cle.
Duration counters that count the numbe r of clock cycles during which a condition persists are
considered “single occurrence” events. Examples of single occurr ence events on the Itanium 2
processor are TLB misses, branch mispredictions, and cycle-based metrics.
10.2.1.3Mul ti-Occurrence Events, Thresholding, and Averagin g
Events that, due to hardware par allelism, may occur at rates greater than one per clock cycle are
termed “multi-occurrence” events. Examples of such events on the Itanium 2 processor are retired
instructions or the numb er of live entries in the memory reque st queue.
Thresholding capabilities are available in the Itanium 2 processor’s multi-occur rence counters and
can be used to plot an event distribution histogram. When a non-zero threshold is speci fied, the
monitor is increment ed by one in every cycle in which the observed event count exceeds that
programmed threshold. This al lows questions such as “For how many cycles did the memory
request queue contai n m or e than two entries?” or “Dur i ng how many cycles did the machine retire
more than three instr uctions?” to be answered. This capability allows microarchitectural buffer
sizing ex pe riments to be supported by real measurements. By running a ben c hmark with different
threshold values, a histogram c a n be drawn up that may help to identify the performance “knee” at
a certain buffer size.
For overlapping concurrent events, such as pending memory operations, the average number of
concurre ntly ou ts tan ding re que sts and the av erag e numbe r of cycle s th at re ques t s were pe nding a re
of interest. To calculate the average number or latency of multiple outstanding requests in the
memory queue, we need to know the total number of requests (n
requests per cycle (n
multi-occurrence counter,
/cycle). By summing up the live requests (n
live
Σn
is directly measured by hardware. We can now calculate the
live
) and the number of live
total
/cycle) using a
live
average number of requests and the aver age latency as follows:
• Average outstanding requests/cycle = Σn
• Average latency per request = Σn
live
/ n
live
total
/ ∆t
An example of this calculation is given in Table 10-1 in which the average outstanding
requests/c ycle = 15/8 = 1.825, and the average latency per request = 15/5 = 3 cycles.
Ta bl e 10-1. Average Latency per Request and Reques ts per Cycle Calculation Example
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization79
Performance Monitoring
The Itanium 2 processor provides the following capabilities to support event rate monitoring:
• Clock cycle counter.
• Retired instruction counter.
• Event occurrence and duration counters.
• Multi-occurrence counters with thresholding capability.
10.2.1.4Cycle Accounting
While event rate monitoring counts the number of events, it does not tell us whether the observed
events are contributing to a performance problem. A commonly used strategy is to plot multiple
event rates and correlate them with the measured IPC rate. If a low IPC occurs concurrently with a
peak of cache miss activity, chances are that cache misses are causing a performance problem. To
eliminate such guess work, the I tanium 2 processor provides a set of cycle accounting monit or s,
that break down the number of cycles that are lost due to various kinds of microarchitectural
events. As shown in Figure 10-2, this lets us account for every cycle spent by a program and
therefore provides insight into an application’s microarchitectural behavior. Note that cycle
accounting is diff e rent from simple stall or flush duration counting. Cycle accounting is based on
the machine’s actual stall and flush conditions, and accounts for overlapped pipeline del ays , while
simple stall or flush duration counters do not. Cycle accounting determines a program’s cycle
breakdown by stall and flush reasons, while simple duration c ounters are useful in determining
cumulative stall or flush latencies.
Figure 10-2. Itanium
®
Processor Family Cycle Accounting
Inhere nt Program
Execution Latency
30%25%
Data Access
Cycles
20%15%10%
100% Execution Time
Branch
Mispredicts
I Fetch
Stalls
Other Stalls
001229
The Itanium 2 processor cycle accounting monitors account f or all major single and multi-cycle
stall and flush conditions. Overlapping stall and flush c onditions are prioritized in reverse pipeline
order, i.e. , delays that occur later in the pipe and that overlap with earlier stage de lays are reported
as being caused later in the pipeli ne. The six back-end stall and flush reasons are prioritized in the
following order:
1. Exception/Interrup tion Cycle: cycles spent flushing the pipe due to interrupts and exceptions.
2. Branch Mispredict Cycle: cycles s pent flushing the pipe due to branch mispredicts.
3. Data/FPU Access Cycle: memory pipeline full, data TLB stalls, load-use stalls, and access to
floating -point unit.
4. Execution Latency Cycle: scoreboa rd and other register dependency stalls.
5. RSE Active Cycle: RSE s pill/fill stall.
6. Front-end Stalls: stalls due to the back-end waiting on the front-end.
Additional front- end stall counter s ar e availab le which detail s even p ossibl e reas ons for a f ront- end
stall to occur. However, the back-end an d f ront-end stall events sh ould not be compared since th ey
are counted in different stages of the pipeline.
For details, refer to Section11.6, “Stall Events.”
80Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
10.2.2Profiling
Profiling is used by application developers, profile-guided com pilers, optimizing linker s, a nd
run-time systems. Ap plication dev elopers are int erested in iden tifying performan ce bottlenecks and
relating them back to t heir source cod e. Bas ed on profile f eedback dev eloper s can make cha nges to
the high-le ve l al gorit hm s and d ata s truc tur es of the pr ogr am. Co mpile rs can use pr of il e fee dback to
optimize instruction schedules by employing advanced Itan ium arch itectural features such as
predication and speculation.
To support profiling, pe rformance monitor counts have to be associated with program locations.
The following mechanisms are su pported directly by the Itanium 2 processor’s performance
monitors:
• Program Counter Sampling
• Miss Event Address Sampl ing: Itani um 2 processo r even t address reg isters (EARs) provid e
sub-pipeline length event resolution for performance critical events ( instruction and data
caches, branch mispredicts, and instruction and data TLBs).
• Event Qualification: cons trains event monitoring to a specific instruction address range, to
certain opcodes or pr ivilege levels.
These profiling features are presented in the next three subsections.
Performance Monitoring
10.2.2.1Program Counter Sampling
Applicat ion tuning tools like [VTune, gprof] us e time-based or event-based sampling of th e
program counte r and other event counters to identify performance critical fu nctions and basic
blocks. As shown in Figure 10-3, the sampled points can be histogrammed by instru c tion
addresses. For application tuning, statistical sampling techniques have been very successful,
because the programmer can rapidly identify code hot spots in which the program spends a
significant fract ion of its time, or where certain event counts are high.
Figure 10-3. Event Histogram by Program Counter
Event
Frequency
Examples:
# Cache
Program counter sampling points the p e rformance analysts at code hot spots, but does not indica te
what caused the perfor mance p roblem. In specti on and man ual analy sis of the hot-s pot r egion along
with a fair a mount of guess work are required to identify the ro ot c a use of the performance
problem. On the Itanium 2 processor, the cycle accounting mechanism (described in
Section 10.2.1.4, “Cycle Accounting”) can be used to directly measure an application’s
microarchitectural behavior.
Address Space
The Itanium architectural interval timer facilities (ITC and ITM registers) can be used fo r
time-ba sed program counter sampl i ng. Event-based program c ounter sampling is supported by a
dedicated performance monitor overflow interrupt mechanism described in detail inSe c tion 7.2.2
“Performa n ce M on ito r Overflow Stat us Reg i st e rs (PM C [0 ]..PMC[3])” in Volume 2 of the Intel
Itanium® Architecture Software Developer’s Manual.
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization81
®
Performance Monitoring
To s uppo rt progr am counter sam pling , the Itaniu m 2 processor provides the foll owing mechanisms :
• Timer interrupt for time-based program counter sampling.
• Event count overflow interrupt for event-based program counter sampling.
• Hardware-supported cycle accounting.
10.2.2.2Miss Event Address Sampling
Program counter sampling and cy cle accounting provide an accurate picture of cumulative
microarchitectural behavior, but they do not provide the application developer with pointers to
specific program elements (cod e locations and data structures) that repeatedly cause
microarchitectural “miss events”. In a cache study of the SPEC92 benchmarks, [Lebeck] used
(trace based) cache miss profiling t o ga in performanc e improvements of 1.02 to 3.4 6 on various
benchmarks by making simple changes to the source code. This type of analysis requires
identifica tion of instruction and data ad dresses related to microarchitectural “miss events” such as
cache misses, branch mispredicts, or TLB misses. Usi ng symbol tables or compiler annotations
these addresses can be mapped back to criti cal s our ce code elements. Like Lebeck, most
performance analysts in the past have had to capture hardware traces and resort to trace driven
simulation.
Due to the superscalar issue, deep pipelining, and out-of- order instruction completi on of today’s
microarchitectures, the s amp led program counter value may not be related to the instruction
address that caused a miss event. On a Pentium processor pipeline, the sampled program count er
may be off by two dynamic instructions from the instruction that caused the miss event. On a
Pentium
Itanium 2 processor, it is approximately 48 dynamic ins tructions. If program counter sampling is
used for miss event address identification on the Itanium 2 processor , a mis s ev ent m ight be
associated with an instruction almost five dynamic basic blocks away from where it actually
occurred (assuming that 10% of all instructions are branches). Therefore, it is essential for
hardware to precisely identify an event’s address.
®
Pro processor, this distance increases to approximately 32 dynamic instructions. On the
The Itanium 2 processor provi des a set of event addr es s registe rs (EARs) that record the ins truction
and data addresses of data cache misses for loads, the instruction and dat a addresses of data TLB
misses, and the instruction addr es s e s of in st ru ction TLB and cache misses. A four deep branch trace buffer captures seq uences of branch instruction s . Table 10-2 summarizes the capabilities
offered by the Itanium 2 processor EARs and the branch trace buffer. Exposing miss event
addresses to sof tware allows them to be monitored eith er by sampling or by code instrumentation.
This eliminates the need for trace generation to identify and solve performance problems and
enables performance analysis by a much lar ger audience on unmodified hardware.
Table 10-2. It anium
Event Address RegisterTriggers OnWhat is Recorded
Instruction CacheInstruction fetches that miss
Instruction TLB (ITLB)Instruction fetch missed L1
Data CacheLoad instructions that miss L1
82Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
®
2 Processo r EARs and Branch Trace Buffer
the L1 instruction cache
(demand fetches only)
ITLB (demand fetches only)
data cache
Instruction Address
Number of cycles fetch was in flight
Instruction Address
Who serviced L1 ITLB miss: L2 ITLB VHPT
or software
Instruction Address
Data Address
Number of cycles load was in flight.
Performance Monitoring
Table 10-2. Itani u m® 2 Processor EARs and Branch Trace Buffer (Continued)
Event Address RegisterTriggers OnWhat is Recorded
Data TLB
(DTLB)
Branch
Trace
Buffer
Data references that miss
L1 DTLB
Branch OutcomesBranch Instruction Address
The Itaniu m 2 processor EARs e nable st atisti ca l sam pl ing by co nf igu ring a per fo rmance coun ter to
count, for instance, the number of data cach e misses or retired instructions. The performance
counter va lue is set up to i nte rrupt the processor after a predet e rmined number of events have been
observed. The data cache event address register repeatedly captures the instruction and data
addresses of actual data cache load misses. Whenever the counter overflo ws, miss event address
collection is suspended until the event address register is read by software (this prevents software
from capturing a miss event that might be caused by the monitoring softw a re itself). When the
counter overflows, an interr upt is de liv ered to so ftware, the obse rved e vent address es are col lecte d,
and a new observation inter val can be setup by rewriting the performan ce coun ter register. For
time-based (rather th an event-based) sampling methods, the event address regis ters indicate to
software whether or not a quali fied event was captured. Statistical sampling can achieve arbitrary
event resolution by varying the number of events within an observation interva l a nd by increasing
the number of observation intervals.
10.2.3Event Qualification
On the Itanium 2 processor, performance monitoring can be confined to a subs et of all events. As
shown in Figure 10-4 ev ent s can be qua lif ied for monit ori ng bas ed on an instr uc tion addr ess ran ge,
a particular instruction opcode, a data address range, an event-s pecific “unit ma s k ” (uma sk ), the
privilege le vel and instruction set the e vent was caused by, and the status of the performance
monitoring freeze bit (PMC
• Itanium Instruction Address Range Che ck: The Itanium 2 processor allows event monitoring
to be constrained to a programmable instruction address ran ge. This enables monitoring of
dynamically linked libraries (DLLs), functions, or loops of in te rest in the context of a large
Itanium-based applicati on. The It anium instruction address range che c k is applied at the
instruction fetch stage of the pipeline and the resulting qualification is carried by the
instruction throughout the pipeline. This enables conditional event counting at a level of
granulari t y smaller than dynamic instruction length of the pipeline (approximately 48
instructions). The Itanium 2 processor’s instruction address ra nge check operates only during
Itanium-based code execution, i.e., when PSR.is is zero. For details, see Itanium Opcode
Match and Address Range Check Registers (PMC
.fr).
0
Instruction Address
Data Address
Who serviced L1 DTLB miss: L2 DTLB,
VHPT or software
Branch Target Instruction Address
Mispredict status and reason
).
8,9
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization83
Performance Monitoring
Figure 10-4. Itanium
®
2 Processor Event Qualification
Instruction Address
Instruction Opcode
Data Address
Current Privilege
Curr ent Instruction
Set (Itanium or IA-32)
Performance Monitor
Freeze Bit (PMC
Level
.fr)
0
Itanium® Instruction
Address Range Check
Itanium Instruction
Opcode Match
Itanium Data Address
Range Check
(Memory Operatio ns Only)
Event Spefic "Uni t Mask"EventDid event happen and qualify?
Privi lege Level Check
Instruction Set Check
Event Count Freeze
Is Itanium instruction pointer
in IBR range?
Does Itanium opcode match?
Is Itanium data address
in DBR range?
Executing at monitored
privilege leve l?
Executing in monitored
instruction set?
Is event monitoring enabled?
YES, all of the above are true;
this event is qualified.
000987a
• Itanium Instruction Op code Match: The Itanium 2 processor pr ovides two independent
Itanium opcode match registers each of which match the currently issued instruction
encodings with a programmable opcode match and mask function. The res ulting match events
can be selected as an event type for counting by the performance counters. This allows
histogramming of instru ction types, usage of destination and predicate registers as well as
basic bloc k profiling (through insertion of tagged NOPs). The opcode matcher operates only
during Itanium-based code execution, i.e., when PSR.is is zero. Details are described in
Section 10.3.4.
• Itanium Data Address Range Check: The Itanium 2 processor allows event collection for
memory operations to be constr ained to a programmable data address range. This enables
selective monitorin g of data cache miss behavior of specific data s tructures. For details, see
Section 10.3.6.
• Event Specific Unit Masks: Some events allow the specification of “unit masks” to filter ou t
interesting events directly at the moni tored unit. As an example, the number of counted bus
transactions can be qualified by an event specific unit mask to contain transactions that
originate d from any bus agent, from the processor it self, or from other I/O bus masters. In this
case, the bus unit uses a three-way unit mask (any, self, or I/O) that specifies which
transactions are to be counted. In the Itanium 2 processor, events fr om the branch, memory and
bus units suppor t a variety of unit masks. For details, refe r to the event pages in Chapter 11,
“Performance Monitor Events.”
84Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
• Privilege Level: Two bits in the processor status register are provi ded to enable selective
process-based event monitoring. The Itanium 2 processor supports con ditional event counting
based on the current privilege level; this allows perf ormance monitoring softw a re to break
down event coun ts into user and operating system contri butions. For details on how to
constrain monitoring by privilege level refe r to Section 10.3.1, “Performance Mon it o r Con tro l
and Accessibility.”
• Instruction Set: The Itanium 2 processor supports conditional event c ounting based on the
currently executing instruction set (Itanium or IA-32) by providing two instruction set mask
bits for each event moni tor. This allo ws pe rfor mance m onito ri ng soft ware t o b rea k down eve nt
counts into It aniu m and IA -3 2 co ntri buti ons . F or de t ails, re fe r to Section 10.3.1, “Performance
Monitor Contro l and Accessibilit y.”.
• Performance Monitor Freeze: Event counter overflows or software can freeze event
monitoring. When frozen, no event monitori ng takes place until softwar e clears the monitorin g
freeze bit (PMC
.fr). This ensures that the performance monitoring routines themselves, e.g.,
0
counter ove rflo w inte rru pt ha ndle rs or per fo rman ce m onitor in g con tex t swi tch rou tines , do n ot
“pollute” the even t co unts of the s ystem un der ob serva tion . Fo r detai ls refer to Sec tion 7. 2.4 of
Volume 2 of the Intel® Itanium™ Archite cture Software Developer’s Manual.
10.2.3.1Combining Opcode Matching, Instruction, and Data Address Range
Check
The Itanium 2 processor allows various event quali fication mechanisms to be combined by
providing the instruction tagging mechanism shown in Figure 10-5.
Figure 10-5. Instruct i o n Tagging M echanism in the Itanium
®
Itanium
Instruction
Address
Range
Check
(IBRs,
PMC
)
14
IBRRange Tag
Itanium
Opcode
Matcher
(PMC
Tag(PMC[8])
Itaniu m
Opcode
Matcher
(PM C
, PMC15)
8
, PMC15)
9
Itanium Data
Address Range
Check
(DBRs, PMC
DBRRange Tag
Tag(PMC9)
)
13
Memory
During Itanium instruction execution (PSR.is is zero), the instruction address range check is
applied first. T he resulting addr ess range check tag (IBRRangeT ag) is passed to two opcode
matchers that combine the instructio n address range check with the opcode match. Each of the two
combined tags (Tag(PMC
) and Tag(PMC9)) can be counted as a re tired instruction count event
8
(for details r e fer to event description “IA64_TAGGED_INST_RETIRED” on page 11-165 ).
®
2 Processo r
Event
i
Event
j
Event
k
Event
l
Event Select (PMC
Privilege Lev e l Mas k
Instruction Set Mask
(PMC
.plm, PMCi.ism)
i
Privilege
Level &
Instruction Set
Check
.es)
i
Counter
(PMD
)
i
000988b
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization85
Performance Monitoring
One of the combined Itanium address ra nge a nd opcode match tags, Tag(PMC8), qualifies all
downstream pipeline event s . Ev ents in the memory hierarchy (L1 and L2 data cache and data TLB
events can further be qualified using a data address DBRRangeTag).
As summarized in Table 10-3, data address range checking can be combined with o pcode match ing
and instruction range checking on the Itanium 2 processor. Additional event qualifications based
on the current privilege level and the current instruction set can be applied to all events and are
discussed in Section 10.2.3.2, “Priv ilege Level Constraints” and Section 10.2.3.3, “In s t ru c tio n Se t
Constraints.”
\]
Table 10-3. It anium
®
2 Processor Event Qualification Modes
Opcode Match
Event Qualification Modes
Unconstrained Monitoring
(all events)
Instruction Address Range
Check Only
Opcode Matching Only1Desired
Data Address Range Check
Only
Instruction Address Range
Check and Opcode
Matching
Instruction and Data
Address Range Check
Opcode Matching and Data
Address Range Check
Enable
.ibrp0-
PMC
15
pmc8
x0xffff_ffff_ffff_ffffx[1,11] or [0,xx]
x0xffff_ffff_ffff_fffe0[1,00]
x0xffff_ffff_ffff_ffffx[1,10]
1Desired
x0xffff_ffff_ffff_fffe0[1,00]
1Desired
10.2.3.2Privilege Level Constraints
Performanc e monitoring software can not always count on contex t switch support from the
operating system. In ge ne ral, this has made performance analysis of a single process in a
multi-processing system or a m ulti-process workl oa d impossible. To provide hard wa re support for
this kind of analysis, the Itanium architecture specif ies three global bi ts (PSR.up, PSR.pp, DCR.p p)
and a per-m onitor “privilege monitor” bit (PM C
contributions of operating syst em and user-level application components, each monitor speci fies a
4-bit privilege level mask (PMC
processor st a tus register (PSR.cpl), an d e vent counting is enabled if PMC
The Itanium 2 processor performance monitors control is discussed in Section 10.3.1,
“Performance Monitor Control and Accessibility.”
.plm). The mask is compared to the current privil ege level in the
i
Instruction
Opcode
Matching
PMC
8
Opcodes
Opcodes
Opcodes
.pm). To break down the performance
i
Address
Range Check
Enable
PMC
.ibrp0
14
x[1,01]
0[1,01]
x[1,00]
.plm[PS R .cpl ] is one.
i
(mem pipe events only)
Data Address
Range Check
[PMC
.enable-dbrp#
13
PMC
.dbrp#]
13
PMC registers can be configured as user-level monitors (PMC
(PMC
.pm is 1). A user-level monitor is enabled whenever PSR.u p is one. PSR.up can be
i
.pm is 0) or system-level monitor s
i
controlled by an application using the sum/rum instr uctions. This allows applicatio ns to
enable/disable perform ance m onitoring for specific code sections. A system-level monito r is
enabled whenever PSR.pp is one. PSR.pp can be c ontrolled at privilege level 0 only, whi c h al lows
monitor control without interference from user-level processes. The pp field in the default control
register (DCR.pp) is copied into PSR.pp whenever an interruption is delivered. This allows events
generated during interruptions to be broken down separately: if DCR.pp is 0, events duri ng
interruptions are not counted; if DCR.pp is 1, they are included in the kernel counts.
86Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
As shown in Figure 10-6, Figure 10-7, and Figure 10-8, single process, multi-process, and
system-level performance monitoring are possible by specifying the appropriate combination of
PSR and DCR bits. These bits allow performance monitoring to be controlled entirely from a
kernel level device driver, without explicit operatin g system support. Once the de sired monitoring
configuration has been setup in a process’ processor status register (PSR), “regular” unmodified
operating context switch code automatically enables/disables performance monitoring.
With support fro m the operating system, indi vidual per-process breakdown of event counts can be
generated as outlined in the pe rformance monitoring ch apter of the IntelSoftware Developer’s Manual.
10.2.3.3Instruction Set Constraints
On the Itanium 2 processor, moni toring can additionally be constrained based on the currently
executing instruction set as defined by PSR.is. This capability is supported by the four generic
performance counters, as well as the opcode matching and instruction and data event address
registers. However, the branch trace buffer only supports Itanium-based cod e execution. When
Itanium architectur e only features are used, the correspo nding PMC register instruction set mask
(PMC
.ism) shoul d be set to Itanium architecture only (01) to ensure that events generated by
i
IA-32 code do not corrupt the Itaniu m 2 processor event counts.
Figure 10-6. Single Process Monitor
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc AProc B Proc C
PSR .up = 1, others 0
A
PMC.pm = 0
PMC.plm = 1000
DCR.pp = 0
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc AProc B Proc C
PSR .up = 1, others 0
A
PMC.pm = 0
PMC.plm = 1001
DCR.pp = 0
®
Itanium® Architecture
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc AProc BProc C
PSR .pp = 1, others 0
A
PMC.pm = 1
PMC.plm = 1001
DCR.pp = 1
000989
Figure 10-7. Multiple Process Monitor
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc A Proc BProc C
PSR .up = 1, others 0
A/B
PMC.pm = 0
PMC.plm = 1000
DCR.pp = 0
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization87
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc A Proc BProc C
PSR .up = 1, others 0
A/B
PMC.pm = 0
PMC.plm = 1001
DCR.pp = 0
User-level, cpl = 3
(Application)
Kernel-level, cpl = 0
(OS)
Interrupt-level, cpl = 0
(Handlers)
Proc A Proc BProc C
PSR .pp = 1, others 0
A/B
PMC.pm = 1
PMC.plm = 1001
DCR.pp = 1
000990
Performance Monitoring
Figure 10-8. System Wide Monitor
Kernel-level, cpl = 0
Interrupt-level, cpl = 0
Proc A Proc B Proc C
PMC.pm = 1
PMC.plm = 1000
DCR.pp = 0
10.2.4References
• [gprof] S.L. Graham S.L., P.B. Kessler and M.K. McKusick, “gprof: A Call Graph Execution
Profiler”, Proceedings SIGPLAN’82 Symposium on Co mpiler Construction; SIGPLAN
Notices; Vol. 17, No. 6, pp. 120-126, June 1982.
• [Lebeck] Al vin R. Lebeck and Da vid A. Wood, “Cache Profiling and the SPEC benchmarks:
A Case Study”, Tech Report 1164, Computer Science Dept., University of Wisconsin -
Madison, July 1993.
• [VTune] Mark Atkins and Ramesh Subramaniam, “PC Software Performance Tuning”, IEEE
Computer, Vol. 29, No. 8, pp. 47 -54, August 1996.
• [WinNT] Russ Bl ake , “Optimizing Windows NT(tm)”, Volume 4 of the Microsoft “Windows
NT Resource Kit for Windows NT Version 3.51”, Micro soft Press, 1995.
10.3Performance Monitor State
T wo sets of performa nce monit or registe rs are def ined. Perform ance Monito r Config uratio n (PMC)
registers are used to configure the monitors. Performance Mon itor Data (PMD) registers pro vide
data values from the mon itors. This section describes the Itanium2 processor performance
monitoring re gisters which expands on the Itanium arc hitectural definition. As shown in
Figure 10-9 the Itanium 2 processor provides four 48-bit perfor mance counters (PMC/PMD
pairs), and the following model-specific monitoring registers: instruction and data event address
registers (EARs) for monitoring cache and TLB misses, a branch trace buffer , two op code match
registers, and an instructio n addres s range check register.
Table 10-4 defines the PMC/PMD register ass ign men ts for each monitoring feature. The interrupt
status registers are mapped to PMC
to PMC/PM D
configuration registers (PMC
. The event address register s and the branch trace buffer are controlled by three
4,5,6,7
10,11,12
accessible to so ftware through five event address data registers (PMD
buffer (PMD
). On the Itanium 2 processor, monitoring of some events can additionally be
8-16
constrained to a programmable instruction ad dress range by appropriate setting of the instruction
breakpoint registe rs (IBR) an d the ins truc tion add ress rang e check re gister ( PMC
the checking mechanism in the opcod e match register (PMC
. The four generic performance counter pairs are assigned
0,1,2,3
). Captured event addresses and cache miss laten cies are
0,1,2,3,17
) T wo opcode match registers
8,9
) and a branch trace
) and turning on
13
4,5,6,7
88Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
(PMC
8,9
be qualified with a programmable opcode. For memory operations, events can be qualified by a
programmable data address range by appropriate setting of the data breakpoint registers (DBRs)
and the data address range configuration register (PMC
Table 10-4. Itani u m
Monitoring
Feature
Interrupt StatusPMC
Event CountersPMC
Opcode
Matching
Instruction EARPMC
Data EARPMC
Branch Trace
Buffer
Instruction
Address Range
Check
Memory Pipeline
Event
Constraints
) and an opco de m atch con fi gura ti on re gi ster (P MC15) allow monit or ing of som e even ts to
90Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
10.3.1Performance Monitor Control and Accessibility
In order to use pe rfor ma nce moni tor feat ur es, th e pow er to the P MU sho uld be turne d on by set ting
PMC
.enable to 1. At reset, this bit will be set. To provide power savings, this bit can be clear ed to
4
turn off the clocks to all PMDs, PMCs (with the exceptio n of PMC
circuitry.
Once the power is turned on, event collection is controlled by the Performance Monitor
Configuration (PMC) registers and the processor status register (PSR). Four PSR fields (PSR.up,
PSR.pp , P SR .cpl and P SR .sp) and the perf or m a nc e mo ni tor free ze bit (PMC
behavior of a ll performance monito r registers.
), and other non-critical
4
.fr) affect the
0
Per-monitor control is provided by three PMC register field s (PMC
PMC
.pm). Ins t ruction set masking based on PMCi.ism is a Itanium 2 proc esso r mo del-sp e cifi c
i
.plm, PMCi.ism, and
i
feature. Event collection for a monitor is enabled under the following constraints on the Itaniu m 2
processor:
Monitor Enablei =(not PMC0.fr) and PMCi.plm[PSR.cpl ] and (( not PMCi.ism[PSR.is] ) or
(PMC
=12)) and ((not (PMCi.pm) and PSR.up) or (PMCi.pm and PSR.pp))
i
Figure 10-10 defines the PSR control fields that affect performance monitoring. For a detailed
definition of how the PSR bits affect event monitoring and control accessibility of PMD registers,
please refer to Sectio n 3.3.2 and Section 7.2.1 of Volume 2 of the Intel
®
Itanium® Architecture
Software Developer’s Manual.
Table 10-5 defines per monitor controls that apply to PMC
“Itanium
®
2 Processor Performance Monitor Register Set,” each of these PMC registers controls
4,5,6,7,10,11,12
. As defined in Table 10-4,
the behavior of its associat ed per formance monitor data register s (P MD) . The Itan ium2 processor
model-specific PMD
(PMD
0,1,2,3,8-17
registers associated with instruction/data EARs and the branch trace buffer
) can be read only when event monitoring is frozen (PMC0.fr is one).
Figure 10-10. Processor Status Register (PSR) Fields for Performance Monitoring
Ta bl e 10-5. Perf orm anc e Monitor PM C Register C ontr ol Fields ( PM C
FieldBitsDescription
plm3:0Privilege Level Mask - controls performance monitor operation for a specific privilege level.
Each bit corresponds to one of the 4 privilege levels, with bit 0 corresponding to privilege
level 0, bit 1 with privilege level 1, etc. A bit val ue of 1 indicates that the monitor is enabled at
that privilege level. Writing zeros to all plm bits effectively disables the monitor. In this state,
the Itanium 2 processor will not preserve the value of the corresponding PMD register(s).
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization91
4,5,6,7,0,11,12
)
Performance Monitoring
Table 10-5. Performance Monitor PMC Register Control Fields (PMC
(Continued)
FieldBitsDescription
pm6Privileged monitor - When 0, the performance monitor is configured as a user monitor and
ism25:24Instruction Set Mask - controls performance monitor operation based on the current
enabled by PSR.up. When PMC.pm is 1, the performance monitor is configured as a
privileged monitor, enabled by PSR.pp, and PMD can only be read by privileged software.
Any read of the PMD by non-privileged software in this case will return 0.
NOTE: In PMC
instruction set. The instruction set mask applies to PMC
00: monitoring enabled dur ing It anium and IA-32 i nstructi on exec ution ( regardles s of P SR.is)
10: bit 24 low enables monitoring during Itanium instruction execution (when PSR.is is zero)
01: bit 25 low enables monitoring during IA-32 instruction execution (when PSR.is is one)
11: disables monitoring
NOTE: In PMC
this field is implemented in bit [4].
10
this is implemented in [15:14]. PM C
10
10.3.2Performance Counter Registers
The Itanium 2 processor provides fou r generic performance counters (PMC/PMD
implemented counter width on the Itanium2 processor is 48 bits ([47] indicates overflow
condition). More than the Itanium processor, PMC/PMD pairs on the Itanium 2 processor are
symmetrical, i.e., nearly al l event types can be monitored by all counters. There are exceptions
within some of t he cache co unters. S ee Section 11.8.2, “L1 Data Cache Events” and Section 11.8.3,
“L2 Unified Cache Events” for more information. These counters can track events whose
maximum per-cycle event increment is up to seven.
4,5,6,7,0,11,12
4,5,6,7,10,11
does not have this field.
12
but not to PMC12.
)
4,5,6,7
pairs). The
Figure 10-11 and Table10-6 define the layout of the Itanium 2 processor Performance Counter
Configuration Reg ister s (P MC
). The main task of these config uratio n register s is to select the
4,5,6,7
events to be monitored by the respective performance monitor data counters. Event selection (
and uni t ma s k (
umask) fields in the PMC registers perform the selection of these events. The rest
of the fields in PMCs specify under what conditions the c ounting should be done (
by how much the counter should be incremented (
counter overflows (
Table 10-6. It anium® 2 Processor Generic PMC Regi ster Fields (PMC
FieldBitsDescription
plm3:0Privilege Level Mask. See Table 10-5 “Performance Monitor PMC Register Control
ev4External visibility - When 1, an external notification (BPM pin strobe) is provided
Fields (PMC4,5,6,7,0,11,12).”
whenever the counter over flows. E xt ernal noti f icati on occ urs regar dle ss of the s etting o f
the oi bit (see below). On the Itanium2
the BPM0 pin, PMC
strobes the BPM3 pin.
strobes the BPM1 pin, PMC6 strobes the BPM2 pin, and PMC7
5
processor, PMC
plm, pm, ism),
)
)
4,5,6,7
external notification strobes
4
es)
92Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
Table 10-6. Itani u m® 2 Processor Generic PM C Reg ister Fields ( PM C
FieldBitsDescription
oi5Overflow interrupt - When 1, a Performance Monitor Interrupt is raised and the
pm6Privilege Monitor. See Table 10-5 “Performance Monitor PMC Register Control Fields
ig7reserved
es15:8Event select - selects the performance event to be monitored.
umask19:16Unit Mask - event specific mask bits (see event definition for details)
threshold22:20Threshold -enables thresholding for “multi-occurrence” events.
enable23PMC
ism25:24Instruction Set Mask. See Table 10-5 “Performance Monitor PMC Register Control
performance monitor freeze bit (PMC
interrupt is raised and the performance monitor freeze bit (PMC
unchanged. Counter overflows generate only one interrupt. Setting the corresponding
PMC
bit on an overflow will be independent of this bit.
0
(PMC4,5,6,7,0,11,12).”
Itanium 2 processor event encodings are defined in Chapter 11, “Performance Monitor
Events.”
When threshold is zero, the counter sums up all observed event values. When the
threshold is non-zero, the counter increments by one in every cycle in which the
observed event value exceeds the threshold.
Only. Enables use of the PMUs. A 1 must be written for the PMUs to function.
4
Power up value is 1.
Fields (PMC4,5,6,7,0,11,12).”
.fr) is set when th e moni tor ov erflow s. When 0, no
0
) (Continued)
4,5,6,7
.fr) remains
0
Figure 10-12 and Table 10-7 defines the layout of the Itanium2 process or Performance Counter
Data Registers (PMD
from bit 46 is detected). Software can force an external inter ruption or exter nal notif ication after N
events by preloadin g the monitor with a count value of 2
). A counter overflow occurs when the counter wraps (i.e., a carry out
4,5,6,7
47
- N. Note that bit 47 is the overflow bit
and must be initialized to 0 whenever there is a need to initialize the register.
When accessible, software can continuously read the performance counter registers PMD
without dis a bling event collection. Any read of the PMD from software without the appropriate
privilege le vel will re turn 0 (See “plm” in Table 10-6). The processor ensures that softwar e will see
monotonically increasing counter values.
Table 10-7. Itani u m® 2 Processor Generic PM D Reg ister Fields
FieldBitsDescription
sxt4763:48Writes are ignored, Reads r eturn the value of bit 47, so count values appear as sign
ov47Overflow bit (carry out from bit 46).
count46:0Event Count. The counter is defined to overflow when the count field wraps (carry out
extended.
NOTE: Writes to initialize the PMD should write 0 to this bit.
from bit 46).
4,5,6,7
4,5,6,7
)
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization93
Performance Monitoring
10.3.3Performance Monitor Overflow Status Registers (PMC
As previously mentioned, the Itanium 2 processor supports four performance monitoring counters.
The overflow status of these four coun ters is indicat ed in regis ter PMC
and Table 10-8 only PMC
read as zero and ignore writes.
Figure 10-13 . Itanium
(PMC
0,1,2,3
63 876543210
Table 10-8. It anium
RegisterFieldBitsDescription
PMC
PMC
PMC
PMC
PMC
®
0
0
0
0
1,2,3
[7:4,0] bits are popula t e d. All other overflow bits are ignored, i.e., they
0
®
2 Processor Performance Monitor Overf lo w Status Registers
)
reserved (PMC0)overflowrsv.fr
reserved (PMC1)
reserved (PMC2)
reserved (PMC3)
2 Processor Performance Monitor Overfl ow Re giste r Field s (PMC
fr0Performance Monitor “freeze” bit - When 1, event monitoring is disabled.
ignored3:1Read zero, Writes ignored.
overflow7:4Event Counter Overflow - When bit n is one, indicate that the PMDn
When 0, event monitoring is enabled. This bit is set by hardware
whenever a performance monitor overflow occurs and its corresponding
overflow interrupt bit (PMC .oi) is set to one. S W is res ponsib le for c lear ing
it. When the PMC.oi bit is not set, then counter overflows do not set this
bit.
overflowed. This is a bit vector indicating which performance monitor
overflowed. These overflow bits are set on their corresponding counters
overflow regardless of the state of the PMC.oi bit. Software may also set
these bits. These bits are sticky and multiple bits may be set.
. As shown in Figure 10-13
0
431
0,1,2,3
0,1,2,3
)
)
10.3.4Opcode Match Check (PMC
8,9,15
)
The Itanium 2 processor allows event monitoring to be constrained based on the instr uction address
and/or Itanium encoding (opcode) of an instruction. Registers PMC
“Instruction Address Range Matchin g”) a re used to enab le thes e features. Registers PMC
configuring thes e featu res. F or mem ory rela ted events , the appro priate bi ts must be set in PMC
and PMC14 (Section10.3.5,
15
8,9
allow
13
to
enable this feature. Please refer to Section 10.3.6, “Data Address Range Matching (PMC13)” for
details. Unlike in the Itanium processor, the opcode matcher in the Itan ium2 processor operates
during both Itanium-based a nd IA-32 code execution. When operating in IA-32 mode it checks for
Itanium opcodes.
Figure 10-14 and Table 10-9 describe the fields of PMC
describes the register PMC
. All combinations of bits [63:60] are supported. To match A-slot
15
registers. Figure 10-15 and Table 10-10
8,9
instruction, set bits [6 3:62] to 11. To match all instruction types, bits [63:60] should be set to 1111.
To ensure that all events are count ed independent of the opcode matcher, all mifb and all mask bits
of PMC
94Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
should be set to one (a ll opcodes match).
8,9
Performance Monitoring
PMC9 only qualifies the event IA64_TAGGED_INST_RETIRED. The Itanium 2 processor’s
opcode cons traint for IA64_TAGGED_INST_RETIRED event ANDs PMC
and IBRP
matches and PMC8 results with IBR0 and IBRP2 matches. PMC8, however, constrains
3
results with IBRP1
9
other downstream events as well. T o ensure that all events are counted independent of the opcode
matcher, bit[63:60] and bit [29:3] should be set to all ones.
Figure 10-14. Opcode Match Registers (PMC
63 62 61 60 59 33 32 30 293 2 10
8,9
m if bmatchrsvmask-- inv ig_
11112732711 1
Ta bl e 10-9. Opcode Match R egist er Fields (PMC
FieldBitsWidthDescription
ig_ad01Ignore Instruction Address Range Checking. If set to 1, all instruction
inv11Invert Range Check. If set to 1, the address ranged spec ified by IBR 0-1 is
---21Must write 1 for proper PMU operation.
mask29:327Bits that mask Itanium
rsv32:303Reserved bits
match59:3327Opcode bits against which Itanium instruction encoding to be matched
b601If 1: match if opcode is an B-slot
f611If 1: match if opcode is an F-slot
i621If 1: match if opcode is an I-slot
m631If 1: match if opcode is an M-slot
addresses are considered for events. If 0, IBRs 0-1 will be used for
address constraints.
NOTE: This bit is ignored in PMC
inverted. Effective only when ig_ad bit is set to 0.
NOTE: This bit is ignored in PMC
[15:3] mask bits for opcode bits[12:0]
[29:16] mask bits for opcode bits [40:27]
If mask bit is set to 1, the corresponding opcod e bit is not used for opcode
matching
[45:33]: match bits for opcode bits[12:0]
[59:46]: match bits for opcode bits[40:27]
)
ad
)
8,9
.
9
.
9
®
instruction encoding bits
Figure 10-15. Opcode Match Configuration Register (PMC15)
6343 2 1 0
reservedibrp3
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization95
ibrp2
ibrp1
ibrp0
pmc
pmc
pmc9
pmc8
9
8
111 1
Performance Monitoring
Tab le 10-10. Opcode Match Configuration Register Fields (PMC
FieldBitsDescription
ibrp0-pmc801: PMU events will not be constrained by opcode
0: PMU events (including IA64_TAGGED_INST_RETIRED.00) will be
opcode constrained by PMC
ibrp1-pmc911: IA64_TA GGED_INST_RETIRED.01 won’t be cons trained by op code
0: IA64_TAGGED_INST_RETIRED.01 will be opcode constrained by
PMC
9
ibrp2-pmc821: IA64_TA GGED_INST_RETIRED.10 won’t be cons trained by op code
0: IA64_TAGGED_INST_RETIRED.10 will be opcode constrained by
PMC
8
ibrp3-pmc931: IA64_TAGGED_INST_RETIRED.11 won’t be constrained by opcode
0: IA64_TAGGED_INST_RETIRED.11 will be opcode constrained by
PMC
9
8
For opcode matching purposes, a n Itanium instruction is defined by two items: the instruction type
“itype” (one of M, I, F or B) and the 42-bit enco ding “enco{41:0}” define d th e Intel
Archite cture Software Developer’s Manual. Each instruction is evaluated again st each opcode
match register (PMC
Match(PMCi) = (imatch(itype, PMCi.mifb) AND ematch(enco,PMCi.match,PMCi.mask))
) as follows:
8,9
Where:
imatch(itype,PMC[i].mifb) = (itype=M AND PMC[i].m) OR (itype=I AND PMC[i].i) OR
(itype=F AND PMC[i].f) OR (itype=B AND PMC[i]. b)
ematch(enco,match,mask) = AND
((enco{b}=match{b}) OR mask{b})
b=12..0
b=40..27
((enco{b}=match{b-14}) OR mask{b-1 4}) AND
15
)
®
Itanium®
This functi on matches encoding bits{40:27} (major opc ode) and encoding bits{12:0} (destination
and qualifying predicate) on ly. Bits{26:13} of the instruction encoding are ignored by the opcode
matcher.
The IBRP matches are advanced with the instruction pointer to the point where opcodes are being
dispersed. The matches from opcode matchers are ANDed with the IBRP matches at this point.
This produces two opcode match events that are combined with the instruction range check tag
(IBRRangeTag, see Section 10.3.5, “Instruction Addr ess Range Matching”) as follows:
Tag(PMC8) = Match(PMC8) and IBRRangeTag
Tag(PMC
) = Match(PMC9) and IBRRangeTag
9
As shown in Figure 10-5 the two tags, Tag( PMC8) and Tag(PMC9), are staged down the processor
pipeline until instru ction retirement and can be select ed as a retired instruction count event (see
event description “IA64_T AGGED_INST_RETIRED” on page 11-165). In this way, a
performance counter (PMC/PMD
within the programmed range th at ma tch the specified opcodes.
The opcodes dispersed to differen t pip elines are compared to PMC
qualified by a number of user configurabl e bits (please refer to definition of PMC
document) and ANDed with IBRP0 match befo re being distributed to different places
Note:Register PMC
not lis ted in Table10-10 processor behavior is not defined.
) can be used to count the numb er of retire d instructions
4,5,6,7
; the opcode match is further
8
in this
15
.
must contai n the predetermined value of 0xfffffff0. If software modifies any bits
15
96Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
10.3.5Instruction Address Range Matching
The Itanium2 process or allows event monitoring to be constrained to a range of instruction
addresses. The four architectural Instruction Breakpoint Register Pairs IBRP
used to specify the d esire d addre ss range. O nce progr ammed thi s restrict ion would be ap plied to all
events. In the Itanium 2 processor, regis ters PMC
applied to the performance monitors. With the e xception of IA64_INST_RETIRED and prefetch
events, IBRP
is the only IBR pair used and will be considered the default for this section. For
0
memory related events, the appropriate bits must be set in PMC
refer to Section 10.3.6, “Data Address Range Matching (PMC13)” for details.
specify how the resulting address match is
8,14
Performance Monitoring
(IBR
0-3
to enable this feature. Please
13
0-17
) can be
Figure 10-16 and Table 10-12 describe the fields of register PMC
checking is controlled by th e “igno re address range check” bit (PMC
When PMC
settings . In this mode, events from both IA-32 and Itanium-ba sed code execution contribute to the
event count. When both PMC
check based on the IBR se ttings is applied to all Itanium code fetches. In this mode, IA-32
instructions are never tagg ed, and, as a result, events generated by IA-32 code execution are
ignored. Table 10-11 defines the behav ior of the instruction address range checker for different
combinations of PSR.is and PMC
Table 10-11. Itanium
PMC
.ig_ad OR
8
.ibrp0
PMC
14
0Tag only Itanium instructions if they match
1Tag all Itanium and IA-32 instructions. Ignore IBR range.
The processor compar es ev ery Itan ium i nstructi on fe tch add ress I P{63:0} agains t the addre ss range
programmed into the architectural in struction breakpoint register pair IBRP
value of the instruction breakpoint fault enable (IBR x-bit), the following expres sion is evaluated
for the Itan ium 2 processor ’s IBRP
.ig_ad is one (or PMC14.ibrp0 is one), all instructions are tagged regardles s of IBR
8
.ig_ad and PMC14.ibrp0 are zero, the instruction address range
8
.ig_ad or PMC14.ibrp0.
8
®
2 Processor Instruction Address Range Check by Instruction Set
PSR.is
®
0 (Itanium
IBR range.
0
)1 (IA-32)
DO NOT tag any IA-32 operations.
:
8
. Regardless of the
0
The events which occur before the instruction dispersal stage will fire only if this qualified match
(IBRmatch) is true. This qualified match will be ANDed with the result of Opcode Matcher PMC
and furt he r qualifi e d with more user definable bits (see Table 10-12) be f ore being di stribute d to
different places. The events which occur after instruction dispersal stage, will use this new
qualifi e d match (ibrp0-pmc8 matc h).
Figure 10-16. Instruction Address Range Configuration R egist er (PMC14)
63 1413 12 11109 876 543 210
reservedfine reser
50121212121 1
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization97
ved
ibrp3reser
ved
ibrp2reser
ved
ibrp1reser
ved
ibrp0 reser
ved
8
Performance Monitoring
Tab le 10-12. Instruction Address Range Configuration Register Fields (PMC
FieldBitsDescription
ibrp011: No constraint
0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.00
included) will be constrained by IBR P
ibrp141: No constraint
0: Prefetch PMU events (IA64_TAGGED_INST_RETIRED.01
included) will be constrained by IBR P
ibrp271: No constraint
0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.10
included) ill be constrained by IBRP
ibrp3101: No constraint
0: Non-prefetch PMU events (IA64_TAGGED_INST_RETIRED.11
included) will be constrained by IBR P
fine13Enable arbitrary range checking (non power of 2)
1: IBRP
0: Normal mode
This bit provides this capability. If set to 1, ibrp0 (lower limit) and
ibrp2 (upper limit) are paired together; So are ibrp1 (lower limit) and
ibrp3(upper limit). Bits [63:12] of upper and lower limits need to be
exactly the same but could have any value. Bits[11:0] of upper limit
needs to be greater than bits[11:0] of lower limit. If an address falls in
between the upper and lower limits then a match will be signaled for
both of the ibr pairs used (ibrp0 and ibrp2 will signals matches at the
same time).
NOTE: The mask bits programmed in IBRs 1,3,5,7 for bits [11:0]
have no effect in this mode.
and IBRP
0,2
are paired as lo/hi limit bits
1,3
0
1
2
3
14
)
IBRP0 match is generated in the following fashion. Note that unless fine mode is us ed, arbitrary
range checking cannot b e performed since the mask bits are in powers of 2. In fine mode, two IBR
pairs are used to s pecify the up per and lower limi ts of a ran ge with in a pag e (the upper bits of lo wer
and upper limits must be exactly the same).
If PMC14.Fine=0, IBRmatch0 = match[IP(63:0), IBR0(63:0), IBR1(55:0)]
Else, IBRmatch0 = match[IP(63:12), IBR0(63:12), IBR1(55:12)] and [IP(11:0) >
IBR0(11:0)] and [IP(11:0) < IBR4(11:0)]
IBRadrmatch0 = IBRmatch0
ibrp0 match = (PMC8.ign or PMC14.ibrp0) or (IBRadmatch0 and match[PSR.cpl,
IBR1(59:56)])
The instruction range check tag (IBRR angeTag) considers the IBR address ranges only if
PMC
.ig_ad, PMC14.ibrp0 and PSR.is are all zero and if none of the IBR x-bits or PSR.db are
8
set.
In order to a llow simultaneous use of som e IBRs for Performance Monitoring and the others for
debugging (the architected purpose of these registers), separate mechanisms are provided for
enabling IBRs and the x-bit sho uld be clear ed to 0 for th e IBRP which i s goin g to be used for PMU.
10.3.5.1Use of IBRP0 For Instruction Address Range Check – Exception 1
The address range constraint for prefetch events is on the target address of these events rather than
the address of t he prefetch instruc tion. Theref ore IBRP
must be used for constraining these events.
1
98Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Performance Monitoring
Calculatio n of IBRP1 match is the same as that of IBRP0 match with the exception th at we use
IBR
instead of IBR
2,3,6
0,1,4
.
Note:Regis ter PMC
listed in Table 10-12 processor behavior is not defined. It is illegal to have PMC
and PMC
must contain the predetermined value 0xdb6. If software modifies any bits no t
14
[0]=0 and ((PMC14[2:1]= 10 or 00 ) or (PMC14[5:4]=10 or 00)); this produces
8
[48:45]=0000
13
inconsistencies in tagging I-side events in L1D and L2.
10.3.5.2Use of IBRP0 For Instruction Address Range Check – Exception 2
The Addres s Range Constraint for IA 64_TAGGE D_INST_RETIRED event us e s all four IBR
pairs. Calculation of IBRP
IBR
(in non-f ine m ode ) are us ed i nste ad of I BR0. Calculation of IB RP3 match is the same as that
4,5
of IBRP
match with the exception that we use IBR
1
match is the same as that of IBRP0 match with the exception that
2
(in non-fine m ode ) instead of IBR
6,7
2,3
.
10.3.6Data Address Range Matching (PMC13)
For instructions that reference memory, the Itanium 2 processor allows event counting to be
constrained by data address ran ges. The 4 architectural Data Breakpoint Regis ters (DBRs) can be
used to specify the des ired address range. Data address range checking capability is controlled by
the Memory Pi peline Event Constraints Regi ster (PMC
Figure 10-17 and Table 10-11 describe the fields of register PMC
bits corresponding to one of the 4 DBRs to be used), data address rang e checking is applied to
loads, stores, semaphore operations, and the lfetch inst r u c ti on.
Ta bl e 10-13. Memo ry Pipeline Event Constra ints Fields (PMC
FieldBitsDescription
cfg dbrp04:3These bits determine whether and how DBRP
constraining memory pipeline events (where applicable).
00: IBR /Opc / DBR - U se I BRP
they will be counted only if their Instruction Address, opcodes and
Data Address matches the IB RP
01: IBR/Opc - Use IBRP
10: DBR - Only use DBRP
11: N o constraints
NOTE: When used in conjunction with “fine” mode (see PMC14
description), only the lower bound DBR Pair (DBRP0 or DBRP1)
config needs to be set. The upper bound DBR Pair config should be
left to no constraint. So if IBRP0,2 are chosen for “fine” mode,
cfg_dbrp0 needs to be set according to the desired constraints but
cfg_dbrp2 should be left as 11 (No constraints).
cfg dbrp112:11These bits determine whether and how DBRP
constraining memory pipeline events (where applicable); bit for bit
these match those defined for DBRP
cfg dbrp220:19These bits determine whether and how DBRP
constraining memory pipeline events (where applicable); bit for bit
these match those defined for DBRP
cfg dbrp348, 28:27These bits determine whether and how DBRP
constraining memory pipeline events (where applicable); bit for bit
these match those defined for DBRP0.
Enable dbrp0450 - No constraints
1 - Constraints as set by cfg dbrp0.
).
13
When enabled ([1,x0] in the
13.
)
13
/PMC8 and DBRP0 for constraints (i.e.,
0
programmed into these registers).
0
/PMC8 for constraints
0
for constraints
0
should be used for
0
should be used for
1
.
0
should be used for
2
.
0
should be used for
3
Intel® Itanium® 2 Processor Reference Manual For Softwa re D evelopment and Optimization99
Performance Monitoring
Table 10-13. Memory Pip eline Event Constraints Fiel d s (PMC13) (Contin u ed )
DBRPx match is generated in the following fashion. Arbitrary range checking is not possible since
the mask bits are in powers of 2. Altho ugh it is poss ible to enable more than one DBRP at a time
for checking, it is not reco mmended. The res ulting fo ur matches are combin ed with PSR. db to form
a single DBR match:
reser
ved
cfg
dbrp0
reser
ved
DBRRangeMatch = ((DBRRangeMatch0 or DBRRangeMatch1 or DBRRangeMatch2 or
DBRRangeMatch3) and (not PSR.db))
Events which occur after a memory instruction gets to the EXE stage will fire only if this qualified
match (DBRP
match) is true. The data address is compared to DBRPx; the address match is
x
further qua l ified by a number of user co nfigurable bits in PMC
different places. DBR matching for performance monitoring ignores the setting of the DBR r,w,
and plm fi e ld s.
In order to allow simultaneous use of some DBRs for Performance Monitoring and the others for
debugging (the architected purpose of these registers), separate mechanisms are provided for
enabling DBRs and the r/w-bit should be cleared to 0 for the DBRP which is going to be used for
the PMU.
Note:Register PMC
must contain th e predetermined value 0x207 8fefefefe. If software modifies any
13
bits not listed in Table 10-11 processor behavior is not defined. It is illegal to ha ve
PMC13[48:45 ]=0000 and PMC
[0]=0 and ((PMC14[2:1]=10 or 00) or (PMC14[5:4]=10 or 00) );
8
this produces inconsistencies in tagging I-side events in L1D and L3.
10.3.7Event Address Registers (PMC
This section defin es the register layout f or the Itanium 2 processor instruction and data event
address reg isters (EARs). Sampling of six events is supported on the Itanium 2 processor:
instruction cache and instruction TLB misses, data cache load misses and data TLB misses, ALAT
misses, and front-end stalls . The EARs are configured throug h two PMC r egisters (PMC
EAR specific unit masks allow software to specify event coll ection parameters to hardware.
Instruction and data addresses , operation latencies and other captured event parameters are
provided in five PMD registers (PMD
0,1,2,3,17
latency of captured cache events and allow latency thresholding to qualify event capture. Event
address data registers (PMD
(PMC
.fr is one). Reads of PMD
0
0,1,2,3,17
) contain valid data only when event collection is frozen
0,1,2,3,17
). The instruction and data cache EARs report the
while event coll ection is enabled return undefined values.
10,11
before being distributed to
13
/PMD
0,1,2,3,17
)
).
10,11
100Intel® Itanium® 2 Processor Re ference Manual For Software Development and Optimization
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.