Digital Semiconductor
Alpha 21164PC Microprocessor
Hardware Reference Manual
Order Number: EC–R2W0A–TE
Revision/Update Information: This is a preliminary document.
Preliminary
Digital Equipment Corporation
Maynard, Massachusetts
http://www.digital.com/semiconductor
Page 2
September 1997
While DIGITAL believes the informa ti on included in this pub li cation is correct as of the date of publication, it is
subject to chang e without notice.
Digital Equipment Corpora ti on makes no representations that the use of its products in the manner de scri bed in this
publication will not infringe on existing or future patent rights, nor do the descriptions contained in this publication
imply the granting of li ce nses to make, use, or sell equipm e n t or software in accordance with the description.
DIGITAL, Digital Semiconductor, OpenVMS, VAX, the AlphaGeneration design mark, and the DIGITAL logo ar e
trademarks of Digital Equipment Corporation.
Digital Semiconducto r is a Digital Equipment Corporation business.
GRAFOIL is a registered trademark of Union Carbide Corporation.
IEEE is a registered trademark of The Institute of Ele ct rical and Electronics Eng ine ers, Inc.
Windows NT is a trademark of Microsoft Corp oration.
All other trademarks and registe re d trademarks are the property of t heir respective owners.
9–12BiSt Timing for Some System Clock Ratios, Port Mode=Normal (System Cycles)9-18
9–13BiSt Timing for Some System Clock Ratios, Port Mode=Normal (CPU Cycles). .9-18
9–14SROM Load Timing for Some System Clock Ratios (System Cycles) . . . . . . . . .9-19
9–15SROM Load Timing for Some System Clock Ratios (CPU Cycles) . . . . . . . . . . .9-19
This manual provides information about the architecture, internal design, external
interface, and speci f ica ti ons of the Digital Semiconduct or Al pha 21164PC microprocessor (referred to as the 21164PC) and its associated software.
Audience
This reference manual is for system designers and programmers who use the
21164PC.
Manual Organization
This manual includes the following chapters and appendixes, and an index.
•Chapter 1, Introduction, introduces the 21164PC and provides an overview of
the Alpha architecture.
•Chapter 2, Internal Architec ture, describes the major hardware funct ions and the
internal chip architecture. It describes performance measurement facilities, coding rules, and design examples.
•Chapter 3, H ardware Interface, lists and describes the external hard ware inter-
face signals.
Preface
•Chapter 4, Clo cks, Cache, and External Interface, describes the e xternal bus
functions and transactions, lists bus commands, and describes the clock functions.
•Chapter 5, Internal Pro cessor Reg isters, lists and de scribes the 21164PC internal
processor register set.
•Chapter 6, Privileged Architecture Library Code, describes the privileged archi-
tecture library code (PALcode).
29 September 1997 – Subject To Change
xvii
Page 18
•Chapter 7, Initialization and Configuration, describes the initialization and con-
figuration sequence.
•Chapter 8, Error Detection and Error Handling, describes error detection and
error handling.
•Chapter 9, Electri cal Data, p rovide s electr ical dat a and descr ibes sign al int egrity
issues.
•Chapter 10, Thermal Management, pr ovides infor mation abou t ther mal manage -
ment.
•Chapter 11, Mechanical Packaging Information, provides mechanical data and
packaging information, including signal pin lists.
•Chapter 12, Testability and Diagn ostics, describes chip and system t estability
features.
•Appendix A, Alpha Instruction Set, summarizes the Alpha instruction set.
•Appendix B, 21164PC Microprocessor Specifications, summarizes the
21164PC specifications.
•Appendix C, Serial Icache Load Predecode Values, provides a C code example
that calculates the predecode values of a serial Icache load.
•Appendix D, Errata Sheet, lists changes and revisions to this manual.
xviii
•Appendix E, Support, Products, and Documentation, provides phone numbers
for support and lists rela ted DIGITAL and third-party publications with order
information .
•The Glossary lists and defines terms associated with the 21164PC.
The companion volume to this manual, the Alpha AXP Architecture Reference Manual, contains the Alpha architecture information.
29 September 1997 – Subject To Change
Page 19
Conventions
This section defines product-specific terminology, abbreviations, and other conventions used throughout this manual.
Abbreviations
Binary Multiples
•
The abbreviations K, M, and G (kilo, mega , and giga ) repr esent b inary mul tipl es
and have the following values.
K
M
G
10
=2
20
=2
30
=2
(1024)
(1,048,576)
(1,073,741,824)
For example:
2KB=2 kilobytes
4MB=4 megabytes
8GB=8 gigabyte s
•Register Access
=2 × 2
=4 × 2
=8 × 2
10
20
30
bytes
bytes
bytes
The abbreviations used to indicate the type of access to register fields and bits
have the following definit io ns:
IGN — Ignore
Register bits specified as IGN are ignored when written and are UNPRE-
DICTABLE when read if not otherwise specified.
MBZ — Must Be Zero
Software must never place a nonzero value in bits and fields specified as
MBZ. Reads return unpredictable values. Such fields are reser ved for future
use.
RAO — Read As One
Register bits specified as RAO return a 1 when read.
RAZ — Read As Zero
Register bits specified as RAZ return a 0 when read.
29 September 1997 – Subject To Change
xix
Page 20
RC — Read To Clear
A register field specifie d as RC is writte n by hardware and remains
unchanged until read. The value may be read by software, at which point,
hardware may write a new value into the field.
RES — Reserved
Bits and fields specified as RES are reserved by Digital Semiconductor and
should not be used; however, zeros can be written to r eserve d field s that can not be masked.
RO — Read Only
Bits and fields specified as RO can be read and are ignored (not written) on
writes.
RW — Read/Write
Bits and fields specified as RW can be read and written.
W0C — Write Zero to Clear
Bits and fields s pecif ied as W0C can be rea d. Writing a zero clears these bits
for the duration of the write; writing a one has no effect.
W1C — Write One to Clear
Bits and fields specifi ed as W1C ca n be read. Writ i ng a one cl ea rs thes e bits
for the duration of the write; writing a zero has no effect.
xx
WO — Write Only
Bits and fields specified as WO can be written but not read.
Addresses
Unless otherwise noted, all addresses and offsets are hexadecimal.
Aligned and Unaligned
The terms aligned and naturally align ed are interchangeable and refer to data objects
n
that are powers of two in size. An aligned datum of size 2
n
byte address that is a multiple of 2
; that is, one that has n low-order zeros. For ex-
is stored in memory at a
ample, an aligned 64-byte stack frame has a memory address that is a multiple of 64.
A datum of size 2
n
.
of 2
n
is unaligned if it is stored in a byte address that is not a multiple
29 September 1997 – Subject To Change
Page 21
Bit Notation
Multiple-bit f ields can i nclud e cont igu ous and noncon ti guous b its c ontai ned in an gle
brackets (<>). Multiple contiguous bits are indicated by a pair of numbers separated
by a colon (:). For example, <9:7,5,2:0> specifies bits 9,8,7,5,2,1, and 0. Similarly,
single bits ar e f re quently indicated with angle brackets. For example, <27 > s pec ifies
bit 27.
Caution
Cautions indicate potential damage to equipment or loss of data.
Data Units
The following data-unit terminology is used throughout this manual.
Unless otherwise stated, external means not contained in the 21164PC.
Numbering
All numbers are decimal or hexadecimal unless otherwise indicated. The prefix 0x
indicates a hexadecimal number. For example, 19 is decimal, but 0x19 and 0x19A
are hexadecimal (also see Addresses). Otherwise, the base is indicated by a subscript; for example, 100
Ranges and Extents
is a binary number.
2
Ranges are specified by a pair o f numb ers se parat ed by t wo per iods ( ..) and are inclu sive. For example, a range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in angle brackets (<>) separated by a
colon (:) and are i nclus ive. Bit fi elds a re oft en speci fi ed as e xtents . For examp le, bit s
<7:3> specifies bits 7, 6, 5, 4, and 3.
29 September 1997 – Subject To Change
xxi
Page 22
Security Holes
Security holes exist when unpr ivil eged sof tware ( that i s, soft ware tha t is run ning out side of kernel mode) can:
•Affect the o per ation of another pr oc ess wi thout authorizatio n f ro m th e operating
system.
•Amplify its privilege without authorization from the operating system.
•Communicate with another process, either overtly or covertly, without authori-
zation from the operating system.
Signal Names
Signal names are printed in lowercase, boldface type. Low-asserted signals are indicated by the _l suffix, while high-asserted signals have the _h suffix. For example,
osc_clk_in_h is a high-asserted signal, and osc_clk_in_l is a low-asserted signal.
Unpredictable and Undefined
Throughout this manual, the te rms UNPREDICTABLE and UNDEFINED are used.
Their meanings are quite different and must be carefully distinguished.
In particular, only privileged software (that is, software running in kernel mode) can
trigger UNDEFINED operations. Unprivileged software cannot trigger UNDEFINED operations. However, either privileged or unprivileged software can trigger
UNPREDICTABLE results or occurrences.
xxii
UNPREDICTABLE results or occurrences do not disrupt the basic operation of the
processor. The processor continues to execute instructions in its normal manner. In
contrast, UNDEFINED operations can halt the processor or cause it to lose information.
The terms UNPREDICTABLE and UNDEFINED can be further described as follows:
Unpredictable
Results or occurrence s s pec if ie d a s UNPREDI CTABLE may vary from moment
•
to moment, implementation to implementation, and instruction to instructio n
within implementations. Software can never depend on results specified as
UNPREDICTABLE.
29 September 1997 – Subject To Change
Page 23
•An UNPREDICTABLE result may acquire an arbitrary value subject to a few
constraints. Such a result may be a n arbitrar y functi on of t he input operands or of
any state informati on that is accessible to the proc ess in its current access mode.
UNPREDICTABLE results may be unchanged from their previous values.
Operations that produ ce UNPREDICTABLE results may also produce exceptions.
•An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints
as are UNPREDICTABLE results and, in particular, must not constitute a security hole.
Specifically, UNPREDICTABLE r es ult s must not depend upon, or be a functio n
of the contents of memory locations or registers that are inaccessible to the current process in the current access mode.
Also, operations that may produce UNPREDICTABLE results must not:
–Write or modi fy the contents of memory locations or registers to which the
current process in the current access mode does not have access.
–Halt or hang the system or any of its components.
For example, a security hole would exist if some UNPREDICTABLE result
depended on the value of a register in another process, on the contents of processor
temporary registers left behind by some previously running process, or on a
sequence of actions of different processes.
Undefined
Operations specified as UNDEFINED may vary from moment to moment,
•
implementation to impleme ntation, and instruction to instructio n within imple mentations. The operation may vary in effect from nothing, to stopping system
operation.
•UNDEFINED operations may halt the processor or cause it to lose information.
However, UNDEFINED operat ions must not cause the pro cessor t o hang, th at is,
reach an unhalted state from which there is no transition to a normal state in
which the machine exec utes instructions. Only privileged software (that is, so ftware running in kernel mode) may trigger UNDEFINED operations.
29 September 1997 – Subject To Change
xxiii
Page 24
Page 25
This chapter provides a brief introduction to the Alpha architecture, Digital
Equipment Corporation’s RISC (reduced instruction set computing) architecture
designed for high perfo rmance. Th e c hapter then s ummari zes t he spec ific fe ature s of
the Digital Semiconductor Alpha 21164PC microprocessor (hereafter called the
21164PC) that implements the Alpha architecture. Appendix A provides a list of
Alpha instructions.
For a complete definition of the Alpha architecture, refer to the companion volume,
the Alpha AXP Architecture Reference Manual.
1.1 The Architecture
The Alpha architecture is a 64-bit load and store RISC architecture designed with
particular emphasis on speed, multiple instruction issue, multiple processors, and
software migration from many operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers. All instructions are 32 bits long. Memory operations are either load or store
operations. All data manipulation is done between registers.
1
Introduction
The Alpha architecture supports the following data types:
•8-, 16-, 32-, and 64-bit integers
•IEEE 32-bit and 64-bit floating-point formats
•VAX architecture 32-bit and 64-bit floating-point formats
In the Alpha architecture, instructions interact with each other only by one instruction writing to a register or memory location and another instruction reading from
that register or memory location. This use of resources makes it easy to build implementations that issue multiple instructions every CPU cycle.
29 September 1997 – Subject To Change
Introduction1–1
Page 26
The Architecture
The 21164PC uses a set of subroutines, called privileged architecture library code
(PALcode), that is specific to a particular Alpha operating system implementation
and hardware platform. These subroutines provide operating system primitives for
context switching, interrupts, exceptions, and memory management. These subroutines can be invoked by hardware or CALL_PAL instructions. CALL_PAL instructions use the function field of the instruction to vector to a specified subroutine.
PALcode is written in standard machine code with some implementation-specific
extensions to provide direct access to low-level hardware functions. PALcode supports optimiza tions for multip le operating sy stems, flexible memory-management
implementations, and multi-instruction atomic sequences.
The Alpha architecture performs byte shifting and masking with normal 64-bit, register-to-register instructions; it does not include single-byte load and store instructions.
1.1.1 Addressing
The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21164PC
supports a 43-bit virtual address.
Virtual addresses as seen by the program are translated into physical memory
addresses by the memory-management mechanism. The 21164PC supports a 40-bit
uncached and a 33-bit cached physical address space.
1–2Introduction
29 September 1997 – Subject To Change
Page 27
1.1.2 Integer Data Types
Alpha architecture supports four integer data types.
Data TypeDescription
ByteA byte is eight contiguous bits that start at an addressable b yte boundary. A
byte is an 8-bit value. A byte is supported in Alpha architecture by the
EXTRACT, INSERT, LDBU, MASK, SEXTB, STB, ZAP, PACK,
UNPACK, MIN, MAX, and PERR instructions.
WordA word is two cont igu ous by te s th at s t art at an ar bi trary byt e b oun dary. A
word is a 16-bit value. A word is supported in Alpha architecture by the
EXTRACT, INSERT, LDWU, MASK, SEXTW, STW, PACK, UNPACK,
MIN, and MAX instructions.
LongwordA longword is four contiguous bytes that start at an arbitrary byte bound-
ary. A longword is a 32-bit value. A longword is supported in Alpha architecture by sign-extended load and store instructions and by longword
arithmetic instructions.
QuadwordA quadword is eight contiguous bytes that start at an arbitrary byte bound-
ary. A quadword is supported in Alpha architecture by load and store
instructions and quadword integer operate instructions.
The Architecture
Note:Alpha implementations may impose a significant performance penalty
when accessing operands that are not NATURALLY ALIGNED. Refer
to the Alpha AXP Architecture Reference Manual for details.
1.1.3 Floating-Point Data Types
The 21164PC supports the following floating-point data types:
The 21164PC is a superscalar pipelined processor manufactured using 0.35-µm
CMOS technology. It is packaged in a 413-pin IPGA carrier and has removable
application-specific heat sinks. The 21164PC has been optimized for uniprocessor
systems with very high cache and memory bandwidth. The 21164PC supports the
new motion video instructions (MVI) added to the Alpha instruction set.
The 21164PC ca n issue four Alpha instruct ions in a si ngle cycle , thereby mini mizing
the average cycles per instruction (CPI). A number of low-latency and/or highthroughput features in the instruction issue unit and the onchip components of the
memory subsystem further reduce the average CPI.
The 21164PC and associated PALcode implements IEEE single-precision and double-precision, VAX F_floating and G_floating data types, and supports longword
(32-bit) and quadword (64-bit) integers. Byte (8-bit) and word (16-bit) support is
provided by byte-manipulation instructions. Limited hardware support is provided
for the VAX D_floating data type.
Other 21164PC features include:
•A peak instruction execution rate of four times the CPU clock frequency.
•The ability to issue up to four instructions during each clock cycle.
•An onchip, demand-paged memory-management unit with translation buffer,
which, when used with PALcode, can implement a variety of page table structures and translation algorithms. The unit consists of a 64-entry data translation
buffer (DTB) and a 48-entry instruction translation buffer (ITB), with each entry
able to map a single 8KB page or a group of 8, 64, or 51 2 8KB page s. The size of
each translation buffer entry’s group is specified by hint bits stored in the entry.
The DTB and ITB implement 7-bit address space numbers (ASN),
(MAX_ASN=127).
•Two onchip, high-throughput pipelined floating-point units, capable of execut-
ing both DIGITAL and IEEE floating-point data types.
•An onchip, 16KB virtual instruction cache with 7-bit ASNs (MAX_ASN=127).
1–4Introduction
29 September 1997 – Subject To Change
Page 29
21164PC Microprocessor Features
•An onchip, dual-read-ported, 8KB data cache.
•An onchip write buffer with six 32-byte entries.
•A 128-bit data bus with onchip parity and offchip longword parity.
•Support for an external second-level cache. The size and access time of the
external second-level cache is programmable.
•An internal clock generator providing a high-speed clock used by the 21164PC,
and a pair of programmable system clocks for use by the CPU module.
•Onchip performance counters to measure and analyze CPU and system perfor-
mance.
•Chip and module level test support, including an instruction cache test interface
to support chip and module level testing.
•A 3.3-V exte rnal interface and 2.5-V internal interface.
Refer to Chapter 9 for 21164PC dc and ac electrical characteristics. Refer to the
Alpha AXP Architecture Reference Manual for a description of address space numbers (ASNs).
29 September 1997 – Subject To Change
Introduction1–5
Page 30
Page 31
2
Internal Architecture
This chapter provides both an overview of the 21164PC microarchitecture and a sys-
tem designer’s view of the 21164PC implementation of the Alpha architecture. The
combination of the 21164PC microarchitecture and privileged architecture library
code (PALcode) defines the chip’s implementation of the Alpha architecture. If a
certain piece of ha rdware seems t o be “ar chite ctura lly inco mplet e,” th e missi ng func tionality is implemented in PALcode. Chapter 6 provides more information on PALcode.
This chapter describes the major functional hardware units and is not intended to be
a detailed hardware description of the chip. It is organized as follows:
•21164PC microarchitecture
•Pipeline organization
•Scheduling and issuing rules
•Replay traps
•Miss address file (MAF) and load-merging rules
•MTU store instruction execution
•Write buffer and the WMB instruction
•Performance measurement support
•Floating-point control register
•Design examples
29 September 1997 – Subject To Change
Internal Architecture2–1
Page 32
21164PC Microarchitecture
2.1 21164PC Microarchitecture
The 21164PC microprocessor is a high-performance implementation of Digital
Equipment Corporation’s Alpha architecture. Figure 2–1 is a block diagram of the
21164PC that shows the major f unctional blocks rela tive to pipeline stage flow. The
following paragraphs provide an overview of the chip’s architecture and major functional units.
2.1.1 Instruction Fetch/Decode Unit and Branch Unit
The primary function of the i nstru ction fetch /decode unit and bran ch unit (IDU) i s to
manage and issue instructions to the IEU, MTU, and FEU. It also manages the
instruction cache. The IDU contains:
•Prefetcher and instruction buffer
•Instruction slot and issue logic
•Program counter (PC) and branch prediction logic
•48-entry instruction translation buffers (ITBs)
•Abort logic
• Register conflict logic
•Interrupt and exception logic
29 September 1997 – Subject To Change
Internal Architecture2–3
Page 34
21164PC Microarchitecture
2.1.1.1 Instruction Decode and Issue
The IDU decodes up to four instructions in parallel and checks that the required
resources are available for each instruction. The IDU issues only the instructions for
which all requi red resou rces are avai lable. The IDU does not iss ue in structi ons out of
order, even if the resources are available for a later instruction and not for an earlier
one.
In other words:
•If resources are available, and multiple issue is possible, then all four instruc-
tions are issued.
•If resources are available only for a later instruction and not for an earlier one,
then only the instructions up to the latest one for which resources are available
are issued.
The IDU handles only NATURALLY ALIGNED groups of four instructions
(INT16). The IDU does not advance to a new group of four instructions until all
instructions in a group are issued. If a branch to the middle of an INT16 group
occurs, then the IDU attempts to issue the instructions from the branch target to the
end of the current INT16; the IDU then proceeds to the next INT16 of instructions
after all the instructions in the target INT16 are issued. Thus, achieving maximum
issue rate and optimal performance requires that code be be scheduled properly and
that floating or integer NOP instructions be used to fill empty slots in the scheduled
instruction stream.
For more informati on on instruction scheduling and issuing, includ ing detailed rules
governing mu ltiple instruction issue, refer to Section 2.3.
2.1.1.2 Instruction Prefetch
The IDU contains an instruction prefetcher and a four-entry, 32-byte-per-entry,
prefetch buffer called the refill buffer. Each instruction cache (Icache) miss is
checked in the refill buffer. If the refill buffer contains the instruction data, it fills the
Icache and instruction buffer simultaneously. If the refill buffer does not contain the
necessary data, a fet ch and a numbe r of pref etches ar e sent to the MTU. One pr efetch
is sent per cycle until each of the four entries in the refill buffer is filled or has a
pending fill. The refill buffer holds all returned fill data until the data is required by
the IDU pipeline or until it is overwritten by a subsequent fetch/prefetch sequence
caused by a future Icache miss.
2–4Internal Architecture
29 September 1997 – Subject To Change
Page 35
21164PC Microarchitecture
Prefetching does n ot begin until ther e is a “true” miss. A tr ue miss is a refere nce t hat
misses in the Icach e and the n al so misse s in th e refi ll buf fer. If an Icache miss results
in a refill buffer hit , pref et chi ng is not sta rt ed unt il all the dat a has been moved from
the refill buffer entry into th e pipeline.
Each fill of the Icache by the refill buffer occurs when the instruction buffer stage in
the IDU pipeline re quires a new I NT16. The INT16 is wr itte n into the I cache and the
instruction buffer simultaneously. This can occur at a maximum rate of one Icache
fill per cycle. The actual rate depends on how frequently the instruction buffer stage
requires a new INT16, and on availability of data in the refill buffer.
Once an Icache miss occurs, the Icache enters fill mode. When the Icache is in fill
mode, the refill buffer is checked each cycle to see if it contains the next INT16
required by the instruction buffer.
When the required data is not available in the refill buffer (also a miss), the Icache is
checked for a hit while it awaits the arrival of the data from the Bcache or main
memory . The IDU sen ds a read re quest to the CBU by means of the MTU. The CBU
checks the Bcache, and if the request misses, the CBU drives a main memory
request.
If there is an Icache hit at this time, the Icache returns to access mode and the
prefetcher stops sending fetches to the MTU. When a new program counter (PC) is
loaded (that i s, taken b ranches), the I cache re turns to acces s mode unti l the fi rst miss.
The refill buffer receives and holds instr uction data fr om fetches initiated befo re the
Icache returned to acces s mode.
The Icache has a 64-byte block size, whereas the refill buffer is able to load the
Icache with only one INT16 (16 bytes) per cycle. Therefore, each Icache block has
four valid bits, one for each 16-byte subblock.
2.1.1.3 Branch Execution
When a branch or jump instruction is fetched from the Icache by the prefetcher, the
IDU needs one cycle to calculate the target PC before it is ready to fetch the target
instruction stream. In the second cycle after the fetch, the Icache is accessed at the
target address . Bra nch and PC prediction are necessary to predict and beg in f et chi ng
the target instruction stream before the branch or jump instruction is issued.
The Icache records the outcome of branch instructions in a 2048-entry, 2-bit per
entry branch history table. The table is indexed by the instruction’s virtual address
bits <13:03>. This information is used as t h e pre diction for the nex t e xecu tion of the
branch instruction. The 2-bit history state is a saturating counter that increments on
taken branches and dec rement s on not -take n branc hes. The branch is predi cted take n
29 September 1997 – Subject To Change
Internal Architecture2–5
Page 36
21164PC Microarchitecture
on the top two count values and is predicted not-taken on the bottom two count val-
ues. The history stat us is not ini tial ized on Icac he fil l, th erefor e it may “remembe r” a
branch that was evicted from the Icache and subsequently reloaded.
The 21164PC does not limit the number of branch predictions outstanding to one. It
predicts branches even while waiting to confirm the prediction of previously predicted branches. There can be one branch prediction pending for each of pipeline
stages 3 and 4, plus up to four in pipeline stage 2. Refer to Section 2.2 for a description of pipeline stages.
When a predicted branch is issued, the IEU or FEU checks the prediction. The
branch history table is updated accordingly. On branch mispredict, a mispredict trap
occurs and the IDU restarts execution from the correct PC.
The 21164PC provides a 12-entry subroutine return stack that is controlled by
decoding the opcode (BSR, HW_REI, and JMP/JSR/RET/JSR_COROUTINE), and
DISP<15:14> in JMP/JSR/RET/JSR_COROUTINE. The stack stores an Icache
index in each entry. The stack is implemented as a circular queue that wraps around
in the overflow and underflow cases.
Table 2–1 lists the effect each of these ins tr uct ions has on the sta te of the branch-prediction stack.
Table 2–1 Effect of Branching Instructions on the Branch—Prediction Stack
Instruction
BSR, JSRNoPush PC+4
RETYesPop
JMP, BR, BRxxNoNo eff ect
JSR_COROUTINE YesPop, then push PC+4
PAL entryNoPush PC+4
HW_REIYesPop
The 21164PC uses the Icache index hint in the JMP and JSR instructions to predict
the target PC. The Ic ache index hint in the inst ruc ti on’s displacement field is used to
access the direct-mapped Icache. The upper bits of the PC are formed from the data
in the Icache tag store at that index. Later in the pipeline, the PC prediction is
checked against the actual PC generated by the IEU. A mismatch causes a PC
mispredict trap and restart from the correct PC. This is similar to branch prediction.
2–6Internal Architecture
Stack Used for
Prediction?Effect on Stack
29 September 1997 – Subject To Change
Page 37
The RET, JSR_COROUTINE, and HW_REI instructions predict the next PC by
using the index fro m the subr outine re turn stac k. The uppe r bits of the PC are for med
from the data in the Icache tag at that index. These predictions are checked against
the actual PC in exactly the same way that JMP and JSR predictions are checked.
The branch-prediction stack never predicts a target address in PALmode. This prevents the possibility of nonprivileged code accessing privileged modes through
incorrect stack predictions (for example, by underflow/overflow of the stack). This
implies that PALcode libraries should avoid using instructions such as RET and
JSR_COROUTINE for internal jumps with PALmode targets, as the 21164PC will
always mispredict the target address.
2.1.1.4 Instruction Translation Buffer
The IDU includes a 48-entry, fully associative instruction translation buffer (ITB).
The buffer stores recently used Istream address translations and protection information for pages ranging from 8KB to 4MB and uses a not-last-used replacement algorithm.
PALcode fills and maintains the ITB. Each entry suppor ts all fou r granular ity hint bi t
combinations, so that any singl e ITB ent ry ca n provi de tran sl at ion for up to 512 contiguously mapped 8KB pages. The operating system, using PALcode, must ensure
that virtual addresses can only be mapped through a single ITB entry or superpage
mapping at one time. Multiple simultaneous mapping can cause UNDEFINED
results.
21164PC Microarchitecture
While not executing in PALmode, the 43-bit virtual PC is routed to the ITB each
cycle. If the page table entry (PTE) associated with the PC is cached in the ITB, the
protection bits for the page that contains the PC are used by the IDU to do the necessary access checks. If there is an Icache miss and the PC is cached in the ITB, the
page frame number (PFN) and protection bits for the page that contains the PC are
used by the IDU to do the address translation and access checks.
The 21164 PC’ s ITB support s 128 addre ss space n umbers (AS Ns) (MAX_ASN=127)
by means of a 7-bit ASN field in each ITB entry. PALcode uses the hardware-specific HW_MTPR instruction to write to the architecturally defined ITB_IAP register.
This has the effect of invalidating ITB entries that do not have their ASM bit set.
The 21164PC provides two optional translation extensions called superpages.
Access to superpages is enabled using ICSR<SPE> and is allowed only while executing in privileged mode.
<39:13>, on a one-to-one basis, when virtual address bits <42:41> equal 2. This
maps the entire physical address space four times over to the quadrant of the virtual address space.
•The other superpage maps virtual address bits <29:13> to physical address bits
<29:13>, on a one-to-one basis, and forces physical address bits <39:30> to 0
when virtual address bits <42:30> equal 1FFE
region of physical address space to a single region of the virtual address space
defined by virtual address bits <42:30> = 1FFE
Access to either su perpag e mapping is al lo wed only while execut ing in kerne l mode.
Superpage mapping allows the operating system to map all physical memory to a
privileged virtual memory region.
2.1.1.5 Interrupts
The IDU exception logic supports three sources of interrupts:
•Hardware interrupts
There are 7 level-sensitive hardware interrupt sources supplied by the following
signals:
There are 15 p rioritized software interrupts sourced by the software interru pt
request register (SIRR) (see Section 5.1.22).
•Asynchronous system traps (ASTs)
There are 4 ASTs sourced by the asynchronous system trap request (ASTRR)
register.
The serial interrupt, the performance counter interrupts, and irq_h<3:0> are all
maskable by bits in the ICSR (see Sec ti on 5.1.17). The four AST traps are maskable
by bits in the ASTER (see Section 5.1.21). In addition, the AST traps are qualified
by the current processor mode. All interrupts are disabled when the processor is executing PALcode.
2–8Internal Architecture
29 September 1997 – Subject To Change
Page 39
Each interrupt source, or group of sources, is assigned an interrupt priority level
(IPL), as shown in Table 4–11. The current IPL is set using the IPLR register (see
Section 5.1.18). Any i nterrupt s that have a n equal or lower IPL a re masked. When an
interrupt occurs that has an IPL greater than the value in the IPLR register, program
control passes to the INTERRUPT PALcode entry point. PALcode processes the
interrupt by reading the ISR (see Section 5.1.24) and the INTID register (see
Section 5.1.19).
2.1.2 Integer Execution Unit
The integer execut ion unit (IEU) contains t w o 64- bit integer execution pip eli nes, E0
and E1, which include the following:
•Two adders
•Two logic boxes
•A barrel shifter
•Byte-manipu lation logic
•An integer multiplier
21164PC Microarchitecture
•A motion video instruction unit
The IEU also includes the 40- entry, 64-bit integer register file (I RF) that conta ins the
32 integer registers defined by the Alpha architecture and 8 PAL shadow registers.
The register fil e has four re ad ports and t wo write po rts that provide oper ands to both
integer execution pipelines and accept results from both pipes. The register file also
accepts load instruction results (memory data) on the same two write ports.
2.1.3 Floating-Point Execution Unit
The onchip, pipelined floating-point unit (FPU) can execute both IEEE and VAX
floating-point instructions. The 21164PC supports IEEE S_floating and T_floating
data types, and all rounding modes. It also supports VAX F_floating and G_floating
data types, and provides limited support for the D_floating format. The FPU contains:
•A 32-entry, 64-bit floating-point register file
•A user-accessible control register
•A floating-point multiply pipeline
•A floating-point add pipeline
29 September 1997 – Subject To Change
Internal Architecture2–9
Page 40
21164PC Microarchitecture
The floating-point divide unit is associated with the floating-point add pipeline
but is not pipelined.
The FPU can accept two instructions every cycle, with the exception of floatingpoint divide inst ructions . The res ult lat ency for non divide, f loating- point inst ructions
is four cycles.
The floating-poi nt regist er file (FRF) has fi ve read por ts and four write port s. Four of
the read ports are used by the two pipelines to source operands. The remaining read
port is used by floating-point stores. Two of the write ports are used to write results
from the two pipelines. The other two write ports are used to write fills from floating-point loads.
2.1.4 Memory Address Translation Unit
The memory address translation unit (MTU) contains three major sections:
•Data translation buffer (dual ported)
•Miss address file
•Write buffer address file
The MTU receives up to two virtual addresses every cycle from the IEU. The translation buffer generates the corresponding physical addresses and access control
information for each virtual address. The 21164PC implements a 43-bit virtual
address, a 40-bit noncacheable physical address, and a 33-bit cacheable physical
address. Cacheable addresses consist of bits <32:0> when bit <39> = 0. Physical
addresses that set bits <38:33> are not supported by the 21164PC. These addresses
are not checked by the 21164PC and could result in erroneous data.
2.1.4.1 Data Translation Buffer
The 64-entry, fully associative, dual-read-ported data translation buffe r (DTB) stores
recently used data stream (Dstream) page table entries (PTEs). Each entry supports
all four granularity hint-bit combinations, so that a single DTB entry can provide
translation for up to 512 contiguously mapped, 8KB pages. The translation buffer
uses a not-last-used replacement algorithm.
For load and store instructions, and other MTU instructions requiring address translation, the eff ective 43-bit virtual address is presented to the DTB. If the PTE of the
supplied virtual address is cached in the DTB, the page frame number (PFN) and
protection bits for the page that contains the address are used by the MTU to complete the address translation and access checks.
2–10Internal Architecture
29 September 1997 – Subject To Change
Page 41
The DTB also supports the optional superpage extensions that are enabled using
ICSR<SPE>. The DTB superpage maps provide virtual-to-physical address translation for two regions of the virtual address space, as described in Section 2.1.1.4.
PALcode fills and maintains the DTB. The operating system, using PALcode, must
ensure that virtual addr esses be mapped eith er through a single DTB entry or through
superpage mapping. Multiple simultaneous mapping can cause UNDEFINED
results. The only exce ption to this rule is tha t a ny gi ve n vi rt ual page may be mapped
twice with identical data in two different DTB entries. This occurs in operating systems, such as OpenVMS, which utilize virtuall y access ib le page ta bles. If the level 1
page table is accessed virtually, PALcode loads the translation information twice;
once in the double-miss hand ler, and once in the primary handler. The PTE mapping
the level 1 page table must remain consta nt during acc esses to this p age to meet this
requirement.
2.1.4.2 Load Instruction and the Miss Address File
The MTU begins the execution of each load instruction by translating the virtual
address and by accessing the data cache (Dcache). Translation and Dcache tag read
operations occur in parallel. If the addressed location is found in the Dcache (a hit),
then the data from the Dcache is formatted and written to either the integer register
file (IRF) or floating-point register file (FRF). The formatting required depends on
the particu l ar load inst ruction executed. If the data is not found in the Dcache (a
miss), then the address, target register number, and formatting information are
entered in the miss address file (MAF).
21164PC Microarchitecture
The MAF performs a load-merging function. When a load miss occurs, each MAF
entry is checked to see if it contains a load miss that addresses the same Dcache (32byte) block. If it does, a nd cert ain merging rules a re sat isfied , th en the new load miss
is merged with an existing MAF entry. This allows the MTU to service two or more
load misses with one data fill from the CBU.
There are six MAF entries fo r load misse s and fou r more fo r IDU instr uction fetc hes
and prefetches. Load misses are usually the highest MTU priority.
Refer to Sect ion 2.5 for information on load-merging rules.
2.1.4.3 Dcache Control and Store Instructions
The Dcache follows a write-through protocol. During the execution of a store
instruction, the MTU probes the Dcache to determine whether the location to be
overwritten is currently cached. If so (a Dcache hit), the Dcache is updated. Regardless of the Dcache state, the MTU forwards the data to the CBU.
29 September 1997 – Subject To Change
Internal Architecture2–11
Page 42
21164PC Microarchitecture
A load instruction t hat is issued one cycle after a store instruction in the pipeline creates a conflict if both the load and st ore ope rations a ccess the same memory loca tion.
(The store instruction has not yet updated the location when the load instruction
reads it.) This co nflic t is hand led b y forc ing the l oad i nstruc ti on to take a repl ay tr ap;
that is, the IDU flushes the pipeline and restarts execution from the load instruction.
By the time the l oad inst ruction a rrives at th e Dcache t he second time, t he confli cting
store instruction has w ritten the Dcache and the load instruction is executed normally.
Replay traps can be avoided by scheduling the load instruction to issue three cycles
after the st ore ins tructi on. If t he load instru ction is sch eduled t o is sue two c ycles a fter
the store instruction, then it will be issue-sta lled for one cycle.
2.1.4.4 Write Buffer
The MTU contains a write buffer that has six 32-byte entries, each of which holds
the data from one or more store instructions that access the same 32-byte block in
memory until the data is written into the Bcache. The write buffer provides a finite,
high-bandwidth resource for receiving store data to minimize the number of CPU
stall cycles. The write buffer and associated WM B instruction are described in Section 2.7.
2.1.5 Cache Control and Bus Interface Unit
The cache control and bus interface unit (CBU) processes all accesses sent by the
MTU and implements all memory-related external interface functions, particularly
the coherence protocol functions for write-back caching. It controls the board-level
backup cache (Bcache). The CBU handles all instruction and primary Dcache read
misses and performs the function of writing data from the write buffer into the
shared coherent memory subsys tem. The CBU also con trols the 128-b it bidire ctional
data bus, address bus, and I/O control. Chapter 4 describes the external interface.
2.1.6 Cache Organization
The 21164PC has two onchip caches−a primary data cache (Dcache) and a primary
instruction cache (Icache). All memory cells in the onchip caches are fully static,
six-transistor, CMOS structures.
The 21164PC also provides control for the external cache (Bcache).
2–12Internal Architecture
29 September 1997 – Subject To Change
Page 43
2.1.6.1 Data Cache
The data cache (Dcache) is a dual-read-ported, single-write-ported, 8KB cache. It is
a write-through, read-allocate, direct-mapped, byte-accessible, physical cache with
32-byte blocks and data parity at the byte level.
2.1.6.2 Instruction Cache
The instruction cache (Icache) is a 16KB, virtual, direct-mapped cache with 64-byte
blocks and 32-byte fills. Each block tag contains:
•A 7-bit address space number (ASN) field as defined by the Alpha architecture
•A 1-bit address space match (ASM) field as defined by the Alpha architecture
•A 1-bit PALcode (physically addressed) indicator
Software, rather than Icache hardware, maintains Icache coherence with memory.
2.1.6.3 External Cache
The CBU implements control for an external, direct-mapped, physical, write-back,
write-allocate cache with 64-byte blocks. The 21164PC supports board-level cache
sizes of 512KB, 1MB, 2MB, and 4MB.
Pipeline Organization
2.1.7 Serial Read-Only Memory Interface
The serial read-only memory (SROM) interface provi des the initia lization data load
path from a system SROM to the Icache. Chapter 7 provides information about the
SROM interface.
2.2 Pipeline Or ga niz a ti on
The 21164PC has a 7-stage (or 7-cycle) pipeline for integer operate and memory reference instructions, and a 9-stage pipeline for floating-point operate instructions. The
IDU maintains state for all pipeline stages to track outstanding register write operations.
Figure 2–2 shows the integer operate, memory reference, and floating-point operate
pipelines for the IDU, FPU, I EU, and MTU. The f ir st four stages are executed in t he
IDU. Remaining stages are executed by the IEU, FEU, MTU, and CBU. There are
bypass paths that allo w the resul t of one instru ction to be used as a source oper and of
a following instruction before it is written to the register file.
T ables 2–2, 2–3, 2–4 , 2–5, 2–6 , and 2–7 pro vide exampl es of eve nts at va rious st ages
of pipelining during instruction execution.
Determine Next PC
Slot by Function Unit
Register File Access Checks,
Integer Register File Access
Integer
Operate
Pipeline
IC0IBSL
12AC34
First Integer
Operate Stage
If Needed, Second Integer
Operate Stage
Write Integer Register File
56
Arithmetic, logical, shift, and compare
instructions complete in pipeline stage 4
(1-cycle latency). CMOV completes in
stage 5 (2-cycle latency). IMULL has
an 8-cycle or 9-cycle latency. CMOV
or BR can issue in parallel (0-cycle
latency) with a dependent CMP
instruction.
FloatingPoint
Pipeline
Memory
Reference
Pipeline
IC
IBIBSL
0
112
Floating-Point Register
File Access
First Floating-Point
Operate Stage
Write Floating-Point Register File,
Last Floating-Point Operate Stage
IC
0
Dcache Read Begins
Dcache Read Ends
Use Dcache Data, Store Writes Dcache
Bcache Tag/Data Access Begins
Bcache Tag Access Ends, 1st Datum Returned
Fill Dcache/Icache (1st OW)
Use Bcache Data
SL
AC
334
AC
2
556678
4
Bcache Read Latency
(5-20 CPU cycles)
7
. . .
11910
Bcache Cycle Time
(2-10 CPU cycles)
109
2nd Datum Returned
Fill Dcache/Icache (2nd OW)
2–14Internal Architecture
HLO019B
29 September 1997 – Subject To Change
Page 45
Pipeline Organization
Table 2–2 Pipeline Examples—All Cases
Pipeline Stage Events
0Access Icache tag and data.
1Buffer four instructions, check for branches, calculate branch displace-
ments, and check for Icache hit.
2Slot-swap instructions around so they are headed for pipelines capable of
executing them. Stall preceding stages if all instructions in this stage cannot issue simultaneously because of function unit conflicts.
3Check the operands of each instruction to see that the source is valid and
available and that no write-write hazards exist. Read the IRF. Stall preceding stages if any in stru ction can no t be is sued. All source opera nds must be
available at the end of this stage for the instruction to issue.
Table 2–3 Pipeline Examples—Integer Add
Pipeline Stage Events
4Perform the add operation.
5Result is available for use by an operate function in this cycle.
6Write the IRF. Result is available for use by an operate function in this
cycle.
Table 2–4 Pipeline Examples—Floating Add
Pipeline Stage Events
4Read the FRF.
5First stage of FEU add pipeline.
6Second stage of FEU add pipeline.
7Third stage of FEU add pipeline.
8Fourth stage of FEU add pipeline. Write the FRF.
9Result is available for use by an operate function in this cycle. For
instance, pipeline stage 5 of th e user in struction can coincide with pipeline
stage 9 of the producer (latency of 4).
29 September 1997 – Subject To Change
Internal Architecture2–15
Page 46
Pipeline Organization
Table 2–5 Pipeline Examples—Load (Dcache Hit)
Pipeline Stage1Events
4Calculate the effective address. Begin the Dcache data and tag store
access.
5Finish the Dcache data and tag store access. Detect Dcache hit. Format
the data as required. Bcache arbitration defaults to pipe E0 in anticipation
of a possible miss.
6Write the IRF or FRF. Data is available for use by an operate function in
this cycle.
1
Pipe E0 has not been defined at this po int.
Table 2–6 Pipeline Examples—Load (Dcache Miss)
Pipeline Stage1Events
4Calculate the effective address. Begin the Dcache data and tag store
access.
5Finish the Dcache data and tag store access. Detect Dcache miss. Bcache
6Forward physical address to pins.
7Begin Bcache access, cycle 1.
8N more CPU cycles waiting for Bcache data.
9Receive Bcache data at the pins, send data to the Dcache.
10Begin Dcache fill. Format the data as required.
11Finish the Dcache fill. Write the integer or floating-point register file.
1
Pipes E0 and E1 have not been defined at this point.
2–16Internal Architecture
arbitration defaults to pipe E0 in anticipation of a possible m iss. If there
are load instructions in both E0 and E1, the load instruction in E1 would
be delayed at least one more cycle because default arbitration speculatively assumes the load in E0 will miss .
Data is available for use by an operate function in this cycle.
29 September 1997 – Subject To Change
Page 47
Table 2–7 Pipeline Examples—Store (Dcache Hit)
Pipeline Stage Events
4Calculate the effective address. Begin the Dcache tag store access.
5Finish the Dcache tag store access. Detect Dcache hit. Send store to the
write buffer simultaneously.
6Write the Dcache data store if hit (write begins this cycle).
2.2.1 Pipeline Stages and Instruction Issue
The 21164P C pipeline d ivides in structi on processing into four s tatic a nd a number of
dynamic stages of execution. The first four st ages consist of the inst ruction fetch,
buffer and decode, slotting, and issue-check logic. These stages are static in that
instructions may remain valid in the same pipeline stage for multiple cycles while
waiting for a resource or stalling for other reasons. Dynamic stages (IEU and FEU)
always advance state and are unaffected by any stall in the pipeline. A pipeline stall
may occur while zero instructions issue, or while some instructions of a set of four
issue and the others are held at the issue stage. A pipeline stall implies that a valid
instruction is (or instructions are) presented to be issued but cannot proceed.
Pipeline Organization
Upon satisfying all issue requirements, instructions are issued into their slotted pipeline. After issuing, ins tructions cannot stall in a s ubse que nt pipeline stage. The issue
stage is responsible for ensuring that all resource conflicts are resolved before an
instruction is allowed to continue. The only means of stopping instructions after the
issue stage is an abor t condit ion. ( The te rm abort as used here i s dif ferent from its use
in the Alpha AXP Architecture Reference Manual.)
2.2.2 Aborts and Exceptions
Aborts result from a number of causes. In general, they can be grouped into two
classes, excepti ons (includin g interrupts ) and nonexce ptions. The dif ference between
the two is that exceptions require that the pipeline be drained of all outstanding
instructions before restarting the pipeline at a redirected address. In either case, the
pipeline must be flushed of all instructions that were fetched subsequent to the
instruction t hat caused the abor t co ndition (arithmeti c exceptions are an exception to
this rule). This includes aborting some instructions of a multiple-issued set in the
case of an abort condition on the one instruction in the set.
29 September 1997 – Subject To Change
Internal Architecture2–17
Page 48
Pipeline Organization
The nonexception case does not need to drain the pipeline of all outstanding instructions ahead of the aborti ng instruction. The pipeline can be r estarted immediat ely at a
redirected address. Examples of nonexception abort conditions are branch mispredictions, subroutine call/return mispredictions, and replay traps. Data cache misses
can cause aborts or issue stalls depending on the cycle-by-cycle timing.
In the event of an exception other than an arithmetic exception, the processor aborts
all instructions issued after the exceptional instruction, as described in the preceding
paragraphs. Due to the nature of some exce ption condi tions, thi s may occur as l ate as
the integer register file (IRF) write cycle. In the case of an arithmetic exception, the
processor may execute ins tr uct ions issued after the exceptional instruction.
After aborting, the address of the exceptional instruction or the immediately subsequent instruction is latched in the EXC_ADDR internal processor register (IPR). In
the case of an arithmetic exception, EXC_ADDR contains the address of the instruction immediately after the last instruction execut ed. (Every instruction prior to the
last instruction executed was also executed.) For machine check and interrupts,
EXC_ADDR points to the instruct ion immed iately f ollowin g th e last instruct io n executed. For the remaining cases, EXC_ADDR points to the exceptional instruction;
where, in all cases, its execution should naturally restart.
When the pipeline is fully drained, the processor begins instruction execution at the
address given by the PALcode dispatch. The pipeline i s drained whe n all outstandi ng
write operations to both the IRF and FRF have completed and all outstanding
instructions have passed the point in the pipeline such that they are guaranteed to
complete without an exception in the absence of a machine check.
Replay traps are aborts that occur when an instruction requires a resource that is not
available at some point in the pipeline. These are usually MTU resources whose
availability could not be ant i cipated accurately at is sue time (ref er to Sec ti on 2.4) . If
the necessary resource is not available when the instruction requires it, the instruction is aborted and the IDU begins fetching at exactly that instruction, thereby
replaying the instruction in the pipeline. A slight variation on this is the load-missand-use replay trap in which an operate instruction is issued just as a Dcache hit is
being evaluated to determine if one of the instruction’s operands is valid. If the result
is a Dcache miss, then the operate instruction is aborted and replayed.
2–18Internal Architecture
29 September 1997 – Subject To Change
Page 49
2.2.3 Nonissue Conditions
There are two reasons for nonissue conditions. The first is a pipeline stall wherein a
valid instruction or set of instructions are prepared to issue but cannot due to a
resource confli ct ( re gister conflict or function unit conflict). These t ype s of nonissue
cycles can be minimized through code scheduling.
The second type of nonissue conditions consists of pipeline bubbles where there is
no valid instruction in the pipeline to issue. Pipeline bubbles result from the abort
conditions described in the previous section. In addition, a single pipeline bubble is
produced whenever a branch type instruction is predicted to be taken, including subroutine calls and returns.
Pipeline bubbles a re reduced directly by t he instruction buf fe r har dwar e a nd through
bubble squashing, but can also be effectively minimized through careful coding
practices. Bubble squashing involves the ability of the first four pipeline stages to
advance whenever a bubble or buffer slot is detected in the pipeline stage immediately ahead of it while the pipeline is otherwise stalled.
2.3 Scheduling and Issuing Rules
Scheduling and Issuing Rules
The following sections define the classes of instructions and provide rules for
instruction slotting, instruction iss uing, and latency.
2.3.1 Instruction Class Definition and Instruction Slotting
The scheduling and mult iple issu e rules pre sented here are per formance re lated onl y;
that is, there are no functional dependencies related to scheduling or multiple issu-
ing. The rules are defined in terms o f instruction classes. Table 2–8 sp ecifies all o f
the instruction classes and the pipeline that executes the particular class. With a few
additional rules, the table provides the information necessary to determine the functional resource conflicts that determine which instructions can issue in a given cycle.
Table 2–8 Instruction Classes and Slotting
Class Name PipelineInstruction List
LDE01 or E12 All loads except LDx_L
STE0All stores except STx_C
MBXE0LDx_L, MB, WMB, STx_C, HW_LD-lock, HW_ST-cond,
FETCH
RXE0RS, RC
29 September 1997 – Subject To Change
Internal Architecture2–19
(Sheet 1 of 3)
Page 50
Scheduling and Issuing Rules
Table 2–8 Instruction Classes and Slotting
Class Name PipelineInstruction List
MXPRE0 or E1
HW_MFPR, HW_MTPR
(Sheet 2 of 3)
(depends
on the IPR)
IBRE1Integer conditional branches
FBRFA
3
Floating-point conditional branches
JSRE1Jump-to-subroutine instructions: JMP, JSR, RET, or
JSR_COROUTINE, BSR, BR, HW_REI, CALLPAL
IADDE0 or E1ADDL, ADDL/V, ADDQ, ADDQ/V, SUBL, SUBL/V, SUBQ,
FADDFAFloating-point operates, including CPYSN and CPYSE, except
FDIVFAFloating-point divide
FMULFM
2–20Internal Architecture
MINSB4, MINUB8, MINUW4, MAXUB8, MAXUW4,
MAXSB8, MAXSW4
multiply, divide, and CPYS
4
Floating-point multiply
29 September 1997 – Subject To Change
Page 51
Scheduling and Issuing Rules
Table 2–8 Instruction Classes and Slotting
Class Name PipelineInstruction List
(Sheet 3 of 3)
FCPYSFM or FA CPYS, not including CPYSN or CPYSE
MISCE0RPCC, TRAPB
UNOPNoneUNOP
1
IEU pipeline 0.
2
IEU pipeline 1.
3
FEU add pipeline.
4
FEU multiply pipeline.
5
UNOP is LDQ_U R31,0(Rx).
5
Slotting
The slotting function in the IDU determines which instructions will be sent forward
to attempt to issue. The slotting function detects and removes all static functional
resource conflict s. Th e s et of instructions out put by t he slotting function will issue if
no register or other dynamic resource conflict is detected in stage 3 of the pipeline.
The slotting algorithm follows:
Starting from the first (lowest addressed) valid instruction in the INT16 in stage
2 of the 21164PC IDU pipeline, attempt to assign that instruction to one of the
four pipelines (E0, E1, FA, FM). If it is an instruction that can issue in either E0
or E1, assign it to E0. However, if one of the following is true, assign it to E1:
•E0 is not free and E1 is free.
•The next integer instruct ion
If the current inst ruc ti on i s one that can issue in eithe r FA or FM, assign it to FA
unless FA is not free. If it is an FA-only instruction, it m ust be a ssigned to FA. If
it is an FM-only instruction, it must be assigned to FM. Mark the pipeline
selected by this proce ss as taken and resu me with the next sequential instruct ion.
Stop when an instruction cannot be allocated in an execution pipeline because
any pipeline it can use is already taken.
The slotting logic does not send instructions forward out of logical instruction order
because the 21164PC always issues instructions in order. The slotting logic also
enforces the special rules in the following list, stopping the slotting process when a
rule would be violated by allocating the next instruction an execution pipeline:
1
In this context, an integer instruction is one that can issue in on e or bo th of E0 or E1, but
not FA or FM.
29 September 1997 – Subject To Change
1
in this INT16 can issue only in E0.
Internal Architecture2–21
Page 52
Scheduling and Issuing Rules
•An instruction of class LD cannot be issued simultaneously with an instruction
of class ST.
•All instructions are discarded at the slotting stage after a predicted-taken IBR or
FBR class instruction, or a JSR class instruction.
•After a predicted not- taken IBR or FBR, no othe r IBR, FBR, or JSR clas s can be
slotted together.
•The following cases are detected by the slotting logic:
–From lowest address to highest with in an I NT16, with th e foll owing a rrange-
I-instruction is any instruction that can issue in one or both of E0 or E1.
F-instruction is any instruction that can issue in one or both of FA or FM.
–From lowest address to highest with in an I NT16, with th e foll owing a rrange-
When this type of case is detected, the first t wo instru ctions are forward ed to
the issue point i n one cycle. The second two are sent only when the first two
have both issued, provided no other slotting rule would prevent the second
two from being slotted in the same cycle.
2.3.2 Coding Guidelines
Code should be sch edul ed ac cor ding to latency and fun cti on unit availability. This is
good practice in most RISC architectures. Code alignment and the effects of split-
2
should be considered.
issue
2
Split-issue is the situation in whic h not all instructions sent from the slottin g stage to the
issue stage issue. One or more stalls result.
2–22Internal Architecture
29 September 1997 – Subject To Change
Page 53
Scheduling and Issuing Rules
Instructions [a] (the LDL) and [b] (the first ADDL) in the following example are
slotted tog ether. Instruction [b] stalls (sp lit-issue), thus preventing instruction [c]
from advancing to the issue stage:
Code example showingCode example showing
incorrect orderingcorrect ordering
NOTES: The instruction examples are assumed to begin on an INT16
alignment. (n) = Expected execute cycle.
Eventually [b] issues when the result of [a] is returned from a presumed Dcache hit.
Instruction [ c] i s delayed because i t cannot advance to t he issue stage unti l [ b] issues.
In the improved sequence, the LDL [d] is slotted with the NOP [e]. Th en the first
ADDL [f] is slotted with the second ADDL [ g] and th ose two in structi ons dual- issue.
This sequence takes one less cycle to complete than the first sequence.
2.3.3 Instruction Latencies
After slotting , inst ructi on is sue is gov erned by the ava ilabi lity of re giste rs for read or
write operations, an d the availability of th e f loa ti ng divide unit and the integer multi-
ply unit. There are producer–consumer dependencies, producer–producer dependencies (also known as write-after-write conflicts), and dynamic function unit
availability dependencies (integer multiply and floating divide). The IDU logic in
stage 3 of the 21164PC pipeline detects all these conflicts.
The latency to produce a valid result for most instructions is fixed. The exceptions
are loads that mis s, float ing-point di vides, and integer multipl ies. Table 2–9 gives the
latencies for each instruction class. A latency of 1 means that the result may be used
by an instruction is sue d one cyc le after the producing instruction. Most latenci es are
only a property of the produc er. An exception is integer multiply lat encie s. Ther e are
no variations in latency due to which a part icular unit produ ces a give n result relati ve
to the particular unit tha t consumes it. In the case of integer multi ply, the instruction
is issued at the time determined by the standard latency numbers. The multiply’s
latency is dependent on which pr evious inst ructi ons prod uced it s oper ands and when
they executed.
29 September 1997 – Subject To Change
Internal Architecture2–23
Page 54
Scheduling and Issuing Rules
Table 2–9 Instruction Latencies
Additional Time Before
Result Available to
ClassLatency
LDDcache hits, latency=2.
Dcache miss/Bcache hit, latency=10 or longer.
2
Integer Multiply Unit
1 cycle
STStore operations produce no result.—
MBXLDx_L Dcache hits, latency=2.
LDx_L Dcache miss/Bcache hit, latency=10 or longer.
2
—
LDx_L Dcache miss/Bcache miss, latency depends on memory
subsystem state.
STx_C, latency depends on memory subsystem state.
MB, WMB, and FETCH produce no result.
RXRS, RC, latency=1.2 cycles
MXPRHW_MFPR, latency=1, 2, or longer, depending on the IPR.
1 or 2 cycles
HW_MTPR, produces no result.
IBRProduces no result. (Taken branch issue latency minimum=1
2 cycles
SHIFTLatency=1.2 cycles
CMOV Latency=2.1 cycle
ICMPLatency=1.
IMULL Latency=8, plus up to 2 cycles of added latency, depen ding o n the
source of the data.
3
1
Latency until next IMULL, IMULQ, or
2 cycles
1 cycle
IMULH instruction can issue (if there are no data dependencies) is
4 cycles plus the number of cycles added to the latency.
2–24Internal Architecture
29 September 1997 – Subject To Change
Page 55
Scheduling and Issuing Rules
Table 2–9 Instruction Latencies
Additional Time Before
Result Available to
ClassLatency
IMULQ Latency=12, plus up to 2 cycles of added latency, depending on
the source of the data.
1
Latency until next IMULL, IMULQ, or
Integer Multiply Unit
1 cycle
IMULH instruction can issue (if there are no data dependencies) is
8 cycles plus the number of cycles added to the latency.
IMULH Latency=14, plus up to 2 cycles of added latency, depending on
the source of the data.
1
Latency until next IMULL, IMULQ, or
1 cycle
IMULH instruction can issue (if there are no data dependencies) is
8 cycles plus the number of cycles added to the latency.
MVILatency=2.1 cycle
FADDLatency=4.—
FDIVData-dependent late ncy: 15 to 31 sin gle p recision , 22 to 60 dou ble
—
precision. Next floating divide can be issued in the same cycle.
The result of the previous divide is available, regardless of data
dependencies.
FMULLatency=4.—
FCYPS Latency=4.—
MISCRPCC, latency=2. TRAPB produces no result.1 cycle
(Sheet 2 of 2)
1
UNOPUNOP produces no result.—
1
The multiplier is unable to receive data from IEU bypass paths. The instruction issues at the expected time,
but its latency is increased by the time it takes for the input data to become available to the multiplier. For
example, an IMULL instruction issued one cycle later than an ADDL instruction, which produced one of its
operands, has a latency of 10 (8 + 2). If the IMULL instruction is issued two cycles later than the ADDL
instruction, the latency is 9 (8 + 1).
2
When idle, Bcache arbitration predicts a load miss in E0. If a load actually does miss in E0, it is sent to the
Bcache immediately. If it hits in the Bcache, and no other event in the CBU affects the operation, the requested
data is available for use in 10 or more cycles. Otherwise, the request takes longer (possibly much longer,
depending on the state of the CBU and memory). It should be possible to schedule some unrolled code loops
for Bcache by prefetching data into the Dcache usi ng LDQ R31, x(Rx).
3
A special bypass provides an effective latency of 0 (zero) cycles for an ICMP or ILOG instruction producing
the test operand of an IBR or CMOV instruction. This is true only when the IBR or CMOV instruction issues
in the same cycle as the ICMP or ILOG instruction that produced the test operand of the IBR or CMOV
instruction. In all other cases, the effective latency of ICMP and ILOG instructions is 1 cycle.
29 September 1997 – Subject To Change
Internal Architecture2–25
Page 56
Scheduling and Issuing Rules
2.3.3.1 Producer–Producer Latency
Producer–producer latency, also known as write-after-write conflicts, cause issuestalls to preserve write o rder. If two instructions write the same register, they are
forced to do so in different cycles by the IDU. This is necessary to ensure that the
correct result is left in the register file after both instructions have executed. For most
instructions, the ord er in which they write the register file is dictated by issue o rder.
However IMUL, FDIV, and LD instructions may require more time than other
instructions t o co mp let e. Subsequent instr uct io ns that write the same destination reg ister are issue-stalled to preserve write ordering at the register file.
Conditions that involve an intervening producer–consumer conflict can occur commonly in a multiple-issue situation when a register is reused. In these cases, producer–consumer latencies are equal to or greater than the required producer–
producer latency as determined by write ordering and therefore dictate the overall
latency.
An example of this case is shown in the following code:
LDQ R2,0(R0) ;R2 destination
ADDQ R2,R3,R4 ;wr-rd conflict stalls execution waiting for R2
LDQ R2,D(R1) ;wr-wr conflict may dual issue when ADDQ issues
Producer–producer la tency is generally determi ned b y applying the rule that register
file write operations must occur in the correct order (enforced by IDU hardware).
Two IADD or ILOG class instructions that write the same register issue at least one
cycle apart. The same is true of a pair of CMOV-class instructions, even though their
latency is 2. For IMUL, FDIV, and LD instructions, producer–producer conflicts
with any subsequent instruction results in the second instruction being issue-stalled
until the IMUL, FDIV, or LD instruction is about to complete. The second instruction is issued as soon as it is guaran teed to write the regist er file at least one cycle
after the IM UL, FDIV, or LD instruc tion.
If a load writes a register, and within two cycles a subsequent instruction writes the
same register, the subsequent instruction is issued speculatively, assuming the load
hits. If the load misses, a load-mi ss-and-u se trap is gen erated. This causes the se cond
instruction to be repla yed by the IDU. When the sec ond instru ction agai n reaches the
issue point, it is issue-stalled until the load fill occurs.
2–26Internal Architecture
29 September 1997 – Subject To Change
Page 57
2.3.4 Issue Rules
The following is a list of conditions that prevent the 21164PC from issuing an
instruction:
•No instruction can be issued until all of its source and destination registers are
clean; that i s, all outstandin g wr it e operations to t he destination register are guaranteed to complete in i ssue order a nd there are no outs tanding write operation s to
the source registers, or those write operations can be bypassed.
Technically, load-miss-and-use replay traps are an exception to this rule. The
consumer of the load’s result issues, and is aborted, beca use a load wa s predict ed
to hit and was discovered to miss just as the consumer instruction issued. In
practice, the only difference is that the latency of the consumer may be longer
than it would have been had the iss ue logic “kno wn” the load would miss in time
to prevent issue.
•An instruction of class LD cannot be issued in the second cycle after an instruc-
tion of class ST is issued.
•No LD, ST, MXPR (to an MTU register), or MBX class instructions can be
issued after an MB inst ructi on has been issued unt il the MB inst ructi on has been
acknowledged by the CBU.
Scheduling and Issuing Rules
•No LD, ST, MXPR (to an MTU register), or MBX class instructions can be
issued after a STx_C (or HW_ST-cond) instruction has been issued until the
MTU writes the success/failure result of the STx_C (HW_ST-cond) in its destination register.
•No IMUL instructions can be issued if the integer multiplier is busy.
•No floating-point divi de instruc tions can be issued if the floati ng-point divider is
busy.
•No instruction can be issued to pipe E0 exactly two cycles before an integer mul-
tiplication complete s.
•No instruction can be issued to pipe FA exactly five cycles before a floating-
point divide completes.
•No Store instruction can be issued exactly three cycles before a fill. The data
store write operation, if the store hits, will conflict with the fill operation.
29 September 1997 – Subject To Change
Internal Architecture2–27
Page 58
Replay Traps
•No instruction can be issued to pipe E0 or E1 exactly two cycles before an inte-
ger register fill is requested (speculatively) by the CBU, except IMULL,
IMULQ, and IMULH instructions and instructions that do not produce any
result.
•No LD, ST, or MBX class instructions can be issued to pipe E0 or E1 exactly
one cycle before an integer register fill is requested (speculatively) by the CBU.
•No instruction issues after a TRAPB instruction until all previously issued
instructions are guaranteed to finish without generating a trap other than a
machine check.
All instructions sent to the issue stage (stage 3) by the slotting logic (stage 2) are
issued subject to the previous ru les. If issue is prev ented fo r a given inst ruction at the
issue stage, all logica lly subsequent instru ctions at th at stage are p revented f rom
issuing automatically. The 21164PC only issues instructions in order.
2.4 Repl ay Traps
There are no stal ls af ter the i nstr uctio n iss ue poi nt in t he pip eline . In s ome si tuations ,
an MTU instruction cannot be executed because of insufficient resources (or some
other reason). These instructions trap and the IDU restarts their execution from the
beginning of the pipeline. This is called a replay trap. Replay traps occur in the following cases:
•The write buff er is fu ll when a s tore i nst ructi on is ex ecuted and t here ar e alre ady
six write buffe r entries all ocated. The trap occur s even if the entry would have
merged in the writ e buffer.
•A load instruction is issued in pipe E0 when all six MAF entries are valid (not
available), or a load instruction issued in pipe E1 when five of the six MAF
entries are valid. The trap occurs even if the load instruction would have hit in
the Dcache or merged with an MAF entry.
•Alpha shared memory model order trap (Litmus test 1 trap): If a load instruction
issues that address matches with any miss in the MAF (down to the quadword
boundary), the load instruction is aborted through a replay trap regardless of
whether the newly issued load instruction hits or misses in the Dcache. This
ensures that the two loads execute in issue order.
2–28Internal Architecture
29 September 1997 – Subject To Change
Page 59
Miss Address File and Load-Merging Rules
•Load-after-store trap: A replay trap occurs if a l oad instruction is issued in the
cycle immediately foll owi ng a s tor e i nst ruction that hits in the Dcache, and both
access the same location. The address match is exact for address bits <12:2>
(longword granularity), but ignores address bits <42:13>.
•When a load in struction is followed, within one cycle, by any instruction that
uses the result of that load, and the load misses in the Dcache, the consumer
instruction traps and is restarted from the beginning of the pipeline. This occurs
because the consumer instruction is issued speculatively while the Dcache hit is
being evaluated. If the load misses in the Dcache, the speculative issue of the
consumer instruction was incorrect. The replay trap generally brings the consumer instruction to the issue point before or simultaneously with the availability
of fill data.
2.5 Miss Address File and Load-Merging Rules
The following sections describe the miss address file (MAF) and its load-merging
function, and the load-merging rules that apply after a load miss.
2.5.1 Merging Rules
When a load miss occ urs, e ach M AF entr y is c hecked to se e if it c ontain s a l oad miss
that addresses the same 32-byte Dcache block. If it does, and certain merging rules
are satisfied, then the new load miss is merged with an existing MAF entry. This
allows the MTU to service two or mor e load misse s with one d ata fi ll from t he CBU.
The merging rules for an individual MAF entry are different for cacheable and noncacheable space.
2.5.1.1 Cacheable Space Load-Merge Rules
The merging rules for cacheable space loads (physical address bit <39>=0) are as
follows:
•Merging only occurs if the new load miss addresses a different INT8 from all
loads previously entered or merged to that MAF entry. If it addresses the same
INT8, the machine traps and replays the instruction. This continues until the
MAF entry is retired, at which time the trapping load hits in the Dcache.
•Bytes, words, longwords, and quadwords can merge with each other, provided
that they are not in the same INT8.
3
Merging rules result primarily from limitations of the implementation.
29 September 1997 – Subject To Change
3
Internal Architecture2–29
Page 60
Miss Address File and Load-Merging Rules
•Merging is prevented for the MAF entry after the first data fill (to that MAF
entry) from the Bcache, regardless of whether the Bcache access hits or not.
•Load misses that match any MAF address down to the INT32 boundary, but
could not merge (for any reason), are replay trapped. Once the Dcache is filled,
this load instruction executes and hits in the Dcache.
All DREAD load-merging is prevented when MAF_MODE<00>=1 (see
Section 5.2.16).
2.5.1.2 Noncacheable Space Load-Merge Rules
The merging ru les for noncach eable spa ce load s (physi cal addre ss bit < 39>=1) ar e as
follows:
•Merging only occurs if the new load miss addresses a different INT8 from all
loads previously entered or merged to that MAF entry. If it addresses the same
INT8, the machine traps and replays the instruction. This continues until the
MAF entry is retired, at which time the trapping load hits in the Dcache.
•Only quadwords can merge with other quadwords, provided they are not in the
same INT8. Bytes, words, and longwords cannot merge.
•Merging stops for a load instruction to noncacheable space as soon as the CBU
accepts the reference. This permits the system environment to access only those
INT8s that are actually requested by load instructions.
•All accesses that could not merge (except those to the same INT8) are allocated
new MAF entries.
Noncacheab le space load -merging is prevented when MAF_MODE<03>=1. All
DREAD load-merging is prevented when MAF_MODE<00>=1 (see
Section 5.2.16).
At the external interface, noncacheable read instructions indicate to the system environment which INT32 is addressed and which of the INT8s within the INT32 are
actually accessed. Each load for longword, word, or byte data results in a separate
request to the CBU.
2.5.2 Read Requests to the CBU
Merging is done for two load instructions that issued simultaneously, and both miss;
in effect, as if they were issued sequentially with the load from IEU pipe E0 first.
The MTU sends a read request to the CBU for each MAF entry allocated.
2–30Internal Architecture
29 September 1997 – Subject To Change
Page 61
Miss Address File and Load-Merging Rules
A bypass is provided so that if the load instruction issues in IEU pipe E0, and no
MAF requests are pending, the load instruction’s read request is sent to the CBU
immediately, provided the CBU is ready for such an access. Similarly, if a load
instruction from IEU p ipe E1 mis ses, and there was no load i nstruction in pipe E0 t o
begin with, the E1 load miss is sent to the CBU immediately. In either case, the
bypassed read request is aborted if the load hits in the Dcache, merges in the MAF,
or is replay trapped by the MTU.
2.5.3 MAF Entries and MAF Full Conditions
There are six MAF entries for load misses and four for IDU instruction fetches and
prefetches. Load misses are usually the highest MTU priority request.
If the MAF is full and a loa d instructi on issues in pipe E0, or i f five of the si x MAF
entries are valid and a loa d instruction issues in pip e E1, an MAF full trap occurs
causing the IDU to restart execution with the load instruction that caused the MAF
overflow. When the load instruction arrives at the MAF the second time, an MAF
entry may have become available. If not, the MAF full trap occurs again.
2.5.4 Fill Operation
Eventually, the CBU provides the data requested for a given MAF entry (a fill). The
CBU requests that the IDU allocate up to three consecutive “bubble” cycles in the
IEU pipelines. The first bubble prevents any store instruction from issuing. The second bubble prevents any instructions from issuing. The third bubble prevents only
MTU instructions (particularly load and store instructions) from issuing. The first
bubble prevents st or e da ta from colliding with t h e fi ll in the data cache. The fi ll uses
the second bubble cycle as it progresses down the IEU/MTU pipelines to format the
data and load the register file. It uses the third bubble cycle to fill the Dcache.
An instruction typically writes the register file in pipeline stage 6 (see Figure 2–2).
Because there is on ly one regi ster fi le write po rt per in teger pipeline , a no-inst ruction
bubble cycle is r equired to reser ve a regis ter fi le wri te port f or the f ill. A loa d or stor e
instruction accesses the Dcache in the second half of stage 4 and the first half of
stage 5. The fill operation writes the Dc ache, making it unavailable for other
accesses at that time. Relative to the register file write operation, the Dcache (write)
access for a fill occurs a cycle later than the Dcache access for a load hit. Only load
and store instructions use the Dcache in the pipeline. Therefore, the second bubble
reserved for a fill is a no-MTU-instruction bubble.
29 September 1997 – Subject To Change
Internal Architecture2–31
Page 62
MTU Store Instruction Execution
Up to two floating or integer registers may be written for each CBU fill cycle. Fills
deliver 32 bytes in tw o cyc le s: two INT8 s per cycle. The MAF merging rul es ensu re
that there is no more than one register to write for each INT8, so that there is a register file write port available for each INT8. After appropriate formatting, data from
each INT8 is written into the IRF or FRF provided there is a miss recorded for that
INT8.
Load misses are all checked against the write buffer contents for conflicts between
new load instructions and previously issued store instructions. Refer to Section 2.7
for more info rmation on write operations.
LDL_L and LDQ_L instructions always allocate a new MAF entry if they miss the
Dcache. LDL_L and LDQ_L instructions that hit in the Dcache are retired by the
MTU immediately. No load instructions that follow an LDL_L or LDQ_L instruction are allowed to merge with it. After an LDL_L or LDQ_L instruction is issued
(and misses in the Dcache), the IDU does not is sue any more M TU instru ctions until
the MTU has successfully sent the LDL_L or LDQ_L instruction to the CBU. This
guarantees correct ordering between an LDL_L or LDQ_L instruction and a subsequent STL_C or STQ_C instruction even if they access different addresses.
2.6 MTU Store Instruction Execution
Store instructions execute in the MTU by:
1. Reading the Dcache tag store in the pipeline stage in which a load instruction
would read the Dcache
2.Checking for a hit in the next stage
3. Writing the Dcache data store instruction if there is a hit in the second (follow-
ing) pipeline stage
Load instructions are not all owed to issue in the se cond cycle af ter a stor e instru ction
(one bubble cycle). Other instructions can be issued in that cycle. Store instructions
can issue at the rat e of one per cycl e becau se s tore i nstru ctions in th e Dstr eam do not
conflict in t hei r use of resources. The Dc ach e t ag store and Dcache data store are the
principal resources. However, a load instruction uses the Dcache data store in the
same early stag e that it uses the Dc ache tag store. Therefore , a l oad inst ruction would
conflict with a store instruction if it were issued in the second cycle after any store
instruction. Refer to Section 2.2 for more information on store instruction execution
in the pipeline.
2–32Internal Architecture
29 September 1997 – Subject To Change
Page 63
Write Buffer and the WMB Instruction
A load instruction t hat is issued one cycle after a store instruction in the pipeline creates a conflict i f b oth access exactly the same memory location. Th is occurs because
the store instruction has not yet updated the location when the load instruction reads
it. This con flict is han dled by forcing the load instructi on to replay trap. The IDU
flushes the pipeline and restarts exec ution from the load instruction. By the time the
load instruction arrives at the Dcache the second time, the conflicting store instruction has written the Dcache and the load instruction is executed normally.
Software should not load data immediately after storing it. The replay trap that is
incurred “costs” se ven cycle s. The best solut ion is t o schedul e the load i nstruc ti on to
issue three cycles after the store. No issue stalls or replay traps will occur in that
case. If the load instruction is scheduled to issue two cycles after the store instruction, it will be issue-stalled for one cycle. This is not an optimal solution, but is preferred over incurring a replay trap on the load instruction.
For each store instruction, a search of the MAF is done to detect load-before-store
hazards. If a sto re instr uctio n is execut ed, and a lo ad of t he same addr ess is pre sent i n
the MAF, two things happen:
1. Bits are set in each conflicting MAF en tr y to pre vent i ts fi ll fr om bei ng pla ced in
the Dcache when it arrives, and to prevent subsequent load instructions from
merging with that MAF entry.
2.Conflict bits are se t with the store instruc tion in the write buffer to prevent the
store instruction from being issued until all conflicting load instructions have
been issued to the CBU.
Conflict checking is done at the 32-byte block granularity. This ensures proper
results from the load instructions and prevents incorrect data from being cached in
the Dcache.
A check is performed for each new s tore a gainst store inst ructi ons in t he write bu f fe r
that have already been sent to the CBU but have not been completed. Section 2.7
describes this proces s.
2.7 Write Buffer and the WMB Instruction
The following sections describe the write buffer and the WMB instruction.
29 September 1997 – Subject To Change
Internal Architecture2–33
Page 64
Write Buffer and the WMB Instruction
2.7.1 The Write Buffer
The write buffer contains six fully associative 32-byte entries. The purpose of the
write buffer is to minimize the number of CPU stall cycles by providing a finite,
high-bandwidth resource for receiving store data. This is required because the
21164PC can generat e store dat a at the peak rat e of one INT8 every CPU cycl e. This
is greater than the average rate at which the Bcache can accept the data.
In addition to HW_ST and other store instructions, the STQ_C and STL_C instructions are also written into the write buffer and sent to the CBU. However, unlike
store instructions, these write buffer-directed instructions are never merged into a
write buffer entry with other instructions.
2.7.2 The Write Mem ory Barrier (WMB) Instruction
The memory barrier (MB) instruction is suitable for ordering memory references of
any kind. The WMB instruction forces ordering of write operations only (store
instructions). The WMB i nstruction h as a special effec t on the write buffer. When it
is executed, a bit is set in every wr ite buf fe r entry c ontainin g valid st ore data that will
prevent future store instructions from merging with any of the entries. Also, the next
entry to be allocated is mar ked with a WMB flag. At this point, the entry marked
with the WMB flag does not yet have valid data in it. When an entry marked with a
WMB flag is ready t o issu e to the CBU, the entry is no t issued until e very p reviou sly
issued write instruction is complete. This ensures correct ordering between store
instructions issued before the WMB instruction and store instructions issued after it.
Each write buffer entry contains a content-addressable memory (CAM) for holding
physical address bits <39:05>, 32 bytes of data, 32-byte mask bits (that indicate
which of the 32 byt es in the ent ry co ntain vali d data ), and misce llane ous con trol bits .
Among the control bits are the WMB flag, and a no-merge bit, which indicates that
the entry is closed to further merging.
2.7.3 Entry-Pointer Queues
T wo e ntry-poi nter queu es are a ssocia ted with the writ e buf fer: a free -entry q ueue an d
a pending-request queue. The free-entry queue contains pointers to available invalid
write buffer entries. The pending-request queue contains pointers to valid write
buffer entries that have not yet been issued to the CBU. The pending-request queue
is ordered in allocation order.
2–34Internal Architecture
29 September 1997 – Subject To Change
Page 65
Each time the write buffer is presented with a store instruction, the physical address
generated by the instruction is compar ed to the addr ess in each valid write buffer
entry that is open for merging. If th e address is in the same INT32 as an address in a
valid write buffer entry (that also contains a store instruction), and the entry is open
for merging, then the new store data is merged into that entry and the entry’s byte
mask bits are updated. If no matching address is found, or all entries are closed to
merging, then the store data is written into t he entry at the top of the free-entry
queue. This entry is vali dated, and a poi nter to th e entry is moved fr om the free-ent ry
queue to the pending-request queue.
2.7.4 Write Buffer Entry Processing
Write Buffer and the WMB Instruction
When the number of entries in the pending-request queue reaches the number pro-
4
grammed in MAF_MODE<WB_SET_LO_THRESH>
, the MTU begins arbi tratio n
with the other MTU queue requ est s. Once the request is grante d, th e MTU se nds the
entry at the head of the pending-request queue to the CBU. The MTU then removes
the entry from the pending-request queue without placing it in the free-entry queue.
When the CBU has completely processed th e write buf f er entry, it notifies the MTU,
and the now invalid write buffer entry is placed i n the free-en try queue. T he MTU
may request that up to fi ve addi tion al wri te buf fer entr ies b e p rocess ed whil e waitin g
for the CBU to finish the first. The write buffer entries are invalidated and placed in
the free-entry que ue in t he orde r that the r equest s compl ete. Thi s order may be d if fer ent from the order in which the requests were made.
The MTU sends write requests from the write buffer to the CBU. The CBU processes these requests according to the cache coherence protocol. Typically, this
involves loading the target block into the Bcache, making it writable, and then writing it. Because the Bcache is write-back, this completes the operation.
The MTU continues to request that write buffer entries be processed as long as one
of the following occur s:
•One buffer contains an STQ_C or STL_C instruction
•One buffer is marked by a WMB flag
•An MB instruction is being executed by the MTU
4
The following actions can also cause the WB to begin arbitration: (1) an MB or WMB
instruction is issued, or (2) 264 cycles have elapsed without completing a write operation
while there were pending write operations in the WB (triggered by the WB write counter).
•The number of entries in the write buffer exceeds the number programmed in
MAF_MODE<WB_CLR_LO_THRESH>.
This ensures that these instructions complete as quickly as possible.
The MTU requests that a wr it e buffer entry be processed every 264 cycl es (provided
there is a valid entry in the write bu ffe r), even if the write buffer is not arbitrating.
This ensures that write instructions do not wait forever to be written to memory.
(This is triggered by a free-running timer that is reset each time a write operation is
completed.)
When an LDL_L or LDQ_L instr ucti on is proces sed by t he MT U, th e MTU re quest s
processing of the next p ending writ e buff er reques t. This increa ses the chan ces of the
write buffer being empty when an STL_C or STQ_C instruction is issued.
Every store instruction that does not merge in the write buffer is checked against
every valid entry. If any entry is an address match, then the WMB flag is set on the
newly allocated write buffer entry. This prevents the MTU from concurrently sending two write instructions to exactly the same block in the CBU.
Load misses are checked in the write buffer for conflicts. The granularity of this
check is an INT32. Any load ins tructi on matchi ng any writ e buf fer entry ’s address is
considered a hit even if it does not access a byte marked for update in that write
buffer entry. If a load hits in the write buffer, a conflict bit i s set in the load instruction’s MAF entry, which prevents the load instruction from being issued to the CBU
before the conflicting write buffer entry has been issued and completed. At the same
time, the no-merge bit is set in every write buffer entry with which the load hit. A
write buffer flush flag is also set. The MTU continues to request that write buffer
entries be processed until all the entries that were ahead of, and including, the conflicting write instructions at the time of the load hit have been processed.
2.7.5 Ordering of Noncacheable Space Write Instructions
Special logic ensures t hat wri te inst ructions to n oncacheabl e space are sent of fchip in
the order in which t hei r corresponding buffers were allo cat ed ( p l ace d in the pendingrequest queue).
The 21164PC contains a performance-recording feature. The implementation of this
feature provides a mechanism to count various hardware events and causes an interrupt upon counter overflow. Interrupts are triggered six cycles after the event and,
therefore, the exception PC might not reflect the exact instruction causing counter
overflow. Three counters are provided to allow accurate comparison of two varia bles
under a potentially nonrepeatable experimental condition. The three counters are
designated counter 0 (16 bits), counter 1 (16 bits), and counter 2 (14 bits).
Counter inputs include:
•Issues
•Nonissues
•Total cycles
•Pipe dry
•Pipe freeze
•Mispredicts and cache misses
•Counts for various instruction classifications
For information about counter control, refer to the following IPR descriptions:
•Hardware interrupt clear (HWINT_CLR) register (see Section 5.1.23)
•Interrupt summary register (ISR) (see Section 5.1.24)
•Performance counter (PMCTR) register (see Section 5.1.27)
•CBU configuration control (CBOX_CONFIG2) register bits <13:08> (see
Section 5.3.4)
2.8.1 CBU Performance Counters
The counters in the CBU (counte rs 0 and 1) are used to count Bc ache and system bus
events. There are request events from the MTU to the CBU (three types), requests
from the CBU to the syst em (three types), and re quests from the system to the CBU
(four types).
MTU-to-CBU Requests
The MTU can issue the following requests:
•Istream read request (32 bytes of instruction data), due to an Icache miss
•Dstream read request (32 bytes of noninstruction data), due to a Dcache miss
Read and write requests can be to either cacheable or I/O space addresses, but the
CBU performance counters only count requests to cacheable address space. The
total number of read requests is equal to the sum of the Dstream read requests and
the Istream read requests.
CBU-to-System Requests
The CBU can issue the following requests to the system:
•READ MISS commands
•BCACHE VICTIM commands
•WRITE BLOCK commands
READ MISS commands to I/O space and WRITE BLOCK commands (which are
always to I/O space on the 21164PC) are not counted by the performance counters.
BCACHE VICTIM commands are always to cacheable space and, therefore, are
always counted. READ MISS command s t o cacheable space ar e g ener at ed when th e
21164PC detects either a read miss or write miss in the Bcache. A BCACHE
VICTIM command is also generated along with the READ MISS command if the
block the request misses on is valid and dirty in the cache. In this case, the 64-byte
Bcache block is read from the Bcache and sent to the system.
System-to-CBU Requests
The system can issue the following requests to the 21164PC:
•FILL commands
•READ commands
•FLUSH commands
• INVAL commands
Cacheable FILL commands are i n r es pons e t o READ MISS commands and write 64
bytes of data into the Bcache. I/O space FILL commands are not counted by the
CBU performance counters. Depending on whether the miss was for a read or write
request, the 21164PC will eit her forward the data to the onchip caches or write data
from the writ e buffer into the newly filled block. The tota l number of FI LL commands is the same as the total n umber of READ MISS commands.
The other three system commands are external probes of the Bcache. INVAL commands are not counted by the CBU performance counter.
Misses in the onchip ca ches can merge in the MTU before being issued to the CBU.
Therefore, MTU read or write requests are not the same as onchip cache misses.
Also, two Bcache misses can merge in the CBU and appear on the system bus as a
single READ MISS request. Requests only merge with other requests of the same
type (that is, Istream and Dstream requests do not merge, nor does a write request
merge with a read request).
Using the Counters
The two counters work in paralle l, so they can be used to de termine simple ra tios like
Bcache miss rate or more complex statistics like Dstream read merging in the CBU
(by running several tests and normalizing the results).
For example:
Bcache miss rate=1 − (Bcache read hits/Total read requests)
(Bcache Dstream read fills/Bcache Dstream read requests)
Counter 0 selects 0x1 and counter 1 selects 0x0 on the first pass,
then counter 0 selects 0x2 and counter 1 selects 0x0 on the s econd
pass.
29 September 1997 – Subject To Change
Internal Architecture2–39
Page 70
Floating-Point Control Register
2.9 Floating-Point Control Register
Figure 2–3 shows the format of the floating-point control register (FPCR) and
Table 2–10 describes the fields.
Figure 2–3 Floating-Point Control Register (FPCR) Format
3100
RAZ/IGN
63325556575859606162
Table 2–10 Floating-Point Control Register Bit Descriptions
50515253544849
INVD
DZED
OVFD
INV
DZE
OVF
UNF
INE
IOV
DYN_RM
UNDZ
UNFD
INED
SUM
RAZ/IGN
LJ-05358.AI4
(Sheet 1 of 2)
NameExtentDescription (Meaning When Set)
SUM<63>Summary bit. Records bitwise OR of FPCR exception bits. Equal to
FPCR<57 | 56 | 55 | 54 | 53 | 52>
INED<62>Inexact disable. Suppress INE trap and place correct IEEE nontrap-
ping result in the destination register if the 21164PC is capable of
producing correct IEEE nontrapping result.
UNFD<61>Underflow disable. Subset support: Suppress UNF trap if UNDZ is
also set and the /S qualifier is set on the instruction.
UNDZ<60>Underflow to zero. When set together with UNFD, on underflow,
2–40Internal Architecture
the hardware places a true zero (all 64 bits zero) in the destination
register rather than the denormal number specified by the I EEE standard.
29 September 1997 – Subject To Change
Page 71
Floating-Point Control Register
Table 2–10 Floating-Point Control Register Bit Descriptions
NameExtentDescription (Meaning When Set)
(Sheet 2 of 2)
DYN_RM <59:58> Dynamic routing mode. Indicates the rounding mode to be used by
an IEEE floating-point operate instruction when the instruction’s
function field specifies dynamic mode (/D). The assignments are:
DYNIEEE Rounding Mode Selected
00Chopped rounding mode
01Minus infinit y
10Normal rounding
11Plus infinity
IOV<57>Integer overflow. An integer arithmetic operation or a conversion
from floating to integer overflowed the destination precision.
INE<56>Inexact result. A floating arithmetic or conversion operation gave a
result that differed from the mathematically exact result.
UNF<55>Underflow. A floating arithmetic or conversion operation under-
flowed the destination exponent.
OVF<54>Overflow. A floating arithmetic or conversion operation overflowed
the destination exponent.
DZE<53>Division by zero. An attempt was made to perfo rm a fl oati ng divi de
operation with a divisor of zero.
INV<52>Invalid operation. An attempt was made to perform a floating arith-
metic, conversion, or comparison operation, and one or more of the
operand values were illegal.
OVFD<51>Overflow disable. Not supported.
DZED<50>Division by zero disable. Not supported.
INVD<49>Invalid operation disable. Not supported.
Reserved<48:0>Reserved. Read as zero; ignored when written.
29 September 1997 – Subject To Change
Internal Architecture2–41
Page 72
Design Examples
2.10 Design Examples
The 21164PC can be designed into many different uniprocessor system configura-
tions. Figure 2–4 illustrates one possible configuration. This configuration employs
additional system/memory controller chipsets.
Figure 2–4 shows a typical uniprocessor system with a board-level cache. This system configuration could be used in standalone or networked workstations.
Figure 2–4 Typical Uniprocessor Configuration
Addr/cmd
21164PC
External
Cache
Tag
External
Cache
Data
Data
I/O Bus
2–42Internal Architecture
Memory
and
I/O
Interface
Main Memory
DRAM
Bank
DRAM
Bank
PCA019
29 September 1997 – Subject To Change
Page 73
Hardware Interface
This chapter contains the 21164PC microprocessor logic symbol and provides a list
of signal names and th eir function s.
3.1 21164PC Microprocessor Logic Symbol
Figure 3–1 shows the logic symbol for the 21164PC chip.
The 21164PC is contained in a 413-pin interstitial pin grid array (IPGA) package.
There are 264 functional signal pins, 2 spare signal pins (unused), 5 voltage reference pins (unused), 46 external power (Vdd) pins, 22 internal power (Vddi) pins,
and 74 ground (Vss) pins.
The following table defines the 21164PC signal types referred to in this section:
Signal TypeDefinition
BBidirectional
IInput only
OOutput only
29 September 1997 – Subject To Change
Hardware Interface3–3
Page 76
21164PC Signal Names and Functions
The remaining two tables describe the function of each 21164PC external signal.
Table 3–1 lists all signals in alphanumeric order. This table provides full signal
descriptions. Table 3–2 lists signals by function and pro vides an abbreviate d de scr iption.
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
addr_h<39:4>B36Address bus. These bidirectional signals pr ovide the addr ess of
the requested data or operation between the 21164PC and the
system. If addr_h<39> is asserted, then the reference is to
noncached, I/O memory space.
When the byte/word instructions are used and addr_h<39> is
asserted, six additional bits of information are communicated
over the pin bus. Two of the new bits are driven over
addr_h<38:37>, becoming transfer_size<1:0>, with the fol-
lowing values:
addr_bus_req_hI1Address bus request. The system interface uses this signal to
gain control of the ad dr_h<39:4> and cmd_h<3:0> pins (see
Figure 4–22).
addr_res_h<1:0>O2Address response bits <1> and <0>. For system commands,
the 21164PC uses these pins to indicate the state of the block
in the Bcache:
(Sheet 1 of 10)
BitsCommandMeaning
00NOPNothing.
01NOACKData not found or clean.
10—Reserved.
11ACK/BcacheData from Bcache.
cack_hI1Command acknowledge. The system interface uses this signal
to acknowledge any one of the commands driven by the
21164PC.
3–4Hardware Interface
29 September 1997 – Subject To Change
Page 77
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 2 of 10)
clk_mode_h<1:0>I2Clock test mode. These signals specify a relationship between
osc_clk_in_h,l, the CPU cycle time, and the duty-cycle equal-
izer. These signals should be deasserted in normal operation
mode.
Bits Description
00 CPU clock frequency is equal to the input clock fre-
quency.
01 CPU clock frequency is equal to the input clock fre-
quency, with the onchip duty-cycle equalizer enabled.
10 Initialize the CPU clock, allowing the system clock to be
synchronized to a stable reference clock.
11 Initialize the CPU clock, allowing the system clock to be
synchronized to a stable reference clock, with the onchip
duty-cycle equalizer enabled.
29 September 1997 – Subject To Change
Hardware Interface3–5
Page 78
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 3 of 10)
cmd_h<3:0>B4Command bus. These signals drive and receive the commands
from the command bus. The following tables define the commands that can be driven on the cmd_h<3:0> bus by the
21164PC or the system. For additional information, refer to
Section 4.1.1.1.
21164PC Commands to System:
cmd_h
<3:0>CommandMeaning
0000NOPNothing.
0001—Reserved.
0010—Reserved.
0011—Reserved.
0100—Reserved.
0101—Reserved.
0110WRITE BLOCKRequest to write a block.
0111—Reserved.
3–6Hardware Interface
1000READ MISS0Request for data.
1001READ MISS1Request for data.
1010—Reserved.
1011—Reserved.
1100BCACHE VICTIM Bcache victim should be
0000NOPNothing.
0001FLUSHRemoves block from caches;
return dirty data.
0010INVALIDATEInvalidates the block from
caches.
0011—Reserved.
0100READRead a block.
0101—Reserved.
0111—Reserved.
1xxx—Reserved.
cpu_clk_out_hO1CPU clock output. This signal is used for test purposes.
dack_hI1Data acknowledge. The system interface uses this signal to
control data transfer between the 21164PC and the system.
data_h<127:0>B128Data bus. These signals are used to move data between the
21164PC, the system, and the Bcache.
data_adsc_lO1Load a new address into the Bcache SSRAM.
data_adv_lO1Advances the Bcache index to the next address.
data_bus_req_hI1Data bus request. If the 21164PC samples this sig nal as serted
on the rising edge of sysclk n, th en the 21 16 4PC does not drive
the data bus on the rising edge of sysclk n+1. Before asserting
this signal, the system should assert idle_bc_h for the correct
number of cycles. If the 21164PC samples this signal deasserted on the rising edge of sysclk n, then the 21164PC drives
the data bus on the rising edge of sysclk n+1. For timing
details, refer to Section 4.9.4.
data_ram_oe_lO1Data RAM output enable. This signal is asserted for Bcache
read operations.
29 September 1997 – Subject To Change
Hardware Interface3–7
Page 80
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 5 of 10)
data_ram_we_l<3:0>O4Data RAM write-enable. These signals are asserted for any
Bcache write operation. Refer to Section 5.3.1 for timing
details.
dc_ok_hI1dc voltage OK. Must be deasserted until dc voltage reaches
proper operating level. After that, dc_ok_h is asserted.
fill_hI1Fill warning. If the 21164PC samples this signal asserted on
the rising edge of sysclk n, then the 21164PC provides the
address indicated by fill_id_h to the Bcache on the rising edge
of sysclk n+1. The Bcache begins to write in that sysclk. At the
end of sysclk n+1, the 21164PC waits for the next sysclk and
then begins the write operation again if dack_h is not asserted.
Refer to Section 4.9.3 for timing details.
fill_dirty_hI1Fill dirty. If the block being filled is dirty, this pin should be
asserted.
fill_error_hI1Fill error. If this signal is asserted during a fill from memory, it
indicates to the 21164PC that the system has detected an
invalid address or hard error. The system still provides an
apparently normal read sequence with correct ECC/parity
though the d ata i s no t va lid . Th e 21164PC traps to the machin e
check (MCHK) PALcode entry point and indicates a serious
hardware error. fill_error_h should be asserted when the data
is returned. Each assertion produces a MCHK trap.
fill_id_hI1Fill identification. Asserted with fill_h to indicate which regis-
ter is used. The 21164PC supports two outstanding load
instructions. If this signal is asserted when the 21164PC samples fill_h asserted, then the 21164PC provides the address
from miss register 1. If it is deasserted, then the address in miss
register 0 is used for the read operation.
idle_bc_hI1Idle Bcache. When asserted, the 21164PC finishes the current
Bcache read or write operation but does not start a new read or
write operation until the signal is deasserted. The system interface must assert this signal in time to idle the Bcache before
fill data arrives.
index_h<21:4>O18Index. These signals index the Bcache.
3–8Hardware Interface
29 September 1997 – Subject To Change
Page 81
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 6 of 10)
int4_valid_h<3:0>O4INT4 data valid. During write operations to noncached space,
these signals are used to indicate which INT4 b ytes of data are
valid. This is useful for noncached write operations that have
been merged in the write buffer.
During read operations to noncached space, these signals indicate which INT8 bytes of a 32-byte block need to be read and
returned to the processor. This is useful for read operations to
noncached memory.
Note: For both read and write operations, multiple
int4_valid_h<3:0> bits can be set simultaneously.
29 September 1997 – Subject To Change
Hardware Interface3–9
Page 82
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
When addr_h<39> is asserted, the int4_valid_h<3:0> signals
are considered the addr_h<3:0> bits required for byte/word
transactions. The functionality of these bits is tied to the value
stored in addr_h<38:37>.
For read transactions:
addr_h
<38:37> int4_valid_h<3:0> Value
00Valid INT8 mask
01addr_h<3:2> valid on int4_valid_h<3:2>;
10addr_h<3:1> valid on int4_valid_h<3:1>;
11addr_h<3:0> valid on int4_valid_h <3:0>
For write transactions:
addr_h
<38:37> int4_valid_h<3:0> Value
(Sheet 7 of 10)
int4_valid<1:0> undefined
int4_valid<0> undefined
3–10Hardware Interface
00Valid INT4 mask
01Valid INT4 mask
10addr_h<3:1> valid on int4_valid_h<3:1>;
int4_valid<0> undefined
11addr_h<3:0> valid on int4_valid_h <3:0>
29 September 1997 – Subject To Change
Page 83
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 8 of 10)
irq_h<3:0>I4System interrupt requests. Th ese signals have multiple modes
of operation. During normal operation, these level-sensitive
signals are used to signal interrupt requests. During initialization, these signals are used to set up the CPU cycle time divisor for sys_clk_out1_h as follows:
lw_parity_h<3:0>B4Longword parity. These signals set even INT4 parity for the
current data cycle. Refer to Section 4.12.1 for information on
the purpose of each lw_parity_h bit.
mch_hlt_irq_hI1Machine halt interrupt request. This signal has multiple modes
of operation. During initialization, this signal is used to set up
sys_clk_out2_ h de lay (see Table 4–3). During normal operation, it is used to signal a halt request.
osc_clk_in_h
osc_clk_in_l
I
11Oscillator clock inputs. These signals provide the differential
I
clock input that is the fundamental timing of the 21164PC.
These signals are driven at the same frequency as the internal
clock frequency (clk_mode_h<1:0> = 01).
port_mode_h<1:0>I2Select test port interface modes (normal, manufacturing, and
debug). For normal operation, both signal s must be deass ert ed.
29 September 1997 – Subject To Change
Hardware Interface3–11
Page 84
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 9 of 10)
pwr_fail_irq_hI1Power failure interrupt request. This signal has multiple modes
of operation. During initialization, this signal is used to set up
sys_clk_out2_ h de lay (see Table 4–3). During normal operation, this signal is used to signal a power failure.
srom_clk_hO1Serial ROM clock. Supplies the clock that causes the SROM to
advance to the next bit. The cycle time of this clock is 128
times the cycle time of the CPU clock.
srom_data_hI1Serial ROM data. Input for the SROM.
srom_oe_lO1Serial ROM output enable. Supplies the output enable to the
SROM.
srom_present_l
1
B1Serial ROM present. Indicates that SROM is present and ready
to load the Icache.
st_clk1_hO1STRAM clock. Clock for synchronously timed RAMs
(STRAMs). For Bcache, this signal is synchronous with
index_h<21:4> during private read and write operations, and
with sys_clk_out1_h during read and fill operations.
st_clk2_hO1This signal is a duplicate of st_clk1_h, to increase the fanout
capability of the signal.
st_clk3_hO1This signal is another duplicate of st_clk1_h, to increase the
fanout capability of the signal.
sys_clk_out1_hO1System clock output. Programmable system clock
(cpu_clk_out_h divided by a value of 3 to 15) is used for
board-level cache and system logic.
sys_clk_out2_hO1System clock output. A version of sys_clk_out1_h delayed by
a programmable amount from 0 to 7 CPU cycles.
sys_mch_chk_irq_hI1System machine check interrupt request. This signal has multi-
ple modes of operation. During initialization, it is used to set
up sys_clk_out2_h delay (see Table 4–3). During normal
operation, it is used to signal a machine interrupt check
request.
sys_reset_lI1System reset. This signal protects the 21164PC from damage
during initial power-up. It must be asserted until dc_ok_h is
asserted. After that, it is deasserted and the 21164PC begins its
reset sequence.
3–12Hardware Interface
29 September 1997 – Subject To Change
Page 85
21164PC Signal Names and Functions
Table 3–1 21164PC Signal Descriptions
SignalType Count Description
(Sheet 10 of 10)
tag_data_h<32:19>B14Bcache tag data bits. This bit range supports .5MB to 4MB
Bcaches.
tag_data_par_hB1Tag data parity bit. This signal indicates odd parity for
tag_data_h<32:19>.
tag_dirty_hB1Tag dirty state bit. This bit is private to the 21164PC.
tag_ram_oe_lO1Tag RAM output enable. This signal is a sserted during any
Bcache read operation.
tag_ram_we_lO1Tag RAM write-enable. This signal is asserted during any tag
write operation.
tag_valid_hB1Tag valid bit. During fills, this signal is asserted to indicate
that the block has valid data. See Table 4–5 for information
about Bcache protocol.
tck_hB1JTAG boundary-scan clock.
tdi_hI1JTAG serial boundary-scan data-in signal.
tdo_hO1JTAG serial boundary-scan data-out signal.
temp_senseI1Temperature sense. This signal is used to measure the die tem-
perature and is for manufacturing use only. For normal opera-
tion, this signal must be left disconnected.
test_status_h<1>O1Icache test status or timeout reset. This signal is used for man-
ufacturing test purposes only to extract Icache test status infor-
mation from the chip.
tms_hI1JTAG test mode select signal.
1
trst_l
B1JTAG test access port (TAP) reset signal.
victim_pending_hO1Victim pending. When asserted, this signal indicates that the
current read miss has generated a victim.
1
This signal is shown as bidirectional. However, for normal operation, it is input only. The output function is
used during manufacturing test and verification only.
29 September 1997 – Subject To Change
Hardware Interface3–13
Page 86
21164PC Signal Names and Functions
Table 3–2 lists signals by function and provides an abbreviated description.
dc_ok_hI1dc voltage OK.
port_mode_h<1:0>I2Selects the test port interface mode (normal, man-
srom_clk_hO1Serial ROM clock.
srom_data_hI1Serial ROM data.
29 September 1997 – Subject To Change
ufacturing, and debug).
Hardware Interface3–15
Page 88
21164PC Signal Names and Functions
Table 3–2 21164PC Signal Descriptions by Function
SignalType Count Description
(Sheet 3 of 3)
srom_oe_lO1Serial ROM output enable.
srom_present_l
1
B1Serial ROM present.
tck_hB1JTAG boundary-scan clock.
tdi_hI1JTAG serial boundary-scan data in.
tdo_hO1JTAG serial boundary-scan data out.
temp_senseI1Temperature sense.
test_status_h<1>O1Icache test status or timeout reset.
tms_hI1JTAG test mode select.
1
trst_l
1
This signal is shown as bidirectional. However, for normal operation , it is input only. The output
function is used during manufacturing test and ve rification only.
B1JTAG test access port (TAP) reset.
3–16Hardware Interface
29 September 1997 – Subject To Change
Page 89
4
Clocks, Cache, and External Interface
This chapter describes the 21164PC microprocessor external interface, which
includes the backu p cache (B cache) an d system int erfaces . It al so describ es the clo ck
circuitry, interrupt signals, and parity generation. It is organized as follows:
•Introduction to the external interfac e
•Clocks
•Physical address considerations
•Bcache structure and operation
•Cache coherency
•21164PC-to-Bcache transactions
•21164PC-initiated system transactions
•System-initiated transactions
•Data bus and command/address bus contention
•21164PC interface restrictions
•21164PC/system race conditions
•Data integrity and Bcache errors
•Interrupts
Chapter 3 lists and defines all 21164PC hardware interface signal pins. Chapter 9
describes the 21164PC hardware interface electrical requirements.
29 September 1997 – Subject To Change
Clocks, Cache, and External Interface4–1
Page 90
Introduction to the External Interface
4.1 Introduction to the External Interface
A 21164PC-based system can be divided into three major sections:
•21164PC microprocessor
•External Bcache
•System interface logic
The 21164PC external interface is optimized for uniprocessor-based systems and
mandates few design rules. The interface includes a 128-bit bidirectional data bus, a
36-bit bidirectional address bus, and several control signals.
Read latencies and data repetition rates of the external Bcache can be programmed
by means of register bits. The Bcache clock frequency for private read and write
operations is independent of the system interface clock frequency and makes for a
more flexible design.
The cache system supports a 64-byte block size to the external Bcache.
Figure 4–1 shows a simplified view of the external interface. The function and purpose of each signal is described in Chapter 3.
4.1.1 System Interface
This section describes the system or external bus interface. The system interface is
made up of bidirectional address and command buses, a data bus that is shared with
the Bcache interface, and several cont rol signals.
The system interface i s u nder t he control of the bus int er fa ce u nit (BIU) in the CBU.
The system interface is a 128-bit bidirectional data bus.
The cycle time of the sys tem inter face is pr ogrammable to speeds of 4 to 15 times the
CPU cycle time (sysclk ratio). All system interface signals are driven or sampled by
the 21164PC on the rising edge of signal sys_clk_out1_h. In this chapter, this edge
is sometimes referred to as “sysclk.” Precisely when interface signals rise and fall
does not matter as long as they meet the setup and hold times specified in Chapter 9.
The 2116 4PC can tak e up to t wo commands from the system at a time. The Bcache is
probed to determine what must be done with the command.
•If nothing is to be done, the 21164PC acknowledges receiving the command.
•If a Bcache read or invalidate operation is required, the 21164PC performs the
task as soon as the Bcache becomes free. The 21164PC acknowledges receiving
the command at the start of the Bcache transaction.
29 September 1997 – Subject To Change
MK5504B
Clocks, Cache, and External Interface4–3
Page 92
Introduction to the External Interface
The BIU contains a three-entry BIU command/address buffer (BAF) capable of
queueing up to three Bcache misses or I/O references. These buffers are capable of
merging both read and write miss references, to reduce external system bus traffic.
4.1.2 Bcache Interface
The 21164PC includes an interface and control for a required backup cache
(Bcache). The Bcache interface features the following:
•Support for pipelined and flow-through synchronous burst SRAMs (SSRAMs)
•Nonblocking, pipelined Bcache (up to three probes in flight)
•Fully interleaved writes to saturate write-hit traffic
•Flexible Bcache sizes (512KB - 4MB)
•Direct-mapped organization with 64-byte block size
•Read/write-allocate replacement policy
•Write-back cache policy
•A 128-bit data bus (shared with the system interface)
•4.8 GB/s pe ak data tran sfer rate
•Programmable Bcache clock rate up to 300-MHz operation
4.1.2.1 Bcache Interface Enhancements
With t he advent of commodit y SSRAMs, of fchi p high-s peed cach es can now be bui lt
at low cost to take ad vantage of the same perf ormance techniques that until now had
been restricted to onchip caches. The SSRAMs contain an address register, a selfincrementing address mechanism, and optionally, a data output register (pipelined).
The 21164PC uses these additional control features to deliver a high-performance
nonblocking, interleaved, fully pipelined Bcache interface.
4.1.2.2 Pipelined Bcache
A pipelined cache allows the processor to issue multiple cache operations that are
overlapped in time to inc rease thr oughput. The 21164PC supports pipelining of up to
three outstanding read or write probes at any given time to attain 100% data bus uti-
lization. The outstanding Bcache probes are tracked by the BIU's “Bcache in flight”
(or BIF) buffer. Figure 4–2 shows the benefits of a having multiple probes in flight
for a pipelined cache.
4–4Clocks, Cache, and External Interface
29 September 1997 – Subject To Change
Page 93
Introduction to the External Interface
Figure 4–2 Merits of a Multiprobes In Flight – Pipelined Cache
Pipelining allows 100% utilization of the data bus.
Nonpipelined Cache:
index
Pipelined Cache:
A1
data
A1index
data
Multiple probes in flight
4.1.2.3 Write Interleaving
The 2116 4PC Bcache i nterfac e takes adva ntage of the S SRAM address i nput regi ster
to employ interleaving techniques to maximize write-hit dirty bandwidth. The
Bcache interfac e deco uples th e tag and da ta stor e cont rol to allow tag wr ite prob es t o
be interleaved with data writes. Figure 4–3 shows an example of write interleaving
and its ability to keep the data bus at 100% utilization.
A2A3
latency 1latency 2
D10 D11D20 D21
A2A3A4A5A6A8A7
latency 1
latency 2
latency 3
D10 D11 D20D30D21D31 D40D50D41D51
PCA002
29 September 1997 – Subject To Change
Clocks, Cache, and External Interface4–5
Page 94
Clocks
Figure 4–3 Tag/Data Store Interleaving
Interleaving tag write probes with data write
operations allows 100% utilization of the data bus.
Data writes interleaved with tag probes
tag
data
Tag probes for writes that hit clean (valid, not dirty) in the Bcache must schedule a
tag store write to update the dirty bit.
4.2 Clocks
The 21164PC develops three clock signals that are available at output pins.
SignalDescription
A1index
A2A3A4A5A6
latency 1
latency 2
A1A2A3
latency 3
latency 4
T1T2T3T4
Hit 1Hit 2Hit 3
D10 D11 D20D30D21D31 D40 D41
latency 5
Hit 4
PCA003
cpu_clk_out_hA 21164PC internal clock that may or may not drive the system clock.
sys_clk_out1_hA clock of programmable speed supplied to the external interface.
sys_clk_out2_hA delayed copy of sys_clk_out1_h. The delay is programmable and is
an integer number of cpu_clk_out_h periods.
The behavior of the programmable clocks during the reset sequence is described in
Section 7.1.
4–6Clocks, Cache, and External Interface
29 September 1997 – Subject To Change
Page 95
4.2.1 CPU Clock
The 21164PC uses the differential input clock lines osc_clk_ in_h,l as a source to
generate its CPU clock. The input signals clk_mode_h<1:0> control generation of
the CPU clock, as listed in Table 4–1 and as shown in Figure 4–4.
The 21164PC uses clk_mode_h<0> to provide onchip capabili ty to equalize the
duty cycle of the input clock (eliminating the need for a 2× oscillator ). When
clk_mode_h<0> is asserted, the equalizing circuitry, called a symmetrator, is
enabled.
The 21164P C u ses cl k_mode_h<1> to reset the CPU c loc k. When clk_mode_h<1>
is set, the internal CPU clock is reset to a known state. When it is clear, the CPU
clock is driven at the same frequency as the osc_clk_h,l differential input.
Table 4–1 CPU Clock Generation Control
Modeclk_mode_h<1:0> Description
Normal00CPU clock frequency is the same as the input clock
Normal01CPU clock frequency is the same as the input clock
Clocks
frequency; symmetrator is disabled.
frequency; symmetrator is enabled. Also used to
accommodate chip testers.
Reset10Initializes CPU clock, allowing system clock to be
Reset11Initializes CPU clock, allowing system clock to be
Caution:A clock source should always be provided on osc_clk_ in_h,l when sig-
nal dc_ok_h is asserted.
29 September 1997 – Subject To Change
synchronized to a stable reference clock; symmetrator
is disabled.
synchronized to a stable reference clock; symmetrator
is enabled.
Clocks, Cache, and External Interface4–7
Page 96
Clocks
Figure 4–4 Clock Signals and Functions
21164PC
osc_clk_in_h, l
clk_mode_h<1:0>
irq_h<3:0>
mch_hlt_irq_h
pwr_fail_irq_h
sys_mch_chk_irq_h
sys_reset_l
dc_ok_h
4.2.2 System Clock
The CPU clock is the source clock used to generate the system clock
sys_clk_out1_h. The system clock divider controls the frequency of
sys_clk_out1_h. The divisor, 4 to 15, is obtained from the four interrupt lines
irq_h<3:0> at power-up as listed in Table 4–2. The system clock frequency is deter-
mined by dividing the ratio into the CPU clock frequency. Refer to Section 7.2 for
information on sysclk behavior during reset. The value is also latched into the
SYS_CLK_RATIO<3:0> field of the CBOX_STATUS IPR (bits <7:4>) for readonly purposes.
Figure 4–5 shows the 21164PC driving the system clock on a uniprocessor system.
Figure 4–5 21164PC Uniprocessor Clock
Memory
ASIC
sys_clk_out
21164PC
4.2.3 Delayed System Clock
The system clock sys_clk_out1_h is the source clock for the delayed system clock
sys_clk_out2_h. These clock signals provide flexible timing for system use. The
delay unit, from 0 to 7 CPU CLK cycles , is obta in ed from th e three inter rupt s ignals :
mch_hlt_irq_h, pwr_fail_irq_h, and sys_mch_chk_irq_h at power-up, as listed in
Table 4–3. The output of this programmable divider is symmetric if the divisor is
29 September 1997 – Subject To Change
Bus
ASIC
HLO004B
Clocks, Cache, and External Interface4–9
Page 98
Physical Address Considerations
even. The output is asymmetric if the divisor is odd. When the divisor is odd, the
clock is high for an extra cycle. Refer to Section 7.2 for information on sysclk
behavior during reset.
This section lists and describes the physical address regions. Cache and data wrapping characteristics of physical addresses are also described.
4.3.1 Physical Address Regions
Physical memory of the 21164PC is divided into three regions:
1. The first region is the first half of the physical address space. It is treated by the
21164PC as memory-like.
2. The second region is the second half of the physical address space except for a
1MB region reserved for CBU IPRs. It is treated by the 21164PC as noncacheable.
3. The third region is the 1MB region reserved for CBU IPRs.
In the first region, write merging and load me rging are permitted. All 21164PC
accesses in this region are 64-byte, the Bcache block size. This memory-like region
is limited to 8G B (maximum).
The 21164PC does not cache data accessed in the second and third region of the
physical address space; 21164PC read accesses in these regions are always INT32
requests. Load merging is perm itted, but th e request incl udes a mask to inform the
4–10Clocks, Cache, and External Interface
29 September 1997 – Subject To Change
Page 99
Physical Address Considerations
system environment as to which INT8s are accessed. Write merging is permitted.
Write accesses are INT32 requests with a mask indicating which INT4s are actually
modified.
The 21164PC never writes more than 32 bytes at a time in noncached space.
The 21164PC does not broadcast accesses to the CBU IPR region if they map to a
CBU IPR. Accesses in this region, that are not to a defined CBU IPR, produce
UNDEFINED results. The system should not probe this region.
Table 4–4 shows the 21164PC physical memory regions.
Table 4–4 Physical Memory Regions
RegionAddress RangeDescription
Memory-like00 0000 0000 –
Noncacheable80 0000 0000 –
IPR regionFF FFF0 0000 –
4.3.2 Data Wrapping
The 21164PC requires that wrapped read operations be performed on INT16 boundaries. READ and FLUSH commands are all wrapped on INT16 boundaries as
described here. The valid wrap orders for 64-byte blocks are selected by
addr_h<5:4>. They are:
0, 1, 2, 3
1, 0, 3, 2
2, 3, 0, 1
3, 2, 1, 0
Similarly, when the system interface supplies a command that returns data from the
21164PC caches, the values that the system drives on addr_h<5:4> determine the
order in which data is supplied by the 21164PC.
BCACHE VICTIM commands provide the data wit h the same wrap o rder as t he read
miss that produced them.
01 FFFF FFFF
FF FFEF FFFF
FF FFFF FFFF
Write invalidate cached, load, and store merging
allowed.
16
Not cached, load merging limited.
16
Accesses do not appear on the interface unless an
undefined location is accessed (which produces
16
UNDEFINED results).
29 September 1997 – Subject To Change
Clocks, Cache, and External Interface4–11
Page 100
Bcache Structure
4.3.3 Noncached Read Operations
Read operations to physi cal addresses th at have addr_h<39> asse rted are not cached
in the Dcache or Bcache. They are merged like other read operations in the miss
address file (MAF). To prevent several read operations to noncached memory from
being merged into a single 32- byt e b us request, software must insert me mory bar ri er
(MB) instructions or set MAF_MODE IPR b it (IO_NMERGE). The MAF merges as
many Dstream read operations together as it can and sends the request to the BIU.
Rather than merging two 32-byte requests into a single 64-byte request, the BIU
requests a READ MISS from the system. Signals int4_valid_h<3:0> indicate which
of the four quadwords are b ein g request ed by soft ware. Th e syst em shoul d return th e
fill data to the 21 164 PC as usual. The 21 164 PC does not wri te the Dcac he or Bcache
with the fill data. The requested data is written in the register file or Icache.
Note:A special case using int4_valid_h<3:0> occurs during an Icache fill. In
this case the entire returned block is valid although int4_valid_h<3:0>
indicates zero.
4.3.4 Noncached Write Operations
Write operations to physical addresses that have addr_h<39> asserted are not writ-
ten to any of the caches . Th ese wri te o perat ions a re mer ge d in the wri te bu f fer before
being sent to the syste m. If soft ware d oes not want write opera tions to mer g e, it must
insert MB or WMB instructions between th em.
When the write buffer decides to write data to noncached memory, the BIU requests
a WRITE BLOCK. During each data cycle, int4_valid_h<3:0> indicates which
INT4s within the INT16 are valid.
4.4 Bcache Structure
The 21164PC supports a .5, 1, 2, and 4MB Bcache. The size is under program control and is specified by CBOX_CONFIG<13:12> (BC_SIZE<1:0>). The Bcache
block size is 64-byte blocks.
Industry-standard, burst-mode synchronous static RAMs (SSRAMs) may be connected to the 21 16 4PC without man y extra comp onents, al though fano ut buf fers may
be required for the index lines. The SSRAMs are directly controlled by the 21164PC,
and the Bcache data lines are connected to the 21164PC data bus.
4–12Clocks, Cache, and External Interface
29 September 1997 – Subject To Change
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.