This manual is directly derived from the internal 21264/EV67 Specifications, Revision 1.4. You can access this hardware reference manual in PDF format from the
following site:
ftp://ftp.compaq.com/pub/products/alphaCPUdocs
Revision/Update Information:This is a revised document . It supercedes
the Alpha 21264A Microprocessor
Hardware Reference Manual
The information in this publication is subj ec t to change without notice.
COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL
ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS
INFORMATION IS PROVIDED “AS IS” AND COMPAQ COMPUTER CORPORATION DISCLAIMS ANY
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR P ARTICULAR PURPOSE, GOOD TITLE AND AGAINST
INFRINGEMENT.
This publication contains information protected by copyright. No part of this publication may be photocopied or
reproduced in any form wit h out prior written consent from Compaq Computer Corporation.
This manual is for system designers and programmers who use the Alpha 21264/EV67
microprocessor (referred to as the 21264/EV67).
This manual contains the following chapters and appendixes:
Chapter 1, Introduction, introduces the 21264/EV67 and provides an overview of the
Alpha architecture.
Chapter 2, Internal Architecture, describes the major hardware functions and the inter-
nal chip architect ure. It descri bes performanc e measurement faci lities, co ding rules, an d
design examples.
Chapter 3, Hardware Interface, lists and describes the internal hardware interface signals, and provides mechanical data and packaging information, including signal pin
lists.
Chapter 4, Cache and External Interfaces, describes the external bus functions and
transactions, lists bus commands, and describes the clock functions.
Chapter 5, Internal Processor Registers, lists and describes the internal processor register set.
Chapter 7, Initialization and Configuration, describes the initialization and configuration sequence.
Chapter 8, Error Detection and Error Handling, describes error detection and error handling.
Chapter 9, Electrical Data, pr ovi des elec tr ical data and describes signal integrity issues.
Chapter 10, Thermal Management, provides information about thermal management.
Chapter 11, Testability and Diagnostics, describes chip and system testability features.
Appendix A, Alpha Instruction Set, summarizes the Alpha instruction set.
Appendix B, 21264/EV67 Boundary-Scan Register, presents the BSDL description of
the 21264/EV67 boundary-scan register.
Alpha 21264/EV67 Hardware Reference Manual
xvii
Appendix C, Serial Icache Load Predecode Values, provides a pointer to the Alpha
Motherboards Software Developer’s Kit (SDK), which contains this information.
Appendix D, PALcode Restrictions and Guidelines, lists restrictions and guidelines
that must be adhered to when generating PALcode.
Appendix E, 21264/EV67-to-Bcache Pin Interconnections, provides the pin interface
between the 21264/EV67 and Bcache SSRAMs.
The Glossary lists and defines terms associated with the 21264/EV67.
An Index is provided at the end of the document.
Documentation Included by Reference
The companio n volume to this manual, the Alpha Architecture Handbook, Version 4, con-
tains the instruction set architecture. You can access this document from the following
website: ftp.digital.com/pub/Digital/info/semiconductor/lit-
erature/dsc-library.html
Also available is the Alpha Architecture Reference Manual, Third Edition, which con-
tains the complete architecture information. That manual is available at bookstores
from the Digital Press as EQ-W938E-DP.
xviii
Alpha 21264/EV67 Hardware Reference Manual
Terminology and Conventions
This section defines the abbreviations, terminology, and other conventions used
throughout this document.
Abbreviations
Binary Multiples
•
The abbreviations K, M, and G (kilo, mega, and giga) represent binary multiples
and have the following values.
The abbreviations used to indica te the t ype of acc ess to re giste r fields and bits ha ve
the following definitions:
AbbreviationMeaning
IGNIgnore
Bits and fields specified are ignored on writes.
MBZMust Be Zero
Software must never place a nonzero value in bits and fields specified as
MBZ. A nonzero read produces an Illegal Operand exception. Also, MBZ
fields are reserved for future use.
RAZRead As Zero
Bits and fields return a zero when read.
RCRead Clears
Bits and fields are cleared when read. Unless otherwise specified, such bits
cannot be written.
RESReserved
Bits and fields are reserved by Compaq and should not be used; however,
zeros can be written to reserved fields that cannot be masked.
RORead Only
The value may be read by software. It is written by hardware. Software write
operations are ignored.
RO,nRead Only, and takes the value n at power-on reset.
The value may be read by software. It is written by hardware. Software write
operations are ignored.
Alpha 21264/EV67 Hardware Reference Manual
xix
AbbreviationMeaning
RWRead/Write
Bits and fields can be read and written.
RW,nRead/Write, and takes the value n at power-on reset.
Bits and fields can be read and written.
W1CWrite One to Clear
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software
returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be cleared by hardware. Software write operations of a 0 do not
modify the state of the bit.
W1SWrite One to Set
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software
returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be set by hardware. Softwa re write operations of a 0 do not modi fy
the state of the bit.
WOWrite Only
Bits and fields can be written but not read.
WO,nWrite Only, and takes the value n at power-on reset.
Bits and fields can be written but not read.
•Sign extension
SEXT(x) means x is sign-extended to the required size.
Addresses
Unless otherwise noted, all addresses and offsets are hexa decimal.
Aligned and Unaligned
The terms aligned and naturally aligned are interchangeable and refer to data objects
that are powers of two in size. An aligned datum of size 2n is stored in memory at a
byte address that is a multiple of 2n; that is , one that has n low-order zeros. For example, an aligned 64-byte st ack frame has a memory address that is a multiple of 64.
A datum of size 2n is unaligned if it is stored in a byte address that is not a multiple of
2n.
Bit Notation
Multiple-bit fields can include contiguous and noncontiguous bits contained in square
brackets ([]). Multiple contiguous bit s are indicated by a pair of numbers separ ated by a
colon [:]. For example , [ 9:7,5,2: 0] s pecif ies b its 9,8,7, 5,2,1, a nd 0. Similar ly, single bits
are frequently indicated with square brackets. For example, [27] specifies bit 27. See
also Field Notation.
Caution
Cautions indicate potential damage to equipment or loss of data.
xx
Alpha 21264/EV67 Hardware Reference Manual
Data Units
The following data unit terminology is used throughout this manual.
Unless otherwise stated, external means not contained in the chip.
Field Notation
The names of single-bit and multiple-bit fields can be used rather than the actual bit
numbers (see Bit Notation). When the field name is used, it is contained in square
brackets ([]). For example, RegisterName[LowByte] specifies RegisterName[7:0].
Note
Notes emphasize particularly important information.
Numbering
All numbers are deci mal or hexadecimal unless otherwise indicat ed. The prefix 0x indicates a hexadecimal numbe r. For example, 19 is decimal, but 0x19 and 0x19A a re hexa decimal (also see Addresses). Otherwise, the base is indicated by a subscript; for
example, 100
Ranges and Extents
is a binary number.
2
Ranges are specified by a pair of numbers separated by two periods (..) and are inclusive. For example, a range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in square brackets ([]) separated by a colon
(:) and are inclusive. Bit fields are often specified as extents. For example, bits [7:3]
specifies bits 7, 6, 5, 4, and 3.
Register Figures
The gray areas in register figures indicate reserved or unused bits and fields.
Bit ranges that are coupled with the field n ame specify the bits of the name d field that
are included in the register. The bit range may, but need not necessarily, correspond to
the bit Extent in the register . Se e the explan ation above Table 5–1 for more information.
Signal Names
The following examples describe signal-name conventions used in this document.
Alpha 21264/EV67 Hardware Reference Manual
xxi
AlphaSignal[n:n]Boldface, mixed-case type denotes signal names that are
assigned internal and external to the 21264/EV67 (that is,
the signal traverses a chip interface pin).
AlphaSignal_x[n:n]When a signal has high and low assertion states, a lower-
case italic x represents the assertion states. For example,
SignalName_x[3:0] represents SignalName_H[3:0] and
SignalName_L[3:0].
UNDEFINED
Operations specified as UNDEFINED may vary f rom moment to moment , implemen tation to implementation, and instruction to instruction within implementations. The
operation may vary in effect from nothing to stopping system operation.
UNDEFINED operations may halt the processor or cause it to lose information. However, UNDEFINED operations must not cause the processor to hang, that is, reach an
unhalted state from which there is no transition to a normal state in which the machine
executes instructions.
UNPREDICTABLE
UNPREDICTABLE results or occurrences do not disrupt the ba sic ope ratio n of the pro cessor; it continues to execute instructions in its normal manner. Further:
•Results or occurrences specified as UNPREDICTABLE may vary from moment to
moment, implementation to imp lementation, and instruction to instruction within
implementations. Software can never depend on results specified as UNPREDICTABLE.
•An UNPREDICTABLE result may acquire an arbitrary value subject to a few con-
straints. Such a result may be an arbitrary function of the input operands or of any
state information that is accessible to the process in its current access mode.
UNPREDICTABLE results may be unchanged from their previous values.
Operations that produce UNPREDICTABLE results may also produce exceptions.
•An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints as
are UNPREDICTABLE results and, in particular, must not constitute a security
hole.
Specifically, UNPREDICT ABLE resul ts must not de pend upon, or be a functio n of,
the contents of memory locations or registers that are inaccessible to the current
process in the current ac cess mode.
Also, operations that may pr oduce UNPREDICTABLE results must not:
–Write or modify the contents of memory locations or registers to which the cur-
rent process in the current access mode does not have access, or
–Halt or hang the system or any of its components .
For example, a security hole would exist if some UNPREDICTABLE result
depended on the val ue o f a re gister in another process, on the contents of processor
temporary registers left behind by some previously running process, or on a
sequence of actions of different processes.
xxii
Alpha 21264/EV67 Hardware Reference Manual
X
Do not care. A capital X represents any valid va lue.
Alpha 21264/EV67 Hardware Reference Manual
xxiii
This chapter provides a brief introduction to the Alpha architecture, Compaq’s RISC
(reduced instruction set computing) architecture designed for high performance. The
chapter then summarizes the specific features of the Alpha 21264/EV67 microprocessor (hereafter called the 21264/EV67) that implements the Alpha ar chitecture. Appendix A provides a list of Alpha instructions.
The companio n volume to this manual, the Alpha Architecture Handbook, Version 4,
contains the i nstruction set architecture. Als o available is the Alpha Architecture Refer-ence Manual, Third Edition, which contains the complete architecture information.
1.1 The Architecture
The Alpha architecture is a 64-bit load and store RISC architecture designed with particular emphasis o n s peed , mul ti ple instruction issue, multiple proces sor s, and software
migration from many operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers.
All instructions ar e 32 bits lo ng. Memory operat ions are e ither loa d or store operation s.
All data manipulation is done between registers.
1
Introduction
The Alpha architecture supports the following data types:
•8-, 16-, 32-, and 64-bit integers
•IEEE 32-bit and 64-bit floating-point formats
•VAX architecture 32-bit and 64-bit floating-point formats
In the Alpha architecture, instructions interact with each other only by one instruction
writing to a register or memory loc ation a nd anothe r inst ructi on read ing fro m that reg ister or memory location. This use of resources makes it easy to build implementations
that issue multiple instructions every CPU cycle.
The 21264/EV67 uses a set of subroutines, called privileged architecture library code
(PAL code), that is specific to a particular Alpha operating sys tem implementation and
hardware platform. These subroutines provide operating system primitives for context
switching, interrupts, exceptions, and memory management. These subroutines can be
invoked by hardware or CALL_PAL instructions. CALL_PAL instructions use the
function field of the instruction to vector to a specified subroutine. PALcode is written
in standard machine code with some implementation-specific extensions to provide
Alpha 21264/EV67 Hardware Reference Manual
Introduction1–1
The Architecture
direct access to low- level hardwar e funct ions. PALcode suppor ts opti mizat ions fo r multiple operating systems, flexible memor y-management implementat ions, and multiinstruction atomic sequ ences.
The Alpha architecture performs byte shifting and masking with normal 64-bit, register-to-regi ster instruct ions. The 21264/EV67 pe rforms single-byt e and single-wo rd load
and store instructions.
1.1.1 Addressing
The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21264/EV67
supports a 48-bit or 43-bit virtual address (selectable under IPR control).
V irtua l addr esses as see n by the progra m ar e tran slat ed int o physic al memory addres ses
by the memory-management mechanism. The 21264/EV67 supports a 44-bit physical
address.
1.1.2 Integer Data Types
Alpha architecture supports the four integer data types listed in Table 1–1.
Table 1–1 Integer Data Types
Data TypeDescription
ByteA byte is 8 contiguous bits that start at an addressable byte boundary.
A byte is an 8-bit value.
WordA word is 2 contiguous bytes that start at an arbitrary byte boundary.
A word is a 16-bit value.
LongwordA longword i s 4 conti guo us byte s that s tar t at an arbit rary byte boundary. A
longword is a 32-bit value.
QuadwordA quadword is 8 contiguous bytes that start at an arbitrary byte boundary.
Note:Alpha implementations may impose a significant performance penalty
when accessing operands that are not naturally aligned. Refer to the Alpha
Architecture Handbook, Version 4
1.1.3 Floating-Point Data Types
The 21264/EV67 supports the following floating-point data types:
The 21264/EV67 microproces sor is a sup er sca la r pipelined processor. It is packaged in
a 587-pin PGA carrier and has removable application-specific heat sinks. A number of
configuration optio ns allow it s use in a ra nge of syst em designs r anging fro m extremely
simple uniprocessor systems with minimum component count to high-performance
multiprocessor systems with very high cache and memory bandwidth.
The 21264/EV67 can issue four Alpha instructions in a single cycle, thereby minimizing the average cycles per instruction (CPI). A number of low-late ncy and/or highthroughput featu res in the i nstru ction issue unit and the onchip compo nents o f the memory subsystem further reduce the average CPI.
The 21264/EV67 and associated PALcode implements IEEE single-precision and double-precision, VAX F_floating and G_floating data types, and supports longword
(32-bit) and quadword (64-bit) integers. Byte (8-bit) and word (16-bit) support is provided by byte-manipulation instructions. Limited hardware support is provided for the
VAX D_floating data type.
Other 21264/EV67 features include:
•The ability to issue up to four instructions during each CPU clock cycle.
•A peak instruction execution rate of four times the CPU clock frequency.
•An onchip, demand-paged memory-management unit with translation buffer, which,
when used with PALcode, can implement a variety of page tabl e s tructures and translation algorithms. The uni t consists of a 128-entry , fully-associative data translation
buffer (DTB) and a 128- entry, fully-associative inst ruction translat ion buf fer (ITB),
with each entry able to map a single 8KB page or a group of 8, 64, or 512 8KB
pages. The allocati on scheme f or t he ITB a nd DTB is r ound-r obin. Th e siz e of e ach
translation buffer entry’s group is specified by hint bits stored in the entry. The
DTB and ITB implement 8-bit address space numbers (ASN), MAX_ASN=255.
•Two onchip, high-throughput pipelined floating-point units, capable of executing
both VAX and IEEE floating-point data types.
•An onchip, 64KB virtually-addressed instruction cache with 8-bit ASNs
(MAX_ASN=255).
•An onchip, virtually-indexed, physically-tagged dual-read-ported, 64KB data
cache.
•Supports a 48-bit or 43-bit virtual address (program selectable).
•Supports a 44-bit physical address.
•An onchip I/O write buff er with four 64-byte entries for I/O write transactions.
•An onchip, 8-entry victim data buffer.
•An onchip, 32-entry load queue.
•An onchip, 32-entry store queue.
•An onchip, 8-entry miss address file for cache fill requests and I/O read
transactions.
•An onchip, 8-entry probe queue, holding pending system port probe commands.
Alpha 21264/EV67 Hardware Reference Manual
Introduction1–3
21264/EV67 Microprocessor Features
•An onchip, duplicate tag array used to maintain level 2 cache coherency.
•A 64-bit data bus with onchip parity and error correction code (ECC) support.
•Support for an external second-level (Bcache) cache. The size and some timing
parameters of the Bcache are programmable.
•An internal clock generator providing a high-speed clock used by the 21264/EV67,
and two clocks for use by the CPU module.
•Onchip performance counters to measure and analyze CPU and system perfor-
mance.
•Chip and module level test support, including an instruction cache test interface to
support chip and module level testing.
•A 2.0-V exter nal interface.
Refer to Chapter 9 for 21264/EV67 dc and ac electrical characteristics. Refer to the
Alpha Archit ecture Handbook, Version 4
implementation-dependent information.
, Appendix E, for waivers and any other
1–4Introduction
Alpha 21264/EV67 Hardware Reference Manual
2
Internal Architecture
This chapter provides both an o verview of the 21264/EV67 microarchitecture and a sys-
tem designer’s view of t he 2 1264/ EV67 imple me ntat io n of t he Alp ha ar chitecture. The
combination of the 2126 4/EV67 mic roar chi tecture and privileged architecture library
code (PALcode) defines the chip’s implementation of the Alpha architecture. If a ce rt ain
piece of hardware seems to be “ar chitecturally incomplete,” the missing functionality is
implemented in PALcode. Chapter 6 provides more infor mati on on PALcode.
This chapter describes the major functional hardware units and is not intended to be a
detailed hardware description of the chip. It is organized as follows:
•21264/EV67 microarchitecture
•Pipeline organization
•Instruction issue and retire rules
•Load instructions to R31/F31 (software-directed instruction prefetch)
•Special cases of Alpha instruction execution
•Memory and I/O address space
•Miss address file (MAF) and load-merging rules
•Instruction orderi ng
•Replay traps
•I/O write buffer and the WMB inst ruction
•Performance measurement support
•Floating-point control register
•AMASK and IMPLVER instruction values
•Design examples
2.1 21264/EV67 Microarchitecture
The 21264/EV67 microprocessor is a high-performance third-generation implementation of the Compaq Alpha archit ec tur e. The 21264 /EV67 cons ists of the following sec-
tions, as shown in Figure 2–1:
•Instruction fetch, issue, and retire unit (Ibox)
•Integer execution unit (Ebox)
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture2–1
21264/EV67 Microarchitecture
•Floating-point execution unit (Fbox)
•Onchip caches (Icache and Dcache)
•Memory reference unit (Mbox)
•External cache and syst em interface unit (Cbox)
•Pipeline operation sequence
2.1.1 Instruction Fetch, Issue, and Retire Unit
The instruction fetch, issue, and retire unit (Ibox) consists of the following subsections:
•Vi rtual program counter logic
•Branch predictor
•Instruction-stream translation buffer (ITB)
•Instruction fetch logic
•Register rename maps
•Integer and floating-point issue queues
•Exception and interrupt logic
•Retire logic
2.1.1.1 Virtual Program Counter Logic
The virtual program counter (VPC) logic maintains the virtual addr esses for instructions that are in flight . There can be up to 80 instr uctions, in 20 succ essive f etch slo ts, in
flight between the register rename mappers and the end of the pipeline. The VPC logic
contains a 20-entry table to store these fetched VPC addresses.
2–2Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Figure 2–1 21264/EV67 Block Diagram
FP
MUL
g
Store
Victim
IOWB
Duplicate
Probe
Cache
Cache
System
System
Address
128
Cbox
128
056
21264/EV67 Microarchitecture
Instruction Cache
Ibox
Fetch Unit
VPC
Queue
Branch
Predictor
Ebox
Address
ALU 0
(L0)
Integer Registers 0
(80 Registers)
Virtual Address
Next Address
Integer Issue Queue
(20 Entries)
INT
UNIT
0
(U0)
INT
UNIT
1
(U1)
Integer Registers 1
(80 Registers)
ITB
Address
ALU 1
(L1)
Retire
Unit
Four
Instructions
Predecode
Decode and
Rename Registers
FP Issue Queue
(15 Entries)
Fbox
FP
ADD
DIV
SQRT
FP Registers
(72 Re
isters)
Queue
Tag Store
Buffer
Arbiter
Physical
Address
Data
128
Index
20
Bus
64
15
Mbox
DTB
(Dual-ported, 128-entry)
Physical
Address
Dual-Ported Data Cache
2.1.1.2 Branch Predictor
The branch predictor is composed of three units: the local, global, and choice predic-
tors. Figure 2–2 shows how the branch predictor generates the predicted branch
address.
Load
Queue
Queue
Data
Miss Address
File
Data
FM-
42-AI4
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture2–3
21264/EV67 Microarchitecture
Figure 2–2 Branch Predictor
Local
Predictor
Global
Predictor
Predicted
Branch
Address
Choice
Predictor
FM-05810.AI4
Local Predictor
The local predictor uses a 2-level table that holds the history of individual branches.
The 2-level table desi gn approaches the prediction accuracy of a larger single-level
table while requiring fewer total bits of storage. Figure 2–3 shows how the local predictor generates a prediction. Bits [11:2] of the VPC of the current branch are used as
the index to a 1K entry table in which each entry is a 10-bit value. This 10-bit value is
used as the index to a 1K entry table of 3-bit saturating counters. The value of the saturating counter determines the predication, taken/not-taken, of the current branch.
Figure 2–3 Local Predictor
VPC[11:2]
Local
History
Table
1K x 10
10
10
Index
Local Branch Prediction
Local
Predictor
1K x 3
3
1
+/-
3
FM-05811.AI4
Global Predictor
The global predictor is indexed by a global history of all recent branches. The global
predictor correlates the local history of the current branch with all recent branches. Fig-
ure 2–4 shows how the global predictor generates a prediction. The global path history
is comprised of the taken/not-taken state of the 12 most-recent branches. These 12
states are used to form an index into a 4K entry table of 2-bit saturating counters. The
value of the saturating counter determines the predication, taken/not-taken, of the current branch.
2–4Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
Figure 2–4 Global Predictor
Global
Path
History
12
Index
Global Branch Prediction
Choice Predictor
The choice predictor moni tors the history of the local and global predictor s and choose s
the best of the two predictors for a particular branch. Figure 2–5 shows how the choice
predictor generates its choice of the result of the local or global prediction. The 12-bit
global path history (see Figure 2–4) is used to index a 4K entry t abl e of 2- bit sa turating
counters. The value of the sa turatin g counter det ermines th e choice bet ween the output s
of the local and global predictors.
Global
Predictor
4K x 2
2
1
+/-
2
FM-05812.AI4
Figure 2–5 Choice Predictor
Global
Path
History
12
Choice
Predictor
4K x 2
2.1.1.3 Instruction-Stream Translation Buffer
The Ibox includes a 128-entry, fully-associative instruction-stream translation buffer
(ITB) that is used to store recently used instruction-stream (Istream) address translations and page protection information. Each of the entries in the ITB can map 1, 8, 64,
or 512 contiguous 8KB pages. The allocation scheme is round-robin.
The ITB supports an 8-bit ASN and contains an ASM bit. The Icache is virtually
addressed and contains the access-check information, so the ITB is accessed only for
Istream references that miss in the Icache.
Istream transactions to I/ O address space are UNDEFINED.
2
Choice Prediction
12
FM-05813.AI4
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture2–5
21264/EV67 Microarchitecture
2.1.1.4 Instruction Fetch Logic
The instruction prefetcher (predecode) reads an octaword, containing up to four naturally aligned instructions per cycle, from the Icache. Branch prediction and line prediction bits accompany the four instructions. The branch prediction scheme operates most
efficiently when only one branch instruction is contained among the four fetched
instructions. The line prediction scheme attempts to predict the Icache line that the
branch predictor will generate, and is described in Section 2.2.
An entry from the subroutine return prediction stack, toge ther with set prediction bits
for use by the Ica che s tream con troll er, are fetched along with the octawo rd. The I cache
stream controller generates fetch requests for additional Icache lines and stores the
Istream data in the Icache. Th ere is no separate buffer to hold Istream requests.
2.1.1.5 Register Rename Maps
The instruction prefetcher forwards instructions to the integer and floating-point register rename maps. The rename maps perform the two functions listed here:
•Eliminate register write-after-read (WAR) and write-after-write (WAW) data
dependencies while preserving true read-after-write (RAW) data dependencies, in
order to allow instructions to be dynamically rescheduled.
•Provide a means of speculatively executing instruction s before the con trol flow
previous to those inst ructions is resolved. Both exceptions and branch
mispredictions represent deviations from the control flow predicted by the
instruction prefetcher.
The map logic translates each instruction’s operand register specifiers from the virtual
register numbers in the instruction to the physical register numbers that hold the corresponding architecturally-correct values. The map logic also renames each instruction’s
destination register specifier from the virtual number in the instruction to a physical
register number chosen from a list of free physical registers, and updates the register
maps.
The map logic can process four instructions per cycle. It does not return the physical
register, which holds the old value of an instruction’s virtual destination register, to the
free list until the instru ction has bee n retired, in dicating that the control flow up to that
instruction has been resolved.
If a branch mispredict or exception occurs, the map logic backs up the contents of the
integer and floating-po int register rename maps to the state associated with the instruction that triggere d the condition, and the prefetcher restarts at the appropriate VPC. At
most, 20 valid fetch slots containing up to 80 instructions can be in flight between the
register maps and the end of the machine’s pipeline, where the control flow is finally
resolved. The map logic is capable of backing up the contents of the maps to the state
associated with any of these 80 instructions in a single cycle.
The register rename logic places instructions into an integer or floating-point issue
queue, from which they are later issued to functional units for execution.
2.1.1.6 Integer Issue Queue
The 20-entry integer issue queue (IQ), associated with the integer execution units
(Ebox), issues the following types of instructions at a maximum rate of four per cycle:
2–6Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
•Integer operate
•Integer conditional branch
•Unconditional branch – both displacement and memory format
•Integer-to-floa ting-point (ITOFx) and floating-point-to-integer (FTOIx)
Each queue entry asserts f our requ est si gnals—one f or ea ch of the Ebox subcl uster s. A
queue entry asserts a re quest wh en it contai ns an instr ucti on that can b e execu te d by the
subcluster, if the instruction’s operand register values are available within the subcluster.
There are two arbiters—one f or the upper s ubcluster s and one for t he lower subcl usters.
(Subclusters are described in Section 2.1.2.) Each arbiter picks two of the possible 20
requesters for servi ce each cycl e. A given instru ction only re quests upper subclust ers or
lower subclusters, but because many instructions can only be executed in one type or
another this is not too limiting.
For example, load and store instructions can only go to lower subclusters and shift
instructions can only go to upper subclusters. Other instructions, such as addition and
logic operations, can execute in either upper or lower subclusters and are statically
assigned before being placed in the IQ.
The IQ arbiters choose between simultaneous requesters of a subcluster based on the
age of the request—older requests are given priority over newer requests. If a given
instruction requests both lower subclusters, and no older instruction requests a lower
subcluster, then the arbiter assigns subcluster L0 to the instruction. If a given instruction
requests both upper subclusters, and no older instruction requests an upper subcluster,
then the arbiter assigns subcluster U1 to the instruction. This asymmetry between the
upper and lower subcluster arbiters is a circuit implementation optimization with ne gligible overall performance effect.
2.1.1.7 Floating-Point Issue Queue
The 15-entry floating-point issue queue (FQ) associated with the Fbox issues the following instruction types:
•Floating-point operates
•Floating-point conditional branches
•Floating-point stores
•Floating-point register to integer register transfers (FTOIx)
Each queue entry has thr ee req uest l ines— one for the ad d pipel ine, on e for t he multi ply
pipeline, and one for the two store pipelines. There are three ar biters—one for each of
the add, multiply, and store pipelines. The add and multiply arbiters pick one requester
per cycle, while the store pipeline arbiter picks two requesters per cycle, one for each
store pipeline.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture2–7
21264/EV67 Microarchitecture
The FQ arbiters pick between simul taneou s reques ters of a pipeline bas ed on the age of
the request—older requests are given priority over newer requests. Fl oat i ng-point store
instructions and FTOIx instructions in even-numbered queue entries arbitrate for one
store port. Floating-point store instructions and FTOIx instructions in odd-numbered
queue entries arbitrate for the second store port.
Floating-point store instructions and FTOIx instructions are queued in both the integer
and floating-point queue s. They wait i n the float ing-poi nt queue unt il thei r opera nd register values are available. They subsequently request service from the store arbiter.
Upon being issued fr om the float ing-point queue, the y signal t he corre sponding en try in
the integer queue to request service. Upon being issued from the integer queue, the
operation is completed.
2.1.1.8 Exception and Interrupt Logic
There are two types of exceptions: faults and synchronous traps. Ar it hmet ic exceptions
are precise and are reported as synchronous traps.
The four sources of interrupts are listed as follows:
•Level-sensitive hardware interrupts sourced by the IRQ_H[5:0] pins
•Edge-sensitive hardware interrupts generated by the serial line receive pin,
•Software interrupts sourced by the software interrupt request (SIRR) register
•Asynchronous system traps (ASTs)
Interrupt sources can be individually masked. In addition, AST inte rrupts are qualified
by the current processor mode.
2.1.1.9 Retire Logic
The Ibox fetches instructions in program order, executes them out of order, and then
retires them in order. The Ibox retire logic maintains the architectural state of the
machine by retiring an instruction only if all previous instructions have executed without generating excepti ons or branch mispr edictions. Retir ing an instruc tion commits the
machine to any changes the instruction may have made to the software-visible state.
The three software-visible states are listed as follows:
•Integer and floating-point registers
•Memory
•Internal processor registers (including control/status registers and translation
The retire logic can sustain a maximum retire rate of eight instructions per cycle, and
can retire up to as many as 11 instructions in a single cycle.
performance counter overflows, and hardware corrected read errors
buffers)
2.1.2 Integer Execution Unit
The integer execut ion u nit ( Ebox ) is a 4- path integ er ex ecu tion unit that is implement ed
as two functional-uni t “cl uster s” la beled 0 and 1. Ea ch clus ter c ontain s a copy of an 80entry, physical-register file and two “subcluste rs ”, named upper (U) and lower (L). Figure 2–6 shows the integer execution unit. In the figure, iop_wr is the cross-cluster bus
for moving integer result values between clusters.
2–8Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Microarchitecture
Figure 2–6 Integer Execution Unit—Clusters 0 and 1
iop_wr
iop_wr
U0
Register
L0
iop_wr
iop_wr
Load/Store Data
Load/Store Data
eff_VAeff_VA
U1
Register
L1
FM-05643.AI4
Most instructions have 1- cycle late ncy for consumer s tha t execu te wit hin th e sa me clus ter . Al so, t here is an oth er 1- cycle de la y ass ociat ed wit h prod ucing a val ue in on e clu ster
and consuming the value i n th e other cluster. The instruction issu e queue mi nimizes the
performance effect of this cross-cluster delay. The Ebox contains the following
resources:
•Four 64-bit adders that are used to calculate results for integer add instructions
(located in U0, U1, L0, and L1)
•The adders in the lower subclusters that are used to generate the effective virtual
address for load and st ore instructions (located in L0 and L1)
•Four logic units
•Two barrel shifters and associated byte logic (located in U0 and U1)
•Two sets of conditional branch logic (located in U0 and U1)
•Two copies of an 80-entry register file
•One pipelined multiplier (locate d in U1) with 7-cycle lat ency for all integer m ultiply
operations
•One fully-pipelined uni t (l oca te d in U0), wit h 3-c y cl e la te ncy, that executes the fol-
The Ebox has 80 register-file entries that contain storage f or t he values of the 31 Alpha
integer registers (the value of R31 is not stored), the values of 8 PALshadow registers,
and 41 results written by instructions that have not yet been retired.
Ignoring cross-cluster delay, the two copies of the Ebox register file contain identical
values. Each copy of the Ebox register file contains four read ports and six write ports.
The four read ports are used to source operands to each of the two subclusters within a
cluster. The six write ports are used as follows:
•Two write ports are used to write results generated within t he cluster.
•Two write ports are used to write results generated by the other cluster.
•Two write ports are used to write results from load instructions. These two ports
are also used for FTO Ix instructions.
2.1.3 Floating-Point Execution Unit
The floating-point execution unit (Fbox) has two paths. The Fbox executes both VAX
and IEEE floating-point instructions. It support IEEE S_floating-point and T_floatingpoint data types and all rounding modes. It also supports VAX F_floating-point and
G_floating-point data types, and provides limited support for D_floating-point format.
The basic structure of the floating-point execution unit is shown in Figure 2–7.
Figure 2–7 Floating-Point Execution Units
The Fbox contains the following resources:
•72-entry physical re gister file
•Fully-pipelined multiplier with 4-cycle latency
•Fully-pipelined adder with 4-cycle latency
•Nonpipelined divide unit associated with the adder pipeline
•Nonpipelined square root unit associated with the adder pipeline
The 72 Fbox register file entries contain storage for the values of the 31 Alpha floatingpoint registers (F31 is not stored) and 41 values written by instructions that have not
been retired .
2–10 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
The Fbox register file contains six reads ports and four write ports. Four read ports are
used to source operands to the add and multiply pipelines, and two read ports are used
to source data for store instructions. Two write ports are used to write results generated
by the add and multiply pipelines, and two write ports are used to write results from
floating-point load instructions.
2.1.4 External Cache and System Interface Unit
The interface for t he system and external cache (Cbox) controls the Bcac he a nd system
ports. It contains the following structures:
•Victim address file (VAF)
•Victim data file (VDF)
•I/O write buffer (IOWB)
•Probe queue (PQ)
•Duplicate Dcache tag (DTAG)
2.1.4.1 Victim Address File and Victim Data File
21264/EV67 Microarchitecture
The victim address file (VAF) and victim data file (VDF) together form an 8-entry victim buffer used for holding:
•Dcache blocks to be written to the Bcache
•Istream cache blocks from memory to be written to the Bcache
•Bcache blocks to be written to memory
•Cache blocks sent to the system in response to probe commands
2.1.4.2 I/O Write Buffer
The I/O write buffer (IOWB) consists of four 64-byte entries and associated address
and control logic used for buffering I/O write data between the store queue and the system port.
2.1.4.3 Probe Queue
The probe queue (PQ) is an 8-entry queue that holds pending system port cache probe
commands and addresses.
2.1.4.4 Duplicate Dcache Tag Array
The duplicate Dcache tag (DTAG) array holds a duplicat e copy of the Dca che tags and
is used by the Cbox when processing Dcache fills, Icache fills, and system port probes.
2.1.5 Onchip Caches
The 21264/EV67 contains two onchip primary-level caches.
2.1.5.1 Instruction Cache
The instruction cache (Icache) is a 64KB virtual-addressed, 2-way set-predict cache.
Set prediction is us ed t o approximate the performance of a 2-set cache without slowing
the cache access time. Each Icache block contains:
•16 Alpha instructions (64 bytes)
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture2–11
21264/EV67 Microarchitecture
•Vi rtual tag bits [47:15]
•8-bit address space number (ASN) field
•1-bit address space match (ASM) bit
•1-bit PALcode bit to indicate physical addressing
•Valid bit
•Data and tag parity bits
•Four access-check bits for the following modes: kernel, executive, supervisor, and
user (KESU)
•Additional predecoded information to assist with instruction processing and fetch
control
2.1.5.2 Data Cache
The data cache (Dcache) is a 64KB, 2-way set- associativ e, virtually index ed, physically
tagged, write-back, read/write allocate cache with 64-byte blocks. During each cycle
the Dcache can perform one of the following transactions:
•Two quadword (or shorter) read transactions to arbitrary addresses
•Two quadword write transactions to the same aligned octaword
•Two non-overlapping less-than-quadword writes to the same aligned quadword
•One sequential read and write transaction from and to the same aligned octaword
Each Dcache block contains:
•64 data bytes and associated quadword ECC bits
•Physical tag bits
•Valid, dirty, shared, and modified bits
•Tag parity bit calculated across the tag, dirty, shared, and modified bits
•One bit to control round-robin set allocation (one bit per two cache blocks)
The Dcache contains two sets, each with 512 rows containing 64-byte blocks per row
(that is, 32K bytes of data per set). The 21264/EV67 requires t wo additional bits of virtual address beyond the bi ts tha t speci fy an 8KB pag e, in orde r to spe cify a Dc ache row
index. A given virtual address might be found in four unique locations in the Dcache,
depending on the virtual-to-physical translation for those two bits. The 21264/EV67
prevents this aliasing by keeping only one of the four possible translated addresses in
the cache at any time.
2.1.6 Memory Referenc e Unit
The memory reference unit (Mbox) controls the Dcache and ensures architecturally
correct behavior for load and store instructions. Th e Mbox contains the following structures:
•Load queue (LQ)
•Store queue (SQ)
2–12 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
•Miss address file (MAF)
•Dstream translation bu ffer (DTB)
2.1.6.1 Load Queue
The load queue (LQ) is a reorder buffer for load instructions. It contains 32 entries and
maintains the stat e a ssociated with load instructions tha t have be en issued to the Mbox,
but for which results have not been delivered to the processor and the instructions
retired. The Mbox assigns load instructions to LQ slots based on the order in which
they were fetched f rom the Icache, then place s them into the LQ after they are issued by
the IQ. The LQ helps ensure corr ect Alpha memory reference behavior.
2.1.6.2 Store Queue
The store queue (SQ) is a reorder buffer and graduation unit for store instructions. It
contains 32 entries and maintains the state associated with store instructions that have
been issued to the Mbox, but for which data has not been written to the Dcache and the
instruction retir ed. The Mbox assigns store instructions to SQ slots based on the order
in which they were fetche d from the Icache and places them into the SQ after they are
issued by the IQ. The SQ holds data associated with store instructions issued from the
IQ until they are retired, at which point the store can be allowed to update the Dcache.
The SQ also helps ensure correct Alpha memory reference behavior.
Pipeline Organization
2.1.6.3 Miss Address File
The 8-entry miss address file (MAF) holds physical addresses associated with pending
Icache and Dcache fill requests and pending I/O space read transactions.
2.1.6.4 Dstream Translation Buffer
The Mbox includes a 128-entry, fully associative Dstream tra nsl ati on buffer (DTB) used
to store Dstream addr ess tr anslat ions and page protec tion i nforma tion. Ea ch of t he entr ies
in the DTB can map 1, 8, 64, or 512 contig uous 8KB pa ges. The allocation scheme is
round-robin. The DTB supports an 8-bi t ASN and c ontains an ASM bit.
2.1.7 SROM Interface
The serial read-only memory (SROM) interface provides th e initialization data load
path from a system SROM to the Icache. Refer to Chapter 7 for more information.
2.2 Pipeline Organization
The 7-stage pipeline provides an optimized environment for executing Alpha instruc-
tions. The pipeline stage s (0 t o 6) are shown in Figur e 2–8 and des cri bed in the following paragraphs.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–13
Pipeline Organization
Figure 2–8 Pipeline Organization
0213456
ALU
Branch
Predictor
Instruction
Cache
(64KB)
(2-Set)
Integer
Register
Rename
Four
Instructions
FloatingRegister
Rename
Map
Point
Map
Integer
Issue
Queue
(20)
Floating-
Point
Issue
Queue
(15)
Integer
Register
File
Floating-
Point
Register
File
Shifter
ALU Shifter
Multiplier
Address
ALU
Address
ALU
Floating-Point
Add, Divide,
and Square Root
Floating-Point
Multiply
64KB
Data
Cache
Bus
Interface
Unit
System
Bus
(64 Bits)
Cache
Bus
(128 Bits)
Physical
Address
(44 Bits)
FM-05575.AI4
Stage 0 — Instruction Fetch
The branch predictor uses a branch history algorithm to predict a br anc h in st ruction target address.
Up to four aligned instructions are fetched from the Icache, in program order. The
branch prediction tables are also accessed in this cycle. The branch predictor uses tables
and a branch history algorithm to predict a branch instruction target address for one
branch or memory format JSR instruct ion per cycl e. Therefore, the prefetcher is limited
to fetching through one branch per cycle. If there is more than one branch within the
fetch line, and the branch pre dictor p redicts that the first b ranch will not be t aken, it will
predict through subsequ ent branche s at the rate of on e per cycle, un til it pre dicts a ta ken
branch or predicts through the last branch in the fetch line.
The Icache array also contains a line prediction field, the contents of which are applied
to the Icache in the next cycle . The purpose o f the line predictor is to remove the pipeline bubble which would otherwise be created when the branch predictor predicts a
branch to be taken. In effect, the line predictor attempts to pr edict the Ica che line whi ch
the branch predictor will generate. On fills, the line predictor value at each fetc h line is
initialized with the inde x of the next sequential fetch line, and later retrained by the
branch predictor if necessary.
Stage 1 — Instruction Slot
The Ibox maps four instructions per cycle from the 64KB 2-way set-predict Icache.
Instructions are mapped in order, executed dynamically, but are retired in order.
2–14 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Pipeline Organization
In the slot stage, the branch predictor compares the next Icache index that it generates to
the index that was generated by the line predictor. If there is a mismatch, the branch
predictor wins—the instructions fetched during that cycle are aborted, and the index
predicted by the branch predictor is applied to the Icache during the next cycle. Line
mispredictions result in one pipeline bubble.
The line predictor ta kes precedence over the branch predictor during memory format
calls or jumps. If the line predictor was trained with a true (as opposed to predicted)
memory format call or jump target, then its contents take precedence over the target
hint field associated with these instructions. This allows dynamic calls or jumps to be
correctly predicted.
The instruction fet cher produce s the full VPC addr ess d uring t he fe tc h stage of th e pipe line. The Icache produces the tags for both Icache sets 0 and 1 each time it is accessed.
That enables the fetcher to separate set mispredictions from true Icache misses. If the
access was caused by a set misprediction, the instruction fetcher aborts the last two
fetched slots and refetches the slot in the next cycle. It also retrains the appropriate set
prediction bits.
The instruction data is transferred from the Icache to the integer and floating-point register map hardware during this stage. When the integer instr uction is fetched from the
Icache and sl otted into the IQ, the slot logi c determines wh ether the instruction is for
the upper or lower subclusters. The slot logic makes the decision based on the
resources needed by th e (up to four) integer inst ructions in the fetc h block. Althou gh all
four instructions need not be issued simultaneously, distributing their resource usage
improves instruction loading across the units. For example, if a fetch block contains
two instructions that can be placed in either cluster followed by two instructions that
must execute in the lower cluster, the slot logic would designate that combination as
EELL and slot them as UULL. Slot combinations are described in Sect ion 2.3.2 and
Table 2–3.
Stage 2 — Map
Instructions are se nt from the Icache to the integer and floating-poi nt reg is ter maps dur ing the slot stage and register renaming is performed during the map stage. Also, each
instruction is assigned a unique 8-bit number, called an inum, which is used to identify
the instruction and its program order with respect to other instructions during the time
that it is in flight. Instructions are considered to be in flight between the time they are
mapped and the time they are retired.
Mapped instructions and their associated inums are placed in the integer and floatingpoint queues by the end of the map stage.
Stage 3 — Issue
The 20-entry integer issue queue (IQ) issues instructions at the rate of four per cycle.
The 15-entry floating-point issue queue (FQ) issues floating-point opera te ins tr uct ions,
conditional branch instructions, and store instructions, at the rate of two per cycle. Normally, instructions are deleted from the IQ or FQ two cycles after they are issued. For
example, if an instruction is issued in cycle n, it remains in the FQ or IQ in cycle n+1
but does not request service, and is deleted in cycle n+2.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–15
Instruction Issue Rules
Stage 4 — Register Read
Instructions iss ued from the issue queues read their o per ands from the integer and floating-point register files and receive bypass data.
Stage 5 — Execute
The Ebox and Fbox pipelines begin execution.
Stage 6 — Dcache Access
Memory reference instructions access the Dcache and data translation buffers. Normally load instructions access the tag and data arrays while store instructions only
access the tag arrays. Store data is written to the store queue where it is held until the
store instruction is retired. Most integer operate instructions write their register results
in this cycl e.
2.2.1 Pipel ine Aborts
The abort penalty as given is measured from the cycle after the fetch stage of the
instruction which tr iggers the abort to the fetch stage of the new target, ignoring any
Ibox pipeline stalls or queuing delay that the triggering instruction might experience.
Table 2–1 lists the timing associated with each common source of pipeline abort.
Table 2–1 Pipeline Abort Delay (GCLK Cycles)
Abort Condition
Branch misprediction7Integer or floating-point conditional branch
JSR misprediction8Memory format JSR or HW_RET.
Mbox order trap14Load-load order or store-load order.
Other Mbox replay traps13—
divide, square root, and conditional move instructions
fdivFAFloating- poi n t divi de in st r uct i on
fsqrtFAFloating-point square root instruction
nopNoneTRAP, EXCB, UNOP - LDQ_U R31, 0(Rx)
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–17
Instruction Issue Rules
Table 2–2 Instruction Name, Pipeline, and Types (Continued)
Class
NamePipelineInstruction Type
ftoiFST0, FST1, L0, L1 FTOIS, FTOIT
itofL0, L1ITOFS, ITOFF, ITOFT
mx_fpcrFMInstructions that move data from the floating-point
2.3.2 Ebox Slotting
Instructions that are issued from the IQ, and could execute in either upper or lower
Ebox subclusters, are slotted to one pair or the other during the pipeline mapping stage
based on the instruction mixture in the fetch line. The codes that are used in Table 2–3
are as follows:
•U—The instruction only executes in an upper subcluster.
•L—The instruction only executes in a lower subcluster.
control register
•E—The instruction could execute in either an upper or lower subcluster.
Table 2–3 defines the slotting rules. The table field Instruction Class 3, 2, 1 and 0 iden-
tifies each instruction’s locati on in the fetch line by the va lue of bits [3:2 ] in its PC.
Table 2–3 Instruction Group Definitions and Pipeline Unit
Instruction Class
3 2 1 0
E E E E U L U L L L L L L L L L
E E E L U L U L L L L U L L L U
E E E U U L L U L L U E L L U U
E E L E U L L U L L U L L L U L
E E L L U U L L L L U U L L U U
E E L U U L L U L U E E L U L U
E E U E U L U L L U E L L U U L
E E U L U L U L L U E U L U L U
E E U U L L U U L U L E L U L U
E L E E U L U L L U L L L U L L
E L E L U L U L L U L U L U L U
Slotting
3 2 1 0
Instruction Class
3 2 1 0
Slotting
3 2 1 0
E L E U U L L U L U U E L U U L
E L L E U L L U L U U L L U U L
E L L L U L L L L U U U L U U U
E L L U U L L U U E E E U L U L
E L U E U L U L U E E L U L U L
E L U L U L U L U E E U U L L U
2–18 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Instruction Issue Rules
Table 2–3 Instruction Group Definitions and Pipeline Unit (Continued)
Instruction Class
3 2 1 0
Slotting
3 2 1 0
Instruction Class
3 2 1 0
Slotting
3 2 1 0
E L U U L L U U U E L E U L L U
E U E E L U L U U E L L U U L L
E U E L L U U L U E L U U L L U
E U E U L U L U U E U E U L U L
E U L E L U L U U E U L U L U L
E U L L U U L L U E U U U L U U
E U L U L U L U U L E E U L U L
E U U E L U U L U L E L U L U L
E U U L L U U L U L E U U L L U
E U U UL U U U U L L E U L L U
L E E EL U L U U L L L U L LL
L E E L L U U L U L L U U L L U
L E E UL U L U U L U E U L U L
L E L EL U L U U L U L U L U L
L E L L L U L L U L U U U L U U
L E L U L U L U U U E E U U L L
L E U E L U U L U U E L U U L L
L E U L L U U L U U E U U U L U
L E U U L L U U U U L E U U L L
L L E E L L U U U U L L U U L L
L L E L L L U L U U L U U U L U
L L E U L L U U U U U E U U U L
L L L E L L L U U U U L U U U L
——U U U U U U U U
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–19
Instruction Issue Rules
2.3.3 Instruction Latencies
After an ins truction is placed in the IQ or FQ, its issue point is determined by the availability of its register operands, functional unit(s), and relationship to other instructions
in the queue. There are register producer-consumer dependencies and dynamic functional unit availability dependencies that affect instruction issue. The mapper removes
register producer-producer dependencies.
The latency to produce a reg ister resul t is genera lly fi xed. The one exce ption i s for l oad
instructions that miss the Dcache. Table 2–4 lists the latency, in cycles, for each
instruction class.
Table 2–4 Instruction Class Latency in Cycles
ClassLatencyComments
ild3
13+
fld4
14+
ist—Does not produce register value.
fst—Does not produce register value.
rpcc1Possible 1-cycle cross-cluster delay.
rx1—
mxpr1 or 3HW_MFPR:Ebox IPRs = 1.
icbr—Conditional branch. Does not produce register value.
ubr3Uncond itional branch. Does not produce register value.
jsr3—
iadd1Possible 1-cycle Ebox cross-cluster delay.
ilog1Possible 1-cycle Ebox cross-cluster delay.
ishf1Possible 1-cycle Ebox cross-cluster delay.
Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
Ibox and Mbox IPRs = 3.
HW_MTPR does not produce a register value.
cmov11Only consumer is cmov2. Possible 1-cycle Ebox cross-cluster delay.
cmov21Possible 1-cycle Ebox cross-cluster delay.
imul7Possible 1-cycle Ebox cross-cluster delay.
imisc3Possible 1-cycle Ebox cross-cluster delay.
fcbr—Does not produce register value.
fadd4
6
2–20 Internal Architecture
Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from when an fadd is issued from the FQ to when an fst or ftoi is issued
from the IQ.
Alpha 21264/EV67 Hardware Reference Manual
Table 2–4 Instruction Class Latency in Cycles (Continued)
ClassLatencyComments
Instruction Retire Rules
fmul4
6
fcmov1 4Only consumer is fcmov2.
fcmov2 4
6
fdiv12
9
15
12
fsqrt18
15
33
30
ftoi3—
itof4—
nop—Does not produce register value.
Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from when an fmul is issued from the FQ to when an fst or ftoi is issued
from the IQ.
Consumer other than fst.
Consumer fst or ftoi.
Measured from when an fcmov2 is issued from the FQ to when an fst or ftoi is is sued
from the IQ.
Single precision - latency to consumer of result value.
Single precision - latency to using divider again.
Double precision - latency to consumer of result value.
Double precision - latency to using divider again.
Single precision - latency to consumer of result value.
Single precision - latency to using unit again.
Double precision - latency to consumer of result value.
Double precision - latency to using unit again.
2.4 Instruction Retire Rules
An instruction is retired when it has been executed to completion, and all previous
instructions have been retired. The execution pipeline stage in which an instruction
becomes eligible to be retired depends upon the instruction’s class.
Table 2–5 gives the minimum retire latencies (assuming that all previous instructions
have been retired) for various classes of instructions.
Table 2–5 Minimum Retire Latencies for Instruction Classes
Instruction ClassRetire StageComments
Integer conditional branch7—
Integer multiply7/13Latency is 13 cycles for the MUL/V instruction.
Integer operate7—
Memory10—
Floating-point add11—
Floating-point multiply11—
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–21
Retire of Operate Instructions into R31/F31
Table 2–5 Minimum Retire Latencies for Instruction Classes (Continued)
Instruction ClassRetire StageComments
Floating-point DIV/SQRT11 + latencyAdd latency of unit reuse for the instruction indicated in Table
2–4. For example, latency for a single-precision fdiv would be
11 plus 9 from Table 2–4. Latency is 11 if har dware detects that
no exception is possible (see Section 2.4.1).
Floating-point conditional
branch
BSR/JSR10JSR instruction m ispredict is reported in stage 8.
11Branch instruction mispredict is reported in stage 7.
2.4.1 Floating-Point Divide/Square Root Early Retire
The floating-point divider and square root unit can detect that, for many combinations
of source operand values, no exception can be generated. Instructions with these operands can be retired before the result is generated. When detected, they are retired with
the same latency as the FP add class. Early re tirement is not possible for th e following
instruction/operand/architecture state conditions:
•Instruction is not a DIV or SQRT.
•SQRT source operand is negative.
•Divide operand exponent_a is 0.
•Either operand is NaN or INF.
•Divide operand exponent_b is 0.
•Trapping mode is /I (inexact).
•INE status bit is 0.
Early retirement is a lso not possi ble f or div ide i nstruc tions if t he res ul ting expon ent has
any of the following characteristics (EXP is the result exponent):
•DIVT, DIVG: (EXP >= 3FF
•DIVS, DIVF: (EXP >= 7F
) OR (EXP <= 216)
16
) OR (EXP <= 38216)
16
2.5 Retire of Operate Instructions into R31/F31
Many instructions that have R31 or F31 as their destination are retired immediately
upon decode (stage 3). These i nstructions do not produce a r esult and ar e removed fro m
the pipeline as well. They do not occupy a slot in the issue queues and do not occupy a
functional unit. Table 2–6 lists th ese instructions and some of their char act er is ti cs . The
instruction type in Table 2–6 is from Table C-6 in Append ix C o f the Alpha Ar c hitecture Handbook, Version 4.
2–22 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Table 2–6 Instructions Retired Without Execution
Instruction TypeNotes
INTA, INTL, INTM, INTSAll with R31 as destination.
FLTI, FLTL, FLTVAll with F31 as destination. MT_FPCR is not included
because it has no destination—it is never removed from the
pipeline.
LDQ_UAll with R31 as destination.
MISCTRAPB and EXCB are always removed. Others are never
removed.
FLTSAll (SQRT, ITOF) with F31 as destination.
2.6 Load Instructions to R31 and F31
This section describe s how th e 21264/EV67 processes software-directed prefetc h tr ans actions and load instructions with a destination of R31 and F31.
Load Instructions to R31 and F31
Prefetches allocat e a M AF entry. How the MAF entry is allocated is what distinguishes
the type of prefetch. A normal prefetch is equivalent to a normal load MAF (that is , a
MAF entry that puts the block into the Dcache in a readable state). A prefetch with
modify intent is equivalent to a normal st ore MAF (that is, a MAF entry that puts the
block into the Dcache in a wri teabl e stat e). A pref etch, evi ct next , is equi valen t to a nor mal load MAF, with the additional behavior described in Section 2.6.3, below.
A prefetch is not performed if the prefetch hits in the Dcache (as if it were a normal
load).
Load operations to R31 and F31 may generate exceptions. These exceptions must be
dismissed by PALcode.
The following sections describe the operational prefetch behavior of these instructions.
The 21264/EV67 processes these instructions as normal cache line prefetches. If the
load instruction hits the Dcache, the instruction is dismissed, otherwise the addressed
cache block is allocated into the Dcache.
The HW_LDL instruction construct equates to the HW_LD instruction with the LEN
field clear. See Table 6–3.
2.6.2 Prefetch with Modify Intent: LDS Instruction
The 21264/EV67 processes an LDS instruction, with F31 as the destination, as a
prefetch with modify intent t ransact ion (ReadBlkM od command). I f the tr ansactio n hits
a dirty Dcache block, the instr uction is dismissed . Otherwise, the addr essed cache block
is allocated into the Dcache for write access, with its dirty and modifie d bits set.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–23
Special Cases of Alpha Instruction Execution
2.6.3 Prefetch, Evi ct Next: LDQ and HW_LDQ Instructions
The 21264/EV67 processes this instruction like a normal prefetch transaction (Read-
BlkSpec command), with one exception—if the load misses the Dcache, the addressed
cache block is allocated into the Dcache, but the Dcache set allocation pointer is left
pointing to this block. The next miss to the same Dcache line will evict the block. For
example, this instruct ion might be use d when softwar e is reading an array tha t is known
to fit in the offchip Bcache, but will not fit into the onchip Dcache. In this case, the
instruction ensure s th at th e hardwa re prov ides t he desi red pr efet ch func tion wi thout d isplacing useful cache blocks stored in the other set within the Dcache.
The HW_LDQ instruction construct equates to the HW_LD instruction with the LEN
field set. See Table 6–3.
2.6.4 Prefetch with the LDx_L / STx_C Instruction Sequence
A prefetch within a dynamic 80-instruction window of a LDx_L instruction can cause
the subsequent STx_C to incorrectly succeed when all three references are to the same
64-byte cache block. Within that 80-instruction window, the proximity of the prefetch
to the LDx_L instruction directly affects the possibility of the incorrect behavior. Further, if t he pre fe tc h issu es befo re the LDx_L, the er ro r cannot occur, and if the prefetch
issues after the LDx_L, the error can only occur when another processor is simultaneously acquiring the same lock.
2.7 Special Cases of Alpha Instruction Execution
This section describes the mechanisms that the 21264/EV67 uses to process irregular
instructions in the Alpha instruction set, and cases in which the 21264/EV67 processes
instructions in a non-intuitive way.
2.7.1 Load Hit Speculation
The latency of integer load instructions that hit in the Dcac he is three cycles. Figure 2–
9 shows the pipeline timing for these integer load instructions. In Figure 2–9:
SymbolMeaning
QIssue queue
RRegister file read
EExecute
DDcache access
BData bus active
2–24 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Special Cases of Alpha Instruction Execution
Figure 2–9 Pipeline Timing for Integer Load Instructions
Hit
1Cycle Number
2345678
ILD
Instruction 1
Instruction 2
QREDB
QR
Q
FM-05814.AI4
There are two cycles in which the IQ may spec ul ati v e ly is sue inst ru ctions that use load
data before Dcache hit infor mat ion is known. An y inst ructions th at are is sue d by the IQ
within this 2-cycle speculative window are ke pt in the IQ with their requests inhibited
until the load instru ction’s hit condition is known, even if they are not depe nden t on the
load operation. If the lo ad instr uc tion hit s, then the se inst ruc tions are remo ved from the
queue. If the load instruction misses, then the execution of these instructions is aborted
and the instructions are allowed to request service again.
For example, in Figure 2–9, instruction 1 and instruction 2 are issued within the speculative window of the load instruction. If the load instruction hits, then both instructions
will be deleted from the queue by the start of cycle 7—one cycle later than normal for
instruction 1 and at the norm al time for instruc tion 2. If the load inst ruction misses , both
instructions are aborted from the execution pipelines and may request service again in
cycle 6.
IQ-issued instructi ons are aborte d if iss ued with in the sp eculat ive win dow of an int eger
load instruction that missed in the Dcache, even if they are not dependent on the load
data. However, if software misses are likely, the 21264/EV67 can still benefit from
scheduling the instruction stream for Dcache miss latency. The 21264/EV67 includes a
saturating counter that is incremented when load instructions hit and is decremented
when load instructions miss. When the upper bit of the counter equals zero, the integer
load latency is incr eased to five cycles and the speculative window is removed. The
counter is 4 bits wide and is incremented by 1 on a hit and is decremented by two on a
miss.
Since load instructions to R31 do not produce a result, they do not create a speculative
window when they execute and, therefore, never waste IQ-issue cycles if they miss.
Floating-point load instructions that hit in the Dcache have a latency of four cycles. Figure 2–10 shows the pipeli ne timing for floating-point load instructions. In Figure 2–10:
SymbolMeaning
QIssue queue
RRegister file read
EExecute
DDcache access
BData bus active
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–25
Special Cases of Alpha Instruction Execution
Figure 2–10 Pipeline Timing for Floating-Point Load Instructions
1Cycle Number
2345678
Hit
FLD
Instruction 1
Instruction 2
QREDB
The speculative window for floating-point load instructions is one cycle wide.
FQ-issued instruction s that are issue d within the spe culative window of a floating- point
load instruction that has missed, are only aborted if they depend on the load being successful.
For example, in Figure 2–10 instruction 1 is issued in the speculative window of the
load instruc tion.
If instruction 1 is not a user of the data returned by the load instruction, then it is
removed from the queue at its normal time (at the start of cycle 7).
If instruction 1 is dependent on the load instruction data and the load instruction hits ,
instruction 1 is removed from the queue one cycle later (at the start of cycle 8). If the
load instruction misses, then instruction 1 is aborted from the Fbox pipeline and may
request service again in cycle 7.
2.7.2 Floating-Point Store Instructions
QR
Q
FM-05815.AI4
Floating-point store instructions are duplicated and loaded into both the IQ and the FQ
from the mapper. Each IQ entry contains a control bit, fpWait, that when set prevents
that entry from asserting its requests. This bit is initially set for each floating-point store
instruction that enter s the IQ, unless it was the ta rget of a replay trap. The instruction’s
FQ clone is issued when its Ra register is about to become clean, resulting in its IQ
clone’s fpWait bit being cleared and allowing the IQ clone to issue and be executed by
the Mbox. This mechanism ensures that floating-point store instructions are always
issued to the Mbox, along wi th t he associated data, without requiring the float in g-p oin t
register dirty bits to be available wi thin the IQ.
2.7.3 CMOV Instruction
For the 21264/EV67, the Alpha CMOV instruction has three operands, and so presents
a special case. The required operation is to move either the value in register Rb or the
value from the old physical destination register into the new destination register, based
upon the value in Ra. Since neither the mapper nor the Ebox and Fbox data paths are
otherwise required to handle three operand instructions, the CMOV instruction is
decomposed by the Ibox pipeline into two 2-operand instructions:
The first instructi on, CM OV1, tests the value of Ra and records the result of this te st in
a 65th bit of its destina tion r egister, newRc1. It also copies the value of th e old phys ical
destination register, oldRc, to newRc1.
The second instruction, CMOV2, t hen copies eit her the value in newRc 1 or the value in
Rb into a second physical destination register, newRc2, based on the CMOV predicate
bit stored in newRc1.
In summary, the original CMOV instruction is decomposed into two dependent instructions that each use a physical register from the free list.
To further simplify this operation, the two component instructions of a CM OV instruction are driven thr ough the map pers in s uccessive cycles. Hence, i f a fetc h line conta ins
n CMOV instructions, it takes n+1 cycles to run that fetch line through the mappers.
For example, the following fetch line:
ADD CMOVx SUB CMOVy
Results in the following three map cycles:
ADD CMOVx1
CMOVx2 SUB CMOVy1
CMOVy2
The Ebox executes intege r CMOV instructions as two distinct 1-cycle latency operations. The Fbox add pipeline executes fl oating-point CMOV instruc tions as two distinct
4-cycle latency operations.
2.8 Memory and I/O Address Space Instructions
This section provide s an ove rview o f the way th e 21264 /EV67 pro cesses memory an d I/
O address space instructions.
The 21264/EV67 supports, and internally recognizes, a 44-bit physical address space
that is divided equally between memory address space and I/O address space. Memory
address space resides in the lower half of the physical address space (PA[43]=0)
and I/O address space resides in the upper half of the physical address space
(PA[43]=1).
The IQ can issue any combination of load and store instructions to t h e Mbox at the rate
of two per cycle. The two lower Ebox subclusters, L0 and L1, generate the
48-bit effective virtual address for these instructions.
An instructi on is defined to be newer than another instruction if it follows that instruction in program order and is older if it precedes that instruction in program order.
2.8.1 Memory Address Space Load Instructions
The Mbox begins execution of a load instruction by translating its virtual address to a
physical address using the DTB and by accessing the Dcache. The Dcache is virtually
indexed, allowing these two operations to be done in parallel. The Mbox puts information about the load instruction, including its physical address, destination register, and
data format, into the LQ.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–27
Memory and I/O Address Space Instructions
If the requested physical location is found in the Dcache (a hit), the data is formatted
and written int o the ap propri ate int eger or floati ng-poin t regis ter. If the location is n ot in
the Dcache (a miss), the physical add ress is placed in the miss address file (MAF) for
processing by the Cbox. The MAF perf orms a merging function in which a new miss
address is compared to mis s addresse s already held in the MAF. If the new miss address
points to the same Dcache block as a miss address in the MAF, then the new miss
address is d iscarded.
When Dcache fill data is returned to the Dcache by the Cbox, the Mbox satisfies the
requesting l oad instructio ns in the LQ.
2.8.2 I/O Address Space Load Instructions
Because I/O space load instructions may have side effects, they cannot be performed
speculatively. When the Mbox receives an I/O space load instruction, the Mbox places
the load instruction in the LQ, where it is held until it retires. The Mbox replays retired
I/O space load instructions from the L Q to the MAF in program order, at a rate of one
per GCLK cycle.
The Mbox allocates a new MAF entry to an I/O load instruction and inc reases I/O band -
width by attempting to mer ge I/O loa d instruc tions in a mer ge re gister. Tabl e 2–7 shows
the rules for merging data. The columns represent the load instructions replayed to the
MAF while the rows represent the size of the load in the merge register.
Table 2–7 Rules for I/O Address Space Load Instruction Data Merging
Byte/WordNo mergeNo mergeNo merge
LongwordNo merg eMerge up to 32 bytesNo merge
QuadwordNo mergeNo mergeMerge up to 64 bytes
In summary, Table 2–7 shows some of the following rules:
•Byte/word load instruct ions and different size loa d instructions are not allowed to
merge.
•A stream of ascending non-ove rlapping, but not necessarily consecutive, longwo rd
load instructions are allowed to merge into naturally aligned 32-byte blocks.
•A stream of ascending non-ove rlappi ng, but no t nece ssari ly con secuti ve, quadwor d
load instructions are allowed to merge into naturally aligned 64-byte blocks.
•Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
•Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the Mbox I/O regist er has closed its merge window, the Cbox sends I/O read
requests offchip in the order that they were re ceived from the Mbox.
2–28 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Memory and I/O Address Space Instructions
2.8.3 Memory Address Space Store Instructions
The Mbox begins execution of a store instruction by translating its virtual address to a
physical address using the DTB and by probing the Dcache. The Mbox puts information about the store instruction, including its physical address, its data and the results of
the Dcache probe, into the store queue (SQ).
If the Mbox does not find the addressed location in the Dcache, it places the address
into the MAF for pro ces si ng b y the Cbox. If the Mbox finds the addressed lo ca ti on i n a
Dcache block that is not dirty, then it places a ChangeToDirty request into the MAF.
A store instruction can write its data into the Dcache when it is retired, and when the
Dcache block containing its address is dirty and not shared. SQ entries that meet these
two conditions can be placed into the writable state. These SQ entries are placed into
the writable state in program order at a maximum rate of two entries per cycle. The
Mbox transfers writable store queue entry data from the SQ to the Dcache in program
order at a maximum rate of two entries per cycle. Dcac he lines associa ted with writab le
store queue entries are locked by the Mbox. System port probe commands cannot evict
these blocks until their associated writable SQ entries have been transferred into the
Dcache. This restriction assists in STx_C instruction and Dcache ECC processing.
SQ entry data that has not been t ransfer red to th e Dcache may sour ce data t o newer loa d
instructions. The Mbox compares the virtual Dcache index bits of incoming load
instructions to queued SQ entries, and sources the data from the SQ, bypassing the
Dcache, when necessary.
2.8.4 I/O Address Space Store Instructions
The Mbox begins processing I/O space store instructions, like memory space store
instructions, by translating the virtual address and placing the state associated with the
store instru ction into the SQ.
The Mbox replays retired I/O space store entries from the SQ to the IOWB in program
order at a rate of one per GCLK cycle. The Mbox never allows queued I/O space store
instructions to source data to subsequent load instructions.
The Cbox maximizes I/O bandwidth when it a llocates a new IOWB entry to an I/O
store instruction by attempting to merge I/O store instructions in a merge register. Table
2–8 shows the rule s for I/O s pace sto re instr uction d ata mer ging. Th e column s represent
the load instructi ons replayed to the IOWB while the rows re present th e size of the store
in the merge register.
Table 2–8 Rules for I/O Address Space Store Instruction Data Merging
Merge Register/
Replayed Instruction
Store
Byte/WordStore LongwordStore Quadword
Byte/WordNo mergeNo mergeNo merge
LongwordNo merg eMerge up to 32 bytesNo merge
QuadwordNo mergeNo mergeMerge up to 64 bytes
Table 2–8 shows some of the following rules:
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–29
MAF Memory Address Space Merging Rules
•Byte/word store instructions and different size store instructions are not allowed to
merge.
•A stream of ascending non-ove rlapping, but not necessarily consecutive, longwo rd
store instructions are allowed to merge into naturally aligned 32-byte blocks.
•A stream of ascending non-ove rlappi ng, but no t nece ssari ly con secuti ve, quadwor d
store instructions are allowed to merge into naturally aligned 64-byte blocks.
•Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
•Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the IOWB merge register has closed its merge windo w, the Cbox sends I/O space
store requests offchip in the order that they were received from the Mbox.
2.9 MAF Memory Address Space Merging Rules
Because all memory trans actio ns are to 6 4-byte blocks , ef fic iency i s impro ved by merg-
ing several small data transactions into a single larger data transaction. Table 2–9 lists
the rules the 21264/EV67 uses when merging memory transactions into 64-byte naturally aligned data block transactions. Rows represent the merged instruction in the
MAF and columns represent the new issued transaction.
In summary, Table 2–9 shows that only like instruction types, with the exception of
load instructions merging w ith store instructions, are merged.
2.10 Instructio n Ordering
In the absence of explicit instruction ordering, such as with MB or WMB instructions,
the 21264/EV67 maintains a default instruct ion ordering relationship between pairs of
load and store instructions.
2–30 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Replay Traps
The 21264/EV67 maintains the default memory data instruction ordering as shown in
Table 2–10 (assume address X and address Y are different).
Table 2–10 Memory Reference Ordering
First Instruction in PairSecond Instruction In PairReference Order
Load memory to address XLoad memory to address XMaintained (litmus test 1)
Load memory to address XLoad memory to address YNot maintained
Store memory to address XStore memory to address XMaintained
Store memory to address XStore memory to address YMaintained
Load memory to address XStore memory to address XMaintained
Load memory to address XStore memory to address YNot maintained
Store memory to address XLoad memory to address XMaintained
Store memory to address XLoad memory to address YNot maintained
The 21264/EV67 maintains t he defa ult I/ O instru ctio n order ing as sho wn in Table 2–11
(assume address X and address Y are different).
Table 2–11 I/O Reference Ordering
First Instruction in PairSecond Instruction in PairReference Order
Load I/O to address XLoad I/O to address XMaintained
Load I/O to address XLoad I/O to address YMaintained
Store I/O to address XStore I/O to address XMaintained
Store I/O to address XStore I/O to address YMaintained
Load I/O to address XStore I/O to address XMaintained
Load I/O to address XStore I/O to address YNot maintained
Store I/O to address XLoad I/O to address XMaintained
Store I/O to address XLoad I/O to address YNot maintained
2.11 Replay Traps
There are some situat ions in whic h a loa d o r store instr uctio n canno t b e execu ted due to
a condition that occurs after that in structi on issues fr om the IQ or FQ. The inst ruction is
aborted (along with all newer instructions) and restarted from the fetch stage of the
pipeline. This mechanism is called a replay trap.
2.11 .1 Mbox Order Traps
Load and store instructions may be issued from the IQ in a different order than they
were fetched from the Icach e, while the architecture dictates that Dstream memory
transactions to the same physical bytes must be completed in order. Usually, the Mbox
manages the memory reference stream by itself to achieve architecturally correct
behavior , but t he two ca ses in whi ch the Mbox uses re play tr aps to man age the memor y
stream are load-load and store-load order traps.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–31
I/O Write Buffer and the WMB Instruction
2.11.1.1 Load-Load Order Trap
The Mbox ensures that load instructions that read the same physical byte(s) ultimately
issue in correct order by using the load-load order trap. The Mbox compares the
address of each load instruction, as it is issued, to the address of all load instructions in
the load queue. If the Mbox finds a newer load instruction in the load queue, it invokes
a load-load order trap on the newer in struction. This is a replay trap that aborts the target of the trap and all newer instructions from the machine and refetches instructions
starting at the target of the trap.
2.11.1.2 Stor e-Load Order Trap
The Mbox ensures that a load instruction ultimately issues after an older store instruction that writes some portion of its memory operand by using the store-load order trap.
The Mbox compares the address of each store instruction, as it is issued, to the address
of all load instruct ions in the load queue. If the Mbox finds a newer load instruction in
the load queue, i t invokes a store-load order trap on the load instr uction. This is a repla y
trap. It functions like the load-load ord er tr ap.
The Ibox contains extra hardware to reduce the frequency of the store-load trap. There
is a 1-bit by 1024-entr y VPC-inde xed tab le in t he Ibox calle d the st Wait table. When an
Icache instruction is fetched, the associated stWait table entry is fetched along with the
Icache instruction. The stWait table produces 1 bit for each instruction accessed from
the Icache. When a loa d i nstru ction ge ts a store-load order repl ay tr ap, it s asso ciat ed bit
in the stWait table is set during the cycle that the load is refetched. Hence, the trapping
load instruc tion’s stWait bi t will be set the next time it is fetched.
The IQ will not issue load instructions whose stW ait bit is set while there are older unis-
sued store i nstructions in the queue. A load instruction whose stWait bit is set can be
issued the cycle immediately after the last older store instruction is issued from the
queue. All the bi ts in t he stWait table are unconditiona lly cle ared ever y 16384 c ycles, or
every 65536 cycles if I_CTL[ST_WAIT_64K] is set.
2.11.2 Other Mbox Replay Traps
The Mbox also uses replay traps to control the flow of the load queue and store queue,
and to ensure that there are never multiple outstanding misses to different physical
addresses that map to the sa me Dcac he or Bc ache l ine. Unl ike th e order tra ps, howeve r,
these replay traps are invoked on the incoming instruction that triggered the condition.
2.12 I/O Write Buffer and the WMB Instruction
The I/O write buffer (IOWB ) consists of four 64-byte entries with the associated
address and control logic used to buffer I/O write data between the store queue (SQ)
and the system port.
2.12.1 Memory Barrier (MB/WMB/TB Fill Flow)
The Cbox CSR SYSBUS_MB_ENABLE bit determines if MB instructions produce
external system port transactions. When the SYSBUS_MB_ENABLE bit equals 0, the
Cbox CSR MB_CNT[3:0] field contains the number of pending uncommitted transactions. The counter will increment for each of the following commands:
The counter is decremented with the C (commit) bit in the Probe and SysDc commands
(see Section 4.7.7). Syst ems can assert the C bit in the SysDc fill response to the commands that originally i ncremen ted the counter, or attached to the last probe s een by tha t
command when it reac hed the syst em seri aliz ation point . If the nu mber of unc ommitted
transactions reaches 15 (saturating the counter), the Cbox will stall MAF and IOWB
processing until at least one of the pending transac ti ons h as been committed. Probe processing is not interrupted by the state of this counter.
2.12.1.1 MB Instruction Processing
When an MB instruction is fetched in the predicted instruction execution path, it stalls
in the map stage of the pipeline. This also stalls all instructions after the MB, and control of instruction flow is base d upon t he val ue in Cbox CSR SYSBUS_MB_ENABLE
as follows:
I/O Write Buffer and the WMB Instruction
•If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the IQ is
empty and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port.
b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decr ements from one to zero, the Cbox marks the
youngest probe queue entry.
c. Wait s until the MAF contains no more Dst ream refer ences and th e SQ, LQ, and
IOWB are empty.
When all of the above have occurred and a probe response has been sent to the system for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
•If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox waits until the IQ is empty
and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port
b. Sends the MB command to the system port
c. Waits until the MB command is acknowledged, then marks the youngest entry
in the probe queue
d. Waits until the MAF contains no more Dstr eam referen ces and the SQ, LQ, and
IOWB are empty
When all of the above have occurred and a probe response has been sent to the system for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–33
I/O Write Buffer and the WMB Instruction
Because the MB instruction is executed speculatively, MB processing can begin
and the original MB ca n be killed. In the internal acknowledge case, the MB may
have already been sent to the system interface, and the system is still expected to
respond to the MB.
2.12.1.2 WMB Instruction Processing
Write memor y barrier (WMB) inst ructions are issued int o the Mbox st ore-queu e, where
they wait until they are retired and all prio r store instructions become writable. The
Mbox then stalls the writable poi nter and informs the Cbox. The Cbox closes the IOWB
merge register and responds in one of the following two ways:
•If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox performs the following
actions:
a. Stalls further MAF and IOWB processing.
b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decr ements from one to zero, the Cbox marks the
youngest probe queue entry.
c. When a probe response has been s ent to t he sy stem for the mar ked probe q ueue
entry, the Cbox considers the WMB to be sa tisfied.
•If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox performs the following
2.12.1.3 TB Fill Flow
Load instructions (HW_LDs) to a virtua l page table entry (VPTE) are processed by the
21264/EV67 to avoid litmus test problems associated with the ordering of memory
transactions from anoth er pr ocessor against loading of a page table entry and the subsequent virtual-mode load from this proces sor.
Consider the sequence shown in Table 2–12. The data could be in the Bcach e. Pj should
fetch datai if it is using PTEi.
Table 2–12 TB Fill Flow Example Sequence 1
PiPj
actions:
a. Stalls further MAF and IOWB processing.
b. Sends the MB command to the system port.
c. Waits until the MB command is acknowledged by the system with a SysDc
MBDone command, then sends acknowle dge and marks the youngest entry in
the probe queue.
d. When a probe r esponse has be en sent to t he syst em for the mar ked pro be qu eue
entry, the Cbox considers the WMB to be sa tisfied.
Also consider the relate d sequ ence shown in Table 2–13. In this case, the data could be
cached in the Bcache; Pj should fetch datai if it is using PTEi.
<write TB>
Istream read (restart) - will miss the Icache
The 21264/EV67 processes Dstream loads to the PTE by injecting, in hardware, some
memory barrier processing between the PTE transaction and any subsequent load or
store instruction. This is accomplished by the following mechanism:
1. The integer queue issues a HW_LD instruction with VPTE.
2. The integer queue issues a HW_MTPR instruction with a DTB_PTE0, that is datadependent on the HW_LD instruction with a VPTE, and is required in order to fill
the DTBs. The HW_MTPR instruction, when que ued, set s I PR scoreboard bits [4]
and [0].
3. When a HW_MTPR instruction with a DTB_PTE0 is issued, the Ibox signals the
Cbox indicating that a HW_LD instruction with a VPTE has been processed. This
causes the Cbox to begin processing the MB instruction. The Ibox prevents any
subsequent memory operation s bei ng is sued by not clearing the IPR scoreboard bit
[0]. IPR scoreboard bit [0] is one of the scoreboard bits associated with the
HW_MTPR instruction with DTB_PTE0.
4. When the Cbox completes processing the MB instruction (using one of the above
sequences, depending upon the state of SYSBUS_MB_ENABLE), the Cbox signals the Ibox to clear IPR scoreboard bit [0].
The 21264/EV67 uses a similar mechanism to process Istream TB misses and fills to
the PTE for the Istream.
1. The integer queue issues a HW_LD instruction with VPTE.
2. The IQ issues a HW_MTPR instruction with an ITB_PTE that is data-dependent
upon the HW_LD instruction with VPTE. This is required in order to fill th e ITB.
The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4] and [0].
3. The Cbox issues a HW_MTPR instruction for the ITB_PTE and signals the Ibox
that a HW_LD/VPTE instruction has been proce sse d, causing the Cbox to start processing the MB instruction. The Mbox stalls Ibox fetching from when the HW_LD/
VPTE instruction finishes until the probe queue is drained.
4. When the 21264/EV67 is finished (SYS_MB selects one of the above sequences),
the Cbox directs th e Ibox t o clear IPR scoreb oard bit [0]. Also , the Mbo x direct s the
Ibox to start prefetching.
Inserting MB instruction processing within the TB fill flow is only required for multiprocessor systems. Uniprocessor systems can disable MB instruction processing by
deasserting Ibox CSR I_CTL[TB_MB_EN].
The 21264/EV67 provides hardware support for two methods of obtaining program
performance feedback information. The two methods do not require program modification. The first method offers similar capabilities to earlier microprocessor performance
counters. The second method sup ports the n ew Profi leMe way of s tati sticall y sampl ing
individual instruct ions dur ing prog ram execut ion to dev elop a model of progra m execu tion. Both methods use the same hardware registers.
See Section 6.10 for information about counter control.
2.14 Floating-Point Control Register
The floating-point control register (FPCR) is shown in Figure 2–11.
Figure 2–11 Floating-Point Control Register
63 62 6160 5949584857475655 54 5352 51 500
SUM
INED
UNFD
UNDZ
DYN
IOV
INE
UNF
OVF
DZE
INV
OVFD
DZED
INVD
DNZ
The floating-point control register fields are described in Table 2–14.
Table 2–14 Floating-Point Control Register Fields
NameExtentTypeDescription
SUM[63]RWSummary bit. Records bit-wise OR of FPCR exception bits.
INED[62]RWInexact Disable. If this bit is set and a floating-point instruction that enables
trapping on inexact results generates an inexact value, the result is placed in the
destination register and the trap is suppressed.
LK99-
A
2–36 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Floating-Point Control Register
Table 2–14 Floating-Point Control Register Fields (Continued)
NameExtentTypeDescription
UNFD[61]RWUnderflow Disable. The 21264/EV67 hardware cannot generate IEEE compli-
ant denormal results. UNFD is used in conjunction with UNDZ as follows:
UNFD UNDZResult
0XUnderflow trap.
10Trap to supply a possible denormal result.
11Underflow trap suppressed. Destination is written wit h a
true zero (+0.0).
UNDZ[60]RWUnderflow to zero. When UNDZ is set together with UNFD, underflow traps
are disabled and the 21264/EV67 places a true zero in the destination register.
See UNFD, above.
DYN[59:58]RWDynamic rounding mode. Indicates the rounding mode to be used by an IEEE
floating-point instruction when the instruction specifies dynamic rounding
mode:
IOV[57]RWInteger overflow. An integer arithmetic operation or a conversion from float-
ing-point to integer overflowed the destination precision.
INE[56]RWInexact result. A floating-point arithmetic or conv ersion o peration gav e a result
that differed from the mathematically exact result.
UNF[55]RWUnderflow. A floating-point arithmetic or conversion operation gave a result
that underflowed the destination exponent.
OVF[54]RWOverflow. A flo ating-point arithmetic o r conversion o peration gave a result that
overflowed the destination exponent.
DZE[53]RWDivide by zero. An attempt was made to perform a floating-point divide with a
divisor of zero.
INV[52]RWInvalid operation. An attempt was made to perform a floating -point arithmetic
operation and one or more of its operand values were illegal.
OVFD[51]RWOverflow disable. If thi s b i t is s e t and a f lo a ti ng-point arithmetic operation gen-
erates an overflow condition, then the appropriate IEEE nontrapping result is
placed in the destination register and the trap is suppressed.
DZED[50]RWDivision by zero disable. If this bit is set and a floating-point divide by zero is
detected, the appropriate IEEE nontrapping result is placed in the destination
register and the trap is suppressed.
INVD[49]R WInvalid operation disable. If thi s bit is set and a fl oatin g-point op erate gener ates
an invalid operation condition and 21264/EV67 is capable of producing the
correct IEEE nontrapping result, that result is placed in the destination register
and the trap is suppressed.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–37
AMASK and IMPLVER Instruction Values
Table 2–14 Floating-Point Control Register Fields (Continued)
NameExtentTypeDescription
DNZ[48]RWDenormal operands to zero. If this bit is set, treat all Denormal operands as a
signed zero value with the same sign as the Denormal operand.
Reserved[47:0]
1
Alpha architecture FPCR bit 47 (DNOD) is not implemented by the 21264/EV67.
1
——
2.15 AMASK and IMPLVER Instruction Values
The AMASK and IMPLVER instructions return processor type and supported architecture extensions, respectively.
2.15.1 AMASK
The 21264/EV67 returns the AMASK instruction values provided in Table 2–15. The
I_CTL register reports the 21264/EV67 pass level (see I_CTL[CHIP_ID], Section
5.2.15).
Table 2–15 21264/EV67 AMASK Values
21264/EV67 Pass LevelAMASK Feature Mask Value
See I_CTL[CHIP_ID], Table 5–11307
16
The AMASK bit definitions provided in Table 2–15 are defined in Table 2–16.
Table 2–16 AMASK Bit Assignments
BitMeaning
0Support for the byte/word extension (BWX)
The instructions that comprise the BWX extension are LDBU, LDWU, SEXTB,
SEXTW, STB, and STW.
1Support for the square-root and floating-point convert extension (FIX)
The instructions that comprise the FIX extension are FT OIS, FT OIT, ITOFF , IT OFS,
ITOFT, SQRTF, SQRTG, SQRTS, and SQRTT.
2Support for the count extension (CIX)
The instructions that comprise the CIX extension are CTLZ, CTPOP, and CTTZ.
8Support for the multimedia extension (MVI)
The instructions that comprise the MVI extension are MAXSB8, MAXSW4,
MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB,
PKWB, UNPKBL, and UNPKBW.
9Support for precise arithmetic trap reporting in hardware. The trap PC is the same as
the instruction PC after the trapping instruction is executed.
2.15.2 IMPLVER
For the 21264/EV67, the IMPLVER instruction returns the value 2.
2–38 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
2.16 Design Examples
21272 Core
64-bit PCI Bus
FM-05573-EV67
The 21264/EV67 can be designed into many di fferent uniprocessor and multiprocessor
system configurations. Figures 2–12 and 2–13 illustrate two possible configurations.
These configurations employ additional system/memory controller chipsets.
Figure 2–12 shows a typical uniprocessor system with a second-level cache. This system configuration could be used in standalone or networked workstations.
Figure 2–12 Typical Uniprocessor Configuration
Design Examples
L2 Cache
Tag
Store
Data
Store
21264
Tag
Address
Data
Address
Out
Address
In
Data
Logic Chipset
Control
Chips
Data Slice
Chips
Host PCI
Bridge Chip
Duplicate
Tag Store
(Optional)
DRAM
Arrays
Address
Data
Figure 2–13 shows a typical multip roc essor sys tem, each p rocess or with a second -leve l
cache. Each interface controller must employ a duplicate tag store to maintain cache
coherency. This system configuration could be used in a networked database server
application.
Alpha 21264/EV67 Hardware Reference Manual
Internal Architecture 2–39
Design Examples
Address
FM-05574-EV67
Figure 2–13 Typical Multiprocessor Configuration
L2
Cache
L2
Cache
21264
21264
Host PCI
Bridge Chip
64-bit PCI Bus
21272 Core
Logic Chipset
Control
Chip
Data Slice
Chips
Host PCI
Bridge Chip
64-bit PCI Bus
DRAM
Arrays
Address
Data
DRAM
Arrays
Data
2–40 Internal Architecture
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface
This chapter contains the 212 64/EV67 mic rop roces sor log ic symbol an d provi des inf ormation about signal names, their function, and their location. This chapter also
describes the mechanical specifications of the 21264/EV67. It is organized as follows:
•The 21264/EV67 logic symbol
•The 21264/EV67 signal names and functions
•Lists of the signal pins, sorted by name and PGA location
•The specifications for the 21264/EV67 mechanical package
•The top and bottom views of the 21264/EV67 pinouts
3.1 21264/EV67 Microprocessor Logic Symbol
Figure 3–1 show the logic symbol for the 21264/EV67 chip.
Table 3–1 defines the 21264/EV67 signal types referred to in this section.
Table 3–1 Signal Pin Types Definitions
Signal TypeDefinition
Inputs
I_DC_REFInput DC reference pin
I_DAInput differential amplifier receiver
I_DA_CLKInput clock pin
Outputs
O_ODOpen drain output driver
O_OD_TPOpen drain driver for test pins
O_PPPush/pull output driv er
O_PP_CLKPus h/ pu l l outpu t clock driver
Bidirectional
B_DA_ODBi directional differential amplifier receiver with open drain output
B_DA_PPBidirectional differential amplifier receiver with push/pull output
Other
SpareReserved to Compaq
NoConnectNo connection — Do not connect to these pins for any revision of the
21264/EV67. These pins must float.
1
All Spare connections are Reserved to Compaq to maintain compatibility between
passes of the chip. Designers should not use these pins.
1
Table 3–2 lists all signal pins in alphabetic order and provides a full functional description of the pins. Table 3–4 lists the signal pins and their corresponding pin grid array
(PGA) locations in al phabetic order f or the s ignal ty pe. Table 3–5 lists the pi n grid ar ray
locations in alphabetical order.
Table 3–2 21264/EV67 Signal Descriptions
SignalTypeCount Description
BcAdd_H[23:4]O_PP20These signals provide the index to the Bcache.
BcCheck_H[15:0]B_DA_PP16ECC check bits for BcData_H[127:0].
BcData_H[127:0]B_DA_PP128Bcache data signals.
BcDataInClk_H[7:0]I_DA8Bcache data input clocks. These clocks are used with high
speed SDRAMs, such as DDRs, that provide a clock-out with
data-output pins to optimize Bcache read bandwidths. The
21264/EV67 internally synchronizes the data to its logic with
clock forward receive circuits similar to the system interface.
BcDataOE_LO_PP1Bcache data output enable. The 21264/EV67 asserts this signal
during Bcache read operations.
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface3–3
21264/EV67 Signal Names and Functions
Table 3–2 21264/EV67 Signal Descriptions (Continued)
SignalTypeCount Description
BcDataOutClk_H[3:0]
BcDataOutClk_L[3:0]
BcDataWr_LO_PP1Bcache data write enable. The 21264/EV67 asserts this signal
BcLoad_LO_PP1Bcache burst enable.
BcTag_H[42:20]B_DA_PP23Bcache tag bits.
BcTagDirty_HB_DA_PP1Tag dirty state bit. During cache write operations, the 21264/
BcTagInClk_HI_DA1Bcache tag input clock. The 21264/EV67 uses this input clock
BcTagOE_LO_PP1Bcache tag output enable. This signal is asserted by the 21264/
O_PP8Bcache data output clocks. These free-running clocks are dif-
ferential copies of the Bcache clock and are derived from the
21264/EV67 GCLK. Their period is a multiple of the GCLK
and is fixed for all operations. They can be configured so that
their rising edge lags BcAdd_H[23:4] by 0 to 2 GCLK cycles.
The 21264/EV67 synchronizes tag output information with
these clocks.
when writing data to the Bcache data arrays.
EV67 will assert this signal if the Bcache data has been modified.
to latch the tag information on Bcache read operations. This
clock is used with high-speed SDRAMs, such as DDRs, that
provide a clock-out with data-output pins to optimize Bcache
read bandwidths. The 21264/ EV67 inte r nall y synchronizes the
data to its logic with clock forward receive circuits similar to
the system interface.
EV67 for Bcache read operations.
BcTagOutClk_H
BcTagOutClk_L
BcTagParity_HB_DA_PP1Tag parity state bit.
BcTagShared_HB_DA_PP1Tag shared state bit. The 21264/EV67 will write a 1 on this sig-
BcTagValid_HB_DA_PP1Tag valid state bit. If set, this line indicates that the cache line
BcTagWr_LO_PP1Tag RAM write enable. The 21264/EV67 asserts this signal
BcVr efI_DC_REF1Bcache tag reference voltage.
ClkFwdRst_HI_DA1Systems assert this synchronous signal to wake up a powered-
ClkIn_H
ClkIn_L
DCOK_HI_DA1dc voltage OK. Must be deasserted until dc voltage reaches
EV6Clk_H
EV6Clk_L
O_PP2Bcache tag output clock. These clocks “echo” the clock-for-
warded BcDataOutClk_x[3:0] clocks.
nal line if another agent has a copy of the cache line.
is valid.
when writing a tag to the Bcache tag arrays.
down 21264/EV67. The ClkFwdRst_H signal is cloc ked into
a 21264/EV67 register by the captured FrameClk_x sig nals.
Systems must ensure that the timing of this signal meets
21264/EV67 requirements (see Section 4.7.2).
I_DA_CLK 2Differential input signals provided by the system.
proper operating level. After that, DCOK_H is asserted.
O_PP_CLK 2Provides an external test point to measure phase alignment of
the PLL.
3–4Hardware Interface
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Signal Names and Functions
Table 3–2 21264/EV67 Signal Descriptions (Continued)
SignalTypeCount Description
FrameClk_H
FrameClk_L
IRQ_H[5:0]I_DA6These six interrupt signal lines may be asserted by the system.
MiscVrefI_DC_REF1Voltage reference for the miscellaneous pins
PllBypass_HI_DA1When asserted, this sig nal will cause the two input clocks
PLL_VDD3.3 V13.3-V dedicated power supply for the 21264/EV67 PLL.
Reset_LI_DA1System reset. This signal protects the 21264/EV67 from dam-
SromClk_HO_OD_TP1Serial ROM clock. Supplies the clock that causes the SROM to
SromData_HI_DA1Serial ROM data. Input data line from the S ROM.
SromOE_LO_OD_TP1Serial ROM enable. Supplies the output enable to the SROM.
I_DA_CLK 2A skew-controlled differential 50% duty cycle copy of the sys-
tem clock. It is used by the 21264/EV67 as a reference, or
framing, clock.
The response of the 21264/EV67 is determined by the system
software.
(see Table 3–3).
(ClkIn_x) to be applied to the 21264/EV67 internal circuits,
instead of the 21264/EV67 global clock (GCLK).
age during initial power-up. It must be asserted until
DCOK_H is asserted. After that, it is deasserted and the
21264/EV67 begins its reset sequence.
advance to the next bit. The cycle time for this clock is 256
times the cycle time of the GCLK (internal 21264/EV67
clock).
SysAddIn_L[14:0]I_DA15Time-multiplexed com mand/address/ID/Ack from system to
the 21264/EV67.
SysAddInClk_LI_DA1Single-ended forwarded clock from system for
SysAddIn_L[14:0] and SysFillValid_L.
SysAddOut_L[14:0]O_OD15Time-multiplexed command/address/ID/mask from the 21264/
EV67 to the system bus.
SysAddOutClk_LO_OD1Single-ended forwarded clock output for
SysAddOut_L[14:0].
SysCheck_L[7:0]B_DA_OD8Quadword ECC check bits for SysData_L[63:0].
SysData_L[63:0]B_DA_OD64Data bus for memory and I/O data.
SysDataInClk_H[7:0]I_DA8Single-ended system-generated clocks for clock forwarded
input system data.
SysDataInValid_LI_DA1When asserted, marks a valid data cycle for data transfers to
the 21264/EV67.
SysDataOutClk_L[7:0] O_OD8Single-ended 21264/EV67-generated clocks for clock for-
warded output system data.
SysDataOutValid_LI_DA1When asserted, marks a valid data cycle for data transfers from
the 21264/EV67.
SysFillV alid_LI_DA1When asserted, this bit indicates validation for the cache fill
delivered in the previous system SysDc command.
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface3–5
21264/EV67 Signal Names and Functions
Table 3–2 21264/EV67 Signal Descriptions (Continued)
SignalTypeCount Description
SysVrefI_DC_REF1System interface reference voltage.
Tck_HI_DA1IEEE 1149.1 test clock.
Tdi_HI_DA1IEEE 1149.1 test data-in signal.
Tdo_HO_OD_TP1IEEE 1149.1 test data-out signal.
TestStat_HO_OD_TP1Test status pin. System reset drives the test status pin low.
The TestStat_H pin is forced high at the start of the Icache
BiST . If the Icache BiST passes, the p in is deasserted at the end
of the BiST operation; otherwise, it remains high.
The 21264/EV67 generates a timeout reset signal if an instruc-
tion is not retired within one billion cycles.
The 21264/EV67 signals the timeout reset event by outputting
a 256 GCLK cycle wide pulse on TestStat_H.
Tms_HI_DA1IEEE 1149.1 test mode select signal.
Trst_LI_DA1IEEE 1149.1 test access port (TAP) reset signal.
Table 3–3 lists signals by function and provides an abbreviated description.
Table 3–3 21264/EV67 Signal Descriptions by Function
SignalTypeCount Description
BcVref Domain
BcAdd_H[23:4]O_PP20Bcache index.
BcCheck_H[15:0]B_DA_PP16ECC check bits for BcData_H[127:0].
BcData_H[127:0]B_DA_PP128Bcache data.
BcDataInClk_H[7:0]I_DA8Bcache data input clocks.
BcDataOE_LO_PP1Bcache data output enable.
BcDataOutClk_H[3:0]
BcDataOutClk_L[3:0]
BcDataWr_LO_PP1Bcache data write enable.
BcLoad_LO_PP1Bcache burst enable.
BcTag_H[42:20]B_DA_PP23Bcache tag bits.
BcTagDirty_HB_DA_PP1Tag dirty state bit.
BcTagInClk_HI_DA1Bcache tag input clock.
BcTagOE_LO_PP1Bcache tag output enable.
O_PP8Bcache data output clocks.
BcTagOutClk_H
BcTagOutClk_L
BcTagParity_HB_DA_PP1Tag parity state bit.
BcTagShared_HB_DA_PP1Tag shared state bit.
BcTagValid_HB_DA_PP1Tag valid state bit.
BcTagWr_LO_PP1Tag RAM write enable.
3–6Hardware Interface
O_PP2Bcache tag output clocks.
Alpha 21264/EV67 Hardware Reference Manual
21264/EV67 Signal Names and Functions
Table 3–3 21264/EV67 Signal Descriptions by Function (Continued)
SignalTypeCount Description
BcVr efI_DC_REF1Tag data input reference voltage.
SysVref Domain
SysAddIn_L[14:0]I_DA15Time-multiplexed SysAddIn, system-to-21264/EV67.
SysAddInClk_LI_DA1Single-ended forwarded clock from system for
SysAddIn_L[14:0] and SysFillValid_L.
SysAddOut_L[14:0]O_OD15Time-multiplexed SysAddOut, 21264/EV67-to-system.
SysAddOutClk_LO_OD1Single-ended forwarded-clock.
SysCheck_L[7:0]B_DA_OD8Quadword ECC check bits for SysData_L[63:0].
SysData_L[63:0]B_DA_OD64Data bus for memory and I/O data.
SysDataInClk_H[7:0]I_DA8Single-ended system-generated clocks for clock forwarded
input system data.
SysDataInValid_LI_DA1When asserted, marks a valid data cycle for data transfers to
the 21264/EV67.
SysDataOutClk_L[7:0] O_OD8Single-ended 21264/EV67-generated clocks for clock for-
warded output system data.
SysDataOutValid_LI_DA1When asserted, marks a valid data cycle for data transfers
from the 21264/EV67.
SysFillValid_LI_DA1Validation for fill given in previous SysDC command.
SysVrefI_DC_REF1System interface reference voltage.
Clocks and PLL
ClkIn_H
ClkIn_L
EV6Clk_H
EV6Clk_L
FrameClk_H
FrameClk_L
PLL_VDD3.3 V13.3-V dedicated power supply for the 21264/EV67 PLL.
MiscVref Domain
ClkFwdRst_HI_DA1Systems assert this synchronous signal to wake up a powered-
DCOK_HI_DA1dc voltage OK. Must be deasserted until dc voltage reaches
I_DA_CLK2Differential input signals provided by the system.
O_PP_CLK2Provides an external test point to measure phase alignment of
the PLL.
I_DA_CLK2A skew-controlled differential 50% duty cycle copy of the
system clock. It is used by the 21264/EV67 as a reference, or
framing, clock.
down 21264/EV67. The ClkFwdRst_H signal is clocked in to
a 21264/EV67 register by the captured FrameClk_x sig nals.
proper operating level. After that, DCOK_H is asserted.
IRQ_H[5:0]I_DA6These six interrupt signal lines may be asserted by the system.
MiscVrefI_DC_REF1Reference voltage for miscellaneous pins.
PllBypass_HI_DA1When asserted, this sig nal will cause the input clocks
(ClkIn_x) to be applied to the 21264/EV67 internal circuits,
instead of the 21264/EV67’s global clock (GCLK).
Alpha 21264/EV67 Hardware Reference Manual
Hardware Interface3–7
Pin Assignments
Table 3–3 21264/EV67 Signal Descriptions by Function (Continued)
SignalTypeCount Description
Reset_LI_DA1System reset. This signal protects the 21264/EV67 from dam-
age during initial power-up. It must be asserted until
DCOK_H is asserted. After that, it is deasserted and the
21264/EV67 begins its reset sequence.
SromClk_HO_OD_TP1Serial ROM clock.
SromData_HI_DA1Serial ROM data.
SromOE_LO_OD_TP1Serial ROM enable.
Tck_HI_DA1IEEE 1149.1 test clock.
Tdi_HI_DA1IEEE 1149.1 test data-in signal.
Tdo_HO_OD_TP1IEEE 1149.1 test data-out signal.
TestStat_HO_OD_TP1Test status pin.
Tms_HI_DA1IEEE 1149.1 test mode select signal.
Trst_LI_DA1IEEE 1149.1 test access port (TAP) reset signal.
3.3 Pin Assignments
The 21264/EV67 package has 587 pi ns aligned in a pi n grid arra y (PGA) design. There
are 380 functional signa l pins, 1 ded icated 3.3-V pin f or the PLL, 112 ground VSS pins,
and 94 VDD pins. Table 3–4 lists the s ignal pins an d thei r co rres ponding p in grid ar ray
(PGA) locations in alphabetical order for the signal type. Table 3–5 lists the pi n grid
array locations in alphabetical order
Table 3–4 Pin List Sorted by Signal Name
Signal NamePGA Location Signal NamePGA Location Signal NamePGA Location
This chapter describ es the 21264/EV67 c ache and exter nal interf ace, which include s the
second-level cache (Bcache) interface and the system interface. It also describes locks,
interrupt signals, and ECC/parity generation. It is organized as follows:
•Introduction to the external interfaces
•Physical address considerations
•Bcache structure
•Victim data buffer
•Cache coherency
•Lock mecha nism
•System port
•Bcache port
•Interrupts
Chapter 3 lists and defines all 21264/EV67 hardware interface signal pins. Chapter 9
describes the 21264/EV67 hardware interface electrical requirements.
4.1 Introduction to the External Interfaces
A 21264/EV67-based system ca n be divided into three major sections:
•21264/EV67 microprocessor
•Second-level Bcache
•System interface logic
–Optional duplicate tag store
–Optional lock register
–Optional victim buffers
The 21264/EV67 external i n t er fac e is f lexible and mandates few design rules, al lo wing
a wide range of prospective systems. The external interface is composed of the Bcache
interface and the system interface.
•Input clocks must hav e the sa me frequenc y as the ir corr esponding o utput cl ock. For
example, the frequency of SysAddInClk_L must be the same as
SysAddOutClk_L.
Alpha 21264/EV67 Hardware Reference Manual
Cache and External Interfaces4–1
Introduction to the External Interfaces
•The Bcache interface includes a 128-bit bidirectional data bus, a 20-bit unidirec-
tional address bus, and several control signals.
–The BcDataOutClk_x[3:0] clocks are free-running and are derived from the
internal GCLK. The period of BcDataOutClk_x[3:0] is a programmable mul-
tiple of GCLK.
–The Bcache turns the BcDataOutClk_x[3:0] clocks around and returns them
to the 21264/EV67 as BcDataInClk_H[7:0]. Likewise, BcTagOutClk_x
returns as BcTagInClk_H.
–The Bcache interface supports a 64-byte block size.
•The system interface includes a 64-bit bidirectional data bus, two 15-bit
unidirectional address buses, and several control signals.
–The SysAddOutClk_L clock is free-running and is derived from the internal
GCLK. The period of SysAddOutClk_L is a programmable multiple of
GCLK.
–The SysAddInClk_L
clock is a turned-around copy of SysAddOutClk_L.
Figure 4–1 shows a simplifi ed view of the external interface. The function and purpose
of each signal is desc ribed in Chapter 3.
4–2Cache and External Interfaces
Alpha 21264/EV67 Hardware Reference Manual
Introduction to the External Interfaces
FM-05818B-EV67
System
Figure 4–1 21264/EV67 System and Bcache Interfaces
SysAddIn_L[14:0]
SysAddInClk_L
SysAddOut_L[14:0]
SysAddOutClk_L
SysVref
SysData_L[63:0]
SysCheck_L[7:0]
SysDataInClk_H[7:0]
SysDataOutClk_L[7:0]
SysDataInValid_L
SysDataOutValid_L
SysFillValid_L
BcAdd_H[23:4]
21264
BcLoad_L
BcData_H[127:0]
BcCheck_H[15:0]
BcDataInClk_H[7:0]
BcDataOutClk_x[3:0]
BcDataOE_L
BcDataWr_L
BcTag_H[42:20]
BcTagInClk_H
BcTagOutClk_
BcVref
BcTagWr_L
BcTagOE_L
BcTagValid_H
BcTagDirty_H
BcTagShared_H
BcTagParity_H
IRQ_H[5:0]
x
[23:4][23:6][23:6]
DataTagStatus
4.1.1 System Interface
This section introduces the system (external) bus interface. The system interface is
made up of two unidirecti onal 15-bit address buses, 64 bidirectional data lines, eight
bidirectional chec k bits, two si ngle-end ed un idirect ional c locks, and a few control pin s.
The 15-bit address buses provide time-shared address/command/ID in two or four
GCLK cycles. The Cbox controls the system interface.
Alpha 21264/EV67 Hardware Reference Manual
Cache and External Interfaces4–3
Physical Address Considerations
4.1.1.1 Commands and Addresses
The system sends probe and data mov ement command s to the 21264/EV6 7. The 21264 /
EV67 can hold up to eight probe commands from the system. The system controls the
number of outstan din g pr obe co mman ds and must ensure that the 21264/EV67 8- ent ry
probe queue does not overflow.
The Cbox contains an 8-entry miss buffer (MAF) and an 8-entry victim buffer (VAF).
A miss occurs when the 21264/EV67 probes the Bc ache but doe s not find t he address ed
block. The 21264/EV67 can queue eight cache misses to the system in its MAF.
4.1.2 Second-Level Cache (Bcache) Interface
The 21264/EV67 Cbox provides control signals and an interface for a second-level
cache, the Bcache. The 21264/EV67 supports a Bcache from 1MB to 16MB, with 64byte blocks. A 128-bit data bus is used for transfers between the 21264/EV67 and the
Bcache. The Bcache must be comprised of synchronous static RAMs (SSRAMs) and
must contain either one , t w o, or t hr ee i nt er n al r egi st ers . Al l Bcache control and address
pins are clocked synchronously on Bcache cycle boundaries. The Bcache clock rate
varies as a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0,
and in full-cycle increments of 5, 6, 7, and 8 times the CPU clock cycle. The 1.5 multiple is only available in dual-data mode.
4.2 Physical Address Considerations
The 21264/EV67 supports a 44-bit physical address space that is divided equally
between memory space and I/O space. Memory space resides in the lower half of the
physical address space (PA[43] = 0) and I/O space resides in the upper half of the physical address space (PA[43] = 1). The 21264/EV67 recognizes these spaces internally.
The 21264/EV67-generated external references to memory space are always of a fixed
64-byte size, though the internal access granularity is byte, word, longword, or quadword. All 21264/EV67-gener ated e xtern al ref erences t o memory or I/O space are phys ical addresses that are either successfully translated from a virtual address or produced
by PALcode. Speculative execution may cause a reference to nonexistent memory. Systems must check the range of all addresses and report nonexistent addresses to the
21264/EV67.
Table 4–1 describes the translation of inter nal references to external interface references. The first column lists the instructions used by the programmer, including load
(LDx) and store (STx) instructions of several si zes . Th e column headings are described
here:
•DcHit (block was found in the Dcache)
•DcW (block was found in a writable state in the Dcache)
•BcHit (block was found in the Bcache)
•BcW (block was found in a writable state in the Bcache)
•Status and Action (status at end of instruction and action performed by the 21264/
EV67)
4–4Cache and External Interfaces
Alpha 21264/EV67 Hardware Reference Manual
Physical Address Considerations
Prefetches (LDL, LDF, LDG, LDT, LDBU, LDWU) to R31 use the LDx flow, and
prefetch with modify intent (LDS) uses the STx flow. If the prefetch target i s addres sed
to I/O space, the upper address bit is cleared, converting the address to memory space
(PA[42:6] ). Notes follow the table.
Table 4–1 Translation of Internal References to External Interface Reference
InstructionDcHitDcWBcHitBcWStatus and Action
LDx Memory1XXXDcache hit, done.
LDx Memory0X1XBcache hit, done.
LDx Memory0 X0XMiss, generate RdBlk command.
LDx I/OXXXXRdBytes, RdLWs, or RdQWs based on size.
Istream Memory1XXXDcache hit, Istream serviced from Dcache.
Istream Memory0X1XBcache hit, Istream serviced from Bcache.
Istream Memory0X0XMiss, generate RdBlkI command.
STx Memory11XXStore Dcache hit and writable, done.
STx Memory10XXStore hit and not writable, set dirty flow (note 1).
STx Memory0X11Store Bcache hit and writable, done.
STx Memory0X10Store hit and not writable, set-dirty flow (note 1).
STx Memory0X0XMiss, generate RdBlkMod command.
STx I/OXXXXWrBytes, WrLWs, or WrQWs based on size.
STx_C Memory0XXXFail STx_C.
STx_C Memory10XXSTx_C hit and not writable, set dirty flow (note 1).
STx_C I/OXXXXAlways succeed and WrQws or WrLws are generated,
based on the size.
WH64 Memory11XXHit, done.
WH64 Memory10XXWH64 hit not writable, set dirty flow (note 1).
WH64 Memory0X11WH64 hit dirty, done.
WH64 Memory0X10WH64 hit not writable, s et dirty flow (note 1).
WH64 Memory0X0XMiss, generate InvalToDirty command (n ote 2).
WH64 I/OXXXXNOP the instruction. WH64 is UNDEFINED for I/O
space.
ECB MemoryXXXXGenerate evict command (note 3).
ECB I/OXXXXNOP the instruction. ECB instruction is UNDEFINED
for I/O space.
MB/WMB
TB Fill Flows
Alpha 21264/EV67 Hardware Reference Manual
XXXXGenerate MB command (note 4). See Section 2.12.1.
Cache and External Interfaces4–5
Physical Address Considerations
Table 4–1 notes:
1. Set Dirty Flow: Based on the Cbox CSR SET_DIRTY_ENABLE[2:0], SetDirty
requests can be either internally acknowledged (called a SetModify) or sent to the
system environment f or processing. When externally acknowl edg ed, the shared status information for the cache block is also broadcast. The commands sent externally are SharedToDirty or CleanToDirty. Based on the Cbox CSR
ENABLE_STC_COMMAND[0], the external system can be informed of a STx_C
generating a SetDirty using the STCChangeToDirty command. See Table 4–16 for
more information.
2. InvalToDirty: Based on the Cbox CSR INVAL_TO_DIRTY_ENABLE[1:0], InvalToDirty requests can be either internally acknowledged or sent to the system environment as InvalToDirty commands. Th is Cbox CSR provide s the ability t o conver t
WH64 instructions to RdModx operations. See Table 4–15 for more information.
3. Evict: There are two aspects to the commands that are generated by an ECB
instruction: fi rst, those com mands that are gene rated to not ify the system of a n evict
being performed; second, those commands that are generated by any victim that is
created by servicing the ECB.
–If Cbox CSR ENABLE_EVICT[0] is clear, no command is issued by the
21264/EV67 on the external interface to notify the system of an evict being
performed. If Cbox CSR ENABLE_EVICT[0] is se t, the 21264/EV67 iss ues an
Evict command on the system interface only if a Bcache index match to the
ECB address is found in the 21264/EV67 cache system.
Note that whenever ENABLE_EVICT[0] is true (in the write-many chain),
BC_CLEAN_VICTIM must also be true (in the write-once chain). Otherwise,
the 21264/EV67 could respon d miss t o a pr obe, ra ther t han hi t, bef ore a n Evict
command has been sent off chip, but after the Evict command has removed a
(clean) block from the internal caches and the Bcache. That behavior might
cause systems that maintain an external duplicate copy of the Bcache tags to
become confused, because the system could receive the probe re spo nse indicating the miss befo re it receives the Evict command.
–The 21264/EV67 can issue the commands CleanVictimBlk and WrVictimBlk
for a victim that is created by an ECB. CleanVictimBlk is issued only if Cbox
CSR BC_CLEAN_VICTIM is set and there is a Bcache index match valid but
not dirty in the 21264/EV67 cache system. Wr VictimBlk is issued for any
Bcache match of the ECB address that is dirty in the 21264/EV67 cache system.
4. MB: Based on the Cbox CSR SYSBUS_MB_ENABLE, the MB command can be
sent to the pins.
Each of these CSRs is programmed appropriately, based on the cache coherence protocol used by the system environment. For example, uniprocessor systems would prefer
to internally acknowledge most of these transactions. In contrast, multiprocessor systems may require notification and control of any change in cache state. The 21264/
EV67 and the external syste m must cooper ate to mai ntai n cache coh erence . Secti on 4.5
explains the 21264/EV67 part of the cache coherency protocol.
4–6Cache and External Interfaces
Alpha 21264/EV67 Hardware Reference Manual
4.3 Bcache Structure
7
The 21264/EV67 Cbox provides control signals and an interface for a second-level
cache (Bcache).
The 21264/EV67 supports a Bcache from 1MB to 16MB, with 64-byte blocks. A 128bit bidirectiona l data b us is used for t ransf ers be tween t he 212 64/EV67 a nd the Bcache .
The Bcache is fully synchronous and the synchronous static RAMs (SSRAMs) must
contain either one, two, or three internal registers. All Bcache control and address pins
are clocked synchronous ly on Bcache cycl e boundaries. The Bcache clock rate va ries as
a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0, and in fullcycle increments of 5, 6, 7, and 8 times the CPU clock cycle. The 1.5 multiple is only
available in dual-data mode.
4.3.1 Bcache Interface Signals
Figure 4–2 shows the 21264/EV67 system interface signals.
The 21264/EV67 provides Bcache st ate sup port fo r syste ms wit h and witho ut dupli cate
tag stores, and will take different actions on this basis. The system sets the Cbox CSR
DUP_TAG_ENA[0], indicating that it has a du plica te ta g store for t he Bcache. Syste ms
using the DUP_TAG_ENA[0] bit must also use the Cbox CSR
BC_CLEAN_VICTIM[0] bit to avoid deadlock situations.
Systems using a Bcache duplicate tag store can accelerate system performance by:
Alpha 21264/EV67 Hardware Reference Manual
Cache and External Interfaces4–7
Victim Data Buffer
•Issuing probes and SysDc fill commands to the 21264/EV67 out-of-order with
respect to their order at the system serialization point
•Filtering out all probe misses from the 21264/EV67 cache system
If a probe misses in the 21264/EV67 cache system (Bcache miss and VAF miss), the
21264/EV67 stalls probe processing with the expectation that a SysDc fill will allocate
this block. Because of this, in du plicate tag mode, the 21264/E V67 can never generate a
probe miss response.
When Cbox CSR DUP_TAG_ENA[0] equals 0, the 21264/EV67 delivers a miss
response for probes that do not hit in its cache system.
4.4 Victim Data Buffer
The 21264/EV67 has eight victim data buffers (VDBs). They have the following properties:
•The VDBs are used for both vi ctims ( fil ls tha t are rep lacin g dirt y cache blo cks) a nd
for system probes that require data movement. The CleanVictimBlk command
(optional) assigns and uses a VDB.
•Each VDB has two valid bits that indicate the buffer is valid for a victim or valid
for a probe or valid for both a victim and a probe. Probe commands that match the
address of a victim address file (VAF) entry with an asserted probe-valid bit (P)
will stall the 21264/EV67 probe queue. No ProbeResponses will be returned until
the P bit is c lear.
•The release victim buffer (RVB) bit, when asserted, causes the victim valid bit, on
the victim data buffer (VDB) specified in the ID field, to be cleared. The RVB bit
will also clear t he IOWB when s ystems move dat a on I/ O writ e tra nsacti ons. I n this
case, ID[3] equals one.
•The release probe buffer (RPB) bit, when asserted (with a WriteData or Release-
Buffer SysDc command), clears the P bit in the victim buffer entry specified in the
ID field.
•Read data commands and victim write commands use IDs 0-7, while IDs 8-11 are
used to address the four I/O write buffers.
4.5 Cache Coherency
This section describes the basics and protocols of the 21264/EV67 cache coherency
scheme.
4.5.1 Cache Coherency Basics
The 21264/EV67 systems maintain the ca che hi er arc hy shown in Figure 4–3.
4–8Cache and External Interfaces
Alpha 21264/EV67 Hardware Reference Manual
Figure 4–3 Cache Subset Hierarchy
Cache Coherency
System
Icache
Main Memory
Bcache
Dcache
FM-05824.AI4
The following tasks must be performed to maintain cache coherency:
•Istream data from memory spaces may be cached in the Icache and Bcache. Icache
coherence is not maintai ned by hardware —it must be maint ained by soft ware using
the CALL_PAL IMB instruction.
•The 21264/EV67 maintains the Dcache as a subset of the Bcache. The Dcache is
set-associative but is kept a subset of the larger externally implemented directmapped Bcache.
•System logic must help the 21264/EV67 to keep the Bcache coherent with main
memory and other caches in the system.
•The 21264/EV67 requires the system to allow only one change to a block at a time.
This means that if the 21264/EV67 gains the bus to read or write a block, no other
node on the bus should be allowed to access that block until the data has been
moved.
•The 21264/EV67 provides hardware mechanisms to support several cache coher-
ency protocols. The protocols can be separat ed into two classes: write invalidate
cache coherency protocol and flush cache coherency protocol.
4.5.2 Cache Block States
Table 4–2 lists the cache block states supported by the 21264/EV67.
Table 4–2 21264/EV67-Supported Cache Block States
State NameDescription
InvalidThe 21264/EV67 do es not have a copy of the block.
CleanThis 21264/EV67 holds a read-on ly copy o f the blo ck, an d no other agent i n th e system holds
a copy. Upon eviction, the block is not written to memory.
(Sheet 1 of 2)
Alpha 21264/EV67 Hardware Reference Manual
Cache and External Interfaces4–9
Cache Coherency
Table 4–2 21264/EV67-Supported Cache Block States
State NameDescription
Clean/SharedThis 21264/EV67 holds a read-only copy of the block, and at least one other agent in the sys-
tem may hold a copy of the block. Upon eviction, the block is not written to memory.
DirtyThis 21264/EV67 holds a read-write copy of the block, and must write it to memory after it is
evicted from the cache. No other agent in the system holds a copy of the block.
Dirty/SharedThis 21264/EV67 holds a read-only copy of the dirty block, which may be shared with
another agent. The block must be written back to memory when it is evicted.
(Sheet 2 of 2)
4.5.3 Cache Block State Transitions
Cache block state transitions are reflected by 21264/EV67-generated commands to the
system. Cache block state transitions can also be caused by system-generated commands to the 21264/EV67 (probes). Probes control the next state for the cache block.
The next state ca n be based on the previous state of the cache block. Table 4–3 lists the
next state for the cache block.
Table 4–3 Cache Block State Transitions
Next StateAction Based on Probe Hit
No changeDo not update cache state. Useful for DMA transactions that sample data but
do not want to update tag state.
CleanIndependent of previous state, update next state to Clean.
Clean/SharedIndependent of previous state, update next state to Clean/Shared. This transac-
tion is useful for systems that update memory on probe hits.
T1:
Based on the dirty bit, make the block clean or dirty shared. This transaction
is useful for systems that do not update memory on probe hits.
If the block is Clean or Dirty/Shared, change to Clean/Shared. If the block is
Dirty, change to Invalid. This transaction is useful for systems that use the
Dirty/Shared state as an exclusive state.
The cache state transitions caused by 21264/EV67-generated commands are under the
full control o f the system environment usin g the SysDc (system data control) commands. Table 4–4 lists these commands.
Table 4–4 System Responses to 21264/EV67 Commands
Response Type21264/EV67 Action
SysDc ReadDataFill block with the associated data and update tag with clean cache status.
SysDc ReadDataDirtyFill block with the associated data and update tag with dirty cache status.
SysDc ReadDataSharedFill block with the associated data and update tag with shared cache status.
SysDc ReadDataShared/DirtyFill block with the associated data and update tag with dirty/shared status.
SysDc ReadDat aE rrorFill block with a l l-ones reference pattern and update tag with inval i d status.
SysDc ChangeToDirtySuccessUnconditionall y upda te block with dirty cache status.
SysDc ChangeToDirtyFailDo not update cache status and fail any associated STx_C instructions.
4–10 Cache and External Interfaces
Alpha 21264/EV67 Hardware Reference Manual
4.5.4 Using SysDc Commands
Note the following:
•The conventional response for RdBlk commands is SysDc ReadData or ReadD-
ataShared.
•The conventional response for a RdBlkMod command is SysDc ReadDataDirty.
•The conventional response for ChangeToDirty commands is
ChangeToDirtySuccess or ChangeToDirtyFail.
However, t he system en vironment i s not limited to these r esponses. Table 4–5 shows all
21264/EV67 commands, system responses, and the 21264/EV67 reaction. The 21264/
EV67 commands are described in the following list:
•Rdx commands are generated by load or Istream references.
•RdBlkModx commands are generated by store references.
•The ChxToDirty command group includes CleanToDirty, SharedToDirty, and STC-
ChangeToDirty commands, which are generated by store references that hit in the
21264/EV67 cache system.
Cache Coherency
•InvalToDirty commands are generated by WH64 instructions that miss in the
21264/EV67 cache system.
•FetchBlk and FetchBlkSpec are noncached references to memory space that have
missed in the 21264/EV67 cache system.
•Rdiox commands are noncached references to I/O address space.
•Evict and STCChangeToDirty commands are generated by ECB and STx_C
instructions, respectively.
Table 4–5 shows the system responses to 21264/EV67 commands and 21264/EV67
reactions.
Table 4–5 System Responses to 21264/EV67 Commands and 21264/EV67 Reactions
21264/EV67
CMDSysDc 21264/EV67 Action
RdxReadData
ReadDataShared
RdxReadDataShared/DirtyThe cache block is filled and marked dirty/shared. Succeeding store
RdxReadDataDirtyThe cache block is filled and marked dirty.
RdxReadDataErrorThe cache block access was to NXM address space. The 21264/EV67
This is a normal fill. The cache block is filled and marked clean or
shared based on SysDc.
commands cannot update the block without external reference.
delivers an all-ones pattern to any load command and evicts the block
from the cache (with associated victim processing). The cache block
is marked invalid.
Rdx ChangeToDirtySuccess
ChangeToDirtyFail
Alpha 21264/EV67 Hardware Reference Manual
Both SysDc responses are illegal for read commands.
Cache and External Interfaces4–11
Cache Coherency
Table 4–5 System Responses to 21264/EV67 Commands and 21264/EV67 Reactions (Continued)
21264/EV67
CMDSysDc 21264/EV67 Action
RdBlkModxReadData
ReadDataShared
ReadDataShared/Dirty
The cache block is filled and marked with a nonwritable status. If the
store instruction that generated the RdBlkModx command is still
active (not killed), the 21264/EV67 will retry the instruction, generating the appropriate ChangeToDirty command. Succeeding store commands cannot update the block without external reference.
RdBlkModxReadDataDirtyThe 21264/EV67 performs a normal fill r esponse, and the cache block
becomes writable.
RdBlkModxChangeToDirtySuccess
Both SysDc responses are illegal for read/modify commands.
ChangeToDirtyFail
RdBlkModxReadDataErrorThe cache block command was to NXM address space. The 21264/
EV67 delivers an all-ones pattern to any dependent load command,
forces a fail action on any pending s to re comm ands to th i s block , and
any store to this block is not retried. The Cbox evicts the cache block
from the cache system (with associated victim processing). Th e cache
block is marked invalid.
ChxToDirtyReadData
ReadDataShared
ReadDataShared/Dirty
The original data in the Dcache is replaced with the filled data. The
block is not writable, so the 21264/EV67 will retry the store instruction and generate another ChxToDirty class command. To avoid a
potential livelock situation, the STC_ENABLE CSR bit must be set.
Any STx_C instruction to this block is forced to fail. In addition, a
Shared/Dirty response causes the 21264/EV67 to generate a victim
for this block upon eviction.
ChxToDirtyReadDataDirtyThe data in the Dcache is replaced with the filled data. The block is
writable, so the store instruction that generated the original command
can update this block. Any STx_C instruction to this block is forced
to fail. In addition, the 21264/EV67 generates a victim for this block
upon eviction.
ChxToDirtyReadDataErrorImpossible situation. The block must be cached to generate a ChxTo-
Dirty command. Caching the block is not possible because all NXM
fills are filled noncached.
ChTo DirtyChangeToDirtySuccess Normal response. ChangeToDirtySuccess makes the block writable.
The 21264/EV67 retries the store instruction and u pdates th e Dcache.
Any STx_C instruction associated wi th this block is allowed to succeed.
ChxToDirtyChangeToDirtyFailThe MAF entry is retired. Any STx_C instruction associated with the
block is forced to fail. If a STx instruction generated this block, the
21264/EV67 retries and generates either a RdBlkModx (because the
reference that failed the ChangeToDirty also invalidated the cache by
way of an invalidating probe) or another ChxToDirty command.
InvalToDirtyReadData
ReadDataShared
The block is not writable, so the 21264/EV67 will retry the WH64
instruction and generate a ChxToDirty command .