This manual is directly derived from the internal 21264/EV68A Specifications, Revision 1.1. You can access this hardware reference manual in PDF format from t he
following site:
ftp://ftp.compaq.com/pub/products/alphaCPUdocs
Revision/Update Information:Revision 1.1, March 2002
Compaq Computer Corporation
Shrewsbur y, Massachuse tts
Page 2
March2002
The information in this publication is subject to changewithout notice.
COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL
ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES RESULTING FROM THE FURNIS HING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS
INFORMATION IS PROVIDED “AS IS” AND COMPAQ COM PUTER CORPORATION DISCLAIMS ANY
WARRANTIES, EXPRESS,IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULARPURPOSE, GOOD TITLE AND AGAINST
INFRINGEMENT.
This publicationcontains information protectedby copyright. No partof this publication may be photocopied or
reproduced in any form without prior written consent from Compaq Computer Corporation.
This manual is for system designers and programmers who use the Alpha 21264/
EV68A microprocessor (referred to as the 21264/EV68A).
This manual contains the following chapters and appendixes:
Chapter 1, Introduction, introduces the 21264/EV68A and provides an overview of the
Alpha architecture.
Chapter 2, Internal Architecture, describes the major hardware functions and the inter-
nal chip architecture.It describesperformance m easurement facilities,coding r ules, and
design examples.
Chapter 3, Hardware Interface, lists and describes the internal hardware interface signals, and provides mechanical data and packaging information, including signal pin
lists.
Chapter 4, Cache and External Interfaces, describes the e xternal bus functions and
transactions, lists bus commands, and describes the clock functions.
Chapter 5, Internal Processor Registers,lists and describes the internal processor register set.
Chapter 7, Initialization and Configuration, describes the initialization and configuration sequence.
Chapter 8, Error Detection and Error Handling, describes error de tection and error handling.
Chapter 9, Electrical Da ta, provides electrical data and describes signal integrity issues.
Chapter 10, Thermal Management, provides information about thermal management.
Chapter 11, Testability a nd Diagnostics, describes chip and system testability features.
Appendix A, Alpha Instruction Set, summarizes the Alpha instruction set.
Appendix B, 21264/EV68A Boundary-Scan Register, presents the BSDL description
of the 21264/EV68A boundary-scan register.
21264/EV68A Hardware Refere nce Manual
xvii
Page 18
Appendix C, Serial Icache Load Predecode Values, provides a pointer to the Alpha
Motherboards Software Developer’s Kit (SDK), which contains this information.
Appendix D, PALcode Restrictions and Guidelines, lists restrictions and guidelines
that must be adhered to when generating PALcode.
Appendix E, 21264/EV68A-to-Bcache P in Interface, provides the pin interface
between the 21264/EV68A and Bcache SSRAMs.
The Glossary lists and defines terms associated with the 21264/EV68A.
An Index is provided at the end of the doc ument.
Documentation Included by Reference
The companion volume to this manual, the Alpha Architecture Reference Manual,
Fourth Edition, can be accessed from the following website: ftp.compaq.com/
pub/products/alphaCPUdocs.
xviii
21264/EV68A Hardware R eference Manual
Page 19
Terminology and Conventions
This section defines the abbreviations, terminology, and other conventions used
throughout this document.
Abbreviations
Binary Multiples
•
The abbreviations K, M, and G (kilo, mega, and giga) represent binary multiples
and have the following values.
The abbreviations used to indicate the type of access to register fieldsand bits have
the following definitions:
Abbreviation Meaning
IGNIgnore
Bitsandfieldsspecifiedareignoredonwrites.
MBZMust Be Zero
Software must never place a nonzero value in bits and fields specified as
MBZ. A nonzero read produces an Illegal Operand exception. Also, MBZ
fields are reserved for future use.
RAZRead As Zero
Bits andfields return a zero when read.
RCRead Clears
Bits and fields are cleared when read. Unless otherwise specified, such bits
cannot be w ritten.
RESReserved
Bits and fields are reserved by Compaq and should not be used; however,
zeros can be written to reserved fields that cannot be masked.
RORead Only
Thevaluemaybereadbysoftware.Itiswrittenbyhardware.Softwarewrite
operations are ignored.
RO,nRead Only, and takes the value n at power-on reset.
Thevaluemaybereadbysoftware.Itiswrittenbyhardware.Softwarewrite
operations are ignored.
21264/EV68A Hardware Refere nce Manual
xix
Page 20
Abbreviation Meaning
RWRead/Write
Bits and fields can be read and written.
RW,nRead/Write, and takes the value n at power-on reset.
Bits and fields can be read and written.
W1CWrite One to Clear
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a re ad operation by software
returns an UNPR E DICTABLE result. Software write operations of a 1 cause
the bit to be cleared by hardware. Software write operations of a 0 do not
modify the state of the bit.
W1SWrite One toSet
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a re ad operation by software
returns an UNPR E DICTABLE result. Software write operations of a 1 cause
the bit to be set by hardware. Software write operations of a 0 do not modify
the state of the bit.
WOWriteOnly
Bits and fields can be written but not read.
WO,nWrite Only, and takes the value n at power-on reset.
Bits and fields can be written but not read.
•Sign extension
SEXT(x) means x is sign-extended to the required size.
Addresses
Unless otherwise noted, all addresses and offsets are hexadecimal.
Aligned and Unaligned
The terms aligned and naturally aligned are interchangeable and refer to data objects
that are powers of two in size. An aligned datum of size 2n is stored in memory at a
byte address that is a multiple of 2n; that is, one that has n low-order zeros. For example, an aligned 64-byte stack frame has a memory address that is a multiple of 64.
A datum of size 2n is unaligned if it is stored in a byte address that is not a multiple of
2n.
Bit Notation
Multiple-bit fields can include contiguous and noncontiguous bits contained in square
brackets ([]). Multiple contiguous bitsare indicated by a pair of numbers separated by a
colon [:].For example, [9:7,5,2:0]specifies bits 9,8,7,5,2,1, and0. Similarly, singlebits
are frequently indicated with square brackets. For example, [27] specifies bit 27. See
also Field Notation.
Caution
Cautions indicate potential damage to equipment or loss of data.
xx
21264/EV68A Hardware R eference Manual
Page 21
Data Units
The following data unit terminology is used throughout this manual.
Unless otherwise stated, external means not contained in the chip.
Field Notation
The names of single-bit and multiple-bit fields can be used rather than the actual bit
numbers (see Bit Notation). When the field name is used, it is contained in square
brackets ([]). For example, RegisterName[LowByte] specifies RegisterName[7:0].
Note
Notes emphasize particularly important information.
Numbering
All numbers are decimal or hexadecimal unless otherwise indicated. The prefix 0x indicates a hexadecimal number. For example, 19 is decimal, but 0x19 and 0x19A are hexadecimal (also see Addresses). Otherwise, the base is indicated by a subscript; for
example, 100
Ranges and Extents
is a binary number.
2
Ranges are specified by a pair of numbers separated by two periods (..) and are inclusive. For example, a range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in square brackets ([]) separated by a colon
(:) and are inclusive. Bit fields are often specified as extents. For example, bits [7:3]
specifies bits 7, 6, 5, 4, and 3.
Register Figures
The gray areas in register figures indicate reserved or unused bits and fields.
Bit ranges that are coupled with the field name specify the bits of the named field that
are included in the register. The bit range may, but need not necessarily, correspond to
the bitExtent in theregister.See the explanationabove Table 5–1 formore information.
Signal Names
The following examples describe signal-name conventions used in this document.
21264/EV68A Hardware Refere nce Manual
xxi
Page 22
AlphaSignal[n:n]Boldface, mixed-case type denotes signal names that are
assigned internal and external to the 21264/EV68A (that
is, the signal traverses a chip interface pin).
AlphaSignal_x[n:n]When a signal has high and low assertion states, a lower-
case italic x represents the assertion states. For example,
SignalName_x[3:0] represents SignalName_H[3:0] and
SignalName_L[3:0].
UNDEFINED
Operations specified as UNDEFINED may vary from moment to moment, implementation to implementation, and instruction to instruction within implementations. The
operation may vary in effect from nothing to stopping system operation.
UNDEFINED operations may halt the processor or cause it to lose information. However, UNDEFINED operations m ust not cause the processor to hang, that is, reach an
unhalted state from which there is no transition to a normal state in which the machine
executes instructions.
UNPREDICTABLE
UNPREDICTABLE resultsor occurrences do not disrupt the basic operation of the processor; it continues to execute instructions in its normal manner. Further:
•Results or occurrences specified as UNPREDICTABLE m ay vary from moment to
moment, implementation to implementation, and instruction to instruction within
implementations. Software can never depend on results specified a s UNPREDICTABLE.
•An UNPREDICTABLE result may acquire an arbitrary value subject to a few c on-
straints. Such a result may be an arbitrary function of the input operands or of any
state information that is accessible to the process in its current access mode.
UNPREDICTABLE results may be unchanged from their previous values.
Operations that produce UNPREDICTABLE results may also produce exceptions.
•An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints as
are UNPREDICTABLE results and, in particular, must not constitute a security
hole.
Specifically, UNPREDICTABLEresults must not depend upon, or be a functionof,
the contents of memory locations or registers that are inaccessible to the current
process in the current access mode.
Also, operations that may produce UNPREDICTABLE results must not:
–Write or modify the c ontents of memory locations or registers to which the cur-
rent process in the current access mode does not have access, or
–Halt or hang the system or any of its components.
For example, a security hole would exist if some UNPREDICTABLE result
depended on the value of a registerin another process, on the contents of processor
temporary registers left be hind by some previously running process, or on a
sequence of actions of different processes.
xxii
21264/EV68A Hardware R eference Manual
Page 23
X
Do not care. A capital X represents any valid value.
21264/EV68A Hardware Refere nce Manual
xxiii
Page 24
Page 25
This chapter provides a brief introduction to the Alpha architecture, Compaq’s RISC
(reduced instruction set computing) architecture designed for high performance. The
chapter then summarizes the specific features of the Alpha 21264/EV68A microprocessor (hereafter called the 21264/EV68A) that implements the Alpha architecture. Appendix A provides a list of Alpha instructions.
The companion volume to this document, the Alpha Architecture Reference Manual,Fourth Edition, contains the complete architecture information.
1.1 The Architecture
The Alpha architecture is a 64-bit load and store RISC architecture designed with particular emphasis on speed, multiple instruction issue, multiple processors, and software
migration from many operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers.
All instructions are 32 bits long. Memory operations are either load or storeoperations.
All data manipulation is done between registers.
1
Introduction
The Alpha architecture supports the following data types:
•8-, 16-, 32-, and 64-bit integers
•IEEE 32-bit a nd 64-bit floating-point formats
•VAX architecture 32-bit and 64-bit floating-point formats
In the Alpha architecture, instructions interact with each other only by one instruction
writing to a register or memory location and a nother instruction reading fromthat register or memory location. This use of resources makes it easy to build implementations
that issue multiple instructions every CPU cycle.
The 21264/EV68A uses a set of subroutines, called privileged a rchitecture library code
(PALc ode), that is specific to a particular A lpha operating system implementation and
hardware platform. These subroutines provide operating system primitives for context
switching, interrupts, exceptions, and memory management. These subroutines can be
invoked by hardware or CALL_PAL instructions. CALL_PAL instructions use the
function field of the instruction to vector to a specified subroutine. PALcode is written
in standard machine code with some implementation-specific extensions to provide
direct accessto low-level hardware f unctions. PALcode supports optimizations for multiple operating systems, flexible memory-management implementations, a nd multiinstruction atomic sequences.
21264/EV68A Hardware Refere nce Manual
Introduction1–1
Page 26
The Architecture
The Alpha architecture performs byte shifting and masking with normal 64-bit, register-to-register instructions. The 21264/EV68A performs single-byte and single-word
load and store instructions.
1.1.1 Addressing
The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21264/
EV68A supports a 48-bit or 43-bit virtual address (selectable under IPR control).
Virtual addresses as seen by the program are translated into physical memory addresses
by the me mory-management mechanism. The21264/EV68A supports a 44-bit physical
address.
1.1.2 Integer Data Types
Alpha architecture supports the four integer data types listed in Table 1–1.
Table 1–1 Integer Data Types
Data TypeDescription
ByteA byte is 8 contiguous bits that start at an addressable byte boundary.
A byte is an 8-bit value.
WordA word is 2 contiguous bytes that start at an arbitrary byte boundary.
A word is a 16-bit value.
LongwordA longword is 4 contiguousbytes that start at an arbitrary byte boundary. A
longword is a 32-bit value.
QuadwordA quadword is 8 contiguous bytes that start at an arbitrary byte boundary.
Note:Alpha implementations may impose a significant performance penalty
when accessing operands that are not naturally aligned. Refer to the Alpha
Architecture Handbook, Version 4 for details.
1.1.3 Floating-Point Data Types
The 21264/EV68A supports the following floating-point data types:
•Longword integer format in floating-point unit
•Quadword integer format in floating-point unit
•IEEE f loating-point formats
–S_floating
–T_floating
•VAX floating-point formats
–F_floating
1–2Introduction
–G_floating
–D_floating (limited support)
21264/EV68A Hardware R eference Manual
Page 27
21264/EV68A Microprocessor Features
1.2 21264/EV68A Microprocessor Features
The 21264/EV68A microprocessor is a superscalar pipelined processor. It is packaged
in a 587-pin PGA carrier and has removable application-specific heat sinks. A number
of configuration options allow its use in a range of system designs ranging from
extremely simple uniprocessor systems with minimum component count to high-performance multiprocessor systems with very high cache and memory bandwidth.
The 21264/EV68A can issue four Alpha instructions in a single cycle, thereby m inimizing the average cycles per instruction (CPI). A number of low-latency and/or highthroughput features in the instructionissue unit and the onchip components of the memory subsystem further reduce the average CPI.
The 21264/EV68A and associated PALcode implements IEEE single-precision and
double-precision, VA X F_floating a nd G_floating data types, and supports longword
(32-bit) and quadword (64-bit) integers. Byte (8-bit) and word (16-bit) support is provided by byte-manipulation instructions. Limited hardware support is provided for the
VAX D _floating data type.
Other 21264/EV68A features include:
•The a bility to issue up to four instructions during each CPU clock cycle.
•A peak instruction execution rate of four times the CPU clock frequency.
•An onchip, demand-paged memory-management unit with translation buffer, which,
when used with PALcode, can implement a variety of page table structures and translation algorithms. The unit consists of a 128-entry, fully-associative data translation
buffer(DTB) and a 128-entry, fully-associative instruction translationbuffer (ITB),
with each entry able to map a single 8KB page or a group of 8, 64, or 512 8KB
pages. The allocation scheme for the ITB and DTB is round-robin.The size of each
translation buffer entry’s group is specified by hint bits stored in the entry. The
DTB and ITB implement 8-bit address space numbers (ASN), MAX_ASN=255.
•Two onchip, high-throughput pipelined floating-point units, capable of executing
both VAX a nd IEEE floating-point data types.
•An onchip, 64KB virtually-addressed instruction cache with 8-bit ASNs
(MAX_ASN=255).
•An onchip, virtually-indexed, physically-tagged dual-read-ported, 64KB data
cache.
•Supports a 48-bit or 43-bit virtual address (program selectable).
•Supports a 44-bit physical address.
•An onchip I/O write buffer with four 64-byte entries for I/O write transactions.
•An onchip, 8-entry victim data buffer.
•An onchip, 32-entry load queue.
•An onchip, 32-entry store queue.
•An onchip, 8-entry miss address file for cache fill requests and I/O read
transactions.
•An onchip, 8-entry probe queue, holding pending system port probe commands.
21264/EV68A Hardware Refere nce Manual
Introduction1–3
Page 28
21264/EV68A Microprocessor Features
•
An onchip, duplicate tag array used to maintain level 2 cache coherency.
•A 64-bit data bus with onchip parity and error correction code (ECC) support.
•Support for an external second-level (Bcache) cache. The size and some timing
parameters of the Bcache are programmable.
•An internal c lock generator providing a high-speed clock used by the 21264/
EV68A, and two clocks for use by the C PU module.
•Onchip performance counters to measure and analyze CPU and system perfor-
mance.
•Chip a nd module level test support, including an instruction cache test interface to
support chip and module level testing.
•A 2.0-V external interface.
Refer to Chapter 9 for 21264/EV68A dc and ac e lectrical characteristics. Refer to the
Alpha Architecture Handbook, Version 4, Appendix E, for waivers and any other
implementation-dependent information.
1–4Introduction
21264/EV68A Hardware R eference Manual
Page 29
2
Internal Architecture
This chapterprovides both an overviewof the 21264/EV68A microarchitecture and a system designer’s view of the 21264/EV68A implementation of the Alpha architecture. The
combination of the 21264/EV68A microarchitecture and privileged architecture library
code (PALcode) defines the chip’s implementation of the Alpha architecture. If a certain
piece of hardware seems to be “architecturally incomplete,” the missing functionality is
implemented in PALcode. Chapter 6 provides more information on PALcode.
This chapter describes the major functional hardware units and is not intended to be a
detailed hardware description of the chip. It is organized as follows:
•21264/EV68A microarchitecture
•Pipeline organization
•Instruction issue and retire rules
•Load instructions to R31/F31 (software-directed instruction pr efetch)
•Special cases of Alpha instruction e xecution
•Memory and I/O address space
•Miss a ddress file (MAF) and load-merging rules
•Instruction ordering
•Replay traps
•I/O wr ite buffer and the WMB instruction
•Performance measurement support
•Floating-point control register
•AM ASK and IMPLVER instruction values
•Design examples
2.1 21264/EV68A Microarchitecture
The 21264/EV68A microprocessor is a high-performance third-generationimplementation of the Compaq Alpha architecture. The 21264/EV68A consists of the following
sections, as shown in Figure 2–1:
•Instruction fetch, issue, and retire unit (Ibox)
•Integer execution unit (Ebox)
21264/EV68A Hardware Refere nce Manual
Internal Architecture2–1
Page 30
21264/EV68A Microarchitecture
•
Floating-point e xecution unit (Fbox)
•Onchip caches (Icache and Dcache)
•Memor y reference unit (Mbox)
•External cache and system interface unit (Cbox)
•Pipeline operation sequence
2.1.1 Instruction Fetch, Issue, and Retire Unit
The instruction fetch, issue, and retire unit (Ibox) consists of the following subsections:
•Virtual program counter logic
•Branch predictor
•Instruction-stream translation buffer (ITB)
•Instruction fetch logic
•Register rename maps
•Integer and floating-point issue queues
•Exception and interrupt logic
•Retire logic
2.1.1.1 Virtual Program CounterLogic
The virtual program counter (VPC) logic maintains the virtual addresses f or instructions thatare in f light. There c an be up to 80 instructions, in20 successive fetch slots,in
flight between the register rename mappers and the end of the pipeline. The VPC logic
contains a 20-entry table to store these fetched VPC addresses.
2–2Internal Architecture
21264/EV68A Hardware R eference Manual
Page 31
Figure 2–1 21264/EV68A Block Diagram
MUL
Store
IOWB
Duplicate
Probe
Cache
Address
128
Cbox
128
056
21264/EV68A Microarchitecture
Instruction Cache
Ibox
Fetch Unit
VPC
Queue
Branch
Predictor
Ebox
Address
ALU 0
(L0)
Integer Registers 0
(80 Registers)
Virtual Address
Next Address
Integer Issue Queue
(20 Entries)
INT
UNIT
0
(U0)
INT
UNIT
1
(U1)
Integer Registers 1
(80 Registers)
ITB
Address
ALU 1
(L1)
Retire
Unit
Four
Instructions
Predecode
Decode and
Rename Registers
FP Issue Queue
(15 Entries)
Fbox
FP
ADD
DIV
SQRT
FP Registers
(72 Registers)
FP
Queue
Tag Store
Victim
Buffer
Arbiter
Physical
Address
Cache
Data
128
Index
20
System
Bus
64
System
15
Mbox
DTB
(Dual-ported, 128-entry)
Physical
Address
Dual-Ported Data Cache
2.1.1.2 Branch Predictor
The branch predictor is composed of three units: the local, global, and choice predictors. Figure 2–2 shows how the branch predictor generates the predicted branch
address.
Load
Queue
Queue
Data
Miss Address
File
Data
FM-
42-AI4
21264/EV68A Hardware Refere nce Manual
Internal Architecture2–3
Page 32
21264/EV68A Microarchitecture
Figure 2–2 Branch Predictor
Local
Predictor
Global
Predictor
Predicted
Branch
Address
Choice
Predictor
FM-05810.AI4
Local Predictor
The local predictor uses a 2-level table that holds the history of individual branches.
The 2-level table design approaches the prediction accuracy of a larger single-level
table while requiring fewer total bits of storage. Figure 2–3 shows how the local predictor generates a prediction. Bits [11:2] of the VPC of the current branch are used as
the index to a 1K entry table in which each entry is a 10-bit value. This 10-bit value is
used as the index to a 1K e ntry table of 3-bit saturating counters. The value of the saturating counter determines the pr edication, taken/not-taken, of the current branch.
Figure 2–3 Local Predictor
VPC[11:2]
Local
History
Table
1K x 10
10
10
Index
Local Branch Prediction
Local
Predictor
1K x 3
3
1
+/-
3
FM-05811.AI4
Global Predicto r
The global predictor is indexed by a global history of all recent branches. The global
predictor correlates the local history of the current branch with all recent branches. Figure 2–4 shows how the global predictor generates a prediction. The global path history
is comprised of the taken/not-taken state of the 12 most-recent branches. These 12
states are used to form an index into a 4K entry table of 2-bit saturating counters. The
value of the saturating counter determines the predication, taken/not-taken, of the current branch.
2–4Internal Architecture
21264/EV68A Hardware R eference Manual
Page 33
21264/EV68A Microarchitecture
Figure 2–4 Global Predictor
Global
Path
History
12
Index
Global Branch Prediction
Choice Predictor
The choice predictor monitors the history of thelocal and global predictors and chooses
the best of the two predictors for a particular branch. Figure 2–5 shows how the choice
predictor generates its choice of the result of the local or global prediction. The 12-bit
global path history (see Figure 2–4) is used to index a 4K entry table of 2-bit saturating
counters.The value of the sa turating counter determines the choice between the outputs
of the local and global predictors.
Global
Predictor
4K x 2
2
1
+/-
2
FM-05812.AI4
Figure 2–5 Choice Pred ic tor
Global
Path
History
12
Choice
Predictor
4K x 2
2.1.1.3 Instruction-Stream Translation Buffer
The Ibox includes a 128-entry, fully-associativeinstruction-stream translation buffer
(ITB) that is used to store recently used instruction-stream (Istream) address translations and page protection information. Each of the entries in the ITB can map 1, 8, 64,
or 512 contiguous 8KB pages. The allocation scheme is round-robin.
The ITB supports an 8-bit ASN and contains an AS M bit. The Icache is virtually
addressed and contains the access-check information, so the ITB is accessed only for
Istream references that miss in the Icache.
Istream transactions to I/O address space are UNDEFINED.
2
Choice Prediction
12
FM-05813.AI4
21264/EV68A Hardware Refere nce Manual
Internal Architecture2–5
Page 34
21264/EV68A Microarchitecture
2.1.1.4 Instruction Fetch Logic
The instruction prefetcher (predecode) reads an octaword, containing up to four naturally aligned instructions per cycle, from the Icache. Branch prediction and line prediction bits accompany the four instructions. The branch prediction scheme ope rates most
efficiently when only one branch instruction is contained among the four fetched
instructions. The line prediction scheme attempts to predict the Icache line that the
branch predictor will generate, and is described in Section 2.2.
An entry from the subroutine return prediction stack, together with set prediction bits
for use by the Icache stream controller, a re fetched along with the octaword. The Icache
stream controller generates fetch requests for additional Icache lines and stores the
Istream data in the Icache. There is no separate buffer to hold Istream requests.
2.1.1.5 Register Rename Maps
The instruction prefetcher forwards instructions to the integer and floating-point register rename maps. The rename maps perform the two functions listed here:
•Eliminate register write-after-read (WAR) and write-after-write (WAW) data
dependencies while preserving true read-after-write (RAW) data dependencies, in
order to allow instructions to be dynamically rescheduled.
•Provide a m eans of speculatively executing instructions before the control flow
previous to those instructions is resolved. Both exceptions and bra nch
mispredictions represent deviations from the control flow predicted by the
instruction prefetcher.
The map logic translates each instruction’s operand register specifiers from the virtual
register numbers in the instruction to the physical register numbers that hold the corresponding architecturally-correct values. The map logic also renames each instruction’s
destination register specifier from the virtual number in the instruction to a physical
register number chosen from a list of free physical registers, and updates the register
maps.
The map logic can process four instructions per cycle. It does not return the physical
register, which holds the old value of an instruction’s virtual destination register, to the
free list until the instruction has been retired, indicating that the control flow up to that
instruction has been resolved.
If a branch mispredict or exception occurs, the map logic backs up the contents of the
integer and floating-point register rename maps to the state associated with the instruction that triggered the condition, a nd the prefetcher restarts at the appropriate VPC. At
most, 20 valid fetch slots containing up to 80 instructions can be in flight between the
register maps and the end of the machine’s pipeline, where the control flow is finally
resolved. The map logic is capable of backing up the contents of the maps to the state
associated with any of these 80 instructions in a single cycle.
The register rename logic places instructions into an integer or floating-point issue
queue, from which they are later issued to functional units for execution.
2.1.1.6 Integer Issue Q ueue
The 20-entry integer issue queue (IQ), associated with the integer execution units
(Ebox), issues the following types of instructions at a m aximum rate of four per cycle:
2–6Internal Architecture
21264/EV68A Hardware R eference Manual
Page 35
21264/EV68A Microarchitecture
•
Integer operate
•Integer conditional branch
•Unconditional branch – both displacement and memory format
•Integer-to-floating-point (ITOFx) and floating-point-to-integer (FTOIx)
Each queue entry asserts four request signals—one for eachof the Ebox subclusters. A
queue entry asserts a request when it contains an instruction that can be executed by the
subcluster, if the instruction’s operand register values are available within the subcluster.
There are two arbiters—one for the upper subclustersand one for the lower subclusters.
(Subclusters are described in Section 2.1.2.) Each arbiter picks two of the possible 20
requestersfor service each cycle. A given instructiononly requests uppersubclusters or
lower subclusters, but because many instructions can only be executed in one type or
another this is not too limiting.
For example, load and store instructions can only go to lower subclusters and shift
instructions can only go to upper subclusters. Other instructions, such as addition and
logic operations, c an e xecute in either upper or lower subclusters and are statically
assigned before being placed in the IQ.
The IQ arbiters choose between simultaneous requesters of a subcluster based on the
age of the request—older requests are given priority over newer requests. If a given
instruction requests both lower subclusters, and no older instruction requests a lower
subcluster, thenthe arbiterassigns subclusterL0 to theinstruction. If a given instruction
requests both upper subclusters, and no older instruction requests an upper subcluster,
then the arbiter assigns subcluster U1 to the instruction. This asymmetry between the
upper and lower subcluster arbiters is a circuit implementation optimization with negligible overall performance effect.
2.1.1.7 Floating-Point Issue Queue
The 15-entry floating-point issue queue (FQ) a ssociated with the Fbox issues the following instruction types:
•Floating-point operates
•Floating-point conditional branches
•Floating-point stores
•Floating-point register to integer register transfers ( FTOIx)
Each queue entryhas three request lines—onefor the add pipeline, one for the multiply
pipeline, and one for the two store pipelines. There are three arbiters—one for each of
the add, multiply, and store pipelines. The add and multiply arbiters pick one requester
per cycle, while the store pipeline arbiter picks two requesters per cycle, one for each
store pipeline.
21264/EV68A Hardware Refere nce Manual
Internal Architecture2–7
Page 36
21264/EV68A Microarchitecture
The FQ arbiters pick between simultaneous requesters of a pipeline based on the age of
the request—older requests are given priority over newer requests. Floating-point store
instructions and F TOIx instructions in even-numbered queue entries arbitrate for one
store port. Floating-point store instructions and FTOIx instructions in odd-numbered
queue entries arbitrate for the second store port.
Floating-point store instructions and FTOIx instructions are queued in both the integer
and floating-pointqueues. They wait in the floating-point queue until their operand register values are available. They subsequently request service from the store arbiter.
Upon beingissued from the floating-point queue, they signal thecorresponding entryin
the integer queue to request service. Upon being issued from the integer queue, the
operation is completed.
2.1.1.8 Exception and Interrupt Logic
There are two types of exceptions:faults and synchronous traps. Arithmeticexceptions
are precise and are reported as synchronous traps.
The four sources of interrupts are listed as follows:
•Level-sensitive hardware interrupts sourced by the IRQ_H[5:0] pins
•Edge-sensitive hardware interrupts generated by the serial line receive pin,
•Software interrupts sourced by the software interrupt request (SIRR) register
•Asynchronous system traps (ASTs)
Interrupt sources ca n be individually masked. In addition, AST interrupts are qualified
by the current processor mode.
2.1.1.9 Retire Logic
The Ibox fetches instructions in program order, executes them out of order, and then
retires them in order. The Ibox retire logic maintains the architectural state of the
machine by retiring an instruction only if all previous instructions have executed without generating exceptionsor branchmispredictions. Retiring an instruction commitsthe
machine to any changes the instruction may have made to the software-visible state.
The three software-visible states are listed as follows:
•Integer and floating-point registers
•Memory
•Internal processor registers (including control/status registers and translation
The retire logic can sustain a maximum retire rate of eight instructions per cycle, and
can retire up to as many as 11 instructions in a single cycle.
performance counter overflows, and hardware corrected read errors
buffers)
2.1.2 Integer Execution Unit
The integer execution unit (Ebox)is a 4-path integerexecution unit that is implemented
as two f unctional-unit “clusters” labeled 0 and 1. Each cluster contains a copy of an 80entry, physical-register file and two “subclusters”, named upper (U) and lower (L). Figure 2–6 shows the integer execution unit. In the figure, iop_wr is the cross-cluster bus
for moving integer result values between clusters.
2–8Internal Architecture
21264/EV68A Hardware R eference Manual
Page 37
21264/EV68A Microarchitecture
Figure 2–6 Integer Execution Unit—Clusters 0 and 1
iop_wr
iop_wr
U0
Register
L0
iop_wr
iop_wr
Load/Store Data
Load/Store Data
eff_VAeff_VA
U1
Register
L1
FM-05643.AI4
Most instructions have 1-cyclelatency for consumers that e xecute within the same cluster. Also, there is another 1-c ycle delay associated with producing a valuein one cluster
and consuming thevalue in the other cluster. The instruction issue queue minimizes the
performance effect of this cross-cluster delay. The Ebox contains the following
resources:
•Four 64-bit adders that are used to calculate results for integer add instructions
(located in U0, U1, L0, and L1)
•The a dders in the lower subclusters that a re used to generate the effective virtual
address for load and store instructions (located in L0 and L1)
•Four logic units
•Two barrel shifters and associated byte logic (located in U0 and U1)
•Two sets of conditional branch logic (located in U0 and U1)
•Two copies of an 80-entry register file
•One pipelined multiplier (located in U1) with 7-cycle latency for all integer multiply
operations
•One f ully-pipelined unit (located in U0), with 3-cycle latency, that executes the fol-
The Ebox has 80 register-file entries that contain storagefor the values of the 31 Alpha
integer registers (the value of R31 is not stored), the values of 8 PALshadow registers,
and 41 results written by instructions that have not yet been retired.
Ignoring cross-cluster delay, the two copies of the Ebox register f ile contain identical
values. Each copy of the Ebox register file contains four read ports and six write ports.
The four read ports are used to source operands to each of the two subclusters within a
cluster. The six write ports are used as follows:
•Two write ports are used to write results generated within the cluster.
•Two write ports are used to write results generated by the other cluster.
•Two write ports are used to write results from load instructions. These two ports
are also used for FTOIx instructions.
2.1.3 Floating-Point Execution Unit
The floating-point execution unit (Fbox) has two paths. The Fbox executes both VAX
and IEEE floating-point instructions. It supports IEEE S_floating-point and T_floatingpoint data types and all rounding modes. It also supports VAX F_floating-point and
G_floating-point data types, and provides limited support for D_floating-point format.
The basic structure of the floating-point execution unit is shown in Figure 2–7.
Figure 2–7 Floating-Point Execution Units
Floating-Point
Execution Units
FP Mul
Reg
FP Add
FP Div
SQRT
LK-4A
The Fbox contains the following resources:
•72-entry physical register file
•Fully-pipelined multiplier with 4-cycle latency
•Fully-pipelined adder with 4-cycle latency
•Nonpipelined divide unit associated with the adder pipeline
•Nonpipelined square root unit associated with the adder pipeline
The 72 Fbox register file entries contain storage for the values of the 31 Alpha floatingpoint registers (F31 is not stored) and 41 values written by instructions that have not
been retired.
2–10 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 39
The Fbox register file contains six reads ports and four write ports. Four read ports are
used to source operands to the add and multiply pipelines, and two read ports are used
to source data for store instructions. Two write ports are used to write results generated
by the add and multiply pipelines, and two write ports are used to wr ite results from
floating-point load instructions.
2.1.4 External Cache and System Interface Unit
The interface for the system and external cache (Cbox) controls the Bcache and system
ports. It contains the following structures:
•Victim address file (VAF)
•Victim data file (VDF)
•I/O wr ite buffer (IOWB)
•Probe queue (PQ)
•Duplicate Dcache tag (DTAG)
2.1.4.1 Victim Address File and Victim Data File
21264/EV68A Microarchitecture
The victim address file (VAF) and victim data file (VDF) together form a n 8-entry victim buffer used for holding:
•Dcache blocks to be written to the Bcache
•Istream cache blocks from memory to be written to the Bcache
•Bcache blocks to be written to memory
•Cache blocks sent to the system in response to probe commands
2.1.4.2 I/O Write Buffer
The I/O write buffer (IOWB) consists of four 64-byte entries and associated address
and control logic used for buffering I/O write data between the store queue and the system port.
2.1.4.3 Probe Queue
The probe queue (PQ) is an 8-entry queue that holds pending system port cache probe
commands and addresses.
2.1.4.4 Duplicate Dcache Tag Array
The duplicateDcache tag (DTAG) array holds a duplicate copy of the Dcache tags and
is used by the Cbox when processing Dcache fills, Icache fills, and system port probes.
2.1.5 Onchip Caches
The 21264/EV68A contains two onchip primary-level caches.
2.1.5.1 Instruction Cache
The instruction cache (Icache) is a 64KB virtual-addressed, 2-way set-predict cache.
Set prediction is used to approximate the performance of a 2-set cache without slowing
the cache access time. Each Icache block contains:
•16 Alpha instructions (64 bytes)
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–11
Page 40
21264/EV68A Microarchitecture
•
Virtual tag bits [47:15]
•8-bit address space number (ASN) f ield
•1-bit address space match (ASM) bit
•1-bit PALcode bit to indicate physical addressing
•Valid bit
•Data and tag parity bits
•Four access-check bits for the following modes: kernel, executive, supervisor, and
user (KESU)
•Additional predecoded information to assist with instruction processing and fetch
control
2.1.5.2 Data Cache
The datacache (Dcache) is a 64KB, 2-wayset-associative,virtually indexed,physically
tagged, write-back, r ead/write allocate cache with 64-byte blocks. During each cycle
the Dcache can perform one of the following transactions:
•Two quadword (or shorter) read transactions to arbitrary addresses
•Two quadword write transactions to the same aligned octaword
•Two non-overlapping less-than-quadword writes to the same aligned quadword
•One sequential read and write transaction from and to the same aligned octaword
Each Dcache block contains:
•64 data bytes and associated quadword ECC bits
•Physical tag bits
•Valid, dirty, shared, and modified bits
•Tag parity bit calculated across the tag, dirty, shared, and modified bits
•One bit to control round-robin set allocation (one bit per two cache blocks)
The Dcache contains two sets, each with 512 rows containing 64-byte blocks per row
(that is, 32K bytes of data per set). The 21264/EV68A requires two additional bits of
virtual address beyond the bits that specify an 8KB page, in order to specify a Dcache
row index. A given virtual address might be found in four unique locations in the
Dcache, depending on the virtual-to-physical translation for those two bits. The 21264/
EV68A prevents this aliasing by keeping only one of the four possible translated
addresses in the cache at any time.
2.1.6 Memory Reference Unit
The memory reference unit (Mbox) controls the Dcache and ensures architecturally
correct behavior for load and store instructions. The Mbox contains the following structures:
•Load queue (LQ)
•Store queue (SQ)
2–12 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 41
•
•Dstream translation buffer (DTB)
2.1.6.1 Load Queue
The load queue (LQ) is a reorder buffer for load instructions. It contains 32 entries and
maintains the state associated with load instructions that have been issued to the Mbox,
but for which results have not been delivered to the processor and the instructions
retired. The Mbox assigns load instructions to LQ slots based on the order in which
they were fetched from the Icache, then places them into the LQ after they are issued by
the IQ. The LQ helps ensure correct Alpha memory reference behavior.
2.1.6.2 Store Queue
The store queue (SQ) is a reorder buffer and graduation unit for store instructions. It
contains 32 entries and maintains the state associated with store instructions that have
been issued to the Mbox, but for which data has not been written to the Dcache and the
instruction retired. The Mbox assigns store instructions to SQ slots based on the order
in which they were fetched from the Icache and places them into the S Q after they are
issued by the IQ. The SQ holds data associated with store instructions issued from the
IQ until they are retired, at which point the store can be allowed to update the Dcache.
The SQ also helps ensure correct Alpha memory reference behavior.
Pipeline Organization
Miss address file (MAF)
2.1.6.3 Miss Addres s File
The 8-entry miss address file (MAF) holds physical addresses associated with pending
Icache and Dcache fill requests and pending I/O space read transactions.
2.1.6.4 Dstream Translation Buffer
The Mbox includes a 128-entry, fully associative Dstream translation buffer (DTB) used
to store Dstream address translations and page protection information. Each of the entries
in the DTB can map 1, 8, 64, or 512 contiguous 8KB pages. The allocation scheme is
round-robin. The DTB supports an 8-bit ASN and contains an ASM bit.
2.1.7 SROM Interface
The serial read-only memory (SROM) interface provides the initialization data load
path from a system SROM to the Icache. Refer to Chapter 7 for more information.
2.2 Pipeline Organization
The 7-stage pipeline provides an optimized environment for executing Alpha instructions. The pipeline stages (0 to 6) are shown in Figure 2–8 and described in the following paragraphs.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–13
Page 42
Pipeline Organization
Figure 2–8 Pipeline Organization
0213456
ALU
Branch
Predictor
Instruction
Cache
(64KB)
(2-Set)
Integer
Register
Rename
Four
Instructions
FloatingRegister
Rename
Map
Point
Map
Integer
Issue
Queue
(20)
Floating-
Point
Issue
Queue
(15)
Integer
Register
File
Floating-
Point
Register
File
Shifter
ALU Shifter
Multiplier
Address
ALU
Address
ALU
Floating-Point
Add, Divide,
and Square Root
Floating-Point
Multiply
64KB
Data
Cache
Bus
Interface
Unit
System
Bus
(64 Bits)
Cache
Bus
(128 Bits)
Physical
Address
(44 Bits)
FM-05575.AI4
Stage 0 — Instru ctio n Fetch
The branch predictor uses a branch history algorithm to predict a branch instruction target address.
Up to four aligned instructions are fetched from the Icache, in program order. The
branch prediction tables are also accessedin this cycle.The branch predictoruses tables
and a branch history algorithm to predict a branch instruction target address for one
branch or m emory format JSR instruction per cycle.Therefore, the prefetcher is limited
to fetching through one branch per cycle. If there is more than one branch within the
fetch line, and the branchpredictor predictsthat the firstbranch will not be taken, it will
predict through subsequent branchesat the rate of one per cycle, untilit predicts a taken
branch or predicts through the last branch in the fetch line.
The Icache array also contains a line prediction field, the contents of which are applied
to the Icache in the next cycle. The purpose of the line predictor is to remove the pipeline bubble which would otherwise be created when the branch predictor predicts a
branch to be taken. In effect, the line predictor attempts to predict the Icache line which
the branch predictor will generate. On fills, the line pr edictor value at each fetch line is
initialized with the index of the next sequential fetch line, and later retrained by the
branch predictor if necessary.
Stage 1 — Instruction Slot
The Ibox maps four instructions per cycle from the 64KB 2-way set-predict Icache.
Instructions a re mapped in order, executed dynamically, but are retired in order.
2–14 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 43
Pipeline Organization
In theslot stage,the branch predictor comparesthe next Icacheindex that it generates to
the index that was generated by the line predictor. If there is a mismatch, the branch
predictor wins—the instructions fetched during that cycle are aborted, and the index
predicted by the branch predictor is applied to the Icache during the next cycle. Line
mispredictions result in one pipeline bubble.
The line predictor takes precedence over the branch predictor during memory format
calls or jumps. If the line predictor was trained with a true (as opposed to predicted)
memory format call or jump target, then its contents take precedence over the target
hint field associated with these instructions. This allows dynamic calls or jumps to be
correctly predicted.
The instruction fetcher producesthe full VPC address during the fetch stage of the pipeline. The Icache produces the tags for both Icache sets 0 and 1 each time it is accessed.
That enables the fetcher to separate set mispredictions from true Icache misses. If the
access was caused by a set misprediction, the instruction fetcher aborts the last two
fetched slots and refetches the slot in the next cycle. It also retrains the appropriate set
prediction bits.
The instruction data is transferred from the Icache to the integer and floating-point register map hardware during this stage. When the integer instruction is fetched from the
Icache and slotted into the IQ, the slot logic determines whether the instruction is for
the upper or lower subclusters. The slot logic makes the decision based on the
resources needed by the (up tofour) integer instructions in thefetch block. Althoughall
four instructions need not be issued simultaneously, distributing their resource usage
improves instruction loading across the units. For example, if a fetch block contains
two instructions that can be placed in either cluster followed by two instructions that
must execute in the lower cluster, the slot logic would designate that combination as
EELL and slot them as UULL. Slot combinations are de scribed in Section 2.3.2 and
Table 2–3.
Stage 2 — Map
Instructions are sent from the Icache to the integer and floating-point register maps during the slot stage and register renaming is performed during the map stage. Also, each
instruction is assigned a unique 8-bit number, called an inum, which is used to identify
the instruction and its program order with respect to other instructions during the time
that it is in flight. Instructions are considered to be in flight between the time they are
mapped and the time they are retired.
Mapped instructions and their associated inums are placed in the integer a nd floatingpoint queues by the end of the map stage.
Stage 3 — Issue
The 20-entry integer issue queue (IQ) issues instructions at the rate of four per cycle.
The 15-entry floating-point issue queue (FQ) issues floating-point operate instructions,
conditional branch instructions, and store instructions,at the rate of two per cycle. Normally, instructions ar e de leted from the IQ or FQ two cycles after they are issued. For
example, if an instruction is issued in cycle n, it remains in the FQ or IQ in cycle n+1
but does not request service, and is deleted in cycle n+2.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–15
Page 44
Instruction Issue Rules
Stage 4 — Register Read
Instructions issued from the issue queues read their operands from the integer and floating-point register files and receive bypass data.
Stage 5 — E xecute
The Ebox and Fbox pipelines begin execution.
Stage 6 — Dcache Access
Memory reference instructions access the Dcache and data translation buffers. Normally load instructions access the tag and data arrays while store instructions only
access the tag arrays. Store data is written to the store queue where it is held until the
store instruction is retired. Most integer operate instructions write their register results
in this cycle.
2.2.1 Pipeline Aborts
The abort penalty as given is measured from the cycle after the fetch stage of the
instruction which triggers the abort to the fetch stage of the new target, ignoring any
Ibox pipeline stalls or queuing delay that the triggering instruction might experience.
Table 2–1 lists the timing associated with each common source of pipeline abort.
Table 2–1 Pipeline Abort Delay (GCLK Cycles)
Abort Condition
Branch misprediction7Integer or floating-point conditional branch
JSR misprediction8Memory format JSR or HW_RET.
Mbox order trap14Load-load order or store-load order.
Other Mbox re play traps13—
DTB miss13—
ITB miss7—
Integer arithmetic trap12—
Floating-point arithmetic
trap
2.3 Instruction Issue Rules
This section defines instruction classes, the functional unit pipelines to which they are
issued, and their associated latencies.
Penalty
(Cycles)Comments
misprediction.
13+latencyAdd latency of instruction. See Section 2.3.3 for
instruction latencies.
2–16 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 45
2.3.1 Instruction Group Definitions
Table 2–2 lists the instruction class, the pipeline assignments, and the instructions
included in the class.
Table 2–2 Instruction Name, Pipeline, and Types (Continued)
Class
NamePipelineInstruction Type
ftoiFST0,FST1, L0, L1 FTOIS, FTOIT
itofL0, L1ITOFS, ITOFF,ITOFT
mx_fpcrFMInstructions that m ove data from the floating-point
2.3.2 Ebox Slotting
Instructions that a re issued from the IQ, and could execute in either upper or lower
Ebox subclusters, are slotted to one pair or the other during the pipeline mapping stage
based on the instruction mixture in the fetch line. The codes that are used in Table 2–3
are as follows:
•U— The instruction only executes in an upper subcluster.
•L— The instruction only executes in a lower subcluster.
control register
•E— The instruction could execute in either an upper or lower subcluster.
Table 2–3 defines the slotting rules. The table field Instruction Class 3, 2, 1 and 0 iden-
tifies each instruction’s location in the fetch line by the value of bits [3:2] in its PC.
Table 2–3 Instruction Group Definitions and Pipeline Unit
After an instruction is placed in the IQ or FQ, its issue point is determined by the availability of its register operands, f unctional unit(s), and relationship to other instructions
in the queue. There are register producer-consumer dependencies and dynamic functional unit availability dependencies that affect instruction issue. The mapper removes
register producer-producer dependencies.
The latency to produce a register result is generally fixed. The one exceptionis for load
instructions that m iss the Dcache. Table 2–4 lists the latency, in cycles, for each
instruction class.
Table 2–4 Instruction Class Latency in Cycles
ClassLatencyComments
ild3
13+
fld4
14+
lda1Possible 1-cycle Ebox cross-cluster delay.
mem_misc —Does not produceregister value.
ist—Does not produce register value.
fst—Does not produce register value.
rpcc1Possible 1-cycle cross-cluster delay.
rx1—
mxpr1 or 3HW_MFPR:Ebox IPRs = 1.
icbr—Conditional branch. Does not produce register value.
ubr3Unconditional branch. Does not produce register value.
jsr3—
iadd1Possible 1-cycle Ebox cross-cluster delay.
Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
Dcache hit.
Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if
Bcache latency is greater than 6 cycles.
Ibox and Mbox IP Rs = 3.
HW_MTPR does not produce a registervalue.
ilog1Possible 1-cycle Ebox cross-cluster delay.
ishf1Possible 1-cycle Ebox cross-cluster delay.
cmov11Only consumer is cmov2. Possible 1-cycle Ebox cross-cluster delay.
cmov21Possible 1-cycle Ebox cross-cluster delay.
imul7Possible 1-cycle Ebox cross-cluster delay.
imisc3Possible 1-cycle Ebox cross-cluster delay.
fcbr—Does not produce register value.
2–20 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 49
Table 2–4 Instruction Class Latency in Cycles (Continued)
ClassLatencyComments
Instruction Retire Rules
fadd4
6
fmul4
6
fcmov14Only consumer is fcmov2.
fcmov24
6
fdiv12
9
15
12
fsqrt18
15
33
30
ftoi3—
Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from whe n an fadd is issued from the FQ to when an fst or ftoi is issued
from the IQ.
Consumer other than fst or ftoi.
Consumer fst or ftoi.
Measured from when an fmul is issued from the FQ to when an fst or ftoi is issued
from the IQ.
Consumer other than fst.
Consumer fst or ftoi.
Measured from when an fcmov2 is issued from the FQ to when an fst or ftoi is
issued from the IQ.
Single precision - latency to consumer of result value.
Single precision - latency to using divider again.
Double precision - latency to consumer of result value.
Double precision - latency to using divider again.
Single precision - latency to consumer of result value.
Single precision - latency to using unit again.
Double precision - latency to consumer of result value.
Double precision - latency to using unit again.
itof4—
nop—Does not produce register value.
2.4 Instruction Retire Rules
An instruction is retired when it has been executed to completion, and all previous
instructions have been retired. The execution pipeline stage in which an instruction
becomes eligible to be retired depends upon the instruction’s class.
Table 2–5 gives the minimum retire latencies (assuming that all previous instructions
have been retired) for various classes of instructions.
Table 2–5 Minimum Retire Latencies for Instruction Classes
Instruction ClassRetire StageComments
Integer conditional branch7—
Integer multiply7/13Latencyis 13 cycles for the MUL/V instruction.
Integer operate7—
Memory10—
Floating-pointadd11—
Floating-point multiply11—
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–21
Page 50
Retire of Operate Instructions into R31/F31
Table 2–5 Minimum Retire Latencies for Instruction Classes (Co nti nu ed)
Instruction ClassRetire StageComments
Floating-pointDIV/SQRT11 + latencyAdd latency of unit reuse for the instruction indicated in Ta ble
2–4. For example, latency for a single-precision fdiv would be
11plus 9 from Table2–4. Latency is 11if ha rdware detectsthat
no exceptionis possible (see Section 2.4.1).
Floating-pointconditional
branch
BSR/JSR10JSR instruction mispredict is reported in stage 8.
11Branch instruction mispredict is reported in stage 7.
2.4.1 Floating-Point Divide/Square Root Early Retire
The floating-point divider and square root unit can detect that, for many combinations
of source operand values, no exception can be generated. Instructions with these operands can be retired before the result is generated. When detected, they are retired with
the same latency as the FP add class. Early retirement is not possible for the following
instruction/operand/architecture state conditions:
•Instruction is not a DI V or SQRT.
•SQRT source operand is negative.
•Divide operand exponent_a is 0.
•Either operand is NaN or INF.
•Divide operand exponent_b is 0.
•Trapping mode is /I (inexact).
•INE status bit is 0.
Early retirementis also not possiblefor divide instructionsif the resulting e xponent has
any of the following characteristics (EXP is the result exponent):
•DIVT, DIVG: (EXP >= 3FF
•DIVS, D IVF: (EXP >= 7F
) OR (EXP <= 216)
16
) OR (EXP <= 38216)
16
2.5 Retire of Operate Instructions into R31/F31
Many instructions that have R31 or F31 as their destination are retired immediately
upon decode(stage 3). These instructions do notproduce a result and are removed from
the pipeline as well. They do not occupy a slot in the issue queues and do not occupy a
functional unit. Table 2–6 lists these instructions and some of their characteristics. The
instructiontype in Table 2–6 is from Table C-6 in Appendix C of the Alpha ArchitectureHandbook, Version 4.
2–22 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 51
Table 2–6 Instructions Re tired Without Execution
Instruction TypeNotes
INTA, INTL, INTM, INTSAll with R31 as destination.
FLTI, FLTL, FLTVAll with F 31 as destination. MT_FPCR is not included
because it has no destination—it is never removed from the
pipeline.
LDQ_UAll with R31 as destination.
MISCTRAPB and EXCB are always removed. Others are never
removed.
FLTSAll (SQRT, ITOF) with F31 as destination.
2.6 Load Instructions to R31 and F31
This section describes how the 21264/EV68A processes software-directed prefetch
transactions and load instructions with a destination of R31 and F31.
Load Instructions to R31 and F31
Prefetches a llocate a MAF entry.How the M AF entry is allocated is what distinguishes
the type of prefetch. A normal prefetch is equivalent to a normal load MAF (that is, a
MAF entry that puts the block into the Dcache in a readable state). A prefetch with
modify intent is equivalent to a normal store MAF (that is, a MAF entry that puts the
block into the Dcache in a writeable state). A prefetch, evict next, is equivalent to a normal load MAF, with the additional behavior described in Section 2.6.3.
A prefetch is not performed if the prefetch hits in the Dcache (as if it were a normal
load).
Load operations to R31 and F31 may generate exceptions. These exceptions must be
dismissed by PALcode.
The following sections describe the operational prefetch behavior of these instructions.
The 21264/EV68A processes these instructions a s normal cache line prefetches. If the
load instruction hits the Dcache, the instruction is dismissed, otherwise the addressed
cache block is allocated into the Dcache.
The HW_LDL instruction construct equates to the HW_LD instruction with the LEN
field clear. See Table 6–3.
2.6.2 Prefetch with Modify Intent: LDS Instruction
The 21264/EV68A processes an LDS instruction, with F31 as the destination, as a
prefetch with modify intent transaction (ReadBlkMod command). If the transaction hits
a dirtyDcache block, the instruction is dismissed. Otherwise, the addressedcache block
is allocated into the Dcache for write access, with its dirty and modified bits set.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–23
Page 52
Special Cases of Alpha Instruction Execution
2.6.3 Prefetch, Evict Next: LDQ and HW_LDQ Instructions
The 21264/EV68A processes this instruction like a normal prefetch transaction (ReadBlkSpec command), with one exception—if the load misses the Dcache, the addressed
cache block is allocated into the Dcache, but the Dcache set allocation pointer is left
pointing to this block. The next miss to the same Dcache line will evict the block. F or
example, this instruction might be used when softwareis reading an a rray that is known
to fit in the offchip Bcache, but will not fit into the onchip Dcache. In this case, the
instructionensures that the hardware provides the desired prefetch function withoutdisplacing useful cache blocks stored in the other set within the Dcache.
The HW_LDQ instruction construct equates to the HW_LD instruction with the LEN
field set. See Table 6–3.
2.7 Special Cases of Alpha Instruction Execution
This section describes the m ec hanisms that the 21264/EV68A uses to process irregular
instructions in the Alpha instruction set, and cases in which the 21264/EV68A processes instructions in a non-intuitive way.
2.7.1 Load Hit Speculation
The latency of integer load instructions that hit in the Dcache is three cycles. Figure 2–
9 shows the pipeline timing for these integer load instructions. In Figure 2–9:
Figure 2–9 Pipeline Timing for Integer Load Instructions
ILD
Instruction 1
Instruction 2
Hit
1Cycle Number
QREDB
2345678
QR
Q
FM-05814.AI4
There are two cycles in which the IQ may speculatively issue instructions that use load
data before Dcache hit information is known. Any instructions that are issued by the IQ
within this 2-cycle speculative window a re kept in the IQ with their requests inhibited
until the load instruction’shit condition is known, even if they are not dependent on the
load operation.If the load instruction hits, then these instructions are removed from the
queue. If the load instruction misses, then the e xecution of these instructions is aborted
and the instructions are allowed to request service again.
2–24 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 53
Special Cases of Alpha Instruction Execution
For example, in Figure 2–9, instruction 1 and instruction 2 are issued within the speculative window of the load instruction. If the load instruction hits, then both instructions
will be deleted from the queue by the start of cycle 7— one cycle later than normal for
instruction1 and at the normal time for instruction2. If the load instructionmisses, both
instructions are aborted from the execution pipelines and may request service again in
cycle 6.
IQ-issued instructions are aborted if issued within the speculative window of an integer
load instruction that missed in the Dcache, even if they are not dependent on the load
data. However, if software misses are likely, the 21264/EV68A can still benefit from
scheduling the instruction stream for Dcache miss latency. The 21264/EV68A includes
a saturating counter that is incremented when load instructions hit and is decremented
when load instructions miss. When the upper bit of the counter equals zero, the integer
load latency is increased to five cycles and the speculative window is removed. The
counter is 4 bits wide and is incremented by 1 on a hit and is decremented by two on a
miss.
Since load instructions to R31 do not produce a result, they do not create a speculative
window when they execute and, therefore, never waste IQ-issue cycles if they miss.
Floating-pointload instructions that hit in the Dcachehave a latency of fourcycles. Figure 2–10 shows the pipeline timing for floating-point load instructions. In Figure 2–10:
Figure 2–10 Pipeline Timing for Floating-Point Load Instructions
Hit
FLD
Instruction 1
Instruction 2
1Cycle Number
QREDB
2345678
QR
Q
FM-05815.AI4
The speculative window for floating-point load instructions is one cycle wide.
FQ-issued instructions that are issued within the speculative window ofa floating-point
load instruction that has missed, are only aborted if they depend on the load being successful.
For example, in Figure 2–10 instruction 1 is issued in the speculative window of the
load instruction.
If instruction 1 is not a user of the data returned by the load instruction, then it is
removed from the queue at its normal time (at the start of cycle 7).
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–25
Page 54
Special Cases of Alpha Instruction Execution
If instruction 1 is dependent on the load instruction data and the load instruction hits,
instruction 1 is removed from the queue one cycle later (at the start of cycle 8). If the
load instruction misses, then instruction 1 is aborted from the Fbox pipeline and may
request service again in cycle 7.
2.7.2 Floating-Point Store Instructions
Floating-point store instructions are duplicated and loaded into both the IQ and the FQ
from the mapper. Each IQ entry contains a control bit, fpWait, that when set prevents
that e ntry from assertingits requests. This bit is initially set for each floating-pointstore
instruction that enters the I Q, unless it was the target of a replay trap. The instruction’s
FQ clone is issued when its Ra register is about to become clean, resulting in its IQ
clone’s fpWait bit being cleared and allowing the IQ clone to issue and be executed by
the Mbox. This mechanism ensures that floating-point store instructions are always
issued to the Mbox, along with the associated data, without requiring the floating-point
register dirty bits to be available within the IQ.
2.7.3 CMOV Instruction
For the 21264/EV68A, the Alpha CMOV instruction has three operands, and so presents a special case. The required operation is to move either the value in register Rb or
the value from the old physical destination register into the new destination register,
based upon the value in Ra. Since neither the mapper nor the Ebox and Fbox da ta paths
are otherwise required to handle three operand instructions,the CMOV instruction is
decomposed by the Ibox pipeline into two 2-operand instructions:
⇒
The Alpha architecture instructionCMOV Ra, R b
Becomes the 21264/EV68A instructionsCMOV1 R a, oldRc
CMOV2 newRc1, Rb
The first instruction,CMOV1, tests the value of Ra and records the result of this test in
a 65th bit of its destination register, newRc1. I t also copies the value of the old physical
destination register, oldRc, to newRc1.
The second instruction, C MOV2, then copies either the value in newRc1 or the value in
Rb into a second physical destination register, newRc2, based on the CMOV predicate
bit stored in newRc1.
In summary, the original CMOV instruction is decomposed into two dependent instructions that each use a physical register from the free list.
To further simplify this operation, the two component instructions of a CMOV instruction aredriven through the mappers in successive cycles.Hence, if a fetch line c ontains
n CMOV instructions, it takes n+1 cycles to run that fetch line through the mappers.
For example, the following fetch line:
ADD CMOVx SUB CMOVy
Results in the following three map cycles:
ADD CMOVx1
Rc
⇒
newRc2
⇒
newRc1
CMOVx2SUBCMOVy1
CMOVy2
2–26 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 55
Memory and I/O Address Space Instructions
The Ebox executes integer CMOV instructions as two distinct 1-cycle latency operations. The Fbox add pipelineexecutes floating-point CMOV instructions as two distinct
4-cycle latency operations.
2.8 Memory and I/O Address Space Instructions
This sectionprovides an overview of the way the 21264/EV68A processesmemory and
I/O address space instructions.
The 21264/EV68A supports, and internally recognizes, a 44-bit physical address space
that is divided equally between memory address space and I/O address space. Memory
address space resides in the lower half of the physical address space (PA[43]=0)
and I/O address space resides in the upper half of the physical address space
(PA[ 43]=1).
The IQ can issue any combination of load and store instructions to the Mbox at the rate
of two per cycle. The two lower Ebox subclusters, L0 and L1, generate the
48-bit effectivevirtual address for these instructions.
An instruction is defined to be newer than another instruction if it follows that instruction in program order and is older if it precedes that instruction in program order.
2.8.1 Memory Address Space Load Instructions
The Mbox begins execution of a load instruction by translating its virtual address to a
physical address using the DTB and by accessing the Dcache. The Dcache is virtually
indexed, allowing these two operations to be done in parallel. The Mbox puts information about the load instruction, including its physical address, destination register, and
data format, into the LQ.
If the requested physical location is found in the Dcache (a hit), the data is formatted
and written into the appropriate integer or floating-point register. If the locationis not in
the Dcache (a miss), the physical address is placed in the miss address file (MAF) for
processing by the Cbox. The MAF performs a merging function in which a new miss
address is compared to miss addresses already heldin the MAF. Ifthe new miss address
points to the same Dcache block as a miss address in the MAF, then the new miss
address is discarded.
When Dcache fill data is returned to the Dcache by the Cbox, the Mbox satisfies the
requesting load instructions in the LQ.
2.8.2 I/O Address Space Load Instructions
Because I/O space load instructions may have side effects, they cannot be performed
speculatively. Whe n the Mbox receives an I/O space load instruction, the Mbox places
the load instruction in the LQ, where it is held until it retires. The Mbox replays retired
I/O space load instructions from the LQ to the MAF in program order, at a rate of one
per GCLK cycle.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–27
Page 56
Memory and I/O Address Space Instructions
The Mbox allocates a new MAF entry to an I/O load instruction and increases I/O bandwidth by attempting to mergeI/O load instructionsin a merge register.Table 2–7 shows
the rules for merging data. The columns represent the load instructions replayed to the
MAF while the rows represent the size of the load in the merge register.
Table 2–7 Rules for I / O Address Space Load Instruction Data Merging
Byte/WordNomergeNomergeNomerge
LongwordNo mergeMerge up to 32 bytesNo merge
QuadwordNomergeNomergeMergeupto64bytes
In summary, Table 2–7 shows some of the following rules:
•Byte/word load instructions and different size load instructions are not allowed to
merge.
•A stream of ascending non-overlapping, but not necessarily consecutive, longword
load instructions are allowed to merge into naturally aligned 32-byte blocks.
•A stream of ascending non-overlapping, but not necessarily consecutive,quadword
load instructions are allowed to merge into naturally aligned 64-byte blocks.
•Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
•Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the Mbox I/O register has closed its merge window, the Cbox sends I/O read
requests offchip in the order that they were received from the Mbox.
2.8.3 Memory Address Space Store Instructions
The Mbox begins execution of a store instruction by translating its virtual address to a
physical address using the DTB and by probing the Dcache. The Mbox puts information about the store instruction, includingits physical address,its data and the results of
the Dcache probe, into the store queue (SQ).
If the Mbox does not find the addressed location in the Dcache, it places the address
into the MAF for processing by the Cbox. If the Mbox finds the addressed location in a
Dcache block that is not dirty, then it places a ChangeToDirty request into the MAF.
A store instruction can write its data into the Dcache when it is retired, and when the
Dcache block containing its address is dirty and not shared. SQ entries that meet these
two conditions can be placed into the writable state. These SQ entries are placed into
the writable state in program order at a maximum rate of two e ntries per cycle. The
Mbox transfers writable store queue entry data from the SQ to the Dcache in program
order at a maximum rate of two entriesper cycle. Dcache linesassociated with writable
store queue entries are locked by the Mbox. System port probe commands cannot evict
these blocks until their associated writable SQ entries have been transferred into the
Dcache. This restriction assists in STx_C instruction and Dcache ECC processing.
2–28 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 57
Memory and I/O Address Space Instructions
SQ entrydata that has not been transferredto the Dcache may source data tonewer load
instructions. The Mbox compares the virtual Dcache index bits of incoming load
instructions to queued SQ entries, and sources the data from the SQ, bypassing the
Dcache, when necessary.
2.8.4 I/O Address Space Store Instructions
The Mbox begins processing I/O space store instructions, like memory space store
instructions, by translating the virtual address and placing the state associated with the
store instruction into the SQ.
The Mbox replays retired I/O space store entries from the SQ to the IOWB in program
order at a rate of one per GCLK cycle. The Mbox never allows queued I/O space store
instructions to source data to subsequent load instructions.
The Cbox maximizes I/O bandwidth when it allocates a new IOWB entry to an I/O
store instruction by attempting to mergeI/O store instructions in a merge register. Ta ble
2–8 shows the rules forI/O space storeinstruction datamerging. The columns represent
the load instructions replayed to the IOWB while the rowsrepresent the size of the store
in the merge register.
Table 2–8 Rules for I/O Address Space Store Instruction Data Merging
Merge Register/
Replayed Instruction
Byte/WordNomergeNomergeNomerge
LongwordNo mergeMerge up to 32 bytesNo merge
QuadwordNomergeNomergeMergeupto64bytes
Store
Byte/WordStore LongwordStore Quadword
Table 2–8 shows some of the following rules:
•Byte/word store instructions and different size store instructions are not allowed to
merge.
•A stream of ascending non-overlapping, but not necessarily consecutive, longword
store instructions a re allowed to merge into naturally aligned 32-byte blocks.
•A stream of ascending non-overlapping, but not necessarily consecutive,quadword
store instructions a re allowed to merge into naturally aligned 64-byte blocks.
•Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
•Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O
store instruction activity for 1024 cycles.
After the IOWB merge register has closed its merge window, the C box sends I/O space
store requests offchip in the order that they were received f rom the Mbox.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–29
Page 58
MAF Memory Address Space Merging Rules
2.9 MAF Memory Address Space Merging Rules
Because all m emory transactions are to 64-byte blocks, efficiency is improved by m erging several small data transactions into a single larger data transaction.Table 2–9 lists
the rules the 21264/EV68A uses when merging memory transactions into 64-byte naturally aligned data block transactions. Rows represent the merged instruction in the
MAF and columns represent the new issued transaction.
In summary, Table 2–9 shows that only like instruction types, with the exception of
load instructions merging with store instructions, are merged.
2.10 Instruction Ordering
In the absence of explicit instruction ordering, such as with MB or WMB instructions,
the 21264/EV68A maintains a default instruction orderingrelationship between pairs of
load and store instructions.
The 21264/EV68A maintains the default memory datainstruction ordering as shown in
Table 2–10 (assume address X and address Y are different).
Table 2–10 Memory Reference Ordering
First Instruction in Pai rSecond Instruction in PairReference Order
Load memory to address XLoad memoryto addressXMaintained (litmus test 1)
Load memory to address XLoad memoryto addressYNot maintained
Store memory to address XStore memory to address XMaintained
Store memory to address XStore memory to address YMaintained
Load memory to address XStore memory to address XMaintained
Load memory to address XStore memory to address YNot maintained
Store memory to address XLoad memory to address XMaintained
Store memory to address XLoad memory to address YNot maintained
2–30 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 59
The 21264/EV68A maintains the default I/O instruction ordering as shown in Table 2–
11 (assume address X and address Y are different).
Table 2–11 I/O Reference Ordering
First Instruction in Pai rSecond Instruction in PairReference Order
Load I/O to address XLoad I/O to address XMaintained
Load I/O to address XLoad I/O to address YMaintained
Store I/O to address XStore I/O to address XMaintained
Store I/O to address XStore I/O to address YMaintained
Load I/O to addressXStore I/O to address XMaintained
Load I/O to address XStore I/O to address YNot maintained
Store I/O to address XLoad I/O to address XMaintained
Store I/O to address XLoad I/O to address YNot maintained
2.11 Replay Traps
Replay Traps
There are some situations in which a load or store instructioncannot be executed due to
a conditionthat occurs after that instructionissues from the IQ orFQ. The instructionis
aborted (along with all newer instructions) and restarted from the fetch stage of the
pipeline. This mechanism is called a replay trap.
2.11.1 Mbox Order Traps
Load and store instructions may be issued from the IQ in a different order than they
were fetched from the Icache, while the architecture dictates that Dstream memory
transactions to the same physical bytes must be completed in order. Usually, the Mbox
manages the memory reference stream by itself to achieve architecturally correct
behavior, but the two cases in which the Mbox uses replay trapsto manage the memory
stream are load-load and store-load order traps.
2.11.1.1 Load-L oad Order Trap
The Mbox ensures that load instructions that read the same physical byte(s) ultimately
issue in correct order by using the load-load order trap. The Mbox compares the
address of each load instruction, as it is issued, to the address of all load instructions in
the load queue. If the Mbox finds a newer load instruction in the load queue, it invokes
a load-load order trap on the newer instruction. This is a replay trap that aborts the tar-
get of the trap and all newer instructions from the machine and refetches instructions
starting at the target of the trap.
2.11.1.2 Store-Load Order Trap
The Mbox ensures that a load instruction ultimately issues after an older store instruction that writes some portion of its memory operand by using the store-load order trap.
The Mbox compares the address of each store instruction, as it is issued, to the address
of all load instructions in the load queue. If the Mbox finds a newer load instruction in
the loadqueue, it invokesa store-load ordertrap on the loadinstruction. Thisis a replay
trap. It functions like the load-load order trap.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–31
Page 60
I/O W rite Buffer and the WMB Instruction
The Ibox contains extra hardware to reduce the frequency of the store-load trap. There
is a 1-bit by 1024-entry VPC-indexed table in the Ibox called the stWait table. When an
Icache instruction is fetched, the associated stWait table entry is fetched along with the
Icache instruction. The stWait table produces 1 bit for each instruction accessed from
the Icache.When a load instruction gets a store-loadorder replay trap, its associated bit
in the stWait table is set during the cycle that the load is refetched. Hence, the trapping
load instruction’s stWait bit will be set the next time it is fetched.
The IQ will not issue load instructions whose stWait bit is set while thereare older unissued store instructions in the queue. A load instruction whose stWait bit is set can be
issued the cycle immediately after the last older store instruction is issued f rom the
queue. All the bits in the stWait table are unconditionally clearedevery 16384 c ycles, or
every 65536 cycles if I_CTL[ST_WAIT_64K] is set.
2.11.2 Other Mbox Replay Traps
The Mbox also uses replay traps to control the flow of the load queue and store queue,
and to ensure that there are never multiple outstanding misses to different physical
addresses thatmap to the same Dcache or B cache line. Unlike the order traps, however,
these replay traps are invoked on the incoming instruction that triggered the condition.
2.12 I/O Write Buffer and the WMB Instruction
The I/O write buffer (IOWB) consists of four 64-byte entries with the associated
address and control logic used to buffer I/O write data between the store queue (SQ)
and the system port.
2.12.1 Memory Barrier (MB/WMB/TB Fill Flow)
The Cbox CSR SYSBUS_MB_ENABLE bit determines if MB instructions produce
external system port transactions. When the SYSBUS_MB_ENABLE bit equals 0, the
Cbox CSR MB_CNT[3:0] field contains the number of pending uncommitted transactions. The counter will increment for each of the following commands:
The counter is decremented with the C (commit) bit in the Probe and SysDc commands
(see Section 4.7.7). Systems can assert the C bit in the SysDc fill response to the commands that originally incremented the counter, or a ttached to the last probe see n by that
command when it reached the system serialization point.If the number of uncommitted
transactions reaches 15 (saturating the counter), the Cbox will stall MAF and IOWB
processing until at least one of the pending transactions has been committed.Probe processing is not interrupted by the state of this counter.
2–32 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 61
2.12.1.1 MB Instruction Processing
When an MB instruction is fetched in the predicted instruction execution path, it stalls
in the map stage of the pipeline. This also stalls all instructions after the MB, and control of instruction flow is based upon the value in Cbox CSR SYSBUS_MB_ENABLE
as follows:
•If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the IQ is
empty and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port.
b. Monitors C box CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the
youngest probe queue entry.
c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and
IOWB are empty.
When all of the above have occurred and a probe response has been sent to the system for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
I/O W rite Buffer an d the WMB Instruction
•If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox waits until the IQ is empty
and then performs the following actions:
a. Sends all pending MAF and IOWB entries to the system port
b. Sends the MB command to the system port
c. Waits until the MB command is acknowledged, then m arks the youngest entry
in the probe queue
d. Wa its until the M AF contains no more Dstream references and the SQ, LQ, and
IOWB are empty
When all of the above have occurred and a probe response has been sent to the system for the marked probe queue entry, instruction execution continues with the
instruction after the MB.
Because the MB instruction is executed speculatively, MB processing can begin
and the original MB can be killed. In the internal acknowledge case, the MB m ay
have already been sent to the system interface, and the system is still expected to
respond to the MB.
2.12.1.2 WMB Instruction Processing
Writememory barrier (WMB ) instructions are issued into the Mbox store-queue, where
they wait until they are retired and all prior store instructions become writable. The
Mbox then stalls the writable pointer a nd informs the Cbox. The Cbox closesthe IOWB
merge register and responds in one of the following two ways:
•If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox performs the following
actions:
a. Stalls further MAF and IOWB processing.
b. Monitors C box CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the
youngest probe queue entry.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–33
Page 62
I/O W rite Buffer and the WMB Instruction
c. When a probe response has been sent to the system for the marked probe queue
entry, the Cbox considers the WMB to be satisfied.
•If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox performs the following
actions:
a. Stalls further MAF and IOWB processing.
b. Sends the MB command to the system port.
c. Waits until the MB command is acknowledged by the system with a SysDc
MBDone command, then sends acknowledge and marks the youngest entry in
the probe queue.
d. When a probe response has been sent to the system for the markedprobe queue
entry, the Cbox considers the WMB to be satisfied.
2.12.1.3 TB Fill Flow
Load instructions (HW_LDs) to a virtual page table entry (VPTE) are processed by the
21264/EV68A to avoid litmus test problems associated with the ordering of memory
transactions from another processor against loading of a page table entry and the subsequent virtual-mode load from this processor.
Consider the sequence shown in Table 2–12.The data could be in the Bcache. P j should
fetch datai if it is using PTEi.
Also consider the related sequence shown in Table 2–13. In this case, the data could be
cached in the Bcache; Pj should fetch datai if it is using PTEi.
<write TB>
Istream read (restart) - will miss the Icache
The 21264/EV68Aprocesses Dstream loads to the PTEby injecting, in hardware, some
memory barrier processing between the PTE transaction and any subsequent load or
store instruction. This is accomplished by the following mechanism:
1. The integer queue issues a HW_LD instruction with VPTE.
2. The integer queue issues a HW_MTPR instruction with a DTB_PTE0, that is datadependent on the HW_LD instruction with a VPTE, and is required in order to fill
the DTBs. The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4]
and [0].
3. When a HW_MTPR instruction with a DTB_PTE0 is issued, the Ibox signals the
Cbox indicating that a HW_LD instruction with a VPTE has been processed. This
causes the Cbox to begin processing the MB instruction. The Ibox prevents any
subsequent memory operations being issued by not clearing the IPR scoreboard bit
[0]. IPR scoreboard bit [0] is one of the scoreboard bits associated with the
HW_MTPR instruction with DTB_PTE0.
4. When the Cbox completes processing the MB instruction (using one of the above
sequences, depending upon the state of SYSBUS_MB_ENABLE), the Cbox signals the Ibox to clear IPR scoreboard bit [0].
The 21264/EV68A uses a similar mechanism to process Istream TB misses and fills to
the PTE for the I stream.
1. The integer queue issues a HW_LD instruction with VPTE.
2. The IQ issues a HW_MTPR instruction with an ITB_PTE that is data-dependent
upon the HW_LD instruction with VPTE. This is required in order to fill the ITB.
The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4] and [0].
3. The Cbox issues a HW_MTPR instruction for the ITB_PTE and signals the Ibox
that a HW_LD/VPTE instruction has been processed, causing the Cbox to start processing the MB instruction. The Mbox stalls Ibox fetching from when the HW_LD/
VPTE instruction finishes until the probe queue is drained.
4. When the 21264/EV68A is finished (SYS_MB selectsone of the above sequences),
the Cbox directs the Ibox to clear IPRscoreboard bit [0]. Also, the M box directs the
Ibox to start prefetching.
Inserting MB instruction processing within the TB fill flow is only required for multiprocessor systems. Uniprocessor systems can disable MB instruction processing by
deasserting Ibox CSR I_CTL[TB_MB_EN].
The 21264/EV68A provides hardware support for two methods of obtaining program
performance feedback information. The two methods do not require program modification. The first method offers similar capabilities to earlier microprocessor performance
counters. The second method supportsthe new ProfileMe way of statistically sampling
individual instructions during program e xecution to develop a model of program execution. Both methods use the same hardware registers.
See Section 6.10 for information about counter control.
2.14 Floating-Point Control Register
The floating-point control register (FPCR) is shown in Figure 2–11.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–35
Page 64
Floating-Point Control Register
0050
Figure 2–11 Floating-Point Control Register
63 62 6160 59495848574756 55 5453 52 51 500
SUM
INED
UNFD
UNDZ
DYN
IOV
INE
UNF
OVF
DZE
INV
OVFD
DZED
INVD
DNZ
LK99-
A
The floating-point control register fields are described in Table 2–14.
Table 2–14 Floating-Point Control Register Fiel ds
NameExtentTypeDescription
SUM[63]RWSummary bit. Records bit-wise OR ofFPCR exceptionbits.Thesummary bitis
not directly modified by writes to bit 63 of the FPCR,but is indirectly modified
by changes t o FPCR bits 57–52.
INED[62]RWInexact Disable. If this bit is set and a floating-point instructionthat enables
trapping on inexact results ge nerates an inexact value,the resultis placedin the
destination register and the trap is suppressed.
UNFD[61]RWUnderflow Disable. The 21264/EV68A hardware cannot generate IEEE com-
pliant denormal results. UNFD is used in conjunction with UNDZ as follows:
UNFD UNDZResult
0XUnderflow trap.
10Trap to supply a possible denormal result.
11Underflow trap suppressed. Destination is written
withatruezero(+0.0).
UNDZ[60]RWUnderflow to zero. When UNDZ is set together with UNFD, underflow traps
are disabled and the 21264/EV68Aplaces a true zero in the destinationregister.
See UNFD, above.
2–36 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 65
AMASK and IMPLVER Instruction Values
Table 2–14 Floating-Point Control Register Fiel ds (Continued)
NameExtentTypeDescription
DYN[59:58]RWDynamic rounding mode. Indicates the rounding mode to be used by an IEEE
floating-pointinstruction when the instruction specifies dynamic rounding
mode:
IOV[57]RWInteger overflow. A CVTGQ, CVTTQ, or CVTQL overflowed the destination
precision.
INE[56]RWInexact result. A floating-point arithmetic or conversion operation gave a result
that differed from the mathematically exact result.
UNF[55]RWUnderflow. A floating-point arithmetic or conversion operation gave a result
that underflowed the de stination exponent.
OVF[54]RWOverflow. A floating-point arithmetic or conversionoperation gave a resultthat
overflowed the destination exponent.
DZE[53]RWDivide by zero. An attempt was made to perform a floating-point divide with a
divisor of zero.
INV[52]RWInvalid operation. An attempt was made to perform a floating-point arithmetic
operation a nd one or more of its operand values were illegal.
OVFD[51]RWOverflow disable. If this bit is set and a floating-point arithmetic operation gen-
erates an overflow condition, then the appropriate IEEE nontrapping result is
placed in the destination register and the trap is suppressed.
DZED[50]RWDivision by zero disable. If this bit is set and a floating-point divide by zero is
detected, the appropriate IEEE nontrapping result is placed in the destination
register and the trap is suppressed.
INVD[49]RWInvalid operation disable. If this bit is set and a floating-pointoperate generates
an invalid operation condition and 21264/EV68A is capable of producing the
correct IE EE nontrapping result, that result is placed in the destination register
and the trap is suppressed.
DNZ[48]RWDenormal operands to zero. If this bit is set, treat all Denormal operands as a
signed zero value with the same sign as the Denormal operand.
1
Reserved[47:0]
1
Alpha architecture FPCR bit 47 (DNOD) is not implemented by the 21264/EV68A.
——
2.15 AMASK and IMPLVER Instruction Values
The AMASK and IMPLVER instructions return the supported architecture extensions
and processor type , respectively.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–37
Page 66
Design Examples
2.15.1 AMASK
The 21264/EV68A returns the AMASK instruction valuesprovided in Table 2–15. The
I_CTL register reports the 21264/EV68A pass level (see I_CTL[CHIP_ID], S ection
5.2.15).
Table 2–15 21264/EV68A AMASK Values
21264/EV68A Pass LevelAMASK Feature Mask Value
See I_CTL[CHIP_ID], Table 5–111307
16
The AMASK bit definitions provided in Table 2–15 are defined in Table 2–16.
Table 2–16 AMASK Bit Assignments
BitMeaning
0Support for the byte/word extension (BWX)
The instructions that comprise the BWX extension are LDBU, LDWU, SEXTB,
SEXTW, STB, and STW.
1Support for the square-root and floating-point c onvert extension (FIX)
The instructions that comprise the FIX extension are FTOIS, FTOIT, ITOFF, ITOFS,
ITOFT, SQRTF, SQRTG, SQRTS, and SQRTT.
2Support for the count extension (CIX)
The instructions that comprise the CIX extension are CTLZ, CTPOP, and CTTZ.
8Support for the multimedia extension (MVI)
The instructions that comprise the MVI extension are MAXSB8, MAXSW4,
MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB,
PKWB, UNPKBL, and UNPKBW.
9Support for precise arithmetic trap reporting in hardware. The trap PC is the same as
the instruction PC after the trapping instruction is executed.
12Support for using a prefetch w ith modify intent to improve the performance of the
first a ttempt to acquire a lock. When clear, indicates possible prefetch error with
locks,described in waiver10 tothe Alpha Architectureand in the prefetch sectionof
the appropriate processor (21264/EV6 and 21264/EV67)documents.
2.15.2 IMPLVER
For the 21264/EV68A, the IMPLVER instruction returns the value 2.
2.16 Design Examples
The 21264/EV68A can be designed into many different uniprocessor and multiprocessor system configurations. F igures 2–12 and 2–13 illustrate two possible configurations. These configurations employ additional system/memory controller chipsets.
Figure 2–12 shows a typical uniprocessor system with a second-level cache. This system configuration could be used in standalone or networked workstations.
2–38 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 67
Figure 2–12 Typical Uniprocessor Confi gur atio n
g
64-bitPCI Bus
FM-05573-EV67
g
Address
Design Examples
L2 Cache
Tag
Store
Data
Store
21264
Tag
Address
Data
Address
Out
Address
In
Data
21272 Core
Lo
ic Chipset
Control
Chips
Data Slice
Chips
Host PCI
Bridge Chip
Duplicate
Tag Store
(Optional)
DRAM
Arrays
Address
Data
Figure 2–13 shows a typical multiprocessor system, each processor with a second-level
cache. Each interface controller must e mploy a duplicate tag store to maintain cache
coherency. This system configuration c ould be used in a networked database server
application.
Figure 2–13 Typical Multiprocessor Co nfi guration
L2
Cache
L2
Cache
21264
21264
Host PCI
BridgeChip
64-bit PCI Bus
21272Core
ic Chipset
Lo
Control
Chip
Data Slice
Chips
64-bit PCI Bus
Host PCI
Bridge Chip
DRAM
Arrays
Address
Data
DRAM
Arrays
Data
FM-05574-EV67
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–39
Page 68
Page 69
Hardware Interface
This chapter contains the 21264/EV68A microprocessor logic symbol and provides
information about signal names, their function, and their location. This chapter also
describes the mechanical specifications of the 21264/EV68A. It is organized as follows:
•The 21264/EV68A logic symbol
•The 21264/EV68A signal names and functions
•Lists of the signal pins, sorted by name and PGA location
•The specifications for the 21264/EV68A mechanical package
•The top and bottom views of the 21264/EV68A pinouts
3.1 21264/EV68A Microprocessor Logic Symbol
Figure 3–1 show the logic symbol for the 21264/EV68A chip.
3
21264/EV68A Hardware Refere nce Manual
Hardware Interface3–1
Page 70
21264/EV68A Microprocessor Logic Symbol
System Interface
05646b
Figure 3–1 21264/EV68A Microprocessor Logic Symbol
Table 3–1 defines the 21264/EV68A signal types referred to in this section.
Table 3–1 Signal Pin Types Definitions
Signal TypeDefinition
Inputs
I_DC_REFInput DC reference pin
I_DAInput differential amplifierreceiver
I_DA_CLKInput clock pin
Outputs
O_ODOpen drain output driver
O_OD_TPOpen drain driver for test pins
O_PPPush/pull output driver
O_PP_CLKPush/pull output clock driver
Bidirectional
B_DA_ODBidirectional differential amplifier receiver with open drain output
B_DA_PPBidirectional differential amplifier receiver with push/pull output
Other
SpareReserved toCOMPAQ
NoConnectNo connection — Do not connect to these pins for any revision of the
21264/EV68A. These pins must float.
1
All Spare connectionsare Reserved toCOMPAQto maintain compatibility between
passes of the chip. Designers should not use these pins.
1
Table 3–2 lists all signal pins in alphabetic order and provides a full functional description of the pins. Table 3–4 lists the signal pins and their corresponding pin grid array
(PGA) locations in a lphabetic order forthe signal type. Table3–5 liststhe pin grid array
locations in alphabetical order.
Table 3–2 21264/EV68A Signal Descrip tions
SignalTypeCount Description
BcAdd_H[23:4]O_PP20These signalsprovide the index to the Bcache.
BcCheck_H[15:0]B_DA_PP16ECC check bits for BcData_H[127:0].
BcData_H[127:0]B_DA_PP128Bcache data signals.
BcDataInClk_H[7:0]I_DA8Bcache da ta input clocks. These clocks are used with high
speed SDRAMs, suchas DDRs, that provide a clock-out with
data-output pins to optimize Bcache read bandwidths. The
21264/EV68Ainternallysynchronizesthe datato its logicwith
clock forward receive circuits similar to the system interface.
BcDataOE_LO_PP1Bcache data output enable. The 21264/EV68A asserts this sig-
nal during B cache read operations.
21264/EV68A Hardware Refere nce Manual
Hardware Interface3–3
Page 72
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
SignalTypeCount Description
BcDataOutClk_H[3:0]
BcDataOutClk_L[3:0]
BcDataWr_LO_PP1Bcachedata write enable.The 21264/EV68Aassertsthis signal
BcLoad_LO_PP1Bcache burst enable.
BcTag_H[42:20]B_DA_PP23Bcache tag bits.
BcTagDirty_HB_DA_PP1Tag dirty state bit. During cache write operations, the 21264/
BcTagInC lk_HI_DA1Bcache tag input clock. The 21264/EV68A uses this input
BcTagO E_LO_PP1Bcache tag output e nable. This signal is asserted by the 21264/
O_PP8Bcache data output clocks. T hese free-running clocks are dif-
ferential copies of the Bcache clock and are derived from the
21264/EV68A GCLK. Their period is a multiple of the GCLK
and is fixed for all operations. They can be configured so that
their rising edge lags BcAdd_H[23:4] by 0 to 2 GCLK cycles.
The 21264/EV68A synchronizes tag output information with
these clocks.
when writing datato the Bcache data arrays.
EV68A will assert this s ignal if the Bcache data has been modified.
clock to latch thetag information on Bcache read operations.
This clock is used with high-speed SDRAMs, such as DDRs,
that provide a clock-out with data-output pins to optimize
Bcache read bandwidths. The 21264/EV68A internally synchronizes the data to its logic with clock forward receive circuits similar to the system interface.
EV68A for Bcache read operations.
BcTagOutClk_H
BcTagOutClk_L
BcTagP arity_HB_DA_PP1Tag parity state bit.
BcTagShared_HB_DA_PP1Tag shared state bit. The 21264/EV68Awill write a 1 on this
BcTagValid_HB_DA_PP1Tag valid state bit. If set, this line indicates that the cache line
BcTagWr_LO_PP1Tag RAM write enable. The 21264/EV68A asserts this signal
BcVrefI_DC_REF1Bcache tag reference voltage.
ClkFwdRst_HI_DA1Systems assert this s ynchronous signal to wake up a powered-
ClkIn_H
ClkIn_L
DCOK_HI_DA1dc voltage OK. Must be deasserted until dc voltage reaches
EV6Clk_H
EV6Clk_L
O_PP2Bcache tag output clock. These clocks “echo” theclock-for-
warded BcDataOutClk_x[3:0] clocks.
signal line if another agent has a copy of the c ache line.
is valid.
when writing a tag to the Bcache tag arrays.
down 21264/EV68A. The ClkFwdRst_H signal is clocked
into a 21264/EV68A register by the captured FrameClk_x
signals. Systems must ensure that the timing of this signal
meets 21264/EV68A requirements (see Section 4.7.2).
I_DA_CLK 2Differential input signals provided by the system.
proper operating level. After that, DCOK_H is asserted.
O_PP_CLK 2Provides an external test point to measure phase alignment of
the PLL.
3–4Hardware Interface
21264/EV68A Hardware R eference Manual
Page 73
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
SignalTypeCount Description
FrameClk_H
FrameClk_L
IRQ_H[5:0]I_DA6These six i nterrupt signal lines may be asserted by the system.
MiscVrefI_DC_REF1Voltage reference for the miscellaneous pins
PllBypass_HI_DA1When asserted, this signal will cause the two input clocks
PLL_VDD2.5V12.5-V dedicated power supply for the 21264/EV68APLL.
Reset_LI_DA1System reset. This signal protects t he 21264/EV68A from
SromClk_HO_OD_TP1Serial ROM clock. Suppliesthe clock that causes the SROM to
SromData_HI_DA1Serial R OM data. Input data line from the SROM.
SromOE_LO_OD_TP1Serial ROM enable. Supplies the output enable to the SROM.
I_DA_CLK 2A skew-controlled differential 50% duty cycle copy of the sys-
tem clock. It is used by the 21264/EV68A as a reference, or
framing, clock.
The response ofthe 21264/EV68A is determined by the system
software.
(see Table 3–3).
(ClkIn_x) to be applied to the 21264/EV68A internal circuits,
instead of the 21264/EV68A global clock (GCLK).
damage during initial power-up. It must be asserted until
DCOK_H is asserted. After that, it is deasserted and the
21264/EV68A begins its reset sequence.
advance t o the next bit. The cycle time for this clock is 256
times the cycle time of the GCLK (internal 21264/EV68A
clock).
SysAddIn_L[14:0]I_DA15Time-multiplexed command/address/ID/Ack from system to
the 21264/EV68A.
SysAddInClk_LI_DA1Single-ended forwarded clock from system for
SysAddIn_L[1 4:0] and SysFillValid_L.
SysAddOut_L[14:0]O_OD15Time-multiplexed command/address/ID/mask from the 21264/
EV68A to the system bus.
SysAddOutClk_LO_OD1Single-ended forwarded clock output for
SysAddOut_L[14:0].
SysCheck_L[7:0]B_DA_OD8Quadword ECC check bits for SysData_L[63:0].
SysData_L[63:0]B_DA_OD64Data bus for memory and I/O data.
SysDataInClk_H[7:0]I_DA8Single-ended system-generated clocks for clock forwarded
input system data.
SysDataInValid_LI_DA1When asserted, marks a valid data cycle for data transfers to
the 21264/EV68A.
SysDataOutClk_L[7:0] O_OD8Single-ended 21264/EV68A-generated clocks for clock for-
warded output system data.
SysDataOutValid_LI_DA1When asserted, marks a valid data cycle for data transfers from
the 21264/EV68A.
SysFillValid_LI_DA1When asserted, this bit indicates validation for the cache fill
delivered in the previous system SysDc command.
21264/EV68A Hardware Refere nce Manual
Hardware Interface3–5
Page 74
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
SignalTypeCount Description
SysVrefI_DC_REF1System interface reference voltage.
Tck_HI_DA1IEEE 1149.1 test clock.
Tdi_HI_DA1IEEE 1149.1 test data-in signal.
Tdo_HO_OD_TP1IEEE 1149.1 test data-out signal.
TestStat_HO_OD_TP1T est status pin. System reset drives the test status pin low.
The TestStat_H pin is forced high at the start of the Icache
BiST. If the Icache BiST passes, the p in is deasserted at the end
of the BiST operation; otherwise, it remains high.
The 21264/EV68A generates a timeout reset signal if an
instruction is not retired within one billion cycles.
The 21264/EV68A signals the timeout reset event by output-
ting a 256 GCLK cycle wide pulse on TestStat_H.
Tms_HI_DA1IEEE 1149.1 test mode select signal.
Trst_LI_DA1IEEE 1149.1test access port (TAP) re set signal.
Table 3–3 lists signals by function and provides an abbreviated description.
Table 3–3 21264/EV68A Signal Descrip tions by Function
SignalTypeCount Description
BcVrefDomain
BcAdd_H[23:4]O_PP20Bcache index.
BcCheck_H[15:0]B_DA_PP16ECC check bitsfor BcData_H[127:0].
BcData_H[127:0]B_DA_PP128Bcache data.
BcDataInClk_H[7:0]I_DA8Bcache data input clocks.
BcDataOE_LO_PP1Bcache data output enable.
BcDataOutClk_H[3:0]
BcDataOutClk_L[3:0]
BcDataWr_LO_PP1Bcache data write enable.
BcLoad_LO_PP1Bcache burst enable.
BcTag_H[42:20]B_DA_PP23Bcache tag bits.
BcTagDirty_HB_DA_PP1Tag dirty state bit.
BcTagInC lk_HI_DA1Bcache tag input clock.
BcTagO E_LO_PP1Bcache tag output enable.
O_PP8Bcache data output clocks.
BcTagOutClk_H
BcTagOutClk_L
BcTagP arity_HB_DA_PP1Tag parity state bit.
BcTagShared_HB_DA_PP1Tag shared state bit.
BcTagValid_HB_DA_PP1Tag valid state bit.
BcTagWr_LO_PP1Tag RAM write enable.
3–6Hardware Interface
O_PP2Bcache tag output clocks.
21264/EV68A Hardware R eference Manual
Page 75
21264/EV68A Signal Names and Functions
Table 3–3 21264/EV68A Signal Descriptions by Function (Continued)
SignalTypeCount Description
BcVrefI_DC_REF1Tag data input reference voltage.
SysVref Domain
SysAddIn_L[14:0]I_DA15Time-multiplexed SysAddIn, system-to-21264/EV68A.
SysAddInClk_LI_DA1Single-ended forwarded clock from system for
SysAddIn_L[14:0] and SysFillValid_L.
SysAddOut_L[14:0]O_OD15Time-multiplexed SysAddOut, 21264/EV68A-to-system.
SysAddOutClk_LO_OD1Single-ended forwarded-clock.
SysCheck_L[7:0]B_DA_OD8Quadword ECC check bits for SysData_L[63:0].
SysData_L[63:0]B_DA_OD64Data bus for memory and I/O data.
SysDataInClk_H[7:0]I_DA8Single-ended system-generated clocks for clock forwarded
input system data.
SysDataInValid_LI_DA1When asserted, marks a valid da ta cycle for data transfers to
the 21264/EV68A.
SysDataOutClk_L[7:0] O_OD8Single-ended 21264/EV68A-generated clocks for clock for-
warded output system data.
SysDataOutValid_LI_DA1When asserted, marks a valid data cycle for data transfers
from the 21264/EV68A.
SysFillValid_LI_DA1Validation for fill given in previous SysDC command.
SysVrefI_DC_REF1System interface reference voltage.
Clocks and PLL
ClkIn_H
ClkIn_L
EV6Clk_H
EV6Clk_L
FrameClk_H
FrameClk_L
PLL_VDD2.5V12.5-V de dicated power supply for the 21264/EV68A PLL.
MiscVref Domain
ClkFwdRst_HI_DA1Systems assert this synchronous signal to wake up a powered-
I_DA_CLK2Differential input signals provided by the system.
O_PP_CLK2Provides an external test point to measure phase alignment of
the PLL.
I_DA_CLK2A skew-controlled differential 50% duty cycle copy of the
system clock. It is used by the 21264/EV68A as a reference,
or framing, clock.
down 21264/EV68A.The ClkFwdRst_H signal is clocked
into a 21264/EV68A register by the captured FrameClk_x
signals.
DCOK_HI_DA1dc voltage OK. Must be deasserted until dc voltage reaches
proper operating level. After that, DCOK_H is asserted.
IRQ_H[5:0]I_DA6These six interrupt signal lines may be asserted by the system.
MiscVrefI_DC_REF1Reference voltage for miscellaneous pins.
PllBypass_HI_DA1When asserted, this signal will cause the input clocks
(ClkIn_x) to be applied to the 21264/EV68A internal c ircuits,
instead of the 21264/EV68A’s global clock (GCLK).
21264/EV68A Hardware Refere nce Manual
Hardware Interface3–7
Page 76
Pin Assignments
Table 3–3 21264/EV68A Signal Descriptions by Function (Continued)
SignalTypeCount Description
Reset_LI_DA1System reset. This signal protects the 21264/EV68A from
damage during initial power-up. It must be asserted until
DCOK_H is asserted. After that, it is deasserted and the
21264/EV68A begins its reset sequence.
SromClk_HO_OD_TP1Serial ROM clock.
SromData_HI_DA1Serial ROM data.
SromOE_LO_OD_TP1Serial ROM enable.
Tck_HI_DA1IEEE 1149.1 test clock.
Tdi_HI_DA1IEEE 1149.1 test data-insignal.
Tdo_HO_OD_TP1IEEE 1149.1 test data-out signal.
TestStat_HO_OD_TP1Test statuspin.
Tms_HI_DA1IEEE 1149.1 test mode select signal.
Trst_LI_DA1IEEE1149.1testaccessport(TAP)resetsignal.
3.3Pin Assignments
The 21264/EV68A package has 587 pins aligned in a pin grid array (PGA) design.
There are 380 functional signal pins, 1 dedicated 2.5-V pin for the PLL, 112 ground
VSS pins, and 94 VDD pins. Table 3–4 liststhe signal pins and their correspondingpin
grid array (PGA) locations in alphabetical or der for the signal type. Table 3–5 lists the
pin grid array locations in alphabetical order
Table 3–4 Pin List Sorted by Signal Name
Signal NamePGA Location Signal NamePGA Location Signal NamePGALocation
Figure 3–3 shows the 21264/EV68A pinout from the top view with pins facing down.
Figure 3–3 21264/EV68A Top View (Pin Down)
B
BE
BD
BC
BC
BB
BA
AY
AW
AV
AU
AT
AR
AP
AN
AM
AL
AK
AJ
AH
AG
AF
AEACAA
ADAB
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
21264/EV68
TopView
(PinDown)
A
424038363432302826242220181614121008060402
44
45
3–18 Hardware Interface
01030507091113151719212325272931333537394143
FM-05644
21264/EV68A Hardware R eference Manual
Page 87
Figure 3–4 shows the 21264/EV68A pinout from the bottom view with pins facing up.
Figure 3–4 21264/EV68A Bottom View (Pin Up)
B
BE
BD
BC
BC
BB
BA
AY
AW
AV
AU
AT
AR
AP
AN
AM
AL
AK
AJ
AH
AG
AF
AEACAA
ADAB
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
21264/
BottomView
(PinUp)
21264/EV68A Packaging
EV68A
040608101214161820222426283032343638404244
02
01
21264/EV68A Hardware Refere nce Manual
45434139373533312927252321191715131109070503
FM-05645
Hardware Interface 3–19
Page 88
Page 89
4
Cache and External Interfaces
This chapter describes the 21264/EV68A cache and external interface, which includes
the second-level cache (Bcache) interface and the system interface. It also describes
locks, interrupt signals, and EC C/parity generation. It is organized as follows:
•Introduction to the external interfaces
•Physical address considerations
•Bcache structure
•Victim data buffer
•Cache coherency
•Lock m echanism
•System port
•Bcache port
•Interrupts
Chapter 3 lists and defines all 21264/EV68A hardware interface signal pins. C hapter 9
describes the 21264/EV68A hardware interface electrical requirements.
4.1 Introduction to the External Interfaces
A 21264/EV68A-based system can be divided into three major sections:
•21264/EV68A microprocessor
•Second-level Bcache
•System interface logic
–Optional duplicate tag store
–Optional lock register
–Optional victim buffers
The 21264/EV68A external interface is flexible and mandates few design rules, allowing a wide range of prospective systems. The external interface is composed of the
Bcache interface and the system interface.
•Input clocks must have the same frequencyas their corresponding output clock. For
example, the frequency of SysAddInClk_L must be the same as
SysAddOutClk_L.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces4–1
Page 90
Introduction to the External Interfaces
•
The Bcache interface includes a 128-bit bidirectional data bus, a 20-bit unidirectional address bus, and several control signals.
–The BcDataOutClk_x[3:0] clocks are free-running and are derived from the
internal GCLK. The period of BcDataOutClk_x[3:0] is a programmable multiple of GCLK.
–The Bcache turns the BcDataOutClk_x[3:0] clocks around and returns them
to the 21264/EV68A as BcDataInClk_H[7:0]. Likewise, BcTagO utClk_x
returns as BcTagInClk_H.
–The Bcache interface supports a 64-byte block size.
•The system interface includes a 64-bit bidirectional data bus, two 15-bit
unidirectional address buses, and several control signals.
–The SysAddOutClk_L clock is free-running and is derived from the internal
GCLK. The period of SysAddOutClk_L is a programmable multiple of
GCLK.
–The SysAddInClk_L clock is a turned-around copy of S ysAddOutClk_L.
Figure 4–1 shows a simplified view of the externalinterface. The function and purpose
of each signal is described in Chapter 3.
4–2Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 91
Introduction to the External Interfaces
FM-05818B-EV67
System
Figure 4–1 21264/EV68A System and Bcache Interfaces
SysAddIn_L[14:0]
SysAddInClk_L
SysAddOut_L[14:0]
SysAddOutClk_L
SysVref
SysData_L[63:0]
SysCheck_L[7:0]
SysDataInClk_H[7:0]
SysDataOutClk_L[7:0]
SysDataInValid_L
SysDataOutValid_L
SysFillValid_L
BcAdd_H[23:4]
21264
BcLoad_L
BcData_H[127:0]
BcCheck_H[15:0]
BcDataInClk_H[7:0]
BcDataOutClk_x[3:0]
BcDataOE_L
BcDataWr_L
BcTag_H[42:20]
BcTagInClk_H
BcTagOutClk_
BcVref
BcTagWr_L
BcTagOE_L
BcTagValid_H
BcTagDir ty_H
BcTagShared_H
BcTagParity_H
IRQ_H[5:0]
x
[23:4][23:6][23:6]
DataTagStatus
4.1.1 System Interface
This section introduces the system (external) bus interface. The system interface is
made up of two unidirectional 15-bit address buses, 64 bidirectional data lines, eight
bidirectional check bits, two single-ended unidirectional clocks, and a few control pins.
The 15-bit address buses provide time-shared address/command/ID in two or four
GCLK cycles. The Cbox controls the system interface.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces4–3
Page 92
Physical Address Considerations
4.1.1.1 Commands and Addresses
The system sends probe and data movement commands to the 21264/EV68A. The
21264/EV68A can hold up to eight probe commands from the system. The system controls the number of outstanding probe commands and must ensure that the 21264/
EV68A 8-entry probe queue does not overflow.
The Cbox contains an 8-entry miss buffer (MAF) and an 8-entry victim buffer (VAF).
A miss occurs when the 21264/EV68A probes the Bcache but does not find the
addressed block. The 21264/EV68A can queue eight cache misses to the system in its
MAF.
4.1.2 Second-Level Cache (Bcache) Interface
The 21264/EV68A Cbox provides control signals and an interface for a second-level
cache, the Bcache. The 21264/EV68A supports a Bcache from 1MB to 16MB, with 64byte blocks. A 128-bit data bus is used for transfers between the 21264/EV68A and the
Bcache. The Bcache must be comprised of synchronous static RAMs (SSRAMs) and
must contain e ither one, two, or three internal registers. All Bcache control and address
pins are clocked synchronously on Bcache cycle boundaries. The Bcache clock rate
varies as a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0,
and in full-cycle increments of 5, 6, 7, and 8 times the C PU clock cycle. The 1.5 multiple is only available in dual-data mode.
4.2 Physical Address Considerations
The 21264/EV68A supports a 44-bit physical address space that is divided equally
between memory space and I/O space. Memory space resides in the lower half of the
physical address spac e (PA[43] = 0) and I/O space resides in the upper half of the physical address space (PA[43] = 1). The 21264/EV68A recognizes these spaces internally.
The 21264/EV68A-generated external references to memory space are always of a
fixed 64-byte size, though the internal access granularity is byte, word, longword, or
quadword. All 21264/EV68A-generated external references to memory or I/O space
are physical addresses that are either successfully translated from a virtual address or
produced by PALcode. Speculative execution may cause a reference to nonexistent
memory. Systems must check the range of all addresses and report nonexistent
addresses to the 21264/EV68A.
Table 4–1 describes the translation of internal references to external interface references. The first column lists the instructions used by the programmer, including load
(LDx) and store ( STx) instructions of several sizes. The column headings are described
here:
•DcHit (block was found in the Dcache)
•DcW (block was found in a writable state in the Dcache)
•BcHit (block was found in the Bcache)
•BcW (block was found in a writable state in the Bcache)
•Status and Action (status at end of instruction and action performed by the 21264/
EV68A)
4–4Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 93
Physical Address Considerations
Prefetches (LDL, LDF, LDG, LDT, LDBU, LDWU) to R31 use the LDx flow, and
prefetch with modify intent (LDS) uses the STx flow. If the prefetch target is addressed
to I/O space, the upper address bit is cleared, converting the address to memory space
(PA[ 42:6] ). Notes follow the table.
Table 4–1 Translation of Internal References to E xternal Interface Reference
InstructionDcHitDcWBcHitBcWStatus and Action
LDx Memory1XXXDcache hit,done.
LDx Memory0X1XBcache hit, done.
LDx Memory0X0XMiss, generate RdBlk command.
LDxI/OXXXXRdBytes,RdLWs,orRdQWsbasedonsize.
Istream Memory1XXXDcache hit, Istream serviced from Dcache.
Istream Memory0X1XBcache hit, Istream serviced from Bcache.
Istream Memory0X0XMiss, generate RdBlkI command.
STx Memory11XXStore Dcache hit and writable, done.
STx Memory10XXStore hit and not writable, set dirty flow (note 1).
STx Memory0X11Store Bcache hit and writable, done.
STx Memory0X10Store hit and not writable, set-dirty flow (note 1).
STx Memory0X0XMiss, generate RdBlkMod command.
STxI/OXXXXWrBytes,WrLWs,orWrQWsbasedonsize.
STx_C Memory0XXXFai l STx_C.
STx_C Memory10XXSTx_C hit and not writable, set dirty flow (note 1).
STx_CI/OXXXXAlwayssucceedandWrQwsorWrLwsaregenerated,
basedonthesize.
WH64 Memory11XXHit, done.
WH64 Memory10XXWH64 hit not writable, set dirty flow (note 1).
WH64 Memory0X11WH64 hit dirty, done.
WH64 Memory0X10WH64 hit not writable, set dirty flow (note 1).
WH64 Memory0X0XMiss, generate InvalToDirty command (note 2).
WH64I/OXXXXNOPtheinstruction.WH64isUNDEFINED for I/O
1. Set Dirty Flow: Based on the Cbox CSR SET_DIRTY_ENABLE[2:0], SetDirty
requests can be either internally a cknowledged (called a SetModify) or sent to the
system environment for processing. When externallyacknowledged, the shared status information for the cache block is also broadcast. The commands sent externally are SharedToDirty or CleanToDirty. Based on the Cbox CSR
ENABLE_STC_COMMAND[0], the external system can be informed of a STx_C
generating a SetDirty using the STCChangeToDirty command. See Table 4–16 for
more information.
2. InvalToDirty: Based on the C box CSR INVAL_TO_DIRTY_ENABLE[1:0], InvalToDirty requests can be either internally acknowledged or sent to the system environment as InvalToDirty commands.This Cbox C SR provides the ability to convert
WH64 instructions to RdModx operations. See Table 4–15 for more information.
3. Evict: There are two aspects to the commands that are generated by an ECB
instruction:first, thosecommands that are generated tonotify the system of an e vict
being performed; second, those commands that are generated by any victim that is
created by servicing the ECB.
–If Cbox CSR ENAB LE_EVICT[0] is c lear, no command is issued by the
21264/EV68A on the external interface to notify the system of an evict being
performed. If Cbox CSR ENABLE_EVICT[0] is set, the 21264/EV68A issues
an Evict command on the system interface only if a Bcache index match to the
ECB address is found in the 21264/EV68A cache system.
Note that whenever ENABLE_EVICT[0] is true (in the write-many chain),
BC_CLEAN_VICTIM must also be true (in the write-once chain). Otherwise,
the 21264/EV68A could respond miss to a probe, rather than hit, before an
Evict command has been sent off chip, but after the Evict command has
removed a (clean) block from the internal caches and the Bcache. That behavior might cause systemsthat maintain an external duplicate copy of the Bcache
tags to become confused, because the system could receive the probe response
indicating the miss before it receives the Evict command.
–The 21264/EV68A can issue the commands CleanVictimBlkand WrVictimBlk
for a victim that is created by an ECB. CleanVictimBlk is issued only if Cbox
CSR BC_CLEAN_VICTIM is set and there is a Bcache index match valid but
not dirty in the 21264/EV68A cache system. WrVictimBlk is issued for any
Bcache match of the ECB address that is dirty in the 21264/EV68A cache system.
4. MB: Based on the Cbox CSR SYSBUS_MB_ENABLE, the MB command can be
sent to the pins.
Each of these CSRs is programmed appropriately, based on the cache coherence protocol used by the system environment. For example, uniprocessor systems would prefer
to internally acknowledge most of these transactions. In c ontrast, multiprocessor systems may require notification and control of any change in cache state. The 21264/
EV68A and the external system must cooperate to maintain cache coherence. Section
4.5 explains the 21264/EV68A part of the cache coherency protocol.
4–6Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 95
4.3 Bcache Structure
05650
The 21264/EV68A Cbox provides control signals and an interface for a second-level
cache (Bcache).
The 21264/EV68A supports a Bcache from 1MB to 16MB, with 64-byte blocks.A 128bit bidirectional da ta bus is used for transfers between the 21264/EV68A and the
Bcache. The Bcache is fully synchronousand the synchronous static RAMs (SSRAMs)
must contain e ither one, two, or three internal registers. All Bcache control and address
pins are clocked synchronously on Bcache cycle boundaries. The Bcache clock rate
varies as a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0,
and in full-cycle increments of 5, 6, 7, and 8 times the C PU clock cycle. The 1.5 multiple is only available in dual-data mode.
4.3.1 Bcache Interface Signals
Figure 4–2 shows the 21264/EV68A system interface signals.
The 21264/EV68A provides Bcache state support for systems with and without duplicate tag stores, and will take different ac tions on this basis. The system sets the Cbox
CSR DUP_TAG_ENA[0], indicating that it has a duplicate tag store for the Bcache.
Systems using the DUP_TAG_ENA[0] bit must also use the Cbox CSR
BC_CLEAN_VICTIM[0] bit to avoid deadlock situations.
Systems using a Bcache duplicate tag store can accelerate system performance by:
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces4–7
Page 96
Victim Data Buffer
•
Issuing probes and SysDc fill commands to the 21264/EV68A out-of-order with
respect to their order at the system serialization point
•Filtering out all probe misses from the 21264/EV68A cache system
If a probe misses in the 21264/EV68A cache system (Bcache miss and VAF miss), the
21264/EV68A stalls probe processing with the expectation that a SysDc fill will allocate this block. Because of this, in duplicate tag mode, the 21264/EV68A can never
generate a probe miss response.
When Cbox CSR DUP_TAG_ENA[0] equals 0, the 21264/EV68A delivers a miss
response for probes that do not hit in its cache system.
4.4 Victim Data Buffer
The 21264/EV68A has eight victim data buffers (VDBs). They have the following
properties:
•The VDBs are used for both victims (fills that are replacingdirty cache blocks) and
for system probes that require data movement. The CleanVictimBlk command
(optional) assigns a nd uses a VDB.
•Each VDB has two valid bits that indicate the buffer is valid for a victim or valid
for a probe or valid for both a victim and a probe. Probe commands that match the
address of a victim address file (VAF) entry with an asserted probe-valid bit (P)
will stallthe 21264/EV68A probe queue.No ProbeResponses willbe returned until
the P bit is clear.
•The release victim buffer (RVB) bit, when asserted, causes the victim valid bit, on
the victim data buffer (VDB) specified in the ID field, to be cleared. The RVB bit
will also clear the IOWB when systems move data on I/O write transactions. In this
case, ID[3] equals one.
•The release probe buffer (RPB) bit, when asserted (with a WriteData or Release-
Buffer SysDc command), clears the P bit in the victim buffer entry specified in the
ID field.
•Read da ta commands and victim write commands use IDs 0-7, while IDs 8-11 are
used to address the four I/O write buffers.
4.5 Cache Coherency
This section describes the basics and protocols of the 21264/EV68A cache coherency
scheme.
4.5.1 Cache Coherency Basics
The 21264/EV68A systems maintain the cache hierarchy shown in Figure 4–3.
4–8Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 97
Figure 4–3 Cache Subset Hierarchy
Cache Coherency
System
Icache
Main Memory
Bcache
Dcache
FM-05824.AI4
The following tasks must be performed to maintain cache coherency:
•Istream data from memory spaces may be cached in the Icache and Bcache. Icache
coherence is not maintained by hardware—it must be maintained by software using
the CALL_PAL IMB instruction.
•The 21264/EV68A maintains the Dcache as a subset of the Bcache. The Dcache is
set-associative but is kept a subset of the larger e xternally implemented directmapped Bcache.
•System logic m ust help the 21264/EV68A to keep the Bcache coherent with main
memory and other caches in the system.
•The 21264/EV68A requires the system to allow only one change to a block at a
time. Thismeans that if the 21264/EV68A gains the bus to read or write a block, no
other node on the bus should be allowed to access thatblock until the data has been
moved.
•The 21264/EV68A provides hardware mechanisms to support several c ache coher-
ency protocols. The protocols can be separated into two classes: write invalidate
cache coherency protocol and flush cache coherency protocol.
4.5.2 Cache Block States
Table 4–2 lists the cache block states supported by the 21264/EV68A.
Table 4–2 21264/E V68 A-Su pp orted Cache Block States
State NameDescription
InvalidThe 21264/EV68A does not have a copy of the block.
CleanThis 21264/EV68A holds a read-only copy of the block, and no other agent in the system
holds a copy. Upon eviction, the block is not written to memory.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces4–9
Page 98
Cache Coherency
Table 4–2 21264/EV68A-Supported Cache Block States (Continued)
State NameDescription
Clean/SharedT his 21264/EV68A holds a read-only copy of the block, and at least one other agent in the
system may hold a copy of the block. Upon eviction, the block is not written to m emory.
DirtyThis 21264/EV68A holds a read-writecopy of the block, a nd must write it to memory after it
is evicted from the cache. No other agent in the system holds a copy of the block.
Dirty/SharedThis 21264/EV68A holds a read-only copy of the dirty block, which may be shared with
another agent. The block must be written back to memory when it is evicted.
4.5.3 Cache Block State Transitions
Cache block state transitions are reflected by 21264/EV68A-generated commands to
the system. Cache block state transitions can also be caused by system-generated commands to the 21264/EV68A (probes). Probes control the next state for the cache block.
The next state can be based on the previous state of the cache block. Table 4–3 lists the
next state for the cache block.
Table 4–3 Cache Block State Transitions
Next StateAction Based on Probe Hit
No changeDo not update cache state. Useful for DMA transactions that sample data but
do not want to update tag state.
CleanIndependent of previous state, update next state to Clean.
Clean/SharedIndependent of previous state,update nextstate to Clean/Shared. This transac-
tion is useful for systems that updatememory on probe hits.
T1:
⇒
Clean
Dirty
T3:
Clean
Dirty
Dirty/Shared
Clean/Shared
⇒
Dirty/Shared
⇒
Clean/Shared
⇒
Invalid
⇒
Clean/Shared
Based on the dirty bit, make the block clean or dirty shared. This transaction
is useful for systems that do not update memory on probe hits.
If the block is Clean or Dirty/Shared, change to Clean/Shared. If the block is
Dirty, change to Invalid. This transaction is useful for systems that use the
Dirty/Shared state as an exclusive state.
The cache state transitionscaused by 21264/EV68A-generated commands are under the
full control of the system environment using the SysDc (system data control) commands. Table 4–4 lists these commands.
Table 4–4 System Responses to 21264/EV68A Comman ds
Response Type21264/EV68A Action
SysDc ReadDataFill block with the associateddata and update tag w ith clean cache status.
SysDc ReadDataDirtyFill block with the associated data and update tag with dirty cache status.
SysDc ReadDataSharedFill block with the associated data and update tag with shared cache status.
SysDc ReadDataShared/DirtyFill block with the associated data and update tag with dirty/shared status.
SysDc ReadDataErrorFill block with all-ones reference pattern and update tag with invalid status.
SysDc ChangeToDirtySuccessUnconditionally update block with dirty cache status.
SysDc ChangeToDirtyFailDo not update cache status and fail any associated STx_C instructions.
4–10 Cache and E xternal Interfaces
21264/EV68A Hardware R eference Manual
Page 99
4.5.4 Using SysDc Commands
Note the following:
•The c onventional response for RdBlk commands is SysDc ReadData or ReadD-
ataShared.
•The c onventional response for a RdBlkMod command is SysDc ReadDataDirty.
•The c onventional response for ChangeToDirty commands is
ChangeToDirtySuccess or ChangeToDirtyFail.
However,the system e nvironment is not limited to these r esponses. Table 4–5 shows all
21264/EV68A commands, system responses, and the 21264/EV68A reaction. The
21264/EV68A commands are described in the following list:
•Rdx commands are generated by load or Istream references.
•RdBlkModx commands are generated by store references.
•The C hxToDirty command group includes CleanToDirty, SharedToDirty, and STC-
ChangeToDirty commands, which are generated by store references that hit in the
21264/EV68A cache system.
Cache Coherency
•InvalToDirty commands a re generated by WH64 instructions that miss in the
21264/EV68A cache system.
•FetchBlk and FetchBlkSpec are noncached r eferences to memory space that have
missed in the 21264/EV68A cache system.
•Rdiox comm ands are noncached references to I/O address space.
•Evict and S TCChangeToDirty commands are generated by ECB and STx_C
instructions, respectively.
Table 4–5 shows the system responses to 21264/EV68A commands and 21264/EV68A
reactions.
Table 4–5 System Resp on ses to 21264/EV68A Commands and Reactions
21264/EV68A
CMDSysDc21264/EV68A Action
RdxReadData
ReadDataShared
RdxReadDataShared/Dirty The cache block is filled and marked dirty/shared. Succeeding store
RdxReadDataDirtyThe cache block is filled and m arked dirty.
RdxReadDataErrorThe cache block access was to NXM address space. The 21264/
This is a normal fill. The cache block is filled and marked clean or
shared based on SysDc.
commands cannot update the block without external reference.
EV68A delivers an all-ones pattern to any load command and evicts
the block from the cache (with associated victim processing). T he
cache block is marked invalid.
RdxChangeToDirtySuccess
ChangeToDirtyFail
21264/EV68A Hardware Refere nce Manual
Both SysDc responses are illegal for read commands.
Cache and External Interfaces4–11
Page 100
Cache Coherency
Table 4–5 System Resp on ses to 21264/EV68A Commands and Reactions (Co ntinued)
21264/EV68A
CMDSysDc21264/EV68A Action
RdBlkModxReadData
ReadDataShared
ReadDataShared/Dirty
The cache block is filled and marked with a nonwritable status. If the
store instruction that generated the RdBlkModx command is still
active (not killed), the 21264/EV68A will retry the instruction, generating the appropriate ChangeToDirty command. Succeeding store
commands cannot update the block without external reference.
RdBlkModxReadDataDirtyThe 21264/EV68A performs a normal fill response, and the cache
block becomes writable.
RdBlkModxChangeToDirtySuccess
Both SysDc responses are illegal for read/modify commands.
ChangeToDirtyFail
RdBlkModxReadDataErrorThe cache block command was to NXM address space. The 21264/
EV68A delivers an all-ones pattern to any dependent load command,
forces a fail action on any pending store commands to this block, and
any store to this block is not retried. The Cbox evicts the cache block
fromthe cache system(with associatedvictimprocessing).The cache
block is marked invalid.
ChxToDirtyReadData
ReadDataShared
ReadDataShared/Dirty
The original data in the Dcache is replaced with the filled data. The
block is not writable, so the 21264/EV68A will retrythe store i nstruction and generate another ChxToDirty class command. To avoid a
potential livelock situation, the STC_ENABLE CSR bit must be set.
Any STx_C instruction to this block is forced to fail. In addition, a
Shared/Dirtyresponse causes the 21264/EV68A to generatea victim
for this block upon eviction.
ChxToDirtyReadDataDirtyThe data in the Dcache i s replaced with the filled data. The block is
writable, so the store instruction thatgenerated the original command
can update this block. Any STx_C instruction to this block is forced
to fail. In addition, the 21264/EV68A ge nerates a victim for this
block upon eviction.
ChxToDirtyReadDataErrorImpossible situation. The block must be cached to generate a ChxTo-
Dirty command. Caching the block is not possible be cause all NXM
fills are filled noncached.
ChToDirtyChangeToDirtySuccess Normal response. C hangeToDirtySuccess makes the block writable.
The 21264/EV68A retries the store instruction and updates the
Dcache. Any STx_C instruction associated with this block is allowed
to succeed.
ChxToDirtyChangeToDirtyFailThe MAF entry is retired. Any STx_C instruction associated with the
block is forced to fail. If a STx instruction generated this block, the
21264/EV68Aretries and generates either a RdBlkModx (because the
reference that failed the ChangeToDirty also invalidated the cache by
way of an invalidating probe) or another ChxToDirty command.
InvalToDirtyReadData
ReadDataShared
The block is not writable, so the 21264/EV68A will re try the WH64
instructionand generate a ChxToDirty command.