Compaq EV68A User Manual

Page 1
21264/EV68A Microprocessor Hardware Reference Manual
Part Number: DS–0038B–TE
This manual is directly derived from the internal 21264/EV68A Specifications, Revi­sion 1.1. You can access this hardware reference manual in PDF format from t he following site:
Revision/Update Information: Revision 1.1, March 2002
Compaq Computer Corporation Shrewsbur y, Massachuse tts
Page 2
March2002
The information in this publication is subject to changewithout notice.
COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAM­AGES RESULTING FROM THE FURNIS HING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS INFORMATION IS PROVIDED “AS IS” AND COMPAQ COM PUTER CORPORATION DISCLAIMS ANY WARRANTIES, EXPRESS,IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WAR­RANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULARPURPOSE, GOOD TITLE AND AGAINST INFRINGEMENT.
This publicationcontains information protectedby copyright. No partof this publication may be photocopied or reproduced in any form without prior written consent from Compaq Computer Corporation.
© Compaq Computer Corporation 2002. All rights reserved. Printed in the U.S.A.
COMPAQ, the Compaq logo, the Digital logo, and VAXRegistered in United States Patent and Trademark Office.
Pentium is a registered trademark of IntelC orporation.
Other product names mentioned herein may be trademarks and/or registered trademarksof their respective compa­nies.
21264/EV68A Hardware Reference Manual
Page 3

Table of Contents

Preface
1 Introduction
1.1 TheArchitecture.......................................................... 1–1
1.1.1 Addressing........................................................... 1–2
1.1.2 Integer Data Types. . . .................................................. 1–2
1.1.3 Floating-PointDataTypes............................................... 1–2
1.2 21264/EV68A Microprocessor Features. . ...................................... 1–3
2 Internal Architecture
2.1 21264/EV68A Microarchitecture . . ............................................ 2–1
2.1.1 InstructionFetch,Issue,andRetireUnit .................................... 2–2
2.1.1.1 Virtual Program Counter Logic . . ...................................... 2–2
2.1.1.2 BranchPredictor................................................... 2–3
2.1.1.3 Instruction-StreamTranslationBuffer................................... 2–5
2.1.1.4 InstructionFetchLogic.............................................. 2–6
2.1.1.5 RegisterRenameMaps ............................................. 2–6
2.1.1.6 Integer Issue Queue................................................ 2–6
2.1.1.7 Floating-Point Issue Queue .......................................... 2–7
2.1.1.8 Exception and Interrupt Logic. . . ...................................... 2–8
2.1.1.9 Retire Logic ....................................................... 2–8
2.1.2 Integer Execution Unit .................................................. 2–8
2.1.3 Floating-PointExecutionUnit............................................. 2–10
2.1.4 ExternalCacheandSystemInterfaceUnit .................................. 2–11
2.1.4.1 VictimAddressFileandVictimDataFile ................................ 2–11
2.1.4.2 I/OWriteBuffer.................................................... 2–11
2.1.4.3 ProbeQueue...................................................... 2–11
2.1.4.4 DuplicateDcacheTagArray.......................................... 2–11
2.1.5 OnchipCaches........................................................ 2–11
2.1.5.1 InstructionCache .................................................. 2–11
2.1.5.2 DataCache....................................................... 2–12
2.1.6 MemoryReferenceUnit................................................. 2–12
2.1.6.1 LoadQueue ...................................................... 2–13
2.1.6.2 StoreQueue...................................................... 2–13
2.1.6.3 MissAddressFile.................................................. 2–13
2.1.6.4 DstreamTranslationBuffer........................................... 2–13
2.1.7 SROMInterface....................................................... 2–13
2.2 PipelineOrganization ...................................................... 2–13
2.2.1 PipelineAborts........................................................ 2–16
2.3 InstructionIssueRules..................................................... 2–16
21264/EV68A Hardware Refere nce Manual
iii
Page 4
2.3.1 InstructionGroupDefinitions............................................. 2–17
2.3.2 EboxSlotting......................................................... 2–18
2.3.3 InstructionLatencies ................................................... 2–20
2.4 InstructionRetireRules..................................................... 2–21
2.4.1 Floating-PointDivide/SquareRootEarlyRetire............................... 2–22
2.5 RetireofOperateInstructionsintoR31/F31..................................... 2–22
2.6 LoadInstructionstoR31andF31............................................. 2–23
2.6.1 NormalPrefetch:LDBU,LDF,LDG,LDL,LDT,LDWU,HW_LDLInstructions....... 2–23
2.6.2 PrefetchwithModifyIntent:LDSInstruction ................................. 2–23
2.6.3 Prefetch,EvictNext:LDQandHW_LDQInstructions.......................... 2–24
2.7 SpecialCasesofAlphaInstructionExecution.................................... 2–24
2.7.1 LoadHitSpeculation ................................................... 2–24
2.7.2 Floating-PointStoreInstructions.......................................... 2–26
2.7.3 CMOVInstruction...................................................... 2–26
2.8 MemoryandI/OAddressSpaceInstructions.................................... 2–27
2.8.1 MemoryAddressSpaceLoadInstructions .................................. 2–27
2.8.2 I/O Address Space Load Instructions. ...................................... 2–27
2.8.3 MemoryAddressSpaceStoreInstructions.................................. 2–28
2.8.4 I/OAddressSpaceStoreInstructions ...................................... 2–29
2.9 MAFMemoryAddressSpaceMergingRules.................................... 2–30
2.10 InstructionOrdering........................................................ 2–30
2.11 ReplayTraps............................................................. 2–31
2.11.1 MboxOrderTraps..................................................... 2–31
2.11.1.1 Load-LoadOrderTrap .............................................. 2–31
2.11.1.2 Store-LoadOrderTrap.............................................. 2–31
2.11.2 OtherMboxReplayTraps............................................... 2–32
2.12 I/OWriteBufferandtheWMBInstruction....................................... 2–32
2.12.1 MemoryBarrier(MB/WMB/TBFillFlow).................................... 2–32
2.12.1.1 MBInstructionProcessing ........................................... 2–33
2.12.1.2 WMBInstructionProcessing.......................................... 2–33
2.12.1.3 TBFillFlow....................................................... 2–34
2.13 Performance Measurement Support—Performance Counters . ...................... 2–35
2.14 Floating-PointControlRegister............................................... 2–35
2.15 AMASKandIMPLVERInstructionValues ...................................... 2–37
2.15.1 AMASK.............................................................. 2–38
2.15.2 IMPLVER............................................................ 2–38
2.16 DesignExamples ......................................................... 2–38
3 Hardware Interface
3.1 21264/EV68A Microprocessor Logic Symbol . . . ................................. 3–1
3.2 21264/EV68A Signal Names and Functions..................................... 3–3
3.3 PinAssignments.......................................................... 3–8
3.4 MechanicalSpecifications................................................... 3–17
3.5 21264/EV68A Packaging . .................................................. 3–18
4 Cache and External Interfaces
4.1 IntroductiontotheExternalInterfaces.......................................... 4–1
4.1.1 SystemInterface...................................................... 4–3
4.1.1.1 CommandsandAddresses........................................... 4–4
4.1.2 Second-Level Cache (Bcache) Interface . . .................................. 4–4
4.2 PhysicalAddressConsiderations............................................. 4–4
4.3 BcacheStructure.......................................................... 4–7
4.3.1 Bcache Interface Signals ................................................ 4–7
4.3.2 SystemDuplicateTagStores............................................. 4–7
iv
21264/EV68A Hardware R eference Manual
Page 5
4.4 VictimDataBuffer......................................................... 4–8
4.5 Cache Coherency . . ....................................................... 4–8
4.5.1 Cache Coherency Basics................................................ 4–8
4.5.2 CacheBlockStates.................................................... 4–9
4.5.3 CacheBlockStateTransitions............................................ 4–10
4.5.4 UsingSysDcCommands................................................ 4–11
4.5.5 DcacheStatesandDuplicateTags........................................ 4–13
4.6 LockMechanism.......................................................... 4–14
4.6.1 In-OrderProcessingofLDx_L/STx_CInstructions ............................ 4–15
4.6.2 InternalEvictionofLDx_LBlocks.......................................... 4–15
4.6.3 LivenessandFairness.................................................. 4–15
4.6.4 ManagingSpeculativeStoreIssueswithMultiprocessorSystems ................ 4–16
4.7 SystemPort.............................................................. 4–16
4.7.1 SystemPortPins...................................................... 4–17
4.7.2 ProgrammingtheSystemInterfaceClocks.................................. 4–18
4.7.3 21264/EV68A-to-System Commands ...................................... 4–19
4.7.3.1 BankInterleaveonCacheBlockBoundaryMode ......................... 4–19
4.7.3.2 PageHitMode .................................................... 4–20
4.7.4 21264/EV68A-to-System Commands Descriptions . ........................... 4–21
4.7.5 ProbeResponse Commands (Command[4:0] = 00001). . . ...................... 4–24
4.7.6 SysAckand21264/EV68A-to-SystemCommandsFlowControl.................. 4–25
4.7.7 System-to-21264/EV68A Commands ...................................... 4–26
4.7.7.1 ProbeCommands (Four Cycles) ...................................... 4–26
4.7.7.2 DataTransfer Commands (Two Cycles)................................. 4–28
4.7.8 DataMovementInandOutofthe21264/EV68A.............................. 4–30
4.7.8.1 21264/EV68A Clock Basics .......................................... 4–30
4.7.8.2 FastDataMode ................................................... 4–31
4.7.8.3 FastDataDisableMode............................................. 4–33
4.7.8.4 SysDataInValid_LandSysDataOutValid_L .............................. 4–34
4.7.8.5 SysFillValid_L..................................................... 4–35
4.7.8.6 DataWrapping . . .................................................. 4–36
4.7.9 NonexistentMemoryProcessing.......................................... 4–38
4.7.10 OrderingofSystemPortTransactions...................................... 4–40
4.7.10.1 21264/EV68A Commands and System Probes ........................... 4–40
4.7.10.2 System Probesand SysDc Commands ................................. 4–42
4.8 BcachePort.............................................................. 4–42
4.8.1 BcachePortPins...................................................... 4–43
4.8.2 BcacheClocking ...................................................... 4–44
4.8.2.1 SettingthePeriodoftheCacheClock.................................. 4–45
4.8.3 BcacheTransactions................................................... 4–47
4.8.3.1 BcacheDataReadandTagReadTransactions .......................... 4–47
4.8.3.2 BcacheDataWriteTransactions ...................................... 4–48
4.8.3.3 BubblesontheBcacheDataBus...................................... 4–49
4.8.4 PinDescriptions....................................................... 4–50
4.8.4.1 BcAdd_H[23:4] . . .................................................. 4–51
4.8.4.2 BcacheControlPins................................................ 4–51
4.8.4.3 BcDataInClk_HandBcTagInClk_H .................................... 4–53
4.8.5 BcacheBanking....................................................... 4–53
4.8.6 Disabling the Bcache for Debugging . ...................................... 4–53
4.9 Interrupts................................................................ 4–54
5 Internal Processor Registers
5.1 EboxIPRs............................................................... 5–3
5.1.1 CycleCounterRegister–CC............................................. 5–3
5.1.2 CycleCounterControlRegister–CC_CTL.................................. 5–3
5.1.3 VirtualAddressRegister–VA............................................ 5–4
21264/EV68A Hardware Refere nce Manual
v
Page 6
5.1.4 VirtualAddressControlRegister–VA_CTL ................................. 5–4
5.1.5 VirtualAddressFormatRegister–VA_FORM................................ 5–5
5.2 IboxIPRs................................................................ 5–6
5.2.1 ITBTagArrayWriteRegister–ITB_TAG................................... 5–6
5.2.2 ITBPTEArrayWriteRegister–ITB_PTE................................... 5–6
5.2.3 ITBInvalidateAllProcess(ASM=0)Register–ITB_IAP........................ 5–7
5.2.4 ITBInvalidateAllRegister–ITB_IA........................................ 5–7
5.2.5 ITBInvalidateSingleRegister–ITB_IS..................................... 5–7
5.2.6 ProfileMePCRegister–PMPC........................................... 5–8
5.2.7 ExceptionAddressRegister–EXC_ADDR.................................. 5–8
5.2.8 InstructionVirtualAddressFormatRegister—IVA_FORM...................... 5–9
5.2.9 InterruptEnableandCurrentProcessorModeRegister–IER_CM................ 5–9
5.2.10 SoftwareInterruptRequestRegister–SIRR................................. 5–10
5.2.11 InterruptSummaryRegister–ISUM....................................... 5–11
5.2.12 HardwareInterruptClearRegister–HW_INT_CLR ........................... 5–12
5.2.13 ExceptionSummaryRegister–EXC_SUM.................................. 5–13
5.2.14 PAL Base Register – PAL_BASE . . . ...................................... 5–15
5.2.15 IboxControlRegister–I_CTL............................................ 5–15
5.2.16 IboxStatusRegister–I_STAT............................................ 5–18
5.2.17 IcacheFlushRegister–IC_FLUSH........................................ 5–21
5.2.18 IcacheFlushASMRegister–IC_FLUSH_ASM .............................. 5–21
5.2.19 ClearVirtual-to-PhysicalMapRegister–CLR_MAP........................... 5–21
5.2.20 SleepModeRegister–SLEEP ........................................... 5–21
5.2.21 ProcessContextRegister–PCTX......................................... 5–21
5.2.22 PerformanceCounterControlRegister–PCTR_CTL.......................... 5–23
5.3 MboxIPRs............................................................... 5–25
5.3.1 DTBTagArrayWriteRegisters0and1–DTB_TAG0,DTB_TAG1............... 5–25
5.3.2 DTBPTEArrayWriteRegisters0and1–DTB_PTE0,DTB_PTE1............... 5–26
5.3.3 DTBAlternateProcessorModeRegister–DTB_ALTMODE..................... 5–26
5.3.4 DstreamTBInvalidateAllProcess(ASM=0)Register–DTB_IAP................ 5–27
5.3.5 DstreamTBInvalidateAllRegister–DTB_IA................................ 5–27
5.3.6 DstreamTBInvalidateSingleRegisters0and1–DTB_IS0,1................... 5–27
5.3.7 DstreamTBAddressSpaceNumberRegisters0and1–DTB_ASN0,1........... 5–28
5.3.8 Memory Management Status Register – MM_STAT........................... 5–28
5.3.9 MboxControlRegister–M_CTL.......................................... 5–29
5.3.10 DcacheControlRegister–DC_CTL ....................................... 5–30
5.3.11 DcacheStatusRegister–DC_STAT....................................... 5–31
5.4 CboxCSRsandIPRs...................................................... 5–32
5.4.1 CboxDataRegister–C_DATA........................................... 5–33
5.4.2 CboxShiftRegister–C_SHFT ........................................... 5–33
5.4.3 CboxWRITE_ONCEChainDescription .................................... 5–33
5.4.4 CboxWRITE_MANYChainDescription .................................... 5–38
5.4.5 CboxReadRegister(IPR)Description ..................................... 5–41
6 Privileged Architecture Library Code
6.1 PALcodeDescription....................................................... 6–1
6.2 PALmodeEnvironment..................................................... 6–2
6.3 RequiredPALcodeFunctionCodes........................................... 6–3
6.4 Opcodes Reserved for PALcode. . ............................................ 6–3
6.4.1 HW_LDInstruction..................................................... 6–3
6.4.2 HW_STInstruction..................................................... 6–4
6.4.3 HW_RETInstruction ................................................... 6–5
6.4.4 HW_MFPRandHW_MTPRInstructions.................................... 6–6
6.5 InternalProcessorRegisterAccessMechanisms................................. 6–7
6.5.1 IPR Scoreboard Bits . . .................................................. 6–8
6.5.2 HardwareStructureofExplicitlyWrittenIPRs................................ 6–8
vi
21264/EV68A Hardware R eference Manual
Page 7
6.5.3 HardwareStructureofImplicitlyWrittenIPRs................................ 6–9
6.5.4 IPRAccessOrdering................................................... 6–9
6.5.5 CorrectOrderingofExplicitWritersFollowedbyImplicitReaders................. 6–10
6.5.6 CorrectOrderingofExplicitReadersFollowedbyImplicitWriters................. 6–11
6.6 PALshadow Registers...................................................... 6–11
6.7 PALcodeEmulationoftheFPCR ............................................. 6–11
6.7.1 StatusFlags.......................................................... 6–12
6.7.2 MF_FPCR ........................................................... 6–12
6.7.3 MT_FPCR ........................................................... 6–12
6.8 PALcodeEntryPoints...................................................... 6–12
6.8.1 CALL_PALEntryPoints................................................. 6–12
6.8.2 PALcodeExceptionEntryPoints.......................................... 6–13
6.9 TranslationBuffer(TB)FillFlows ............................................. 6–14
6.9.1 DTBFill ............................................................. 6–14
6.9.2 ITBFill.............................................................. 6–16
6.10 Performance Counter Support . . . ............................................ 6–17
6.10.1 GeneralPrecautions ................................................... 6–18
6.10.2 AggregateModeProgrammingGuidelines.................................. 6–18
6.10.2.1 AggregateModePrecautions......................................... 6–18
6.10.2.2 Operation ........................................................ 6–19
6.10.2.3 AggregateCountingModeDescription.................................. 6–20
6.10.2.3.1 Cyclecounting................................................. 6–20
6.10.2.3.2 Retiredinstructionscycles........................................ 6–20
6.10.2.3.3 Bcachemissorlonglatencyprobescycles........................... 6–20
6.10.2.3.4 Mboxreplaytrapscycles......................................... 6–20
6.10.2.4 Counter M odes for Aggregate Mode. . .................................. 6–20
6.10.3 ProfileMeModeProgrammingGuidelines................................... 6–20
6.10.3.1 ProfileMeModePrecautions.......................................... 6–20
6.10.3.2 Operation ........................................................ 6–21
6.10.3.3 ProfileMeCounting Mode Description . ................................. 6–23
6.10.3.3.1 Cyclecounting................................................. 6–23
6.10.3.3.2 Inumretiredelaycycles.......................................... 6–23
6.10.3.3.3 Retiredinstructionscycles........................................ 6–23
6.10.3.3.4 Bcachemissorlonglatencyprobescycles........................... 6–23
6.10.3.3.5 Mboxreplaytrapscycles......................................... 6–23
6.10.3.4 CounterModesforProfileMeMode.................................... 6–24
7 Initialization and Configuration
7.1 Power-UpResetFlowandtheReset_LandDCOK_HPins......................... 7–1
7.1.1 Power Sequencing and Reset State for Signal Pins ........................... 7–3
7.1.2 ClockForwardingandSystemClockRatioConfiguration....................... 7–4
7.1.3 PLLRampUp......................................................... 7–6
7.1.4 BiSTandSROMLoadandtheTestStat_HPin............................... 7–6
7.1.5 ClockForwardResetandSystemInterfaceInitialization........................ 7–7
7.2 FaultResetFlow.......................................................... 7–8
7.3 EnergyStarCertificationandSleepModeFlow.................................. 7–9
7.4 WarmResetFlow......................................................... 7–11
7.5 ArrayInitialization ......................................................... 7–12
7.6 InitializationModeProcessing................................................ 7–12
7.7 ExternalInterfaceInitialization ............................................... 7–14
7.8 InternalProcessorRegisterPower-UpResetState............................... 7–14
7.9 IEEE1149.1TestPortReset................................................ 7–16
7.10 ResetStateMachine....................................................... 7–16
7.11 Phase-LockLoop(PLL)FunctionalDescription.................................. 7–19
7.11.1 DifferentialReferenceClocks............................................. 7–19
7.11.2 PLLOutputClocks..................................................... 7–19
21264/EV68A Hardware Refere nce Manual
vii
Page 8
7.11.2.1 GCLK........................................................... 7–19
7.11.2.2 Differential 21264/EV68A Clocks ...................................... 7–19
7.11.2.3 Nominal Operating Frequency . . ...................................... 7–19
7.11.2.4 Power-Up/ResetClocking............................................ 7–20
8 Error Detection and Error Handling
8.1 DataErrorCorrectionCode.................................................. 8–2
8.2 IcacheDataorTagParityError............................................... 8–2
8.3 DcacheTagParityError.................................................... 8–2
8.4 DcacheDataSingle-BitCorrectableECCError .................................. 8–3
8.4.1 LoadInstruction....................................................... 8–3
8.4.2 Store Instruction (Quadword or Smaller) . . . ................................. 8–4
8.4.3 DcacheVictimExtracts ................................................. 8–4
8.5 DcacheStoreSecondError ................................................. 8–4
8.6 DcacheDuplicateTagParityError............................................ 8–4
8.7 BcacheTagParityError .................................................... 8–5
8.8 ControllingBcacheBlockParityCalculation..................................... 8–5
8.9 BcacheDataSingle-BitCorrectableECCError .................................. 8–5
8.9.1 IcacheFillfromBcache................................................. 8–5
8.9.2 DcacheFillfromBcache ................................................ 8–6
8.9.3 BcacheVictimRead.................................................... 8–7
8.9.3.1 BcacheVictimReadDuringaDcache/BcacheMiss ....................... 8–7
8.9.3.2 BcacheVictimReadDuringanECBInstruction........................... 8–7
8.10 Memory/SystemPortSingle-BitDataCorrectableECCError........................ 8–7
8.10.1 IcacheFillfromMemory................................................. 8–7
8.10.2 DcacheFillfromMemory................................................ 8–8
8.11 BcacheDataSingle-BitCorrectableECCErroronaProbe......................... 8–9
8.12 Double-BitFillErrors....................................................... 8–9
8.13 ErrorCaseSummary....................................................... 8–10
9 Electrical Data
9.1 ElectricalCharacteristics.................................................... 9–1
9.2 DCCharacteristics ........................................................ 9–2
9.3 Power Supply Sequencing and AvoidingPotential FailureMechanisms ............... 9–5
9.4 ACCharacteristics......................................................... 9–6
10 Thermal Management
10.1 OperatingTemperature..................................................... 10–1
10.2 HeatSinkSpecifications.................................................... 10–3
10.3 ThermalDesignConsiderations .............................................. 10–6
11 Testability and Diagnostics
11.1 TestPins................................................................ 11–1
11.2 SROM/SerialDiagnosticTerminalPort......................................... 11–2
11.2.1 SROMLoadOperation.................................................. 11–2
11.2.2 SerialTerminalPort.................................................... 11–2
11.3 IEEE 1149.1 Port. . . ....................................................... 11–3
11.4 TestStat_HPin ........................................................... 11–4
11.5 Power-UpSelf-TestandInitialization .......................................... 11–5
11.5.1 Built-inSelf-Test....................................................... 11–5
viii
21264/EV68A Hardware R eference Manual
Page 9
11.5.2 SROMInitialization..................................................... 11–5
11.5.2.1 SerialInstructionCacheLoadOperation ................................ 11–6
11.6 Notes on IEEE 1149.1 Operation and Compliance ............................... 11–7
A Alpha Instruction Set
A.1 AlphaInstructionSummary.................................................. A–1
A.2 Reserved O pcodes . ....................................................... A–8
A.2.1 Opcodes Reserved for Compaq........................................... A–8
A.2.2 Opcodes Reserved for PALcode .......................................... A–9
A.3 IEEEFloating-PointInstructions.............................................. A–9
A.4 VAXFloating-PointInstructions............................................... A–11
A.5 IndependentFloating-Point Instructions . . ...................................... A–11
A.6 OpcodeSummary......................................................... A–12
A.7 RequiredPALcodeFunctionCodes........................................... A–13
A.8 IEEEFloating-PointConformance ............................................ A–14
B 21264/EV68A Boundary-Scan Register
B.1 Boundary-ScanRegister . . .................................................. B–1
B.1.1 BSDL Description of the Alpha21264/EV68A Boundary-ScanRegister . . .......... B–1
C Serial Icache Load Predecode Values
D PALcode Restrictions and Guidelines
D.1 Restriction 1 : Reset Sequence Required by Retire Logic and Mapper............... D–1
D.2 Restriction 2 : No Multiple Writers toIPRs in Same Scoreboard Group ............... D–8
D.3 Restriction 4 : No Writers and R eaders to IPRs in Same Scoreboard Group .......... D–8
D.4 Guideline 6 : Avoid Consecutive Read-Modify-Write-Read-Modify-Write. . .......... D–9
D.5 Restriction 7 :ReplayTrap,InterruptCodeSequence,andSTF/ITOF............... D–9
D.6 Restriction 9 : PALmode Istream Address Ranges . . . ........................... D–10
D.7 Restriction 10:DuplicateIPRModeBits ....................................... D–10
D.8 Restriction 11: Ibox IPR Update Synchronization................................ D–11
D.9 Restriction 12: MFPR of Implicitly-WrittenIPRs EXC_ADDR, IVA_FORM, and EXC_SUM D–11
D.10 Restriction13:DTBFillFlowCollision......................................... D–11
D.11 Restriction14:HW_RET ................................................... D–11
D.12 Guideline16:JSR-BADVA................................................. D–12
D.13 Restriction17:MTPRtoDTB_TAG0/DTB_PTE0/DTB_TAG1/DTB_PTE1 ............. D–12
D.14 Restriction 18: No FP Operates, FP Conditional Branches, FTOI, or STF in Same Fetch Block as
HW_MTPR .............................................................. D–12
D.15 Restriction 19: HW_RET/STALL After Updating the FPCR by way of MT_FPCR in PALmode D–12
D.16 Guideline 20 : I_CTL[SBE] Stream Buf fer Enable................................ D–12
D.17 Restriction21:HW_RET/STALLAfterHW_MTPRASN0/ASN1...................... D–12
D.18 Restriction22:HW_RET/STALLAfterHW_MTPRIS0/IS1.......................... D–13
D.19 Restriction23:HW_ST/P/CONDITIONALDoesNotCleartheLockFlag............... D–13
D.20 Restriction 24: HW_RET/STALL After HW_MTPR IC_FLUSH, IC_FLUSH_ASM, CLEAR_MAP
....................................................................... D–14
D.21 Restriction25:HW_MTPRITB_IAAfterReset................................... D–14
D.22 Guideline 26: Conditional Branches in PALcode ................................. D–14
D.23 Restriction27:Resetof‘Force-FailLockFlag’StateinPALcode..................... D–15
D.24 Restriction 28: Enforce Ordering Between IPRs Implicitly Written by Loads and Subsequent Loads
....................................................................... D–15
D.25 Guideline29:JSR,JMP,RET,andJSR_CORinPALcode......................... D–15
21264/EV68A Hardware Refere nce Manual
ix
Page 10
D.26 Restriction30:HW_MTPRandHW_MFPRtotheCboxCSR....................... D–15
D.27 Restriction 31 : I_CTL[VA_48]Update . . . ...................................... D–17
D.28 Restriction32:PCTR_CTLUpdate ........................................... D–17
D.29 Restriction33:HW_LDPhysical/LockUse...................................... D–18
D.30 Restriction34:WritingMultipleITBEntriesintheSamePALcodeFlow............... D–18
D.31 Guideline 35:HW_INT_CLRUpdate......................................... D–18
D.32 Restriction36:UpdatingI_CTL[SDE].......................................... D–18
D.33 Restriction 37 : UpdatingVA_CTL[VA_48] ...................................... D–18
D.34 Restriction38:UpdatingPCTR_CTL.......................................... D–18
D.35 Guideline39:WritingMultipleDTBEntriesintheSamePALFlow.................... D–19
D.36 Restriction40:ScrubbingaSingle-BitError..................................... D–19
D.37 Restriction41:MTPRITB_TAG,MTPRITB_PTEMustbeintheSameFetchBlock..... D–21
D.38 Restriction42:UpdatingVA_CTL,CC_CTL,orCCIPRs........................... D–21
D.39 Restriction 43: No Trappable InstructionsAlong with HW_MTPR..................... D–21
D.40 Restriction 44: Not Applicable to the 21264/EV68A ............................... D–21
D.41 Restriction45: NoHW_JMPorJMPIntructionsinPALcode........................ D–21
D.42 Restriction 46: Avoiding Livelocks i n Speculative Load CRD Handlers ................ D–22
D.43 Restriction47: CacheEvictionforSingle-BitCacheErrors......................... D–22
D.44 Restriction 48: MB Bracketing of Dcache Writes to Force Bad Data ECC and Force Bad Tag Parity
....................................................................... D–24
E 21264/EV68A-to-Bcache Pin Interface
E.1 ForwardingClockPinGroupings.............................................. E–1
E.2 Late-WriteNon-BurstingSSRAMs............................................ E–2
E.3 Dual-DataRateSSRAMs ................................................... E–3
Glossary
Index
x
21264/EV68A Hardware R eference Manual
Page 11

Figures

2–1 21264/EV68A Block Diagram ................................................ 2–3
2–2 BranchPredictor.......................................................... 2–4
2–3 LocalPredictor ........................................................... 2–4
2–4 Global Predictor........................................................... 2–5
2–5 ChoicePredictor.......................................................... 2–5
2–6 Integer Execution Unit—Clusters0 and 1 ....................................... 2–9
2–7 Floating-PointExecutionUnits............................................... 2–10
2–8 PipelineOrganization ...................................................... 2–14
2–9 Pipeline Timing for Integer Load Instructions . . . ................................. 2–24
2–10 PipelineTimingforFloating-PointLoadInstructions............................... 2–25
2–11 Floating-PointControlRegister............................................... 2–36
2–12 TypicalUniprocessorConfiguration ........................................... 2–39
2–13 TypicalMultiprocessorConfiguration .......................................... 2–39
3–1 21264/EV68A Microprocessor Logic Symbol . . . ................................. 3–2
3–2 PackageDimensions....................................................... 3–17
3–3 21264/EV68A Top View (Pin Down) ........................................... 3–18
3–4 21264/EV68A Bottom View (Pin Up)........................................... 3–19
4–1 21264/EV68A System and Bcache Interfaces . . ................................. 4–3
4–2 21264/EV68A Bcache Interface Signals . . ...................................... 4–7
4–3 CacheSubsetHierarchy.................................................... 4–9
4–4 System Interface Signals. . .................................................. 4–17
4–5 FastTransferTimingExample ............................................... 4–32
4–6 SysFillValid_LTiming...................................................... 4–36
5–1 CycleCounterRegister..................................................... 5–3
5–2 CycleCounterControlRegister............................................... 5–3
5–3 VirtualAddressRegister.................................................... 5–4
5–4 VirtualAddressControlRegister.............................................. 5–4
5–5 VirtualAddressFormatRegister(VA_48=0,VA_FORM_32=0).................... 5–5
5–6 VirtualAddressFormatRegister(VA_48=1,VA_FORM_32=0).................... 5–6
5–7 VirtualAddressFormatRegister(VA_48=0,VA_FORM_32=1).................... 5–6
5–8 ITBTagArrayWriteRegister ................................................ 5–6
5–9 ITBPTEArrayWriteRegister................................................ 5–7
5–10 ITBInvalidateSingleRegister................................................ 5–7
5–11 ProfileMePCRegister...................................................... 5–8
5–12 ExceptionAddressRegister ................................................. 5–8
5–13 InstructionVirtualAddressFormatRegister(VA_48=0,VA_FORM_32=0)........... 5–9
5–14 InstructionVirtualAddressFormatRegister(VA_48=1,VA_FORM_32=0)........... 5–9
5–15 InstructionVirtualAddressFormatRegister(VA_48=0,VA_FORM_32=1)........... 5–9
5–16 InterruptEnableandCurrentProcessorModeRegister............................ 5–10
5–17 SoftwareInterruptRequestRegister........................................... 5–11
5–18 InterruptSummaryRegister ................................................. 5–11
5–19 HardwareInterruptClearRegister ............................................ 5–12
5–20 ExceptionSummaryRegister................................................ 5–14
5–21 PALBaseRegister ........................................................ 5–15
5–22 IboxControlRegister....................................................... 5–16
5–23 IboxStatusRegister....................................................... 5–19
5–24 ProcessContextRegister................................................... 5–22
5–25 PerformanceCounterControlRegister......................................... 5–23
5–26 DTBTagArrayWriteRegisters0and1........................................ 5–25
5–27 DTBPTEArrayWriteRegisters0and1........................................ 5–26
5–28 DTBAlternateProcessorModeRegister ....................................... 5–26
5–29 DstreamTranslationBufferInvalidateSingleRegisters............................ 5–27
5–30 DstreamTranslationBufferAddressSpaceNumberRegisters0and1................ 5–28
5–31 Memory Management Status Register . . . ...................................... 5–28
5–32 MboxControlRegister...................................................... 5–29
5–33 DcacheControlRegister.................................................... 5–31
21264/EV68A Hardware Refere nce Manual
xi
Page 12
5–34 DcacheStatusRegister..................................................... 5–32
5–35 CboxDataRegister........................................................ 5–33
5–36 CboxShiftRegister........................................................ 5–33
5–37 WRITE_MANYChainWriteTransactionExample................................ 5–39
6–1 HW_LDInstructionFormat.................................................. 6–4
6–2 HW_STInstructionFormat.................................................. 6–4
6–3 HW_RETInstructionFormat................................................. 6–6
6–4 HW_MFPRandHW_MTPRInstructionsFormat................................. 6–6
6–5 Single-MissDTBInstructionsFlowExample..................................... 6–14
6–6 ITBMissInstructionsFlowExample........................................... 6–16
7–1 Power-Up Timing Sequence ................................................. 7–3
7–2 Fault Reset Sequence of Operation ........................................... 7–9
7–3 SleepModeSequenceofOperation .......................................... 7–11
7–4 ExampleforInitializingBcache............................................... 7–13
7–5 21264/EV68A Reset State Machine State Diagram ............................... 7–17
10–1 Type1HeatSink.......................................................... 10–3
10–2 Type2HeatSink.......................................................... 10–4
10–3 Type3HeatSink.......................................................... 10–5
11–1 TestStat_HPinTimingDuringPower-UpBuilt-InSelf-Test(BiST) ................... 11–5
11–2 TestStat_HPinTimingDuringBuilt-InSelf-Initialization(BiSI)....................... 11–5
11–3 SROMContentMap ....................................................... 11–6
xii
21264/EV68A Hardware R eference Manual
Page 13

Tables

1–1 Integer Data Types . ....................................................... 1–2
2–1 PipelineAbortDelay(GCLKCycles)........................................... 2–16
2–2 InstructionName,Pipeline,andTypes......................................... 2–17
2–3 InstructionGroupDefinitionsandPipelineUnit................................... 2–18
2–4 InstructionClassLatencyinCycles............................................ 2–20
2–5 MinimumRetireLatenciesforInstructionClasses ................................ 2–21
2–6 InstructionsRetiredWithoutExecution......................................... 2–23
2–7 RulesforI/OAddressSpaceLoadInstructionDataMerging........................ 2–28
2–8 RulesforI/OAddressSpaceStoreInstructionDataMerging........................ 2–29
2–9 MAFMergingRules........................................................ 2–30
2–10 MemoryReferenceOrdering................................................. 2–30
2–11 I/OReferenceOrdering..................................................... 2–31
2–12 TB Fill Flow Example Sequence 1 ............................................ 2–34
2–13 TB Fill Flow Example Sequence 2 ............................................ 2–34
2–14 Floating-PointControlRegisterFields.......................................... 2–36
2–15 21264/EV68A AMASK Values................................................ 2–38
2–16 AMASKBitAssignments.................................................... 2–38
3–1 Signal Pi n Types Definitions ................................................. 3–3
3–2 21264/EV68A Signal Descriptions ............................................ 3–3
3–3 21264/EV68A Signal Descriptions by Function. . ................................. 3–6
3–4 PinListSortedbySignalName............................................... 3–8
3–5 PinListSortedbyPGALocation.............................................. 3–12
3–6 Ground and Power (VSS and VDD) Pin List . . . ................................. 3–16
4–1 TranslationofInternalReferencestoExternalInterfaceReference................... 4–5
4–2 21264/EV68A-Supported Cache Block States . . ................................. 4–9
4–3 CacheBlockStateTransitions ............................................... 4–10
4–4 System Responsesto 21264/EV68A Commands................................. 4–10
4–5 System Responsesto 21264/EV68A Commands and Reactions ..................... 4–11
4–6 SystemPortPins.......................................................... 4–17
4–7 ProgrammingValuesforSystemInterfaceClocks................................ 4–18
4–8 ProgramValuesforData-Sample/DriveCSRs................................... 4–18
4–9 ForwardedClocksandFrameClockRatio...................................... 4–19
4–10 BankInterleaveonCacheBlockBoundaryModeofOperation...................... 4–19
4–11 PageHitModeofOperation................................................. 4–20
4–12 21264/EV68A-to-System Command Fields Definitions. . ........................... 4–20
4–13 MaximumPhysicalAddressforShortBusFormat................................ 4–21
4–14 21264/EV68A-to-System Commands Descriptions................................ 4–21
4–15 ProgrammingINVAL_TO_DIRTY_ENABLE[1:0].................................. 4–23
4–16 ProgrammingSET_DIRTY_ENABLE[2:0]....................................... 4–24
4–17 21264/EV68A ProbeResponse Command ...................................... 4–24
4–18 ProbeResponse Fields Descriptions........................................... 4–25
4–19 System-to-21264/EV68A Probe Commands..................................... 4–26
4–20 System-to-21264/EV68A Probe Commands Fields Descriptions ..................... 4–27
4–21 Data Movement Selection by Probe[4:3] . . ...................................... 4–27
4–22 Next Cache Block State Selection by Probe[2:0] ................................. 4–27
4–23 DataTransferCommandFormat ............................................. 4–28
4–24 SysDc[4:0]FieldDescription................................................. 4–29
4–25 SYSCLK Cycles Between SysAddOut and SysData............................... 4–32
4–26 CboxCSRSYSDC_DELAY[4:0]Examples ..................................... 4–33
4–27 FourTimingExamples ..................................................... 4–34
4–28 Data Wrapping Rules ...................................................... 4–36
4–29 SystemWrapandDeliverData............................................... 4–37
4–30 WrapInterleaveOrder...................................................... 4–37
4–31 WrapOrderforDouble-PumpedDataTransfers.................................. 4–38
4–32 21264/EV68A Commands with NXM Addresses and System Response............... 4–39
4–33 21264/EV68A Response t o System Probe and I n-Flight Command Interaction.......... 4–41
21264/EV68A Hardware Refere nce Manual
xiii
Page 14
4–34 Rules for System Controlof Cache Status Update Order ........................... 4–42
4–35 RangeofMaximumBcacheClockRatios....................................... 4–43
4–36 BcachePortPins.......................................................... 4–43
4–37 BC_CPU_CLK_DELAY[1:0]Values........................................... 4–45
4–38 BC_CLK_DELAY[1:0]Values................................................ 4–45
4–39 ProgramValuestoSettheCacheClockPeriod(Single-Data)....................... 4–46
4–40 ProgramValuestoSettheCacheClockPeriod(Dual-DataRate).................... 4–46
4–41 Data-Sample/DriveCboxCSRs .............................................. 4–47
4–42 Programming the Bcache to Support Each Size of the Bcache ...................... 4–51
4–43 ProgrammingtheBcacheControlPins......................................... 4–51
4–44 ControlPinAssertionforRAM_TYPEA........................................ 4–51
4–45 ControlPinAssertionforRAM_TYPEB........................................ 4–52
4–46 ControlPinAssertionforRAM_TYPEC........................................ 4–52
4–47 ControlPinAssertionforRAM_TYPED........................................ 4–52
5–1 InternalProcessorRegisters................................................. 5–1
5–2 CycleCounterControlRegisterFieldsDescription................................ 5–4
5–3 VirtualAddressControlRegisterFieldsDescription............................... 5–5
5–4 ProfileMePCFieldsDescription.............................................. 5–8
5–5 IER_CMRegisterFieldsDescription........................................... 5–10
5–6 SoftwareInterruptRequestRegisterFieldsDescription............................ 5–11
5–7 InterruptSummaryRegisterFieldsDescription................................... 5–12
5–8 HardwareInterruptClearRegisterFieldsDescription.............................. 5–13
5–9 ExceptionSummaryRegisterFieldsDescription ................................. 5–14
5–10 PALBaseRegisterFieldsDescription ......................................... 5–15
5–11 IboxControlRegisterFieldsDescription........................................ 5–16
5–12 IboxStatusRegisterFieldsDescription ........................................ 5–19
5–13 IPRIndexBitsandRegisterFields............................................ 5–21
5–14 ProcessContextRegisterFieldsDescription .................................... 5–22
5–15 PerformanceCounterControlRegisterFieldsDescription.......................... 5–23
5–16 PerformanceCounterControlRegisterInputSelectFields.......................... 5–25
5–17 DTBAlternateProcessorModeRegisterFieldsDescription......................... 5–26
5–18 Memory Management Status Register Fields Description .......................... 5–28
5–19 MboxControlRegisterFieldsDescription....................................... 5–30
5–20 DcacheControlRegisterFieldsDescription..................................... 5–31
5–21 DcacheStatusRegisterFieldsDescription...................................... 5–32
5–22 CboxDataRegisterFieldsDescription......................................... 5–33
5–23 CboxShiftRegisterFieldsDescription......................................... 5–33
5–24 CboxWRITE_ONCEChainOrder ............................................ 5–34
5–25 CboxWRITE_MANYChainOrder ............................................ 5–39
5–26 CboxReadIPRFieldsDescription............................................ 5–41
6–1 RequiredPALcodeFunctionCodes........................................... 6–3
6–2 Opcodes Reserved for PALcode. . ............................................ 6–3
6–3 HW_LDInstructionFieldsDescriptions......................................... 6–4
6–4 HW_STInstructionFieldsDescriptions......................................... 6–5
6–5 HW_RETInstructionFieldsDescriptions ....................................... 6–6
6–6 HW_MFPRandHW_MTPRInstructionsFieldsDescriptions........................ 6–7
6–7 PairedInstructionFetchOrder ............................................... 6–9
6–8 PALcodeExceptionEntryLocations........................................... 6–13
6–9 IPRs Used for Performance Counter Support. . . ................................. 6–18
6–10 AggregateModeReturnedIPRContents....................................... 6–19
6–11 AggregateModePerformanceCounterIPRInputSelectFields...................... 6–20
6–12 CMOVDecomposed....................................................... 6–21
6–13 ProfileMeModeReturnedIPRContents........................................ 6–22
6–14 ProfileMeModePCTR_CTLInputSelectFields.................................. 6–24
7–1 21264/EV68A Reset State Machine Major Operations. . ........................... 7–1
7–2 Signal Pi n Reset State . . . .................................................. 7–3
7–3 PinSignalNamesandInitializationState....................................... 7–5
7–4 Power-Up FlowSignals and Their Constraints . ................................. 7–7
7–5 EffectonIPRsAfterFaultReset.............................................. 7–8
xiv
21264/EV68A Hardware R eference Manual
Page 15
7–6 Effect on IPRs After Transition Through Sleep Mode . . . ........................... 7–10
7–7 Signals and Constraints for the Sleep Mode Sequence . ........................... 7–11
7–8 EffectonIPRsAfterWarmReset............................................. 7–11
7–9 WRITE_MANYChainCSRValuesforBcacheInitialization......................... 7–12
7–10 InternalProcessorRegistersatPower-UpResetState ............................ 7–14
7–11 21264/EV68A Reset State Machine State Descriptions . ........................... 7–17
7–12 Differential Reference Clock Frequencies in Full-SpeedLock . ...................... 7–20
8–1 21264/EV68A Error Detection Mechanisms ..................................... 8–1
8–2 64-BitDataandCheckBitECCCode.......................................... 8–2
8–3 ErrorCaseSummary....................................................... 8–10
9–1 MaximumElectricalRatings................................................. 9–1
9–2 Signal Types ............................................................. 9–2
9–3 VDD(I_DC_POWER)...................................................... 9–3
9–4 Input DC Reference Pin (I_DC_REF) .......................................... 9–3
9–5 Input Differential AmplifierReceiver (I_DA)...................................... 9–3
9–6 Input Differential Amplifier Clock Receiver (I_DA_CLK) . ........................... 9–3
9–7 PinType:Open-DrainOutputDriver(O_OD).................................... 9–4
9–8 Bidirectional,DifferentialAmplifierReceiver,Open-DrainOutputDriver(B_DA_OD)..... 9–4
9–9 PinType:Open-DrainDriverforTestPins(O_OD_TP)............................ 9–4
9–10 Bidirectional,DifferentialAmplifierReceiver,Push-PullOutputDriver(B_DA_PP) ....... 9–4
9–11 Push-PullOutputDriver(O_PP).............................................. 9–5
9–12 Push-PullOutputClockDriver(O_PP_CLK)..................................... 9–5
9–13 ACSpecifications ......................................................... 9–7
10–1 OperatingTemperatureatHeatSinkCenter(Tc)................................. 10–1
10–2 qca at Various Airflows for 21264/EV68A . ...................................... 10–2
10–3 Maximum Ta for 21264/EV68A @ 750 MHz and @ 1.7 V with Various Airflows ......... 10–2
10–4 Maximum Ta for 21264/EV68A @ 833 MHz and @ 1.7 V with Various Airflows ......... 10–2
10–5 Maximum Ta for 21264/EV68A @ 875 MHz and @ 1.7 V with Various Airflows ......... 10–2
10–6 Maximum Ta for 21264/EV68A @ 940 MHz and @ 1.7 V with Various Airflows ......... 10–2
11–1 DedicatedTestPortPins.................................................... 11–1
11–2 IEEE 1149.1 Instructions and Opcodes . . ...................................... 11–3
11–3 TAPControllerStateMachine................................................ 11–4
11–4 IcacheBitFieldsinanSROMLine............................................ 11–7
A–1 InstructionFormatandOpcodeNotation ....................................... A–1
A–2 ArchitectureInstructions.................................................... A–2
A–3 Opcodes Reserved for Compaq . . ............................................ A–8
A–4 Opcodes Reserved for PALcode. . ............................................ A–9
A–5 IEEE Floating-Point Instruction FunctionCodes . ................................. A–9
A–6 VAXFloating-PointInstructionFunctionCodes .................................. A–11
A–7 Independent Floating-Point InstructionFunction Codes ............................ A–12
A–8 OpcodeSummary......................................................... A–12
A–9 KeytoOpcodeSummaryUsedinTableA–8.................................... A–13
A–10 RequiredPALcodeFunctionCodes........................................... A–13
A–11 Exceptional Inputand Output Conditions ...................................... A–15
E–1 BcacheForwardingClockPinGroupings...................................... E–1
E–2 Late-WriteNon-BurstingSSRAMsDataPinUsage............................... E–2
E–3 Late-WriteNon-BurstingSSRAMsTagPinUsage................................ E–2
E–4 Dual-DataRateSSRAMDataPinUsage....................................... E–3
E–5 Dual-DataRateSSRAMTagPinUsage........................................ E–4
21264/EV68A Hardware Refere nce Manual
xv
Page 16
Page 17
Audience
Content

Preface

This manual is for system designers and programmers who use the Alpha 21264/ EV68A microprocessor (referred to as the 21264/EV68A).
This manual contains the following chapters and appendixes: Chapter 1, Introduction, introduces the 21264/EV68A and provides an overview of the
Alpha architecture. Chapter 2, Internal Architecture, describes the major hardware functions and the inter-
nal chip architecture.It describesperformance m easurement facilities,coding r ules, and design examples.
Chapter 3, Hardware Interface, lists and describes the internal hardware interface sig­nals, and provides mechanical data and packaging information, including signal pin lists.
Chapter 4, Cache and External Interfaces, describes the e xternal bus functions and transactions, lists bus commands, and describes the clock functions.
Chapter 5, Internal Processor Registers,lists and describes the internal processor regis­ter set.
Chapter 6, Privileged Architecture Library Code, describes the privileged architecture library code (PALcode).
Chapter 7, Initialization and Configuration, describes the initialization and configura­tion sequence.
Chapter 8, Error Detection and Error Handling, describes error de tection and error han­dling.
Chapter 9, Electrical Da ta, provides electrical data and describes signal integrity issues. Chapter 10, Thermal Management, provides information about thermal management. Chapter 11, Testability a nd Diagnostics, describes chip and system testability features. Appendix A, Alpha Instruction Set, summarizes the Alpha instruction set. Appendix B, 21264/EV68A Boundary-Scan Register, presents the BSDL description
of the 21264/EV68A boundary-scan register.
21264/EV68A Hardware Refere nce Manual
xvii
Page 18
Appendix C, Serial Icache Load Predecode Values, provides a pointer to the Alpha Motherboards Software Developer’s Kit (SDK), which contains this information.
Appendix D, PALcode Restrictions and Guidelines, lists restrictions and guidelines that must be adhered to when generating PALcode.
Appendix E, 21264/EV68A-to-Bcache P in Interface, provides the pin interface between the 21264/EV68A and Bcache SSRAMs.
The Glossary lists and defines terms associated with the 21264/EV68A. An Index is provided at the end of the doc ument.
Documentation Included by Reference
The companion volume to this manual, the Alpha Architecture Reference Manual, Fourth Edition, can be accessed from the following website: ftp.compaq.com/
pub/products/alphaCPUdocs.
xviii
21264/EV68A Hardware R eference Manual
Page 19
Terminology and Conventions
This section defines the abbreviations, terminology, and other conventions used throughout this document.
Abbreviations
Binary Multiples
The abbreviations K, M, and G (kilo, mega, and giga) represent binary multiples and have the following values.
K M G
10
=2
20
=2
30
=2
(1024) (1,048,576) (1,073,741,824)
For example:
2KB = 2 kilobytes 4MB = 4 megabytes 8GB = 8 gigabytes 2K pixels = 2 kilopixels 4M pixels = 4 m egapixels
Register Access
=2× 2 =4× 2 =8× 2 =2× 2 =4× 2
10 20 30 10 20
bytes bytes bytes pixels pixels
The abbreviations used to indicate the type of access to register fieldsand bits have the following definitions:
Abbreviation Meaning
IGN Ignore
Bitsandfieldsspecifiedareignoredonwrites.
MBZ Must Be Zero
Software must never place a nonzero value in bits and fields specified as MBZ. A nonzero read produces an Illegal Operand exception. Also, MBZ fields are reserved for future use.
RAZ Read As Zero
Bits andfields return a zero when read.
RC Read Clears
Bits and fields are cleared when read. Unless otherwise specified, such bits cannot be w ritten.
RES Reserved
Bits and fields are reserved by Compaq and should not be used; however, zeros can be written to reserved fields that cannot be masked.
RO Read Only
Thevaluemaybereadbysoftware.Itiswrittenbyhardware.Softwarewrite operations are ignored.
RO,n Read Only, and takes the value n at power-on reset.
Thevaluemaybereadbysoftware.Itiswrittenbyhardware.Softwarewrite operations are ignored.
21264/EV68A Hardware Refere nce Manual
xix
Page 20
Abbreviation Meaning
RW Read/Write
Bits and fields can be read and written.
RW,n Read/Write, and takes the value n at power-on reset.
Bits and fields can be read and written.
W1C Write One to Clear
If read operations are allowed to the register, then the value may be read by software. If it is a write-only register, then a re ad operation by software returns an UNPR E DICTABLE result. Software write operations of a 1 cause the bit to be cleared by hardware. Software write operations of a 0 do not modify the state of the bit.
W1S Write One toSet
If read operations are allowed to the register, then the value may be read by software. If it is a write-only register, then a re ad operation by software returns an UNPR E DICTABLE result. Software write operations of a 1 cause the bit to be set by hardware. Software write operations of a 0 do not modify the state of the bit.
WO WriteOnly
Bits and fields can be written but not read.
WO,n Write Only, and takes the value n at power-on reset.
Bits and fields can be written but not read.
Sign extension
SEXT(x) means x is sign-extended to the required size.
Addresses
Unless otherwise noted, all addresses and offsets are hexadecimal.
Aligned and Unaligned
The terms aligned and naturally aligned are interchangeable and refer to data objects that are powers of two in size. An aligned datum of size 2n is stored in memory at a byte address that is a multiple of 2n; that is, one that has n low-order zeros. For ex­ample, an aligned 64-byte stack frame has a memory address that is a multiple of 64.
A datum of size 2n is unaligned if it is stored in a byte address that is not a multiple of 2n.
Bit Notation
Multiple-bit fields can include contiguous and noncontiguous bits contained in square brackets ([]). Multiple contiguous bitsare indicated by a pair of numbers separated by a colon [:].For example, [9:7,5,2:0]specifies bits 9,8,7,5,2,1, and0. Similarly, singlebits are frequently indicated with square brackets. For example, [27] specifies bit 27. See also Field Notation.
Caution
Cautions indicate potential damage to equipment or loss of data.
xx
21264/EV68A Hardware R eference Manual
Page 21
Data Units
The following data unit terminology is used throughout this manual.
Term Words Bytes Bits Other
Byte ½ 1 8 — Word1216— Longword 2 4 32 Dword Quadword 4 8 64 2 longword
Do Not Care (X)
A capital X represents any valid value.
External
Unless otherwise stated, external means not contained in the chip.
Field Notation
The names of single-bit and multiple-bit fields can be used rather than the actual bit numbers (see Bit Notation). When the field name is used, it is contained in square brackets ([]). For example, RegisterName[LowByte] specifies RegisterName[7:0].
Note
Notes emphasize particularly important information.
Numbering
All numbers are decimal or hexadecimal unless otherwise indicated. The prefix 0x indi­cates a hexadecimal number. For example, 19 is decimal, but 0x19 and 0x19A are hexa­decimal (also see Addresses). Otherwise, the base is indicated by a subscript; for example, 100
Ranges and Extents
is a binary number.
2
Ranges are specified by a pair of numbers separated by two periods (..) and are inclu­sive. For example, a range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in square brackets ([]) separated by a colon (:) and are inclusive. Bit fields are often specified as extents. For example, bits [7:3] specifies bits 7, 6, 5, 4, and 3.
Register Figures
The gray areas in register figures indicate reserved or unused bits and fields. Bit ranges that are coupled with the field name specify the bits of the named field that
are included in the register. The bit range may, but need not necessarily, correspond to the bitExtent in theregister.See the explanationabove Table 5–1 formore information.
Signal Names
The following examples describe signal-name conventions used in this document.
21264/EV68A Hardware Refere nce Manual
xxi
Page 22
AlphaSignal[n:n] Boldface, mixed-case type denotes signal names that are
assigned internal and external to the 21264/EV68A (that is, the signal traverses a chip interface pin).
AlphaSignal_x[n:n] When a signal has high and low assertion states, a lower-
case italic x represents the assertion states. For example,
SignalName_x[3:0] represents SignalName_H[3:0] and SignalName_L[3:0].
UNDEFINED
Operations specified as UNDEFINED may vary from moment to moment, implementa­tion to implementation, and instruction to instruction within implementations. The operation may vary in effect from nothing to stopping system operation.
UNDEFINED operations may halt the processor or cause it to lose information. How­ever, UNDEFINED operations m ust not cause the processor to hang, that is, reach an unhalted state from which there is no transition to a normal state in which the machine executes instructions.
UNPREDICTABLE
UNPREDICTABLE resultsor occurrences do not disrupt the basic operation of the pro­cessor; it continues to execute instructions in its normal manner. Further:
Results or occurrences specified as UNPREDICTABLE m ay vary from moment to
moment, implementation to implementation, and instruction to instruction within implementations. Software can never depend on results specified a s UNPREDICT­ABLE.
An UNPREDICTABLE result may acquire an arbitrary value subject to a few c on-
straints. Such a result may be an arbitrary function of the input operands or of any state information that is accessible to the process in its current access mode. UNPREDICTABLE results may be unchanged from their previous values.
Operations that produce UNPREDICTABLE results may also produce exceptions.
An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints as are UNPREDICTABLE results and, in particular, must not constitute a security hole.
Specifically, UNPREDICTABLEresults must not depend upon, or be a functionof, the contents of memory locations or registers that are inaccessible to the current process in the current access mode.
Also, operations that may produce UNPREDICTABLE results must not: – Write or modify the c ontents of memory locations or registers to which the cur-
rent process in the current access mode does not have access, or – Halt or hang the system or any of its components. For example, a security hole would exist if some UNPREDICTABLE result
depended on the value of a registerin another process, on the contents of processor temporary registers left be hind by some previously running process, or on a sequence of actions of different processes.
xxii
21264/EV68A Hardware R eference Manual
Page 23
X
Do not care. A capital X represents any valid value.
21264/EV68A Hardware Refere nce Manual
xxiii
Page 24
Page 25
This chapter provides a brief introduction to the Alpha architecture, Compaq’s RISC (reduced instruction set computing) architecture designed for high performance. The chapter then summarizes the specific features of the Alpha 21264/EV68A microproces­sor (hereafter called the 21264/EV68A) that implements the Alpha architecture. Appen­dix A provides a list of Alpha instructions.
The companion volume to this document, the Alpha Architecture Reference Manual, Fourth Edition, contains the complete architecture information.

1.1 The Architecture

The Alpha architecture is a 64-bit load and store RISC architecture designed with par­ticular emphasis on speed, multiple instruction issue, multiple processors, and software migration from many operating systems.
All registers are 64 bits long and all operations are performed between 64-bit registers. All instructions are 32 bits long. Memory operations are either load or storeoperations. All data manipulation is done between registers.
1

Introduction

The Alpha architecture supports the following data types:
8-, 16-, 32-, and 64-bit integers
IEEE 32-bit a nd 64-bit floating-point formats
VAX architecture 32-bit and 64-bit floating-point formats
In the Alpha architecture, instructions interact with each other only by one instruction writing to a register or memory location and a nother instruction reading fromthat regis­ter or memory location. This use of resources makes it easy to build implementations that issue multiple instructions every CPU cycle.
The 21264/EV68A uses a set of subroutines, called privileged a rchitecture library code (PALc ode), that is specific to a particular A lpha operating system implementation and hardware platform. These subroutines provide operating system primitives for context switching, interrupts, exceptions, and memory management. These subroutines can be invoked by hardware or CALL_PAL instructions. CALL_PAL instructions use the function field of the instruction to vector to a specified subroutine. PALcode is written in standard machine code with some implementation-specific extensions to provide direct accessto low-level hardware f unctions. PALcode supports optimizations for mul­tiple operating systems, flexible memory-management implementations, a nd multi­instruction atomic sequences.
21264/EV68A Hardware Refere nce Manual
Introduction 1–1
Page 26
The Architecture
The Alpha architecture performs byte shifting and masking with normal 64-bit, regis­ter-to-register instructions. The 21264/EV68A performs single-byte and single-word load and store instructions.

1.1.1 Addressing

The basic addressable unit in the Alpha architecture is the 8-bit byte. The 21264/ EV68A supports a 48-bit or 43-bit virtual address (selectable under IPR control).
Virtual addresses as seen by the program are translated into physical memory addresses by the me mory-management mechanism. The21264/EV68A supports a 44-bit physical address.

1.1.2 Integer Data Types

Alpha architecture supports the four integer data types listed in Table 1–1.
Table 1–1 Integer Data Types
Data Type Description
Byte A byte is 8 contiguous bits that start at an addressable byte boundary.
A byte is an 8-bit value.
Word A word is 2 contiguous bytes that start at an arbitrary byte boundary.
A word is a 16-bit value.
Longword A longword is 4 contiguousbytes that start at an arbitrary byte boundary. A
longword is a 32-bit value.
Quadword A quadword is 8 contiguous bytes that start at an arbitrary byte boundary.
Note: Alpha implementations may impose a significant performance penalty
when accessing operands that are not naturally aligned. Refer to the Alpha Architecture Handbook, Version 4 for details.

1.1.3 Floating-Point Data Types

The 21264/EV68A supports the following floating-point data types:
Longword integer format in floating-point unit
Quadword integer format in floating-point unit
IEEE f loating-point formats
S_floating – T_floating
VAX floating-point formats
F_floating
1–2 Introduction
G_floating – D_floating (limited support)
21264/EV68A Hardware R eference Manual
Page 27

21264/EV68A Microprocessor Features

1.2 21264/EV68A Microprocessor Features
The 21264/EV68A microprocessor is a superscalar pipelined processor. It is packaged in a 587-pin PGA carrier and has removable application-specific heat sinks. A number of configuration options allow its use in a range of system designs ranging from extremely simple uniprocessor systems with minimum component count to high-per­formance multiprocessor systems with very high cache and memory bandwidth.
The 21264/EV68A can issue four Alpha instructions in a single cycle, thereby m inimiz­ing the average cycles per instruction (CPI). A number of low-latency and/or high­throughput features in the instructionissue unit and the onchip components of the mem­ory subsystem further reduce the average CPI.
The 21264/EV68A and associated PALcode implements IEEE single-precision and double-precision, VA X F_floating a nd G_floating data types, and supports longword (32-bit) and quadword (64-bit) integers. Byte (8-bit) and word (16-bit) support is pro­vided by byte-manipulation instructions. Limited hardware support is provided for the VAX D _floating data type.
Other 21264/EV68A features include:
The a bility to issue up to four instructions during each CPU clock cycle.
A peak instruction execution rate of four times the CPU clock frequency.
An onchip, demand-paged memory-management unit with translation buffer, which,
when used with PALcode, can implement a variety of page table structures and trans­lation algorithms. The unit consists of a 128-entry, fully-associative data translation buffer(DTB) and a 128-entry, fully-associative instruction translationbuffer (ITB), with each entry able to map a single 8KB page or a group of 8, 64, or 512 8KB pages. The allocation scheme for the ITB and DTB is round-robin.The size of each translation buffer entry’s group is specified by hint bits stored in the entry. The DTB and ITB implement 8-bit address space numbers (ASN), MAX_ASN=255.
Two onchip, high-throughput pipelined floating-point units, capable of executing
both VAX a nd IEEE floating-point data types.
An onchip, 64KB virtually-addressed instruction cache with 8-bit ASNs
(MAX_ASN=255).
An onchip, virtually-indexed, physically-tagged dual-read-ported, 64KB data
cache.
Supports a 48-bit or 43-bit virtual address (program selectable).
Supports a 44-bit physical address.
An onchip I/O write buffer with four 64-byte entries for I/O write transactions.
An onchip, 8-entry victim data buffer.
An onchip, 32-entry load queue.
An onchip, 32-entry store queue.
An onchip, 8-entry miss address file for cache fill requests and I/O read
transactions.
An onchip, 8-entry probe queue, holding pending system port probe commands.
21264/EV68A Hardware Refere nce Manual
Introduction 1–3
Page 28
21264/EV68A Microprocessor Features
An onchip, duplicate tag array used to maintain level 2 cache coherency.
A 64-bit data bus with onchip parity and error correction code (ECC) support.
Support for an external second-level (Bcache) cache. The size and some timing
parameters of the Bcache are programmable.
An internal c lock generator providing a high-speed clock used by the 21264/
EV68A, and two clocks for use by the C PU module.
Onchip performance counters to measure and analyze CPU and system perfor-
mance.
Chip a nd module level test support, including an instruction cache test interface to
support chip and module level testing.
A 2.0-V external interface.
Refer to Chapter 9 for 21264/EV68A dc and ac e lectrical characteristics. Refer to the
Alpha Architecture Handbook, Version 4, Appendix E, for waivers and any other
implementation-dependent information.
1–4 Introduction
21264/EV68A Hardware R eference Manual
Page 29
2

Internal Architecture

This chapterprovides both an overviewof the 21264/EV68A microarchitecture and a sys­tem designer’s view of the 21264/EV68A implementation of the Alpha architecture. The combination of the 21264/EV68A microarchitecture and privileged architecture library code (PALcode) defines the chip’s implementation of the Alpha architecture. If a certain piece of hardware seems to be “architecturally incomplete,” the missing functionality is implemented in PALcode. Chapter 6 provides more information on PALcode.
This chapter describes the major functional hardware units and is not intended to be a detailed hardware description of the chip. It is organized as follows:
21264/EV68A microarchitecture
Pipeline organization
Instruction issue and retire rules
Load instructions to R31/F31 (software-directed instruction pr efetch)
Special cases of Alpha instruction e xecution
Memory and I/O address space
Miss a ddress file (MAF) and load-merging rules
Instruction ordering
Replay traps
I/O wr ite buffer and the WMB instruction
Performance measurement support
Floating-point control register
AM ASK and IMPLVER instruction values
Design examples

2.1 21264/EV68A Microarchitecture

The 21264/EV68A microprocessor is a high-performance third-generationimplementa­tion of the Compaq Alpha architecture. The 21264/EV68A consists of the following sections, as shown in Figure 2–1:
Instruction fetch, issue, and retire unit (Ibox)
Integer execution unit (Ebox)
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–1
Page 30
21264/EV68A Microarchitecture
Floating-point e xecution unit (Fbox)
Onchip caches (Icache and Dcache)
Memor y reference unit (Mbox)
External cache and system interface unit (Cbox)
Pipeline operation sequence

2.1.1 Instruction Fetch, Issue, and Retire Unit

The instruction fetch, issue, and retire unit (Ibox) consists of the following subsections:
Virtual program counter logic
Branch predictor
Instruction-stream translation buffer (ITB)
Instruction fetch logic
Register rename maps
Integer and floating-point issue queues
Exception and interrupt logic
Retire logic
2.1.1.1 Virtual Program CounterLogic
The virtual program counter (VPC) logic maintains the virtual addresses f or instruc­tions thatare in f light. There c an be up to 80 instructions, in20 successive fetch slots,in flight between the register rename mappers and the end of the pipeline. The VPC logic contains a 20-entry table to store these fetched VPC addresses.
2–2 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 31
Figure 2–1 21264/EV68A Block Diagram
MUL
Store
IOWB
Duplicate
Probe
Cache
Address
128
Cbox
128
056
21264/EV68A Microarchitecture
Instruction Cache
Ibox
Fetch Unit
VPC
Queue
Branch
Predictor
Ebox
Address
ALU 0
(L0)
Integer Registers 0
(80 Registers)
Virtual Address
Next Address
Integer Issue Queue
(20 Entries)
INT
UNIT
0
(U0)
INT
UNIT
1
(U1)
Integer Registers 1
(80 Registers)
ITB
Address
ALU 1
(L1)
Retire
Unit
Four Instructions
Predecode
Decode and
Rename Registers
FP Issue Queue
(15 Entries)
Fbox
FP
ADD
DIV
SQRT
FP Registers
(72 Registers)
FP
Queue
Tag Store
Victim Buffer
Arbiter
Physical Address
Cache
Data
128
Index
20
System
Bus
64
System
15
Mbox
DTB
(Dual-ported, 128-entry)
Physical Address
Dual-Ported Data Cache
2.1.1.2 Branch Predictor
The branch predictor is composed of three units: the local, global, and choice predic­tors. Figure 2–2 shows how the branch predictor generates the predicted branch address.
Load
Queue
Queue
Data
Miss Address
File
Data
FM-
42-AI4
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–3
Page 32
21264/EV68A Microarchitecture
Figure 2–2 Branch Predictor
Local
Predictor
Global
Predictor
Predicted
Branch
Address
Choice
Predictor
FM-05810.AI4
Local Predictor
The local predictor uses a 2-level table that holds the history of individual branches. The 2-level table design approaches the prediction accuracy of a larger single-level table while requiring fewer total bits of storage. Figure 2–3 shows how the local pre­dictor generates a prediction. Bits [11:2] of the VPC of the current branch are used as the index to a 1K entry table in which each entry is a 10-bit value. This 10-bit value is used as the index to a 1K e ntry table of 3-bit saturating counters. The value of the satu­rating counter determines the pr edication, taken/not-taken, of the current branch.
Figure 2–3 Local Predictor
VPC[11:2]
Local
History
Table
1K x 10
10
10
Index
Local Branch Prediction
Local
Predictor
1K x 3
3
1
+/-
3
FM-05811.AI4
Global Predicto r
The global predictor is indexed by a global history of all recent branches. The global predictor correlates the local history of the current branch with all recent branches. Fig­ure 2–4 shows how the global predictor generates a prediction. The global path history is comprised of the taken/not-taken state of the 12 most-recent branches. These 12 states are used to form an index into a 4K entry table of 2-bit saturating counters. The value of the saturating counter determines the predication, taken/not-taken, of the cur­rent branch.
2–4 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 33
21264/EV68A Microarchitecture
Figure 2–4 Global Predictor
Global
Path
History
12
Index
Global Branch Prediction
Choice Predictor
The choice predictor monitors the history of thelocal and global predictors and chooses the best of the two predictors for a particular branch. Figure 2–5 shows how the choice predictor generates its choice of the result of the local or global prediction. The 12-bit global path history (see Figure 2–4) is used to index a 4K entry table of 2-bit saturating counters.The value of the sa turating counter determines the choice between the outputs of the local and global predictors.
Global
Predictor
4K x 2
2
1
+/-
2
FM-05812.AI4
Figure 2–5 Choice Pred ic tor
Global
Path
History
12
Choice
Predictor
4K x 2
2.1.1.3 Instruction-Stream Translation Buffer
The Ibox includes a 128-entry, fully-associativeinstruction-stream translation buffer (ITB) that is used to store recently used instruction-stream (Istream) address transla­tions and page protection information. Each of the entries in the ITB can map 1, 8, 64, or 512 contiguous 8KB pages. The allocation scheme is round-robin.
The ITB supports an 8-bit ASN and contains an AS M bit. The Icache is virtually addressed and contains the access-check information, so the ITB is accessed only for Istream references that miss in the Icache.
Istream transactions to I/O address space are UNDEFINED.
2
Choice Prediction
12
FM-05813.AI4
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–5
Page 34
21264/EV68A Microarchitecture
2.1.1.4 Instruction Fetch Logic
The instruction prefetcher (predecode) reads an octaword, containing up to four natu­rally aligned instructions per cycle, from the Icache. Branch prediction and line predic­tion bits accompany the four instructions. The branch prediction scheme ope rates most efficiently when only one branch instruction is contained among the four fetched instructions. The line prediction scheme attempts to predict the Icache line that the branch predictor will generate, and is described in Section 2.2.
An entry from the subroutine return prediction stack, together with set prediction bits for use by the Icache stream controller, a re fetched along with the octaword. The Icache stream controller generates fetch requests for additional Icache lines and stores the Istream data in the Icache. There is no separate buffer to hold Istream requests.
2.1.1.5 Register Rename Maps
The instruction prefetcher forwards instructions to the integer and floating-point regis­ter rename maps. The rename maps perform the two functions listed here:
Eliminate register write-after-read (WAR) and write-after-write (WAW) data
dependencies while preserving true read-after-write (RAW) data dependencies, in order to allow instructions to be dynamically rescheduled.
Provide a m eans of speculatively executing instructions before the control flow
previous to those instructions is resolved. Both exceptions and bra nch mispredictions represent deviations from the control flow predicted by the instruction prefetcher.
The map logic translates each instruction’s operand register specifiers from the virtual register numbers in the instruction to the physical register numbers that hold the corre­sponding architecturally-correct values. The map logic also renames each instruction’s destination register specifier from the virtual number in the instruction to a physical register number chosen from a list of free physical registers, and updates the register maps.
The map logic can process four instructions per cycle. It does not return the physical register, which holds the old value of an instruction’s virtual destination register, to the free list until the instruction has been retired, indicating that the control flow up to that instruction has been resolved.
If a branch mispredict or exception occurs, the map logic backs up the contents of the integer and floating-point register rename maps to the state associated with the instruc­tion that triggered the condition, a nd the prefetcher restarts at the appropriate VPC. At most, 20 valid fetch slots containing up to 80 instructions can be in flight between the register maps and the end of the machine’s pipeline, where the control flow is finally resolved. The map logic is capable of backing up the contents of the maps to the state associated with any of these 80 instructions in a single cycle.
The register rename logic places instructions into an integer or floating-point issue queue, from which they are later issued to functional units for execution.
2.1.1.6 Integer Issue Q ueue
The 20-entry integer issue queue (IQ), associated with the integer execution units (Ebox), issues the following types of instructions at a m aximum rate of four per cycle:
2–6 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 35
21264/EV68A Microarchitecture
Integer operate
Integer conditional branch
Unconditional branch – both displacement and memory format
Integer and floating-point load and store
PAL-reserved instructions: HW_MTPR, HW_MFPR, HW_LD, HW_ST,
HW_RET
Integer-to-floating-point (ITOFx) and floating-point-to-integer (FTOIx)
Each queue entry asserts four request signals—one for eachof the Ebox subclusters. A queue entry asserts a request when it contains an instruction that can be executed by the subcluster, if the instruction’s operand register values are available within the subclus­ter.
There are two arbiters—one for the upper subclustersand one for the lower subclusters. (Subclusters are described in Section 2.1.2.) Each arbiter picks two of the possible 20 requestersfor service each cycle. A given instructiononly requests uppersubclusters or lower subclusters, but because many instructions can only be executed in one type or another this is not too limiting.
For example, load and store instructions can only go to lower subclusters and shift instructions can only go to upper subclusters. Other instructions, such as addition and logic operations, c an e xecute in either upper or lower subclusters and are statically assigned before being placed in the IQ.
The IQ arbiters choose between simultaneous requesters of a subcluster based on the age of the request—older requests are given priority over newer requests. If a given instruction requests both lower subclusters, and no older instruction requests a lower subcluster, thenthe arbiterassigns subclusterL0 to theinstruction. If a given instruction requests both upper subclusters, and no older instruction requests an upper subcluster, then the arbiter assigns subcluster U1 to the instruction. This asymmetry between the upper and lower subcluster arbiters is a circuit implementation optimization with negli­gible overall performance effect.
2.1.1.7 Floating-Point Issue Queue
The 15-entry floating-point issue queue (FQ) a ssociated with the Fbox issues the fol­lowing instruction types:
Floating-point operates
Floating-point conditional branches
Floating-point stores
Floating-point register to integer register transfers ( FTOIx)
Each queue entryhas three request lines—onefor the add pipeline, one for the multiply pipeline, and one for the two store pipelines. There are three arbiters—one for each of the add, multiply, and store pipelines. The add and multiply arbiters pick one requester per cycle, while the store pipeline arbiter picks two requesters per cycle, one for each store pipeline.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–7
Page 36
21264/EV68A Microarchitecture
The FQ arbiters pick between simultaneous requesters of a pipeline based on the age of the request—older requests are given priority over newer requests. Floating-point store instructions and F TOIx instructions in even-numbered queue entries arbitrate for one store port. Floating-point store instructions and FTOIx instructions in odd-numbered queue entries arbitrate for the second store port.
Floating-point store instructions and FTOIx instructions are queued in both the integer and floating-pointqueues. They wait in the floating-point queue until their operand reg­ister values are available. They subsequently request service from the store arbiter. Upon beingissued from the floating-point queue, they signal thecorresponding entryin the integer queue to request service. Upon being issued from the integer queue, the operation is completed.
2.1.1.8 Exception and Interrupt Logic
There are two types of exceptions:faults and synchronous traps. Arithmeticexceptions are precise and are reported as synchronous traps.
The four sources of interrupts are listed as follows:
Level-sensitive hardware interrupts sourced by the IRQ_H[5:0] pins
Edge-sensitive hardware interrupts generated by the serial line receive pin,
Software interrupts sourced by the software interrupt request (SIRR) register
Asynchronous system traps (ASTs)
Interrupt sources ca n be individually masked. In addition, AST interrupts are qualified by the current processor mode.
2.1.1.9 Retire Logic
The Ibox fetches instructions in program order, executes them out of order, and then retires them in order. The Ibox retire logic maintains the architectural state of the machine by retiring an instruction only if all previous instructions have executed with­out generating exceptionsor branchmispredictions. Retiring an instruction commitsthe machine to any changes the instruction may have made to the software-visible state. The three software-visible states are listed as follows:
Integer and floating-point registers
Memory
Internal processor registers (including control/status registers and translation
The retire logic can sustain a maximum retire rate of eight instructions per cycle, and can retire up to as many as 11 instructions in a single cycle.
performance counter overflows, and hardware corrected read errors
buffers)

2.1.2 Integer Execution Unit

The integer execution unit (Ebox)is a 4-path integerexecution unit that is implemented as two f unctional-unit “clusters” labeled 0 and 1. Each cluster contains a copy of an 80­entry, physical-register file and two “subclusters”, named upper (U) and lower (L). Fig­ure 2–6 shows the integer execution unit. In the figure, iop_wr is the cross-cluster bus for moving integer result values between clusters.
2–8 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 37
21264/EV68A Microarchitecture
Figure 2–6 Integer Execution Unit—Clusters 0 and 1
iop_wr iop_wr
U0
Register
L0
iop_wr iop_wr
Load/Store Data Load/Store Data
eff_VA eff_VA
U1
Register
L1
FM-05643.AI4
Most instructions have 1-cyclelatency for consumers that e xecute within the same clus­ter. Also, there is another 1-c ycle delay associated with producing a valuein one cluster and consuming thevalue in the other cluster. The instruction issue queue minimizes the performance effect of this cross-cluster delay. The Ebox contains the following resources:
Four 64-bit adders that are used to calculate results for integer add instructions
(located in U0, U1, L0, and L1)
The a dders in the lower subclusters that a re used to generate the effective virtual
address for load and store instructions (located in L0 and L1)
Four logic units
Two barrel shifters and associated byte logic (located in U0 and U1)
Two sets of conditional branch logic (located in U0 and U1)
Two copies of an 80-entry register file
One pipelined multiplier (located in U1) with 7-cycle latency for all integer multiply
operations
One f ully-pipelined unit (located in U0), with 3-cycle latency, that executes the fol-
lowing instructions: – CTLZ, CTPOP, CTTZ – PERR, MINxxx, MAXxxx, UNPKxx, P Kxx
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–9
Page 38
21264/EV68A Microarchitecture
The Ebox has 80 register-file entries that contain storagefor the values of the 31 Alpha integer registers (the value of R31 is not stored), the values of 8 PALshadow registers, and 41 results written by instructions that have not yet been retired.
Ignoring cross-cluster delay, the two copies of the Ebox register f ile contain identical values. Each copy of the Ebox register file contains four read ports and six write ports. The four read ports are used to source operands to each of the two subclusters within a cluster. The six write ports are used as follows:
Two write ports are used to write results generated within the cluster.
Two write ports are used to write results generated by the other cluster.
Two write ports are used to write results from load instructions. These two ports
are also used for FTOIx instructions.

2.1.3 Floating-Point Execution Unit

The floating-point execution unit (Fbox) has two paths. The Fbox executes both VAX and IEEE floating-point instructions. It supports IEEE S_floating-point and T_floating­point data types and all rounding modes. It also supports VAX F_floating-point and G_floating-point data types, and provides limited support for D_floating-point format. The basic structure of the floating-point execution unit is shown in Figure 2–7.
Figure 2–7 Floating-Point Execution Units
Floating-Point
Execution Units
FP Mul
Reg
FP Add
FP Div
SQRT
LK-4A
The Fbox contains the following resources:
72-entry physical register file
Fully-pipelined multiplier with 4-cycle latency
Fully-pipelined adder with 4-cycle latency
Nonpipelined divide unit associated with the adder pipeline
Nonpipelined square root unit associated with the adder pipeline
The 72 Fbox register file entries contain storage for the values of the 31 Alpha floating­point registers (F31 is not stored) and 41 values written by instructions that have not been retired.
2–10 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 39
The Fbox register file contains six reads ports and four write ports. Four read ports are used to source operands to the add and multiply pipelines, and two read ports are used to source data for store instructions. Two write ports are used to write results generated by the add and multiply pipelines, and two write ports are used to wr ite results from floating-point load instructions.

2.1.4 External Cache and System Interface Unit

The interface for the system and external cache (Cbox) controls the Bcache and system ports. It contains the following structures:
Victim address file (VAF)
Victim data file (VDF)
I/O wr ite buffer (IOWB)
Probe queue (PQ)
Duplicate Dcache tag (DTAG)
2.1.4.1 Victim Address File and Victim Data File
21264/EV68A Microarchitecture
The victim address file (VAF) and victim data file (VDF) together form a n 8-entry vic­tim buffer used for holding:
Dcache blocks to be written to the Bcache
Istream cache blocks from memory to be written to the Bcache
Bcache blocks to be written to memory
Cache blocks sent to the system in response to probe commands
2.1.4.2 I/O Write Buffer
The I/O write buffer (IOWB) consists of four 64-byte entries and associated address and control logic used for buffering I/O write data between the store queue and the sys­tem port.
2.1.4.3 Probe Queue
The probe queue (PQ) is an 8-entry queue that holds pending system port cache probe commands and addresses.
2.1.4.4 Duplicate Dcache Tag Array
The duplicateDcache tag (DTAG) array holds a duplicate copy of the Dcache tags and is used by the Cbox when processing Dcache fills, Icache fills, and system port probes.

2.1.5 Onchip Caches

The 21264/EV68A contains two onchip primary-level caches.
2.1.5.1 Instruction Cache
The instruction cache (Icache) is a 64KB virtual-addressed, 2-way set-predict cache. Set prediction is used to approximate the performance of a 2-set cache without slowing the cache access time. Each Icache block contains:
16 Alpha instructions (64 bytes)
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–11
Page 40
21264/EV68A Microarchitecture
Virtual tag bits [47:15]
8-bit address space number (ASN) f ield
1-bit address space match (ASM) bit
1-bit PALcode bit to indicate physical addressing
Valid bit
Data and tag parity bits
Four access-check bits for the following modes: kernel, executive, supervisor, and
user (KESU)
Additional predecoded information to assist with instruction processing and fetch
control
2.1.5.2 Data Cache
The datacache (Dcache) is a 64KB, 2-wayset-associative,virtually indexed,physically tagged, write-back, r ead/write allocate cache with 64-byte blocks. During each cycle the Dcache can perform one of the following transactions:
Two quadword (or shorter) read transactions to arbitrary addresses
Two quadword write transactions to the same aligned octaword
Two non-overlapping less-than-quadword writes to the same aligned quadword
One sequential read and write transaction from and to the same aligned octaword
Each Dcache block contains:
64 data bytes and associated quadword ECC bits
Physical tag bits
Valid, dirty, shared, and modified bits
Tag parity bit calculated across the tag, dirty, shared, and modified bits
One bit to control round-robin set allocation (one bit per two cache blocks)
The Dcache contains two sets, each with 512 rows containing 64-byte blocks per row (that is, 32K bytes of data per set). The 21264/EV68A requires two additional bits of virtual address beyond the bits that specify an 8KB page, in order to specify a Dcache row index. A given virtual address might be found in four unique locations in the Dcache, depending on the virtual-to-physical translation for those two bits. The 21264/ EV68A prevents this aliasing by keeping only one of the four possible translated addresses in the cache at any time.

2.1.6 Memory Reference Unit

The memory reference unit (Mbox) controls the Dcache and ensures architecturally correct behavior for load and store instructions. The Mbox contains the following struc­tures:
Load queue (LQ)
Store queue (SQ)
2–12 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 41
Dstream translation buffer (DTB)
2.1.6.1 Load Queue
The load queue (LQ) is a reorder buffer for load instructions. It contains 32 entries and maintains the state associated with load instructions that have been issued to the Mbox, but for which results have not been delivered to the processor and the instructions retired. The Mbox assigns load instructions to LQ slots based on the order in which they were fetched from the Icache, then places them into the LQ after they are issued by the IQ. The LQ helps ensure correct Alpha memory reference behavior.
2.1.6.2 Store Queue
The store queue (SQ) is a reorder buffer and graduation unit for store instructions. It contains 32 entries and maintains the state associated with store instructions that have been issued to the Mbox, but for which data has not been written to the Dcache and the instruction retired. The Mbox assigns store instructions to SQ slots based on the order in which they were fetched from the Icache and places them into the S Q after they are issued by the IQ. The SQ holds data associated with store instructions issued from the IQ until they are retired, at which point the store can be allowed to update the Dcache. The SQ also helps ensure correct Alpha memory reference behavior.

Pipeline Organization

Miss address file (MAF)
2.1.6.3 Miss Addres s File
The 8-entry miss address file (MAF) holds physical addresses associated with pending Icache and Dcache fill requests and pending I/O space read transactions.
2.1.6.4 Dstream Translation Buffer
The Mbox includes a 128-entry, fully associative Dstream translation buffer (DTB) used to store Dstream address translations and page protection information. Each of the entries in the DTB can map 1, 8, 64, or 512 contiguous 8KB pages. The allocation scheme is round-robin. The DTB supports an 8-bit ASN and contains an ASM bit.

2.1.7 SROM Interface

The serial read-only memory (SROM) interface provides the initialization data load path from a system SROM to the Icache. Refer to Chapter 7 for more information.
2.2 Pipeline Organization
The 7-stage pipeline provides an optimized environment for executing Alpha instruc­tions. The pipeline stages (0 to 6) are shown in Figure 2–8 and described in the follow­ing paragraphs.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–13
Page 42
Pipeline Organization
Figure 2–8 Pipeline Organization
0213456
ALU
Branch
Predictor
Instruction
Cache (64KB) (2-Set)
Integer Register Rename
Four Instructions
Floating­Register
Rename
Map
Point
Map
Integer
Issue
Queue
(20)
Floating-
Point
Issue
Queue
(15)
Integer
Register
File
Floating-
Point
Register
File
Shifter
ALU Shifter
Multiplier
Address
ALU
Address
ALU
Floating-Point
Add, Divide,
and Square Root
Floating-Point
Multiply
64KB
Data
Cache
Bus
Interface
Unit
System Bus (64 Bits)
Cache Bus (128 Bits)
Physical Address (44 Bits)
FM-05575.AI4
Stage 0 Instru ctio n Fetch
The branch predictor uses a branch history algorithm to predict a branch instruction tar­get address.
Up to four aligned instructions are fetched from the Icache, in program order. The branch prediction tables are also accessedin this cycle.The branch predictoruses tables and a branch history algorithm to predict a branch instruction target address for one branch or m emory format JSR instruction per cycle.Therefore, the prefetcher is limited to fetching through one branch per cycle. If there is more than one branch within the fetch line, and the branchpredictor predictsthat the firstbranch will not be taken, it will predict through subsequent branchesat the rate of one per cycle, untilit predicts a taken branch or predicts through the last branch in the fetch line.
The Icache array also contains a line prediction field, the contents of which are applied to the Icache in the next cycle. The purpose of the line predictor is to remove the pipe­line bubble which would otherwise be created when the branch predictor predicts a branch to be taken. In effect, the line predictor attempts to predict the Icache line which the branch predictor will generate. On fills, the line pr edictor value at each fetch line is initialized with the index of the next sequential fetch line, and later retrained by the branch predictor if necessary.
Stage 1 — Instruction Slot
The Ibox maps four instructions per cycle from the 64KB 2-way set-predict Icache. Instructions a re mapped in order, executed dynamically, but are retired in order.
2–14 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 43
Pipeline Organization
In theslot stage,the branch predictor comparesthe next Icacheindex that it generates to the index that was generated by the line predictor. If there is a mismatch, the branch predictor wins—the instructions fetched during that cycle are aborted, and the index predicted by the branch predictor is applied to the Icache during the next cycle. Line mispredictions result in one pipeline bubble.
The line predictor takes precedence over the branch predictor during memory format calls or jumps. If the line predictor was trained with a true (as opposed to predicted) memory format call or jump target, then its contents take precedence over the target hint field associated with these instructions. This allows dynamic calls or jumps to be correctly predicted.
The instruction fetcher producesthe full VPC address during the fetch stage of the pipe­line. The Icache produces the tags for both Icache sets 0 and 1 each time it is accessed. That enables the fetcher to separate set mispredictions from true Icache misses. If the access was caused by a set misprediction, the instruction fetcher aborts the last two fetched slots and refetches the slot in the next cycle. It also retrains the appropriate set prediction bits.
The instruction data is transferred from the Icache to the integer and floating-point reg­ister map hardware during this stage. When the integer instruction is fetched from the Icache and slotted into the IQ, the slot logic determines whether the instruction is for the upper or lower subclusters. The slot logic makes the decision based on the resources needed by the (up tofour) integer instructions in thefetch block. Althoughall four instructions need not be issued simultaneously, distributing their resource usage improves instruction loading across the units. For example, if a fetch block contains two instructions that can be placed in either cluster followed by two instructions that must execute in the lower cluster, the slot logic would designate that combination as EELL and slot them as UULL. Slot combinations are de scribed in Section 2.3.2 and Table 2–3.
Stage 2 Map
Instructions are sent from the Icache to the integer and floating-point register maps dur­ing the slot stage and register renaming is performed during the map stage. Also, each instruction is assigned a unique 8-bit number, called an inum, which is used to identify the instruction and its program order with respect to other instructions during the time that it is in flight. Instructions are considered to be in flight between the time they are mapped and the time they are retired.
Mapped instructions and their associated inums are placed in the integer a nd floating­point queues by the end of the map stage.
Stage 3 Issue
The 20-entry integer issue queue (IQ) issues instructions at the rate of four per cycle. The 15-entry floating-point issue queue (FQ) issues floating-point operate instructions, conditional branch instructions, and store instructions,at the rate of two per cycle. Nor­mally, instructions ar e de leted from the IQ or FQ two cycles after they are issued. For example, if an instruction is issued in cycle n, it remains in the FQ or IQ in cycle n+1 but does not request service, and is deleted in cycle n+2.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–15
Page 44

Instruction Issue Rules

Stage 4 Register Read
Instructions issued from the issue queues read their operands from the integer and float­ing-point register files and receive bypass data.
Stage 5 — E xecute
The Ebox and Fbox pipelines begin execution.
Stage 6 Dcache Access
Memory reference instructions access the Dcache and data translation buffers. Nor­mally load instructions access the tag and data arrays while store instructions only access the tag arrays. Store data is written to the store queue where it is held until the store instruction is retired. Most integer operate instructions write their register results in this cycle.

2.2.1 Pipeline Aborts

The abort penalty as given is measured from the cycle after the fetch stage of the instruction which triggers the abort to the fetch stage of the new target, ignoring any Ibox pipeline stalls or queuing delay that the triggering instruction might experience. Table 2–1 lists the timing associated with each common source of pipeline abort.
Table 2–1 Pipeline Abort Delay (GCLK Cycles)
Abort Condition
Branch misprediction 7 Integer or floating-point conditional branch
JSR misprediction 8 Memory format JSR or HW_RET. Mbox order trap 14 Load-load order or store-load order. Other Mbox re play traps 13 — DTB miss 13 — ITB miss 7 — Integer arithmetic trap 12 — Floating-point arithmetic
trap
2.3 Instruction Issue Rules
This section defines instruction classes, the functional unit pipelines to which they are issued, and their associated latencies.
Penalty (Cycles) Comments
misprediction.
13+latency Add latency of instruction. See Section 2.3.3 for
instruction latencies.
2–16 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 45

2.3.1 Instruction Group Definitions

Table 2–2 lists the instruction class, the pipeline assignments, and the instructions included in the class.
Table 2–2 Instruction Name, Pipeline, and Types
Class Name Pipeline Instruction Type
ild L0, L1 All integer load instructions fld L0, L1 All floating-point load instructions ist L0, L1 All integer store instructions fst FST0, FST1, L0, L1 All floating-point store instructions lda L0, L1, U0, U1 LDA, LDAH mem_misc L1 WH64, ECB, WMB rpcc L1 RPCC rx L1 RS, RC
Instruction Issue Rules
mxpr L0, L1
(depends on IPR) icbr U0, U1 Integer conditional branch instructions jsr L0 BR, BSR, JMP, CALL, RET, COR, HW_RET,
iadd L0, U0, L1, U1 Instructions with opcode 10 ilog L0,U0,L1,U1 AND,BIC,BIS,ORNOT,XOR,EQV,CMPBGE ishf U0, U1 Instructionswith opcode 12 cmov L0, U0, L1, U1 Integer CMOV — eithercluster imul U1 Integer multiply instructions imisc U0
fcbr FA Floating-point conditional branch instructions fadd FA All floating-point operate instructions except multiply,
fmul FM Floating-pointmultiply instruction fcmov1 FA Floating-point CMOV— first half fcmov2 FA Floating-point CMOV— second half
HW_MTPR, HW_MFPR
CALL_PAL
,exceptCMPBGE
16
16
CTLZ, CTPOP, CTTZ, PERR, MINxxx, MAXxxx,
PKxx, UNPKxx
divide, square root, and conditional move instructions
fdiv FA Floating-point divide instruction fsqrt FA Floating-point square root instruction nop None TRAP, EXCB, UNOP - LDQ_U R31, 0(Rx)
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–17
Page 46
Instruction Issue Rules
Table 2–2 Instruction Name, Pipeline, and Types (Continued)
Class Name Pipeline Instruction Type
ftoi FST0,FST1, L0, L1 FTOIS, FTOIT itof L0, L1 ITOFS, ITOFF,ITOFT mx_fpcr FM Instructions that m ove data from the floating-point

2.3.2 Ebox Slotting

Instructions that a re issued from the IQ, and could execute in either upper or lower Ebox subclusters, are slotted to one pair or the other during the pipeline mapping stage based on the instruction mixture in the fetch line. The codes that are used in Table 2–3 are as follows:
U— The instruction only executes in an upper subcluster.
L— The instruction only executes in a lower subcluster.
control register
E— The instruction could execute in either an upper or lower subcluster.
Table 2–3 defines the slotting rules. The table field Instruction Class 3, 2, 1 and 0 iden- tifies each instruction’s location in the fetch line by the value of bits [3:2] in its PC.
Table 2–3 Instruction Group Definitions and Pipeline Unit
Instruction Class 3210
EEEE ULUL LLLL LLLL EEEL ULUL LLLU LLLU EEEU ULLU LLUE LLUU EELE ULLU LLUL LLUL EELL UULL LLUU LLUU EELU ULLU LUEE LULU EEUE ULUL LUEL LUUL EEUL ULUL LUEU LULU EEUU LLUU LULE LULU ELEE ULUL LULL LULL ELEL ULUL LULU LULU
Slotting 3210
Instruction Class 3210
Slotting 3210
ELEU ULLU LUUE LUUL ELLE ULLU LUUL LUUL ELLL ULLL LUUU LUUU ELLU ULLU UEEE ULUL ELUE ULUL UEEL ULUL ELUL ULUL UEEU ULLU
2–18 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 47
Instruction Issue Rules
Table 2–3 Instruction Group Definitions and PipelineUnit (Continued)
Instruction Class 3210
Slotting 3210
Instruction Class 3210
Slotting 3210
ELUU LLUU UELE ULLU EUEE LULU UELL UULL EUEL LUUL UELU ULLU EUEU LULU UEUE ULUL EULE LULU UEUL ULUL EULL UULL UEUU ULUU EULU LULU ULEE ULUL EUUE LUUL ULEL ULUL EUUL LUUL ULEU ULLU EUUU LUUU ULLE ULLU LEEE LULU ULLL ULLL LEEL LUUL ULLU ULLU LEEU LULU ULUE ULUL LELE LULU ULUL ULUL LELL LULL ULUU ULUU LELU LULU UUEE UULL LEUE LUUL UUEL UULL LEUL LUUL UUEU UULU LEUU LLUU UULE UULL LLEE LLUU UULL UULL LLEL LLUL UULU UULU LLEU LLUU UUUE UUUL LLLE LLLU UUUL UUUL — UUUU UUUU
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–19
Page 48
Instruction Issue Rules

2.3.3 Instruction Latencies

After an instruction is placed in the IQ or FQ, its issue point is determined by the avail­ability of its register operands, f unctional unit(s), and relationship to other instructions in the queue. There are register producer-consumer dependencies and dynamic func­tional unit availability dependencies that affect instruction issue. The mapper removes register producer-producer dependencies.
The latency to produce a register result is generally fixed. The one exceptionis for load instructions that m iss the Dcache. Table 2–4 lists the latency, in cycles, for each instruction class.
Table 2–4 Instruction Class Latency in Cycles
Class Latency Comments
ild 3
13+
fld 4
14+
lda 1 Possible 1-cycle Ebox cross-cluster delay. mem_misc — Does not produceregister value. ist Does not produce register value. fst Does not produce register value. rpcc 1 Possible 1-cycle cross-cluster delay. rx 1 — mxpr 1 or 3 HW_MFPR: Ebox IPRs = 1.
icbr Conditional branch. Does not produce register value. ubr 3 Unconditional branch. Does not produce register value. jsr 3 — iadd 1 Possible 1-cycle Ebox cross-cluster delay.
Dcache hit. Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if Bcache latency is greater than 6 cycles.
Dcache hit. Dcache miss, latency with 6-cycle Bcache. Add additional Bcache loop latency if Bcache latency is greater than 6 cycles.
Ibox and Mbox IP Rs = 3.
HW_MTPR does not produce a registervalue.
ilog 1 Possible 1-cycle Ebox cross-cluster delay. ishf 1 Possible 1-cycle Ebox cross-cluster delay. cmov1 1 Only consumer is cmov2. Possible 1-cycle Ebox cross-cluster delay. cmov2 1 Possible 1-cycle Ebox cross-cluster delay. imul 7 Possible 1-cycle Ebox cross-cluster delay. imisc 3 Possible 1-cycle Ebox cross-cluster delay. fcbr Does not produce register value.
2–20 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 49
Table 2–4 Instruction Class Latency in Cycles (Continued)
Class Latency Comments

Instruction Retire Rules

fadd 4
6
fmul 4
6
fcmov1 4 Only consumer is fcmov2. fcmov2 4
6
fdiv 12
9 15 12
fsqrt 18
15 33 30
ftoi 3
Consumer other than fst or ftoi. Consumer fst or ftoi. Measured from whe n an fadd is issued from the FQ to when an fst or ftoi is issued from the IQ.
Consumer other than fst or ftoi. Consumer fst or ftoi. Measured from when an fmul is issued from the FQ to when an fst or ftoi is issued from the IQ.
Consumer other than fst. Consumer fst or ftoi. Measured from when an fcmov2 is issued from the FQ to when an fst or ftoi is issued from the IQ.
Single precision - latency to consumer of result value. Single precision - latency to using divider again. Double precision - latency to consumer of result value. Double precision - latency to using divider again.
Single precision - latency to consumer of result value. Single precision - latency to using unit again. Double precision - latency to consumer of result value. Double precision - latency to using unit again.
itof 4 — nop Does not produce register value.
2.4 Instruction Retire Rules
An instruction is retired when it has been executed to completion, and all previous instructions have been retired. The execution pipeline stage in which an instruction becomes eligible to be retired depends upon the instruction’s class.
Table 2–5 gives the minimum retire latencies (assuming that all previous instructions have been retired) for various classes of instructions.
Table 2–5 Minimum Retire Latencies for Instruction Classes
Instruction Class Retire Stage Comments
Integer conditional branch 7 — Integer multiply 7/13 Latencyis 13 cycles for the MUL/V instruction. Integer operate 7 — Memory 10 — Floating-pointadd 11 — Floating-point multiply 11
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–21
Page 50

Retire of Operate Instructions into R31/F31

Table 2–5 Minimum Retire Latencies for Instruction Classes (Co nti nu ed)
Instruction Class Retire Stage Comments
Floating-pointDIV/SQRT 11 + latency Add latency of unit reuse for the instruction indicated in Ta ble
2–4. For example, latency for a single-precision fdiv would be 11plus 9 from Table2–4. Latency is 11if ha rdware detectsthat no exceptionis possible (see Section 2.4.1).
Floating-pointconditional branch
BSR/JSR 10 JSR instruction mispredict is reported in stage 8.
11 Branch instruction mispredict is reported in stage 7.

2.4.1 Floating-Point Divide/Square Root Early Retire

The floating-point divider and square root unit can detect that, for many combinations of source operand values, no exception can be generated. Instructions with these oper­ands can be retired before the result is generated. When detected, they are retired with the same latency as the FP add class. Early retirement is not possible for the following instruction/operand/architecture state conditions:
Instruction is not a DI V or SQRT.
SQRT source operand is negative.
Divide operand exponent_a is 0.
Either operand is NaN or INF.
Divide operand exponent_b is 0.
Trapping mode is /I (inexact).
INE status bit is 0.
Early retirementis also not possiblefor divide instructionsif the resulting e xponent has any of the following characteristics (EXP is the result exponent):
DIVT, DIVG: (EXP >= 3FF
DIVS, D IVF: (EXP >= 7F
) OR (EXP <= 216)
16
) OR (EXP <= 38216)
16
2.5 Retire of Operate Instructions into R31/F31
Many instructions that have R31 or F31 as their destination are retired immediately upon decode(stage 3). These instructions do notproduce a result and are removed from the pipeline as well. They do not occupy a slot in the issue queues and do not occupy a functional unit. Table 2–6 lists these instructions and some of their characteristics. The instructiontype in Table 2–6 is from Table C-6 in Appendix C of the Alpha Architecture Handbook, Version 4.
2–22 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 51
Table 2–6 Instructions Re tired Without Execution
Instruction Type Notes
INTA, INTL, INTM, INTS All with R31 as destination. FLTI, FLTL, FLTV All with F 31 as destination. MT_FPCR is not included
because it has no destination—it is never removed from the
pipeline. LDQ_U All with R31 as destination. MISC TRAPB and EXCB are always removed. Others are never
removed. FLTS All (SQRT, ITOF) with F31 as destination.

2.6 Load Instructions to R31 and F31

This section describes how the 21264/EV68A processes software-directed prefetch transactions and load instructions with a destination of R31 and F31.
Load Instructions to R31 and F31
Prefetches a llocate a MAF entry.How the M AF entry is allocated is what distinguishes the type of prefetch. A normal prefetch is equivalent to a normal load MAF (that is, a MAF entry that puts the block into the Dcache in a readable state). A prefetch with modify intent is equivalent to a normal store MAF (that is, a MAF entry that puts the block into the Dcache in a writeable state). A prefetch, evict next, is equivalent to a nor­mal load MAF, with the additional behavior described in Section 2.6.3.
A prefetch is not performed if the prefetch hits in the Dcache (as if it were a normal load).
Load operations to R31 and F31 may generate exceptions. These exceptions must be dismissed by PALcode.
The following sections describe the operational prefetch behavior of these instructions.

2.6.1 Normal Prefetch: LDBU, LDF, LDG, LDL, LDT, LDWU, HW_LDL Instructions

The 21264/EV68A processes these instructions a s normal cache line prefetches. If the load instruction hits the Dcache, the instruction is dismissed, otherwise the addressed cache block is allocated into the Dcache.
The HW_LDL instruction construct equates to the HW_LD instruction with the LEN field clear. See Table 6–3.

2.6.2 Prefetch with Modify Intent: LDS Instruction

The 21264/EV68A processes an LDS instruction, with F31 as the destination, as a prefetch with modify intent transaction (ReadBlkMod command). If the transaction hits a dirtyDcache block, the instruction is dismissed. Otherwise, the addressedcache block is allocated into the Dcache for write access, with its dirty and modified bits set.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–23
Page 52

Special Cases of Alpha Instruction Execution

2.6.3 Prefetch, Evict Next: LDQ and HW_LDQ Instructions

The 21264/EV68A processes this instruction like a normal prefetch transaction (Read­BlkSpec command), with one exception—if the load misses the Dcache, the addressed cache block is allocated into the Dcache, but the Dcache set allocation pointer is left pointing to this block. The next miss to the same Dcache line will evict the block. F or example, this instruction might be used when softwareis reading an a rray that is known to fit in the offchip Bcache, but will not fit into the onchip Dcache. In this case, the instructionensures that the hardware provides the desired prefetch function withoutdis­placing useful cache blocks stored in the other set within the Dcache.
The HW_LDQ instruction construct equates to the HW_LD instruction with the LEN field set. See Table 6–3.
2.7 Special Cases of Alpha Instruction Execution
This section describes the m ec hanisms that the 21264/EV68A uses to process irregular instructions in the Alpha instruction set, and cases in which the 21264/EV68A pro­cesses instructions in a non-intuitive way.

2.7.1 Load Hit Speculation

The latency of integer load instructions that hit in the Dcache is three cycles. Figure 2– 9 shows the pipeline timing for these integer load instructions. In Figure 2–9:
Symbol Meaning
Q Issue queue R Register file read EExecute D Dcache access BDatabusactive
Figure 2–9 Pipeline Timing for Integer Load Instructions
ILD Instruction 1 Instruction 2
Hit
1Cycle Number
QREDB
2 3 4 5 6 7 8
QR
Q
FM-05814.AI4
There are two cycles in which the IQ may speculatively issue instructions that use load data before Dcache hit information is known. Any instructions that are issued by the IQ within this 2-cycle speculative window a re kept in the IQ with their requests inhibited until the load instruction’shit condition is known, even if they are not dependent on the load operation.If the load instruction hits, then these instructions are removed from the queue. If the load instruction misses, then the e xecution of these instructions is aborted and the instructions are allowed to request service again.
2–24 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 53
Special Cases of Alpha Instruction Execution
For example, in Figure 2–9, instruction 1 and instruction 2 are issued within the specu­lative window of the load instruction. If the load instruction hits, then both instructions will be deleted from the queue by the start of cycle 7— one cycle later than normal for instruction1 and at the normal time for instruction2. If the load instructionmisses, both instructions are aborted from the execution pipelines and may request service again in cycle 6.
IQ-issued instructions are aborted if issued within the speculative window of an integer load instruction that missed in the Dcache, even if they are not dependent on the load data. However, if software misses are likely, the 21264/EV68A can still benefit from scheduling the instruction stream for Dcache miss latency. The 21264/EV68A includes a saturating counter that is incremented when load instructions hit and is decremented when load instructions miss. When the upper bit of the counter equals zero, the integer load latency is increased to five cycles and the speculative window is removed. The counter is 4 bits wide and is incremented by 1 on a hit and is decremented by two on a miss.
Since load instructions to R31 do not produce a result, they do not create a speculative window when they execute and, therefore, never waste IQ-issue cycles if they miss.
Floating-pointload instructions that hit in the Dcachehave a latency of fourcycles. Fig­ure 2–10 shows the pipeline timing for floating-point load instructions. In Figure 2–10:
Symbol Meaning
Q Issue queue R Register file read EExecute D Dcache access BDatabusactive
Figure 2–10 Pipeline Timing for Floating-Point Load Instructions
Hit
FLD Instruction 1 Instruction 2
1Cycle Number
QREDB
2 3 4 5 6 7 8
QR
Q
FM-05815.AI4
The speculative window for floating-point load instructions is one cycle wide. FQ-issued instructions that are issued within the speculative window ofa floating-point load instruction that has missed, are only aborted if they depend on the load being suc­cessful.
For example, in Figure 2–10 instruction 1 is issued in the speculative window of the load instruction.
If instruction 1 is not a user of the data returned by the load instruction, then it is removed from the queue at its normal time (at the start of cycle 7).
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–25
Page 54
Special Cases of Alpha Instruction Execution
If instruction 1 is dependent on the load instruction data and the load instruction hits, instruction 1 is removed from the queue one cycle later (at the start of cycle 8). If the load instruction misses, then instruction 1 is aborted from the Fbox pipeline and may request service again in cycle 7.

2.7.2 Floating-Point Store Instructions

Floating-point store instructions are duplicated and loaded into both the IQ and the FQ from the mapper. Each IQ entry contains a control bit, fpWait, that when set prevents that e ntry from assertingits requests. This bit is initially set for each floating-pointstore instruction that enters the I Q, unless it was the target of a replay trap. The instruction’s FQ clone is issued when its Ra register is about to become clean, resulting in its IQ clone’s fpWait bit being cleared and allowing the IQ clone to issue and be executed by the Mbox. This mechanism ensures that floating-point store instructions are always issued to the Mbox, along with the associated data, without requiring the floating-point register dirty bits to be available within the IQ.

2.7.3 CMOV Instruction

For the 21264/EV68A, the Alpha CMOV instruction has three operands, and so pre­sents a special case. The required operation is to move either the value in register Rb or the value from the old physical destination register into the new destination register, based upon the value in Ra. Since neither the mapper nor the Ebox and Fbox da ta paths are otherwise required to handle three operand instructions,the CMOV instruction is decomposed by the Ibox pipeline into two 2-operand instructions:
The Alpha architecture instruction CMOV Ra, R b Becomes the 21264/EV68A instructions CMOV1 R a, oldRc
CMOV2 newRc1, Rb
The first instruction,CMOV1, tests the value of Ra and records the result of this test in a 65th bit of its destination register, newRc1. I t also copies the value of the old physical destination register, oldRc, to newRc1.
The second instruction, C MOV2, then copies either the value in newRc1 or the value in Rb into a second physical destination register, newRc2, based on the CMOV predicate bit stored in newRc1.
In summary, the original CMOV instruction is decomposed into two dependent instruc­tions that each use a physical register from the free list.
To further simplify this operation, the two component instructions of a CMOV instruc­tion aredriven through the mappers in successive cycles.Hence, if a fetch line c ontains n CMOV instructions, it takes n+1 cycles to run that fetch line through the mappers.
For example, the following fetch line:
ADD CMOVx SUB CMOVy
Results in the following three map cycles:
ADD CMOVx1
Rc
newRc2
newRc1
CMOVx2SUBCMOVy1 CMOVy2
2–26 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 55

Memory and I/O Address Space Instructions

The Ebox executes integer CMOV instructions as two distinct 1-cycle latency opera­tions. The Fbox add pipelineexecutes floating-point CMOV instructions as two distinct 4-cycle latency operations.
2.8 Memory and I/O Address Space Instructions
This sectionprovides an overview of the way the 21264/EV68A processesmemory and I/O address space instructions.
The 21264/EV68A supports, and internally recognizes, a 44-bit physical address space that is divided equally between memory address space and I/O address space. Memory address space resides in the lower half of the physical address space (PA[43]=0) and I/O address space resides in the upper half of the physical address space (PA[ 43]=1).
The IQ can issue any combination of load and store instructions to the Mbox at the rate of two per cycle. The two lower Ebox subclusters, L0 and L1, generate the 48-bit effectivevirtual address for these instructions.
An instruction is defined to be newer than another instruction if it follows that instruc­tion in program order and is older if it precedes that instruction in program order.

2.8.1 Memory Address Space Load Instructions

The Mbox begins execution of a load instruction by translating its virtual address to a physical address using the DTB and by accessing the Dcache. The Dcache is virtually indexed, allowing these two operations to be done in parallel. The Mbox puts informa­tion about the load instruction, including its physical address, destination register, and data format, into the LQ.
If the requested physical location is found in the Dcache (a hit), the data is formatted and written into the appropriate integer or floating-point register. If the locationis not in the Dcache (a miss), the physical address is placed in the miss address file (MAF) for processing by the Cbox. The MAF performs a merging function in which a new miss address is compared to miss addresses already heldin the MAF. Ifthe new miss address points to the same Dcache block as a miss address in the MAF, then the new miss address is discarded.
When Dcache fill data is returned to the Dcache by the Cbox, the Mbox satisfies the requesting load instructions in the LQ.

2.8.2 I/O Address Space Load Instructions

Because I/O space load instructions may have side effects, they cannot be performed speculatively. Whe n the Mbox receives an I/O space load instruction, the Mbox places the load instruction in the LQ, where it is held until it retires. The Mbox replays retired I/O space load instructions from the LQ to the MAF in program order, at a rate of one per GCLK cycle.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–27
Page 56
Memory and I/O Address Space Instructions
The Mbox allocates a new MAF entry to an I/O load instruction and increases I/O band­width by attempting to mergeI/O load instructionsin a merge register.Table 2–7 shows the rules for merging data. The columns represent the load instructions replayed to the MAF while the rows represent the size of the load in the merge register.
Table 2–7 Rules for I / O Address Space Load Instruction Data Merging
Merge Register/ Replayed Instruction LoadByte/Word Load Longword Load Quadword
Byte/Word Nomerge Nomerge Nomerge Longword No merge Merge up to 32 bytes No merge Quadword Nomerge Nomerge Mergeupto64bytes
In summary, Table 2–7 shows some of the following rules:
Byte/word load instructions and different size load instructions are not allowed to
merge.
A stream of ascending non-overlapping, but not necessarily consecutive, longword
load instructions are allowed to merge into naturally aligned 32-byte blocks.
A stream of ascending non-overlapping, but not necessarily consecutive,quadword
load instructions are allowed to merge into naturally aligned 64-byte blocks.
Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O store instruction activity for 1024 cycles.
After the Mbox I/O register has closed its merge window, the Cbox sends I/O read requests offchip in the order that they were received from the Mbox.

2.8.3 Memory Address Space Store Instructions

The Mbox begins execution of a store instruction by translating its virtual address to a physical address using the DTB and by probing the Dcache. The Mbox puts informa­tion about the store instruction, includingits physical address,its data and the results of the Dcache probe, into the store queue (SQ).
If the Mbox does not find the addressed location in the Dcache, it places the address into the MAF for processing by the Cbox. If the Mbox finds the addressed location in a Dcache block that is not dirty, then it places a ChangeToDirty request into the MAF.
A store instruction can write its data into the Dcache when it is retired, and when the Dcache block containing its address is dirty and not shared. SQ entries that meet these two conditions can be placed into the writable state. These SQ entries are placed into the writable state in program order at a maximum rate of two e ntries per cycle. The Mbox transfers writable store queue entry data from the SQ to the Dcache in program order at a maximum rate of two entriesper cycle. Dcache linesassociated with writable store queue entries are locked by the Mbox. System port probe commands cannot evict these blocks until their associated writable SQ entries have been transferred into the Dcache. This restriction assists in STx_C instruction and Dcache ECC processing.
2–28 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 57
Memory and I/O Address Space Instructions
SQ entrydata that has not been transferredto the Dcache may source data tonewer load instructions. The Mbox compares the virtual Dcache index bits of incoming load instructions to queued SQ entries, and sources the data from the SQ, bypassing the Dcache, when necessary.

2.8.4 I/O Address Space Store Instructions

The Mbox begins processing I/O space store instructions, like memory space store instructions, by translating the virtual address and placing the state associated with the store instruction into the SQ.
The Mbox replays retired I/O space store entries from the SQ to the IOWB in program order at a rate of one per GCLK cycle. The Mbox never allows queued I/O space store instructions to source data to subsequent load instructions.
The Cbox maximizes I/O bandwidth when it allocates a new IOWB entry to an I/O store instruction by attempting to mergeI/O store instructions in a merge register. Ta ble 2–8 shows the rules forI/O space storeinstruction datamerging. The columns represent the load instructions replayed to the IOWB while the rowsrepresent the size of the store in the merge register.
Table 2–8 Rules for I/O Address Space Store Instruction Data Merging
Merge Register/ Replayed Instruction
Byte/Word Nomerge Nomerge Nomerge Longword No merge Merge up to 32 bytes No merge Quadword Nomerge Nomerge Mergeupto64bytes
Store Byte/Word Store Longword Store Quadword
Table 2–8 shows some of the following rules:
Byte/word store instructions and different size store instructions are not allowed to
merge.
A stream of ascending non-overlapping, but not necessarily consecutive, longword
store instructions a re allowed to merge into naturally aligned 32-byte blocks.
A stream of ascending non-overlapping, but not necessarily consecutive,quadword
store instructions a re allowed to merge into naturally aligned 64-byte blocks.
Merging of quadwords can be limited to naturally-aligned 32-byte blocks based on
the Cbox WRITE_ONCE chain 32_BYTE_IO field.
Issued MB, WMB, and I/O load instructions close the I/O register merge window.
To minimize latency, the merge window is also closed when a timer detects no I/O store instruction activity for 1024 cycles.
After the IOWB merge register has closed its merge window, the C box sends I/O space store requests offchip in the order that they were received f rom the Mbox.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–29
Page 58

MAF Memory Address Space Merging Rules

2.9 MAF Memory Address Space Merging Rules
Because all m emory transactions are to 64-byte blocks, efficiency is improved by m erg­ing several small data transactions into a single larger data transaction.Table 2–9 lists the rules the 21264/EV68A uses when merging memory transactions into 64-byte natu­rally aligned data block transactions. Rows represent the merged instruction in the MAF and columns represent the new issued transaction.
Table 2–9 MAF Merging Rules
MAF/New LDx STx STx_C WH64 ECB Istream
LDxMerge————— STx Merge Merge — STx_C——Merge——— WH64———Merge—— ECB————Merge— Istream—————Merge
In summary, Table 2–9 shows that only like instruction types, with the exception of load instructions merging with store instructions, are merged.

2.10 Instruction Ordering

In the absence of explicit instruction ordering, such as with MB or WMB instructions, the 21264/EV68A maintains a default instruction orderingrelationship between pairs of load and store instructions.
The 21264/EV68A maintains the default memory datainstruction ordering as shown in Table 2–10 (assume address X and address Y are different).
Table 2–10 Memory Reference Ordering
First Instruction in Pai r Second Instruction in Pair Reference Order
Load memory to address X Load memoryto addressX Maintained (litmus test 1) Load memory to address X Load memoryto addressY Not maintained Store memory to address X Store memory to address X Maintained Store memory to address X Store memory to address Y Maintained Load memory to address X Store memory to address X Maintained Load memory to address X Store memory to address Y Not maintained Store memory to address X Load memory to address X Maintained Store memory to address X Load memory to address Y Not maintained
2–30 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 59
The 21264/EV68A maintains the default I/O instruction ordering as shown in Table 2– 11 (assume address X and address Y are different).
Table 2–11 I/O Reference Ordering
First Instruction in Pai r Second Instruction in Pair Reference Order
Load I/O to address X Load I/O to address X Maintained Load I/O to address X Load I/O to address Y Maintained Store I/O to address X Store I/O to address X Maintained Store I/O to address X Store I/O to address Y Maintained Load I/O to addressX Store I/O to address X Maintained Load I/O to address X Store I/O to address Y Not maintained Store I/O to address X Load I/O to address X Maintained Store I/O to address X Load I/O to address Y Not maintained

2.11 Replay Traps

Replay Traps
There are some situations in which a load or store instructioncannot be executed due to a conditionthat occurs after that instructionissues from the IQ orFQ. The instructionis aborted (along with all newer instructions) and restarted from the fetch stage of the pipeline. This mechanism is called a replay trap.

2.11.1 Mbox Order Traps

Load and store instructions may be issued from the IQ in a different order than they were fetched from the Icache, while the architecture dictates that Dstream memory transactions to the same physical bytes must be completed in order. Usually, the Mbox manages the memory reference stream by itself to achieve architecturally correct behavior, but the two cases in which the Mbox uses replay trapsto manage the memory stream are load-load and store-load order traps.
2.11.1.1 Load-L oad Order Trap
The Mbox ensures that load instructions that read the same physical byte(s) ultimately issue in correct order by using the load-load order trap. The Mbox compares the address of each load instruction, as it is issued, to the address of all load instructions in the load queue. If the Mbox finds a newer load instruction in the load queue, it invokes a load-load order trap on the newer instruction. This is a replay trap that aborts the tar- get of the trap and all newer instructions from the machine and refetches instructions starting at the target of the trap.
2.11.1.2 Store-Load Order Trap
The Mbox ensures that a load instruction ultimately issues after an older store instruc­tion that writes some portion of its memory operand by using the store-load order trap. The Mbox compares the address of each store instruction, as it is issued, to the address of all load instructions in the load queue. If the Mbox finds a newer load instruction in the loadqueue, it invokesa store-load ordertrap on the loadinstruction. Thisis a replay trap. It functions like the load-load order trap.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–31
Page 60

I/O W rite Buffer and the WMB Instruction

The Ibox contains extra hardware to reduce the frequency of the store-load trap. There is a 1-bit by 1024-entry VPC-indexed table in the Ibox called the stWait table. When an Icache instruction is fetched, the associated stWait table entry is fetched along with the Icache instruction. The stWait table produces 1 bit for each instruction accessed from the Icache.When a load instruction gets a store-loadorder replay trap, its associated bit in the stWait table is set during the cycle that the load is refetched. Hence, the trapping load instruction’s stWait bit will be set the next time it is fetched.
The IQ will not issue load instructions whose stWait bit is set while thereare older unis­sued store instructions in the queue. A load instruction whose stWait bit is set can be issued the cycle immediately after the last older store instruction is issued f rom the queue. All the bits in the stWait table are unconditionally clearedevery 16384 c ycles, or every 65536 cycles if I_CTL[ST_WAIT_64K] is set.

2.11.2 Other Mbox Replay Traps

The Mbox also uses replay traps to control the flow of the load queue and store queue, and to ensure that there are never multiple outstanding misses to different physical addresses thatmap to the same Dcache or B cache line. Unlike the order traps, however, these replay traps are invoked on the incoming instruction that triggered the condition.
2.12 I/O Write Buffer and the WMB Instruction
The I/O write buffer (IOWB) consists of four 64-byte entries with the associated address and control logic used to buffer I/O write data between the store queue (SQ) and the system port.

2.12.1 Memory Barrier (MB/WMB/TB Fill Flow)

The Cbox CSR SYSBUS_MB_ENABLE bit determines if MB instructions produce external system port transactions. When the SYSBUS_MB_ENABLE bit equals 0, the Cbox CSR MB_CNT[3:0] field contains the number of pending uncommitted transac­tions. The counter will increment for each of the following commands:
RdBlk, R dBlkMod, RdBlkI
RdBlkSpec (valid), RdBlkModSpec (valid), RdBlkSpecI (valid)
RdBlkVic, RdBlkModVic, RdBlkVicI
CleanToDirty, SharedToDirty, STChangeToDirty, InvalToDirty
FetchBlk, FetchBlkSpec (valid), Evict
RdByte, RdLw, RdQw, WrByte, WrLW, WrQW
The counter is decremented with the C (commit) bit in the Probe and SysDc commands (see Section 4.7.7). Systems can assert the C bit in the SysDc fill response to the com­mands that originally incremented the counter, or a ttached to the last probe see n by that command when it reached the system serialization point.If the number of uncommitted transactions reaches 15 (saturating the counter), the Cbox will stall MAF and IOWB processing until at least one of the pending transactions has been committed.Probe pro­cessing is not interrupted by the state of this counter.
2–32 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 61
2.12.1.1 MB Instruction Processing
When an MB instruction is fetched in the predicted instruction execution path, it stalls in the map stage of the pipeline. This also stalls all instructions after the MB, and con­trol of instruction flow is based upon the value in Cbox CSR SYSBUS_MB_ENABLE as follows:
If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the IQ is
empty and then performs the following actions: a. Sends all pending MAF and IOWB entries to the system port.
b. Monitors C box CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the youngest probe queue entry.
c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and
IOWB are empty.
When all of the above have occurred and a probe response has been sent to the sys­tem for the marked probe queue entry, instruction execution continues with the instruction after the MB.
I/O W rite Buffer an d the WMB Instruction
If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox waits until the IQ is empty
and then performs the following actions: a. Sends all pending MAF and IOWB entries to the system port b. Sends the MB command to the system port c. Waits until the MB command is acknowledged, then m arks the youngest entry
in the probe queue
d. Wa its until the M AF contains no more Dstream references and the SQ, LQ, and
IOWB are empty
When all of the above have occurred and a probe response has been sent to the sys­tem for the marked probe queue entry, instruction execution continues with the instruction after the MB.
Because the MB instruction is executed speculatively, MB processing can begin and the original MB can be killed. In the internal acknowledge case, the MB m ay have already been sent to the system interface, and the system is still expected to respond to the MB.
2.12.1.2 WMB Instruction Processing
Writememory barrier (WMB ) instructions are issued into the Mbox store-queue, where they wait until they are retired and all prior store instructions become writable. The Mbox then stalls the writable pointer a nd informs the Cbox. The Cbox closesthe IOWB merge register and responds in one of the following two ways:
If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox performs the following
actions: a. Stalls further MAF and IOWB processing. b. Monitors C box CSR MB_CNT[3:0], a 4-bit counter of outstanding committed
events. When the counter decrements from one to zero, the Cbox marks the youngest probe queue entry.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–33
Page 62
I/O W rite Buffer and the WMB Instruction
c. When a probe response has been sent to the system for the marked probe queue
entry, the Cbox considers the WMB to be satisfied.
If Cbox CSR SYSBUS_MB_ENABLE is set, the Cbox performs the following
actions: a. Stalls further MAF and IOWB processing. b. Sends the MB command to the system port. c. Waits until the MB command is acknowledged by the system with a SysDc
MBDone command, then sends acknowledge and marks the youngest entry in the probe queue.
d. When a probe response has been sent to the system for the markedprobe queue
entry, the Cbox considers the WMB to be satisfied.
2.12.1.3 TB Fill Flow
Load instructions (HW_LDs) to a virtual page table entry (VPTE) are processed by the 21264/EV68A to avoid litmus test problems associated with the ordering of memory transactions from another processor against loading of a page table entry and the subse­quent virtual-mode load from this processor.
Consider the sequence shown in Table 2–12.The data could be in the Bcache. P j should fetch datai if it is using PTEi.
Table 2–12 TB Fi ll Flow Example Sequence 1
Pi Pj
Write Datai Load/Store datai MB <TB miss> Write PTEi Load-PTE
<write TB> Load/Store (restart)
Also consider the related sequence shown in Table 2–13. In this case, the data could be cached in the Bcache; Pj should fetch datai if it is using PTEi.
Table 2–13 TB Fi ll Flow Example Sequence 2
Pi Pj
Write Datai Istream read datai MB <TB miss> Write PTEi Load-PTE
<write TB> Istream read (restart) - will miss the Icache
The 21264/EV68Aprocesses Dstream loads to the PTEby injecting, in hardware, some memory barrier processing between the PTE transaction and any subsequent load or store instruction. This is accomplished by the following mechanism:
1. The integer queue issues a HW_LD instruction with VPTE.
2–34 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 63

Performance Measurement Support—Performance Counters

2. The integer queue issues a HW_MTPR instruction with a DTB_PTE0, that is data­dependent on the HW_LD instruction with a VPTE, and is required in order to fill the DTBs. The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4] and [0].
3. When a HW_MTPR instruction with a DTB_PTE0 is issued, the Ibox signals the Cbox indicating that a HW_LD instruction with a VPTE has been processed. This causes the Cbox to begin processing the MB instruction. The Ibox prevents any subsequent memory operations being issued by not clearing the IPR scoreboard bit [0]. IPR scoreboard bit [0] is one of the scoreboard bits associated with the HW_MTPR instruction with DTB_PTE0.
4. When the Cbox completes processing the MB instruction (using one of the above sequences, depending upon the state of SYSBUS_MB_ENABLE), the Cbox sig­nals the Ibox to clear IPR scoreboard bit [0].
The 21264/EV68A uses a similar mechanism to process Istream TB misses and fills to the PTE for the I stream.
1. The integer queue issues a HW_LD instruction with VPTE.
2. The IQ issues a HW_MTPR instruction with an ITB_PTE that is data-dependent upon the HW_LD instruction with VPTE. This is required in order to fill the ITB. The HW_MTPR instruction, when queued, sets IPR scoreboard bits [4] and [0].
3. The Cbox issues a HW_MTPR instruction for the ITB_PTE and signals the Ibox that a HW_LD/VPTE instruction has been processed, causing the Cbox to start pro­cessing the MB instruction. The Mbox stalls Ibox fetching from when the HW_LD/ VPTE instruction finishes until the probe queue is drained.
4. When the 21264/EV68A is finished (SYS_MB selectsone of the above sequences), the Cbox directs the Ibox to clear IPRscoreboard bit [0]. Also, the M box directs the Ibox to start prefetching.
Inserting MB instruction processing within the TB fill flow is only required for multi­processor systems. Uniprocessor systems can disable MB instruction processing by deasserting Ibox CSR I_CTL[TB_MB_EN].
2.13 Performance Measurement Support—Performance Counters
The 21264/EV68A provides hardware support for two methods of obtaining program performance feedback information. The two methods do not require program modifica­tion. The first method offers similar capabilities to earlier microprocessor performance counters. The second method supportsthe new ProfileMe way of statistically sampling individual instructions during program e xecution to develop a model of program execu­tion. Both methods use the same hardware registers.
See Section 6.10 for information about counter control.

2.14 Floating-Point Control Register

The floating-point control register (FPCR) is shown in Figure 2–11.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–35
Page 64
Floating-Point Control Register
0050
Figure 2–11 Floating-Point Control Register
63 62 6160 59 4958 4857 4756 55 5453 52 51 50 0
SUM
INED UNFD UNDZ
DYN
IOV
INE UNF OVF DZE
INV
OVFD DZED
INVD
DNZ
LK99-
A
The floating-point control register fields are described in Table 2–14.
Table 2–14 Floating-Point Control Register Fiel ds
Name Extent Type Description
SUM [63] RW Summary bit. Records bit-wise OR ofFPCR exceptionbits.Thesummary bitis
not directly modified by writes to bit 63 of the FPCR,but is indirectly modified by changes t o FPCR bits 57–52.
INED [62] RW Inexact Disable. If this bit is set and a floating-point instructionthat enables
trapping on inexact results ge nerates an inexact value,the resultis placedin the destination register and the trap is suppressed.
UNFD [61] RW Underflow Disable. The 21264/EV68A hardware cannot generate IEEE com-
pliant denormal results. UNFD is used in conjunction with UNDZ as follows:
UNFD UNDZ Result
0 X Underflow trap. 1 0 Trap to supply a possible denormal result. 1 1 Underflow trap suppressed. Destination is written
withatruezero(+0.0).
UNDZ [60] RW Underflow to zero. When UNDZ is set together with UNFD, underflow traps
are disabled and the 21264/EV68Aplaces a true zero in the destinationregister. See UNFD, above.
2–36 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 65

AMASK and IMPLVER Instruction Values

Table 2–14 Floating-Point Control Register Fiel ds (Continued)
Name Extent Type Description
DYN [59:58] RW Dynamic rounding mode. Indicates the rounding mode to be used by an IEEE
floating-pointinstruction when the instruction specifies dynamic rounding mode:
Bits Meaning
00 Chopped 01 Minus infinity 10 Normal 11 Plus infinity
IOV [57] RW Integer overflow. A CVTGQ, CVTTQ, or CVTQL overflowed the destination
precision.
INE [56] RW Inexact result. A floating-point arithmetic or conversion operation gave a result
that differed from the mathematically exact result.
UNF [55] RW Underflow. A floating-point arithmetic or conversion operation gave a result
that underflowed the de stination exponent.
OVF [54] RW Overflow. A floating-point arithmetic or conversionoperation gave a resultthat
overflowed the destination exponent.
DZE [53] RW Divide by zero. An attempt was made to perform a floating-point divide with a
divisor of zero.
INV [52] RW Invalid operation. An attempt was made to perform a floating-point arithmetic
operation a nd one or more of its operand values were illegal.
OVFD [51] RW Overflow disable. If this bit is set and a floating-point arithmetic operation gen-
erates an overflow condition, then the appropriate IEEE nontrapping result is placed in the destination register and the trap is suppressed.
DZED [50] RW Division by zero disable. If this bit is set and a floating-point divide by zero is
detected, the appropriate IEEE nontrapping result is placed in the destination register and the trap is suppressed.
INVD [49] RW Invalid operation disable. If this bit is set and a floating-pointoperate generates
an invalid operation condition and 21264/EV68A is capable of producing the correct IE EE nontrapping result, that result is placed in the destination register and the trap is suppressed.
DNZ [48] RW Denormal operands to zero. If this bit is set, treat all Denormal operands as a
signed zero value with the same sign as the Denormal operand.
1
Reserved [47:0]
1
Alpha architecture FPCR bit 47 (DNOD) is not implemented by the 21264/EV68A.
——
2.15 AMASK and IMPLVER Instruction Values
The AMASK and IMPLVER instructions return the supported architecture extensions and processor type , respectively.
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–37
Page 66

Design Examples

2.15.1 AMASK

The 21264/EV68A returns the AMASK instruction valuesprovided in Table 2–15. The I_CTL register reports the 21264/EV68A pass level (see I_CTL[CHIP_ID], S ection
5.2.15).
Table 2–15 21264/EV68A AMASK Values
21264/EV68A Pass Level AMASK Feature Mask Value
See I_CTL[CHIP_ID], Table 5–11 1307
16
The AMASK bit definitions provided in Table 2–15 are defined in Table 2–16.
Table 2–16 AMASK Bit Assignments
Bit Meaning
0 Support for the byte/word extension (BWX)
The instructions that comprise the BWX extension are LDBU, LDWU, SEXTB, SEXTW, STB, and STW.
1 Support for the square-root and floating-point c onvert extension (FIX)
The instructions that comprise the FIX extension are FTOIS, FTOIT, ITOFF, ITOFS, ITOFT, SQRTF, SQRTG, SQRTS, and SQRTT.
2 Support for the count extension (CIX)
The instructions that comprise the CIX extension are CTLZ, CTPOP, and CTTZ.
8 Support for the multimedia extension (MVI)
The instructions that comprise the MVI extension are MAXSB8, MAXSW4, MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB, PKWB, UNPKBL, and UNPKBW.
9 Support for precise arithmetic trap reporting in hardware. The trap PC is the same as
the instruction PC after the trapping instruction is executed.
12 Support for using a prefetch w ith modify intent to improve the performance of the
first a ttempt to acquire a lock. When clear, indicates possible prefetch error with locks,described in waiver10 tothe Alpha Architectureand in the prefetch sectionof the appropriate processor (21264/EV6 and 21264/EV67)documents.

2.15.2 IMPLVER

For the 21264/EV68A, the IMPLVER instruction returns the value 2.
2.16 Design Examples
The 21264/EV68A can be designed into many different uniprocessor and multiproces­sor system configurations. F igures 2–12 and 2–13 illustrate two possible configura­tions. These configurations employ additional system/memory controller chipsets.
Figure 2–12 shows a typical uniprocessor system with a second-level cache. This sys­tem configuration could be used in standalone or networked workstations.
2–38 Internal Architecture
21264/EV68A Hardware R eference Manual
Page 67
Figure 2–12 Typical Uniprocessor Confi gur atio n
g
64-bitPCI Bus
FM-05573-EV67
g
Address
Design Examples
L2 Cache
Tag
Store
Data
Store
21264
Tag
Address
Data
Address
Out
Address
In
Data
21272 Core
Lo
ic Chipset
Control
Chips
Data Slice
Chips
Host PCI
Bridge Chip
Duplicate Tag Store (Optional)
DRAM Arrays
Address Data
Figure 2–13 shows a typical multiprocessor system, each processor with a second-level cache. Each interface controller must e mploy a duplicate tag store to maintain cache coherency. This system configuration c ould be used in a networked database server application.
Figure 2–13 Typical Multiprocessor Co nfi guration
L2
Cache
L2
Cache
21264
21264
Host PCI
BridgeChip
64-bit PCI Bus
21272Core
ic Chipset
Lo
Control
Chip
Data Slice
Chips
64-bit PCI Bus
Host PCI
Bridge Chip
DRAM
Arrays
Address
Data
DRAM
Arrays
Data
FM-05574-EV67
21264/EV68A Hardware Refere nce Manual
Internal Architecture 2–39
Page 68
Page 69

Hardware Interface

This chapter contains the 21264/EV68A microprocessor logic symbol and provides information about signal names, their function, and their location. This chapter also describes the mechanical specifications of the 21264/EV68A. It is organized as fol­lows:
The 21264/EV68A logic symbol
The 21264/EV68A signal names and functions
Lists of the signal pins, sorted by name and PGA location
The specifications for the 21264/EV68A mechanical package
The top and bottom views of the 21264/EV68A pinouts

3.1 21264/EV68A Microprocessor Logic Symbol

Figure 3–1 show the logic symbol for the 21264/EV68A chip.
3
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–1
Page 70
21264/EV68A Microprocessor Logic Symbol
System Interface
05646b
Figure 3–1 21264/EV68A Microprocessor Logic Symbol
21264
Bcache Interface
2.5 V
SysAddIn_L[14:0] SysAddInClk_L SysAddOut_L[14:0] SysAddOutClk_L SysVref SysData_L[63:0] SysCheck_L[7:0] SysDataInClk_H[7:0] SysDataOutClk_L[7:0] SysDataInValid_L SysDataOutValid_L SysFillValid_L
ClkIn_x FrameClk_x EV6Clk_x PLL_VDD
BcAdd_H[23:4]
BcData_H[127:0]
BcCheck_H[15:0] BcDataInClk_H[7:0] BcDataOutClk_[3:0]
BcDataOE_L
BcDataWr_L
BcTag_H[42:20]
BcTagInClk_H
BcTagOutClk_x
BcVref
BcTagDirty_H
BcTa gParity_H
BcTagShared_H
BcTagValid_H
BcTagOE_L
BcTagWr_L
BcLoad_L
Clocks
IRQ_H[5:0] ClkFwdRst_H SromData_H Tms_H Trst_L Tick_H Tdi_H PllBypass_H MiscVref Reset_L DCOK_H
Miscellaneous
SromClk_H
SromOE_L TestStat_H
Tdo_H
FM-
3–2 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 71

21264/EV68A Signal Names and Functions

3.2 21264/EV68A Signal Names and Functions
Table 3–1 defines the 21264/EV68A signal types referred to in this section.
Table 3–1 Signal Pin Types Definitions
Signal Type Definition
Inputs I_DC_REF Input DC reference pin I_DA Input differential amplifierreceiver I_DA_CLK Input clock pin Outputs O_OD Open drain output driver O_OD_TP Open drain driver for test pins O_PP Push/pull output driver O_PP_CLK Push/pull output clock driver Bidirectional B_DA_OD Bidirectional differential amplifier receiver with open drain output B_DA_PP Bidirectional differential amplifier receiver with push/pull output Other Spare Reserved toCOMPAQ NoConnect No connection — Do not connect to these pins for any revision of the
21264/EV68A. These pins must float.
1
All Spare connectionsare Reserved toCOMPAQto maintain compatibility between passes of the chip. Designers should not use these pins.
1
Table 3–2 lists all signal pins in alphabetic order and provides a full functional descrip­tion of the pins. Table 3–4 lists the signal pins and their corresponding pin grid array (PGA) locations in a lphabetic order forthe signal type. Table3–5 liststhe pin grid array locations in alphabetical order.
Table 3–2 21264/EV68A Signal Descrip tions
Signal Type Count Description
BcAdd_H[23:4] O_PP 20 These signalsprovide the index to the Bcache. BcCheck_H[15:0] B_DA_PP 16 ECC check bits for BcData_H[127:0]. BcData_H[127:0] B_DA_PP 128 Bcache data signals. BcDataInClk_H[7:0] I_DA 8 Bcache da ta input clocks. These clocks are used with high
speed SDRAMs, suchas DDRs, that provide a clock-out with data-output pins to optimize Bcache read bandwidths. The 21264/EV68Ainternallysynchronizesthe datato its logicwith clock forward receive circuits similar to the system interface.
BcDataOE_L O_PP 1 Bcache data output enable. The 21264/EV68A asserts this sig-
nal during B cache read operations.
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–3
Page 72
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
Signal Type Count Description
BcDataOutClk_H[3:0] BcDataOutClk_L[3:0]
BcDataWr_L O_PP 1 Bcachedata write enable.The 21264/EV68Aassertsthis signal
BcLoad_L O_PP 1 Bcache burst enable. BcTag_H[42:20] B_DA_PP 23 Bcache tag bits. BcTagDirty_H B_DA_PP 1 Tag dirty state bit. During cache write operations, the 21264/
BcTagInC lk_H I_DA 1 Bcache tag input clock. The 21264/EV68A uses this input
BcTagO E_L O_PP 1 Bcache tag output e nable. This signal is asserted by the 21264/
O_PP 8 Bcache data output clocks. T hese free-running clocks are dif-
ferential copies of the Bcache clock and are derived from the 21264/EV68A GCLK. Their period is a multiple of the GCLK and is fixed for all operations. They can be configured so that their rising edge lags BcAdd_H[23:4] by 0 to 2 GCLK cycles. The 21264/EV68A synchronizes tag output information with these clocks.
when writing datato the Bcache data arrays.
EV68A will assert this s ignal if the Bcache data has been mod­ified.
clock to latch thetag information on Bcache read operations. This clock is used with high-speed SDRAMs, such as DDRs, that provide a clock-out with data-output pins to optimize Bcache read bandwidths. The 21264/EV68A internally syn­chronizes the data to its logic with clock forward receive cir­cuits similar to the system interface.
EV68A for Bcache read operations.
BcTagOutClk_H BcTagOutClk_L
BcTagP arity_H B_DA_PP 1 Tag parity state bit. BcTagShared_H B_DA_PP 1 Tag shared state bit. The 21264/EV68Awill write a 1 on this
BcTagValid_H B_DA_PP 1 Tag valid state bit. If set, this line indicates that the cache line
BcTagWr_L O_PP 1 Tag RAM write enable. The 21264/EV68A asserts this signal
BcVref I_DC_REF 1 Bcache tag reference voltage. ClkFwdRst_H I_DA 1 Systems assert this s ynchronous signal to wake up a powered-
ClkIn_H ClkIn_L
DCOK_H I_DA 1 dc voltage OK. Must be deasserted until dc voltage reaches
EV6Clk_H EV6Clk_L
O_PP 2 Bcache tag output clock. These clocks “echo” theclock-for-
warded BcDataOutClk_x[3:0] clocks.
signal line if another agent has a copy of the c ache line.
is valid.
when writing a tag to the Bcache tag arrays.
down 21264/EV68A. The ClkFwdRst_H signal is clocked into a 21264/EV68A register by the captured FrameClk_x signals. Systems must ensure that the timing of this signal meets 21264/EV68A requirements (see Section 4.7.2).
I_DA_CLK 2 Differential input signals provided by the system.
proper operating level. After that, DCOK_H is asserted.
O_PP_CLK 2 Provides an external test point to measure phase alignment of
the PLL.
3–4 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 73
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
Signal Type Count Description
FrameClk_H FrameClk_L
IRQ_H[5:0] I_DA 6 These six i nterrupt signal lines may be asserted by the system.
MiscVref I_DC_REF 1 Voltage reference for the miscellaneous pins
PllBypass_H I_DA 1 When asserted, this signal will cause the two input clocks
PLL_VDD 2.5V 1 2.5-V dedicated power supply for the 21264/EV68APLL. Reset_L I_DA 1 System reset. This signal protects t he 21264/EV68A from
SromClk_H O_OD_TP 1 Serial ROM clock. Suppliesthe clock that causes the SROM to
SromData_H I_DA 1 Serial R OM data. Input data line from the SROM. SromOE_L O_OD_TP 1 Serial ROM enable. Supplies the output enable to the SROM.
I_DA_CLK 2 A skew-controlled differential 50% duty cycle copy of the sys-
tem clock. It is used by the 21264/EV68A as a reference, or framing, clock.
The response ofthe 21264/EV68A is determined by the system software.
(see Table 3–3).
(ClkIn_x) to be applied to the 21264/EV68A internal circuits, instead of the 21264/EV68A global clock (GCLK).
damage during initial power-up. It must be asserted until DCOK_H is asserted. After that, it is deasserted and the 21264/EV68A begins its reset sequence.
advance t o the next bit. The cycle time for this clock is 256 times the cycle time of the GCLK (internal 21264/EV68A clock).
SysAddIn_L[14:0] I_DA 15 Time-multiplexed command/address/ID/Ack from system to
the 21264/EV68A.
SysAddInClk_L I_DA 1 Single-ended forwarded clock from system for
SysAddIn_L[1 4:0] and SysFillValid_L.
SysAddOut_L[14:0] O_OD 15 Time-multiplexed command/address/ID/mask from the 21264/
EV68A to the system bus.
SysAddOutClk_L O_OD 1 Single-ended forwarded clock output for
SysAddOut_L[14:0]. SysCheck_L[7:0] B_DA_OD 8 Quadword ECC check bits for SysData_L[63:0]. SysData_L[63:0] B_DA_OD 64 Data bus for memory and I/O data. SysDataInClk_H[7:0] I_DA 8 Single-ended system-generated clocks for clock forwarded
input system data. SysDataInValid_L I_DA 1 When asserted, marks a valid data cycle for data transfers to
the 21264/EV68A. SysDataOutClk_L[7:0] O_OD 8 Single-ended 21264/EV68A-generated clocks for clock for-
warded output system data. SysDataOutValid_L I_DA 1 When asserted, marks a valid data cycle for data transfers from
the 21264/EV68A. SysFillValid_L I_DA 1 When asserted, this bit indicates validation for the cache fill
delivered in the previous system SysDc command.
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–5
Page 74
21264/EV68A Signal Names and Functions
Table 3–2 21264/EV68A Signal Descrip tions (Continued)
Signal Type Count Description
SysVref I_DC_REF 1 System interface reference voltage. Tck_H I_DA 1 IEEE 1149.1 test clock. Tdi_H I_DA 1 IEEE 1149.1 test data-in signal. Tdo_H O_OD_TP 1 IEEE 1149.1 test data-out signal. TestStat_H O_OD_TP 1 T est status pin. System reset drives the test status pin low.
The TestStat_H pin is forced high at the start of the Icache
BiST. If the Icache BiST passes, the p in is deasserted at the end
of the BiST operation; otherwise, it remains high.
The 21264/EV68A generates a timeout reset signal if an
instruction is not retired within one billion cycles.
The 21264/EV68A signals the timeout reset event by output-
ting a 256 GCLK cycle wide pulse on TestStat_H.
Tms_H I_DA 1 IEEE 1149.1 test mode select signal. Trst_L I_DA 1 IEEE 1149.1test access port (TAP) re set signal.
Table 3–3 lists signals by function and provides an abbreviated description.
Table 3–3 21264/EV68A Signal Descrip tions by Function
Signal Type Count Description
BcVrefDomain BcAdd_H[23:4] O_PP 20 Bcache index. BcCheck_H[15:0] B_DA_PP 16 ECC check bitsfor BcData_H[127:0]. BcData_H[127:0] B_DA_PP 128 Bcache data. BcDataInClk_H[7:0] I_DA 8 Bcache data input clocks. BcDataOE_L O_PP 1 Bcache data output enable. BcDataOutClk_H[3:0]
BcDataOutClk_L[3:0] BcDataWr_L O_PP 1 Bcache data write enable. BcLoad_L O_PP 1 Bcache burst enable. BcTag_H[42:20] B_DA_PP 23 Bcache tag bits. BcTagDirty_H B_DA_PP 1 Tag dirty state bit. BcTagInC lk_H I_DA 1 Bcache tag input clock. BcTagO E_L O_PP 1 Bcache tag output enable.
O_PP 8 Bcache data output clocks.
BcTagOutClk_H BcTagOutClk_L
BcTagP arity_H B_DA_PP 1 Tag parity state bit. BcTagShared_H B_DA_PP 1 Tag shared state bit. BcTagValid_H B_DA_PP 1 Tag valid state bit. BcTagWr_L O_PP 1 Tag RAM write enable.
3–6 Hardware Interface
O_PP 2 Bcache tag output clocks.
21264/EV68A Hardware R eference Manual
Page 75
21264/EV68A Signal Names and Functions
Table 3–3 21264/EV68A Signal Descriptions by Function (Continued)
Signal Type Count Description
BcVref I_DC_REF 1 Tag data input reference voltage. SysVref Domain SysAddIn_L[14:0] I_DA 15 Time-multiplexed SysAddIn, system-to-21264/EV68A. SysAddInClk_L I_DA 1 Single-ended forwarded clock from system for
SysAddIn_L[14:0] and SysFillValid_L. SysAddOut_L[14:0] O_OD 15 Time-multiplexed SysAddOut, 21264/EV68A-to-system. SysAddOutClk_L O_OD 1 Single-ended forwarded-clock. SysCheck_L[7:0] B_DA_OD 8 Quadword ECC check bits for SysData_L[63:0]. SysData_L[63:0] B_DA_OD 64 Data bus for memory and I/O data. SysDataInClk_H[7:0] I_DA 8 Single-ended system-generated clocks for clock forwarded
input system data. SysDataInValid_L I_DA 1 When asserted, marks a valid da ta cycle for data transfers to
the 21264/EV68A. SysDataOutClk_L[7:0] O_OD 8 Single-ended 21264/EV68A-generated clocks for clock for-
warded output system data. SysDataOutValid_L I_DA 1 When asserted, marks a valid data cycle for data transfers
from the 21264/EV68A.
SysFillValid_L I_DA 1 Validation for fill given in previous SysDC command. SysVref I_DC_REF 1 System interface reference voltage. Clocks and PLL ClkIn_H
ClkIn_L EV6Clk_H
EV6Clk_L FrameClk_H
FrameClk_L
PLL_VDD 2.5V 1 2.5-V de dicated power supply for the 21264/EV68A PLL. MiscVref Domain ClkFwdRst_H I_DA 1 Systems assert this synchronous signal to wake up a powered-
I_DA_CLK 2 Differential input signals provided by the system.
O_PP_CLK 2 Provides an external test point to measure phase alignment of
the PLL.
I_DA_CLK 2 A skew-controlled differential 50% duty cycle copy of the
system clock. It is used by the 21264/EV68A as a reference,
or framing, clock.
down 21264/EV68A.The ClkFwdRst_H signal is clocked
into a 21264/EV68A register by the captured FrameClk_x
signals. DCOK_H I_DA 1 dc voltage OK. Must be deasserted until dc voltage reaches
proper operating level. After that, DCOK_H is asserted.
IRQ_H[5:0] I_DA 6 These six interrupt signal lines may be asserted by the system. MiscVref I_DC_REF 1 Reference voltage for miscellaneous pins. PllBypass_H I_DA 1 When asserted, this signal will cause the input clocks
(ClkIn_x) to be applied to the 21264/EV68A internal c ircuits,
instead of the 21264/EV68A’s global clock (GCLK).
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–7
Page 76
Pin Assignments
Table 3–3 21264/EV68A Signal Descriptions by Function (Continued)
Signal Type Count Description
Reset_L I_DA 1 System reset. This signal protects the 21264/EV68A from
damage during initial power-up. It must be asserted until
DCOK_H is asserted. After that, it is deasserted and the
21264/EV68A begins its reset sequence.
SromClk_H O_OD_TP 1 Serial ROM clock. SromData_H I_DA 1 Serial ROM data. SromOE_L O_OD_TP 1 Serial ROM enable. Tck_H I_DA 1 IEEE 1149.1 test clock. Tdi_H I_DA 1 IEEE 1149.1 test data-insignal. Tdo_H O_OD_TP 1 IEEE 1149.1 test data-out signal. TestStat_H O_OD_TP 1 Test statuspin. Tms_H I_DA 1 IEEE 1149.1 test mode select signal. Trst_L I_DA 1 IEEE1149.1testaccessport(TAP)resetsignal.
3.3 Pin Assignments
The 21264/EV68A package has 587 pins aligned in a pin grid array (PGA) design. There are 380 functional signal pins, 1 dedicated 2.5-V pin for the PLL, 112 ground VSS pins, and 94 VDD pins. Table 3–4 liststhe signal pins and their correspondingpin grid array (PGA) locations in alphabetical or der for the signal type. Table 3–5 lists the pin grid array locations in alphabetical order
Table 3–4 Pin List Sorted by Signal Name
Signal Name PGA Location Signal Name PGA Location Signal Name PGALocation
BcAdd_H_10 B30 BcAdd_H_11 D30 BcAdd_H_12 C31 BcAdd_H_13 H28 BcAdd_H_14 G29 BcAdd_H_15 A33 BcAdd_H_16 E31 BcAdd_H_17 D32 BcAdd_H_18 B34 BcAdd_H_19 A35 BcAdd_H_20 B36 BcAdd_H_21 H30 BcAdd_H_22 C35 BcAdd_H_23 E33 BcAdd_H_4 B28 BcAdd_H_5 E27 BcAdd_H_6 A29 BcAdd_H_7 G27 BcAdd_H_8 C29 BcAdd_H_9 F28 BcCheck_H_0 F2 BcCheck_H_1 AB4 BcCheck_H_10 AW1 BcCheck_H_11 BD10 BcCheck_H_12 E45 BcCheck_H_13 AC45 BcCheck_H_14 AT44 BcCheck_H_15 BB36 BcCheck_H_2 AT2 BcCheck_H_3 BC11
.
BcCheck_H_4 M38 BcCheck_H_5 AB42 BcCheck_H_6 AU43 BcCheck_H_7 BC37 BcCheck_H_8 M8 BcCheck_H_9 AA3 BcData_H_0 B10 BcData_H_1 D10 BcData_H_10 L3 BcData_H_100 D42 BcData_H_101 D44 BcData_H_102 H40 BcData_H_103 H42 BcData_H_104 G45 BcData_H_105 L43
3–8 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 77
Pin Assignments
Table 3–4 Pin List Sorted by Signal Name (Continued)
Signal Name PGA Location Signal Name PGA Location Signal Name PGALocation
BcData_H_106 L45 BcData_H_107 N45 BcData_H_108 T44 BcData_H_109 U45 BcData_H_11 M2 BcData_H_110 W45 BcData_H_111 AA43 BcData_H_112 AC43 BcData_H_113 AD44 BcData_H_114 AE41 BcData_H_115 AG45 BcData_H_116 AK44 BcData_H_117 AL43 BcData_H_118 AM42 BcData_H_119 AR45 BcData_H_12 T2 BcData_H_120 AP40 BcData_H_121 BA45 BcData_H_122 AV42 BcData_H_123 BB44 BcData_H_124 BB42 BcData_H_125 BC41 BcData_H_126 BA37 BcData_H_127 BD40 BcData_H_13 U1 BcData_H_14 V2 BcData_H_15 Y4 BcData_H_16 AC1 BcData_H_17 AD2 BcData_H_18 AE3 BcData_H_19 AG1 BcData_H_2 A5 BcData_H_20 AK2 BcData_H_21 AL3 BcData_H_22 AR1 BcData_H_23 AP2 BcData_H_24 AY2 BcData_H_25 BB2 BcData_H_26 AW5 BcData_H_27 BB4 BcData_H_28 BB8 BcData_H_29 BE5 BcData_H_3 C5 BcData_H_30 BB10 BcData_H_31 BE7 BcData_H_32 G33 BcData_H_33 C37 BcData_H_34 B40 BcData_H_35 C41 BcData_H_36 C43 BcData_H_37 E43 BcData_H_38 G41 BcData_H_39 F44 BcData_H_4 C3 BcData_H_40 K44 BcData_H_41 N41 BcData_H_42 M44 BcData_H_43 P42 BcData_H_44 U43 BcData_H_45 V44 BcData_H_46 Y42 BcData_H_47 AB44 BcData_H_48 AD42 BcData_H_49 AE43 BcData_H_5 E3 BcData_H_50 AF42 BcData_H_51 AJ45 BcData_H_52 AK42 BcData_H_53 AN45 BcData_H_54 AP44 BcData_H_55 AN41 BcData_H_56 AW45 BcData_H_57 AU41 BcData_H_58 AY44 BcData_H_59 BA43 BcData_H_6 H6 BcData_H_60 BC43 BcData_H_61 BD42 BcData_H_62 BB38 BcData_H_63 BE41 BcData_H_64 C11 BcData_H_65 A7 BcData_H_66 C9 BcData_H_67 B6 BcData_H_68 B4 BcData_H_69 D4 BcData_H_7 E1 BcData_H_70 G5 BcData_H_71 D2 BcData_H_72 H4 BcData_H_73 G1 BcData_H_74 N5 BcData_H_75 L1 BcData_H_76 N1 BcData_H_77 U3 BcData_H_78 W5 BcData_H_79 W1 BcData_H_8 J3 BcData_H_80 AB2 BcData_H_81 AC3 BcData_H_82 AD4 BcData_H_83 AF4 BcData_H_84 AJ3 BcData_H_85 AK4 BcData_H_86 AN1 BcData_H_87 AM4 BcData_H_88 AU5 BcData_H_89 BA1
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–9
Page 78
Pin Assignments
Table 3–4 Pin List Sorted by Signal Name (Continued)
Signal Name PGA Location Signal Name PGA Location Signal Name PGALocation
BcData_H_9 K2 BcData_H_90 BA3 BcData_H_91 BC3 BcData_H_92 BD6 BcData_H_93 BA9 BcData_H_94 BC9 BcData_H_95 AY12 BcData_H_96 A39 BcData_H_97 D36 BcData_H_98 A41 BcData_H_99 B42 BcDataInClk_H_0 E7 BcDataInClk_H_1 R3 BcDataInClk_H_2 AH2 BcDataInClk_H_3 BC5 BcDataInClk_H_4 F38 BcDataInClk_H_5 U39 Bc DataInClk_H_6 AH44 BcDataInClk_H_7 AY40 BcDataOE_L A27 BcDataOutClk_H_0 J5 BcDataOutClk_H_1 AU3 BcDataOutClk_H_2 J43 BcDataOutClk_H_3 AR43 BcDataOutClk_L_0 K4 BcDataOutClk_L_1 AV4 BcDataOutClk_L_2 K42 BcDataOutClk_L_3 AT42 BcDataWr_L D26 BcLoad_L F26 BcTag_H_20 E13 BcTag_H_21 H16 BcTag_H_22 A11 BcTag_H_23 B12 BcTag_H_24 D14 BcTag_H_25 E15 BcTag_H_26 A13 BcTag_H_27 G17 BcTag_H_28 C15 BcTag_H_29 H18 BcTag_H_30 D16 BcTag_H_31 B16 BcTag_H_32 C17 BcTag_H_33 A17 BcTag_H_34 E19 BcTag_H_35 B18 BcTag_H_36 A19 BcTag_H_37 F20 BcTag_H_38 D20 BcTag_H_39 E21 BcTag_H_40 C21 BcTag_H_41 D22 BcTag_H_42 H22 BcTagDirty_H C23 BcTagInClk_H G19 BcTagOE_L H24 BcTagOutClk_H C25 BcTagOutClk_L D24 BcTagParity_H B22 BcTagShared_H G23 BcTagValid_H B24 BcTagWr_L E25 BcVref F18 ClkFwdRst_H BE11 ClkIn_H AM8 ClkIn_L AN7 DCOK_H AY18 EV6Clk_H AM6 EV6Clk_L AL7 FrameClk_H AV16 FrameClk_L AW15 IRQ_H_0 BA15 IRQ_H_1 BE13 IRQ_H_2 AW17 IRQ_H_3 AV18 IRQ_H_4 BC15 IRQ_H_ 5 BB16 MiscVref AV22 NoConnect BB14 NoConnect BD2 PLL_VDD AV8 PllBypass_H BD12 Reset_L BD16 Spare AJ1
Spare V38 Spare AT4 Spare BE9 Spare F8 Spare BD4 Spare AJ43 Spare AR3 Spare T4 Spare E39 Spare BA39 Spare BC21 SromClk_H AW19
SromData_H BC17 SromOE_L BE17 SysAddIn_L_0 BD30 SysAddIn_L_1 BC29 SysAddIn_L_10 BB24 SysAddIn_L_11 AV 24 SysAddIn_L_12 BD24 SysAddIn_L_13 BE23 SysAddIn_L_14 AW23 SysAddIn_L_2 AY28 SysAddIn_L_3 BE29 SysAddIn_L_4 AW27
3–10 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 79
Pin Assignments
Table 3–4 Pin List Sorted by Signal Name (Continued)
Signal Name PGA Location Signal Name PGA Location Signal Name PGALocation
SysAddIn_L_5 BA27 SysAddIn_L_6 BD28 SysAddIn_L_7 BE27 SysAddIn_L_8 AY26 SysAddIn_L_9 BC25 SysAddInClk_L BB26 SysAddOut_L_0 AW33 SysAddOut_L_1 BE39 SysAddOut_L_10 BE33 SysAddOut_L_11 AW29 SysAddOut_L_12 BC31 SysAddOut_L_13 AV28 SysAddOut_L_14 BB30 SysAddOut_L_2 BD36 SysAddOut_L_3 BC35 SysAddOut_L_4 BA33 SysAddOut_L_5 AY32 SysAddOut_L_6 BE35 SysAddOut_L_7 AV3 0 SysAddOut_L_8 BB32 SysAddOut_L_9 BA31 SysAddOutClk_L BD34 SysCheck_L_0 L7 SysCheck_L_1 AA5 SysCheck_L_2 AK8 SysCheck_L_3 BA13 SysCheck_L_4 L39 SysCheck_L_5 AA41 SysCheck_L_6 AM40 SysCheck_L_7 AY34 SysData_L_0 F14 SysData_L_1 G13 SysData_L_10 P6 SysData_L_11 T8 SysData_L_12 V8 SysData_L_13 V6 SysData_L_14 W7 SysData_L_15 Y6 SysData_L_16 AB8 SysData_L_17 AC7 SysData_L_18 AD8 SysData_L_19 AE5 SysData_L_2 F12 SysData_L_20 AH6 SysData_L_21 AH8 SysData_L_22 AJ7 SysData_L_23 AL5 SysData_L_24 AP8 SysData_L_25 AR7 SysData_L_26 AT8 SysData_L_27 AV6 SysData_L_28 AV1 0 SysData_L_29 AW11 SysData_L_3 H12 SysData_L_30 AV1 2 SysData_L_31 AW13 SysData_L_32 F32 SysData_L_33 F34 SysData_L_34 H34 SysData_L_35 G35 SysData_L_36 F40 SysData_L_37 G39 SysData_L_38 K38 SysData_L_39 J41 SysData_L_4 H10 SysData_L_40 M40 SysData_L_41 N39 SysData_L_42 P40 SysData_L_43 T38 SysData_L_44 V40 SysData_L_45 W41 SysData_L_46 W39 SysData_L_47 Y40 SysData_L_48 AB38 SysData_L_49 AC39 SysData_L_5 G7 SysData_L_50 AD38 SysData_L_51 AF40 SysData_L_52 AH38 SysData_L_53 AJ39 SysData_L_54 AL41 SysData_L_55 AK38 SysData_L_56 AN39 SysData_L_57 AP38 SysData_L_58 AR39 SysData_L_59 AT38 SysData_L_6 F6 SysData_L_60 AY38 SysData_L_61 AV36 SysData_L_62 AW35 SysData_L_63 AV3 4 SysData_L_7 K8 SysData_L_8 M6 SysData_L_9 N7 SysDataInClk_H_0 D8 SysDataInClk_H_1 P4 SysDataInClk_H_2 AF6 SysDataInClk_H_3 AY6 SysDataInClk_H_4 E37 SysDataInClk_H_5 R43 SysDataInClk_H_6 AG41 SysDataInClk_H_7 AV 40 SysDataInValid_L BD22 SysDataOutClk_L_0 G11 SysDataOutClk_L_1 U7 SysDataOutClk_L_2 AG7 SysDataOutClk_L_3 AY8 SysDataOutClk_L_4 H36
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–11
Page 80
Pin Assignments
Table 3–4 Pin List Sorted by Signal Name (Continued)
Signal Name PGA Location Signal Name PGA Location Signal Name PGALocation
SysDataOutClk_L_5 R41 SysDataOutClk_L_6 AH40 SysDataOutClk_L_7 AW39 SysDataOutValid_L BB22 SysFillValid_L BC23 SysVref BA25 Tck_H BE19 Tdi_H BA21 Tdo_H BB20 TestStat_H BA19 Tms_H BD18 Trst_L AY20
Table 3–5 Pin List Sorted by PGA Location
PGA Location Signal Name PGA Location Signal Name PGA Location Signal Name
A11 BcTag_H_22 A13 BcTag_H_26 A17 BcTag_H_33 A19 BcTag_H_36 A27 BcDataOE_L A29 BcAdd_H_6 A33 BcAdd_H_15 A35 BcAdd_H_19 A39 BcData_H_96 A41 BcData_H_98 A5 BcData_H_2 A7 BcData_H_65 AA3 BcCheck_H_9 AA41 SysCheck_L_5 AA43 BcData_H_111 AA5 SysCheck_L_1 AB2 BcData_H_80 AB38 SysData_L_48 AB4 BcCheck_H_1 AB42 BcCheck_H_5 AB44 BcData_H_47 AB8 SysData_L_16 AC1 BcData_H_16 AC3 BcData_H_81 AC39 SysData_L_49 AC43 BcData_H_112 AC45 BcCheck_H_13 AC7 SysData_L_17 AD2 BcData_H_17 AD38 SysData_L_50 AD4 BcData_H_82 AD42 BcData_H_48 AD44 BcData_H_113 AD8 SysData_L_18 AE3 BcData_H_18 AE41 BcData_H_114 AE43 BcData_H_49 AE5 SysData_L_19 AF4 BcData_H_83 AF40 SysData_L_51 AF42 BcData_H_50 AF6 SysDataInClk_H_2 AG1 BcData_H_19 AG41 SysDataInClk_H_6 AG45 BcData_H_115 AG7 SysDataOutClk_L_2 AH2 BcDataInClk_H_2 AH38 SysData_L_52 AH40 SysDataOutClk_L_6 AH44 Bc Da taInClk_H_6 AH6 SysData_L_20 AH8 SysData_L_21 AJ1 Sp are AJ3 BcData_H_84 AJ39 SysData_L_53 AJ43 Spare AJ 45 BcData_H_51 AJ7 SysData_L_22 AK2 BcData_H_20 AK38 SysData_L_55 AK4 BcData_H_85 AK42 BcData_H_52 AK44 BcData_H_116 AK8 SysCheck_L_2 AL3 BcData_H_21 AL41 SysData_L_54 AL43 BcData_H_117 AL5 SysData_L_23 AL7 EV6Clk_L AM4 BcData_H_87 AM40 SysCheck_L_6 AM42 BcData_H_118 AM6 EV6Clk_H AM8 ClkIn_H AN1 BcData_H_86 AN39 SysData_L_56 AN41 BcData_H_55 AN45 BcData_H_53 AN7 ClkIn_L AP2 BcData_H_23 AP38 SysData_L_57 AP40 BcData_H_120 AP44 BcData_H_54 AP8 SysData_L_24
3–12 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 81
Pin Assignments
Table 3–5 Pin List Sorted by PGA Location (Continued)
PGA Location Signal Name PGA Location Signal Name PGA Location Signal Name
AR1 BcData_H_22 AR3 Spare AR39 SysData_L_58 AR43 BcDataOutClk_H_3 AR45 BcData_H_119 AR7 SysData_L_25 AT2 BcCheck_H_2 AT38 SysData_L_59 AT4 Spare AT42 BcDataOutClk_L_3 AT 44 BcCheck_H_14 AT8 SysData_L_26 AU3 BcDataOutClk_H_1 AU4 1 BcData_H_57 AU43 BcCheck_H_6 AU5 BcData_H_88 AV1 0 SysData_L_28 AV12 SysData_L_30 AV1 6 FrameClk_H AV18 IRQ_H_3 AV22 MiscVref AV2 4 SysAddIn_L_11 AV2 8 SysAddOut_L_13 AV30 SysAddOut_L_7 AV3 4 SysData_L_63 AV36 SysData_L_61 AV4 BcDataOutClk_L_1 AV4 0 SysDataInClk_H_7 AV42 BcData_H_122 AV6 SysData_L_27 AV8 PLL_VDD AW1 BcCheck_H_10 AW11 SysData_L_29 AW13 SysData_L_31 AW15 FrameClk_L AW17 IRQ_H_2 AW19 SromClk_H AW23 SysAddIn_L_14 AW27 SysAddIn_L_4 AW29 SysAddOut_L_11 AW33 SysAddOut_L_0 AW35 SysData_L_62 AW39 SysDataOutClk_L_7 AW45 BcData_H_56 AW5 BcData_H_26 AY12 BcData_H_95 AY18 DCOK_H AY2 BcData_H_24 AY20 Trst_L AY26 SysAddIn_L_8 AY28 SysAddIn_L_2 AY32 SysAddOut_L_5 AY34 SysCheck_L_7 AY38 SysData_L_60 AY40 BcDataInClk_H_7 AY44 BcData_H_58 AY6 SysDataInClk_H_3 AY8 SysDataOutClk_L_3 B10 BcData_H_0 B12 BcTag_H_23 B16 BcTag_H_31 B18 BcTag_H_35 B22 BcTagParity_H B24 BcTagValid_H B28 BcAdd_H_4 B30 BcAdd_H_10 B34 BcAdd_H_18 B36 BcAdd_H_20 B4 BcData_H_68 B40 BcData_H_34 B42 BcData_H_99 B6 BcData_H_67 BA1 BcData_H_89 BA13 SysCheck_L_3 BA15 IRQ_H_0 BA19 TestStat_H BA21 Tdi_H BA25 SysVref BA27 SysAddIn_L_5 BA3 BcData_H_90 BA31 SysAddOut_L_9 BA33 SysAddOut_L_4 BA37 BcData_H_126 BA39 Spare BA43 BcData_H_59 BA45 BcData_H_121 BA9 BcData_H_93 BB10 BcData_H_30 BB14 NoConnect BB16 IRQ_H_5 BB2 BcData_H_25 BB20 Tdo_H BB22 SysDataOutValid_L BB24 SysAddIn_L_10 BB26 SysAddInClk_L BB30 SysAddOut_L_14 BB32 SysAddOut_L_8 BB36 BcCheck_H_15 BB38 BcData_H_62 BB4 BcData_H_27 BB42 BcData_H_124 BB44 BcData_H_123 BB8 BcData_H_28 BC11 BcCheck_H_3 BC15 IRQ_H_4 BC17 SromData_H BC21 Spare BC23 SysFillVa lid_L
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–13
Page 82
Pin Assignments
Table 3–5 Pin List Sorted by PGA Location (Continued)
PGA Location Signal Name PGA Location Signal Name PGA Location Signal Name
BC25 SysAddIn_L_9 BC29 SysAddIn_L_1 BC3 BcData_H_91 BC31 SysAddOut_L_12 BC35 SysAddOut_L_3 BC37 BcCheck_H_7 BC41 BcData_H_125 BC43 BcData_H_60 BC5 BcDataInClk_H_3 BC9 BcData_H_94 BD10 BcCheck_H_11 BD12 PllBypass_H BD16 Reset_L BD18 Tms_H BD2 NoConnect BD22 SysDataInValid_L BD24 SysAddIn_L_12 BD28 SysAddIn_L_6 BD30 SysAddIn_L_0 BD34 SysAddOutClk_L BD36 SysAddOut_L_2 BD4 Spare BD40 BcData_H_127 BD42 BcData_H_61 BD6 BcData_H_92 BE11 ClkFwdRst_H BE13 IRQ_H_1 BE17 SromOE_L BE19 Tck_H BE23 SysAddIn_L_13 BE27 SysAddIn_L_7 BE29 SysAddIn_L_3 BE33 SysAddOut_L_10 BE35 SysAddOut_L_6 BE39 SysAddOut_L_1 BE41 BcData_H_63 BE5 BcData_H_29 BE7 BcData_H_31 BE9 Spare C11 BcData_H_64 C15 BcTag_H_28 C17 BcTag_H_32 C21 BcTag_H_40 C23 BcTagDirty_H C25 BcTagOutClk_H C29 BcAdd_H_8 C3 BcData_H_4 C31 BcAdd_H_12 C35 BcAdd_H_22 C37 BcData_H_33 C41 BcData_H_35 C43 BcData_H_36 C5 BcData_H_3 C9 BcData_H_66 D10 BcData_H_1 D14 BcTag_H_24 D16 BcTag_H_30 D2 BcData_H_71 D20 BcTag_H_38 D22 BcTag_H_41 D24 BcTagOutClk_L D26 BcDataWr_L D30 BcAdd_H_11 D32 BcAdd_H_17 D36 BcData_H_97 D4 BcData_H_69 D42 BcData_H_100 D4 4 BcData_H_101 D8 SysDataInClk_H_0 E1 BcData_H_7 E13 BcTag_H_20 E15 BcTag_H_25 E19 BcTag_H_34 E21 BcTag_H_39 E25 BcTagWr_L E27 BcAdd_H_5 E3 BcData_H_5 E31 BcAdd_H_16 E33 BcAdd_H_23 E37 SysDataInClk_H_4 E39 Spare E43 BcData_H_37 E45 BcCheck_H_12 E7 BcDataInClk_H_0 F12 SysData_L_2 F14 SysData_L_0 F18 BcVref F2 BcCheck_H_0 F20 BcTag_H_37 F26 BcLoad_L F28 BcAdd_H_9 F32 SysData_L_32 F34 SysData_L_33 F38 BcDataInClk_H_4 F40 SysData_L_36 F44 BcData_H_39 F6 SysData_L_6 F8 Spare G1 BcData_H_73 G11 SysDataOutClk_L_0 G13 SysData_L_1 G17 BcTag_H_27 G19 BcTagInClk_H G23 BcTagShared_H G27 BcAdd_H_7 G29 BcAdd_H_14 G33 BcData_H_32 G35 SysData_L_35
3–14 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 83
Pin Assignments
Table 3–5 Pin List Sorted by PGA Location (Continued)
PGA Location Signal Name PGA Location Signal Name PGA Location Signal Name
G39 SysData_L_37 G41 BcData_H_38 G4 5 BcData_H_104 G5 BcData_H_70 G7 SysData_L_5 H10 SysData_L_4 H12 SysData_L_3 H16 BcTag_H_21 H18 BcTag_H_29 H22 BcTag_H_42 H24 BcTagOE_L H28 BcAdd_H_13 H30 BcAdd_H_21 H34 SysData_L_34 H3 6 SysDataOutClk_L_4 H4 BcData_H_72 H40 BcData_H_102 H42 BcData_H_103 H6 BcData_H_6 J3 BcData_H_8 J41 SysData_L_39 J43 BcDataOutClk_H_2 J5 BcDataOutClk_H_0 K2 BcData_H_9 K38 SysData_L_38 K4 BcDataOutClk_L_0 K42 BcDataOutClk_L_2 K44 BcData_H_40 K8 SysData_L_7 L1 BcData_H_75 L3 BcData_H_10 L39 SysCheck_L_4 L43 BcData_H_105 L45 BcData_H_106 L7 SysCheck_L_0 M2 BcData_H_11 M38 BcCheck_H_4 M40 SysData_L_40 M44 BcData_H_42 M6 SysData_L_8 M8 BcCheck_H_8 N1 BcData_H_76 N39 SysData_L_41 N41 BcData_H_41 N4 5 BcData_H_107 N5 BcData_H_74 N7 SysData_L_9 P4 SysDataInClk_H_1 P40 SysData_L_42 P42 BcData_H_43 P6 SysData_L_10 R3 BcDataInClk_H_1 R41 SysDataOutClk_L_5 R43 SysDataInClk_H_5 T2 BcData_H_12 T38 SysData_L_43 T4 Spare T44 BcData_H_108 T8 SysData_L_11 U1 BcData_H_13 U3 BcData_H_77 U39 Bc DataInClk_H_5 U43 BcData_H_44 U45 BcData_H_109 U7 SysDataOutClk_L_1 V2 BcData_H_14 V38 Spare V4 0 SysData_L_44 V44 BcData_H_45 V6 SysData_L_13 V8 SysData_L_12 W1 BcData_H_79 W39 SysData_L_46 W41 SysData_L_45 W45 BcData_H_110 W5 BcData_H_78 W7 SysData_L_14 Y4 BcData_H_15 Y40 SysData_L_47 Y42 BcData_H_46 Y6 SysData_L_15
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–15
Page 84
Pin Assignments
Table 3–6 lists the 21264/EV68A ground and power (VSS and VDD, respectively) pin list.
Table 3–6 Ground and Power (VSS and VDD) Pin List
Signal PGA Location
VSS A15 A2 1 A25 A3 A31 A37 A43 A9 AA1 AA3 9
AA45 AA7 AC41 AC5 AE1 AE39 AE45 AE7 AG3 AG39 AG43 AG5 AJ41 AJ5 AL1 AL39 AL45 AN3 AN43 AN5 AR41 AR5 AU1 AU39 AU45 AU7 AW21 AW25 AW3 AW31 AW37 AW41 AW43 AW7 AW9 AY14 BA11 BA17 BA23 BA29 BA35 BA41 BA5 BA7 BC1 BC13 BC19 BC27 BC33 BC39 BC45 BC7 BE15 BE21 BE25 BE3 BE31 BE37 BE43 C1 C13 C19 C27 C33 C39 C45 C7 DS8 E11 E17 E23 E29 E35 E41 E5 E9 G15 G21 G25 G3 G31 G37 G43 G9 J1 J39 J45 J7 L41 L5 N3 N43 R1 R39 R45 R5 R7 T42 U41 U5 W3W43————————
VDD A23 AB40 AB6 AD40 AD6 AF2 AF38 AF44 AF8 AH4
AH42 AK40 AK6 AM2 AM38 AM44 AP4 AP42 AP6 AT40 AT6 AV14 AV 2 AV20 AV26 AV32 AV38 AV44 AY10 AY16 AY22 AY24 AY30 AY36 AY4 AY42 B14 B2 B20 B26 B32 B38 B44 B8 BB12 BB18 BB28 BB34 BB40 BB6 BD14 BD20 BD26 BD32 BD38 BD44 BD8 D12 D18 D28 D34 D40 D6 F10 F16 F22 F24 F30 F36 F4 F42 H14 H2 H20 H26 H32 H38 H44 K40 K6 M4M42P2P38P44P8T40T6V4V42 Y2 Y38 Y44 Y8
3–16 Hardware Interface
21264/EV68A Hardware R eference Manual
Page 85
3.4 Mechanical Specifications
This section shows the 21264/EV68A mechanical package dimensions without a heat sink. For heat sink information and dimensions, refer to Chapter 10.
Figure 3–2 shows the package physical dimensions without a heat sink.
Figure 3–2 Package Dimensions
2.54 mm (.100 in) Typ
B
BE
BD
BC
BC
BB
BA
AY
AW
AV
AU
AT
AR
AP
AN
AM
AL
AK
AJ
AH
AG
AF
AE
AD
AC
AB
AA
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
02
04 06 08 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
01
27.94 mm (1.100 in)
Standoff (4x)
587x 1.40 mm (.055 in) Typ
1.27 mm (.050 in) Typ
27.94 mm
(1.100 in)
45434139373533312927252321191715131109070503
Lid
.13 mm (.005 in) R
Mechanical Specifications
1.27 mm (.050 in) Typ
4.32 mm (.170 in) Typ
1.377 mm (.055 in) Typ
1/4-20 Stud (2x)
7.62 mm (.300 in) Typ
1.905 mm (.075 in) Typ
59.94 mm (2.360 in) Typ
29.62 mm (1.180 in) Typ
25.40 mm
(1.000 in) Typ
53.85 mm (2.120 in) Typ
29.62 mm (1.180 in) Typ
FM-05662.AI4
21264/EV68A Hardware Refere nce Manual
Hardware Interface 3–17
Page 86
21264/EV68A Packaging
3.5 21264/EV68A Packaging
Figure 3–3 shows the 21264/EV68A pinout from the top view with pins facing down.
Figure 3–3 21264/EV68A Top View (Pin Down)
B
BE
BD
BC
BC
BB
BA
AY
AW
AV
AU
AT
AR
AP
AN
AM
AL
AK
AJ
AH
AG
AF
AE AC AA
AD AB
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
21264/EV68
TopView
(PinDown)
A
42 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 10 08 06 04 02
44
45
3–18 Hardware Interface
01030507091113151719212325272931333537394143
FM-05644
21264/EV68A Hardware R eference Manual
Page 87
Figure 3–4 shows the 21264/EV68A pinout from the bottom view with pins facing up.
Figure 3–4 21264/EV68A Bottom View (Pin Up)
B
BE
BD
BC
BC
BB
BA
AY
AW
AV
AU
AT
AR
AP
AN
AM
AL
AK
AJ
AH
AG
AF
AE AC AA
AD AB
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
21264/
Bottom View
(PinUp)
21264/EV68A Packaging
EV68A
04 06 08 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
02
01
21264/EV68A Hardware Refere nce Manual
45434139373533312927252321191715131109070503
FM-05645
Hardware Interface 3–19
Page 88
Page 89
4

Cache and External Interfaces

This chapter describes the 21264/EV68A cache and external interface, which includes the second-level cache (Bcache) interface and the system interface. It also describes locks, interrupt signals, and EC C/parity generation. It is organized as follows:
Introduction to the external interfaces
Physical address considerations
Bcache structure
Victim data buffer
Cache coherency
Lock m echanism
System port
Bcache port
Interrupts
Chapter 3 lists and defines all 21264/EV68A hardware interface signal pins. C hapter 9 describes the 21264/EV68A hardware interface electrical requirements.

4.1 Introduction to the External Interfaces

A 21264/EV68A-based system can be divided into three major sections:
21264/EV68A microprocessor
Second-level Bcache
System interface logic
Optional duplicate tag store – Optional lock register – Optional victim buffers
The 21264/EV68A external interface is flexible and mandates few design rules, allow­ing a wide range of prospective systems. The external interface is composed of the Bcache interface and the system interface.
Input clocks must have the same frequencyas their corresponding output clock. For
example, the frequency of SysAddInClk_L must be the same as SysAddOutClk_L.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces 4–1
Page 90
Introduction to the External Interfaces
The Bcache interface includes a 128-bit bidirectional data bus, a 20-bit unidirec­tional address bus, and several control signals.
The BcDataOutClk_x[3:0] clocks are free-running and are derived from the
internal GCLK. The period of BcDataOutClk_x[3:0] is a programmable mul­tiple of GCLK.
The Bcache turns the BcDataOutClk_x[3:0] clocks around and returns them
to the 21264/EV68A as BcDataInClk_H[7:0]. Likewise, BcTagO utClk_x returns as BcTagInClk_H.
The Bcache interface supports a 64-byte block size.
The system interface includes a 64-bit bidirectional data bus, two 15-bit
unidirectional address buses, and several control signals. – The SysAddOutClk_L clock is free-running and is derived from the internal
GCLK. The period of SysAddOutClk_L is a programmable multiple of GCLK.
The SysAddInClk_L clock is a turned-around copy of S ysAddOutClk_L.
Figure 4–1 shows a simplified view of the externalinterface. The function and purpose of each signal is described in Chapter 3.
4–2 Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 91
Introduction to the External Interfaces
FM-05818B-EV67
System
Figure 4–1 21264/EV68A System and Bcache Interfaces
SysAddIn_L[14:0]
SysAddInClk_L
SysAddOut_L[14:0]
SysAddOutClk_L
SysVref
SysData_L[63:0]
SysCheck_L[7:0]
SysDataInClk_H[7:0]
SysDataOutClk_L[7:0]
SysDataInValid_L
SysDataOutValid_L
SysFillValid_L
BcAdd_H[23:4]
21264
BcLoad_L
BcData_H[127:0]
BcCheck_H[15:0]
BcDataInClk_H[7:0]
BcDataOutClk_x[3:0]
BcDataOE_L
BcDataWr_L
BcTag_H[42:20]
BcTagInClk_H
BcTagOutClk_
BcVref
BcTagWr_L
BcTagOE_L BcTagValid_H BcTagDir ty_H
BcTagShared_H
BcTagParity_H
IRQ_H[5:0]
x
[23:4] [23:6] [23:6]
Data Tag Status

4.1.1 System Interface

This section introduces the system (external) bus interface. The system interface is made up of two unidirectional 15-bit address buses, 64 bidirectional data lines, eight bidirectional check bits, two single-ended unidirectional clocks, and a few control pins. The 15-bit address buses provide time-shared address/command/ID in two or four GCLK cycles. The Cbox controls the system interface.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces 4–3
Page 92

Physical Address Considerations

4.1.1.1 Commands and Addresses
The system sends probe and data movement commands to the 21264/EV68A. The 21264/EV68A can hold up to eight probe commands from the system. The system con­trols the number of outstanding probe commands and must ensure that the 21264/ EV68A 8-entry probe queue does not overflow.
The Cbox contains an 8-entry miss buffer (MAF) and an 8-entry victim buffer (VAF). A miss occurs when the 21264/EV68A probes the Bcache but does not find the
addressed block. The 21264/EV68A can queue eight cache misses to the system in its MAF.

4.1.2 Second-Level Cache (Bcache) Interface

The 21264/EV68A Cbox provides control signals and an interface for a second-level cache, the Bcache. The 21264/EV68A supports a Bcache from 1MB to 16MB, with 64­byte blocks. A 128-bit data bus is used for transfers between the 21264/EV68A and the Bcache. The Bcache must be comprised of synchronous static RAMs (SSRAMs) and must contain e ither one, two, or three internal registers. All Bcache control and address pins are clocked synchronously on Bcache cycle boundaries. The Bcache clock rate varies as a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0, and in full-cycle increments of 5, 6, 7, and 8 times the C PU clock cycle. The 1.5 multi­ple is only available in dual-data mode.
4.2 Physical Address Considerations
The 21264/EV68A supports a 44-bit physical address space that is divided equally between memory space and I/O space. Memory space resides in the lower half of the physical address spac e (PA[43] = 0) and I/O space resides in the upper half of the phys­ical address space (PA[43] = 1). The 21264/EV68A recognizes these spaces internally.
The 21264/EV68A-generated external references to memory space are always of a fixed 64-byte size, though the internal access granularity is byte, word, longword, or quadword. All 21264/EV68A-generated external references to memory or I/O space are physical addresses that are either successfully translated from a virtual address or produced by PALcode. Speculative execution may cause a reference to nonexistent memory. Systems must check the range of all addresses and report nonexistent addresses to the 21264/EV68A.
Table 4–1 describes the translation of internal references to external interface refer­ences. The first column lists the instructions used by the programmer, including load (LDx) and store ( STx) instructions of several sizes. The column headings are described here:
DcHit (block was found in the Dcache)
DcW (block was found in a writable state in the Dcache)
BcHit (block was found in the Bcache)
BcW (block was found in a writable state in the Bcache)
Status and Action (status at end of instruction and action performed by the 21264/
EV68A)
4–4 Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 93
Physical Address Considerations
Prefetches (LDL, LDF, LDG, LDT, LDBU, LDWU) to R31 use the LDx flow, and prefetch with modify intent (LDS) uses the STx flow. If the prefetch target is addressed to I/O space, the upper address bit is cleared, converting the address to memory space (PA[ 42:6] ). Notes follow the table.
Table 4–1 Translation of Internal References to E xternal Interface Reference
Instruction DcHit DcW BcHit BcW Status and Action
LDx Memory 1 X X X Dcache hit,done. LDx Memory 0 X 1 X Bcache hit, done. LDx Memory 0 X 0 X Miss, generate RdBlk command. LDxI/O XXXXRdBytes,RdLWs,orRdQWsbasedonsize. Istream Memory 1 X X X Dcache hit, Istream serviced from Dcache. Istream Memory 0 X 1 X Bcache hit, Istream serviced from Bcache. Istream Memory 0 X 0 X Miss, generate RdBlkI command. STx Memory 1 1 X X Store Dcache hit and writable, done. STx Memory 1 0 X X Store hit and not writable, set dirty flow (note 1). STx Memory 0 X 1 1 Store Bcache hit and writable, done. STx Memory 0 X 1 0 Store hit and not writable, set-dirty flow (note 1). STx Memory 0 X 0 X Miss, generate RdBlkMod command. STxI/O XXXXWrBytes,WrLWs,orWrQWsbasedonsize. STx_C Memory 0 X X X Fai l STx_C. STx_C Memory 1 0 X X STx_C hit and not writable, set dirty flow (note 1). STx_CI/O XXXXAlwayssucceedandWrQwsorWrLwsaregenerated,
basedonthesize. WH64 Memory 1 1 X X Hit, done. WH64 Memory 1 0 X X WH64 hit not writable, set dirty flow (note 1). WH64 Memory 0 X 1 1 WH64 hit dirty, done. WH64 Memory 0 X 1 0 WH64 hit not writable, set dirty flow (note 1). WH64 Memory 0 X 0 X Miss, generate InvalToDirty command (note 2). WH64I/O XXXXNOPtheinstruction.WH64isUNDEFINED for I/O
space. ECBMemoryXXXXGenerateevictcommand(note3). ECBI/O XXXXNOPtheinstruction.ECBinstructionisUNDEFINED
for I/O space. MB/WMB
TBFillFlows
21264/EV68A Hardware Refere nce Manual
XXXXGenerateMBcommand(note4).AlsoseeSection3.2.5.
Cache and External Interfaces 4–5
Page 94
Physical Address Considerations
Table 4–1 notes:
1. Set Dirty Flow: Based on the Cbox CSR SET_DIRTY_ENABLE[2:0], SetDirty requests can be either internally a cknowledged (called a SetModify) or sent to the system environment for processing. When externallyacknowledged, the shared sta­tus information for the cache block is also broadcast. The commands sent exter­nally are SharedToDirty or CleanToDirty. Based on the Cbox CSR ENABLE_STC_COMMAND[0], the external system can be informed of a STx_C generating a SetDirty using the STCChangeToDirty command. See Table 4–16 for more information.
2. InvalToDirty: Based on the C box CSR INVAL_TO_DIRTY_ENABLE[1:0], Inval­ToDirty requests can be either internally acknowledged or sent to the system envi­ronment as InvalToDirty commands.This Cbox C SR provides the ability to convert WH64 instructions to RdModx operations. See Table 4–15 for more information.
3. Evict: There are two aspects to the commands that are generated by an ECB instruction:first, thosecommands that are generated tonotify the system of an e vict being performed; second, those commands that are generated by any victim that is created by servicing the ECB.
If Cbox CSR ENAB LE_EVICT[0] is c lear, no command is issued by the
21264/EV68A on the external interface to notify the system of an evict being performed. If Cbox CSR ENABLE_EVICT[0] is set, the 21264/EV68A issues an Evict command on the system interface only if a Bcache index match to the ECB address is found in the 21264/EV68A cache system.
Note that whenever ENABLE_EVICT[0] is true (in the write-many chain), BC_CLEAN_VICTIM must also be true (in the write-once chain). Otherwise, the 21264/EV68A could respond miss to a probe, rather than hit, before an Evict command has been sent off chip, but after the Evict command has removed a (clean) block from the internal caches and the Bcache. That behav­ior might cause systemsthat maintain an external duplicate copy of the Bcache tags to become confused, because the system could receive the probe response indicating the miss before it receives the Evict command.
The 21264/EV68A can issue the commands CleanVictimBlkand WrVictimBlk
for a victim that is created by an ECB. CleanVictimBlk is issued only if Cbox CSR BC_CLEAN_VICTIM is set and there is a Bcache index match valid but not dirty in the 21264/EV68A cache system. WrVictimBlk is issued for any Bcache match of the ECB address that is dirty in the 21264/EV68A cache sys­tem.
4. MB: Based on the Cbox CSR SYSBUS_MB_ENABLE, the MB command can be sent to the pins.
Each of these CSRs is programmed appropriately, based on the cache coherence proto­col used by the system environment. For example, uniprocessor systems would prefer to internally acknowledge most of these transactions. In c ontrast, multiprocessor sys­tems may require notification and control of any change in cache state. The 21264/ EV68A and the external system must cooperate to maintain cache coherence. Section
4.5 explains the 21264/EV68A part of the cache coherency protocol.
4–6 Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 95

4.3 Bcache Structure

05650
The 21264/EV68A Cbox provides control signals and an interface for a second-level cache (Bcache).
The 21264/EV68A supports a Bcache from 1MB to 16MB, with 64-byte blocks.A 128­bit bidirectional da ta bus is used for transfers between the 21264/EV68A and the Bcache. The Bcache is fully synchronousand the synchronous static RAMs (SSRAMs) must contain e ither one, two, or three internal registers. All Bcache control and address pins are clocked synchronously on Bcache cycle boundaries. The Bcache clock rate varies as a multiple of the CPU clock cycle in half-cycle increments from 1.5 to 4.0, and in full-cycle increments of 5, 6, 7, and 8 times the C PU clock cycle. The 1.5 multi­ple is only available in dual-data mode.

4.3.1 Bcache Interface Signals

Figure 4–2 shows the 21264/EV68A system interface signals.
Figure 4–2 21264/EV68A Bcache Interface Signals
Bcache Structure
BcData_H[127:0]
21264
BcCheck_H[15:0] BcDataInClk_H[7:0] BcDataOutClk_x[3:0] BcDataOE_L BcDataWr_L BcAdd_H[23:4] BcTag_H[42:20] BcTagInClk_H BcTagOutClk_ BcVref BcTagDirty_H BcTagParity_H BcTagShared_H BcTagValid_H BcTagOE_L BcTagWr_L BcLoad_L
x
x

4.3.2 System Duplicate Tag Stores

FM-
The 21264/EV68A provides Bcache state support for systems with and without dupli­cate tag stores, and will take different ac tions on this basis. The system sets the Cbox CSR DUP_TAG_ENA[0], indicating that it has a duplicate tag store for the Bcache. Systems using the DUP_TAG_ENA[0] bit must also use the Cbox CSR BC_CLEAN_VICTIM[0] bit to avoid deadlock situations.
Systems using a Bcache duplicate tag store can accelerate system performance by:
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces 4–7
Page 96

Victim Data Buffer

Issuing probes and SysDc fill commands to the 21264/EV68A out-of-order with respect to their order at the system serialization point
Filtering out all probe misses from the 21264/EV68A cache system
If a probe misses in the 21264/EV68A cache system (Bcache miss and VAF miss), the 21264/EV68A stalls probe processing with the expectation that a SysDc fill will allo­cate this block. Because of this, in duplicate tag mode, the 21264/EV68A can never generate a probe miss response.
When Cbox CSR DUP_TAG_ENA[0] equals 0, the 21264/EV68A delivers a miss response for probes that do not hit in its cache system.
4.4 Victim Data Buffer
The 21264/EV68A has eight victim data buffers (VDBs). They have the following properties:
The VDBs are used for both victims (fills that are replacingdirty cache blocks) and
for system probes that require data movement. The CleanVictimBlk command (optional) assigns a nd uses a VDB.
Each VDB has two valid bits that indicate the buffer is valid for a victim or valid
for a probe or valid for both a victim and a probe. Probe commands that match the address of a victim address file (VAF) entry with an asserted probe-valid bit (P) will stallthe 21264/EV68A probe queue.No ProbeResponses willbe returned until the P bit is clear.
The release victim buffer (RVB) bit, when asserted, causes the victim valid bit, on
the victim data buffer (VDB) specified in the ID field, to be cleared. The RVB bit will also clear the IOWB when systems move data on I/O write transactions. In this case, ID[3] equals one.
The release probe buffer (RPB) bit, when asserted (with a WriteData or Release-
Buffer SysDc command), clears the P bit in the victim buffer entry specified in the ID field.
Read da ta commands and victim write commands use IDs 0-7, while IDs 8-11 are
used to address the four I/O write buffers.

4.5 Cache Coherency

This section describes the basics and protocols of the 21264/EV68A cache coherency scheme.

4.5.1 Cache Coherency Basics

The 21264/EV68A systems maintain the cache hierarchy shown in Figure 4–3.
4–8 Cache and External Interfaces
21264/EV68A Hardware R eference Manual
Page 97
Figure 4–3 Cache Subset Hierarchy
Cache Coherency
System
Icache
Main Memory
Bcache
Dcache
FM-05824.AI4
The following tasks must be performed to maintain cache coherency:
Istream data from memory spaces may be cached in the Icache and Bcache. Icache
coherence is not maintained by hardware—it must be maintained by software using the CALL_PAL IMB instruction.
The 21264/EV68A maintains the Dcache as a subset of the Bcache. The Dcache is
set-associative but is kept a subset of the larger e xternally implemented direct­mapped Bcache.
System logic m ust help the 21264/EV68A to keep the Bcache coherent with main
memory and other caches in the system.
The 21264/EV68A requires the system to allow only one change to a block at a
time. Thismeans that if the 21264/EV68A gains the bus to read or write a block, no other node on the bus should be allowed to access thatblock until the data has been moved.
The 21264/EV68A provides hardware mechanisms to support several c ache coher-
ency protocols. The protocols can be separated into two classes: write invalidate cache coherency protocol and flush cache coherency protocol.

4.5.2 Cache Block States

Table 4–2 lists the cache block states supported by the 21264/EV68A.
Table 4–2 21264/E V68 A-Su pp orted Cache Block States
State Name Description
Invalid The 21264/EV68A does not have a copy of the block. Clean This 21264/EV68A holds a read-only copy of the block, and no other agent in the system
holds a copy. Upon eviction, the block is not written to memory.
21264/EV68A Hardware Refere nce Manual
Cache and External Interfaces 4–9
Page 98
Cache Coherency
Table 4–2 21264/EV68A-Supported Cache Block States (Continued)
State Name Description
Clean/Shared T his 21264/EV68A holds a read-only copy of the block, and at least one other agent in the
system may hold a copy of the block. Upon eviction, the block is not written to m emory.
Dirty This 21264/EV68A holds a read-writecopy of the block, a nd must write it to memory after it
is evicted from the cache. No other agent in the system holds a copy of the block.
Dirty/Shared This 21264/EV68A holds a read-only copy of the dirty block, which may be shared with
another agent. The block must be written back to memory when it is evicted.

4.5.3 Cache Block State Transitions

Cache block state transitions are reflected by 21264/EV68A-generated commands to the system. Cache block state transitions can also be caused by system-generated com­mands to the 21264/EV68A (probes). Probes control the next state for the cache block. The next state can be based on the previous state of the cache block. Table 4–3 lists the next state for the cache block.
Table 4–3 Cache Block State Transitions
Next State Action Based on Probe Hit
No change Do not update cache state. Useful for DMA transactions that sample data but
do not want to update tag state. Clean Independent of previous state, update next state to Clean. Clean/Shared Independent of previous state,update nextstate to Clean/Shared. This transac-
tion is useful for systems that updatememory on probe hits. T1:
Clean Dirty
T3: Clean Dirty Dirty/Shared
Clean/Shared
Dirty/Shared
Clean/Shared
Invalid
Clean/Shared
Based on the dirty bit, make the block clean or dirty shared. This transaction
is useful for systems that do not update memory on probe hits.
If the block is Clean or Dirty/Shared, change to Clean/Shared. If the block is
Dirty, change to Invalid. This transaction is useful for systems that use the
Dirty/Shared state as an exclusive state.
The cache state transitionscaused by 21264/EV68A-generated commands are under the full control of the system environment using the SysDc (system data control) com­mands. Table 4–4 lists these commands.
Table 4–4 System Responses to 21264/EV68A Comman ds
Response Type 21264/EV68A Action
SysDc ReadData Fill block with the associateddata and update tag w ith clean cache status. SysDc ReadDataDirty Fill block with the associated data and update tag with dirty cache status. SysDc ReadDataShared Fill block with the associated data and update tag with shared cache status. SysDc ReadDataShared/Dirty Fill block with the associated data and update tag with dirty/shared status. SysDc ReadDataError Fill block with all-ones reference pattern and update tag with invalid status. SysDc ChangeToDirtySuccess Unconditionally update block with dirty cache status. SysDc ChangeToDirtyFail Do not update cache status and fail any associated STx_C instructions.
4–10 Cache and E xternal Interfaces
21264/EV68A Hardware R eference Manual
Page 99

4.5.4 Using SysDc Commands

Note the following:
The c onventional response for RdBlk commands is SysDc ReadData or ReadD-
ataShared.
The c onventional response for a RdBlkMod command is SysDc ReadDataDirty.
The c onventional response for ChangeToDirty commands is
ChangeToDirtySuccess or ChangeToDirtyFail.
However,the system e nvironment is not limited to these r esponses. Table 4–5 shows all 21264/EV68A commands, system responses, and the 21264/EV68A reaction. The 21264/EV68A commands are described in the following list:
Rdx commands are generated by load or Istream references.
RdBlkModx commands are generated by store references.
The C hxToDirty command group includes CleanToDirty, SharedToDirty, and STC-
ChangeToDirty commands, which are generated by store references that hit in the 21264/EV68A cache system.
Cache Coherency
InvalToDirty commands a re generated by WH64 instructions that miss in the
21264/EV68A cache system.
FetchBlk and FetchBlkSpec are noncached r eferences to memory space that have
missed in the 21264/EV68A cache system.
Rdiox comm ands are noncached references to I/O address space.
Evict and S TCChangeToDirty commands are generated by ECB and STx_C
instructions, respectively.
Table 4–5 shows the system responses to 21264/EV68A commands and 21264/EV68A reactions.
Table 4–5 System Resp on ses to 21264/EV68A Commands and Reactions
21264/EV68A CMD SysDc 21264/EV68A Action
Rdx ReadData
ReadDataShared
Rdx ReadDataShared/Dirty The cache block is filled and marked dirty/shared. Succeeding store
Rdx ReadDataDirty The cache block is filled and m arked dirty. Rdx ReadDataError The cache block access was to NXM address space. The 21264/
This is a normal fill. The cache block is filled and marked clean or shared based on SysDc.
commands cannot update the block without external reference.
EV68A delivers an all-ones pattern to any load command and evicts the block from the cache (with associated victim processing). T he cache block is marked invalid.
Rdx ChangeToDirtySuccess
ChangeToDirtyFail
21264/EV68A Hardware Refere nce Manual
Both SysDc responses are illegal for read commands.
Cache and External Interfaces 4–11
Page 100
Cache Coherency
Table 4–5 System Resp on ses to 21264/EV68A Commands and Reactions (Co ntinued)
21264/EV68A CMD SysDc 21264/EV68A Action
RdBlkModx ReadData
ReadDataShared ReadDataShared/Dirty
The cache block is filled and marked with a nonwritable status. If the store instruction that generated the RdBlkModx command is still active (not killed), the 21264/EV68A will retry the instruction, gener­ating the appropriate ChangeToDirty command. Succeeding store commands cannot update the block without external reference.
RdBlkModx ReadDataDirty The 21264/EV68A performs a normal fill response, and the cache
block becomes writable.
RdBlkModx ChangeToDirtySuccess
Both SysDc responses are illegal for read/modify commands.
ChangeToDirtyFail
RdBlkModx ReadDataError The cache block command was to NXM address space. The 21264/
EV68A delivers an all-ones pattern to any dependent load command, forces a fail action on any pending store commands to this block, and any store to this block is not retried. The Cbox evicts the cache block fromthe cache system(with associatedvictimprocessing).The cache block is marked invalid.
ChxToDirty ReadData
ReadDataShared ReadDataShared/Dirty
The original data in the Dcache is replaced with the filled data. The block is not writable, so the 21264/EV68A will retrythe store i nstruc­tion and generate another ChxToDirty class command. To avoid a potential livelock situation, the STC_ENABLE CSR bit must be set. Any STx_C instruction to this block is forced to fail. In addition, a Shared/Dirtyresponse causes the 21264/EV68A to generatea victim for this block upon eviction.
ChxToDirty ReadDataDirty The data in the Dcache i s replaced with the filled data. The block is
writable, so the store instruction thatgenerated the original command can update this block. Any STx_C instruction to this block is forced to fail. In addition, the 21264/EV68A ge nerates a victim for this block upon eviction.
ChxToDirty ReadDataError Impossible situation. The block must be cached to generate a ChxTo-
Dirty command. Caching the block is not possible be cause all NXM fills are filled noncached.
ChToDirty ChangeToDirtySuccess Normal response. C hangeToDirtySuccess makes the block writable.
The 21264/EV68A retries the store instruction and updates the Dcache. Any STx_C instruction associated with this block is allowed to succeed.
ChxToDirty ChangeToDirtyFail The MAF entry is retired. Any STx_C instruction associated with the
block is forced to fail. If a STx instruction generated this block, the 21264/EV68Aretries and generates either a RdBlkModx (because the reference that failed the ChangeToDirty also invalidated the cache by way of an invalidating probe) or another ChxToDirty command.
InvalToDirty ReadData
ReadDataShared
The block is not writable, so the 21264/EV68A will re try the WH64 instructionand generate a ChxToDirty command.
ReadDataShared/Dirty
InvalToDirty ReadDataError The 21264/EV68Adoesn’t send InvalToDirty commands offchip
speculatively. This NXM condition is a hard error. Systems should perform a m achine check.
InvalToDirty ReadDataDirty
The block is writable. Done.
ChangeToDirtySuccess
4–12 Cache and E xternal Interfaces
21264/EV68A Hardware R eference Manual
Loading...