MOTOROLA PowerPC 603e Technical data

查询MPC603E供应商
SA14-2027-00 (IBM Order Number)
(Motorola Order Number)
MPC603E/D
1/96
REV 1
Advance Information
PowerPC 603e
RISC Microprocessor
Technical Summary
This document provides an overview of the PowerPC 603e microprocessor features, including a block diagram showing the major functional components. It also provides an overview of the PowerPC implementation complies with the architectural definitions.
This document is divided into two parts:
Part 1, “PowerPC 603e Microprocessor Overview,” provides an overview of the 603e features, including a block diagram showing the major functional components.
Part 2, “PowerPC 603e Microprocessor: Implementation,” describes the PowerPC architecture in general, as well as providing specific details about the implementation of the 603e as a low-power, 32-bit member of the PowerPC processor family , and an enumeration of the dif ferences from the PowerPC 603 microprocessor.
In this document, the term “603e” is used as an abbreviation for the phrase, “PowerPC 603e microprocessor,” and the term “603” is used as an abbreviation for the phrase “PowerPC 603 microprocessor.” The PowerPC 603e microprocessors are available from IBM as PPC603e and from Motorola as MPC603e.
architecture specification, and information about how the 603e
This document contains information on a new product under development by Motorola and IBM. Motorola and IBM reserve the right to change or discontinue this product without notice.
Motorola Inc., 1996. All rights reserved
Portions hereof
International Business Machines Corporation, 1991–1996. All rights reserved
of International
603e Technical Summary
2
Part 1 PowerPC 603e Microprocessor Overview
This section describes the features of the 603e, provides a block diagram showing the major functional units, and gives an overview of how the 603e operates.
The 603e is a low-power implementation of the PowerPC microprocessor family of reduced instruction set computer (RISC) microprocessors. The 603e implements the 32-bit portion of the PowerPC architecture, which provides 32-bit effective addresses, integer data types of 8, 16, and 32 bits, and floating-point data types of 32 and 64 bits.
The 603e provides four software controllable power-saving modes. Three of the modes (the nap, doze, and sleep modes) are static in nature, and progressively reduce the amount of power dissipated by the processor. The fourth is a dynamic power management mode that causes the functional units in the 603e to automatically enter a low-power mode when the functional units are idle without affecting operational performance, software execution, or any external hardware.
The 603e is a superscalar processor that can issue and retire as many as three instructions per clock. Instructions can execute out of order for increased performance; however, the 603e makes completion appear sequential.
The 603e integrates five execution units—an integer unit (IU), a floating-point unit (FPU), a branch processing unit (BPU), a load/store unit (LSU), and a system register unit (SRU). The ability to execute five instructions in parallel and the use of simple instructions with rapid execution times yield high efficiency and throughput for 603e-based systems. Most integer instructions execute in one clock cycle. The FPU is pipelined so a single-precision multiply-add instruction can be issued and completed every clock cycle.
The 603e provides independent on-chip, 16-Kbyte, four-way set-associative, physically addressed caches for instructions and data and on-chip instruction and data memory management units (MMUs). The MMUs contain 64-entry, two-way set-associative, data and instruction translation lookaside buffers (DTLB and ITLB) that provide support for demand-paged virtual memory address translation and variable-sized block translation. The TLBs and caches use a least recently used (LRU) replacement algorithm. The 603e also supports block address translation through the use of two independent instruction and data block address translation (IBAT and DBAT) arrays of four entries each. Effective addresses are compared simultaneously with all four entries in the BA T array during block translation. In accordance with the PowerPC architecture, if an effective address hits in both the TLB and BAT array, the BAT translation takes priority.
The 603e has a selectable 32- or 64-bit data bus and a 32-bit address bus. The 603e interface protocol allows multiple masters to compete for system resources through a central external arbiter. The 603e provides a three-state coherency protocol that supports the exclusive, modified, and invalid cache states. This protocol is a compatible subset of the MESI (modified/exclusive/shared/invalid) four-state protocol and operates coherently in systems that contain four-state caches. The 603e supports single-beat and burst data transfers for memory accesses, and supports memory-mapped I/O operations.
The 603e is fabricated using an advanced CMOS process technology and is fully compatible with TTL devices. The 603e is implemented in both a 2.5-volt version (PID 0007v PowerPC 603e microprocessor, or PID7v-603e) and a 3.3-volt version (PID 0006 PowerPC 603e microprocessor, or PID6-603e).
PowerPC 603e RISC Microprocessor Technical Summary
1.1 PowerPC 603e Microprocessor Features
This section describes details of the 603e’s implementation of the PowerPC architecture. Major features of the 603e are as follows:
High-performance, superscalar microprocessor — As many as three instructions issued and retired per clock
— As many as five instructions in execution per clock — Single-cycle execution for most instructions — Pipelined FPU for all single-precision and most double-precision operations
Five independent execution units and two register files — BPU featuring static branch prediction — A 32-bit IU — Fully IEEE 754-compliant FPU for both single- and double-precision operations — LSU for data transfer between data cache and GPRs and FPRs — SRU that executes condition register (CR), special-purpose register (SPR), and integer add/
compare instructions — Thirty-two GPRs for integer operands — Thirty-two FPRs for single- or double-precision operands
High instruction and data throughput — Zero-cycle branch capability (branch folding)
— Programmable static branch prediction on unresolved conditional branches — Instruction fetch unit capable of fetching two instructions per clock from the instruction cache — A six-entry instruction queue that provides lookahead capability — Independent pipelines with feed-forwarding that reduces data dependencies in hardware — 16-Kbyte data cache—four-way set-associative, physically addressed; LRU replacement
algorithm
— 16-Kbyte instruction cache—four-way set-associative, physically addressed; LRU replacement
algorithm — Cache write-back or write-through operation programmable on a per page or per block basis — BPU that performs CR lookahead operations — Address translation facilities for 4-Kbyte page size, variable block size, and 256-Mbyte
segment size — A 64-entry, two-way set-associative ITLB — A 64-entry, two-way set-associative DTLB — Four-entry data and instruction BAT arrays providing 128-Kbyte to 256-Mbyte blocks — Software table search operations and updates supported through fast trap mechanism — 52-bit virtual address; 32-bit physical address
Facilities for enhanced system performance — A 32- or 64-bit split-transaction external data bus with burst transfers
— Support for one-level address pipelining and out-of-order bus transactions — Hardware support for misaligned little-endian accesses (PID7v-603e)
PowerPC 603e RISC Microprocessor Technical Summary
3
4
Integrated power management — Low-power 2.5-volt and 3.3-volt design
— Internal processor/bus clock multiplier ratios as follows:
– 1/1, 1.5/1, 2/1, 2.5/1, 3/1, 3.5/1, and 4/1 (PID6-603e) – 2/1, 2.5/1, 3/1, 3.5/1, 4/1, 4.5/1, 5/1, 5.5/1, and 6/1 (PID7v-603e)
— Three power-saving modes: doze, nap, and sleep — Automatic dynamic power reduction when internal functional units are idle
In-system testability and debugging features through JTAG boundary-scan capability
1.2 Block Diagram
Figure 1 provides a block diagram of the 603e that illustrates how the execution units—IU, FPU, BPU, LSU, and SRU—operate independently and in parallel.
The 603e provides address translation and protection facilities, including an ITLB, DTLB, and instruction and data BAT arrays. Instruction fetching and issuing is handled in the instruction unit. Translation of addresses for cache or external memory accesses are handled by the MMUs. Both units are discussed in more detail in Sections 1.3, “Instruction Unit,” and 1.5.1, “Memory Management Units (MMUs).”
1.3 Instruction Unit
As shown in Figure 1, the 603e instruction unit, which contains a sequential fetcher, instruction queue, dispatch unit, and BPU, provides centralized control of instruction flow to the execution units. The instruction unit determines the address of the next instruction to be fetched based on information from the sequential fetcher and from the BPU.
The sequential fetcher fetches the instructions from the instruction cache into the instruction queue. The BPU extracts branch instructions from the sequential fetcher and uses static branch prediction on unresolved conditional branches to allow the instruction unit to fetch instructions from a predicted target instruction stream while a conditional branch is evaluated. The BPU folds out branch instructions for unconditional branches or conditional branches unaffected by instructions in progress in the execution pipeline.
Instructions issued beyond a predicted branch do not complete execution until the branch is resolved, preserving the programming model of sequential execution. If any of these instructions are to be executed in the BPU, they are decoded but not issued. Instructions to be executed by the FPU, IU, LSU, and SRU are issued and allowed to complete up to the register write-back stage. Write-back is allowed when a correctly predicted branch is resolved, and instruction execution continues without interruption along the predicted path.
If branch prediction is incorrect, the instruction unit flushes all predicted path instructions, and instructions are issued from the correct path.
PowerPC 603e RISC Microprocessor Technical Summary
5
64 Bit
SYSTEM
REGISTER
UNIT
+
INTEGER
UNIT
+
/
*
XER
GPR File
GP Rename
Registers
SEQUENTIAL
FETCHER
64 Bit
INSTRUCTION
QUEUE
Dispatch Unit
64 Bit
LOAD/STORE
UNIT
+
64 Bit
64 Bit
BRANCH
PROCESSING
INSTRUCTION UNIT
64 Bit64 Bit
FPR File
FP Rename
Registers
UNIT
CTR
CR
LR
64 Bit
FLOATING-
POINT UNIT
+
/
*
FPSCR
COMPLETION
UNIT
Power
Dissipation
Control
JTAG/COP
Interface
Time Base
Counter/
Decrementer
Clock
Multiplier
Touch Load Buffer
Copyback Buffer
32 Bit
D MMU
SRs
DTLB
Tags
32-BIT ADDRESS BUS
32-/64-BIT DATA BUS
DBAT
Array
16-Kbyte D Cache
64 Bit
PROCESSOR BUS
INTERFACE
SRs
ITLB
Tags
I MMU
IBAT
Array
16-Kbyte
I Cache
Figure 1. PowerPC 603e Microprocessor Block Diagram
PowerPC 603e RISC Microprocessor Technical Summary
6
1.3.1 Instruction Queue and Dispatch Unit
The instruction queue (IQ), shown in Figure 1, holds as many as six instructions and loads up to two instructions from the instruction unit during a single cycle. The instruction fetch unit continuously loads as many instructions as space in the IQ allows. Instructions are dispatched to their respective execution units from the dispatch unit at a maximum rate of two instructions per cycle. Dispatching is facilitated to the IU, FPU, LSU, and SRU by the provision of a reservation station at each unit. The dispatch unit checks for source and destination register dependencies, determines if dispatch serialization is required, and inhibits subsequent instruction dispatching as required.
For a more detailed overview of instruction dispatch, see Section 2.7, “Instruction Timing.”
1.3.2 Branch Processing Unit (BPU)
The BPU receives branch instructions from the fetch unit and performs CR lookahead operations on conditional branches to resolve them early, achieving the effect of a zero-cycle branch in many cases.
The BPU uses a bit in the instruction encoding to predict the direction of the conditional branch. Therefore, when an unresolved conditional branch instruction is encountered, the 603e fetches instructions from the predicted target stream until the conditional branch is resolved.
The BPU contains an adder to compute branch target addresses and three user-control registers—the link register (LR), the count register (CTR), and the CR. The BPU calculates the return pointer for subroutine calls and saves it into the LR for certain types of branch instructions. The LR also contains the branch target address for the Branch Conditional to Link Register ( address for the Branch Conditional to Count Register ( can be copied to or from any GPR. Because the BPU uses dedicated registers rather than GPRs or FPRs, execution of branch instructions is largely independent from execution of integer and floating-point instructions.
bclr x ) instruction. The CTR contains the branch target
bcctr x ) instruction. The contents of the LR and CTR
1.4 Independent Execution Units
The PowerPC architecture’s support for independent execution units allows implementation of processors with out-of-order instruction execution. For example, because branch instructions do not depend on GPRs or FPRs, branches can often be resolved early, eliminating stalls caused by taken branches.
In addition to the BPU, the 603e provides four other execution units and a completion unit, which are described in the following sections.
1.4.1 Integer Unit (IU)
The IU can execute all integer instructions. The IU executes one integer instruction at a time, performing computations with its arithmetic logic unit (ALU) and XER register. Most integer instructions are single­cycle instructions. Thirty-two general-purpose registers are provided to support integer operations. Stalls due to contention for GPRs are minimized by automatic allocation of the 5 rename registers. The 603e writes the contents of the rename registers to the appropriate GPR when integer instructions are retired by the completion unit.
1.4.2 Floating-Point Unit (FPU)
The FPU contains a single-precision multiply-add array and the floating-point status and control register (FPSCR). The multiply-add array allows the 603e to efficiently implement multiply and multiply-add operations. The FPU is pipelined so that one single- or double-precision instruction can be issued per clock cycle. Thirty-two 64-bit floating-point registers are provided to support floating-point operations. Stalls due to contention for FPRs are minimized by automatic allocation of the 4 rename registers. The 603e writes the
PowerPC 603e RISC Microprocessor Technical Summary
contents of the rename registers to the appropriate FPR when floating-point instructions are retired by the completion unit.
The 603e supports all IEEE 754 floating-point data types (normalized, denormalized, NaN, zero, and infinity) in hardware, eliminating the latency incurred by software exception routines. (Note that exception is also referred to as interrupt in the architecture specification.)
1.4.3 Load/Store Unit (LSU)
The LSU executes all load and store instructions and provides the data transfer interface between the GPRs, FPRs, and the cache/memory subsystem. The LSU calculates effective addresses, performs data alignment, and provides sequencing for load/store string and multiple instructions.
Load and store instructions are issued and translated in program order; however, the actual memory accesses can occur out of order. Synchronizing instructions are provided to enforce strict ordering.
Cacheable loads, when free of data dependencies, execute in a speculative manner with a maximum throughput of one per cycle and a two-cycle total latency . Data returned from the cache is held in a rename register until the completion logic commits the value to a GPR or FPR. Stores cannot be executed out of order and are held in the store queue until the completion logic signals that the store operation is to be completed to memory. The 603e executes store instructions with a maximum throughput of one per cycle and a three-cycle total latency. The time required to perform the actual load or store operation varies depending on the processor/bus clock ratio, and whether the operation involves the cache, system memory , or an I/O device.
1.4.4 System Register Unit (SRU)
The SRU executes various system-level instructions, including condition register logical operations and move to/from special-purpose register instructions, and also executes integer add/compare instructions. In order to maintain system state, most instructions executed by the SRU are completion-serialized; that is, the instruction is held for execution in the SRU until all prior instructions issued have completed. Results from completion-serialized instructions executed by the SRU are not available or forwarded for subsequent instructions until the instruction completes.
1.4.5 Completion Unit
The completion unit tracks instructions from dispatch through execution, and then retires, or “completes,” them in program order. Completing an instruction commits the 603e to any architectural register changes caused by that instruction. In-order completion ensures the correct architectural state when the 603e must recover from a mispredicted branch or any exception.
Instruction state and other information required for completion is kept in a first-in-first-out (FIFO) queue of five completion buffers. A single completion buffer entry is allocated for each instruction once it enters the dispatch unit. A completion buffer entry is required for instruction dispatch; otherwise, instruction dispatch stalls. A maximum of two instructions per cycle are completed in order from the queue.
1.5 Memory Subsystem Support
The 603e provides support for cache and memory management through dual instruction and data memory management units. The 603e also provides dual 16-Kbyte instruction and data caches, and an efficient processor bus interface for access into main memory and other bus subsystems. The memory subsystem support functions are described in the following subsections.
PowerPC 603e RISC Microprocessor Technical Summary
7
8
1.5.1 Memory Management Units (MMUs)
The 603e’s MMUs support up to 4 Petabytes (2
52
) of virtual memory and 4 Gigabytes (2 memory (referred to as real memory in the architecture specification) for instructions and data. The MMUs also control access privileges for these spaces on block and page granularities. Referenced and changed status is maintained by the processor for each page to assist implementation of a demand-paged virtual memory system. A key bit is implemented to provide information about memory protection violations prior to page table search operations.
The LSU calculates effective addresses for data loads and stores, performs data alignment to and from cache memory, and provides the sequencing for load and store string and multiple word instructions. The instruction unit calculates the effective addresses for instruction fetching.
The higher-order bits of the effective address are translated by the appropriate MMU into physical address bits. Simultaneously, the lower-order address bits (that are untranslated and therefore, considered both logical and physical) are directed to the on-chip caches where they form the index into the four-way set­associative tag array. After translating the address, the MMU passes the higher-order bits of the physical address to the cache and the cache lookup completes. For caching-inhibited accesses or accesses that miss in the cache, the untranslated lower-order address bits are concatenated with the translated higher-order address bits; the resulting 32-bit physical address is used by the memory unit and the system interface, which accesses external memory.
32
) of physical
The MMU also directs the address translation and enforces the protection hierarchy programmed by the operating system in relation to the supervisor/user privilege level of the access and in relation to whether the access is a load or store.
For instruction accesses, the MMU performs an address lookup in both the 64 entries of the ITLB, and in the IBAT array. If an effective address hits in both the ITLB and the IBAT array, the IBAT array translation takes priority. Data accesses cause a lookup in the DTLB and DBAT array for the physical address translation. In most cases, the physical address translation resides in one of the TLBs and the physical address bits are readily available to the on-chip cache.
When the physical address translation misses in the TLBs, the 603e provides hardware assistance for software to perform a search of the translation tables in memory. The hardware assist consists of the following features:
Automatic storage of the missed effective address in the IMISS and DMISS registers
Automatic generation of the primary and secondary hashed real address of the page table entry group (PTEG), which are readable from the HASH1 and HASH2 register locations.
The HASH data is generated from the contents of the IMISS or DMISS register. Which register is selected depends on which miss (instruction or data) was last acknowledged.
Automatic generation of the first word of the page table entry (PTE) for which the tables are being searched
A real page address (RPA) register that matches the format of the lower word of the PTE
Two TLB access instructions (
tlbli and tlbld ) that are used to load an address translation into the
instruction or data TLBs
Shadow registers for GPR0–GPR3 that allow miss code to execute without corrupting the state of any of the existing GPRs. These shadow registers are only used for servicing a TLB miss.
See Section 2.6.2, “PowerPC 603e Microprocessor Memory Management,” for more information about memory management for the 603e.
PowerPC 603e RISC Microprocessor Technical Summary
1.5.2 Cache Units
The 603e provides independent 16-Kbyte, four-way set-associative instruction and data caches. The cache block is 32 bytes long. The caches adhere to a write-back policy, but the PowerPC architecture allows control of cacheability, write policy, and memory coherency at the page and block levels. The caches use a least recently used (LRU) replacement policy.
As shown in Figure 1, the caches provide a 64-bit interface to the instruction fetch unit and load/store unit. The surrounding logic selects, organizes, and forwards the requested information to the requesting unit. Write operations to the cache can be performed on a byte basis, and a complete read-modify-write operation to the cache can occur in each cycle.
The load/store and instruction fetch units provide the caches with the address of the data or instruction to be fetched. In the case of a cache hit, the cache returns two words to the requesting unit.
Since the 603e data cache tags are single ported, simultaneous load or store and snoop accesses cause resource contention. Snoop accesses have the highest priority and are given first access to the tags, unless the snoop access coincides with a tag write, in which case the snoop is retried and must re-arbitrate for access to the cache. Loads or stores that are deferred due to snoop accesses are executed on the clock cycle following the snoop.
1.6 Processor Bus Interface
Memory accesses can occur in single-beat (1–8 bytes) and four-beat burst (32 bytes) data transfers when the bus is configured as 64 bits, and in single-beat (1–4 bytes), two-beat (8 bytes), and eight-beat (32 bytes) data transfers when the bus is configured as 32 bits. The address and data buses operate independently to support pipelining and split transactions during memory accesses. The 603e can pipeline its bus transactions to a depth of one level.
Because the caches on the 603e are on-chip, write-back caches, the predominant type of transaction for most applications is burst-read memory operations, followed by burst-write memory operations, and single-beat (noncacheable or write-through) memory read and write operations. Additionally, there can be address-only operations, variants of the burst and single-beat operations, (for example, global memory operations that are snooped and atomic memory operations), and address retry activity (for example, when a snooped read access hits a modified line in the cache).
Access to the system interface is granted through an external arbitration mechanism that allows devices to compete for bus mastership. This arbitration mechanism is flexible, allowing the 603e to be integrated into systems that implement various fairness and bus parking procedures to avoid arbitration overhead.
Typically, memory accesses are weakly ordered—sequences of operations, including load/store string and multiple instructions, do not necessarily complete in the order they begin—maximizing the efficiency of the bus without sacrificing coherency of the data. The 603e allows read operations to precede store operations (except when a dependency exists, or in cases where a non-cacheable access is performed), and provides support for a write operation to proceed a previously queued read data tenure (for example, allowing a snoop push to be enveloped by the address and data tenures of a read operation). Because the processor can dynamically optimize run-time ordering of load/store traffic, overall performance is improved.
1.7 System Support Functions
The 603e implements several support functions that include power management, time base/decrementer registers for system timing tasks, an IEEE 1149.1(JTAG)/common on-chip processor (COP) test interface, and a phase-locked loop (PLL) clock multiplier. These system support functions are described in the following subsections.
PowerPC 603e RISC Microprocessor Technical Summary
9
10
1.7.1 Power Management
The 603e provides four power modes selectable by setting the appropriate control bits in the machine state register (MSR) and hardware implementation register 0 (HID0) registers. The four power modes are as follows:
Full-power–This is the default power state of the 603e. The 603e is fully powered and the internal functional units are operating at the full processor clock speed. If the dynamic power management mode is enabled, functional units that are idle will automatically enter a low-power state without affecting performance, software execution, or external hardware.
Doze–All the functional units of the 603e are disabled except for the time base/decrementer registers and the bus snooping logic. When the processor is in doze mode, an external asynchronous interrupt, a system management interrupt, a decrementer exception, a hard or soft reset, or machine check brings the 603e into the full-power state. The 603e in doze mode maintains the PLL in a fully powered state and locked to the system external clock input (SYSCLK) so a transition to the full­power state takes only a few processor clock cycles.
Nap–The nap mode further reduces power consumption by disabling bus snooping, leaving only the time base register and the PLL in a powered state. The 603e returns to the full-power state upon receipt of an external asynchronous interrupt, a system management interrupt, a decrementer exception, a hard or soft reset, or a machine check input (MCP a nap state takes only a few processor clock cycles.
). A return to full-power state from
Sleep–Sleep mode reduces power consumption to a minimum by disabling all internal functional units, after which external system logic may disable the PLL and SYSCLK. Returning the 603e to the full-power state requires the enabling of the PLL and SYSCLK, followed by the assertion of an external asynchronous interrupt, a system management interrupt, a hard or soft reset, or a machine check input (MCP
) signal after the time required to relock the PLL.
1.7.2 Time Base/Decrementer
The time base is a 64-bit register (accessed as two 32-bit registers) that is incremented once every four bus clock cycles; external control of the time base is provided through the time base enable (TBEN) signal. The decrementer is a 32-bit register that can generate a maskable decrementer exception after a programmable delay. The contents of the decrementer register are decremented once every four bus clock cycles, and the decrementer exception is generated as the count passes through zero.
1.7.3 IEEE 1149.1 (JTAG)/COP Test Interface
The 603e provides IEEE 1149.1 and COP functions for facilitating board testing and chip debug. The IEEE
1149.1 test interface provides a means for boundary-scan testing the 603e and the board to which it is attached. The COP function shares the IEEE 1149.1 test port, provides a means for executing test routines, and facilitates chip and software debugging.
1.7.4 Clock Multiplier
The internal clocking of the 603e is generated from and synchronized to the external clock signal, SYSCLK, by means of a voltage-controlled oscillator-based PLL. The PLL provides programmable internal processor clock rates of 1x, 1.5x, 2x, 2.5x, 3x, 3.5x, and 4x multiples of the externally supplied clock frequency for the PID6-603e, and multiples of 2x, 2.5x, 3x, 3.5x, 4x, 4.5x, 5x, 5.5x, and 6x of the externally provided clock for the PID7v-603e. The bus clock is the same frequency and is synchronous with SYSCLK. The configuration of the PLL can be read by software from hardware implementation register 1 (HID1).
PowerPC 603e RISC Microprocessor Technical Summary
Loading...
+ 21 hidden pages