This document provides an overview of the PowerPC 603e microprocessor features,
including a block diagram showing the major functional components. It also provides an
overview of the PowerPC
implementation complies with the architectural definitions.
This document is divided into two parts:
•Part 1, “PowerPC 603e Microprocessor Overview,” provides an overview of the
603e features, including a block diagram showing the major functional
components.
•Part 2, “PowerPC 603e Microprocessor: Implementation,” describes the PowerPC
architecture in general, as well as providing specific details about the
implementation of the 603e as a low-power, 32-bit member of the PowerPC
processor family , and an enumeration of the dif ferences from the PowerPC 603
microprocessor.
In this document, the term “603e” is used as an abbreviation for the phrase, “PowerPC 603e
microprocessor,” and the term “603” is used as an abbreviation for the phrase “PowerPC
603 microprocessor.” The PowerPC 603e microprocessors are available from IBM as
PPC603e and from Motorola as MPC603e.
architecture specification, and information about how the 603e
The PowerPC name, the PowerPC logotype, PowerPC 601, PowerPC 603, and PowerPC 603e are trademarks
Business Machines Corporation, used by Motorola under license from International Business Machines Corporation.
This document contains information on a new product under development by Motorola and IBM. Motorola and IBM reserve the right to
change or discontinue this product without notice.
Motorola Inc., 1996. All rights reserved
Portions hereof
International Business Machines Corporation, 1991–1996. All rights reserved
of International
603e Technical Summary
2
Part 1 PowerPC 603e Microprocessor Overview
This section describes the features of the 603e, provides a block diagram showing the major functional units,
and gives an overview of how the 603e operates.
The 603e is a low-power implementation of the PowerPC microprocessor family of reduced instruction set
computer (RISC) microprocessors. The 603e implements the 32-bit portion of the PowerPC architecture,
which provides 32-bit effective addresses, integer data types of 8, 16, and 32 bits, and floating-point data
types of 32 and 64 bits.
The 603e provides four software controllable power-saving modes. Three of the modes (the nap, doze, and
sleep modes) are static in nature, and progressively reduce the amount of power dissipated by the processor.
The fourth is a dynamic power management mode that causes the functional units in the 603e to
automatically enter a low-power mode when the functional units are idle without affecting operational
performance, software execution, or any external hardware.
The 603e is a superscalar processor that can issue and retire as many as three instructions per clock.
Instructions can execute out of order for increased performance; however, the 603e makes completion
appear sequential.
The 603e integrates five execution units—an integer unit (IU), a floating-point unit (FPU), a branch
processing unit (BPU), a load/store unit (LSU), and a system register unit (SRU). The ability to execute five
instructions in parallel and the use of simple instructions with rapid execution times yield high efficiency
and throughput for 603e-based systems. Most integer instructions execute in one clock cycle. The FPU is
pipelined so a single-precision multiply-add instruction can be issued and completed every clock cycle.
The 603e provides independent on-chip, 16-Kbyte, four-way set-associative, physically addressed caches
for instructions and data and on-chip instruction and data memory management units (MMUs). The MMUs
contain 64-entry, two-way set-associative, data and instruction translation lookaside buffers (DTLB and
ITLB) that provide support for demand-paged virtual memory address translation and variable-sized block
translation. The TLBs and caches use a least recently used (LRU) replacement algorithm. The 603e also
supports block address translation through the use of two independent instruction and data block address
translation (IBAT and DBAT) arrays of four entries each. Effective addresses are compared simultaneously
with all four entries in the BA T array during block translation. In accordance with the PowerPC architecture,
if an effective address hits in both the TLB and BAT array, the BAT translation takes priority.
The 603e has a selectable 32- or 64-bit data bus and a 32-bit address bus. The 603e interface protocol allows
multiple masters to compete for system resources through a central external arbiter. The 603e provides a
three-state coherency protocol that supports the exclusive, modified, and invalid cache states. This protocol
is a compatible subset of the MESI (modified/exclusive/shared/invalid) four-state protocol and operates
coherently in systems that contain four-state caches. The 603e supports single-beat and burst data transfers
for memory accesses, and supports memory-mapped I/O operations.
The 603e is fabricated using an advanced CMOS process technology and is fully compatible with TTL
devices. The 603e is implemented in both a 2.5-volt version (PID 0007v PowerPC 603e microprocessor, or
PID7v-603e) and a 3.3-volt version (PID 0006 PowerPC 603e microprocessor, or PID6-603e).
This section describes details of the 603e’s implementation of the PowerPC architecture. Major features of
the 603e are as follows:
•High-performance, superscalar microprocessor
— As many as three instructions issued and retired per clock
— As many as five instructions in execution per clock
— Single-cycle execution for most instructions
— Pipelined FPU for all single-precision and most double-precision operations
•Five independent execution units and two register files
— BPU featuring static branch prediction
— A 32-bit IU
— Fully IEEE 754-compliant FPU for both single- and double-precision operations
— LSU for data transfer between data cache and GPRs and FPRs
— SRU that executes condition register (CR), special-purpose register (SPR), and integer add/
compare instructions
— Thirty-two GPRs for integer operands
— Thirty-two FPRs for single- or double-precision operands
•High instruction and data throughput
— Zero-cycle branch capability (branch folding)
— Programmable static branch prediction on unresolved conditional branches
— Instruction fetch unit capable of fetching two instructions per clock from the instruction cache
— A six-entry instruction queue that provides lookahead capability
— Independent pipelines with feed-forwarding that reduces data dependencies in hardware
— 16-Kbyte data cache—four-way set-associative, physically addressed; LRU replacement
algorithm
— Cache write-back or write-through operation programmable on a per page or per block basis
— BPU that performs CR lookahead operations
— Address translation facilities for 4-Kbyte page size, variable block size, and 256-Mbyte
segment size
— A 64-entry, two-way set-associative ITLB
— A 64-entry, two-way set-associative DTLB
— Four-entry data and instruction BAT arrays providing 128-Kbyte to 256-Mbyte blocks
— Software table search operations and updates supported through fast trap mechanism
— 52-bit virtual address; 32-bit physical address
•Facilities for enhanced system performance
— A 32- or 64-bit split-transaction external data bus with burst transfers
— Support for one-level address pipelining and out-of-order bus transactions
— Hardware support for misaligned little-endian accesses (PID7v-603e)
— Three power-saving modes: doze, nap, and sleep
— Automatic dynamic power reduction when internal functional units are idle
•In-system testability and debugging features through JTAG boundary-scan capability
1.2 Block Diagram
Figure 1 provides a block diagram of the 603e that illustrates how the execution units—IU, FPU, BPU,
LSU, and SRU—operate independently and in parallel.
The 603e provides address translation and protection facilities, including an ITLB, DTLB, and instruction
and data BAT arrays. Instruction fetching and issuing is handled in the instruction unit. Translation of
addresses for cache or external memory accesses are handled by the MMUs. Both units are discussed in
more detail in Sections 1.3, “Instruction Unit,” and 1.5.1, “Memory Management Units (MMUs).”
1.3 Instruction Unit
As shown in Figure 1, the 603e instruction unit, which contains a sequential fetcher, instruction queue,
dispatch unit, and BPU, provides centralized control of instruction flow to the execution units. The
instruction unit determines the address of the next instruction to be fetched based on information from the
sequential fetcher and from the BPU.
The sequential fetcher fetches the instructions from the instruction cache into the instruction queue. The
BPU extracts branch instructions from the sequential fetcher and uses static branch prediction on unresolved
conditional branches to allow the instruction unit to fetch instructions from a predicted target instruction
stream while a conditional branch is evaluated. The BPU folds out branch instructions for unconditional
branches or conditional branches unaffected by instructions in progress in the execution pipeline.
Instructions issued beyond a predicted branch do not complete execution until the branch is resolved,
preserving the programming model of sequential execution. If any of these instructions are to be executed
in the BPU, they are decoded but not issued. Instructions to be executed by the FPU, IU, LSU, and SRU are
issued and allowed to complete up to the register write-back stage. Write-back is allowed when a correctly
predicted branch is resolved, and instruction execution continues without interruption along the predicted
path.
If branch prediction is incorrect, the instruction unit flushes all predicted path instructions, and instructions
are issued from the correct path.
The instruction queue (IQ), shown in Figure 1, holds as many as six instructions and loads up to two
instructions from the instruction unit during a single cycle. The instruction fetch unit continuously loads as
many instructions as space in the IQ allows. Instructions are dispatched to their respective execution units
from the dispatch unit at a maximum rate of two instructions per cycle. Dispatching is facilitated to the IU,
FPU, LSU, and SRU by the provision of a reservation station at each unit. The dispatch unit checks for
source and destination register dependencies, determines if dispatch serialization is required, and inhibits
subsequent instruction dispatching as required.
For a more detailed overview of instruction dispatch, see Section 2.7, “Instruction Timing.”
1.3.2 Branch Processing Unit (BPU)
The BPU receives branch instructions from the fetch unit and performs CR lookahead operations on
conditional branches to resolve them early, achieving the effect of a zero-cycle branch in many cases.
The BPU uses a bit in the instruction encoding to predict the direction of the conditional branch. Therefore,
when an unresolved conditional branch instruction is encountered, the 603e fetches instructions from the
predicted target stream until the conditional branch is resolved.
The BPU contains an adder to compute branch target addresses and three user-control registers—the link
register (LR), the count register (CTR), and the CR. The BPU calculates the return pointer for subroutine
calls and saves it into the LR for certain types of branch instructions. The LR also contains the branch target
address for the Branch Conditional to Link Register (
address for the Branch Conditional to Count Register (
can be copied to or from any GPR. Because the BPU uses dedicated registers rather than GPRs or FPRs,
execution of branch instructions is largely independent from execution of integer and floating-point
instructions.
bclrx) instruction. The CTR contains the branch target
bcctrx) instruction. The contents of the LR and CTR
1.4 Independent Execution Units
The PowerPC architecture’s support for independent execution units allows implementation of processors
with out-of-order instruction execution. For example, because branch instructions do not depend on GPRs
or FPRs, branches can often be resolved early, eliminating stalls caused by taken branches.
In addition to the BPU, the 603e provides four other execution units and a completion unit, which are
described in the following sections.
1.4.1 Integer Unit (IU)
The IU can execute all integer instructions. The IU executes one integer instruction at a time, performing
computations with its arithmetic logic unit (ALU) and XER register. Most integer instructions are singlecycle instructions. Thirty-two general-purpose registers are provided to support integer operations. Stalls
due to contention for GPRs are minimized by automatic allocation of the 5 rename registers. The 603e
writes the contents of the rename registers to the appropriate GPR when integer instructions are retired by
the completion unit.
1.4.2 Floating-Point Unit (FPU)
The FPU contains a single-precision multiply-add array and the floating-point status and control register
(FPSCR). The multiply-add array allows the 603e to efficiently implement multiply and multiply-add
operations. The FPU is pipelined so that one single- or double-precision instruction can be issued per clock
cycle. Thirty-two 64-bit floating-point registers are provided to support floating-point operations. Stalls due
to contention for FPRs are minimized by automatic allocation of the 4 rename registers. The 603e writes the
contents of the rename registers to the appropriate FPR when floating-point instructions are retired by the
completion unit.
The 603e supports all IEEE 754 floating-point data types (normalized, denormalized, NaN, zero, and
infinity) in hardware, eliminating the latency incurred by software exception routines. (Note that exception
is also referred to as interrupt in the architecture specification.)
1.4.3 Load/Store Unit (LSU)
The LSU executes all load and store instructions and provides the data transfer interface between the GPRs,
FPRs, and the cache/memory subsystem. The LSU calculates effective addresses, performs data alignment,
and provides sequencing for load/store string and multiple instructions.
Load and store instructions are issued and translated in program order; however, the actual memory accesses
can occur out of order. Synchronizing instructions are provided to enforce strict ordering.
Cacheable loads, when free of data dependencies, execute in a speculative manner with a maximum
throughput of one per cycle and a two-cycle total latency . Data returned from the cache is held in a rename
register until the completion logic commits the value to a GPR or FPR. Stores cannot be executed out of
order and are held in the store queue until the completion logic signals that the store operation is to be
completed to memory. The 603e executes store instructions with a maximum throughput of one per cycle
and a three-cycle total latency. The time required to perform the actual load or store operation varies
depending on the processor/bus clock ratio, and whether the operation involves the cache, system memory ,
or an I/O device.
1.4.4 System Register Unit (SRU)
The SRU executes various system-level instructions, including condition register logical operations and
move to/from special-purpose register instructions, and also executes integer add/compare instructions. In
order to maintain system state, most instructions executed by the SRU are completion-serialized; that is, the
instruction is held for execution in the SRU until all prior instructions issued have completed. Results from
completion-serialized instructions executed by the SRU are not available or forwarded for subsequent
instructions until the instruction completes.
1.4.5 Completion Unit
The completion unit tracks instructions from dispatch through execution, and then retires, or “completes,”
them in program order. Completing an instruction commits the 603e to any architectural register changes
caused by that instruction. In-order completion ensures the correct architectural state when the 603e must
recover from a mispredicted branch or any exception.
Instruction state and other information required for completion is kept in a first-in-first-out (FIFO) queue of
five completion buffers. A single completion buffer entry is allocated for each instruction once it enters the
dispatch unit. A completion buffer entry is required for instruction dispatch; otherwise, instruction dispatch
stalls. A maximum of two instructions per cycle are completed in order from the queue.
1.5 Memory Subsystem Support
The 603e provides support for cache and memory management through dual instruction and data memory
management units. The 603e also provides dual 16-Kbyte instruction and data caches, and an efficient
processor bus interface for access into main memory and other bus subsystems. The memory subsystem
support functions are described in the following subsections.
) of virtual memory and 4 Gigabytes (2
memory (referred to as real memory in the architecture specification) for instructions and data. The MMUs
also control access privileges for these spaces on block and page granularities. Referenced and changed
status is maintained by the processor for each page to assist implementation of a demand-paged virtual
memory system. A key bit is implemented to provide information about memory protection violations prior
to page table search operations.
The LSU calculates effective addresses for data loads and stores, performs data alignment to and from cache
memory, and provides the sequencing for load and store string and multiple word instructions. The
instruction unit calculates the effective addresses for instruction fetching.
The higher-order bits of the effective address are translated by the appropriate MMU into physical address
bits. Simultaneously, the lower-order address bits (that are untranslated and therefore, considered both
logical and physical) are directed to the on-chip caches where they form the index into the four-way setassociative tag array. After translating the address, the MMU passes the higher-order bits of the physical
address to the cache and the cache lookup completes. For caching-inhibited accesses or accesses that miss
in the cache, the untranslated lower-order address bits are concatenated with the translated higher-order
address bits; the resulting 32-bit physical address is used by the memory unit and the system interface,
which accesses external memory.
32
) of physical
The MMU also directs the address translation and enforces the protection hierarchy programmed by the
operating system in relation to the supervisor/user privilege level of the access and in relation to whether
the access is a load or store.
For instruction accesses, the MMU performs an address lookup in both the 64 entries of the ITLB, and in
the IBAT array. If an effective address hits in both the ITLB and the IBAT array, the IBAT array translation
takes priority. Data accesses cause a lookup in the DTLB and DBAT array for the physical address
translation. In most cases, the physical address translation resides in one of the TLBs and the physical
address bits are readily available to the on-chip cache.
When the physical address translation misses in the TLBs, the 603e provides hardware assistance for
software to perform a search of the translation tables in memory. The hardware assist consists of the
following features:
•Automatic storage of the missed effective address in the IMISS and DMISS registers
•Automatic generation of the primary and secondary hashed real address of the page table entry
group (PTEG), which are readable from the HASH1 and HASH2 register locations.
The HASH data is generated from the contents of the IMISS or DMISS register. Which register is
selected depends on which miss (instruction or data) was last acknowledged.
•Automatic generation of the first word of the page table entry (PTE) for which the tables are being
searched
•A real page address (RPA) register that matches the format of the lower word of the PTE
•Two TLB access instructions (
tlbli and tlbld) that are used to load an address translation into the
instruction or data TLBs
•Shadow registers for GPR0–GPR3 that allow miss code to execute without corrupting the state of
any of the existing GPRs. These shadow registers are only used for servicing a TLB miss.
See Section 2.6.2, “PowerPC 603e Microprocessor Memory Management,” for more information about
memory management for the 603e.
The 603e provides independent 16-Kbyte, four-way set-associative instruction and data caches. The cache
block is 32 bytes long. The caches adhere to a write-back policy, but the PowerPC architecture allows
control of cacheability, write policy, and memory coherency at the page and block levels. The caches use a
least recently used (LRU) replacement policy.
As shown in Figure 1, the caches provide a 64-bit interface to the instruction fetch unit and load/store unit.
The surrounding logic selects, organizes, and forwards the requested information to the requesting unit.
Write operations to the cache can be performed on a byte basis, and a complete read-modify-write operation
to the cache can occur in each cycle.
The load/store and instruction fetch units provide the caches with the address of the data or instruction to
be fetched. In the case of a cache hit, the cache returns two words to the requesting unit.
Since the 603e data cache tags are single ported, simultaneous load or store and snoop accesses cause
resource contention. Snoop accesses have the highest priority and are given first access to the tags, unless
the snoop access coincides with a tag write, in which case the snoop is retried and must re-arbitrate for
access to the cache. Loads or stores that are deferred due to snoop accesses are executed on the clock cycle
following the snoop.
1.6 Processor Bus Interface
Memory accesses can occur in single-beat (1–8 bytes) and four-beat burst (32 bytes) data transfers when the
bus is configured as 64 bits, and in single-beat (1–4 bytes), two-beat (8 bytes), and eight-beat (32 bytes) data
transfers when the bus is configured as 32 bits. The address and data buses operate independently to support
pipelining and split transactions during memory accesses. The 603e can pipeline its bus transactions to a
depth of one level.
Because the caches on the 603e are on-chip, write-back caches, the predominant type of transaction for most
applications is burst-read memory operations, followed by burst-write memory operations, and single-beat
(noncacheable or write-through) memory read and write operations. Additionally, there can be address-only
operations, variants of the burst and single-beat operations, (for example, global memory operations that are
snooped and atomic memory operations), and address retry activity (for example, when a snooped read
access hits a modified line in the cache).
Access to the system interface is granted through an external arbitration mechanism that allows devices to
compete for bus mastership. This arbitration mechanism is flexible, allowing the 603e to be integrated into
systems that implement various fairness and bus parking procedures to avoid arbitration overhead.
Typically, memory accesses are weakly ordered—sequences of operations, including load/store string and
multiple instructions, do not necessarily complete in the order they begin—maximizing the efficiency of the
bus without sacrificing coherency of the data. The 603e allows read operations to precede store operations
(except when a dependency exists, or in cases where a non-cacheable access is performed), and provides
support for a write operation to proceed a previously queued read data tenure (for example, allowing a snoop
push to be enveloped by the address and data tenures of a read operation). Because the processor can
dynamically optimize run-time ordering of load/store traffic, overall performance is improved.
1.7 System Support Functions
The 603e implements several support functions that include power management, time base/decrementer
registers for system timing tasks, an IEEE 1149.1(JTAG)/common on-chip processor (COP) test interface,
and a phase-locked loop (PLL) clock multiplier. These system support functions are described in the
following subsections.
The 603e provides four power modes selectable by setting the appropriate control bits in the machine state
register (MSR) and hardware implementation register 0 (HID0) registers. The four power modes are as
follows:
•Full-power–This is the default power state of the 603e. The 603e is fully powered and the internal
functional units are operating at the full processor clock speed. If the dynamic power management
mode is enabled, functional units that are idle will automatically enter a low-power state without
affecting performance, software execution, or external hardware.
•Doze–All the functional units of the 603e are disabled except for the time base/decrementer
registers and the bus snooping logic. When the processor is in doze mode, an external asynchronous
interrupt, a system management interrupt, a decrementer exception, a hard or soft reset, or machine
check brings the 603e into the full-power state. The 603e in doze mode maintains the PLL in a fully
powered state and locked to the system external clock input (SYSCLK) so a transition to the fullpower state takes only a few processor clock cycles.
•Nap–The nap mode further reduces power consumption by disabling bus snooping, leaving only the
time base register and the PLL in a powered state. The 603e returns to the full-power state upon
receipt of an external asynchronous interrupt, a system management interrupt, a decrementer
exception, a hard or soft reset, or a machine check input (MCP
a nap state takes only a few processor clock cycles.
). A return to full-power state from
•Sleep–Sleep mode reduces power consumption to a minimum by disabling all internal functional
units, after which external system logic may disable the PLL and SYSCLK. Returning the 603e to
the full-power state requires the enabling of the PLL and SYSCLK, followed by the assertion of an
external asynchronous interrupt, a system management interrupt, a hard or soft reset, or a machine
check input (MCP
) signal after the time required to relock the PLL.
1.7.2 Time Base/Decrementer
The time base is a 64-bit register (accessed as two 32-bit registers) that is incremented once every four bus
clock cycles; external control of the time base is provided through the time base enable (TBEN) signal. The
decrementer is a 32-bit register that can generate a maskable decrementer exception after a programmable
delay. The contents of the decrementer register are decremented once every four bus clock cycles, and the
decrementer exception is generated as the count passes through zero.
1.7.3 IEEE 1149.1 (JTAG)/COP Test Interface
The 603e provides IEEE 1149.1 and COP functions for facilitating board testing and chip debug. The IEEE
1149.1 test interface provides a means for boundary-scan testing the 603e and the board to which it is
attached. The COP function shares the IEEE 1149.1 test port, provides a means for executing test routines,
and facilitates chip and software debugging.
1.7.4 Clock Multiplier
The internal clocking of the 603e is generated from and synchronized to the external clock signal, SYSCLK,
by means of a voltage-controlled oscillator-based PLL. The PLL provides programmable internal processor
clock rates of 1x, 1.5x, 2x, 2.5x, 3x, 3.5x, and 4x multiples of the externally supplied clock frequency for
the PID6-603e, and multiples of 2x, 2.5x, 3x, 3.5x, 4x, 4.5x, 5x, 5.5x, and 6x of the externally provided
clock for the PID7v-603e. The bus clock is the same frequency and is synchronous with SYSCLK. The
configuration of the PLL can be read by software from hardware implementation register 1 (HID1).