Note: This document and the information it contains are provided on an as-is basis.
There is no plan for providing for future updates and corrections to this document.
Printed in the United States of America October 2012
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
.
Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Other company, product, and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document
are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction
could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not
affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied
license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating
environments may vary.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. In no event will IBM be
liable for damages arising directly or indirectly from any use of the information contained in this document.
IBM Systems and Technology Group
2070 Route 52, Bldg. 330
Hopewell Junction, NY 12533-6351
The IBM home page can be found at ibm.com®.
The IBM semiconductor solutions home page can be found at ibm.com/chips.
Version 1.3
October 23, 2012
Page 3
User’s Manual
A2 Processor
Contents
List of Figures ............................................................................................................... 21
List of Tables ................................................................................................................. 23
9.1 Time Base ..................................................................................................................................... 388
9.1.1 Reading the Time Base ....................................................................................................... 389
9.1.2 Writing the Time Base .......................................................................................................... 389
Each release of this document supersedes all previously released versions. The revision log lists all significant changes made to the document since its initial release. In the rest of the document, change bars in the
margin indicate that the adjacent text was modified from the previous release of this document.
Revision DatePagesDescription
October 23, 2012
657
May 25, 2011—Version 1.2.
April 1, 2011Version 1.1.
518Added a programming note to Section 12.5.3 Execution.
90Revised Table 2-11 Operand Handling Dependent on Alignment.
December 15, 2010—Version 1.0. Initial release.
Version 1.3.
Updated Section 14.5.99 PVR - Processor Version Register.
Removed “IBM Confidential.”
Version 1.3
October 23, 2012
Revision Log
Page 29 of 864
Page 30
User’s Manual
A2 Processor
Revision Log
Page 30 of 864
Version 1.3
October 23, 2012
Page 31
User’s Manual
A2 Processor
About This Book
This user’s manual provides the architectural overview, programming model, and detailed information about
the instruction set, registers, and other facilities of the IBM® Power ISA
The A2 embedded controller core features:
• Power ISA Architecture
• Concurrent-issue pipeline with dynamic branch prediction
A2 64-bit embedded processor core.
• Separate 16 KB
each instruction and data caches
• Memory management unit (MMU) with a 512-entry translation lookaside buffer (TLB)
•4TB
(42-bit) physical address capability
• 128-bit reload interface and 128-bit store interface
•ANSI
/IEEE 754-1985 compliant floating-point1
• Single-precision and double-precision operation in hardware
• Auxiliary execution unit (AXU) that executes the Power ISA floating-point instruction set
• Super-pipelined: Single cycle throughput for most instructions
• In-order execution and completion
Who Should Use This Book
This book is for system hardware and software developers and for application developers who need to understand the A2 core. The audience should understand embedded system design, operating systems, RISC
microprocessing, and computer organization and architecture.
How to Use This Book
This book describes the A2 core device architecture, programming model, registers, and instruction set. This
book contains the following chapters:
• Overview on page 45
• CPU Programming Model on page 61
• FU Programming Model on page 127
• Initialization on page 153
• Instruction and Data Caches on page 169
• Memory Management on page 185
• CPU Interrupts and Exceptions on page 293
• FU Interrupts and Exceptions on page 371
• Timer Facilities on page 387
1.Power ISA FUs require software support for IEEE compliance.
Version 1.3
October 23, 2012
About This Book
Page 31 of 864
Page 32
User’s Manual
A2 Processor
• Debug Facilities on page 399
• Performance Events and Event Selection on page 449
• Implementation Dependent Instructions on page 481
• Power Management Methods on page 525
• Register Summary on page 529
• SCOM Accessible Registers on page 701
This book contains the following appendixes:
• Processor Instruction Summary on page 737
• FU Instruction Summary on page 756
• Debug and Trigger Groups on page 761
• Instruction Execution Performance and Code Optimizations on page 833
• Programming Examples on page 861
Notation
The manual uses the following notational conventions:
• Active low signals are shown with an overbar (Active_Low
).
• All numbers are decimal unless specified in some special way.
• 0bnnnn means a number expressed in binary format.
• 0xnnnn means a number expressed in hexadecimal format.
Underscores might be used between digits.
• RA refers to General Purpose Register (GPR) RA.
• (RA) refers to the contents of GPR RA.
• (RA|0) refers to the contents of GPR RA or to the value 0 if the RA field is 0.
• Bits in registers, instructions, and fields are specified as follows.
• Bits are numbered most-significant bit to least-significant bit, starting with bit 0.
•X
means bit p of register, instruction, or field X.
p
•X
means bits p through q of a register, instruction, or field X.
p:q
•X
means bits p, q,... of a register, instruction, or field X.
p,q,...
• X[p] means a named field p of register X.
• X[p:q] means named fields p through q of register X.
• X[p,q,...]
means named fields p, q,... of register X.
...
• ¬X means the ones complement of the contents of X.
• A period (.) as the last character of an instruction mnemonic means that the instruction records status
information in certain fields of the Condition Register as a side effect of execution, as described in
Section 12 Implementation Dependent Instructions on page 481.
About This Book
Page 32 of 864
Version 1.3
October 23, 2012
Page 33
User’s Manual
A2 Processor
• The symbol is used to describe the concatenation of two values. For example, 0b010 0b111 is the
same as 0b010111.
n
•x
means x raised to the n power.
n
x means the replication of x, n times (that is, x concatenated to itself n – 1 times). n0 and n1 are special
•
cases:
n
•
0 means a field of n bits with each bit equal to 0. Thus 50 is equivalent to 0b00000.
n
•
1 means a field of n bits with each bit equal to 1. Thus 51 is equivalent to 0b11111.
• /, //, ///,... denotes a reserved field in an instruction or in a register.
• ? denotes an allocated bit in a register.
• A shaded field denotes a field that is reserved or allocated in an instruction or in a register.
Related Publications
• Power ISA User Set Architecture (Book I, Version 2.06)
• Power ISA Virtual Environment Architecture (Book II, Version 2.06)
• Power ISA Operating Environment Architecture (Book III-E, Version 2.06)
The Power ISA specifications are available at www.power.org
UXunderflow exception or user mode execution access
Vvector category
V.LElittle-endian category
VAvirtual addresses
VFvirtualization fault
VHDLvery-high-speed integrated circuit (VHSIC) hardware description language
VLE variable length encoding category
VLPTvirtual linear page table
VPNvirtual page number
VSIDvirtual segment ID
VSXvector-scalar extension category
VXinvalid operation exception
Wwrite-through
WAWwrite-after-write
WCwake control or write to clear
WDTwatchdog timer
WIMGEwrite-through, caching-inhibited, memory coherency required, guarded, and endi-
anness attributes
WPwatchdog timer period
WSwrite to set
WT wait category
XORexclusive OR
XUexecution unit
ZXzero divide exception
Version 1.3
October 23, 2012
List of Acronyms and Abbreviations
Page 43 of 864
Page 44
User’s Manual
A2 Processor
List of Acronyms and Abbreviations
Page 44 of 864
Version 1.3
October 23, 2012
Page 45
User’s Manual
A2 Processor
1. Overview
The IBM Power ISA A2 64-bit embedded processor core is an implementation of the scalable and flexible
Power ISA architecture. The A2 core implements four simultaneous threads of execution within the core.
Each thread of execution can be viewed as a processor within a 4-way multiprocessor with shared dataflow.
This gives the effective appearance of four independent processing units to software. The performance of the
four threads is limited because they share some resources such as the L1 and L2 caches.
The floating-point unit interfaces to the A2 processor core and incorporates a 6-stage arithmetic pipeline. The
pipeline enables one arithmetic instruction to be issued during each cycle. Floating-point instructions execute
with 6-cycle latency and 1-cycle throughput, except for operations on denormalized operands, division, and
square root.
1.1 A2 Core Key Design Fundamentals
The key design fundamentals of the A2 core are the following:
• 64-bit implementation of the Power ISA Version 2.06 Book III-E - Embedded Platform Environment.
– The A2 core provides binary compatibility for IBM PowerPC® application level code (problem state).
– The A2 core implements the Embedded Hypervisor Architecture to provide secure compute domains
and operating system virtualization.
• The A2 core is optimized for aggregate throughput.
– 4-way, fine-grained simultaneous multithreaded.
– 2-way concurrent issue. One branch/integer/load/store + one AXU
(FP/vector).
– In-order dispatch and execution.
– 27 FO4 design.
• The A2 core is a modular design to support reuse.
– The A2 core provides a general purpose coprocessor (AXU) port to attached unique AXUs.
• AXUs have full ISA flexibility.
• AXUs currently include:
–FU
- Power ISA V2.06 scalar double-precision floating-point unit.
• The AXU is an optional unit.
– The A2 core provides for an optional MMU
unit.
• The MMU unit supports Power ISA V2.06 Book III-E Memory Management (MAV
• Without the MMU, the A2 core supports the software-managed ERATs defined in this document.
– The A2 core provides for an optional microcode engine and ROM
.
• Power ISA V2.06 Book I and II instructions are supported with a combination of microcoded
instructions and hardware implemented instructions.
2.0).
Version 1.3
October 23, 2012
Overview
Page 45 of 864
Page 46
User’s Manual
A2 Processor
1.2 A2 Core Features
The A2 core is a high-performance, low-power engine that implements the flexible and powerful Power ISA
Architecture.
The A2 core contains a single-issue, in-order, pipelined processing unit, along with other functional elements
required by embedded product specifications. These other functions include memory management, cache
control, timers, and debug facilities. Interfaces for custom coprocessors and floating-point functions are
provided. The processor interface is 128 bits for reads and 128 bits (optional 256 bits version of the A2) for
writes and provides the framework to efficiently support system-on-a-chip (SOC) designs.
A2 core features include:
• High-performance, concurrent-issue, 64-bit RISC
CPU
• 4-way, fine-grained simultaneous multithreaded implementation of the full 64-bit Power ISA Architecture
– One outstanding I-fetch request to the L2
cache per thread
– One 8-entry instruction fetch buffer per thread
– Up to four instructions can be placed in the instruction buffer per cycle
– Up to one instruction can be taken out of the instruction buffer per cycle per thread
– Instruction decode and dependency per thread
• Two-way concurrent instruction decode and issue
• In-order dispatch, execution, and completion
• High-accuracy dynamic branch prediction
–81024 entry branch history table with 2 bits of history
– Four-entry link stack per thread
• Highly-pipelined microarchitecture
– Full GPR
– Full CR
bypass
bypass
– Link Register bypass
• Single unified pipeline
• Complex integer, system, branch, simple integer, and load/store pipelines
• Unified (for all threads) nonblocking with up to eight outstanding load misses
Overview
Page 46 of 864
Version 1.3
October 23, 2012
Page 47
User’s Manual
A2 Processor
• Cache line locking supported
• Caches can be partitioned to provide separate regions for transient instructions and data
• Critical-word-first data access and forwarding
• Pseudo LRU
replacement policy
• Cache tags and data are parity protected. Errors are recoverable.
• Memory Management Unit (MMU)
• Support for Power ISA categories Embedded.Hypervisor (E.HV), Embedded.Hypervisor.LRAT
(E.HV.LRAT), Embedded.TLB Write Conditional (E.TWC), and Embedded.Page Table (E.PT)
• Support for Power ISA Book III-E MMU Architecture Version 2.0 (MAV 2.0)
• Separate instruction and data ERAT
– Fully associative 16-entry I-ERAT
s
shared by all threads
– Fully associative 32-entry D-ERAT shared by all threads
– Entries can be shared by two or more threads via 4-bit thread ID mask field
– Exclusion range function to allow address “holes” at base of page entries
– ERATs operate in one of two modes: MMU mode or ERAT-only mode
1. MMU mode; ERAT with backing MMU
– Software-managed page tables and indirect (IND = 1) TLB
entries
– Hardware handles ERAT miss with TLB hit
– Hardware handles direct (IND = 0) TLB miss via hardware page table walking
– Software handles indirect (IND = 1) TLB miss via instruction and data TLB miss exceptions
– Software can also install direct (IND = 0) TLB entries as required
2. ERAT-only mode; effective-to-real address translation with ERATs only
– MMU removed, no backing TLB
– Software-managed ERAT entries – I/D TLB miss exceptions
1. The A2 FPU requires software support for IEEE 754 compliance. See IEEE 754 and Architectural Compliance on page 56
for details.
Overview
Page 48 of 864
Version 1.3
October 23, 2012
Page 49
User’s Manual
A2 Processor
1.3 The A2 Core as a Power ISA Implementation
The A2 core implements the full, 64-bit fixed-point Power ISA Architecture. The A2 core fully complies with
these architectural specifications. The core does not implement the floating-point operations, although a
floating-point unit (FU) can be attached (using the AXU interface).
1.3.1 Embedded Hypervisor
The A2 core implements the Embedded Hypervisor Architecture to provide secure compute domains and
operating system virtualization. The Embedded Hypervisor Architecture introduces the concept of partitions
by two main architectural changes. The first is by extending the virtual address with a logical partition identifier (LPID). The identifier serves an analogous purpose to the process ID (PID) and is used to distinguish
partitions. The second change is introducing a new privilege level above supervisor and reallocating ownership of resources between the two levels. Moving the ownership of certain resources beyond the supervisor
helps software to provide secure compute domains.
In addition to providing logical partitions, the following requirements are set forth:
• Ensure a secure environment. An operating system in one logical partition is not allowed to affect the
resources of an operating system in another partition.
• Maintain compatibility with the existing programming model. An existing operating system today should
require only minor initialization changes to run.
• An operating system running in a logical partition should not be able to deny service to any shared
resources.
• Clean and secure communication channels between supervisor and embedded hypervisor states (in both
directions).
• The ability to run guest operating systems efficiently and provide real-time response to interrupts.
1.4 A2 Core Organization
The A2 core includes a concurrent-issue instruction fetch and decode unit with an attached branch unit,
together with a pipeline for complex integer, simple integer, and load/store operations. The A2 core also
includes a memory management unit (MMU); separate instruction and data cache units; pervasive and debug
logic; and timer facilities.
Version 1.3
October 23, 2012
Overview
Page 49 of 864
Page 50
User’s Manual
A2 Processor
Figure 1-1. A2 Core Organization
1.4.1 Instruction Unit
The instruction unit of the A2 core fetches, decodes, and issues two instructions from different threads per
cycle to any combination of the one execution pipeline and the AXU interface (see Section 1.4.2 Execution Unit on page 51 and Section 1.5.2 Auxiliary Execution Unit (AXU) Port on page 59). The instruction unit
includes a branch unit that provides dynamic branch prediction using a branch history table (BHT). This
mechanism greatly improves the branch prediction accuracy and reduces the latency of taken branches, such
that the target of a branch can usually be executed immediately after the branch itself with no penalty.
Overview
Page 50 of 864
Version 1.3
October 23, 2012
Page 51
User’s Manual
A2 Processor
1.4.2 Execution Unit
The A2 core contains a single execution pipeline. The pipeline consists of seven stages and can access the
5-ported (three read, two write) GPR file.
The pipeline handles all arithmetic, logical, branch, and system management instructions (such as interrupt
and TLB management, move to/from system registers, and so on) as well as arithmetic, logical operations
and all loads, stores and cache management operations. The pipelined multiply unit can perform 32-bit 32-
bit multiply operations with single-cycle throughput and single-cycle latency. The width of the divider is 64
bits. Divide instructions dealing with 64-bit operands recirculate for 65 cycles, and operations with 32-bit operands recirculate for 32 cycles. No divide instructions are pipelined; they all require some recirculation.
All misaligned operations are handled in hardware with no penalty on any operation that is contained within
an aligned 32-byte region. The load/store pipeline supports all operations to both big-endian and little-endian
data regions.
Appendix D Instruction Execution Performance and Code Optimizations on page 833 provides detailed information about instruction timings and performance implications in the A2 core.
1.4.3 Instruction and Data Cache Controllers
The A2 core provides separate instruction and data cache controllers and arrays, which allow concurrent
access and minimize pipeline stalls. The storage capacity of the cache arrays 16 KB each. Both cache
controllers have 64-byte lines, with 4-way set-associativity I-cache and 8-way set-associativity D-cache. Both
caches support parity checking on the tags and data in the memory arrays to protect against soft errors. If a
parity error is detected, the CPU forces an L1 miss and reloads from the system bus. The A2 core can be
configured to cause a machine check exception on a D-cache parity error.
The Power ISA instruction set provides a rich set of cache management instructions for software-enforced
coherency. See Instruction and Data Caches on page 169 for detailed information about the instruction and
data cache controllers.
1.4.3.1 Instruction Cache Controller
The instruction cache controller (ICC) delivers up to four instructions per cycle to the instruction unit of the A2
core. The ICC also handles the execution of the Power ISA instruction cache management instructions for
coherency.
1.4.3.2 Data Cache Controller
The data cache controller (DCC) handles all load and store data accesses, as well as the Power ISA data
cache management instructions. All misaligned accesses are handled in hardware. Cacheable load accesses
that are contained within a double quadword (32 bytes) are handled as a single request. Cacheable store or
caching inhibited loads or store accesses that are contained within a quadword (16 bytes) are handled as a
single request. Load and store accesses that cross these boundaries are broken into separate byte accesses
by the hardware by the microcode engine. When in 32-byte store mode (XUCR0[L2SIW] = 1), then all
misaligned store or load accesses contained within a double quadword (32 bytes) are handled as a single
request. This includes cacheable and caching inhibited stores and loads.
Version 1.3
October 23, 2012
Overview
Page 51 of 864
Page 52
User’s Manual
A2 Processor
The DCC interfaces to the AXU port to provide direct load/store access to the data cache for AXU load and
store operations. Such AXU load and store instructions can access up to 32 bytes (a double quadword) in a
single cycle for cacheable accesses and can access up to 16 bytes (a quadword) in a single cycle for caching
inhibited accesses.
The data cache always operates in a write-through manner.
The DCC also supports cache line locking and “transient” data via way locking.
The DCC provides for up to eight outstanding load misses, and the DCC can continue servicing subsequent
load and store hits in an out-of-order fashion. Store-gathering is not performed within the A2 core.
1.4.4 Memory Management Unit (MMU)
The A2 core supports a flat, 42-bit (4 TB) real (physical) address space. This 42-bit real address is generated
by the MMU as part of the translation process from the 64-bit effective address, which is calculated by the
processor core as an instruction fetch or load/store address.
Note: In 32-bit mode, the A2 core forces bits 0:31 of the calculated 64-bit effective address to zeros. Therefore, to have a translation hit in 32-bit mode, software needs to set the effective address upper bits to zero in
the ERATs and TLB.
The MMU provides address translation, access protection, and storage attribute control for embedded applications. The MMU supports demand paged virtual memory and other management schemes that require
precise control of logical to physical address mapping and flexible memory protection. Working with appropriate system level software, the MMU provides the following functions:
• Translation of the 88-bit virtual address, 1-bit guest state (GS), 8-bit logical partition ID (LPID), 1-bit
address space (AS) identifier, 14-bit process ID (PID), and 64-bit effective address into the 42-bit real
address (note the 1-bit indirect entry IND bit is not considered part of the virtual address)
• Page-level read, write, and execute access control
• Storage attributes for cache policy, byte order (endianness), and speculative memory access
• Software control of page replacement strategy
The translation lookaside buffer (TLB) is the primary hardware resource involved in the control of translation,
protection, and storage attributes. It consists of 512 entries, each specifying the various attributes of a given
page of the address space. The TLB is 4-way set associative. The TLB entries can be of type direct (IND = 0),
in which case the virtual address is translated immediately by a matching entry, or of type indirect (IND = 1),
in which case the hardware page table walker is invoked to fetch and install an entry from the hardware page
table.
The TLB tag and data memory arrays are parity protected against soft errors; if a parity error is detected
during an address translation, the TLB and ERAT caches treat the parity error like a miss and proceed to
either reload the entry with correct parity (in the case of an ERAT miss, TLB hit) and set the parity error bit in
the appropriate fault isolation register (FIR), or generate a TLB exception where software can take appropriate action (in the case of a TLB miss).
An operating system can choose to implement hardware page tables in memory that contain virtual to logical
translation page table entries (PTEs) per Category E.PT. These PTEs are loaded into the TLB by the hardware page table walker logic after the logical address is converted to a real address via the logical to real
address translation (LRAT) per Category E.HV.LRAT. Software must install indirect (IND = 1) type TLB
entries for each page table that is to be traversed by the hardware walker. Alternately, software can manage
Overview
Page 52 of 864
Version 1.3
October 23, 2012
Page 53
User’s Manual
A2 Processor
the establishment and replacement of TLB entries by simply not using indirect entries (that is, by using only
direct IND = 0 entries). This gives system software significant flexibility in implementing a custom page
replacement strategy. For example, to reduce TLB thrashing or translation delays, software can reserve
several TLB entries for globally accessible static mappings. The instruction set provides several instructions
for managing TLB entries. These instructions are privileged, and the processor must be in supervisor state i
for them to be executed.
The first step in the address translation process is to expand the effective address into a virtual address. This
is done by taking the 64-bit effective address and prepending to it a 1-bit guest state (GS) identifier, an 8-bit
logical partition ID (LPID), a 1-bit address space (AS) identifier, and the 14-bit process identifier (PID). The
1-bit indirect entry (IND) identifier is not considered part of the virtual address. The LPID value is provided by
the LPIDR register, and the PID value is provided by the PID register (see Memory Management on
page 185). The GS and AS identifiers are provided by the Machine State Register (MSR, see CPU Interrupts and Exceptions on page 293), which contains separate bits for the instruction fetch address space (MSR[IS])
and the data access address space (MSR[DS]). Together, the 64-bit effective address and the other identifiers form an 88-bit virtual address. This 88-bit virtual address is then translated into the 42-bit real address
using the TLB.
The MMU divides the address space (whether effective, virtual, or real) into pages. Five direct (IND = 0) page
sizes (4 KB, 64 KB, 1 MB, 16 MB, 1 GB) are simultaneously supported, such that at any given time the TLB
can contain entries for any combination of page sizes. The MMU also supports two indirect (IND = 1) page
sizes (1 MB and 256 MB) with associated sub-page sizes (see Section 6.16 Hardware Page Table Walking (Category E.PT)). For an address translation to occur, a valid direct entry for the page containing the virtual
address must be in the TLB. An attempt to access an address for which no TLB direct exists results in a
search for an indirect TLB entry to be used by the hardware page table walker. If neither a direct or indirect
entry exists, an instruction (for fetches) or data (for load/store accesses) TLB miss exception occurs.
To improve performance, both the instruction cache and the data cache maintain separate shadow TLBs
called ERATs. The ERATs contain only direct (IND = 0) type entries. The instruction ERAT (I-ERAT) contains
16 entries, while the data ERAT (D-ERAT) contains 32 entries. These ERAT arrays minimize TLB contention
between instruction fetch and data load/store operations. The instruction fetch and data access mechanisms
only access the main unified TLB when a miss occurs in the respective ERAT. Hardware manages the
replacement and invalidation of both the I-ERAT and D-ERAT; no system software action is required in MMU
mode. In ERAT-only mode, an attempt to access an address for which no ERAT entry exists causes an
instruction (for fetches) or data (for load/store accesses) TLB miss exception.
Each TLB entry provides separate user state and supervisor state read, write, and execute permission
controls for the memory page associated with the entry. If software attempts to access a page for which it
does not have the necessary permission, an instruction (for fetches) or data (for load/store accesses) storage
exception occurs.
Each TLB entry also provides a collection of storage attributes for the associated page. These attributes
control cache policy (such as cacheability and write-through as opposed to copy-back behavior), byte order
(big-endian as opposed to little-endian), and enabling of speculative access for the page. In addition, a set of
four, user-definable storage attributes are provided. These attributes can be used to control various systemlevel behaviors.
Section 6 Memory Management describes the A2 core MMU functions in greater detail.
Version 1.3
October 23, 2012
Overview
Page 53 of 864
Page 54
User’s Manual
A2 Processor
1.4.5 Timers
The A2 core contains a time base and three timers: a decrementer (DEC), a fixed interval timer (FIT), and a
watchdog timer. The time base is a 64-bit counter that gets incremented at a frequency either equal to the
processor core clock rate or as controlled by a separate asynchronous timer clock input to the core. No interrupt is generated as a result of the time base wrapping back to zero.
The DEC is a 32-bit register that is decremented at the same rate at which the time base is incremented. The
user loads the DEC register with a value to create the desired interval. When the register is decremented to
zero, a number of actions occur: the DEC stops decrementing, a status bit is set in the Timer Status Register
(TSR), and a decrementer exception is reported to the interrupt mechanism of the A2 core. Optionally, the
DEC can be programmed to reload automatically the value contained in the Decrementer Auto-Reload
Register (DECAR), after which the DEC resumes decrementing. The Timer Control Register (TCR) contains
the interrupt enable for the decrementer interrupt.
The FIT generates periodic interrupts based on the transition of a selected bit from the time base. Users can
select one of four intervals for the FIT period by setting a control field in the TCR to select the appropriate bit
from the time base. When the selected time base bit transitions from 0 to 1, a status bit is set in the TSR and
a fixed interval timer exception is reported to the interrupt mechanism of the A2 core. The FIT interrupt enable
is contained in the TCR.
Similar to the FIT, the watchdog timer also generates a periodic interrupt based on the transition of a selected
bit from the time base. Users can select one of four intervals for the watchdog period, again by setting a
control field in the TCR to select the appropriate bit from the time base. Upon the first transition from 0 to 1 of
the selected time base bit, a status bit is set in the TSR and a watchdog timer exception is reported to the
interrupt mechanism of the A2 core. The watchdog timer can also be configured to initiate a hardware reset if
a second transition of the selected time base bit occurs before the first watchdog exception being serviced.
This capability provides an extra measure of recoverability from potential system lock-ups.
The timer functions of the A2 core are more fully described in Timer Facilities on page 387
1.4.6 Debug Facilities
The A2 core debug facilities include debug modes for the various types of debugging used during hardware
and software development. Also included are debug events that allow developers to control the debug
process. Debug modes and debug events are controlled using debug registers in the chip. The debug registers are accessed either through software running on the processor, or through the serial communications
(SCOM) port.
The debug modes, events, controls, and interfaces provide a powerful combination of debug facilities for
hardware development tools such as the RISCWatch debugger from IBM.
A brief overview of the debug modes and development tool support are provided below. Debug Facilities on
page 399 provides detailed information about each debug mode and other debug resources.
1.4.6.1 Debug Modes
The A2 core supports two debug modes: internal and external. Each mode supports a different type of debug
tool used in embedded systems development. Internal debug mode supports software-based ROM
monitors,
and external debug mode supports a hardware emulator type of debug. The debug modes arecontrolled by
Debug Control Register 0 (DBCR0) and the setting of bits in the Machine State Register (MSR).
Overview
Page 54 of 864
Version 1.3
October 23, 2012
Page 55
User’s Manual
A2 Processor
Internal debug mode supports accessing architected processor resources, setting hardware and software
breakpoints, and monitoring processor status. In internal debug mode, debug events can generate debug
exceptions, which can interrupt normal program flow so that monitor software can collect processor status
and alter processor resources.
Internal debug mode relies on exception-handling software—running on the processor—along with an
external communications path to debug software problems. This mode is used while the processor continues
executing instructions and enables debugging of problems in application or operating system code. Access to
debugger software executing in the processor while in internal debug mode is through a communications port
on the processor board, such as a serial port or Ethernet connection.
External debug mode supports stopping, starting, and single-stepping the processor, accessing architected
processor resources, setting hardware and software breakpoints, and monitoring processor status. In
external debug mode, debug events can architecturally “freeze” the processor. While the processor is frozen,
normal instruction execution stops, and the architected processor resources can be accessed and altered
using a debug tool (such as RISCWatch) attached through the SCOM port. This mode is useful for debugging
hardware and low-level control software problems.
1.4.6.2 Development Tool Support
The A2 core provides powerful debug support for a wide range of hardware and software development tools.
RISCWatch is an example of a development tool that uses the external debug mode, debug events, and the
SCOM port to support hardware and software development and debugging.
1.4.7 Floating-Point Unit Organization
The floating-point unit incorporates a single-issue instruction decode and issue unit and a 6-stage arithmetic
pipeline working in parallel with a 4-stage load/store pipeline. The floating-point unit contains a Floating-Point
Register (FPR) file that interfaces to both pipelines. There are thirty-two 64-bit FPRs.
Figure 1-2 illustrates the logical organization of the A2 core and its relationship to the A2 processor core.
Version 1.3
October 23, 2012
Overview
Page 55 of 864
Page 56
User’s Manual
Instruction Decode/Issue Unit
AXU
Interface
Data
Cache
Arithmetic
Pipe
CR
Load/Store
Pipe
FPSCR
Thread 0Thread 1
Unit
FPR0
FPR1
•
•
•
FPR1
FPR30
FPR31
Floating-Point AXUA2 Core
Thread 2Thread 3
A2 Processor
Figure 1-2. A2 Processor Block Diagram
1.4.7.1 Arithmetic and Load/Store Pipelines
The A2 core has a single execution pipeline. The pipeline handles all computational instructions and reads
from and writes to the FPRs, Floating-Point Status and Control Register (FPSCR), and the Condition Register
(CR).
1.4.8 IEEE 754 and Architectural Compliance
The A2 core is IEEE 754 and Power ISA compliant and implements single-precision and double-precision
instructions.
Overview
Page 56 of 864
Version 1.3
October 23, 2012
Page 57
User’s Manual
A2 Processor
1.4.8.1 IEEE 754 Compliance
IEEE 754 requires a certain set of operations to be included in any implementation that claims to be
compliant. Such operations can be implemented in hardware, software, or a combination of the two. The
Power ISA floating-point architecture includes most of the required operations but some are missing. The
missing operations are: floating-point remainder, format conversion between binary and decimal, and format
conversion from integer to floating-point. It is necessary to provide a software library to support these missing
functions. In other words, the Power ISA Architecture requires software support to be fully complaint with the
IEEE standard.
1.4.9 Floating-Point Unit Implementation
Certain aspects of the behavior of the floating-point unit are implementation-specific.
1.4.9.1 Reciprocal Estimates
While the Power ISA Architecture defines single-precision reciprocal estimates and reciprocal square root
estimates to have relative errors of 2
relative error of 2
-14
.
-5
and 2-8 respectively, both are implemented in the A2 core to have a
Programmers are encouraged to take advantage of this increased accuracy, but must be aware that code
that relies on this increased accuracy might not work on any other Power ISA FU.
1.4.9.2 Denormalized B Operands
The floating-point unit supports all denormal numbers in the dataflow with no additional latency except the
following cases:
1. B is a double-precision denorm AND NOT (move{fabs/fnabs/fneg} OR fsel OR fcfid OR mv_to_fpscr).
2. B is a single-precision denorm AND NOT (move{fabs/fnabs/fneg} OR fsel)
If any of the above cases are detected, the A2 core flushes to the microcode engine, which in turn issues a
prenormalization instruction, followed by the original instruction. The latency for these operations increases
by 20 cycles when this occurs.
1.4.9.3 Non-IEEE mode
Non-IEEE mode, controlled by the NI bit in the FPSCR, is intended to eliminate data-dependent overhead
cycles caused by exceptional operands or results. The result is faster, deterministic performance with reasonable results. This mode is not supported by the A2 core. The value of the NI bit is ignored.
1.4.10 Floating-Point Unit Interfaces
The floating-point unit interfaces to the A2 processor core.
1.4.10.1 A2 Processor Core Interface
This interface enables the A2 core to interact with the A2 processor core. Interactions include resets and
updating the CR.
Version 1.3
October 23, 2012
Overview
Page 57 of 864
Page 58
User’s Manual
A2 Processor
1.4.10.2 Clock and Power Management Interface
The CPM interface supports clock distribution and power management to reduce power consumption below
the normal operational level. External logic is necessary for the sleep mode to function.
1.5 Core Interfaces
The core includes the following interfaces:
• System interface
• Auxiliary execution unit (AXU) port
• SCOM, debug, trace, and performance monitor event ports
• Interrupt interface
• Clock and power management interface
Several of these interfaces are described briefly in the sections below.
1.5.1 System Interface
The A2 core interface has one command interface for instruction reads, data reads, and data writes, and uses
a 42-bit address bus. A full 64-byte cache line is implied for cacheable data reads and cacheable instruction
fetches. The transfer length is used to indicate 1 byte, 2 byte, 4 byte, 8 byte, 16 byte, and 32 byte for
noncacheable reads and 16 bytes for noncacheable instruction fetches. There is a 256-bit data interface for
data writes with 32 byte enables indicating which bytes should be written.
Data writes can be 1 byte, 2 byte, 4 byte, 8 byte, or 16 byte for noncacheable or cacheable writes. There is a
128-bit data reload interface for instruction reads and data reads. When the reload data is less than 16 bytes
(due to the transfer length indicating 1 byte, 2 byte, 4 byte or 8 byte), the data should be aligned within the 16
byte reload bus based on the associated command interface address. There is a back invalidate interface for
systems with an entity outside the A2 core (such as an L2 cache controller) that provide hardware cache
coherency.
A2 supports a mode that enables a 32-byte write bus to the A2 core/L2 interface. Only the AXU can produce
32-byte writes.
The command interface is a credit-based interface. The A2 core can handle up to eight load-type credits. The
actual number of load-type credits (L) that it will handle is initialized in the A2 core configuration ring. In the A2
core, there is a 12-entry load command queue that includes eight entries for data loads and four entries for
instruction fetches. An entity outside the A2 core is expected to have a near queue of L entries for load-type
operations and to give a pop indication to the A2 core as each is sent to the far queue that contains 8 to 12
entries. The specific command is indicated in the transaction type.
Examples of transaction types that expect data to be returned on the reload bus are instruction fetch, load,
and dcbt. Examples of transaction types that do not expect data to be returned on the reload bus are store,
dcbz and dcbf. The A2 core can handle up to 32 store-type credits. The actual number of credits (S) that it
will handle is initialized in the A2 core configuration ring.
Overview
Page 58 of 864
Version 1.3
October 23, 2012
Page 59
User’s Manual
A2 Processor
An entity outside the A2 core is expected to be able to queue the S store-type operations and give a pop indication to the A2 core for each as it is processed and the queue entry is available. For an entity outside the A2
core that also support store gathering, it should give a gather indication to the A2 core when the store is gathered with an existing queue entry to let the A2 core know that an additional queue entry is available.
1.5.2 Auxiliary Execution Unit (AXU) Port
This interface provides the A2 core with the flexibility to attach a tightly-coupled coprocessor-type macro
incorporating instructions that go beyond those provided within the processor core itself. The AXU port
provides sufficient functionality for attachment of various coprocessor functions such as a fully-compliant
Power ISA floating-point unit (single- or double-precision), multimedia engine, DSP
, or other custom function
implementing algorithms appropriate for specific system applications. The AXU interface supports can be
used with macros that contain their own register files. AXU load and store instructions can directly access the
A2 core data cache, with operands of up to a double quadword (32 bytes) in length.
The AXU interface provides the capability for a coprocessor to execute instructions that are not part of the
Power ISA instruction set at the same time that the A2 core is executing PowerISA instructions. Areas within
the architected instruction space allow for these customer-specific or application-specific AXU instruction set
extentions. Further description is beyond the scope of this document.
1.5.3 JTAG Port
The A2 core SCOM port supports the indirect attachment of a debug tool such as the RISCWatch product
from IBM. A logic block outside the A2 core must provide JTAG
to SCOM port translation. Through the SCOM
port, and using the debug facilities designed into the A2 core, a debug workstation can single-step the
processor and interrogate the internal processor state to facilitate hardware and software debugging.
Version 1.3
October 23, 2012
Overview
Page 59 of 864
Page 60
User’s Manual
A2 Processor
Overview
Page 60 of 864
Version 1.3
October 23, 2012
Page 61
User’s Manual
A2 Processor
2. CPU Programming Model
The programming model of the A2 core describes how the following features and operations of the core
appear to programmers:
• Logical Partitioning on page 61
• Storage Addressing on page 62
• Multithreading on page 70
• Registers on page 82
• 32-Bit Mode on page 85
• Instruction Categories on page 86
• Instruction Classes on page 87
• Implemented Instruction Set Summary on page 88
• Wait Instruction on page 98
• Branch Processing on page 99
• Integer Processing on page 110
• Processor Control on page 113
• Privileged Modes on page 120
• Speculative Accesses on page 122
• Synchronization on page 122
• Software Transactional Memory Acceleration on page 125
2.1 Logical Partitioning
2.1.1 Overview
Logical partitioning defines instructions, resources, and methods for establishing an additional attribute of
processor privilege called a guest state.
The Embedded.Hypervisor category permits processors and portions of real storage to be assigned to local
collections called partitions such that a program executing on a processor in one partition cannot interfere
with any program executing on a processor in a different partition. This isolation can be provided for both
problem state and privileged state programs by using a layer of trusted software called a hypervisor program
(or simply a “hypervisor”) and the resources provided by this category to manage system resources. The
collection of software that runs in a given partition and its associated resources is called a guest. The guest
normally includes an operating system (or other system software) running in privileged state and its associated processes running in the problem state under the management of the hypervisor. The processor is in the
guest state when a guest is executing, and it is in the hypervisor state when the hypervisor is executing. The
processor is executing in the guest state when MSR[GS] = 1.
A2 implements 2
8
partitions. See Section 6.17.2 Logical Partition ID Register (LPIDR) on page 245. All
threads of a single A2 core must be assigned to the same logical partition.
Version 1.3
October 23, 2012
CPU Programming Model
Page 61 of 864
Page 62
User’s Manual
A2 Processor
A processor is assigned to one partition at any given time. A processor can be assigned to any given partition
without consideration of the physical configuration of the system (for example, shared registers, caches,
organization of the storage hierarchy), except that processors that share certain hypervisor resources might
need to be assigned to the same partition. Additionally, certain resources can be used by the guest at the
discretion of the hypervisor. Such usage might cause interference between partitions, and the hypervisor
should allocate those resources accordingly. The primary registers and facilities used to control logical partitioning are described in the following subsections. Other facilities associated with logical partitioning are
described within the appropriate sections within this book.
Category Embedded.Hypervisor changes the operating system programming model to allow for easier virtualization, while retaining a default backwards compatible mode where an operating system written for processors not containing this category will still operate as before without using the logical partitioning facilities.
2.2 Storage Addressing
As a 64-bit implementation of the Power ISA Architecture, the A2 core implements a uniform 64-bit effective
address (EA) space. Effective addresses are expanded into virtual addresses and then translated to 42-bit
(4 TB) real addresses by the memory management unit (see Memory Management on page 185 for more
information about the translation process). The organization of the real address space into a physical address
space is system-dependent, and is described in the user’s manuals for chip-level products that incorporate an
A2 core.
The A2 core generates an effective address whenever it executes a storage access, branch, cache management, or translation look aside buffer (TLB) management instruction, or when it fetches the next sequential
instruction.
2.2.1 Storage Operands
Bytes in storage are numbered consecutively starting with 0. Each number is the address of the corresponding byte.
Data storage operands accessed by the integer load/store instructions can be bytes, halfwords, words,
doublewords or—for load/store multiple and string instructions—a sequence of words or bytes, respectively.
Data storage operands accessed by auxiliary execution unit (AXU) load/store instructions can be bytes, halfwords, words, doublewords, quadwords or double quadwords. The address of a storage operand is the
address of its first byte (that is, of its lowest-numbered byte). Byte ordering can be either big endian or little
endian, as controlled by the endian storage attribute (see Byte Ordering on page 66; also see Endian (E) on
page 197 for more information about the endian storage attribute).
Operand length is implicit for each scalar storage access instruction type (that is, each storage access
instruction type other than the load/store multiple and string instructions). The operand of such a scalar
storage access instruction has a “natural” alignment boundary equal to the operand length. In other words,
the natural address of an operand is an integral multiple of the operand length. A storage operand is said to
be aligned if it is aligned at its natural boundary; otherwise, it is said to be unaligned.
Data storage operands for storage access instructions have the characteristics shown in Table 2-1 on
page 63.
CPU Programming Model
Page 62 of 864
Version 1.3
October 23, 2012
Page 63
User’s Manual
A2 Processor
Table 2-1. Data Operand Definitions
Storage Access Instruction TypeOperand Length Addr[59:63] if Aligned
Byte (or String) 8 bits 0bxxxxx
Halfword 2 bytes 0bxxxx0
Word (or Multiple) 4 bytes 0bxxx00
Doubleword 8 bytes 0bxx000
Quadword (AXU only) 16 bytes 0bx0000
Double Quadword (AXU only) 32 bytes 0b00000
Note: An “x” in an address bit position indicates that the bit can be 0 or 1 independent of the state of other bits in the address.
The alignment of the operand effective address of some storage access instructions might affect performance; in some cases, it might cause an alignment exception to occur. For such storage access instructions,
the best performance is obtained when the storage operands are aligned. Table 2-2 summarizes the effects
of alignment on those storage access instruction types for which such effects exist. If an instruction type is not
shown in the table, there are no alignment effects for that instruction type.
Table 2-2. Alignment Effects for Storage Access Instructions (Sheet 1 of 2)
Storage Access Instruction TypeAlignment Effects
Integer cacheable load halfword
Integer cacheable store or caching
inhibited load/store halfword
Integer cacheable load word
Integer cacheable store or caching
inhibited load/store word
Integer cacheable load doubleword
Integer cacheable store or caching
inhibited load/store doubleword
Integer load/store multiple
Integer load/store stringBroken into a series of byte accesses until the last byte is accessed. (See notes.)
AXU cacheable load halfword
AXU cacheable store or caching inhibited load/store halfword
AXU cacheable load word
AXU cacheable store or caching inhibited load/store word
AXU cacheable load doubleword
AXU cacheable store or caching inhibited load/store doubleword
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] = 0b11111); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] = 0b1111); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b11100); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b1100); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b11000); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b1000); otherwise no
effect. (See notes.)
Broken into a series of word (4-byte) accesses until the last word is accessed. The load/store
multiple address must be word aligned. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] = 0b11111); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] = 0b1111); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b11100); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b1100); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b11000); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b1000); otherwise no
effect. (See notes.)
Version 1.3
October 23, 2012
CPU Programming Model
Page 63 of 864
Page 64
User’s Manual
A2 Processor
Table 2-2. Alignment Effects for Storage Access Instructions (Sheet 2 of 2)
Storage Access Instruction TypeAlignment Effects
AXU cacheable load quadword
AXU cacheable store or caching inhibited load/store quadword
AXU cacheable load double quadword
AXU cacheable store or caching inhibited load/store double quadword
Notes:
• Any unaligned access that also crosses a 4 K page boundary causes an alignment exception.
• An auxiliary processor can specify that the EA for a given AXU load/store instruction must be aligned at the operand-size boundary
or, alternatively, at a word boundary. If the AXU so indicates this requirement and the calculated EA fails to meet it, the A2 core
generates an alignment exception. Alternatively, an auxiliary processor can specify that the EA for a given AXU load/store instruction should be “forced” to be aligned by ignoring the appropriate number of low-order EA bits and processing the AXU load/store as
if those bits were 0. Byte, halfword, word, doubleword, and quadword AXU load/store instructions ignore 0, 1, 2, 3, and 4 low-order
EA bits, respectively.
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b10000); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b0000); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 32-byte boundary (EA[59:63] > 0b00000); otherwise no
effect. (See notes.)
Broken into byte accesses if crosses 16-byte boundary (EA[60:63] > 0b0000); otherwise no
effect. (See notes.)
Cache management instructions access cache block operands; for the A2 core, the cache block size is 64
bytes. However, the effective addresses calculated by cache management instructions are not required to be
aligned on cache block boundaries. Instead, the architecture specifies that the associated low-order effective
address bits (bits 58:63 for the A2 core) are ignored during the execution of these instructions.
Similarly, the TLB management instructions access page operands, and—as determined by the page size—
the associated low-order effective address bits are ignored during the execution of these instructions.
Instruction storage operands, on the other hand, are always 4 bytes long, and the effective addresses calculated by branch instructions are therefore always word-aligned.
2.2.2 Effective Address Calculation
For a storage access instruction, if the sum of the effective address and the operand length exceeds the
maximum effective address of 2
64
–1 for 64-bit mode or 232–1 in 32-bit mode (that is, the storage operand
itself crosses the maximum address boundary), the result of the operation is undefined, as specified by the
architecture. The A2 core performs the operation as if the storage operand wrapped around from the
maximum effective address to effective address 0. Software, however, should not depend upon this behavior,
so that it can be ported to other implementations that do not handle this scenario in the same fashion. Accordingly, software should ensure that no data storage operands cross the maximum address boundary.
Note: Because instructions are words and because the effective addresses of instructions are always implicitly on word boundaries, it is not possible for an instruction storage operand to cross any word boundary,
including the maximum address boundary.
Effective address arithmetic, which calculates the starting address for storage operands, wraps around from
the maximum address to address 0 for all effective address computations except next sequential instruction
fetching. See Instruction Storage Addressing Modes on page 65 for more information about next sequential
instruction fetching at the maximum address boundary.
CPU Programming Model
Page 64 of 864
Version 1.3
October 23, 2012
Page 65
User’s Manual
A2 Processor
2.2.2.1 Data Storage Addressing Modes
There are two data storage addressing modes supported by the A2 core:
• Base + displacement (D-mode) addressing mode:
The 16-bit D field is sign-extended and added to the contents of the GPR
designated by RA or to zero if
RA = 0.
• Base + index (X-mode) addressing mode:
The contents of the GPR designated by RB (or the value 0 for lswi and stswi) are added to the contents
of the GPR designated by RA or to 0 if RA = 0.
2.2.2.2 Instruction Storage Addressing Modes
There are four instruction storage addressing modes supported by the A2 core:
• I-form branch instructions (unconditional):
The 24-bit LI field is concatenated on the right with 0b00, sign-extended, and then added to either the
address of the branch instruction if AA = 0 or to 0 if AA = 1.
• Taken B-form branch instructions:
The 14-bit BD field is concatenated on the right with 0b00, sign-extended, and then added to either the
address of the branch instruction if AA = 0 or to 0 if AA = 1.
• Taken XL-form branch instructions:
The contents of bits 0:61 of the Link Register (LR) or the Count Register (CTR) are concatenated on the
right with 0b00 to form the 64-bit effective address of the next instruction.
Note: In 32-bit mode, the A2 core forces bits 0:31 of the calculated 64-bit effective address to zeros.
• Next sequential instruction fetching (including nontaken branch instructions):
The value 4 is added to the address of the current instruction to form the 64-bit effective address of the
next instruction. If the address of the current instruction is 0xFFFF_FFFF_FFFF_FFFC in 64-bit mode or
0x0000_0000_FFFF_FFFC in 32-bit mode, the A2 core wraps the next sequential instruction address
back to address 0. This behavior is not required by the architecture, which specifies that the next sequential instruction address is undefined under these circumstances. Therefore, software should not depend
upon this behavior, so that it can be ported to other implementations that do not handle this scenario in
the same fashion. Accordingly, if software wants to execute across this maximum address boundary and
wrap back to address 0, it should place an unconditional branch at the boundary with a displacement of 4.
In addition to the above four instruction storage addressing modes, the following behavior applies to
branch instructions:
• Any branch instruction with LK = 1:
The value 4 is added to the address of the current instruction and the low-order 64 bits of the result are
placed into the LR. As for the similar scenario for next sequential instruction fetching, if the address of the
branch instruction is 0xFFFF_FFFF_FFFF_FFFC in 64-bit mode or 0x0000_0000_FFFF_FFFC in 32-bit
mode, the result placed into the LR is architecturally undefined, although once again the A2 core wraps
the LR update value back to address 0. Again, however, software should not depend on this behavior so
that it can be ported to implementations that do not handle this scenario in the same fashion.
Version 1.3
October 23, 2012
CPU Programming Model
Page 65 of 864
Page 66
User’s Manual
A2 Processor
2.2.3 Byte Ordering
If scalars (individual data items and instructions) were indivisible, there would be no such concept as “byte
ordering.” It is meaningless to consider the order of bits or groups of bits within the smallest addressable unit
of storage, because nothing can be observed about such order. Only when scalars, which the programmer
and processor regard as indivisible quantities, can comprise more than one addressable unit of storage does
the question of order arise.
For a machine in which the smallest addressable unit of storage is the 64-bit doubleword, there is no question
of the ordering of bytes within doublewords. All transfers of individual scalars between registers and storage
are of doublewords, and the address of the byte containing the high-order 8 bits of a scalar is no different
from the address of a byte containing any other part of the scalar.
For the Power ISA Architecture, as for most current computer architectures, the smallest addressable unit of
storage is the 8-bit byte. Many scalars are halfwords, words, or doublewords that consist of groups of bytes.
When a word-length scalar is moved from a register to storage, the scalar occupies 4 consecutive byte
addresses. It thus becomes meaningful to discuss the order of the byte addresses with respect to the value of
the scalar: which byte contains the highest-order 8 bits of the scalar, which byte contains the next-highestorder 8 bits, and so on.
Given a scalar that contains multiple bytes, the choice of byte ordering is essentially arbitrary. There are 24
ways to specify the ordering of 4 bytes within a word, but only two of these orderings are sensible:
• The ordering that assigns the lowest address to the highest-order (left-most) 8 bits of the scalar, the next
sequential address to the next-highest-order 8 bits, and so on.
This ordering is called big endian because the “big end” (most-significant end) of the scalar, considered
as a binary number, comes first in storage. IBM RISC
System/6000, IBM System/390®, and Motorola
680x0 are examples of computer architectures using this byte ordering.
• The ordering that assigns the lowest address to the lowest-order (“right-most”) 8 bits of the scalar, the
next sequential address to the next-lowest-order 8 bits, and so on.
This ordering is called little endian because the “little end” (least-significant end) of the scalar, considered
as a binary number, comes first in storage. The Intel x86 is an example of a processor architecture using
this byte ordering.
Power ISA supports both big-endian and little-endian byte ordering, for both instruction and data storage
accesses. Which byte ordering is used is controlled on a memory page basis by the endian (E) storage
attribute, which is a field within the TLB entry for the page. The endian storage attribute is set to 0 for a bigendian page and is set to 1 for a little-endian page. See Memory Management on page 185 for more information about memory pages, the TLB, and storage attributes, including the endian storage attribute.
2.2.3.1 Structure Mapping Examples
The following C language structure,
s, contains an assortment of scalars and a character string. The
comments show the value assumed to be in each structure element; these values show how the bytes
comprising each structure element are mapped into storage.
struct {
int a;/* 0x1112_1314 word */
long long b;/* 0x2122_2324_2526_2728 doubleword */
int c;/* 0x3132_3334 word */
char d[7];/* 'A','B','C','D','E','F','G' array of bytes */
CPU Programming Model
Page 66 of 864
Version 1.3
October 23, 2012
Page 67
User’s Manual
A2 Processor
short e;/* 0x5152 halfword */
int f;/* 0x6162_6364 word */
} s;
C structure mapping rules permit the use of padding (skipped bytes) to align scalars on desirable boundaries.
The following structure mapping examples show each scalar aligned at its natural boundary. This alignment
introduces padding of 4 bytes between a and b, one byte between d and e, and two bytes between e and f.
The same amount of padding is present in both big-endian and little-endian mappings.
Big-Endian Mapping
The big-endian mapping of structure
s follows (the data is highlighted in the structure mappings). Addresses,
in hexadecimal, are below the data stored at the address. The contents of each byte, as defined in structure
s, is shown as a (hexadecimal) number or character (for the string elements). The shaded cells correspond to
padded bytes.
11121314
0x000x010x020x030x040x050x060x07
2122232425262728
0x080x090x0A0x0B0x0C0x0D0x0E0x0F
31323334'A''B''C''D'
0x100x110x120x130x140x150x160x17
'E''F''G'
0x180x190x1A0x1B0x1C0x1D0x1E0x1F
61626364
0x200x210x220x230x240x250x260x27
5152
Little-Endian Mapping
Structure
14131211
0x000x010x020x030x040x050x060x07
2827262524232221
0x080x090x0A0x0B0x0C0x0D0x0E0x0F
34333231'A''B''C''D'
0x100x110x120x130x140x150x160x17
'E''F''G'
0x180x190x1A0x1B0x1C0x1D0x1E0x1F
64636261
0x200x210x220x230x240x250x260x27
s is shown mapped little endian.
5251
2.2.3.2 Instruction Byte Ordering
Power ISA defines instructions as aligned words (4 bytes) in memory. As such, instructions in a big-endian
program image are arranged with the most-significant byte (MSB) of the instruction word at the lowestnumbered address.
Version 1.3
October 23, 2012
CPU Programming Model
Page 67 of 864
Page 68
User’s Manual
A2 Processor
Consider the big-endian mapping of instruction p at address 0x00, where, for example, p = add r7, r7, r4:
MSBLSB
0x000x010x020x03
On the other hand, in a little-endian mapping the same instruction is arranged with the least-significant byte
(LSB) of the instruction word at the lowest-numbered address:
LSBMSB
0x000x010x020x03
By the definition of Power ISA bit numbering, the most-significant byte of an instruction is the byte containing
bits 0:7 of the instruction. As depicted in the instruction format diagrams (see Instruction Formats in the Power ISA specification), this most-significant byte is the one that contains the primary opcode field (bits 0:5).
Due to this difference in byte orderings, the processor must perform whatever byte reversal is required
(depending on the particular byte ordering in use) to correctly deliver the opcode field to the instruction
decoder. In the A2 core, this reversal is performed between the memory interface and the instruction cache,
according to the value of the endian storage attribute for each memory page, such that the bytes in the
instruction cache are always correctly arranged for delivery directly to the instruction decoder.
If the endian storage attribute for a memory page is reprogrammed from one byte ordering to the other, the
contents of the memory page must be reloaded with program and data structures that are in the appropriate
byte ordering. Furthermore, anytime the contents of instruction memory change, the instruction cache must
be made coherent with the updates by invalidating the instruction cache and refetching the updated memory
contents with the new byte ordering.
2.2.3.3 Data Byte Ordering
Unlike instruction fetches, data accesses cannot be byte-reversed between memory and the data cache.
Data byte ordering in memory depends upon the data type (byte, halfword, word, and so on) of a specific data
item. It is only when moving a data item of a specific type from or to an architected register (as directed by the
execution of a particular storage access instruction) that it becomes known what kind of byte reversal might
be required due to the byte ordering of the memory page containing the data item. Therefore, byte reversal
during load or store accesses is performed between the data cache (or memory, on a data cache miss, for
example) and the load register target or store register source, depending on the specific type of load or store
instruction (that is, byte, halfword, word, and so on).
Comparing the big-endian and little-endian mappings of structure
s, as shown in Structure Mapping Exam-
ples on page 66, the differences between the byte locations of any data item in the structure depends upon
the size of the particular data item. For example (again referring to the big-endian and little-endian mappings
of structure
s):
•The word a has its 4 bytes reversed within the word spanning addresses 0x00 – 0x03.
•The halfword e has its 2 bytes reversed within the halfword spanning addresses 0x1C – 0x1D.
Note: The array of bytes d, where each data item is a byte, is not reversed when the big-endian and littleendian mappings are compared. For example, the character 'A' is located at address 0x14 in both the bigendian and little-endian mappings.
The size of the data item being loaded or stored must be known before the processor can decide whether,
and if so, how, to reorder the bytes when moving them between a register and the data cache (or memory).
CPU Programming Model
Page 68 of 864
Version 1.3
October 23, 2012
Page 69
User’s Manual
A2 Processor
• For byte loads and stores, including strings, no reordering of bytes occurs regardless of byte ordering.
• For halfword loads and stores, bytes are reversed within the halfword for one byte order with respect to
the other.
• For word loads and stores (including load/store multiple), bytes are reversed within the word for one byte
order with respect to the other.
• For doubleword loads and stores, bytes are reversed within the doubleword for one byte order with
respect to the other.
• For quadword loads and stores (AXU loads/stores only), bytes are reversed within the quadword for one
byte order with respect to the other.
Note: This mechanism applies independent of the alignment of data. In other words, when loading a multibyte data operand with a scalar load instruction, bytes are accessed from the data cache (or memory) starting
with the byte at the calculated effective address and continuing with consecutively higher-numbered bytes
until the required number of bytes have been retrieved. Then, the bytes are arranged such that either the byte
from the highest-numbered address (for big-endian storage regions) or the lowest-numbered address (for little-endian storage regions) is placed into the least-significant byte of the register. The rest of the register is
filled in corresponding order with the rest of the accessed bytes. An analogous procedure is followed for scalar store instructions.
For load/store multiple instructions, each group of 4 bytes is transferred between memory and the register
according to the procedure for a scalar load word instruction.
For load/store string instructions, the most-significant byte of the first register is transferred to or from memory
at the starting (lowest-numbered) effective address, regardless of byte ordering. Subsequent register bytes
(from most-significant to least-significant, and then moving into the next register, starting with the most-significant byte, and so on) are transferred to or from memory at sequentially higher-numbered addresses. This
behavior for byte strings ensures that if two strings are loaded into registers and then compared, the first
bytes of the strings are treated as most significant with respect to the comparison.
2.2.3.4 Byte-Reverse Instructions
The Power ISA defines load/store byte-reverse instructions, which can access storage that is specified as
being of one byte ordering in the same manner that a regular (that is, nonbyte-reverse) load/store instruction
would access storage that is specified as being of the opposite byte ordering. In other words, a load/store
byte-reverse instruction to a big-endian memory page transfers data between the data cache (or memory)
and the register in the same manner that a normal load/store would transfer the data to or from a little-endian
memory page. Similarly, a load/store byte-reverse instruction to a little-endian memory page transfers data
between the data cache (or memory) and the register in the same manner that a normal load/store would
transfer the data to or from a big-endian memory page.
The function of the load/store byte-reverse instructions is useful when a particular memory page contains a
combination of data with both big-endian and little-endian byte ordering. In such an environment, the endian
storage attribute for the memory page would be set according to the predominant byte ordering for the page,
and the normal load/store instructions would be used to access data operands that used this predominant
byte ordering. Conversely, the load/store byte-reverse instructions would be used to access the data operands that were of the other (less prevalent) byte ordering.
Software compilers cannot typically make general use of the load/store byte-reverse instructions, so they are
ordinarily used only in special, hand-coded device drivers.
Version 1.3
October 23, 2012
CPU Programming Model
Page 69 of 864
Page 70
User’s Manual
A2 Processor
2.3 Multithreading
The A2 core has four threads that allow simultaneous execution within the processor and can be viewed as a
4-way multiprocessor with shared dataflow. This gives the effective appearance of four independent
processing units from the view of software. The performance of each thread can be limited due to the sharing
of resources between each of the threads.
2.3.1 Thread Identification
2.3.1.1 Thread Identification Register (TIR)
The TIR is a read-only register that can be used to distinguish a thread from other threads on the A2 core.
The TIR returns a value n, where n is referred to as “thread n.”
Register Short Name:TIRRead Access:Hypv
Decimal SPR Number:446Write Access:None
Initial Value:0x0000000000000000Duplicated for Multithread:N
Slow SPR:NNotes:
Guest Supervisor Mapping:Scan Ring:func
BitsField Name
0:31/// 0x0Reserved
32:61/// 0x0Reserved
62:63TID 0b00Processor Thread ID
Initial
Val ue
This field can be used to distinguish the thread from other threads on the processor.
Threads are numbered sequentially, with valid values ranging from 0 to 3.
Description
2.3.1.2 Processor Identification Register (PIR)
The PIR is a read-only register that uniquely identifies a specific instance of a processor thread, within a
multiprocessor configuration, enabling software to determine exactly which thread it is running on. This capability is important for operating system software within multiprocessor configurations.
Register Short Name:PIRRead Access:Priv
Decimal SPR Number:286Write Access:None
Initial Value:0x0000000000000000Duplicated for Multithread:N
Slow SPR:NNotes:
Guest Supervisor Mapping:GPIRScan Ring:func
BitsField Name
32:53///
54:61CID
IO
Initial
Val ue
0x0Reserved
0x0Processor Core ID
Returns the value of the I/O pin an_ac_coreid. This can be used to distinguish a processor
core from other processor cores in the system.
IO
Description
CPU Programming Model
Page 70 of 864
Version 1.3
October 23, 2012
Page 71
User’s Manual
A2 Processor
BitsField Name
62:63TID
Initial
Val ue
0b00Processor Thread ID
This field can be used to distinguish the thread from other threads on the processor.
Threads are numbered sequentially, with valid values ranging from 0 to 3.
The GPIR is a register that identifies a specific instance of a processor thread for the guest operating system.
The GPIR is used to filter incoming processor messages. See Processor Messages on page 357.
Register Short Name:GPIRRead Access:Priv
Decimal SPR Number:382Write Access:Hypv
Initial Value:0x0000000000000000Duplicated for Multithread:Y
Slow SPR:NNotes:HM
Guest Supervisor Mapping:YScan Ring:func
BitsField Name
32:49VPTAG
50:63DBTAG
Initial
Val ue
0x0Virtual Processor Tag
Storage used by the guest operating system to identify the virtual processor on which the
operating system is running.
0x0Doorbell Tag
Used to match guest doorbell messages that are sent to all the processors and virtual processors in a coherence domain. If a sent guest doorbell message tag matches the DBTAG
field, a guest doorbell is said to be accepted on the (virtual) processor.
Description
2.3.2 Thread Run State
The A2 core provides several methods for controlling a thread’s run state. For a thread to fetch instructions,
all methods outlined below must be properly configured. If any one I/O or register is configured to stop a
thread, the affected thread will not fetch instructions.
2.3.2.1 Thread Stop I/O Pin
The I/O pin, an_ac_pm_thread_stop, can be used to stop the A2 core from fetching instructions. Stopping a
thread causes all instructions that have begun executing to be completed and all prefetched instructions to be
discarded.
2.3.2.2 Thread Control and Status Register (THRCTL)
The SCOM
accessible THRCTL register can control the thread run state to allow an external debugger
control of the processor. See Direct Access to I-Cache and D-Cache Directories on page 437. Stopping a
thread via THRCTRL causes all instructions that have begun executing to be completed and all prefetched
instructions to be discarded.
Version 1.3
October 23, 2012
CPU Programming Model
Page 71 of 864
Page 72
User’s Manual
A2 Processor
2.3.2.3 Core Configuration Register 0 (CCR0)
The CCR0 is used to disable or enable threads. When a thread is disabled by setting the CCR0 bit corresponding to the thread to 0, all instructions that have begun executing are completed and all prefetched
instructions are discarded. Subsequent instructions are not prefetched or initiated. Asynchronous interrupts
or other conditions that are unmasked and enabled in CCR1 for the thread will cause the thread to be reenabled. Executing a wait instruction on a thread will cause that thread’s CCR0[WE] to be set to 1. CCR0
also contains controls for allowing the processor to enter a power managed state. See Section 13 Power Management Methods on page 525 for information about power savings modes.
Programming Note: When using mtccr0 to put other threads to sleep, using an external interrupt or any
asynchronous interrupt as the wake-up method is not reliable. The thread being put to sleep might have just
taken an interrupt and MSR(EE) is zero, preventing wake-up. In this case, mtccr0 should be used to wake up
the sleeping threads. A thread can put itself to sleep using mtccr0 or the wait instruction and wake up using
an external interrupt or any asynchronous interrupt reliably.
Register Short Name:CCR0Read Access:Hypv
Decimal SPR Number:1008Write Access:Hypv
Initial Value:0x0000000000000000Duplicated for Multithread:N
Slow SPR:NNotes:
Guest Supervisor Mapping:Scan Ring:bcfg
BitsField Name
32:33PME
34:51///
52:55WEM
56:59///
60:63WE
Initial
Val ue
0b00Power Management Enable
00 Disabled: No power savings mode entered.
01PM_Sleep_enable: PM_Sleep state entered when all threads are stopped.
10 PM_RVW_enable: PM_RVW state entered when all threads are stopped.
11 Disabled2: No power savings mode entered.
Note: See the A2 User Manual, Power Management Methods section.
0x0Reserved
0b0000 Wait Enable Mask
0 No effect to CCR0[WE].
1 Allows writing of the corresponding bit in the CCR0[WE] field. These bits are non-
0b0000 Reserved
0b0000 Wait Enable
For t < 4, bit 63-t corresponds to thread t:
0 Indicates that the thread is enabled.
1 Indicates that the thread is disabled.
Note: This field can also be set by a wait instruction.
persistent. A read always returns zeros.
Description
2.3.2.4 Thread Enable Register (TENS, TENC)
The Thread Enable Register is used to disable or enable threads and is provided as a means to access
shared resources (see Accessing Shared Resources on page 78). When a thread is disabled by setting the
TEN bit corresponding to the thread 0, all instructions that have begun executing are completed and all
prefetched instructions are discarded. Subsequent instructions are not prefetched or initiated. All asynchronous interrupts for the thread are delayed until the thread is re-enabled.
CPU Programming Model
Page 72 of 864
Version 1.3
October 23, 2012
Page 73
User’s Manual
A2 Processor
The TEN is accessed by using two registers: TENS and TENC. When TENS is written, threads for which the
corresponding bit in TENS is 1 are enabled; threads for which the corresponding bit in TENS is 0 are unaffected. When TENC is written, threads for which the corresponding bit in TENC is 1 are disabled; threads for
which the corresponding bit in TENC is 0 are unaffected. When either SPR
is read, the current value of the
TEN is returned.
Register Short Name:TENSRead Access:Hypv
Decimal SPR Number:438Write Access:Hypv
Initial Value:0x0000000000000001Duplicated for Multithread:N
Slow SPR:NNotes:WS
Guest Supervisor Mapping:Scan Ring:bcfg
BitsField Name
0:31///
32:59///
60:63TEN
Initial
Val ue
0x0Reserved
0x0Reserved
0b0001 Thread Enable Set
Description
For t < 4, bit 63-t corresponds to thread t. When bit 63-t is set to 1, thread t is enabled, if it
is not already. When bit 63-t is set 0, thread t is unaffected.
When bit 63-t is read, the current value of the thread enable is returned.
Register Short Name:TENCRead Access:Hypv
Decimal SPR Number:439Write Access:Hypv
Initial Value:0x0000000000000001Duplicated for Multithread:N
Slow SPR:NNotes:WC
Guest Supervisor Mapping:Scan Ring:bcfg
BitsField Name
0:31///
32:59///
60:63TEN
Initial
Val ue
0x0Reserved
0x0Reserved
0b0001 Thread Enable Clear
Description
For t < 4, bit 63-t corresponds to thread t. When bit 63-t is set to 1, thread t is disabled, if it
is not already. When bit 63-t is set 0, thread t is unaffected.
When bit 63-t is read, the current value of the thread enable is returned.
2.3.2.5 Thread Enable Status Register (TENSR)
The TENSR indicates which threads are quiesced.
Programming Note: The TENSR is only valid after a context synchronizing instruction or an event that precisely stops a thread, such as a write to TEN.
Programming Note: When thread T1 disables other threads, Tn, it sets the 10 bits corresponding to Tn to
zeros. To ensure that all operations being performed by threads Tn have been performed with respect to all
threads on the processor, thread T1 reads the TENSR until all the bits corresponding to the disabled threads,
Tn, are zeros.
Version 1.3
October 23, 2012
CPU Programming Model
Page 73 of 864
Page 74
User’s Manual
A2 Processor
Register Short Name:TENSRRead Access:Hypv
Decimal SPR Number:437Write Access:None
Initial Value:0x0000000000000000Duplicated for Multithread:N
Slow SPR:NNotes:
Guest Supervisor Mapping:Scan Ring:func
BitsField Name
0:31///
32:59///
60:63TENSR
Initial
Val ue
0x0Reserved
0x0Reserved
0b0000 Thread Enable Status Register
Description
Bit 63-t of the TENSR corresponds to thread t.
2.3.3 Wake On Interrupt
The A2 core can be configured to wake on interrupts or other conditions, if the thread was disabled by a write
to CCR0 or by executing a wait instruction.
2.3.3.1 Core Configuration Register 1 (CCR1)
CCR1 provides additional masking on what conditions can cause the processor to resume execution. The
conditions or interrupts specified must be appropriately unmasked and must also be enabled in CCR1 to exit
the stopped state.
Register Short Name:CCR1Read Access:Hypv
Decimal SPR Number:1009Write Access:Hypv
Initial Value:0x000000000F0F0F0FDuplicated for Multithread:N
Slow SPR:NNotes:
Guest Supervisor Mapping:Scan Ring:func
BitsField Name
32:33///
34:39WC3
40:41///
CPU Programming Model
Page 74 of 864
Initial
Val ue
0b00Reserved
0xFThread 3 Wake Control
Description
(0) 1 Disables sleep on waitrsv.
(1) 1 Disables sleep on waitimpl.
(2) 1 Enables wake on critical input, watchdog, critical doorbell, guest critical doorbell,
or guest machine check doorbell interrupts.
(3) 1 Enables wake on external input, performance monitor, doorbell, or guest doorbell
interrupts.
(4) 1 Enables wake on decrementer or user decrementer interrupts.
(5) 1 Enables wake on fixed interval timer interrupts.
0b00Reserved
Version 1.3
October 23, 2012
Page 75
User’s Manual
A2 Processor
BitsField Name
42:47WC2
48:49///
50:55WC1
56:57///
58:63WC0
Initial
Val ue
0xFThread 2 Wake Control
(0) 1 Disables sleep on waitrsv.
(1) 1 Disables sleep on waitimpl.
(2) 1 Enables wake on critical input, watchdog, critical doorbell, guest critical doorbell,
(3) 1 Enables wake on external input, performance monitor, doorbell, or guest doorbell
(4) 1 Enables wake on decrementer or user decrementer interrupts.
(5) 1 Enables wake on fixed interval timer interrupts.
0b00Reserved
0xFThread 1 Wake Control
(0) 1 Disables sleep on waitrsv.
(1) 1 Disables sleep on waitimpl.
(2) 1 Enables wake on critical input, watchdog, critical doorbell, guest critical doorbell,
(3) 1 Enables wake on external input, performance monitor, doorbell, or guest doorbell
(4) 1 Enables wake on decrementer or user decrementer interrupts.
(5) 1 Enables wake on fixed interval timer interrupts.
0b00Reserved
0xFThread 0 Wake Control
(0) 1 Disables sleep on waitrsv.
(1) 1 Disables sleep on waitimpl.
(2) 1 Enables wake on critical input, watchdog, critical doorbell, guest critical doorbell,
(3) 1 Enables wake on external input, performance monitor, doorbell, or guest doorbell
(4) 1 Enables wake on decrementer or user decrementer interrupts.
(5) 1 Enables wake on fixed interval timer interrupts.
or guest machine check doorbell interrupts.
interrupts.
or guest machine check doorbell interrupts.
interrupts.
or guest machine check doorbell interrupts.
interrupts.
Description
2.3.4 Thread Priority
Thread priority can be changed by writing the PPR32 register, executing an or Rx,Rx,Rx instruction, or by
causing an interrupt.
2.3.4.1 Program Priority Register (PPR32)
The program priority register controls thread priority. A2 hardware supports three physical priorities. In A2’s
lowest hardware priority, the number of cycles between two instructions being issued is determined by
IUCR1[THRES]. See Instruction Unit Configuration Register 1 (IUCR1) on page 77.
The mapping of the three hardware priorities to the architected priorities in the PPR32 register is shown in
Table 2-3. An or Rx,Rx,Rx is used to set PPR32[PRI]; these are also shown in Table 2-3. Other defined or
Rx,Rx,Rx hints shown in Table 2-4 are ignored. PPR32[PRI] remains unchanged if the privilege state of the
processor executing the instruction is lower than the privilege indicated in Table 2-3. PPR32[PRI] also
remains unchanged if “000” is written to the field.
If MSR[EE] is 0 and PPR32 = low then thread priority is increased to medium; PPR32 is unchanged. When
MSR[EE] is 1, thread priority is determined by PPR32[PRI]. This function is provided to reduce delay in the
processing of interrupts.
Version 1.3
October 23, 2012
CPU Programming Model
Page 75 of 864
Page 76
User’s Manual
A2 Processor
Table 2-3. Priority Levels
RxPPR32[PRI]ISA Priority
A2 Hardware Priority with IUCR1[HIPRI] Setting
00011011
31001very lowa2lowa2lowa2lowa2lowyes
1010lowno
6011medium lowa2mediuma2mediuma2mediuma2mediumno
2100mediuma2highno
5101medium higha2highyes
3110higha2highyes
7111very higha2highhypv
Table 2-4. Other “or” Instruction Hints
RxMnemonicReserved
27yieldYes
29mdoioYes
30mdoomYes
Table 2-5. Program Priority Register (PPR32)
Register Short Name:PPR32Read Access:Any
Decimal SPR Number:898Write Access:Any
Initial Value:0x00000000000C0000Duplicated for Multithread:Y
Slow SPR:YNotes:
Guest Supervisor Mapping:Scan Ring:ccfg
Privileged
BitsField Name
32:42///
43:45PRI
46:63///
CPU Programming Model
Page 76 of 864
Initial
Val ue
0x0Reserved
0b011 Thread Priority
Description
001 Very low (privileged).
010 Low.
011 Medium low.
100 Medium.
101 Medium high (privileged).
110 High (privileged).
111 Very high (hypervisor).
Access violations or writing a value of zero will result in a nop.
0x0Reserved
Version 1.3
October 23, 2012
Page 77
User’s Manual
A2 Processor
2.3.4.2 Instruction Unit Configuration Register 1 (IUCR1)
Register Short Name:IUCR1Read Access:Hypv
Decimal SPR Number:883Write Access:Hypv
Initial Value:0x0000000000001000Duplicated for Multithread:Y
Slow SPR:YNotes:
Guest Supervisor Mapping:Scan Ring:ccfg
BitsField Name
32:49///
50:51HIPRI
52:57///
58:63THRES
Initial
Val ue
0x0Reserved
0b01High Priority Privilege Level
The A2 core has three priority values implemented in hardware. This field configures which
value in PPR32[PRI] corresponds to the implementations highest priority.
00 Medium normal.
01 Medium high.
10 High.
11 Very high.
0x0Reserved
0x0Low Priority Minimum Issue Count
Sets the number of cycles between low priority issues, which is set by PPR32[PRI]. The
number of cycles is equal to THRES 4. This field is not used when a thread is set to high
or medium priority.
Description
2.3.5 Resources Shared between Threads
All architected states are duplicated for each thread except for logical partitioning and memory. This allows
each thread to look independent from a software standpoint. Some nonarchitected resources are shared
between threads to save on the overall area for the core. Section 2.3.6 provides more information about
shared resources. Section 2.3.7 on page 78 provides more information about duplicated resources.
2.3.6 Shared Resources
Instruction ERAT arrayEntries can be used as shared or thread specific.
L1 instruction cache array
Data ERAT array Entries can be used as shared or thread specific.
L1 data cache array
Load miss queue
Store queue
Microcode ROM
array
Branch history table This is a configurable resource and can be set up to be shared or duplicated.
SPR registersNot all SPRs are shared. See Table 14-1 Register Summary on page 530 for
more information.
Instruction fetch pipeline
Instruction issue
Integer execution pipeline
Version 1.3
October 23, 2012
CPU Programming Model
Page 77 of 864
Page 78
User’s Manual
A2 Processor
TLB
LRAT
2.3.6.1 Accessing Shared Resources
When software executing in thread Tn writes a new value in an SPR (mtspr) that is shared with other
threads, either of the following sequences of operations can be performed to ensure that the write operation
has been performed with respect to other threads.
Sequence 1
• Disable all other threads (see Thread Enable Register (TENS, TENC) on page 72).
• Write to the shared SPR (mtspr).
• Perform a context synchronizing operation.
• Enable the previously disabled threads.
In the above sequence, the context synchronizing operation ensures that the write operation has been
performed with respect to all other threads that share the SPR. The enabling of other threads ensures that
subsequent instructions of the enabled threads use the new SPR value because enabling a thread is a
context synchronizing operation.
Sequence 2
• All threads are put in hypervisor state and begin polling a storage flag.
• The thread updating the SPR does the following:
• Writes to the SPR (mtspr).
• Sets a storage flag indicating that the write operation was done.
• Performs a context synchronizing operation.
• When other threads see the updated storage flag, they perform context synchronizing operations.
In the above sequence, the context synchronizing operation by the thread that writes to the SPR ensures that
the write operation has been performed with respect to all other threads that share the SPR; the context
synchronizing operation by the other threads ensures that subsequent instructions for these threads use the
updated value.
2.3.7 Duplicated Resources
Link stack queue
Instruction buffer
Thread dependency
GPR register file This includes extra registers for microcode instruction use.
SPR registers Not all SPRs are duplicated. See Table 14-1 Register Summary on page 530 for
more information.
Branch history table This is a configurable resource and can be setup to be shared or duplicated.
CPU Programming Model
Page 78 of 864
Version 1.3
October 23, 2012
Page 79
User’s Manual
A2 Processor
2.3.8 Pipeline Sharing
Figure 2-1 shows the instruction flow for the A2 core.
Figure 2-1. A2 Core Instruction Unit
Version 1.3
October 23, 2012
CPU Programming Model
Page 79 of 864
Page 80
User’s Manual
A2 Processor
2.3.8.1 Instruction Cache
The instruction cache is a shared resource between all threads where a single thread can be selected each
cycle dependent upon the number of instructions currently contained within that thread’s instruction buffers.
There are two watermarks within the instruction buffer that determine a thread’s priority level for fetches that
are empty and half-empty. The empty watermarks gives the corresponding thread high priority and a halfempty level gives the thread a low-priority fetch request. The high-priority and low-priority fetches are two
separate round-robin queues to give each thread an even chance at getting the next command. A low-priority
fetch is only issued when none of the high-priority water marks are active. The instruction cache and instruction directories are 4-way associative and are a shared resource between all threads. The branch prediction
unit that is part of the instruction cache in Figure 2-1 on page 79 contains a branch history table and link stack
to allow proper branch resolution. The link stack is a 4-deep queue per thread whereas the branch history
table is a 2-bit history that can configured to either 1 k per thread or a 4 k history shared between all four
threads.
2.3.8.2 Instruction Buffer and Decode Dependency
The colored portion of Figure 1-1 on page 50 contains all of the instruction buffer, decode, and dependency
logic for each of the threads. This logic is duplicated for each thread to allow other threads with nondependent
commands to be issued to maximize usage for the integer and floating-point pipelines.
2.3.8.3 Instruction Issue
Instruction issue is a shared resource within the core, and the logic is a 1+1 concurrent issue machine. This
allows two commands to be issued per cycle; however, each of the commands issued must be from separate
threads with one to the XU
and another to the AXU units. The selection logic for the issue logic is a simple
round-robin scheme with three levels of priority to allow software more flexibility.
See Figure 2-2, Figure 2-3, and Figure 2-4 for examples of round-robin logic.
Figure 2-2. Instruction Issue Timing Diagram 1
(Thread 0, high priority; threads 1, 2, 3 low priority; timeout set to 3.)
CPU Programming Model
Page 80 of 864
Version 1.3
October 23, 2012
Page 81
User’s Manual
A2 Processor
.
Figure 2-3. Instruction Issue Timing Diagram 2
(All threads set to high priority; timeout set to 3.)
Figure 2-4. Instruction Issue Timing Diagram 3 (Threads 0 and 1, high priority; threads 2 and 3, medium priority;
timeout set to 3.)
2.3.8.4 Ram Unit
The Ram unit allows an external command to be issued within a given thread’s instruction stream. This unit is
a shared resource within a core in that only one thread can issue a Ram command at a time. It is software’s
responsibility to only allow one outstanding command per core, and it is necessary to poll the core until this
command has completed before issuing any new commands.
Version 1.3
October 23, 2012
CPU Programming Model
Page 81 of 864
Page 82
User’s Manual
A2 Processor
2.3.8.5 Microcode Unit
The microcode unit (uCode) is partially shared and partially duplicated logic. The ROM that contains the
actual stream of instructions to be issued is a shared unit; however, each thread contains its own microcode
engine so that all four threads can be within a uCode stream at the same time. One of the engines will read a
single command from the ROM each cycle based upon a fair round-robin scheme (not based upon the thread
priority level for the issue logic), and issue that command to the appropriate thread’s instruction buffer. If the
instruction buffer is over halfway filled, the uCode will stop issuing new commands. In addition, it will not
include this thread for ROM reads until the instruction buffer has drained below this point.
2.3.8.6 Integer Unit
The integer execution unit is shared between threads because there is a unified execution, load/store, and
branch pipeline. Exceptions and flushes from one thread usually will not affect another thread.
However, a flush that will affect all threads when encountered by one of the threads is caused by a data
cache invalidate (DCI) or instruction cache invalidate (ICI) that reaches completion. A DCI or ICI will flush all
threads for one cycle to allow the L1 caches to be invalidated. Software is required to guarantee that the load
miss queue is empty for all threads before execution of a DCI.
Another flush condition caused by one thread that can affect another thread occurs when reload data
returning for an outstanding load collides with a load or store at the data cache array pins.
For a comprehensive list of flush conditions, see Interrupt Conditions on page 854.
Some multiply operations and all divide operations require recirculation within the multiply/divide unit, therefore blocking all other threads from executing multiplies and divides. This does not prevent other threads from
executing any instructions other than multiplies and divides. If any multiply or divide instructions are issued
and collide with a recirculating multiply or divide, the younger instructions are flushed. In the case of the multiplier, the size of the operands determines how many cycles are needed for recirculation. The width of the
multiplier is 32 bits by 32 bits, so any operations that require multiplying 64-bit operands will require recirculation. If both operands are 32 bits, no recirculation is needed (in other words, the instruction is pipelined as
normal). The width of the divider is 64 bits. Divide instructions dealing with 64-bit operands recirculate for 65
cycles, and operations with 32-bit operands recirculate for 32 cycles. No divide instructions are pipelined; they
all require some recirculation.
A forward progress timer monitors that each thread is making forward progress. If the thread appears to be
hung, thread priorities are adjusted to break out of a potential live-lock condition.
2.4 Registers
This section provides an overview of the register categories and types provided by the A2 core. Detailed
descriptions of each of the registers are provided within the chapters covering the functions with which they
are associated (for example, the cache control and cache debug registers are described in Instruction and
Data Caches on page 169). An alphabetical summary of all registers, including bit definitions, is provided in
Register Summary on page 529
All registers in the A2 core are architected as 64 bits wide, although certain bits in some registers are
reserved and thus not necessarily implemented. For all registers with fields marked as reserved, these
reserved fields should be written as 0 and read as undefined. The recommended coding practice is to
CPU Programming Model
Page 82 of 864
Version 1.3
October 23, 2012
Page 83
User’s Manual
Integer Processing
GPR0
GPR1
GPR31
GPR2
•
•
•
Condition Register
CR
XER
Link Register
LR
CTR
Timer
TBU
TB
SPRG4
SPRG5
SPRG7
SPRG6
Processor Control
VR Save Register
VRSAVE
Count Register
Integer Exception Register
Time Base
Branch Control
SPR General 3–7
General Purpose
Replicated per Thread
SPRG3
UDEC
User Decrementer Register
A2 Processor
perform the initial write to a register with reserved fields set to 0, and to perform all subsequent writes to the
register using a read-modify-write strategy: read the register; use logical instructions to alter defined fields,
leaving reserved fields unmodified; and write the register.
All of the registers are grouped into categories according to the processor functions with which they are associated. In addition, each register is classified as being of a particular type, as characterized by the specific
instructions that are used to read and write registers of that type. Finally, most of the registers contained
within the A2 core are defined by the Power ISA Architecture, although some registers are implementationspecific and unique to the A2 core.
Figure 2-5 illustrates the A2 core registers contained in the user programming model; that is, those registers
to which access is nonprivileged and that are available to both user and supervisor programs.
Figure 2-5. User Programming Model Registers
Table 14-1 on page 530 lists the A2 core registers contained in the supervisor or hypervisor programming
model, to which access is privileged.
Version 1.3
October 23, 2012
CPU Programming Model
Page 83 of 864
Page 84
User’s Manual
A2 Processor
2.4.1 Register Mapping
Some special purpose register (SPR) accesses in guest state are mapped to analogous registers for the
guest state. This removes the requirement for the hypervisor software to handle embedded hypervisor privilege interrupts for these accesses and make the required emulated changes by the hypervisor for these highuse registers.
Accesses to the registers listed in Table 2-6 are changed by the processor to the registers given in the table
when the processor is in guest state (MSR[GS] = 1). Accesses to these registers are not mapped when not in
guest state.
Table 2-6. Register Mapping
SPR Accessed SPR Mapped to Type of Access
SRR0GSRR0mtspr, mfspr
SRR1GSRR1mtspr, mfspr
ESRGESRmtspr, mfspr
DEARGDEAR mtspr, mfspr
PIRGPIRmtspr, mfspr
SPRG0GSPRG0 mtspr, mfspr
SPRG1GSPRG1mtspr, mfspr
SPRG2 GSPRG2mtspr, mfspr
SPRG3GSPRG3mtspr, mfspr
USPRG3GSPRG3mtspr, mfspr
2.4.2 Register Types
There are five register types contained within and/or supported by the A2 core. Each register type is characterized by the instructions that are used to read and write the registers of that type. The following subsections
provide an overview of each of the register types and the instructions associated with them.
2.4.2.1 General Purpose Registers
The A2 core contains 32 integer general purpose registers (GPRs); each contains 64 bits. In 32-bit mode, all
instructions that operate on GPRs produce the same GPR results in 32-bit mode as in 64-bit mode.
Integer Processing on page 110 provides more information about integer operations and the use of GPRs.
2.4.2.2 Special Purpose Registers
Special Purpose Registers (SPRs) are directly accessed using the mtspr and mfspr instructions. In addition,
certain SPRs might be updated as a side-effect of the execution of various instructions. For example, the
Integer Exception Register (XER) (see Integer Exception Register (XER) on page 110) is an SPR that is
updated with arithmetic status (such as carry and overflow) upon execution of certain forms of integer arithmetic instructions.
CPU Programming Model
Page 84 of 864
Version 1.3
October 23, 2012
Page 85
User’s Manual
A2 Processor
SPRs control the use of the debug facilities, timers, interrupts, memory management, caches, and other
architected processor resources. Table 14-1 on page 530 shows the mnemonic, name, and number for each
SPR, in alphabetical order. Each of the SPRs is described in more detail within the section or chapter
covering the function with which it is associated.
2.4.2.3 Condition Register
The Condition Register (CR) is a 32-bit register of its own unique type and is divided up into eight, independent 4-bit fields (CR0–CR7). The CR can be used to record certain conditional results of various arithmetic
and logical operations. Subsequently, conditional branch instructions can designate a bit of the CR as one of
the branch conditions (see Wait Instruction on page 98). Instructions are also provided for performing logical
bit operations and for moving fields within the CR.
See Condition Register (CR) on page 107 for more information about the various instructions that can update
the CR.
2.4.2.4 Machine State Register
The Machine State Register (MSR) is a register of its own unique type that controls important chip functions,
such as the enabling or disabling of various interrupt types.
The MSR can be written from a GPR using the mtmsr instruction. The contents of the MSR can be read into
a GPR using the mfmsr instruction. The MSR[EE] bit can be set or cleared atomically using the wrtee or
wrteei instructions. The MSR contents are also automatically saved, altered, and restored by the interrupthandling mechanism. See Machine State Register (MSR) on page 301 for more detailed information about
the MSR and the function of each of its bits.
2.5 32-Bit Mode
2.5.1 64-Bit Specific Instructions
Instructions or registers that are categorized as 64-bit are only available in 64-bit implementations of the A2
core. In a 64-bit implementation in 32-bit mode, all instructions that operate on GPRs produce the same GPR
results in 32-bit mode as in 64-bit mode. Instructions that set condition bits do so based on the 32-bit result
computed. Effective addresses and all SPRs operate on the low-order 32 bits only unless otherwise stated.
2.5.2 32-Bit Instruction Selection
Any software that uses any of the instructions listed in the 64-bit category is considered 64-bit software.
Generally speaking, 32-bit software should avoid using any instruction or instructions that depend on any
particular setting of bits 0:31 of any 64-bit application-accessible system register, including General Purpose
Registers, for producing the correct 32-bit results. Context switching might or might not preserve the upper 32
bits of application-accessible 64-bit system registers, and insertion of arbitrary settings of those upper 32 bits
at arbitrary times during the execution of the 32-bit application must not affect the final result.
Version 1.3
October 23, 2012
CPU Programming Model
Page 85 of 864
Page 86
User’s Manual
A2 Processor
2.6 Instruction Categories
The Power ISA defines that each facility (including registers and fields therein) and instruction is in exactly
one category. Table 2-7 indicate the categories that are implemented by the A2 processor core.
Table 2-7. Category Listing
Implemented
by A2 Core
YesBase B Required for all implementations.
NoServer S Required for server implementations.
YesEmbedded E Required for embedded implementations.
NoAlternate Time Base ATB An additional time base; see Book II.
YesCache Specification CS Specify a specific cache for some instructions; see Book II.
S.RPTAHTAB alignment on a 256 KB boundary; see Book III-S.
.Embedded Float Scalar Dou-
SP
SP.FD
SP.FS
SP.FV
V
V.LE
Facility for signal processing.
GPR-based floating-point double-precision instruction set.
GPR-based floating-point single-precision instruction set.
GPR-based floating-point vector instruction set.
Vector facilities.
Little-endian support for vector storage operations.
Required for 64-bit implementations; not defined for 32-bit
implementations.
2.7 Instruction Classes
Power ISA architecture defines all instructions as falling into exactly one of the following three classes, as
determined by the primary opcode (and the extended opcode, if any):
1. Defined
2. Illegal
3. Reserved
2.7.1 Defined Instruction Class
This class of instructions consists of all the instructions defined in Power ISA. In general, defined instructions
are guaranteed to be supported within a Power ISA system as specified by the architecture, either within the
processor implementation itself or within emulation software supported by the system operating software.
As defined by Power ISA, any attempt to execute a defined instruction will:
• Cause an illegal instruction exception type of program interrupt, if the instruction is not recognized by the
implementation; or
• Cause a floating-point unavailable interrupt if the instruction is recognized as a floating-point instruction,
but floating-point processing is disabled; or
Version 1.3
October 23, 2012
CPU Programming Model
Page 87 of 864
Page 88
User’s Manual
A2 Processor
• Perform the actions described in the rest of this document, if the instruction is recognized and supported
by the implementation. The architected behavior might cause other exceptions.
The A2 core recognizes and fully supports all of the instructions in the defined class and in the categories
supported, with a few exceptions. First, instructions that are defined for floating-point processing are not
supported within the A2 core, but can be implemented within an auxiliary processor and attached to the core
using the AXU interface. If no such auxiliary processor is attached, attempting to execute any floating-point
instructions causes an illegal instruction exception type of program interrupt. If an auxiliary processor that
supports the floating-point instructions is attached, the behavior of these instructions is as defined above and
as determined by the implementation details of the floating-point auxiliary processor.
2.7.2 Illegal Instruction Class
This class of instructions contains the set of instructions described in Power ISA Appendix D of Book Appendices. Illegal instructions are available for future extensions of the Power ISA; that is, some future version of
the Power ISA might define any of these instructions to perform new functions.
Any attempt to execute an illegal instruction causes the system illegal instruction error handler to be invoked
and will have no other effect.
An instruction consisting entirely of binary zeros is guaranteed always to be an illegal instruction. This
increases the probability that an attempt to execute data or uninitialized storage will result in the invocation of
the system illegal instruction error handler.
2.7.3 Reserved Instruction Class
This class of instructions contains the set of instructions described in Power ISA Appendix E of Book Appendices.
Reserved instructions are allocated to specific purposes that are outside the scope of the Power ISA.
Any attempt to execute a reserved instruction causes the system illegal instruction error handler to be
invoked if the instruction is not implemented.
Because implementations are typically expected to treat reserved-nop instructions as true no-ops, these
instruction opcodes are available for future extensions to Power ISA that have no effect on the architected
state. Such extensions might include performance-enhancing hints, such as new forms of cache touch
instructions. Software would be able to take advantage of the functionality offered by the new instructions and
still remain backwards-compatible with implementations of previous versions of Power ISA.
The A2 core implements all of the reserved-nop instruction opcodes as true no-ops. The specific reservednop opcodes are the following extended opcodes under primary opcode 31: 530, 562, 594, 626, 658, 690,
722, and 754.
2.8 Implemented Instruction Set Summary
This section provides an overview of the various types and categories of instructions implemented within the
A2 core. Appendix A Processor Instruction Summary on page 737 lists each implemented instruction alpha-
betically (and by opcode) along with a short-form description and its extended mnemonics.
CPU Programming Model
Page 88 of 864
Version 1.3
October 23, 2012
Page 89
User’s Manual
A2 Processor
Table 2-8 summarizes the A2 core instruction set by category. Instructions within each category are
described in subsequent sections.
Note: The A2 core does not implement any device control registers (DCRs). Move to and move from DCR instructions are dropped
silently. They are no-ops and do not cause an exception.
system call, return from interrupt, return from critical interrupt, return
from machine check interrupt
data allocate, data invalidate, data touch, data zero, data flush, data
store, instruction invalidate, instruction touch
2.8.1 Integer Instructions
Integer instructions transfer data between memory and the GPRs and perform various operations on the
GPRs. This category of instructions is further divided into seven subcategories, described in the following
sections.
2.8.1.1 Integer Storage Access Instructions
Integer storage access instructions load and store data between memory and the GPRs. These instructions
operate on bytes, halfwords, and words. Integer storage access instructions also support loading and storing
multiple registers, character strings, and byte-reversed data, and loading data with sign-extension.
Table 2-9 shows the integer storage access instructions in the A2 core. In the table, the syntax “[u]” indicates
that the instruction has both an “update” form (in which the RA addressing register is updated with the calculated address) and a “nonupdate” form. Similarly, the syntax “[x]” indicates that the instruction has both an
“indexed” form (in which the address is formed by adding the contents of the RA and RB GPRs) and a
“base + displacement” form (in which the address is formed by adding a 16-bit signed immediate value (specified as part of the instruction) to the contents of GPR RA.
1. If the storage operand spans two virtual pages that have different storage control attributes, an alignment exception occurs.
2. Only valid if the request is a cache-inhibited load or a store request with the L2 interface in 16-byte mode.
Virtual PageNone32B Block16B Block2Virtual Page
2.8.1.2 Integer Arithmetic Instructions
Arithmetic operations are performed on integer or ordinal operands stored in registers. Instructions that
perform operations on two operands are defined in a 3-operand format; an operation is performed on the
operands, which are stored in two registers. The result is placed in a third register. Instructions that perform
operations on one operand are defined in a 2-operand format; the operation is performed on the operand in a
register, and the result is placed in another register. Several instructions also have immediate formats in
which one of the source operands is a field in the instruction.
Most integer arithmetic instructions have versions that can update CR[CR0] and/or XER[SO, OV] (Summary
Overflow, Overflow), based on the result of the instruction. Some integer arithmetic instructions also update
XER[CA] (Carry) implicitly. See Integer Processing on page 110 for more information about how these
instructions update the CR and/or the XER.
Table 2-12 lists the integer arithmetic instructions in the A2 core. In the table, the syntax “[o]” indicates that
the instruction has both an “o” form (which updates the XER[SO,OV] fields) and a “non-o” form. Similarly, the
syntax “[.]” indicates that the instruction has both a “record” form (which updates CR[CR0]) and a “nonrecord”
form.
Table 2-13 lists the integer logical instructions in the A2 core. See Integer Arithmetic Instructions on page 91
for an explanation of the “[.]” syntax.
Table 2-13. Integer Logical Instructions
Or with
Com-
ple-
ment
NorXor
orc[.]nor[.]
xor[.]
xori
xoris
Equiva-
lence
eqv[.]
Extend
Sign
extsb[.]
extsh[.]
extsw[.]
Count
Leading
Zeros
cntlzw[.]
cntlzd[.]
PermuteParity
bpermd
prtyw
prtyd
And
and[.]
andi.
andis.
And with
Comple-
ment
NandOr
andc[.]nand[.]
or[.]
ori
oris
2.8.1.4 Integer Compare Instructions
These instructions perform arithmetic or logical comparisons between two operands and update the CR with
the result of the comparison.
Table 2-14 lists the integer compare instructions in the A2 core.
Table 2-14. Integer Compare Instructions
ArithmeticLogical
cmp
cmpi
cmpb
cmpl
cmpli
2.8.1.5 Integer Trap Instructions
Table 2-15 lists the integer trap instructions in the A2 core.
Table 2-15. Integer Trap Instructions
Tr ap
tw
twi
td
tdi
2.8.1.6 Integer Rotate Instructions
These instructions rotate operands stored in the GPRs. Rotate instructions can also mask rotated operands.
Table 2-16 lists the rotate instructions in the A2 core. See Integer Arithmetic Instructions on page 91 for an
explanation of the “[.]” syntax.
CPU Programming Model
Page 92 of 864
Version 1.3
October 23, 2012
Page 93
User’s Manual
A2 Processor
Table 2-16. Integer Rotate Instructions
Rotate and InsertRotate and MaskRotate and Clear
rldcl[.]
rlwimi[.]
rldimi[.]
rlwinm[.]
rlwnm[.]
rldcr[.]
rldic[.]
rldicl[.]
rldicr[.]
2.8.1.7 Integer Shift Instructions
Table 2-17 lists the integer shift instructions in the A2 core. Note that the shift right algebraic instructions
implicitly update the XER[CA] field. See Integer Arithmetic Instructions on page 91 for an explanation of the
“[.]” syntax.
Table 2-17. Integer Shift Instructions
Shift LeftShift Right
Shift Right
Algebraic
sraw[.]
slw[.]
sld[.]
srw[.]
srd[.]
srawi[.]
srad[.]
sradi[.]
2.8.1.8 Integer Population Count Instructions
Table 2-18 lists the integer population count instructions in the A2 core.
Table 2-18. Integer Population Count Instructions
Pop Count
popcntb
popcntw
popcntd
2.8.1.9 Integer Select Instruction
Table 2-19 lists the integer select instruction in the A2 core. The RA operand is 0 if the RA field of the instruction is 0; it is the contents of GPR(RA) otherwise.
Table 2-19. Integer Select Instruction
Integer Select
isel
Version 1.3
October 23, 2012
CPU Programming Model
Page 93 of 864
Page 94
User’s Manual
A2 Processor
2.8.2 Branch Instructions
These instructions unconditionally or conditionally branch to an address. Conditional branch instructions can
test condition codes set in the CR by a previous instruction and branch accordingly. Conditional branch
instructions can also decrement and test the Count Register (CTR) as part of branch determination and can
save the return address in the Link Register (LR). The target address for a branch can be a displacement
from the current instruction address or an absolute address or contained in the LR or CTR.
See Wait Instruction on page 98 for more information about branch operations.
Table 2-20 lists the branch instructions in the A2 core. In the table, the syntax “[l]” indicates that the instruc-
tion has both a “link update” form (which updates LR with the address of the instruction after the branch) and
a “nonlink update” form. Similarly, the syntax “[a]” indicates that the instruction has both an “absolute
address” form (in which the target address is formed directly using the immediate field specified as part of the
instruction) and a “relative” form (in which the target address is formed by adding the specified immediate
field to the address of the branch instruction).
Table 2-20. Branch Instructions
Branch
b[l][a]
bc[l][a]
bcctr[l]
bclr[l]
2.8.3 Processor Control Instructions
Processor control instructions manipulate system registers, perform system software linkage, and synchronize processor operations. The instructions in these three subcategories of processor control instructions are
described below.
2.8.3.1 Condition Register Logical Instructions
These instructions perform logical operations on a specified pair of bits in the CR, placing the result in another
specified bit. The benefit of these instructions is that they can logically combine the results of several comparison operations without incurring the overhead of conditional branching between each one. Software performance can significantly improve if multiple conditions are tested at once as part of a branch decision.
Table 2-21 lists the condition register logical instructions in the A2 core.
These instructions move data between the GPRs and control registers in the A2 core.
Table 2-22 lists the register management instructions in the A2 core.
Table 2-22. Register Management Instructions
CRDCR
mcrf
mcrxr
mfcr
mfocrf
mtcrf
mtocrf
1. When CCR2(EN_DCR) is zero, DCR instructions are dropped silently. They are no-ops and do not cause an exception.
mfdcr
mfdcrx
mfdcrux
mtdcr
mtdcrx
mtdcrux
1
MSRSPRTB
mfmsr
mtmsr
wrtee
mfspr
mtspr
mttb
wrteei
2.8.3.3 System Linkage Instructions
These instructions invoke supervisor software level for system services and return from interrupts.
When executing in the guest state (MSR[GS,PR] = 0b10), execution of an rfi instruction is mapped to rfgi
and the rfgi instruction is executed in place of the rfi.
Table 2-23 lists the system linkage instructions in the A2 core.
Table 2-23. System Linkage Instructions
ehpriv
rfi
rfci
rfgi
rfmci
sc
2.8.3.4 Processor Control Instructions
The msgsnd and msgclr instructions are provided for sending and clearing messages to processors and
other devices in the coherence domain. These instructions are hypervisor privileged.
Table 2-28 shows the processor control instructions in the A2 core.
Table 2-24. Processor Control Instruction
msgsnd
msgclr
2.8.4 Storage Control Instructions
These instructions manage the instruction and data caches and the TLB of the A2 core. Instructions are also
provided to synchronize and order storage accesses. The instructions in these three subcategories of storage
control instructions are described in the following sections.
Version 1.3
October 23, 2012
CPU Programming Model
Page 95 of 864
Page 96
User’s Manual
A2 Processor
2.8.4.1 Cache Management Instructions
These instructions control the operation of the data and instruction caches. Instructions are provided to fill,
flush, invalidate, or zero data cache blocks, where a block is defined as a 64-byte cache line. Instructions are
also provided to fill or invalidate instruction cache blocks.
Table 2-25 lists the cache management instructions in the A2 core.
Table 2-25. Cache Management Instructions
Data CacheInstruction Cache
dcba
dcbf
dcbi
dcbst
dcbt
icbi
icbt
dcbtst
dcbz
icbtls
icblc
dcbtls
dcbtstls
dcblc
Table 2-26. Cache Management Instructions by External Process ID
Data CacheInstruction Cache
dcbstep
dcbtep
dcbfep
icbiep
dcbtstep
dcbzep
2.8.4.2 TLB Management Instructions
The TLB management instructions read and write entries of the TLB array and search the TLB array for an
entry that will translate a given virtual address.
Table 2-27 lists the TLB management instructions in the A2 core. See Integer Arithmetic Instructions on
page 91 for an explanation of the “[.]” syntax.
.
Table 2-27. TLB Management Instructions
tlbre
tlbsx[.]
tlbsync
tlbwe
tlbivax
CPU Programming Model
Page 96 of 864
Version 1.3
October 23, 2012
Page 97
User’s Manual
A2 Processor
2.8.4.3 Processor Synchronization Instruction
The processor synchronization instruction, isync, forces the processor to complete all instructions preceding
the isync before allowing any context changes as a result of any instructions that follow the isync. Additionally, all instructions that follow the isync will execute within the context established by the completion of all
the instructions that precede the isync. See Synchronization on page 122 for more information about the
synchronizing effect of isync.
Table 2-28 shows the processor synchronization instructions in the A2 core.
Table 2-28. Processor Synchronization Instruction
isync
sync
2.8.4.4 Load and Reserve and Store Conditional Instructions
The load and reserve and store conditional instructions can be used to construct a sequence of instructions
that appears to perform an atomic update operation on an aligned storage location.
The A2 core implements the exclusive access hint (EH) included in load and reserve instructions.
Table 2-29. Load and Reserve and Store Conditional Instructions
LoadsStores
WordDoubleWordDouble
lwarxldarxstwcx.stdcx.
2.8.4.5 Storage Synchronization Instructions
The storage synchronization instructions allow software to enforce ordering amongst the storage accesses
caused by load and store instructions, which by default are weakly-ordered by the processor. “Weaklyordered” means that the processor is architecturally permitted to perform loads and stores generally out-oforder with respect to their sequence within the instruction stream, with some exceptions. However, if a
storage synchronization instruction is executed, then all storage accesses prompted by instructions
preceding the synchronizing instruction must be performed before any storage accesses prompted by
instructions that come after the synchronizing instruction. See Synchronization on page 122 for more information about storage synchronization.
msync is an extended mnemonic for the synchronize instruction so that it can be coded with the L value as
part of the mnemonic rather than as a numeric operand.
Table 2-28 shows the storage synchronization instructions in the A2 core.
Table 2-30. Storage Synchronization Instructions
msync
mbar
Version 1.3
October 23, 2012
CPU Programming Model
Page 97 of 864
Page 98
User’s Manual
A2 Processor
2.8.4.6 Wait Instruction
The wait instruction allows instruction fetching and execution to be suspended under certain conditions,
depending on the value of the WC field. WC = 11 is treated as a no-op instruction. WC = 10 specifies a wake
condition determined by the an A2 input signal called an_ac_sleep_en.
Table 2-31 shows the wait instructions in the A2 core.
Table 2-31. Wait Instruction
wait
2.8.5 Initiate Coprocessor Instructions
Initiation of a coprocessor is requested by issuing the Initiate Coprocessor Store Word Indexed (icswx)
instruction. A coprocessor is not a standard processor, but instead is a specialized processor that is capable
of one or more particular tasks with the intent to provide acceleration of each task that might have otherwise
been done by the program. See Section 12.5 Coprocessor Instructions on page 513.
Table 2-32 shows the icswx instructions in the A2 core.
Table 2-32. Initiate Coprocessor Instructions
icswx[.]
icswepx[.]
2.8.5.1 Cache Initialization Instructions
The dci and ici instructions are privileged instructions, and if executed in supervisor mode they will flash
invalidate the entire associated cache. They do not generate an address, nor are they affected by the access
control mechanism.
Table 2-28 shows the cache initialization instructions in the A2 core.
Table 2-33. Cache Initialization Instructions
dci
ici
The dci and ici instructions have a CT field. The following describes the affects of the CT field.
• CT = 0 indicates L1 only. The L1 cache will be invalidated and request is not sent to the L2.
• CT = 2 indicates L1 and L2. The L1 cache will be invalidated and request is sent to the L2.
• CT != 0,2 indicates a no-op. No L1 caches are invalidated and the request is not sent to the L2.
CPU Programming Model
Page 98 of 864
Version 1.3
October 23, 2012
Page 99
User’s Manual
A2 Processor
2.9 Branch Processing
The four branch instructions provided by A2 core are summarized in Table 2.8.2 on page 94. The following
sections provide additional information about branch addressing, instruction fields, prediction, and registers.
2.9.1 Branch Addressing
The branch instruction (b[l][a]) specifies the displacement of the branch target address as a 26-bit value (the
24-bit LI field right-extended with 0b00). This displacement is regarded as a signed 26-bit number covering an
address range of 32 MB. Similarly, the branch conditional instruction (bc[l][a]) specifies the displacement as
a 16-bit value (the 14-bit BD field right-extended with 0b00). This displacement covers an address range of
32 KB.
For the relative form of the branch and branch conditional instructions (b[l] and bc[l], with instruction field
AA = 0), the target address is the address of the branch instruction itself (the current instruction address, or
CIA) plus the signed displacement. This address calculation is defined to “wrap around” from the maximum
effective address (0xFFFF_FFFF_FFFF_FFFF) to 0x0000_0000_0000_0000 and vice-versa.
For the absolute form of the branch and branch conditional instructions (ba[l] and bca[l], with instruction field
AA = 1), the target address is the sign-extended displacement. This means that with absolute forms of the
branch and branch conditional instructions, the branch target can be within the first or last 32 MB or 32 KB of
the address space, respectively.
The other two branch instructions, bclr (branch conditional to LR) and bcctr (branch conditional to CTR), do
not use absolute or relative addressing. Instead, they use indirect addressing, in which the target of the
branch is specified indirectly as the contents of the LR or CTR.
2.9.2 Branch Instruction BI Field
Conditional branch instructions can optionally test one bit of the CR, as indicated by instruction field BO[0]
(see Section 2.9.3). The value of instruction field BI specifies the CR bit to be tested (32-63). The BI field is
ignored if BO[0] = 1. The branch (b[l][a]) instruction is by definition unconditional; hence, it does not have a BI
instruction field. Instead, the position of this field is part of the LI displacement field.
2.9.3 Branch Instruction BO Field
The BO field specifies the condition under which a conditional branch is taken and whether the branch decrements the CTR as shown in Table 2-34. In the table, M = 0 in 64-bit mode and M = 32 in 32-bit mode. The
branch (b[l][a]) instruction is by definition unconditional; hence, it does not have a BO instruction field.
Instead, the position of this field is part of the LI displacement field.
Conditional branch instructions can optionally test one bit in the CR. This option is selected when BO[0] = 0. If
BO[0] = 1, the CR does not participate in the branch condition test. If the CR condition option is selected, the
condition is satisfied (branch can occur) if the CR bit selected by the BI instruction field matches BO[1].
Conditional branch instructions can also optionally decrement the CTR by one and test whether the decremented value is 0. This option is selected when BO[2] = 0. If BO[2] = 1, the CTR is not decremented and
does not participate in the branch condition test. If the CTR decrement option is selected, BO[3] specifies the
condition that must be satisfied to allow the branch to be taken. If BO[3] = 0, CTR 0 is required for the
branch to occur. If BO[3] = 1, CTR = 0 is required for the branch to occur.
Version 1.3
October 23, 2012
CPU Programming Model
Page 99 of 864
Page 100
User’s Manual
A2 Processor
Table 2-34. BO Field Encodings
BO DescriptionDescription
0000z Decrement the CTR, then branch if the decremented CTRM:63 neq 0 and CRBI = 0.
0001z Decrement the CTR, then branch if the decremented CTRM:63 = 0 and CRBI = 0.
001at Branch if CRBI = 0.
0100z Decrement the CTR, then branch if the decremented CTRM:63 neq 0 and CRBI = 1.
0101z Decrement the CTR, then branch if the decremented CTRM:63 = 0 and CRBI = 1.
011at Branch if CRBI = 1.
1a00t Decrement the CTR, then branch if the decremented CTRM:63 neq 0.
1a01t Decrement the CTR, then branch if the decremented CTRM:63 = 0.
1z1zz Branch always.
Notes:
1. ‘z’ denotes a bit that is ignored.
2. The ‘a’ and ‘t’ bits are used as described in Table 2-35 on page 100.
The “a” and “t” bits of the BO field can be used by software to provide a hint about whether the branch is likely
to be taken or is likely not to be taken, as shown in Table 2-35.
Table 2-35. ‘at’ Bit Encodings
at Hint
00 No hint is given.
01 Reserved.
10 The branch is very likely not to be taken.
11 The branch is very likely to be taken.
This implementation has dynamic mechanisms for predicting whether a branch will be taken. Because the
dynamic prediction is likely to be very accurate and is likely to be overridden by any hint provided by the “at”
bits, the “at” bits should be set to 0b00 unless the static prediction implied by at = 0b10 or at = 0b11 is highly
likely to be correct.
2.9.4 Branch Prediction
The following sections detail the methods by which the branch predictor decodes incoming branches, generates predictions for both the direction and target of these branches, and guides instruction flow based on
these predictions.
2.9.4.1 Branch Decoder
Before the branch predictor itself, every instruction cache line is passed through the branch decoder. The
primary purpose of the branch decoder is to identify any valid branch instructions contained within the cache
line. Valid branches include b, bc, bclr, bcctr, and their derivatives.
The branch decoder also decodes any hints contained within the branch instructions. Hints can be specified
for any branch conditional instruction (bc, bclr, bcctr, and their derivatives). Hints are encoded in the branch
instruction's BO field.
CPU Programming Model
Page 100 of 864
Version 1.3
October 23, 2012
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.