INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTELR PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the
presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by
estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to th em.
The Intel® PXA27x Processor Family may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
MPEG is an international standard for video compression/decompression promoted by ISO. Implementations of MPEG CODECs, or MPEG enabled
platforms may require licenses from various entities, including Intel Corporation.
This document and the software described in it are furnished under license and may only be used or copied in accordance with the terms of the
license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a
commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this
document or any software that may be provided in association with this document. Except as permitted by such license, no part of this document may
be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling
1-800-548-4725 or by visiting Intel's website at http://www.intel.com.
Intel® PXA27x Processor Family Optimization Guideix
Contents
xIntel® PXA27x Processor Family Optimization Guide
Introduction1
1.1About This Document
This document is a guide to optimizing software, the operating system, and system configuration to
best use the Intel® PXA27x Processor Family (PXA27x processor) feature set. The Intel®
PXA27x Processor Family consists of:
• Intel® PXA270 Processor – discrete processor
• Intel® PXA271 Processor – 32 MBytes of Intel StrataFlash® Memory and 32 MBytes of Low
Optimization” discusses how to optimize software (mostly at the assembly programming
level) to take advantage of the Intel XScale® Microarchitecture and Intel® Wireless MMX™
technology media co-processor.
• Chapter 5, “High Level Language Optimization” is a set of guidelines for C and C++ code
developers to maximize the performance by making the best use of the system resources.
Intel® PXA27x Processor Family Optimization Guide1-1
Introduction
• Chapter 6, “Power Optimization” discusses the trade-offs between performance and power
using the PXA27x processor.
• Appendix A, “Performance Checklist” is a set of guidelines for system level optimizations
which allow for obtain greater performance when using the Intel XScale® Microarchitecture
and the PXA27x processor.
1.2High-Level Overview
Mobile and wireless devices simplify our lives, keep us entertained, increase productivity and
maximize our responsiveness. Enterprise and individual consumers alike realize the potential and
are integrating these products at a rapid rate into their everyday life. Customer expectations exceed
what is being delivered today. The desire to communicate and compute wirelessly - to have access
to information anytime, anywhere, is the expectation. Manufacturers require technologies that
deliver high-performance, flexibility and robust functionality-all in the small-size, low-power
framework of mobile handheld, battery-powered devices. The Intel® Personal Internet Client
Architecture (Intel® PCA) processors with Intel XScale® Microarchitecture help drive wireless
handheld device functionality to new heights to meet customer demand. Combining low-power,
high-performance, compelling new features and second generation memory stacking, Intel PCA
processors help to redefine what a mobile device can do to meet many of the performance demands
of Enterprise-class wireless computing and feature-hungry technology consumers.
Targeted at wireless handhelds and handsets such as cell phones and PDAs with full featured
operating systems, the Intel PXA27x processor family is the next generation of ultra-low-power
applications with industry leading multimedia performance for wireless clients. The Intel PXA27x
processor is a highly integrated solution that includes Wireless Intel Speedstep® technology for
ultra-low-power, Intel® Wireless MMX™ technology and up to 624
multimedia capabilites, and Intel® Quick Capture Interface to give customers the ability to capture
high quality images and video.
The PXA27x processor incorporates a comprehensive set of system and peripheral functions that
make it useful in a variety of low-power applications. The block diagram in
the PXA27x processor system-on-a-chip, and shows a primary system bus with the Intel XScale®
Microarchitecture core (Intel XScale® core) attached along with an LCD controller, USB Host
controller and 256
controller to allow communication to a variety of external memory or companion-chip devices, and
it is also connected to a DMA/bridge to allow communication with the on-chip peripherals. The
key features of all the sub-blocks are described in this section, with more detail provided in
subsequent sections.
KBytes of internal memory. The system bus is connected to a memory
1.2.1Intel XScale® Microarchitecture and Intel XScale® core
The Intel XScale® Microarchitecture is based on a core that is ARM* version 5TE compliant. The
microarchitecture surrounds the core with instruction and data memory management units;
instruction, data, and mini-data caches; write, fill, pend, and branch-target buffers; power
management, performance monitoring, debug, and JTAG units; coprocessor interface; 32K caches;
MMUs; BTB; MAC coprocessor; and core memory bus.
The Intel XScale® Microarchitecture can be combined with peripherals to provide applicationspecific standard products (ASSPs) targeted at selected market segments. For example, the RISC
core can be integrated with peripherals such as an LCD controller, multimedia controllers, and an
external memory interface to empower OEMs to develop smaller, more cost-effective handheld
devices with long battery life, with the performance to run rich multimedia applications. Or the
microarchitecture could be surrounded by high-bandwidth PCI interfaces, memory controllers, and
networking micro-engines to provide a highly integrated, low-power, I/O or network processor.
Intel® PXA27x Processor Family Optimization Guide1-3
Introduction
1.2.2Intel XScale® Microarchitecture Features
• Superpipelined RISC technology achieves high speed and low power
• Wireless Intel Speedstep® technology allows on-the-fly voltage and frequency scaling to
enable applications to use the right blend of performance and power
• Media processing technology enables the MAC coprocessor perform two simultaneous 16-bit
SIMD multiplies with 64-bit accumulation for efficient media processing
• Power management unit provides power savings via multiple low-power modes
• 32-Kbyte instruction cache (I-cache) keeps local copy of important instructions to enable high
performance and low power
• 32-Kbyte data cache (D-cache) keeps local copy of important data to enable high performance
and low power
• 2-Kbyte mini-data cache avoids “thrashing” of the D-cache for frequently changing data
streams
• 32-entry instruction memory management unit enables logical-to-physical address translation,
access permissions, I-cache attributes
• 32-entry data memory management unit enables logical-to-physical address translation, access
permissions, D-cache attributes
• 4-entry Fill and Pend buffers promote core efficiency by allowing “hit-under-miss” operation
with data caches
• Performance monitoring unit furnishes two 32-bit event counters and one 32-bit cycle counter
for analysis of hit rates
• Debug unit uses hardware breakpoints and 256-entry Trace History buffer (for flow change
messages) to debug programs
• 32-bit coprocessor interface provides high performance interface between core and
coprocessors
• 8-entry Write buffer allows the core to continue execution while data is written to memory
See the Intel XScale® Microarchitecture Users Guide for additional information.
1.2.3Intel® Wireless MMX™ technology
The Intel XScale® Microarchitecture has attached to it a coprocessor to accelerate multimedia
applications. This coprocessor, characterized by a 64-bit Single Instruction Multiple Data (SIMD)
architecture and compatibility with the integer functionality of the
technology
technology
and SSE instruction sets, is known by its Intel project name, Intel® Wireless MMX™
. The key features of this coprocessor are:
• 30 new media processing instructions
• 64-bit architecture up to eight-way SIMD
• 16 x 64-bit register file
• SIMD PSR flags with group conditional execution support
• Instruction support for SIMD, SAD, and MAC
• Instruction support for alignment and video
• Intel® Wireless MMX™ technology and SSE integer compatibility
• Superset of existing Intel XScale® Microarchitecture media processing instructions
See the Intel® Wireless MMX™ technology Coprocessor EAS for more details.
1.2.4Memory Architecture
1.2.4.1Caches
There are two caches:
• Data cache – The PXA27x processor supports 32 Kbytes of data cache.
• Instruction Cache – The PXA27x processor supports 32 Kbytes of instruction cache.
1.2.4.2Internal Memories
The key features of the PXA27x processor internal memory are:
• 256 Kbytes of on-chip SRAM arranged as four banks of 64 Kbytes
• Bank-by-bank power management with automatic power management for reduced power
consumption
• Byte write support
Introduction
1.2.4.3External Memory Controller
The PXA27x processor supports a memory controller for external memory which can access:
• SDRAM up to 100 MHz at 1.8 Volts.
• Flash memories
• Synchronous ROM
• SRAM
• Variable latency input/output (VLIO) memory
• PC card and compact flash expansion memory
1.2.5Processor Internal Communications
The PXA27x processor supports a hierarchical bus architecture. A system bus supports high
bandwidth peripherals, and a slower peripheral bus supports peripherals with lower data
throughputs.
1.2.5.1System Bus
• Interconnection between the major key components is through the system bus.
• 64-bit wide, address and data multiplexed bus.
• The system bus allows split transactions, increasing the maximum data-throughput in the
system.
• Different burst sizes are allowed; up to 4 data phases per transactions (that is, 32 bytes). The
burst size is set in silicon for each peripheral and is not configurable.
Intel® PXA27x Processor Family Optimization Guide1-5
Introduction
• The system bus can operate at different frequency ratios with respect to the Intel XScale® core
(up to 208 MHz). The frequency control of the system bus is pivotal to striking a balance
between the desired performance and power consumption.
1.2.5.2Peripheral Bus
The peripheral bus is a single master bus. The bus master arbitrates between the Intel XScale® core
and the DMA controller with a pre-defined priority scheme between them. The peripheral bus is
used by the low-bandwidth peripherals; the peripheral bus runs at 26
1.2.5.3Peripherals in the Processor
The PXA27x processor has a rich set of peripherals. The list of peripherals and key features are
described in the subsections below.
1.2.5.3.1LCD Display Controller
The LCD controller supports single- or dual-panel LCD displays. Color panels without internal
frame buffers up to 262144 colors (18
up to 16777216 colors (24
(8
bits) are supported.
bits) are supported. Monochrome panels up to 256 gray-scale levels
bits) are supported. Color panels with internal frame buffers
MHz.
1.2.5.3.2DMA Controller
The PXA27x processor has a high performance DMA controller supporting memory-to-memory
transfers, peripheral-to-memory and memory-to-peripheral device transfers. It has support for
32
channels and up to 63-peripheral devices. The controller can perform descriptor chaining. DMA
supports descriptor-fetch, no-descriptor-fetch and descriptor-chaining.
1.2.5.3.3Other Peripherals
The PXA27x processor offers this peripheral support:
• USB Client Controller with 23 programmable endpoints (compliant with USB Revision 1.1).
• USB Host controller (USB Rev. 1.1 compatible), which supports both low-speed and full-
speed USB devices through a built-in DMA controller.
• Intel® Quick Capture Interface which provides a connection between the processor and a
camera image sensor.
• Infrared Communication Port (ICP) which supports 4 Mbps data rate compliant with Infrared
Data Association (IrDA) standard.
2
• I
C Serial Bus Port, which is compliant with I2C standard (also supports arbitration between
• USIM card interface (compliant with ISO standard 7816-3 and 3G TS 31.101)
• MSL – the physical interface of communication subsystems for mobile or wireless platforms.
The operating system and application software uses this to communicate between each other.
• Keypad interface supports both direct key as well as matrix key.
• Real-time clock (RTC) controller which provides a general-purpose, real-time reference clock
for use by the system.
• The pulse width modulator (PWM) controller generates four independent PWM outputs.
• Interrupt controller identifiesand controls the interrupt sources available to the processor.
• The OS timers controller provides a set of timer channels that allow software to generate timed
interrupts or wake-up events.
• General-purpose I/O (GPIO) controller for use in generating and capturing application-
specific input and output signals. Each of the 121
an input (or as bidirectional for certain alternate functions).
1
GPIOs may be programmed as an output,
1.2.6Wireless Intel Speedstep® technology
Wireless Intel Speedstep® technology advances the capabilities of Intel® Dynamic Voltage
Management - a function already built into the Intel XScale® Microarchitecture - by incorporating
three new low-power states: deep idle, standby and deep sleep. The technology is able to change
both voltage and frequency on-the-fly by intelligently switching the processor into the various low
power modes, saving additional power while still providing the necessary performance to run rich
applications.
The PXA27x processor integrated microprocessor provides a rich set of flexible powermanagement controls for a wide range of usage models, while enabling very low-power operation.
The key features include:
• Five reset sources:
—Power-on
— Hardware
— Watchdog
—GPIO
— Exit from sleep mode
• Three clock-speed controls to adjust frequency:
— Turbo mode
— Divisor mode
—Fast Bus mode
• Switchable clock source
• Functional clock gating
• Programmable frequency-change capability
1.121 GPIOs are available on the PXA271 processor, PXA271 processor, and PXA271 processor. The PXA270 processor only has 119 GPIOs
bonded out.
Intel® PXA27x Processor Family Optimization Guide1-7
Introduction
• Six power modes to control power consumption:
—Normal
—Idle
— Deep idle
— Standby
— Sleep
— Deep sleep
• Programmable I
See the Intel® PXA27x Processor Family Developer’s Manual for more details.
2
C-based external regulator interface to support voltage changing.
1.3Intel XScale® Microarchitecture Compatibility
The Intel XScale® Microarchitecture is ARM*Version 5 (V5TE) architecture compliant. The
PXA27x processor implements the integer instruction set architecture of ARM*V5TE.
Backward compatibility for user-mode applications is maintained with the earlier generations of
StrongARM* and Intel XScale® Microarchitecture processors. Operating systems may require
modifications to match the specific Intel XScale® Microarchitecture hardware features, and to take
advantage of the performance enhancements added to this core.
Memory map and register locations are backward-compatible with the previous Intel XScale®
Microarchitecture hand-held products.
The Intel® Wireless MMX™ technology instruction set is compatible with the standard ARM*
coprocessor instruction format (See
for more details).
The Complete Guide to Intel® Wireless MMX™ Technology
1.3.1PXA27x Processor Performance Features
Performance features of the PXA27x processor are:
• 32-Kbyte instruction cache
• 32-Kbyte data cache
• Intel® Wireless MMX™ technology with sixteen 64-bit registers, optimized instructions for
video, and multi-media applications.
• The PXA27x processor has an internal SRAM of 256 KBytes.
• Capability of locking entries in the instruction or data caches
• 2-Kbyte mini-data cache, separate from the data cache
• L1 caches and the mini-data cache use virtual address indices (or tags)
• Separate instruction and data Translation Lookaside buffers (TLBs), each with 32 entries
• Capability of locking entries in the TLBs
• 16-channel DMA engine with transfer-size control and descriptor chaining
This chapter contains an overview of Intel XScale® Microarchitecture and Intel® Wireless
MMX™ Technology
architecture with an enhanced memory pipeline. The Intel XScale® Microarchitecture instruction
set is based on ARM* V5TE architecture; however, the Intel XScale® Microarchitecture includes
new instructions. Code developed for the Intel® StrongARM* SA-110 (SA-110), Intel®
StrongARM* SA-1100 (SA-1100), and Intel® StrongARM* SA-1110 (SA-1110) microprocessors
is portable to Intel XScale® Microarchitecture based processors. However, to obtain the maximum
performance, the code should be optimized for the Intel XScale® Microarchitecture using the
techniques presented in this document.
2.2Intel XScale® Microarchitecture Pipeline
This section provides a brief description of the structure and behavior of Intel XScale®
Microarchitecture pipeline.
2.2.1General Pipeline Characteristics
. The Intel XScale® Microarchitecture includes a superpipelined RISC
The following sections discuss general pipeline characteristics.
2.2.1.1Pipeline Organization
The Intel XScale® Microarchitecture has a 7-stage pipeline operating at a higher frequency than its
predecessors allowing for greater overall performance. The Intel XScale® Microarchitecture
single-issue superpipeline consists of a main execution pipeline, a multiply-accumulate {MAC}
pipeline, and a memory access pipeline.
execution pipeline shaded.
Figure 2-1 shows the pipeline organization with the main
MAC pipeline
M1M2Mx
Memory pipeline
D1D2
XWB
DWB
Intel® PXA27x Processor Family Optimization Guide2-1
Microarchitecture Overview
Tab le 2-1 gives a brief description of each pipe stage and a reference for further information.
Table 2-1. Pipelines and Pipe Stages
Pipe / PipestageDescriptionFor More Information
Main Execution Pipeline
• IF1/IF2
•ID
•RF
•X1
•X2
•XWB
Memory Pipeline
•D1/D2
•DWB
MAC Pipeline
•M1-M5
• MWB (not shown)
Handles data processing instructions
Instruction Fetch
Instruction Decode
Register File / Operand Shifter
ALU Execute
State Execute
Write-back
Handles load/store instructions
Data cache access
Data cache writeback
Handles all multiply instructions
Multiplier stages
MAC write-back occurs during M2-M5
2.2.1.2Out of Order Completion
While the pipeline is scalar and single-issue, instructions occupy all three pipelines at once. The
main execution pipeline, memory, and MAC pipelines have different execution times because they
are not lock-stepped. Sequential consistency of instruction execution relates to two aspects: first,
the order instructions are completed and second, the order memory is accessed due to load and
store instructions. The Intel XScale® Microarchitecture only preserves a weak processor
consistency because instructions complete out of order (assuming no data dependencies exist).
The Intel XScale® Microarchitecture can buffer up to four outstanding reads. If load operations
miss the data cache, subsequent instructions complete independently. This operation is called a
hit-under-miss operation.
Section 2.2.3
Section 2.2.3.1
Section 2.2.3.2
Section 2.2.3.3
Section 2.2.3.4
Section 2.2.3.5
Section 2.2.3.6
Section 2.2.4
Section 2.2.4.1
Section 2.2.5.1
Section 2.2.5
Section 2.2.5
Section 2.2.5
2.2.1.3Use of Bypassing
The pipeline makes extensive use of bypassing to minimize data hazards. To eliminate the need to
stall the pipeline, bypassing allows results forwarding from multiple sources.
In certain situations, the pipeline must stall because of register dependencies between instructions.
A register dependency occurs when a previous MAC or load instruction is about to modify a
register value that has not returned to the register file. Core bypassing allows the current instruction
to execute when the previous instruction’s results are available without waiting for the register file
to update.
2.2.2Instruction Flow Through the Pipeline
With the exception of the MAC unit, the pipeline issues one instruction per clock cycle. Instruction
execution begins at the F1 pipestage and completes at the WB pipestage.
Although a single instruction is issued per clock cycle, all three pipelines are processing
instructions simultaneously. If there are no data hazards, each instruction complete independently
of the others.
Figure 2-1 uses arrows to show the possible flow of instructions in the pipeline. Instruction
execution flows from the F1 pipestage to the RF pipestage. The RF pipestage issues a single
instruction to either the X1 pipestage or the MAC unit (multiply instructions go to the MAC, while
all others continue to X1). This means that M1 or X1 are idle.
After calculating the effective addresses in XI, all load and store instructions route to the memory
pipeline.
The ARM* V5TE branch and exchange (BX) instruction (used to branch between ARM* and
THUMB* code) causes the entire pipeline to be flushed. If the processor is in THUMB* mode the
ID pipestage dynamically expands each THUMB* instruction into a normal ARM* V5TE RISC
instruction and normal execution resumes.
2.2.2.2Pipeline Stalls
Pipeline stalls can seriously degrade performance. The primary reasons for stalls are register
dependencies, load dependencies, multiple-cycle instruction latency, and unpredictable branches.
To help maximize performance, it is important to understand some of the ways to avoid pipeline
stalls. The following sections provide more detail on the nature of the pipeline and ways of
preventing stalls.
Microarchitecture Overview
2.2.3Main Execution Pipeline
2.2.3.1F1 / F2 (Instruction Fetch) Pipestages
The job of the instruction fetch stages F1 and F2 is to present the next instruction to be executed to
the ID stage. Two important functional units residing within the F1 and F2 stages are the BTB and
IFU.
• Branch Target Buffer (BTB)
The BTB provides a 128-entry dynamic branch prediction buffer. An entry in the BTB is
created when a B or BL instruction branch is taken for the first time. On sequential executions
of the branch instruction at the same address, the next instruction loaded into the pipeline is
predicted by the BTB. Once the branch type instruction reaches the X1 pipestage, its target
address is known. Execution continues without stalling if the target address is the same as the
BTB predicted address. If the address is different from the address that the BTB predicted, the
pipeline is flushed, execution starts at the new target address, and the branch’s history is
updated in the BTB.
• Instruction Fetch Unit (IFU)
The IFU is responsible for delivering instructions to the instruction decode (ID) pipestage. It
delivers one instruction word each cycle (if possible) to the ID. The instruction could come
from one of two sources: instruction cache or fetch buffers.
Intel® PXA27x Processor Family Optimization Guide2-3
Microarchitecture Overview
2.2.3.2Instruction Decode (ID) Pipestage
The ID pipestage accepts an instruction word from the IFU and sends register decode information
to the RF pipestage. The ID is able to accept a new instruction word from the IFU on every clock
cycle in which there is no stall. The ID pipestage is responsible for:
• General instruction decoding (extracting the opcode, operand addresses, destination addresses
and the offset).
• Detecting undefined instructions and generating an exception.
• Dynamic expansion of complex instructions into sequence of simple instructions. Complex
instructions are defined as ones that take more than one clock cycle to issue, such as LDM,
STM, and SWP.
2.2.3.3Register File / Shifter (RF) Pipestage
The main function of the RF pipestage is to read and write to the register file unit (RFU). It
provides source data to:
• X1 for ALU operations
• MAC for multiply operations
• Data cache for memory writes
• Coprocessor interface
The ID unit decodes the instruction and specifies the registers accessed in the RFU. Based on this
information, the RFU determines if it needs to stall the pipeline due to a register dependency. A
register dependency occurs when a previous instruction is about to modify a register value that has
not been returned to the RFU and the current instruction needs to access that same register. If no
dependencies exist, the RFU selects the appropriate data from the register file and passes it to the
next pipestage. When a register dependency does exist, the RFU keeps track of the unavailable
register. The RFU stops stalling the pipe when the result is returned.
The ARM* architecture specifies one of the operands for data processing instructions as the shifter
operand. A 32-bit shift can be performed on a value before it is used as an input to the ALU. This
shifter is located in the second half of the RF pipestage.
2.2.3.4Execute (X1) Pipestages
The X1 pipestage performs these functions:
• ALU calculations – the ALU performs arithmetic and logic operations, as required for data
processing instructions and load/store index calculations.
• Determine conditional instruction executions – the instruction’s condition is compared to the
CPSR prior to execution of each instruction. Any instruction with a false condition is
cancelled and does not cause any architectural state changes, including modifications of
registers, memory, and PSR.
• Branch target determinations – the X1 pipestage flushes all instructions in the previous
pipestages and sends the branch target address to the BTB if a branch is mispredicted by the
BTB. The flushing of these instructions restarts the pipeline.
The X2 pipestage contains the program status registers (PSR). This pipestage selects the data to be
written to the RFU in the WB cycle including the following items.
The X2 pipestage contains the current program status register (CPSR). This pipestage selects what
is written to the RFU in the WB cycle including program status registers.
2.2.3.6Write-Back (WB)
When an instruction reaches the write-back stage it is considered complete. Instruction results are
written to the RFU.
2.2.4Memory Pipeline
The memory pipeline consists of two stages, D1 and D2. The data cache unit (DCU) consists of the
data cache array, mini-data cache, fill buffers, and write buffers. The memory pipeline handles load
and store instructions.
2.2.4.1D1 and D2 Pipestage
Microarchitecture Overview
Operation begins in D1 after the X1 pipestage calculates the effective address for loads and stores.
The data cache and mini-data cache return the destination data in the D2 pipestage. Before data is
returned in the D2 pipestage, sign extension and byte alignment occurs for byte and half-word
loads.
2.2.4.1.1Write Buffer Behavior
The Intel XScale® Microarchitecture has enhanced write performance by the use of write
coalescing. Coalescing is combining a new store operation with an existing store operation already
resident in the write buffer. The new store is placed in the same write buffer entry as an existing
store when the address of new store falls in the 4-word aligned address of the existing entry.
The core can coalesce any of the four entries in the write buffer. The Intel XScale®
Microarchitecture has a global coalesce disable bit located in the Control register (CP15, register 1,
opcode_2=1).
2.2.4.1.2Read Buffer Behavior
The Intel XScale® Microarchitecture has four fill buffers that allow four outstanding loads to the
cache and external memory. Four outstanding loads increases the memory throughput and the bus
efficiency. This feature can also be used to hide latency. Page table attributes affect the load
behavior; for a section with C=0, B=0 there is only one outstanding load from the memory. Thus,
the load performance for a memory page with C=0, B=1 is significantly better compared to a
memory page with C=0, B=0.
2.2.5Multiply/Multiply Accumulate (MAC) Pipeline
The multiply-accumulate (MAC) unit executes the multiply and multiply-accumulate instructions
supported by the Intel XScale® Microarchitecture. The MAC implements the 40-bit Intel XScale®
Microarchitecture accumulator register acc0 and handles the instructions which transfers its value
to and from general-purpose ARM* registers.
Intel® PXA27x Processor Family Optimization Guide2-5
Microarchitecture Overview
These are important characteristics about the MAC:
• The MAC is not a true pipeline. The processing of a single instruction requires use of the same
data-path resources for several cycles before a new instruction is accepted. The type of
instruction and source arguments determine the number of required cycles.
• No more than two instructions can concurrently occupy the MAC pipeline.
• When the MAC is processing an instruction, another instruction cannot enter M1 unless the
original instruction completes in the next cycle.
• The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and
memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.
• The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit
multiply.
• ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future
implementations. Code should be written to depend on the 40-bit nature of the current
implementation.
2.2.5.1Behavioral Description
The execution of the MAC unit starts at the beginning of the M1 pipestage. At this point, the MAC
unit receives two 32-bit source operands. Results are completed N cycles later (where N is
dependent on the operand size) and returned to the register file. For more information on MAC
instruction latencies, refer to
Microarchitecture”.
Section 4.8, “Instruction Latencies for Intel XScale®
An instruction occupying the M1 or M2 pipestages occupies the X1 and X2 pipestage, respectively.
Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may complete anywhere
from M2-M5.
2.2.5.2Perils of Superpipelining
The longer pipeline has several consequences worth considering:
• Larger branch misprediction penalty (four cycles in the Intel XScale® Microarchitecture
instead of one in StrongARM* Architecture).
• Larger load use delay (LUD) — LUDs arise from load-use dependencies. A load-use
dependency gives rise to a LUD if the result of the load instruction cannot be made available
by the pipeline in time for the subsequent instruction. To avoid these penalties, an optimizing
compiler should take advantage of the core’s multiple outstanding load capability (also called
hit-under-miss) as well as finding independent instructions to fill the slot following the load.
• Certain instructions incur a few extra cycles of delay with the Intel XScale® Microarchitecture
as compared to StrongARM* processors (LDM, STM).
• Decode and register file lookups are spread out over two cycles with the Intel XScale®
Microarchitecture, instead of one cycle in predecessors.
As the Intel® Wireless MMX™ Technology is tightly coupled with the Intel XScale®
Microarchitecture; the
structure as the Intel XScale® Microarchitecture.
Technology pipeline, which contains three independent pipeline threads:
• X pipeline - Execution pipe
• M pipeline - Multiply pipe
• D pipeline - Memory pipe
Figure 2-2. Intel® Wireless MMX™ Technology Pipeline Threads and relation with Intel
XScale® Microarchitecture Pipeline
Intel® Wireless MMX™ Technology pipeline follows the similar pipeline
Figure 2-2 shows the Intel® Wireless MMX™
Intel
XScale®
Pipeline
IF1IF2
ID
RF
X1
X2XWB
X pipeline
M pipeline
D pipeline
2.3.1Execute Pipeline Thread
2.3.1.1ID Stage
The ID pipe stage is where decoding of Intel® Wireless MMX™ Technology instructions
commences. Because of the significance of the transit time from Intel XScale® Microarchitecture
in the ID pipe stage, only group decoding is performed in the ID stage, with the remainder of the
decoding being completed in the RF stage. However, it is worth noting that the register address
decoding is fully completed in the ID stage because the register file needs to be accessed at the
beginning of the RF stage.
All instructions are issued in a single cycle, and they pass through the ID stage in one cycle if no
pipeline stall occurs.
2.3.1.2RF Stage
ID
RFX1
M1
X2XWB
M2
D1D2DWB
MWBM3
The RF stage controls the reading/writing of the register file, and determines if the pipeline has to
stall due to data or resource hazards. Instruction decoding also continues at the RF stage and
completes at the end of the RF stage. The register file is accessed for reads in the high phase of the
clock and accessed for writes in the low phase.If data or resource hazards are detected, the
Intel® PXA27x Processor Family Optimization Guide2-7
Intel®
Microarchitecture Overview
Wireless MMX™ Technology stalls Intel XScale® Microarchitecture. Note that control hazards
are detected in the Intel XScale® Microarchitecture, and a flush signal is sent from the core to the
Intel® Wireless MMX™ Technology.
2.3.1.3X1 Stage
The X1 stage is also known as the execution stage, which is where most instructions begin being
executed. All instructions are conditionally executed and that determination occurs at the X1 stage
in the Intel XScale® Microarchitecture. A signal from the core is required to indicate whether the
instruction being executed is committed. In other words, an instruction being executed at the X1
stage may be canceled by a signal from the core. This signal is available to the
MMX™ Technology
2.3.1.4X2 Stage
The Intel® Wireless MMX™ Technology supports saturated arithmetic operations. Saturation
detection is completed in the X2 pipe stage.
If the Intel XScale® Microarchitecture detects exceptions and flushes in the X2 pipe stage, Intel®
Wireless MMX™ Technology
Intel® Wireless
in the middle of the X1 pipe stage.
also flushes all the pipeline stages.
2.3.1.5XWB Stage
The XWB stage is the last stage of the X pipeline, where a final result calculated in the X pipeline
is written back to the register file.
2.3.2Multiply Pipeline Thread
2.3.2.1M1 Stage
The M pipeline is separated from the X pipeline. The execution of multiply instructions starts at the
beginning of the M1 stage, which aligns with the X1 stage of the X pipeline. While the issue cycle
for multiply operations is one clock cycle, the result latency is at least three cycles. Certain
instructions such as TMIA, WMAC, WMUL, WMADD spend two M1 cycles since the
Wireless MMX™ Technology
level compression occur in the M1 pipe stage.
2.3.2.2M2 Stage
Additional compression occurs in the M2 pipe stage, and the lower 32 bits of the result are
calculated with a 32 bit adder.
2.3.2.3M3 Stage
The upper 32 bits of the result are calculated with a 32-bit adder.
has only two 16x16 multiplier arrays. Booth encoding and first-
Intel®
2.3.2.4MWB Stage
The MWB stage is the last stage of the M pipeline, which is where a final result calculated in the M
pipeline is written back to the register file.
A forwarding path from the MWB stage to the RF stage serves as a non-critical bypass. Critical and
reasonable logic insertion are allowed.
2.3.3Memory Pipeline Thread
2.3.3.1D1 Stage
In the D1 pipe stage, the Intel XScale® Microarchitecture provides a virtual address that is used to
access the data cache. There is no logic inside the
pipe stage.
2.3.3.2D2 Stage
The D2 stage is where load data is returned. Load data comes from either data cache or external
memory, with external memory having the highest priority. The
Technology
2.3.3.3DWB Stage
needs to bridge incoming 32-bit data to internal 64-bit data.
Microarchitecture Overview
Intel® Wireless MMX™ Technology in the D1
Intel® Wireless MMX™
The DWB stage—the last stage of the D pipeline—is where load data is written back to the register
file.
Intel® PXA27x Processor Family Optimization Guide2-9