INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTELR PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER
INTELLECTUAL PROPERTY RIGHT.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the
presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by
estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to th em.
The Intel® PXA27x Processor Family may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
MPEG is an international standard for video compression/decompression promoted by ISO. Implementations of MPEG CODECs, or MPEG enabled
platforms may require licenses from various entities, including Intel Corporation.
This document and the software described in it are furnished under license and may only be used or copied in accordance with the terms of the
license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a
commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this
document or any software that may be provided in association with this document. Except as permitted by such license, no part of this document may
be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling
1-800-548-4725 or by visiting Intel's website at http://www.intel.com.
Intel® PXA27x Processor Family Optimization Guideix
Contents
xIntel® PXA27x Processor Family Optimization Guide
Introduction1
1.1About This Document
This document is a guide to optimizing software, the operating system, and system configuration to
best use the Intel® PXA27x Processor Family (PXA27x processor) feature set. The Intel®
PXA27x Processor Family consists of:
• Intel® PXA270 Processor – discrete processor
• Intel® PXA271 Processor – 32 MBytes of Intel StrataFlash® Memory and 32 MBytes of Low
Optimization” discusses how to optimize software (mostly at the assembly programming
level) to take advantage of the Intel XScale® Microarchitecture and Intel® Wireless MMX™
technology media co-processor.
• Chapter 5, “High Level Language Optimization” is a set of guidelines for C and C++ code
developers to maximize the performance by making the best use of the system resources.
Intel® PXA27x Processor Family Optimization Guide1-1
Introduction
• Chapter 6, “Power Optimization” discusses the trade-offs between performance and power
using the PXA27x processor.
• Appendix A, “Performance Checklist” is a set of guidelines for system level optimizations
which allow for obtain greater performance when using the Intel XScale® Microarchitecture
and the PXA27x processor.
1.2High-Level Overview
Mobile and wireless devices simplify our lives, keep us entertained, increase productivity and
maximize our responsiveness. Enterprise and individual consumers alike realize the potential and
are integrating these products at a rapid rate into their everyday life. Customer expectations exceed
what is being delivered today. The desire to communicate and compute wirelessly - to have access
to information anytime, anywhere, is the expectation. Manufacturers require technologies that
deliver high-performance, flexibility and robust functionality-all in the small-size, low-power
framework of mobile handheld, battery-powered devices. The Intel® Personal Internet Client
Architecture (Intel® PCA) processors with Intel XScale® Microarchitecture help drive wireless
handheld device functionality to new heights to meet customer demand. Combining low-power,
high-performance, compelling new features and second generation memory stacking, Intel PCA
processors help to redefine what a mobile device can do to meet many of the performance demands
of Enterprise-class wireless computing and feature-hungry technology consumers.
Targeted at wireless handhelds and handsets such as cell phones and PDAs with full featured
operating systems, the Intel PXA27x processor family is the next generation of ultra-low-power
applications with industry leading multimedia performance for wireless clients. The Intel PXA27x
processor is a highly integrated solution that includes Wireless Intel Speedstep® technology for
ultra-low-power, Intel® Wireless MMX™ technology and up to 624
multimedia capabilites, and Intel® Quick Capture Interface to give customers the ability to capture
high quality images and video.
The PXA27x processor incorporates a comprehensive set of system and peripheral functions that
make it useful in a variety of low-power applications. The block diagram in
the PXA27x processor system-on-a-chip, and shows a primary system bus with the Intel XScale®
Microarchitecture core (Intel XScale® core) attached along with an LCD controller, USB Host
controller and 256
controller to allow communication to a variety of external memory or companion-chip devices, and
it is also connected to a DMA/bridge to allow communication with the on-chip peripherals. The
key features of all the sub-blocks are described in this section, with more detail provided in
subsequent sections.
KBytes of internal memory. The system bus is connected to a memory
1.2.1Intel XScale® Microarchitecture and Intel XScale® core
The Intel XScale® Microarchitecture is based on a core that is ARM* version 5TE compliant. The
microarchitecture surrounds the core with instruction and data memory management units;
instruction, data, and mini-data caches; write, fill, pend, and branch-target buffers; power
management, performance monitoring, debug, and JTAG units; coprocessor interface; 32K caches;
MMUs; BTB; MAC coprocessor; and core memory bus.
The Intel XScale® Microarchitecture can be combined with peripherals to provide applicationspecific standard products (ASSPs) targeted at selected market segments. For example, the RISC
core can be integrated with peripherals such as an LCD controller, multimedia controllers, and an
external memory interface to empower OEMs to develop smaller, more cost-effective handheld
devices with long battery life, with the performance to run rich multimedia applications. Or the
microarchitecture could be surrounded by high-bandwidth PCI interfaces, memory controllers, and
networking micro-engines to provide a highly integrated, low-power, I/O or network processor.
Intel® PXA27x Processor Family Optimization Guide1-3
Introduction
1.2.2Intel XScale® Microarchitecture Features
• Superpipelined RISC technology achieves high speed and low power
• Wireless Intel Speedstep® technology allows on-the-fly voltage and frequency scaling to
enable applications to use the right blend of performance and power
• Media processing technology enables the MAC coprocessor perform two simultaneous 16-bit
SIMD multiplies with 64-bit accumulation for efficient media processing
• Power management unit provides power savings via multiple low-power modes
• 32-Kbyte instruction cache (I-cache) keeps local copy of important instructions to enable high
performance and low power
• 32-Kbyte data cache (D-cache) keeps local copy of important data to enable high performance
and low power
• 2-Kbyte mini-data cache avoids “thrashing” of the D-cache for frequently changing data
streams
• 32-entry instruction memory management unit enables logical-to-physical address translation,
access permissions, I-cache attributes
• 32-entry data memory management unit enables logical-to-physical address translation, access
permissions, D-cache attributes
• 4-entry Fill and Pend buffers promote core efficiency by allowing “hit-under-miss” operation
with data caches
• Performance monitoring unit furnishes two 32-bit event counters and one 32-bit cycle counter
for analysis of hit rates
• Debug unit uses hardware breakpoints and 256-entry Trace History buffer (for flow change
messages) to debug programs
• 32-bit coprocessor interface provides high performance interface between core and
coprocessors
• 8-entry Write buffer allows the core to continue execution while data is written to memory
See the Intel XScale® Microarchitecture Users Guide for additional information.
1.2.3Intel® Wireless MMX™ technology
The Intel XScale® Microarchitecture has attached to it a coprocessor to accelerate multimedia
applications. This coprocessor, characterized by a 64-bit Single Instruction Multiple Data (SIMD)
architecture and compatibility with the integer functionality of the
technology
technology
and SSE instruction sets, is known by its Intel project name, Intel® Wireless MMX™
. The key features of this coprocessor are:
• 30 new media processing instructions
• 64-bit architecture up to eight-way SIMD
• 16 x 64-bit register file
• SIMD PSR flags with group conditional execution support
• Instruction support for SIMD, SAD, and MAC
• Instruction support for alignment and video
• Intel® Wireless MMX™ technology and SSE integer compatibility
• Superset of existing Intel XScale® Microarchitecture media processing instructions
See the Intel® Wireless MMX™ technology Coprocessor EAS for more details.
1.2.4Memory Architecture
1.2.4.1Caches
There are two caches:
• Data cache – The PXA27x processor supports 32 Kbytes of data cache.
• Instruction Cache – The PXA27x processor supports 32 Kbytes of instruction cache.
1.2.4.2Internal Memories
The key features of the PXA27x processor internal memory are:
• 256 Kbytes of on-chip SRAM arranged as four banks of 64 Kbytes
• Bank-by-bank power management with automatic power management for reduced power
consumption
• Byte write support
Introduction
1.2.4.3External Memory Controller
The PXA27x processor supports a memory controller for external memory which can access:
• SDRAM up to 100 MHz at 1.8 Volts.
• Flash memories
• Synchronous ROM
• SRAM
• Variable latency input/output (VLIO) memory
• PC card and compact flash expansion memory
1.2.5Processor Internal Communications
The PXA27x processor supports a hierarchical bus architecture. A system bus supports high
bandwidth peripherals, and a slower peripheral bus supports peripherals with lower data
throughputs.
1.2.5.1System Bus
• Interconnection between the major key components is through the system bus.
• 64-bit wide, address and data multiplexed bus.
• The system bus allows split transactions, increasing the maximum data-throughput in the
system.
• Different burst sizes are allowed; up to 4 data phases per transactions (that is, 32 bytes). The
burst size is set in silicon for each peripheral and is not configurable.
Intel® PXA27x Processor Family Optimization Guide1-5
Introduction
• The system bus can operate at different frequency ratios with respect to the Intel XScale® core
(up to 208 MHz). The frequency control of the system bus is pivotal to striking a balance
between the desired performance and power consumption.
1.2.5.2Peripheral Bus
The peripheral bus is a single master bus. The bus master arbitrates between the Intel XScale® core
and the DMA controller with a pre-defined priority scheme between them. The peripheral bus is
used by the low-bandwidth peripherals; the peripheral bus runs at 26
1.2.5.3Peripherals in the Processor
The PXA27x processor has a rich set of peripherals. The list of peripherals and key features are
described in the subsections below.
1.2.5.3.1LCD Display Controller
The LCD controller supports single- or dual-panel LCD displays. Color panels without internal
frame buffers up to 262144 colors (18
up to 16777216 colors (24
(8
bits) are supported.
bits) are supported. Monochrome panels up to 256 gray-scale levels
bits) are supported. Color panels with internal frame buffers
MHz.
1.2.5.3.2DMA Controller
The PXA27x processor has a high performance DMA controller supporting memory-to-memory
transfers, peripheral-to-memory and memory-to-peripheral device transfers. It has support for
32
channels and up to 63-peripheral devices. The controller can perform descriptor chaining. DMA
supports descriptor-fetch, no-descriptor-fetch and descriptor-chaining.
1.2.5.3.3Other Peripherals
The PXA27x processor offers this peripheral support:
• USB Client Controller with 23 programmable endpoints (compliant with USB Revision 1.1).
• USB Host controller (USB Rev. 1.1 compatible), which supports both low-speed and full-
speed USB devices through a built-in DMA controller.
• Intel® Quick Capture Interface which provides a connection between the processor and a
camera image sensor.
• Infrared Communication Port (ICP) which supports 4 Mbps data rate compliant with Infrared
Data Association (IrDA) standard.
2
• I
C Serial Bus Port, which is compliant with I2C standard (also supports arbitration between
• USIM card interface (compliant with ISO standard 7816-3 and 3G TS 31.101)
• MSL – the physical interface of communication subsystems for mobile or wireless platforms.
The operating system and application software uses this to communicate between each other.
• Keypad interface supports both direct key as well as matrix key.
• Real-time clock (RTC) controller which provides a general-purpose, real-time reference clock
for use by the system.
• The pulse width modulator (PWM) controller generates four independent PWM outputs.
• Interrupt controller identifiesand controls the interrupt sources available to the processor.
• The OS timers controller provides a set of timer channels that allow software to generate timed
interrupts or wake-up events.
• General-purpose I/O (GPIO) controller for use in generating and capturing application-
specific input and output signals. Each of the 121
an input (or as bidirectional for certain alternate functions).
1
GPIOs may be programmed as an output,
1.2.6Wireless Intel Speedstep® technology
Wireless Intel Speedstep® technology advances the capabilities of Intel® Dynamic Voltage
Management - a function already built into the Intel XScale® Microarchitecture - by incorporating
three new low-power states: deep idle, standby and deep sleep. The technology is able to change
both voltage and frequency on-the-fly by intelligently switching the processor into the various low
power modes, saving additional power while still providing the necessary performance to run rich
applications.
The PXA27x processor integrated microprocessor provides a rich set of flexible powermanagement controls for a wide range of usage models, while enabling very low-power operation.
The key features include:
• Five reset sources:
—Power-on
— Hardware
— Watchdog
—GPIO
— Exit from sleep mode
• Three clock-speed controls to adjust frequency:
— Turbo mode
— Divisor mode
—Fast Bus mode
• Switchable clock source
• Functional clock gating
• Programmable frequency-change capability
1.121 GPIOs are available on the PXA271 processor, PXA271 processor, and PXA271 processor. The PXA270 processor only has 119 GPIOs
bonded out.
Intel® PXA27x Processor Family Optimization Guide1-7
Introduction
• Six power modes to control power consumption:
—Normal
—Idle
— Deep idle
— Standby
— Sleep
— Deep sleep
• Programmable I
See the Intel® PXA27x Processor Family Developer’s Manual for more details.
2
C-based external regulator interface to support voltage changing.
1.3Intel XScale® Microarchitecture Compatibility
The Intel XScale® Microarchitecture is ARM*Version 5 (V5TE) architecture compliant. The
PXA27x processor implements the integer instruction set architecture of ARM*V5TE.
Backward compatibility for user-mode applications is maintained with the earlier generations of
StrongARM* and Intel XScale® Microarchitecture processors. Operating systems may require
modifications to match the specific Intel XScale® Microarchitecture hardware features, and to take
advantage of the performance enhancements added to this core.
Memory map and register locations are backward-compatible with the previous Intel XScale®
Microarchitecture hand-held products.
The Intel® Wireless MMX™ technology instruction set is compatible with the standard ARM*
coprocessor instruction format (See
for more details).
The Complete Guide to Intel® Wireless MMX™ Technology
1.3.1PXA27x Processor Performance Features
Performance features of the PXA27x processor are:
• 32-Kbyte instruction cache
• 32-Kbyte data cache
• Intel® Wireless MMX™ technology with sixteen 64-bit registers, optimized instructions for
video, and multi-media applications.
• The PXA27x processor has an internal SRAM of 256 KBytes.
• Capability of locking entries in the instruction or data caches
• 2-Kbyte mini-data cache, separate from the data cache
• L1 caches and the mini-data cache use virtual address indices (or tags)
• Separate instruction and data Translation Lookaside buffers (TLBs), each with 32 entries
• Capability of locking entries in the TLBs
• 16-channel DMA engine with transfer-size control and descriptor chaining
This chapter contains an overview of Intel XScale® Microarchitecture and Intel® Wireless
MMX™ Technology
architecture with an enhanced memory pipeline. The Intel XScale® Microarchitecture instruction
set is based on ARM* V5TE architecture; however, the Intel XScale® Microarchitecture includes
new instructions. Code developed for the Intel® StrongARM* SA-110 (SA-110), Intel®
StrongARM* SA-1100 (SA-1100), and Intel® StrongARM* SA-1110 (SA-1110) microprocessors
is portable to Intel XScale® Microarchitecture based processors. However, to obtain the maximum
performance, the code should be optimized for the Intel XScale® Microarchitecture using the
techniques presented in this document.
2.2Intel XScale® Microarchitecture Pipeline
This section provides a brief description of the structure and behavior of Intel XScale®
Microarchitecture pipeline.
2.2.1General Pipeline Characteristics
. The Intel XScale® Microarchitecture includes a superpipelined RISC
The following sections discuss general pipeline characteristics.
2.2.1.1Pipeline Organization
The Intel XScale® Microarchitecture has a 7-stage pipeline operating at a higher frequency than its
predecessors allowing for greater overall performance. The Intel XScale® Microarchitecture
single-issue superpipeline consists of a main execution pipeline, a multiply-accumulate {MAC}
pipeline, and a memory access pipeline.
execution pipeline shaded.
Figure 2-1 shows the pipeline organization with the main
MAC pipeline
M1M2Mx
Memory pipeline
D1D2
XWB
DWB
Intel® PXA27x Processor Family Optimization Guide2-1
Microarchitecture Overview
Tab le 2-1 gives a brief description of each pipe stage and a reference for further information.
Table 2-1. Pipelines and Pipe Stages
Pipe / PipestageDescriptionFor More Information
Main Execution Pipeline
• IF1/IF2
•ID
•RF
•X1
•X2
•XWB
Memory Pipeline
•D1/D2
•DWB
MAC Pipeline
•M1-M5
• MWB (not shown)
Handles data processing instructions
Instruction Fetch
Instruction Decode
Register File / Operand Shifter
ALU Execute
State Execute
Write-back
Handles load/store instructions
Data cache access
Data cache writeback
Handles all multiply instructions
Multiplier stages
MAC write-back occurs during M2-M5
2.2.1.2Out of Order Completion
While the pipeline is scalar and single-issue, instructions occupy all three pipelines at once. The
main execution pipeline, memory, and MAC pipelines have different execution times because they
are not lock-stepped. Sequential consistency of instruction execution relates to two aspects: first,
the order instructions are completed and second, the order memory is accessed due to load and
store instructions. The Intel XScale® Microarchitecture only preserves a weak processor
consistency because instructions complete out of order (assuming no data dependencies exist).
The Intel XScale® Microarchitecture can buffer up to four outstanding reads. If load operations
miss the data cache, subsequent instructions complete independently. This operation is called a
hit-under-miss operation.
Section 2.2.3
Section 2.2.3.1
Section 2.2.3.2
Section 2.2.3.3
Section 2.2.3.4
Section 2.2.3.5
Section 2.2.3.6
Section 2.2.4
Section 2.2.4.1
Section 2.2.5.1
Section 2.2.5
Section 2.2.5
Section 2.2.5
2.2.1.3Use of Bypassing
The pipeline makes extensive use of bypassing to minimize data hazards. To eliminate the need to
stall the pipeline, bypassing allows results forwarding from multiple sources.
In certain situations, the pipeline must stall because of register dependencies between instructions.
A register dependency occurs when a previous MAC or load instruction is about to modify a
register value that has not returned to the register file. Core bypassing allows the current instruction
to execute when the previous instruction’s results are available without waiting for the register file
to update.
2.2.2Instruction Flow Through the Pipeline
With the exception of the MAC unit, the pipeline issues one instruction per clock cycle. Instruction
execution begins at the F1 pipestage and completes at the WB pipestage.
Although a single instruction is issued per clock cycle, all three pipelines are processing
instructions simultaneously. If there are no data hazards, each instruction complete independently
of the others.
Figure 2-1 uses arrows to show the possible flow of instructions in the pipeline. Instruction
execution flows from the F1 pipestage to the RF pipestage. The RF pipestage issues a single
instruction to either the X1 pipestage or the MAC unit (multiply instructions go to the MAC, while
all others continue to X1). This means that M1 or X1 are idle.
After calculating the effective addresses in XI, all load and store instructions route to the memory
pipeline.
The ARM* V5TE branch and exchange (BX) instruction (used to branch between ARM* and
THUMB* code) causes the entire pipeline to be flushed. If the processor is in THUMB* mode the
ID pipestage dynamically expands each THUMB* instruction into a normal ARM* V5TE RISC
instruction and normal execution resumes.
2.2.2.2Pipeline Stalls
Pipeline stalls can seriously degrade performance. The primary reasons for stalls are register
dependencies, load dependencies, multiple-cycle instruction latency, and unpredictable branches.
To help maximize performance, it is important to understand some of the ways to avoid pipeline
stalls. The following sections provide more detail on the nature of the pipeline and ways of
preventing stalls.
Microarchitecture Overview
2.2.3Main Execution Pipeline
2.2.3.1F1 / F2 (Instruction Fetch) Pipestages
The job of the instruction fetch stages F1 and F2 is to present the next instruction to be executed to
the ID stage. Two important functional units residing within the F1 and F2 stages are the BTB and
IFU.
• Branch Target Buffer (BTB)
The BTB provides a 128-entry dynamic branch prediction buffer. An entry in the BTB is
created when a B or BL instruction branch is taken for the first time. On sequential executions
of the branch instruction at the same address, the next instruction loaded into the pipeline is
predicted by the BTB. Once the branch type instruction reaches the X1 pipestage, its target
address is known. Execution continues without stalling if the target address is the same as the
BTB predicted address. If the address is different from the address that the BTB predicted, the
pipeline is flushed, execution starts at the new target address, and the branch’s history is
updated in the BTB.
• Instruction Fetch Unit (IFU)
The IFU is responsible for delivering instructions to the instruction decode (ID) pipestage. It
delivers one instruction word each cycle (if possible) to the ID. The instruction could come
from one of two sources: instruction cache or fetch buffers.
Intel® PXA27x Processor Family Optimization Guide2-3
Microarchitecture Overview
2.2.3.2Instruction Decode (ID) Pipestage
The ID pipestage accepts an instruction word from the IFU and sends register decode information
to the RF pipestage. The ID is able to accept a new instruction word from the IFU on every clock
cycle in which there is no stall. The ID pipestage is responsible for:
• General instruction decoding (extracting the opcode, operand addresses, destination addresses
and the offset).
• Detecting undefined instructions and generating an exception.
• Dynamic expansion of complex instructions into sequence of simple instructions. Complex
instructions are defined as ones that take more than one clock cycle to issue, such as LDM,
STM, and SWP.
2.2.3.3Register File / Shifter (RF) Pipestage
The main function of the RF pipestage is to read and write to the register file unit (RFU). It
provides source data to:
• X1 for ALU operations
• MAC for multiply operations
• Data cache for memory writes
• Coprocessor interface
The ID unit decodes the instruction and specifies the registers accessed in the RFU. Based on this
information, the RFU determines if it needs to stall the pipeline due to a register dependency. A
register dependency occurs when a previous instruction is about to modify a register value that has
not been returned to the RFU and the current instruction needs to access that same register. If no
dependencies exist, the RFU selects the appropriate data from the register file and passes it to the
next pipestage. When a register dependency does exist, the RFU keeps track of the unavailable
register. The RFU stops stalling the pipe when the result is returned.
The ARM* architecture specifies one of the operands for data processing instructions as the shifter
operand. A 32-bit shift can be performed on a value before it is used as an input to the ALU. This
shifter is located in the second half of the RF pipestage.
2.2.3.4Execute (X1) Pipestages
The X1 pipestage performs these functions:
• ALU calculations – the ALU performs arithmetic and logic operations, as required for data
processing instructions and load/store index calculations.
• Determine conditional instruction executions – the instruction’s condition is compared to the
CPSR prior to execution of each instruction. Any instruction with a false condition is
cancelled and does not cause any architectural state changes, including modifications of
registers, memory, and PSR.
• Branch target determinations – the X1 pipestage flushes all instructions in the previous
pipestages and sends the branch target address to the BTB if a branch is mispredicted by the
BTB. The flushing of these instructions restarts the pipeline.
The X2 pipestage contains the program status registers (PSR). This pipestage selects the data to be
written to the RFU in the WB cycle including the following items.
The X2 pipestage contains the current program status register (CPSR). This pipestage selects what
is written to the RFU in the WB cycle including program status registers.
2.2.3.6Write-Back (WB)
When an instruction reaches the write-back stage it is considered complete. Instruction results are
written to the RFU.
2.2.4Memory Pipeline
The memory pipeline consists of two stages, D1 and D2. The data cache unit (DCU) consists of the
data cache array, mini-data cache, fill buffers, and write buffers. The memory pipeline handles load
and store instructions.
2.2.4.1D1 and D2 Pipestage
Microarchitecture Overview
Operation begins in D1 after the X1 pipestage calculates the effective address for loads and stores.
The data cache and mini-data cache return the destination data in the D2 pipestage. Before data is
returned in the D2 pipestage, sign extension and byte alignment occurs for byte and half-word
loads.
2.2.4.1.1Write Buffer Behavior
The Intel XScale® Microarchitecture has enhanced write performance by the use of write
coalescing. Coalescing is combining a new store operation with an existing store operation already
resident in the write buffer. The new store is placed in the same write buffer entry as an existing
store when the address of new store falls in the 4-word aligned address of the existing entry.
The core can coalesce any of the four entries in the write buffer. The Intel XScale®
Microarchitecture has a global coalesce disable bit located in the Control register (CP15, register 1,
opcode_2=1).
2.2.4.1.2Read Buffer Behavior
The Intel XScale® Microarchitecture has four fill buffers that allow four outstanding loads to the
cache and external memory. Four outstanding loads increases the memory throughput and the bus
efficiency. This feature can also be used to hide latency. Page table attributes affect the load
behavior; for a section with C=0, B=0 there is only one outstanding load from the memory. Thus,
the load performance for a memory page with C=0, B=1 is significantly better compared to a
memory page with C=0, B=0.
2.2.5Multiply/Multiply Accumulate (MAC) Pipeline
The multiply-accumulate (MAC) unit executes the multiply and multiply-accumulate instructions
supported by the Intel XScale® Microarchitecture. The MAC implements the 40-bit Intel XScale®
Microarchitecture accumulator register acc0 and handles the instructions which transfers its value
to and from general-purpose ARM* registers.
Intel® PXA27x Processor Family Optimization Guide2-5
Microarchitecture Overview
These are important characteristics about the MAC:
• The MAC is not a true pipeline. The processing of a single instruction requires use of the same
data-path resources for several cycles before a new instruction is accepted. The type of
instruction and source arguments determine the number of required cycles.
• No more than two instructions can concurrently occupy the MAC pipeline.
• When the MAC is processing an instruction, another instruction cannot enter M1 unless the
original instruction completes in the next cycle.
• The MAC unit can operate on 16-bit packed signed data. This reduces register pressure and
memory traffic size. Two 16-bit data items can be loaded into a register with one LDR.
• The MAC can achieve throughput of one multiply per cycle when performing a 16-by-32-bit
multiply.
• ACC registers in the Intel XScale® Microarchitecture can be up to 64 bits in future
implementations. Code should be written to depend on the 40-bit nature of the current
implementation.
2.2.5.1Behavioral Description
The execution of the MAC unit starts at the beginning of the M1 pipestage. At this point, the MAC
unit receives two 32-bit source operands. Results are completed N cycles later (where N is
dependent on the operand size) and returned to the register file. For more information on MAC
instruction latencies, refer to
Microarchitecture”.
Section 4.8, “Instruction Latencies for Intel XScale®
An instruction occupying the M1 or M2 pipestages occupies the X1 and X2 pipestage, respectively.
Each cycle, a MAC operation progresses for M1 to M5. A MAC operation may complete anywhere
from M2-M5.
2.2.5.2Perils of Superpipelining
The longer pipeline has several consequences worth considering:
• Larger branch misprediction penalty (four cycles in the Intel XScale® Microarchitecture
instead of one in StrongARM* Architecture).
• Larger load use delay (LUD) — LUDs arise from load-use dependencies. A load-use
dependency gives rise to a LUD if the result of the load instruction cannot be made available
by the pipeline in time for the subsequent instruction. To avoid these penalties, an optimizing
compiler should take advantage of the core’s multiple outstanding load capability (also called
hit-under-miss) as well as finding independent instructions to fill the slot following the load.
• Certain instructions incur a few extra cycles of delay with the Intel XScale® Microarchitecture
as compared to StrongARM* processors (LDM, STM).
• Decode and register file lookups are spread out over two cycles with the Intel XScale®
Microarchitecture, instead of one cycle in predecessors.
As the Intel® Wireless MMX™ Technology is tightly coupled with the Intel XScale®
Microarchitecture; the
structure as the Intel XScale® Microarchitecture.
Technology pipeline, which contains three independent pipeline threads:
• X pipeline - Execution pipe
• M pipeline - Multiply pipe
• D pipeline - Memory pipe
Figure 2-2. Intel® Wireless MMX™ Technology Pipeline Threads and relation with Intel
XScale® Microarchitecture Pipeline
Intel® Wireless MMX™ Technology pipeline follows the similar pipeline
Figure 2-2 shows the Intel® Wireless MMX™
Intel
XScale®
Pipeline
IF1IF2
ID
RF
X1
X2XWB
X pipeline
M pipeline
D pipeline
2.3.1Execute Pipeline Thread
2.3.1.1ID Stage
The ID pipe stage is where decoding of Intel® Wireless MMX™ Technology instructions
commences. Because of the significance of the transit time from Intel XScale® Microarchitecture
in the ID pipe stage, only group decoding is performed in the ID stage, with the remainder of the
decoding being completed in the RF stage. However, it is worth noting that the register address
decoding is fully completed in the ID stage because the register file needs to be accessed at the
beginning of the RF stage.
All instructions are issued in a single cycle, and they pass through the ID stage in one cycle if no
pipeline stall occurs.
2.3.1.2RF Stage
ID
RFX1
M1
X2XWB
M2
D1D2DWB
MWBM3
The RF stage controls the reading/writing of the register file, and determines if the pipeline has to
stall due to data or resource hazards. Instruction decoding also continues at the RF stage and
completes at the end of the RF stage. The register file is accessed for reads in the high phase of the
clock and accessed for writes in the low phase.If data or resource hazards are detected, the
Intel® PXA27x Processor Family Optimization Guide2-7
Intel®
Microarchitecture Overview
Wireless MMX™ Technology stalls Intel XScale® Microarchitecture. Note that control hazards
are detected in the Intel XScale® Microarchitecture, and a flush signal is sent from the core to the
Intel® Wireless MMX™ Technology.
2.3.1.3X1 Stage
The X1 stage is also known as the execution stage, which is where most instructions begin being
executed. All instructions are conditionally executed and that determination occurs at the X1 stage
in the Intel XScale® Microarchitecture. A signal from the core is required to indicate whether the
instruction being executed is committed. In other words, an instruction being executed at the X1
stage may be canceled by a signal from the core. This signal is available to the
MMX™ Technology
2.3.1.4X2 Stage
The Intel® Wireless MMX™ Technology supports saturated arithmetic operations. Saturation
detection is completed in the X2 pipe stage.
If the Intel XScale® Microarchitecture detects exceptions and flushes in the X2 pipe stage, Intel®
Wireless MMX™ Technology
Intel® Wireless
in the middle of the X1 pipe stage.
also flushes all the pipeline stages.
2.3.1.5XWB Stage
The XWB stage is the last stage of the X pipeline, where a final result calculated in the X pipeline
is written back to the register file.
2.3.2Multiply Pipeline Thread
2.3.2.1M1 Stage
The M pipeline is separated from the X pipeline. The execution of multiply instructions starts at the
beginning of the M1 stage, which aligns with the X1 stage of the X pipeline. While the issue cycle
for multiply operations is one clock cycle, the result latency is at least three cycles. Certain
instructions such as TMIA, WMAC, WMUL, WMADD spend two M1 cycles since the
Wireless MMX™ Technology
level compression occur in the M1 pipe stage.
2.3.2.2M2 Stage
Additional compression occurs in the M2 pipe stage, and the lower 32 bits of the result are
calculated with a 32 bit adder.
2.3.2.3M3 Stage
The upper 32 bits of the result are calculated with a 32-bit adder.
has only two 16x16 multiplier arrays. Booth encoding and first-
Intel®
2.3.2.4MWB Stage
The MWB stage is the last stage of the M pipeline, which is where a final result calculated in the M
pipeline is written back to the register file.
A forwarding path from the MWB stage to the RF stage serves as a non-critical bypass. Critical and
reasonable logic insertion are allowed.
2.3.3Memory Pipeline Thread
2.3.3.1D1 Stage
In the D1 pipe stage, the Intel XScale® Microarchitecture provides a virtual address that is used to
access the data cache. There is no logic inside the
pipe stage.
2.3.3.2D2 Stage
The D2 stage is where load data is returned. Load data comes from either data cache or external
memory, with external memory having the highest priority. The
Technology
2.3.3.3DWB Stage
needs to bridge incoming 32-bit data to internal 64-bit data.
Microarchitecture Overview
Intel® Wireless MMX™ Technology in the D1
Intel® Wireless MMX™
The DWB stage—the last stage of the D pipeline—is where load data is written back to the register
file.
Intel® PXA27x Processor Family Optimization Guide2-9
This chapter describes relevant performance considerations that developers and system designers
should be aware of to efficiently use the Intel® PXA27x Processor Family (PXA27x processor).
3.1Optimizing Frequency Selection
The PXA27x processor offers a range of combinations of core, system bus and memory clock
speed. The run mode frequency and derived system bus and memory controller frequencies affect
the latency and throughput of external memory interfaces.
Memory latencies depend on the run mode frequency, because a higher run mode frequency
improves performance for memory bound applications. The core clock speed is indicated by the
run frequency, or (if CLKCFG[T] is set) by the turbo frequency. This value is most significant for
computationally bound applications. For a memory bound application, a processor operating in
333-MHz run mode might perform better than a processor operating in 400-MHz turbo mode using
only a 200-MHz run mode frequency. The clock frequency combination should be chosen to fit the
target application mix. Possible frequency selections are listed in the clocks and power manager
section of the Intel® PXA27x Processor Family Developer’s Manual.
3.2Memory System Optimization
3.2.1Optimal Setting for Memory Latency and Bandwidth
Because the PXA27x processor has a multi-transactional internal bus, there are latencies involved
with accesses to and from the Intel XScale® core. The internal bus, also called the system bus,
allows many internal operations to occur concurrently such as LCD, DMA controller and related
data transfers.
frequencies. The throughput reported in the tables is only measuring load
Table 3-1. External SDRAM Access Latency and Throughput for Different Frequencies
(Silicon Measurement Pending)
Core Clock
Speed (MHz)
(up to)
10410410410417205
20820820810421326
31220820810430343
Tab le 3-1 and Tabl e 3-2 list latencies and throughputs associated with different
Run Mode
Frequency (MHz)
(up to)
System Bus Clock
Speed (MHz)
(up to)
Memory
Clock Speed
(MHz)
(up to)
Memory
Latency
(core cycles)
Load
Throughput
from Memory
(MBytes/
Sec)
Intel® PXA27x Processor Family Optimization Guide3-1
System Level Optimization
Table 3-2. Internal SRAM Access Latency and Throughput for Different Frequencies (Silicon
Measurement Pending)
Core Clock
Speed (MHz)
(up to)
10410410410414236
20820820810414472
31220820810421473
Store throughput is similar
†
Run Mode
Frequency (MHz)
(up to)
System Bus Clock
Speed (MHz)
(up to)
Memory
Clock Speed
(MHz)
(up to)
Memory
Latency
(core cycles)
Throughput
from Memory
(MBytes/
• Setting wait-states for static memory
For static memory, it is important to use the correct number of wait-states to get optimal
performance. The Intel® PXA27x Processor Family Developer’s Manual explains the
possible values in the MSCx registers. These registers control wait states and set up the access
mode used. For flash memory that supports burst-of-four reads or burst-of-eight reads, these
modes provide improvements in reading and executing from flash.
• CAS latency for SDRAM
For SDRAM the key parameter is the CAS latency. Lower CAS latency gives higher
performance. Most current SDRAM supports a CAS latency of two
• Setting of the APD bit
Use of the APD bit in the memory controller can save power, however can also increase the
memory latency. For high performance the APD bit should be cleared.
Load
Sec)
†
• Buffer Strength registers
The output drivers for the PXA27x processor external memory bus have programmable
strength settings. This feature allows for simple, software-based control of the output driver
impedance for the external memory bus. Use these registers to match the driver strength of the
PXA27x processor to external memory bus. The buffer strength should be set to the lowest
possible setting (minimum drive strength) that still allows for reliable memory system
performance. This will minimize the power usage of the external memory bus, which is a
major component of total system power. Refer to the Programmable Output Buffer Strength
registers described in the Intel® PXA27x Processor Family Developer’s Manual, for more
information.
3.2.2Alternate Memory Clock Setting
An alternate set of memory bus selections are available through the use of CCCR[A], refer to the
“CCCR Bit Definitions” table in the Intel® PXA27x Processor Family Developer’s Manual. When
this bit is set the memory clock speed is expanded to allow it to be set as high as 208
cleared the maximum memory clock speed is 130
If CCCR[A] is cleared, use the “Core PLL Output Frequencies for 13-MHz Crystal with CCCR[A]
= 0” table in the Intel® PXA27x Processor Family Developer’s Manual when making the clock
setting selections. If CCCR[A] is set, use the “Core PLL Output Frequencies for 13-MHz Crystal
With B=0 and CCCR[A] = 1” table and the “Core PLL Output Frequencies for 13 MHz Crystal
With B=1 and CCCR[A] = 1” table in the Intel® PXA27x Processor Family Developer’s Manual
instead.
3.2.3Page Table Configuration
Three bits for each page are used to configure each memory page’s cache behavior. Different
values of X,C,B determine the caching, reading and writing, and buffering policies of the pages.
3.2.3.1Page Attributes For Instructions
When examining these bits in a descriptor, the instruction cache only utilizes the C bit. If the C bit
is clear, the instruction cache considers a code fetch from that memory to be noncacheable, and will
not fill a cache entry. If the C bit is set, then fetches from the associated memory region is cached.
3.2.3.2Page Attributes For Data Access
For data access, all three attributes are important. If the X bit for a descriptor is zero, the C and B
bits operate as defined by the ARM* architecture. This behavior is detailed in
System Level Optimization
Table 3-3.
If the X bit for a descriptor is one, the C and B bits behave differently, as shown in Tab le 3-4. The
load and store buffer behavior in Intel XScale® Microarchitecture is explained in Section 2.2.4.1.1,
“Write Buffer Behavior” and Section 2.2.4.1.2, “Read Buffer Behavior”
Table 3-3. Data Cache and Buffer Behavior when X = 0
C BCacheable?
0 0NN——Stall until complete
0 1NY——
1 0YYWrite-throughRead Allocate
1 1YYWrite-backRead Allocate
† Normally, the processor continues executing after a data access if no dependency on that access is
encountered. With this setting, the processor stalls execution until the data access completes. This
guarantees to software that the data access has taken effect by the time execution of the data access
instruction completes. External data aborts from such accesses are imprecise.
Load Buffering
and Write
Coalescing?
Write Policy
Line Allocation
Policy
Table 3-4. Data Cache and Buffer Behavior when X = 1 (Sheet 1 of 2)
C BCacheable?
0 0————Unpredictable -- do not use
0 1NY——
Load Buffering
and Write
Coalescing?
Write Policy
Line Allocation
Policy
Notes
†
Notes
Writes will not coalesce into
†
buffers
Intel® PXA27x Processor Family Optimization Guide3-3
System Level Optimization
Table 3-4. Data Cache and Buffer Behavior when X = 1 (Sheet 2 of 2)
C BCacheable?
1 0
1 1YYWrite-back
† Normally, "bufferable" writes can coalesce with previously buffered data in the same address range
†† Refer to Intel XScale® Core Developer’s Manual and the Intel® PXA27x Processor Family Developer’s
(Mini-data
cache)
Manual
for a description of this register.
Load Buffering
and Write
Coalescing?
———
Write Policy
Line Allocation
Policy
Read/Write
Allocate
Notes
Cache policy is determined by
MD field of Auxiliary Control
††
register
Note:The Intel XScale® Microarchitecture page-attributes are different than the Intel® StrongARM*
SA-1110 Microprocessor (SA-1110). The SA-1110 code may behave differently on PXA27x
processor systems due to page attribute differences. Table 3- 5 describes the differences in the
encoding of the C and B bits for data accesses. The main difference occurs when cacheable and
nonbufferable data is specified (C=1, B=0); the SA-1110 uses this encoding for the mini-data cache
while the Intel XScale® Microarchitecture uses this encoding to specify write-through caching.
Another difference is when C=0, B=1, where the Intel XScale® Microarchitecture coalesces stores
in the write buffer; the SA-1110 does not.
Table 3-5. Data Cache and Buffer operation comparison for Intel® SA-1110 and Intel XScale®
Microarchitecture, X=0
EncodingSA-1110 FunctionIntel XScale® Microarchitecture Function
C=1,B=1
C=1,B=0
C=0,B=1
C=0,B=0
Cacheable in data cache; store misses can
coalesce in write buffer
Cacheable in mini-data cache; store misses
can coalesce in write buffer
Noncacheable; no coalescing in write buffer,
but can wait in write buffer
Noncacheable; no coalescing in the write
buffer, SA-110 stalls until this transaction is
done
Cacheable in data cache, store misses can
coalesce in write buffer
Cacheable in data cache, with a write-through
policy. Store misses can coalesce in write
buffer
Noncacheable; stores can coalesce in the write
buffer
Noncacheable, no coalescing in the write
buffer, Intel XScale® Microarchitecture stalls
until the operation completes.
3.3Optimizing for Instruction and Data Caches
Cache locking allows frequently used code to be locked in the cache. Up to 28 cache lines can be
locked in a set, while the remaining four entries still participate in the round robin replacement
policy.
3.3.1Increasing Instruction Cache Performance
The performance of the PXA27x processor is highly dependent on the cache miss rate. Due to the
complexity of the processor fetching instructions from external memory can have a large latency.
Moreover, this cycle penalty becomes significant when the Intel XScale® core is running much
faster than external memory. Executing non-cached instructions severely curtails the processor's
performance so it is important to do everything possible to minimize cache misses.
Both the data and the instruction caches use a round robin replacement policy to evict a cache line.
The simple consequence of this is that every line will eventually be evicted, assuming a non-trivial
program. The less obvious consequence is that predicting when and over which cache lines
evictions take place is difficult to predict. This information must be gained by experimentation
using performance profiling.
3.3.1.2Code Placement to Reduce Cache Misses
Code placement can greatly affect cache misses. One way to view the cache is to think of it as 32
sets of 32 bytes, which span an address range of 1024 bytes. When running, the code maps into 32
blocks modular 1024 of cache space. Any overused sets will thrash the cache. The ideal situation is
for the software tools to distribute the code on a temporal evenness over this space.
This is not possible for a compiler to do automatically. Most of the input needed to best estimate
how to distribute the code will come from profiling followed by compiler-based two pass
optimizations.
3.3.1.3Locking Code into the Instruction Cache
System Level Optimization
One important instruction cache feature is the ability to lock code into the instruction cache. Once
locked into the instruction cache, the code is always available for fast execution. Another reason
for locking critical code into cache is that with the round robin replacement policy, eventually the
code is evicted, even if it is a frequently executed function. Key code components to consider
locking are:
• Interrupt handlers
• OS Timer clock handlers
• OS critical code
• Time critical application code
The disadvantage to locking code into the cache is that it reduces the cache size for the rest of the
program. How much code to lock is application dependent and requires experimentation to
optimize.
Code placed into the instruction cache should be aligned on a 1024 byte boundary and placed
sequentially together as tightly as possible so as not to waste memory space. Making the code
sequential also insures even distribution across all cache ways. Though it is possible to choose
randomly located functions for cache locking, this approach runs the risk of locking multiple cache
ways in one set and few or none in another set. This distribution unevenness can lead to excessive
thrashing of instruction cache.
3.3.2Increasing Data Cache Performance
There are different techniques which can be used to increase the data cache performance. These
include, optimizing cache configuration and programming techniques etc. This section offers a set
of system-level optimization opportunities; however program-level optimization techniques are
equally important.
Intel® PXA27x Processor Family Optimization Guide3-5
System Level Optimization
3.3.2.1Cache Configuration
The Intel XScale® Microarchitecture allows users to define memory regions whose cache policies
can be set by the user. To support these various memory regions, OS configures the page-tables
accordingly.
The performance of application code depends on what cache policy used for data objects. A
description of when to use a particular policy is described below.
If the application is running under an OS, then the OS may restrict the application from using
certain cache policies.
3.3.2.1.1Cache Configuration: Write-through and Write-back Cached Memory Regions
Write-back mode avoids some memory transactions by allowing data to collect in the data cache
before eventually being written to memory when the cache line is evicted. When cache lines are
evicted, the writes coalesce and are efficiently written to memory. This differs from write-through
mode where writes are always written to memory immediately. Write-through memory regions
generate more data traffic on the bus and consume more power due to increased bus activity. The
write-back policy is recommended to be used whenever possible. However, in a multi bus master
environment it may be necessary to use a write-through policy if data is shared across multiple
masters. In such a situation all shared memory regions should use write-through policy. Memory
regions that are private to a particular master should use the write-back policy.
3.3.2.1.2Cache Configuration: Read Allocate and Read-write Allocate Memory Regions
Write-back with read/write allocate caches cause an additional read from the memory during a
write miss. Subsequent read and write performance may be improved by more frequent cache hits.
Most of the regular data and the stack for applications should be allocated to a read-write allocate
region. Data that is write only (or data that is written to and subsequently not used for a long time)
should be placed in a read allocate region. Under the read allocate policy, if a cache write miss
occurs, a new cache line is not allocated, and hence does not evict data from the data cache.
Memory intensive operations like a memcopy can actually be slowed down by the extra reads
required for the write allocate policy.
3.3.2.1.3Cache Configuration: Noncacheable Regions
Noncachable memory regions (X=0, C=0, B=0) are frequently needed for I/O devices. For these
devices the relevant device registers and memory spaces are mapped as noncacheable. In some
cases making the noncacheable regions bufferable (X=0, C=0, and B =1) can accelerate the
memory performance due to write coalescing. There are cases where a noncached memory regions
must be set as nonbufferable (B=0):
• Any device where consecutive writes to the same address could be over-written in the write
buffer before reaching the target device (e.g. FIFOs).
• Devices where read/write order to the device is required. When coalescing occurs, writes
occur in numerical address order, not in the temporal order.
3.3.2.2Creating Scratch RAM in the Internal SRAM
A very simple method for creating a fast scratch RAM is to allocate a portion of the Internal SRAM
for this purpose. This will allow data mapped to this area to be accessed much more quickly than if
it resided in external memory. Additionally, there are no considerations for cache locking, as are
discussed in the next section,
Section 3.3.2.3, “Creating Scratch RAM in Data Cache”.
This is the preferred method for creating Scratch RAM for the PXA27x processor. It is generally
preferable to keep as much of the data cache as possible available for it’s designated use - cache
space. While access to the internal SRAM is slower than accessing data in cache, data in the
scratch RAM generally do not suffer from the increased latency.
3.3.2.3Creating Scratch RAM in Data Cache
Like the instruction cache, lines of the data cache can be locked as well. This can be thought of as
converting parts of the cache into fast on-chip RAM. Access to objects in this on-chip RAM will
not incur cache miss penalties, thereby reducing the number of processor stalls. Application
performance can be improved by locking data cache lines and allocating frequently allocated
variables to this space. Due to the Intel XScale® Microarchitecture round robin replacement
policy, all non-locked cache data will eventually be evicted. Therefore, to prevent critical or
frequently used data from being evicted it can be allocated to on-chip RAM.
These variables are good candidates for allocating to the on-chip RAM:
• Frequently used global data used for storing context for context switching.
• Global variables that are accessed in time-critical functions such as interrupt service routines.
When locking a memory region into the data cache to create on-chip RAM, care must be taken to
ensure that all sets in the on-chip RAM area of the data cache have approximately the same number
of ways locked. If some sets have more ways locked than others, this will increases the level of
thrashing in some sets and leave other sets under-utilized.
System Level Optimization
For example, consider three arrays arr1, arr2 and arr3 of size 64 bytes each that are allocated to the
on-chip RAM and assume that the address of arr1 is 0, address of arr2 is 1024, and the address of
arr3 is 2048. All three arrays are within the same sets, set0 and set1. As a result, three ways in both
sets set0 and set1 are locked, leaving 29 ways for use by other variables.
This can be overcome by allocating on-chip RAM data in sequential order. In the above example
allocating arr2 to address 64 and arr3 to address 128, allows the three arrays to use only one way in
sets zero through five.
In order to reduce cache pollution between two processes and avoid frequent cache flushing during
context switch, the OS could potentially lock critical data sections in the cache. The OS can also
potentially offer the locking mechanism as a system function to its applications.
3.3.2.4Reducing Memory Page Thrashing
Memory page thrashing occurs because of the nature of SDRAM. SDRAMs are typically divided
into 4 banks. Each bank can have one selected page where a page address size for current memory
components is often defined as 4k Bytes. Memory lookup time or latency time for a selected page
address is currently 2 to 3 bus clocks. Thrashing occurs when subsequent memory accesses within
the same memory bank access different pages. The memory page change adds 3 to 4 bus clock
cycles to memory latency. This added delay extends the preload distance
it more difficult to hide memory access latencies. This type of thrashing can be resolved by placing
the conflicting data structures into different memory banks or by paralleling the data structures
such that the data resides within the same memory page. It is also extremely important to insure
that instruction and data sections and LCD frame buffer are in different memory banks, or they will
continually trash the memory page selection.
1
correspondingly making
1.Preload distance is defined as the number of instructions required to preload data in order to avoid a core stall.
Intel® PXA27x Processor Family Optimization Guide3-7
System Level Optimization
3.3.2.5Using Mini-Data Cache
The mini-data cache (X=1, C=1, B=0) is best used for data structures which have short temporal
lives, and/or cover vast amounts of data space. Addressing these types of data spaces from the data
cache would corrupt much, if not all, of the data cache by evicting valuable data. Eviction of
valuable data will reduce performance. Placing this data instead in a mini-data cache memory
region would help prevent data cache corruption while providing the benefits of cached accesses.
These examples of data that could be assigned to mini-data cache:
• The stack space of a frequently occurring interrupt: The stack is used during the short duration
of the interrupt only.
• Streaming media data: In many cases, the media steam’s data has limited time span usage and
would otherwise repeatedly evict the main data cache.
Overuse of the mini-data cache leads to thrashing the cache. This is easy to do because the minidata cache has two ways per set. For example, a loop which uses a simple statement such as:
for (i=0; I< IMAX; i++)
{
A[i] = B[i] + C[i];
}
Where A, B, and C reside in a mini-data cache memory region and each is array is aligned on a 1 K
boundary quickly thrashes the cache.
The mini-data cache could also be used to keep frequently used tables cached. The advantage of
keeping these in the minicache is two-fold. First, the data thrashing in the main cache does not
thrash the frequently used tables and coefficients. Second, it saves main cache space from locking
the critical blocks. For applications like mpeg4, mp3, gsm-amr that handle big data streams,
locking main data cache for these tables is not an efficient use of cache. During execution of such
applications, these are some examples of tables which can effectively make use of the minicache:
• Huffman tables
• Sine-Cosine look-up tables
• Color-conversion look-up tables
• Motion compensation vector tables
3.3.2.6Reducing Cache Conflicts, Pollution and Pressure
Cache pollution occurs when unused data is loaded in the cache and cache pressure occurs when
data that is not temporal to the current process is loaded into the cache. Excessive pre-loading and
data locking should be avoided. For an example, see
on page 5-2. Increasing data locality through the use of programming techniques will help this
The Intel XScale® Microarchitecture offers 32 entries for instruction and data TLBs. The TLB unit
also offers a hardware page-table walk. This eliminates the need for using a software page table
walk and software management of the TLBs.
The Intel XScale® Microarchitecture allows individual entries to be locked in the TLBs. Each
locked TLB entry reduces the number of TLB entries available to hold other translation
information. The entries one would expect to lock in the TLBs are those used during access to
locked cache lines. A TLB global invalidate does not affect locked entries.
The TLBs can be used translate a virtual address to the physical address. The hardware page-table
walk eliminates the page translation task for the OS. From the performance point of view, HW
TLBs are more efficient than SW managed TLBs. It is recommended that HW TLBs are used for
page table walking - however, to reduce data aborts the page table attributes need to be set
correctly. During context switch, OS implementation may choose to flush the TLBs. However, the
OS is free to lock critical TLB entries in the TLBs reduce excessive thrashing and hence retain
performance.
3.4Optimizing for Internal Memory Usage
The PXA27x processor has a 256 Kbyte memory which offers low latency and high memory
bandwidth. Any data structure which requires high throughput and lower latency can be placed in
the internal memory. While the LCD frame buffer is highly likely to be mapped to the internal
memory, depending on the LCD size and refresh rate and latency that LCD can tolerate, some
overlays can be placed in the external memory. This scheme may free up some internal memory
space for OS and user applications. Depending on the user profile the internal memory can be used
for different purposes.
3.4.1LCD Frame Buffer
The LCD is a significant bandwidth consumer in the system. The LCD frame buffer can be mapped
to the internal memory. Apart from using the LCD frame buffer, the internal memory space may be
used for an application frame buffer. Many applications update the image to be displayed in their
local copy of frame buffer and then copy the content into the LCD frame buffer. Depending on the
application’s update rate and LCD size, it might be preferable to allow the application to update the
application’s frame-buffer, while system DMA can copy from the application’s frame-buffer to the
LCD frame-buffer.
The LCD controller uses its DMA controller to fetch data from the frame buffer. This makes it
possible to split the frame buffers between internal SRAM and external memory, if necessary,
through the use of chained DMA descriptors. In this way it is possible to uses the internal SRAM
for a portion of the frame buffer, even if the entire frame buffer cannot fit within the 256KB.
3.4.2Buffer for Capture Interface
The capture frames at the camera interface are typically processed for image enhancements and
often encoded for transmission or storage. The image enhancement and video encoding application
is accelerated by allowing the storage of the raw data in the internal memory. Note that the capture
interface can be on-board or be an external device. Both benefit from the use of the internal
memory buffering scheme.
Intel® PXA27x Processor Family Optimization Guide3-9
System Level Optimization
3.4.3Buffer for Context Switch
During context switch the states of the process has to be saved. For the PXA27x processor, the
PCB (process control block) can be large in size due to additional registers for
MMX™ Technology
employed.
. In order to reduce context switch latency the internal memory can be
3.4.4Scratch Ram
For many application (such as graphics, etc.) the working set may often be larger than the data
cache, and due to the random access nature of the application effective preload may be difficult to
perform. Thus part of the internal ram can be used for storing these critical data-structures. OS can
offer management of such critical data spaces through malloc() or virtual_alloc().
3.4.5OS Acceleration
There is much OS- and system- related code that is used in a periodic fashion (e.g. device drivers,
OS daemon processes). Codes for these routines can be stored in the internal memory, this will
reduce the instruction cache miss penalties for the periodic routines.
Intel® Wireless
3.4.6Increasing Preloads for Memory Performance
Apart from increasing cache efficiency, hiding the memory latency is extremely important. The
proper preload scheme can be used to hide the memory latency for data accesses.
The Intel XScale® Microarchitecture has a preload load instruction (PLD). The purpose of this
instruction is to preload data into the data and mini-data caches. Data pre-loading allows hiding of
memory transfer latency while the processor continues to execute instructions. The preload is
important to compiler and assembly code because judicious use of the preload instruction can
enormously improve throughput performance of Intel XScale® Microarchitecture-based
processors. Data preload can be applied not only to loops but also to any data references within a
block of code. Preload also applies to data writing when the memory type is enabled as write
allocate.
Note:The Intel XScale® Microarchitecture PLD instruction encoding translates to a never execute in the
ARM* V4 architecture. This is to allow compatibility between code using PLD on an Intel
XScale® Microarchitecture processor and older devices. Code that has to run on both architectures
can include the PLD instruction, gaining performance on the Intel XScale® Microarchitecture,
while maintaining compatibility for ARM* V4 (for example, StrongARM). A detailed discussion
on the efficient pre-loading of the data and possible use cases has been explained in Section 4,
“Intel XScale® Microarchitecture & Intel® Wireless MMX™ Technology Optimization”,
Section 5, “High Level Language Optimization”, and Section 6, “Power Optimization”.
3.5Optimization of System Components
In the PXA27x processor, the LCD, DMA controller, Intel® Quick Capture Interface and Intel
XScale® core share the same resources such as system bus, memory controller, etc. Thus, there
may be potential resource conflicts and the sharing of resources may impact the performance of the
end application. For example, a larger LCD display consumes more memory and system bus
bandwidth and hence an application could potentially run faster in a system with a smaller LCD
display or a display with a lower refresh rate. Also, DMA channels can influence the performance
of applications. This section describes how different sub-systems can be optimized for improving
system performance.
3.5.1LCD Controller Optimization
The LCD controller provides an interface between the PXA27x processor and a LCD module. The
LCD module can be passive (STN), active (TFT), or an LCD panel with internal frame buffering.
3.5.1.1Bandwidth and Latency Requirements for LCD
The LCD controller may have up to 7 DMA channels running depending on the mode of operation.
Therefore the LCD can potentially consume the majority of the bus bandwidth when used with
large panels. Bandwidth requirements for each plane (that is: base, overlay1, overlay 2, etc.) must
be considered when determining LCD bandwidth requirements. The formula for each plane is:
Length and width are the number of lines per panel and pixels per line, respectively. Refresh rate is
in frames per second. BPP is bits per pixel in physical memory, that is: 16 for 16 BPP, 32 for 18
BPP unpacked, 24 for 18 BPP packed (refer to the Intel® PXA27x Processor Family Developer’s Manual for more info).
Depending on where the overlay planes are placed, there might be variable data bandwidth
requirement during a refresh cycle of the LCD. The sections on the screen with overlaps between
overlay 1 and 2 require fetching data at the highest rate. It is important to understand both the long
term average and the peak bandwidth. The average bandwidth is a long term average of the
consumed bandwidth over the entire frame. The peak bandwidth is the highest (instantaneous) data
rate that the LCD consumes - which occurs when fetching data for the overlapped section of the
frame.
The average bandwidth can be calculated as:
Average BandwidthPlane Bandwidths()
=
∑
The formula for peak bandwidth is:
Peak BandwidthMaximum Plane Overlap X Base Plane Bandwidth=
Intel® PXA27x Processor Family Optimization Guide3-11
System Level Optimization
Maximum plane overlap is the maximum number of overlapping planes (base, overlay 1, overlay
2). The planes do not need to completely overlap each other, they simply need to occupy the same
pixel location. It is generally the number of planes used, unless the overlays are guaranteed never
to be positioned over one another. The peak bandwidth is required whenever the LCD controller is
displaying a portion of the screen where the planes overlap. While the peak bandwidth is higher
than the average bandwidth, it does not sustain for long. Sustained period of peak bandwidth
activity is dependent on the overlay sizes and color depth.
The system needs to guarantee the LCD has enough bandwidth available to meet peak bandwidth
requirements for the sustained peak-bandwidth period to avoid underruns during plane overlap
periods. Optimizing arbitration scheme and internal memory usage is encouraged to address this
problem. The LCD controller has an internal buffering mechanism to minimize the impact of
fluctuations in the bandwidths.
The maximum latency the LCD controller can tolerate for it’s 32-byte burst data fetches can be
calculated with the equation below. Note that the latency requirements may vary for different
overlays (refer to
Requirements”).
Tab le 3-6, “Sample LCD Configurations with Latency and Peak Bandwidth
Latency
---------------------------
Peak Bandwidth
32
seconds=
Peak bandwidth comes from the equation above and is in bytes per second.
• So, for example, a 640x480x16 BPP screen with a 320x240x16BPP overlay and a 70 Hz
refresh rate, average bandwidth required is:
[(480 x 640 x 70 x 16) / 8] + [(240 x 320 x 70 x 16) / 8]
= 43008000 + 10752000
= 53,760,000 bytes per sec, or 52 MBytes/sec.
• The Peak bandwidth required is:
2 x [(480 x 640 x 70 x 16) / 8] = 86,016,000 bytes per sec, or 82 Mbytes per sec.
• The Maximum allowable average latency for LCD DMA burst data fetches is:
32 / 86,016,000 = 372 ns,
• For a 195 MHz system bus, this is (372 x 10
Note that each LCD DMA channel has a 16-entry, 8-byte wide FIFO buffer to help deal with
fluctuations in available bandwidth due to spikes in system activity.
Table 3-6. Sample LCD Configurations with Latency and Peak Bandwidth Requirements
LCD (Base Plane Size,
Overlay 1, Overlay 2,
Cursor)
320x240+ No Overlay
640x480 + No Overlay78
640x480 + No Overlay78
800x600+ No Overlay7316 BPP937.5456.6266.8366.83
800x600 + 176x144
Overlay
Refresh
Rate
(Hz)
7716 BPP1502702.5611.2811.28
7316 BPP
Color
Depth
18 BPP
unpacked
18 BPP
packed
Frame Buffer
Foot Print
Requirement
(KBytes)
1200333.8791.4191.41
900445.1668.5568.55
937.5 base + 49.5
overlay
Maximum
Latency
Tolerance (ns)
223.7870.52133.67
Averag e
Bandwidth
Requirements
(MBytes/Sec)
Peak Bandwidth
Requirements
(MBytes / sec)
3.5.1.2Frame Buffer Placement for LCD Optimization
3.5.1.2.1Internal Memory Usage
As the bandwidth and latency requirements increase with screen size, it may become necessary to
utilize internal memory in order to meet LCD requirements. Internal memory provides the lowest
latency and highest bandwidth of all memories in the system. In addition, having the frame buffer
located in internal SRAM dramatically reduces the external memory traffic in the system and the
internal bus-utilization.
3.5.1.2.2Overlay Placement
Most systems that use overlays require more memory for the frame buffers (base plane and
overlays) than is available (or allocated for frame buffer usage) in the internal SRAM. Optimum
system performance is achieved by placing the most frequently accessed frame buffers in internal
SRAM and placing the remainder in external memory. Frame buffer accesses include not only the
OS and applications writing to the plane when updating the content displayed, but also the LCD
controller reading the data from the plane.
For the base plane the total accesses are simply the sum of the refresh rate plus the frequency of
content update of the base plane. For each overlay the total accesses are the same sum multiplied
by the percent of time the overlay is enabled. After estimating the total accesses for the base plane
and all overlays employed, place the frame buffers for the planes with the highest total accesses in
the internal SRAM.
Some systems might benefit from dynamically reconfiguring the location of the frame buffer
memory whenever the overlays are enabled. When overlays are disabled the frame buffer for the
base plane is placed in the internal SRAM. However, when the overlays are enabled, the base
plane’s frame buffer is moved to external memory and the frame buffers of the overlays are placed
in the internal SRAM. This method requires close coordination with the LCD controller to ensure
that no artifacts are seen on the LCD. Refer to the LCD chapter in the Intel® PXA27x Processor Family Developer’s Manual for more information on reconfiguring the LCD.
Intel® PXA27x Processor Family Optimization Guide3-13
System Level Optimization
3.5.1.2.3Multiple Descriptor Technique
Another technique for utilizing internal SRAM is the use of multiple-descriptor frames. This
technique can be used if the LCD controller underruns due to occasional slow memory accesses.
The frame buffer is split across two different chained descriptors for a particular LCD DMA
channel. Both descriptors can be stored in internal SRAM for speedy descriptor reloading. One
descriptor points to frame buffer source data in external memory, while the second descriptor
points to the remainder of the frame buffer in internal SRAM. The descriptor that points to frame
data in external memory should have the end of frame (EOF) interrupt bit set. The idea is to queue
slower memory transfers and push them out when the EOF interrupt occurs, indicating that the
LCD is switching to internal SRAM. This allows a ping-pong between slow memory traffic (that
would cause LCD output FIFO underruns) and LCD traffic. This technique is only necessary with
very large screens, and will not work if the offending slow memory accesses also occur when the
LCD is fetching from external memory.
3.5.1.3LCD Display Frame Buffer Setting
For most products the LCD frame buffer is allocated in a noncacheable region. If this region is set
to noncached but bufferable graphics performance improvements can be achieved.
The noncached but bufferable mode (X=0, C=0, B=1) improves write performance by allowing the
consecutive writes to coalesce in the write buffer and result in more efficient bus transactions.
System developers should set their LCD frame buffer as noncached but bufferable.
3.5.1.4LCD Color Conversion HW
The LCD controller is equipped with hardware color management capabilities such as:
• Up-scaling from YCbCr 4:2:0 & 4:2:2 to YCbCr 4:4:4
• Color Space Conversion from YCbCr 4:4:4 to RGB 8:8:8 (CCIR 601)
• Conversion from RGB 8:8:8, to RGB 5:5:5 and the supported formats of RGBT
For many video and image applications, the color-conversion routines require a significant amount
of processing power. This work can be off-loaded to the LCD controller by properly configuring
the LCD controller. This has two advantages; first, the Intel XScale® core is not burdened with the
processing, and second, the LCD bandwidth consumption is lowered by using the lower bit
precision format.
3.5.1.5Arbitration Scheme Tuning for LCD
The most important thing to do in order to enable larger screens is to reprogram the arbiter
ARBCTRL register. The default arbiter weight for the programmable clients is LCD=2, DMA=3
and XScale=4; this is only sufficient for very small screens. Typically the LCD needs to be the
highest weight of the programmable clients - this is discussed further in
The PXA27x processor arbiter features programmable “weights” for the LCD controller, DMA
controller, and Intel XScale® Microarchitecture bus requests. In addition, the “park” bit can be set
which causes the arbiter to grant the bus to a specific client whenever the bus is idle. These two
features should be used to tune PXA27x processor to match your system bandwidth requirements.
The USB host controller cannot tolerate long latencies and is given highest priority whenever it
requests the bus, unless the memory controller is requesting the bus. The memory controller has the
absolute highest priority in the system. Since the weight of the USB host and memory controller
are not programmable, they are not discussed any further in the text below. The weights of the
LCD, DMA controller and Intel XScale® Microarchitecture bus requests are programmable via the
ARBCNTL register. The maximum weight allowed is 15. Each client weight is loaded into a
counter, and whenever a client is granted the bus the counter decrements. When all counters reach
zero, the counters are reloaded with the weights in the ARBCNTL register and the process restarts.
At any given time, the arbiter gives a grant to the client with the highest value in their respective
counter, unless the USB host or memory controller is requesting the bus. If one or more client
counts are at zero and no non-zero clients are requesting the bus, the arbiter grants the bus to the
zero-count client with the oldest pending request. If this happens three times, the counters are all
reloaded even though one more client counts never reached zero. This basic understanding of how
the arbiter works is necessary in order to begin tuning the arbiter settings.
System Level Optimization
3.5.2.2Determining the Optimal Weights for Clients
The weights are decided based on the real time (RT) deadline1, bandwidth (BW) requirements and
likelihood of a client requesting the bus. Setting the correct weight helps ensuring that each client is
statistically guaranteed to have a fixed amount of bandwidth.
Over-assigning or under-assigning of weights may violate the BW and RT requirements of a client.
Also, when weights for one or more clients becomes zero, the effective arbitration becomes first
come first serve (FCFS).
3.5.2.2.1Weight for LCD
The first client to consider is the LCD controller. When used with larger panel sizes or overlays, the
LCD controller has very demanding real-time data requirements, which if not satisfied result in
underruns and visual artifacts. Therefore, the LCD controller is usually given the highest weight of
all of the programmable clients. The safest and easiest method of insuring the LCD controller gets
all of the bandwidth it requires is to set the LCD weight to 15. This gives the LCD controller the
bus whenever it needs it, allowing the LCD FIFO buffers to stay as full as possible in order to avoid
underrun situations. The remaining bus bandwidth, which may be very little if a very large panel is
used, is then split up between the DMA controller and the Intel XScale® Microarchitecture.
3.5.2.2.2Weight for DMA
The DMA controller is a unique client in that it is “friendly” and always deasserts it’s request line
whenever it gets a grant. Therefore, it never performs back-to-back transactions unless nobody else
is requesting the bus. In addition, if the DMA controller is the only non-zero client, there is a fair
chance the client counters are prematurely reloaded due to three zero-count clients getting grants in
1.Real time deadline is the maximum time that a client can wait for data across the bus without impacting the client’s performance (for
example, by causing a stall).
Intel® PXA27x Processor Family Optimization Guide3-15
System Level Optimization
between DMA grants. For these reasons, the DMA controller never consumes all of the bus
bandwidth, even when programmed with a large weight. The best weight to use is systemdependent, based on the number of DMA channels running and the bandwidth requirements of
those channels. Since the LCD controller and USB host have real-time requirements, DMA
bandwidth usually reduces the available bandwidth to the Intel XScale® core.
3.5.2.2.3Weight for Core
A good method for setting the Intel XScale® core weight and the DMA controller weight is to
determine the ratio of the bandwidth requirements of both. Once the ratio is determined the weights
can be programmed with that same ratio. For instance, if the Intel XScale® core requires twice the
bandwidth of the DMA controller, the DMA weight could be set to two with the Intel XScale®
core weight set to four. Larger weights are used for greater accuracy, but the worst-case time to
grant also increases. It is often best to start with low weights while the LCD weight is high to avoid
LCD problems at this point. The weights can be increased using the same ratio if desired and there
are no LCD underruns.
3.5.2.3Taking Advantage of Bus Parking
Another arbiter feature is the ability to park the grant on a particular client when the bus is idle. If
the bus is not parked and is idle, it takes 1-2 cycles to get a bus grant. This can be reduced to zero if
the bus is successfully parked on the next client that needs it. However, if the bus is parked on a
particular client and a different client requests the bus, it takes 2-3 cycles to get a grant. Consider
the 1-cycle penalty for a mispredicted park.
For most applications it is recommended to park the bus on the core. Since the bus parking can be
easily and dynamically changed, it is also recommended that the OS and applications use this
feature to park the bus where it results in the best performance for the current task.
While most applications have the highest performance with the bus parked on the Intel XScale®
core, some might perform better with different bus park settings. As an example, it is likely that
parking the bus on the memory controller will result in higher performance than having it parked
on the core if an application was invoked to copy a large section of memory from SDRAM. Use the
performance monitoring capabilities of the Intel XScale® core to verify that the choice of bus
parking resulted in increased performance.
3.5.2.4Dynamic Adaptation of Weights
Once the initial weights for all of the programmable clients have been determined, the arbiter
settings should be tested with real system traffic. It is important to make sure all real-time
requirements are met with both typical and worst-case traffic loads. It may take several iterations to
find the best arbiter setting. Once the best ratio is determined, arbiter accuracy can be increased by
raising the DMA controller and Intel XScale® core weights as much as possible while still
preserving the ratio between the two. The system should be retested while increasing weights to
ensure the increase in worst-case time-to-grant does not affect performance. Also, LCD output
FIFO buffer underruns have to be monitored to make sure the LCD does not fail as it’s bandwidth
allocation decreases. If worst-case time-to-grant is more important than arbiter accuracy, smaller
weights can be used and the LCD weight can be lowered as long as LCD output FIFO underruns do
not occur with a worst-case traffic load.
A final consideration is dynamically changing the ARBCNTL register based on the current state of
the system. For example, experimentation may show different DMA controller weights should be
used based on the number of channels running. When the system enables a new channel, the
ARBCNTL register can be written. This results in an immediate reload of the client counters.
The DMA controller is used by the PXA27x processor peripherals for data transfers between the
peripheral buffers and the memory (internal and external). Also, depending on the use cases and
user profiles, the operating system may use DMA for copying different pages for its own
operations.
Table 3-7. Memory to Memory Performance Using DMA for Different Memories and
Frequencies
Table 3-7 shows DMA controller performance data.
DMA Throughput for
Clock Ratios
104:104:104127.352.9
208:104:104127.652.3
195:195:97.5238.270.9
338:169:84.520659.4
390:195:97.5237.968.6
Ratio = Core Frequency : System Bus Frequency : Memory Bus Frequency
†
†
Internal to Internal
Memory
DMA Throughput for
Internal to External
Memory
Proper DMA controller usage can reduce the workload of the processor by allowing the Intel
XScale® core to use the DMA controller to perform peripheral I/O. The DMA can also be used to
populate the internal memory from the capture interface or external memory, etc.
3.5.4Peripheral Bus Split Transactions
The DMA bridge between the peripheral bus and the system bus normally performs split
transactions for all operations. This allows for some decoupling of the address and data phases of
transactions and generally improves efficiency. This can be disabled and requires active
transactions complete before another transaction starts. Please refer to the DMA Programmed I/O
Control Status register described in the Intel® PXA27x Processor Family Developer’s Manual for
detailed information on this feature and its usage.
Note:When using split transactions (default): If software requires that a write complete on the peripheral
bus before continuing, then software must write the address, then immediately read the same
address. This guarantees that the address has been updated before letting the core continue
execution. The user must perform this read-after-write transaction to ensure the processor is in a
correct state before the core continues execution.
Intel® PXA27x Processor Family Optimization Guide3-17
This section outlines optimizations specific to ARM* architecture and also to the Intel® Wireless
MMX™ Technology
where needed. This chapter focuses mainly on the assembly code level optimization.
explains optimization during high level language programming.
4.2General Optimization Techniques
The Intel XScale® Microarchitecture provides the ability to execute instructions conditionally.
This feature combined with the ability of the Intel XScale® Microarchitecture instructions to
modify the condition codes makes a wide array of optimizations possible.
4.2.1Conditional Instructions and Loop Control
The Intel XScale® Microarchitecture instructions can selectively modify the state of the condition
codes. When generating code for if-else and loop conditions it is often beneficial to make use of
this feature to set condition codes, thereby eliminating the need for a subsequent compare
instruction.
. These optimizations are modified for the Intel XScale® Microarchitecture
Chapter 5
Consider the following C statement
if (a + b)
Code generated for the if condition without using an add instruction to set condition codes is:
; Assume r0 contains the value a, r1 contains the value b,
; and r2 is available
add r2,r0,r1
cmp r2, #0
However, code can be optimized as follows making use of add instruction to set condition codes:
; Assume r0 contains the value a, r1 contains the value b,
; and r2 is available
Intel® PXA27x Processor Family Optimization Guide4-1
The instructions that increment or decrement the loop counter can also modify the condition codes.
This eliminates the need for a subsequent compare instruction. A conditional branch instruction
can then exit or continue with the next loop iteration.
Consider the following C code segment.
for (i = 10; i!= 0; i--){
do something;
}
The optimized code generated for the preceding code segment would look like:
L6:
subs r3, r3, #1
bne .L6
It is also beneficial to rewrite loops whenever possible to make the loop exit conditions check
against the value 0. For example, the code generated for the code segment below needs a compare
instruction to check for the loop exit condition.
for (i = 0; i < 10; i++){
do something;
}
If the loop were rewritten as follows, the code generated avoids using the compare instruction to
check for the loop exit condition.
for (i = 9; i >= 0; i--){
do something;
}
4.2.2Program Flow and Branch Instructions
Branches decrease application performance by indirectly causing pipeline stalls. Branch prediction
improves performance by lessening the delay inherent to fetching a new instruction stream. The
Intel® PXA27x Processor Family (PXA27x processor) add a branch target buffer (BTB) which
helps mitigate the penalty due to branch misprediction. However, the BTB must be enabled.
The size of the branch target buffer limits the number of correctly predictable branches. Because
the total number of branches executed in a program is relatively large compared to the size of the
branch target buffer., it is often beneficial to minimize the number of branches in a program.
Consider the following C code segment.
The code generated for the if-else portion of this code segment using branches is:
cmp r0, #10
ble L1
mov r0, #0
b L2
L1:
mov r0, #1
L2:
This code takes three cycles to execute the else statement and four cycles for the if statement
assuming best case conditions and no branch misprediction penalties. In the case of the Intel
XScale® Microarchitecture, a branch misprediction incurs a penalty of four cycles. If the branch is
mispredicted 50% of the time and if both the if statement and the else statement are equally likely
to be taken, on an average the code above takes 5.5 cycles to execute.
50
---------4
100
34+
---------- --+×
2
5.5=cycles
.
Using the Intel XScale® Microarchitecture to execute instructions conditionally, the code
generated for the preceding if-else statement is:
cmp r0, #10
movgt r0, #0
movle r0, #1
The preceding code segment would not incur any branch misprediction penalties and would take
three cycles to execute assuming best case conditions. Using conditional instructions speeds up
execution significantly. However, the use of conditional instructions should be considered carefully
to ensure it improve performance. To decide when to use conditional instructions over branches,
consider this hypothetical code segment:
if (cond)
if_stmt
else
else_stmt
Using the following data:
N1BNumber of cycles to execute the if_stmt assuming the use of branch instructions
N2BNumber of cycles to execute the else_stmt assuming the use of branch instructions
Intel® PXA27x Processor Family Optimization Guide4-3
add r4, r4, #1
b L2
L1:
sub r0, r0, #1
sub r1, r1, #1
sub r2, r2, #1
sub r3, r3, #1
sub r4, r4, #1
L2:
The CMP instruction takes one cycle to execute, the if statement takes seven cycles to execute, and
the else statement takes six cycles to execute.
If the code were changed to eliminate the branch instructions by using of conditional instructions,
the if-else statement would take 10 cycles to complete.
Assuming an equal probability of both paths being taken and that branch misprediction occur 50%
of the time, compute the costs of branch prediction versus conditional execution as:
Users get better performance by using branch instructions in this scenario.
4.2.3Optimizing Complex Expressions
Using conditional instructions improves the code generated for complex expressions such as the C
shortcut evaluation feature. The use of conditional instructions in this fashion improves
performance by minimizing the number of branches, thereby minimizing the penalties caused by
branch misprediction.
int foo(int a, int b){
if (a!= 0 && b!= 0)
return 0;
else
return 1;
}
The optimized code for the if condition is:
cmp r0, #0
cmpne r1, #0
Similarly, the code generated for this C segment:
int foo(int a, int b){
if (a!= 0 || b!= 0)
return 0;
else
return 1;
}
is:
cmp r0, #0
cmpeq r1, #0
This approach also reduces the utilization of branch prediction resources.
Intel® PXA27x Processor Family Optimization Guide4-5
The Intel XScale® Microarchitecture shift and logical operations provide a useful way of
manipulating bit fields. Bit field operations can be optimized as:
;Set the bit number specified by r1 in register r0
mov r2, #1
orr r0, r0, r2, asl r1
;Clear the bit number specified by r1 in register r0
mov r2, #1
bic r0, r0, r2, asl r1
;Extract the bit-value of the bit number specified by r1 of the
;value in r0 storing the value in r0
mov r1, r0, asr r1
and r0, r1, #1
;Extract the higher order 8 bits of the value in r0 storing
;the result in r1
mov r1, r0, lsr #24
The method outlined here can greatly accelerate encryption algorithms such as Data Encryption
Standard (DES), Triple DES (T-DES), Hashing functions (SHA). This approach helps other
application such as network packet parsing, and voice stream parsing.
4.2.4Optimizing the Use of Immediate Values
Use the Intel XScale® Microarchitecture MOV or MVN instruction when loading an immediate
(constant) value into a register. Refer to the ARM* Architecture Reference Manual for the set of
immediate values that can be used in a MOV or MVN instruction. It is also possible to generate a
whole set of constant values using a combination of MOV, MVN, ORR, BIC, and ADD
instructions. The LDR instruction has the potential of incurring a cache miss in addition to
polluting the data and instruction caches. Use a combination of the above instructions to set a
register to a constant value. An example of this is shown in these code samples.
;Set the value of r0 to 127
mov r0, #127
;Set the value of r0 to 0xfffffefb.
mvn r0, #260
;Set the value of r0 to 257
mov r0, #1
orr r0, r0, #256
It is possible to load any 32-bit value into a register using a sequence of four instructions.
4.2.5Optimizing Integer Multiply and Divide
Optimize when multiplying by an integer constant to make use of the shift operation.
;Multiplication of R0 by 2
mov r0, r0, LSL #n
;Multiplication of R0 by 2n+1
add r0, r0, r0, LSL #n
Multiplication by an integer constant, expressed as , can be optimized.
;Multiplication of r0 by an integer constant that can be
;expressed as (2n+1)*(2m)
add r0, r0, r0, LSL #n
mov r0, r0, LSL #m
n
2n1+()·2m()⋅
Note:Use the preceding optimization in cases where the multiply operation cannot be advanced far
enough to prevent pipeline stalls only.
Optimize when dividing an unsigned integer by an integer constant to make use of the shift
operation.
;Dividing r0 containing an unsigned value by an integer constant
;that can be represented as 2
mov r0, r0, LSR #n
n
Optimize when dividing a signed integer by an integer constant to make use of the shift operation.
;Dividing r0 containing a signed value by an integer constant
;that can be represented as 2
mov r1, r0, ASR #31
add r0, r0, r1, LSR #(32 - n)
mov r0, r0, ASR #n
n
Intel® PXA27x Processor Family Optimization Guide4-7
The add instruction stalls for one cycle. Prevent this stall by filling in another instruction before the
add instruction.
4.2.6Effective Use of Addressing Modes
The Intel XScale® Microarchitecture provides a variety of addressing modes that make indexing
an array of objects highly efficient. Refer to the ARM* Architecture Reference Manual for a
detailed description of ARM*addressing modes. These code samples illustrate how various kinds
of array operations can be optimized to make use of the various addressing modes:
;Set the contents of the word pointed to by r0 to the value
;contained in r1 and make r0 point to the next word
str r1,[r0], #4
;Increment the contents of r0 to make it point to the next word
;and set the contents of the word pointed to the value contained
;in r1
str r1, [r0, #4]!
;Set the contents of the word pointed to by r0 to the value
;contained in r1 and make r0 point to the previous word
str r1,[r0], #-4
;Decrement the contents of r0 to make it point to the previous
;word and set the contents of the word pointed to the value
;contained in r1
str r1,[r0, #-4]!
4.3Instruction Scheduling for Intel XScale®
Microarchitecture
and Intel® Wireless MMX™
Technology
This section discusses instruction scheduling optimizations. Instruction scheduling refers to the
rearrangement of a sequence of instructions for the purpose of helping to minimize pipeline stalls.
Reducing the number of pipeline stalls helps improve application performance. While these
rearrangements, ensure the new sequence of instructions has the same effect as the original
sequence of instructions.
4.3.1Instruction Scheduling for Intel XScale® Microarchitecture
4.3.1.1Scheduling Loads
On the Intel XScale® Microarchitecture, an LDR instruction has a result latency of 3 cycles,
assuming the data being loaded is in the data cache. If the instruction after the LDR needs to use the
result of the load, then it would stall for 2 cycles. If possible, rearrange the instructions surrounding
the LDR instruction to avoid this stall.
In the code shown in the following example, the ADD instruction following the LDR stalls for two
cycles because it uses the result of the load.
add r1, r2, r3
ldr r0, [r5]
add r6, r0, r1
sub r8, r2, r3
mul r9, r2, r3
Rearrange the code as shown to prevent the stall:
ldr r0, [r5]
add r1, r2, r3
sub r8, r2, r3
add r6, r0, r1
mul r9, r2, r3
This rearrangement is not always possible. In the following example, the LDR instruction cannot
be moved before the ADDNE or the SUBEQ instructions because the LDR instruction depends on
the result of these instructions.
The optimized code takes six cycles to execute compared to the seven cycles taken by the
unoptimized version.
The result latency for an LDR instruction is significantly higher if the data being loaded is not in
the data cache. To help minimize the number of pipeline stalls in such a situation, move the LDR
instruction as far away as possible from the instruction that uses the result of the load. Moving the
LDR instruction can cause certain register values to be spilled to memory due to the increase in
register pressure. In such cases, use a preload instruction to ensure that the data access in the LDR
instruction hits the cache when it executes.
Intel® PXA27x Processor Family Optimization Guide4-9
In the following code sample, the ADD and LDR instructions can be moved before the MOV
instruction. This helps prevent pipeline stalls if the load hits the data cache. However, if the load is
likely to miss the data cache, move the LDR instruction so it executes as early as possible—before
the SUB instruction. Moving the LDR instruction before the SUB instruction changes the program
semantics.
; all other registers are in use
sub r1, r6, r7
mul r3,r6, r2
mov r2, r2, LSL #2
orr r9, r9, #0xf
add r0,r4, r5
ldr r6, [r0]
add r8, r6, r8
add r8, r8, #4
orr r8,r8, #0xf
; The value in register r6 is not used after this
It is possible to move the ADD and the LDR instructions before the SUB instruction so that the
contents of register R6 are allowed to spill and restore from the stack as shown in this example:
; all other registers are in use
str r6,[sp, #-4]!
add r0,r4,r5
ldr r6, [r0]
mov r2, r2, LSL #2
orr r9, r9, #0xf
In the previous example, the contents of register R6 are spilled to the stack and subsequently
loaded back to register R6 to retain the program semantics. Using a preload instruction, such as the
one shown in the following example, is another way to optimize the code in the previous example.
; all other registers are in use
add r0,r4, r5
pld [r0]
orr r8,r8, #0xf
; The value in register r6 is not used after this
The Intel XScale® Microarchitecture has four fill buffers used to fetch data from external memory
when a data cache miss occurs. The Intel XScale® Microarchitecture stalls when all fill buffers are
in use. This happens when more than four loads are outstanding and are being fetched from
memory. Write the code to ensure no more than four loads are simultaneously outstanding. For
example, the number of loads issued sequentially should not exceed four. A preload instruction can
cause a fill buffer to be used. As a result, the number of outstanding preload instructions should
also be considered to arrive at the number of loads that are outstanding.
Use the number of outstanding loads to improve performance of the PXA27x processor.
4.3.1.2Increasing Load Throughput
Increasing load throughput for data-demanding applications is important. Making use of multiple
outstanding loads increases throughput in the PXA27x processor. Use register rotation to allow
multiple outstanding loads. The following code allows one outstanding load at a time due to the
data dependency between the instructions (load and add). Throughput falls drastically in cases
where there is a cache miss.
Loop:
ldr r1, [r0], #32; r0 be a pointer to some initialized memory
add r2, r2, r1
ldr r1, [r0], #32;
add r2, r2, r1
ldr r1, [r0], #32;
add r2, r2, r1
.
.
.
bne Loop
However, the following example uses multiple registers as the target for loads and allows multiple
outstanding loads.
ldr r1, [r0], #32; r0 be a pointer to some initialized memory
The modified code not only hides the load-to-use latencies for the cases of cache-hits, but also
increases the throughput by allowing several loads to be outstanding at a time.
Due to the complexity of the PXA27x processor, the memory latency can be higher. Latency hiding
is very critical. Thus, two things to remember: issue loads as early as possible, and make up to four
outstanding loads. Another technique for hiding memory latency is to use preloads. The prefetch
technique is mentioned in
Chapter 5, “High Level Language Optimization”.
4.3.1.3Increasing Store Throughput
Increasing store throughput is important in applications that process video while updating the
output to the display. Write coalescing in the PXA27x processor (set by the page table attributes)
combines multiple stores going to the same half of the cache line into a single memory transaction.
This approach increases the bus efficiency and throughput. The coalescing operation is transparent
to software. However, software can cause more frequent coalescing by placing store instructions
targeted to the same cache line next to each other and configuring the target page attributes as
bufferable. For example, this code does not take advantage of coalescing:
add r1, r1,r2
str r1,[r0],#4 ; A separate bus transaction
add r1, r1,r3
str r1,[r0],#4; A separate bus transaction
add r1, r1,r4
str r1,[r0],#4; A separate bus transaction
add r1, r1,r5
str r1,[r0],#4; A separate bus transaction
However, it can be modified to allow coalescing to occur as:
The number of write buffers limits the number of successive writes that can be issued before the
processor stalls. No more than eight uncoalesced store instructions can be issued. If the data caches
are using the write-allocate with writeback policy, then a load operation may cause stores to the
external memory if the read operation evicts a cache line that is dirty (modified). The number of
sequential stores may be limited by this fact.
4.3.1.4Scheduling Load Double and Store Double (LDRD/STRD)
The Intel XScale® Microarchitecture introduces two new double word instructions: LDRD and
STRD. LDRD loads 64
stores 64
bits from two consecutive registers to an effective address. There are two important
bits of data from an effective address into two consecutive registers. STRD
restrictions on how these instructions are used:
• The effective address must be aligned on an 8-byte boundary
• The specified register must be even (r0, r2)
Using LDRD/STRD instead of LDM/STM to do the same thing is more efficient because
LDRD/STRD issues in only one or two clock cycle. LDM/STM issues in four clock cycles. Avoid
LDRDs targeting R12 because this incurs an extra cycle of issue latency.
The LDRD instruction has a result latency of three or four cycles depending on the destination
register being accessed (assuming the data being loaded is in the data cache).
add r6, r7, r8
sub r5, r6, r9
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
orr r8, r1, #0xf
mul r7, r0, r7
In the code example above, the ORR instruction stalls for three cycles because of the four cycle
result latency for the second destination register of an LDRD instruction. The preceding code can
be rearranged to help remove the pipeline stalls:
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
add r6, r7, r8
sub r5, r6, r9
mul r7, r0, r7
orr r8, r1, #0xf
Any memory operation following a LDRD instruction (LDR, LDRD, STR and others) stall for one
cycle.
; The str instruction below will stall for 1 cycle
ldrd r0, [r3]
str r4, [r5]
Intel® PXA27x Processor Family Optimization Guide4-13
4.3.1.5Scheduling Load and Store Multiple (LDM/STM)
LDM and STM instructions have an issue latency of 2 to 20 cycles depending on the number of
registers being loaded or stored. The issue latency is typically two cycles plus an additional cycle
for each of the registers loaded or stored assuming a data cache hit. The instruction following an
LDM stalls whether or not this instruction depends on the results of the load. An LDRD or STRD
instruction does not suffer from this drawback (except when followed by a memory operation) and
should be used where possible. Consider the task of adding two 64-bit integer values. Assume that
the addresses of these values are aligned on an 8-byte boundary. Achieve this using the following
LDM instructions.
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldm r0, {r2, r3}
ldm r1, {r4, r5}
adds r0, r2, r4
adc r1,r3, r5
Assuming all accesses hit the cache, this example code takes 11 cycles to complete. Rewriting the
code as shown in the following example using the LDRD instruction would take only seven cycles
to complete. The performance increases further if users fill in other instructions after the LDRD
instruction to reduce the stalls due to the result latencies of the LDRD instructions and the one
cycle stall of any memory operation.
; r0 contains the address of the value being copied
; r1 contains the address of the destination location
ldrd r2, [r0]
ldrd r4, [r1]
adds r0, r2, r4
adc r1,r3, r5
Similarly, the code sequence in the following example takes five cycles to complete.
stm r0, {r2, r3}
add r1, r1, #1
The alternative version which is shown below would only take 3 cycles to complete.
Most Intel XScale® Microarchitecture data-processing instructions have a result latency of one
cycle. This means that the current instruction uses the result from the previous data processing
instruction. However, the result latency is two cycles if the current instruction uses the result of the
previous data processing instruction for a shift by immediate. As a result, this code segment would
incur a one-cycle stall for the MOV instruction:
sub r6, r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
This code removes the one-cycle stall:
add r1, r2, r3
sub r6, r7, r8
mov r4, r1, LSL #2
All data processing instructions incur a two-cycle issue penalty and a two-cycle result penalty
when the shifter operand is shifted/rotated by a register or the shifter operand is a register. The next
instruction incur a two-cycle issue penalty and there is no way to avoid such a stall except by
rewriting the assembler instruction. The subtract instruction incurs a one-cycle stall due to the issue
latency of the add instruction as the shifter operand is shifted by a register.
mov r3, #10
mul r4, r2, r3
add r5, r6, r2, LSL r3
sub r7, r8, r2
The issue latency can be avoided by changing the code as:
Multiply instructions can cause pipeline stalls due to resource conflicts or result latencies. This
code segment incurs a stall of 0-3 cycles depending on the values in registers R1, R2, R4 and R5
due to resource conflicts:
mul r0, r1, r2
mul r3, r4, r5
Intel® PXA27x Processor Family Optimization Guide4-15
Due to result latency, this code segment incurs a stall of 1-3 cycles depending on the values in
registers R1 and R2:
mul r0, r1, r2
mov r4, r0
A multiply instruction that sets the condition codes blocks the whole pipeline. A four-cycle
multiply operation that sets the condition codes behaves the same as a four-cycle issue operation.
The add operation in the following example stalls for three cycles if the multiply takes three cycles
to complete.
muls r0, r1, r2
add r3, r3, #1
sub r4, r4, #1
sub r5, r5, #1
It is better to replace the previous example code with this sequence:
mul r0, r1, r2
add r3, r3, #1
sub r4, r4, #1
sub r5, r5, #1
cmp r0, #0
Refer to Section 4.8, “Instruction Latencies for Intel XScale® Microarchitecture” for more
information on instruction latencies for various multiply instructions. The multiply instructions
should be scheduled taking into consideration their respective instruction latencies.
4.3.1.8Scheduling SWP and SWPB Instructions
The SWP and SWPB instructions have a five cycle issue latency. As a result of this latency, the
instruction following the SWP/SWPB instruction stalls for 4 cycles. Only use the SWP/SWPB
instructions where they are needed. For example, use SWP/SWPB to execute an atomic swap for a
semaphore.
For example, the following code can be used to swap the contents of two memory locations. This
code takes nine cycles to complete.
; Swap the contents of memory locations pointed to by r0 and r1
ldr r2, [r0]
swp r2, [r1]
str r2, [r1]
This code takes six cycles to execute:
; Swap the contents of memory locations pointed to by r0 and r1
ldr r2, [r0]
ldr r3, [r1]
4.3.1.9Scheduling the MRA and MAR Instructions (MRRC/MCRR)
The MRA (MRRC) instruction has an issue latency of one cycle, a result latency of two or three
cycles depending on the destination register value being accessed, and a resource latency of two
cycles. The code in the following example incurs a one-cycle stall due to the two-cycle resource
latency of an MRA instruction.
mra r6, r7, acc0
mra r8, r9, acc0
add r1, r1, #1
Rearrange the code to prevent the stall:
mra r6, r7, acc0
add r1, r1, #1
mra r8, r9, acc0
Similarly, the following code incurs a two-cycle penalty due to the three-cycle result latency for the
second destination register.
The MAR (MCRR) instruction has an issue latency, a result latency, and a resource latency of
2-cycles. Due to the two-cycle issue latency in this example, the pipeline always stalls for one
cycle following an MAR instruction. Only use the MAR instruction when necessary.
4.3.1.10Scheduling MRS and MSR Instructions
The issue latency of the MRS instruction is one cycle and the result latency is two cycles. The issue
latency of the MSR instruction is two cycles (six if updating the mode bits) and the result latency is
one cycle. The ORR instruction in the following example incurs a one cycle stall due to the 2-cycle
result latency of the MRS instruction.
mrs r0, cpsr
orr r0, r0, #1
add r1, r2, r3
Intel® PXA27x Processor Family Optimization Guide4-17
Move the ADD instruction to after the ORR instruction to prevent this stall.
4.3.1.11Scheduling Coprocessor 15 Instructions
The MRC instruction has an issue latency of one cycle and a result latency of three cycles. The
MCR instruction has an issue latency of one cycle. The MOV instruction in the following example,
incurs a 2-cycle latency due to the 3-cycle result latency of the MRC instruction.
4.3.2Instruction Scheduling for Intel® Wireless MMX™
Technology
The Intel® Wireless MMX™ Technology provides an instruction set which offers the same
functionality as the
integer instructions.
4.3.2.1Increasing Load Throughput on Intel® Wireless MMX™ Technology
The constraints on issuing load transactions with Intel XScale® Microarchitecture also hold with
Intel® Wireless MMX™ Technology. The considerations reviewed using the Intel XScale®
Microarchitecture instructions are re-illustrated in this section using the
Technology
instruction set. The primary observations with load transactions are:
• The buffering in the memory pipeline allows two load double transactions to be outstanding
without incurring a penalty (stall).
• Back-to-back WLDRD instructions incur a stall, back-to-back WLDR(BHW) instructions do
not incur a stall
• The WLDRD requires 4 cycles to return the DWORD assuming a cache hit, back-to-back
WLDR (BHW) require 3 cycles to return the data.
• Use prefetching schemes with the above suggestions.
The overhead on issuing load transactions can be minimized by instruction scheduling and load
pipelining. In most cases it is straightforward to interleave other operation to avoid the penalty with
back-to-back LDRD instructions. In the following code sequence three WLDRD instructions are
issued back-to-back incurring a stall on the second and third instruction.
Intel® Wireless MMX™ Technology and Streaming SIMD Extensions (SSE)
Always try to separate 3 consecutive WLDRD instructions so that only 2 are outstanding at any
one time and the loads are always interleaved with other instructions
The issue latency of the WMAC instruction is one cycle and the result and resource latency is two
cycles. The second WMAC instruction in the following example stalls for one cycle due to the two
cycle resource latency.
WMACS wR0, wR2, wR3
WMACS wR1, wR4, wR5
The WADD instruction in the following example stalls for one cycle due to the two cycle result
latency.
WMACS wR0, wR2, wR3
WADD wR1, wR0, wR2
Intel® PXA27x Processor Family Optimization Guide4-19
It is often possible to interleave instructions and effectively overlap their execution with multicycle instructions that utilize the multiply pipe-line. The 2-cycle WMAC instruction may be easily
interleaved with operations which do not utilize the same resources:
In the above example, the WLDRD and WALIGNI instructions do not incur a stall since they are
utilizing the memory and execution pipelines respectively and there are no data dependencies.
When utilizing both Intel XScale® Microarchitecture and Intel® Wireless MMX™ Technology
execution resources, it is also possible to overlap the multicycle instructions. The ADD instruction
in the following example executes with no stalls.
WMACS wR14, wR1, wR2
ADD R1, R2, R3
Refer to Section 4.8, “Instruction Latencies for Intel XScale® Microarchitecture” for more
information on instruction latencies for various multiply instructions. The multiply instructions
should be scheduled taking into consideration their respective instruction latencies.
4.3.2.3Scheduling the TMIA Instruction
The issue latency of the TMIA instruction is one cycle and the result and resource latency are two
cycles. The second TMIA instruction in the following example stalls for one cycle due to the two
cycle resource latency.
TMIA wR0, r2, r3
TMIA wR1, r4, r5
The WADD instruction in the following example stalls for one cycle due to the two cycle result
latency.
TMIA wR0, r2, r3
WADD wR1, wR0, wR2
Refer to Section 4.8, “Instruction Latencies for Intel XScale® Microarchitecture” for more
information on instruction latencies for various multiply instructions. The multiply instructions
should be scheduled taking into consideration their respective instruction latencies
The issue latency of the WMUL and WMADD instructions is one cycle and the result and resource
latency are two cycles. The second WMUL instruction in the following example stalls for one
cycle due to the two cycle resource latency.
WMUL wR0, wR1, wR2
WMUL wR3, wR4, wR5
The WADD instruction in the following example stalls for one cycle due to the two cycle result
latency.
WMUL wR0, wR1, wR2
WADD wR1, wR0, wR2
4.4SIMD Optimization Techniques
The Single Instruction Multiple Data, (SIMD), architectures provided by the Intel® Wireless
MMX™ Technology
multimedia and communication applications. The most time-consuming code sequences have
certain characteristics in common:
enables us to exploit the inherent parallelism found in the wide domain of
• Operations are performed on small-native-data types (8-bit pixels, 16-bit voice, 32-bit audio)
• Regular and recurring memory access patterns, usually data independent
• Localized, recurring computations performed on the data
• Compute-intensive processing
In the following sections we illustrate how the rules for writing fast sequences of Intel® MMX™
Technology instructions on
optimization of short loops of Intel® MMX™ Technology code.
Intel® Wireless MMX™ Technology can be applied to the
4.4.1Software Pipelining
Software pipelining or loop unrolling is a well known optimization technique where multiple
calculations are in executed with each loop iteration. The disadvantages of applying this technique
include: increases in code size for critical loops and restrictions on the minimum and multiples of
taps or samples
The obvious advantage is in reduced cycle consumption. Overhead from loop exit testing may be
reduced load-use stalls may be minimized and in some cases eliminated completely instruction
scheduling opportunities may be created and exploited.
To illustrate the need for software pipe-lining, lets consider a key kernel of Intel® MMX™
Technology code that is central to many signal-processing algorithms, the real block FiniteImpulse-Response (FIR) filter. A real block FIR filter operates on two real vectors c(i) and x(i) and
produces and output vector y(n). The vectors are represented for Intel® MMX™ Technology
programming as arrays of 16-bit integers of some length N. The real FIR filter is represented by the
equation:
Intel® PXA27x Processor Family Optimization Guide4-21
for (j = 0; j < T; j++) {
s += a[j]*x[i-j]);
}
y[i] = round (s);
}
The WMAC instruction is utilized for this calculation and provides for four parallel 16-bit by 16bit multiplications with accumulation. The first level of unrolling is a direct function of the fourway SIMD instruction that is used to implement the filter.
The C-code for the real block FIR filter is re-written to illustrate that 4-taps are computed for each
loop iteration.
for (i = 0; i < N; i++) {
s0= 0;
for (j = 0; j < T/4; j++4) {
The direct assembly code implementation of the inner loop illustrates clearly that optimum
execution has not been accomplished. In the following code sequence we have several undesirable
stalls. The back-to-back LDRD instructions incur a 1
3
cycle stall. In addition, the loop overhead is high with 2 cycles being consumed for every
The parallelism of the filter may be exposed further by unrolling the loop to provide for eight taps
per iteration. In the following code sequence, the loop has been unrolled once allowing several
load-to-use stalls to be eliminated. The loop overhead has also been further amortized reducing it
from two cycles for every four taps to 2
cycles for every eight taps. There is still a single load-touse stall present between the second WLDRD instruction and the second WMACS instruction
within the inner loop
In the example for the real block FIR filter, two copies of the basic sequence of code were
interleaved eliminating all but one of the stalls. The throughput for the sequence went from
cycles for every four taps to 9 cycles for every eight taps. This corresponds to a throughput of
9
1.125
cycles per tap represents a 2X throughput improvement.
It is useful to define a metric to describe the number of copies of a basic sequence of instructions
which need to be interleaved in order to remove all stalls. We can call this the interleave factor, k.
The real block FIR filter requires k=2 to eliminate all possible stalls primarily because it is a small
sequence which must take into account the long load-to-use latency. In practice, k=2 is sufficient
for most loops encountered in real applications. This is fortunate because each interleaving requires
its own set of temporary registers and with some algorithms interleaving with k=3 is not possible.
A good rule of thumb is to try k=2 first, as it is usually the right choice.
4.4.2Multi-Sample Technique
The multi-sample optimization technique provides for calculating multiple outputs with each loop
iteration similar to loop unrolling. The disadvantages of applying this technique include, increases
in code size for critical loops. Restrictions on the minimum and multiples of taps or samples are
also imposed. The obvious advantage is in reduced cycle consumption.
• Memory bandwidth is reduced by data re-use.
• Load-to-use stalls may be easily eliminated with scheduling.
Intel® PXA27x Processor Family Optimization Guide4-23
In the inner loop, we are calculating four output samples using the adjacent data samples x(n-i),
x(n-1+1), x(n-i+2) and x(n-i+3). The output samples y(n), y(n+1), y(n+2), and y(n+3) are assigned
to four 64-bit
throughput, the inner loop is unrolled to provide for eight taps for each of the four output samples
per loops iteration.
; ** Update pointers,
Outer_Loop:
; ** Update pointers,zero accumulators and prime the loop with DWORD loads
Intel® Wireless MMX™ Technology registers. In order to obtain near ideal
WLDRD wR0, [R1], #8 ; Load first 4 input samples
WZERO wR15
WLDRD wR1, [R1], #8 ; Load even groups of 4
; input samples
WZERO wR14
WLDRD wR8, [R2], #8; Load first 4 coefficients
WZERO wR13
WZERO wR12
InnerLoop:
; ** Executes 8-Taps for each four outputs samples
; y(n),y(n+1), y(n+2),y(n+3)
WLDRD wR8, [R2], #8 ; even groups of 4 coeff.
WMAC wR12,wR8 , wR5 ; y(n+3) +=
BNE Inner_Loop
; ** Outer loop code calculates the last four taps for
; y(n), y(n+1), y(n+2), y(n+3)**
; ** Store results
BNE Outer_Loop
4.4.2.1General Remarks on Multi-Sample Technique
In the example for the real block FIR filter, four outputs are computed simultaneously in the same
inner loop. This has allowed the re-use of coefficients and sample data loaded into the register for
computation of the first output to be used for the computation of the next three outputs. The
interleave factor is set at k=2, which results in the elimination of load-to-use stalls. The throughput
for the sequence is 20
saturation of the execution resources.
cycles for every 32 taps, or 0.625 cycles per tap. This represents near ideal
The multi-sample technique may be applied whenever the same data is being utilized for multiple
calculations. The large register file on
Intel® Wireless MMX™ Technology facilitates this
approach and a number of variations are possible.
4.4.3Data Alignment Techniques
The exploitation of the data parallelism present in multimedia algorithms is accomplished by
executing the same operation on different elements in parallel. This is accomplished by packing
several data elements into a single register and using the packed data instructions provided by the
Intel® Wireless MMX™ Technology.
An important guideline for achieving optimum performance is always to align memory references.
This means that an N-byte memory read or write should always be on an N-byte boundary. In some
it is easy to align data so that all of the reads and writes are aligned. In other cases it is more
difficult because an algorithm naturally reads data in a misaligned fashion. A couple of examples
of this include the single-sample FIR and video motion estimation.
The Intel® Wireless MMX™ Technology provides a mechanism for reducing the overhead
associated with the classes of algorithms which require data to be accessed on 32-bit, 16-bit, or 8bit binaries. The ALIGNI instruction is useful when the sequence of alignment is known
beforehand as with the single-sample FIR filter. The ALIGNR instruction is useful when sequence
of alignments are calculated when the algorithm executes as with the fast motion search algorithms
used in video compression. Both of these instructions operate on register pairs which may be
effectively ping-ponged with alternate loads reducing the alignments overhead significantly.
Intel® PXA27x Processor Family Optimization Guide4-25
The following code sequence illustrates the set-up processo for an unaligned array access. The
procedure involves loading one of the general purpose registers on
Technology
accesses to the array are on a 64-bit boundary.
;r0 -> pointer to misaligned array.
MOV r5,#7
AND r7,r0,r5
TMCR wCGR1, r7
SUB r0,r0,r7;r0 psrc 64 bit aligned
Following the initial setup for alignment, the data can now be accessed, aligned, and presented to
the execution resources.
The re-use of existing Intel® MMX™ Technology code is encouraged since algorithm mapping to
Intel® Wireless MMX™ Technology may be significantly accelerated. The Intel® MMX™
Technology target pipeline and architecture is different than Intel® Wireless MMX™ Technology
and several changes are required for optimal mapping. The algorithms may require some re-design
and attention to several aspects will make the task more manageable
• Data width – Intel® MMX™ Technology uses different designators for data types:
— Packed words for 16-bit operands, Intel® Wireless MMX™ Technology uses halfword
(H)
— Packed double words for 32-bit operands, Intel® Wireless MMX™ Technology uses
word (W)
— Quadwords for 64-bit operands, Intel® Wireless MMX™ Technology used doubleword
(D)
• Instruction latencies – Instruction latencies are different with Intel® Wireless MMX™
Technology. May need to alter the scheduling of instructions.
• Instruction pairing – Intel® MMX™ Technology interleaves with x86 to reduce stalls. May
need alter the pairing of instructions in some cases on Intel® Wireless MMX™ Technology.
• Operand alignment – DWORD load/store requires 64-bit alignment. The pointers must be on a
64b boundary to avoid an exception.
• Memory latency – Memory latency for the PXA27x processor is different than existing Intel®
These registers can be used to store intermediate results and coefficients for tight multi-media
inner-loops without having to perform memory operations.
• The Intel® Wireless MMX™ Technology instructions provide encoding for three registers
unlike the Intel® MMX™ Technology instructions which provide for two registers only. The
destination registers may be different from the source registers when converting Intel®
MMX™ Technology code to Intel® Wireless MMX™ Technology. Remove all code
sequences in Intel® MMX™ Technology that have MOV instructions associated with the
destructive register behavior to improve throughput.
The following is an example of Intel® MMX™ Technology to Intel® Wireless MMX™
Technology
Many of the standard C library routines can benefit greatly by being optimized for the Intel
XScale® Microarchitecture. The following string and memory manipulation routines are good
candidates to be tuned for the Intel XScale® Microarchitecture.
Apart from the C libraries, there are many critical functions that can be optimized in the same
fashion. For example, graphics drivers and graphics applications frequently use a set of key
functions. These functions can be optimized for the PXA27x processor. In the following sections a
set of routines are provided as optimization case studies.
4.6.1Case Study 1: Memory-to-Memory Copy
The performance of memory copy (memcpy) is influenced by memory-access latency and memory
throughput. During memcpy, if the source and destination are both in cache, the performance is the
highest and simple load-instruction scheduling can ensure the most efficient performance.
However, if the source or the destination is not in the cache, a load-latency-hiding technique has to
be applied.
Intel® PXA27x Processor Family Optimization Guide4-29
Using preloads appropriately, the code can be desensitized to the memory latency (preload and
prefetches are the same). Preloads are described further in
Section 5.1.1.1.2, “Preload Loop
Scheduling” on page 5-2. The following code performs memcpy with optimizations for latency
desensitization.
; for cache-line-aligned case
PLD [r5]
PLD [r5, #32]
This code preloads three cache lines ahead of its current iteration. It also uses LDRD and groups
STRs together to coalesce.
the
4.6.2Case Study 2: Optimizing Memory Fill
Graphics applications use fill routines. Most of the personal data assistant (PDA) LCD displays use
output color format of RGB (16-bits or 8-bits). Therefore, most of the fill routines write out pixels
as bytes or half-words which is not recommended in terms of bus-bandwidth usage. However,
multiple pixels can be packed into a 32-bit data format and used for writing to the memory. Use
packing to improve efficiency.
Fill routines effectively make use of the write-coalescing feature which the PXA27x processor
provide if the LCD frame buffer is allocated as un-cached but bufferable. This code example shows
a common fill function:
unsigned short wColor, *pDst, DstStride;
BlitFill( ){
for (int i = 0; i < iRows; i++) {
// Set this solid color for whole scanline, then advance to next
for (int j=0; j<iCols; j++)
*pDst++ = wColor;
pDst += DstStride;
; Get wColor for each pixel arranged into hi and lo half-words
; of register so that multiple pixels can be written out
orr r4,r1,r1,LSL #16
; code to check alignment of source and destination
; and code to handle end cases
; setup counters etc. is not shown
; Optimized loop may look like …
LOOP
str r4,[r0],#4 ; inner loop that fills destination scan line
str r4,[r0],#4 ; pointed by r0
str r4,[r0],#4 ; these stores take advantage of write coalescing
str r4,[r0],#4
str r4,[r0],#4 ; writing out as words
str r4,[r0],#4 ; instead of bytes or half-words
str r4,[r0],#4 ; achieves optimum performance
subs r5,r5,#1 ;Fill 32 units(16 bits WORD) in each loop here
bne LOOP
If the data is going to the internal memory, the same code offers even a greater throughput.
4.6.3Case Study 3: Dot Product
Dot product is a typical vector operation for signal processing applications and graphics. For
example, vertex transformation uses a graphic dot product. Using
Technology
to attain this acceleration. These items are key issues for optimizing the dot-product code:
• Use LDRD if input is aligned
• Use the 2 cycle WMAC instruction
features can help accelerate these applications. The following code demonstrates how
Intel® Wireless MMX™
Intel® PXA27x Processor Family Optimization Guide4-31
Many handheld devices use native landscape orientation for internal graphics application
processing. However, if the end user views the output in portrait mode, a portrait-to-landscape
conversion needs to occur each time the frame buffer writes to the display.
The display driver usually implements a landscape to portrait conversion when the frame is copied
from the off-screen buffer to the display buffer. The following C code example shows a landscape
to portrait rotation.
In the following example, row indicates the current row of the off-screen buffer. pDst and pSrc are
single-byte pointers to the display and off-screen buffer respectively.
for (row=Top; row < Bottom; row++) {
for (j=0;j<2;j++) {
*pDst++=*(pSrc+j);
}
pSrc-=bytesPerRow; // bytesPerRow = Stride
}
This is an optimized version of the previous example in assembly:
;Set up for loop goes here
;This shows only the critical loop in the implementation
In the following example, scheduled instructions take advantage of write-coalescing of multiple
store instructions to the same line. In this example, the two stores are combined in a single writebuffer entry and issued as a single write request.
str r11, [r10], #4; Write Coalesce the two stores
str r12, [r10], #4
This can be exploited by either unrolling the C loop or by explicitly inlining multiple stores which
can be combined.
The register rotation technique also allows multiple loads to be outstanding.
4.6.5Case Study 5: 8x8 Block 1/2X Motion Compensation
Bi-linear interpolation is a typical operation in image and video processing applications. For
example the video decode motion compensation uses the 1/2X interpolation operation. Using
Intel® Wireless MMX™ Technology features can help to accelerate these key applications. The
following code demonstrates how to attain this acceleration. These items are key issues for
optimizing the 1/2X motion compensation:
• Use WALIGNR instruction for aligning the packed byte array
• Use the WAVG2BR instruction for calculating the average of bytes.
• Schedule around the load-to-use-latency
This example code is for the 1/2X interpolation:
; Test for special case of aligned ( LSBs = 110b and 000b)
; r0 -> pointer to misaligned array.
MOV r5,#7 ; r5 =0x7
AND r7,r0,r5 ; r7 -> 3 LSBs of *psrc
MOV r12,#4 ; counter
Intel® PXA27x Processor Family Optimization Guide4-33
WLDRD wR0, [r0] ; load first group of 8 bytes
ADD r8,r7,#1 ; r8 -> 3LSBs of *psrc + 1
WLDRD wR1, [r0,#8] ; load second group of 8 bytes
TMCR wCGR1, r7 ; transfer alignment to wCGR1
WLDRD wR2, [r0]
TMCR wCGR2, r8
WLDRD wR3, [r0,#8]
LOOP
; We recommend completely unrolling this loop it will save 8 cycles
ADD r0,r0,r1
SUBS r12,r12,#1
Users who want to take full advantage of many of the optimizations in this guide are likely to use
these techniques:
• Write hand-optimized assembly code.
• Take advantage of a compiler tuned for the Intel XScale® Microarchitecture and the ARM*
v5TE instruction set architecture.
• Incorporate a library with optimizations present.
For the last item, a listing of fully optimized code, the Intel® Integrated Performance Primitives
(IPP) is available. The IPP comprises a rich and powerful set of general and multimedia signal
processing kernels optimized for high performance on the PXA27x processor. Besides
optimization, the IPP offers application developers a number of significant advantages, including
accelerated time-to-market, compatibility with many major real-time embedded operating systems,
and support for porting across certain Intel® platforms.
The IPP include optimized general signal and image processing primitives, as well as primitives for
use in constructing internationally standardized audio, video, image, and speech encoder/decoders
(CODECs) for the PXA27x processor.
IPP available for general one-dimensional (1D) signal processing include:
• Vector initialization, arithmetic, statistics, thresholding, and measure
• Deterministic and random signal generation
• Convolution, filtering, windowing, and transforms
IPP for general two-dimensional (2D) image processing include:
• Vector initialization, arithmetic, statistics, thresholding, and measure
• Color conversions
• Morphological operations
• Convolution, filtering, windowing, and transforms
Additional IPP are available allowing construction of these multimedia CODECs:
• Video - ITU H.263 decoder, ISO/IEC 14496-2 MPEG-4 decoder
• Speech - ITU-T G.723.1 CODEC and ETSI GSM-AMR codec
• Image - ISO/IEC JPEG CODEC
For more details on the IPP, as well as upcoming libraries for 3D graphics and encryption, browse
http://intel.com/software/products/ipp/.
4.8Instruction Latencies for Intel XScale®
Microarchitecture
The following sections show the latencies for all the instructions with respect to their functional
groups: branch, data processing, multiply, status register access, load/store, semaphore, and
coprocessor.
Section 4.8.1, “Performance Terms” explains how to read Tab le 4-2 through Tabl e 4-17.
4.8.1Performance Terms
• Issue Clock (cycle 0)
The first cycle when an instruction is decoded and allowed to proceed to further stages in the
execution pipeline.
• Cycle Distance from A to B
The cycle distance from cycle A to cycle B is (B-A) — the number of cycles from the start of
cycle A to the start of cycle B. For example, the cycle distance from cycle 3 to cycle 4 is one
cycle.
Intel® PXA27x Processor Family Optimization Guide4-35
The cycle distance from the first issue clock of the current instruction to the issue clock of the
next instruction. Cache-misses, resource-dependency stalls, and resource availability conflicts
can influence the actual number of cycles.
• Result Latency
The cycle distance from the first issue clock of the current instruction to the issue clock of the
first instruction using the result without incurring a resource dependency stall. Cache-misses,
resource-dependency stalls, and resource availability conflicts influence the actual number of
cycles.
This represents the minimum cycle distance is the distance from the issue clock of the current
instruction to the first possible issue clock of the next instruction. For example, the issuing of
the next instruction is not stalled due to these situations:
— Resource dependency stalls
— The next instruction can be immediately fetched from the cache or memory interface
— The current instruction does not incur a resource dependency stall during execution that
can not be detected at its issue time
— The instruction uses dynamic branch prediction, correct prediction is assumed.
• Minimum Result Latency
This represents the required minimum cycle is the distance from the issue clock of the current
instruction to the issue clock of the first instruction that uses the result without incurring a
resource dependency stall. For example, the issuing of the next instruction is not stalled due to
these situations:
— Resource dependency stalls
— The next instruction can be immediately fetched from the cache or memory interface.
— The current instruction does not incur resource dependency stalls during executions that
It represents the minimum cycle distance from the issue clock of the current branching
instruction to the first possible issue clock of the next instruction. The value of this is identical
to minimum issue latency except the branching instruction is mispredicted. It is calculated by
adding minimum issue latency (without branch misprediction) to the minimum branch latency
penalty cycles using Table 4-3 and Table 4- 4 .
• Minimum Resource Latency
The minimum cycle distance from the issue clock of the current multiply instruction to the
issue clock of the next multiply instruction assuming the second multiply does not incur a data
dependency and is immediately available from the instruction cache or memory interface.
This code is an example of computing latencies:
UMLAL r6,r8,r0,r1
ADD r9,r10,r11
SUB r2,r8,r9
MOV r0,r1
Tabl e 4-2 shows how to calculate issue latency and result latency for each instruction. The
UMLAL instruction (shown in the issue column) starts to issue on cycle 0 and the next instruction,
ADD, issues on cycle
result dependency between the UMLAL instruction and the SUB instruction. In
UMLAL starts to issue at cycle 0 and the SUB issues at cycle 5, thus the result latency is five.
Table 4-2. Latency Example
CycleIssueExecuting
0umlal (1st cycle)—
1umlal (2nd cycle)umlal
2addumlal
3sub (stalled)umlal & add
4sub (stalled)umlal
5subumlal
6movsub
7—mov
2, so the issue latency for UMLAL is two. From the code fragment, there is a
Tab le 4-2,
4.8.2Branch Instruction Timings
Table 4-3. Branch Instruction Timings (Those Predicted By the BTB (Branch Target Buffer))
Instruction
B15
BL15
(
Minimum Issue Latency When Correctly
Predicted By The Btb
Table 4-4. Branch Instruction Timings (Those Not Predicted By the BTB)
Instruction
BLX(1)—5
BLX(2)15
BX15
Data Processing Instruction with
PC as the destination
LDR PC,<>28
LDM with PC in register list3 + numreg
† numreg is the number of registers in the register list including the PC.
Minimum Issue Latency When
the Branch Is Not Taken
Same as Table 4-54 + numbers in Table 4-5
Minimum Issue Latency With Branch
†
Misprediction
Minimum Issue Latency When
the Branch is Taken
10 + max (0, numreg-3)
Intel® PXA27x Processor Family Optimization Guide4-37
In general, the timing of THUMB* instructions is the same as their equivalent ARM* instructions,
except for these cases:
• If the equivalent ARM* instruction maps to an entry in Table 4-3, the “Minimum Issue
Latency with branch misprediction” goes from 5 to 6 cycles. This is due to the branch latency
penalty.
• If the equivalent ARM* instruction maps to one in Table 4-4, the “Minimum Issue Latency
when the Branch is Taken” increases by one cycle. This is due to the branch latency penalty.
• The timings of a THUMB* BL instruction and an ARM* data processing instruction when
H=0 are the same.
The mapping of THUMB* instructions to ARM* instructions can be found in the
ARM*Architecture Reference Manual
4.9Instruction Latencies for Intel® Wireless MMX™
Technology
The issue cycle and result latency of all the PXA27x processor instructions is shown in Table 4-18.
In this table, the issue cycle is the number of cycles that an instruction takes to leave the register
file. The
result latency is the number of cycles required to calculate the result and make it
available to the bypassing logic. A result latency of 1 indicates that the value is available
immediately to the following instruction.
Tab le 4-18 shows the best case result latency that can be
degraded by data or resource hazards.
Table 4-18. Issue Cycle and Result Latency of the PXA27x processor Instructions (Sheet 1 of
2)
InstructionsIssue CycleResult Latency
WADD11
WSUB11
WCMPEQ12
WCMPGT12
WAND11
WANDN11
WOR11
WXOR11
WAVG211
WMAX12
WMIN12
WSAD11
WACC11
WMUL11
WMADD11
Intel® PXA27x Processor Family Optimization Guide4-43
The basic performance of the system can be effected by stalls caused by data or resource hazards.
This section describes the factors effecting each type of hazard and the implications for
performance.
4.10.1Data Hazards
A data hazard occurs when an instruction requires data that cannot be provided by the register file
or the data-forwarding mechanism or if two instructions update the same destination register in an
out of order fashion. The first hazard is termed as Read-After-Write (RAW) and the second hazard is
termed as Write-After-Write (WAW). The processing of the new instruction is stalled until the data
becomes available for RAW hazards, and until it can be guaranteed that the new instruction will
update the register file after previous instruction has updated the same destination register, for
WAW hazards. The PXA27x processor device contains a bypassing mechanism for ensuring that
data and different stages of the pipeline can forwarded to the correct instructions. There are,
however, certain combinations of instructions where it is not possible to forward directly between
instructions in the PXA27x processor 1.0 implementation.
The result latency shown in Tab le 4-18 and best-case result latency are generally achievable.
However there are certain instruction combinations where these result latencies do not hold
because not all combinations of bypassing logic exist in the hardware, and some instructions
require more time to calculate the result when certain qualifiers are specified. This list describes the
data hazards for the PXA27x processor 1.0 implementation:
• When saturation is specified for WADD or WSUB, the result latency is increased to two cycles
• The destination register (accumulator) for certain multiplier instructions (WMAC, WSAD,
TMIA, TMIAph, TMIAxy) can be forwarded for accumulation to the same destination register
only. If the destination register results are needed by another instruction as source operands,
there is an additional result latency as the result is available from the regular forwarding paths,
external to the multiplier. The exact number of extra cycles depends upon the multiplier
instruction that is delivering results to source operands of other instructions.
• If an instruction is updating a destination register from the multiply pipeline, a following
instruction in the execute, memory or core interface pipelines updating the same destination
register is stalled till it can be guaranteed that the following instruction will update the register
file after the previous instruction in the multiply pipe, has updated the register file
• If an instruction is updating a destination register from the memory pipeline, a following
instruction updating the same destination register is stalled till it can be guaranteed that the
following instruction will update the register file after the previous instruction in the memory
pipe, has updated the register file.
• If the Intel XScale® Microarchitecture MAC unit is in use, the resulting latency of a TMRC,
TMRRC, and TEXRM increases accordingly.
4.10.2Resource Hazard
A resource hazard is caused when an instruction requires a resource that is already in use. When
this condition is detected, the processing of the new instruction is stalled at the register file stage.
Intel® PXA27x Processor Family Optimization Guide4-45
Figure 4-1 shows a high-level representation of the operation of the PXA27x processor
coprocessor. After the register file, there are four concurrent pipelines to which an instruction can
be dispatched. An instruction can be issued to a pipeline if the resource is available and there are no
unresolved data dependencies. For example, a load instruction that uses the Memory pipeline can
be issued while a multiply instruction is completing in the Multiply pipeline (assuming there are no
data hazards.)
Figure 4-1. High-Level Pipeline Organization
Execution Pipeline
Multiply Pipeline
Register
File
Memory Pipeline
Core Interface Pipeline
The performance effect of resource contention can be quantified by examining the delay taken for a
particular instruction to release the resource after starting execution. The definition of “release the
resource” in this context is that the resource can accept another instruction (note: the resource may
still be processing the previous instruction further down its internal pipeline). A delay of one clock
cycle indicates that the resource is available immediately to the next instruction. A delay greater
than one clock cycle stalls the next instruction if the same resource is required. The following
sections examine the resource-usage delays for the four pipelines, and how these map onto the
instruction set.
4.10.2.1Execution Pipeline
An instruction can be accepted into the execution pipeline when the first stage of the pipeline is
Tab le 4-19 shows the instructions that execute in the main execution pipeline. All these
empty.
instructions have a resource usage delay of one clock cycle. Therefore, the execution pipeline will
always be available to the next instruction.
Table 4-19. Resource Availability Delay for the Execution Pipeline (Sheet 1 of 2)
Table 4-19. Resource Availability Delay for the Execution Pipeline (Sheet 2 of 2)
InstructionsDelay (Clocks)
WAVG21
WMAX1
WMIN1
WSAD1
WSLL1
WSRA1
WSRL1
WROR1
WPACK1
WUNPCKEH1
WUNPCKEL1
WUNPCKIH1
WUNPCKIL1
WALIGNI1
WAL IGNR1
WSHUF1
TMIA1
TMIAph1
TMIAxy1
TMCR1
TMCRR1
TINSR1
TBCST1
TANDC1
TORC1
TEXTRC1
† The WSAD, TMIA, TMIAph, TMIAxy execute in both the
main execution pipeline and the multiplier pipeline. They
execute for one cycle in the execution pipeline and the
rest in the multiplier pipeline. See Section 4.10.2.5 for
more details
†
†
†
†
4.10.2.2Multiply Pipeline
Instructions issued to the multiply pipeline may take up to two cycles before another instruction
can be issued to the pipeline. The instructions in the multiply pipe can be categorized into 4 classes
shown in
multiplier pipeline depend upon the class of the multiply instruction that subsequently wants to use
the multiply resource. These delays for are shown below in
instruction is followed by a TMIAph (class3) instruction, then the TMIAph sees a resource
availability of 2
Intel® PXA27x Processor Family Optimization Guide4-47
Tab le 4-20. The resource-availability delay for the instructions that are mapped onto the
Table 4-21. Resource Availability Delay for the Multiplier Pipeline
Delay(Clocks) for a
Instruc-
tions
WSAD2211
WACC1111
WMUL2211
WMADD2211
WMAC2211
TMIA3322
TMIAPH2211
TMIAxy2211
WSAD, TMIA, TMIAxy, TMIAph execute in both the main execution pipeline and the multiplier
pipeline. See Section 4.10.2.5 for more details
subsequent class 1
multiply pipe
instruction
Delay(Clocks) for a
subsequent class 2
4.10.2.3Memory Control Pipeline
The memory control pipeline is responsible for coordinating the load/store activity with the main
core. The external interface to memory is 32-bits so the 64-bit load/store issued by the PXA27x
processor device are sequenced as two 32-bit load/stores to memory. This is transparent to end
users and is already factored into the result latencies show in
processor device issues the 64-bit memory transaction, it must buffer the data until the two 32-bit
half transactions are complete. Currently, there are two 64-bit buffer slots for load operations and
one 64-bit buffer slot available for store transactions. If the memory buffer is currently empty, the
Memory pipeline resource- availability delay is only one clock. However, if the buffer is currently
full due to a sequence of memory transactions, the following instruction must wait for space in the
buffer. The resource availability delay in this case is two cycles. This is summarized in
multiply pipe
instruction
Delay(Clocks) for a
subsequent class 3
multiply pipe
instruction
Tabl e 4-18. After the PXA27x
Delay(Clocks) for a
subsequent class 4
multiply pipe
instruction
Table 4-22.
Table 4-22. Resource Availability Delay for the Memory Pipeline
The coprocessor interface pipeline also contains buffering to allow multiple outstanding
MRC/MRRC operations. The coprocessor interface pipeline can continue to accept MRC and
MRRC instructions every cycle until its buffers are full. Currently there is sufficient storage in the
buffer for either four MRC data values (32-bit) or two MRRC data values (64-bit).
shows a summary of the resource availability delay for the Coprocessor interface.
Table 4-23. Resource Availability Delay for the Coprocessor Interface Pipeline
InstructionsDelay(Clocks)Condition
TMRC1Buffer Empty
TMRC2Buffer Full
TMRRC1Buffer empty
TMRRC2Buffer Full
There is also an interaction between TMRC/TMRRC and any instructions in the core that utilize
the MAC unit of the core. For optimum performance, the MAC unit in the core should not be used
adjacent to TMRC instructions as they both share the route back to the core register file.
Table 4-23
4.10.2.5Multiple Pipelines
The WSAD, TMIA, TMIAph and TMIAxy instructions execute in both the main Execution
pipeline and the Multiplier pipeline. The instruction executes one cycle in the Execution pipeline
and the rest in the Multiplier pipeline. The WSAD, TMIA, TMIAph, TMIAxy instructions will
always issue without stalls to the Execution pipeline (see
multiplier pipeline depends on a previous instruction that was using the multiply resource. If the
previous instruction was a TMIA, there is an effective resource availability of two cycles.
Section 4.10.2.1). The availability of the
Intel® PXA27x Processor Family Optimization Guide4-49
For embedded systems, the system’s performance is greatly affected by the software programming
techniques. In order to attain performance at the application level, there are many techniques which
can be applied at the C/ C++ code development phase. This chapter covers a set of programming
optimization techniques which are relevant to deeply embedded system such as the Intel® PXA27x
Processor Family (PXA27x processor).
5.1.1Efficient Usage of Preloading
The Intel XScale® Microarchitecture preload instruction is a true preload instruction because the
load destination is the data or mini-data cache and not a register. Compilers for processors which
have data caches, but do not support preload, sometimes use a load instruction to preload the data
cache. This technique has the disadvantages of using a register to load data and requiring additional
registers for subsequent preloads and thus increasing register pressure. By contrast, the Intel
XScale® Microarchitecture preload can be used to reduce register pressure instead of increasing it.
The Intel XScale® Microarchitecture preload is a hint instruction and does not guarantee that the
data is loaded. Whenever the load would cause a fault or a table walk, then the processor ignores
the preload instruction, the fault or table walk, and continue processing the next instruction. This is
particularly advantageous in the case where a linked list or recursive data structure is terminated by
a NULL pointer. Preloading the NULL pointer does not cause a fault.
The preload instructions (PLD) can be inserted by the compiler during compilation. However, the
programmer can effectively insert preload operations in the code. A function can be defined during
high level language programming which results in a PLD instruction being inserted in-line. This
function can the be called at other suitable places in the code to insert PLD instructions.
5.1.1.1Preload Considerations
The issues associated with using preloading which require consideration are explained below.
5.1.1.1.1Preload Distances In the Intel XScale® Microarchitecture
Scheduling the preload instruction requires understanding the system latency times and system
resources which determine when to use the preload instruction.
The optimum advantage of using preload is obtained if the preload issue-to-use distance is equal to
the memory latency. The memory latency shown in
Latency and Bandwidth” should be used to determine the proper insertion point for preloads.
Depending on whether the target is in the internal memory or in the external memory, the preload
distance may need to be varied. Also, for external memory in which the target address is not
aligned to a cacheline the memory latency can increase due to the critical word first (CWF) mode
of the memory accesses. CWF mode returns the requested data starting with the requested word
instead of starting with the word at the aligned address.When using preloads, align the target
address to a cache-line boundary in order to avoid the extra memory bus usage.
Section 3.2.1, “Optimal Setting for Memory
Intel® PXA27x Processor Family Optimization Guide5-1
High Level Language Optimization
Consider this code sample:
add r1, r1, #1
; Sequence of instructions using r2, but leave r3 unchanged.
For most cases, optimizing for the external memory latency also satisfies the requirements for the
internal memory latency.
5.1.1.1.2Preload Loop Scheduling
When adding preload instructions to a loop which operates on arrays, preload ahead one, two, or
more iterations. The data for future iterations is located in memory a fixed offset from the data for
the current iteration. This makes it easy to predict where to fetch the data. The number of iterations
to preload ahead is referred to as the preload scheduling distance (PSD). For the Intel XScale®
Microarchitecture this can be calculated as: