INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY
RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining
applications. Intel may make changes to specifications and product descriptions at any time, without notice.
This IA-32 Intel® Architecture Optimization Reference Manual as well as the software described in it is furnished
under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies
that may appear in this document or any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.
Developers must not rely on the absence or characteristics of any features or in structions marked “reserved” or “undefined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in
developer's software code when running on an Intel® processor. Intel reserves these features or instructions for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized
use.
Hyper-Threading Technology requires a computer system with an Intel® Pentium®4 processor supporting HyperThreading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary
depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more
information including details on which processors support HT Technology.
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and
VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other
countries.
Table C-8IA-32 General Purpose Instructions ................................................. C-17
xxii
Introduction
The IA-32 Intel® Architectur e Optimization Reference Manual describes
how to optimize software to take advantage of the performance
characteristics of the current generation of IA-32 Intel architecture
family of processors. The optimizations described in this manual apply
to IA-32 processors based on the Intel
the Intel
support Hyper-Threading Technology.
The target audience for this manual includes software programmer s and
compiler writers. This manual assumes that the reader is familiar with the
basics of the IA-32 architecture and has access to the IntelSoftware Developer’s Manual: Volume 1, Basic Architecture;
Volu me 2A, Instruction Set Refer ence A-M; Volum e 2B, Instruction Set Reference N-Z, and Volum e 3, System Programmer’s Guide.
When developing and optimizing software applications to achieve a
high level of performance when running on IA-32 processors, a detailed
understanding of IA-32 family of processors is often required. In many
cases, knowledge of IA-32 microarchitectures is required.
®
Pentium® M processor family and IA-32 processors that
®
NetBurst® microarchitecture,
®
Architecture
This manual provides an overview of the Intel NetBurst
microarchitecture and the Intel Pentium M processor microarchitecture.
It contains design guidelines for high-performance software applications,
coding rules, and techniques for many aspects of code-tuning. These
rules are useful to programmers and compiler developers.
The design guidelines that are discussed in this manual for developing
high-performance software apply to current as well as to future IA-32
processors. The coding rules and code optimization techniques listed
xxiii
IA-32 Intel® Architectu re Optimization
target the Intel NetBurst microarchitecture and the Pentium M processor
microarchitecture.
Tuning Your Application
Tuning an application for high performance on any IA-32 processor
requires understanding and basic skills in:
•IA-32 architecture
•C and Assembly language
•the hot-spot regions in your application that have significant impact
on software performance
•the optimization capabilities of your compiler
•techniques to evaluate the application’s performance
®
The Intel
locate hot-spot regions in your applications. On the Pentium 4, Intel
Xeon
through a selection of performance monitoring events and analyze the
performance event data that is gathered during code execution.
VTune™ Performance Analyzer can help you analyze and
®
and Pentium M processors, this tool can monitor an application
®
This manual also describes information that can be gathered using the
performance counters through Pentium 4 processor’s performance
monitoring events.
For VTune Performance Analyzer order information, see the web page:
ttp://developer.intel.com
h
About This Manual
In this document, the reference “Pentium 4 processor” refers to
processors based on the Intel NetBurst microarchitecture. Currently this
includes the Intel Pentium 4 processor and Intel Xeon processor . Where
appropriate, differences between Pentium 4 processor and Intel Xeon
processor are noted.
xxiv
Introduction
The manual consists of the following parts:
Introduction. Defines the purpose and outlines the contents of this
manual.
®
Chapter 1: IA-32 Intel
Architecture Processor Family Overview.
Describes the features relevant to software optimization of the current
generation of IA-32 Intel architecture processors, including the
architectural extensions to the IA-32 architecture and an overview of the
Intel NetBurst microarchitecture, Pentium M processor
microarchitecture and Hyper-Threading Technology.
Chapter 2: General Optimization Guidelines. Describes general code
development and optimization techniques that apply to all applications
designed to take advantage of the common features of the Intel NetBurst
microarchitecture and Pentium M processor microarchitecture.
Chapter 3: Coding for SIMD Architectures. Describes techniques
and concepts for using the SIMD integer and SIMD floating-point
instructions provided by the MMX™ technology, Streaming SIMD
Extensions, Streaming SIMD Extensions 2, and Streaming SIMD
Extensions 3.
Chapter 4: Optimizing for SIMD Integer Applications. Provides
optimization suggestions and common building blocks for applications
that use the 64-bit and 128-bit SIMD integer instructions.
Chapter 5: Optimizing for SIMD Floating-point Applications.
Provides optimization suggestions and common building blocks for
applications that use the single-precision and double-precision SIMD
floating-point instructions.
Chapter 6: Optimizing Cache Usage. Describes how to use the
prefetch instruction, cache control management instructions to
optimize cache usage, and the deterministic cache parameters.
xxv
IA-32 Intel® Architectu re Optimization
Chapter 7: Multiprocessor and Hyper-Threading Technology.
Describes guidelines and techniques for optimizing multithreaded
applications to achieve optimal performance scaling. Use these when
targeting multiprocessor (MP) systems or MP systems using IA-32
processors that support Hyper-Threading Technology.
Chapter 8: 64-Bit Mode Coding Guidelines. This chapter describes a
set of additional coding guidelines for application software written to
run in 64-bit mode.
Chapter 9: Power Optimization for Mobile Usages. This chapter
provides background on power saving techniques in mobile processors
and makes recommendations that developers can leverage to provide
longer battery life.
Appendix A: Application Performance Tools. Introduces tools for
analyzing and enhancing application performance without having to
write assembly code.
Appendix B: Intel Pentium 4 Processor Performance Metrics.
Provides information that can be gathered using Pentium 4 processor’s
performance monitoring events. These performance metrics can help
programmers determine how effectively an application is using the
features of the Intel NetBurst microarchitecture.
xxvi
Appendix C: IA-32 Instruction Latency and Throughput. Provides
latency and throughput data for the IA-32 instructions. Instruction
timing data specific to the Pentium 4 and Pentium M processors are
provided.
Appendix D: St ack Alignment. Describes stack alignment conventions
and techniques to optimize performance of accessing stack-based data.
Appendix E: The Mathematics of Prefetch Scheduling Distance.
Discusses the optimum spacing to insert
prefetch instructions and
presents a mathematical model for determining the prefetch scheduling
distance (PSD) for your application.
Related Documentation
For more information on the Intel architecture, specific techniques, and
processor architecture terminology referenced in this manual, see the
following documents:
•Intel
•Intel
•VTune Performance Analyzer online help
•Intel
•Intel Processor Identification with the CPUID Instruction, doc.
This type styleIndicates an element of syntax, a reserved
THIS TYPE STYLEIndicates a value, for example, TRUE, CONST1,
This type styleIndicates a placeholder for an identifier, an
... (ellipses)Indicate that a few lines of the code are
This type style
word, a keyword, a filename, instruction,
computer output, or part of a program
example. The text appears in lowercase
unless uppercase is significant.
or a variable, for example,
names
l indicates lowercase letter L in examples. 1
MMO through MM7.
is the number 1 in examples.
uppercase O in examples.
A, B, or register
O is the
0 is the number 0 in
examples.
expression, a string, a symbol, or a value.
Substitute one of these items for the
placeholder.
omitted.
Indicates a hypertext link.
xxviii
IA-32 Intel® Architecture
Processor Family Overview
This chapter gives an overview of the features relevant to software
optimization for the current generations of IA-32 processors, inclu ding:
•Microarchitectures that enable executing instructions with high
•Intel
•Intel
•Multi-core architecture supported in Intel
Intel Pentium 4 processors, Intel Xeon processors, Pentium D
processors, and Pentium processor Extreme Editions are based on Intel
NetBurst® microarchitecture. The Intel Pentium M processor
microarchitecture balances performance and low power consumption.
, Intel® Pentium® M, and IA-32 processors with multi-core
D processors and Pentium® processor Extreme Edition
1
™
technology,
Core™ Duo, Intel®
1
2
1.Hyper-Threading Technology requires a computer system with an Intel processor
supporting HT Technology and an HT Technology enabled chipset, BIOS and operating
system. Performance varies depending on the hardware and software used.
2.Dual-core platform requires an Intel Core Duo, Pentium D processor or Pentium processor
Extreme Edition, with appropriate chipset, BIOS, and operating system. Performance
varies depending on the hardware and software used.
1-1
IA-32 Intel® Architectu re Optimization
Intel Core Solo and Intel Core Duo processors incorporate
microarchitectural enhancements for performance and power efficiency
that are in addition to those introduced in the Pentium M processor.
SIMD Technology
SIMD computations (see Figure 1-1) were introduced in the IA-32
architecture with MMX technology . MMX technology allows SIMD
computations to be performed on packed byte, word, and doubleword
integers. The integers are contained in a set of eight 64-bit registers
called MMX registers (see Figure 1-2).
The Pentium III processor extended the SIMD computation model with
the introduction of the Streaming SIMD Extensions (SSE). SSE allows
SIMD computations to be performed on operands that contain four
packed single-precision floating-point data elements. The operands can
be in memory or in a set of eight 128-bit XMM registers (see Figure
1-2). SSE also extended SIMD computational capability by adding
additional 64-bit MMX instructions.
Figure 1-1 shows a typical SIMD computation. Two sets of four packed
data elements (X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are
operated on in parallel, with the same operation being performed on
1-2
each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3
and Y3, and X4 and Y4). The results of the four parallel computations
are sorted as a set of four packed data elements.
Figure 1-1 Typical SIMD Operations
X4X3X2X1
Y4Y3Y2Y1
OPOPOPOP
X4 op Y4X3 op Y3X2 op Y2X1 op Y1
The Pentium 4 processor further extended the SIMD computation model
with the introduction of Streaming SIMD Extensions 2 (SSE2) and
Streaming SIMD Extensions 3 (SSE3)
IA-32 Intel® Architecture Processor Family Overview
OM15148
SSE2 works with operands in either memory or in the XMM registers.
The technology extends SIMD computations to process packed
double-precision floating-point data elements and 128-bit packed
integers. There are 144 instructions in SSE2 that operate on two packed
double-precision floating-point data elements or on 16 packed byte, 8
packed word, 4 doubleword, and 2 quadword integers.
SSE3 enhances x87, SSE and SSE2 by providing 13 instructions that
can accelerate application performance in specific areas. These include
video processing, complex arithmetics, and thread synchronization.
SSE3 complements SSE and SSE2 with instructions that process SIMD
data asymmetrically, facilitate horizontal computation, and help avoid
loading cache line splits.
1-3
IA-32 Intel® Architectu re Optimization
Figure 1-2 SIMD Instruction Register Usage
64-bit M M X R egisters
MM7
MM7
MM6
MM5
MM4
MM3
MM2
MM1
MM0
128-bit X M M R egisters
MM7
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
OM15149
SIMD improves the performance of 3D graphics, speech recognition,
image processing, scientific applications and applications that have the
following characteristics:
•inherently parallel
•recurring memory access patterns
•localized recurring operations performed on the data
1-4
•data-independent control flow
SIMD floating-point instructions fully support the IEEE Standard 754
for Binary Floating-Point Arithmetic. They are accessible from all
IA-32 execution modes: protected mode, real address mode, and V irtual
8086 mode.
SSE, SSE2, and MMX technologies are architectural extensions in the
IA-32 Intel architecture. Existing software will continue to run
correctly, without modification on IA-32 microprocessors that
incorporate these technologies. Existing software will also run correctly
in the presence of applications that incorporate SIMD technologies.
IA-32 Intel® Architecture Processor Family Overview
SSE and SSE2 instructions also introduced cacheability and memory
ordering instructions that can improve cache usage and application
performance.
For more on SSE, SSE2, SSE3 and MMX technologies, see:
Intel EM64T is an extension of the IA-32 Intel architecture. Intel
EM64T increases the linear address space for software to 64 bits and
supports physical address space up to 40 bits. The technology also
introduces a new operating mode referred to as IA-32e mode.
IA-32e mode consists of two sub-modes: (1) compatibility mode
enables a 64-bit operating system to run most legacy 32-bit software
unmodified, (2) 64-bit mode enables a 64-bit operating system to run
applications written to access 64-bit linear address space.
In the 64-bit mode of Intel EM64T, software may access:
•64-bit flat linear addressing
•8 additional general-purpose registers (GPRs)
•8 additional registers for streaming SIMD extensions (SSE, SSE2
and SSE3)
•64-bit-wide GPRs and instruction pointers
•uniform byte-register addressing
•fast interrupt-prioritization mechanism
•a new instruction-pointer relative-addressing mode
For optimizing 64-bit applications, the features that impact software
optimizations include:
•using a set of prefixes to access new registers or 64-bit register
operand
•pointer size increases from 32 bits to 64 bits
•instruction-specific usages
1-7
IA-32 Intel® Architectu re Optimization
Intel NetBurst® Microarchitecture
The Pentium 4 processor, Pentium 4 processor Extreme Edition
supporting Hyper-Threading Technology, Pentium D processor,
Pentium processor Extreme Edition and the Intel Xeon processor
implement the Intel NetBurst microarchitecture.
This section describes the features of the Intel NetBurst
microarchitecture and its operation common to the above processors. It
provides the technical background required to understand optimization
recommendations and the coding rules discussed in the rest of this
manual. For implementation details, including instruction latencies, see
Appendix C, “IA-32 Instruction Latency and Throughput.”
Intel NetBurst microarchitecture is designed to achieve high
performance for integer and floating-point computations at high clock
rates. It supports the following features:
•hyper-pipelined technology that enables high clock rates
•a high-performance, quad-pumped bus interface to the Intel
NetBurst microarchitecture system bus
•a rapid execution engine to reduce the latency of basic integer
instructions
•out-of-order speculative execution to enable parallelism
•superscalar issue to enable parallelism
•hardware register renaming to avoid register name space limitations
•cache line sizes of 64 bytes
•hardware prefetch
Design Goals of Intel NetBurst Microarchitecture
The design goals of Intel NetBurst microarchitecture are:
•to execute legacy IA-32 applications and applications based on
single-instruction, multiple-data (SIMD) technology at high
throughput
1-8
IA-32 Intel® Architecture Processor Family Overview
•to operate at high clock rates and to scale to higher performance and
clock rates in the future
Design advances of the Intel NetBurst microarchitecture include:
•a deeply pipelined design that allows for high clock rates (with
different parts of the chip running at different clock rates).
•a pipeline that optimizes for the common case of frequently
executed instructions; the most frequently-executed instructions in
common circumstances (such as a cache hit) are decoded efficiently
and executed with short latencies
•employment of techniques to hide stall penalties; Among these are
parallel execution, buffering, and speculation. The
microarchitecture executes instructions dynamically and
out-of-order, so the time it takes to execute each individual
instruction is not always deterministic
Chapter 2, “General Optimization Guidelines,” lists optimizations to use
and situations to avoid. The chapter also gives a sense of relative
priority . Because most optimizations are implementation dependent, the
chapter does not quantify expected benefits and penalties.
The following sections provide more information about key features of
the Intel NetBurst microarchitecture.
Overview of the Intel NetBurst Microarchitecture Pipeline
The pipeline of the Intel NetBurst microarchitecture contains:
•an in-order issue front end
•an out-of-order superscalar execution core
•an in-order retirement unit
The front end supplies instructions in program order to the out-of-order
core. It fetches and decodes IA-32 instructions. The decoded IA-32
instructions are translated into micro-operations (µops). The front end’s
primary job is to feed a continuous stream of µops to the execution core
in original program order.
1-9
IA-32 Intel® Architectu re Optimization
The out-of-order core aggressively reorders µops so that µops whose
inputs are ready (and have execution resources available) can execute as
soon as possible. The core can issue multiple µops per cycle.
The retirement section ensures that the results of execution are
processed according to original program order and that the proper
architectural states are updated.
Figure 1-3 illustrates a diagram of the major functional blocks
associated with the Intel NetBurst microarchitecture pipeline. The
following subsections provide an overview for each.
Figure 1-3 The Intel NetBurst Microarchitecture
6\VWHP%XV
%XV8QLW
UG/HYHO&DFKH
QG/HYHO&DFKH
)URQW(QG
)HWFK'HFRGH
%7%V%UDQFK3UHGLFWLRQ
2SWLRQDO
:D\
7UDFH&DFKH
0LFURFRGH520
)UHTXHQWO\XVHGSDWKV
/HVVIUHTXHQWO\XVHGSDWKV
VW/HYHO&DFKH
ZD\
([HFXWLRQ
2XW2I2UGHU&RUH
%UDQFK+LVWRU\8SGDWH
5HWLUHPHQW
1-10
IA-32 Intel® Architecture Processor Family Overview
The Front End
The front end of the Intel NetBurst microarchitecture consists of two
parts:
•fetch/decode unit
•execution trace cache
It performs the following functions:
•prefetches IA-32 instructions that are likely to be executed
•fetches required instructions that have not been prefetched
•decodes instructions into µops
•generates microcode for complex instructions and special-purpose
code
•delivers decoded instructions from the execution trace cache
•predicts branches using advanced algorithms
The front end is designed to address two problems that are sources of
delay:
•the time required to decode instructions fetched from the target
•wasted decode bandwidth due to branches or a branch target in the
middle of a cache line
Instructions are fetched and decoded by a translation engine. The
translation engine then builds decoded instructions into µop sequences
called traces. Next, traces are then stored in the execution trace cache.
The execution trace cache stores µops in the path of program execution
flow, where the results of branches in the code are integrated into the
same cache line. This increases the instruction flow from the cache and
makes better use of the overall cache storage space since the cache no
longer stores instructions that are branched over and never executed.
The trace cache can deliver up to 3 µops per clock to the core.
1-11
IA-32 Intel® Architectu re Optimization
The execution trace cache and the translation engine have cooperating
branch prediction hardware. Branch targets are predicted based on their
linear address using branch prediction logic and fetched as soon as
possible. Branch targets are fetched from the execution trace cache if
they are cached, otherwise they are fetched from the memory hierarchy.
The translation engine’s branch prediction information is used to form
traces along the most likely paths.
The Out-of-order Core
The core’s ability to execute instructions out of order is a key factor in
enabling parallelism. This feature enables the processor to reorder
instructions so that if one µop is delayed while waiting for data or a
contended resource, other µops that appear later in the program order
may proceed. This implies that when one portion of the pipeline
experiences a delay, the delay may be covered by other operations
executing in parallel or by the execution of µops queued up in a buffer.
The core is designed to facilitate parallel execution. It can dispatch up to
six µops per cycle through the issue ports (Figure 1-4, page 19). Note
that six µops per cycle exceeds the trace cache and retirement µop
bandwidth. The higher bandwidth in the core allows for peak bursts of
greater than three µops and to achieve higher issue rates by allowing
greater flexibility in issuing µops to different execution ports.
1-12
Most core execution units can start executing a new µop every cycle, so
several instructions can be in flight at one time in each pipeline. A
number of arithmetic logical unit (ALU) instructions can start at two per
cycle; many floating-point instructions start one every two cycles.
Finally , µops can begin execution out of program order, as soon as their
data inputs are ready and resources are available.
Retirement
The retirement section receives the results of the executed µops from the
execution core and processes the results so that the architectural state is
updated according to the original program order. For semantically
IA-32 Intel® Architecture Processor Family Overview
correct execution, the results of IA-32 instructions must be committed
in original program order before they are retired. Exceptions may be
raised as instructions are retired. For this reason, exceptions cannot
occur speculatively.
When a µop completes and writes its result to the destination, it is
retired. Up to three µops may be retired per cycle. The reorder buffer
(ROB) is the unit in the processor which buffers completed µops,
updates the architectural state and manages the ordering of exceptions.
The retirement section also keeps track of branches and sends updated
branch target information to the branch target buffer (BTB). This
updates branch history. Figure 1-3 illustrates the paths that are most
frequently executing inside the Intel NetBurst microarchitecture: an
execution loop that interacts with multilevel cache hierarchy and the
system bus.
The following sections describe in more detail the operation of the front
end and the execution core. This information provides the background
for using the optimization techniques and instruction latency data
documented in this manual.
Front End Pipeline Detail
The following information about the front end operation is be useful for
tuning software with respect to prefetching, branch prediction, and
execution trace cache operations.
Prefetching
The Intel NetBurst microarchitecture supports three prefetching
mechanisms:
•a hardware instruction fetcher that automatically prefetches
instructions
•a hardware mechanism that automatically fetches data and
instructions into the unified second-level cache
1-13
IA-32 Intel® Architectu re Optimization
•a mechanism fetches data only and includes two distinct
components: (1) a hardware mechanism to fetch the adjacent cache
line within an 128-byte sector that contains the data needed due to a
cache line miss, this is also referred to as adjacent cache line
prefetch (2) a software controlled mechanism that fetches data into
the caches using the prefetch instructions.
The hardware instruction fetcher reads instructions along the path
predicted by the branch target buffer (BTB) into instruction streaming
buffers. Data is read in 32-byte chunks starting at the target address. The
second and third mechanisms are described later.
Decoder
The front end of the Intel NetBurst microarchitecture has a single
decoder that decodes instructions at the maximum rate of one
instruction per clock. Some complex instructions must enlist the help of
the microcode ROM. The decoder operation is connected to the
execution trace cache.
Execution Trace Cache
1-14
The execution trace cache (TC) is the primary instruction cache in the
Intel NetBurst microarchitecture. The TC stores decoded IA-32
instructions (µops).
In the Pentium 4 processor implementation, TC can hold up to 12K
µops and can deliver up to three µops per cycle. TC does not hold all of
the µops that need to be executed in the execution core. In some
situations, the execution core may need to execute a microcode flow
instead of the µop traces that are stored in the trace cache.
The Pentium 4 processor is optimized so that most frequently-executed
IA-32 instructions come from the trace cache while only a few
instructions involve the microcode ROM.
IA-32 Intel® Architecture Processor Family Overview
Branch Prediction
Branch prediction is important to the performance of a deeply pipelined
processor. It enables the processor to begin executing instructions long
before the branch outcome is certain. Branch delay is the penalty that is
incurred in the absence of correct prediction. For Pentium 4 and Intel
Xeon processors, the branch delay for a correctly predicted instruction
can be as few as zero clock cycles. The branch delay for a mispredicted
branch can be many cycles, usually equivalent to the pipeline depth.
Branch prediction in the Intel NetBurst microarchitecture predicts all
near branches (conditional calls, unconditional calls, returns and
indirect branches). It does not predict far transfers (far calls, irets and
software interrupts).
Mechanisms have been implemented to aid in predicting branches
accurately and to reduce the cost of taken branches. These include:
•the ability to dynamically predict the direction and target of
branches based on an instruction’s linear address, using the branch
target buffer (BTB)
•if no dynamic prediction is available or if it is invalid, the ability to
statically predict the outcome based on the offset of the target: a
backward branch is predicted to be taken, a forward branch is
predicted to be not taken
•the ability to predict return addresses using the 16-entry return
address stack
•the ability to build a trace of instructions across predicted taken
branches to avoid branch penalties.
The Static Predictor.Once a branch instruction is decoded, the
direction of the branch (forward or backward) is known. If there was no
valid entry in the BTB for the branch, the static predictor makes a
prediction based on the direction of the branch. The static prediction
mechanism predicts backward conditional branches (those with
negative displacement, such as loop-closing branches) as taken.
Forward branches are predicted not taken.
1-15
IA-32 Intel® Architectu re Optimization
To take advantage of the forward-not-taken and backward-taken static
predictions, code should be arranged so that the likely target of the
branch immediately follows forward branches (see also: “Branch
Prediction” in Chapter 2).
Branch Target Buffer.Once branch history is available, the Pentium 4
processor can predict the branch outcome even before the branch
instruction is decoded. The processor uses a branch history table and a
branch target buffer (collectively called the BTB) to predict the
direction and target of branches based on an instruction’s linear address.
Once the branch is retired, the BTB is updated with the target address.
Return Stack.Returns are always taken; but since a procedure may be
invoked from several call sites, a single predicted target does not suffice.
The Pentium 4 processor has a Return Stack that can predict return
addresses for a series of procedure calls. This increases the benefit of
unrolling loops containing function calls. It also mitigates the need to
put certain procedures inline since the return penalty portion of the
procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly
predicted, a taken branch may reduce available parallelism in a typical
processor (since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does
not end the line and target does not begin the line). The branch predictor
allows a branch and its target to coexist in a single trace cache line,
maximizing instruction delivery from the front end.
Execution Core Detail
The execution core is designed to optimize overall performance by
handling common cases most efficiently. The hardware is designed to
execute frequent operations in a common context as fast as possible, at
the expense of infrequent operations using rare contexts.
1-16
IA-32 Intel® Architecture Processor Family Overview
Some parts of the core may speculate that a common condition holds to
allow faster execution. If it does not, the machine may stall. An example
of this pertains to store-to-load forwarding (see “Store Forwarding” in
this chapter). If a load is predicted to be dependent on a store, it gets its
data from that store and tentatively proceeds. If the load turned out not
to depend on the store, the load is delayed until the real data has been
loaded from memory, then it proceeds.
Instruction Latency and Throughput
The superscalar out-of-order core contains hardware resources that can
execute multiple μops in parallel. The core’s ability to make use of
available parallelism of execution units can enhanced by software’s
ability to:
•select IA-32 instructions that can be decoded in less than 4 μops
and/or have short latencies
•order IA-32 instructions to preserve available parallelism by
minimizing long dependence chains and covering long instruction
latencies
•order instructions so that their operands are ready and their
corresponding issue ports and execution units are free when they
reach the scheduler
This subsection describes port restrictions, result latencies, and issue
latencies (also referred to as throughput). These concepts form the basis
to assist software for ordering instructions to increase parallelism. The
order that μops are presented to the core of the processor is further
affected by the machine’s scheduling resources.
It is the execution core that reacts to an ever-changing machine state,
reordering μops for faster execution or delaying them because of
dependence and resource constraints. The ordering of instructions in
software is more of a suggestion to the hardware.
Appendix C, “IA-32 Instruction Latency and Throughput,” lists some of
the more-commonly-used IA-32 instructions with their latency, their
issue throughput, and associated execution units (where relevant). Some
1-17
IA-32 Intel® Architectu re Optimization
execution units are not pipelined (meaning that µops cannot be
dispatched in consecutive cycles and the throughput is less than one per
cycle). The number of µops associated with each instruction provides a
basis for selecting instructions to generate. All µops executed out of the
microcode ROM involve extra overhead.
Execution Units and Issue Ports
At each cycle, the core may dispatch µops to one or more of four issue
ports. At the microarchitecture level, store operations are further divided
into two parts: store data and store address operations. The four ports
through which μops are dispatched to execution units and to load and
store operations are shown in Figure 1-4. Some ports can dispatch two
µops per clock. Those execution units are marked Double Speed.
Port 0. In the first half of the cycle, port 0 can dispatch either one
floating-point move µop (a floating-point stack move, floating-point
exchange or floating-point store data), or one arithmetic logical unit
(ALU) µop (arithmetic, logic, branch or store data). In the second half
of the cycle, it can dispatch one similar ALU µop.
1-18
Port 1. In the first half of the cycle, port 1 can dispatch either one
floating-point execution (all floating-point operations except moves, all
SIMD operations) µop or one normal-speed integer (multiply, shift and
rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle,
it can dispatch one similar ALU µop.
Port 2. This port supports the dispatch of one load operation per cycle.
Port 3. This port supports the dispatch of one store address operation
per cycle.
The total issue bandwidth can range from zero to six µops per cycle.
Each pipeline contains several execution units. The µops are dispatched
to the pipeline that corresponds to the correct type of operation. For
example, an integer arithmetic logic unit and the floating-point
execution units (adder, multiplier, and divider) can share a pipeline.
IA-32 Intel® Architecture Processor Family Overview
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core
Port 0
ALU 0
Double
Speed
ADD/SUB
Logic
Store Data
Branches
Note:
FP_ADD refers to x87 FP, and SIMD FP add and subtract operations
FP_MUL refers to x87 FP, and SIMD FP multiply operations
FP_DIV refers to x87 FP, and SIMD FP divide and square root operations
MMX_ALU refers to SIMD integer arithmetic and logic operations
MMX_ SHF T han d le s Shift, R o ta te , S h u f fle, Pack and U npack o perations
MMX_MISC handles SIMD reciprocal and some integer operations
FP
Move
FP Move
FP Store Data
FXCH
ALU 1
Double
Speed
ADD /S U BShift/Ro tate
Port 1
Intege r
Operation
Normal
Speed
FP
Execute
FP_ADD
FP_MUL
FP_DIV
FP_MISC
MMX_SHFT
MMX_ALU
MMX_MISC
Port 2
Memory
Load
All Loads
Prefetch
Caches
The Intel NetBurst microarchitecture supports up to three levels of
on-chip cache. At least two levels of on-chip cache are implemented in
processors based on the Intel NetBurst microarchitecture. The Intel
Xeon processor MP and selected Pentium and Intel Xeon processors
may also contain a third-level cache.
Port 3
Memory
Store
Store
Address
OM15151
The first level cache (nearest to the execution core) contains separate
caches for instructions and data. These include the first-level data cache
and the trace cache (an advanced first-level instruction cache). All other
caches are shared between instructions and data.
1-19
IA-32 Intel® Architectu re Optimization
Levels in the cache hierarchy are not inclusive. The fact that a line is in
level i does not imply that it is also in level i+1. All caches use a
pseudo-LRU (least recently used) replacement algorithm.
Table 1-1 provides parameters for all cache levels for Pentium and Intel
Xeon Processors with CPUID model encoding equals 0, 1, 2 or 3.
Table 1-1Pentium 4 and Intel Xeon Processor Cache Parameters
Access Latency,
Associativity
Level (Model)Capacity
First (Model 0,
1, 2)
First (Model 3)16 KB8644/12write through
TC (All models)12K µops8N/AN/AN/A
Second (Model
0, 1, 2)
Second (Model
3, 4)
Second (Model
3, 4, 6)
Third (Model 0,
1, 2)
8 KB4642/9write through
256 KB or 512
2
KB
1 MB864
2 MB864
0, 512 KB, 1
MB or 2 MB
(ways)
86417/7write back
864
Line Size
(bytes)
1
1
1
Integer/
floating-point
(clocks)
18/18write back
20/20write back
14/14write back
Write Update
Policy
1
Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write operation is 64 bytes.
2
Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level cache of 512 KB.
On processors without a third level cache, the second-level cache miss
initiates a transaction across the system bus interface to the memory
sub-system. On processors with a third level cache, the third-level cache
miss initiates a transaction across the system bus. A bus write
transaction writes 64 bytes to cacheable memory, or separate 8-byte
chunks if the destination is not cacheable. A bus read transaction from
cacheable memory fetches two cache lines of data.
The system bus interface supports using a scalable bus clock and
achieves an effective speed that quadruples the speed of the scalable bus
clock. It takes on the order of 12 processor cycles to get to the bus and
1-20
IA-32 Intel® Architecture Processor Family Overview
back within the processor, and 6-12 bus cycles to access memory if
there is no bus congestion. Each bus cycle equals several processor
cycles. The ratio of processor clock speed to the scalable bus clock
speed is referred to as bus ratio. For example, one bus cycle for a
100 MHz bus is equal to 15 processor cycles on a 1.50 GHz processor.
Since the speed of the bus is implementation-dependent, consult the
specifications of a given system for further details.
Data Prefetch
The Pentium 4 processor and other IA-32 processors based on the
NetBurst microarchitecture have two type of mechanisms for
prefetching data: software prefetch instructions and hardware-based
prefetch mechanisms.
Software controlled prefetch is enabled using the four prefetch
instructions (PREFETCHh) introduced with SSE. The software prefetch
is not intended for prefetching code. Using it can incur significant
penalties on a multiprocessor system if code is shared.
Software prefetch can provide benefits in selected situations. These
situations include:
•when the pattern of memory access operations in software allows
the programmer to hide memory latency
•when a reasonable choice can be made about how many cache lines
to fetch ahead of the line being execute
•when an choice can be made about the type of prefetch to use
SSE prefetch instructions have different behaviors, depending on cache
levels updated and the processor implementation. For instance, a
processor may implement the non-temporal prefetch by returning data
to the cache level closest to the processor core. This approach has the
following effect:
•minimizes disturbance of temporal data in other cache levels
1-21
IA-32 Intel® Architectu re Optimization
•avoids the need to access off-chip caches, which can increase the
realized bandwidth compared to a normal load-miss, which returns
data to all cache levels
Situations that are less likely to benefit from software prefetch are:
•for cases that are already bandwidth bound, prefetching tends to
increase bandwidth demands
•prefetching far ahead can cause eviction of cached data from the
caches prior to the data being used in execution
•not prefetching far enough can reduce the ability to overlap memory
and execution latencies
Software prefetches are treated by the processor as a hint to initiate a
request to fetch data from the memory system, and consume resources
in the processor and the use of too many prefetches can limit their
effectiveness. Examples of this include prefetching data in a loop for a
reference outside the loop and prefetching in a basic block that is
frequently executed, but which seldom precedes the reference for which
the prefetch is targeted.
1-22
See also: Chapter 6, “Optimizing Cache Usage.”
Automatic hardware prefetch is a feature in the Pentium 4 processor.
It brings cache lines into the unified second-level cache based on prior
reference patterns. See also: Chapter 6, “Optimizing Cache Usage.”
Pros and Cons of Software and Hardware Prefetching. Software
prefetching has the following characteristics:
•handles irregular access patterns, which would not trigger the
hardware prefetcher
•handles prefetching of short arrays and avoids hardware prefetching
start-up delay before initiating the fetches
•must be added to new code; so it does not benefit existing
applications
IA-32 Intel® Architecture Processor Family Overview
Hardware prefetching for Pentium 4 processor has the following
characteristics:
•works with existing applications
•does not require extensive study of prefetch instructions
•requires regular access patterns
•avoids instruction and issue port bandwidth overhead
•has a start-up penalty before the hardware prefetcher triggers and
begins initiating fetches
The hardware prefetcher can handle multiple streams in either the
forward or backward directions. The start-up delay and fetch-ahead has
a larger effect for short arrays when hardware prefetching generates a
request for data beyond the end of an array (not actually utilized). The
hardware penalty diminishes if it is amortized over longer arrays.
Hardware prefetching is triggered after two successive cache misses in
the last level cache and requires these cache misses to satisfy a condition
that the linear address distance between these cache misses is within a
threshold value. The threshold value depends on the processor
implementation of the microarchitecture (see Table 1-2). However,
hardware prefetching will not cross 4KB page boundaries. As a result,
hardware prefetching can be very effective when dealing with cache
miss patterns that have small strides that are significantly less than half
the threshold distance to trigger hardware prefetching. On the other
hand, hardware prefetching will not benefit cache miss patterns that
have frequent DTLB misses or have access strides that cause successive
cache misses that are spatially apart by more than the trigger threshold
distance.
Software can proactively control data access pattern to favor smaller
access strides (e.g., stride that is less than half of the trigger threshold
distance) over larger access strides (stride that is greater than the trigger
threshold distance), this can achieve additional benefit of improved
temporal locality and reducing cache misses in the last level cache
significantly.
1-23
IA-32 Intel® Architectu re Optimization
Thus, software optimization of a data access pattern should emphasize
tuning for hardware prefetch first to favor greater proportions of
smaller-stride data accesses in the workload; before attempting to
provide hints to the processor by employing software prefetch
instructions.
Loads and Stores
The Pentium 4 processor employs the following techniques to speed up
the execution of memory operations:
•speculative execution of loads
•reordering of loads with respect to loads and stores
•multiple outstanding misses
•buffering of writes
•forwarding of data from stores to dependent loads
Performance may be enhanced by not exceeding the memory issue
bandwidth and buffer resources provided by the processor. Up to one
load and one store may be issued for each cycle from a memory port
reservation station. In order to be dispatched to a reservation station,
there must be a buffer entry available for each memory operation. There
are 48 load buffers and 24 store buffers
address information until the operation is completed, retired, and
deallocated.
3
. These buffers hold the µop and
The Pentium 4 processor is designed to enable the execution of memory
operations out of order with respect to other instructions and with
respect to each other. Loads can be carried out speculatively, that is,
before all preceding branches are resolved. However, speculative loads
cannot cause page faults.
3.Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store
buffers.
1-24
IA-32 Intel® Architecture Processor Family Overview
Reordering loads with respect to each other can prevent a load miss
from stalling later loads. Reordering loads with respect to other loads
and stores to different addresses can enable more parallelism, allowing
the machine to execute operations as soon as their inputs are ready.
Writes to memory are always carried out in program order to maintain
program correctness.
A cache miss for a load does not prevent other loads from issuing and
completing. The Pentium 4 processor supports up to four (or eight for
Pentium 4 processor with CPUID signature corresponding to family 15,
model 3) outstanding load misses that can be serviced either by on-chip
caches or by memory.
Store buffers improve performance by allowing the processor to
continue executing instructions without having to wait until a write to
memory and/or cache is complete. Writes are generally not on the
critical path for dependence chains, so it is often beneficial to delay
writes for more efficient use of memory-access bus cycles.
Store Forwarding
Loads can be moved before stores that occurred earlier in the program if
they are not predicted to load from the same linear address. If they do
read from the same linear address, they have to wait for the store data to
become available. However, with store forwarding, they do not have to
wait for the store to write to the memory hierarchy and retire. The data
from the store can be forwarded directly to the load, as long as the
following conditions are met:
•Sequence: the data to be forwarded to the load has been generated
by a programmatically-earlier store which has already executed
•Size: the bytes loaded must be a subset of (including a proper
subset, that is, the same) bytes stored
•Alignment: the store cannot wrap around a cache line boundary , and
the linear address of the load must be the same as that of the store
1-25
IA-32 Intel® Architectu re Optimization
Intel® Pentium® M Processor Microarchitecture
Like the Intel NetBurst microarchitecture, the pipeline of the Intel
Pentium M processor microarchitecture contains three sections:
•in-order issue front end
•out-of-order superscalar execution core
•in-order retirement unit
Intel Pentium M processor microarchitecture supports a high-speed
system bus (up to 533 MHz) with 64-byte line size. Most coding
recommendations that apply to the Intel NetBurst microarchitecture also
apply to the Intel Pentium M processor.
1-26
IA-32 Intel® Architecture Processor Family Overview
The Intel Pentium M processor microarchitecture is designed for lower
power consumption. There are other specific areas of the Pentium M
processor microarchitecture that differ from the Intel NetBurst
microarchitecture. They are described next. A block diagram of the Intel
Pentium M processor is shown in Figure 1-5.
Figure 1-5 The Intel Pentium M Processor Microarchitecture
6\VWHP%XV
%XV8QLW
QG/HYHO&DFKH
VW/HYHO
,QVWUXFWLRQ
&DFKH
%7%V%UDQFK3UHGLFWLRQ
)URQW(QG
)HWFK'HFRGH
)UHTXHQWO\XVHGSDWKV
/HVVIUHTXHQWO\XVHG
SDWKV
VW/HYHO'DWD
&DFKH
([HFXWLRQ
2XW2I2UGHU&RUH
%UDQFK+LVWRU\8SGDWH
5HWLUHPHQW
The Front End
The Intel Pentium M processor uses a pipeline depth that enables high
performance and low power consumption. It’s shorter than that of the
Intel NetBurst microarchitecture.
The Intel Pentium M processor front end consists of two parts:
•fetch/decode unit
•instruction cache
1-27
IA-32 Intel® Architectu re Optimization
The fetch and decode unit includes a hardware instruction prefetcher
and three decoders that enable parallelism. It also provides a 32KB
instruction cache that stores un-decoded binary instructions.
The instruction prefetcher fetches instructions in a linear fashion from
memory if the target instructions are not already in the instruction cache.
The prefetcher is designed to fetch efficiently from an aligned 16-byte
block. If the modulo 16 remainder of a branch target address is 14, only
two useful instruction bytes are fetched in the first cycle. The rest of the
instruction bytes are fetched in subsequent cycles.
The three decoders decode IA-32 instructions and break them down into
micro-ops (µops). In each clock cycle, the first decoder is capable of
decoding an instruction with four or fewer µops. The remaining two
decoders each decode a one µop instruction in each clock cycle.
The front end can issue multiple µops per cycle, in original program
order, to the out-of-order core.
The Intel Pentium M processor incorporates sophisticated branch
prediction hardware to support the out-of-order core. The branch
prediction hardware includes dynamic prediction, and branch target
buffers.
1-28
The Intel Pentium M processor has enhanced dynamic branch prediction
hardware. Branch target buffers (BTB) predict the direction and target
of branches based on an instruction’s address.
The Pentium M Processor includes two techniques to reduce the
execution time of certain operations:
•ESP Folding. This eliminates the ESP manipulation
micro-operations in stack-related instructions such as PUSH, POP,
CALL and RET. It increases decode rename and retirement
throughput. ESP folding also increases execution bandwidth by
eliminating µops which would have required execution resources.
IA-32 Intel® Architecture Processor Family Overview
•Micro-ops (µops) fusion. Some of the most frequent pairs of µops
derived from the same instruction can be fused into a single µops.
The following categories of fused µops have been implemented in
the Pentium M processor:
— “Store address” and “store data” micro-ops are fused into a
single “Store” micro-op. This holds for all types of store
operations, including integer, floating-point, MMX technology,
and Streaming SIMD Extensions (SSE and SSE2) operations.
— A load micro-op in most cases can be fused with a successive
execution micro-op.This holds for integer, floating-point and
MMX technology loads and for most kinds of successive
execution operations. Note that SSE Loads can not be fused.
Data Prefetching
The Intel Pentium M processor supports three prefetching mechanisms:
•The first mechanism is a hardware instruction fetcher and is
described in the previous section.
•The second mechanism automatically fetches data into the
second-level cache. The implementation of automatic hardware
prefetching in Pentium M processor family is basically similar to
those described for NetBurst microarchitecture. The trigger
threshold distance for each relevant processor models is shown in
Table 1-2
•The third mechanism is a software mechanism that fetches data into
the caches using the prefetch instructions.
1-29
IA-32 Intel® Architectu re Optimization
Table 1-2Trigger Threshold and CPUID Signatures for IA-32 Processor
Families
Trigger Threshold Distance
(Bytes)
51200153, 4, 6
25600150, 1, 2
256 0069, 13, 14
Extended
Model ID
Extended
Family IDFamily IDModel ID
Data is fetched 64 bytes at a time; the instruction and data translation
lookaside buffers support 128 entries. See Table 1-3 for processor cache
parameters.
Table 1-3Cache Parameters of Pentium M, Intel® Core™ Solo and
Intel®Core™ Duo Processors
Line
Associativity
LevelCapacity
First32 KB8643Writeback
Instruction32 KB8N/AN/AN/A
Second
(model 9)
Second
(model 13)
Second
(model 14)
1 MB8649Writeback
2 MB86410Writeback
2 MB86414Writeback
(ways)
Size
(bytes)
Access
Latency
(clocks)
Write Update
Policy
Out-of-Order Core
The processor core dynamically executes µops independent of program
order. The core is designed to facilitate parallel execution by employing
many buffers, issue ports, and parallel execution units.
The out-of-order core buffers µops in a Reservation Station (RS) until
their operands are ready and resources are available. Each cycle, the
core may dispatch up to five µops through the issue ports.
1-30
IA-32 Intel® Architecture Processor Family Overview
In-Order Retirement
The retirement unit in the Pentium M processor buffers completed µops
is the reorder buffer (ROB). The ROB updates the architectural state in
order. Up to three µops may be retired per cycle.
Microarchitecture of Intel® Core™ Solo and
Intel®Core™ Duo Processors
Intel Core Solo and Intel Core Duo processors incorporate an
microarchitecture that is similar to the Pentium M processor
microarchitecture, but provides additional enhancements for
performance and power efficiency. Enhancements include:
•Intel Smart Cache
This second level cache is shared between two cores in an Intel Core
Duo processor to minimize bus traffic between two cores accessing
a single-copy of cached data. It allows an Intel Core Solo processor
(or when one of the two cores in an Intel Core Duo processor is idle)
to access its full capacity.
•Stream SIMD Extensions 3
These extensions are supported in Intel Core Solo and Intel Core
Duo processors.
•Decoder improvement
Improvement in decoder and micro-op fusion allows the front en d to
see most instructions as single
throughput of the three decoders in the front end.
μop instructions. This increases the
•Improved execution core
Throughput of SIMD instructions is improved and the out-of-order
engine is more robust in handling sequences of frequently-used
instructions. Enhanced internal buffering and prefetch mechanisms
also improve data bandwidth for execution.
1-31
IA-32 Intel® Architectu re Optimization
•Power-optimized bus
The system bus is optimized for power efficiency; increased bus
speed supports 667 MHz.
•Data Prefetch
Intel Core Solo and Intel Core Duo processors implement improved
hardware prefetch mechanisms: one mechanism can look ahead and
prefetch data into L1 from L2. These processors also provide
enhanced hardware prefetchers similar to those of the Pentium M
processor (see Table 1-2).
Front End
Execution of SIMD instructions on Intel Core Solo and Intel Core Duo
processors are improved over Pentium M processors by the following
enhancements:
•Micro-op fusion
Scalar SIMD operations on register and memory have single
micro-op flows comparable to X87 flows. Many packed instructions
are fused to reduce its micro-op flow from four to two micro-ops.
1-32
•Eliminating decoder restrictions
Intel Core Solo and Intel Core Duo processors improve decoder
throughput with micro-fusion and macro-fusion, so that many more
SSE and SSE2 instructions can be decoded without restriction. On
Pentium M processors, many single micro-op SSE and SSE2
instructions must be decoded by the main decoder.
•Improved packed SIMD instruction decoding
On Intel Core Solo and Intel Core Duo processors, decoding of most
packed SSE instructions is done by all three decoders. As a result
the front end can process up to three packed SSE instructions every
cycle. There are some exceptions to the above; some
shuffle/unpack/shift operations are not fused and require the main
decoder.
Data Prefetching
Intel Core Solo and Intel Core Duo processors provide hardware
mechanisms to prefetch data from memory to the second-level cache.
There are two techniques: one mechanism activates after the data access
pattern experiences two cache-reference misses within a trigger-distance
threshold (see Table 1-2). This mechanism is similar to that of the
Pentium M processor, but can track 16 forward data streams and 4
backward streams. The second mechanism fetches an adjacent cache
line of data after experiencing a cache miss. This effectively simulates
the prefetching capabilities of 128-byte sectors (similar to the sectoring
of two adjacent 64-byte cache lines available in Pentium 4 processors).
Hardware prefetch requests are queued up in the bus system at lower
priority than normal cache-miss requests. If bus queue is in high
demand, hardware prefetch requests may be ignored or cancelled to
service bus traffic required by demand cache-misses and other bus
transactions.
Hardware prefetch mechanisms are enhanced over that of Pentium M
processor by:
IA-32 Intel® Architecture Processor Family Overview
•Data stores that are not in the second-level cache generate read for
ownership requests. These requests are treated as loads and can
trigger a prefetch stream.
•Software prefetch instructions are treated as loads, they can also
trigger a prefetch stream.
Hyper-Threading Technology
Intel® Hyper-Threading (HT) Technology is supported by specific
members of the Intel Pentium 4 and Xeon processor families. The
technology enables software to take advantage of task-level, or
thread-level parallelism by providing multiple logical processors within
a physical processor package. In its first implementation in Intel Xeon
processor, Hyper-Threading Technology makes a single physical
processor appear as two logical processors.
1-33
IA-32 Intel® Architectu re Optimization
The two logical processors each have a complete set of architectural
registers while sharing one single physical processor's resources. By
maintaining the architecture state of two processors, an HT Technology
capable processor looks like two processors to software, including
operating system and application code.
By sharing resources needed for peak demands between two logical
processors, HT Technology is well suited for multiprocessor systems to
provide an additional performance boost in throughput when compared
to traditional MP systems.
Figure 1-6 shows a typical bus-based symmetric multiprocessor (SMP)
based on processors supporting Hyper-Threading Technology. Each
logical processor can execute a software thread, allowing a maximum of
two software threads to execute simultaneously on one physical
processor. The two software threads execute simultaneously, meaning
that in the same clock cycle an “add” operation from logical processor 0
and another “add” operation and load from logical processor 1 can be
executed simultaneously by the execution engine.
1-34
IA-32 Intel® Architecture Processor Family Overview
In the first implementation of HT Technology, the physical execution
resources are shared and the architecture state is duplicated for each
logical processor. This minimizes the die area cost of implementing HT
Technology while still achieving performance gains for multithreaded
applications or multitasking workloads.
Figure 1-6Hyper-Threading Technology on an SMP
Architectural
State
Execution Engine
Local APIC
Bus Inte rfac e
Architectural
State
Local APIC
Architectural
State
Execution Engine
Local APIC
Bus Inte rfa ce
System Bus
Architectural
State
Local APIC
OM15152
The performance potential due to HT Technology is due to:
•the fact that operating systems and user programs can schedule
processes or threads to execute simultaneously on the logical
processors in each physical processor
•the ability to use on-chip execution resources at a higher level than
when only a single thread is consuming the execution resources;
higher level of resource utilization can lead to higher system
throughput
1-35
IA-32 Intel® Architectu re Optimization
Processor Resources and Hyper-Threading Technology
The majority of microarchitecture resources in a physical processor are
shared between the logical processors. Only a few small data structures
were replicated for each logical processor. This section describes how
resources are shared, partitioned or replicated.
Replicated Resources
The architectural state is replicated for each logical processor. The
architecture state consists of registers that are used by the operating
system and application code to control program behavior and store data
for computations. This state includes the eight general-purpose
registers, the control registers, machine state registers, debug registers,
and others. There are a few exceptions, most notably the memory type
range registers (MTRRs) and the performance monitoring resources.
For a complete list of the architecture state and exceptions, see the IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A & 3B.
Other resources such as instruction pointers and register renaming tables
were replicated to simultaneously track execution and state changes of
the two logical processors. The return stack predictor is replicated to
improve branch prediction of return instructions.
1-36
In addition, a few buffers (for example, the 2-entry instruction
streaming buffers) were replicated to reduce complexity.
Partitioned Resources
Several buffers are shared by limiting the use of each logical processor
to half the entries. These are referred to as partitioned resources.
Reasons for this partitioning include:
•operational fairness
•permitting the ability to allow operations from one logical processor
to bypass operations of the other logical processor that may have
stalled
IA-32 Intel® Architecture Processor Family Overview
For example: a cache miss, a branch misprediction, or instruction
dependencies may prevent a logical processor from making forward
progress for some number of cycles. The partitioning prevents the
stalled logical processor from blocking forward progress.
In general, the buffers for staging instructions between major pipe
stages are partitioned. These buffers include µop queues after the
execution trace cache, the queues after the register rename stage, the
reorder buffer which stag es instru ctions for retirement, and the load and
store buffers.
In the case of load and store buffers, partitioning also provided an easier
implementation to maintain memory ordering for each logical processor
and detect memory ordering violations.
Shared Resources
Most resources in a physical processor are fully shared to improve the
dynamic utilization of the resource, including caches and all the
execution units. Some shared resources which are linearly addressed,
like the DTLB, include a logical processor ID bit to distinguish whether
the entry belongs to one logical processor or the other.
The first level cache can operate in two modes depending on a
context-ID bit:
•Shared mode: The L1 data cache is fully shared by two logical
processors.
•Adaptive mode: In adaptive mode, memory accesses using the page
directory is mapped identically across logical processors sharing the
L1 data cache.
The other resources are fully shared.
1-37
IA-32 Intel® Architectu re Optimization
Microarchitecture Pipeline and Hyper-Threading Technology
This section describes the HT Technology microarchitecture and how
instructions from the two logical processors are handled between the
front end and the back end of the pipeline.
Although instructions originating from two programs or two threads
execute simultaneously and not necessarily in program order in the
execution core and memory hierarchy, the front end and back end
contain several selection points to select between instructions from the
two logical processors. All selection points alternate between the two
logical processors unless one logical processor cannot make use of a
pipeline stage. In this case, the other logical processor has full use of
every cycle of the pipeline stage. Reasons why a logical processor may
not use a pipeline stage include cache misses, branch mispredictions,
and instruction dependencies.
Front End Pipeline
The execution trace cache is shared between two logical processors.
Execution trace cache access is arbitrated by the two logical processors
every clock. If a cache line is fetched for one logical processor in one
clock cycle, the next clock cycle a line would be fetched for the other
logical processor provided that both logical processors are requesting
access to the trace cache.
1-38
If one logical processor is stalled or is unable to use the execution trace
cache, the other logical processor can use the full bandwidth of the trace
cache until the initial logical processor’s instruction fetches return from
the L2 cache.
After fetching the instructions and building traces of µops, the µops are
placed in a queue. This queue decouples the execution trace cache from
the register rename pipeline stage. As described earlier, if both logical
processors are active, the queue is partitioned so that both logical
processors can make independent forward progress.
Execution Core
The core can dispatch up to six µops per cycle, provided the µops are
ready to execute. Once the µops are placed in the queues waiting for
execution, there is no distinction between instructions from the two
logical processors. The execution core and memory hierarchy is also
oblivious to which instructions belong to which logical processor.
After execution, instructions are placed in the re-order buffer. The
re-order buffer decouples the execution stage from the retirement stage.
The re-order buffer is partitioned such that each uses half the entries.
Retirement
The retirement logic tracks when instructions from the two logical
processors are ready to be retired. It retires the instruction in program
order for each logical processor by alternating between the two logical
processors. If one logical processor is not ready to retire any
instructions, then all retirement bandwidth is dedicated to the other
logical processor.
IA-32 Intel® Architecture Processor Family Overview
Once stores have retired, the processor needs to write the store data into
the level-one data cache. Selection logic alternates between the two
logical processors to commit store data to the cache.
Multi-Core Processors
The Intel Pentium D processor and the Pentium Processor Extreme
Edition introduce multi-core features in the IA-32 architecture. These
processors enhance hardware support for multi-threading by providing
two processor cores in each physical processor package. The Dual-core
Intel Xeon and Intel Core Duo processors also provide two processor
cores in a physical package.
The Intel Pentium D processor provides two logical processors in a
physical package, each logical processor has a separate execution core
and a cache hierarchy . The Dual-core Intel Xeon processor and the Intel
1-39
IA-32 Intel® Architectu re Optimization
Pentium Processor Extreme Edition provide four logical processors in a
physical package that has two execution cores. Each core provides two
logical processors sharing an execution core and a cache hierarchy.
The Intel Core Duo processor provides two logical processors in a
physical package. Each logical processor has a separate execution core
(including first-level cache) and a smart second-level cache. The
second-level cache is shared between two logical processors and
optimized to reduce bus traffic when the same copy of cached data is
used by two logical processors. The full capacity of the second-level
cache can be used by one logical processor if the other logical processor
is inactive.
The functional blocks of these processors are shown in Figure 1-7.
1-40
IA-32 Intel® Architecture Processor Family Overview
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition
and Intel Core Duo Processor
Pentium D Processor
Architectual Stat e
Execut ion E ngine
Local API CLocal APIC
CachesCaches
Bus InterfaceBus Interface
System Bus
Pent i um Processor Extreme E d ition
Architectual
State
Execut ion E ngine
Local APICLocal API C
Archit ec tual State
Execut ion E ngine
Architectual
State
Local APICLocal AP I C
Caches
Bus InterfaceBus Interface
System Bus
In t el Core Duo Processor
Local APICLocal APIC
Second Level Cache
Bus Interface
Archit ec tual State
Execution Engine
Architectual
State
Execution Engine
Architectual Stat e
Execution Engine
Caches
Architectual
State
System Bus
1-41
IA-32 Intel® Architectu re Optimization
Microarchitecture Pipeline and Multi-Core Processors
In general, each core in a multi-core processor resembles a single-core
processor implementation of the underlying microarchitecture. The
implementation of the cache hierarchy in a dual-core or multi-core
processor may be the same or different from the cache hierarchy
implementation in a single-core processor.
CPUID should be used to determine cache-sharing topology
information in a processor implementation and the underlying
microarchitecture. The former is obtained by querying the deterministic
cache parameter leaf (see Chapter 6, “Optimizing Cache Usage”); the
latter by using the encoded values for extended family , family, extended
model, and model fields. See Table 1 -4.
Table 1-4Family And Model Designations of Microarchitectures
Dual-Core
Processor
Pentium D processorNetBurst01503, 4, 6
Pentium processor
Extreme Edition
Intel Core Duo
processor
Microarchitecture
NetBurst01503, 4, 6
Improved
Pentium M
Extended
FamilyFamily
060 14
Extended
ModelModel
Shared Cache in Intel Core Duo Processors
The Intel Core Duo processor has two symmetric cores that share the
second-level cache and a single bus interface (see Figure 1-7). Two
threads executing on two cores in an Intel Core Duo processor can take
advantage of shared second-level cache, accessing a single-copy of
cached data without generating bus traffic.
Load and Store Operations
When an instruction needs to read data from a memory address, the
processor looks for it in caches and memory. When an instruction writes
data to a memory location (write back) the processor first makes sure
1-42
IA-32 Intel® Architecture Processor Family Overview
that the cache line that contains the memory location is owned by the
first-level data cache of the initiating core (that is, the line is in
exclusive or modified state). Then the processor looks for the cache line
in the cache and memory sub-systems. The look-ups for the locality of
load or store operation are in the following order:
1.First level cache of the initiating core
2.Second-level cache and the first-level cache of the other core
3.Memory
Table 1-5 lists the performance characteristics of generic load and store
operations in an Intel Core Duo processor .
are
in terms of processor core cycles.
Table 1-5Characteristics of Load and Store Operations
in Intel Core Duo Processors
LoadStore
Data Locality
1st-level cache (L1)3121
L1 of the other core in
“Modified” state
2nd-level cache14<614<6
Memory14 + bus
LatencyThroughputLatencyThroughput
14 + bus
transaction
transaction
14 + bus
transaction
Bus read
protocol
Numeric values of Table 1-5
14 + bus
transaction
14 + bus
transaction
~10
Bus write
protocol
Throughput is expressed as the number of cycles to wait before the
same operation can start again. The latency of a bus transaction is
exposed in some of these operations, as indicated by entries
containing “+ bus transaction”. On Intel Core Duo processors, a
typical bus transaction may take 5.5 bus cycles. For a 667 MHz bus
and a core frequency of 2.167GHz, the total of 14 + 5.5 * 2167
/(667/4) ~ 86 core cycles.
Sometimes a modified cache line has to be evicted to make room for a
new cache line. The modified cache line is evicted in parallel to
bringing in new data and does not require additional latency. However,
1-43
IA-32 Intel® Architectu re Optimization
when data is written back to memory, the eviction consumes cache
bandwidth and bus bandwidth. For multiple cache misses that require
the eviction of modified lines and are within a short time, there is an
overall degradation in response time of these cache misses.
For store operation, reading for ownership must be completed before the
data is written to the first-level data cache and the line is marked as
modified. Reading for ownership and storing the data happens after
instruction retirement and follows the order of retirement. The bus store
latency does not affect the store instruction itself. However, several
sequential stores may have cumulative latency that can effect
performance.
1-44
General Optimization
Guidelines
This chapter discusses general optimization techniques that can improve
the performance of applications running on the Intel Pentium 4, Intel
Xeon, Pentium M processors, as well as on dual-core processors. These
techniques take advantage of the microarchitectural features of the
generation of IA-32 processor family described in Chapter 1.
Optimization guidelines for 64-bit mode applications are discussed in
Chapter 8. Additional optimization guidelines applicable to dual-core
processors and Hyper-Threading Technology are discussed in Chapter 7.
This chapter explains the optimization techniques both for those who
use the Intel
compilers. The Intel
for IA-32 processor family, provides the most of the optimization. For
those not using the Intel C++ or Fortran Compiler, the assembly code
tuning optimizations may be useful. The explanations are supported by
coding examples.
Tuning to Achieve Optimum Performance
®
C++ or Fortran Compiler and for those who use other
®
compiler, which generates code specifically tuned
2
The most important factors in achieving optimum processor
performance are:
•good branch prediction
•avoiding memory access stalls
•good floating-point performance
•instruction selection, including use of SIMD instructions
The following sections describe practices, tools, coding rules and
recommendations associated with these factors that will aid in
optimizing the performance on IA-32 processors.
Tuning to Prevent Known Coding Pitfalls
To produce program code that takes advantage of the Intel NetBurst
microarchitecture and the Pentium M processor microarchitecture, you
must avoid the coding pitfalls that limit the performance of the target
processor family. This section lists several known pitfalls that can limit
performance of Pentium 4 and Intel Xeon processor implementations.
Some of these pitfalls, to a lesser degree, also negatively impact
Pentium M processor performance (store-to-load-forwarding
restrictions, cache-line splits).
Table 2-1 lists coding pitfalls that cause performance degradation in
some Pentium 4 and Intel Xeon processor implementations. For every
issue, Table 2-1 references a section in this document. The section
describes in detail the causes of the penalty and presents a
recommended solution. Note that “aligned” here means that the address
of the load is aligned with respect to the address of the store.
Table 2-1Coding Pitfalls Affecting Performance
Factors Affecting
Performance Symptom
Small, unaligned
after large store
load after small
Large
store;
Load
dword after store
dword, store byte;
Load dword, AND with
after store byte
0xff
2-2
load
Store-forwarding
blocked
Store-forwarding
blocked
Example
(if applicable)Section Reference
Example 2-12Store Forwarding,
Example 2-13,
Example 2-14
Store-to-Load-Forwar
ding Restriction on
Size and Alignment
Store Forwarding,
Store-to-Load-Forwar
ding Restriction on
Size and Alignment
Cycling more than 2
values of Floating-point
Control Word
* Streaming SIMD Extensions (SSE)
** Streaming SIMD Extensions 2 (SSE2)
Slows x87, SSE*,
SSE2** floating-
point operations
fldcw not
optimized
Example
(if applicable)Section Reference
Example 2-11Align data on natural
operand size address
boundaries. If the
data will be accesses
with vector instruction
loads and stores,
align the data on 16
byte boundaries.
Floating-point
Exceptions
Floating-point Modes
General Practices and Coding Guidelines
This section discusses guidelines derived from the performance factors
listed in the “Tuning to Achieve Optimum Performance” section. It also
highlights practices that use performance tools.
The majority of these guidelines benefit processors based on the Intel
NetBurst microarchitecture and the Pentium M processor
microarchitecture. Some guidelines benefit one microarchitecture more
than the other. As a whole, these coding rules enable software to be
optimized for the common performance features of the Intel NetBurst
microarchitecture and the Pentium M processor microarchitecture.
The coding practices recommended under each heading and the bullets
under each heading are listed in order of importance.
2-3
IA-32 Intel® Architectu re Optimization
Use Available Performance Tools
•Current-generation compiler, such as the Intel C++ Compiler:
— Set this compiler to produce code for the target processor
implementation
— Use the compiler switches for optimization and/or
profile-guided optimization. These features are summarized in
the “Intel® C++ Compiler” section. For more detail, see the
Intel® C++ Compiler User’s Guide.
•Current-generation performance monitoring tools, such as VTune™
Performance Analyzer:
— Identify performance issues, use event-based sampling, code
coach and other analysis resource.
— Measure workload characteristics such as instruction
throughput, data traffic locality, memory traffic characteristics,
etc.
— Characterize the performance gain.
Optimize Performance Across Processor Generations
•Use a cpuid dispatch strategy to deliver optimum performance for
all processor generations.
•Use deterministic cache parameter leaf of cpuid to deliver scalable
performance that are transparent across processor families with
different cache sizes.
•Use compatible code strategy to deliver optimum performance for
the current generation of IA-32 processor family and future IA-32
processors.
•Use a low-overhead threading strategy so that a multi-threaded
application delivers optimal multi-processor scaling performance
when executing on processors that have hardware multi-threading
support, or deliver nearly identical single-processor scaling when
executing on a processor without hardware multi-threading support.
2-4
Optimize Branch Predictability
•Improve branch predictability and optimize instruction prefetching
by arranging code to be consistent with the static branch prediction
assumption: backward taken and forward not taken.
•Avoid mixing near calls, far calls and returns.
•Avoid implementing a call by pushing the return address and
jumping to the target. The hardware can pair up call and return
instructions to enhance predictability.
•Use the pause instruction in spin-wait loops.
•Inline functions according to coding recommendations.
•Whenever possible, eliminate branches.
•Avoid indirect calls.
Optimize Memory Access
•Observe store-forwarding constraints.
•Ensure proper data alignment to prevent data split across cache line.
boundary. This includes stack and passing parameters.
General Optimization Guidelines 2
•Avoid mixing code and data (self-modifying code).
•Choose data types carefully (see next bullet below) and avoid type
casting.
•Employ data structure layout optimization to ensure efficient use of
64-byte cache line size.
•Favor parallel data access to mask latency over data accesses with
dependency that expose latency.
•For cache-miss data traffic, favor smaller cache-miss strides to
avoid frequent DTLB misses.
•Use prefetching appropriately.
•Use the following techniques to enhance locality: blocking,
•Use the const modifier; use the static modifier for global
variables.
•Use new cacheability instructions and memory-ordering behavior.
Optimize Floating-point Performance
•Avoid exceeding representable ranges during computation, since
handling these cases can have a performance impact. Do not use a
larger precision format (double-extended floating point) unless
required, since this increases memory size and bandwidth
utilization.
•Use FISTTP to avoid changing rounding mode when possible or use
optimized
registers (rounding modes) between more than two values.
•Use efficient conversions, such as those that implicitly include a
rounding mode, in order to avoid changing control/status registers.
•Take advantage of the SIMD capabilities of Streaming SIMD
Extensions (SSE) and of Streaming SIMD Extensions 2 (SSE2)
instructions. Enable flush-to-zero mode and DAZ mode when using
SSE and SSE2 instructions.
•Avoid denormalized input values, denormalized output values, and
explicit constants that could cause denormal exceptions.
•Avoid excessive use of the fxch instruction.
Optimize Instruction Selection
•Focus instruction selection at the granularity of path length for a
sequence of instructions versus individual instruction selections;
minimize the number of uops, data/register dependency in
aggregates of the path length, and maximize retirement throughput.
2-6
General Optimization Guidelines 2
•Avoid longer latency instructions: integer multiplies and divides.
Replace them with alternate code sequences (e.g., use shifts instead
of multiplies).
•Use the lea instruction and the full range of addressing modes to do
address calculation.
•Some types of stores use more µops than others, try to use simpler
store variants and/or reduce the number of stores.
•Avoid use of complex instructions that require more than 4 µops.
•Avoid instructions that unnecessarily introduce dependence-related
stalls:
inc and dec instructions, partial register operations (8/16-bit
operands).
•Avoid use of ah, bh, and other higher 8-bits of the 16-bit registers,
because accessing them requires a shift operation internally.
•Use xor and pxor instructions to clear registers and break
dependencies for integer operations; also use
clear XMM registers for floating-point operations.
xorps and xorpd to
•Use efficient approaches for performing comparisons.
Optimize Instruction Scheduling
•Consider latencies and resource constraints.
•Calculate store addresses as early as possible.
Enable Vectorization
•Use the smallest possible data type. This enables more parallelism
with the use of a longer vector.
•Arrange the nesting of loops so the innermost nesting level is free of
inter-iteration dependencies. It is especially important to avoid the
case where the store of data in an earlier iteration happens lexically
after the load of that data in a future iteration (called
lexically-backward dependence).
Coding recommendations are ranked in importance using two measures:
•Local impact (referred to as “impact”) is the difference that a
recommendation makes to performance for a given instance, with
the impact’s priority marked as: H = high, M = medium, L = low.
•Generality measures how frequently such instances occur across all
application domains, with the frequency marked as: H = high,
M = medium, L = low.
2-8
These rules are very approximate. They can vary depending on coding
style, application domain, and other factors. The purpose of including
high, medium and low priorities with each recommendation is to
provide some hints as to the degree of performance gain that one can
expect if a recommendation is implemented.
Because it is not possible to predict the frequency of occurrence of a
code instance in applications, priority hints cannot be directly correlated
to application-level performance gain. However, in important cases
where application-level performance gain has been observed, a more
quantitative characterization of application-level performance gain is
provided for information only (see: “Store-to-Load-Forwarding
Restriction on Size and Alignment” and “Instruction Selection” in this
document). In places where no priority is assigned, the impact has been
deemed inapplicable.
Performance Tools
Intel offers several tools that can facilitate optimizing your application’s
performance.
Intel® C++ Compiler
Use the Intel C++ Compiler following the recommendations described
here. The Intel Compiler’s advanced optimization features provide good
performance without the need to hand-tune assembly code. However,
the following features may enhance performance even further:
•Inlined assembly
•Intrinsics, which have a one-to-one correspondence with assembly
language instructions but allow the compiler to perform register
allocation and instruction scheduling. Refer to the “Intel C++
Intrinsics Reference” section of the Intel® C++ Compiler User’s Guide.
•C++ class libraries. Refer to the “Intel C++ Class Libraries for
SIMD Operations Reference” section of the Intel® C++ Compiler
User’s Guide.
General Optimization Guidelines 2
•Vectorization in conjunction with compiler directives (pragmas).
Refer to the “Compiler Vectorization Support and Guidelines”
section of the Intel® C++ Compiler User’s Guide.
The Intel C++ Compiler can generate an executable which uses features
such as Streaming SIMD Extensions 2. The executable will maximize
performance on the current generation of IA-32 processor family (for
example, a Pentium 4 processor) and still execute correctly on older
processors. Refer to the “Processor Dispatch Support” section in the
Intel® C++ Compiler User’s Guide.
2-9
IA-32 Intel® Architectu re Optimization
General Compiler Recommendations
A compiler that has been extensively tuned for the target microarchitecture can be expected to match or outperform hand-coding in a general
case. However, if particular performance problems are noted with the
compiled code, some compilers (like the Intel C++ and Fortran Compilers) allow the coder to insert intrinsics or inline assembly in order to
exert greater control over what code is generated. If inline assembly is
used, the user should verify that the code generated to integrate the
inline assembly is of good quality and yields good overall performance.
Default compiler switches are targeted for the common case. An
optimization may be made to the compiler default if it is beneficial for
most programs. If a performance problem is root-caused to a poor
choice on the part of the compiler, using dif ferent switches or compiling
the targeted module with a different compiler may be the solution.
VTune™ Performance Analyzer
Where performance is a critical concern, use performance monitoring
hardware and software tools to tune your application and its interaction
with the hardware. IA-32 processors have counters which can be used to
monitor a large number of performance-related events for each
microarchitecture. The counters also provide information that helps
resolve the coding pitfalls.
2-10
The VTune Performance Analyzer allow engineers to use these counters
to provide with two kinds of tuning feedback:
•indication of a performance improvement gained by using a specific
coding recommendation or microarchitectural feature,
•information on whether a change in the program has improved or
degraded performance with respect to a particular metric.
General Optimization Guidelines 2
The VTune Performance Analyzer also enables engineers to use these
counters to measure a number of workload characteristics, including:
•retirement throughput of instruction execution as an indication of
the degree of extractable instruction-level parallelism in the
workload,
•data traffic locality as an indication of the stress point of the cache
and memory hierarchy,
•data traffic parallelism as an indication of the degree of
effectiveness of amortization of data access latency.
Note that improving performance in one part of the machine does not
necessarily bring significant gains to overall performance. It is possible
to degrade overall performance by improving performance for some
particular metric.
Where appropriate, coding recommendations in this chapter include
descriptions of the VTune analyzer events that provide measurable data
of performance gain achieved by following recommendations. Refer to
the VTune analyzer online help for instructions on how to use the tool.
VTune analyzer events include the Pentium 4 processor performance
metrics described in Appendix B, “Using Performance Monitoring
Events.”
Processor Perspectives
The majority of the coding recommendations for the Pentium 4 and
Intel Xeon processors also apply to Pentium M, Intel Core Solo, and
Intel Core Duo processors. However, there are situations where a
recommendation may benefit one microarchitecture more than the other.
The most important of these are:
•Instruction decode throughput is important for the Pentium M, Intel
Core Solo, and Intel Core Duo processors but less important for the
Pentium 4 and Intel Xeon processors. Generating code with the
4-1-1 template (instruction with four μops followed by two
instructions with one μop each) helps the Pentium M processor.
2-11
IA-32 Intel® Architectu re Optimization
Intel Core Solo and Intel Core Duo processors have enhanced front
end that is less sensitive to the 4-1-1 template. The practice has no
real impact on processors based on the Intel NetBurst
microarchitecture.
•Dependencies for partial register writes incur large penalties when
using the Pentium M processor (this applies to processors with
CPUID signature family 6, model 9). On Pentium 4, Intel Xeon
processors, Pentium M processor (with CPUID signature family 6,
model 13), and Intel Core Solo, and Intel Core Duo processors, such
penalties are resolved by artificial dependencies between each
partial register write. To avoid false dependences from partial
register updates, use full register updates and extended moves.
•On Pentium 4 and Intel Xeon processors, some latencies have
increased: shifts, rotates, integer multiplies, and moves from
memory with sign extension are longer than before. Use care when
using the
Instruction” for recommendations.
lea instruction. See the section “Use of the lea
•The inc and dec instructions should always be avoided. Using add
and
sub instructions instead avoids data dependence and improves
performance.
2-12
•Dependence-breaking support is added for the pxor instruction.
•Floating point register stack exchange instructions were free; now
they are slightly more expensive due to issue restrictions.
•Writes and reads to the same location should now be spaced apart.
This is especially true for writes that depend on long-latency
instructions.
•Hardware prefetching may shorten the effective memory latency for
data and instruction accesses.
•Cacheability instructions are available to streamline stores and
manage cache utilization.
•Cache lines are 64 bytes (see Table 1-1 and Table 1-3). Because of
this, software prefetching should be done less often. False sharing,
however, can be an issue.
General Optimization Guidelines 2
•On the Pentium 4 and Intel Xeon processors, the primary code size
limit of interest is imposed by the trace cache. On Pentium M
processors, code size limit is governed by the instruction cache.
•There may be a penalty when instructions with immediates
requiring more than 16-bit signed representation are placed next to
other instructions that use immediates.
Note that memory-related optimization techniques for alignments,
complying with store-to-load-forwarding restrictions and avoiding data
splits help Pentium 4 processors as well as Pentium M processors.
CPUID Dispatch Strategy and Compatible Code Strategy
Where optimum performance on all processor generations is desired,
applications can take advantage of
generation and integrate processor-specific instructions (such as SSE2
instructions) into the source code. The Intel C++ Compiler supports the
integration of different versions of the code for different target
processors. The selection of which code to execute at runtime is made
based on the CPU identifier that is read with
targeted for different processor generations can be generated under the
control of the programmer or by the compiler.
cpuid to identify the processor
cpuid. Binary code
For applications run on both the Intel Pentium 4 and Pentium M
processors, and where minimum binary code size and single code path
is important, a compatible code strategy is the best. Optimizing
applications for the Intel NetBurst microarchitecture is likely to improve
code efficiency and scalability when running on processors based on
current and future generations of IA-32 processors. This approach to
optimization is also likely to deliver high performance on Pentium M
processors.
2-13
IA-32 Intel® Architectu re Optimization
Transparent Cache-Parameter Strategy
If CPUID instruction supports function leaf 4, also known as
deterministic cache parameter leaf, this function leaf will report detailed
cache parameters for each level of the cache hierarchy in a deterministic
and forward-compatible manner across current and future IA-32
processor families. See CPUID instruction in the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2B.
For coding techniques that rely on specific parameters of a cache level,
using the deterministic cache parameter allow software to implement
such coding technique to be forward-compatible with future generations
of IA-32 processors, and be cross-compatible with processors equipped
with different cache sizes.
Threading Strategy and Hardware Multi-Threading Support
Current IA-32 processor families offer hardware multi-threading
support in two forms: dual-core technology and Hyper-Threading
Technology. Future trend for IA-32 processors will continue to impro ve
in the direction of multi-core technology.
2-14
To fully harness the performance potentials of the hardware
multi-threading capabilities in current and future generations of IA-32
processors, software must embrace a threaded approach in application
design. At the same time, to address the widest range of installed base of
machines, multi-threaded software should be able to run without failure
on single processor without hardware multi-threading support, and
multi-threaded software implementation should also achieve
comparable performance on a single logical processor relative to an
unthreaded implementation if such comparison can be made. This
generally requires architecting a multi-threaded application to minimize
the overhead of thread synchronization. Additional software
optimization guidelines on multi-threading are discussed in Chapter 7.
Branch Prediction
Branch optimizations have a significant impact on performance. By
understanding the flow of branches and improving the predictability of
branches, you can increase the speed of code significantly.
Optimizations that help branch prediction are:
•Keep code and data on separate pages (a very important item, see
more details in the “Memory Accesses” section).
•Whenever possible, eliminate branches.
•Arrange code to be consistent with the static branch prediction
algorithm.
•Use the pause instruction in spin-wait loops.
•Inline functions and pair up calls and returns.
•Unroll as necessary so that repeatedly-executed loops have sixteen
or fewer iterations, unless this causes an excessive code size
increase.
•Separate branches so that they occur no more frequently than every
three
μops where possible.
General Optimization Guidelines 2
Eliminating Branches
Eliminating branches improves performance because it:
•reduces the possibility of mispredictions
•reduces the number of required branch target buffer (BTB) entries;
conditional branches, which are never taken, do not consume BTB
resources
There are four principal ways of eliminating branches:
•arrange code to make basic blocks contiguous
•unroll loops, as discussed in the “Loop Unrolling” section
•use the cmov instruction
•use the setcc instruction
2-15
IA-32 Intel® Architectu re Optimization
Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange
code to make basic blocks contiguous and eliminate unnecessary branches.
For the Pentium M processor, every branch counts, even correctly
predicted branches have a negative effect on the amount of useful code
delivered to the processor. Also, taken branches consume space in the
branch prediction structures and extra branches create pressure on the
capacity of the structures.
Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the
setcc and cmov instructions to eliminate unpredictable conditional branches
where possible. Do not do this for predictable branches. Do not use these
instructions to eliminate all unpr edictable conditional branches (be cause using
these instructions will incur execution overhead due to the requirement for
executing both paths of a conditional branch). In addition, converting
conditional branches to
data dependence and restricts the capability of the out of order engine. When
tuning, note that all IA-32 based pr oc essors ha ve very hig h bra nc h prediction
rates. Consistently mispredicted are rare. Use these instructions only if the
increase in computation time is less than the expected cost of a mispredicted
branch.
cmovs or setcc trades of contr ol flow d ependen ce for
2-16
Consider a line of C code that has a condition dependent upon one of the
constants:
X = (A < B) ? CONST1 : CONST2;
This code conditionally compares two values, A and B. If the condition is
true,
X is set to CONST1; otherwise it is set to CONST2. An assembly code
sequence equivalent to the above C code can contain branches that are
not predictable if there are no correlation in the two values.
Example 2-1 shows the assembly code with unpredictable branches.
The unpredictable branches in Example 2-1 can be removed with the
use of the
setcc instruction. Example 2-2 shows an optimized code that
does not have branches.
General Optimization Guidelines 2
Example 2-1Assembly Code with an Unpredictable Branch
cmp A, B ; condition
jge L30 ; conditional branch
mov ebx, CONST1 ; ebx holds X
jmp L31; unconditional branch
L30:
mov ebx, CONST2
L31:
Example 2-2Code Optimization to Eliminate Branches
xor ebx, ebx ; clear ebx (X in the C code)
cmp A, B
setge bl ; When ebx = 0 or 1
; OR the complement condition
sub ebx, 1 ; ebx=11...11 or 00...00
and ebx, CONST3 ; CONST3 = CONST1-CONST2
add ebx, CONST2 ; ebx=CONST1 or CONST2
See Example 2-2. The optimized code sets ebx to zero, then compares A
and B. If A is greater than or equal to B, ebx is set to one. Then ebx is
decreased and “
sets
ebx to either zero or the difference of the values. By adding CONST2
back to
ebx, the correct value is written to ebx. When CONST2 is equal to
and-ed” with the difference of the constant values. This
zero, the last instruction can be deleted.
Another way to remove branches on Pentium II and subsequent
processors is to use the
shows changing a
eliminating a branch. If the
will be moved to
cmov and fcmov instructions. Example 2-3
test and branch instruction sequence using cmov and
test sets the equal flag, the value in ebx
eax. This branch is data-dependent, and is
representative of an unpredictable branch.
2-17
IA-32 Intel® Architectu re Optimization
Example 2-3Eliminating Branch with CMOV Instruction
test ecx, ecx
jne 1h
mov eax, ebx
1h:
; To optimize code, combine jne and mov into one cmovcc
; instruction that checks the equal flag
test ecx, ecx; test the flags
cmoveq eax, ebx; if the equal flag is set, move
; ebx to eax - the lh: tag no longer
; needed
The cmov and fcmov instructions are available on the Pentium II and
subsequent processors, but not on Pentium processors and earlier 32-bit
Intel architecture processors. Be sure to check whether a processor
supports these instructions with the
cpuid instruction.
Spin-Wait and Idle Loops
2-18
The Pentium 4 processor introduces a new pause instruction; the
instruction is architecturally a
nop on all IA-32 implementations. T o th e
Pentium 4 processor, this instruction acts as a hint that the code
sequence is a spin-wait loop. Without a
pause instruction in such loops,
the Pentium 4 processor may suffer a severe penalty when exiting the
loop because the processor may detect a possible memory order
violation. Inserting the
pause instruction significantly reduces the
likelihood of a memory order violation and as a result improves
performance.
In Example 2-4, the code spins until memory location A matches the
value stored in the register
eax. Such code sequences are common when
protecting a critical section, in producer-consumer sequences, for
barriers, or other synchronization.
Example 2-4Use of pause Instruction
lock:cmp eax, A
jne loop
; code in critical section:
loop:pause
cmp eax, A
jne loop
jmp lock
Static Prediction
Branches that do not have a history in the BTB (see the “Branch
Prediction” section) are predicted using a static prediction algorithm.
The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo
processors have similar static prediction algorithms:
•Predict unconditional branches to be taken.
•Predict indirect branches to be NOT taken.
General Optimization Guidelines 2
In addition, conditional branches in processors based on the Intel
NetBurst microarchitecture are predicted using the following static
prediction algorithm:
•Predict backward conditional branches to be taken. This rule is
suitable for loops.
•Predict forward conditional branches to be NOT taken.
Pentium M, Intel Core Solo and Intel Core Duo processors do not
statically predict conditional branches according to the jump direction.
All conditional branches are dynamically predicted, even at their first
appearance.
2-19
IA-32 Intel® Architectu re Optimization
Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange
code to be consistent with the static branch prediction algorithm: make the
fall-through code following a conditional branch be the likely target for a
branch with a forward target, and make the fall-through code following a
conditional branch be the unlikely target for a branch with a backward target.
Example 2-5 illustrates the static branch prediction algorithm. The body
of an
if-then conditional is predicted to be executed.
Example 2-5Pentium 4 Processor Static Branch Prediction Algorithm
forw ard conditional branches not taken (fall through)
If <condition> {
...
}
for<condition>{
...
}
BackwardConditionalBranchesaretaken
Uncondi t ional Branchestaken
JM P
2-20
loop {
}<condition>
General Optimization Guidelines 2
Examples 2-6, Example 2-7 provide basic rules for a static prediction
algorithm.
In Example 2-6, the backward branch (
JC Begin) is not in the BTB the
first time through, therefore, the BTB does not issue a prediction. The
static predictor, however, will predict the branch to be taken, so a
misprediction will not occur.
Example 2-6Static Taken Prediction Example
Begin: mov eax, mem32
and eax, ebx
imuleax, edx
shldeax, 7
jcBegin
The first branch instruction (
JC Begin) in Example 2-7 segment is a
conditional forward branch. It is not in the BTB the first time through,
but the static predictor will predict the branch to fall through
The static prediction algorithm correctly predicts that the Call
Convert
instruction will be taken, even before the branch has any
The return address stack mechanism augments the static and dynamic
predictors to optimize specifically for calls and returns. It holds 16
entries, which is large enough to cover th e call d epth of most programs.
If there is a chain of more than 16 nested calls and more than 16 returns
in rapid succession, performance may be degraded.
The trace cache maintains branch prediction information for calls and
returns. As long as the trace with the call or return remains in the trace
cache and if the call and return targets remain unchanged, the depth
limit of the return address stack described above will not impede
performance.
To enable the use of the return stack mechanism, calls and returns must
be matched in pairs. If this is done, the likelihood of exceeding the
stack depth in a manner that will impact performance is very low.
Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near
calls must be matched with near returns, and far calls must be matched with
far returns. Pushing the return address on the stack and jumping to the routine
to be called is not recommended since it creates a mismatch in calls and
returns.
2-22
Calls and returns are expensive; use inlining for the following reasons:
•Parameter passing overhead can be eliminated.
•In a compiler, inlining a function exposes more opportunity for
optimization.
•If the inlined routine contains branches, the additional context of the
caller may improve branch prediction within the routine.
•A mispredicted branch can lead to larger performance penalties
inside a small function than if that function is inlined.
Assembly/Compiler Coding Rule 5. (MH impact, MH generality)
Selectively inline a function where doing so decreases code size or if the
function is small and the call site is fr e qu ently executed.
General Optimization Guidelines 2
Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline
a function if doing so increases the working set size beyond what will fit in the
trace cache.
Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there
are more than 16 nested calls and returns in rapid succession; consider
transforming the program with inline to reduce the call depth.
Assembly/Compiler Coding Rule 8. (ML impact, ML generality)
inlining small functions that contain branches with poor prediction rates. If a
branch misprediction results in a RETURN being prematurely predicted as
taken, a performance penalty may be incurred.
Assembly/Compiler Coding Rule 9. (L impact, L generality)
statement in a function is a call to another function, consider converting the
call to a jump. This will save the call/ return overhead as well as an entry in the
return stack buffer.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put
more than four branches in a 16-byte chunk.
Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put
more than two end loop branc hes in a 16-b yte chunk.
Favor
If the last
Branch Type Selection
The default predicted target for indirect branches and calls is the
fall-through path. The fall-through prediction is overridden if and when
a hardware prediction is available for that branch. The predicted branch
target from branch prediction hardware for an indirect branch is the
previously executed branch target.
The default prediction to the fall-through path is only a significant issue
if no branch prediction is available, due to poor code locality or
pathological branch conflict problems. For indirect calls, predicting the
fall-through path is usually not an issue, since execution will likely
return to the instruction after the associated return.
2-23
IA-32 Intel® Architectu re Optimization
Placing data immediately following an indirect branch can cause a
performance problem. If the data consist of all zeros, it looks like a long
stream of adds to memory destinations, which can cause resource
conflicts and slow down branch recovery. Also, the data immediately
following indirect branches may appear as branches to the branch
predication hardware, which can branch off to execute other data pages.
This can lead to subsequent self-modifying code problems.
Assembly/Compiler Coding Rule 12. (M impact, L generality) When
indirect branches are present, try to put the most likely target of an indirect
branch immediately following the indirect branch. Alternatively, if indirect
branches are common but they cannot be predicted by branch prediction
hardware, then follow the indirect branch with a UD2 instruction, which will
stop the processor from decoding down the fall-through path.
Indirect branches resulting from code constructs, such as switch
statements, computed
arbitrary number of locations. If the code sequence is such that the target
destination of a branch goes to the same address most of the time, then
the BTB will predict accurately most of the time. Since only one taken
(non-fall-through) target can be stored in the BTB, indirect branches
with multiple taken targets may have lower prediction rates.
GOTOs or calls through pointers, can jump to an
2-24
The effective number of targets stored may be increased by introducing
additional conditional branches. Adding a conditional branch to a target
is fruitful if and only if:
•The branch direction is correlated with the branch history leading up
to that branch, that is, not just the last target, but how it got to this
branch.
•The source/target pair is common enough to warrant using the extra
branch prediction capacity. (This may increase the number of
overall branch mispredictions, while improving the misprediction of
indirect branches. The profitability is lower if the number of
mispredicting branches is very large).
User/Source Coding Rule 1. (M impact, L generality) If an indirect branch
has two or more common taken targets, and at least one of those targets are
correlated with branch history leading up to the branch, then convert the
General Optimization Guidelines 2
indirect branch into a tree where one or more indirect branches are preceded
by conditional branches to th ose tar g ets. Apply this “peeling” procedur e to the
common target of an indirect branch that correlates to branch history.
The purpose of this rule is to reduce the total number of mispredictions
by enhancing the predictability of branches, even at the expense of
adding more branches. The added branches must be very predictable for
this to be worthwhile. One reason for such predictability is a strong
correlation with preceding branch history , that is, the directions taken on
preceding branches are a good indicator of the direction of the branch
under consideration.
Example 2-8 shows a simple example of the correlation between a target
of a preceding conditional branch with a target of an indirect branch.
Example 2-8Indirect Branch With Two Favored Targets
function ()
{
int n = rand(); // random integer 0 to RAND_MAX
if( !(n & 0x01) ){ // n will be 0 half the times
n = 0; // updates branch history to predict taken
}
// indirect branches with multiple taken targets
// may have lower prediction rates
switch (n) {
case 0: handle_0(); break; // common target, correlated with
// branch history that is forward taken
case 1: handle_1(); break;// uncommon
case 3: handle_3(); break;// uncommon
default: handle_other(); // common target
}
}
Correlation can be difficult to determine analytically, either for a
compiler or sometimes for an assembly language programmer . It may be
fruitful to evaluate performance with and without this peeling, to get the
2-25
IA-32 Intel® Architectu re Optimization
best performance from a coding effort. An example of peeling out the
most favored target of an indirect branch with correlated branch history
is shown in Example 2-9.
Example 2-9A Peeling Technique to Reduce Indirect Branch Misprediction
function ()
{
int n = rand();// random integer 0 to RAND_MAX
if( !(n & 0x01) ) n = 0;
// n will be 0 half the times
if (!n) handle_0(); // peel out the most common target
// with correlated branch history
else {
switch (n) {
case 1: handle_1(); break;// uncommon
case 3: handle_3(); break;// uncommon
default: handle_other();// make the favored target in
// the fall-through path
}
}
}
Loop Unrolling
The benefits of unrolling loops are:
•Unrolling amortizes the branch overhead, since it eliminates
branches and some of the code to manage induction variables.
•Unrolling allows you to aggressively schedule (or pipeline) the loop
to hide latencies. This is useful if you have enough free registers to
keep variables live as you stretch out the dependence chain to
expose the critical path.
•Unrolling exposes the code to various other optimizations, such as
removal of redundant loads, common subexpression elimination,
and so on.
2-26
General Optimization Guidelines 2
•The Pentium 4 processor can correctly predict the exit branch for an
inner loop that has 16 or fewer iterations, if that number of iterations
is predictable and there are no conditional branches in the loop.
Therefore, if the loop body size is not excessive, and the probable
number of iterations is known, unroll inner loops until they have a
maximum of 16 iterations. With the Pentium M processor, do not
unroll loops more than 64 iterations.
The potential costs of unrolling loops are:
•Excessive unrolling, or unrolling of very large loops can lead to
increased code size. This can be harmful if the unrolled loop no
longer fits in the trace cache (TC).
on the BTB capacity . If the number of iterations of th e unrolled loop
is 16 or less, the branch predictor should be able to correctly predict
branches in the loop body that alternate direction.
Assembly/Compiler Coding Rule 13. (H impact, M generality) Unr oll small
loops until the overhead of the branch and the induction variable accounts,
generally, for less than about 10% of the execution time of the loop.
Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid
unrolling loop s excessiv ely, as this may thrash the trace cache or instruction
cache.
Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll
loops that are frequently executed and that have a predictable number of
iterations to reduce the number of iterations to 16 or fewer, unless this
increases code size so that the working set no longer fits in the trace cache or
instruction cache. If the loop body contains mo r e than one c onditional bran ch,
then unroll so that the number of iterations is 16/(# conditional branches).
Example 2-10 shows how unrolling enables other optimizations.
2-27
IA-32 Intel® Architectu re Optimization
Example 2-10 Loop Unrolling
Before unrolling:
do i=1,100
if (i mod 2 == 0) then a(i) = x
else a(i) = y
enddo
After unrolling
do i=1,100,2
a(i) = y
a(i+1) = x
enddo
In this example, a loop that executes 100 times assigns x to every
even-numbered element and
y to every odd-numbered element. By
unrolling the loop you can make both assignments each iteration,
removing one branch in the loop body.
Compiler Support for Branch Prediction
Compilers can generate code that improves the efficiency of branch
prediction in the Pentium 4 and Pentium M processors. The Intel C++
Compiler accomplishes this by:
2-28
•keeping code and data on separate pages
•using conditional move instructions to eliminate branches
•generating code that is consistent with the static branch prediction
algorithm
•inlining where appropriate
•unrolling, if the number of iterations is predictable
With profile-guided optimization, the Intel compiler can lay out basic
blocks to eliminate branches for the most frequently executed paths of a
function or at least improve their predictability. Branch prediction need
not be a concern at the source level. For more information, see the
Intel® C++ Compiler User’s Guide.
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.