For more information.......................................................................................................................... 15
Call to action .................................................................................................................................... 15
Abstract
Since 1996, HP and AMD have collaborated to provide high-performance, energy efficient
solutions that deliver quality, variety, and value in industry-standard servers. This collaboration
includes the adoption of the latest multi-core AMD Opteron™ processors. This technology brief
discusses current and near future AMD Opteron processors and the evolving AMD Opteron
processor microarchitecture.
Introduction
The AMD Opteron family of processors is AMD’s offering for the industry-standard server market.
HP is providing AMD processors in the ProLiant server product line to offer enterprise customers
expanded options for improved performance while maintaining cost-effective infrastructures.
AMD Opteron processors are based on the X86 architecture with AMD64 technology and feature
an integrated memory controller and the Direct Connect I/O Architecture, which uses
HyperTransport™ technology. In addition, AMD multi-core processors feature AMD Virtualization™
and AMD PowerNow!™ technologies.
X86 architecture
AMD Opteron processors adhere to the x86 instruction set architecture to be compatible with the
wealth of 32-bit software applications available. In other words, at the software/hardware
interface, the software interface of the AMD Opteron processor remains the same with regard to
the memory addressing size, the instruction sets, and the register designs for the x86 architecture.
32-bit operations
A 32-bit processor has general-purpose registers (GPRs) that are 32 bits wide and can operate on
an integer data stream that is 32 bits wide. In addition, a 32-bit processor can hold 32 bits of
memory address data in a single register, for a maximum of 4 GB of addressable memory.
The x86 architecture supports physical addressing extensions (PAE), which extend the address
space to allow addressing to 36 bits for a maximum of 64 GB of physical addressable memory.
However, this requires the OS and applications to take advantage of the additional memory
addressing.
As shown in Table 1, the x86, 32-bit instruction set of the AMD Opteron family of processors
includes the following:
• Standard x86 instructions, which are general-purpose arithmetic functions
• Single Input Multiple Data (SIMD) Instructions, which let one command work simultaneously on
multiple data items. This includes Streaming SIMD Extensions (SSE), SSE2, SSE3, and SSE4a
instructions.
• x87 floating point instructions
AMD Opteron processors support 32-bit addressing as well as the 36-bit PAE.
2
Table 1. 32-bit x86 instructions common to AMD processors
Instruction
name
Standard
x86
MMX
x87 Instructions for floating point calculations FP 80-bit* 8
SSE, SSE2,
SSE3, and
SSE4a
* According to the article “An Introduction to 64-bit Computing and x86-64” by Jon Stokes1, the “x87 uses 80-bit registers
to do double-precision floating point. The floats themselves are 64-bit, but the processor converts them to an internal, 80-bit
format for increased precision when doing computations.”
Description Register
type
Instructions for logical and arithmetic operations, address
calculations, and has 16-bit index registers for memory
pointers
Multimedia instructions that allow the processor to do 64-bit
SIMD operations
SSE improved upon the MMX instructions and allowed
processors to do 128-bit SIMD floating-point operations.
SSE2 added 64-bit parallel floating point numeric support.
It also added new instructions to support 128-bit SIMD
integer operations.
SSE3 instructions include 13 instructions that accelerate
performance of SSE technology, SSE2 technology, and
x87-floating-point math capabilities.
SSE4a instructions include two new SSE instructions. SSE4a
instructions also add support for unaligned SSE loadoperation, which formerly required 16-byte alignment.
GPR 32-bit 8
MMX 64-bit 8
MMX 128-bit 8
Size of
registers
Number of
registers
AMD Opteron processors support the AMD 3Dnow!™ instruction set, AMD’s version of multimedia
instructions. The 3DNow! set added SIMD instructions to improve the vector-processing (floating
point) performance of graphic-intensive and multimedia applications.
AMD64 technology
Introduced in 2003, AMD64 technology is the AMD microarchitecture and instruction set that
provides full support for 64-bit operating systems and applications. The most important feature of
AMD64 is support for very large virtual and physical memory in a flat address space.
Instruction set and registers
AMD64 instruction
These registers are used by the applications only when running the processors in 64-bit long mode.
To support the AMD64 instructions, the registers expand to include the following:
• Eight new 64-bit GPRs
• Extensions of the eight original, 32-bit GPRs to 64 bits
• Eight new 128-bit registers for SSE, SSE2, and SSE3 instructions
can take advantage of the 64-bit wide registers in AMD Opteron processors.
s
1
Available at http://arstechnica.com/cpu/03q1/x86-64/x86-64-1.html
3
Operating modes
AMD Opteron processors use three different operating modes: 64-bit long mode, 64-bit
compatibility mode, and 32-bit legacy mode. The 64-bit long mode requires a 64-bit OS and an
application recompiled to use the 64-bit registers. In other words, the full capabilities of the
expanded register set are available only when both the OS and the application support 64 bits.
The 64-bit compatibility mode requires a 64-bit OS, but can use a 32-bit application. The
additional registers are available to the OS, but not to the 32-bit application, because it cannot
make use of them. When running in legacy mode, the processor acts just like a 32-bit processor,
and the extra registers are not available (Table 2).
Table 2. Operating modes for AMD Opteron processors2
Mode OS required Application
recompile required?
64-bit long mode 64-bit OS Yes Yes 64
64-bit compatibility
mode
32-bit legacy mode 32-bit OS No No 32
64-bit OS No Yes – to OS
Register extensions
available?
No – to application
GPR width
(bits)
32
Memory addressability
The AM
D Opteron registers are at least 64-bits wide. When operating in 64-bit long mode, the
AMD Opteron processors support up to 48 bits (256 Terabytes) for physical memory and use 64
bits for virtual memory
Naming conventions
First-generation single-core AMD Opteron processors (Socket 940 and Socket 939) have three-digit
model numbers in the form XZZ, and third-generation Quad-Core AMD Opteron processors (Socket
F and Socket AM2) have four-digit model numbers XYZZ. AMD Opteron processor “generations”
are called Revisions.
For all AMD Opteron processors, the first digit “X” specifies the number of CPUs on the target
machine:
• 1000 Series - Single-processor systems
• 2000 Series - Dual-processor systems
• 8000 Series - Systems with up to 8 processors
The second digit, Y, indicates socket generation, where “2” indicates Socket AM2 or Socket F
(1207). Series 12ZZ processors are based on Socket AM2; Series 22ZZ and 82ZZ processors are
based on Socket F (1207). If the second digit is “3,” it stands for third-generation AMD Opteron
processors for Socket AM2 and Socket F (1207). If the second digit is “4,” it indicates Six-Core
AMD Opteron processors.
2
From the document titled “AMD64 Architecture Programmer’s Manual, Vol. 1: Application Programming,“
available at www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf
4
The last two digits, ZZ, indicate the relati
higher performance.
In addition, the model number can include a suffix designator to indicate a non-standard power
level. HE designates a lower power version, and SE a higher power version. For example, Model
2220, Model 2220 HE, and Model 2220 SE all offer equivalent performance, but differ in power
consumption.
The AMD website includes a quick reference guide
socket, revision (stepping), core frequency, manufacturing process (45 nm, 65 nm, or 90 nm),
HyperTransport frequency, and wattage.
ve performance within the series. Higher numbers indicate
3
that details each processor part number by
Direct Connect I/O Architecture
The AMD Direct Connect I/O Architecture replaces the traditional front side bus with point-to-point
HyperTransport Technology links and an integrated memory controller connected to dedicated
memory banks for each processor.
Integrated memory controller and dedicated memory banks
Each AMD Opteron processor contains an integrated dual-channel SDRAM memory controller that
is directly connected to dedicated memory banks. Integrating the controller into the processor
means that memory performance can scale linearly based on the number of processors in a multiprocessor system. For example, in a multi-processor system, the integrated memory controller allows
for multiple memory requests in parallel, thereby increasing the effective memory bandwidth and
decreasing average memory latency.
The memory controller operates at a frequency independent of—and usually slower than—the
processor core. It has a 128-bit interface that is capable of supporting up to eight DDR2 DIMMs.
With four DDR2-800 DIMMs per channel, the memory bandwidth is up to 12.8 GB/s. The 128-bit
interface can be divided into two independent 64-bit memory channels for memory controller
utilization and better memory performance.
3
http://www.amdcompare.com/us-en/AMD Opteron/
5
HyperTransport Technology
HyperTransport is a point-to-point interconnect with two unidirectional links (see Figure 1) that
directly connect the processors to each other and connect each processor to its dedicated memory
banks, as well as to other I/O chipsets.
HyperTransport has the advantages of no overhead for bus arbitration and easier signal integrity
maintenance, resulting in a scalable, high-bandwidth architecture.
Each16-bit (2-byte) HyperTransport link is double-pumped, performing two data transfers per clock
cycle. From HyperTransport 1.0 in 2001 to HyperTransport 3.0 in 2008, the maximum clock
speed and transfer rate increased from 800 MHz (1.6 MT/s
(4.8 GT/s) in each direction. This gives each HyperTransport 3.0 link a maximum data rate of
4.8 GT/s × 2 bytes per transfer, or 9.6 GB/s (19.2 GB/s aggregate data rate).
Figure 1. The Hy
each processor, allowing memory capacity to scale with the number of processors.
perTransport interconnect separates memory and I/O traffic and directly attaches memory to
4
Compared to a shared, parallel front-side bus,
5
) to a maximum of 2.4 GHz
4
HyperTransport Technology was invented at AMD with contributions from industry partners and is managed and licensed by the
HyperTransport Technology Consortium, a Texas non-profit corporation.
5
MT/s, or megatransfers per second, equals the speed of the link in millions of cycles per second times the number of transfers per cycle.
6
Multi-core technologies
In the past, the most common way to improve processor performance was to increase core
frequency and/or cache size. However, both of these solutions increase power consumption (and
heat generation) and have other limitations. Alternatively, higher performance can be achieved by
using multiple execution cores per processor. Multi-core processors run applications more efficiently
and allow multi-threaded software to achieve higher performance, while maintaining a similar
power budget to single-core processors. Also, multi-core processors are increasingly attractive with
reductions in the manufacturing process (for example, from 90 nm to 65 nm to 45 nm). This is
because smaller cores require less power, which permits more cores to be built into a single
processor.
AMD introduced its first dual-core AMD64 processor in 2005; it was manufactured using a 90 nm
process. The AMD Opteron processor is essentially divided into two parts: execution and
communications, with a system request interface and crossbar switch linking these two parts (see
Figure 2). The crossbar switch architecture enabled AMD Opteron processors to transition easily
from single-core to dual-core processors without fundamental design changes.
Each execution core includes a 64-KB/64-KB data/instruction L1 cache and a 1-MB L2 cache. The
system request interface manages and prioritizes the processor requests to the crossbar switch. The
crossbar switch connects both processor cores directly to communications: I/O (through
HyperTransport links) and the integrated memory controller. The memory controller and the
HyperTransport links remain the same as in a single core system.
Figure 2. The Socket F (1207) and Socket AM2 designs support dual-core AMD Revision F processors.
The primary difference between the proc
the way the processor uses the HyperTransport link(s). In the 1000 series AMD Opteron processors,
the single HyperTransport link can only connect to I/O in a non-coherent link. This means that the
1000 series processors are limited to single-processor systems. In the 2000 series, one of three
HyperTransport links can connect to one other AMD Opteron processor in a coherent link. The
essors designed for single, dual, or multi-core systems is in
7
other link
s can connect to I/O (non-coherent links); thus, 2000 series AMD Opteron processors can
be used in dual-processor systems. With the 8000 series AMD Opteron processors, all three
HyperTransport links can connect to other AMD Opteron processors or to I/O.
For more information about multi-core processors, see the AMD whitepaper titled “Multi-Core
Processors—The Next Evolution in Computing.”
6
Dual-core Revision F processors
The dual-core Revision F (Rev F) was introduced in 2006. A number of features from previous
revisions remained unchanged with the introduction of the Rev F processor:
• 64-KB/64-KB data/instruction L1 cache per core
• 1-MB L2 cache per core
• 1-GHz HyperTransport
Revision F also included key improvements:
• DDR2 memory support
• Hardware assisted virtualization (AMD V
• Power management (PowerNow!) improvements
• Quad-core upgradeability
DDR2 operates at 1.8 V compared to 2.5 V for DDR, reducing power requirements by up to
30 percent.
AMD Virtualization
TM
hardware assistance directly supports virtualization with industry-standard
servers, which reduces complexity and improves performance.
TM
)
PowerNow! Technology with Optimized Power Management reduces power requirements and heat
generation by reducing the processor’s clock speed and voltage during periods when the CPU is
not fully utilized. Up to five power states are supported. Power consumption at idle is reduced by
up to 75 percent.
Quad-core upgradeability means that Revision F sockets are pin-compatible with and will support
quad-core processors within the same power and thermal envelopes.
6
AMD whitepaper “Multi-Core Processors—The Next Evolution in Computing” is accessible at
http://multicore.amd.com/Resources/3321
7
HP ProLiant DL145 G3 servers do not support quad-core Revision F processor upgradeability.
1A_Multi-Core_WP_en.pdf.
7
8
Quad-Core AMD Opteron processors
AMD introduced the quad-core AMD Opteron (see Figure 3) in September of 2007. It included
several innovations:
• A new core microarchitecture – K8L (true quad-core on a single die)
• Extensions to AMD64 instruction set – bit manipulation and SSE, SSE2, SSE3, and SSE4
• 128-bit FPU for improved floating point and graphics performance
• AMD Smart Fetch Technology Support for DDR2 memory
• Dedicated 64-KB L1 cache and 512-KB L2 cache for each core
• 2-MB to 6-MB L3 cache shared among all cores
• 65 nm silicon process technology
• Enhanced AMD PowerNow! with Independent Dynamic Core Technology and Dual Dynamic
Power Management
• AMD-V™ with Rapid Virtualization Indexing
Figure 3. The Socket F (1207) design supports Quad-Core AMD Opteron™ processors.
9
AMD Smart Fetch Technology
Smart Fetch Technology allows cores to enter a "halt" state during idle processing times, causing
them to draw less power. Before entering the halt state, data from the L1 and L2 caches are
transferred to the shared L3 cache so that the contents of the idle cores can be retrieved.
Enhanced AMD PowerNow! Technology
Native qu
four cores. Two power management enhancements—Independent Dynamic Core Technology and
Dual Dynamic Power Management™—provide optimum performance-per-watt and power savings.
ad-core tec
hnology enables enhancements to AMD PowerNow! Technology across all
Independent Dynamic Core Technology
AMD’s Independent Dynamic Core Technology allows each core to independently adjust its
frequency to reduce power use based on application requirements (Figure 4). This enables more
precise power management, which can reduce the total cost of ownership (TCO) of a data center.
Figure 4. Independently controlled cores reduce power use. The voltage is locked to the core with the highest
P-state.
Dual Dynamic Power Management
Dual Dynamic Power Management provides separate (split) power planes for the cores and
memory controller. This can reduce idle power consumption and allow individual processors to be
managed in multi-socket systems, thereby creating power-saving opportunities without
compromising performance.
Rapid Virtualization Indexing
R
apid Virtu
associated with software virtualization. With software virtualization, processor overhead increases
as each guest OS and application vies for the host machine’s physical resources; this results in
decreased performance. Also, memory latency increases as the virtual machine monitor, or
hypervisor, dynamically translates the memory addresses sent to and received from the memory
controller. The hypervisor does this so that each guest application does not realize that it is being
virtualized. The translation from virtual machine memory address to host machine physical address
is achieved by using “shadow page tables” (Figure 5).
alization Indexing is an innovation in AMD-V technology that reduces the overhead
10
Rapid Virtu
allows virtual machines to manage memory more directly. Rapid Virtualization Indexing eliminates
the time the hypervisor spends managing shadow pages in software, and accelerates this task with
much faster hardware-based page management. This hardware-based management reduces
hypervisor overhead and improves the speed of the guest OS.
Figure 5. Hardware-based management using nested page tables reduces hypervisor overhead, compared to
software-based management of shadow page tables, thus improving the speed of the guest OS.
alization Indexing is the AMD implementation of nested page tables technology which
Average CPU Power metric
Beca
use of rising power and cooling costs in data centers, organizations are adopting a new
paradigm that focuses on maximizing system energy efficiency down to the component level. This is
especially true for the processor, which represents a significant percentage of power use and heat
generation. AMD’s introduction of power management enhancements such as Dual Dynamic Power
Management and Independent Dynamic Core technology help to reduce processor power use.
However, if data center planners do not know the actual power required by the processor, they
must use the maximum power ratings listed in the engineering specifications.
To accurately measure processor power consumption, its power use must be isolated from the
power use of other components on the motherboard. To accomplish this, AMD developed specially
instrumented motherboards with voltage regulators that deliver power to individual processor
power rails. This special instrumentation allows AMD to measure processor power use of all
processor rails during standard test workloads such as floating point, integer, Web, and
transaction processing.
From these test results, AMD developed an average CPU power (ACP) metric to more accurately
estimate the power consumption of AMD Opteron processors during peak workloads (Table 3). The
ACP metric allows data centers to more accurately forecast their power requirements and reap the
benefits of lower power and cooling costs.
11
Table 3. Thermal Design Power versus Average CPU Power
Thermal Design Power (watts) Average CPU Power (watts)
137 105
115 75
75 55
Independent and combined memory channel modes
The Qu
ad-Core AMD Opteron processor includes two DRAM controllers that support DDR2 DIMMs.
Each DRAM controller controls one 64-bit DDR DIMM channel that connects to a series of DIMMs.
The DRAM controllers can be configured to behave as a single channel (called ganged, or
combined, mode) or as two channels (called unganged, or independent, mode). Configuring the
DRAM controllers in unganged mode creates two 64-bit logical DIMMs, each equivalent to one 64bit physical DIMM. Configuring the DRAM controllers in ganged mode creates one 128-bit logical
DIMM. Each physical DIMM of a 128-bit logical DIMM must be identical (same size and same
timing parameters).
The configuration requirements for the DRAM controllers and DIMMs are as follows:
• Both DRAM controllers must be programmed to the same frequency. All DIMMs must operate at
the same memory clock frequency, regardless of the channel on which they are connected.
• The DRAM controllers do not support mixing unbuffered and registered DIMMs on the same
channel or between channels.
• The DRAM controllers do not support mixing ECC and non-ECC DIMMs on the same channel
or between channels.
Six-Core AMD Opteron processors
AMD introduced the six-core Opteron processor, formerly code-named "Istanbul," in June 2009.
According to AMD, the six-core Opteron processor (Figure 6) operates within the same power and
thermal envelope as the Quad-Core Opteron processor. However, it provides a 20% to 50%
performance increase, specifically for virtualization, database, and high-performance computing
applications. The six-core processor includes several innovations:
• A dedicated 64-KB L1 cache and 512-KB L2 cache for each core
• A 6-MB L3 cache shared among all cores
• 45 nm silicon process technology
• Support for DDR2 memory
• HyperTransport™ 3 technology
• HyperTransport (HT) Assist technology
• AMD Smart Fetch Technology
• Enhanced AMD PowerNow! with Independent Dynamic Core Technology and Dual Dynamic
Power Management
• AMD-V™ with Rapid Virtualization Indexing
12
Figure 6. The Six-Core AMD Opteron processor operates in the same power and thermal envelope as the
Quad-Core Opteron processor while improving performance by up to 50%.
HT Assist
HT Assi
or eight sockets. It is designed to maintain data correctness (coherence) between the processors
and minimize inter-processor communication traffic on the HyperTransport links.
In a multi-socket system, each processor has to ensure that it is executing the latest data, or cache
line, to maintain coherence. Before a processor can execute a transaction, it probes the caches of
the other processors by broadcasting a coherence protocol and only requests data from system
memory is there is a cache miss. All of these latency-sensitive messages—probe requests, probe
responses, data requests, and data responses—are transmitted over the HyperTransport links. For
example, one cache line coherency check in a 4-socket system can generate 10 or more messages
over the four HyperTransport links between the processors. In a 4- or 8-socket system with six-core
AMD Opteron processors (a total of 24 or 48 processor cores), this traffic can severely load the
HyperTransport links.
HT Assist uses 1MB of each processor's 6-MB L3 cache as a directory cache to track all cache lines
stored in the multi-socket system. This allows a multi-core processor to probe its own L3 cache when
checking a cache line, called a Probe Filter Lookup, instead of broadcasting numerous cache
probes over the HyperTransport links. With HT Assist, a cache line coherency check in the
previously mentioned 4-socket system may only generate two to three messages. The Probe Filter
Lookup also reduces latency for accesses to local DRAM because there is no need to broadcast
probe requests and wait for responses.
st helps increase performance of six-core AMD Opteron processor-based systems with four
The performance benefits of HT Assist in 4- and 8-socket systems outweigh the small decrease in
available L3 data cache. HT Assist does not need to be enabled on 2-socket systems where there is
much less cache probe traffic.
13
Future AMD Opteron processors
A new generation of processor socket (called G34) is planned for the first half of 2010. It will
feature DDR3 memory, the AMD RD890 chipset, and an additional HT link. New 8- and 12-core
AMD Opteron processors, codenamed Magny-Cours, are planned for socket G34.
AMD is expected to continue improving the AMD Opteron processor family with faster memory and
HyperTransport speeds. AMD has also announced the Torrenza initiative which will provide an
additional chip socket for a co-processor on the motherboard. This socket will include a
HyperTransport bus connection, and it will support graphics and other more specialized third-party
co-processors.
Software licensing
Customers should be aware of possible changes in software licensing for use of multi-core
processors. At this writing, major OS vendors, such as Microsoft, treat multi-core processors as
performance improvements to a single processor
purposes among processors with one, two, four, or more cores. However, customers should check
with their OS and application vendors to determine particular licensing requirements.
8
; they are not making a distinction for licensing
In April 2009, AMD announced a ROM feature called AMD Core Select that enables IT managers
to turn off one or more cores to fine tune hardware for specific operating conditions and
workloads, or to address software licensing issues.
Conclusion
HP ProLiant servers continue to offer both AMD Opteron and Intel® Xeon™ processor architectures
to deliver the best possible choice to customers. HP ProLiant servers using the AMD Opteron
processor family have proven their performance in numerous benchmarks and systems. Multi-core
AMD Opteron technology takes advantage of multi-threaded applications and reduces latencies,
providing higher performance within the same power budget.
8
Refer to the Microsoft website http://www.microsoft.com/licensing/highlights/multicore.mspx
14
For more information
For additional information, refer to the resources listed below.
HyperTransport
Consortium
Multi-Core
Processors—The
Next Evolution in
Computing