AMD ATHLON 64 User Manual

Performance Guidelines for
AMD Athlon™ 64 and
AMD Opteron™ ccNUMA
Multiprocessor Systems
Application Note
40555Publication # Revision: 3.00
June 2006Issue Date:
© 2006 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The informa­tion contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and dis­claims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice.
Trademarks
AMD, the AMD Arrow logo, AMD Athlon, and AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc.
HyperTransport is a licensed trademark of the HyperTransport Technology Consortium.
Linux is a registered trademark of Linus Torvalds.
Microsoft and Windows registered trademarks of Microsoft Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

Contents

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
1.1 Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Chapter 2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
2.1 System Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
2.2 Synthetic Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
2.3 Reading and Interpreting Test Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
2.3.1 X-Axis Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
2.3.2 Labels Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
2.3.3 Y-Axis Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
Chapter 3 Analysis and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
3.1 Scheduling Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
3.1.1 Multiple Threads-Independent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
3.1.2 Multiple Threads-Shared Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
3.1.3 Scheduling on a Non-Idle System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
3.2 Data Locality Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
3.2.1 Keeping Data Local by Virtue of first Touch . . . . . . . . . . . . . . . . . . . . . . . . .22
3.2.2 Data Placement Techniques to Alleviate Unnecessary Data Sharing
Between Nodes Due to First Touch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
3.3 Avoid Cache Line Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
3.4 Common Hop Myths Debunked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
3.4.1 Myth: All Equal Hop Cases Take Equal Time. . . . . . . . . . . . . . . . . . . . . . . .25
3.4.2 Myth: Greater Hop Distance Always Means Slower Time. . . . . . . . . . . . . . .29
3.5 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
3.6 Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems . . .35
Chapter 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
Appendix A 39
A.1 Description of the Buffer Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
A.2 Why Is the Crossfire Case Slower Than the No Crossfire Case on
an Idle System? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
Contents 3
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
A.2.1 What Resources Are Used When a Single Read-Only or
Write-Only Thread Accesses Remote Data? . . . . . . . . . . . . . . . . . . . . . . . . . .40
A.2.2 What Resources Are Used When Two Write-only Threads Fire
at Each Other (Crossfire) on an Idle System? . . . . . . . . . . . . . . . . . . . . . . . . .40
A.2.3 What Role Do Buffers Play in the Throughput Observed? . . . . . . . . . . . . . .41
A.2.4 What Resources Are Used When Write-Only Threads Do Not
Fire at Each Other (No Crossfire) on an Idle System? . . . . . . . . . . . . . . . . . .41
A.3 Why Is the No Crossfire Case Slower Than the Crossfire Case on a System
under a Very High Background Load (Full Subscription)?. . . . . . . . . . . . . . . . . . . . .42
A.4 Why Is 0 Hop-0 Hop Case Slower Than the 0 Hop-1 Hop Case on an
Idle System for Write-Only Threads? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
A.5 Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case on a System
under High Background Load (High Subscription) for Write-Only Threads? . . . . . .43
A.6 Support for a ccNUMA-Aware Scheduler for AMD64 ccNUMA
Multiprocessor Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
40555 Rev. 3.00 June 2006
A.7 Tools and APIs for Thread/Process and Memory Placement (Affinity) for
AMD64 ccNUMA Multiprocessor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
A.7.1 Support Under Linux® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
A.7.2 Support under Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
A.7.3 Support under Microsoft® Windows® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
A.8 Tools and APIs for Node Interleaving in Various OSs for AMD64 ccNUMA
Multiprocessor Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
A.8.1 Support under Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
A.8.2 Support under Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
A.8.3 Support under Microsoft Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
A.8.4 Node Interleaving Configuration in the BIOS. . . . . . . . . . . . . . . . . . . . . . . . .47
4 Contents
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

List of Figures

Figure 1. Quartet Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
Figure 2. Internal Resources Associated with a Quartet Node . . . . . . . . . . . . . . . . . . . . . . . . . . .15
Figure 3. Write-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2 Hops
Away on an Idle System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Figure 4. Read-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2 Hops
Away on an Idle System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Figure 5. Write-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2
Hops Away on an Idle System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Figure 6. Crossfire 1 Hop-1 Hop Case vs No Crossfire 1 Hop-1 Hop Case on an Idle System . .26
Figure 7. Crossfire 1 Hop-1 Hop Case vs No Crossfire 1 Hop-1 Hop Case under a
Low Background Load (High Subscription) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Figure 8. Crossfire 1 Hop-1 Hop Case vs No Crossfire 1 Hop-1 Hop Case under a
Very High Background Load (High Subscription). . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Figure 9. Crossfire 1 Hop-1 Hop Case vs No Crossfire 1 Hop-1 Hop Case under a
Very High Background Load (Full Subscription) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Figure 10. Both Read-Only Threads Running on Node 0 (Different Cores) on an Idle System . . .30
Figure 11. Both Write-Only Threads Running on Node 0 (Different Cores) on an Idle System . .31
Figure 12. Both Write-Only Threads Running on Node 0 (Different Cores) under
Low Background Load (High Subscription) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
Figure 13. Both Write-Only Threads Running on Node 0 (Different Cores) under Medium
Background Load (High Subscription). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33
Figure 14. Both Write-Only Threads Running on Node 0 (Different Cores) under High
Background Load (High Subscription). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33
Figure 15. Both Write-Only Threads Running on Node 0 (Different Cores) under
Very High Background Load (High Subscription). . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
Figure 16. Internal Resources Associated with a Quartet Node . . . . . . . . . . . . . . . . . . . . . . . . . . .39
List of Figures 5
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
40555 Rev. 3.00 June 2006
6 List of Figures
40555 Rev. 3.00 June 2006

Revision History

Date Revision Description
June 2006 3.00 Initial release.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
Revision History 7
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
40555 Rev. 3.00 June 2006
8 Revision History
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

Chapter 1 Introduction

The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture. In this architecture, each processor has access to its own low-latency, local memory (through the processor’s on-die local memory controller), as well as to higher latency remote memory through the on-die memory controllers of the other processors in the multiprocessor environment. At the same time, the ccNUMA architecture is designed to maintain the cache coherence of the entire shared memory space. The high-performance coherent HyperTransport™ technology interconnects between processors in the multiprocessor system permit remote memory access and cache coherence.
In traditional symmetric multiprocessing (SMP) systems, the various processors share a single memory controller. This single memory connection can become a performance bottleneck when all processors access memory at once. At the same time, the SMP architecture does not scale well into larger systems with a greater number of processors. The AMD ccNUMA architecture is designed to overcome these inherent SMP performance bottlenecks. It is a mature architecture that is designed to extract greater performance potential from multiprocessor systems.
As developers deploy more demanding workloads on these multiprocessor systems, common performance questions arise: Where should threads or processes be scheduled (thread or process placement)? Where should memory be allocated (memory placement)? The underlying operating system (OS), tuned for AMD Athlon 64 and AMD Opteron multiprocessor ccNUMA systems, makes these performance decisions transparent and easy.
Advanced developers, however, should be aware of the more advanced tools and techniques available for performance tuning. In addition to recommending mechanisms provided by the OS for explicit thread (or process) and memory placement, this application note explores advanced techniques such as node interleaving of memory to boost performance. This document also delves into the characterization of an AMD ccNUMA multiprocessor system, providing advanced developers with an understanding of the fundamentals necessary to enhance the performance of synthetic and real applications and to develop advanced tools.
In general, applications can be memory latency sensitive or memory bandwidth sensitive; both classes are important for performance tuning. In a multiprocessor system, in addition to memory latency and memory bandwidth, other factors influence performance:
the latency of remote memory access (hop latency)
the latency of maintaining cache coherence (probe latency)
the bandwidth of the HyperTransport interconnect links
the lengths of various buffer queues in the system
The empirical analysis presented in this document is based upon data provided by running a multi­threaded synthetic test. While this test is neither a pure memory latency test nor a pure memory
Chapter 1 Introduction 9
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations. The discussion below explores the performance results of this test, with an emphasis on behavior exhibited when the test imposes high bandwidth demands on the low level resources of the system.
Additionally, the tests are run in undersubscribed, highly subscribed, and fully subscribed modes. In undersubscribed mode, there are significantly fewer threads than the number of processors. In highly subscribed mode, the number of threads approaches the number of processors. In the fully subscribed mode, the number of threads is equal to the number of processors. Testing these conditions provides an understanding of the impact of thread subscription on performance.
Based on the data and the analysis gathered from this synthetic test-bench, this application note presents recommendations to software developers who are working on applications, compiler tool chains, virtual machines and operating systems. Finally, the test results should also dispel some common myths concerning identical performance results obtained when comparing workloads that are symmetrical in all respects except for the thread and memory placement used.
40555 Rev. 3.00 June 2006

1.1 Related Documents

The following web links are referenced in the text and provide valuable resource and background information:
[1] http://www.hotchips.org/archives/hc14/3_Tue/28_AMD_Hammer_MP_HC_v8.pdf
[2] http://www.kernel.org/pub/linux/kernel/people/mbligh/presentations/OLS2004-
numa_paper.pdf
[3] http://www.amd64.org/lists/discuss/msg03314.html
[4] http://www.pgroup.com/doc/pgiug.pdf
[5] http://www.novell.com/collateral/4621437/4621437.pdf
[6] http://opensolaris.org/os/community/performance/mpo_overview.pdf
[7] http://www.opensolaris.org/os/community/performance/numa/observability/
[8] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/
multiple_processors.asp
[9] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/
virtualalloc.asp
[10] http://msdn2.microsoft.com/en-us/library/ms186255(SQL.90).aspx
[11] http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/
529588d3-71bc-45ea-a84b-267914674709.mspx
10 Introduction Chapter 1
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
[12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/
msdn_heapmm.asp
[13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/
low_fragmentation_heap.asp
[14] http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx
[15] https://www.pathscale.com/docs/UserGuide.pdf
[16] http://docs.sun.com/source/819-3688/parallel.html
Chapter 1 Introduction 11
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
40555 Rev. 3.00 June 2006
12 Introduction Chapter 1
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

Chapter 2 Experimental Setup

This chapter presents a description of the experimental environment within which the following performance study was carried out. This section describes the hardware configuration and the software test framework used.

2.1 System Used

All experiments and analysis discussed in this application note were performed on a Quartet system having four 2.2 GHz E1 Dual-Core AMD Opteron™ processors running the Linux® 2.6.12-rc1-mm1 kernel (ccNUMA-aware kernel).
While Quartet is an internal development non-commercial platform, the way the processors are connected on the Quartet is a common way of connecting and routing the processors on other supported 4P AMD platforms. We anticipate that these results should hold on other systems that are connected in a similar manner and we expect the recommendations to carry forward on the current generation Opteron systems. We also expect that the results will hold on other Linux kernels and even other operating systems for reasons explained later.
Each processor had 2x1GB DDR400 CL2.5 PC3200 (Samsung K4H510838B-TCCC) DRAM memory. To rule out any interference from the Xserver and the network, all tests were performed at runlevel 3 with the network disconnected.
At a high level, in a Quartet, the four dual-core processors are connected with coherent HyperTransport™ links. Each processor has one bidirectional HyperTransport link that is dedicated to I/O and two bidirectional coherent HyperTransport links that are used to connect to two other dual­core processors. This enables a direct connection for a given dual-core processor to all other dual-core processors but one in a 4-way configuration. The throughput of each bidirectional HyperTransport link is 4 GB/s in each direction. Each node has its own on-chip memory controller and is connected to its own memory.
As shown in Figure 1 on page 14, the processors (also called nodes) are numbered N0, N1, N3 and N2 clockwise from the top left. Each Node has two cores—labeled C0 and C1 respectively [1].
Chapter 2 Experimental Setup 13
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems
N0 N1
Link
40555 Rev. 3.00 June 2006
Link
Link
Link
N2 N3

Figure 1. Quartet Topology

The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access. If a thread is running on one node but accessing memory that is resident on a different node, the access is a remote access. If the node on which the thread is running and the node on which the memory resides are directly connected to each other, the memory access is a 1-hop access. If they are indirectly connected to each other (i.e., there is no direct coherent HyperTransport link) in the 4P configuration shown in Figure 1, the memory access is a 2-hop access. For example, if a thread running on Node 0 (N0) accesses memory resident on Node 3 (N3), the memory access is a 2-hop access.
Figure 2 on page 15 views the resources of each node from a lower level perspective. Each (dual­core) processor has two cores. The two cores talk to a system request interface (SRI), which in turn talks to a crossbar (XBar). The crossbar is connected to the local memory controller (MCT) on one end and to the various HyperTransport links on the other end. The SRI, XBar and MCT are collectively called the Northbridge on the node. The MCT is connected to the physical memory (DRAM) for that node.
14 Experimental Setup Chapter 2
40555 Rev. 3.00 June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
C0
C1
4 GV/s per direction @ 2 GHz Data Rate
4 GV/s per direction @ 2 GHz Data Rate
HT = HyperTransport™ Technology
4 GV/s per direction @ 2 GHz Data Rate

Figure 2. Internal Resources Associated with a Quartet Node

From the perspective of the MCT, a memory request may come from either the local core or from another core over a coherent HyperTransport link. The former request is a local request, while the latter is a remote request. In the former case, the request could be routed from the local core to the SRI, then to the XBar and then to the MCT. In the later case, the request is routed from the remote core over the coherent HyperTransport link to the XBar and from there to the MCT.
The MCT, the SRI and the XBar on each node all have internal buffers that are used to queue transaction packets for transmission. For additional details on the Northbridge buffer queues, refer to Section A.1 on page 39.
From a system perspective, the developer can think of the system as having three key resources that affect throughput: memory bandwidth, HyperTransport bandwidth and buffer queue capacity.

2.2 Synthetic Test

The test used is a simple synthetic workload consisting of two threads with each thread accessing an array that is not shared with the other thread. The time taken by each thread to access this array is measured.
Each thread does a series of read-only or write-only accesses to successive elements of the array using a cache line stride (64 bytes). The test iterates through all permutations of read-read, read-write, write-read, and write-write for the access patterns of the two threads. Each array is sized at 64MB— significantly larger than the cache size.
This synthetic test is neither a pure memory latency test nor a pure memory bandwidth test; rather it places varying throughput and capacity demands on the resources of the system described in the previous section. This provides an understanding of how the system behaves when any of the
Chapter 2 Experimental Setup 15
Loading...
+ 33 hidden pages