viIBM Power Systems 775 for AIX and Linux HPC Solution
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX 5L™
AIX®
BladeCenter®
DB2®
developerWorks®
Electronic Service Agent™
Focal Point™
Global Technology Services®
GPFS™
HACMP™
IBM®
LoadLeveler®
Power Systems™
POWER6+™
POWER6®
POWER7®
PowerPC®
POWER®
pSeries®
Redbooks®
Redbooks (logo)®
RS/6000®
System p®
System x®
Tivoli®
The following terms are trademarks of other companies:
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other
countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viiiIBM Power Systems 775 for AIX and Linux HPC Solution
Preface
This IBM® Redbooks® publication contains information about the IBM Power Systems™ 775
Supercomputer solution for AIX® and Linux HPC customers. This publication provides details
about how to plan, configure, maintain, and run HPC workloads in this environment.
This IBM Redbooks document is targeted to current and future users of the IBM Power
Systems 775 Supercomputer (consultants, IT architects, support staff, and IT specialists)
responsible for delivering and implementing IBM Power Systems 775 clustering solutions for
their enterprise high-performance computing (HPC) applications.
The team who wrote this book
This book was produced by a team of specialists from around the world working at the
International Technical Support Organization, Poughkeepsie Center.
Dino Quintero is an IBM Senior Certified IT Specialist with the ITSO in Poughkeepsie, NY.
His areas of knowledge include enterprise continuous availability, enterprise systems
management, system virtualization, technical computing, and clustering solutions. He is
currently an Open Group Distinguished IT Specialist. Dino holds a Master of Computing
Information Systems degree and a Bachelor of Science degree in Computer Science from
Marist College.
Kerry Bosworth is a Software Engineer in pSeries® Cluster System Test for
high-performance computing in Poughkeepsie, New York. Since joining the team four years
ago, she worked with the InfiniBand technology on POWER6® AIX, SLES, and Red Hat
clusters and the new Power 775 system. She has 12 years of experience at IBM with eight
years in IBM Global Services as an AIX Administrator and Service Delivery Manager.
Puneet Chaudhary is a software test specialist with the General Parallel File System team in
Poughkeepsie, New York.
Rodrigo Garcia da Silva is a Deep Computing Client Technical Architect at the IBM Systems
and Technology Group. He is part of the STG Growth Initiatives Technical Sales Team in
Brazil, specializing in High Performance Computing solutions. He has worked at IBM for the
past five years and has a total of eight years of experience in the IT industry. He holds a B.S.
in Electrical Engineering and his areas of expertise include systems architecture, OS
provisioning, Linux, and open source software. He also has a background in intellectual
property protection, including publications and a filed patent.
ByungUn Ha is an Accredited IT Specialist and Deep Computing Technical Specialist in
Korea. He has over 10 years experience in IBM and has conducted various HPC projects and
HPC benchmarks in Korea. He has supported Supercomputing Center at KISTI (Korea
Institute of Science and Technology Information) on-site for nine years. His area of expertise
include Linux performance and clustering for System X, InfiniBand, AIX Power system, and
HPC Software Stack including LoadLeveler®, Parallel Environment, and ESSL/PESSL,
C/Fortran Compiler. He is a Redhat Certified Engineer (RHCE) and has a Master’s degree in
Aerospace Engineering from Seoul National University. He is currently working in Deep
Computing team, Growth Initiatives, STG in Korea as a HPC Technical Sales Specialist.
Jose Higino is an Infrastructure IT Specialist for AIX/Linux support and services for IBM
Portugal. His areas of knowledge include System X, BladeCenter® and Power Systems
planning and implementation, management, virtualization, consolidation, and clustering (HPC
and HA) solutions. He is currently the only person responsible for Linux support and services
in IBM Portugal. He completed the Red Hat Certified Technician level in 2007, became a
CiRBA Certified Virtualization Analyst in 2009, and completed certification in KT Resolve
methodology as an SME in 2011. José holds a Master of Computers and Electronics
Engineering degree from UNL - FCT (Universidade Nova de Lisboa - Faculdade de Ciências
e Technologia), in Portugal.
Marc-Eric Kahle is a POWER® Systems Hardware Support specialist at the IBM Global
Technology Services® Central Region Hardware EMEA Back Office in Ehningen, Germany.
He has worked in the RS/6000®, POWER System, and AIX fields since 1993. He has worked
at IBM Germany since 1987. His areas of expertise include POWER Systems hardware and
he is an AIX certified specialist. He has participated in the development of six other IBM
Redbooks publications.
Tsuyoshi Kamenoue is a Advisory IT specialist in Power Systems Technical Sales in IBM
Japan. He has nine years of experience of working on pSeries, System p®, and Power
Systems products especially in HPC area. He holds a Bachelor’s degree in System
information from the university of Tokyo.
James Pearson is a Product Engineer for pSeries high-end Enterprise systems and HPC
cluster offerings since 1998. He has participated in the planning, test, installation and
on-going maintenance phases of clustered RISC and pSeries servers for numerous
government and commercial customers, beginning with SP2 and continuing through the
current Power 775 HPC solution.
Mark Perez is a customer support specialist servicing IBM Cluster 1600.
Fernando Pizzano is a Hardware and Software Bring-up Team Lead in the IBM Advanced
Clustering Technology Development Lab, Poughkeepsie, New York. He has over 10 years of
information technology experience, the last five years in HPC Development. His areas of
expertise include AIX, pSeries High Performance Switch, and IBM System p hardware. He
holds an IBM certification in pSeries AIX 5L™ System Support.
Robert Simon is a Senior Software Engineer in STG working in Poughkeepsie, New York. He
has worked with IBM since 1987. He currently is a Team Leader in the Software Technical
Support Group, which supports the High Performance Clustering software (LoadLeveler,
CSM, GPFS™, RSCT, and PPE). He has extensive experience with IBM System p hardware,
AIX, HACMP™, and high-performance clustering software. He has participated in the
development of three other IBM Redbooks publications.
Kai Sun is a Software Engineer in pSeries Cluster System Test for high performance
computing in IBM China System Technology Laboratory, Beijing. Since joining the team in
2011, he has worked with the IBM Power Systems 775 cluster. He has six years of
experience at embedded system on Linux and VxWorks platform. He has recently been given
an Eminence and Excellence Award by IBM for his work on Power Systems 775 cluster. He
holds a B.Eng. degree in Communication Engineering from Beijing University of Technology,
China. He has a M.Sc. degree in Project Management from the New Jersey Institute of
Technology, US.
Thanks to the following people for their contributions to this project:
Mark Atkins
IBM Boulder
Robert Dandar
xIBM Power Systems 775 for AIX and Linux HPC Solution
Joseph Demczar
Chulho Kim
John Lewars
John Robb
Hanhong Xue
Gary Mincher
Dave Wootton
Paula Trimble
William Lepera
Joan McComb
Bruce Potter
Linda Mellor
Alison White
Richard Rosenthal
Gordon McPheeters
Ray Longi
Alan Benner
Lissa Valleta
John Lemek
Doug Szerdi
David Lerma
IBM Poughkeepsie
Ettore Tiotto
IBM Toronto, Canada
Wei QQ Qu
IBM China
Phil Sanders
IBM Rochester
Richard Conway
David Bennin
International Technical Support Organization, Poughkeepsie Center
Now you can become a published author, too!
Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.
Find out more about the residency program, browse the residency index, and apply online at:
http://www.ibm.com/redbooks/residencies.html
Preface xi
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
http://www.ibm.com/redbooks
Send your comments in an email to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Stay connected to IBM Redbooks
Find us on Facebook:
http://www.facebook.com/IBMRedbooks
Follow us on Twitter:
http://twitter.com/ibmredbooks
Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
xiiIBM Power Systems 775 for AIX and Linux HPC Solution
Chapter 1.Understanding the IBM Power
1
Systems 775 Cluster
In this book, we describe the new IBM Power Systems 775 Cluster hardware and software.
The chapters provide an overview of the general features of the Power 775 and its hardware
and software components. This chapter helps you get a basic understanding and concept of
this cluster.
Application integration and monitoring of a Power 775 cluster is also described in greater
detail in this IBM Redbooks publication. LoadLeveler, GPFS, xCAT, and more are
documented with some examples to get a better view on the complete cluster solution.
Problem determination is also discussed throughout this publication for different scenarios
that include xCAT configuration issues, Integrated Switch Network Manager (ISNM), Host
Fabric Interface (HFI), GPFS, and LoadLeveler. These scenarios show the flow of how to
determine the cause of the error and how to solve the error. This knowledge compliments the
information in Chapter 5, “Maintenance and serviceability” on page 265.
Some cluster management challenges might need intervention that requires service updates,
xCAT shutdown/startup, node management, and Fail in Place tasks. Documents that are
available are referenced in this book because not everything is shown in this publication.
This chapter includes the following topics:
Overview of the IBM Power System 775 Supercomputer
Advantages and new features of the IBM Power 775
Hardware information
Power, packaging, and cooling
Disk enclosure
Cluster management
Connection scenario between EMS, HMC, and Frame
High Performance Computing software stack
1.1 Overview of the IBM Power System 775 Supercomputer
For many years, IBM provided High Performance Computing (HPC) solutions that provide
extreme performance. For example, highly scalable clusters by using AIX and Linux for
demanding workloads, including weather forecasting and climate modeling.
The previous IBM Power 575 POWER6 water-cooled cluster showed impressive density and
performance. With 32 processors, 32 GB to 256 GB of memory in one central electronic
complex (CEC) enclosure or cage, and up to 14 CECs per Frame (water-cooled), 448
processors per frame was possible.
The InfiniBand interconnect provided the cluster with powerful communication channels for
the workloads.
The new Power 775 Supercomputer from IBM takes the density to a new height. With 256
3.84 GHz POWER7® processors, 2 TB of memory per CEC, and up to 12 CECs per Frame, a
total of 3072 processors and 24 TBs memory per Frame is possible. Highly scalable with the
capability to cluster 2048 CEC drawers together makes up 524,288 POWER7 processors to
do the work to solve the most challenging problems. A total of 7.86 TF per CEC and 94.4 TF
per rack highlights the capabilities of this high-performance computing solution.
The hardware is only as good as the software that runs on it. IBM AIX, IBM FileNet Process
Engine (PE) Runtime Edition, LoadLeveler, GPFS, and xCAT are a few of the supported
software stacks for the solution. For more information, see 1.9, “High Performance Computing
software stack” on page 62.
1.2 The IBM Power 775 cluster components
The IBM Power 775 can consist of the following components:
Compute subsystem:
– Diskless nodes dedicated to perform computational tasks
– Customized operating system (OS) images
– Applications
Storage subsystem:
– I/O node (diskless)
– OS images for IO nodes
– SAS adapters attached to the Disk Enclosures (DE)
– General Parallel File System (GPFS)
• Busses from processor modules to the switching hub in an octant
• Local links (LL-links) between octants
• Local remote links (LR-links) between drawers in a SuperNode
• Distance links (D-links) between SuperNodes
– Operating system drivers
– IBM User space protocol
– AIX and Linux IP drivers
2IBM Power Systems 775 for AIX and Linux HPC Solution
Octants, SuperNode, and other components are described in the other sections of this book.
Node types
The following node types have other partial functions available for the cluster. In the
context of the 9125-F2C drawer, a node is an OSI image that is booted in an LPAR. There
are three general designations for node types on the 9125-F2C. Often these functions are
dedicated to a node, but a node can have multiple roles:
– Compute nodes
Compute nodes run parallel jobs and perform the computational functions. These
nodes are diskless and booted across the HFI network from a Service Node. Most of
the nodes are compute nodes.
– IO nodes
These nodes are attached to either the Disk Enclosure in the physical cluster or
external storage. These nodes serve the file system to the rest of the cluster.
– Utility Nodes
A Utility node offers services to the cluster. These nodes often feature more resources,
such as an external Ethernet, external, or internal storage. The following Utility nodes
are required:
• Service nodes: Runs xCAT to serve the operating system to local diskless nodes
• Login nodes: Provides a centralized login to the cluster
– Optional utility node:
• Tape subsystem server
Important: xCAT stores all system definitions as node objects, including the required EMS
console and the HMC console. However, the consoles are external to the 9125-F2C cluster
and are not referred to as cluster nodes. The HMC and EMS consoles are physically
running on specific, dedicated servers. The HMC runs on a System x® based machine
(7042 or 7310) and the EMS runs on a POWER 750 Server. For more information, see
1.7.1, “Hardware Management Console” on page 53 and 1.7.2, “Executive Management
Server” on page 53.
1.3 Advantages and new features of the IBM Power 775
The IBM Power Systems 775 (9125-F2C) has several new features that make this system
even more reliable, available, and serviceable.
Fully redundant power, cooling and management, dynamic processor de-allocation and
memory chip & lane sparing, and concurrent maintenance are the main reliability, availability,
and serviceability (RAS) features.
The system is water-cooled, which gives a 100% heat capture. Some components are cooled
by small fans, but the Rear Door Heat exchanger captures this heat.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 3
Because most of the nodes are diskless nodes, the service nodes provide the operating
system to the diskless nodes. The HFI network also is used to boot the diskless utility nodes.
The Power 775 Availability Plus (A+) feature allows processors, switching hubs, and HFI
cables immediate failure-recovery because more resources are available in the system.
These resources fail in place and no hardware must be replaced until a specified threshold is
reached. For more information, see 5.4, “Power 775 Availability Plus” on page 297.
The IBM Power 775 cluster solution provides High Performance Computing clients with the
following benefits:
Sustained performance and low energy consumption for climate modeling and forecasting
Massive scalability for cell and organism process analysis in life sciences
Memory capacity for high-resolution simulations in nuclear resource management
Space and energy efficient for risk analytics and real-time trading in financial services
1.4 Hardware information
This section provides detailed information about the hardware components of the IBM Power
775. Within this section, there are links to IBM manuals and external sources for more
information.
1.4.1 POWER7 chip
The IBM Power System 775 implements the POWER7 processor technology. The PowerPC®
Architecture POWER7 processor is designed for use in servers that provide solutions with
large clustered systems, as shown in Figure 1-1 on page 5.
4IBM Power Systems 775 for AIX and Linux HPC Solution
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Fabric
p7
8B X-Bus
8B X-Bus
QCM
Chip Connections
Addess/Data
8B Y-Bus
8B Y-Bus
QCM
Chip Connections
Address/Data
8B Z-Bus
8B Z-Bus
QCM
Chip Connections
Address/Data
8B A-Bus
8B A-Bus
QCM
Chip Connections
Data
8B B-Bus
8B B-Bus
QCM
Chip Connections
Data
8B C-Bus
8B C-Bus
QCM
Chip Connections
Data
QCM to Hub
Connections
Address/Data
PSI
I2C
On Module SEEPRM
I2C
On Module SEEPRM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
FSI
FSP1 - B
FSI
FSP1 - A
OSC
OSC - B
OSC
OSC - A
TPMD
TPMD-A, TPMD-B
8B W/Gx-Bus
8B W/Gx-Bus
TOD Sync
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
.
Figure 1-1 POWER7 chip block diagram
IBM POWER7 characteristics
This section provides a description of the following characteristics of the IBM POWER7 chip,
as shown in Figure 1-1:
240 GFLOPs:
– Up to eight cores per chip
– Four Floating Point Units (FPU) per core
– Two FLOPS/Cycle (Fused Operation)
– 246 GFLOPs = 8 cores x 3.84 GHz x 4 FPU x 2)
32 KBs instruction and 32 KBs data caches per core
256 KB L2 cache per core
4 MB L3 cache per core
Eight Channels of SuperNova buffered DIMMs:
– Two memory controllers per chip
– Four memory busses per memory controller (1 B wide Write, 2 B wide Read each)
CMOS 12S SOI 11 level metal
Die size: 567 mm2
Chapter 1. Understanding the IBM Power Systems 775 Cluster 5
Architecture
PowerPC architecture
IEEE New P754 floating point compliant
Big endian, little endian, strong byte ordering support extension
46-bit real addressing, 68-bit virtual addressing
Off-chip bandwidth: 336 GBps:
– Local + remote interconnect)
Memory capacity: Up to 128 GBs per chip
Memory bandwidth: 128 GBps peak per chip
1.9 GHz Frequency
(8) 16 B data bus, 2 address snoop, 21 on/off ramps
Asynchronous interface to chiplets and off-chip interconnect
Differential memory controllers (2)
6.4-GHz Interface to Super Nova (SN)
DDR3 support max 1067 Mhz
Minimum Memory 2 channels, 1 SN/channel
Maximum Memory 8 channels X 1 SN/channel
2 Ports/Super Nova
8 Ranks/Port
X8b and X4b devices supported
PowerBus Off-Chip Interconnect
1.5 to 2.9 Gbps single ended EI-3
2 spare bits/bus
Max 256-way SMP
32-way optimal scaling
Four 8-B Intranode Buses (W, X, Y, or Z)
All buses run at the same bit rate
All capable of running as a single 4B interface; the location of the 4B interface within the
8 B is fixed
Hub chip attaches via W, X, Y or Z
Three 8-B Internode Buses (A, B,C)
C-bus multiplex with GX Only operates as an aggregate data bus (for example, address
and command traffic is not supported)
6IBM Power Systems 775 for AIX and Linux HPC Solution
Buses
Table 1-1 describes the POWER7 busses.
Table 1-1 POWER7 busses
Bus nameWidth (speed)ConnectsFunction
W, X, Y, Z8B+8B with 2 extra bits
per bus (3 Gbps)
A,B8B+8B with 2 extra bits
per bus (3 Gbps)
C8B+8B with 2 extra bits
per bus (3 Gb/p)
Mem1-Mem82B Read + 1B Write
with 2 extra bits per
bus (2.9 GHz)
Intranode processors
& hub
Other nodes within
drawer
Other nodes within
drawer
Processor to memory
Used for address and
data
Data only
Data only, Multiplex
with Gx
WXYZABC Busses
The off-chip PowerBus supports up to seven coherent SMP links (WXYZABC) by using
Elastic Interface 3 (EI-3) interface signaling that uses up to 3 Gbps. The intranode WXYZ
links up to four processor chips to make a 32way and connect a Hub chip to each processor.
The WXYZ links carry coherency traffic and data and are interchangeable as intranode
processor links or Hub links. The internode AB links connect up to two nodes per processor
chip. The AB links carry coherency traffic and data and are interchangeable with each other.
The AB links also are configured as aggregate data-only links. The C link is configured only
as a data-only link.
All seven coherent SMP links (WXYZABC) are configured as 8Bytes or 4Bytes in width.
The XYZABC Busses include the following features:
Four (WXYZ) 8-B or 4-B EI-3 Intranode Links
Two (AB) 8-B or 4-B EI-3 Internode Links or two (AB) 8-B or 4-B EI-3 data-only Links
One (C) 8-B or 4-B EI-3 data-only Link
PowerBus
The PowerBus is responsible for coherent and non-coherent memory access, IO operations,
interrupt communication, and system controller communication. The PowerBus provides all of
the interfaces, buffering, and sequencing of command and data operations within the storage
subsystem. The POWER7 chip has up to seven PowerBus links that are used to connect to
other POWER7 chips, as shown in Figure 1-2 on page 8.
The PowerBus link is an 8-Byte-wide (or optional 4-Byte-wide), split-transaction, multiplexed,
command and data bus that supports up to 32 POWER7 chips. The bus topology is a
multitier, fully connected topology to reduce latency, increase redundancy, and improve
concurrent maintenance. Reliability is improved with ECC on the external I/Os.
Data transactions are always sent along a unique point-to-point path. A route tag travels with
the data to help routing decisions along the way. Multiple data links are supported between
chips that are used to increase data bandwidth.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 7
C Bus
4B
4B
B Bus
8B
8B
A Bus
8B
8B
4B
4B
SMP InterconnectSMP Interconnect
SMP Data Only
GX1GX0
4MB L34MB L3
4MB L34MB L3
PBE
PBE
GXC(0,1)
Mem 3 PHY’s
Mem 3 PHY’s
MC0
MC1
Power Bus
PSI A/D HTM ICP
PLLs
EI – 3 PHY’s
Z BUS
8B
8B
W BUS
8B
8B
X Bus
8B
8B
Y Bus
8B
8B
SMP Interconnect
HUB Attach
POR
PSI
JTAG/FSI
I2C
ViDBUS
I2C
SEEPROM
M1A
22b
14b
M1B
22b
14b
M1C
22b
14b
M1D
22b
14b
Memory Interface
M0A
22b
14b
M0B
22b
14b
M0C
22b
14b
M0D
22b
14b
Memory Interface
EI – 3 PHY’s
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
L2L2
L2L2
L2L2
L2L2
4MB L34MB L3
4MB L34MB L3
NCU
NCUNCUNCU
NCU
NCUNCUNCU
Figure 1-2 POWER7 chip layout
Figure 1-3 on page 9 shows the POWER7 core structure.
8IBM Power Systems 775 for AIX and Linux HPC Solution
Figure 1-3 Microprocessor core structural diagram
Reliability, availability, and serviceability features
The microprocessor core includes the following reliability, availability, and serviceability (RAS)
features:
POWER7 core:
– Instruction retry for soft core logic errors
– Alternate processor recovery for hard core errors detected
– Processor limited checkstop for other errors
– Protection key support for AIX
L1 I/D Cache Error Recovery and Handling:
– Instruction retry for soft errors
– Alternate processor recovery for hard errors
– Guarding of core for core and L1/L2 cache errors
L2 Cache:
– ECC on L2 and directory tags
– Line delete for L2 and directory tags (seven lines)
– L2 UE handling includes purge and refetch of unmodified data
– Predictive dynamic guarding of associated cores
L3 Cache:
– ECC on data
– Line delete mechanism for data (seven lines)
– L3UE handling includes purges and refetch of unmodified data
– Predictive dynamic guarding of associated cores for CEs in L3 not managed by the line
deletion
Chapter 1. Understanding the IBM Power Systems 775 Cluster 9
1.4.2 I/O hub chip
EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
This section provides information about the IBM Power 775 I/O hub chip (or torrent chip), as
shown in Figure 1-4.
Figure 1-4 Hub chip (Torrent)
Host fabric interface
The host fabric interface (HFI) provides a non-coherent interface between a quad-chip
module (QCM), which is composed of four POWER7, and the clustered network.
Figure 1-5 on page 11 shows two instances of HFI in a hub chip. The HFI chips also attach to
the Collective Acceleration Unit (CAU).
Each HFI has one PowerBus command and four PowerBus data interfaces, which feature the
following configuration:
1. The PowerBus directly connects to the processors and memory controllers of four
POWER7 chips via the WXYZ links.
10IBM Power Systems 775 for AIX and Linux HPC Solution
2. The PowerBus also indirectly coherently connects to other POWER7 chips within a
HFI
Cmd x1data x4
CAU
MMIO
cmd & data x4
EA/RA
Power Bus
D linksLR links
4
NC
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
Nest
MMU
c d
CAU
HFI
Cmd x1
data x4
CAU
MMIO
cmd & data x4
EA/RA
Integrated Switch Router
(ISR)
LL links
.
256-way drawer via the LL links. Although fully supported by the HFI hardware, this path
provides reduced performance.
3. Each HFI has four ports to the Integrated Switch Router (ISR). The ISR connects to other
hub chips through the D, LL, and LR links.
4. ISRs and D, LL, and LR links that interconnect the hub chips form the cluster network.
POWER7 chips: The set of four POWER7 chips (QCM), its associated memory, and a
hub chip form the building block for cluster systems. A Power 775 systems consists of
multiple building blocks that are connected to each another via the cluster network.
Figure 1-5 HFI attachment scheme
Chapter 1. Understanding the IBM Power Systems 775 Cluster 11
Packet processing
POWER7
Lin k
ISR
POWER7 Coherency Bus
Proc,
caches
POW ER7
Link
...
POWER7
Chip
Hub
Chip
ISR Network
Proc,
caches
POWER7 Coherency Bus
Mem
HFI
IS R
ISRISR
.
.
HFI
POWER7
Chi p
Hub
Chip
Pr oc,
caches
...
Pro c,
caches
Mem
PO WER7 Coher ency Bus
PO WER7 Coher ency Bus
HFIHFI
The HFI is the interface between the POWER7 chip quads and the cluster network, and is
responsible for moving data between the PowerBus and the ISR. The data is in various
formats, but packets are processed in the following manner:
Send
– Pulls or receives data from PowerBus-attached devices in a POWER7 chip
– Translates data into network packets
– Injects network packets into the cluster network via the ISR
Receive
– Receives network packets from the cluster network via the ISR
– Translates them into transactions
– Pushes the transactions to PowerBus-attached devices in a POWER7 chip
Packet ordering
– The HFIs and cluster network provide no ordering guarantees among packets. Packets
that are sent from the same source window and node to the same destination window
and node might reach the destination in a different order.
Figure 1-6 shows two HFIs cooperating to move data from devices that are attached to one
PowerBus to devices attached to another PowerBus through the Cluster Network.
Figure 1-6 HFI moving data from one quad to another quad
HFI paths: The path between any two HFIs might be indirect, thus requiring multiple hops
through intermediate ISRs.
12IBM Power Systems 775 for AIX and Linux HPC Solution
1.4.3 Collective acceleration unit
The hub chip provides specialized hardware that is called the Collective Acceleration Unit
(CAU) to accelerate frequently used collective operations.
Collective operations
Collective operations are distributed operations that operate across a tree. Many HPC
applications perform collective operations with the application that make forward progress
after every compute node that completed its contribution and after the results of the collective
operation are delivered back to every compute node (for example, barrier synchronization,
and global sum).
A specialized arithmetic-logic unit (ALU) within the collective CAU implements reduction,
barrier, and reduction operations. For reduce operations, the ALU supports the following
operations and data types:
Fixed point: NOP, SUM, MIN, MAX, OR, ANDS, signed and unsigned XOR
Floating point: MIN, MAX, SUM, single and double precision PROD
There is one CAU in each hub chip, which is one CAU per four POWER7 chips, or one CAU
per 32 C1 cores.
Software organizes the CAUs in the system collective trees. The arrival of an input on one link
causes its forwarding on all other links when there is a broadcast operation. For reduce
operation, arrivals on all but one link causes the reduction result to forward to the remaining
links.
A link in the CAU tree maps to a path composed of more than one link in the network. The
system supports many trees simultaneously and each CAYU supports 64 independent trees.
The usage of sequence numbers and a retransmission protocol enables reliability and
pipelining. Each tree has only one participating HFI window on any involved node. The order
in which the reduction operation is evaluated is preserved from one run to another, which
benefits programming models that allow programmers to require that collective operations are
executed in a particular order, such as MPI.
Package propagation
As shown Figure 1-7 on page 14, a CAU receive packets from the following sources:
The memory of a remote node is inserted into the cluster network by the HFI of the remote
node
The memory of a local node is inserted into the cluster network by the HFI of the local
node
A remote CAU
Chapter 1. Understanding the IBM Power Systems 775 Cluster 13
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
HFI
HFI
Figure 1-7 CAU packets received by CAU
As shown in Figure 1-8 on page 15, a CAU sends packets to the following locations:
The memory of a remote node that is written to memory by the HFI of the remote node.
The memory of a local node that is written to memory by the HFI of the local node.
A remote CAU.
14IBM Power Systems 775 for AIX and Linux HPC Solution
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
CAUCAU
Figure 1-8 CAU packets sent by CAU
1.4.4 Nest memory management unit
The Nest Memory Management Unit (NMMU) that is in the hub check facilitates user-level
code to operate on the address space of processes that executes on other compute nodes.
The NMMU enables user-level code to create a global address space from which the NMMU
performs operations. This facility is called
A process that executes on a compute node registers its address space, thus permitting
interconnect packets to manipulate the registered shared region directly. The NMMU
references a page table that maps effective addresses to real memory. The hub chip also
maintains a cache of the mappings and maps the entire real memory of most installations.
Incoming interconnect packets that reference memory, such as RDMA packets and packets
that perform atomic operations, contain an effective address and information that pinpoints
the context in which to translate the effective address. This feature greatly facilitates
global-address space languages, such as Unified Parallel C (UPC), co-array Fortran, and
X10, by permitting such packets to contain easy-to-use effective addresses.
global shared memory.
1.4.5 Integrated switch router
The integrated switch router (ISR) replaces the external switching and routing functions that
are used in prior networks. The ISR is designed to dramatically reduce cost and improve
performance in bandwidth and latency.
A direct graph network topology connects up to 65,536 POWER7 eight-core processor chips
with two-level routing hierarchy of L and D busses.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 15
Each hub chip ISR connects to four POWER7 chips via the HFI controller and the W busses.
The Torrent hub chip and its four POWER7 chips are called an
directly connected to seven other octants on a drawer via the wide on-planar L-Local busses
and to 24 other octants in three more drawers via the optical L-Remote busses.
A
Supernode is the fully interconnected collection of 32 octants in four drawers. Up to 512
Supernodes are fully connected via the 16 optical D busses per hub chip. The ISR is
designed to support smaller systems with multiple D busses between Supernodes for higher
bandwidth and performance.
The ISR logically contains input and output buffering, a full crossbar switch, hierarchical route
tables, link protocol framers/controllers, interface controllers (HFI and PB data), Network
Management registers and controllers, and extensive RAS logic that includes link replay
buffers.
The Integrated Switch Router supports the following features:
Target cycle time up to 3 GHz
Target switch latency of 15 ns
Target GUPS: ~21 K. ISR assisted GUPs handling at all intermediate hops (not software)
Target switch crossbar bandwidth greater than 1 TB per second input and output:
– 96 Gbps WXYZ-busses (4 @ 24 Gbps) from P7 chips (unidirectional)
– 168 Gbps local L-busses (7 @ 24 Gbps) between octants in a drawer (unidirectional)
– 144 Gbps optical L-busses (24 @ 6 Gbps) to other drawers (unidirectional)
– 160 Gbps D-busses (16 @ 10 Gbps) to other Supernodes (unidirectional)
Two-tiered full-graph network
Virtual Channels for deadlock prevention
octant. Each ISR octant is
Cut-through Wormhole routing
Routing Options:
– Full hardware routing
– Software-controlled indirect routing by using hardware route tables
Multiple indirect routes that are supported for data striping and failover
Multiple direct routes by using LR and D-links supported for less than a full-up system
Maximum packet size that supported is 2 KB. Packets size varies from 1 to 16 flits, each flit
being 128 Bytes
Routing Algorithms:
– Round Robin: Direct and Indirect
– Random: Indirect routes only
IP Multicast with central buffer and route table and supports 256 Bytes or 2 KB packets
Global Hardware Counter implementation and support and includes link latency counts
LCRC on L and D busses with link-level retry support for handling transient errors and
includes error thresholds.
ECC on local L and W busses, internal arrays, and busses and includes Fault Isolation
Registers and Control Checker support
Performance Counters and Trace Debug support
16IBM Power Systems 775 for AIX and Linux HPC Solution
Loading...
+ 328 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.