viIBM Power Systems 775 for AIX and Linux HPC Solution
Page 9
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX 5L™
AIX®
BladeCenter®
DB2®
developerWorks®
Electronic Service Agent™
Focal Point™
Global Technology Services®
GPFS™
HACMP™
IBM®
LoadLeveler®
Power Systems™
POWER6+™
POWER6®
POWER7®
PowerPC®
POWER®
pSeries®
Redbooks®
Redbooks (logo)®
RS/6000®
System p®
System x®
Tivoli®
The following terms are trademarks of other companies:
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other
countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viiiIBM Power Systems 775 for AIX and Linux HPC Solution
Page 11
Preface
This IBM® Redbooks® publication contains information about the IBM Power Systems™ 775
Supercomputer solution for AIX® and Linux HPC customers. This publication provides details
about how to plan, configure, maintain, and run HPC workloads in this environment.
This IBM Redbooks document is targeted to current and future users of the IBM Power
Systems 775 Supercomputer (consultants, IT architects, support staff, and IT specialists)
responsible for delivering and implementing IBM Power Systems 775 clustering solutions for
their enterprise high-performance computing (HPC) applications.
The team who wrote this book
This book was produced by a team of specialists from around the world working at the
International Technical Support Organization, Poughkeepsie Center.
Dino Quintero is an IBM Senior Certified IT Specialist with the ITSO in Poughkeepsie, NY.
His areas of knowledge include enterprise continuous availability, enterprise systems
management, system virtualization, technical computing, and clustering solutions. He is
currently an Open Group Distinguished IT Specialist. Dino holds a Master of Computing
Information Systems degree and a Bachelor of Science degree in Computer Science from
Marist College.
Kerry Bosworth is a Software Engineer in pSeries® Cluster System Test for
high-performance computing in Poughkeepsie, New York. Since joining the team four years
ago, she worked with the InfiniBand technology on POWER6® AIX, SLES, and Red Hat
clusters and the new Power 775 system. She has 12 years of experience at IBM with eight
years in IBM Global Services as an AIX Administrator and Service Delivery Manager.
Puneet Chaudhary is a software test specialist with the General Parallel File System team in
Poughkeepsie, New York.
Rodrigo Garcia da Silva is a Deep Computing Client Technical Architect at the IBM Systems
and Technology Group. He is part of the STG Growth Initiatives Technical Sales Team in
Brazil, specializing in High Performance Computing solutions. He has worked at IBM for the
past five years and has a total of eight years of experience in the IT industry. He holds a B.S.
in Electrical Engineering and his areas of expertise include systems architecture, OS
provisioning, Linux, and open source software. He also has a background in intellectual
property protection, including publications and a filed patent.
ByungUn Ha is an Accredited IT Specialist and Deep Computing Technical Specialist in
Korea. He has over 10 years experience in IBM and has conducted various HPC projects and
HPC benchmarks in Korea. He has supported Supercomputing Center at KISTI (Korea
Institute of Science and Technology Information) on-site for nine years. His area of expertise
include Linux performance and clustering for System X, InfiniBand, AIX Power system, and
HPC Software Stack including LoadLeveler®, Parallel Environment, and ESSL/PESSL,
C/Fortran Compiler. He is a Redhat Certified Engineer (RHCE) and has a Master’s degree in
Aerospace Engineering from Seoul National University. He is currently working in Deep
Computing team, Growth Initiatives, STG in Korea as a HPC Technical Sales Specialist.
Jose Higino is an Infrastructure IT Specialist for AIX/Linux support and services for IBM
Portugal. His areas of knowledge include System X, BladeCenter® and Power Systems
planning and implementation, management, virtualization, consolidation, and clustering (HPC
and HA) solutions. He is currently the only person responsible for Linux support and services
in IBM Portugal. He completed the Red Hat Certified Technician level in 2007, became a
CiRBA Certified Virtualization Analyst in 2009, and completed certification in KT Resolve
methodology as an SME in 2011. José holds a Master of Computers and Electronics
Engineering degree from UNL - FCT (Universidade Nova de Lisboa - Faculdade de Ciências
e Technologia), in Portugal.
Marc-Eric Kahle is a POWER® Systems Hardware Support specialist at the IBM Global
Technology Services® Central Region Hardware EMEA Back Office in Ehningen, Germany.
He has worked in the RS/6000®, POWER System, and AIX fields since 1993. He has worked
at IBM Germany since 1987. His areas of expertise include POWER Systems hardware and
he is an AIX certified specialist. He has participated in the development of six other IBM
Redbooks publications.
Tsuyoshi Kamenoue is a Advisory IT specialist in Power Systems Technical Sales in IBM
Japan. He has nine years of experience of working on pSeries, System p®, and Power
Systems products especially in HPC area. He holds a Bachelor’s degree in System
information from the university of Tokyo.
James Pearson is a Product Engineer for pSeries high-end Enterprise systems and HPC
cluster offerings since 1998. He has participated in the planning, test, installation and
on-going maintenance phases of clustered RISC and pSeries servers for numerous
government and commercial customers, beginning with SP2 and continuing through the
current Power 775 HPC solution.
Mark Perez is a customer support specialist servicing IBM Cluster 1600.
Fernando Pizzano is a Hardware and Software Bring-up Team Lead in the IBM Advanced
Clustering Technology Development Lab, Poughkeepsie, New York. He has over 10 years of
information technology experience, the last five years in HPC Development. His areas of
expertise include AIX, pSeries High Performance Switch, and IBM System p hardware. He
holds an IBM certification in pSeries AIX 5L™ System Support.
Robert Simon is a Senior Software Engineer in STG working in Poughkeepsie, New York. He
has worked with IBM since 1987. He currently is a Team Leader in the Software Technical
Support Group, which supports the High Performance Clustering software (LoadLeveler,
CSM, GPFS™, RSCT, and PPE). He has extensive experience with IBM System p hardware,
AIX, HACMP™, and high-performance clustering software. He has participated in the
development of three other IBM Redbooks publications.
Kai Sun is a Software Engineer in pSeries Cluster System Test for high performance
computing in IBM China System Technology Laboratory, Beijing. Since joining the team in
2011, he has worked with the IBM Power Systems 775 cluster. He has six years of
experience at embedded system on Linux and VxWorks platform. He has recently been given
an Eminence and Excellence Award by IBM for his work on Power Systems 775 cluster. He
holds a B.Eng. degree in Communication Engineering from Beijing University of Technology,
China. He has a M.Sc. degree in Project Management from the New Jersey Institute of
Technology, US.
Thanks to the following people for their contributions to this project:
Mark Atkins
IBM Boulder
Robert Dandar
xIBM Power Systems 775 for AIX and Linux HPC Solution
Page 13
Joseph Demczar
Chulho Kim
John Lewars
John Robb
Hanhong Xue
Gary Mincher
Dave Wootton
Paula Trimble
William Lepera
Joan McComb
Bruce Potter
Linda Mellor
Alison White
Richard Rosenthal
Gordon McPheeters
Ray Longi
Alan Benner
Lissa Valleta
John Lemek
Doug Szerdi
David Lerma
IBM Poughkeepsie
Ettore Tiotto
IBM Toronto, Canada
Wei QQ Qu
IBM China
Phil Sanders
IBM Rochester
Richard Conway
David Bennin
International Technical Support Organization, Poughkeepsie Center
Now you can become a published author, too!
Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.
Find out more about the residency program, browse the residency index, and apply online at:
http://www.ibm.com/redbooks/residencies.html
Preface xi
Page 14
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
http://www.ibm.com/redbooks
Send your comments in an email to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Stay connected to IBM Redbooks
Find us on Facebook:
http://www.facebook.com/IBMRedbooks
Follow us on Twitter:
http://twitter.com/ibmredbooks
Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
xiiIBM Power Systems 775 for AIX and Linux HPC Solution
Page 15
Chapter 1.Understanding the IBM Power
1
Systems 775 Cluster
In this book, we describe the new IBM Power Systems 775 Cluster hardware and software.
The chapters provide an overview of the general features of the Power 775 and its hardware
and software components. This chapter helps you get a basic understanding and concept of
this cluster.
Application integration and monitoring of a Power 775 cluster is also described in greater
detail in this IBM Redbooks publication. LoadLeveler, GPFS, xCAT, and more are
documented with some examples to get a better view on the complete cluster solution.
Problem determination is also discussed throughout this publication for different scenarios
that include xCAT configuration issues, Integrated Switch Network Manager (ISNM), Host
Fabric Interface (HFI), GPFS, and LoadLeveler. These scenarios show the flow of how to
determine the cause of the error and how to solve the error. This knowledge compliments the
information in Chapter 5, “Maintenance and serviceability” on page 265.
Some cluster management challenges might need intervention that requires service updates,
xCAT shutdown/startup, node management, and Fail in Place tasks. Documents that are
available are referenced in this book because not everything is shown in this publication.
This chapter includes the following topics:
Overview of the IBM Power System 775 Supercomputer
Advantages and new features of the IBM Power 775
Hardware information
Power, packaging, and cooling
Disk enclosure
Cluster management
Connection scenario between EMS, HMC, and Frame
High Performance Computing software stack
1.1 Overview of the IBM Power System 775 Supercomputer
For many years, IBM provided High Performance Computing (HPC) solutions that provide
extreme performance. For example, highly scalable clusters by using AIX and Linux for
demanding workloads, including weather forecasting and climate modeling.
The previous IBM Power 575 POWER6 water-cooled cluster showed impressive density and
performance. With 32 processors, 32 GB to 256 GB of memory in one central electronic
complex (CEC) enclosure or cage, and up to 14 CECs per Frame (water-cooled), 448
processors per frame was possible.
The InfiniBand interconnect provided the cluster with powerful communication channels for
the workloads.
The new Power 775 Supercomputer from IBM takes the density to a new height. With 256
3.84 GHz POWER7® processors, 2 TB of memory per CEC, and up to 12 CECs per Frame, a
total of 3072 processors and 24 TBs memory per Frame is possible. Highly scalable with the
capability to cluster 2048 CEC drawers together makes up 524,288 POWER7 processors to
do the work to solve the most challenging problems. A total of 7.86 TF per CEC and 94.4 TF
per rack highlights the capabilities of this high-performance computing solution.
The hardware is only as good as the software that runs on it. IBM AIX, IBM FileNet Process
Engine (PE) Runtime Edition, LoadLeveler, GPFS, and xCAT are a few of the supported
software stacks for the solution. For more information, see 1.9, “High Performance Computing
software stack” on page 62.
1.2 The IBM Power 775 cluster components
The IBM Power 775 can consist of the following components:
Compute subsystem:
– Diskless nodes dedicated to perform computational tasks
– Customized operating system (OS) images
– Applications
Storage subsystem:
– I/O node (diskless)
– OS images for IO nodes
– SAS adapters attached to the Disk Enclosures (DE)
– General Parallel File System (GPFS)
• Busses from processor modules to the switching hub in an octant
• Local links (LL-links) between octants
• Local remote links (LR-links) between drawers in a SuperNode
• Distance links (D-links) between SuperNodes
– Operating system drivers
– IBM User space protocol
– AIX and Linux IP drivers
2IBM Power Systems 775 for AIX and Linux HPC Solution
Page 17
Octants, SuperNode, and other components are described in the other sections of this book.
Node types
The following node types have other partial functions available for the cluster. In the
context of the 9125-F2C drawer, a node is an OSI image that is booted in an LPAR. There
are three general designations for node types on the 9125-F2C. Often these functions are
dedicated to a node, but a node can have multiple roles:
– Compute nodes
Compute nodes run parallel jobs and perform the computational functions. These
nodes are diskless and booted across the HFI network from a Service Node. Most of
the nodes are compute nodes.
– IO nodes
These nodes are attached to either the Disk Enclosure in the physical cluster or
external storage. These nodes serve the file system to the rest of the cluster.
– Utility Nodes
A Utility node offers services to the cluster. These nodes often feature more resources,
such as an external Ethernet, external, or internal storage. The following Utility nodes
are required:
• Service nodes: Runs xCAT to serve the operating system to local diskless nodes
• Login nodes: Provides a centralized login to the cluster
– Optional utility node:
• Tape subsystem server
Important: xCAT stores all system definitions as node objects, including the required EMS
console and the HMC console. However, the consoles are external to the 9125-F2C cluster
and are not referred to as cluster nodes. The HMC and EMS consoles are physically
running on specific, dedicated servers. The HMC runs on a System x® based machine
(7042 or 7310) and the EMS runs on a POWER 750 Server. For more information, see
1.7.1, “Hardware Management Console” on page 53 and 1.7.2, “Executive Management
Server” on page 53.
1.3 Advantages and new features of the IBM Power 775
The IBM Power Systems 775 (9125-F2C) has several new features that make this system
even more reliable, available, and serviceable.
Fully redundant power, cooling and management, dynamic processor de-allocation and
memory chip & lane sparing, and concurrent maintenance are the main reliability, availability,
and serviceability (RAS) features.
The system is water-cooled, which gives a 100% heat capture. Some components are cooled
by small fans, but the Rear Door Heat exchanger captures this heat.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 3
Page 18
Because most of the nodes are diskless nodes, the service nodes provide the operating
system to the diskless nodes. The HFI network also is used to boot the diskless utility nodes.
The Power 775 Availability Plus (A+) feature allows processors, switching hubs, and HFI
cables immediate failure-recovery because more resources are available in the system.
These resources fail in place and no hardware must be replaced until a specified threshold is
reached. For more information, see 5.4, “Power 775 Availability Plus” on page 297.
The IBM Power 775 cluster solution provides High Performance Computing clients with the
following benefits:
Sustained performance and low energy consumption for climate modeling and forecasting
Massive scalability for cell and organism process analysis in life sciences
Memory capacity for high-resolution simulations in nuclear resource management
Space and energy efficient for risk analytics and real-time trading in financial services
1.4 Hardware information
This section provides detailed information about the hardware components of the IBM Power
775. Within this section, there are links to IBM manuals and external sources for more
information.
1.4.1 POWER7 chip
The IBM Power System 775 implements the POWER7 processor technology. The PowerPC®
Architecture POWER7 processor is designed for use in servers that provide solutions with
large clustered systems, as shown in Figure 1-1 on page 5.
4IBM Power Systems 775 for AIX and Linux HPC Solution
Page 19
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Fabric
p7
8B X-Bus
8B X-Bus
QCM
Chip Connections
Addess/Data
8B Y-Bus
8B Y-Bus
QCM
Chip Connections
Address/Data
8B Z-Bus
8B Z-Bus
QCM
Chip Connections
Address/Data
8B A-Bus
8B A-Bus
QCM
Chip Connections
Data
8B B-Bus
8B B-Bus
QCM
Chip Connections
Data
8B C-Bus
8B C-Bus
QCM
Chip Connections
Data
QCM to Hub
Connections
Address/Data
PSI
I2C
On Module SEEPRM
I2C
On Module SEEPRM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
1.333Gb/s
Buffered DRAM
FSI
FSP1 - B
FSI
FSP1 - A
OSC
OSC - B
OSC
OSC - A
TPMD
TPMD-A, TPMD-B
8B W/Gx-Bus
8B W/Gx-Bus
TOD Sync
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
.
Figure 1-1 POWER7 chip block diagram
IBM POWER7 characteristics
This section provides a description of the following characteristics of the IBM POWER7 chip,
as shown in Figure 1-1:
240 GFLOPs:
– Up to eight cores per chip
– Four Floating Point Units (FPU) per core
– Two FLOPS/Cycle (Fused Operation)
– 246 GFLOPs = 8 cores x 3.84 GHz x 4 FPU x 2)
32 KBs instruction and 32 KBs data caches per core
256 KB L2 cache per core
4 MB L3 cache per core
Eight Channels of SuperNova buffered DIMMs:
– Two memory controllers per chip
– Four memory busses per memory controller (1 B wide Write, 2 B wide Read each)
CMOS 12S SOI 11 level metal
Die size: 567 mm2
Chapter 1. Understanding the IBM Power Systems 775 Cluster 5
Page 20
Architecture
PowerPC architecture
IEEE New P754 floating point compliant
Big endian, little endian, strong byte ordering support extension
46-bit real addressing, 68-bit virtual addressing
Off-chip bandwidth: 336 GBps:
– Local + remote interconnect)
Memory capacity: Up to 128 GBs per chip
Memory bandwidth: 128 GBps peak per chip
1.9 GHz Frequency
(8) 16 B data bus, 2 address snoop, 21 on/off ramps
Asynchronous interface to chiplets and off-chip interconnect
Differential memory controllers (2)
6.4-GHz Interface to Super Nova (SN)
DDR3 support max 1067 Mhz
Minimum Memory 2 channels, 1 SN/channel
Maximum Memory 8 channels X 1 SN/channel
2 Ports/Super Nova
8 Ranks/Port
X8b and X4b devices supported
PowerBus Off-Chip Interconnect
1.5 to 2.9 Gbps single ended EI-3
2 spare bits/bus
Max 256-way SMP
32-way optimal scaling
Four 8-B Intranode Buses (W, X, Y, or Z)
All buses run at the same bit rate
All capable of running as a single 4B interface; the location of the 4B interface within the
8 B is fixed
Hub chip attaches via W, X, Y or Z
Three 8-B Internode Buses (A, B,C)
C-bus multiplex with GX Only operates as an aggregate data bus (for example, address
and command traffic is not supported)
6IBM Power Systems 775 for AIX and Linux HPC Solution
Page 21
Buses
Table 1-1 describes the POWER7 busses.
Table 1-1 POWER7 busses
Bus nameWidth (speed)ConnectsFunction
W, X, Y, Z8B+8B with 2 extra bits
per bus (3 Gbps)
A,B8B+8B with 2 extra bits
per bus (3 Gbps)
C8B+8B with 2 extra bits
per bus (3 Gb/p)
Mem1-Mem82B Read + 1B Write
with 2 extra bits per
bus (2.9 GHz)
Intranode processors
& hub
Other nodes within
drawer
Other nodes within
drawer
Processor to memory
Used for address and
data
Data only
Data only, Multiplex
with Gx
WXYZABC Busses
The off-chip PowerBus supports up to seven coherent SMP links (WXYZABC) by using
Elastic Interface 3 (EI-3) interface signaling that uses up to 3 Gbps. The intranode WXYZ
links up to four processor chips to make a 32way and connect a Hub chip to each processor.
The WXYZ links carry coherency traffic and data and are interchangeable as intranode
processor links or Hub links. The internode AB links connect up to two nodes per processor
chip. The AB links carry coherency traffic and data and are interchangeable with each other.
The AB links also are configured as aggregate data-only links. The C link is configured only
as a data-only link.
All seven coherent SMP links (WXYZABC) are configured as 8Bytes or 4Bytes in width.
The XYZABC Busses include the following features:
Four (WXYZ) 8-B or 4-B EI-3 Intranode Links
Two (AB) 8-B or 4-B EI-3 Internode Links or two (AB) 8-B or 4-B EI-3 data-only Links
One (C) 8-B or 4-B EI-3 data-only Link
PowerBus
The PowerBus is responsible for coherent and non-coherent memory access, IO operations,
interrupt communication, and system controller communication. The PowerBus provides all of
the interfaces, buffering, and sequencing of command and data operations within the storage
subsystem. The POWER7 chip has up to seven PowerBus links that are used to connect to
other POWER7 chips, as shown in Figure 1-2 on page 8.
The PowerBus link is an 8-Byte-wide (or optional 4-Byte-wide), split-transaction, multiplexed,
command and data bus that supports up to 32 POWER7 chips. The bus topology is a
multitier, fully connected topology to reduce latency, increase redundancy, and improve
concurrent maintenance. Reliability is improved with ECC on the external I/Os.
Data transactions are always sent along a unique point-to-point path. A route tag travels with
the data to help routing decisions along the way. Multiple data links are supported between
chips that are used to increase data bandwidth.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 7
Page 22
C Bus
4B
4B
B Bus
8B
8B
A Bus
8B
8B
4B
4B
SMP InterconnectSMP Interconnect
SMP Data Only
GX1GX0
4MB L34MB L3
4MB L34MB L3
PBE
PBE
GXC(0,1)
Mem 3 PHY’s
Mem 3 PHY’s
MC0
MC1
Power Bus
PSI A/D HTM ICP
PLLs
EI – 3 PHY’s
Z BUS
8B
8B
W BUS
8B
8B
X Bus
8B
8B
Y Bus
8B
8B
SMP Interconnect
HUB Attach
POR
PSI
JTAG/FSI
I2C
ViDBUS
I2C
SEEPROM
M1A
22b
14b
M1B
22b
14b
M1C
22b
14b
M1D
22b
14b
Memory Interface
M0A
22b
14b
M0B
22b
14b
M0C
22b
14b
M0D
22b
14b
Memory Interface
EI – 3 PHY’s
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
L2L2
L2L2
L2L2
L2L2
4MB L34MB L3
4MB L34MB L3
NCU
NCUNCUNCU
NCU
NCUNCUNCU
Figure 1-2 POWER7 chip layout
Figure 1-3 on page 9 shows the POWER7 core structure.
8IBM Power Systems 775 for AIX and Linux HPC Solution
Page 23
Figure 1-3 Microprocessor core structural diagram
Reliability, availability, and serviceability features
The microprocessor core includes the following reliability, availability, and serviceability (RAS)
features:
POWER7 core:
– Instruction retry for soft core logic errors
– Alternate processor recovery for hard core errors detected
– Processor limited checkstop for other errors
– Protection key support for AIX
L1 I/D Cache Error Recovery and Handling:
– Instruction retry for soft errors
– Alternate processor recovery for hard errors
– Guarding of core for core and L1/L2 cache errors
L2 Cache:
– ECC on L2 and directory tags
– Line delete for L2 and directory tags (seven lines)
– L2 UE handling includes purge and refetch of unmodified data
– Predictive dynamic guarding of associated cores
L3 Cache:
– ECC on data
– Line delete mechanism for data (seven lines)
– L3UE handling includes purges and refetch of unmodified data
– Predictive dynamic guarding of associated cores for CEs in L3 not managed by the line
deletion
Chapter 1. Understanding the IBM Power Systems 775 Cluster 9
Page 24
1.4.2 I/O hub chip
EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
This section provides information about the IBM Power 775 I/O hub chip (or torrent chip), as
shown in Figure 1-4.
Figure 1-4 Hub chip (Torrent)
Host fabric interface
The host fabric interface (HFI) provides a non-coherent interface between a quad-chip
module (QCM), which is composed of four POWER7, and the clustered network.
Figure 1-5 on page 11 shows two instances of HFI in a hub chip. The HFI chips also attach to
the Collective Acceleration Unit (CAU).
Each HFI has one PowerBus command and four PowerBus data interfaces, which feature the
following configuration:
1. The PowerBus directly connects to the processors and memory controllers of four
POWER7 chips via the WXYZ links.
10IBM Power Systems 775 for AIX and Linux HPC Solution
Page 25
2. The PowerBus also indirectly coherently connects to other POWER7 chips within a
HFI
Cmd x1data x4
CAU
MMIO
cmd & data x4
EA/RA
Power Bus
D linksLR links
4
NC
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
Nest
MMU
c d
CAU
HFI
Cmd x1
data x4
CAU
MMIO
cmd & data x4
EA/RA
Integrated Switch Router
(ISR)
LL links
.
256-way drawer via the LL links. Although fully supported by the HFI hardware, this path
provides reduced performance.
3. Each HFI has four ports to the Integrated Switch Router (ISR). The ISR connects to other
hub chips through the D, LL, and LR links.
4. ISRs and D, LL, and LR links that interconnect the hub chips form the cluster network.
POWER7 chips: The set of four POWER7 chips (QCM), its associated memory, and a
hub chip form the building block for cluster systems. A Power 775 systems consists of
multiple building blocks that are connected to each another via the cluster network.
Figure 1-5 HFI attachment scheme
Chapter 1. Understanding the IBM Power Systems 775 Cluster 11
Page 26
Packet processing
POWER7
Lin k
ISR
POWER7 Coherency Bus
Proc,
caches
POW ER7
Link
...
POWER7
Chip
Hub
Chip
ISR Network
Proc,
caches
POWER7 Coherency Bus
Mem
HFI
IS R
ISRISR
.
.
HFI
POWER7
Chi p
Hub
Chip
Pr oc,
caches
...
Pro c,
caches
Mem
PO WER7 Coher ency Bus
PO WER7 Coher ency Bus
HFIHFI
The HFI is the interface between the POWER7 chip quads and the cluster network, and is
responsible for moving data between the PowerBus and the ISR. The data is in various
formats, but packets are processed in the following manner:
Send
– Pulls or receives data from PowerBus-attached devices in a POWER7 chip
– Translates data into network packets
– Injects network packets into the cluster network via the ISR
Receive
– Receives network packets from the cluster network via the ISR
– Translates them into transactions
– Pushes the transactions to PowerBus-attached devices in a POWER7 chip
Packet ordering
– The HFIs and cluster network provide no ordering guarantees among packets. Packets
that are sent from the same source window and node to the same destination window
and node might reach the destination in a different order.
Figure 1-6 shows two HFIs cooperating to move data from devices that are attached to one
PowerBus to devices attached to another PowerBus through the Cluster Network.
Figure 1-6 HFI moving data from one quad to another quad
HFI paths: The path between any two HFIs might be indirect, thus requiring multiple hops
through intermediate ISRs.
12IBM Power Systems 775 for AIX and Linux HPC Solution
Page 27
1.4.3 Collective acceleration unit
The hub chip provides specialized hardware that is called the Collective Acceleration Unit
(CAU) to accelerate frequently used collective operations.
Collective operations
Collective operations are distributed operations that operate across a tree. Many HPC
applications perform collective operations with the application that make forward progress
after every compute node that completed its contribution and after the results of the collective
operation are delivered back to every compute node (for example, barrier synchronization,
and global sum).
A specialized arithmetic-logic unit (ALU) within the collective CAU implements reduction,
barrier, and reduction operations. For reduce operations, the ALU supports the following
operations and data types:
Fixed point: NOP, SUM, MIN, MAX, OR, ANDS, signed and unsigned XOR
Floating point: MIN, MAX, SUM, single and double precision PROD
There is one CAU in each hub chip, which is one CAU per four POWER7 chips, or one CAU
per 32 C1 cores.
Software organizes the CAUs in the system collective trees. The arrival of an input on one link
causes its forwarding on all other links when there is a broadcast operation. For reduce
operation, arrivals on all but one link causes the reduction result to forward to the remaining
links.
A link in the CAU tree maps to a path composed of more than one link in the network. The
system supports many trees simultaneously and each CAYU supports 64 independent trees.
The usage of sequence numbers and a retransmission protocol enables reliability and
pipelining. Each tree has only one participating HFI window on any involved node. The order
in which the reduction operation is evaluated is preserved from one run to another, which
benefits programming models that allow programmers to require that collective operations are
executed in a particular order, such as MPI.
Package propagation
As shown Figure 1-7 on page 14, a CAU receive packets from the following sources:
The memory of a remote node is inserted into the cluster network by the HFI of the remote
node
The memory of a local node is inserted into the cluster network by the HFI of the local
node
A remote CAU
Chapter 1. Understanding the IBM Power Systems 775 Cluster 13
Page 28
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
HFI
HFI
Figure 1-7 CAU packets received by CAU
As shown in Figure 1-8 on page 15, a CAU sends packets to the following locations:
The memory of a remote node that is written to memory by the HFI of the remote node.
The memory of a local node that is written to memory by the HFI of the local node.
A remote CAU.
14IBM Power Systems 775 for AIX and Linux HPC Solution
Page 29
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
CAUCAU
Figure 1-8 CAU packets sent by CAU
1.4.4 Nest memory management unit
The Nest Memory Management Unit (NMMU) that is in the hub check facilitates user-level
code to operate on the address space of processes that executes on other compute nodes.
The NMMU enables user-level code to create a global address space from which the NMMU
performs operations. This facility is called
A process that executes on a compute node registers its address space, thus permitting
interconnect packets to manipulate the registered shared region directly. The NMMU
references a page table that maps effective addresses to real memory. The hub chip also
maintains a cache of the mappings and maps the entire real memory of most installations.
Incoming interconnect packets that reference memory, such as RDMA packets and packets
that perform atomic operations, contain an effective address and information that pinpoints
the context in which to translate the effective address. This feature greatly facilitates
global-address space languages, such as Unified Parallel C (UPC), co-array Fortran, and
X10, by permitting such packets to contain easy-to-use effective addresses.
global shared memory.
1.4.5 Integrated switch router
The integrated switch router (ISR) replaces the external switching and routing functions that
are used in prior networks. The ISR is designed to dramatically reduce cost and improve
performance in bandwidth and latency.
A direct graph network topology connects up to 65,536 POWER7 eight-core processor chips
with two-level routing hierarchy of L and D busses.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 15
Page 30
Each hub chip ISR connects to four POWER7 chips via the HFI controller and the W busses.
The Torrent hub chip and its four POWER7 chips are called an
directly connected to seven other octants on a drawer via the wide on-planar L-Local busses
and to 24 other octants in three more drawers via the optical L-Remote busses.
A
Supernode is the fully interconnected collection of 32 octants in four drawers. Up to 512
Supernodes are fully connected via the 16 optical D busses per hub chip. The ISR is
designed to support smaller systems with multiple D busses between Supernodes for higher
bandwidth and performance.
The ISR logically contains input and output buffering, a full crossbar switch, hierarchical route
tables, link protocol framers/controllers, interface controllers (HFI and PB data), Network
Management registers and controllers, and extensive RAS logic that includes link replay
buffers.
The Integrated Switch Router supports the following features:
Target cycle time up to 3 GHz
Target switch latency of 15 ns
Target GUPS: ~21 K. ISR assisted GUPs handling at all intermediate hops (not software)
Target switch crossbar bandwidth greater than 1 TB per second input and output:
– 96 Gbps WXYZ-busses (4 @ 24 Gbps) from P7 chips (unidirectional)
– 168 Gbps local L-busses (7 @ 24 Gbps) between octants in a drawer (unidirectional)
– 144 Gbps optical L-busses (24 @ 6 Gbps) to other drawers (unidirectional)
– 160 Gbps D-busses (16 @ 10 Gbps) to other Supernodes (unidirectional)
Two-tiered full-graph network
Virtual Channels for deadlock prevention
octant. Each ISR octant is
Cut-through Wormhole routing
Routing Options:
– Full hardware routing
– Software-controlled indirect routing by using hardware route tables
Multiple indirect routes that are supported for data striping and failover
Multiple direct routes by using LR and D-links supported for less than a full-up system
Maximum packet size that supported is 2 KB. Packets size varies from 1 to 16 flits, each flit
being 128 Bytes
Routing Algorithms:
– Round Robin: Direct and Indirect
– Random: Indirect routes only
IP Multicast with central buffer and route table and supports 256 Bytes or 2 KB packets
Global Hardware Counter implementation and support and includes link latency counts
LCRC on L and D busses with link-level retry support for handling transient errors and
includes error thresholds.
ECC on local L and W busses, internal arrays, and busses and includes Fault Isolation
Registers and Control Checker support
Performance Counters and Trace Debug support
16IBM Power Systems 775 for AIX and Linux HPC Solution
Page 31
1.4.6 SuperNOVA
SuperNOVA is the second member of the fourth generation of the IBM Synchronous Memory
Interface ASIC. It connects host memory controllers to DDR3 memory devices.
SuperNOVA is used in a planar configuration to connect to Industry Standard (I/S) DDR3
RDIMMs. SuperNOVA also resides on a custom, fully buffered memory module that is called
the SuperNOVA DIMM (SND). Fully buffered DIMMs use a logic device, such as SuperNOVA,
to buffer all signals to and from the memory devices.
As shown in Figure 1-9, SuperNOVA provides the following features:
Cascaded memory channel (up to seven SNs deep) that use 6.4-Gbps, differential ended
(DE), unidirectional links.
Two DDR3 SDRAM command and address ports.
Two, 8 B DDR3 SDRAM data ports with a ninth byte for ECC and a tenth byte that is used
as a locally selectable spare.
16 ranks of chip selects and CKE controls (eight per CMD port).
Eight ODT (four per CMD port).
Four differential memory clock pairs to support up to four DDR3 registered dual in-line
memory modules (RDIMMs).
Data Flow Modes include the following features:
Expansion memory channel daisy-chain
4:1 or 6:1 configurable data rate ratio between memory channel and SDRAM domain
Figure 1-9 Memory channel
SuperNOVA uses a high speed, differential ended communications memory channel to link a
host memory controller to the main memory storage devices through the SuperNOVA ASIC.
The maximum memory channel transfer rate is 6.4 Gbps.
The SuperNOVA memory channel consists of two DE, unidirectional links. The downstream
link transmits write data and commands away from the host (memory controller) to the
SuperNOVA. The downstream includes 13 active logical signals (lanes), two more spare
lanes, and a bus clock. The upstream (US), link transmits read data and responses from the
SuperNOVA back to the host. The US includes 20 active logical signals, two more spare
lanes, and a bus clock.
Although SuperNOVA supports a cascaded memory channel topology of multiple chips that
use daisy chained memory channel links, Power 775 does not use this capability.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 17
Page 32
The links that are connected on the host side are called the Primary Up Stream (PUS) and
Primary Down Stream (PDS) links. The links on the cascaded side are called the Secondary
Up Stream
The SuperNOVA US and downstream links each include two dedicated spare lanes. One of
these lanes is used to repair either a clock or data connection. The other lane is used only to
repair data signal defects. Each segment (host to SuperNOVA or SuperNOVA to SuperNOVA
connection) of a cascaded memory channel is independently deployed of their dedicated
spares per link. This deployment maximizes the ability to survive multiple interconnect hard
failures. The spare lanes are tested and aligned during initialization but are deactivated during
normal runtime operation. The channel frame format, error detection, and protocols are the
same before and after spare lane invocation. Spare lanes are selected by one of the following
means:
The spare lanes are selected during initialization by loading host and SuperNOVA
configuration registers based on previously logged lane failure information.
The spare lanes are selected dynamically by the hardware during runtime operation by an
error recovery operation that performs the link reinitialization and repair procedure. This
procedure is initiated by the host memory controller and supported by the SuperNOVAs in
the memory channel. During the link repair operation, the memory controller holds back
memory access requests. The procedure is designed to take less than 10 ms to prevent
system performance problems, such as timeouts.
The spare lanes are selected by system control software by loading host or SuperNOVA
configuration registers that are based on the results of the memory channel lane
shadowing diagnostic procedure.
(SUS) and Secondary Down Stream (SDS) links.
1.4.7 Hub module
The Power 775 hub module provides all the connectivity that is needed to form a clustered
system, as shown in Figure 1-10 on page 19.
18IBM Power Systems 775 for AIX and Linux HPC Solution
Page 33
EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
24 Fiber
D-Link Connector
48 Fiber
D-Link Connector
12 Lanes (10 +2 spare)
12 Lanes (10 +2 spare)
O to 16
24 FiberD-Links
SEEPROM
SEEPROM
Caps
CPROMCPROM
Capacitors
Optical
Xmit& Rec
(D-Link)
Optical
Xmit& Rec
(LR-Link)
Optical
Xmit& Rec
(D-Link)
6 Lanes (5 +1 spare)6 Lanes (5+1 spare)
6 Lanes (5 +1 spare)6 Lanes (5 +1 spare)
6 Lanes (5 +1 spare)6 Lanes (5 +1 spare)
6 Lanes (5 +1 spare)
6 Lanes (5 +1 spare)
(No t to S cale)
24 F iber
D-Li nk Co nne ctor
48 Fiber
D- Link Co nnec to r
12 Lanes (10 + 2 spare)
12 Lanes (10 + 2 spar e)
O to 16
24 Fiber D-Li nks
SEEPRO M
SEEP ROM
Cap s
CPROMCPROM
Capacitors
Op tical
Xmit & Rec
(D- Link )
Optical
Xmit & Rec
(LR-Link)
Opti cal
Xmit & Rec
(D-Link)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 spare)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
(No t to S cale)
Figure 1-10 Hub module diagram
Chapter 1. Understanding the IBM Power Systems 775 Cluster 19
Page 34
The hub features the following primary functions:
Connects the QCM Processor/Memory subsystem to up to two high-performance 16x
PCIe slots and one high-performance 8x PCI Express slot. This configuration provides
general-purpose I/O and networking capability for the server node.
POWER 775 drawer: In a Power 775 drawer (CEC), the Octant 0 has 3 PCIe, in which
two PCIe are 16x and one PCIe is 8x (SRIOV Ethernet Adapter is given priority in the
8x slot.). Octants 1-7 have two PCI Express, which are 16x.
Connects eight Processor QCMs together by using a low-latency, high-bandwidth,
coherent copper fabric (L-Local buses) that includes the following features:
– Enables a single hypervisor to run across 8 QCMs, which enables a single pair of
redundant service processors to manage 8 QCMs
– Directs the I/O slots that are attached to the eight hubs to the compute power of any of
the eight QCMs that provide I/O capability where needed
– Provides a message passing mechanism with high bandwidth and the lowest possible
latency between eight QCMs (8.2 TFLOPs) of compute power
Connects four Power 775 planars via the L-Remote optical connections to create a 33
TFLOP tightly connected compute building block (SuperNode). The bi-sectional exchange
bandwidth between the four boards is 3 TBps, the same bandwidth as 1500 10 Gb
Ethernet links.
Connects up to 512 groups of four planers (SuperNodes) together via the D optical buses
with ~3 TBs of exiting bandwidth per planer.
Optical links
The Hub modules that are on the node board house optical transceivers for up to 24
L-Remote links and 16 D-Links. Each optical transceiver includes a jumper cable that
connects the transceiver to the node tailstock. The transceivers are included to facilitate cost
optimization, depending on the application. The supported options are shown in Table 1-2.
Table 1-2 Supported optical link options
SuperNode typeL-Remote linksD-LinksNumber of
combinations
SuperNodes not enabled00-16 in increments of 117
Full SuperNodes240-16 in increments of 117
34
Some customization options are available on the hub optics module, which allow some optic
transceivers to remain unpopulated on the Torrent module if the wanted topology does not
require all of transceivers. The number of actual offering options that are deployed is
dependent on specific large customer bids.
20IBM Power Systems 775 for AIX and Linux HPC Solution
Page 35
Optics physical package
The optics physical package includes the following features:
Individual transmit (Tx) and Receive (Rx) modules that are packaged in Tx+Rx pairs on
glass ceramic substrate.
Up to 28 Tx+Rx pairs per module.
uLGA (Micro-Land Grid Array) at 0.7424 mm pitch interconnects optical modules to
ceramic substrate.
12-fiber optical fiber ribbon on top of each Tx and each Rx module, which is coupled
through Prizm reflecting and spheric-focusing 12-channel connectors.
Copper saddle over each optical module and optical connector for uLGA actuation and
heat spreading.
Heat spreader with springs and thermal interface materials that provide uLGA actuation
and heat removal separately for each optical module.
South (rear) side of each glass ceramic module carries 12 Tx+Rx optical pairs that
support 24 (6+6) fiber LR-links, and 2 Tx+Rx pairs that support 2 (12+12) fiber D-links.
North (front) side of each glass ceramic module carries 14 Tx+Rx optical module pairs
that support 14 (12+12) fiber D-links
Optics electrical interface
The optics electrical interface includes the following features:
12 differential pairs @ 10 GB per second (24 signals) for each TX optics module
12 differential pairs @ 10 GB per second (24 signals) for each RX module
Three wire I2C/TWS (Serial Data & Address, Serial Clock, Interrupt): three signals
Cooling
Cooling includes the following features:
Optics are water-cooled with Hub chip
Cold plate on top of module, which is coupled to optics through heat reader and saddles,
with thermal interface materials at each junction
Recommended temperature range: 20C – 55C at top of optics modules
Optics drive/receive distances
Optics links might be up to 60 meters rack-to-rack (61.5 meters, including inside-drawer
optical fiber ribbons).
Reliability assumed
The following reliability features are assumed:
10 FIT rate per lane.
D-link redundancy. Each (12+12)-fiber D-link runs normally with 10 active lanes and two
spares. Each D-link runs in degraded-Bandwidth mode with as few as eight lanes.
LR-link redundancy: Each (6+6)-fiber D-link runs normally with six active lanes. Each
LR-link (half of a Tx+Rx pair) runs in degraded-bandwidth mode with as few as four lanes
out of six lanes.
Overall redundancy: As many four lanes out of each 12 (two lanes of each six lanes) might
fail without disabling any D-links or LR-links.
Expect to allow one failed lane per 12 lanes in manufacturing.
Bit Error Rate: Worst-case, end-of-life BER is 10^-12. Normal expected BER is 10^-18
Chapter 1. Understanding the IBM Power Systems 775 Cluster 21
Page 36
1.4.8 Memory subsystem
Memory Controllerp7Memory Controller
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
DIMM
.
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
SN
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
SN
DIMM
.
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
SN
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAMDRAM DRAM DRAM DRAM DRAM
SN
DIMM
.
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
SN
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
SN
DIMM
.
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
SN
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
SN
FSI
FSI
FSI
FSI
FSI
FSI
FSI
FSI
8B
8B
8B
8B
8B
8B
The memory controller layout is shown in Figure 1-11.
Figure 1-11 Memory controller layout
The memory cache sizes are shown in Table 1-3.
Table 1-3 Cache memory sizes
Cache levelMemory size (per core)
L1
L2
L3
32 KB Instruction, 32 KB Data
256 KB
4 MB eDRAM
The memory subsystem features the following characteristics:
Memory capacity: Up to 128 GB/processor
Memory bandwidth: 128 GB/s (peak)/processor
Eight channels of SuperNOVA buffered DIMMs/processor
22IBM Power Systems 775 for AIX and Linux HPC Solution
Two memory controllers per processor:
– Four memory busses per memory controller
– Each buss is 1 B-wide Write, 2 B-wide Read
Page 37
Memory per drawer
Each drawer features the following minimum and maximum memory ranges:
Minimum:
– 4 DIMMs per QCM x 8 QCM per drawer = 32 DIMMs per drawer
– 32 DIMMs per drawer x 8 GB per DIMM = 256 GB per drawer
Maximum:
– 16 DIMMs per QCM x 8 QCM per drawer = 128 DIMMs per drawer
– 128 DIMMs per drawer x 16 GB per DIMM = 2 TB per drawer
Memory DIMMs
Memory DIMMs include the following features:
Two SuperNOVA chips each with a bus connected directly to the processor
Two ports on the DIMM from each SuperNova
Dual CFAM interfaces from the processor to each DIMM, wired to the primary SuperNOVA
and dual chained to the secondary SuperNOVA on the DIMM
Two VPD SEEPROMs on the DIMM interfaced to the primary SuperNOVA CFAM
80 DRAM sites - 2 x 10 (x8) DRAM ranks per SuperNova Port
Water cooled jacketed design
50 watt max DIMM power
Available in sizes: 8 GB, 16 GB, and 32 GB (RPQ)
For best performance, it is recommended that all 16 DIMM slots are plugged in each node. All
DIMMs driven by a quad-chip module (QCM) must have the same size, speed, and voltage
rating.
1.4.9 Quad chip module
The previous sections provided a brief introduction to the low-level components of the Power
775 system. We now look at the system on a modular level. This section discusses the
quad-chip module or QCM, which contains four POWER7 chips that are connected in a
ceramic module.
The standard Power 775 CEC drawer contains eight QCMs. Each QCM contains four, 8-core
POWER7 processor chips and supports 16 DDR3 SuperNova buffered memory DIMMs.
Figure 1-12 on page 24 shows the POWER7 quad chip module which contains the following
characteristics:
A=# of D-Link Transceivers; A=16
B=# of Transmitter lanes; B=10
C=# of Receiver lanes; C=10
D=bit rate per lane ; D=10Gb/S
E=8:10 Coding: E=8/10
BW = Ax(B+C)xDxE
16x(10+10)x10x8/10 = 2560Gb/S=320GB/s
(20GB/S/D-Link)
L-remote Link
A=# of L-Link Transceivers; A=12
B=# of Transmitter lanes; B=10
C=# of Receiver lanes; C=10
D=bit rate per lane ; D=10Gb/S
E=8:10 Coding: E=8/10
BW = Ax(B+C)xDxE
12x(10+10)x10x8/10 = 1920Gb/S=240GB/s
(20GB/S/L-Link)
L-local Link
A=L links; A=7
B=# of Transmitter bits/bus; B=64
C=# of Receiver bits/bus; C=64
D=bit rate per lane ; D=3Gb/S
E=Framing Efficiency; E=22/24
BW = Ax(B+C)xDxE
7x(64+64)x3x(22/24) = 2464Gb/S=308GB/s
(44BG/S/L-Local)
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
10x 10x 10x 10x
10x 10x 10x 10x
DIMM 1DIMM 0DIMM 7
DIMM 6
Mem SN
Mem SN
Mem SN
Mem
SN
Mem
SN
Mem SN
Mem SN
Mem
SN
Mem
SN
Mem SN
Mem SN
Mem
SN
Mem
SN
Mem
SN
Mem
SN
Mem SN
Mem
SN
Mem
SN
Mem
SN
MemSN
Mem
SN
Mem
SN
Mem
SN
MemSN
Mem
SN
MemSN
MemSN
Mem
SN
Mem
SN
MemSN
Mem
SN
Mem
SN
Each Power 775 planar consists of eight octants, as shown in Figure 1-14 on page 26. Seven
of the octants are composed of 1x QCM, 1 x HUB, 2 x PCI Express 16x. The other octant
contains 1x QCM, 1x HUB, 2x PCI Express 16x, 1x PCI Express 8x.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 25
Page 40
22x25mm
550 sqmm
HUB
1 23456
7101112
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28
89
61x96mm Substrate
Water Cooled Hub
Module
(2:1 View)
8x
Optical Connector
Light Pipe
HUB
0
61x96mm
P
C
I
e
1
6
X
P
C
I
e
1
6
X
P
C
I
e
8
X
Power Input
P7-0
P7-2
P7-3
P7-1
QCM 0
Power Input
Figure 1-14 Octant layout differences
26IBM Power Systems 775 for AIX and Linux HPC Solution
– 1280 DRAM sites (80 Dram sites per DIMM, 1/2/4 Gb DRAMS over time)
– 2 Processor buses and four 8B memory buses per DIMM (two Buffers)
– Double stacked at the extreme with memory access throttling
For more information, see 1.4.8, “Memory subsystem” on page 22.
I/O: 1 TB/s BW:
– 32 Gbps Generic I/O:
• Two PCIe2 16x line rate capable busses
• Expanded I/O slot count through PCIe2 expansion possible
IBM Proprietary Fabric with On-board copper/Off-board Optical:
– Excellent cost / performance (especially at mid-large scale)
– Basic technology can be adjusted for low or high BW applications
Packaging:
– Water Cooled (>95% at level shown), distributed N+1 Point Of Load Power
– High wire count board with small Vertical Interconnect Accesses (VIAs)
– High pin count LGA module sockets
– Hot Plug PCIe Air Cooled I/O Adapter Design
– Fully Redundant Out-of-band Management Control
1.4.11 Interconnect levels
A functional Power 775 system consists of multiples nodes that are spread across several
racks. This configuration means multiple octants are available in which every octant is
connected to every other octant on the system. The following levels of interconnect are
available on a system:
First level
This level connects the eight octants in a node together via the hub module by using
copper board wiring. This interconnect level is referred to as “L” local (LL). Every octant in
the node is connected to every other octant. For more information, see 1.4.12, “Node” on
page 28.
Second level
This level connects four nodes together to create a Supernode. This interconnection is
possible via the hub module optical links. This interconnection level is referred to as “L”
distant (LD). Every octant in a node must connect to every other octant in the other three
nodes that form the Supernode. Every octant features 24 connections, but the total
number of connections across the four nodes in a Supernode is 384. For more
information, see 1.4.13, “Supernodes” on page 30.
Third level
This level connects every Supernode to every other Supernode in a system. This
interconnection is possible via the hub module optical links. This interconnect level is
referred to as D-link. Each Supernode has up to 512 D-links. It is possible to scale up this
level to 512 Supernodes. Every Supernode has a minimum of one hop D-link to every
other Supernode. For more information, see 1.4.14, “Power 775 system” on page 32.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 27
Page 42
1.4.12 Node
This section discusses the node level of the Power 775 physically represented by the drawer,
also commonly referred as CEC. A node is composed of eight octants and their local
interconnect. Figure 1-15 shows the CEC drawer from the front.
Figure 1-15 CEC drawer front view
Figure 1-16 shows the CEC drawer rear view.
Figure 1-16 CEC drawer rear view
28IBM Power Systems 775 for AIX and Linux HPC Solution
Page 43
First level interconnect: L Local
L Local (LL) connects the eight octants in the CEC drawer together via the HUB module by
using copper board wiring. Every octant in the node is connected to every other octant, as
shown in Figure 1-17.
Figure 1-17 First level local interconnect (256 cores)
Chapter 1. Understanding the IBM Power Systems 775 Cluster 29
Page 44
System planar board
This section provides details about the following system planar board characteristics:
Approximately 2U x 85 cm wide x 131cm deep overall node package in 30 EIA frame
Eight octants and each octant features one QCM, one hub, and 4 - 16 DIMMs
128 memory slots
17 PCI adapter slots (octant 0 has three PCI slots, octant 1 - 7 each have two PCI slots)
Regulators on the bottom side of planar directly under modules to reduce loss and
decoupling capacitance
Water-cooled stiffener to cool regulators on bottom side of planar and memory DIMMs on
top of board
Connectors on rear of board to optional PCI cards (17x)
Connectors on front of board to redundant 2N DCCAs
Optical fiber D Link interface cables from HUB modules to left and right of rear tail stock
128 total = 16 links x 8 Hub Modules
Optical fiber L-remote interface cables from Hub modules to center of rear tail stock
96 total = 24 links x 8 Hub Modules
Clock distribution & out-of-band control distribution from DCCA.
Redundant service processor
The redundant service processor features the following characteristics:
The clocking source follows the topology of the service processor.
N+1 redundancy of the service processor and clock source logic use two inputs on each
processor and HUB chip.
Out-of-band signal distribution for memory subsystem (SuperNOVA chips) and PCI
express slots are consolidated to the standby powered pervasive unit on the processor
and HUB chips. PCI express is managed on the Hub and Bridge chips.
1.4.13 Supernodes
This section describes the concept of Supernodes.
Supernode configurations
The following supported Supernode configurations are used in a Power 775 system. The
usage of each type is based on cluster size or application requirements:
Four-drawer Supernode
The four-drawer Supernode is the most common Supernode configuration. This
configuration is formed by four CEC drawers (32 octants) connected via Hub optical links.
Single-drawer Supernode
In this configuration, each CEC drawer in the system is a Supernode.
Second level interconnect: L Remote
This level connects four CEC drawers (32 octants) together to create a Supernode via Hub
module optical links. Every octant in a node connects to every other octant in the other three
nodes in the Supernode. There are 384 connections in this level, as shown in Figure 1-18 on
page 31.
30IBM Power Systems 775 for AIX and Linux HPC Solution
The second level wiring connector count is shown in Figure 1-19 on page 32.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 31
Page 46
Figure 1-19 Second level wiring connector count
Step 1
Each Octant in Node 1 needs to be connected to the
8 Octants in Node 2, the 8 Octants in Node 3, and
the 8 Octants in Node 4. This requires 24 connection
from each of the 8 Octant in Node 1. So every
Octant has 24 connections from it and there are 8
Octants resulting in 192 connections
8 x 24 =192
Connections
Step 2
Step One Below has connected Node 1 to every
other Octant in the Super Node. We need now to
connect Node 2 to every other remaining Node
(nodes 3 &4) in the Super Node.
This requires 16 connections from each of the 8
Octant in Node 2. So every Octant in Node 2 has 16
connections from it and there are 8 Octants resulting
in 128 connections
8 x 16 =128
Connections
Step 3
Step 1 & 2 below have connected Node 1 & Node 2
to every other Octant in the Super Node. We need
now to connect Node 3 to Node 4. To do this every
Octant in Node 3 needs 8 connections to the 8
octants in Node 4 which results in 64 connections. At
this point every Octant in the Super Node is
connected to every other Octant in the Super Node
8 x 8 =64
Connections
Step 4
The total number of connections to build a super
node are 384. 192 + 128 + 64 = 384
It must be noted that every Octant has 24
connections, but the total number of connections
across the 4 nodes in a given Super Node is 384.
This section describes the Power 775 system and provides details about the third level of
interconnect.
Third level interconnect: Distance
This level connects every Supernode to every other Supernode in a system by using Hub
module optical links. Each Supernode includes up to 512 D-links, which allows for system that
contains up to 512 Supernodes. Every Supernode features a minimum of one hop D-link to
every other Supernode, and there are multiple two hop connections, as shown in Figure 1-20
on page 33.
Each HUB contains 16 Optical D-Links. The Physical node (board) contains eight HUBs;
therefore, a physical node (board) contains 16 x 8 = 128 Optical D-Links. A Super Node is
four Physical Nodes, which result in 16 x 8 x 4 = 512 Optical D-Links per Super node. This
configuration allows up to 2048 CEC connected drawers.
In smaller configurations, in which the system features less than 512 Super Nodes, more than
one optical D-Link per node is possible. Multiple connections between Supernodes are used
for redundancy and higher bandwidth solutions.
32IBM Power Systems 775 for AIX and Linux HPC Solution
Page 47
Figure 1-20 System third level interconnect
Integrated cluster fabric interconnect
A complete Power 775 system configuration is achieved by configuring server nodes into a
tight cluster by using a fully integrated switch fabric.
The fabric is a multitier, hierarchical implementation that connects eight logical nodes
(octants) together in the physical node (server drawer or CEC) by using copper L-local links.
Four physical nodes are connected with structured optical cabling into a Supernode by using
optical L-remote links. Up to 512 super nodes are connected by using optical D-links.
Figure 1-21 on page 34 shows a logical representation of a Power 775 cluster.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 33
Page 48
Figure 1-21 Logical view of a Power 775 system
Figure 1-22 shows an example configuration of a 242 TFLOP Power 775 cluster that uses
eight Supernodes and direct graph interconnect. In this configuration, there are 28 D-Link
cable paths to route and 1-64 12-lane 10 Gb D-Link cables per cable path.
Figure 1-22 Direct graph interconnect example
34IBM Power Systems 775 for AIX and Linux HPC Solution
Page 49
A 4,096 core (131 TF), fully interconnected system is shown in Figure 1-23.
Figure 1-23 Fully interconnected system example
Network topology
Optical D-links connect Supernodes in different connection patterns. Figure 1-24 on page 36
shows an example of 32 D-links between each pair of supernodes. Topology is 32D, a
connection pattern that supports up to 16 supernodes.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 35
Page 50
Figure 1-24 Supernode connection using 32D topology
Figure 1-25 shows another example in which there is one D-link between supernode pairs,
which supports up to 512 supernodes in a 1D topology.
Figure 1-25 1D network topology
36IBM Power Systems 775 for AIX and Linux HPC Solution
Page 51
The network topology is specified during the installation. A topology specifier is set up in the
cluster database. In the cluster DB site table, topology=<specifier>. Table 1-4 shows the
supported four-drawer and single-drawer Supernode topologies.
Table 1-4 Supported four-drawer and single-drawer Supernode topologies
TopologyMaximum number of supernodes
256D3
128D5
64D8
32D16
16D32
8D64
4D128
2D256
1D512
Single-Drawer Supernode topologies
2D_SDSN48
8D_SDSN12
ISR network routes
Each ISR includes a set of hardware route tables. The Local Network Management Controller
(LNMC) routing code generates and maintains the routes with help from Central Network
Manage (CNM), as shown in Figure 1-26 on page 38. These route tables are set up during
system initialization and are dynamically adjusted as links go down or come up during
operation. Packets are injected into the network with a destination identifier and the route
mode. The route information is picked up from the route tables along the route path that is
based on this information. Packets that is injected into the interconnect by the HFI employ
source route tables with the route partially determined. Per-port route tables are used to route
packets along each hop in the network. Separate route tables are used for intersupernode
and intrasupernode routes.
Routes are classified as
two compute nodes in a system. There are multiple direct routes between a set of compute
nodes because a pair of supernodes are connected by more than one D-link.
The network topology features two levels and therefore the longest direct route has three
hops (no more than two L hops and at most one D hop). This configuration is called an L-D-L
route.
The following conditions exist when source and destination hubs are within a drawer:
The route is one L-hop (assuming all of the links are good).
LNMC needs to know only the local link status in this CEC.
direct or indirect. A direct route uses the shortest path between any
Chapter 1. Understanding the IBM Power Systems 775 Cluster 37
Page 52
Figure 1-26 Routing within a single CEC
The following conditions exist when source and destination hubs lie within a supernode, as
shown in Figure 1-27:
Route is one L-hop (every hub within a supernode is directly connected via Lremote link to
every other hub in the supernode).
LNMC needs to know only the local link status in this CEC.
Figure 1-27 L-hop
If an L-remote link is faulty, the route requires two hops. However, only the link status local to
the CEC is needed to construct routes, as shown in Figure 1-28.
Figure 1-28 Route representation in event of a faulty Lremote link
38IBM Power Systems 775 for AIX and Linux HPC Solution
Page 53
When source and destination hubs lie in different supernodes, as shown in Figure 1-29, the
following conditions exist:
Route possibilities: one D-hop, or L-D (L-D-L routes also are used)
LNMC needs non-local link status to construct L-D routes
Figure 1-29 L-D route example
The ISR also supports indirect routes to provide increased bandwidth and to prevent hot
spots in the interconnect. An indirect route is a route that has an intermediate compute node
in the route that is on a different supernode, not the same supernode in which source and
compute nodes reside. An indirect route must employ the shortest path from the source
compute node to the intermediate node, and the shortest path from the intermediate compute
node to the destination compute node. Although the longest indirect route has five hops at
most, no more than three hops are L hops and two hops (at most) are D hops. This
configuration often is represented as an L-D-L-D-L route.
The following methods are used to select a route is when multiples routes exist:
Software specifies the intermediate supernode, but the hardware determines how to route
to and then route from the intermediate supernode.
The hardware selects among the multiple routes in a round-robin manner for both direct
and indirect routes.
The hub chip provides support for route randomization in which the hardware selects one
route between a source–destination pair. Hardware-directed randomized route selection is
available only for indirect routes.
These routing modes are specified on a per-packet basis.
The correct choice between the use of direct- versus indirect-route modes depends on the
communication pattern that us used by the applications. Direct routing is suitable for
communication patterns in which each node must communicate with many other nodes by
using spectral methods. Communication patterns that involve small numbers of compute
nodes benefit from the extra bandwidth that is offered by the multiple routes with indirect
routing.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 39
Page 54
1.5 Power, packaging, and cooling
This section provides information about the IBM Power Systems 775 power, packaging, and
cooling features.
1.5.1 Frame
The front view of an IBM Power Systems 775 frame is shown in Figure 1-30.
Figure 1-30 Power 775 frame
The Power 775 frame front view is shown in Figure 1-31 on page 41.
40IBM Power Systems 775 for AIX and Linux HPC Solution
Page 55
Figure 1-31 Frame front view
The rear view of the Power 775 frame is shown in Figure 1-32 on page 42.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 41
Page 56
Figure 1-32 Frame rear photo
1.5.2 Bulk Power and Control Assembly
Each Bulk Power and Control Assembly (BPCA) is a modular unit that includes the following
features:
Bulk Power and Control Enclosure (BPCE)
Contains two 125A 3-phase AC couplers, six BPR bays, one BPCH and one BPD bay.
Bulk Power Regulators (BPR), each rated at 27 KW@-360 VDC
One to six BPRs are populated in the BPCE depending on CEC drawers and storage
enclosures in frame, and the type of power cord redundancy that is wanted.
Bulk Power Control and Communications Hub (BPCH)
This unit provides rack-level control, storage enclosures, and water conditioning units
(WCUs), and concentration of the communications interfaces to the unit level controllers
that are in each server node, storage enclosure, and WCU.
Bulk Power Distribution (BPD)
This unit distributes 360 VDC to server nodes and disk enclosures.
Power Cords
42IBM Power Systems 775 for AIX and Linux HPC Solution
Page 57
One per BPCE for populations of one to three BPRs/BPCE and two per BPCE for four to
BPF
BPF
BPF
BPF
BPE
BPR
BPR
BPD
BPCH
Front View
Rear View
Line Cord Connector
Line Cord Connector
six BPRs/BPCE.
The front and rear views of the BPCA are shown in Figure 1-33.
Figure 1-33 BPCA
The minimum configuration per BPCE is one x BPCH, one x BPD, one x BPR and one line
core. There are always two BPCAs above one another in the top of the cabinet.
BPRs are added uniformly to each BPCE depending on the power load in the rack.
A single fully configured BPCE provides 27 KW x 6 = 162 KW of Bulk Power, which equates
to aggregate system power cord power of approximately 170 KW. Up to this power level, bulk
power is in a 2N arrangement, where a single BPCE is removed entirely for maintenance
concurrently with the rack that remains fully operational. If the rack bulk power demand
exceeds 162 KW, the bulk power provides an N+1 configuration of up to 27 KW x 9 = 243 KW,
which equates to aggregate system power cord power of approximately 260 KW. N+1 Bulk
Power mode means one of the four power cords are disconnected and the cabinet continues
to operate normally. BPCE concurrent maintenance is not conducted in N+1 bulk power
mode, unless the rack bulk power load is reduced to less than 162 KW by invoking Power
Efficient Mode on the server nodes. This mode reduces the peak power demand at the
expense of reducing performance.
Cooling
The BPCA is nearly entirely water cooled. Each BPR and the BPD has two quick connects to
enable connection to the supply and return water cooling manifolds in the cabinet. All of the
components that dissipate any significant power in the BPR and BPD are heat sunk to a
water-cooled cold plate in these units.
To assure the ambient air temperature internal to the BP, BPCH, and BPD enclosures is kept
low, two hot pluggable blowers are installed in the rear of each BPCA in an N+1 speed
controlled arrangement. These blowers flush the units to keep the temperature internal at
approximately system inlet air temperature, which is 40 degrees-C maximum. A fan is
replaced concurrently.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 43
Page 58
Management
The BPCH provides the mechanism to connect the cabinet to the management server via a
1 Gb Ethernet out-of-band network. Each BPCH has two management server-facing 1 Gb
Ethernet ports or buses so that the BPCH connects to a fully redundant network.
1.5.3 Bulk Power Control and Communications Hub
The front view of the bulk power control and communications hub (BPCH) is shown in
Figure 1-34.
Figure 1-34 BPCH front view
The following connections are shown in Figure 1-34:
T2: 10/100 Mb Ethernet from HMC1.
T3: 10/100 Mb Ethernet from HMC2.
T4: EPO.
T5: Cross Power.
T6-T9: RS422 UPIC for WCUs.
T10: RS422 UPIC port for connection of the Fill and Drain Tool.
T19/T36: 1 Gb Ethernet for HMC connectors (T19, T36).
2x – 10/100 Mb Ethernet port to plug in a notebook while the frame is serviced.
2x – 10/100 Mb spare (connectors contain both eNet and ½ Duplex RS422).
T11-T19, T20-T35, T37-T44: 10/100 Mb Ethernet ports or buses to the management
processors in CEC drawers and storage enclosures. Max configuration supports 12 CEC
drawers and one storage enclosure in the frame (connectors contain both eNet and ½
Duplex RS422).
1.5.4 Bulk Power Regulator
This section describes the bulk power regulator (BPR).
Input voltage requirements
The BPR supports DC and AC input power for the Power 775. A single design accommodates
both the AC and DC range with different power cords for various voltage options.
DC requirements
The BPR features the following DC requirements:
The Bulk Power Assembly (BPA) is capable of operating over a range of 300 to 600 VDC
Nominal operating DC points are 375 VDC and 575 VDC
44IBM Power Systems 775 for AIX and Linux HPC Solution
Acceptable voltage tolerance @ machine power cord180 - 259Vac333 - 508Vac
Acceptable frequency tolerance @ machine power cord47 - 63Hz47 - 63Hz
1.5.5 Water conditioning unit
The Power 755 WCU system is shown in Figure 1-35.
and GND
(no neutral)
@ 50 to 60Hz
Three phase
and GND (no
neutral)
380 to 480 Vac
@50 to 60Hz
Figure 1-35 Power 755 water conditioning unit system
Chapter 1. Understanding the IBM Power Systems 775 Cluster 45
Page 60
The hose and manifolds assemblies and WCUs are shown in Figure 1-36.
Supply Manifold
Return Manifold
1” x 2”
WCUs
BCW Hose Assemblies
Rectangular
Stainless Steel
Tubing
Figure 1-36 Hose and manifold assemblies
The components of the WCU are shown in Figure 1-37 on page 47.
46IBM Power Systems 775 for AIX and Linux HPC Solution
Page 61
Dual Float Sensor Asm
(Orientation in frame)
Pressure Relief Valve
& Vacuum Breaker
System Supply
(to electronics)
Ball Valve
Quick Connect
System Return
(from electronics)
Chilled Water Return
Chilled Water
Supply
Proportional
Control Valve
Flow Meter
Check Valve
(integrated into tank)
Reservoir
Tank
Plate Heat Exchanger
Pump / Motor Asm
Figure 1-37 WCU components
The WCU schematics are shown in Figure 1-38.
Figure 1-38 WCU schematics
Chapter 1. Understanding the IBM Power Systems 775 Cluster 47
Page 62
1.6 Disk enclosure
This section describes the storage disk enclosure for the Power 775 system.
1.6.1 Overview
The Power 775 system features the following disk enclosures:
SAS Expander Chip (SEC):
– # PHYs: 38
– Each PHY capable of SAS SDR or DDR
384 SFF DASD drives:
– 96 carriers with four drives each
– 8 Storage Groups (STOR 1-8) with 48 drives each:
• 12 carriers per STOR
• Two Port Cards per STOR each with 3 SECs
– 32 SAS x4 ports (four lanes each) on 16 Port Cards.
Data Rates:
– Serial Attach SCSI (SAS) SDR = 3.0 Gbps per lane (SEC to Drive)
– Serial Attach SCSI (SAS) DDR= 6.0 Gbps per lane (SAS Adapter in Node to SEC)
The drawer supports 10 K/15 K rpm drives in 300 Gb or 600 Gb sizes.
A Joint Test Action Group (JTAG) interface is provided from the DC converter assemblies
(DCAs) to each SEC for error diagnostics and boundary scan.
Important: STOR is the short name for storage group (it is not an acronym).
The front view of the disk enclosure is shown in Figure 1-39 on page 49.
48IBM Power Systems 775 for AIX and Linux HPC Solution
Page 63
Figure 1-39 Disk enclosure front view
1.6.2 High-level description
Figure 1-40 on page 50 represents the top view of a disk enclosure and highlights the front
view of a STOR. Each STOR includes 12 carrier cards (six at the top of the drawer and six at
the bottom of the drawer) and two port cards.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 49
Page 64
Figure 1-40 Storage drawer top view
DCA Power
DCA Power
Port Card
Port Card
STOR1
2 Port
Cards
12 Carriers
(36 Drives)
STOR2
2 Port
Cards
12 Carriers
(36 Drives)
STOR3
2 Port
Cards
12 Carriers
(36 Drives)
STOR4
2 Port
Cards
12 Carriers
(36 Drives)
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
Front View
Of STOR
STOR8
2 Port
Cards
12 Carriers
(36 Drives)
STOR7
2 Port
Cards
12 Carriers
(36 Drives)
STOR6
2 Port
Cards
12 Carriers
(36 Drives)
STOR5
2 Port
Cards
12 Carriers
(36 Drives)
Heat Exchanger
The disk enclosure is a SAS storage drawer that is specially designed for the IBM Power 775
system. The maximum storage capacity of the drawer is 230.4 TB, distributed over 384 SFF
DASD drives logically organized in eight groups of 48.
The disk enclosure feature two mid-plane boards that comprise the inner core assembly. The
disk drive carriers, port cards, and power supplies plug into the mid-plane boards. There are
four Air Moving Devices (AMD) in the center of the drawer. Each AMD consists of three
counter-rotating fans.
Each carrier contains connectors for four disk drives. The carrier features a solenoid latch that
is released only through a console command to prevent accidental unseating. The disk
carriers also feature LEDs close to each drive and a gold capacitor circuit so that drives are
identified for replacement after the carrier is removed for service.
Each port card includes four SAS DDR 4x ports (four lanes at 6 Gbps/lane). These incoming
SAS lanes connect to the input SEC, which directs the SAS traffic to the drives. Each drive is
connected to one of the output SECs on the port card with SAS SDR 1x (one lane @ 6 Gbps).
There are two port cards per STOR. The first Port card connects to the A ports of all 48 drives
in the STOR. The second Port card connects to the B ports of all 48 drives in the STOR. The
port cards include soft switches for all 48 drives in the STOR (5 V and 12 V soft switches
connect and interrupt and monitor power). The soft switch is controlled by I2C from the SAS
Expander Chip (SEC) on the port card.
50IBM Power Systems 775 for AIX and Linux HPC Solution
Page 65
A fully cabled drawer includes 36 cables: four UPIC power cables and 32 SAS cables from
SAS adapters in the CEC. During service to replace a power supply, two UPIC cables
manage the current and power control of the entire drawer. During service of a port card, the
second port card in the STOR remains cabled to the CEC so that the STOR remains
operational. A customer minimum configuration is two SAS cables per STOR and four UPIC
power cables per drawer to ensure proper redundancy.
1.6.3 Configuration
A disk enclosure must reside in the same frame as the CEC to which it is cabled. A frame
might contain up to six Disk Enclosures. The disk enclosure front view is shown in
Figure 1-41.
Figure 1-41 Disk Enclosure front view
The disk enclosure internal view is shown in Figure 1-42 on page 52.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 51
Page 66
Figure 1-42 Disk Enclosure internal view
The disk carrier is shown in Figure 1-43.
Figure 1-43 Disk carrier
52IBM Power Systems 775 for AIX and Linux HPC Solution
Page 67
The disk enclosure includes the following features:
A disk enclosure is one quarter, one half, three quarters, or fully populated with HDDs and
eight SSDs. The disk enclosure always is populated with eight SSDs.
Disk enclosure contains two GPFS recovery groups (RGs). The carriers that hold the
disks of the RGs are distributed throughout all of the STOR domains in the drawer.
A GPFS recovery group consists of four SSDs and one to four declustered arrays (DAs) of
47 disks each.
Each DA contains distributed spare space that is two disks in size.
Every DA in a GPFS system must be the same size.
The granularity of capacity and throughput is an entire DA.
RGs in the GPFS system do not need to be the same size.
1.7 Cluster management
The cluster management hardware that supports the Cluster is placed in 42 U, 19-inch racks.
The cluster management requires Hardware Management Consoles (HMCs), redundant
Executive Management Servers (EMS), and the associated Ethernet network switches.
1.7.1 Hardware Management Console
The HMC runs on a single server and is used to help manage the Power 775 servers. The
traditional HMC functions for configuring and controlling the servers are done via xCAT. For
more information, see 1.9.3, “Extreme Cluster Administration Toolkit” on page 72.
The HMC is often used for the following tasks:
During installation
For reporting hardware serviceable events, especially through Electronic Service Agent™
(ESA), which is also commonly known as call-home
By service personal to perform guided service actions
An HMC is required for every 36 CECs (1152 LPARs) and all Power 775 system have
redundant HMCs. For every group of 10 HMCs, a spare HMC is in place. For example, if a
cluster requires four HMCs, five HMCs are present. If a cluster requires 16 HMCs, the cluster
has two HMCs to serve as spares.
1.7.2 Executive Management Server
The EMS is a standard 4U POWER7 entry-level server responsible for cluster management
activities. EMSs often are redundant; however, a simplex configuration is supported in smaller
Power 775 deployments.
At the cluster level, a pair of EMSs provide the following maximum management support:
512 frames
512 supernodes
2560 disk enclosures
Chapter 1. Understanding the IBM Power Systems 775 Cluster 53
Page 68
The EMS is the central coordinator of the cluster from a system management perspective.
The EMS is connected to, and manages, all cluster components: the frames and CECs,
HFI/ISR interconnect, I/O nodes, service nodes, and compute nodes. The EMS manages
these components through the entire lifecycle, including discovery, configuration, deployment,
monitoring, and updating via private network Ethernet connections. The cluster administrator
uses the EMS as their primary management cluster control point. The service nodes, HMCs,
and Flexible Service Processors (FSPs) are mostly transparent to the system administrator,
and therefore the cluster appears to be a single, flat cluster, despite the hierarchical
management infrastructure to be deployed by using xCAT.
1.7.3 Service node
Systems management throughout the cluster is a hierarchical structure (see Figure 1-44) to
achieve the scaling and performance necessary for a large cluster size. All the compute and
I/O nodes in a building block are initially booted via the HFI and managed by a dedicated
server that is called a
service node (SN) in the utility CECs.
Figure 1-44 EMS hierarchy
Two service nodes (one for redundancy) per 36 CECs/Drawers (1 - 36) are required for all
Power 775 clusters.
The two service nodes must reside in different frames, except under the following conditions:
If there is only one frame, the nodes must reside in different super nodes in the frame.
If there is only one super node in the frame, the nodes must reside in different CECs in the
super node.
If there are only two or three CEC drawers, the nodes must reside in different CEC
drawers.
If there is only one CEC drawer, the two Service nodes must reside in different octants.
54IBM Power Systems 775 for AIX and Linux HPC Solution
Page 69
The service node provides diskless boot and an interface to the management network. The
service node requires that a PCIe SAS adapter and two 600 GB HDD PCIe form factor (in
RAID1 for redundancy) must be installed to support diskless boot. The recommended
location is shown on Figure 1-45. The SAS PCIe must reside in PCIe slot 16 and the HDDs in
slots 15 and 14.
The service node also contains a 1 Gb Enet PCIe card that is in PCIe slot 17.
Figure 1-45 Service node
1.7.4 Server and management networks
Figure 1-46 on page 56 shows the logical structure of the two Ethernet networks for the
cluster that is known as the
page 56, the black nets designate the service network and the red nets designate the
management network.
service network and the management network. In Figure 1-46 on
Chapter 1. Understanding the IBM Power Systems 775 Cluster 55
Page 70
The service network is a private, out-of-band network that is dedicated to managing the
Frame - 3
CEC (U25)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U19)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U17)
DCCA-A
DCCA-B
DCCA-A
DCCA-A
CEC (U15)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U11)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U9)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U7)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U5)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U27)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
DE (U29)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A
BPC-A
BPA-B
BPC-B
BPC-B
Frame - 1
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
Red = EMS
Connections are 1Gb
eNet Unless Shown
explicitly
Black = HMC
CEC (U25)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U19)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U17)
DCCA-A
DCCA-B
DCCA-A
DCCA-A
CEC (U15)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U11)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U9)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U7)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U5)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U27)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
DE (U29)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A
BPC-A
BPA-B
BPC-B
BPC-B
.
.
HMC 2
HMC 1
EMS 1
EMS 2
Frame - 2
CEC (U25)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U23)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U19)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U17)
DCCA-A
DCCA-B
DCCA-A
DCCA-A
CEC (U15)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U13)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U11)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U9)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U7)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U5)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U27)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
DE (U29)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A
BPC-A
BPA-B
BPC-B
BPC-B
CEC (U21)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
HMC x
Management
Network
Service
Network
ENET A
Management
Network
Service
Network
ENET B
.
CEC (U13)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U13)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U21)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U23)
(Utility)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
PCI
17
CEC (U21)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
CEC (U23)
(Utility)
DCCA-B
DCCA-B
DCCA-A
DCCA-A
PCI
17
Power 775 clusters hardware. This network provides Ethernet based connectivity between
the FSP of the CEC, the frame control BPA, the EMS, and the associated HMCs. Two
identical network switches (ENET A and ENET B in the figure) are deployed to ensure high
availability of these networks.
The management network is primarily responsible for booting all nodes, the designated
service nodes, compute nodes, and I/O nodes, and monitoring their OS image loads. This
management network connects the dual EMSs running the system management software
with the various Power 775 servers of the cluster. Both the service and management
networks must be considered private and not routed into the public network of the enterprise
for security reasons.
Figure 1-46 Logical structure of service and management networks
1.7.5 Data flow
This section provides a high-level description of the data flow on the cluster service network
and cluster management operations.
After discovery of the hardware components, their definitions are stored in the xCAT database
on the EMS. HMCs, CECs, and frames are discovered via Service Location Protocol (SLP) by
xCAT. The discovery information includes model and serial numbers, IP addresses, and so
on. The Ethernet switch of the service LAN also is queried to determine which switch port is
connected to each component. This discovery is run again when the system is up, if wanted.
The HFI/ISR cabling also is tested by the CNM daemon on the EMS. The disk enclosures and
their disks are discovered by GPFS services on these dedicated nodes when they are booted
up.
56IBM Power Systems 775 for AIX and Linux HPC Solution
Page 71
1.7.6 LPARs
The hardware is configured and managed via the service LAN, which connects the EMS to
the HMCs, BCAs, and FSPs.
Management is hierarchical with the EMS at the top, followed by the service nodes, then all
the nodes in their building blocks. Management operations from the EMS to the nodes are
also distributed out through the service nodes. Compute nodes are deployed by using a
service node as the diskless image server.
Monitoring information comes from the sources (frames/CECs, nodes, HFI/ISR fabric, and so
on), flows through the service LAN and cluster LAN back to the EMS, and is logged in the
xCAT database.
The minimum hardware requirement for an LPAR is one POWER7 chip with memory attached
to its memory controller. If an LPAR is assigned to one POWER7, that chip must have
memory on either of its memory controllers. If an LPAR is assigned to two, three, or four
POWER7 chips, any one or more of the POWER7 chips must have memory that is attached
to them.
A maximum of one LPAR per POWER7 chip supported. A single LPAR resides on one, two,
three, or four POWER7 chips. This configuration results in an Octant with the capability to
have one, two, three, or four LPARs. An LPAR cannot reside in two Octants. With this
configuration, the number of LPARs per CEC (eight Octants) ranges 8 - 32 (4 x 8). Therefore,
1 - 4 LPARs per Octant and 8 - 32 LPARs per CEC.
The following LPAR assignments are supported in an Octant:
LPAR with all processors and memory that is allocated to that LPAR
LPARs with 75% of processor and memory resources that are allocated to the first LPAR
and 25% to the second
LPARs with 50% of processor and memory resources that are allocated to each LPAR
LPARs with 50% of processor and memory resources that are allocated to the first LPAR
and 25% to each of the remaining two LPARs
LPARs with 25% of processor and memory resources that are allocated to each LPAR
Recall that for an LPAR to be assigned, a POWER7 chip and memory that is attached to its
memory controller is required. If either one of the two requirements is not met, that POWER7
is skipped and the LPAR is assigned to the next valid POWER7 in the order.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 57
Page 72
1.7.7 Utility nodes
This section defines the utility node for all Power 775 frame configurations.
A CEC is defined as a Utility CEC (node) when it has the Management server (Service node)
as an LPAR. Each frame configuration is addressed individually. A single Utility LPAR
supports a maximum of 1536 LPARs, one of which is an LPAR (one Utility LPAR and 1535
other LPARs). Recall that a node contains four POWER7 chips and a single POWER7
contains a maximum of one LPAR; therefore, a CEC contains 8 x 4 = 32 POWER7 chips. This
configuration results in up to 32 LPARs per CEC.
This result of 1536 LPARs translates to the following figures:
1536 POWER7 chips
384 Octants (1536 / 4 = 384)
48 CECs (a CEC can contain up to 32 LPARS; therefore, 1536 / 32 = 48)
There are always redundant utility nodes that reside in different frames when possible. If there
is only one frame and multiple SuperNodes, the utility node resides in different SuperNodes. If
there is only one SuperNode, the two utility nodes reside in different CECs. If there is only one
CEC, the two utility LPARs reside in different Octants in the CEC.
The following defined utility CEC is used in the four-frame, three-frame, two-frame, and
single-frame with 4, 8, and 12 CEC configurations. The single frame with 1 - 3 CECs uses a
different utility CEC definition. These utilities CEC definitions are defined in their respective
frame definition sections.
The utility LPAR resides in Octant 0. The LPAR is assigned only to a single POWER7.
Figure 1-47 on page 59 shows the eight Octant CEC and the location of the Management
LPAR. The two Octant and the four Octant CEC might be used as a utility CEC and follows
the same rules as the eight Octant CEC.
58IBM Power Systems 775 for AIX and Linux HPC Solution
Page 73
Figure 1-47 Eight octant utility node definition
Chapter 1. Understanding the IBM Power Systems 775 Cluster 59
Page 74
1.7.8 GPFS I/O nodes
Figure 1-48 shows the GPFS Network Shared Disk (NSD) node in Octant 0.
Figure 1-48 GPFS NSD node on octant 0
1.8 Connection scenario between EMS, HMC, and Frame
The network interconnect between the different system components (EMS server, HMC,
Frame) requires the managing, running, maintaining, configuring, and monitoring of the
cluster. The management rack for a POWER 775 Cluster houses the different components,
such as the EMS servers (IBM POWER 750), HMCs, network switches, I/O drawers for the
EMS data disks, keyboard, and mouse. The different networks that are used in such an
environment are the management network and the service network (as shown in Figure 1-49
on page 61). The customer network is connected to some components, but for the actual
cluster, only the management and service networks are essential. For more information about
the server and management networks, see 1.7.4, “Server and management networks” on
page 55.
60IBM Power Systems 775 for AIX and Linux HPC Solution
Page 75
Figure 1-49 Typical cabling scenario for the HMC, the EMS, and the frame
In Figure 1-49, you see the different networks and cabling. Each Frame has two Ethernet
ports on the BPCH to connect the Service Network A and B.
The I/O drawers in which the disks are installed for the EMS Servers also are interconnected.
Therefore, the data is secured with RAID6 and the I/O drawers also are software mirrored.
This means that when one EMS server goes down for any reason, the other EMS server
accesses the data. The EMS servers are redundant from in this scenario, but there is no
automated high-availability process for recovery of a failed EMS server.
All actions to activate the second EMS server must be performed manually. There also is no
plan to automate this process. A cluster continues running without the EMS servers (in case
both servers failed). No node fails because of a server failure or an HMC error. When multiple
problems rise simultaneously, there might be a greater need for more intervention, but often
this intervention does not occur under normal circumstances.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 61
Page 76
1.9 High Performance Computing software stack
GPFS
User SpaceKernel Space
IP
IF_LS
DD
HYP
Operating Systems: AIX/Linux
Load Leveler
Scheduler
xCAT
Network(s): IB, HFI-ISR
Network Adapter(s): Galaxy2, IB HCAs, HFI (GSM)
Hardware Platforms: Power, P7IH
HAL – HFI (GSM), IB
AIX/OFED IB Verbs
LAPI – Reliable FIFO, RDMA, Striping,
Failover/Recovery, Pre-emption, User
Space Statistics, Multi-Protocol,
Scalability
Jitter migration bases on
synchronized global clock
Kernel patches
RASFailover
Striping with multiple links
Multilink/bonding supportSupported
64IBM Power Systems 775 for AIX and Linux HPC Solution
Page 79
1.9.1 Integrated Switch Network Manager
The ISNM subsystem package is installed on the executive management server of a
high-performance computing cluster that consists of IBM Power 775 Supercomputers and
contains the network management commands. The local network management controller
runs on the server service processor as part of the system of the drawers and is shipped with
the Power 775.
Network management services
As shown in Figure 1-51 on page 66, the ISNM provides the following services:
ISR network configuration and installation:
– Topology validation
– Miswire detection
– Works with cluster configuration as defined in the cluster database
– Hardware global counter configuration
– Phased installation and Optical Link Connectivity Test (OLCT)
ISR network hardware status:
– Monitors for ISR, HFI, link, and optical module events
– Command line queries to display network hardware status
– Performance counter collection
– Some RMC monitor points (for example, HFI Down)
Network maintenance:
– Set up ISR route tables during drawer power-on
– Thresholds on certain link events, might disable a link
– Dynamically update route tables to reroute around problems or add to routes when
CECs power on
– Maintain data to support software route mode choices, makes the data available to the
OS through PHYP
– Monitor global counter health
Report hardware failures:
– Analyzes the EMS
– Most events that are forwarded to TEAL Event DB and Alert DB
– Link events due to CEC power off/power on are consolidated within CNM to reduce
unnecessary strain on analysis
– Events reported via TEAL to Service Focal Point™ on the HMC
Chapter 1. Understanding the IBM Power Systems 775 Cluster 65
Page 80
Figure 1-51 ISNM operating environment
HMC
TEAL
P7 IH
FSP
EMS
GFW
services
PERCS
Database
T
E
A
L
c
o
n
n
e
c
t
o
r
Centra l Ne tw o rk
Manager
(CNM)
HP C Hardware Server
Control Network Ethernet
Local Network
Ma nagement
Controller
(LNMC)
NETS
Mailbox
Torrent
PHYP
Integrated
Switch
Rou ter
GEAR
Alert
Filter
ISN M
Configuration
ISNM
Rules
ISNM
Alert
Liste ner
Service
Focal Point
Ne twork Hard wa re
Events
MCRSA for the ISNM: IBM offers Machine Control Program Remote Support Agreement
(MCRSA) for the ISNM. This agreement includes remote call-in support for the central
network manager and the hardware server components of the ISNM, and for the local
network management controller machine code.
MCRSA enables a single-site or worldwide enterprise customer to maintain machine code
entitlement to remote call-in support for ISNM throughout the life of the MCRSA.
66IBM Power Systems 775 for AIX and Linux HPC Solution
Page 81
Figure 1-52 ISNM distributed architecture
ISNM Distributed Architecture
Hardware events
Routin g information
Performance C ounters
Read
Routing information
Routing information
Hardware events
Hardware events
Performance Coun ters
Read
Performance Counters
Read
CNM
Link Status
Link Status
ISNM
Command
Module
TEAL
HW EventsHW Events
EMS
LNMC
1
FSP
LNMC
1
FSP
LNMC
FSP
LNMC
FSP
LNMC
N
FSP
LNMC
N
FSP
A high-level representation of ISNMs distributed architecture is shown in Figure 1-52. An
instance of Local Network Manager (LNMC) software runs on each FSP. Each LNMC
generates routes for the eight hubs in the local drawer specific to the supernode, drawer, and
hub.
A Central Network Manager (CNM) runs on the EMS and communicates with the LNMCs.
Link status and reachability information flows between the LNMC instances and CNM.
Network events flow from LNMC to CNM, and then to Toolkit for Event Analysis and Logging
(TEAL).
Local Network Management Controller
The LNMC present on each node features the following primary functions:
Event management:
– Aggregates local hardware events, local routing events, and remote routing events.
Route management:
– Generates routes that are based on configuration data and the current state of links in
the network.
Hardware access:
– Downloads routes.
– Allows the hardware to be examined and manipulated.
Figure 1-53 on page 68 shows a logical representation of these functions.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 67
Page 82
Figure 1-53 LNMC functional blocks
The LNMC also interacts with the EMS and with the ISR hardware to support the execution of
vital management functions. Figure 1-54 on page 69 provides a high-level visualization of the
interaction between the LNMC components and other external entities.
68IBM Power Systems 775 for AIX and Linux HPC Solution
Page 83
Figure 1-54 LNMC external interactions
As shown in Figure 1-54, the following external interactions are featured in the LNMC:
1. Network configuration commands
The primary function of this procedure is to uniquely identify the Power 775 server within
the network. This includes the following information:
– Network topology
– Supernode identification
– Drawer identification within the Supernode
– Frame identification (from BPA via FSP)
– Cage identification (from BPA via FSP)
– Expected neighbors table for mis-wire detection
2. Local network hardware events
All network hardware events flow from the ISR into the LNMCs event management, where
they are examined and acted upon. The following list of potential actions that are taken by
event management:
– Threshold checking
– Actions upon hardware
– Event aggregation
– Network status update. Involves route management and CNM reporting.
– Reporting to EMS
Chapter 1. Understanding the IBM Power Systems 775 Cluster 69
Page 84
3. Network event reporting
Event management examines each local network hardware event and, if appropriate,
forwards the event to the EMS for analysis and reports to the service focal point. Event
management also sends the following local routing events that indicate changes in the link
status or route tables within the local drawer that other LNMCs need to react to:
– Link usability masks (LUM): One per hub in the drawer, indicates whether each link on
that hub is available for routing
– PRT1 and PRT2 validity vectors: One each per hub in the drawer, more data is used in
making routing decisions
General changes in LNMC or network status are also reported via this interface.
4. Remote network events
After a local routing event (LUM, PRT1, PRT2) is received by CNM, CNM determines
which other LNMCs need the information to make route table updates, and sends the
updates to the LNMCs.
The events are aggregated together by event management and then passed to route
management. Route management generates a set of appropriate route table updates and
potentially some PRT1 and PRT2 events of its own.
Changed routes are downloaded via hardware access. Event management sends out new
PRT1 and PRT2 events, if applicable.
5. Local hardware management
Hardware access provides the following facilities to both LNMC and CNM for managing
the network hardware:
– Reads and writes route tables
– Reads and writes hardware registers
– Disables and enables ports
– Controls optical link connectivity test
– Allows management of multicast
– Allows management of global counter
– Reads and writes performance counters
6. Centralized hardware management
The following functions are managed centrally by CNM with support from LNMC:
– Global counter
–Multicast
– Port Enable/Disable
Central Network Manage
The CNM daemon waits for events and handles each one as separate transactions. There are
software threads within CNM that handle different aspects of the network management tasks.
The service network traffic flows through another daemon called
Computing Hardware Server
Figure 1-55 on page 71 shows the relationships between the CNM software components. The
components are described in the following section.
.
High Performance
70IBM Power Systems 775 for AIX and Linux HPC Solution
Page 85
Figure 1-55 CNM software structure
Response
Async
DB
Queue
Response
Async
Routing
Queue
Response
Async
Recovery
Queue
Response
Async
Diagnostic
Queue
Response
Async
Command
Queue
Response
Async
GC
Queue
Outbound
Queue
DB
Thread
Cmd
Thread
GC
Thread
Diags
Thread
Route
Thread
R AR AR AR AR AR A
NM CommandExec
CNMError Log
Thread
Recovery
Thread
R
A
R
A
R
A
R
R
A
R
Hardware Server
Communications Layer
CNM – Hdwr_Svr socket
TEAL
Response
Async
DB
Queue
Res ponse
Async
DB
Queue
Response
Async
Routing
Queue
Response
Async
Routing
Queue
Response
Async
Recovery
Queue
Res ponse
Async
Rec overy
Queue
Response
Async
Diagnostic
Queue
Response
Async
Diagnostic
Queue
Response
Async
Command
Queue
Response
Async
Command
Queue
Response
Async
GC
Queue
Response
Async
GC
Queue
Outbound
Queue
DB
Thread
Cmd
Thread
GC
Thread
Diags
Thread
Route
Thread
R AR AR AR AR AR AR AR AR AR AR AR A
NM Command Exec
CNMError Log
Thread
Recovery
Thread
R
A
R
A
R
A
R
R
A
R
Hardware Server
Communications Layer
CNM – Hdwr_Svr socket
TE AL
Communication layer
This layer provides a packet library with methods for communicating with LNMC. The layer
manages incoming and outgoing messages between FSPs and CNM component message
queues.
The layer also manages event aggregation and the virtual connections to the Hardware
Server.
Database component
This component maintains the CNM internal network hardware database and updates the
status fields in this in-memory database to support reporting the status to the administrator.
The component also maintains required reachability information for routing.
Routing component
This component builds and maintains the hardware multicast tree. The component also writes
multicast table contents to the ISR and handles the exchange of routing information between
the LNMCs to support route generation and maintenance.
Global counter component
This component sets up and monitors the hardware global counter. The component also
maintains information about the location of the ISR master counter and configured backups.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 71
Page 86
Recovery component
The recovery component gathers network hardware events and frame-level events. This
component also logs each event in the CNM_ERRLOG and sends most events to TEAL.
The recovery component also performs some event consolidation to avoid flooding the TEAL
with too many messages in the event of a CEC power up or power down.
Performance counter data management
This data management periodically collects ISR and HFI aggregate performance counters
from the hardware and stores the counters in the cluster database. The collection interval and
amount of data to keep are configurable.
Command handler
This handler is a socket listener to the command module. This handler manages ISNM
commands, such as requests for hardware status, configuration for LNMC, and link
diagnostics.
IBM High Performance Computing Hardware Server
In addition to the CNM software components, the HPC Hardware Server (HWS) handles the
connections to the service network. Its primary function is to manage connections to service
processors and provide an API for clients to communicate with the service processors. HWS
assigns every service processor connection a unique handle that is called a
number
hardware.
(vport). This handle is used by clients to send synchronous commands to the
virtual port
In a Power 775 cluster, HPC HWS runs on the EMS, and on each xCAT service node.
1.9.2 DB2
IBM DB2 Workgroup Server Edition 9.7 for High Performance Computing (HPC) V1.1 is a
scalable, relational database that is designed for use in a local area network (LAN)
environment and provides support for both local and remote DB2 clients. DB2 Workgroup
Server Edition is a multi-user version of DB2 packed with features that are designed to reduce
the overall costs of owning a database. DB2 includes data warehouse capabilities, high
availability function, and is administered remotely from a satellite control database.
The IBM Power 775 Supercomputer cluster solution requires a database to store all of the
configuration and monitoring data. DB2 Workgroup Server Edition 9.7 for HPC V1.1 is
licensed for use only on the executive management server (EMS) of the Power 775
high-performance computing cluster.
The EMS serves as a single point of control for cluster management of the Power 775 cluster.
The Power 775 cluster also includes a backup EMS, service nodes, compute nodes, I/O
nodes, and login nodes. DB2 Workgroup Server Edition 9.7 for HPC V1.1 must be installed
on the EMS and backup EMS.
1.9.3 Extreme Cluster Administration Toolkit
Extreme Cloud Administration Toolkit (xCAT) is an open source, scalable distributed
computing management, and provisioning tool that provides a unified interface for hardware
control, discovery, and operating system stateful and stateless deployment. This robust toolkit
is used for the deployment and administration of AIX or Linux clusters, as shown in
Figure 1-56 on page 73.
72IBM Power Systems 775 for AIX and Linux HPC Solution
Page 87
xCAT makes simple clusters easy and complex clusters possible through the following
features:
Remotely controlling hardware functions, such as power, vitals, inventory, events logs, and
alert processing. xCAT indicates which light path LEDs are lit up remotely.
Managing server consoles remotely via serial console, SOL.
Installing an AIX or Linux cluster with utilities for installing many machines in parallel.
Managing an AIX or Linux cluster with tools for management and parallel operation.
Setting up a high-performance computing software stack, including software for batch job
submission, parallel libraries, and other software that is useful on a cluster.
Creating and managing stateless and diskless clusters.
Figure 1-56 xCAT architecture
xCAT supports both Intel and POWER based architectures, which provide operating system
support for AIX, Linux (RedHat, SuSE and CentOS), and Windows installations. the following
provisioning methods are available:
Local disk
Stateless (via Linux ramdisk support)
iSCSI (Windows and Linux)
xCAT manages a Power 775 cluster by using a hierarchical distribution that is based on
management and service nodes. A single xCAT management node with multiple service
nodes provides boot services to increase scaling (to thousands and up to tens of thousands
of nodes).
The number of nodes and network infrastructure determine the number of Dynamic Host
Configuration Protocol/Trivial File Transfer Protocol/Hypertext Transfer Protocol
(DHCP/TFTP/HTTP) servers that are required for a parallel reboot without
DHCP/TFTP/HTTP timeouts.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 73
Page 88
The number of DHCP servers does not need to equal the number of TFTP or HTTP servers.
TFTP servers NFS mount read-only the /tftpboot and image directories from the management
node to provide a consistent set of kernel, initrd, and file system images.
xCAT version 2 provides the following enhancements that address the requirements of a
Power 775 cluster:
Improved ACLs and non-root operator support:
– Certificate-authenticated client/server XML protocol for all xCAT commands
Choice of databases:
– Use a database (DB) like SQLite, or an enterprise DB like DB2 or Oracle
– Stores all of the cluster config data, status information, and events
– Information is stored in DB by other applications and customer scripts
– Data change notification is used to drive automatic administrative operations
Improved monitoring:
– Hardware event and simple Network Management Protocol (SNMP) alert monitoring
– More HPC stack (GPFS, LL, Torque, and so on) setup and monitoring
Improved RMC conditions:
– Condition triggers when it is true for a specified duration
– Batch multiple events into a single invocation of the response
– Micro-sensors: ability to extend RMC monitoring efficiently
– Performance monitoring and aggregation that is based on TEAL and RMC
Automating the deployment process:
– Automate creation of LPARs in every CEC
– Automate set up of infrastructure nodes (service nodes and I/O nodes)
– Automate configuration of network adaptors, assign node names/IDs, IP addresses,
and so on
– Automate choosing and pushing the corresponding operating system and other HPC
software images to nodes
– Automate configuration of the operating system and HPC software so that the system
is ready to use
– Automate verification of the nodes to ensure their availability
Boot nodes with a single shared image among all nodes of a similar configuration
(diskless support)
Allow for deploying the cluster in phases (for example, a set of new nodes at-a-time by
using the existing cluster)
Scan the connected networks to discover the various hardware components and firmware
information of interest:
– Uses the standard SLP protocol
– Finds: FSPs, BPAs, hardware control points
Automatically defines the discovered components to the administration software,
assigning IP addresses, and hostnames
Hardware control (for example, powering components on and off) is automatically
configured
ISR and HFI components are initialized and configured
All components are scanned to ensure that firmware levels are consistent and at the
wanted version
74IBM Power Systems 775 for AIX and Linux HPC Solution
Page 89
Firmware is updated on all down-level components when necessary
Provide software inventory:
– Utilities to query the software levels that are installed in the cluster
– Utilities to choose updates to be applied to the cluster
With diskless nodes, software updates are applied to the OS image on the server (nodes
apply the updates on the next reboot)
HPC software (LoadLeveler, GPFS, PE, ESSL, Parallel ESSL, compiler libraries, and so
on) is installed throughout the cluster by the system management software
HPC software relies on system management to provide configuration information. System
Management stores the configuration information in the management database
Uses RMC monitoring infrastructure for monitoring and diagnosing the components of
interest
Continuous operation (rolling update):
– Apply upgrades and maintenance to the cluster with minimal impact on running jobs
– Rolling updates are coordinated with CNM and LL to schedule updates (reboots) to a
limited set of nodes at a time, allowing the other nodes to still be running jobs
1.9.4 Toolkit for Event Analysis and Logging
The Toolkit for Event Analysis and Logging (TEAL) is a robust framework for low-level system
event analysis and reporting that supports both real-time and historic analysis of events.
TEAL provides a central repository for low-level event logging and analysis that addresses the
new Power 775 requirements.
The analysis of system events is delivered through alerts. A rules-based engine is used to
determine which alert must be delivered. The TEAL configuration controls the manner in
which problem notifications are delivered.
Real-time analysis provides a pro-active approach to system management, and the historical
analysis allows for deeper on-site and off-site debugging.
The primary users of TEAL are the system administrator and operator. The output of TEAL is
delivered to an alert database that is monitored by the administrator and operators through a
series of monitoring methods.
TEAL runs on the EMS and commands are issued via the EMS command line. TEAL
supports the monitoring of the following functions:
For more information about TEAL, see Table 1-6 on page 62.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 75
Page 90
1.9.5 Reliable Scalable Cluster Technology
Reliable Scalable Cluster Technology (RSCT) is a set of software components that provide a
comprehensive clustering environment for AIX, Linux, Solaris, and Windows. RSCT is the
infrastructure that is used by various of IBM products to provide clusters with improved
system availability, scalability, and ease of use.
RSCT includes the following components:
Resource monitoring and control (RMC) subsystem
This subsystem is the scalable, reliable backbone of RSCT. RMC runs on a single
machine or on each node (operating system image) of a cluster and provides a common
abstraction for the resources of the individual system or the cluster of nodes. You use
RMC for single system monitoring or for monitoring nodes in a cluster. However, in a
cluster, RMC provides global access to subsystems and resources throughout the cluster,
thus providing a single monitoring and management infrastructure for clusters.
RSCT core resource managers
A resource manager is a software layer between a resource (a hardware or software entity
that provides services to some other component) and RMC. A resource manager maps
programmatic abstractions in RMC into the actual calls and commands of a resource.
RSCT cluster security services
This RSCT component provides the security infrastructure that enables RSCT
components to authenticate the identity of other parties.
Topology services subsystem
1.9.6 GPFS
This RSCT component provides node and network failure detection on some cluster
configurations.
Group services subsystem
This RSCT component provides cross-node/process coordination on some cluster
configurations.
The IBM General Parallel File System (GPFS) is distributed, high-performance, massively
scalable enterprise file system solution that addresses the most challenging demands in
high-performance computing.
GPFS provides online storage management, scalable access, and integrated information
lifecycle management tools capable of managing petabytes of data and billions of files.
Virtualizing your file storage space and allowing multiple systems and applications to share
common pools of storage provides you the flexibility to transparently administer the
infrastructure without disrupting applications. This configuration improves cost and energy
efficiency and reduces management overhead.
Massive namespace support, seamless capacity and performance scaling, and proven
reliability features and flexible architecture of GPFS helps your company foster innovation by
simplifying your environment and streamlining data work flows for increased efficiency.
76IBM Power Systems 775 for AIX and Linux HPC Solution
Page 91
GPFS plays a key role in the shared storage configuration for Power 775 clusters. Virtually all
large-scale systems are connected to disk over HFI via GPFS Network Shared Disk (NSD)
servers, which are referred GPFS I/O nodes or Storage nodes in Power 775 terminology. The
system interconnect features higher performance and is far more scalable than traditional
storage fabrics, and is RDMA capable.
GPFS includes a Native RAID function that is used to manage the disks in the disk
enclosures. In particular, the disk hospital function is queried regularly to ascertain the health
of the disk subsystem. This function is not always necessary because disk problems that
require service are reported to the HMC serviceable events and to TEAL.
For more information about GPFS, see Table 1-6 on page 62.
GPFS Native RAID
GPFS Native RAID is a software implementation of storage RAID technologies within GPFS.
By using conventional dual-ported disks in a Just-a-Bunch-Of-Disks (JBOD) configuration,
GPFS Native RAID implements sophisticated data placement and error correction algorithms
to deliver high levels of storage reliability, availability, and performance. Standard GPFS file
systems are created from the NSDs defined through GPFS Native RAID.
This section describes the basic concepts, advantages, and motivations behind GPFS Native
RAID: redundancy codes, end-to-end checksums, data declustering, and administrator
configuration, including recovery groups, declustered arrays, virtual disks, and virtual disk
NSDs.
Overview
GPFS Native RAID integrates the functionality of an advanced storage controller into the
GPFS NSD server. Unlike an external storage controller, in which configuration, LUN
definition, and maintenance are beyond the control of GPFS, GPFS Native RAID takes
ownership of a JBOD array to directly match LUN definition, caching, and disk behavior to
GPFS file system requirements.
Sophisticated data placement and error correction algorithms deliver high levels of storage
reliability, availability, serviceability, and performance. GPFS Native RAID provides a variation
of the GPFS NSD called a
the VDisk NSDs of a file system by using the conventional NSD protocol.
The GPFS Native RAID includes the following features:
Software RAID: GPFS Native RAID runs on standard AIX disks in a dual-ported JBOD
array, which does not require external RAID storage controllers or other custom hardware
RAID acceleration.
spare space uniformly across all disks of a JBOD. This distribution reduces the rebuild
(disk failure recovery process) overhead that is compared to conventional RAID.
Checksum: An end-to-end data integrity check (by using checksums and version
numbers) is maintained between the disk surface and NSD clients. The checksum
algorithm uses version numbers to detect silent data corruption and lost disk writes.
Data redundancy: GPFS Native RAID supports highly reliable two-fault tolerant and
three-fault-tolerant Reed-Solomon-based parity codes and three-way and four-way
replication.
Large cache: A large cache improves read and write performance, particularly for small
I/O operations.
virtual disk, or VDisk. Standard NSD clients transparently access
Chapter 1. Understanding the IBM Power Systems 775 Cluster 77
Page 92
Arbitrarily sized disk arrays: The number of disks is not restricted to a multiple of the RAID
redundancy code width, which allows flexibility in the number of disks in the RAID array.
Multiple redundancy schemes: One disk array supports VDisks with different redundancy
schemes; for example, Reed-Solomon and replication codes.
Disk hospital: A disk hospital asynchronously diagnoses faulty disks and paths, and
requests replacement of disks by using past health records.
Automatic recovery: Seamlessly and automatically recovers from primary server failure.
Disk scrubbing: A disk scrubber automatically detects and repairs latent sector errors in
the background.
Familiar interface: Standard GPFS command syntax is used for all configuration
commands, including, maintaining, and replacing failed disks.
Flexible hardware configuration: Support of JBOD enclosures with multiple disks
physically mounted together on removable carriers.
Configuration and data logging: Internal configuration and small-write data are
automatically logged to solid-state disks for improved performance.
GPFS Native RAID features
This section describes three key features of GPFS Native RAID and how the functions work:
data redundancy that use RAID codes, end-to-end checksums, and declustering.
RAID codes
GPFS Native RAID automatically corrects for disk failures and other storage faults by
reconstructing the unreadable data by using the available data redundancy of either a
Reed-Solomon code or N-way replication. GPFS Native RAID uses the reconstructed data to
fulfill client operations, and in the case of disk failure, to rebuild the data onto spare space.
GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes and
three-way and four-way replication, which detect and correct up to two or three concurrent
faults1. The redundancy code layouts that are supported by GPFS Native RAID, called
are shown in Figure 1-57.
tracks,
Figure 1-57 Redundancy codes that are supported by GPFS Native RAID
GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes, which
partition a GPFS block into eight data strips and two or three parity strips. The N-way
replication codes duplicate the GPFS block on N - 1 replica strips.
78IBM Power Systems 775 for AIX and Linux HPC Solution
Page 93
GPFS Native RAID automatically creates redundancy information, depending on the
configured RAID code. By using a Reed-Solomon code, GPFS Native RAID equally divides a
GPFS block of user data into eight data strips and generates two or three redundant parity
strips. This configuration results in a stripe or track width of 10 or 11 strips and storage
efficiency of 80% or 73% (excluding user configurable spare space for rebuild).
By using N-way replication, a GPFS data block is replicated N - 1 times, implementing 1 + 2
and 1 + 3 redundancy codes, with the strip size equal to the GPFS block size. Thus, for every
block or strip written to the disks, N replicas of that block or strip are also written. This
configuration results in track width of three or four strips and storage efficiency of 33% or
25%.
End-to-end checksum
Most implementations of RAID codes implicitly assume that disks reliably detect and report
faults, hard-read errors, and other integrity problems. However, studies show that disks do not
report some read faults and occasionally fail to write data, although it was reported that the
data was written.
These errors are often referred to as
silent errors, phantom-writes, dropped-writes, or
off-track writes. To compensate for these shortcomings, GPFS Native RAID implements an
end-to-end checksum that detects silent data corruption that is caused by disks or other
system components that transport or manipulate the data.
When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the
data before it is transported over the network to the GPFS Native RAID server. On reception,
GPFS Native RAID calculates and verifies the checksum. GPFS Native RAID stores the data,
a checksum, and version number to disk and logs the version number in its metadata for
future verification during read.
When GPFS Native RAID reads disks to satisfy a client read operation, it compares the disk
checksum against the disk data and the disk checksum version number against what is stored
in its metadata. If the checksums and version numbers match, GPFS Native RAID sends the
data along with a checksum to the NSD client. If the checksum or version numbers are
invalid, GPFS Native RAID reconstructs the data by using parity or replication and returns the
reconstructed data and a newly generated checksum to the client. Thus, both silent disk read
errors and lost or missing disk writes are detected and corrected.
Declustered RAID
Compared to conventional RAID, GPFS Native RAID implements a sophisticated data and
spare space disk layout scheme that allows for arbitrarily sized disk arrays and reduces the
overhead to clients that are recovering from disk failures. To accomplish this configuration,
GPFS Native RAID uniformly spreads or declusters user data, redundancy information, and
spare space across all the disks of a declustered array. A conventional RAID layout is
compared to an equivalent declustered array in Figure 1-58 on page 80.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 79
Page 94
Figure 1-58 Conventional RAID versus declustered RAID layouts
Figure 1-58 shows an example of how GPFS Native RAID improves client performance
during rebuild operations by using the throughput of all disks in the declustered array. This is
illustrated by comparing a conventional RAID of three arrays versus a declustered array, both
using seven disks. A conventional 1-fault-tolerant 1 + 1 replicated RAID array is shown with
three arrays of two disks each (data and replica strips) and a spare disk for rebuilding. To
decluster this array, the disks are divided into seven tracks, two strips per array. The strips
from each group are then spread across all seven disk positions, for a total of 21 virtual
tracks. The strips of each disk position for every track are then arbitrarily allocated onto the
disks of the declustered array (in this case, by vertically sliding down and compacting the
strips from above). The spare strips are uniformly inserted, one per disk.
As illustrated in Figure 1-59 on page 81, a declustered array significantly shortens the time
that is required to recover from a disk failure, which lowers the rebuild overhead for client
80IBM Power Systems 775 for AIX and Linux HPC Solution
Page 95
applications. When a disk fails, erased data is rebuilt by using all of the operational disks in
the declustered array, the bandwidth of which is greater than the fewer disks of a conventional
RAID group. If another disk fault occurs during a rebuild, the number of impacted tracks that
require repair is markedly less than the previous failure and less than the constant rebuild
overhead of a conventional array.
The decrease in declustered rebuild impact and client overhead might be a factor of three to
four times less than a conventional RAID. Because GPFS stripes client data across all the
storage nodes of a cluster, file system performance becomes less dependent upon the speed
of any single rebuilding storage array.
Figure 1-59 Lower rebuild overhead in conventional RAID versus declustered RAID
When a single disk fails in the 1-fault-tolerant 1 + 1 conventional array on the left, the
redundant disk is read and copied onto the spare disk, which requires a throughput of seven
strip I/O operations. When a disk fails in the declustered array, all replica strips of the six
impacted tracks are read from the surviving six disks and then written to six spare strips, for a
throughput of two strip I/O operations. As shown in Figure 1-59, disk read and write I/O
throughput during the rebuild operations.
Disk configurations
This section describes recovery group and declustered array configurations.
Recovery groups
GPFS Native RAID divides disks into recovery groups in which each disk is physically
connected to two servers: primary and backup. All accesses to any of the disks of a recovery
group are made through the active primary or backup server of the recovery group.
Building on the inherent NSD failover capabilities of GPFS, when a GPFS Native RAID server
stops operating because of a hardware fault, software fault, or normal shutdown, the backup
GPFS Native RAID server seamlessly assumes control of the associated disks of its recovery
groups.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 81
Page 96
Typically, a JBOD array is divided into two recovery groups that are controlled by different
primary GPFS Native RAID servers. If the primary server of a recovery group fails, control
automatically switches over to its backup server. Within a typical JBOD, the primary server for
a recovery group is the backup server for the other recovery group.
Figure 1-60 illustrates the ring configuration where GPFS Native RAID servers and storage
JBODs alternate around a loop. A particular GPFS Native RAID server is connected to two
adjacent storage JBODs and vice versa. The ratio of GPFS Native RAID server to storage
JBODs is thus one-to-one. Load on servers increases by 50% when a server fails.
Figure 1-60 GPFS Native RAID server and recovery groups in a ring configuration
Declustered arrays
A declustered array is a subset of the physical disks (pdisks) in a recovery group across
which data, redundancy information, and spare space are declustered. The number of disks
in a declustered array is determined by the RAID code-width of the VDisks that are housed in
the declustered array. One or more declustered arrays can exist per recovery group.
Figure 1-61 on page 83 illustrates a storage JBOD with two recovery groups, each with four
declustered arrays.
A declustered array can hold one or more VDisks. After redundancy codes are associated
with VDisks, a declustered array simultaneously contains Reed-Solomon and replicated
VDisks.
If the storage JBOD supports multiple disks that are physically mounted together on
removable carriers, removal of a carrier temporarily disables access to all of the disks in the
carrier. Thus, pdisks on the same carrier must not be in the same declustered array, as VDisk
redundancy protection is weakened upon carrier removal.
Declustered arrays are normally created at recovery group creation time but new arrays are
created or existing arrays are grown by adding pdisks later.
82IBM Power Systems 775 for AIX and Linux HPC Solution
Page 97
Figure 1-61 Example of declustered arrays and recovery groups in storage JBOD
Virtual and physical disks
A VDisk is a type of NSD that is implemented by GPFS Native RAID across all the pdisks of a
declustered array. Multiple VDisks are defined within a declustered array, typically
Reed-Solomon VDisks for GPFS user data and replicated VDisks for GPFS metadata.
Virtual disks
Whether a VDisk of a particular capacity is created in a declustered array depends on its
redundancy code, the number of pdisks and equivalent spare capacity in the array, and other
small GPFS Native RAID overhead factors. The mmcrvdisk command automatically
configures a VDisk of the largest possible size a redundancy code and configured spare
space of the declustered array.
In general, the number of pdisks in a declustered array cannot be less than the widest
redundancy code of a VDisk plus the equivalent spare disk capacity of a declustered array.
For example, a VDisk that uses the 11-strip-wide 8 + 3p Reed-Solomon code requires at least
13 pdisks in a declustered array with the equivalent spare space capacity of two disks. A
VDisk that uses the three-way replication code requires at least five pdisks in a declustered
array with the equivalent spare capacity of two disks.
VDisks are partitioned into virtual tracks, which are the functional equivalent of a GPFS block.
All VDisk attributes are fixed at creation and cannot be altered.
Physical disks
A pdisk is used by GPFS Native RAID to store user data and GPFS Native RAID internal
configuration data.
A pdisk is either a conventional rotating magnetic-media disk (HDD) or a solid-state disk
(SSD). All pdisks in a declustered array must have the same capacity.
Pdisks are also assumed to be dual-ported with one or more paths that are connected to the
primary GPFS Native RAID server and one or more paths that are connected to the backup
server. Often there are two redundant paths between a GPFS Native RAID server and
connected JBOD pdisks.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 83
Page 98
Solid-state disks
GPFS Native RAID assumes several solid-state disks (SSDs) in each recovery group in order
to redundantly log changes to its internal configuration and fast-write data in non-volatile
memory, which is accessible from either the primary or backup GPFS Native RAID servers
after server failure. A typical GPFS Native RAID log VDisk might be configured as three-way
replication over a dedicated declustered array of four SSDs per recovery group.
Disk hospital
The disk hospital is a key feature of GPFS Native RAID that asynchronously diagnoses errors
and faults in the storage subsystem. GPFS Native RAID times out an individual pdisk I/O
operation after approximately 10 seconds, limiting the effect of a faulty pdisk on a client I/O
operation. When a pdisk I/O operation results in a timeout, an I/O error, or a checksum
mismatch, the suspect pdisk is immediately admitted into the disk hospital. When a pdisk is
first admitted, the hospital determines whether the error was caused by the pdisk or by the
paths to it. Although the hospital diagnoses the error, GPFS Native RAID, if possible, uses
VDisk redundancy codes to reconstruct lost or erased strips for I/O operations that otherwise
are used the suspect pdisk.
Health metrics
The disk hospital maintains internal health assessment metrics for each pdisk: time badness,
which characterizes response times; and data badness, which characterizes media errors
(hard errors) and checksum errors. When a pdisk health metric exceeds the threshold, it is
marked for replacement according to the disk maintenance replacement policy for the
declustered array.
The disk hospital logs selected Self-Monitoring, Analysis, and Reporting Technology
(SMART) data, including the number of internal sector remapping events for each pdisk.
Pdisk discovery
GPFS Native RAID discovers all connected pdisks when it starts, and then regularly
schedules a process that rediscovers a pdisk that newly becomes accessible to the GPFS
Native RAID server. This configuration allows pdisks to be physically connected or connection
problems to be repaired without restarting the GPFS Native RAID server.
Disk replacement
The disk hospital tracks disks that require replacement according to the disk replacement
policy of the declustered array. The disk hospital is configured to report the need for
replacement in various ways. The hospital records and reports the FRU number and physical
hardware location of failed disks to help guide service personnel to the correct location with
replacement disks.
When multiple disks are mounted on a removable carrier, each of which is a member of a
different declustered array, disk replacement requires the hospital to temporarily suspend
other disks in the same carrier. To guard against human error, carriers are also not removable
until GPFS Native RAID actuates a solenoid controlled latch. In response to administrative
commands, the hospital quiesces the appropriate disks, releases the carrier latch, and turns
on identify lights on the carrier that is next to the disks that require replacement.
After one or more disks are replaced and the carrier is re-inserted, in response to
administrative commands, the hospital verifies that the repair took place. The hospital also
automatically adds any new disks to the declustered array, which causes GPFS Native RAID
to rebalance the tracks and spare space across all the disks of the declustered array. If
service personnel fail to reinsert the carrier within a reasonable period, the hospital declares
the disks on the carrier as missing and starts rebuilding the affected data.
84IBM Power Systems 775 for AIX and Linux HPC Solution
Page 99
Two Declustered Arrays/Two Recovery Group
S
S
S
S
S
S
S
S
1
1
111
111
111
111
111
111
111
111111111
111
111
111
111
111
2
2
2
2
2
2
222
222
222
222
222
22
222
222
222
222
222
222
222
222
Rear
Front
I/O node
Blue RG primary / Yellow RG backup
I/O node
Yellow RG primary / Blue RG backup
STOR1
STOR2
STOR3
STOR4
STOR8STOR7STOR6STOR5
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
DCA
DCA
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
2
T
2
T
2
T
2
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
Port Card
Port Card
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
Figure 1-62 shows a “Two Declustered Array/Two Recovery Group” configuration of a Disk
Enclosure. This configuration is referred to as 1/4 populated. The configuration features four
SDDs (shown in dark blue in Figure 1-62) in the first recovery group and the four SSDs (dark
yellow in Figure 1-62) in the second recovery group.
Figure 1-62 Two Declustered Array/Two Recovery Group DE configuration
Chapter 1. Understanding the IBM Power Systems 775 Cluster 85
Page 100
Four Declustered Arrays/Two Recovery Group
S
S
S
S
S
S
S
S
1
33
1
3
13131
3
1131
3
13131
3
13131
3
13131
3
13131
3
13131
3
13131
3
13131313131
3
13131
3
13131
3
13131
3
13131
3
13131
3
2
4
2
4
2
4
2
4
2
4
2
4
24242
4
24242
4
24242
4
2242
4
24242
4
42424
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
Rear
Front
I/O node
Blue RG primary / Yellow RG backup
I/O node
Yellow RG primary / Blue RG backup
STOR1
STOR2
STOR3
STOR4
STOR8STOR7STOR6STOR5
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
DCA
DCA
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
2
T
2
T
2
T
2
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
Port Card
Port Card
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
T
2
Figure 1-63 shows a Four Declustered Array/Two Recovery Group configuration of a disk
enclosure. This configuration is referred to as 1/2 populated. The configuration features four
SDDs (shown in dark blue in Figure 1-63) in the first recovery group and the four SSDs (dark
yellow in Figure 1-63) in the second recovery group.
Figure 1-63 Four Declustered Array/Two Recovery Group DE configuration
86IBM Power Systems 775 for AIX and Linux HPC Solution
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.