IBM Power Systems 775 User Manual

Page 1

Front cover

IBM Power Systems 775 for AIX and Linux HPC Solution
Unleashes computing power for HPC workloads
Provides architectural solution overview
Contains sample scenarios
Dino Quintero
Kerry Bosworth
Puneet Chaudhary
ByungUn Ha
Jose Higino
Marc-Eric Kahle
Tsuyoshi Kamenoue
James Pearson
Mark Perez
Fernando Pizzano
Robert Simon
Kai Sun
ibm.com/redbooks
Page 2
Page 3
International Technical Support Organization
IBM Power Systems 775 for AIX and Linux HPC Solution
October 2012
SG24-8003-00
Page 4
Note: Before using this information and the product it supports, read the information in “Notices” on page vii.
First Edition (October 2012)
This edition applies to IBM AIX 7.1, xCAT 2.6.6, IBM GPFS 3.4, IBM LoadLelever, Parallel Environment Runtime Edition for AIX V1.1.
© Copyright International Business Machines Corporation 2012. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Page 5

Contents

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Chapter 1. Understanding the IBM Power Systems 775 Cluster. . . . . . . . . . . . . . . . . . . 1
1.1 Overview of the IBM Power System 775 Supercomputer . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Advantages and new features of the IBM Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Hardware information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 POWER7 chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 I/O hub chip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Collective acceleration unit (CAU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Nest memory management unit (NMMU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Integrated switch router (ISR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.6 SuperNOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.7 Hub module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.8 Memory subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.9 Quad chip module (QCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.10 Octant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.11 Interconnect levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.12 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.13 Supernodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.14 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4 Power, packaging and cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4.1 Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.4.2 Bulk Power and Control Assembly (BPCA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.3 Bulk Power Control and Communications Hub (BPCH) . . . . . . . . . . . . . . . . . . . . 43
1.4.4 Bulk Power Regulator (BPR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.4.5 Water Conditioning Unit (WCU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.5 Disk enclosure (Rodrigo). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.5.2 High level description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5.3 Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.6 Cluster management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.1 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.2 EMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6.4 Server and management networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.6.5 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
© Copyright IBM Corp. 2012. All rights reserved. iii
Page 6
1.6.6 LPARs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.6.7 Utility nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.6.8 GPFS I/O nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.7 Typical connection scenario between EMS, HMC, Frame . . . . . . . . . . . . . . . . . . . . . . 58
1.8 Software stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8.1 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.8.2 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.8.3 Extreme Cluster Administration Toolkit (xCAT). . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.8.4 Toolkit for Event Analysis and Logging (TEAL). . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.8.5 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.8.6 GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.8.7 IBM Parallel Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.8.8 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
1.8.9 ESSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.8.10 Parallel ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.8.11 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.8.12 Parallel Tools Platform (PTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 2. Application integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.1 Power 775 diskless considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.1.1 Stateless vs. Statelite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.1.2 System access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.2 System capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.3 Application development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.3.1 XL compilers support for POWER7 processors . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.3.2 Advantage for PGAS programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.3.3 Unified Parallel C (UPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3.4 ESSL/PESSL optimized for Power 775 clusters . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.4 Parallel Environment optimizations for Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.4.1 Considerations for using Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . 116
2.4.2 Considerations for data striping with PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4.3 Confirmation of HFI status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.4.4 Considerations for using Collective Acceleration Unit (CAU) . . . . . . . . . . . . . . . 126
2.4.5 Managing jobs with large numbers of tasks (up to 1024 K) . . . . . . . . . . . . . . . . 129
2.5 IBM Parallel Environment Developer Edition for AIX . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.5.1 Eclipse Parallel Tools Platform (PTP 5.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.5.2 IBM High Performance Computing Toolkit (IBM HPC Toolkit) . . . . . . . . . . . . . . 133
2.6 Running workloads using IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6.1 Submitting jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6.2 Querying and managing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.6.3 Specific considerations for LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 3. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.1 Component monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.1.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.1.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.1.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.1.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.1.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.1.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.1.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.1.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 203
3.1.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 206
iv IBM Power Systems 775 for AIX and Linux HPC Solution
Page 7
3.1.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 206
3.2 TEAL tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
3.2.1 Configuration (LoadLeveler, GPFS, Service Focal Point, PNSD, ISNM) . . . . . . 211
3.2.2 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.3 Quick health check (full HPC Cluster System) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.3.1 Component analysis location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.3.2 Top to bottom checks direction (software to hardware) . . . . . . . . . . . . . . . . . . . 219
3.3.3 Bottom to top direction (hardware to software) . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.4 EMS Availability+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.4.1 Simplified failover procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5 Component configuration listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.5.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.5.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.5.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.5.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 234
3.5.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.6 Component monitoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.6.1 xCAT (power management, hardware discovery and connectivity) . . . . . . . . . . 235
3.6.2 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Chapter 4. Problem determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.1 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.1 xcatdebug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.2 Resolving xCAT configuration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.3 Node does not respond to queries or rpower command. . . . . . . . . . . . . . . . . . . 240
4.1.4 Node fails to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
4.1.5 Unable to open a remote console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.1.6 Time out errors during network boot of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.2 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.2.1 Checking the status and recycling the hardware server and the CNM . . . . . . . . 243
4.2.2 Communication issues between CNM and DB2 . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.2.3 Adding hardware connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.2.4 Checking FSP status, resolving configuration or communication issues . . . . . . 248
4.2.5 Verifying CNM to FSP connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
4.2.6 Verify that a multicast tree is present and correct . . . . . . . . . . . . . . . . . . . . . . . . 250
4.2.7 Correcting inconsistent topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.3 HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.3.1 HFI health check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.3.2 HFI tools and link diagnostics (resolving down links and miswires) . . . . . . . . . . 254
4.3.3 SMS ping test fails over HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.3.4 netboot over HFI fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.3.5 Other HFI issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Chapter 5. Maintenance and serviceability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
5.1 Managing service updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.1 Service packs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.2 System firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.3 Managing multiple operating system (OS) images . . . . . . . . . . . . . . . . . . . . . . . 259
Contents v
Page 8
5.2 Power 775 xCAT startup/shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.2.1 Startup procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.2.2 Shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.3 Managing cluster nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3.1 Node types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3.2 Adding nodes to the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.3 Removing nodes from a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.4 Power 775 availability plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.4.1 Advantages of Availability Plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4.2 Considerations for A+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4.3 Availability Plus (A+) resources in a Power 775 Cluster . . . . . . . . . . . . . . . . . . . 289
5.4.4 How to identify a A+ resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.4.5 Availability Plus definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.4.6 Availability plus components and recovery procedures . . . . . . . . . . . . . . . . . . . 292
5.4.7 Hot, warm, cold Policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5.4.8 A+ QCM move example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.4.9 Availability plus non-Compute node overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Appendix A. Serviceable event analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Analyzing a hardware serviceable event that points to an A+ action . . . . . . . . . . . . . . . . . 306
Appendix B. Command outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
GPFS native RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
vi IBM Power Systems 775 for AIX and Linux HPC Solution
Page 9

Notices

This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 2012. All rights reserved. vii
Page 10

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX 5L™ AIX® BladeCenter® DB2® developerWorks® Electronic Service Agent™ Focal Point™ Global Technology Services® GPFS™
HACMP™ IBM® LoadLeveler® Power Systems™ POWER6+™ POWER6® POWER7® PowerPC® POWER®
pSeries® Redbooks® Redbooks (logo) ® RS/6000® System p® System x® Tivoli®
The following terms are trademarks of other companies:
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viii IBM Power Systems 775 for AIX and Linux HPC Solution
Page 11

Preface

This IBM® Redbooks® publication contains information about the IBM Power Systems™ 775 Supercomputer solution for AIX® and Linux HPC customers. This publication provides details about how to plan, configure, maintain, and run HPC workloads in this environment.
This IBM Redbooks document is targeted to current and future users of the IBM Power Systems 775 Supercomputer (consultants, IT architects, support staff, and IT specialists) responsible for delivering and implementing IBM Power Systems 775 clustering solutions for their enterprise high-performance computing (HPC) applications.

The team who wrote this book

This book was produced by a team of specialists from around the world working at the International Technical Support Organization, Poughkeepsie Center.
Dino Quintero is an IBM Senior Certified IT Specialist with the ITSO in Poughkeepsie, NY. His areas of knowledge include enterprise continuous availability, enterprise systems management, system virtualization, technical computing, and clustering solutions. He is currently an Open Group Distinguished IT Specialist. Dino holds a Master of Computing Information Systems degree and a Bachelor of Science degree in Computer Science from Marist College.
Kerry Bosworth is a Software Engineer in pSeries® Cluster System Test for high-performance computing in Poughkeepsie, New York. Since joining the team four years ago, she worked with the InfiniBand technology on POWER6® AIX, SLES, and Red Hat clusters and the new Power 775 system. She has 12 years of experience at IBM with eight years in IBM Global Services as an AIX Administrator and Service Delivery Manager.
Puneet Chaudhary is a software test specialist with the General Parallel File System team in Poughkeepsie, New York.
Rodrigo Garcia da Silva is a Deep Computing Client Technical Architect at the IBM Systems and Technology Group. He is part of the STG Growth Initiatives Technical Sales Team in Brazil, specializing in High Performance Computing solutions. He has worked at IBM for the past five years and has a total of eight years of experience in the IT industry. He holds a B.S. in Electrical Engineering and his areas of expertise include systems architecture, OS provisioning, Linux, and open source software. He also has a background in intellectual property protection, including publications and a filed patent.
ByungUn Ha is an Accredited IT Specialist and Deep Computing Technical Specialist in Korea. He has over 10 years experience in IBM and has conducted various HPC projects and HPC benchmarks in Korea. He has supported Supercomputing Center at KISTI (Korea Institute of Science and Technology Information) on-site for nine years. His area of expertise include Linux performance and clustering for System X, InfiniBand, AIX Power system, and HPC Software Stack including LoadLeveler®, Parallel Environment, and ESSL/PESSL, C/Fortran Compiler. He is a Redhat Certified Engineer (RHCE) and has a Master’s degree in Aerospace Engineering from Seoul National University. He is currently working in Deep Computing team, Growth Initiatives, STG in Korea as a HPC Technical Sales Specialist.
© Copyright IBM Corp. 2012. All rights reserved. ix
Page 12
Jose Higino is an Infrastructure IT Specialist for AIX/Linux support and services for IBM Portugal. His areas of knowledge include System X, BladeCenter® and Power Systems planning and implementation, management, virtualization, consolidation, and clustering (HPC and HA) solutions. He is currently the only person responsible for Linux support and services in IBM Portugal. He completed the Red Hat Certified Technician level in 2007, became a CiRBA Certified Virtualization Analyst in 2009, and completed certification in KT Resolve methodology as an SME in 2011. José holds a Master of Computers and Electronics Engineering degree from UNL - FCT (Universidade Nova de Lisboa - Faculdade de Ciências e Technologia), in Portugal.
Marc-Eric Kahle is a POWER® Systems Hardware Support specialist at the IBM Global Technology Services® Central Region Hardware EMEA Back Office in Ehningen, Germany. He has worked in the RS/6000®, POWER System, and AIX fields since 1993. He has worked at IBM Germany since 1987. His areas of expertise include POWER Systems hardware and he is an AIX certified specialist. He has participated in the development of six other IBM Redbooks publications.
Tsuyoshi Kamenoue is a Advisory IT specialist in Power Systems Technical Sales in IBM Japan. He has nine years of experience of working on pSeries, System p®, and Power Systems products especially in HPC area. He holds a Bachelor’s degree in System information from the university of Tokyo.
James Pearson is a Product Engineer for pSeries high-end Enterprise systems and HPC cluster offerings since 1998. He has participated in the planning, test, installation and on-going maintenance phases of clustered RISC and pSeries servers for numerous government and commercial customers, beginning with SP2 and continuing through the current Power 775 HPC solution.
Mark Perez is a customer support specialist servicing IBM Cluster 1600.
Fernando Pizzano is a Hardware and Software Bring-up Team Lead in the IBM Advanced
Clustering Technology Development Lab, Poughkeepsie, New York. He has over 10 years of information technology experience, the last five years in HPC Development. His areas of expertise include AIX, pSeries High Performance Switch, and IBM System p hardware. He holds an IBM certification in pSeries AIX 5L™ System Support.
Robert Simon is a Senior Software Engineer in STG working in Poughkeepsie, New York. He has worked with IBM since 1987. He currently is a Team Leader in the Software Technical Support Group, which supports the High Performance Clustering software (LoadLeveler, CSM, GPFS™, RSCT, and PPE). He has extensive experience with IBM System p hardware, AIX, HACMP™, and high-performance clustering software. He has participated in the development of three other IBM Redbooks publications.
Kai Sun is a Software Engineer in pSeries Cluster System Test for high performance computing in IBM China System Technology Laboratory, Beijing. Since joining the team in 2011, he has worked with the IBM Power Systems 775 cluster. He has six years of experience at embedded system on Linux and VxWorks platform. He has recently been given an Eminence and Excellence Award by IBM for his work on Power Systems 775 cluster. He holds a B.Eng. degree in Communication Engineering from Beijing University of Technology, China. He has a M.Sc. degree in Project Management from the New Jersey Institute of Technology, US.
Thanks to the following people for their contributions to this project: 򐂰 Mark Atkins
IBM Boulder
򐂰 Robert Dandar
x IBM Power Systems 775 for AIX and Linux HPC Solution
Page 13
򐂰 Joseph Demczar 򐂰 Chulho Kim 򐂰 John Lewars 򐂰 John Robb 򐂰 Hanhong Xue 򐂰 Gary Mincher 򐂰 Dave Wootton 򐂰 Paula Trimble 򐂰 William Lepera 򐂰 Joan McComb 򐂰 Bruce Potter 򐂰 Linda Mellor 򐂰 Alison White 򐂰 Richard Rosenthal 򐂰 Gordon McPheeters 򐂰 Ray Longi 򐂰 Alan Benner 򐂰 Lissa Valleta 򐂰 John Lemek 򐂰 Doug Szerdi 򐂰 David Lerma
IBM Poughkeepsie
򐂰 Ettore Tiotto
IBM Toronto, Canada
򐂰 Wei QQ Qu
IBM China
򐂰 Phil Sanders
IBM Rochester
򐂰 Richard Conway 򐂰 David Bennin
International Technical Support Organization, Poughkeepsie Center

Now you can become a published author, too!

Here’s an opportunity to spotlight your skills, grow your career, and become a published author—all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base.
Find out more about the residency program, browse the residency index, and apply online at:
http://www.ibm.com/redbooks/residencies.html
Preface xi
Page 14

Comments welcome

Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
http://www.ibm.com/redbooks
򐂰 Send your comments in an email to:
redbooks@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400

Stay connected to IBM Redbooks

򐂰 Find us on Facebook:
http://www.facebook.com/IBMRedbooks
򐂰 Follow us on Twitter:
http://twitter.com/ibmredbooks
򐂰 Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
xii IBM Power Systems 775 for AIX and Linux HPC Solution
Page 15
Chapter 1. Understanding the IBM Power
1
Systems 775 Cluster
In this book, we describe the new IBM Power Systems 775 Cluster hardware and software. The chapters provide an overview of the general features of the Power 775 and its hardware and software components. This chapter helps you get a basic understanding and concept of this cluster.
Application integration and monitoring of a Power 775 cluster is also described in greater detail in this IBM Redbooks publication. LoadLeveler, GPFS, xCAT, and more are documented with some examples to get a better view on the complete cluster solution.
Problem determination is also discussed throughout this publication for different scenarios that include xCAT configuration issues, Integrated Switch Network Manager (ISNM), Host Fabric Interface (HFI), GPFS, and LoadLeveler. These scenarios show the flow of how to determine the cause of the error and how to solve the error. This knowledge compliments the information in Chapter 5, “Maintenance and serviceability” on page 265.
Some cluster management challenges might need intervention that requires service updates, xCAT shutdown/startup, node management, and Fail in Place tasks. Documents that are available are referenced in this book because not everything is shown in this publication.
This chapter includes the following topics:
򐂰 Overview of the IBM Power System 775 Supercomputer 򐂰 Advantages and new features of the IBM Power 775 򐂰 Hardware information 򐂰 Power, packaging, and cooling 򐂰 Disk enclosure 򐂰 Cluster management 򐂰 Connection scenario between EMS, HMC, and Frame 򐂰 High Performance Computing software stack
© Copyright IBM Corp. 2012. All rights reserved. 1
Page 16

1.1 Overview of the IBM Power System 775 Supercomputer

For many years, IBM provided High Performance Computing (HPC) solutions that provide extreme performance. For example, highly scalable clusters by using AIX and Linux for demanding workloads, including weather forecasting and climate modeling.
The previous IBM Power 575 POWER6 water-cooled cluster showed impressive density and performance. With 32 processors, 32 GB to 256 GB of memory in one central electronic complex (CEC) enclosure or cage, and up to 14 CECs per Frame (water-cooled), 448 processors per frame was possible. The InfiniBand interconnect provided the cluster with powerful communication channels for the workloads.
The new Power 775 Supercomputer from IBM takes the density to a new height. With 256
3.84 GHz POWER7® processors, 2 TB of memory per CEC, and up to 12 CECs per Frame, a total of 3072 processors and 24 TBs memory per Frame is possible. Highly scalable with the capability to cluster 2048 CEC drawers together makes up 524,288 POWER7 processors to do the work to solve the most challenging problems. A total of 7.86 TF per CEC and 94.4 TF per rack highlights the capabilities of this high-performance computing solution.
The hardware is only as good as the software that runs on it. IBM AIX, IBM FileNet Process Engine (PE) Runtime Edition, LoadLeveler, GPFS, and xCAT are a few of the supported software stacks for the solution. For more information, see 1.9, “High Performance Computing software stack” on page 62.

1.2 The IBM Power 775 cluster components

The IBM Power 775 can consist of the following components: 򐂰 Compute subsystem:
– Diskless nodes dedicated to perform computational tasks – Customized operating system (OS) images – Applications
򐂰 Storage subsystem:
– I/O node (diskless) – OS images for IO nodes – SAS adapters attached to the Disk Enclosures (DE) – General Parallel File System (GPFS)
򐂰 Management subsystem:
– Executive Management Server (EMS) – Login Node – Utility Node
򐂰 Communication Subsystem:
– Host Fabric Interface (HFI):
• Busses from processor modules to the switching hub in an octant
• Local links (LL-links) between octants
• Local remote links (LR-links) between drawers in a SuperNode
• Distance links (D-links) between SuperNodes – Operating system drivers – IBM User space protocol – AIX and Linux IP drivers
2 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 17
Octants, SuperNode, and other components are described in the other sections of this book. 򐂰 Node types
The following node types have other partial functions available for the cluster. In the context of the 9125-F2C drawer, a node is an OSI image that is booted in an LPAR. There are three general designations for node types on the 9125-F2C. Often these functions are dedicated to a node, but a node can have multiple roles:
– Compute nodes
Compute nodes run parallel jobs and perform the computational functions. These nodes are diskless and booted across the HFI network from a Service Node. Most of the nodes are compute nodes.
– IO nodes
These nodes are attached to either the Disk Enclosure in the physical cluster or external storage. These nodes serve the file system to the rest of the cluster.
– Utility Nodes
A Utility node offers services to the cluster. These nodes often feature more resources, such as an external Ethernet, external, or internal storage. The following Utility nodes are required:
• Service nodes: Runs xCAT to serve the operating system to local diskless nodes
• Login nodes: Provides a centralized login to the cluster – Optional utility node:
• Tape subsystem server
Important: xCAT stores all system definitions as node objects, including the required EMS console and the HMC console. However, the consoles are external to the 9125-F2C cluster and are not referred to as cluster nodes. The HMC and EMS consoles are physically running on specific, dedicated servers. The HMC runs on a System x® based machine (7042 or 7310) and the EMS runs on a POWER 750 Server. For more information, see
1.7.1, “Hardware Management Console” on page 53 and 1.7.2, “Executive Management Server” on page 53.
For more information, see this website:
http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/p7had/p7had_775x .pdf

1.3 Advantages and new features of the IBM Power 775

The IBM Power Systems 775 (9125-F2C) has several new features that make this system even more reliable, available, and serviceable.
Fully redundant power, cooling and management, dynamic processor de-allocation and memory chip & lane sparing, and concurrent maintenance are the main reliability, availability, and serviceability (RAS) features.
The system is water-cooled, which gives a 100% heat capture. Some components are cooled by small fans, but the Rear Door Heat exchanger captures this heat.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 3
Page 18
Because most of the nodes are diskless nodes, the service nodes provide the operating system to the diskless nodes. The HFI network also is used to boot the diskless utility nodes.
The Power 775 Availability Plus (A+) feature allows processors, switching hubs, and HFI cables immediate failure-recovery because more resources are available in the system. These resources fail in place and no hardware must be replaced until a specified threshold is reached. For more information, see 5.4, “Power 775 Availability Plus” on page 297.
The IBM Power 775 cluster solution provides High Performance Computing clients with the following benefits:
򐂰 Sustained performance and low energy consumption for climate modeling and forecasting 򐂰 Massive scalability for cell and organism process analysis in life sciences 򐂰 Memory capacity for high-resolution simulations in nuclear resource management 򐂰 Space and energy efficient for risk analytics and real-time trading in financial services

1.4 Hardware information

This section provides detailed information about the hardware components of the IBM Power
775. Within this section, there are links to IBM manuals and external sources for more information.

1.4.1 POWER7 chip

The IBM Power System 775 implements the POWER7 processor technology. The PowerPC® Architecture POWER7 processor is designed for use in servers that provide solutions with large clustered systems, as shown in Figure 1-1 on page 5.
4 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 19
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Fabric
p7
8B X-Bus
8B X-Bus
QCM
Chip Connections
Addess/Data
8B Y-Bus
8B Y-Bus
QCM
Chip Connections
Address/Data
8B Z-Bus
8B Z-Bus
QCM
Chip Connections
Address/Data
8B A-Bus
8B A-Bus
QCM
Chip Connections
Data
8B B-Bus
8B B-Bus
QCM
Chip Connections
Data
8B C-Bus
8B C-Bus
QCM
Chip Connections
Data
QCM to Hub
Connections
Address/Data
PSI
I2C
On Module SEEPRM
I2C
On Module SEEPRM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
FSI
FSP1 - B
FSI
FSP1 - A
OSC
OSC - B
OSC
OSC - A
TPMD
TPMD-A, TPMD-B
8B W/Gx-Bus
8B W/Gx-Bus
TOD Sync
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
.
Figure 1-1 POWER7 chip block diagram
IBM POWER7 characteristics
This section provides a description of the following characteristics of the IBM POWER7 chip, as shown in Figure 1-1:
򐂰 240 GFLOPs:
– Up to eight cores per chip – Four Floating Point Units (FPU) per core – Two FLOPS/Cycle (Fused Operation) – 246 GFLOPs = 8 cores x 3.84 GHz x 4 FPU x 2)
򐂰 32 KBs instruction and 32 KBs data caches per core 򐂰 256 KB L2 cache per core 򐂰 4 MB L3 cache per core 򐂰 Eight Channels of SuperNova buffered DIMMs:
– Two memory controllers per chip – Four memory busses per memory controller (1 B wide Write, 2 B wide Read each)
򐂰 CMOS 12S SOI 11 level metal 򐂰 Die size: 567 mm2
Chapter 1. Understanding the IBM Power Systems 775 Cluster 5
Page 20
Architecture
򐂰 PowerPC architecture 򐂰 IEEE New P754 floating point compliant 򐂰 Big endian, little endian, strong byte ordering support extension 򐂰 46-bit real addressing, 68-bit virtual addressing 򐂰 Off-chip bandwidth: 336 GBps:
– Local + remote interconnect)
򐂰 Memory capacity: Up to 128 GBs per chip 򐂰 Memory bandwidth: 128 GBps peak per chip
C1 core and cache
򐂰 8 C1 processor cores per chip 򐂰 2 FX, 2 LS, 4 DPFP, 1 BR, 1 CR, 1 VMX, 1 DFP 򐂰 4 SMT, OoO 򐂰 112x2 GPR and 172x2 VMX/VSX/FPR renames
PowerBus On-Chip Intraconnect
򐂰 1.9 GHz Frequency 򐂰 (8) 16 B data bus, 2 address snoop, 21 on/off ramps 򐂰 Asynchronous interface to chiplets and off-chip interconnect
Differential memory controllers (2)
򐂰 6.4-GHz Interface to Super Nova (SN) 򐂰 DDR3 support max 1067 Mhz 򐂰 Minimum Memory 2 channels, 1 SN/channel 򐂰 Maximum Memory 8 channels X 1 SN/channel 򐂰 2 Ports/Super Nova 򐂰 8 Ranks/Port 򐂰 X8b and X4b devices supported
PowerBus Off-Chip Interconnect
򐂰 1.5 to 2.9 Gbps single ended EI-3 򐂰 2 spare bits/bus 򐂰 Max 256-way SMP 򐂰 32-way optimal scaling 򐂰 Four 8-B Intranode Buses (W, X, Y, or Z) 򐂰 All buses run at the same bit rate 򐂰 All capable of running as a single 4B interface; the location of the 4B interface within the
8 B is fixed
򐂰 Hub chip attaches via W, X, Y or Z 򐂰 Three 8-B Internode Buses (A, B,C) 򐂰 C-bus multiplex with GX Only operates as an aggregate data bus (for example, address
and command traffic is not supported)
6 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 21
Buses
Table 1-1 describes the POWER7 busses.
Table 1-1 POWER7 busses
Bus name Width (speed) Connects Function
W, X, Y, Z 8B+8B with 2 extra bits
per bus (3 Gbps)
A,B 8B+8B with 2 extra bits
per bus (3 Gbps)
C 8B+8B with 2 extra bits
per bus (3 Gb/p)
Mem1-Mem8 2B Read + 1B Write
with 2 extra bits per bus (2.9 GHz)
Intranode processors & hub
Other nodes within drawer
Other nodes within drawer
Processor to memory
Used for address and data
Data only
Data only, Multiplex with Gx
WXYZABC Busses
The off-chip PowerBus supports up to seven coherent SMP links (WXYZABC) by using Elastic Interface 3 (EI-3) interface signaling that uses up to 3 Gbps. The intranode WXYZ links up to four processor chips to make a 32way and connect a Hub chip to each processor. The WXYZ links carry coherency traffic and data and are interchangeable as intranode processor links or Hub links. The internode AB links connect up to two nodes per processor chip. The AB links carry coherency traffic and data and are interchangeable with each other. The AB links also are configured as aggregate data-only links. The C link is configured only as a data-only link.
All seven coherent SMP links (WXYZABC) are configured as 8Bytes or 4Bytes in width.
The XYZABC Busses include the following features:
򐂰 Four (WXYZ) 8-B or 4-B EI-3 Intranode Links 򐂰 Two (AB) 8-B or 4-B EI-3 Internode Links or two (AB) 8-B or 4-B EI-3 data-only Links 򐂰 One (C) 8-B or 4-B EI-3 data-only Link
PowerBus
The PowerBus is responsible for coherent and non-coherent memory access, IO operations, interrupt communication, and system controller communication. The PowerBus provides all of the interfaces, buffering, and sequencing of command and data operations within the storage subsystem. The POWER7 chip has up to seven PowerBus links that are used to connect to other POWER7 chips, as shown in Figure 1-2 on page 8.
The PowerBus link is an 8-Byte-wide (or optional 4-Byte-wide), split-transaction, multiplexed, command and data bus that supports up to 32 POWER7 chips. The bus topology is a multitier, fully connected topology to reduce latency, increase redundancy, and improve concurrent maintenance. Reliability is improved with ECC on the external I/Os.
Data transactions are always sent along a unique point-to-point path. A route tag travels with the data to help routing decisions along the way. Multiple data links are supported between chips that are used to increase data bandwidth.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 7
Page 22
C Bus
4B
4B
B Bus
8B
8B
A Bus
8B
8B
4B
4B
SMP Interconnect SMP Interconnect
SMP Data Only
GX1 GX0
4MB L3 4MB L3
4MB L3 4MB L3
PBE
PBE
GXC(0,1)
Mem 3 PHY’s
Mem 3 PHY’s
MC0
MC1
Power Bus
PSI A/D HTM ICP
PLLs
EI – 3 PHY’s
Z BUS
8B
8B
W BUS
8B
8B
X Bus
8B
8B
Y Bus
8B
8B
SMP Interconnect
HUB Attach
POR
PSI
JTAG/FSI
I2C
ViDBUS
I2C
SEEPROM
M1A
22b
14b
M1B
22b
14b
M1C
22b
14b
M1D
22b
14b
Memory Interface
M0A
22b
14b
M0B
22b
14b
M0C
22b
14b
M0D
22b
14b
Memory Interface
EI – 3 PHY’s
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
L2 L2
L2 L2
L2 L2
L2 L2
4MB L3 4MB L3
4MB L3 4MB L3
NCU
NCU NCU NCU
NCU
NCU NCU NCU
Figure 1-2 POWER7 chip layout
Figure 1-3 on page 9 shows the POWER7 core structure.
8 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 23
Figure 1-3 Microprocessor core structural diagram
Reliability, availability, and serviceability features
The microprocessor core includes the following reliability, availability, and serviceability (RAS) features:
򐂰 POWER7 core:
– Instruction retry for soft core logic errors – Alternate processor recovery for hard core errors detected – Processor limited checkstop for other errors – Protection key support for AIX
򐂰 L1 I/D Cache Error Recovery and Handling:
– Instruction retry for soft errors – Alternate processor recovery for hard errors – Guarding of core for core and L1/L2 cache errors
򐂰 L2 Cache:
– ECC on L2 and directory tags – Line delete for L2 and directory tags (seven lines) – L2 UE handling includes purge and refetch of unmodified data – Predictive dynamic guarding of associated cores
򐂰 L3 Cache:
– ECC on data – Line delete mechanism for data (seven lines) – L3UE handling includes purges and refetch of unmodified data – Predictive dynamic guarding of associated cores for CEs in L3 not managed by the line
deletion
Chapter 1. Understanding the IBM Power Systems 775 Cluster 9
Page 24

1.4.2 I/O hub chip

EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
This section provides information about the IBM Power 775 I/O hub chip (or torrent chip), as shown in Figure 1-4.
Figure 1-4 Hub chip (Torrent)
Host fabric interface
The host fabric interface (HFI) provides a non-coherent interface between a quad-chip module (QCM), which is composed of four POWER7, and the clustered network.
Figure 1-5 on page 11 shows two instances of HFI in a hub chip. The HFI chips also attach to the Collective Acceleration Unit (CAU).
Each HFI has one PowerBus command and four PowerBus data interfaces, which feature the following configuration:
1. The PowerBus directly connects to the processors and memory controllers of four POWER7 chips via the WXYZ links.
10 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 25
2. The PowerBus also indirectly coherently connects to other POWER7 chips within a
HFI
Cmd x1data x4
CAU MMIO
cmd & data x4
EA/RA
Power Bus
D links LR links
4
NC
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
Nest
MMU
c d
CAU
HFI
Cmd x1
data x4
CAU
MMIO
cmd & data x4
EA/RA
Integrated Switch Router
(ISR)
LL links
.
256-way drawer via the LL links. Although fully supported by the HFI hardware, this path provides reduced performance.
3. Each HFI has four ports to the Integrated Switch Router (ISR). The ISR connects to other hub chips through the D, LL, and LR links.
4. ISRs and D, LL, and LR links that interconnect the hub chips form the cluster network.
POWER7 chips: The set of four POWER7 chips (QCM), its associated memory, and a hub chip form the building block for cluster systems. A Power 775 systems consists of multiple building blocks that are connected to each another via the cluster network.
Figure 1-5 HFI attachment scheme
Chapter 1. Understanding the IBM Power Systems 775 Cluster 11
Page 26
Packet processing
POWER7
Lin k
ISR
POWER7 Coherency Bus
Proc,
caches
POW ER7
Link
...
POWER7
Chip
Hub Chip
ISR Network
Proc,
caches
POWER7 Coherency Bus
Mem
HFI
IS R
ISR ISR
.
.
HFI
POWER7
Chi p
Hub Chip
Pr oc,
caches
...
Pro c,
caches
Mem
PO WER7 Coher ency Bus
PO WER7 Coher ency Bus
HFI HFI
The HFI is the interface between the POWER7 chip quads and the cluster network, and is responsible for moving data between the PowerBus and the ISR. The data is in various formats, but packets are processed in the following manner:
򐂰 Send
– Pulls or receives data from PowerBus-attached devices in a POWER7 chip – Translates data into network packets – Injects network packets into the cluster network via the ISR
򐂰 Receive
– Receives network packets from the cluster network via the ISR – Translates them into transactions – Pushes the transactions to PowerBus-attached devices in a POWER7 chip
򐂰 Packet ordering
– The HFIs and cluster network provide no ordering guarantees among packets. Packets
that are sent from the same source window and node to the same destination window and node might reach the destination in a different order.
Figure 1-6 shows two HFIs cooperating to move data from devices that are attached to one PowerBus to devices attached to another PowerBus through the Cluster Network.
Figure 1-6 HFI moving data from one quad to another quad
HFI paths: The path between any two HFIs might be indirect, thus requiring multiple hops through intermediate ISRs.
12 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 27

1.4.3 Collective acceleration unit

The hub chip provides specialized hardware that is called the Collective Acceleration Unit (CAU) to accelerate frequently used collective operations.
Collective operations
Collective operations are distributed operations that operate across a tree. Many HPC applications perform collective operations with the application that make forward progress after every compute node that completed its contribution and after the results of the collective operation are delivered back to every compute node (for example, barrier synchronization, and global sum).
A specialized arithmetic-logic unit (ALU) within the collective CAU implements reduction, barrier, and reduction operations. For reduce operations, the ALU supports the following operations and data types:
򐂰 Fixed point: NOP, SUM, MIN, MAX, OR, ANDS, signed and unsigned XOR 򐂰 Floating point: MIN, MAX, SUM, single and double precision PROD
There is one CAU in each hub chip, which is one CAU per four POWER7 chips, or one CAU per 32 C1 cores.
Software organizes the CAUs in the system collective trees. The arrival of an input on one link causes its forwarding on all other links when there is a broadcast operation. For reduce operation, arrivals on all but one link causes the reduction result to forward to the remaining links.
A link in the CAU tree maps to a path composed of more than one link in the network. The system supports many trees simultaneously and each CAYU supports 64 independent trees.
The usage of sequence numbers and a retransmission protocol enables reliability and pipelining. Each tree has only one participating HFI window on any involved node. The order in which the reduction operation is evaluated is preserved from one run to another, which benefits programming models that allow programmers to require that collective operations are executed in a particular order, such as MPI.
Package propagation
As shown Figure 1-7 on page 14, a CAU receive packets from the following sources: 򐂰 The memory of a remote node is inserted into the cluster network by the HFI of the remote
node
򐂰 The memory of a local node is inserted into the cluster network by the HFI of the local
node
򐂰 A remote CAU
Chapter 1. Understanding the IBM Power Systems 775 Cluster 13
Page 28
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
HFI
HFI
Figure 1-7 CAU packets received by CAU
As shown in Figure 1-8 on page 15, a CAU sends packets to the following locations:
򐂰 The memory of a remote node that is written to memory by the HFI of the remote node. 򐂰 The memory of a local node that is written to memory by the HFI of the local node. 򐂰 A remote CAU.
14 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 29
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
CAU CAU
Figure 1-8 CAU packets sent by CAU

1.4.4 Nest memory management unit

The Nest Memory Management Unit (NMMU) that is in the hub check facilitates user-level code to operate on the address space of processes that executes on other compute nodes. The NMMU enables user-level code to create a global address space from which the NMMU performs operations. This facility is called
A process that executes on a compute node registers its address space, thus permitting interconnect packets to manipulate the registered shared region directly. The NMMU references a page table that maps effective addresses to real memory. The hub chip also maintains a cache of the mappings and maps the entire real memory of most installations.
Incoming interconnect packets that reference memory, such as RDMA packets and packets that perform atomic operations, contain an effective address and information that pinpoints the context in which to translate the effective address. This feature greatly facilitates global-address space languages, such as Unified Parallel C (UPC), co-array Fortran, and X10, by permitting such packets to contain easy-to-use effective addresses.
global shared memory.

1.4.5 Integrated switch router

The integrated switch router (ISR) replaces the external switching and routing functions that are used in prior networks. The ISR is designed to dramatically reduce cost and improve performance in bandwidth and latency.
A direct graph network topology connects up to 65,536 POWER7 eight-core processor chips with two-level routing hierarchy of L and D busses.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 15
Page 30
Each hub chip ISR connects to four POWER7 chips via the HFI controller and the W busses. The Torrent hub chip and its four POWER7 chips are called an directly connected to seven other octants on a drawer via the wide on-planar L-Local busses and to 24 other octants in three more drawers via the optical L-Remote busses.
A
Supernode is the fully interconnected collection of 32 octants in four drawers. Up to 512
Supernodes are fully connected via the 16 optical D busses per hub chip. The ISR is designed to support smaller systems with multiple D busses between Supernodes for higher bandwidth and performance.
The ISR logically contains input and output buffering, a full crossbar switch, hierarchical route tables, link protocol framers/controllers, interface controllers (HFI and PB data), Network Management registers and controllers, and extensive RAS logic that includes link replay buffers.
The Integrated Switch Router supports the following features:
򐂰 Target cycle time up to 3 GHz 򐂰 Target switch latency of 15 ns 򐂰 Target GUPS: ~21 K. ISR assisted GUPs handling at all intermediate hops (not software) 򐂰 Target switch crossbar bandwidth greater than 1 TB per second input and output:
– 96 Gbps WXYZ-busses (4 @ 24 Gbps) from P7 chips (unidirectional) – 168 Gbps local L-busses (7 @ 24 Gbps) between octants in a drawer (unidirectional) – 144 Gbps optical L-busses (24 @ 6 Gbps) to other drawers (unidirectional) – 160 Gbps D-busses (16 @ 10 Gbps) to other Supernodes (unidirectional)
򐂰 Two-tiered full-graph network 򐂰 Virtual Channels for deadlock prevention
octant. Each ISR octant is
򐂰 Cut-through Wormhole routing 򐂰 Routing Options:
– Full hardware routing – Software-controlled indirect routing by using hardware route tables
򐂰 Multiple indirect routes that are supported for data striping and failover 򐂰 Multiple direct routes by using LR and D-links supported for less than a full-up system 򐂰 Maximum packet size that supported is 2 KB. Packets size varies from 1 to 16 flits, each flit
being 128 Bytes
򐂰 Routing Algorithms:
– Round Robin: Direct and Indirect – Random: Indirect routes only
򐂰 IP Multicast with central buffer and route table and supports 256 Bytes or 2 KB packets 򐂰 Global Hardware Counter implementation and support and includes link latency counts 򐂰 LCRC on L and D busses with link-level retry support for handling transient errors and
includes error thresholds.
򐂰 ECC on local L and W busses, internal arrays, and busses and includes Fault Isolation
Registers and Control Checker support
򐂰 Performance Counters and Trace Debug support
16 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 31

1.4.6 SuperNOVA

SuperNOVA is the second member of the fourth generation of the IBM Synchronous Memory Interface ASIC. It connects host memory controllers to DDR3 memory devices.
SuperNOVA is used in a planar configuration to connect to Industry Standard (I/S) DDR3 RDIMMs. SuperNOVA also resides on a custom, fully buffered memory module that is called the SuperNOVA DIMM (SND). Fully buffered DIMMs use a logic device, such as SuperNOVA, to buffer all signals to and from the memory devices.
As shown in Figure 1-9, SuperNOVA provides the following features: 򐂰 Cascaded memory channel (up to seven SNs deep) that use 6.4-Gbps, differential ended
(DE), unidirectional links.
򐂰 Two DDR3 SDRAM command and address ports. 򐂰 Two, 8 B DDR3 SDRAM data ports with a ninth byte for ECC and a tenth byte that is used
as a locally selectable spare.
򐂰 16 ranks of chip selects and CKE controls (eight per CMD port). 򐂰 Eight ODT (four per CMD port). 򐂰 Four differential memory clock pairs to support up to four DDR3 registered dual in-line
memory modules (RDIMMs).
Data Flow Modes include the following features:
򐂰 Expansion memory channel daisy-chain 򐂰 4:1 or 6:1 configurable data rate ratio between memory channel and SDRAM domain
Figure 1-9 Memory channel
SuperNOVA uses a high speed, differential ended communications memory channel to link a host memory controller to the main memory storage devices through the SuperNOVA ASIC. The maximum memory channel transfer rate is 6.4 Gbps.
The SuperNOVA memory channel consists of two DE, unidirectional links. The downstream link transmits write data and commands away from the host (memory controller) to the SuperNOVA. The downstream includes 13 active logical signals (lanes), two more spare lanes, and a bus clock. The upstream (US), link transmits read data and responses from the SuperNOVA back to the host. The US includes 20 active logical signals, two more spare lanes, and a bus clock.
Although SuperNOVA supports a cascaded memory channel topology of multiple chips that use daisy chained memory channel links, Power 775 does not use this capability.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 17
Page 32
The links that are connected on the host side are called the Primary Up Stream (PUS) and
Primary Down Stream (PDS) links. The links on the cascaded side are called the Secondary Up Stream
The SuperNOVA US and downstream links each include two dedicated spare lanes. One of these lanes is used to repair either a clock or data connection. The other lane is used only to repair data signal defects. Each segment (host to SuperNOVA or SuperNOVA to SuperNOVA connection) of a cascaded memory channel is independently deployed of their dedicated spares per link. This deployment maximizes the ability to survive multiple interconnect hard failures. The spare lanes are tested and aligned during initialization but are deactivated during normal runtime operation. The channel frame format, error detection, and protocols are the same before and after spare lane invocation. Spare lanes are selected by one of the following means:
򐂰 The spare lanes are selected during initialization by loading host and SuperNOVA
configuration registers based on previously logged lane failure information.
򐂰 The spare lanes are selected dynamically by the hardware during runtime operation by an
error recovery operation that performs the link reinitialization and repair procedure. This procedure is initiated by the host memory controller and supported by the SuperNOVAs in the memory channel. During the link repair operation, the memory controller holds back memory access requests. The procedure is designed to take less than 10 ms to prevent system performance problems, such as timeouts.
򐂰 The spare lanes are selected by system control software by loading host or SuperNOVA
configuration registers that are based on the results of the memory channel lane shadowing diagnostic procedure.
(SUS) and Secondary Down Stream (SDS) links.

1.4.7 Hub module

The Power 775 hub module provides all the connectivity that is needed to form a clustered system, as shown in Figure 1-10 on page 19.
18 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 33
EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus Copper
8B 8B
8B 8B
LL1 Bus Copper
8B 8B
LL2 Bus Copper
8B 8B
LL4 Bus Copper
8B 8B
LL5 Bus Copper
8B 8B
LL6 Bus Copper
8B 8B
LL3 Bus Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
24 F iber
D-Li nk Co nne ctor
48 Fiber
D- Link Co nnec to r
12 Lanes (10 + 2 spare)
12 Lanes (10 + 2 spar e)
O to 16
24 Fiber D-Li nks
SEEPRO M
SEEP ROM
Cap s
CPROMCPROM
Capacitors
Op tical
Xmit & Rec
(D- Link )
Optical
Xmit & Rec
(LR-Link)
Opti cal
Xmit & Rec
(D-Link)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 spare)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
(No t to S cale)
24 F iber
D-Li nk Co nne ctor
48 Fiber
D- Link Co nnec to r
12 Lanes (10 + 2 spare)
12 Lanes (10 + 2 spar e)
O to 16
24 Fiber D-Li nks
SEEPRO M
SEEP ROM
Cap s
CPROMCPROM
Capacitors
Op tical
Xmit & Rec
(D- Link )
Optical
Xmit & Rec
(LR-Link)
Opti cal
Xmit & Rec
(D-Link)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 spare)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
6 Lanes (5 + 1 sp are)
(No t to S cale)
Figure 1-10 Hub module diagram
Chapter 1. Understanding the IBM Power Systems 775 Cluster 19
Page 34
The hub features the following primary functions: 򐂰 Connects the QCM Processor/Memory subsystem to up to two high-performance 16x
PCIe slots and one high-performance 8x PCI Express slot. This configuration provides general-purpose I/O and networking capability for the server node.
POWER 775 drawer: In a Power 775 drawer (CEC), the Octant 0 has 3 PCIe, in which two PCIe are 16x and one PCIe is 8x (SRIOV Ethernet Adapter is given priority in the 8x slot.). Octants 1-7 have two PCI Express, which are 16x.
򐂰 Connects eight Processor QCMs together by using a low-latency, high-bandwidth,
coherent copper fabric (L-Local buses) that includes the following features:
– Enables a single hypervisor to run across 8 QCMs, which enables a single pair of
redundant service processors to manage 8 QCMs
– Directs the I/O slots that are attached to the eight hubs to the compute power of any of
the eight QCMs that provide I/O capability where needed
– Provides a message passing mechanism with high bandwidth and the lowest possible
latency between eight QCMs (8.2 TFLOPs) of compute power
򐂰 Connects four Power 775 planars via the L-Remote optical connections to create a 33
TFLOP tightly connected compute building block (SuperNode). The bi-sectional exchange bandwidth between the four boards is 3 TBps, the same bandwidth as 1500 10 Gb Ethernet links.
򐂰 Connects up to 512 groups of four planers (SuperNodes) together via the D optical buses
with ~3 TBs of exiting bandwidth per planer.
Optical links
The Hub modules that are on the node board house optical transceivers for up to 24 L-Remote links and 16 D-Links. Each optical transceiver includes a jumper cable that connects the transceiver to the node tailstock. The transceivers are included to facilitate cost optimization, depending on the application. The supported options are shown in Table 1-2.
Table 1-2 Supported optical link options
SuperNode type L-Remote links D-Links Number of
combinations
SuperNodes not enabled 0 0-16 in increments of 1 17
Full SuperNodes 24 0-16 in increments of 1 17
34
Some customization options are available on the hub optics module, which allow some optic transceivers to remain unpopulated on the Torrent module if the wanted topology does not require all of transceivers. The number of actual offering options that are deployed is dependent on specific large customer bids.
20 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 35
Optics physical package
The optics physical package includes the following features: 򐂰 Individual transmit (Tx) and Receive (Rx) modules that are packaged in Tx+Rx pairs on
glass ceramic substrate.
򐂰 Up to 28 Tx+Rx pairs per module. 򐂰 uLGA (Micro-Land Grid Array) at 0.7424 mm pitch interconnects optical modules to
ceramic substrate.
򐂰 12-fiber optical fiber ribbon on top of each Tx and each Rx module, which is coupled
through Prizm reflecting and spheric-focusing 12-channel connectors.
򐂰 Copper saddle over each optical module and optical connector for uLGA actuation and
heat spreading.
򐂰 Heat spreader with springs and thermal interface materials that provide uLGA actuation
and heat removal separately for each optical module.
򐂰 South (rear) side of each glass ceramic module carries 12 Tx+Rx optical pairs that
support 24 (6+6) fiber LR-links, and 2 Tx+Rx pairs that support 2 (12+12) fiber D-links.
򐂰 North (front) side of each glass ceramic module carries 14 Tx+Rx optical module pairs
that support 14 (12+12) fiber D-links
Optics electrical interface
The optics electrical interface includes the following features:
򐂰 12 differential pairs @ 10 GB per second (24 signals) for each TX optics module 򐂰 12 differential pairs @ 10 GB per second (24 signals) for each RX module 򐂰 Three wire I2C/TWS (Serial Data & Address, Serial Clock, Interrupt): three signals
Cooling
Cooling includes the following features:
򐂰 Optics are water-cooled with Hub chip 򐂰 Cold plate on top of module, which is coupled to optics through heat reader and saddles,
with thermal interface materials at each junction
򐂰 Recommended temperature range: 20C – 55C at top of optics modules
Optics drive/receive distances
Optics links might be up to 60 meters rack-to-rack (61.5 meters, including inside-drawer optical fiber ribbons).
Reliability assumed
The following reliability features are assumed:
򐂰 10 FIT rate per lane. 򐂰 D-link redundancy. Each (12+12)-fiber D-link runs normally with 10 active lanes and two
spares. Each D-link runs in degraded-Bandwidth mode with as few as eight lanes.
򐂰 LR-link redundancy: Each (6+6)-fiber D-link runs normally with six active lanes. Each
LR-link (half of a Tx+Rx pair) runs in degraded-bandwidth mode with as few as four lanes out of six lanes.
򐂰 Overall redundancy: As many four lanes out of each 12 (two lanes of each six lanes) might
fail without disabling any D-links or LR-links.
򐂰 Expect to allow one failed lane per 12 lanes in manufacturing. 򐂰 Bit Error Rate: Worst-case, end-of-life BER is 10^-12. Normal expected BER is 10^-18
Chapter 1. Understanding the IBM Power Systems 775 Cluster 21
Page 36

1.4.8 Memory subsystem

Memory Controllerp7Memory Controller
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
DIMM
.
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
SN
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
SN
DIMM
.
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
SN
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
8B
DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM
SN
DIMM
.
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
SN
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
SN
DIMM
.
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
SN
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
8B
DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM
SN
FSI
FSI
FSI
FSI
FSI
FSI
FSI
FSI
8B
8B
8B
8B
8B
8B
The memory controller layout is shown in Figure 1-11.
Figure 1-11 Memory controller layout
The memory cache sizes are shown in Table 1-3.
Table 1-3 Cache memory sizes
Cache level Memory size (per core)
L1
L2
L3
32 KB Instruction, 32 KB Data
256 KB
4 MB eDRAM
The memory subsystem features the following characteristics:
򐂰 Memory capacity: Up to 128 GB/processor 򐂰 Memory bandwidth: 128 GB/s (peak)/processor 򐂰 Eight channels of SuperNOVA buffered DIMMs/processor
22 IBM Power Systems 775 for AIX and Linux HPC Solution
򐂰 Two memory controllers per processor:
– Four memory busses per memory controller – Each buss is 1 B-wide Write, 2 B-wide Read
Page 37
Memory per drawer
Each drawer features the following minimum and maximum memory ranges: 򐂰 Minimum:
– 4 DIMMs per QCM x 8 QCM per drawer = 32 DIMMs per drawer – 32 DIMMs per drawer x 8 GB per DIMM = 256 GB per drawer
򐂰 Maximum:
– 16 DIMMs per QCM x 8 QCM per drawer = 128 DIMMs per drawer – 128 DIMMs per drawer x 16 GB per DIMM = 2 TB per drawer
Memory DIMMs
Memory DIMMs include the following features:
򐂰 Two SuperNOVA chips each with a bus connected directly to the processor 򐂰 Two ports on the DIMM from each SuperNova 򐂰 Dual CFAM interfaces from the processor to each DIMM, wired to the primary SuperNOVA
and dual chained to the secondary SuperNOVA on the DIMM
򐂰 Two VPD SEEPROMs on the DIMM interfaced to the primary SuperNOVA CFAM 򐂰 80 DRAM sites - 2 x 10 (x8) DRAM ranks per SuperNova Port 򐂰 Water cooled jacketed design 򐂰 50 watt max DIMM power 򐂰 Available in sizes: 8 GB, 16 GB, and 32 GB (RPQ)
For best performance, it is recommended that all 16 DIMM slots are plugged in each node. All DIMMs driven by a quad-chip module (QCM) must have the same size, speed, and voltage rating.

1.4.9 Quad chip module

The previous sections provided a brief introduction to the low-level components of the Power 775 system. We now look at the system on a modular level. This section discusses the quad-chip module or QCM, which contains four POWER7 chips that are connected in a ceramic module.
The standard Power 775 CEC drawer contains eight QCMs. Each QCM contains four, 8-core POWER7 processor chips and supports 16 DDR3 SuperNova buffered memory DIMMs.
Figure 1-12 on page 24 shows the POWER7 quad chip module which contains the following characteristics:
򐂰 4x POWER7 cores 򐂰 32 cores (4 x 8 = 32) 򐂰 948 GFLOPs / QCM 򐂰 474 GOPS (Integer) / QCM 򐂰 Off-chip bandwidth: 336 Gbps (peak):
– local + remote interconnect
Chapter 1. Understanding the IBM Power Systems 775 Cluster 23
Page 38
8c uP
8c uP
8c uP
8c uP
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp

1.4.10 Octant

Figure 1-12 POWER7 quad chip module
This section discusses the “octant” level of the system.
Figure 1-13 on page 25 shows a 32-way SMP with two-tier SMP fabric, four chip processor MCM + Hub SCM with onboard optics.
Each octant represents 1/8 of the CEC planar, which contains one QCM, one Hub module, and up to 16 associated memory modules.
Octant 0: Octant 0 controls another PCIe 8x slot that is used for an Ethernet adapter for cluster management.
24 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 39
Figure 1-13 Power 775 octant logic diagram
DIMM 15 DIMM 8 DIMM 9DIMM 14DIMM 4 DIMM 12 DIMM 13DIMM 5
DIMM 11DIMM 3DIMM 2 DIMM 10
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
Octant
32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor QCM + Hub SCM w/ On-Board Optics
MC1 MC 0
8c uP
MC1
Mem
Mem
Mem
Mem
MC0
Mem
Mem
Mem
Mem
8c uP
Three 2.5 PCI-E2 (2-16x, 1-8x)
5.0Gb/s @ 2B+2B with 8/10 Encoding
7 Inter-Hub Board Level L-Buses
(154+154)
GB/s
D0-D15 Lr0-Lr23
(160+160)
GB/s
(120+120)
GB/s
Hub Chip Module
22+22GB/s
164
Ll0
22+22GB/s
164
Ll1
22+22GB/s
164
Ll2
22+22GB/s
164
Ll3
22+22GB/s
164
Ll4
22+22GB/s
164
Ll5
22+22GB/s
164
Ll6
10x 10x 10x 10x 10x 10x 10x 10x 10x 10x 10x 10x
10x 10x 10x 10x 10x 10x 10x 10x
7+7GB/s
72
EG2
PCIe 61x
7+7GB/s
72
EG1
PCIe 16x
7+7GB/s
72
EG2
PCIe 8x
MC0
Mem
Mem
Mem
Mem
MC1
Mem
Mem
Mem
Mem
8c uP
MC0
Mem
Mem
Mem
Mem
MC1
Mem
Mem
Mem
Mem
8c uP
48GB/s (pk)
P7-0 P7-1
P7-3 P7-2
ABX
W
BAW
X
BA
YX
AB
XY
C
Z
C
Z
W
Z
Z
W
C
Y
C
Y
Z
W
Y
X
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
4
8
G
B
/
s
(
p
k
)
4
8
G
B
/
s
(
p
k
)
48GB/s
48GB/s
48GB/s
48GB/s
44GB/s
88GB/s
44GB/s
88GB/s
44GB/s
88GB/s
44GB/s
88GB/s
MC 0MC 1
MC 0
MC 1
MC 0MC 1
MC 0MC 1
DIMM 5 DIMM 4
DIMM 10DIMM 11DIMM 2 DIMM 3
DIMM 14 DIMM 15
DIMM 8 DIMM 9
DIMM 12 DIMM 13
48GB/s (pk)
4
8
G
B
/
s
(
p
k
)
4
8
G
B
/
s
(
p
k
)
48GB/s (pk) 48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
48GB/s (pk)
D-Link
A=# of D-Link Transceivers; A=16 B=# of Transmitter lanes; B=10 C=# of Receiver lanes; C=10 D=bit rate per lane ; D=10Gb/S E=8:10 Coding: E=8/10 BW = Ax(B+C)xDxE 16x(10+10)x10x8/10 = 2560Gb/S=320GB/s (20GB/S/D-Link)
L-remote Link
A=# of L-Link Transceivers; A=12 B=# of Transmitter lanes; B=10 C=# of Receiver lanes; C=10 D=bit rate per lane ; D=10Gb/S E=8:10 Coding: E=8/10 BW = Ax(B+C)xDxE 12x(10+10)x10x8/10 = 1920Gb/S=240GB/s (20GB/S/L-Link)
L-local Link
A=L links; A=7 B=# of Transmitter bits/bus; B=64 C=# of Receiver bits/bus; C=64 D=bit rate per lane ; D=3Gb/S E=Framing Efficiency; E=22/24 BW = Ax(B+C)xDxE 7x(64+64)x3x(22/24) = 2464Gb/S=308GB/s (44BG/S/L-Local)
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
128GB/s/P7 (peak)
16GB/s/SuperNova (peak)
Read: 10.67GB/s/SN
Write: 5.33GB/s/SN
10x 10x 10x 10x 10x 10x 10x 10x
DIMM 1DIMM 0DIMM 7
DIMM 6
Mem SN
Mem SN
Mem SN
Mem
SN
Mem
SN
Mem SN Mem SN
Mem
SN
Mem
SN
Mem SN
Mem SN
Mem
SN
Mem
SN
Mem
SN
Mem
SN
Mem SN
Mem
SN
Mem
SN
Mem
SN
MemSN
Mem
SN
Mem
SN
Mem
SN
MemSN
Mem
SN
MemSN
MemSN
Mem
SN
Mem
SN
MemSN
Mem
SN
Mem
SN
Each Power 775 planar consists of eight octants, as shown in Figure 1-14 on page 26. Seven of the octants are composed of 1x QCM, 1 x HUB, 2 x PCI Express 16x. The other octant contains 1x QCM, 1x HUB, 2x PCI Express 16x, 1x PCI Express 8x.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 25
Page 40
22x25mm
550 sqmm
HUB
1 23456
7101112
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28
89
61x96mm Substrate
Water Cooled Hub
Module
(2:1 View)
8x
Optical Connector
Light Pipe
HUB
0
61x96mm
P
C
I
e
1 6 X
P C
I
e
1 6 X
P C
I
e
8
X
Power Input
P7-0
P7-2
P7-3
P7-1
QCM 0
Power Input
Figure 1-14 Octant layout differences
26 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 41
The Power 7755 includes the following features:
򐂰 Compute power: 1 TF, 32 cores, 128 threads (Flat Memory Design) 򐂰 Memory: 512 GB max capacity, ~512 Gbps peak memory BW (1/2 B/FLOP):
– 1280 DRAM sites (80 Dram sites per DIMM, 1/2/4 Gb DRAMS over time) – 2 Processor buses and four 8B memory buses per DIMM (two Buffers) – Double stacked at the extreme with memory access throttling
For more information, see 1.4.8, “Memory subsystem” on page 22.
򐂰 I/O: 1 TB/s BW:
– 32 Gbps Generic I/O:
• Two PCIe2 16x line rate capable busses
• Expanded I/O slot count through PCIe2 expansion possible
– 980 Gbps maximum Switch Fabric BW (12 optical lanes active)
򐂰 IBM Proprietary Fabric with On-board copper/Off-board Optical:
– Excellent cost / performance (especially at mid-large scale) – Basic technology can be adjusted for low or high BW applications
򐂰 Packaging:
– Water Cooled (>95% at level shown), distributed N+1 Point Of Load Power – High wire count board with small Vertical Interconnect Accesses (VIAs) – High pin count LGA module sockets – Hot Plug PCIe Air Cooled I/O Adapter Design – Fully Redundant Out-of-band Management Control

1.4.11 Interconnect levels

A functional Power 775 system consists of multiples nodes that are spread across several racks. This configuration means multiple octants are available in which every octant is connected to every other octant on the system. The following levels of interconnect are available on a system:
򐂰 First level
This level connects the eight octants in a node together via the hub module by using copper board wiring. This interconnect level is referred to as “L” local (LL). Every octant in the node is connected to every other octant. For more information, see 1.4.12, “Node” on page 28.
򐂰 Second level
This level connects four nodes together to create a Supernode. This interconnection is possible via the hub module optical links. This interconnection level is referred to as “L” distant (LD). Every octant in a node must connect to every other octant in the other three nodes that form the Supernode. Every octant features 24 connections, but the total number of connections across the four nodes in a Supernode is 384. For more information, see 1.4.13, “Supernodes” on page 30.
򐂰 Third level
This level connects every Supernode to every other Supernode in a system. This interconnection is possible via the hub module optical links. This interconnect level is referred to as D-link. Each Supernode has up to 512 D-links. It is possible to scale up this level to 512 Supernodes. Every Supernode has a minimum of one hop D-link to every other Supernode. For more information, see 1.4.14, “Power 775 system” on page 32.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 27
Page 42

1.4.12 Node

This section discusses the node level of the Power 775 physically represented by the drawer, also commonly referred as CEC. A node is composed of eight octants and their local interconnect. Figure 1-15 shows the CEC drawer from the front.
Figure 1-15 CEC drawer front view
Figure 1-16 shows the CEC drawer rear view.
Figure 1-16 CEC drawer rear view
28 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 43
First level interconnect: L Local
L Local (LL) connects the eight octants in the CEC drawer together via the HUB module by using copper board wiring. Every octant in the node is connected to every other octant, as shown in Figure 1-17.
Figure 1-17 First level local interconnect (256 cores)
Chapter 1. Understanding the IBM Power Systems 775 Cluster 29
Page 44
System planar board
This section provides details about the following system planar board characteristics:
򐂰 Approximately 2U x 85 cm wide x 131cm deep overall node package in 30 EIA frame 򐂰 Eight octants and each octant features one QCM, one hub, and 4 - 16 DIMMs 򐂰 128 memory slots 򐂰 17 PCI adapter slots (octant 0 has three PCI slots, octant 1 - 7 each have two PCI slots) 򐂰 Regulators on the bottom side of planar directly under modules to reduce loss and
decoupling capacitance
򐂰 Water-cooled stiffener to cool regulators on bottom side of planar and memory DIMMs on
top of board
򐂰 Connectors on rear of board to optional PCI cards (17x) 򐂰 Connectors on front of board to redundant 2N DCCAs 򐂰 Optical fiber D Link interface cables from HUB modules to left and right of rear tail stock 򐂰 128 total = 16 links x 8 Hub Modules 򐂰 Optical fiber L-remote interface cables from Hub modules to center of rear tail stock 򐂰 96 total = 24 links x 8 Hub Modules 򐂰 Clock distribution & out-of-band control distribution from DCCA.
Redundant service processor
The redundant service processor features the following characteristics:
򐂰 The clocking source follows the topology of the service processor. 򐂰 N+1 redundancy of the service processor and clock source logic use two inputs on each
processor and HUB chip.
򐂰 Out-of-band signal distribution for memory subsystem (SuperNOVA chips) and PCI
express slots are consolidated to the standby powered pervasive unit on the processor and HUB chips. PCI express is managed on the Hub and Bridge chips.

1.4.13 Supernodes

This section describes the concept of Supernodes.
Supernode configurations
The following supported Supernode configurations are used in a Power 775 system. The usage of each type is based on cluster size or application requirements:
򐂰 Four-drawer Supernode
The four-drawer Supernode is the most common Supernode configuration. This configuration is formed by four CEC drawers (32 octants) connected via Hub optical links.
򐂰 Single-drawer Supernode
In this configuration, each CEC drawer in the system is a Supernode.
Second level interconnect: L Remote
This level connects four CEC drawers (32 octants) together to create a Supernode via Hub module optical links. Every octant in a node connects to every other octant in the other three nodes in the Supernode. There are 384 connections in this level, as shown in Figure 1-18 on page 31.
30 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 45
Figure 1-18 Board 2nd level interconnect (1,024 cores)
The second level wiring connector count is shown in Figure 1-19 on page 32.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 31
Page 46
Figure 1-19 Second level wiring connector count
Step 1
Each Octant in Node 1 needs to be connected to the 8 Octants in Node 2, the 8 Octants in Node 3, and the 8 Octants in Node 4. This requires 24 connection from each of the 8 Octant in Node 1. So every Octant has 24 connections from it and there are 8 Octants resulting in 192 connections
8 x 24 =192
Connections
Step 2
Step One Below has connected Node 1 to every other Octant in the Super Node. We need now to connect Node 2 to every other remaining Node (nodes 3 &4) in the Super Node. This requires 16 connections from each of the 8 Octant in Node 2. So every Octant in Node 2 has 16 connections from it and there are 8 Octants resulting in 128 connections
8 x 16 =128
Connections
Step 3
Step 1 & 2 below have connected Node 1 & Node 2 to every other Octant in the Super Node. We need now to connect Node 3 to Node 4. To do this every Octant in Node 3 needs 8 connections to the 8 octants in Node 4 which results in 64 connections. At this point every Octant in the Super Node is connected to every other Octant in the Super Node
8 x 8 =64
Connections
Step 4
The total number of connections to build a super node are 384. 192 + 128 + 64 = 384 It must be noted that every Octant has 24 connections, but the total number of connections across the 4 nodes in a given Super Node is 384.
.
Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7
Node 1
Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7
Node 3
Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7
Node 2
Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7
Node 4

1.4.14 Power 775 system

This section describes the Power 775 system and provides details about the third level of interconnect.
Third level interconnect: Distance
This level connects every Supernode to every other Supernode in a system by using Hub module optical links. Each Supernode includes up to 512 D-links, which allows for system that contains up to 512 Supernodes. Every Supernode features a minimum of one hop D-link to every other Supernode, and there are multiple two hop connections, as shown in Figure 1-20 on page 33.
Each HUB contains 16 Optical D-Links. The Physical node (board) contains eight HUBs; therefore, a physical node (board) contains 16 x 8 = 128 Optical D-Links. A Super Node is four Physical Nodes, which result in 16 x 8 x 4 = 512 Optical D-Links per Super node. This configuration allows up to 2048 CEC connected drawers.
In smaller configurations, in which the system features less than 512 Super Nodes, more than one optical D-Link per node is possible. Multiple connections between Supernodes are used for redundancy and higher bandwidth solutions.
32 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 47
Figure 1-20 System third level interconnect
Integrated cluster fabric interconnect
A complete Power 775 system configuration is achieved by configuring server nodes into a tight cluster by using a fully integrated switch fabric.
The fabric is a multitier, hierarchical implementation that connects eight logical nodes (octants) together in the physical node (server drawer or CEC) by using copper L-local links. Four physical nodes are connected with structured optical cabling into a Supernode by using optical L-remote links. Up to 512 super nodes are connected by using optical D-links. Figure 1-21 on page 34 shows a logical representation of a Power 775 cluster.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 33
Page 48
Figure 1-21 Logical view of a Power 775 system
Figure 1-22 shows an example configuration of a 242 TFLOP Power 775 cluster that uses eight Supernodes and direct graph interconnect. In this configuration, there are 28 D-Link cable paths to route and 1-64 12-lane 10 Gb D-Link cables per cable path.
Figure 1-22 Direct graph interconnect example
34 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 49
A 4,096 core (131 TF), fully interconnected system is shown in Figure 1-23.
Figure 1-23 Fully interconnected system example
Network topology
Optical D-links connect Supernodes in different connection patterns. Figure 1-24 on page 36 shows an example of 32 D-links between each pair of supernodes. Topology is 32D, a connection pattern that supports up to 16 supernodes.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 35
Page 50
Figure 1-24 Supernode connection using 32D topology
Figure 1-25 shows another example in which there is one D-link between supernode pairs, which supports up to 512 supernodes in a 1D topology.
Figure 1-25 1D network topology
36 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 51
The network topology is specified during the installation. A topology specifier is set up in the cluster database. In the cluster DB site table, topology=<specifier>. Table 1-4 shows the supported four-drawer and single-drawer Supernode topologies.
Table 1-4 Supported four-drawer and single-drawer Supernode topologies
Topology Maximum number of supernodes
256D 3
128D 5
64D 8
32D 16
16D 32
8D 64
4D 128
2D 256
1D 512
Single-Drawer Supernode topologies
2D_SDSN 48
8D_SDSN 12
ISR network routes
Each ISR includes a set of hardware route tables. The Local Network Management Controller (LNMC) routing code generates and maintains the routes with help from Central Network Manage (CNM), as shown in Figure 1-26 on page 38. These route tables are set up during system initialization and are dynamically adjusted as links go down or come up during operation. Packets are injected into the network with a destination identifier and the route mode. The route information is picked up from the route tables along the route path that is based on this information. Packets that is injected into the interconnect by the HFI employ source route tables with the route partially determined. Per-port route tables are used to route packets along each hop in the network. Separate route tables are used for intersupernode and intrasupernode routes.
Routes are classified as two compute nodes in a system. There are multiple direct routes between a set of compute nodes because a pair of supernodes are connected by more than one D-link.
The network topology features two levels and therefore the longest direct route has three hops (no more than two L hops and at most one D hop). This configuration is called an L-D-L route.
The following conditions exist when source and destination hubs are within a drawer:
򐂰 The route is one L-hop (assuming all of the links are good). 򐂰 LNMC needs to know only the local link status in this CEC.
direct or indirect. A direct route uses the shortest path between any
Chapter 1. Understanding the IBM Power Systems 775 Cluster 37
Page 52
Figure 1-26 Routing within a single CEC
The following conditions exist when source and destination hubs lie within a supernode, as shown in Figure 1-27:
򐂰 Route is one L-hop (every hub within a supernode is directly connected via Lremote link to
every other hub in the supernode).
򐂰 LNMC needs to know only the local link status in this CEC.
Figure 1-27 L-hop
If an L-remote link is faulty, the route requires two hops. However, only the link status local to the CEC is needed to construct routes, as shown in Figure 1-28.
Figure 1-28 Route representation in event of a faulty Lremote link
38 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 53
When source and destination hubs lie in different supernodes, as shown in Figure 1-29, the following conditions exist:
򐂰 Route possibilities: one D-hop, or L-D (L-D-L routes also are used) 򐂰 LNMC needs non-local link status to construct L-D routes
Figure 1-29 L-D route example
The ISR also supports indirect routes to provide increased bandwidth and to prevent hot spots in the interconnect. An indirect route is a route that has an intermediate compute node in the route that is on a different supernode, not the same supernode in which source and compute nodes reside. An indirect route must employ the shortest path from the source compute node to the intermediate node, and the shortest path from the intermediate compute node to the destination compute node. Although the longest indirect route has five hops at most, no more than three hops are L hops and two hops (at most) are D hops. This configuration often is represented as an L-D-L-D-L route.
The following methods are used to select a route is when multiples routes exist: 򐂰 Software specifies the intermediate supernode, but the hardware determines how to route
to and then route from the intermediate supernode.
򐂰 The hardware selects among the multiple routes in a round-robin manner for both direct
and indirect routes.
򐂰 The hub chip provides support for route randomization in which the hardware selects one
route between a source–destination pair. Hardware-directed randomized route selection is available only for indirect routes.
These routing modes are specified on a per-packet basis.
The correct choice between the use of direct- versus indirect-route modes depends on the communication pattern that us used by the applications. Direct routing is suitable for communication patterns in which each node must communicate with many other nodes by using spectral methods. Communication patterns that involve small numbers of compute nodes benefit from the extra bandwidth that is offered by the multiple routes with indirect routing.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 39
Page 54

1.5 Power, packaging, and cooling

This section provides information about the IBM Power Systems 775 power, packaging, and cooling features.

1.5.1 Frame

The front view of an IBM Power Systems 775 frame is shown in Figure 1-30.
Figure 1-30 Power 775 frame
The Power 775 frame front view is shown in Figure 1-31 on page 41.
40 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 55
Figure 1-31 Frame front view
The rear view of the Power 775 frame is shown in Figure 1-32 on page 42.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 41
Page 56
Figure 1-32 Frame rear photo

1.5.2 Bulk Power and Control Assembly

Each Bulk Power and Control Assembly (BPCA) is a modular unit that includes the following features:
򐂰 Bulk Power and Control Enclosure (BPCE)
Contains two 125A 3-phase AC couplers, six BPR bays, one BPCH and one BPD bay.
򐂰 Bulk Power Regulators (BPR), each rated at 27 KW@-360 VDC
One to six BPRs are populated in the BPCE depending on CEC drawers and storage enclosures in frame, and the type of power cord redundancy that is wanted.
򐂰 Bulk Power Control and Communications Hub (BPCH)
This unit provides rack-level control, storage enclosures, and water conditioning units (WCUs), and concentration of the communications interfaces to the unit level controllers that are in each server node, storage enclosure, and WCU.
򐂰 Bulk Power Distribution (BPD)
This unit distributes 360 VDC to server nodes and disk enclosures.
򐂰 Power Cords
42 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 57
One per BPCE for populations of one to three BPRs/BPCE and two per BPCE for four to
BPF
BPF
BPF
BPF
BPE
BPR
BPR
BPD
BPCH
Front View
Rear View
Line Cord Connector
Line Cord Connector
six BPRs/BPCE.
The front and rear views of the BPCA are shown in Figure 1-33.
Figure 1-33 BPCA
The minimum configuration per BPCE is one x BPCH, one x BPD, one x BPR and one line core. There are always two BPCAs above one another in the top of the cabinet.
BPRs are added uniformly to each BPCE depending on the power load in the rack.
A single fully configured BPCE provides 27 KW x 6 = 162 KW of Bulk Power, which equates to aggregate system power cord power of approximately 170 KW. Up to this power level, bulk power is in a 2N arrangement, where a single BPCE is removed entirely for maintenance concurrently with the rack that remains fully operational. If the rack bulk power demand exceeds 162 KW, the bulk power provides an N+1 configuration of up to 27 KW x 9 = 243 KW, which equates to aggregate system power cord power of approximately 260 KW. N+1 Bulk Power mode means one of the four power cords are disconnected and the cabinet continues to operate normally. BPCE concurrent maintenance is not conducted in N+1 bulk power mode, unless the rack bulk power load is reduced to less than 162 KW by invoking Power Efficient Mode on the server nodes. This mode reduces the peak power demand at the expense of reducing performance.
Cooling
The BPCA is nearly entirely water cooled. Each BPR and the BPD has two quick connects to enable connection to the supply and return water cooling manifolds in the cabinet. All of the components that dissipate any significant power in the BPR and BPD are heat sunk to a water-cooled cold plate in these units.
To assure the ambient air temperature internal to the BP, BPCH, and BPD enclosures is kept low, two hot pluggable blowers are installed in the rear of each BPCA in an N+1 speed controlled arrangement. These blowers flush the units to keep the temperature internal at approximately system inlet air temperature, which is 40 degrees-C maximum. A fan is replaced concurrently.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 43
Page 58
Management
The BPCH provides the mechanism to connect the cabinet to the management server via a 1 Gb Ethernet out-of-band network. Each BPCH has two management server-facing 1 Gb Ethernet ports or buses so that the BPCH connects to a fully redundant network.

1.5.3 Bulk Power Control and Communications Hub

The front view of the bulk power control and communications hub (BPCH) is shown in Figure 1-34.
Figure 1-34 BPCH front view
The following connections are shown in Figure 1-34:
򐂰 T2: 10/100 Mb Ethernet from HMC1. 򐂰 T3: 10/100 Mb Ethernet from HMC2. 򐂰 T4: EPO. 򐂰 T5: Cross Power. 򐂰 T6-T9: RS422 UPIC for WCUs. 򐂰 T10: RS422 UPIC port for connection of the Fill and Drain Tool. 򐂰 T19/T36: 1 Gb Ethernet for HMC connectors (T19, T36). 򐂰 2x – 10/100 Mb Ethernet port to plug in a notebook while the frame is serviced. 򐂰 2x – 10/100 Mb spare (connectors contain both eNet and ½ Duplex RS422). 򐂰 T11-T19, T20-T35, T37-T44: 10/100 Mb Ethernet ports or buses to the management
processors in CEC drawers and storage enclosures. Max configuration supports 12 CEC drawers and one storage enclosure in the frame (connectors contain both eNet and ½ Duplex RS422).

1.5.4 Bulk Power Regulator

This section describes the bulk power regulator (BPR).
Input voltage requirements
The BPR supports DC and AC input power for the Power 775. A single design accommodates both the AC and DC range with different power cords for various voltage options.
DC requirements
The BPR features the following DC requirements:
򐂰 The Bulk Power Assembly (BPA) is capable of operating over a range of 300 to 600 VDC 򐂰 Nominal operating DC points are 375 VDC and 575 VDC
44 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 59
AC requirements
(2X) Bulk Power
As se m b l y
Wat er
Units
(WCUs)
Building
Chilled
Water
Supply
Manifold
Ret urn
Manifold
Ret urn
Supply
Rear
Door Heat
Exchanger
(RDHx)
Door Heat
Exchanger
(RDHx)
Disk Enclosure
CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute
CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute
CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute
CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute
Table 1-5 shows the AC electrical requirements.
Table 1-5 AC electrical requirements
Input configuration Three phase
Rated nominal voltage and frequency 200 to 240 Vac
Rated current (amps per phase) 125A 100A
Acceptable voltage tolerance @ machine power cord 180 - 259Vac 333 - 508Vac
Acceptable frequency tolerance @ machine power cord 47 - 63Hz 47 - 63Hz

1.5.5 Water conditioning unit

The Power 755 WCU system is shown in Figure 1-35.
and GND (no neutral)
@ 50 to 60Hz
Three phase and GND (no neutral)
380 to 480 Vac @50 to 60Hz
Figure 1-35 Power 755 water conditioning unit system
Chapter 1. Understanding the IBM Power Systems 775 Cluster 45
Page 60
The hose and manifolds assemblies and WCUs are shown in Figure 1-36.
Supply Manifold
Return Manifold
1” x 2”
WCUs
BCW Hose Assemblies
Rectangular Stainless Steel Tubing
Figure 1-36 Hose and manifold assemblies
The components of the WCU are shown in Figure 1-37 on page 47.
46 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 61
Dual Float Sensor Asm
(Orientation in frame)
Pressure Relief Valve & Vacuum Breaker
System Supply (to electronics)
Ball Valve
Quick Connect
System Return (from electronics)
Chilled Water Return
Chilled Water Supply
Proportional Control Valve
Flow Meter
Check Valve (integrated into tank)
Reservoir Tank
Plate Heat Exchanger
Pump / Motor Asm
Figure 1-37 WCU components
The WCU schematics are shown in Figure 1-38.
Figure 1-38 WCU schematics
Chapter 1. Understanding the IBM Power Systems 775 Cluster 47
Page 62

1.6 Disk enclosure

This section describes the storage disk enclosure for the Power 775 system.

1.6.1 Overview

The Power 775 system features the following disk enclosures: 򐂰 SAS Expander Chip (SEC):
– # PHYs: 38 – Each PHY capable of SAS SDR or DDR
򐂰 384 SFF DASD drives:
– 96 carriers with four drives each – 8 Storage Groups (STOR 1-8) with 48 drives each:
• 12 carriers per STOR
• Two Port Cards per STOR each with 3 SECs
– 32 SAS x4 ports (four lanes each) on 16 Port Cards.
򐂰 Data Rates:
– Serial Attach SCSI (SAS) SDR = 3.0 Gbps per lane (SEC to Drive) – Serial Attach SCSI (SAS) DDR= 6.0 Gbps per lane (SAS Adapter in Node to SEC)
򐂰 The drawer supports 10 K/15 K rpm drives in 300 Gb or 600 Gb sizes. 򐂰 A Joint Test Action Group (JTAG) interface is provided from the DC converter assemblies
(DCAs) to each SEC for error diagnostics and boundary scan.
Important: STOR is the short name for storage group (it is not an acronym).
The front view of the disk enclosure is shown in Figure 1-39 on page 49.
48 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 63
Figure 1-39 Disk enclosure front view

1.6.2 High-level description

Figure 1-40 on page 50 represents the top view of a disk enclosure and highlights the front view of a STOR. Each STOR includes 12 carrier cards (six at the top of the drawer and six at the bottom of the drawer) and two port cards.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 49
Page 64
Figure 1-40 Storage drawer top view
DCA Power
DCA Power
Port Card
Port Card
STOR1
2 Port Cards
12 Carriers (36 Drives)
STOR2
2 Port Cards
12 Carriers
(36 Drives)
STOR3
2 Port Cards
12 Carriers (36 Drives)
STOR4
2 Port Cards
12 Carriers (36 Drives)
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
4 Disk Carrier
Front View
Of STOR
STOR8
2 Port Cards
12 Carriers (36 Drives)
STOR7
2 Port Cards
12 Carriers
(36 Drives)
STOR6
2 Port Cards
12 Carriers (36 Drives)
STOR5
2 Port Cards
12 Carriers (36 Drives)
Heat Exchanger
The disk enclosure is a SAS storage drawer that is specially designed for the IBM Power 775 system. The maximum storage capacity of the drawer is 230.4 TB, distributed over 384 SFF DASD drives logically organized in eight groups of 48.
The disk enclosure feature two mid-plane boards that comprise the inner core assembly. The disk drive carriers, port cards, and power supplies plug into the mid-plane boards. There are four Air Moving Devices (AMD) in the center of the drawer. Each AMD consists of three counter-rotating fans.
Each carrier contains connectors for four disk drives. The carrier features a solenoid latch that is released only through a console command to prevent accidental unseating. The disk carriers also feature LEDs close to each drive and a gold capacitor circuit so that drives are identified for replacement after the carrier is removed for service.
Each port card includes four SAS DDR 4x ports (four lanes at 6 Gbps/lane). These incoming SAS lanes connect to the input SEC, which directs the SAS traffic to the drives. Each drive is connected to one of the output SECs on the port card with SAS SDR 1x (one lane @ 6 Gbps). There are two port cards per STOR. The first Port card connects to the A ports of all 48 drives in the STOR. The second Port card connects to the B ports of all 48 drives in the STOR. The port cards include soft switches for all 48 drives in the STOR (5 V and 12 V soft switches connect and interrupt and monitor power). The soft switch is controlled by I2C from the SAS Expander Chip (SEC) on the port card.
50 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 65
A fully cabled drawer includes 36 cables: four UPIC power cables and 32 SAS cables from SAS adapters in the CEC. During service to replace a power supply, two UPIC cables manage the current and power control of the entire drawer. During service of a port card, the second port card in the STOR remains cabled to the CEC so that the STOR remains operational. A customer minimum configuration is two SAS cables per STOR and four UPIC power cables per drawer to ensure proper redundancy.

1.6.3 Configuration

A disk enclosure must reside in the same frame as the CEC to which it is cabled. A frame might contain up to six Disk Enclosures. The disk enclosure front view is shown in Figure 1-41.
Figure 1-41 Disk Enclosure front view
The disk enclosure internal view is shown in Figure 1-42 on page 52.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 51
Page 66
Figure 1-42 Disk Enclosure internal view
The disk carrier is shown in Figure 1-43.
Figure 1-43 Disk carrier
52 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 67
The disk enclosure includes the following features: 򐂰 A disk enclosure is one quarter, one half, three quarters, or fully populated with HDDs and
eight SSDs. The disk enclosure always is populated with eight SSDs.
򐂰 Disk enclosure contains two GPFS recovery groups (RGs). The carriers that hold the
disks of the RGs are distributed throughout all of the STOR domains in the drawer.
򐂰 A GPFS recovery group consists of four SSDs and one to four declustered arrays (DAs) of
47 disks each.
򐂰 Each DA contains distributed spare space that is two disks in size. 򐂰 Every DA in a GPFS system must be the same size. 򐂰 The granularity of capacity and throughput is an entire DA. 򐂰 RGs in the GPFS system do not need to be the same size.

1.7 Cluster management

The cluster management hardware that supports the Cluster is placed in 42 U, 19-inch racks. The cluster management requires Hardware Management Consoles (HMCs), redundant Executive Management Servers (EMS), and the associated Ethernet network switches.

1.7.1 Hardware Management Console

The HMC runs on a single server and is used to help manage the Power 775 servers. The traditional HMC functions for configuring and controlling the servers are done via xCAT. For more information, see 1.9.3, “Extreme Cluster Administration Toolkit” on page 72.
The HMC is often used for the following tasks:
򐂰 During installation 򐂰 For reporting hardware serviceable events, especially through Electronic Service Agent™
(ESA), which is also commonly known as call-home
򐂰 By service personal to perform guided service actions
An HMC is required for every 36 CECs (1152 LPARs) and all Power 775 system have redundant HMCs. For every group of 10 HMCs, a spare HMC is in place. For example, if a cluster requires four HMCs, five HMCs are present. If a cluster requires 16 HMCs, the cluster has two HMCs to serve as spares.

1.7.2 Executive Management Server

The EMS is a standard 4U POWER7 entry-level server responsible for cluster management activities. EMSs often are redundant; however, a simplex configuration is supported in smaller Power 775 deployments.
At the cluster level, a pair of EMSs provide the following maximum management support:
򐂰 512 frames 򐂰 512 supernodes 򐂰 2560 disk enclosures
Chapter 1. Understanding the IBM Power Systems 775 Cluster 53
Page 68
The EMS is the central coordinator of the cluster from a system management perspective. The EMS is connected to, and manages, all cluster components: the frames and CECs, HFI/ISR interconnect, I/O nodes, service nodes, and compute nodes. The EMS manages these components through the entire lifecycle, including discovery, configuration, deployment, monitoring, and updating via private network Ethernet connections. The cluster administrator uses the EMS as their primary management cluster control point. The service nodes, HMCs, and Flexible Service Processors (FSPs) are mostly transparent to the system administrator, and therefore the cluster appears to be a single, flat cluster, despite the hierarchical management infrastructure to be deployed by using xCAT.

1.7.3 Service node

Systems management throughout the cluster is a hierarchical structure (see Figure 1-44) to achieve the scaling and performance necessary for a large cluster size. All the compute and I/O nodes in a building block are initially booted via the HFI and managed by a dedicated server that is called a
service node (SN) in the utility CECs.
Figure 1-44 EMS hierarchy
Two service nodes (one for redundancy) per 36 CECs/Drawers (1 - 36) are required for all Power 775 clusters.
The two service nodes must reside in different frames, except under the following conditions:
򐂰 If there is only one frame, the nodes must reside in different super nodes in the frame. 򐂰 If there is only one super node in the frame, the nodes must reside in different CECs in the
super node.
򐂰 If there are only two or three CEC drawers, the nodes must reside in different CEC
drawers.
򐂰 If there is only one CEC drawer, the two Service nodes must reside in different octants.
54 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 69
The service node provides diskless boot and an interface to the management network. The service node requires that a PCIe SAS adapter and two 600 GB HDD PCIe form factor (in RAID1 for redundancy) must be installed to support diskless boot. The recommended location is shown on Figure 1-45. The SAS PCIe must reside in PCIe slot 16 and the HDDs in slots 15 and 14.
The service node also contains a 1 Gb Enet PCIe card that is in PCIe slot 17.
Figure 1-45 Service node

1.7.4 Server and management networks

Figure 1-46 on page 56 shows the logical structure of the two Ethernet networks for the cluster that is known as the page 56, the black nets designate the service network and the red nets designate the management network.
service network and the management network. In Figure 1-46 on
Chapter 1. Understanding the IBM Power Systems 775 Cluster 55
Page 70
The service network is a private, out-of-band network that is dedicated to managing the
Frame - 3
CEC (U25)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U19)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U17)
DCCA-A DCCA-B
DCCA-A DCCA-A
CEC (U15)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U11)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U9)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U7)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U5)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U27)
DCCA-B DCCA-B
DCCA-A DCCA-A
DE (U29)
DCCA-B DCCA-B
DCCA-A DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A BPC-A
BPA-B
BPC-B BPC-B
Frame - 1
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
10Gb
Red = EMS
Connections are 1Gb eNet Unless Shown explicitly
Black = HMC
CEC (U25)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U19)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U17)
DCCA-A DCCA-B
DCCA-A DCCA-A
CEC (U15)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U11)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U9)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U7)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U5)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U27)
DCCA-B DCCA-B
DCCA-A DCCA-A
DE (U29)
DCCA-B DCCA-B
DCCA-A DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A BPC-A
BPA-B
BPC-B BPC-B
.
.
HMC 2
HMC 1
EMS 1
EMS 2
Frame - 2
CEC (U25)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U23)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U19)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U17)
DCCA-A DCCA-B
DCCA-A DCCA-A
CEC (U15)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U13)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U11)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U9)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U7)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U5)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U27)
DCCA-B DCCA-B
DCCA-A DCCA-A
DE (U29)
DCCA-B DCCA-B
DCCA-A DCCA-A
BPCH-A
HUB A
BPCH-B
HUB B
BPA-A
BPC-A BPC-A
BPA-B
BPC-B BPC-B
CEC (U21)
DCCA-B DCCA-B
DCCA-A DCCA-A
HMC x
Management
Network
Service
Network
ENET A
Management
Network
Service
Network
ENET B
.
CEC (U13)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U13)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U21)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U23)
(Utility)
DCCA-B DCCA-B
DCCA-A DCCA-A
PCI
17
CEC (U21)
DCCA-B DCCA-B
DCCA-A DCCA-A
CEC (U23)
(Utility)
DCCA-B DCCA-B
DCCA-A DCCA-A
PCI
17
Power 775 clusters hardware. This network provides Ethernet based connectivity between the FSP of the CEC, the frame control BPA, the EMS, and the associated HMCs. Two identical network switches (ENET A and ENET B in the figure) are deployed to ensure high availability of these networks.
The management network is primarily responsible for booting all nodes, the designated service nodes, compute nodes, and I/O nodes, and monitoring their OS image loads. This management network connects the dual EMSs running the system management software with the various Power 775 servers of the cluster. Both the service and management networks must be considered private and not routed into the public network of the enterprise for security reasons.
Figure 1-46 Logical structure of service and management networks

1.7.5 Data flow

This section provides a high-level description of the data flow on the cluster service network and cluster management operations.
After discovery of the hardware components, their definitions are stored in the xCAT database on the EMS. HMCs, CECs, and frames are discovered via Service Location Protocol (SLP) by xCAT. The discovery information includes model and serial numbers, IP addresses, and so on. The Ethernet switch of the service LAN also is queried to determine which switch port is connected to each component. This discovery is run again when the system is up, if wanted. The HFI/ISR cabling also is tested by the CNM daemon on the EMS. The disk enclosures and their disks are discovered by GPFS services on these dedicated nodes when they are booted up.
56 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 71

1.7.6 LPARs

The hardware is configured and managed via the service LAN, which connects the EMS to the HMCs, BCAs, and FSPs.
Management is hierarchical with the EMS at the top, followed by the service nodes, then all the nodes in their building blocks. Management operations from the EMS to the nodes are also distributed out through the service nodes. Compute nodes are deployed by using a service node as the diskless image server.
Monitoring information comes from the sources (frames/CECs, nodes, HFI/ISR fabric, and so on), flows through the service LAN and cluster LAN back to the EMS, and is logged in the xCAT database.
The minimum hardware requirement for an LPAR is one POWER7 chip with memory attached to its memory controller. If an LPAR is assigned to one POWER7, that chip must have memory on either of its memory controllers. If an LPAR is assigned to two, three, or four POWER7 chips, any one or more of the POWER7 chips must have memory that is attached to them.
A maximum of one LPAR per POWER7 chip supported. A single LPAR resides on one, two, three, or four POWER7 chips. This configuration results in an Octant with the capability to have one, two, three, or four LPARs. An LPAR cannot reside in two Octants. With this configuration, the number of LPARs per CEC (eight Octants) ranges 8 - 32 (4 x 8). Therefore, 1 - 4 LPARs per Octant and 8 - 32 LPARs per CEC.
The following LPAR assignments are supported in an Octant:
򐂰 LPAR with all processors and memory that is allocated to that LPAR 򐂰 LPARs with 75% of processor and memory resources that are allocated to the first LPAR
and 25% to the second
򐂰 LPARs with 50% of processor and memory resources that are allocated to each LPAR 򐂰 LPARs with 50% of processor and memory resources that are allocated to the first LPAR
and 25% to each of the remaining two LPARs
򐂰 LPARs with 25% of processor and memory resources that are allocated to each LPAR
Recall that for an LPAR to be assigned, a POWER7 chip and memory that is attached to its memory controller is required. If either one of the two requirements is not met, that POWER7 is skipped and the LPAR is assigned to the next valid POWER7 in the order.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 57
Page 72

1.7.7 Utility nodes

This section defines the utility node for all Power 775 frame configurations.
A CEC is defined as a Utility CEC (node) when it has the Management server (Service node) as an LPAR. Each frame configuration is addressed individually. A single Utility LPAR supports a maximum of 1536 LPARs, one of which is an LPAR (one Utility LPAR and 1535 other LPARs). Recall that a node contains four POWER7 chips and a single POWER7 contains a maximum of one LPAR; therefore, a CEC contains 8 x 4 = 32 POWER7 chips. This configuration results in up to 32 LPARs per CEC.
This result of 1536 LPARs translates to the following figures:
򐂰 1536 POWER7 chips 򐂰 384 Octants (1536 / 4 = 384) 򐂰 48 CECs (a CEC can contain up to 32 LPARS; therefore, 1536 / 32 = 48)
There are always redundant utility nodes that reside in different frames when possible. If there is only one frame and multiple SuperNodes, the utility node resides in different SuperNodes. If there is only one SuperNode, the two utility nodes reside in different CECs. If there is only one CEC, the two utility LPARs reside in different Octants in the CEC.
The following defined utility CEC is used in the four-frame, three-frame, two-frame, and single-frame with 4, 8, and 12 CEC configurations. The single frame with 1 - 3 CECs uses a different utility CEC definition. These utilities CEC definitions are defined in their respective frame definition sections.
The utility LPAR resides in Octant 0. The LPAR is assigned only to a single POWER7. Figure 1-47 on page 59 shows the eight Octant CEC and the location of the Management LPAR. The two Octant and the four Octant CEC might be used as a utility CEC and follows the same rules as the eight Octant CEC.
58 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 73
Figure 1-47 Eight octant utility node definition
Chapter 1. Understanding the IBM Power Systems 775 Cluster 59
Page 74

1.7.8 GPFS I/O nodes

Figure 1-48 shows the GPFS Network Shared Disk (NSD) node in Octant 0.
Figure 1-48 GPFS NSD node on octant 0

1.8 Connection scenario between EMS, HMC, and Frame

The network interconnect between the different system components (EMS server, HMC, Frame) requires the managing, running, maintaining, configuring, and monitoring of the cluster. The management rack for a POWER 775 Cluster houses the different components, such as the EMS servers (IBM POWER 750), HMCs, network switches, I/O drawers for the EMS data disks, keyboard, and mouse. The different networks that are used in such an environment are the management network and the service network (as shown in Figure 1-49 on page 61). The customer network is connected to some components, but for the actual cluster, only the management and service networks are essential. For more information about the server and management networks, see 1.7.4, “Server and management networks” on page 55.
60 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 75
Figure 1-49 Typical cabling scenario for the HMC, the EMS, and the frame
In Figure 1-49, you see the different networks and cabling. Each Frame has two Ethernet ports on the BPCH to connect the Service Network A and B.
The I/O drawers in which the disks are installed for the EMS Servers also are interconnected. Therefore, the data is secured with RAID6 and the I/O drawers also are software mirrored. This means that when one EMS server goes down for any reason, the other EMS server accesses the data. The EMS servers are redundant from in this scenario, but there is no automated high-availability process for recovery of a failed EMS server.
All actions to activate the second EMS server must be performed manually. There also is no plan to automate this process. A cluster continues running without the EMS servers (in case both servers failed). No node fails because of a server failure or an HMC error. When multiple problems rise simultaneously, there might be a greater need for more intervention, but often this intervention does not occur under normal circumstances.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 61
Page 76

1.9 High Performance Computing software stack

GPFS
User Space Kernel Space
IP
IF_LS
DD HYP
Operating Systems: AIX/Linux
Load Leveler
Scheduler
xCAT
Network(s): IB, HFI-ISR
Network Adapter(s): Galaxy2, IB HCAs, HFI (GSM)
Hardware Platforms: Power, P7IH
HAL – HFI (GSM), IB
AIX/OFED IB Verbs
LAPI – Reliable FIFO, RDMA, Striping, Failover/Recovery, Pre-emption, User Space Statistics, Multi-Protocol, Scalability
PNSD / NRT Debug Infra structure
PESSL
Eclipse PTP Framework
POE Runtime Command Line Parallel Debugger / HPC Toolkit Eclipse Tools
APPLICATION
xlC, C++
Ope nMP
xlF
Fortran
Ope nMP
MPI
MATH
Libraries
ESSL
xlUPC
X10*
LAPI
Open-
shmem
TCPUDP
SOCKETS
Multi-Link - bonding
LL Resource Mgr
Pre-empt ion
HMC
HFI CNM
TotalView / IBM Parallel Debugger
NSD/Linux/AIX
1
Figure 1-50 shows the IBM Power Systems HPC software stack.
Table 1-6 HPC software stack for POWER
Application Development Environment
HPC Workbench Integrated Development Environment that is based on Eclipse PTP (open source)
High Scalable Communications Protocol
Performance Tuning Tools IBM HPC Toolkit (part of PE) http://publib.boulder.ibm.com/infocenter/clre
62 IBM Power Systems 775 for AIX and Linux HPC Solution
Figure 1-50 POWER HPC software stack
Table 1-6 describes the IBM Power HPC software stack.
Available tools for IBM POWER Resources
C and Fortran Development Tools http://www.eclipse.org/photran/
PTP (Parallel tools platform) Programming models support: MPI, LAPI, OpenShmem, UPC
IBM Parallel Environment (MP, LAPI/PAMI, Debug Tools, OpenShmem) Note: User space support IB, HFI
http://www.eclipse.org/ptp/
http://publib.boulder.ibm.com/infocenter/clre sctr/vxrx/topic/com.ibm.cluster.pe.doc/pebook s.html
sctr/vxrx/topic/com.ibm.cluster.pe.doc/pebook s.html
Page 77
Application Development Environment
PGAS language support Unified Parallel C (UPC) http://upc.lbl.gov/
Compilers XL C/C++ http://www.ibm.com/software/awdtools/xlcpp/
Performance Counter PAPI http://icl.cs.utk.edu/papi/
Debugger Allinea Parallel Debugger http://www.allinea.com/products/ddt
GPU support OpenCL http://www.khronos.org/opencl/
Scientific Math libraries ESSL http://publib.boulder.ibm.com/infocenter/clre
Available tools for IBM POWER Resources
X10 http://x10-lang.org/
OpenMP http://openmp.org/
XL Fortran http://www.ibm.com/software/awdtools/fortran/
Totalview http://www.roguewave.com/products/totalview-f
amily.aspx
Eclipse debugger http://eclipse.org/ptp/
PERCS debugger
sctr/vxrx/topic/com.ibm.cluster.essl.doc/essl books.html
Parallel ESSL http://publib.boulder.ibm.com/infocenter/clre
sctr/vxrx/topic/com.ibm.cluster.pessl.doc/pes slbooks.html
HPCS Toolkit http://domino.research.ibm.com/comm/research_
projects.nsf/pages/hpcst.index.html
Advanced Systems Management
Development, maintenance xCAT http://xcat.sourceforge.net/
Remote hardware controls
System monitoring RSCT (RMC) http://www.redbooks.ibm.com/abstracts/sg24661
5.html
http://publib.boulder.ibm.com/infocenter/clre sctr/vxrx/topic/com.ibm.cluster.related_libra ries.doc/related.htm?path=3_6#rsct_link
Event handling Toolkit for Event Analysis and
Logging (TEAL)
Workload and Resource Management
Scheduler IBM LoadLeveler http://publib.boulder.ibm.com/infocenter/clre
Integrated resource manager
http://pyteal.sourceforge.net
sctr/vxrx/topic/com.ibm.cluster.loadl.doc/llb ooks.html
Cluster File System
Chapter 1. Understanding the IBM Power Systems 775 Cluster 63
Page 78
Application Development Environment
Available tools for IBM POWER Resources
Advanced scalable file system
Network NFS
Operating System Support
Base support AIX 7.1B http://www.ibm.com/systems/power/software/aix
Key OS enhancements for HPC
Network Management
InfiniBand Vendor supported tools http://www.infinibandta.org/
HFI Switch management (CNM)
GPFS http://publib.boulder.ibm.com/infocenter/clre
sctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs books.html
/index.html
RHEL 6 http://www.ibm.com/systems/power/software/lin
ux/
AIX http://www.ibm.com/systems/power/software/aix
/index.html
Linux http://www.ibm.com/systems/power/software/lin
ux/
Route management
Failover/recovery
Performance counter collection and analysis
Firmware level GFW
LNMC
Cluster Database
Database DB2® http://www.ibm.com/software/data/db2/
Other Key Features
Scalability 16K OS images (special bid)
OS Jitter Best practices guide
Jitter migration bases on synchronized global clock
Kernel patches
RAS Failover
Striping with multiple links
Multilink/bonding support Supported
64 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 79

1.9.1 Integrated Switch Network Manager

The ISNM subsystem package is installed on the executive management server of a high-performance computing cluster that consists of IBM Power 775 Supercomputers and contains the network management commands. The local network management controller runs on the server service processor as part of the system of the drawers and is shipped with the Power 775.
Network management services
As shown in Figure 1-51 on page 66, the ISNM provides the following services: 򐂰 ISR network configuration and installation:
– Topology validation – Miswire detection – Works with cluster configuration as defined in the cluster database – Hardware global counter configuration – Phased installation and Optical Link Connectivity Test (OLCT)
򐂰 ISR network hardware status:
– Monitors for ISR, HFI, link, and optical module events – Command line queries to display network hardware status – Performance counter collection – Some RMC monitor points (for example, HFI Down)
򐂰 Network maintenance:
– Set up ISR route tables during drawer power-on – Thresholds on certain link events, might disable a link – Dynamically update route tables to reroute around problems or add to routes when
CECs power on
– Maintain data to support software route mode choices, makes the data available to the
OS through PHYP
– Monitor global counter health
򐂰 Report hardware failures:
– Analyzes the EMS – Most events that are forwarded to TEAL Event DB and Alert DB – Link events due to CEC power off/power on are consolidated within CNM to reduce
unnecessary strain on analysis
– Events reported via TEAL to Service Focal Point™ on the HMC
Chapter 1. Understanding the IBM Power Systems 775 Cluster 65
Page 80
Figure 1-51 ISNM operating environment
HMC
TEAL
P7 IH
FSP
EMS
GFW
services
PERCS
Database
T
E
A
L
c
o
n
n
e
c
t
o
r
Centra l Ne tw o rk
Manager
(CNM)
HP C Hardware Server
Control Network Ethernet
Local Network
Ma nagement
Controller
(LNMC)
NETS
Mailbox
Torrent
PHYP
Integrated
Switch Rou ter
GEAR
Alert
Filter
ISN M
Configuration
ISNM Rules
ISNM
Alert
Liste ner
Service
Focal Point
Ne twork Hard wa re Events
MCRSA for the ISNM: IBM offers Machine Control Program Remote Support Agreement (MCRSA) for the ISNM. This agreement includes remote call-in support for the central network manager and the hardware server components of the ISNM, and for the local network management controller machine code.
MCRSA enables a single-site or worldwide enterprise customer to maintain machine code entitlement to remote call-in support for ISNM throughout the life of the MCRSA.
66 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 81
Figure 1-52 ISNM distributed architecture
ISNM Distributed Architecture
Hardware events
Routin g information
Performance C ounters Read
Routing information
Routing information
Hardware events
Hardware events
Performance Coun ters Read
Performance Counters Read
CNM
Link Status
Link Status
ISNM
Command
Module
TEAL
HW EventsHW Events
EMS
LNMC
1
FSP
LNMC
1
FSP
LNMC
FSP
LNMC
FSP
LNMC
N
FSP
LNMC
N
FSP
A high-level representation of ISNMs distributed architecture is shown in Figure 1-52. An instance of Local Network Manager (LNMC) software runs on each FSP. Each LNMC generates routes for the eight hubs in the local drawer specific to the supernode, drawer, and hub.
A Central Network Manager (CNM) runs on the EMS and communicates with the LNMCs. Link status and reachability information flows between the LNMC instances and CNM. Network events flow from LNMC to CNM, and then to Toolkit for Event Analysis and Logging (TEAL).
Local Network Management Controller
The LNMC present on each node features the following primary functions: 򐂰 Event management:
– Aggregates local hardware events, local routing events, and remote routing events.
򐂰 Route management:
– Generates routes that are based on configuration data and the current state of links in
the network.
򐂰 Hardware access:
– Downloads routes. – Allows the hardware to be examined and manipulated.
Figure 1-53 on page 68 shows a logical representation of these functions.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 67
Page 82
Figure 1-53 LNMC functional blocks
The LNMC also interacts with the EMS and with the ISR hardware to support the execution of vital management functions. Figure 1-54 on page 69 provides a high-level visualization of the interaction between the LNMC components and other external entities.
68 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 83
Figure 1-54 LNMC external interactions
As shown in Figure 1-54, the following external interactions are featured in the LNMC:
1. Network configuration commands The primary function of this procedure is to uniquely identify the Power 775 server within
the network. This includes the following information:
– Network topology – Supernode identification – Drawer identification within the Supernode – Frame identification (from BPA via FSP) – Cage identification (from BPA via FSP) – Expected neighbors table for mis-wire detection
2. Local network hardware events All network hardware events flow from the ISR into the LNMCs event management, where
they are examined and acted upon. The following list of potential actions that are taken by event management:
– Threshold checking – Actions upon hardware – Event aggregation – Network status update. Involves route management and CNM reporting. – Reporting to EMS
Chapter 1. Understanding the IBM Power Systems 775 Cluster 69
Page 84
3. Network event reporting Event management examines each local network hardware event and, if appropriate,
forwards the event to the EMS for analysis and reports to the service focal point. Event management also sends the following local routing events that indicate changes in the link status or route tables within the local drawer that other LNMCs need to react to:
– Link usability masks (LUM): One per hub in the drawer, indicates whether each link on
that hub is available for routing
– PRT1 and PRT2 validity vectors: One each per hub in the drawer, more data is used in
making routing decisions
General changes in LNMC or network status are also reported via this interface.
4. Remote network events After a local routing event (LUM, PRT1, PRT2) is received by CNM, CNM determines
which other LNMCs need the information to make route table updates, and sends the updates to the LNMCs.
The events are aggregated together by event management and then passed to route management. Route management generates a set of appropriate route table updates and potentially some PRT1 and PRT2 events of its own.
Changed routes are downloaded via hardware access. Event management sends out new PRT1 and PRT2 events, if applicable.
5. Local hardware management Hardware access provides the following facilities to both LNMC and CNM for managing
the network hardware:
– Reads and writes route tables – Reads and writes hardware registers – Disables and enables ports – Controls optical link connectivity test – Allows management of multicast – Allows management of global counter – Reads and writes performance counters
6. Centralized hardware management The following functions are managed centrally by CNM with support from LNMC:
– Global counter –Multicast – Port Enable/Disable
Central Network Manage
The CNM daemon waits for events and handles each one as separate transactions. There are software threads within CNM that handle different aspects of the network management tasks.
The service network traffic flows through another daemon called
Computing Hardware Server
Figure 1-55 on page 71 shows the relationships between the CNM software components. The components are described in the following section.
.
High Performance
70 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 85
Figure 1-55 CNM software structure
Res ponse
Async
DB
Queue
Response
Async
Routing
Queue
Res ponse
Async
Rec overy
Queue
Response
Async
Diagnostic
Queue
Response
Async
Command
Queue
Response
Async
GC
Queue
Outbound
Queue
DB
Thread
Cmd
Thread
GC
Thread
Diags
Thread
Route
Thread
R A R A R AR AR AR A
NM Command Exec
CNM Error Log
Thread
Recovery
Thread
R
A
R
A
R
A
R
R
A
R
Hardware Server
Communications Layer
CNM – Hdwr_Svr socket
TE AL
Res ponse
Async
DB
Queue
Res ponse
Async
DB
Queue
Response
Async
Routing
Queue
Response
Async
Routing
Queue
Res ponse
Async
Rec overy
Queue
Res ponse
Async
Rec overy
Queue
Response
Async
Diagnostic
Queue
Response
Async
Diagnostic
Queue
Response
Async
Command
Queue
Response
Async
Command
Queue
Response
Async
GC
Queue
Response
Async
GC
Queue
Outbound
Queue
DB
Thread
Cmd
Thread
GC
Thread
Diags
Thread
Route
Thread
R AR A R AR A R AR AR AR AR AR AR AR A
NM Command Exec
CNM Error Log
Thread
Recovery
Thread
R
A
R
A
R
A
R
R
A
R
Hardware Server
Communications Layer
CNM – Hdwr_Svr socket
TE AL
Communication layer
This layer provides a packet library with methods for communicating with LNMC. The layer manages incoming and outgoing messages between FSPs and CNM component message queues.
The layer also manages event aggregation and the virtual connections to the Hardware Server.
Database component
This component maintains the CNM internal network hardware database and updates the status fields in this in-memory database to support reporting the status to the administrator. The component also maintains required reachability information for routing.
Routing component
This component builds and maintains the hardware multicast tree. The component also writes multicast table contents to the ISR and handles the exchange of routing information between the LNMCs to support route generation and maintenance.
Global counter component
This component sets up and monitors the hardware global counter. The component also maintains information about the location of the ISR master counter and configured backups.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 71
Page 86
Recovery component
The recovery component gathers network hardware events and frame-level events. This component also logs each event in the CNM_ERRLOG and sends most events to TEAL.
The recovery component also performs some event consolidation to avoid flooding the TEAL with too many messages in the event of a CEC power up or power down.
Performance counter data management
This data management periodically collects ISR and HFI aggregate performance counters from the hardware and stores the counters in the cluster database. The collection interval and amount of data to keep are configurable.
Command handler
This handler is a socket listener to the command module. This handler manages ISNM commands, such as requests for hardware status, configuration for LNMC, and link diagnostics.
IBM High Performance Computing Hardware Server
In addition to the CNM software components, the HPC Hardware Server (HWS) handles the connections to the service network. Its primary function is to manage connections to service processors and provide an API for clients to communicate with the service processors. HWS assigns every service processor connection a unique handle that is called a
number
hardware.
(vport). This handle is used by clients to send synchronous commands to the
virtual port
In a Power 775 cluster, HPC HWS runs on the EMS, and on each xCAT service node.

1.9.2 DB2

IBM DB2 Workgroup Server Edition 9.7 for High Performance Computing (HPC) V1.1 is a scalable, relational database that is designed for use in a local area network (LAN) environment and provides support for both local and remote DB2 clients. DB2 Workgroup Server Edition is a multi-user version of DB2 packed with features that are designed to reduce the overall costs of owning a database. DB2 includes data warehouse capabilities, high availability function, and is administered remotely from a satellite control database.
The IBM Power 775 Supercomputer cluster solution requires a database to store all of the configuration and monitoring data. DB2 Workgroup Server Edition 9.7 for HPC V1.1 is licensed for use only on the executive management server (EMS) of the Power 775 high-performance computing cluster.
The EMS serves as a single point of control for cluster management of the Power 775 cluster. The Power 775 cluster also includes a backup EMS, service nodes, compute nodes, I/O nodes, and login nodes. DB2 Workgroup Server Edition 9.7 for HPC V1.1 must be installed on the EMS and backup EMS.

1.9.3 Extreme Cluster Administration Toolkit

Extreme Cloud Administration Toolkit (xCAT) is an open source, scalable distributed computing management, and provisioning tool that provides a unified interface for hardware control, discovery, and operating system stateful and stateless deployment. This robust toolkit is used for the deployment and administration of AIX or Linux clusters, as shown in Figure 1-56 on page 73.
72 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 87
xCAT makes simple clusters easy and complex clusters possible through the following features:
򐂰 Remotely controlling hardware functions, such as power, vitals, inventory, events logs, and
alert processing. xCAT indicates which light path LEDs are lit up remotely.
򐂰 Managing server consoles remotely via serial console, SOL. 򐂰 Installing an AIX or Linux cluster with utilities for installing many machines in parallel. 򐂰 Managing an AIX or Linux cluster with tools for management and parallel operation. 򐂰 Setting up a high-performance computing software stack, including software for batch job
submission, parallel libraries, and other software that is useful on a cluster.
򐂰 Creating and managing stateless and diskless clusters.
Figure 1-56 xCAT architecture
xCAT supports both Intel and POWER based architectures, which provide operating system support for AIX, Linux (RedHat, SuSE and CentOS), and Windows installations. the following provisioning methods are available:
򐂰 Local disk 򐂰 Stateless (via Linux ramdisk support) 򐂰 iSCSI (Windows and Linux)
xCAT manages a Power 775 cluster by using a hierarchical distribution that is based on management and service nodes. A single xCAT management node with multiple service nodes provides boot services to increase scaling (to thousands and up to tens of thousands of nodes).
The number of nodes and network infrastructure determine the number of Dynamic Host Configuration Protocol/Trivial File Transfer Protocol/Hypertext Transfer Protocol (DHCP/TFTP/HTTP) servers that are required for a parallel reboot without DHCP/TFTP/HTTP timeouts.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 73
Page 88
The number of DHCP servers does not need to equal the number of TFTP or HTTP servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the management node to provide a consistent set of kernel, initrd, and file system images.
xCAT version 2 provides the following enhancements that address the requirements of a Power 775 cluster:
򐂰 Improved ACLs and non-root operator support:
– Certificate-authenticated client/server XML protocol for all xCAT commands
򐂰 Choice of databases:
– Use a database (DB) like SQLite, or an enterprise DB like DB2 or Oracle – Stores all of the cluster config data, status information, and events – Information is stored in DB by other applications and customer scripts – Data change notification is used to drive automatic administrative operations
򐂰 Improved monitoring:
– Hardware event and simple Network Management Protocol (SNMP) alert monitoring – More HPC stack (GPFS, LL, Torque, and so on) setup and monitoring
򐂰 Improved RMC conditions:
– Condition triggers when it is true for a specified duration – Batch multiple events into a single invocation of the response – Micro-sensors: ability to extend RMC monitoring efficiently – Performance monitoring and aggregation that is based on TEAL and RMC
򐂰 Automating the deployment process:
– Automate creation of LPARs in every CEC – Automate set up of infrastructure nodes (service nodes and I/O nodes) – Automate configuration of network adaptors, assign node names/IDs, IP addresses,
and so on
– Automate choosing and pushing the corresponding operating system and other HPC
software images to nodes
– Automate configuration of the operating system and HPC software so that the system
is ready to use
– Automate verification of the nodes to ensure their availability
򐂰 Boot nodes with a single shared image among all nodes of a similar configuration
(diskless support)
򐂰 Allow for deploying the cluster in phases (for example, a set of new nodes at-a-time by
using the existing cluster)
򐂰 Scan the connected networks to discover the various hardware components and firmware
information of interest:
– Uses the standard SLP protocol – Finds: FSPs, BPAs, hardware control points
򐂰 Automatically defines the discovered components to the administration software,
assigning IP addresses, and hostnames
򐂰 Hardware control (for example, powering components on and off) is automatically
configured
򐂰 ISR and HFI components are initialized and configured 򐂰 All components are scanned to ensure that firmware levels are consistent and at the
wanted version
74 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 89
򐂰 Firmware is updated on all down-level components when necessary 򐂰 Provide software inventory:
– Utilities to query the software levels that are installed in the cluster – Utilities to choose updates to be applied to the cluster
򐂰 With diskless nodes, software updates are applied to the OS image on the server (nodes
apply the updates on the next reboot)
򐂰 HPC software (LoadLeveler, GPFS, PE, ESSL, Parallel ESSL, compiler libraries, and so
on) is installed throughout the cluster by the system management software
򐂰 HPC software relies on system management to provide configuration information. System
Management stores the configuration information in the management database
򐂰 Uses RMC monitoring infrastructure for monitoring and diagnosing the components of
interest
򐂰 Continuous operation (rolling update):
– Apply upgrades and maintenance to the cluster with minimal impact on running jobs – Rolling updates are coordinated with CNM and LL to schedule updates (reboots) to a
limited set of nodes at a time, allowing the other nodes to still be running jobs

1.9.4 Toolkit for Event Analysis and Logging

The Toolkit for Event Analysis and Logging (TEAL) is a robust framework for low-level system event analysis and reporting that supports both real-time and historic analysis of events. TEAL provides a central repository for low-level event logging and analysis that addresses the new Power 775 requirements.
The analysis of system events is delivered through alerts. A rules-based engine is used to determine which alert must be delivered. The TEAL configuration controls the manner in which problem notifications are delivered.
Real-time analysis provides a pro-active approach to system management, and the historical analysis allows for deeper on-site and off-site debugging.
The primary users of TEAL are the system administrator and operator. The output of TEAL is delivered to an alert database that is monitored by the administrator and operators through a series of monitoring methods.
TEAL runs on the EMS and commands are issued via the EMS command line. TEAL supports the monitoring of the following functions:
򐂰 ISNM/CNM 򐂰 LoadLeveler 򐂰 HMCs/Service Focal Points 򐂰 PNSD 򐂰 GPFS
For more information about TEAL, see Table 1-6 on page 62.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 75
Page 90

1.9.5 Reliable Scalable Cluster Technology

Reliable Scalable Cluster Technology (RSCT) is a set of software components that provide a comprehensive clustering environment for AIX, Linux, Solaris, and Windows. RSCT is the infrastructure that is used by various of IBM products to provide clusters with improved system availability, scalability, and ease of use.
RSCT includes the following components: 򐂰 Resource monitoring and control (RMC) subsystem
This subsystem is the scalable, reliable backbone of RSCT. RMC runs on a single machine or on each node (operating system image) of a cluster and provides a common abstraction for the resources of the individual system or the cluster of nodes. You use RMC for single system monitoring or for monitoring nodes in a cluster. However, in a cluster, RMC provides global access to subsystems and resources throughout the cluster, thus providing a single monitoring and management infrastructure for clusters.
򐂰 RSCT core resource managers
A resource manager is a software layer between a resource (a hardware or software entity that provides services to some other component) and RMC. A resource manager maps programmatic abstractions in RMC into the actual calls and commands of a resource.
򐂰 RSCT cluster security services
This RSCT component provides the security infrastructure that enables RSCT components to authenticate the identity of other parties.
򐂰 Topology services subsystem

1.9.6 GPFS

This RSCT component provides node and network failure detection on some cluster configurations.
򐂰 Group services subsystem
This RSCT component provides cross-node/process coordination on some cluster configurations.
For more information, see this website:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.re lated_libraries.doc/related.htm?path=3_6#rsct_link
The IBM General Parallel File System (GPFS) is distributed, high-performance, massively scalable enterprise file system solution that addresses the most challenging demands in high-performance computing.
GPFS provides online storage management, scalable access, and integrated information lifecycle management tools capable of managing petabytes of data and billions of files. Virtualizing your file storage space and allowing multiple systems and applications to share common pools of storage provides you the flexibility to transparently administer the infrastructure without disrupting applications. This configuration improves cost and energy efficiency and reduces management overhead.
Massive namespace support, seamless capacity and performance scaling, and proven reliability features and flexible architecture of GPFS helps your company foster innovation by simplifying your environment and streamlining data work flows for increased efficiency.
76 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 91
GPFS plays a key role in the shared storage configuration for Power 775 clusters. Virtually all large-scale systems are connected to disk over HFI via GPFS Network Shared Disk (NSD) servers, which are referred GPFS I/O nodes or Storage nodes in Power 775 terminology. The system interconnect features higher performance and is far more scalable than traditional storage fabrics, and is RDMA capable.
GPFS includes a Native RAID function that is used to manage the disks in the disk enclosures. In particular, the disk hospital function is queried regularly to ascertain the health of the disk subsystem. This function is not always necessary because disk problems that require service are reported to the HMC serviceable events and to TEAL.
For more information about GPFS, see Table 1-6 on page 62.
GPFS Native RAID
GPFS Native RAID is a software implementation of storage RAID technologies within GPFS. By using conventional dual-ported disks in a Just-a-Bunch-Of-Disks (JBOD) configuration, GPFS Native RAID implements sophisticated data placement and error correction algorithms to deliver high levels of storage reliability, availability, and performance. Standard GPFS file systems are created from the NSDs defined through GPFS Native RAID.
This section describes the basic concepts, advantages, and motivations behind GPFS Native RAID: redundancy codes, end-to-end checksums, data declustering, and administrator configuration, including recovery groups, declustered arrays, virtual disks, and virtual disk NSDs.
Overview
GPFS Native RAID integrates the functionality of an advanced storage controller into the GPFS NSD server. Unlike an external storage controller, in which configuration, LUN definition, and maintenance are beyond the control of GPFS, GPFS Native RAID takes ownership of a JBOD array to directly match LUN definition, caching, and disk behavior to GPFS file system requirements.
Sophisticated data placement and error correction algorithms deliver high levels of storage reliability, availability, serviceability, and performance. GPFS Native RAID provides a variation of the GPFS NSD called a the VDisk NSDs of a file system by using the conventional NSD protocol.
The GPFS Native RAID includes the following features: 򐂰 Software RAID: GPFS Native RAID runs on standard AIX disks in a dual-ported JBOD
array, which does not require external RAID storage controllers or other custom hardware RAID acceleration.
򐂰 Declustering: GPFS Native RAID distributes client data, redundancy information, and
spare space uniformly across all disks of a JBOD. This distribution reduces the rebuild (disk failure recovery process) overhead that is compared to conventional RAID.
򐂰 Checksum: An end-to-end data integrity check (by using checksums and version
numbers) is maintained between the disk surface and NSD clients. The checksum algorithm uses version numbers to detect silent data corruption and lost disk writes.
򐂰 Data redundancy: GPFS Native RAID supports highly reliable two-fault tolerant and
three-fault-tolerant Reed-Solomon-based parity codes and three-way and four-way replication.
򐂰 Large cache: A large cache improves read and write performance, particularly for small
I/O operations.
virtual disk, or VDisk. Standard NSD clients transparently access
Chapter 1. Understanding the IBM Power Systems 775 Cluster 77
Page 92
򐂰 Arbitrarily sized disk arrays: The number of disks is not restricted to a multiple of the RAID
redundancy code width, which allows flexibility in the number of disks in the RAID array.
򐂰 Multiple redundancy schemes: One disk array supports VDisks with different redundancy
schemes; for example, Reed-Solomon and replication codes.
򐂰 Disk hospital: A disk hospital asynchronously diagnoses faulty disks and paths, and
requests replacement of disks by using past health records.
򐂰 Automatic recovery: Seamlessly and automatically recovers from primary server failure. 򐂰 Disk scrubbing: A disk scrubber automatically detects and repairs latent sector errors in
the background.
򐂰 Familiar interface: Standard GPFS command syntax is used for all configuration
commands, including, maintaining, and replacing failed disks.
򐂰 Flexible hardware configuration: Support of JBOD enclosures with multiple disks
physically mounted together on removable carriers.
򐂰 Configuration and data logging: Internal configuration and small-write data are
automatically logged to solid-state disks for improved performance.
GPFS Native RAID features
This section describes three key features of GPFS Native RAID and how the functions work: data redundancy that use RAID codes, end-to-end checksums, and declustering.
RAID codes
GPFS Native RAID automatically corrects for disk failures and other storage faults by reconstructing the unreadable data by using the available data redundancy of either a Reed-Solomon code or N-way replication. GPFS Native RAID uses the reconstructed data to fulfill client operations, and in the case of disk failure, to rebuild the data onto spare space. GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes and three-way and four-way replication, which detect and correct up to two or three concurrent faults1. The redundancy code layouts that are supported by GPFS Native RAID, called are shown in Figure 1-57.
tracks,
Figure 1-57 Redundancy codes that are supported by GPFS Native RAID
GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes, which partition a GPFS block into eight data strips and two or three parity strips. The N-way replication codes duplicate the GPFS block on N - 1 replica strips.
78 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 93
GPFS Native RAID automatically creates redundancy information, depending on the configured RAID code. By using a Reed-Solomon code, GPFS Native RAID equally divides a GPFS block of user data into eight data strips and generates two or three redundant parity strips. This configuration results in a stripe or track width of 10 or 11 strips and storage efficiency of 80% or 73% (excluding user configurable spare space for rebuild).
By using N-way replication, a GPFS data block is replicated N - 1 times, implementing 1 + 2 and 1 + 3 redundancy codes, with the strip size equal to the GPFS block size. Thus, for every block or strip written to the disks, N replicas of that block or strip are also written. This configuration results in track width of three or four strips and storage efficiency of 33% or 25%.
End-to-end checksum
Most implementations of RAID codes implicitly assume that disks reliably detect and report faults, hard-read errors, and other integrity problems. However, studies show that disks do not report some read faults and occasionally fail to write data, although it was reported that the data was written.
These errors are often referred to as
silent errors, phantom-writes, dropped-writes, or
off-track writes. To compensate for these shortcomings, GPFS Native RAID implements an
end-to-end checksum that detects silent data corruption that is caused by disks or other system components that transport or manipulate the data.
When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the data before it is transported over the network to the GPFS Native RAID server. On reception, GPFS Native RAID calculates and verifies the checksum. GPFS Native RAID stores the data, a checksum, and version number to disk and logs the version number in its metadata for future verification during read.
When GPFS Native RAID reads disks to satisfy a client read operation, it compares the disk checksum against the disk data and the disk checksum version number against what is stored in its metadata. If the checksums and version numbers match, GPFS Native RAID sends the data along with a checksum to the NSD client. If the checksum or version numbers are invalid, GPFS Native RAID reconstructs the data by using parity or replication and returns the reconstructed data and a newly generated checksum to the client. Thus, both silent disk read errors and lost or missing disk writes are detected and corrected.
Declustered RAID
Compared to conventional RAID, GPFS Native RAID implements a sophisticated data and spare space disk layout scheme that allows for arbitrarily sized disk arrays and reduces the overhead to clients that are recovering from disk failures. To accomplish this configuration, GPFS Native RAID uniformly spreads or declusters user data, redundancy information, and spare space across all the disks of a declustered array. A conventional RAID layout is compared to an equivalent declustered array in Figure 1-58 on page 80.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 79
Page 94
Figure 1-58 Conventional RAID versus declustered RAID layouts
Figure 1-58 shows an example of how GPFS Native RAID improves client performance during rebuild operations by using the throughput of all disks in the declustered array. This is illustrated by comparing a conventional RAID of three arrays versus a declustered array, both using seven disks. A conventional 1-fault-tolerant 1 + 1 replicated RAID array is shown with three arrays of two disks each (data and replica strips) and a spare disk for rebuilding. To decluster this array, the disks are divided into seven tracks, two strips per array. The strips from each group are then spread across all seven disk positions, for a total of 21 virtual tracks. The strips of each disk position for every track are then arbitrarily allocated onto the disks of the declustered array (in this case, by vertically sliding down and compacting the strips from above). The spare strips are uniformly inserted, one per disk.
As illustrated in Figure 1-59 on page 81, a declustered array significantly shortens the time that is required to recover from a disk failure, which lowers the rebuild overhead for client
80 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 95
applications. When a disk fails, erased data is rebuilt by using all of the operational disks in the declustered array, the bandwidth of which is greater than the fewer disks of a conventional RAID group. If another disk fault occurs during a rebuild, the number of impacted tracks that require repair is markedly less than the previous failure and less than the constant rebuild overhead of a conventional array.
The decrease in declustered rebuild impact and client overhead might be a factor of three to four times less than a conventional RAID. Because GPFS stripes client data across all the storage nodes of a cluster, file system performance becomes less dependent upon the speed of any single rebuilding storage array.
Figure 1-59 Lower rebuild overhead in conventional RAID versus declustered RAID
When a single disk fails in the 1-fault-tolerant 1 + 1 conventional array on the left, the redundant disk is read and copied onto the spare disk, which requires a throughput of seven strip I/O operations. When a disk fails in the declustered array, all replica strips of the six impacted tracks are read from the surviving six disks and then written to six spare strips, for a throughput of two strip I/O operations. As shown in Figure 1-59, disk read and write I/O throughput during the rebuild operations.
Disk configurations
This section describes recovery group and declustered array configurations.
Recovery groups
GPFS Native RAID divides disks into recovery groups in which each disk is physically connected to two servers: primary and backup. All accesses to any of the disks of a recovery group are made through the active primary or backup server of the recovery group.
Building on the inherent NSD failover capabilities of GPFS, when a GPFS Native RAID server stops operating because of a hardware fault, software fault, or normal shutdown, the backup GPFS Native RAID server seamlessly assumes control of the associated disks of its recovery groups.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 81
Page 96
Typically, a JBOD array is divided into two recovery groups that are controlled by different primary GPFS Native RAID servers. If the primary server of a recovery group fails, control automatically switches over to its backup server. Within a typical JBOD, the primary server for a recovery group is the backup server for the other recovery group.
Figure 1-60 illustrates the ring configuration where GPFS Native RAID servers and storage JBODs alternate around a loop. A particular GPFS Native RAID server is connected to two adjacent storage JBODs and vice versa. The ratio of GPFS Native RAID server to storage JBODs is thus one-to-one. Load on servers increases by 50% when a server fails.
Figure 1-60 GPFS Native RAID server and recovery groups in a ring configuration
Declustered arrays
A declustered array is a subset of the physical disks (pdisks) in a recovery group across which data, redundancy information, and spare space are declustered. The number of disks in a declustered array is determined by the RAID code-width of the VDisks that are housed in the declustered array. One or more declustered arrays can exist per recovery group. Figure 1-61 on page 83 illustrates a storage JBOD with two recovery groups, each with four declustered arrays.
A declustered array can hold one or more VDisks. After redundancy codes are associated with VDisks, a declustered array simultaneously contains Reed-Solomon and replicated VDisks.
If the storage JBOD supports multiple disks that are physically mounted together on removable carriers, removal of a carrier temporarily disables access to all of the disks in the carrier. Thus, pdisks on the same carrier must not be in the same declustered array, as VDisk redundancy protection is weakened upon carrier removal.
Declustered arrays are normally created at recovery group creation time but new arrays are created or existing arrays are grown by adding pdisks later.
82 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 97
Figure 1-61 Example of declustered arrays and recovery groups in storage JBOD
Virtual and physical disks
A VDisk is a type of NSD that is implemented by GPFS Native RAID across all the pdisks of a declustered array. Multiple VDisks are defined within a declustered array, typically Reed-Solomon VDisks for GPFS user data and replicated VDisks for GPFS metadata.
Virtual disks
Whether a VDisk of a particular capacity is created in a declustered array depends on its redundancy code, the number of pdisks and equivalent spare capacity in the array, and other small GPFS Native RAID overhead factors. The mmcrvdisk command automatically configures a VDisk of the largest possible size a redundancy code and configured spare space of the declustered array.
In general, the number of pdisks in a declustered array cannot be less than the widest redundancy code of a VDisk plus the equivalent spare disk capacity of a declustered array. For example, a VDisk that uses the 11-strip-wide 8 + 3p Reed-Solomon code requires at least 13 pdisks in a declustered array with the equivalent spare space capacity of two disks. A VDisk that uses the three-way replication code requires at least five pdisks in a declustered array with the equivalent spare capacity of two disks.
VDisks are partitioned into virtual tracks, which are the functional equivalent of a GPFS block. All VDisk attributes are fixed at creation and cannot be altered.
Physical disks
A pdisk is used by GPFS Native RAID to store user data and GPFS Native RAID internal configuration data.
A pdisk is either a conventional rotating magnetic-media disk (HDD) or a solid-state disk (SSD). All pdisks in a declustered array must have the same capacity.
Pdisks are also assumed to be dual-ported with one or more paths that are connected to the primary GPFS Native RAID server and one or more paths that are connected to the backup server. Often there are two redundant paths between a GPFS Native RAID server and connected JBOD pdisks.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 83
Page 98
Solid-state disks
GPFS Native RAID assumes several solid-state disks (SSDs) in each recovery group in order to redundantly log changes to its internal configuration and fast-write data in non-volatile memory, which is accessible from either the primary or backup GPFS Native RAID servers after server failure. A typical GPFS Native RAID log VDisk might be configured as three-way replication over a dedicated declustered array of four SSDs per recovery group.
Disk hospital
The disk hospital is a key feature of GPFS Native RAID that asynchronously diagnoses errors and faults in the storage subsystem. GPFS Native RAID times out an individual pdisk I/O operation after approximately 10 seconds, limiting the effect of a faulty pdisk on a client I/O operation. When a pdisk I/O operation results in a timeout, an I/O error, or a checksum mismatch, the suspect pdisk is immediately admitted into the disk hospital. When a pdisk is first admitted, the hospital determines whether the error was caused by the pdisk or by the paths to it. Although the hospital diagnoses the error, GPFS Native RAID, if possible, uses VDisk redundancy codes to reconstruct lost or erased strips for I/O operations that otherwise are used the suspect pdisk.
Health metrics
The disk hospital maintains internal health assessment metrics for each pdisk: time badness, which characterizes response times; and data badness, which characterizes media errors (hard errors) and checksum errors. When a pdisk health metric exceeds the threshold, it is marked for replacement according to the disk maintenance replacement policy for the declustered array.
The disk hospital logs selected Self-Monitoring, Analysis, and Reporting Technology (SMART) data, including the number of internal sector remapping events for each pdisk.
Pdisk discovery
GPFS Native RAID discovers all connected pdisks when it starts, and then regularly schedules a process that rediscovers a pdisk that newly becomes accessible to the GPFS Native RAID server. This configuration allows pdisks to be physically connected or connection problems to be repaired without restarting the GPFS Native RAID server.
Disk replacement
The disk hospital tracks disks that require replacement according to the disk replacement policy of the declustered array. The disk hospital is configured to report the need for replacement in various ways. The hospital records and reports the FRU number and physical hardware location of failed disks to help guide service personnel to the correct location with replacement disks.
When multiple disks are mounted on a removable carrier, each of which is a member of a different declustered array, disk replacement requires the hospital to temporarily suspend other disks in the same carrier. To guard against human error, carriers are also not removable until GPFS Native RAID actuates a solenoid controlled latch. In response to administrative commands, the hospital quiesces the appropriate disks, releases the carrier latch, and turns on identify lights on the carrier that is next to the disks that require replacement.
After one or more disks are replaced and the carrier is re-inserted, in response to administrative commands, the hospital verifies that the repair took place. The hospital also automatically adds any new disks to the declustered array, which causes GPFS Native RAID to rebalance the tracks and spare space across all the disks of the declustered array. If service personnel fail to reinsert the carrier within a reasonable period, the hospital declares the disks on the carrier as missing and starts rebuilding the affected data.
84 IBM Power Systems 775 for AIX and Linux HPC Solution
Page 99
Two Declustered Arrays/Two Recovery Group
S
S
S
S
S
S
S
S
1
1
111
111
111
111
111
111
111
111 111 111
111
111
111
111
111
2
2
2
2
2
2
222
222
222
222
222
22
222
222
222
222
222
222
222
222
Rear
Front
I/O node
Blue RG primary / Yellow RG backup
I/O node
Yellow RG primary / Blue RG backup
STOR1
STOR2
STOR3
STOR4
STOR8 STOR7 STOR6 STOR5
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
DCA
DCA
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 2
T 2
T 2
T 2
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 1
Port Card
Port Card
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
Figure 1-62 shows a “Two Declustered Array/Two Recovery Group” configuration of a Disk Enclosure. This configuration is referred to as 1/4 populated. The configuration features four SDDs (shown in dark blue in Figure 1-62) in the first recovery group and the four SSDs (dark yellow in Figure 1-62) in the second recovery group.
Figure 1-62 Two Declustered Array/Two Recovery Group DE configuration
Chapter 1. Understanding the IBM Power Systems 775 Cluster 85
Page 100
Four Declustered Arrays/Two Recovery Group
S
S
S
S
S
S
S
S
1
33
1 3
13131
3
1131
3
13131
3
13131
3
13131
3
13131
3
13131
3
13131
3
13131313131
3
13131
3
13131
3
13131
3
13131
3
13131
3
2 4
2 4
2 4
2 4
2 4
2 4
24242
4
24242
4
24242
4
2242
4
24242
4
42424
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
24242
4
Rear
Front
I/O node
Blue RG primary / Yellow RG backup
I/O node
Yellow RG primary / Blue RG backup
STOR1
STOR2
STOR3
STOR4
STOR8 STOR7 STOR6 STOR5
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
Port Card
DCA
DCA
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 2
T 2
T 2
T 2
T 1
T 1
T 1
T 1
T 1
T 1
T 1
T 1
Port Card
Port Card
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
T 2
Figure 1-63 shows a Four Declustered Array/Two Recovery Group configuration of a disk enclosure. This configuration is referred to as 1/2 populated. The configuration features four SDDs (shown in dark blue in Figure 1-63) in the first recovery group and the four SSDs (dark yellow in Figure 1-63) in the second recovery group.
Figure 1-63 Four Declustered Array/Two Recovery Group DE configuration
86 IBM Power Systems 775 for AIX and Linux HPC Solution
Loading...