IBM Power Systems 775 User Manual

Front cover

IBM Power Systems 775 for AIX and Linux HPC Solution
Unleashes computing power for HPC workloads
Provides architectural solution overview
Contains sample scenarios
Dino Quintero
Kerry Bosworth
Puneet Chaudhary
ByungUn Ha
Jose Higino
Marc-Eric Kahle
Tsuyoshi Kamenoue
James Pearson
Mark Perez
Fernando Pizzano
Robert Simon
Kai Sun
ibm.com/redbooks
International Technical Support Organization
IBM Power Systems 775 for AIX and Linux HPC Solution
October 2012
SG24-8003-00
Note: Before using this information and the product it supports, read the information in “Notices” on page vii.
First Edition (October 2012)
This edition applies to IBM AIX 7.1, xCAT 2.6.6, IBM GPFS 3.4, IBM LoadLelever, Parallel Environment Runtime Edition for AIX V1.1.
© Copyright International Business Machines Corporation 2012. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Chapter 1. Understanding the IBM Power Systems 775 Cluster. . . . . . . . . . . . . . . . . . . 1
1.1 Overview of the IBM Power System 775 Supercomputer . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Advantages and new features of the IBM Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Hardware information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 POWER7 chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 I/O hub chip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Collective acceleration unit (CAU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Nest memory management unit (NMMU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Integrated switch router (ISR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.6 SuperNOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.7 Hub module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.8 Memory subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.9 Quad chip module (QCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.10 Octant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.11 Interconnect levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.12 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.13 Supernodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.14 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4 Power, packaging and cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4.1 Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.4.2 Bulk Power and Control Assembly (BPCA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.4.3 Bulk Power Control and Communications Hub (BPCH) . . . . . . . . . . . . . . . . . . . . 43
1.4.4 Bulk Power Regulator (BPR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.4.5 Water Conditioning Unit (WCU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.5 Disk enclosure (Rodrigo). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.5.2 High level description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5.3 Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.6 Cluster management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.1 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.2 EMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.6.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6.4 Server and management networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.6.5 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
© Copyright IBM Corp. 2012. All rights reserved. iii
1.6.6 LPARs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.6.7 Utility nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.6.8 GPFS I/O nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.7 Typical connection scenario between EMS, HMC, Frame . . . . . . . . . . . . . . . . . . . . . . 58
1.8 Software stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8.1 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.8.2 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.8.3 Extreme Cluster Administration Toolkit (xCAT). . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.8.4 Toolkit for Event Analysis and Logging (TEAL). . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.8.5 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.8.6 GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.8.7 IBM Parallel Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.8.8 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
1.8.9 ESSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.8.10 Parallel ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.8.11 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.8.12 Parallel Tools Platform (PTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 2. Application integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.1 Power 775 diskless considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.1.1 Stateless vs. Statelite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.1.2 System access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.2 System capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.3 Application development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.3.1 XL compilers support for POWER7 processors . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.3.2 Advantage for PGAS programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.3.3 Unified Parallel C (UPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3.4 ESSL/PESSL optimized for Power 775 clusters . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.4 Parallel Environment optimizations for Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.4.1 Considerations for using Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . 116
2.4.2 Considerations for data striping with PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4.3 Confirmation of HFI status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.4.4 Considerations for using Collective Acceleration Unit (CAU) . . . . . . . . . . . . . . . 126
2.4.5 Managing jobs with large numbers of tasks (up to 1024 K) . . . . . . . . . . . . . . . . 129
2.5 IBM Parallel Environment Developer Edition for AIX . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.5.1 Eclipse Parallel Tools Platform (PTP 5.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.5.2 IBM High Performance Computing Toolkit (IBM HPC Toolkit) . . . . . . . . . . . . . . 133
2.6 Running workloads using IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6.1 Submitting jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6.2 Querying and managing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.6.3 Specific considerations for LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 3. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.1 Component monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.1.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.1.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.1.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.1.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.1.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.1.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.1.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.1.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 203
3.1.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 206
iv IBM Power Systems 775 for AIX and Linux HPC Solution
3.1.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 206
3.2 TEAL tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
3.2.1 Configuration (LoadLeveler, GPFS, Service Focal Point, PNSD, ISNM) . . . . . . 211
3.2.2 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
3.3 Quick health check (full HPC Cluster System) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.3.1 Component analysis location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.3.2 Top to bottom checks direction (software to hardware) . . . . . . . . . . . . . . . . . . . 219
3.3.3 Bottom to top direction (hardware to software) . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.4 EMS Availability+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.4.1 Simplified failover procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5 Component configuration listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.5.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.5.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.5.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.5.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 234
3.5.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 234
3.6 Component monitoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.6.1 xCAT (power management, hardware discovery and connectivity) . . . . . . . . . . 235
3.6.2 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Chapter 4. Problem determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.1 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.1 xcatdebug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.2 Resolving xCAT configuration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.1.3 Node does not respond to queries or rpower command. . . . . . . . . . . . . . . . . . . 240
4.1.4 Node fails to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
4.1.5 Unable to open a remote console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.1.6 Time out errors during network boot of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.2 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.2.1 Checking the status and recycling the hardware server and the CNM . . . . . . . . 243
4.2.2 Communication issues between CNM and DB2 . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.2.3 Adding hardware connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.2.4 Checking FSP status, resolving configuration or communication issues . . . . . . 248
4.2.5 Verifying CNM to FSP connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
4.2.6 Verify that a multicast tree is present and correct . . . . . . . . . . . . . . . . . . . . . . . . 250
4.2.7 Correcting inconsistent topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.3 HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.3.1 HFI health check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.3.2 HFI tools and link diagnostics (resolving down links and miswires) . . . . . . . . . . 254
4.3.3 SMS ping test fails over HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.3.4 netboot over HFI fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.3.5 Other HFI issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Chapter 5. Maintenance and serviceability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
5.1 Managing service updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.1 Service packs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.2 System firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.1.3 Managing multiple operating system (OS) images . . . . . . . . . . . . . . . . . . . . . . . 259
Contents v
5.2 Power 775 xCAT startup/shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.2.1 Startup procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.2.2 Shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.3 Managing cluster nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3.1 Node types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3.2 Adding nodes to the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.3.3 Removing nodes from a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.4 Power 775 availability plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.4.1 Advantages of Availability Plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4.2 Considerations for A+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.4.3 Availability Plus (A+) resources in a Power 775 Cluster . . . . . . . . . . . . . . . . . . . 289
5.4.4 How to identify a A+ resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.4.5 Availability Plus definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.4.6 Availability plus components and recovery procedures . . . . . . . . . . . . . . . . . . . 292
5.4.7 Hot, warm, cold Policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5.4.8 A+ QCM move example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.4.9 Availability plus non-Compute node overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Appendix A. Serviceable event analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Analyzing a hardware serviceable event that points to an A+ action . . . . . . . . . . . . . . . . . 306
Appendix B. Command outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
GPFS native RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
vi IBM Power Systems 775 for AIX and Linux HPC Solution

Notices

This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 2012. All rights reserved. vii

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX 5L™ AIX® BladeCenter® DB2® developerWorks® Electronic Service Agent™ Focal Point™ Global Technology Services® GPFS™
HACMP™ IBM® LoadLeveler® Power Systems™ POWER6+™ POWER6® POWER7® PowerPC® POWER®
pSeries® Redbooks® Redbooks (logo) ® RS/6000® System p® System x® Tivoli®
The following terms are trademarks of other companies:
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viii IBM Power Systems 775 for AIX and Linux HPC Solution

Preface

This IBM® Redbooks® publication contains information about the IBM Power Systems™ 775 Supercomputer solution for AIX® and Linux HPC customers. This publication provides details about how to plan, configure, maintain, and run HPC workloads in this environment.
This IBM Redbooks document is targeted to current and future users of the IBM Power Systems 775 Supercomputer (consultants, IT architects, support staff, and IT specialists) responsible for delivering and implementing IBM Power Systems 775 clustering solutions for their enterprise high-performance computing (HPC) applications.

The team who wrote this book

This book was produced by a team of specialists from around the world working at the International Technical Support Organization, Poughkeepsie Center.
Dino Quintero is an IBM Senior Certified IT Specialist with the ITSO in Poughkeepsie, NY. His areas of knowledge include enterprise continuous availability, enterprise systems management, system virtualization, technical computing, and clustering solutions. He is currently an Open Group Distinguished IT Specialist. Dino holds a Master of Computing Information Systems degree and a Bachelor of Science degree in Computer Science from Marist College.
Kerry Bosworth is a Software Engineer in pSeries® Cluster System Test for high-performance computing in Poughkeepsie, New York. Since joining the team four years ago, she worked with the InfiniBand technology on POWER6® AIX, SLES, and Red Hat clusters and the new Power 775 system. She has 12 years of experience at IBM with eight years in IBM Global Services as an AIX Administrator and Service Delivery Manager.
Puneet Chaudhary is a software test specialist with the General Parallel File System team in Poughkeepsie, New York.
Rodrigo Garcia da Silva is a Deep Computing Client Technical Architect at the IBM Systems and Technology Group. He is part of the STG Growth Initiatives Technical Sales Team in Brazil, specializing in High Performance Computing solutions. He has worked at IBM for the past five years and has a total of eight years of experience in the IT industry. He holds a B.S. in Electrical Engineering and his areas of expertise include systems architecture, OS provisioning, Linux, and open source software. He also has a background in intellectual property protection, including publications and a filed patent.
ByungUn Ha is an Accredited IT Specialist and Deep Computing Technical Specialist in Korea. He has over 10 years experience in IBM and has conducted various HPC projects and HPC benchmarks in Korea. He has supported Supercomputing Center at KISTI (Korea Institute of Science and Technology Information) on-site for nine years. His area of expertise include Linux performance and clustering for System X, InfiniBand, AIX Power system, and HPC Software Stack including LoadLeveler®, Parallel Environment, and ESSL/PESSL, C/Fortran Compiler. He is a Redhat Certified Engineer (RHCE) and has a Master’s degree in Aerospace Engineering from Seoul National University. He is currently working in Deep Computing team, Growth Initiatives, STG in Korea as a HPC Technical Sales Specialist.
© Copyright IBM Corp. 2012. All rights reserved. ix
Jose Higino is an Infrastructure IT Specialist for AIX/Linux support and services for IBM Portugal. His areas of knowledge include System X, BladeCenter® and Power Systems planning and implementation, management, virtualization, consolidation, and clustering (HPC and HA) solutions. He is currently the only person responsible for Linux support and services in IBM Portugal. He completed the Red Hat Certified Technician level in 2007, became a CiRBA Certified Virtualization Analyst in 2009, and completed certification in KT Resolve methodology as an SME in 2011. José holds a Master of Computers and Electronics Engineering degree from UNL - FCT (Universidade Nova de Lisboa - Faculdade de Ciências e Technologia), in Portugal.
Marc-Eric Kahle is a POWER® Systems Hardware Support specialist at the IBM Global Technology Services® Central Region Hardware EMEA Back Office in Ehningen, Germany. He has worked in the RS/6000®, POWER System, and AIX fields since 1993. He has worked at IBM Germany since 1987. His areas of expertise include POWER Systems hardware and he is an AIX certified specialist. He has participated in the development of six other IBM Redbooks publications.
Tsuyoshi Kamenoue is a Advisory IT specialist in Power Systems Technical Sales in IBM Japan. He has nine years of experience of working on pSeries, System p®, and Power Systems products especially in HPC area. He holds a Bachelor’s degree in System information from the university of Tokyo.
James Pearson is a Product Engineer for pSeries high-end Enterprise systems and HPC cluster offerings since 1998. He has participated in the planning, test, installation and on-going maintenance phases of clustered RISC and pSeries servers for numerous government and commercial customers, beginning with SP2 and continuing through the current Power 775 HPC solution.
Mark Perez is a customer support specialist servicing IBM Cluster 1600.
Fernando Pizzano is a Hardware and Software Bring-up Team Lead in the IBM Advanced
Clustering Technology Development Lab, Poughkeepsie, New York. He has over 10 years of information technology experience, the last five years in HPC Development. His areas of expertise include AIX, pSeries High Performance Switch, and IBM System p hardware. He holds an IBM certification in pSeries AIX 5L™ System Support.
Robert Simon is a Senior Software Engineer in STG working in Poughkeepsie, New York. He has worked with IBM since 1987. He currently is a Team Leader in the Software Technical Support Group, which supports the High Performance Clustering software (LoadLeveler, CSM, GPFS™, RSCT, and PPE). He has extensive experience with IBM System p hardware, AIX, HACMP™, and high-performance clustering software. He has participated in the development of three other IBM Redbooks publications.
Kai Sun is a Software Engineer in pSeries Cluster System Test for high performance computing in IBM China System Technology Laboratory, Beijing. Since joining the team in 2011, he has worked with the IBM Power Systems 775 cluster. He has six years of experience at embedded system on Linux and VxWorks platform. He has recently been given an Eminence and Excellence Award by IBM for his work on Power Systems 775 cluster. He holds a B.Eng. degree in Communication Engineering from Beijing University of Technology, China. He has a M.Sc. degree in Project Management from the New Jersey Institute of Technology, US.
Thanks to the following people for their contributions to this project: 򐂰 Mark Atkins
IBM Boulder
򐂰 Robert Dandar
x IBM Power Systems 775 for AIX and Linux HPC Solution
򐂰 Joseph Demczar 򐂰 Chulho Kim 򐂰 John Lewars 򐂰 John Robb 򐂰 Hanhong Xue 򐂰 Gary Mincher 򐂰 Dave Wootton 򐂰 Paula Trimble 򐂰 William Lepera 򐂰 Joan McComb 򐂰 Bruce Potter 򐂰 Linda Mellor 򐂰 Alison White 򐂰 Richard Rosenthal 򐂰 Gordon McPheeters 򐂰 Ray Longi 򐂰 Alan Benner 򐂰 Lissa Valleta 򐂰 John Lemek 򐂰 Doug Szerdi 򐂰 David Lerma
IBM Poughkeepsie
򐂰 Ettore Tiotto
IBM Toronto, Canada
򐂰 Wei QQ Qu
IBM China
򐂰 Phil Sanders
IBM Rochester
򐂰 Richard Conway 򐂰 David Bennin
International Technical Support Organization, Poughkeepsie Center

Now you can become a published author, too!

Here’s an opportunity to spotlight your skills, grow your career, and become a published author—all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base.
Find out more about the residency program, browse the residency index, and apply online at:
http://www.ibm.com/redbooks/residencies.html
Preface xi

Comments welcome

Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
http://www.ibm.com/redbooks
򐂰 Send your comments in an email to:
redbooks@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400

Stay connected to IBM Redbooks

򐂰 Find us on Facebook:
http://www.facebook.com/IBMRedbooks
򐂰 Follow us on Twitter:
http://twitter.com/ibmredbooks
򐂰 Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
xii IBM Power Systems 775 for AIX and Linux HPC Solution
Chapter 1. Understanding the IBM Power
1
Systems 775 Cluster
In this book, we describe the new IBM Power Systems 775 Cluster hardware and software. The chapters provide an overview of the general features of the Power 775 and its hardware and software components. This chapter helps you get a basic understanding and concept of this cluster.
Application integration and monitoring of a Power 775 cluster is also described in greater detail in this IBM Redbooks publication. LoadLeveler, GPFS, xCAT, and more are documented with some examples to get a better view on the complete cluster solution.
Problem determination is also discussed throughout this publication for different scenarios that include xCAT configuration issues, Integrated Switch Network Manager (ISNM), Host Fabric Interface (HFI), GPFS, and LoadLeveler. These scenarios show the flow of how to determine the cause of the error and how to solve the error. This knowledge compliments the information in Chapter 5, “Maintenance and serviceability” on page 265.
Some cluster management challenges might need intervention that requires service updates, xCAT shutdown/startup, node management, and Fail in Place tasks. Documents that are available are referenced in this book because not everything is shown in this publication.
This chapter includes the following topics:
򐂰 Overview of the IBM Power System 775 Supercomputer 򐂰 Advantages and new features of the IBM Power 775 򐂰 Hardware information 򐂰 Power, packaging, and cooling 򐂰 Disk enclosure 򐂰 Cluster management 򐂰 Connection scenario between EMS, HMC, and Frame 򐂰 High Performance Computing software stack
© Copyright IBM Corp. 2012. All rights reserved. 1

1.1 Overview of the IBM Power System 775 Supercomputer

For many years, IBM provided High Performance Computing (HPC) solutions that provide extreme performance. For example, highly scalable clusters by using AIX and Linux for demanding workloads, including weather forecasting and climate modeling.
The previous IBM Power 575 POWER6 water-cooled cluster showed impressive density and performance. With 32 processors, 32 GB to 256 GB of memory in one central electronic complex (CEC) enclosure or cage, and up to 14 CECs per Frame (water-cooled), 448 processors per frame was possible. The InfiniBand interconnect provided the cluster with powerful communication channels for the workloads.
The new Power 775 Supercomputer from IBM takes the density to a new height. With 256
3.84 GHz POWER7® processors, 2 TB of memory per CEC, and up to 12 CECs per Frame, a total of 3072 processors and 24 TBs memory per Frame is possible. Highly scalable with the capability to cluster 2048 CEC drawers together makes up 524,288 POWER7 processors to do the work to solve the most challenging problems. A total of 7.86 TF per CEC and 94.4 TF per rack highlights the capabilities of this high-performance computing solution.
The hardware is only as good as the software that runs on it. IBM AIX, IBM FileNet Process Engine (PE) Runtime Edition, LoadLeveler, GPFS, and xCAT are a few of the supported software stacks for the solution. For more information, see 1.9, “High Performance Computing software stack” on page 62.

1.2 The IBM Power 775 cluster components

The IBM Power 775 can consist of the following components: 򐂰 Compute subsystem:
– Diskless nodes dedicated to perform computational tasks – Customized operating system (OS) images – Applications
򐂰 Storage subsystem:
– I/O node (diskless) – OS images for IO nodes – SAS adapters attached to the Disk Enclosures (DE) – General Parallel File System (GPFS)
򐂰 Management subsystem:
– Executive Management Server (EMS) – Login Node – Utility Node
򐂰 Communication Subsystem:
– Host Fabric Interface (HFI):
• Busses from processor modules to the switching hub in an octant
• Local links (LL-links) between octants
• Local remote links (LR-links) between drawers in a SuperNode
• Distance links (D-links) between SuperNodes – Operating system drivers – IBM User space protocol – AIX and Linux IP drivers
2 IBM Power Systems 775 for AIX and Linux HPC Solution
Octants, SuperNode, and other components are described in the other sections of this book. 򐂰 Node types
The following node types have other partial functions available for the cluster. In the context of the 9125-F2C drawer, a node is an OSI image that is booted in an LPAR. There are three general designations for node types on the 9125-F2C. Often these functions are dedicated to a node, but a node can have multiple roles:
– Compute nodes
Compute nodes run parallel jobs and perform the computational functions. These nodes are diskless and booted across the HFI network from a Service Node. Most of the nodes are compute nodes.
– IO nodes
These nodes are attached to either the Disk Enclosure in the physical cluster or external storage. These nodes serve the file system to the rest of the cluster.
– Utility Nodes
A Utility node offers services to the cluster. These nodes often feature more resources, such as an external Ethernet, external, or internal storage. The following Utility nodes are required:
• Service nodes: Runs xCAT to serve the operating system to local diskless nodes
• Login nodes: Provides a centralized login to the cluster – Optional utility node:
• Tape subsystem server
Important: xCAT stores all system definitions as node objects, including the required EMS console and the HMC console. However, the consoles are external to the 9125-F2C cluster and are not referred to as cluster nodes. The HMC and EMS consoles are physically running on specific, dedicated servers. The HMC runs on a System x® based machine (7042 or 7310) and the EMS runs on a POWER 750 Server. For more information, see
1.7.1, “Hardware Management Console” on page 53 and 1.7.2, “Executive Management Server” on page 53.
For more information, see this website:
http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/p7had/p7had_775x .pdf

1.3 Advantages and new features of the IBM Power 775

The IBM Power Systems 775 (9125-F2C) has several new features that make this system even more reliable, available, and serviceable.
Fully redundant power, cooling and management, dynamic processor de-allocation and memory chip & lane sparing, and concurrent maintenance are the main reliability, availability, and serviceability (RAS) features.
The system is water-cooled, which gives a 100% heat capture. Some components are cooled by small fans, but the Rear Door Heat exchanger captures this heat.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 3
Because most of the nodes are diskless nodes, the service nodes provide the operating system to the diskless nodes. The HFI network also is used to boot the diskless utility nodes.
The Power 775 Availability Plus (A+) feature allows processors, switching hubs, and HFI cables immediate failure-recovery because more resources are available in the system. These resources fail in place and no hardware must be replaced until a specified threshold is reached. For more information, see 5.4, “Power 775 Availability Plus” on page 297.
The IBM Power 775 cluster solution provides High Performance Computing clients with the following benefits:
򐂰 Sustained performance and low energy consumption for climate modeling and forecasting 򐂰 Massive scalability for cell and organism process analysis in life sciences 򐂰 Memory capacity for high-resolution simulations in nuclear resource management 򐂰 Space and energy efficient for risk analytics and real-time trading in financial services

1.4 Hardware information

This section provides detailed information about the hardware components of the IBM Power
775. Within this section, there are links to IBM manuals and external sources for more information.

1.4.1 POWER7 chip

The IBM Power System 775 implements the POWER7 processor technology. The PowerPC® Architecture POWER7 processor is designed for use in servers that provide solutions with large clustered systems, as shown in Figure 1-1 on page 5.
4 IBM Power Systems 775 for AIX and Linux HPC Solution
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Memory Controller
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
1B Write
2B Read
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Core
4 FXU, 4 FPU
4T SMT
L2
256KB
L3
4MB
Fabric
p7
8B X-Bus
8B X-Bus
QCM
Chip Connections
Addess/Data
8B Y-Bus
8B Y-Bus
QCM
Chip Connections
Address/Data
8B Z-Bus
8B Z-Bus
QCM
Chip Connections
Address/Data
8B A-Bus
8B A-Bus
QCM
Chip Connections
Data
8B B-Bus
8B B-Bus
QCM
Chip Connections
Data
8B C-Bus
8B C-Bus
QCM
Chip Connections
Data
QCM to Hub
Connections
Address/Data
PSI
I2C
On Module SEEPRM
I2C
On Module SEEPRM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
1.333Gb/s Buffered DRAM
FSI
FSP1 - B
FSI
FSP1 - A
OSC
OSC - B
OSC
OSC - A
TPMD
TPMD-A, TPMD-B
8B W/Gx-Bus
8B W/Gx-Bus
TOD Sync
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
FSI
DIMM_1
FSI
DIMM_2
FSI
DIMM_3
FSI
DIMM_4
.
Figure 1-1 POWER7 chip block diagram
IBM POWER7 characteristics
This section provides a description of the following characteristics of the IBM POWER7 chip, as shown in Figure 1-1:
򐂰 240 GFLOPs:
– Up to eight cores per chip – Four Floating Point Units (FPU) per core – Two FLOPS/Cycle (Fused Operation) – 246 GFLOPs = 8 cores x 3.84 GHz x 4 FPU x 2)
򐂰 32 KBs instruction and 32 KBs data caches per core 򐂰 256 KB L2 cache per core 򐂰 4 MB L3 cache per core 򐂰 Eight Channels of SuperNova buffered DIMMs:
– Two memory controllers per chip – Four memory busses per memory controller (1 B wide Write, 2 B wide Read each)
򐂰 CMOS 12S SOI 11 level metal 򐂰 Die size: 567 mm2
Chapter 1. Understanding the IBM Power Systems 775 Cluster 5
Architecture
򐂰 PowerPC architecture 򐂰 IEEE New P754 floating point compliant 򐂰 Big endian, little endian, strong byte ordering support extension 򐂰 46-bit real addressing, 68-bit virtual addressing 򐂰 Off-chip bandwidth: 336 GBps:
– Local + remote interconnect)
򐂰 Memory capacity: Up to 128 GBs per chip 򐂰 Memory bandwidth: 128 GBps peak per chip
C1 core and cache
򐂰 8 C1 processor cores per chip 򐂰 2 FX, 2 LS, 4 DPFP, 1 BR, 1 CR, 1 VMX, 1 DFP 򐂰 4 SMT, OoO 򐂰 112x2 GPR and 172x2 VMX/VSX/FPR renames
PowerBus On-Chip Intraconnect
򐂰 1.9 GHz Frequency 򐂰 (8) 16 B data bus, 2 address snoop, 21 on/off ramps 򐂰 Asynchronous interface to chiplets and off-chip interconnect
Differential memory controllers (2)
򐂰 6.4-GHz Interface to Super Nova (SN) 򐂰 DDR3 support max 1067 Mhz 򐂰 Minimum Memory 2 channels, 1 SN/channel 򐂰 Maximum Memory 8 channels X 1 SN/channel 򐂰 2 Ports/Super Nova 򐂰 8 Ranks/Port 򐂰 X8b and X4b devices supported
PowerBus Off-Chip Interconnect
򐂰 1.5 to 2.9 Gbps single ended EI-3 򐂰 2 spare bits/bus 򐂰 Max 256-way SMP 򐂰 32-way optimal scaling 򐂰 Four 8-B Intranode Buses (W, X, Y, or Z) 򐂰 All buses run at the same bit rate 򐂰 All capable of running as a single 4B interface; the location of the 4B interface within the
8 B is fixed
򐂰 Hub chip attaches via W, X, Y or Z 򐂰 Three 8-B Internode Buses (A, B,C) 򐂰 C-bus multiplex with GX Only operates as an aggregate data bus (for example, address
and command traffic is not supported)
6 IBM Power Systems 775 for AIX and Linux HPC Solution
Buses
Table 1-1 describes the POWER7 busses.
Table 1-1 POWER7 busses
Bus name Width (speed) Connects Function
W, X, Y, Z 8B+8B with 2 extra bits
per bus (3 Gbps)
A,B 8B+8B with 2 extra bits
per bus (3 Gbps)
C 8B+8B with 2 extra bits
per bus (3 Gb/p)
Mem1-Mem8 2B Read + 1B Write
with 2 extra bits per bus (2.9 GHz)
Intranode processors & hub
Other nodes within drawer
Other nodes within drawer
Processor to memory
Used for address and data
Data only
Data only, Multiplex with Gx
WXYZABC Busses
The off-chip PowerBus supports up to seven coherent SMP links (WXYZABC) by using Elastic Interface 3 (EI-3) interface signaling that uses up to 3 Gbps. The intranode WXYZ links up to four processor chips to make a 32way and connect a Hub chip to each processor. The WXYZ links carry coherency traffic and data and are interchangeable as intranode processor links or Hub links. The internode AB links connect up to two nodes per processor chip. The AB links carry coherency traffic and data and are interchangeable with each other. The AB links also are configured as aggregate data-only links. The C link is configured only as a data-only link.
All seven coherent SMP links (WXYZABC) are configured as 8Bytes or 4Bytes in width.
The XYZABC Busses include the following features:
򐂰 Four (WXYZ) 8-B or 4-B EI-3 Intranode Links 򐂰 Two (AB) 8-B or 4-B EI-3 Internode Links or two (AB) 8-B or 4-B EI-3 data-only Links 򐂰 One (C) 8-B or 4-B EI-3 data-only Link
PowerBus
The PowerBus is responsible for coherent and non-coherent memory access, IO operations, interrupt communication, and system controller communication. The PowerBus provides all of the interfaces, buffering, and sequencing of command and data operations within the storage subsystem. The POWER7 chip has up to seven PowerBus links that are used to connect to other POWER7 chips, as shown in Figure 1-2 on page 8.
The PowerBus link is an 8-Byte-wide (or optional 4-Byte-wide), split-transaction, multiplexed, command and data bus that supports up to 32 POWER7 chips. The bus topology is a multitier, fully connected topology to reduce latency, increase redundancy, and improve concurrent maintenance. Reliability is improved with ECC on the external I/Os.
Data transactions are always sent along a unique point-to-point path. A route tag travels with the data to help routing decisions along the way. Multiple data links are supported between chips that are used to increase data bandwidth.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 7
C Bus
4B
4B
B Bus
8B
8B
A Bus
8B
8B
4B
4B
SMP Interconnect SMP Interconnect
SMP Data Only
GX1 GX0
4MB L3 4MB L3
4MB L3 4MB L3
PBE
PBE
GXC(0,1)
Mem 3 PHY’s
Mem 3 PHY’s
MC0
MC1
Power Bus
PSI A/D HTM ICP
PLLs
EI – 3 PHY’s
Z BUS
8B
8B
W BUS
8B
8B
X Bus
8B
8B
Y Bus
8B
8B
SMP Interconnect
HUB Attach
POR
PSI
JTAG/FSI
I2C
ViDBUS
I2C
SEEPROM
M1A
22b
14b
M1B
22b
14b
M1C
22b
14b
M1D
22b
14b
Memory Interface
M0A
22b
14b
M0B
22b
14b
M0C
22b
14b
M0D
22b
14b
Memory Interface
EI – 3 PHY’s
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
C1
Core
L2 L2
L2 L2
L2 L2
L2 L2
4MB L3 4MB L3
4MB L3 4MB L3
NCU
NCU NCU NCU
NCU
NCU NCU NCU
Figure 1-2 POWER7 chip layout
Figure 1-3 on page 9 shows the POWER7 core structure.
8 IBM Power Systems 775 for AIX and Linux HPC Solution
Figure 1-3 Microprocessor core structural diagram
Reliability, availability, and serviceability features
The microprocessor core includes the following reliability, availability, and serviceability (RAS) features:
򐂰 POWER7 core:
– Instruction retry for soft core logic errors – Alternate processor recovery for hard core errors detected – Processor limited checkstop for other errors – Protection key support for AIX
򐂰 L1 I/D Cache Error Recovery and Handling:
– Instruction retry for soft errors – Alternate processor recovery for hard errors – Guarding of core for core and L1/L2 cache errors
򐂰 L2 Cache:
– ECC on L2 and directory tags – Line delete for L2 and directory tags (seven lines) – L2 UE handling includes purge and refetch of unmodified data – Predictive dynamic guarding of associated cores
򐂰 L3 Cache:
– ECC on data – Line delete mechanism for data (seven lines) – L3UE handling includes purges and refetch of unmodified data – Predictive dynamic guarding of associated cores for CEs in L3 not managed by the line
deletion
Chapter 1. Understanding the IBM Power Systems 775 Cluster 9

1.4.2 I/O hub chip

EI-3 PHYs
Torrent
Diff PHYs
L local
HUB To HUB Copper Board Wiring
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR0 Bus
Optical
6x
6x
LR23 Bus
Optical
6x
6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Diff PHYs
PX0 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX1 Bus
16x
16x
PCI-E
IO PHY
Hot Plug Ctl
PX2 Bus
8x
8x
PCI-E
IO PHY
Hot Plug Ctl
FSI
FSP1-A
FSI
FSP1-B
I2C
TPMD-A, TMPD-B
SVIC
MDC-A
SVIC
MDC-B
I2C
SEEPROM 1
I2C
SEEPROM 2
24
L remote
Buses
HUB to QCM Connections
Address/Data
D Bus
Interconnect of Supernodes
Optical
D0 Bus Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D Buses
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To Optical
Modules
TOD Sync
8B Z-Bus
8B Z-Bus
TOD Sync
8B Y-Bus
8B Y-Bus
TOD Sync
8B X-Bus
8B X-Bus
TOD Sync
8B W-Bus
8B W-Bus
This section provides information about the IBM Power 775 I/O hub chip (or torrent chip), as shown in Figure 1-4.
Figure 1-4 Hub chip (Torrent)
Host fabric interface
The host fabric interface (HFI) provides a non-coherent interface between a quad-chip module (QCM), which is composed of four POWER7, and the clustered network.
Figure 1-5 on page 11 shows two instances of HFI in a hub chip. The HFI chips also attach to the Collective Acceleration Unit (CAU).
Each HFI has one PowerBus command and four PowerBus data interfaces, which feature the following configuration:
1. The PowerBus directly connects to the processors and memory controllers of four POWER7 chips via the WXYZ links.
10 IBM Power Systems 775 for AIX and Linux HPC Solution
2. The PowerBus also indirectly coherently connects to other POWER7 chips within a
HFI
Cmd x1data x4
CAU MMIO
cmd & data x4
EA/RA
Power Bus
D links LR links
4
NC
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
WXYZ
link
ctrlr
WXYZ link
Nest
MMU
c d
CAU
HFI
Cmd x1
data x4
CAU
MMIO
cmd & data x4
EA/RA
Integrated Switch Router
(ISR)
LL links
.
256-way drawer via the LL links. Although fully supported by the HFI hardware, this path provides reduced performance.
3. Each HFI has four ports to the Integrated Switch Router (ISR). The ISR connects to other hub chips through the D, LL, and LR links.
4. ISRs and D, LL, and LR links that interconnect the hub chips form the cluster network.
POWER7 chips: The set of four POWER7 chips (QCM), its associated memory, and a hub chip form the building block for cluster systems. A Power 775 systems consists of multiple building blocks that are connected to each another via the cluster network.
Figure 1-5 HFI attachment scheme
Chapter 1. Understanding the IBM Power Systems 775 Cluster 11
Packet processing
POWER7
Lin k
ISR
POWER7 Coherency Bus
Proc,
caches
POW ER7
Link
...
POWER7
Chip
Hub Chip
ISR Network
Proc,
caches
POWER7 Coherency Bus
Mem
HFI
IS R
ISR ISR
.
.
HFI
POWER7
Chi p
Hub Chip
Pr oc,
caches
...
Pro c,
caches
Mem
PO WER7 Coher ency Bus
PO WER7 Coher ency Bus
HFI HFI
The HFI is the interface between the POWER7 chip quads and the cluster network, and is responsible for moving data between the PowerBus and the ISR. The data is in various formats, but packets are processed in the following manner:
򐂰 Send
– Pulls or receives data from PowerBus-attached devices in a POWER7 chip – Translates data into network packets – Injects network packets into the cluster network via the ISR
򐂰 Receive
– Receives network packets from the cluster network via the ISR – Translates them into transactions – Pushes the transactions to PowerBus-attached devices in a POWER7 chip
򐂰 Packet ordering
– The HFIs and cluster network provide no ordering guarantees among packets. Packets
that are sent from the same source window and node to the same destination window and node might reach the destination in a different order.
Figure 1-6 shows two HFIs cooperating to move data from devices that are attached to one PowerBus to devices attached to another PowerBus through the Cluster Network.
Figure 1-6 HFI moving data from one quad to another quad
HFI paths: The path between any two HFIs might be indirect, thus requiring multiple hops through intermediate ISRs.
12 IBM Power Systems 775 for AIX and Linux HPC Solution

1.4.3 Collective acceleration unit

The hub chip provides specialized hardware that is called the Collective Acceleration Unit (CAU) to accelerate frequently used collective operations.
Collective operations
Collective operations are distributed operations that operate across a tree. Many HPC applications perform collective operations with the application that make forward progress after every compute node that completed its contribution and after the results of the collective operation are delivered back to every compute node (for example, barrier synchronization, and global sum).
A specialized arithmetic-logic unit (ALU) within the collective CAU implements reduction, barrier, and reduction operations. For reduce operations, the ALU supports the following operations and data types:
򐂰 Fixed point: NOP, SUM, MIN, MAX, OR, ANDS, signed and unsigned XOR 򐂰 Floating point: MIN, MAX, SUM, single and double precision PROD
There is one CAU in each hub chip, which is one CAU per four POWER7 chips, or one CAU per 32 C1 cores.
Software organizes the CAUs in the system collective trees. The arrival of an input on one link causes its forwarding on all other links when there is a broadcast operation. For reduce operation, arrivals on all but one link causes the reduction result to forward to the remaining links.
A link in the CAU tree maps to a path composed of more than one link in the network. The system supports many trees simultaneously and each CAYU supports 64 independent trees.
The usage of sequence numbers and a retransmission protocol enables reliability and pipelining. Each tree has only one participating HFI window on any involved node. The order in which the reduction operation is evaluated is preserved from one run to another, which benefits programming models that allow programmers to require that collective operations are executed in a particular order, such as MPI.
Package propagation
As shown Figure 1-7 on page 14, a CAU receive packets from the following sources: 򐂰 The memory of a remote node is inserted into the cluster network by the HFI of the remote
node
򐂰 The memory of a local node is inserted into the cluster network by the HFI of the local
node
򐂰 A remote CAU
Chapter 1. Understanding the IBM Power Systems 775 Cluster 13
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
HFI
HFI
Figure 1-7 CAU packets received by CAU
As shown in Figure 1-8 on page 15, a CAU sends packets to the following locations:
򐂰 The memory of a remote node that is written to memory by the HFI of the remote node. 򐂰 The memory of a local node that is written to memory by the HFI of the local node. 򐂰 A remote CAU.
14 IBM Power Systems 775 for AIX and Linux HPC Solution
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
ISR Network
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
ISR
ISR
Proc,
caches
WXYZ
link
...
P7 Chip
Torrent
Chip
Proc,
caches
Power Bus
Mem
Power Bus
HFI
ISR
.
.
CAU CAU
Figure 1-8 CAU packets sent by CAU

1.4.4 Nest memory management unit

The Nest Memory Management Unit (NMMU) that is in the hub check facilitates user-level code to operate on the address space of processes that executes on other compute nodes. The NMMU enables user-level code to create a global address space from which the NMMU performs operations. This facility is called
A process that executes on a compute node registers its address space, thus permitting interconnect packets to manipulate the registered shared region directly. The NMMU references a page table that maps effective addresses to real memory. The hub chip also maintains a cache of the mappings and maps the entire real memory of most installations.
Incoming interconnect packets that reference memory, such as RDMA packets and packets that perform atomic operations, contain an effective address and information that pinpoints the context in which to translate the effective address. This feature greatly facilitates global-address space languages, such as Unified Parallel C (UPC), co-array Fortran, and X10, by permitting such packets to contain easy-to-use effective addresses.
global shared memory.

1.4.5 Integrated switch router

The integrated switch router (ISR) replaces the external switching and routing functions that are used in prior networks. The ISR is designed to dramatically reduce cost and improve performance in bandwidth and latency.
A direct graph network topology connects up to 65,536 POWER7 eight-core processor chips with two-level routing hierarchy of L and D busses.
Chapter 1. Understanding the IBM Power Systems 775 Cluster 15
Each hub chip ISR connects to four POWER7 chips via the HFI controller and the W busses. The Torrent hub chip and its four POWER7 chips are called an directly connected to seven other octants on a drawer via the wide on-planar L-Local busses and to 24 other octants in three more drawers via the optical L-Remote busses.
A
Supernode is the fully interconnected collection of 32 octants in four drawers. Up to 512
Supernodes are fully connected via the 16 optical D busses per hub chip. The ISR is designed to support smaller systems with multiple D busses between Supernodes for higher bandwidth and performance.
The ISR logically contains input and output buffering, a full crossbar switch, hierarchical route tables, link protocol framers/controllers, interface controllers (HFI and PB data), Network Management registers and controllers, and extensive RAS logic that includes link replay buffers.
The Integrated Switch Router supports the following features:
򐂰 Target cycle time up to 3 GHz 򐂰 Target switch latency of 15 ns 򐂰 Target GUPS: ~21 K. ISR assisted GUPs handling at all intermediate hops (not software) 򐂰 Target switch crossbar bandwidth greater than 1 TB per second input and output:
– 96 Gbps WXYZ-busses (4 @ 24 Gbps) from P7 chips (unidirectional) – 168 Gbps local L-busses (7 @ 24 Gbps) between octants in a drawer (unidirectional) – 144 Gbps optical L-busses (24 @ 6 Gbps) to other drawers (unidirectional) – 160 Gbps D-busses (16 @ 10 Gbps) to other Supernodes (unidirectional)
򐂰 Two-tiered full-graph network 򐂰 Virtual Channels for deadlock prevention
octant. Each ISR octant is
򐂰 Cut-through Wormhole routing 򐂰 Routing Options:
– Full hardware routing – Software-controlled indirect routing by using hardware route tables
򐂰 Multiple indirect routes that are supported for data striping and failover 򐂰 Multiple direct routes by using LR and D-links supported for less than a full-up system 򐂰 Maximum packet size that supported is 2 KB. Packets size varies from 1 to 16 flits, each flit
being 128 Bytes
򐂰 Routing Algorithms:
– Round Robin: Direct and Indirect – Random: Indirect routes only
򐂰 IP Multicast with central buffer and route table and supports 256 Bytes or 2 KB packets 򐂰 Global Hardware Counter implementation and support and includes link latency counts 򐂰 LCRC on L and D busses with link-level retry support for handling transient errors and
includes error thresholds.
򐂰 ECC on local L and W busses, internal arrays, and busses and includes Fault Isolation
Registers and Control Checker support
򐂰 Performance Counters and Trace Debug support
16 IBM Power Systems 775 for AIX and Linux HPC Solution
Loading...
+ 328 hidden pages