IBM Power Systems 775 User Manual

Download

Page 1

Front cover

IBM Power Systems 775 for AIX and Linux HPC Solution

Unleashes computing power for HPC workloads

Provides architectural solution overview

Contains sample scenarios

Dino Quintero

Kerry Bosworth

Puneet Chaudhary

Rodrigo Garcia da Silva

ByungUn Ha

Jose Higino

Marc-Eric Kahle

Tsuyoshi Kamenoue

James Pearson

Mark Perez

Fernando Pizzano

Robert Simon

Kai Sun

ibm.com/redbooks

Page 2

Page 3

International Technical Support Organization

IBM Power Systems 775 for AIX and Linux HPC Solution

October 2012

SG24-8003-00

Page 4

Note: Before using this information and the product it supports, read the information in “Notices” on page vii.

First Edition (October 2012)

This edition applies to IBM AIX 7.1, xCAT 2.6.6, IBM GPFS 3.4, IBM LoadLelever, Parallel Environment Runtime Edition for AIX V1.1.

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Page 5

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii

Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii

Chapter 1. Understanding the IBM Power Systems 775 Cluster. . . . . . . . . . . . . . . . . . . 1

1.1 Overview of the IBM Power System 775 Supercomputer . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Advantages and new features of the IBM Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Hardware information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 POWER7 chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 I/O hub chip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.3 Collective acceleration unit (CAU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.4 Nest memory management unit (NMMU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.5 Integrated switch router (ISR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.6 SuperNOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.7 Hub module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.8 Memory subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.9 Quad chip module (QCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.3.10 Octant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3.11 Interconnect levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3.12 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3.13 Supernodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.14 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.4 Power, packaging and cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.4.1 Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.4.2 Bulk Power and Control Assembly (BPCA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.4.3 Bulk Power Control and Communications Hub (BPCH) . . . . . . . . . . . . . . . . . . . . 43

1.4.4 Bulk Power Regulator (BPR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1.4.5 Water Conditioning Unit (WCU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.5 Disk enclosure (Rodrigo). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1.5.2 High level description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.5.3 Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

1.6 Cluster management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.6.1 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.6.2 EMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.6.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1.6.4 Server and management networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1.6.5 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Page 6

1.6.6 LPARs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

1.6.7 Utility nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

1.6.8 GPFS I/O nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

1.7 Typical connection scenario between EMS, HMC, Frame . . . . . . . . . . . . . . . . . . . . . . 58

1.8 Software stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

1.8.1 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

1.8.2 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

1.8.3 Extreme Cluster Administration Toolkit (xCAT). . . . . . . . . . . . . . . . . . . . . . . . . . . 70

1.8.4 Toolkit for Event Analysis and Logging (TEAL). . . . . . . . . . . . . . . . . . . . . . . . . . . 73

1.8.5 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

1.8.6 GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

1.8.7 IBM Parallel Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1.8.8 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

1.8.9 ESSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

1.8.10 Parallel ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

1.8.11 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

1.8.12 Parallel Tools Platform (PTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Chapter 2. Application integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

2.1 Power 775 diskless considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

2.1.1 Stateless vs. Statelite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

2.1.2 System access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2.2 System capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

2.3 Application development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

2.3.1 XL compilers support for POWER7 processors . . . . . . . . . . . . . . . . . . . . . . . . . 104

2.3.2 Advantage for PGAS programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

2.3.3 Unified Parallel C (UPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

2.3.4 ESSL/PESSL optimized for Power 775 clusters . . . . . . . . . . . . . . . . . . . . . . . . . 113

2.4 Parallel Environment optimizations for Power 775 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

2.4.1 Considerations for using Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . 116

2.4.2 Considerations for data striping with PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

2.4.3 Confirmation of HFI status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

2.4.4 Considerations for using Collective Acceleration Unit (CAU) . . . . . . . . . . . . . . . 126

2.4.5 Managing jobs with large numbers of tasks (up to 1024 K) . . . . . . . . . . . . . . . . 129

2.5 IBM Parallel Environment Developer Edition for AIX . . . . . . . . . . . . . . . . . . . . . . . . . 133

2.5.1 Eclipse Parallel Tools Platform (PTP 5.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

2.5.2 IBM High Performance Computing Toolkit (IBM HPC Toolkit) . . . . . . . . . . . . . . 133

2.6 Running workloads using IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

2.6.1 Submitting jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

2.6.2 Querying and managing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

2.6.3 Specific considerations for LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 3. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

3.1 Component monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

3.1.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

3.1.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

3.1.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

3.1.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

3.1.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

3.1.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

3.1.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

3.1.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 203

3.1.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 206

iv IBM Power Systems 775 for AIX and Linux HPC Solution

Page 7

3.1.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 206

3.2 TEAL tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

3.2.1 Configuration (LoadLeveler, GPFS, Service Focal Point, PNSD, ISNM) . . . . . . 211

3.2.2 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

3.3 Quick health check (full HPC Cluster System) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

3.3.1 Component analysis location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

3.3.2 Top to bottom checks direction (software to hardware) . . . . . . . . . . . . . . . . . . . 219

3.3.3 Bottom to top direction (hardware to software) . . . . . . . . . . . . . . . . . . . . . . . . . . 220

3.4 EMS Availability+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

3.4.1 Simplified failover procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

3.5 Component configuration listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

3.5.1 LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

3.5.2 General Parallel File System (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

3.5.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

3.5.4 DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

3.5.5 AIX and Linux systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

3.5.6 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

3.5.7 Host Fabric Interface (HFI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

3.5.8 Reliable Scalable Cluster Technology (RSCT) . . . . . . . . . . . . . . . . . . . . . . . . . . 234

3.5.9 Compilers environment (PE Runtime Edition, ESSL, Parallel ESSL) . . . . . . . . . 234

3.5.10 Diskless resources (NIM, iSCSI, NFS, TFTP). . . . . . . . . . . . . . . . . . . . . . . . . . 234

3.6 Component monitoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

3.6.1 xCAT (power management, hardware discovery and connectivity) . . . . . . . . . . 235

3.6.2 Integrated Switch Network Manager (ISNM). . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

Chapter 4. Problem determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

4.1 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

4.1.1 xcatdebug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

4.1.2 Resolving xCAT configuration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

4.1.3 Node does not respond to queries or rpower command. . . . . . . . . . . . . . . . . . . 240

4.1.4 Node fails to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

4.1.5 Unable to open a remote console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

4.1.6 Time out errors during network boot of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 242

4.2 ISNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

4.2.1 Checking the status and recycling the hardware server and the CNM . . . . . . . . 243

4.2.2 Communication issues between CNM and DB2 . . . . . . . . . . . . . . . . . . . . . . . . . 243

4.2.3 Adding hardware connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

4.2.4 Checking FSP status, resolving configuration or communication issues . . . . . . 248

4.2.5 Verifying CNM to FSP connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

4.2.6 Verify that a multicast tree is present and correct . . . . . . . . . . . . . . . . . . . . . . . . 250

4.2.7 Correcting inconsistent topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

4.3 HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

4.3.1 HFI health check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

4.3.2 HFI tools and link diagnostics (resolving down links and miswires) . . . . . . . . . . 254

4.3.3 SMS ping test fails over HFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

4.3.4 netboot over HFI fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

4.3.5 Other HFI issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

Chapter 5. Maintenance and serviceability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

5.1 Managing service updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

5.1.1 Service packs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

5.1.2 System firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

5.1.3 Managing multiple operating system (OS) images . . . . . . . . . . . . . . . . . . . . . . . 259

Contents v

Page 8

5.2 Power 775 xCAT startup/shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

5.2.1 Startup procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

5.2.2 Shutdown procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

5.3 Managing cluster nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

5.3.1 Node types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

5.3.2 Adding nodes to the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

5.3.3 Removing nodes from a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

5.4 Power 775 availability plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

5.4.1 Advantages of Availability Plus (A+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

5.4.2 Considerations for A+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

5.4.3 Availability Plus (A+) resources in a Power 775 Cluster . . . . . . . . . . . . . . . . . . . 289

5.4.4 How to identify a A+ resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

5.4.5 Availability Plus definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

5.4.6 Availability plus components and recovery procedures . . . . . . . . . . . . . . . . . . . 292

5.4.7 Hot, warm, cold Policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

5.4.8 A+ QCM move example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

5.4.9 Availability plus non-Compute node overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 302

Appendix A. Serviceable event analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Analyzing a hardware serviceable event that points to an A+ action . . . . . . . . . . . . . . . . . 306

Appendix B. Command outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

GPFS native RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

vi IBM Power Systems 775 for AIX and Linux HPC Solution

Page 9

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION

PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Page 10

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

AIX 5L™ AIX® BladeCenter® DB2® developerWorks® Electronic Service Agent™ Focal Point™ Global Technology Services® GPFS™

HACMP™ IBM® LoadLeveler® Power Systems™ POWER6+™ POWER6® POWER7® PowerPC® POWER®

pSeries® Redbooks® Redbooks (logo) ® RS/6000® System p® System x® Tivoli®

The following terms are trademarks of other companies:

Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

viii IBM Power Systems 775 for AIX and Linux HPC Solution

Page 11

Preface

This IBM® Redbooks® publication contains information about the IBM Power Systems™ 775 Supercomputer solution for AIX® and Linux HPC customers. This publication provides details about how to plan, configure, maintain, and run HPC workloads in this environment.

This IBM Redbooks document is targeted to current and future users of the IBM Power Systems 775 Supercomputer (consultants, IT architects, support staff, and IT specialists) responsible for delivering and implementing IBM Power Systems 775 clustering solutions for their enterprise high-performance computing (HPC) applications.

The team who wrote this book

This book was produced by a team of specialists from around the world working at the International Technical Support Organization, Poughkeepsie Center.

Dino Quintero is an IBM Senior Certified IT Specialist with the ITSO in Poughkeepsie, NY. His areas of knowledge include enterprise continuous availability, enterprise systems management, system virtualization, technical computing, and clustering solutions. He is currently an Open Group Distinguished IT Specialist. Dino holds a Master of Computing Information Systems degree and a Bachelor of Science degree in Computer Science from Marist College.

Kerry Bosworth is a Software Engineer in pSeries® Cluster System Test for high-performance computing in Poughkeepsie, New York. Since joining the team four years ago, she worked with the InfiniBand technology on POWER6® AIX, SLES, and Red Hat clusters and the new Power 775 system. She has 12 years of experience at IBM with eight years in IBM Global Services as an AIX Administrator and Service Delivery Manager.

Puneet Chaudhary is a software test specialist with the General Parallel File System team in Poughkeepsie, New York.

Rodrigo Garcia da Silva is a Deep Computing Client Technical Architect at the IBM Systems and Technology Group. He is part of the STG Growth Initiatives Technical Sales Team in Brazil, specializing in High Performance Computing solutions. He has worked at IBM for the past five years and has a total of eight years of experience in the IT industry. He holds a B.S. in Electrical Engineering and his areas of expertise include systems architecture, OS provisioning, Linux, and open source software. He also has a background in intellectual property protection, including publications and a filed patent.

ByungUn Ha is an Accredited IT Specialist and Deep Computing Technical Specialist in Korea. He has over 10 years experience in IBM and has conducted various HPC projects and HPC benchmarks in Korea. He has supported Supercomputing Center at KISTI (Korea Institute of Science and Technology Information) on-site for nine years. His area of expertise include Linux performance and clustering for System X, InfiniBand, AIX Power system, and HPC Software Stack including LoadLeveler®, Parallel Environment, and ESSL/PESSL, C/Fortran Compiler. He is a Redhat Certified Engineer (RHCE) and has a Master’s degree in Aerospace Engineering from Seoul National University. He is currently working in Deep Computing team, Growth Initiatives, STG in Korea as a HPC Technical Sales Specialist.

Page 12

Jose Higino is an Infrastructure IT Specialist for AIX/Linux support and services for IBM Portugal. His areas of knowledge include System X, BladeCenter® and Power Systems planning and implementation, management, virtualization, consolidation, and clustering (HPC and HA) solutions. He is currently the only person responsible for Linux support and services in IBM Portugal. He completed the Red Hat Certified Technician level in 2007, became a CiRBA Certified Virtualization Analyst in 2009, and completed certification in KT Resolve methodology as an SME in 2011. José holds a Master of Computers and Electronics Engineering degree from UNL - FCT (Universidade Nova de Lisboa - Faculdade de Ciências e Technologia), in Portugal.

Marc-Eric Kahle is a POWER® Systems Hardware Support specialist at the IBM Global Technology Services® Central Region Hardware EMEA Back Office in Ehningen, Germany. He has worked in the RS/6000®, POWER System, and AIX fields since 1993. He has worked at IBM Germany since 1987. His areas of expertise include POWER Systems hardware and he is an AIX certified specialist. He has participated in the development of six other IBM Redbooks publications.

Tsuyoshi Kamenoue is a Advisory IT specialist in Power Systems Technical Sales in IBM Japan. He has nine years of experience of working on pSeries, System p®, and Power Systems products especially in HPC area. He holds a Bachelor’s degree in System information from the university of Tokyo.

James Pearson is a Product Engineer for pSeries high-end Enterprise systems and HPC cluster offerings since 1998. He has participated in the planning, test, installation and on-going maintenance phases of clustered RISC and pSeries servers for numerous government and commercial customers, beginning with SP2 and continuing through the current Power 775 HPC solution.

Mark Perez is a customer support specialist servicing IBM Cluster 1600.

Fernando Pizzano is a Hardware and Software Bring-up Team Lead in the IBM Advanced

Clustering Technology Development Lab, Poughkeepsie, New York. He has over 10 years of information technology experience, the last five years in HPC Development. His areas of expertise include AIX, pSeries High Performance Switch, and IBM System p hardware. He holds an IBM certification in pSeries AIX 5L™ System Support.

Robert Simon is a Senior Software Engineer in STG working in Poughkeepsie, New York. He has worked with IBM since 1987. He currently is a Team Leader in the Software Technical Support Group, which supports the High Performance Clustering software (LoadLeveler, CSM, GPFS™, RSCT, and PPE). He has extensive experience with IBM System p hardware, AIX, HACMP™, and high-performance clustering software. He has participated in the development of three other IBM Redbooks publications.

Kai Sun is a Software Engineer in pSeries Cluster System Test for high performance computing in IBM China System Technology Laboratory, Beijing. Since joining the team in 2011, he has worked with the IBM Power Systems 775 cluster. He has six years of experience at embedded system on Linux and VxWorks platform. He has recently been given an Eminence and Excellence Award by IBM for his work on Power Systems 775 cluster. He holds a B.Eng. degree in Communication Engineering from Beijing University of Technology, China. He has a M.Sc. degree in Project Management from the New Jersey Institute of Technology, US.

Thanks to the following people for their contributions to this project: 򐂰 Mark Atkins

IBM Boulder

򐂰 Robert Dandar

x IBM Power Systems 775 for AIX and Linux HPC Solution

Page 13

򐂰 Joseph Demczar 򐂰 Chulho Kim 򐂰 John Lewars 򐂰 John Robb 򐂰 Hanhong Xue 򐂰 Gary Mincher 򐂰 Dave Wootton 򐂰 Paula Trimble 򐂰 William Lepera 򐂰 Joan McComb 򐂰 Bruce Potter 򐂰 Linda Mellor 򐂰 Alison White 򐂰 Richard Rosenthal 򐂰 Gordon McPheeters 򐂰 Ray Longi 򐂰 Alan Benner 򐂰 Lissa Valleta 򐂰 John Lemek 򐂰 Doug Szerdi 򐂰 David Lerma

IBM Poughkeepsie

򐂰 Ettore Tiotto

IBM Toronto, Canada

򐂰 Wei QQ Qu

IBM China

򐂰 Phil Sanders

IBM Rochester

򐂰 Richard Conway 򐂰 David Bennin

International Technical Support Organization, Poughkeepsie Center

Now you can become a published author, too!

Here’s an opportunity to spotlight your skills, grow your career, and become a published author—all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base.

Find out more about the residency program, browse the residency index, and apply online at:

http://www.ibm.com/redbooks/residencies.html

Preface xi

Page 14

Comments welcome

Your comments are important to us!

We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways:

򐂰 Use the online Contact us review Redbooks form found at:

http://www.ibm.com/redbooks

򐂰 Send your comments in an email to:

redbooks@us.ibm.com

򐂰 Mail your comments to:

IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400

Stay connected to IBM Redbooks

򐂰 Find us on Facebook:

http://www.facebook.com/IBMRedbooks

򐂰 Follow us on Twitter:

http://twitter.com/ibmredbooks

򐂰 Look for us on LinkedIn:

http://www.linkedin.com/groups?home=&gid=2130806

򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks

weekly newsletter:

https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm

򐂰 Stay current on recent Redbooks publications with RSS Feeds:

http://www.redbooks.ibm.com/rss.html

xii IBM Power Systems 775 for AIX and Linux HPC Solution

Page 15

Chapter 1. Understanding the IBM Power

Systems 775 Cluster

In this book, we describe the new IBM Power Systems 775 Cluster hardware and software. The chapters provide an overview of the general features of the Power 775 and its hardware and software components. This chapter helps you get a basic understanding and concept of this cluster.

Application integration and monitoring of a Power 775 cluster is also described in greater detail in this IBM Redbooks publication. LoadLeveler, GPFS, xCAT, and more are documented with some examples to get a better view on the complete cluster solution.

Problem determination is also discussed throughout this publication for different scenarios that include xCAT configuration issues, Integrated Switch Network Manager (ISNM), Host Fabric Interface (HFI), GPFS, and LoadLeveler. These scenarios show the flow of how to determine the cause of the error and how to solve the error. This knowledge compliments the information in Chapter 5, “Maintenance and serviceability” on page 265.

Some cluster management challenges might need intervention that requires service updates, xCAT shutdown/startup, node management, and Fail in Place tasks. Documents that are available are referenced in this book because not everything is shown in this publication.

This chapter includes the following topics:

򐂰 Overview of the IBM Power System 775 Supercomputer 򐂰 Advantages and new features of the IBM Power 775 򐂰 Hardware information 򐂰 Power, packaging, and cooling 򐂰 Disk enclosure 򐂰 Cluster management 򐂰 Connection scenario between EMS, HMC, and Frame 򐂰 High Performance Computing software stack

Page 16

1.1 Overview of the IBM Power System 775 Supercomputer

For many years, IBM provided High Performance Computing (HPC) solutions that provide extreme performance. For example, highly scalable clusters by using AIX and Linux for demanding workloads, including weather forecasting and climate modeling.

The previous IBM Power 575 POWER6 water-cooled cluster showed impressive density and performance. With 32 processors, 32 GB to 256 GB of memory in one central electronic complex (CEC) enclosure or cage, and up to 14 CECs per Frame (water-cooled), 448 processors per frame was possible. The InfiniBand interconnect provided the cluster with powerful communication channels for the workloads.

The new Power 775 Supercomputer from IBM takes the density to a new height. With 256

3.84 GHz POWER7® processors, 2 TB of memory per CEC, and up to 12 CECs per Frame, a total of 3072 processors and 24 TBs memory per Frame is possible. Highly scalable with the capability to cluster 2048 CEC drawers together makes up 524,288 POWER7 processors to do the work to solve the most challenging problems. A total of 7.86 TF per CEC and 94.4 TF per rack highlights the capabilities of this high-performance computing solution.

The hardware is only as good as the software that runs on it. IBM AIX, IBM FileNet Process Engine (PE) Runtime Edition, LoadLeveler, GPFS, and xCAT are a few of the supported software stacks for the solution. For more information, see 1.9, “High Performance Computing software stack” on page 62.

1.2 The IBM Power 775 cluster components

The IBM Power 775 can consist of the following components: 򐂰 Compute subsystem:

– Diskless nodes dedicated to perform computational tasks – Customized operating system (OS) images – Applications

򐂰 Storage subsystem:

– I/O node (diskless) – OS images for IO nodes – SAS adapters attached to the Disk Enclosures (DE) – General Parallel File System (GPFS)

򐂰 Management subsystem:

– Executive Management Server (EMS) – Login Node – Utility Node

򐂰 Communication Subsystem:

– Host Fabric Interface (HFI):

• Busses from processor modules to the switching hub in an octant

• Local links (LL-links) between octants

• Local remote links (LR-links) between drawers in a SuperNode

• Distance links (D-links) between SuperNodes – Operating system drivers – IBM User space protocol – AIX and Linux IP drivers

2 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 17

Octants, SuperNode, and other components are described in the other sections of this book. 򐂰 Node types

The following node types have other partial functions available for the cluster. In the context of the 9125-F2C drawer, a node is an OSI image that is booted in an LPAR. There are three general designations for node types on the 9125-F2C. Often these functions are dedicated to a node, but a node can have multiple roles:

– Compute nodes

Compute nodes run parallel jobs and perform the computational functions. These nodes are diskless and booted across the HFI network from a Service Node. Most of the nodes are compute nodes.

– IO nodes

These nodes are attached to either the Disk Enclosure in the physical cluster or external storage. These nodes serve the file system to the rest of the cluster.

– Utility Nodes

A Utility node offers services to the cluster. These nodes often feature more resources, such as an external Ethernet, external, or internal storage. The following Utility nodes are required:

• Service nodes: Runs xCAT to serve the operating system to local diskless nodes

• Login nodes: Provides a centralized login to the cluster – Optional utility node:

• Tape subsystem server

Important: xCAT stores all system definitions as node objects, including the required EMS console and the HMC console. However, the consoles are external to the 9125-F2C cluster and are not referred to as cluster nodes. The HMC and EMS consoles are physically running on specific, dedicated servers. The HMC runs on a System x® based machine (7042 or 7310) and the EMS runs on a POWER 750 Server. For more information, see

1.7.1, “Hardware Management Console” on page 53 and 1.7.2, “Executive Management Server” on page 53.

For more information, see this website:

http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/p7had/p7had_775x .pdf

1.3 Advantages and new features of the IBM Power 775

The IBM Power Systems 775 (9125-F2C) has several new features that make this system even more reliable, available, and serviceable.

Fully redundant power, cooling and management, dynamic processor de-allocation and memory chip & lane sparing, and concurrent maintenance are the main reliability, availability, and serviceability (RAS) features.

The system is water-cooled, which gives a 100% heat capture. Some components are cooled by small fans, but the Rear Door Heat exchanger captures this heat.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 3

Page 18

Because most of the nodes are diskless nodes, the service nodes provide the operating system to the diskless nodes. The HFI network also is used to boot the diskless utility nodes.

The Power 775 Availability Plus (A+) feature allows processors, switching hubs, and HFI cables immediate failure-recovery because more resources are available in the system. These resources fail in place and no hardware must be replaced until a specified threshold is reached. For more information, see 5.4, “Power 775 Availability Plus” on page 297.

The IBM Power 775 cluster solution provides High Performance Computing clients with the following benefits:

򐂰 Sustained performance and low energy consumption for climate modeling and forecasting 򐂰 Massive scalability for cell and organism process analysis in life sciences 򐂰 Memory capacity for high-resolution simulations in nuclear resource management 򐂰 Space and energy efficient for risk analytics and real-time trading in financial services

1.4 Hardware information

This section provides detailed information about the hardware components of the IBM Power

775. Within this section, there are links to IBM manuals and external sources for more information.

1.4.1 POWER7 chip

The IBM Power System 775 implements the POWER7 processor technology. The PowerPC® Architecture POWER7 processor is designed for use in servers that provide solutions with large clustered systems, as shown in Figure 1-1 on page 5.

4 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 19

Memory Controller

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

Memory Controller

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Core

4 FXU, 4 FPU

4T SMT

256KB

4MB

Fabric

8B X-Bus

QCM

Chip Connections

Addess/Data

8B Y-Bus

QCM

Chip Connections

Address/Data

8B Z-Bus

QCM

Chip Connections

Address/Data

8B A-Bus

QCM

Chip Connections

Data

8B B-Bus

QCM

Chip Connections

Data

8B C-Bus

QCM

Chip Connections

Data

QCM to Hub

Connections

Address/Data

PSI

I2C

On Module SEEPRM

I2C

On Module SEEPRM

1.333Gb/s Buffered DRAM

FSI

FSP1 - B

FSI

FSP1 - A

OSC

OSC - B

OSC

OSC - A

TPMD

TPMD-A, TPMD-B

8B W/Gx-Bus

TOD Sync

FSI

DIMM_1

FSI

DIMM_2

FSI

DIMM_3

FSI

DIMM_4

FSI

DIMM_1

FSI

DIMM_2

FSI

DIMM_3

FSI

DIMM_4

Figure 1-1 POWER7 chip block diagram

IBM POWER7 characteristics

This section provides a description of the following characteristics of the IBM POWER7 chip, as shown in Figure 1-1:

򐂰 240 GFLOPs:

– Up to eight cores per chip – Four Floating Point Units (FPU) per core – Two FLOPS/Cycle (Fused Operation) – 246 GFLOPs = 8 cores x 3.84 GHz x 4 FPU x 2)

򐂰 32 KBs instruction and 32 KBs data caches per core 򐂰 256 KB L2 cache per core 򐂰 4 MB L3 cache per core 򐂰 Eight Channels of SuperNova buffered DIMMs:

– Two memory controllers per chip – Four memory busses per memory controller (1 B wide Write, 2 B wide Read each)

򐂰 CMOS 12S SOI 11 level metal 򐂰 Die size: 567 mm2

Chapter 1. Understanding the IBM Power Systems 775 Cluster 5

Page 20

Architecture

򐂰 PowerPC architecture 򐂰 IEEE New P754 floating point compliant 򐂰 Big endian, little endian, strong byte ordering support extension 򐂰 46-bit real addressing, 68-bit virtual addressing 򐂰 Off-chip bandwidth: 336 GBps:

– Local + remote interconnect)

򐂰 Memory capacity: Up to 128 GBs per chip 򐂰 Memory bandwidth: 128 GBps peak per chip

C1 core and cache

򐂰 8 C1 processor cores per chip 򐂰 2 FX, 2 LS, 4 DPFP, 1 BR, 1 CR, 1 VMX, 1 DFP 򐂰 4 SMT, OoO 򐂰 112x2 GPR and 172x2 VMX/VSX/FPR renames

PowerBus On-Chip Intraconnect

򐂰 1.9 GHz Frequency 򐂰 (8) 16 B data bus, 2 address snoop, 21 on/off ramps 򐂰 Asynchronous interface to chiplets and off-chip interconnect

Differential memory controllers (2)

򐂰 6.4-GHz Interface to Super Nova (SN) 򐂰 DDR3 support max 1067 Mhz 򐂰 Minimum Memory 2 channels, 1 SN/channel 򐂰 Maximum Memory 8 channels X 1 SN/channel 򐂰 2 Ports/Super Nova 򐂰 8 Ranks/Port 򐂰 X8b and X4b devices supported

PowerBus Off-Chip Interconnect

򐂰 1.5 to 2.9 Gbps single ended EI-3 򐂰 2 spare bits/bus 򐂰 Max 256-way SMP 򐂰 32-way optimal scaling 򐂰 Four 8-B Intranode Buses (W, X, Y, or Z) 򐂰 All buses run at the same bit rate 򐂰 All capable of running as a single 4B interface; the location of the 4B interface within the

8 B is fixed

򐂰 Hub chip attaches via W, X, Y or Z 򐂰 Three 8-B Internode Buses (A, B,C) 򐂰 C-bus multiplex with GX Only operates as an aggregate data bus (for example, address

and command traffic is not supported)

6 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 21

Buses

Table 1-1 describes the POWER7 busses.

Table 1-1 POWER7 busses

Bus name Width (speed) Connects Function

W, X, Y, Z 8B+8B with 2 extra bits

per bus (3 Gbps)

A,B 8B+8B with 2 extra bits

per bus (3 Gbps)

C 8B+8B with 2 extra bits

per bus (3 Gb/p)

Mem1-Mem8 2B Read + 1B Write

with 2 extra bits per bus (2.9 GHz)

Intranode processors & hub

Other nodes within drawer

Processor to memory

Used for address and data

Data only

Data only, Multiplex with Gx

WXYZABC Busses

The off-chip PowerBus supports up to seven coherent SMP links (WXYZABC) by using Elastic Interface 3 (EI-3) interface signaling that uses up to 3 Gbps. The intranode WXYZ links up to four processor chips to make a 32way and connect a Hub chip to each processor. The WXYZ links carry coherency traffic and data and are interchangeable as intranode processor links or Hub links. The internode AB links connect up to two nodes per processor chip. The AB links carry coherency traffic and data and are interchangeable with each other. The AB links also are configured as aggregate data-only links. The C link is configured only as a data-only link.

All seven coherent SMP links (WXYZABC) are configured as 8Bytes or 4Bytes in width.

The XYZABC Busses include the following features:

򐂰 Four (WXYZ) 8-B or 4-B EI-3 Intranode Links 򐂰 Two (AB) 8-B or 4-B EI-3 Internode Links or two (AB) 8-B or 4-B EI-3 data-only Links 򐂰 One (C) 8-B or 4-B EI-3 data-only Link

PowerBus

The PowerBus is responsible for coherent and non-coherent memory access, IO operations, interrupt communication, and system controller communication. The PowerBus provides all of the interfaces, buffering, and sequencing of command and data operations within the storage subsystem. The POWER7 chip has up to seven PowerBus links that are used to connect to other POWER7 chips, as shown in Figure 1-2 on page 8.

The PowerBus link is an 8-Byte-wide (or optional 4-Byte-wide), split-transaction, multiplexed, command and data bus that supports up to 32 POWER7 chips. The bus topology is a multitier, fully connected topology to reduce latency, increase redundancy, and improve concurrent maintenance. Reliability is improved with ECC on the external I/Os.

Data transactions are always sent along a unique point-to-point path. A route tag travels with the data to help routing decisions along the way. Multiple data links are supported between chips that are used to increase data bandwidth.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 7

Page 22

C Bus

B Bus

A Bus

SMP Interconnect SMP Interconnect

SMP Data Only

GX1 GX0

4MB L3 4MB L3

PBE

GXC(0,1)

Mem 3 PHY’s

MC0

MC1

Power Bus

PSI A/D HTM ICP

PLLs

EI – 3 PHY’s

Z BUS

W BUS

X Bus

Y Bus

SMP Interconnect

HUB Attach

POR

PSI

JTAG/FSI

I2C

ViDBUS

I2C

SEEPROM

M1A

22b

14b

M1B

22b

14b

M1C

22b

14b

M1D

22b

14b

Memory Interface

M0A

22b

14b

M0B

22b

14b

M0C

22b

14b

M0D

22b

14b

Memory Interface

EI – 3 PHY’s

Core

L2 L2

4MB L3 4MB L3

NCU

NCU NCU NCU

NCU

NCU NCU NCU

Figure 1-2 POWER7 chip layout

Figure 1-3 on page 9 shows the POWER7 core structure.

8 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 23

Figure 1-3 Microprocessor core structural diagram

Reliability, availability, and serviceability features

The microprocessor core includes the following reliability, availability, and serviceability (RAS) features:

򐂰 POWER7 core:

– Instruction retry for soft core logic errors – Alternate processor recovery for hard core errors detected – Processor limited checkstop for other errors – Protection key support for AIX

򐂰 L1 I/D Cache Error Recovery and Handling:

– Instruction retry for soft errors – Alternate processor recovery for hard errors – Guarding of core for core and L1/L2 cache errors

򐂰 L2 Cache:

– ECC on L2 and directory tags – Line delete for L2 and directory tags (seven lines) – L2 UE handling includes purge and refetch of unmodified data – Predictive dynamic guarding of associated cores

򐂰 L3 Cache:

– ECC on data – Line delete mechanism for data (seven lines) – L3UE handling includes purges and refetch of unmodified data – Predictive dynamic guarding of associated cores for CEs in L3 not managed by the line

deletion

Chapter 1. Understanding the IBM Power Systems 775 Cluster 9

Page 24

1.4.2 I/O hub chip

EI-3 PHYs

Torrent

Diff PHYs

L local

HUB To HUB Copper Board Wiring

L remote

4 Drawer Interconnect to Create a Supernode

Optical

LR0 Bus

Optical

LR23 Bus

Optical

LL0 Bus

Copper

LL1 Bus

Copper

LL2 Bus

Copper

LL4 Bus

Copper

LL5 Bus

Copper

LL6 Bus

Copper

LL3 Bus

Copper

Diff PHYs

PX0 Bus

16x

PCI-E

IO PHY

Hot Plug Ctl

PX1 Bus

16x

PCI-E

IO PHY

Hot Plug Ctl

PX2 Bus

PCI-E

IO PHY

Hot Plug Ctl

FSI

FSP1-A

FSI

FSP1-B

I2C

TPMD-A, TMPD-B

SVIC

MDC-A

SVIC

MDC-B

I2C

SEEPROM 1

I2C

SEEPROM 2

L remote

Buses

HUB to QCM Connections

Address/Data

D Bus

Interconnect of Supernodes

Optical

D0 Bus Optical

12x

D15 Bus

Optical

12x

D Buses

I2C

I2C_0 + Int

I2C_27 + Int

I2C

To Optical

Modules

TOD Sync

8B Z-Bus

TOD Sync

8B Y-Bus

TOD Sync

8B X-Bus

TOD Sync

8B W-Bus

This section provides information about the IBM Power 775 I/O hub chip (or torrent chip), as shown in Figure 1-4.

Figure 1-4 Hub chip (Torrent)

Host fabric interface

The host fabric interface (HFI) provides a non-coherent interface between a quad-chip module (QCM), which is composed of four POWER7, and the clustered network.

Figure 1-5 on page 11 shows two instances of HFI in a hub chip. The HFI chips also attach to the Collective Acceleration Unit (CAU).

Each HFI has one PowerBus command and four PowerBus data interfaces, which feature the following configuration:

1. The PowerBus directly connects to the processors and memory controllers of four POWER7 chips via the WXYZ links.

10 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 25

2. The PowerBus also indirectly coherently connects to other POWER7 chips within a

HFI

Cmd x1data x4

CAU MMIO

cmd & data x4

EA/RA

Power Bus

D links LR links

WXYZ

link

ctrlr

WXYZ link

WXYZ

link

ctrlr

WXYZ link

WXYZ

link

ctrlr

WXYZ link

WXYZ

link

ctrlr

WXYZ link

Nest

MMU

c d

CAU

HFI

Cmd x1

data x4

CAU

MMIO

cmd & data x4

EA/RA

Integrated Switch Router

(ISR)

LL links

256-way drawer via the LL links. Although fully supported by the HFI hardware, this path provides reduced performance.

3. Each HFI has four ports to the Integrated Switch Router (ISR). The ISR connects to other hub chips through the D, LL, and LR links.

4. ISRs and D, LL, and LR links that interconnect the hub chips form the cluster network.

POWER7 chips: The set of four POWER7 chips (QCM), its associated memory, and a hub chip form the building block for cluster systems. A Power 775 systems consists of multiple building blocks that are connected to each another via the cluster network.

Figure 1-5 HFI attachment scheme

Chapter 1. Understanding the IBM Power Systems 775 Cluster 11

Page 26

Packet processing

POWER7

Lin k

ISR

POWER7 Coherency Bus

Proc,

caches

POW ER7

Link

...

POWER7

Chip

Hub Chip

ISR Network

Proc,

caches

POWER7 Coherency Bus

Mem

HFI

IS R

ISR ISR

HFI

POWER7

Chi p

Hub Chip

Pr oc,

caches

...

Pro c,

caches

Mem

PO WER7 Coher ency Bus

HFI HFI

The HFI is the interface between the POWER7 chip quads and the cluster network, and is responsible for moving data between the PowerBus and the ISR. The data is in various formats, but packets are processed in the following manner:

򐂰 Send

– Pulls or receives data from PowerBus-attached devices in a POWER7 chip – Translates data into network packets – Injects network packets into the cluster network via the ISR

򐂰 Receive

– Receives network packets from the cluster network via the ISR – Translates them into transactions – Pushes the transactions to PowerBus-attached devices in a POWER7 chip

򐂰 Packet ordering

– The HFIs and cluster network provide no ordering guarantees among packets. Packets

that are sent from the same source window and node to the same destination window and node might reach the destination in a different order.

Figure 1-6 shows two HFIs cooperating to move data from devices that are attached to one PowerBus to devices attached to another PowerBus through the Cluster Network.

Figure 1-6 HFI moving data from one quad to another quad

HFI paths: The path between any two HFIs might be indirect, thus requiring multiple hops through intermediate ISRs.

12 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 27

1.4.3 Collective acceleration unit

The hub chip provides specialized hardware that is called the Collective Acceleration Unit (CAU) to accelerate frequently used collective operations.

Collective operations

Collective operations are distributed operations that operate across a tree. Many HPC applications perform collective operations with the application that make forward progress after every compute node that completed its contribution and after the results of the collective operation are delivered back to every compute node (for example, barrier synchronization, and global sum).

A specialized arithmetic-logic unit (ALU) within the collective CAU implements reduction, barrier, and reduction operations. For reduce operations, the ALU supports the following operations and data types:

򐂰 Fixed point: NOP, SUM, MIN, MAX, OR, ANDS, signed and unsigned XOR 򐂰 Floating point: MIN, MAX, SUM, single and double precision PROD

There is one CAU in each hub chip, which is one CAU per four POWER7 chips, or one CAU per 32 C1 cores.

Software organizes the CAUs in the system collective trees. The arrival of an input on one link causes its forwarding on all other links when there is a broadcast operation. For reduce operation, arrivals on all but one link causes the reduction result to forward to the remaining links.

A link in the CAU tree maps to a path composed of more than one link in the network. The system supports many trees simultaneously and each CAYU supports 64 independent trees.

The usage of sequence numbers and a retransmission protocol enables reliability and pipelining. Each tree has only one participating HFI window on any involved node. The order in which the reduction operation is evaluated is preserved from one run to another, which benefits programming models that allow programmers to require that collective operations are executed in a particular order, such as MPI.

Package propagation

As shown Figure 1-7 on page 14, a CAU receive packets from the following sources: 򐂰 The memory of a remote node is inserted into the cluster network by the HFI of the remote

node

򐂰 The memory of a local node is inserted into the cluster network by the HFI of the local

node

򐂰 A remote CAU

Chapter 1. Understanding the IBM Power Systems 775 Cluster 13

Page 28

Proc,

caches

WXYZ

link

...

P7 Chip

Torrent

Chip

ISR Network

Proc,

caches

Power Bus

Mem

Power Bus

HFI

ISR

Proc,

caches

WXYZ

link

...

P7 Chip

Torrent

Chip

Proc,

caches

Power Bus

Mem

Power Bus

HFI

ISR

HFI

Figure 1-7 CAU packets received by CAU

As shown in Figure 1-8 on page 15, a CAU sends packets to the following locations:

򐂰 The memory of a remote node that is written to memory by the HFI of the remote node. 򐂰 The memory of a local node that is written to memory by the HFI of the local node. 򐂰 A remote CAU.

14 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 29

Proc,

caches

WXYZ

link

...

P7 Chip

Torrent

Chip

ISR Network

Proc,

caches

Power Bus

Mem

Power Bus

HFI

ISR

Proc,

caches

WXYZ

link

...

P7 Chip

Torrent

Chip

Proc,

caches

Power Bus

Mem

Power Bus

HFI

ISR

CAU CAU

Figure 1-8 CAU packets sent by CAU

1.4.4 Nest memory management unit

The Nest Memory Management Unit (NMMU) that is in the hub check facilitates user-level code to operate on the address space of processes that executes on other compute nodes. The NMMU enables user-level code to create a global address space from which the NMMU performs operations. This facility is called

A process that executes on a compute node registers its address space, thus permitting interconnect packets to manipulate the registered shared region directly. The NMMU references a page table that maps effective addresses to real memory. The hub chip also maintains a cache of the mappings and maps the entire real memory of most installations.

Incoming interconnect packets that reference memory, such as RDMA packets and packets that perform atomic operations, contain an effective address and information that pinpoints the context in which to translate the effective address. This feature greatly facilitates global-address space languages, such as Unified Parallel C (UPC), co-array Fortran, and X10, by permitting such packets to contain easy-to-use effective addresses.

global shared memory.

1.4.5 Integrated switch router

The integrated switch router (ISR) replaces the external switching and routing functions that are used in prior networks. The ISR is designed to dramatically reduce cost and improve performance in bandwidth and latency.

A direct graph network topology connects up to 65,536 POWER7 eight-core processor chips with two-level routing hierarchy of L and D busses.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 15

Page 30

Each hub chip ISR connects to four POWER7 chips via the HFI controller and the W busses. The Torrent hub chip and its four POWER7 chips are called an directly connected to seven other octants on a drawer via the wide on-planar L-Local busses and to 24 other octants in three more drawers via the optical L-Remote busses.

Supernode is the fully interconnected collection of 32 octants in four drawers. Up to 512

Supernodes are fully connected via the 16 optical D busses per hub chip. The ISR is designed to support smaller systems with multiple D busses between Supernodes for higher bandwidth and performance.

The ISR logically contains input and output buffering, a full crossbar switch, hierarchical route tables, link protocol framers/controllers, interface controllers (HFI and PB data), Network Management registers and controllers, and extensive RAS logic that includes link replay buffers.

The Integrated Switch Router supports the following features:

򐂰 Target cycle time up to 3 GHz 򐂰 Target switch latency of 15 ns 򐂰 Target GUPS: ~21 K. ISR assisted GUPs handling at all intermediate hops (not software) 򐂰 Target switch crossbar bandwidth greater than 1 TB per second input and output:

– 96 Gbps WXYZ-busses (4 @ 24 Gbps) from P7 chips (unidirectional) – 168 Gbps local L-busses (7 @ 24 Gbps) between octants in a drawer (unidirectional) – 144 Gbps optical L-busses (24 @ 6 Gbps) to other drawers (unidirectional) – 160 Gbps D-busses (16 @ 10 Gbps) to other Supernodes (unidirectional)

򐂰 Two-tiered full-graph network 򐂰 Virtual Channels for deadlock prevention

octant. Each ISR octant is

򐂰 Cut-through Wormhole routing 򐂰 Routing Options:

– Full hardware routing – Software-controlled indirect routing by using hardware route tables

򐂰 Multiple indirect routes that are supported for data striping and failover 򐂰 Multiple direct routes by using LR and D-links supported for less than a full-up system 򐂰 Maximum packet size that supported is 2 KB. Packets size varies from 1 to 16 flits, each flit

being 128 Bytes

򐂰 Routing Algorithms:

– Round Robin: Direct and Indirect – Random: Indirect routes only

򐂰 IP Multicast with central buffer and route table and supports 256 Bytes or 2 KB packets 򐂰 Global Hardware Counter implementation and support and includes link latency counts 򐂰 LCRC on L and D busses with link-level retry support for handling transient errors and

includes error thresholds.

򐂰 ECC on local L and W busses, internal arrays, and busses and includes Fault Isolation

Registers and Control Checker support

򐂰 Performance Counters and Trace Debug support

16 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 31

1.4.6 SuperNOVA

SuperNOVA is the second member of the fourth generation of the IBM Synchronous Memory Interface ASIC. It connects host memory controllers to DDR3 memory devices.

SuperNOVA is used in a planar configuration to connect to Industry Standard (I/S) DDR3 RDIMMs. SuperNOVA also resides on a custom, fully buffered memory module that is called the SuperNOVA DIMM (SND). Fully buffered DIMMs use a logic device, such as SuperNOVA, to buffer all signals to and from the memory devices.

As shown in Figure 1-9, SuperNOVA provides the following features: 򐂰 Cascaded memory channel (up to seven SNs deep) that use 6.4-Gbps, differential ended

(DE), unidirectional links.

򐂰 Two DDR3 SDRAM command and address ports. 򐂰 Two, 8 B DDR3 SDRAM data ports with a ninth byte for ECC and a tenth byte that is used

as a locally selectable spare.

򐂰 16 ranks of chip selects and CKE controls (eight per CMD port). 򐂰 Eight ODT (four per CMD port). 򐂰 Four differential memory clock pairs to support up to four DDR3 registered dual in-line

memory modules (RDIMMs).

Data Flow Modes include the following features:

򐂰 Expansion memory channel daisy-chain 򐂰 4:1 or 6:1 configurable data rate ratio between memory channel and SDRAM domain

Figure 1-9 Memory channel

SuperNOVA uses a high speed, differential ended communications memory channel to link a host memory controller to the main memory storage devices through the SuperNOVA ASIC. The maximum memory channel transfer rate is 6.4 Gbps.

The SuperNOVA memory channel consists of two DE, unidirectional links. The downstream link transmits write data and commands away from the host (memory controller) to the SuperNOVA. The downstream includes 13 active logical signals (lanes), two more spare lanes, and a bus clock. The upstream (US), link transmits read data and responses from the SuperNOVA back to the host. The US includes 20 active logical signals, two more spare lanes, and a bus clock.

Although SuperNOVA supports a cascaded memory channel topology of multiple chips that use daisy chained memory channel links, Power 775 does not use this capability.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 17

Page 32

The links that are connected on the host side are called the Primary Up Stream (PUS) and

Primary Down Stream (PDS) links. The links on the cascaded side are called the Secondary Up Stream

The SuperNOVA US and downstream links each include two dedicated spare lanes. One of these lanes is used to repair either a clock or data connection. The other lane is used only to repair data signal defects. Each segment (host to SuperNOVA or SuperNOVA to SuperNOVA connection) of a cascaded memory channel is independently deployed of their dedicated spares per link. This deployment maximizes the ability to survive multiple interconnect hard failures. The spare lanes are tested and aligned during initialization but are deactivated during normal runtime operation. The channel frame format, error detection, and protocols are the same before and after spare lane invocation. Spare lanes are selected by one of the following means:

򐂰 The spare lanes are selected during initialization by loading host and SuperNOVA

configuration registers based on previously logged lane failure information.

򐂰 The spare lanes are selected dynamically by the hardware during runtime operation by an

error recovery operation that performs the link reinitialization and repair procedure. This procedure is initiated by the host memory controller and supported by the SuperNOVAs in the memory channel. During the link repair operation, the memory controller holds back memory access requests. The procedure is designed to take less than 10 ms to prevent system performance problems, such as timeouts.

򐂰 The spare lanes are selected by system control software by loading host or SuperNOVA

configuration registers that are based on the results of the memory channel lane shadowing diagnostic procedure.

(SUS) and Secondary Down Stream (SDS) links.

1.4.7 Hub module

The Power 775 hub module provides all the connectivity that is needed to form a clustered system, as shown in Figure 1-10 on page 19.

18 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 33

EI-3 PHYs

Torrent

Diff PHYs

L local

HUB To HUB Copper Board Wiring

L remote

4 Drawer Interconnect to Create a Supernode

Optical

LR0 Bus

Optical

LR23 Bus

Optical

LL0 Bus Copper

8B 8B

LL1 Bus Copper

8B 8B

LL2 Bus Copper

8B 8B

LL4 Bus Copper

8B 8B

LL5 Bus Copper

8B 8B

LL6 Bus Copper

8B 8B

LL3 Bus Copper

Diff PHYs

PX0 Bus

16x

PCI-E

IO PHY

Hot Plug Ctl

PX1 Bus

16x

PCI-E

IO PHY

Hot Plug Ctl

PX2 Bus

PCI-E

IO PHY

Hot Plug Ctl

FSI

FSP1-A

FSI

FSP1-B

I2C

TPMD-A, TMPD-B

SVIC

MDC-A

SVIC

MDC-B

I2C

SEEPROM 1

I2C

SEEPROM 2

L remote

Buses

HUB to QCM Connections

Address/Data

D Bus

Interconnect of Supernodes

Optical

D0 Bus

Optical

12x

D15 Bus

Optical

12x

D Buses

I2C

I2C_0 + Int

I2C_27 + Int

I2C

To Optical

Modules

TOD Sync

8B Z-Bus

TOD Sync

8B Y-Bus

TOD Sync

8B X-Bus

TOD Sync

8B W-Bus

24 F iber

D-Li nk Co nne ctor

48 Fiber

D- Link Co nnec to r

12 Lanes (10 + 2 spare)

12 Lanes (10 + 2 spar e)

O to 16

24 Fiber D-Li nks

SEEPRO M

SEEP ROM

Cap s

CPROMCPROM

Capacitors

Op tical

Xmit & Rec

(D- Link )

Optical

Xmit & Rec

(LR-Link)

Opti cal

Xmit & Rec

(D-Link)

6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 spare)

6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)

6 Lanes (5 + 1 sp are)

(No t to S cale)

24 F iber

D-Li nk Co nne ctor

48 Fiber

D- Link Co nnec to r

12 Lanes (10 + 2 spare)

12 Lanes (10 + 2 spar e)

O to 16

24 Fiber D-Li nks

SEEPRO M

SEEP ROM

Cap s

CPROMCPROM

Capacitors

Op tical

Xmit & Rec

(D- Link )

Optical

Xmit & Rec

(LR-Link)

Opti cal

Xmit & Rec

(D-Link)

6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 spare)

6 Lanes (5 + 1 sp are) 6 Lanes (5 + 1 sp are)

6 Lanes (5 + 1 sp are)

(No t to S cale)

Figure 1-10 Hub module diagram

Chapter 1. Understanding the IBM Power Systems 775 Cluster 19

Page 34

The hub features the following primary functions: 򐂰 Connects the QCM Processor/Memory subsystem to up to two high-performance 16x

PCIe slots and one high-performance 8x PCI Express slot. This configuration provides general-purpose I/O and networking capability for the server node.

POWER 775 drawer: In a Power 775 drawer (CEC), the Octant 0 has 3 PCIe, in which two PCIe are 16x and one PCIe is 8x (SRIOV Ethernet Adapter is given priority in the 8x slot.). Octants 1-7 have two PCI Express, which are 16x.

򐂰 Connects eight Processor QCMs together by using a low-latency, high-bandwidth,

coherent copper fabric (L-Local buses) that includes the following features:

– Enables a single hypervisor to run across 8 QCMs, which enables a single pair of

redundant service processors to manage 8 QCMs

– Directs the I/O slots that are attached to the eight hubs to the compute power of any of

the eight QCMs that provide I/O capability where needed

– Provides a message passing mechanism with high bandwidth and the lowest possible

latency between eight QCMs (8.2 TFLOPs) of compute power

򐂰 Connects four Power 775 planars via the L-Remote optical connections to create a 33

TFLOP tightly connected compute building block (SuperNode). The bi-sectional exchange bandwidth between the four boards is 3 TBps, the same bandwidth as 1500 10 Gb Ethernet links.

򐂰 Connects up to 512 groups of four planers (SuperNodes) together via the D optical buses

with ~3 TBs of exiting bandwidth per planer.

Optical links

The Hub modules that are on the node board house optical transceivers for up to 24 L-Remote links and 16 D-Links. Each optical transceiver includes a jumper cable that connects the transceiver to the node tailstock. The transceivers are included to facilitate cost optimization, depending on the application. The supported options are shown in Table 1-2.

Table 1-2 Supported optical link options

SuperNode type L-Remote links D-Links Number of

combinations

SuperNodes not enabled 0 0-16 in increments of 1 17

Full SuperNodes 24 0-16 in increments of 1 17

Some customization options are available on the hub optics module, which allow some optic transceivers to remain unpopulated on the Torrent module if the wanted topology does not require all of transceivers. The number of actual offering options that are deployed is dependent on specific large customer bids.

20 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 35

Optics physical package

The optics physical package includes the following features: 򐂰 Individual transmit (Tx) and Receive (Rx) modules that are packaged in Tx+Rx pairs on

glass ceramic substrate.

򐂰 Up to 28 Tx+Rx pairs per module. 򐂰 uLGA (Micro-Land Grid Array) at 0.7424 mm pitch interconnects optical modules to

ceramic substrate.

򐂰 12-fiber optical fiber ribbon on top of each Tx and each Rx module, which is coupled

through Prizm reflecting and spheric-focusing 12-channel connectors.

򐂰 Copper saddle over each optical module and optical connector for uLGA actuation and

heat spreading.

򐂰 Heat spreader with springs and thermal interface materials that provide uLGA actuation

and heat removal separately for each optical module.

򐂰 South (rear) side of each glass ceramic module carries 12 Tx+Rx optical pairs that

support 24 (6+6) fiber LR-links, and 2 Tx+Rx pairs that support 2 (12+12) fiber D-links.

򐂰 North (front) side of each glass ceramic module carries 14 Tx+Rx optical module pairs

that support 14 (12+12) fiber D-links

Optics electrical interface

The optics electrical interface includes the following features:

򐂰 12 differential pairs @ 10 GB per second (24 signals) for each TX optics module 򐂰 12 differential pairs @ 10 GB per second (24 signals) for each RX module 򐂰 Three wire I2C/TWS (Serial Data & Address, Serial Clock, Interrupt): three signals

Cooling

Cooling includes the following features:

򐂰 Optics are water-cooled with Hub chip 򐂰 Cold plate on top of module, which is coupled to optics through heat reader and saddles,

with thermal interface materials at each junction

򐂰 Recommended temperature range: 20C – 55C at top of optics modules

Optics drive/receive distances

Optics links might be up to 60 meters rack-to-rack (61.5 meters, including inside-drawer optical fiber ribbons).

Reliability assumed

The following reliability features are assumed:

򐂰 10 FIT rate per lane. 򐂰 D-link redundancy. Each (12+12)-fiber D-link runs normally with 10 active lanes and two

spares. Each D-link runs in degraded-Bandwidth mode with as few as eight lanes.

򐂰 LR-link redundancy: Each (6+6)-fiber D-link runs normally with six active lanes. Each

LR-link (half of a Tx+Rx pair) runs in degraded-bandwidth mode with as few as four lanes out of six lanes.

򐂰 Overall redundancy: As many four lanes out of each 12 (two lanes of each six lanes) might

fail without disabling any D-links or LR-links.

򐂰 Expect to allow one failed lane per 12 lanes in manufacturing. 򐂰 Bit Error Rate: Worst-case, end-of-life BER is 10^-12. Normal expected BER is 10^-18

Chapter 1. Understanding the IBM Power Systems 775 Cluster 21

Page 36

1.4.8 Memory subsystem

Memory Controllerp7Memory Controller

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

2B Read

1B Write

DIMM

DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM

DIMM

DRAMDRAM DRAM DRAMDRAM DRAM DRAM DRAM DRAM DRAM

DIMM

DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM

DIMM

DRAM DRAMDRAMDRAM DRAMDRAMDRAMDRAMDRAMDRAM

FSI

The memory controller layout is shown in Figure 1-11.

Figure 1-11 Memory controller layout

The memory cache sizes are shown in Table 1-3.

Table 1-3 Cache memory sizes

Cache level Memory size (per core)

32 KB Instruction, 32 KB Data

256 KB

4 MB eDRAM

The memory subsystem features the following characteristics:

򐂰 Memory capacity: Up to 128 GB/processor 򐂰 Memory bandwidth: 128 GB/s (peak)/processor 򐂰 Eight channels of SuperNOVA buffered DIMMs/processor

22 IBM Power Systems 775 for AIX and Linux HPC Solution

򐂰 Two memory controllers per processor:

– Four memory busses per memory controller – Each buss is 1 B-wide Write, 2 B-wide Read

Page 37

Memory per drawer

Each drawer features the following minimum and maximum memory ranges: 򐂰 Minimum:

– 4 DIMMs per QCM x 8 QCM per drawer = 32 DIMMs per drawer – 32 DIMMs per drawer x 8 GB per DIMM = 256 GB per drawer

򐂰 Maximum:

– 16 DIMMs per QCM x 8 QCM per drawer = 128 DIMMs per drawer – 128 DIMMs per drawer x 16 GB per DIMM = 2 TB per drawer

Memory DIMMs

Memory DIMMs include the following features:

򐂰 Two SuperNOVA chips each with a bus connected directly to the processor 򐂰 Two ports on the DIMM from each SuperNova 򐂰 Dual CFAM interfaces from the processor to each DIMM, wired to the primary SuperNOVA

and dual chained to the secondary SuperNOVA on the DIMM

򐂰 Two VPD SEEPROMs on the DIMM interfaced to the primary SuperNOVA CFAM 򐂰 80 DRAM sites - 2 x 10 (x8) DRAM ranks per SuperNova Port 򐂰 Water cooled jacketed design 򐂰 50 watt max DIMM power 򐂰 Available in sizes: 8 GB, 16 GB, and 32 GB (RPQ)

For best performance, it is recommended that all 16 DIMM slots are plugged in each node. All DIMMs driven by a quad-chip module (QCM) must have the same size, speed, and voltage rating.

1.4.9 Quad chip module

The previous sections provided a brief introduction to the low-level components of the Power 775 system. We now look at the system on a modular level. This section discusses the quad-chip module or QCM, which contains four POWER7 chips that are connected in a ceramic module.

The standard Power 775 CEC drawer contains eight QCMs. Each QCM contains four, 8-core POWER7 processor chips and supports 16 DDR3 SuperNova buffered memory DIMMs.

Figure 1-12 on page 24 shows the POWER7 quad chip module which contains the following characteristics:

򐂰 4x POWER7 cores 򐂰 32 cores (4 x 8 = 32) 򐂰 948 GFLOPs / QCM 򐂰 474 GOPS (Integer) / QCM 򐂰 Off-chip bandwidth: 336 Gbps (peak):

– local + remote interconnect

Chapter 1. Understanding the IBM Power Systems 775 Cluster 23

Page 38

8c uP

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

1.4.10 Octant

Figure 1-12 POWER7 quad chip module

This section discusses the “octant” level of the system.

Figure 1-13 on page 25 shows a 32-way SMP with two-tier SMP fabric, four chip processor MCM + Hub SCM with onboard optics.

Each octant represents 1/8 of the CEC planar, which contains one QCM, one Hub module, and up to 16 associated memory modules.

Octant 0: Octant 0 controls another PCIe 8x slot that is used for an Ethernet adapter for cluster management.

24 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 39

Figure 1-13 Power 775 octant logic diagram

DIMM 15 DIMM 8 DIMM 9DIMM 14DIMM 4 DIMM 12 DIMM 13DIMM 5

DIMM 11DIMM 3DIMM 2 DIMM 10

128GB/s/P7 (peak)

16GB/s/SuperNova (peak)

Read: 10.67GB/s/SN

Write: 5.33GB/s/SN

Octant

32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor QCM + Hub SCM w/ On-Board Optics

MC1 MC 0

8c uP

MC1

Mem

MC0

Mem

8c uP

Three 2.5 PCI-E2 (2-16x, 1-8x)

5.0Gb/s @ 2B+2B with 8/10 Encoding

7 Inter-Hub Board Level L-Buses

(154+154)

GB/s

D0-D15 Lr0-Lr23

(160+160)

GB/s

(120+120)

GB/s

Hub Chip Module

22+22GB/s

164

Ll0

22+22GB/s

164

Ll1

22+22GB/s

164

Ll2

22+22GB/s

164

Ll3

22+22GB/s

164

Ll4

22+22GB/s

164

Ll5

22+22GB/s

164

Ll6

10x 10x 10x 10x 10x 10x 10x 10x 10x 10x 10x 10x

10x 10x 10x 10x 10x 10x 10x 10x

7+7GB/s

EG2

PCIe 61x

7+7GB/s

EG1

PCIe 16x

7+7GB/s

EG2

PCIe 8x

MC0

Mem

MC1

Mem

8c uP

MC0

Mem

MC1

Mem

8c uP

48GB/s (pk)

P7-0 P7-1

P7-3 P7-2

ABX

BAW

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

(

)

(

)

48GB/s

44GB/s

88GB/s

44GB/s

88GB/s

44GB/s

88GB/s

44GB/s

88GB/s

MC 0MC 1

MC 0

MC 1

MC 0MC 1

DIMM 5 DIMM 4

DIMM 10DIMM 11DIMM 2 DIMM 3

DIMM 14 DIMM 15

DIMM 8 DIMM 9

DIMM 12 DIMM 13

48GB/s (pk)

(

)

(

)

48GB/s (pk) 48GB/s (pk)

48GB/s (pk)

D-Link

A=# of D-Link Transceivers; A=16 B=# of Transmitter lanes; B=10 C=# of Receiver lanes; C=10 D=bit rate per lane ; D=10Gb/S E=8:10 Coding: E=8/10 BW = Ax(B+C)xDxE 16x(10+10)x10x8/10 = 2560Gb/S=320GB/s (20GB/S/D-Link)

L-remote Link

A=# of L-Link Transceivers; A=12 B=# of Transmitter lanes; B=10 C=# of Receiver lanes; C=10 D=bit rate per lane ; D=10Gb/S E=8:10 Coding: E=8/10 BW = Ax(B+C)xDxE 12x(10+10)x10x8/10 = 1920Gb/S=240GB/s (20GB/S/L-Link)

L-local Link

A=L links; A=7 B=# of Transmitter bits/bus; B=64 C=# of Receiver bits/bus; C=64 D=bit rate per lane ; D=3Gb/S E=Framing Efficiency; E=22/24 BW = Ax(B+C)xDxE 7x(64+64)x3x(22/24) = 2464Gb/S=308GB/s (44BG/S/L-Local)

128GB/s/P7 (peak)

16GB/s/SuperNova (peak)

Read: 10.67GB/s/SN

Write: 5.33GB/s/SN

128GB/s/P7 (peak)

16GB/s/SuperNova (peak)

Read: 10.67GB/s/SN

Write: 5.33GB/s/SN

128GB/s/P7 (peak)

16GB/s/SuperNova (peak)

Read: 10.67GB/s/SN

Write: 5.33GB/s/SN

10x 10x 10x 10x 10x 10x 10x 10x

DIMM 1DIMM 0DIMM 7

DIMM 6

Mem SN

Mem

Mem SN Mem SN

Mem

Mem SN

Mem

Mem SN

Mem

MemSN

Mem

MemSN

Mem

MemSN

Mem

MemSN

Mem

Each Power 775 planar consists of eight octants, as shown in Figure 1-14 on page 26. Seven of the octants are composed of 1x QCM, 1 x HUB, 2 x PCI Express 16x. The other octant contains 1x QCM, 1x HUB, 2x PCI Express 16x, 1x PCI Express 8x.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 25

Page 40

22x25mm

550 sqmm

HUB

1 23456

7101112

13 14 15 16 17 18

19 20 21 22 23 24

25 26 27 28

61x96mm Substrate

Water Cooled Hub

Module

(2:1 View)

Optical Connector

Light Pipe

HUB

61x96mm

1 6 X

P C

1 6 X

P C

Power Input

P7-0

P7-2

P7-3

P7-1

QCM 0

Power Input

Figure 1-14 Octant layout differences

26 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 41

The Power 7755 includes the following features:

򐂰 Compute power: 1 TF, 32 cores, 128 threads (Flat Memory Design) 򐂰 Memory: 512 GB max capacity, ~512 Gbps peak memory BW (1/2 B/FLOP):

– 1280 DRAM sites (80 Dram sites per DIMM, 1/2/4 Gb DRAMS over time) – 2 Processor buses and four 8B memory buses per DIMM (two Buffers) – Double stacked at the extreme with memory access throttling

For more information, see 1.4.8, “Memory subsystem” on page 22.

򐂰 I/O: 1 TB/s BW:

– 32 Gbps Generic I/O:

• Two PCIe2 16x line rate capable busses

• Expanded I/O slot count through PCIe2 expansion possible

– 980 Gbps maximum Switch Fabric BW (12 optical lanes active)

򐂰 IBM Proprietary Fabric with On-board copper/Off-board Optical:

– Excellent cost / performance (especially at mid-large scale) – Basic technology can be adjusted for low or high BW applications

򐂰 Packaging:

– Water Cooled (>95% at level shown), distributed N+1 Point Of Load Power – High wire count board with small Vertical Interconnect Accesses (VIAs) – High pin count LGA module sockets – Hot Plug PCIe Air Cooled I/O Adapter Design – Fully Redundant Out-of-band Management Control

1.4.11 Interconnect levels

A functional Power 775 system consists of multiples nodes that are spread across several racks. This configuration means multiple octants are available in which every octant is connected to every other octant on the system. The following levels of interconnect are available on a system:

򐂰 First level

This level connects the eight octants in a node together via the hub module by using copper board wiring. This interconnect level is referred to as “L” local (LL). Every octant in the node is connected to every other octant. For more information, see 1.4.12, “Node” on page 28.

򐂰 Second level

This level connects four nodes together to create a Supernode. This interconnection is possible via the hub module optical links. This interconnection level is referred to as “L” distant (LD). Every octant in a node must connect to every other octant in the other three nodes that form the Supernode. Every octant features 24 connections, but the total number of connections across the four nodes in a Supernode is 384. For more information, see 1.4.13, “Supernodes” on page 30.

򐂰 Third level

This level connects every Supernode to every other Supernode in a system. This interconnection is possible via the hub module optical links. This interconnect level is referred to as D-link. Each Supernode has up to 512 D-links. It is possible to scale up this level to 512 Supernodes. Every Supernode has a minimum of one hop D-link to every other Supernode. For more information, see 1.4.14, “Power 775 system” on page 32.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 27

Page 42

1.4.12 Node

This section discusses the node level of the Power 775 physically represented by the drawer, also commonly referred as CEC. A node is composed of eight octants and their local interconnect. Figure 1-15 shows the CEC drawer from the front.

Figure 1-15 CEC drawer front view

Figure 1-16 shows the CEC drawer rear view.

Figure 1-16 CEC drawer rear view

28 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 43

First level interconnect: L Local

L Local (LL) connects the eight octants in the CEC drawer together via the HUB module by using copper board wiring. Every octant in the node is connected to every other octant, as shown in Figure 1-17.

Figure 1-17 First level local interconnect (256 cores)

Chapter 1. Understanding the IBM Power Systems 775 Cluster 29

Page 44

System planar board

This section provides details about the following system planar board characteristics:

򐂰 Approximately 2U x 85 cm wide x 131cm deep overall node package in 30 EIA frame 򐂰 Eight octants and each octant features one QCM, one hub, and 4 - 16 DIMMs 򐂰 128 memory slots 򐂰 17 PCI adapter slots (octant 0 has three PCI slots, octant 1 - 7 each have two PCI slots) 򐂰 Regulators on the bottom side of planar directly under modules to reduce loss and

decoupling capacitance

򐂰 Water-cooled stiffener to cool regulators on bottom side of planar and memory DIMMs on

top of board

򐂰 Connectors on rear of board to optional PCI cards (17x) 򐂰 Connectors on front of board to redundant 2N DCCAs 򐂰 Optical fiber D Link interface cables from HUB modules to left and right of rear tail stock 򐂰 128 total = 16 links x 8 Hub Modules 򐂰 Optical fiber L-remote interface cables from Hub modules to center of rear tail stock 򐂰 96 total = 24 links x 8 Hub Modules 򐂰 Clock distribution & out-of-band control distribution from DCCA.

Redundant service processor

The redundant service processor features the following characteristics:

򐂰 The clocking source follows the topology of the service processor. 򐂰 N+1 redundancy of the service processor and clock source logic use two inputs on each

processor and HUB chip.

򐂰 Out-of-band signal distribution for memory subsystem (SuperNOVA chips) and PCI

express slots are consolidated to the standby powered pervasive unit on the processor and HUB chips. PCI express is managed on the Hub and Bridge chips.

1.4.13 Supernodes

This section describes the concept of Supernodes.

Supernode configurations

The following supported Supernode configurations are used in a Power 775 system. The usage of each type is based on cluster size or application requirements:

򐂰 Four-drawer Supernode

The four-drawer Supernode is the most common Supernode configuration. This configuration is formed by four CEC drawers (32 octants) connected via Hub optical links.

򐂰 Single-drawer Supernode

In this configuration, each CEC drawer in the system is a Supernode.

Second level interconnect: L Remote

This level connects four CEC drawers (32 octants) together to create a Supernode via Hub module optical links. Every octant in a node connects to every other octant in the other three nodes in the Supernode. There are 384 connections in this level, as shown in Figure 1-18 on page 31.

30 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 45

Figure 1-18 Board 2nd level interconnect (1,024 cores)

The second level wiring connector count is shown in Figure 1-19 on page 32.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 31

Page 46

Figure 1-19 Second level wiring connector count

Step 1

Each Octant in Node 1 needs to be connected to the 8 Octants in Node 2, the 8 Octants in Node 3, and the 8 Octants in Node 4. This requires 24 connection from each of the 8 Octant in Node 1. So every Octant has 24 connections from it and there are 8 Octants resulting in 192 connections

8 x 24 =192

Connections

Step 2

Step One Below has connected Node 1 to every other Octant in the Super Node. We need now to connect Node 2 to every other remaining Node (nodes 3 &4) in the Super Node. This requires 16 connections from each of the 8 Octant in Node 2. So every Octant in Node 2 has 16 connections from it and there are 8 Octants resulting in 128 connections

8 x 16 =128

Connections

Step 3

Step 1 & 2 below have connected Node 1 & Node 2 to every other Octant in the Super Node. We need now to connect Node 3 to Node 4. To do this every Octant in Node 3 needs 8 connections to the 8 octants in Node 4 which results in 64 connections. At this point every Octant in the Super Node is connected to every other Octant in the Super Node

8 x 8 =64

Connections

Step 4

The total number of connections to build a super node are 384. 192 + 128 + 64 = 384 It must be noted that every Octant has 24 connections, but the total number of connections across the 4 nodes in a given Super Node is 384.

Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7

Node 1

Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7

Node 3

Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7

Node 2

Octant 0 Octant 1 Octant2 Octant3 Octant 4 Octant 5 Octant 6 Octant 7

Node 4

1.4.14 Power 775 system

This section describes the Power 775 system and provides details about the third level of interconnect.

Third level interconnect: Distance

This level connects every Supernode to every other Supernode in a system by using Hub module optical links. Each Supernode includes up to 512 D-links, which allows for system that contains up to 512 Supernodes. Every Supernode features a minimum of one hop D-link to every other Supernode, and there are multiple two hop connections, as shown in Figure 1-20 on page 33.

Each HUB contains 16 Optical D-Links. The Physical node (board) contains eight HUBs; therefore, a physical node (board) contains 16 x 8 = 128 Optical D-Links. A Super Node is four Physical Nodes, which result in 16 x 8 x 4 = 512 Optical D-Links per Super node. This configuration allows up to 2048 CEC connected drawers.

In smaller configurations, in which the system features less than 512 Super Nodes, more than one optical D-Link per node is possible. Multiple connections between Supernodes are used for redundancy and higher bandwidth solutions.

32 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 47

Figure 1-20 System third level interconnect

Integrated cluster fabric interconnect

A complete Power 775 system configuration is achieved by configuring server nodes into a tight cluster by using a fully integrated switch fabric.

The fabric is a multitier, hierarchical implementation that connects eight logical nodes (octants) together in the physical node (server drawer or CEC) by using copper L-local links. Four physical nodes are connected with structured optical cabling into a Supernode by using optical L-remote links. Up to 512 super nodes are connected by using optical D-links. Figure 1-21 on page 34 shows a logical representation of a Power 775 cluster.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 33

Page 48

Figure 1-21 Logical view of a Power 775 system

Figure 1-22 shows an example configuration of a 242 TFLOP Power 775 cluster that uses eight Supernodes and direct graph interconnect. In this configuration, there are 28 D-Link cable paths to route and 1-64 12-lane 10 Gb D-Link cables per cable path.

Figure 1-22 Direct graph interconnect example

34 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 49

A 4,096 core (131 TF), fully interconnected system is shown in Figure 1-23.

Figure 1-23 Fully interconnected system example

Network topology

Optical D-links connect Supernodes in different connection patterns. Figure 1-24 on page 36 shows an example of 32 D-links between each pair of supernodes. Topology is 32D, a connection pattern that supports up to 16 supernodes.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 35

Page 50

Figure 1-24 Supernode connection using 32D topology

Figure 1-25 shows another example in which there is one D-link between supernode pairs, which supports up to 512 supernodes in a 1D topology.

Figure 1-25 1D network topology

36 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 51

The network topology is specified during the installation. A topology specifier is set up in the cluster database. In the cluster DB site table, topology=<specifier>. Table 1-4 shows the supported four-drawer and single-drawer Supernode topologies.

Table 1-4 Supported four-drawer and single-drawer Supernode topologies

Topology Maximum number of supernodes

256D 3

128D 5

64D 8

32D 16

16D 32

8D 64

4D 128

2D 256

1D 512

Single-Drawer Supernode topologies

2D_SDSN 48

8D_SDSN 12

ISR network routes

Each ISR includes a set of hardware route tables. The Local Network Management Controller (LNMC) routing code generates and maintains the routes with help from Central Network Manage (CNM), as shown in Figure 1-26 on page 38. These route tables are set up during system initialization and are dynamically adjusted as links go down or come up during operation. Packets are injected into the network with a destination identifier and the route mode. The route information is picked up from the route tables along the route path that is based on this information. Packets that is injected into the interconnect by the HFI employ source route tables with the route partially determined. Per-port route tables are used to route packets along each hop in the network. Separate route tables are used for intersupernode and intrasupernode routes.

Routes are classified as two compute nodes in a system. There are multiple direct routes between a set of compute nodes because a pair of supernodes are connected by more than one D-link.

The network topology features two levels and therefore the longest direct route has three hops (no more than two L hops and at most one D hop). This configuration is called an L-D-L route.

The following conditions exist when source and destination hubs are within a drawer:

򐂰 The route is one L-hop (assuming all of the links are good). 򐂰 LNMC needs to know only the local link status in this CEC.

direct or indirect. A direct route uses the shortest path between any

Chapter 1. Understanding the IBM Power Systems 775 Cluster 37

Page 52

Figure 1-26 Routing within a single CEC

The following conditions exist when source and destination hubs lie within a supernode, as shown in Figure 1-27:

򐂰 Route is one L-hop (every hub within a supernode is directly connected via Lremote link to

every other hub in the supernode).

򐂰 LNMC needs to know only the local link status in this CEC.

Figure 1-27 L-hop

If an L-remote link is faulty, the route requires two hops. However, only the link status local to the CEC is needed to construct routes, as shown in Figure 1-28.

Figure 1-28 Route representation in event of a faulty Lremote link

38 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 53

When source and destination hubs lie in different supernodes, as shown in Figure 1-29, the following conditions exist:

򐂰 Route possibilities: one D-hop, or L-D (L-D-L routes also are used) 򐂰 LNMC needs non-local link status to construct L-D routes

Figure 1-29 L-D route example

The ISR also supports indirect routes to provide increased bandwidth and to prevent hot spots in the interconnect. An indirect route is a route that has an intermediate compute node in the route that is on a different supernode, not the same supernode in which source and compute nodes reside. An indirect route must employ the shortest path from the source compute node to the intermediate node, and the shortest path from the intermediate compute node to the destination compute node. Although the longest indirect route has five hops at most, no more than three hops are L hops and two hops (at most) are D hops. This configuration often is represented as an L-D-L-D-L route.

The following methods are used to select a route is when multiples routes exist: 򐂰 Software specifies the intermediate supernode, but the hardware determines how to route

to and then route from the intermediate supernode.

򐂰 The hardware selects among the multiple routes in a round-robin manner for both direct

and indirect routes.

򐂰 The hub chip provides support for route randomization in which the hardware selects one

route between a source–destination pair. Hardware-directed randomized route selection is available only for indirect routes.

These routing modes are specified on a per-packet basis.

The correct choice between the use of direct- versus indirect-route modes depends on the communication pattern that us used by the applications. Direct routing is suitable for communication patterns in which each node must communicate with many other nodes by using spectral methods. Communication patterns that involve small numbers of compute nodes benefit from the extra bandwidth that is offered by the multiple routes with indirect routing.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 39

Page 54

1.5 Power, packaging, and cooling

This section provides information about the IBM Power Systems 775 power, packaging, and cooling features.

1.5.1 Frame

The front view of an IBM Power Systems 775 frame is shown in Figure 1-30.

Figure 1-30 Power 775 frame

The Power 775 frame front view is shown in Figure 1-31 on page 41.

40 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 55

Figure 1-31 Frame front view

The rear view of the Power 775 frame is shown in Figure 1-32 on page 42.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 41

Page 56

Figure 1-32 Frame rear photo

1.5.2 Bulk Power and Control Assembly

Each Bulk Power and Control Assembly (BPCA) is a modular unit that includes the following features:

򐂰 Bulk Power and Control Enclosure (BPCE)

Contains two 125A 3-phase AC couplers, six BPR bays, one BPCH and one BPD bay.

򐂰 Bulk Power Regulators (BPR), each rated at 27 KW@-360 VDC

One to six BPRs are populated in the BPCE depending on CEC drawers and storage enclosures in frame, and the type of power cord redundancy that is wanted.

򐂰 Bulk Power Control and Communications Hub (BPCH)

This unit provides rack-level control, storage enclosures, and water conditioning units (WCUs), and concentration of the communications interfaces to the unit level controllers that are in each server node, storage enclosure, and WCU.

򐂰 Bulk Power Distribution (BPD)

This unit distributes 360 VDC to server nodes and disk enclosures.

򐂰 Power Cords

42 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 57

One per BPCE for populations of one to three BPRs/BPCE and two per BPCE for four to

BPF

BPE

BPR

BPD

BPCH

Front View

Rear View

Line Cord Connector

six BPRs/BPCE.

The front and rear views of the BPCA are shown in Figure 1-33.

Figure 1-33 BPCA

The minimum configuration per BPCE is one x BPCH, one x BPD, one x BPR and one line core. There are always two BPCAs above one another in the top of the cabinet.

BPRs are added uniformly to each BPCE depending on the power load in the rack.

A single fully configured BPCE provides 27 KW x 6 = 162 KW of Bulk Power, which equates to aggregate system power cord power of approximately 170 KW. Up to this power level, bulk power is in a 2N arrangement, where a single BPCE is removed entirely for maintenance concurrently with the rack that remains fully operational. If the rack bulk power demand exceeds 162 KW, the bulk power provides an N+1 configuration of up to 27 KW x 9 = 243 KW, which equates to aggregate system power cord power of approximately 260 KW. N+1 Bulk Power mode means one of the four power cords are disconnected and the cabinet continues to operate normally. BPCE concurrent maintenance is not conducted in N+1 bulk power mode, unless the rack bulk power load is reduced to less than 162 KW by invoking Power Efficient Mode on the server nodes. This mode reduces the peak power demand at the expense of reducing performance.

Cooling

The BPCA is nearly entirely water cooled. Each BPR and the BPD has two quick connects to enable connection to the supply and return water cooling manifolds in the cabinet. All of the components that dissipate any significant power in the BPR and BPD are heat sunk to a water-cooled cold plate in these units.

To assure the ambient air temperature internal to the BP, BPCH, and BPD enclosures is kept low, two hot pluggable blowers are installed in the rear of each BPCA in an N+1 speed controlled arrangement. These blowers flush the units to keep the temperature internal at approximately system inlet air temperature, which is 40 degrees-C maximum. A fan is replaced concurrently.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 43

Page 58

Management

The BPCH provides the mechanism to connect the cabinet to the management server via a 1 Gb Ethernet out-of-band network. Each BPCH has two management server-facing 1 Gb Ethernet ports or buses so that the BPCH connects to a fully redundant network.

1.5.3 Bulk Power Control and Communications Hub

The front view of the bulk power control and communications hub (BPCH) is shown in Figure 1-34.

Figure 1-34 BPCH front view

The following connections are shown in Figure 1-34:

򐂰 T2: 10/100 Mb Ethernet from HMC1. 򐂰 T3: 10/100 Mb Ethernet from HMC2. 򐂰 T4: EPO. 򐂰 T5: Cross Power. 򐂰 T6-T9: RS422 UPIC for WCUs. 򐂰 T10: RS422 UPIC port for connection of the Fill and Drain Tool. 򐂰 T19/T36: 1 Gb Ethernet for HMC connectors (T19, T36). 򐂰 2x – 10/100 Mb Ethernet port to plug in a notebook while the frame is serviced. 򐂰 2x – 10/100 Mb spare (connectors contain both eNet and ½ Duplex RS422). 򐂰 T11-T19, T20-T35, T37-T44: 10/100 Mb Ethernet ports or buses to the management

processors in CEC drawers and storage enclosures. Max configuration supports 12 CEC drawers and one storage enclosure in the frame (connectors contain both eNet and ½ Duplex RS422).

1.5.4 Bulk Power Regulator

This section describes the bulk power regulator (BPR).

Input voltage requirements

The BPR supports DC and AC input power for the Power 775. A single design accommodates both the AC and DC range with different power cords for various voltage options.

DC requirements

The BPR features the following DC requirements:

򐂰 The Bulk Power Assembly (BPA) is capable of operating over a range of 300 to 600 VDC 򐂰 Nominal operating DC points are 375 VDC and 575 VDC

44 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 59

AC requirements

(2X) Bulk Power

As se m b l y

Wat er

Units

(WCUs)

Building

Chilled

Water

Supply

Manifold

Ret urn

Manifold

Ret urn

Supply

Rear

Door Heat

Exchanger

(RDHx)

Door Heat

Exchanger

(RDHx)

Disk Enclosure

CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute CEC - Compute

Table 1-5 shows the AC electrical requirements.

Table 1-5 AC electrical requirements

Input configuration Three phase

Rated nominal voltage and frequency 200 to 240 Vac

Rated current (amps per phase) 125A 100A

Acceptable voltage tolerance @ machine power cord 180 - 259Vac 333 - 508Vac

Acceptable frequency tolerance @ machine power cord 47 - 63Hz 47 - 63Hz

1.5.5 Water conditioning unit

The Power 755 WCU system is shown in Figure 1-35.

and GND (no neutral)

@ 50 to 60Hz

Three phase and GND (no neutral)

380 to 480 Vac @50 to 60Hz

Figure 1-35 Power 755 water conditioning unit system

Chapter 1. Understanding the IBM Power Systems 775 Cluster 45

Page 60

The hose and manifolds assemblies and WCUs are shown in Figure 1-36.

Supply Manifold

Return Manifold

1” x 2”

WCUs

BCW Hose Assemblies

Rectangular Stainless Steel Tubing

Figure 1-36 Hose and manifold assemblies

The components of the WCU are shown in Figure 1-37 on page 47.

46 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 61

Dual Float Sensor Asm

(Orientation in frame)

Pressure Relief Valve & Vacuum Breaker

System Supply (to electronics)

Ball Valve

Quick Connect

System Return (from electronics)

Chilled Water Return

Chilled Water Supply

Proportional Control Valve

Flow Meter

Check Valve (integrated into tank)

Reservoir Tank

Plate Heat Exchanger

Pump / Motor Asm

Figure 1-37 WCU components

The WCU schematics are shown in Figure 1-38.

Figure 1-38 WCU schematics

Chapter 1. Understanding the IBM Power Systems 775 Cluster 47

Page 62

1.6 Disk enclosure

This section describes the storage disk enclosure for the Power 775 system.

1.6.1 Overview

The Power 775 system features the following disk enclosures: 򐂰 SAS Expander Chip (SEC):

– # PHYs: 38 – Each PHY capable of SAS SDR or DDR

򐂰 384 SFF DASD drives:

– 96 carriers with four drives each – 8 Storage Groups (STOR 1-8) with 48 drives each:

• 12 carriers per STOR

• Two Port Cards per STOR each with 3 SECs

– 32 SAS x4 ports (four lanes each) on 16 Port Cards.

򐂰 Data Rates:

– Serial Attach SCSI (SAS) SDR = 3.0 Gbps per lane (SEC to Drive) – Serial Attach SCSI (SAS) DDR= 6.0 Gbps per lane (SAS Adapter in Node to SEC)

򐂰 The drawer supports 10 K/15 K rpm drives in 300 Gb or 600 Gb sizes. 򐂰 A Joint Test Action Group (JTAG) interface is provided from the DC converter assemblies

(DCAs) to each SEC for error diagnostics and boundary scan.

Important: STOR is the short name for storage group (it is not an acronym).

The front view of the disk enclosure is shown in Figure 1-39 on page 49.

48 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 63

Figure 1-39 Disk enclosure front view

1.6.2 High-level description

Figure 1-40 on page 50 represents the top view of a disk enclosure and highlights the front view of a STOR. Each STOR includes 12 carrier cards (six at the top of the drawer and six at the bottom of the drawer) and two port cards.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 49

Page 64

Figure 1-40 Storage drawer top view

DCA Power

Port Card

STOR1

2 Port Cards

12 Carriers (36 Drives)

STOR2

2 Port Cards

12 Carriers

(36 Drives)

STOR3

2 Port Cards

12 Carriers (36 Drives)

STOR4

2 Port Cards

12 Carriers (36 Drives)

4 Disk Carrier

Front View

Of STOR

STOR8

2 Port Cards

12 Carriers (36 Drives)

STOR7

2 Port Cards

12 Carriers

(36 Drives)

STOR6

2 Port Cards

12 Carriers (36 Drives)

STOR5

2 Port Cards

12 Carriers (36 Drives)

Heat Exchanger

The disk enclosure is a SAS storage drawer that is specially designed for the IBM Power 775 system. The maximum storage capacity of the drawer is 230.4 TB, distributed over 384 SFF DASD drives logically organized in eight groups of 48.

The disk enclosure feature two mid-plane boards that comprise the inner core assembly. The disk drive carriers, port cards, and power supplies plug into the mid-plane boards. There are four Air Moving Devices (AMD) in the center of the drawer. Each AMD consists of three counter-rotating fans.

Each carrier contains connectors for four disk drives. The carrier features a solenoid latch that is released only through a console command to prevent accidental unseating. The disk carriers also feature LEDs close to each drive and a gold capacitor circuit so that drives are identified for replacement after the carrier is removed for service.

Each port card includes four SAS DDR 4x ports (four lanes at 6 Gbps/lane). These incoming SAS lanes connect to the input SEC, which directs the SAS traffic to the drives. Each drive is connected to one of the output SECs on the port card with SAS SDR 1x (one lane @ 6 Gbps). There are two port cards per STOR. The first Port card connects to the A ports of all 48 drives in the STOR. The second Port card connects to the B ports of all 48 drives in the STOR. The port cards include soft switches for all 48 drives in the STOR (5 V and 12 V soft switches connect and interrupt and monitor power). The soft switch is controlled by I2C from the SAS Expander Chip (SEC) on the port card.

50 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 65

A fully cabled drawer includes 36 cables: four UPIC power cables and 32 SAS cables from SAS adapters in the CEC. During service to replace a power supply, two UPIC cables manage the current and power control of the entire drawer. During service of a port card, the second port card in the STOR remains cabled to the CEC so that the STOR remains operational. A customer minimum configuration is two SAS cables per STOR and four UPIC power cables per drawer to ensure proper redundancy.

1.6.3 Configuration

A disk enclosure must reside in the same frame as the CEC to which it is cabled. A frame might contain up to six Disk Enclosures. The disk enclosure front view is shown in Figure 1-41.

Figure 1-41 Disk Enclosure front view

The disk enclosure internal view is shown in Figure 1-42 on page 52.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 51

Page 66

Figure 1-42 Disk Enclosure internal view

The disk carrier is shown in Figure 1-43.

Figure 1-43 Disk carrier

52 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 67

The disk enclosure includes the following features: 򐂰 A disk enclosure is one quarter, one half, three quarters, or fully populated with HDDs and

eight SSDs. The disk enclosure always is populated with eight SSDs.

򐂰 Disk enclosure contains two GPFS recovery groups (RGs). The carriers that hold the

disks of the RGs are distributed throughout all of the STOR domains in the drawer.

򐂰 A GPFS recovery group consists of four SSDs and one to four declustered arrays (DAs) of

47 disks each.

򐂰 Each DA contains distributed spare space that is two disks in size. 򐂰 Every DA in a GPFS system must be the same size. 򐂰 The granularity of capacity and throughput is an entire DA. 򐂰 RGs in the GPFS system do not need to be the same size.

1.7 Cluster management

The cluster management hardware that supports the Cluster is placed in 42 U, 19-inch racks. The cluster management requires Hardware Management Consoles (HMCs), redundant Executive Management Servers (EMS), and the associated Ethernet network switches.

1.7.1 Hardware Management Console

The HMC runs on a single server and is used to help manage the Power 775 servers. The traditional HMC functions for configuring and controlling the servers are done via xCAT. For more information, see 1.9.3, “Extreme Cluster Administration Toolkit” on page 72.

The HMC is often used for the following tasks:

򐂰 During installation 򐂰 For reporting hardware serviceable events, especially through Electronic Service Agent™

(ESA), which is also commonly known as call-home

򐂰 By service personal to perform guided service actions

An HMC is required for every 36 CECs (1152 LPARs) and all Power 775 system have redundant HMCs. For every group of 10 HMCs, a spare HMC is in place. For example, if a cluster requires four HMCs, five HMCs are present. If a cluster requires 16 HMCs, the cluster has two HMCs to serve as spares.

1.7.2 Executive Management Server

The EMS is a standard 4U POWER7 entry-level server responsible for cluster management activities. EMSs often are redundant; however, a simplex configuration is supported in smaller Power 775 deployments.

At the cluster level, a pair of EMSs provide the following maximum management support:

򐂰 512 frames 򐂰 512 supernodes 򐂰 2560 disk enclosures

Chapter 1. Understanding the IBM Power Systems 775 Cluster 53

Page 68

The EMS is the central coordinator of the cluster from a system management perspective. The EMS is connected to, and manages, all cluster components: the frames and CECs, HFI/ISR interconnect, I/O nodes, service nodes, and compute nodes. The EMS manages these components through the entire lifecycle, including discovery, configuration, deployment, monitoring, and updating via private network Ethernet connections. The cluster administrator uses the EMS as their primary management cluster control point. The service nodes, HMCs, and Flexible Service Processors (FSPs) are mostly transparent to the system administrator, and therefore the cluster appears to be a single, flat cluster, despite the hierarchical management infrastructure to be deployed by using xCAT.

1.7.3 Service node

Systems management throughout the cluster is a hierarchical structure (see Figure 1-44) to achieve the scaling and performance necessary for a large cluster size. All the compute and I/O nodes in a building block are initially booted via the HFI and managed by a dedicated server that is called a

service node (SN) in the utility CECs.

Figure 1-44 EMS hierarchy

Two service nodes (one for redundancy) per 36 CECs/Drawers (1 - 36) are required for all Power 775 clusters.

The two service nodes must reside in different frames, except under the following conditions:

򐂰 If there is only one frame, the nodes must reside in different super nodes in the frame. 򐂰 If there is only one super node in the frame, the nodes must reside in different CECs in the

super node.

򐂰 If there are only two or three CEC drawers, the nodes must reside in different CEC

drawers.

򐂰 If there is only one CEC drawer, the two Service nodes must reside in different octants.

54 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 69

The service node provides diskless boot and an interface to the management network. The service node requires that a PCIe SAS adapter and two 600 GB HDD PCIe form factor (in RAID1 for redundancy) must be installed to support diskless boot. The recommended location is shown on Figure 1-45. The SAS PCIe must reside in PCIe slot 16 and the HDDs in slots 15 and 14.

The service node also contains a 1 Gb Enet PCIe card that is in PCIe slot 17.

Figure 1-45 Service node

1.7.4 Server and management networks

Figure 1-46 on page 56 shows the logical structure of the two Ethernet networks for the cluster that is known as the page 56, the black nets designate the service network and the red nets designate the management network.

service network and the management network. In Figure 1-46 on

Chapter 1. Understanding the IBM Power Systems 775 Cluster 55

Page 70

The service network is a private, out-of-band network that is dedicated to managing the

Frame - 3

CEC (U25)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U19)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U17)

DCCA-A DCCA-B

DCCA-A DCCA-A

CEC (U15)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U11)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U9)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U7)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U5)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U27)

DCCA-B DCCA-B

DCCA-A DCCA-A

DE (U29)

DCCA-B DCCA-B

DCCA-A DCCA-A

BPCH-A

HUB A

BPCH-B

HUB B

BPA-A

BPC-A BPC-A

BPA-B

BPC-B BPC-B

Frame - 1

10Gb

Red = EMS

Connections are 1Gb eNet Unless Shown explicitly

Black = HMC

CEC (U25)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U19)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U17)

DCCA-A DCCA-B

DCCA-A DCCA-A

CEC (U15)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U11)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U9)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U7)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U5)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U27)

DCCA-B DCCA-B

DCCA-A DCCA-A

DE (U29)

DCCA-B DCCA-B

DCCA-A DCCA-A

BPCH-A

HUB A

BPCH-B

HUB B

BPA-A

BPC-A BPC-A

BPA-B

BPC-B BPC-B

HMC 2

HMC 1

EMS 1

EMS 2

Frame - 2

CEC (U25)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U23)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U19)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U17)

DCCA-A DCCA-B

DCCA-A DCCA-A

CEC (U15)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U13)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U11)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U9)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U7)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U5)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U27)

DCCA-B DCCA-B

DCCA-A DCCA-A

DE (U29)

DCCA-B DCCA-B

DCCA-A DCCA-A

BPCH-A

HUB A

BPCH-B

HUB B

BPA-A

BPC-A BPC-A

BPA-B

BPC-B BPC-B

CEC (U21)

DCCA-B DCCA-B

DCCA-A DCCA-A

HMC x

Management

Network

Service

Network

ENET A

Management

Network

Service

Network

ENET B

CEC (U13)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U13)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U21)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U23)

(Utility)

DCCA-B DCCA-B

DCCA-A DCCA-A

PCI

CEC (U21)

DCCA-B DCCA-B

DCCA-A DCCA-A

CEC (U23)

(Utility)

DCCA-B DCCA-B

DCCA-A DCCA-A

PCI

Power 775 clusters hardware. This network provides Ethernet based connectivity between the FSP of the CEC, the frame control BPA, the EMS, and the associated HMCs. Two identical network switches (ENET A and ENET B in the figure) are deployed to ensure high availability of these networks.

The management network is primarily responsible for booting all nodes, the designated service nodes, compute nodes, and I/O nodes, and monitoring their OS image loads. This management network connects the dual EMSs running the system management software with the various Power 775 servers of the cluster. Both the service and management networks must be considered private and not routed into the public network of the enterprise for security reasons.

Figure 1-46 Logical structure of service and management networks

1.7.5 Data flow

This section provides a high-level description of the data flow on the cluster service network and cluster management operations.

After discovery of the hardware components, their definitions are stored in the xCAT database on the EMS. HMCs, CECs, and frames are discovered via Service Location Protocol (SLP) by xCAT. The discovery information includes model and serial numbers, IP addresses, and so on. The Ethernet switch of the service LAN also is queried to determine which switch port is connected to each component. This discovery is run again when the system is up, if wanted. The HFI/ISR cabling also is tested by the CNM daemon on the EMS. The disk enclosures and their disks are discovered by GPFS services on these dedicated nodes when they are booted up.

56 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 71

1.7.6 LPARs

The hardware is configured and managed via the service LAN, which connects the EMS to the HMCs, BCAs, and FSPs.

Management is hierarchical with the EMS at the top, followed by the service nodes, then all the nodes in their building blocks. Management operations from the EMS to the nodes are also distributed out through the service nodes. Compute nodes are deployed by using a service node as the diskless image server.

Monitoring information comes from the sources (frames/CECs, nodes, HFI/ISR fabric, and so on), flows through the service LAN and cluster LAN back to the EMS, and is logged in the xCAT database.

The minimum hardware requirement for an LPAR is one POWER7 chip with memory attached to its memory controller. If an LPAR is assigned to one POWER7, that chip must have memory on either of its memory controllers. If an LPAR is assigned to two, three, or four POWER7 chips, any one or more of the POWER7 chips must have memory that is attached to them.

A maximum of one LPAR per POWER7 chip supported. A single LPAR resides on one, two, three, or four POWER7 chips. This configuration results in an Octant with the capability to have one, two, three, or four LPARs. An LPAR cannot reside in two Octants. With this configuration, the number of LPARs per CEC (eight Octants) ranges 8 - 32 (4 x 8). Therefore, 1 - 4 LPARs per Octant and 8 - 32 LPARs per CEC.

The following LPAR assignments are supported in an Octant:

򐂰 LPAR with all processors and memory that is allocated to that LPAR 򐂰 LPARs with 75% of processor and memory resources that are allocated to the first LPAR

and 25% to the second

򐂰 LPARs with 50% of processor and memory resources that are allocated to each LPAR 򐂰 LPARs with 50% of processor and memory resources that are allocated to the first LPAR

and 25% to each of the remaining two LPARs

򐂰 LPARs with 25% of processor and memory resources that are allocated to each LPAR

Recall that for an LPAR to be assigned, a POWER7 chip and memory that is attached to its memory controller is required. If either one of the two requirements is not met, that POWER7 is skipped and the LPAR is assigned to the next valid POWER7 in the order.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 57

Page 72

1.7.7 Utility nodes

This section defines the utility node for all Power 775 frame configurations.

A CEC is defined as a Utility CEC (node) when it has the Management server (Service node) as an LPAR. Each frame configuration is addressed individually. A single Utility LPAR supports a maximum of 1536 LPARs, one of which is an LPAR (one Utility LPAR and 1535 other LPARs). Recall that a node contains four POWER7 chips and a single POWER7 contains a maximum of one LPAR; therefore, a CEC contains 8 x 4 = 32 POWER7 chips. This configuration results in up to 32 LPARs per CEC.

This result of 1536 LPARs translates to the following figures:

򐂰 1536 POWER7 chips 򐂰 384 Octants (1536 / 4 = 384) 򐂰 48 CECs (a CEC can contain up to 32 LPARS; therefore, 1536 / 32 = 48)

There are always redundant utility nodes that reside in different frames when possible. If there is only one frame and multiple SuperNodes, the utility node resides in different SuperNodes. If there is only one SuperNode, the two utility nodes reside in different CECs. If there is only one CEC, the two utility LPARs reside in different Octants in the CEC.

The following defined utility CEC is used in the four-frame, three-frame, two-frame, and single-frame with 4, 8, and 12 CEC configurations. The single frame with 1 - 3 CECs uses a different utility CEC definition. These utilities CEC definitions are defined in their respective frame definition sections.

The utility LPAR resides in Octant 0. The LPAR is assigned only to a single POWER7. Figure 1-47 on page 59 shows the eight Octant CEC and the location of the Management LPAR. The two Octant and the four Octant CEC might be used as a utility CEC and follows the same rules as the eight Octant CEC.

58 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 73

Figure 1-47 Eight octant utility node definition

Chapter 1. Understanding the IBM Power Systems 775 Cluster 59

Page 74

1.7.8 GPFS I/O nodes

Figure 1-48 shows the GPFS Network Shared Disk (NSD) node in Octant 0.

Figure 1-48 GPFS NSD node on octant 0

1.8 Connection scenario between EMS, HMC, and Frame

The network interconnect between the different system components (EMS server, HMC, Frame) requires the managing, running, maintaining, configuring, and monitoring of the cluster. The management rack for a POWER 775 Cluster houses the different components, such as the EMS servers (IBM POWER 750), HMCs, network switches, I/O drawers for the EMS data disks, keyboard, and mouse. The different networks that are used in such an environment are the management network and the service network (as shown in Figure 1-49 on page 61). The customer network is connected to some components, but for the actual cluster, only the management and service networks are essential. For more information about the server and management networks, see 1.7.4, “Server and management networks” on page 55.

60 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 75

Figure 1-49 Typical cabling scenario for the HMC, the EMS, and the frame

In Figure 1-49, you see the different networks and cabling. Each Frame has two Ethernet ports on the BPCH to connect the Service Network A and B.

The I/O drawers in which the disks are installed for the EMS Servers also are interconnected. Therefore, the data is secured with RAID6 and the I/O drawers also are software mirrored. This means that when one EMS server goes down for any reason, the other EMS server accesses the data. The EMS servers are redundant from in this scenario, but there is no automated high-availability process for recovery of a failed EMS server.

All actions to activate the second EMS server must be performed manually. There also is no plan to automate this process. A cluster continues running without the EMS servers (in case both servers failed). No node fails because of a server failure or an HMC error. When multiple problems rise simultaneously, there might be a greater need for more intervention, but often this intervention does not occur under normal circumstances.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 61

Page 76

1.9 High Performance Computing software stack

GPFS

User Space Kernel Space

IF_LS

DD HYP

Operating Systems: AIX/Linux

Load Leveler

Scheduler

xCAT

Network(s): IB, HFI-ISR

Network Adapter(s): Galaxy2, IB HCAs, HFI (GSM)

Hardware Platforms: Power, P7IH

HAL – HFI (GSM), IB

AIX/OFED IB Verbs

LAPI – Reliable FIFO, RDMA, Striping, Failover/Recovery, Pre-emption, User Space Statistics, Multi-Protocol, Scalability

PNSD / NRT Debug Infra structure

PESSL

Eclipse PTP Framework

POE Runtime Command Line Parallel Debugger / HPC Toolkit Eclipse Tools

APPLICATION

xlC, C++

Ope nMP

xlF

Fortran

Ope nMP

MPI

MATH

Libraries

ESSL

xlUPC

X10*

LAPI

Open-

shmem

TCPUDP

SOCKETS

Multi-Link - bonding

LL Resource Mgr

Pre-empt ion

HMC

HFI CNM

TotalView / IBM Parallel Debugger

NSD/Linux/AIX

Figure 1-50 shows the IBM Power Systems HPC software stack.

Table 1-6 HPC software stack for POWER

Application Development Environment

HPC Workbench Integrated Development Environment that is based on Eclipse PTP (open source)

High Scalable Communications Protocol

Performance Tuning Tools IBM HPC Toolkit (part of PE) http://publib.boulder.ibm.com/infocenter/clre

62 IBM Power Systems 775 for AIX and Linux HPC Solution

Figure 1-50 POWER HPC software stack

Table 1-6 describes the IBM Power HPC software stack.

Available tools for IBM POWER Resources

C and Fortran Development Tools http://www.eclipse.org/photran/

PTP (Parallel tools platform) Programming models support: MPI, LAPI, OpenShmem, UPC

IBM Parallel Environment (MP, LAPI/PAMI, Debug Tools, OpenShmem) Note: User space support IB, HFI

http://www.eclipse.org/ptp/

http://publib.boulder.ibm.com/infocenter/clre sctr/vxrx/topic/com.ibm.cluster.pe.doc/pebook s.html

sctr/vxrx/topic/com.ibm.cluster.pe.doc/pebook s.html

Page 77

Application Development Environment

PGAS language support Unified Parallel C (UPC) http://upc.lbl.gov/

Compilers XL C/C++ http://www.ibm.com/software/awdtools/xlcpp/

Performance Counter PAPI http://icl.cs.utk.edu/papi/

Debugger Allinea Parallel Debugger http://www.allinea.com/products/ddt

GPU support OpenCL http://www.khronos.org/opencl/

Scientific Math libraries ESSL http://publib.boulder.ibm.com/infocenter/clre

Available tools for IBM POWER Resources

X10 http://x10-lang.org/

OpenMP http://openmp.org/

XL Fortran http://www.ibm.com/software/awdtools/fortran/

Totalview http://www.roguewave.com/products/totalview-f

amily.aspx

Eclipse debugger http://eclipse.org/ptp/

PERCS debugger

sctr/vxrx/topic/com.ibm.cluster.essl.doc/essl books.html

Parallel ESSL http://publib.boulder.ibm.com/infocenter/clre

sctr/vxrx/topic/com.ibm.cluster.pessl.doc/pes slbooks.html

HPCS Toolkit http://domino.research.ibm.com/comm/research_

projects.nsf/pages/hpcst.index.html

Advanced Systems Management

Development, maintenance xCAT http://xcat.sourceforge.net/

Remote hardware controls

System monitoring RSCT (RMC) http://www.redbooks.ibm.com/abstracts/sg24661

5.html

http://publib.boulder.ibm.com/infocenter/clre sctr/vxrx/topic/com.ibm.cluster.related_libra ries.doc/related.htm?path=3_6#rsct_link

Event handling Toolkit for Event Analysis and

Logging (TEAL)

Workload and Resource Management

Scheduler IBM LoadLeveler http://publib.boulder.ibm.com/infocenter/clre

Integrated resource manager

http://pyteal.sourceforge.net

sctr/vxrx/topic/com.ibm.cluster.loadl.doc/llb ooks.html

Cluster File System

Chapter 1. Understanding the IBM Power Systems 775 Cluster 63

Page 78

Application Development Environment

Available tools for IBM POWER Resources

Advanced scalable file system

Network NFS

Operating System Support

Base support AIX 7.1B http://www.ibm.com/systems/power/software/aix

Key OS enhancements for HPC

Network Management

InfiniBand Vendor supported tools http://www.infinibandta.org/

HFI Switch management (CNM)

GPFS http://publib.boulder.ibm.com/infocenter/clre

sctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs books.html

/index.html

RHEL 6 http://www.ibm.com/systems/power/software/lin

ux/

AIX http://www.ibm.com/systems/power/software/aix

/index.html

Linux http://www.ibm.com/systems/power/software/lin

ux/

Route management

Failover/recovery

Performance counter collection and analysis

Firmware level GFW

LNMC

Cluster Database

Database DB2® http://www.ibm.com/software/data/db2/

Other Key Features

Scalability 16K OS images (special bid)

OS Jitter Best practices guide

Jitter migration bases on synchronized global clock

Kernel patches

RAS Failover

Striping with multiple links

Multilink/bonding support Supported

64 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 79

1.9.1 Integrated Switch Network Manager

The ISNM subsystem package is installed on the executive management server of a high-performance computing cluster that consists of IBM Power 775 Supercomputers and contains the network management commands. The local network management controller runs on the server service processor as part of the system of the drawers and is shipped with the Power 775.

Network management services

As shown in Figure 1-51 on page 66, the ISNM provides the following services: 򐂰 ISR network configuration and installation:

– Topology validation – Miswire detection – Works with cluster configuration as defined in the cluster database – Hardware global counter configuration – Phased installation and Optical Link Connectivity Test (OLCT)

򐂰 ISR network hardware status:

– Monitors for ISR, HFI, link, and optical module events – Command line queries to display network hardware status – Performance counter collection – Some RMC monitor points (for example, HFI Down)

򐂰 Network maintenance:

– Set up ISR route tables during drawer power-on – Thresholds on certain link events, might disable a link – Dynamically update route tables to reroute around problems or add to routes when

CECs power on

– Maintain data to support software route mode choices, makes the data available to the

OS through PHYP

– Monitor global counter health

򐂰 Report hardware failures:

– Analyzes the EMS – Most events that are forwarded to TEAL Event DB and Alert DB – Link events due to CEC power off/power on are consolidated within CNM to reduce

unnecessary strain on analysis

– Events reported via TEAL to Service Focal Point™ on the HMC

Chapter 1. Understanding the IBM Power Systems 775 Cluster 65

Page 80

Figure 1-51 ISNM operating environment

HMC

TEAL

P7 IH

FSP

EMS

GFW

services

PERCS

Database

Centra l Ne tw o rk

Manager

(CNM)

HP C Hardware Server

Control Network Ethernet

Local Network

Ma nagement

Controller

(LNMC)

NETS

Mailbox

Torrent

PHYP

Integrated

Switch Rou ter

GEAR

Alert

Filter

ISN M

Configuration

ISNM Rules

ISNM

Alert

Liste ner

Service

Focal Point

Ne twork Hard wa re Events

MCRSA for the ISNM: IBM offers Machine Control Program Remote Support Agreement (MCRSA) for the ISNM. This agreement includes remote call-in support for the central network manager and the hardware server components of the ISNM, and for the local network management controller machine code.

MCRSA enables a single-site or worldwide enterprise customer to maintain machine code entitlement to remote call-in support for ISNM throughout the life of the MCRSA.

66 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 81

Figure 1-52 ISNM distributed architecture

ISNM Distributed Architecture

Hardware events

Routin g information

Performance C ounters Read

Routing information

Hardware events

Performance Coun ters Read

Performance Counters Read

CNM

Link Status

ISNM

Command

Module

TEAL

HW EventsHW Events

EMS

LNMC

FSP

LNMC

FSP

LNMC

FSP

LNMC

FSP

LNMC

FSP

LNMC

FSP

A high-level representation of ISNMs distributed architecture is shown in Figure 1-52. An instance of Local Network Manager (LNMC) software runs on each FSP. Each LNMC generates routes for the eight hubs in the local drawer specific to the supernode, drawer, and hub.

A Central Network Manager (CNM) runs on the EMS and communicates with the LNMCs. Link status and reachability information flows between the LNMC instances and CNM. Network events flow from LNMC to CNM, and then to Toolkit for Event Analysis and Logging (TEAL).

Local Network Management Controller

The LNMC present on each node features the following primary functions: 򐂰 Event management:

– Aggregates local hardware events, local routing events, and remote routing events.

򐂰 Route management:

– Generates routes that are based on configuration data and the current state of links in

the network.

򐂰 Hardware access:

– Downloads routes. – Allows the hardware to be examined and manipulated.

Figure 1-53 on page 68 shows a logical representation of these functions.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 67

Page 82

Figure 1-53 LNMC functional blocks

The LNMC also interacts with the EMS and with the ISR hardware to support the execution of vital management functions. Figure 1-54 on page 69 provides a high-level visualization of the interaction between the LNMC components and other external entities.

68 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 83

Figure 1-54 LNMC external interactions

As shown in Figure 1-54, the following external interactions are featured in the LNMC:

1. Network configuration commands The primary function of this procedure is to uniquely identify the Power 775 server within

the network. This includes the following information:

– Network topology – Supernode identification – Drawer identification within the Supernode – Frame identification (from BPA via FSP) – Cage identification (from BPA via FSP) – Expected neighbors table for mis-wire detection

2. Local network hardware events All network hardware events flow from the ISR into the LNMCs event management, where

they are examined and acted upon. The following list of potential actions that are taken by event management:

– Threshold checking – Actions upon hardware – Event aggregation – Network status update. Involves route management and CNM reporting. – Reporting to EMS

Chapter 1. Understanding the IBM Power Systems 775 Cluster 69

Page 84

3. Network event reporting Event management examines each local network hardware event and, if appropriate,

forwards the event to the EMS for analysis and reports to the service focal point. Event management also sends the following local routing events that indicate changes in the link status or route tables within the local drawer that other LNMCs need to react to:

– Link usability masks (LUM): One per hub in the drawer, indicates whether each link on

that hub is available for routing

– PRT1 and PRT2 validity vectors: One each per hub in the drawer, more data is used in

making routing decisions

General changes in LNMC or network status are also reported via this interface.

4. Remote network events After a local routing event (LUM, PRT1, PRT2) is received by CNM, CNM determines

which other LNMCs need the information to make route table updates, and sends the updates to the LNMCs.

The events are aggregated together by event management and then passed to route management. Route management generates a set of appropriate route table updates and potentially some PRT1 and PRT2 events of its own.

Changed routes are downloaded via hardware access. Event management sends out new PRT1 and PRT2 events, if applicable.

5. Local hardware management Hardware access provides the following facilities to both LNMC and CNM for managing

the network hardware:

– Reads and writes route tables – Reads and writes hardware registers – Disables and enables ports – Controls optical link connectivity test – Allows management of multicast – Allows management of global counter – Reads and writes performance counters

6. Centralized hardware management The following functions are managed centrally by CNM with support from LNMC:

– Global counter –Multicast – Port Enable/Disable

Central Network Manage

The CNM daemon waits for events and handles each one as separate transactions. There are software threads within CNM that handle different aspects of the network management tasks.

The service network traffic flows through another daemon called

Computing Hardware Server

Figure 1-55 on page 71 shows the relationships between the CNM software components. The components are described in the following section.

High Performance

70 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 85

Figure 1-55 CNM software structure

Res ponse

Async

Queue

Response

Async

Routing

Queue

Res ponse

Async

Rec overy

Queue

Response

Async

Diagnostic

Queue

Response

Async

Command

Queue

Response

Async

Queue

Outbound

Queue

Thread

Cmd

Thread

Diags

Thread

Route

Thread

R A R A R AR AR AR A

NM Command Exec

CNM Error Log

Thread

Recovery

Thread

Hardware Server

Communications Layer

CNM – Hdwr_Svr socket

TE AL

Res ponse

Async

Queue

Res ponse

Async

Queue

Response

Async

Routing

Queue

Response

Async

Routing

Queue

Res ponse

Async

Rec overy

Queue

Res ponse

Async

Rec overy

Queue

Response

Async

Diagnostic

Queue

Response

Async

Diagnostic

Queue

Response

Async

Command

Queue

Response

Async

Command

Queue

Response

Async

Queue

Response

Async

Queue

Outbound

Queue

Thread

Cmd

Thread

Diags

Thread

Route

Thread

R AR A R AR A R AR AR AR AR AR AR AR A

NM Command Exec

CNM Error Log

Thread

Recovery

Thread

Hardware Server

Communications Layer

CNM – Hdwr_Svr socket

TE AL

Communication layer

This layer provides a packet library with methods for communicating with LNMC. The layer manages incoming and outgoing messages between FSPs and CNM component message queues.

The layer also manages event aggregation and the virtual connections to the Hardware Server.

Database component

This component maintains the CNM internal network hardware database and updates the status fields in this in-memory database to support reporting the status to the administrator. The component also maintains required reachability information for routing.

Routing component

This component builds and maintains the hardware multicast tree. The component also writes multicast table contents to the ISR and handles the exchange of routing information between the LNMCs to support route generation and maintenance.

Global counter component

This component sets up and monitors the hardware global counter. The component also maintains information about the location of the ISR master counter and configured backups.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 71

Page 86

Recovery component

The recovery component gathers network hardware events and frame-level events. This component also logs each event in the CNM_ERRLOG and sends most events to TEAL.

The recovery component also performs some event consolidation to avoid flooding the TEAL with too many messages in the event of a CEC power up or power down.

Performance counter data management

This data management periodically collects ISR and HFI aggregate performance counters from the hardware and stores the counters in the cluster database. The collection interval and amount of data to keep are configurable.

Command handler

This handler is a socket listener to the command module. This handler manages ISNM commands, such as requests for hardware status, configuration for LNMC, and link diagnostics.

IBM High Performance Computing Hardware Server

In addition to the CNM software components, the HPC Hardware Server (HWS) handles the connections to the service network. Its primary function is to manage connections to service processors and provide an API for clients to communicate with the service processors. HWS assigns every service processor connection a unique handle that is called a

number

hardware.

(vport). This handle is used by clients to send synchronous commands to the

virtual port

In a Power 775 cluster, HPC HWS runs on the EMS, and on each xCAT service node.

1.9.2 DB2

IBM DB2 Workgroup Server Edition 9.7 for High Performance Computing (HPC) V1.1 is a scalable, relational database that is designed for use in a local area network (LAN) environment and provides support for both local and remote DB2 clients. DB2 Workgroup Server Edition is a multi-user version of DB2 packed with features that are designed to reduce the overall costs of owning a database. DB2 includes data warehouse capabilities, high availability function, and is administered remotely from a satellite control database.

The IBM Power 775 Supercomputer cluster solution requires a database to store all of the configuration and monitoring data. DB2 Workgroup Server Edition 9.7 for HPC V1.1 is licensed for use only on the executive management server (EMS) of the Power 775 high-performance computing cluster.

The EMS serves as a single point of control for cluster management of the Power 775 cluster. The Power 775 cluster also includes a backup EMS, service nodes, compute nodes, I/O nodes, and login nodes. DB2 Workgroup Server Edition 9.7 for HPC V1.1 must be installed on the EMS and backup EMS.

1.9.3 Extreme Cluster Administration Toolkit

Extreme Cloud Administration Toolkit (xCAT) is an open source, scalable distributed computing management, and provisioning tool that provides a unified interface for hardware control, discovery, and operating system stateful and stateless deployment. This robust toolkit is used for the deployment and administration of AIX or Linux clusters, as shown in Figure 1-56 on page 73.

72 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 87

xCAT makes simple clusters easy and complex clusters possible through the following features:

򐂰 Remotely controlling hardware functions, such as power, vitals, inventory, events logs, and

alert processing. xCAT indicates which light path LEDs are lit up remotely.

򐂰 Managing server consoles remotely via serial console, SOL. 򐂰 Installing an AIX or Linux cluster with utilities for installing many machines in parallel. 򐂰 Managing an AIX or Linux cluster with tools for management and parallel operation. 򐂰 Setting up a high-performance computing software stack, including software for batch job

submission, parallel libraries, and other software that is useful on a cluster.

򐂰 Creating and managing stateless and diskless clusters.

Figure 1-56 xCAT architecture

xCAT supports both Intel and POWER based architectures, which provide operating system support for AIX, Linux (RedHat, SuSE and CentOS), and Windows installations. the following provisioning methods are available:

򐂰 Local disk 򐂰 Stateless (via Linux ramdisk support) 򐂰 iSCSI (Windows and Linux)

xCAT manages a Power 775 cluster by using a hierarchical distribution that is based on management and service nodes. A single xCAT management node with multiple service nodes provides boot services to increase scaling (to thousands and up to tens of thousands of nodes).

The number of nodes and network infrastructure determine the number of Dynamic Host Configuration Protocol/Trivial File Transfer Protocol/Hypertext Transfer Protocol (DHCP/TFTP/HTTP) servers that are required for a parallel reboot without DHCP/TFTP/HTTP timeouts.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 73

Page 88

The number of DHCP servers does not need to equal the number of TFTP or HTTP servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the management node to provide a consistent set of kernel, initrd, and file system images.

xCAT version 2 provides the following enhancements that address the requirements of a Power 775 cluster:

򐂰 Improved ACLs and non-root operator support:

– Certificate-authenticated client/server XML protocol for all xCAT commands

򐂰 Choice of databases:

– Use a database (DB) like SQLite, or an enterprise DB like DB2 or Oracle – Stores all of the cluster config data, status information, and events – Information is stored in DB by other applications and customer scripts – Data change notification is used to drive automatic administrative operations

򐂰 Improved monitoring:

– Hardware event and simple Network Management Protocol (SNMP) alert monitoring – More HPC stack (GPFS, LL, Torque, and so on) setup and monitoring

򐂰 Improved RMC conditions:

– Condition triggers when it is true for a specified duration – Batch multiple events into a single invocation of the response – Micro-sensors: ability to extend RMC monitoring efficiently – Performance monitoring and aggregation that is based on TEAL and RMC

򐂰 Automating the deployment process:

– Automate creation of LPARs in every CEC – Automate set up of infrastructure nodes (service nodes and I/O nodes) – Automate configuration of network adaptors, assign node names/IDs, IP addresses,

and so on

– Automate choosing and pushing the corresponding operating system and other HPC

software images to nodes

– Automate configuration of the operating system and HPC software so that the system

is ready to use

– Automate verification of the nodes to ensure their availability

򐂰 Boot nodes with a single shared image among all nodes of a similar configuration

(diskless support)

򐂰 Allow for deploying the cluster in phases (for example, a set of new nodes at-a-time by

using the existing cluster)

򐂰 Scan the connected networks to discover the various hardware components and firmware

information of interest:

– Uses the standard SLP protocol – Finds: FSPs, BPAs, hardware control points

򐂰 Automatically defines the discovered components to the administration software,

assigning IP addresses, and hostnames

򐂰 Hardware control (for example, powering components on and off) is automatically

configured

򐂰 ISR and HFI components are initialized and configured 򐂰 All components are scanned to ensure that firmware levels are consistent and at the

wanted version

74 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 89

򐂰 Firmware is updated on all down-level components when necessary 򐂰 Provide software inventory:

– Utilities to query the software levels that are installed in the cluster – Utilities to choose updates to be applied to the cluster

򐂰 With diskless nodes, software updates are applied to the OS image on the server (nodes

apply the updates on the next reboot)

򐂰 HPC software (LoadLeveler, GPFS, PE, ESSL, Parallel ESSL, compiler libraries, and so

on) is installed throughout the cluster by the system management software

򐂰 HPC software relies on system management to provide configuration information. System

Management stores the configuration information in the management database

򐂰 Uses RMC monitoring infrastructure for monitoring and diagnosing the components of

interest

򐂰 Continuous operation (rolling update):

– Apply upgrades and maintenance to the cluster with minimal impact on running jobs – Rolling updates are coordinated with CNM and LL to schedule updates (reboots) to a

limited set of nodes at a time, allowing the other nodes to still be running jobs

1.9.4 Toolkit for Event Analysis and Logging

The Toolkit for Event Analysis and Logging (TEAL) is a robust framework for low-level system event analysis and reporting that supports both real-time and historic analysis of events. TEAL provides a central repository for low-level event logging and analysis that addresses the new Power 775 requirements.

The analysis of system events is delivered through alerts. A rules-based engine is used to determine which alert must be delivered. The TEAL configuration controls the manner in which problem notifications are delivered.

Real-time analysis provides a pro-active approach to system management, and the historical analysis allows for deeper on-site and off-site debugging.

The primary users of TEAL are the system administrator and operator. The output of TEAL is delivered to an alert database that is monitored by the administrator and operators through a series of monitoring methods.

TEAL runs on the EMS and commands are issued via the EMS command line. TEAL supports the monitoring of the following functions:

򐂰 ISNM/CNM 򐂰 LoadLeveler 򐂰 HMCs/Service Focal Points 򐂰 PNSD 򐂰 GPFS

For more information about TEAL, see Table 1-6 on page 62.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 75

Page 90

1.9.5 Reliable Scalable Cluster Technology

Reliable Scalable Cluster Technology (RSCT) is a set of software components that provide a comprehensive clustering environment for AIX, Linux, Solaris, and Windows. RSCT is the infrastructure that is used by various of IBM products to provide clusters with improved system availability, scalability, and ease of use.

RSCT includes the following components: 򐂰 Resource monitoring and control (RMC) subsystem

This subsystem is the scalable, reliable backbone of RSCT. RMC runs on a single machine or on each node (operating system image) of a cluster and provides a common abstraction for the resources of the individual system or the cluster of nodes. You use RMC for single system monitoring or for monitoring nodes in a cluster. However, in a cluster, RMC provides global access to subsystems and resources throughout the cluster, thus providing a single monitoring and management infrastructure for clusters.

򐂰 RSCT core resource managers

A resource manager is a software layer between a resource (a hardware or software entity that provides services to some other component) and RMC. A resource manager maps programmatic abstractions in RMC into the actual calls and commands of a resource.

򐂰 RSCT cluster security services

This RSCT component provides the security infrastructure that enables RSCT components to authenticate the identity of other parties.

򐂰 Topology services subsystem

1.9.6 GPFS

This RSCT component provides node and network failure detection on some cluster configurations.

򐂰 Group services subsystem

This RSCT component provides cross-node/process coordination on some cluster configurations.

For more information, see this website:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.re lated_libraries.doc/related.htm?path=3_6#rsct_link

The IBM General Parallel File System (GPFS) is distributed, high-performance, massively scalable enterprise file system solution that addresses the most challenging demands in high-performance computing.

GPFS provides online storage management, scalable access, and integrated information lifecycle management tools capable of managing petabytes of data and billions of files. Virtualizing your file storage space and allowing multiple systems and applications to share common pools of storage provides you the flexibility to transparently administer the infrastructure without disrupting applications. This configuration improves cost and energy efficiency and reduces management overhead.

Massive namespace support, seamless capacity and performance scaling, and proven reliability features and flexible architecture of GPFS helps your company foster innovation by simplifying your environment and streamlining data work flows for increased efficiency.

76 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 91

GPFS plays a key role in the shared storage configuration for Power 775 clusters. Virtually all large-scale systems are connected to disk over HFI via GPFS Network Shared Disk (NSD) servers, which are referred GPFS I/O nodes or Storage nodes in Power 775 terminology. The system interconnect features higher performance and is far more scalable than traditional storage fabrics, and is RDMA capable.

GPFS includes a Native RAID function that is used to manage the disks in the disk enclosures. In particular, the disk hospital function is queried regularly to ascertain the health of the disk subsystem. This function is not always necessary because disk problems that require service are reported to the HMC serviceable events and to TEAL.

For more information about GPFS, see Table 1-6 on page 62.

GPFS Native RAID

GPFS Native RAID is a software implementation of storage RAID technologies within GPFS. By using conventional dual-ported disks in a Just-a-Bunch-Of-Disks (JBOD) configuration, GPFS Native RAID implements sophisticated data placement and error correction algorithms to deliver high levels of storage reliability, availability, and performance. Standard GPFS file systems are created from the NSDs defined through GPFS Native RAID.

This section describes the basic concepts, advantages, and motivations behind GPFS Native RAID: redundancy codes, end-to-end checksums, data declustering, and administrator configuration, including recovery groups, declustered arrays, virtual disks, and virtual disk NSDs.

Overview

GPFS Native RAID integrates the functionality of an advanced storage controller into the GPFS NSD server. Unlike an external storage controller, in which configuration, LUN definition, and maintenance are beyond the control of GPFS, GPFS Native RAID takes ownership of a JBOD array to directly match LUN definition, caching, and disk behavior to GPFS file system requirements.

Sophisticated data placement and error correction algorithms deliver high levels of storage reliability, availability, serviceability, and performance. GPFS Native RAID provides a variation of the GPFS NSD called a the VDisk NSDs of a file system by using the conventional NSD protocol.

The GPFS Native RAID includes the following features: 򐂰 Software RAID: GPFS Native RAID runs on standard AIX disks in a dual-ported JBOD

array, which does not require external RAID storage controllers or other custom hardware RAID acceleration.

򐂰 Declustering: GPFS Native RAID distributes client data, redundancy information, and

spare space uniformly across all disks of a JBOD. This distribution reduces the rebuild (disk failure recovery process) overhead that is compared to conventional RAID.

򐂰 Checksum: An end-to-end data integrity check (by using checksums and version

numbers) is maintained between the disk surface and NSD clients. The checksum algorithm uses version numbers to detect silent data corruption and lost disk writes.

򐂰 Data redundancy: GPFS Native RAID supports highly reliable two-fault tolerant and

three-fault-tolerant Reed-Solomon-based parity codes and three-way and four-way replication.

򐂰 Large cache: A large cache improves read and write performance, particularly for small

I/O operations.

virtual disk, or VDisk. Standard NSD clients transparently access

Chapter 1. Understanding the IBM Power Systems 775 Cluster 77

Page 92

򐂰 Arbitrarily sized disk arrays: The number of disks is not restricted to a multiple of the RAID

redundancy code width, which allows flexibility in the number of disks in the RAID array.

򐂰 Multiple redundancy schemes: One disk array supports VDisks with different redundancy

schemes; for example, Reed-Solomon and replication codes.

򐂰 Disk hospital: A disk hospital asynchronously diagnoses faulty disks and paths, and

requests replacement of disks by using past health records.

򐂰 Automatic recovery: Seamlessly and automatically recovers from primary server failure. 򐂰 Disk scrubbing: A disk scrubber automatically detects and repairs latent sector errors in

the background.

򐂰 Familiar interface: Standard GPFS command syntax is used for all configuration

commands, including, maintaining, and replacing failed disks.

򐂰 Flexible hardware configuration: Support of JBOD enclosures with multiple disks

physically mounted together on removable carriers.

򐂰 Configuration and data logging: Internal configuration and small-write data are

automatically logged to solid-state disks for improved performance.

GPFS Native RAID features

This section describes three key features of GPFS Native RAID and how the functions work: data redundancy that use RAID codes, end-to-end checksums, and declustering.

RAID codes

GPFS Native RAID automatically corrects for disk failures and other storage faults by reconstructing the unreadable data by using the available data redundancy of either a Reed-Solomon code or N-way replication. GPFS Native RAID uses the reconstructed data to fulfill client operations, and in the case of disk failure, to rebuild the data onto spare space. GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes and three-way and four-way replication, which detect and correct up to two or three concurrent faults1. The redundancy code layouts that are supported by GPFS Native RAID, called are shown in Figure 1-57.

tracks,

Figure 1-57 Redundancy codes that are supported by GPFS Native RAID

GPFS Native RAID supports two- and three-fault tolerant Reed-Solomon codes, which partition a GPFS block into eight data strips and two or three parity strips. The N-way replication codes duplicate the GPFS block on N - 1 replica strips.

78 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 93

GPFS Native RAID automatically creates redundancy information, depending on the configured RAID code. By using a Reed-Solomon code, GPFS Native RAID equally divides a GPFS block of user data into eight data strips and generates two or three redundant parity strips. This configuration results in a stripe or track width of 10 or 11 strips and storage efficiency of 80% or 73% (excluding user configurable spare space for rebuild).

By using N-way replication, a GPFS data block is replicated N - 1 times, implementing 1 + 2 and 1 + 3 redundancy codes, with the strip size equal to the GPFS block size. Thus, for every block or strip written to the disks, N replicas of that block or strip are also written. This configuration results in track width of three or four strips and storage efficiency of 33% or 25%.

End-to-end checksum

Most implementations of RAID codes implicitly assume that disks reliably detect and report faults, hard-read errors, and other integrity problems. However, studies show that disks do not report some read faults and occasionally fail to write data, although it was reported that the data was written.

These errors are often referred to as

silent errors, phantom-writes, dropped-writes, or

off-track writes. To compensate for these shortcomings, GPFS Native RAID implements an

end-to-end checksum that detects silent data corruption that is caused by disks or other system components that transport or manipulate the data.

When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the data before it is transported over the network to the GPFS Native RAID server. On reception, GPFS Native RAID calculates and verifies the checksum. GPFS Native RAID stores the data, a checksum, and version number to disk and logs the version number in its metadata for future verification during read.

When GPFS Native RAID reads disks to satisfy a client read operation, it compares the disk checksum against the disk data and the disk checksum version number against what is stored in its metadata. If the checksums and version numbers match, GPFS Native RAID sends the data along with a checksum to the NSD client. If the checksum or version numbers are invalid, GPFS Native RAID reconstructs the data by using parity or replication and returns the reconstructed data and a newly generated checksum to the client. Thus, both silent disk read errors and lost or missing disk writes are detected and corrected.

Declustered RAID

Compared to conventional RAID, GPFS Native RAID implements a sophisticated data and spare space disk layout scheme that allows for arbitrarily sized disk arrays and reduces the overhead to clients that are recovering from disk failures. To accomplish this configuration, GPFS Native RAID uniformly spreads or declusters user data, redundancy information, and spare space across all the disks of a declustered array. A conventional RAID layout is compared to an equivalent declustered array in Figure 1-58 on page 80.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 79

Page 94

Figure 1-58 Conventional RAID versus declustered RAID layouts

Figure 1-58 shows an example of how GPFS Native RAID improves client performance during rebuild operations by using the throughput of all disks in the declustered array. This is illustrated by comparing a conventional RAID of three arrays versus a declustered array, both using seven disks. A conventional 1-fault-tolerant 1 + 1 replicated RAID array is shown with three arrays of two disks each (data and replica strips) and a spare disk for rebuilding. To decluster this array, the disks are divided into seven tracks, two strips per array. The strips from each group are then spread across all seven disk positions, for a total of 21 virtual tracks. The strips of each disk position for every track are then arbitrarily allocated onto the disks of the declustered array (in this case, by vertically sliding down and compacting the strips from above). The spare strips are uniformly inserted, one per disk.

As illustrated in Figure 1-59 on page 81, a declustered array significantly shortens the time that is required to recover from a disk failure, which lowers the rebuild overhead for client

80 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 95

applications. When a disk fails, erased data is rebuilt by using all of the operational disks in the declustered array, the bandwidth of which is greater than the fewer disks of a conventional RAID group. If another disk fault occurs during a rebuild, the number of impacted tracks that require repair is markedly less than the previous failure and less than the constant rebuild overhead of a conventional array.

The decrease in declustered rebuild impact and client overhead might be a factor of three to four times less than a conventional RAID. Because GPFS stripes client data across all the storage nodes of a cluster, file system performance becomes less dependent upon the speed of any single rebuilding storage array.

Figure 1-59 Lower rebuild overhead in conventional RAID versus declustered RAID

When a single disk fails in the 1-fault-tolerant 1 + 1 conventional array on the left, the redundant disk is read and copied onto the spare disk, which requires a throughput of seven strip I/O operations. When a disk fails in the declustered array, all replica strips of the six impacted tracks are read from the surviving six disks and then written to six spare strips, for a throughput of two strip I/O operations. As shown in Figure 1-59, disk read and write I/O throughput during the rebuild operations.

Disk configurations

This section describes recovery group and declustered array configurations.

Recovery groups

GPFS Native RAID divides disks into recovery groups in which each disk is physically connected to two servers: primary and backup. All accesses to any of the disks of a recovery group are made through the active primary or backup server of the recovery group.

Building on the inherent NSD failover capabilities of GPFS, when a GPFS Native RAID server stops operating because of a hardware fault, software fault, or normal shutdown, the backup GPFS Native RAID server seamlessly assumes control of the associated disks of its recovery groups.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 81

Page 96

Typically, a JBOD array is divided into two recovery groups that are controlled by different primary GPFS Native RAID servers. If the primary server of a recovery group fails, control automatically switches over to its backup server. Within a typical JBOD, the primary server for a recovery group is the backup server for the other recovery group.

Figure 1-60 illustrates the ring configuration where GPFS Native RAID servers and storage JBODs alternate around a loop. A particular GPFS Native RAID server is connected to two adjacent storage JBODs and vice versa. The ratio of GPFS Native RAID server to storage JBODs is thus one-to-one. Load on servers increases by 50% when a server fails.

Figure 1-60 GPFS Native RAID server and recovery groups in a ring configuration

Declustered arrays

A declustered array is a subset of the physical disks (pdisks) in a recovery group across which data, redundancy information, and spare space are declustered. The number of disks in a declustered array is determined by the RAID code-width of the VDisks that are housed in the declustered array. One or more declustered arrays can exist per recovery group. Figure 1-61 on page 83 illustrates a storage JBOD with two recovery groups, each with four declustered arrays.

A declustered array can hold one or more VDisks. After redundancy codes are associated with VDisks, a declustered array simultaneously contains Reed-Solomon and replicated VDisks.

If the storage JBOD supports multiple disks that are physically mounted together on removable carriers, removal of a carrier temporarily disables access to all of the disks in the carrier. Thus, pdisks on the same carrier must not be in the same declustered array, as VDisk redundancy protection is weakened upon carrier removal.

Declustered arrays are normally created at recovery group creation time but new arrays are created or existing arrays are grown by adding pdisks later.

82 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 97

Figure 1-61 Example of declustered arrays and recovery groups in storage JBOD

Virtual and physical disks

A VDisk is a type of NSD that is implemented by GPFS Native RAID across all the pdisks of a declustered array. Multiple VDisks are defined within a declustered array, typically Reed-Solomon VDisks for GPFS user data and replicated VDisks for GPFS metadata.

Virtual disks

Whether a VDisk of a particular capacity is created in a declustered array depends on its redundancy code, the number of pdisks and equivalent spare capacity in the array, and other small GPFS Native RAID overhead factors. The mmcrvdisk command automatically configures a VDisk of the largest possible size a redundancy code and configured spare space of the declustered array.

In general, the number of pdisks in a declustered array cannot be less than the widest redundancy code of a VDisk plus the equivalent spare disk capacity of a declustered array. For example, a VDisk that uses the 11-strip-wide 8 + 3p Reed-Solomon code requires at least 13 pdisks in a declustered array with the equivalent spare space capacity of two disks. A VDisk that uses the three-way replication code requires at least five pdisks in a declustered array with the equivalent spare capacity of two disks.

VDisks are partitioned into virtual tracks, which are the functional equivalent of a GPFS block. All VDisk attributes are fixed at creation and cannot be altered.

Physical disks

A pdisk is used by GPFS Native RAID to store user data and GPFS Native RAID internal configuration data.

A pdisk is either a conventional rotating magnetic-media disk (HDD) or a solid-state disk (SSD). All pdisks in a declustered array must have the same capacity.

Pdisks are also assumed to be dual-ported with one or more paths that are connected to the primary GPFS Native RAID server and one or more paths that are connected to the backup server. Often there are two redundant paths between a GPFS Native RAID server and connected JBOD pdisks.

Chapter 1. Understanding the IBM Power Systems 775 Cluster 83

Page 98

Solid-state disks

GPFS Native RAID assumes several solid-state disks (SSDs) in each recovery group in order to redundantly log changes to its internal configuration and fast-write data in non-volatile memory, which is accessible from either the primary or backup GPFS Native RAID servers after server failure. A typical GPFS Native RAID log VDisk might be configured as three-way replication over a dedicated declustered array of four SSDs per recovery group.

Disk hospital

The disk hospital is a key feature of GPFS Native RAID that asynchronously diagnoses errors and faults in the storage subsystem. GPFS Native RAID times out an individual pdisk I/O operation after approximately 10 seconds, limiting the effect of a faulty pdisk on a client I/O operation. When a pdisk I/O operation results in a timeout, an I/O error, or a checksum mismatch, the suspect pdisk is immediately admitted into the disk hospital. When a pdisk is first admitted, the hospital determines whether the error was caused by the pdisk or by the paths to it. Although the hospital diagnoses the error, GPFS Native RAID, if possible, uses VDisk redundancy codes to reconstruct lost or erased strips for I/O operations that otherwise are used the suspect pdisk.

Health metrics

The disk hospital maintains internal health assessment metrics for each pdisk: time badness, which characterizes response times; and data badness, which characterizes media errors (hard errors) and checksum errors. When a pdisk health metric exceeds the threshold, it is marked for replacement according to the disk maintenance replacement policy for the declustered array.

The disk hospital logs selected Self-Monitoring, Analysis, and Reporting Technology (SMART) data, including the number of internal sector remapping events for each pdisk.

Pdisk discovery

GPFS Native RAID discovers all connected pdisks when it starts, and then regularly schedules a process that rediscovers a pdisk that newly becomes accessible to the GPFS Native RAID server. This configuration allows pdisks to be physically connected or connection problems to be repaired without restarting the GPFS Native RAID server.

Disk replacement

The disk hospital tracks disks that require replacement according to the disk replacement policy of the declustered array. The disk hospital is configured to report the need for replacement in various ways. The hospital records and reports the FRU number and physical hardware location of failed disks to help guide service personnel to the correct location with replacement disks.

When multiple disks are mounted on a removable carrier, each of which is a member of a different declustered array, disk replacement requires the hospital to temporarily suspend other disks in the same carrier. To guard against human error, carriers are also not removable until GPFS Native RAID actuates a solenoid controlled latch. In response to administrative commands, the hospital quiesces the appropriate disks, releases the carrier latch, and turns on identify lights on the carrier that is next to the disks that require replacement.

After one or more disks are replaced and the carrier is re-inserted, in response to administrative commands, the hospital verifies that the repair took place. The hospital also automatically adds any new disks to the declustered array, which causes GPFS Native RAID to rebalance the tracks and spare space across all the disks of the declustered array. If service personnel fail to reinsert the carrier within a reasonable period, the hospital declares the disks on the carrier as missing and starts rebuilding the affected data.

84 IBM Power Systems 775 for AIX and Linux HPC Solution

Page 99

Two Declustered Arrays/Two Recovery Group

111

111 111 111

111

222

Rear

Front

I/O node

Blue RG primary / Yellow RG backup

I/O node

Yellow RG primary / Blue RG backup

STOR1

STOR2

STOR3

STOR4

STOR8 STOR7 STOR6 STOR5

Port Card

DCA

T 1

T 2

T 1

Port Card

T 2

Figure 1-62 shows a “Two Declustered Array/Two Recovery Group” configuration of a Disk Enclosure. This configuration is referred to as 1/4 populated. The configuration features four SDDs (shown in dark blue in Figure 1-62) in the first recovery group and the four SSDs (dark yellow in Figure 1-62) in the second recovery group.

Figure 1-62 Two Declustered Array/Two Recovery Group DE configuration

Chapter 1. Understanding the IBM Power Systems 775 Cluster 85

Page 100

Four Declustered Arrays/Two Recovery Group

1 3

13131

1131

13131

13131313131

13131

2 4

24242

2242

24242

42424

24242

Rear

Front

I/O node

Blue RG primary / Yellow RG backup

I/O node

Yellow RG primary / Blue RG backup

STOR1

STOR2

STOR3

STOR4

STOR8 STOR7 STOR6 STOR5

Port Card

DCA

T 1

T 2

T 1

Port Card

T 2

Figure 1-63 shows a Four Declustered Array/Two Recovery Group configuration of a disk enclosure. This configuration is referred to as 1/2 populated. The configuration features four SDDs (shown in dark blue in Figure 1-63) in the first recovery group and the four SSDs (dark yellow in Figure 1-63) in the second recovery group.

Figure 1-63 Four Declustered Array/Two Recovery Group DE configuration

86 IBM Power Systems 775 for AIX and Linux HPC Solution