Intel MPCMM0001 User Manual

2.31 Mb
Loading...

Intel® NetStructure™

MPCMM0001 Chassis

Management Module

Software Technical Product Specification

April 2005

Order Number: 273888-007

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications.

Intel may make changes to specifications and product descriptions at any time, without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

The Intel® NetStructureTM MPCMM0001 Chassis Management Module may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

This Software Technical Product Specification as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.

Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling 1-800-548-4725 or by visiting Intel's website at http://www.intel.com.

AnyPoint, AppChoice, BoardWatch, BunnyPeople, CablePort, Celeron, Chips, CT Media, Dialogic, DM3, EtherExpress, ETOX, FlashFile, i386, i486, i960, iCOMP, InstantIP, Intel, Intel Centrino, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4, IntelSX2, Intel Create & Share, Intel GigaBlade, Intel InBusiness, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel Play, Intel Play logo, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel TeamStation, Intel Xeon, Intel XScale, IPLink, Itanium, MCS, MMX, MMX logo, Optimizer logo, OverDrive, Paragon, PC Dads, PC Parents, PDCharm, Pentium, Pentium II Xeon, Pentium III Xeon, Performance at Your Command, RemoteExpress, SmartDie, Solutions960, Sound Mark, StorageExpress, The Computer Inside., The Journey Inside, TokenExpress, VoiceBrick, VTune, and Xircom are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

*Other names and brands may be claimed as the property of others. Copyright © 2005, Intel Corporation. All rights reserved.

2

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

 

 

 

Contents

Contents

 

1

Introduction....................................................................................................................................

 

16

 

1.1

Overview.............................................................................................................................

16

 

1.2

Terms Used in this Document ............................................................................................

16

2

Software Specifications .................................................................................................................

18

 

2.1

Red Hat* Embedded Debug and Bootstrap (Redboot).......................................................

18

 

2.2

Operating System ...............................................................................................................

18

 

2.3

Command Line Interface (CLI) ...........................................................................................

18

 

2.4

SNMP/UDP.........................................................................................................................

18

 

2.5

Remote Procedural Call (RPC) Interface............................................................................

19

 

2.6

RMCP .................................................................................................................................

 

19

 

2.7

Ethernet Interfaces .............................................................................................................

19

 

2.8

Sensor Event Logs (SEL) ...................................................................................................

19

 

 

2.8.1

CMM SEL Architecture ..........................................................................................

19

 

 

2.8.2

Retrieving a SEL....................................................................................................

19

 

 

2.8.3

Clearing the SEL....................................................................................................

20

 

 

2.8.4

Retrieving the Raw SEL.........................................................................................

20

 

2.9

Blade OverTemp Shutdown Script .....................................................................................

20

3

Redundancy, Synchronization, and Failover .................................................................................

21

 

3.1

Overview.............................................................................................................................

21

 

3.2

Synchronization ..................................................................................................................

21

 

3.3

Heterogeneous Synchronization.........................................................................................

23

 

 

3.3.1

SDR/SIF Synchronization ......................................................................................

23

 

 

3.3.2

User Scripts Synchronization and Configuration ...................................................

23

 

 

3.3.3

Synchronization Requirements..............................................................................

24

 

3.4

Initial Data Synchronization ................................................................................................

24

 

 

3.4.1

Initial Data Sync Failure.........................................................................................

24

 

3.5

Datasync Status Sensor .....................................................................................................

25

 

 

3.5.1

Sensor bitmap........................................................................................................

25

 

 

3.5.2

Event IDs ...............................................................................................................

25

 

 

3.5.3

Querying the Datasync Status ...............................................................................

25

 

 

3.5.4

SEL Event..............................................................................................................

27

 

 

3.5.5

SNMP Trap ............................................................................................................

27

 

 

3.5.6

System Health .......................................................................................................

28

 

3.6

CMM Failover .....................................................................................................................

28

 

 

3.6.1

Scenarios That Prevent Failover ...........................................................................

28

 

 

3.6.2

Scenarios That Failover to a Healthier Standby CMM...........................................

28

 

 

3.6.3

Manual Failover .....................................................................................................

29

 

 

3.6.4

Scenarios That Force a Failover............................................................................

29

 

3.7

CMM Ready Event..............................................................................................................

30

4

Built-In Self Test (BIST).................................................................................................................

31

 

4.1

BIST Test Flow ...................................................................................................................

31

 

4.2

Boot-BIST ...........................................................................................................................

33

 

4.3

Early-BIST ..........................................................................................................................

33

 

4.4

Mid-BIST.............................................................................................................................

33

MPCMM0001 Chassis Management Module Software Technical Product Specification

3

Contents

 

 

 

 

4.5

Late-BIST............................................................................................................................

33

 

4.6

QuickBoot Feature..............................................................................................................

34

 

 

4.6.1

Configuring QuickBoot...........................................................................................

34

 

4.7

Event Log Area and Event Management............................................................................

35

 

4.8

OS Flash Corruption Detection and Recovery Design .......................................................

35

 

 

4.8.1

Monitoring the Static Images .................................................................................

35

 

 

4.8.2

Monitoring the Dynamic Images ............................................................................

36

 

 

4.8.3

CMM Failover ........................................................................................................

36

 

4.9

BIST Test Descriptions .......................................................................................................

36

 

 

4.9.1

Flash Checksum Test ............................................................................................

36

 

 

4.9.2

Base Memory Test.................................................................................................

36

 

 

4.9.3

Extended Memory Tests........................................................................................

36

 

 

4.9.4

FPGA Version Check.............................................................................................

37

 

 

4.9.5

DS1307 RTC (Real-Time Clock) Test ...................................................................

37

 

 

4.9.6

NIC Presence/Local PCI Bus Test.........................................................................

37

 

 

4.9.7

OS Image Checksum Test.....................................................................................

37

 

 

4.9.8

CRC32 Checksum .................................................................................................

37

 

 

4.9.9

IPMB Bus Busy/Not Ready Test............................................................................

38

5

Re-enumeration.............................................................................................................................

39

 

5.1

Overview.............................................................................................................................

39

 

5.2

Re-enumeration on Failover ...............................................................................................

39

 

5.3

Re-enumeration of M5 FRU................................................................................................

40

 

5.4

Resolution of EKeys ...........................................................................................................

40

 

5.5

Events Regeneration ..........................................................................................................

40

6

Process Monitoring and Integrity...................................................................................................

41

 

6.1

Overview.............................................................................................................................

41

 

 

6.1.1

Process Existence Monitoring ...............................................................................

41

 

 

6.1.2

Thread Watchdog Monitoring ................................................................................

41

 

 

6.1.3

Process Integrity Monitoring ..................................................................................

42

 

6.2

Processes Monitored ..........................................................................................................

42

 

6.3

Process Monitoring Targets................................................................................................

42

 

6.4

Process Monitoring Dataitems............................................................................................

43

 

 

6.4.1

Examples ...............................................................................................................

43

 

6.5

SNMP MIB Commands.......................................................................................................

44

 

6.6

Process Monitoring CMM Events .......................................................................................

44

 

6.7

Failure Scenarios and Eventing..........................................................................................

45

 

 

6.7.1

No Action Recovery ...............................................................................................

45

 

 

6.7.2

Successful Restart Recovery.................................................................................

46

 

 

6.7.3

Successful Failover/Restart Recovery...................................................................

47

 

 

6.7.4

Successful Failover/Reboot Recovery...................................................................

48

 

 

6.7.5

Failed Failover/Reboot Recovery, Non-Critical......................................................

48

 

 

6.7.6

Failed Failover/Reboot Recovery, Critical .............................................................

49

 

 

6.7.7

Excessive Restarts, Escalate No Action................................................................

50

 

 

6.7.8

Excessive Restarts, Successful Escalate Failover/Reboot....................................

51

 

 

6.7.9

Excessive Restarts, Failed Escalate Failover/Reboot, Non-Critical ......................

52

 

 

6.7.10

Excessive Restarts, Failed Escalate Failover/Reboot, Critical ..............................

52

 

 

6.7.11

Process Administrative Action ...............................................................................

53

 

 

6.7.12

Excessive Failover/Reboots, Administrative Action...............................................

54

4

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

 

 

 

Contents

 

6.8

Process Integrity Executable (PIE) .....................................................................................

54

 

6.9

Configuring pms.ini .............................................................................................................

55

 

 

6.9.1

Global Data............................................................................................................

55

 

 

6.9.2

Process Specific Data............................................................................................

56

 

 

6.9.3

Process Definition Section of pms.ini.....................................................................

58

 

6.10

Process Integrity Executable (PIE) Specific Data Config ...................................................

64

 

 

6.10.1

PIE Section Name .................................................................................................

64

 

 

6.10.2

Process Integrity Executable .................................................................................

65

 

 

6.10.3

Unique ID...............................................................................................................

65

 

 

6.10.4

Administrative State...............................................................................................

65

 

 

6.10.5

Process Integrity Interval .......................................................................................

66

 

 

6.10.6

Chassis Applicability ..............................................................................................

66

 

 

6.10.7

PmsPieSnmp Command Line................................................................................

66

 

 

6.10.8

SNMP PIE Section of pms.ini ................................................................................

66

 

6.11

WP/BPM PIE ......................................................................................................................

67

 

 

6.11.1

WP/BPM Section of pms.ini...................................................................................

67

7

Power and Hot Swap Management...............................................................................................

68

 

7.1

Hot Swap States .................................................................................................................

68

 

7.2

FRU Insertion......................................................................................................................

68

 

7.3

Graceful FRU Extraction.....................................................................................................

68

 

7.4

Surprise FRU Extraction/IPMI Failure.................................................................................

69

 

7.5

Forced Power State Changes.............................................................................................

69

 

7.6

Power Management on the Standby CMM.........................................................................

69

 

7.7

Power Feed Targets ...........................................................................................................

69

 

7.8

Pinging IPMI Controllers .....................................................................................................

70

8

The Command Line Interface (CLI) ...............................................................................................

71

 

8.1

CLI Overview ......................................................................................................................

71

 

8.2

Connecting to the CLI .........................................................................................................

71

 

 

8.2.1

Connecting through a Serial Port Console ............................................................

71

 

8.3

Initial Setup— Logging in for the First Time........................................................................

72

 

 

8.3.1

Setting IP Address Properties................................................................................

72

 

 

8.3.2

Setting a Hostname ...............................................................................................

75

 

 

8.3.3

Setting the Amount of Time for Auto-Logout .........................................................

75

 

 

8.3.4

Setting the Date and Time .....................................................................................

76

 

 

8.3.5

Telnet into the CMM ..............................................................................................

76

 

 

8.3.6

Connect Through SSH (Secure Shell)...................................................................

76

 

 

8.3.7

FTP into the CMM..................................................................................................

76

 

 

8.3.8

Rebooting the CMM...............................................................................................

76

 

8.4

CLI Command Line Syntax and Arguments .......................................................................

77

 

 

8.4.1

Cmmget and Cmmset Syntax................................................................................

77

 

 

8.4.2

Help Parameter: -h ................................................................................................

77

 

 

8.4.3

Location Parameter: -l ...........................................................................................

77

 

 

8.4.4

Target Parameter: -t ..............................................................................................

78

 

 

8.4.5

Dataitem Parameter: -d .........................................................................................

80

 

 

8.4.6

Value Parameter: -v...............................................................................................

97

 

 

8.4.7

Sample CLI Operations .........................................................................................

97

 

8.5

Generating a System Status Report ...................................................................................

97

MPCMM0001 Chassis Management Module Software Technical Product Specification

5

Contents

 

 

 

9

Resetting the Password.................................................................................................................

99

 

9.1

Resetting the Password in a Dual CMM System ................................................................

99

 

9.2

Resetting the Password in a Single CMM System ...........................................................

100

10

Sensor Types

..............................................................................................................................

101

 

10.1

CMM Sensor Types ..........................................................................................................

101

 

10.2

Threshold ................................................................................................-Based Sensors

101

 

 

10.2.1 .........................................................................

Threshold - Based Sensor Events

101

 

10.3

CMM Voltage/Temp ...........................................................................Sensor Thresholds

102

 

10.4

Discrete ..............................................................................................................Sensors

102

 

 

10.4.1 .......................................................................................

Discrete Sensor Events

103

11

Health Events ..............................................................................................................................

 

104

 

11.1

Syntax .........................................................................................of Health Event Strings

104

 

 

11.1.1 .......................................................................Healthevents Query Event Syntax

104

 

 

11.1.2 ................................................................................................

SEL Event Syntax

104

 

 

11.1.3 ...............................................................................................

SEL Sensor Types

105

 

 

11.1.4 ....................................................................................SNMP Trap Event Syntax

105

 

11.2

Sensor .................................................................................................................Targets

106

 

11.3

Healthevents .......................................................................................................Queries

107

 

 

11.3.1 ......................................................HealthEvents Queries for Individual Sensors

107

 

 

11.3.2 ...........................................HealthEvents Queries for All Sensors on a Location

108

 

 

11.3.3 .................................................................................................

No Active Events

108

 

 

11.3.4 .....................................................................Not Present or Non-IPMI Locations

108

 

11.4

List of Possible ................................................................................Health Event Strings

108

 

 

11.4.1 ........................................................................................................

All Locations

109

 

 

11.4.2 ......................................................................................................

CMM Location

115

 

 

11.4.3 .................................................................................................

Chassis Location

120

 

11.5

IPMI Error ...........................................................................................Completion Codes

120

 

 

11.5.1 ..........................................................Configuring IPMI Error Completion Codes

121

 

 

11.5.2 .........................................................................IPMI/IMB Error Message Format

121

12

Front Panel LEDs ........................................................................................................................

123

 

12.1

LED Types ......................................................................................................and States

123

 

 

12.1.1 ..........................................................................................................

Alarm LEDs

123

 

 

12.1.2 ..........................................................................................................

Health LED

124

 

 

12.1.3 .....................................................................................................

Hot Swap LED

124

 

 

12.1.4 ...........................................................................................

User Definable LEDs

124

 

12.2

Retrieving ............................................................................a Location’s LED properties

124

 

12.3

Retrieving .................................................................................Color Properties of LEDs

124

 

12.4

Retrieving ............................................................................................the State of LEDs

125

 

12.5

Setting ...................................................................................the State of the User LEDs

125

 

12.6

LED Boot .........................................................................................................Sequence

126

13

Node Power Control ....................................................................................................................

127

 

13.1

Node Operational ..............................................................................State Management

127

 

13.2

Obtaining ..............................................................................the Power State of a Board

127

 

13.3

Controlling ............................................................................the Power State of a Board

127

 

 

13.3.1 ..........................................................................................Powering Off a Board

127

 

 

13.3.2 ..........................................................................................Powering On a Board

127

6

MPCMM0001 Chassis Management Module Software Technical Product Specification

Contents

 

 

 

 

 

13.3.3

Resetting a Board ................................................................................................

128

14

Electronic Keying Manager..........................................................................................................

129

 

14.1

Point-to-Point EKeying......................................................................................................

129

 

14.2

Bused EKeying .................................................................................................................

129

 

14.3

EKeying CLI Commands ..................................................................................................

129

15 CDMs and FRU Information ........................................................................................................

130

 

15.1

Chassis Data Module........................................................................................................

130

 

15.2

FRU/CDM Election Process .............................................................................................

130

 

15.3

FRU Information ...............................................................................................................

130

 

15.4

FRU Query Syntax............................................................................................................

131

16 Fan Control and Monitoring .........................................................................................................

132

 

16.1

Automatic Fan Control ......................................................................................................

132

 

16.2

Querying Fan Tray Sensors - FantrayN location ..............................................................

132

 

16.3

Fantray Cooling Levels .....................................................................................................

132

 

16.4

CMM Cooling Manager Temperature Status ....................................................................

132

 

16.5

CMM Cooling Table ..........................................................................................................

133

 

 

16.5.1 Setting Values in the Cooling Table.....................................................................

133

 

16.6

Control Modes for Fan Trays ............................................................................................

134

 

 

16.6.1

CMM Control Mode..............................................................................................

134

 

 

16.6.2

Fantray Control Mode ..........................................................................................

134

 

 

16.6.3 Emergency Shutdown Control Mode ...................................................................

134

 

 

16.6.4 User Initiated Mode Change ................................................................................

135

 

 

16.6.5

Automatic Mode Change .....................................................................................

135

 

16.7

Getting Temperature Statuses..........................................................................................

135

 

16.8

Fantray Properties ............................................................................................................

136

 

16.9

Retrieving the Current Cooling Level................................................................................

136

 

16.10

Fantray Insertion...............................................................................................................

136

 

16.11 Default Cooling Values .....................................................................................................

137

 

 

16.11.1 Vendor Defaults ...................................................................................................

137

 

 

16.11.2 Structure of /etc/cmm/fantray.cfg.........................................................................

138

 

 

16.11.3 Code Defaults ......................................................................................................

138

 

 

16.11.4 Restoring Defaults ...............................................................................................

138

 

16.12 Firmware Upgrade/Downgrade.........................................................................................

138

 

16.13 Chassis vs. Fantray ..........................................................................................................

139

 

16.14

Legacy Method of Querying/Setting Fan Speed...............................................................

139

17

SNMP

..........................................................................................................................................

 

140

 

17.1

CMM MIB..........................................................................................................................

141

 

17.2

MIB Design .......................................................................................................................

141

 

 

17.2.1

MIB Tree ..............................................................................................................

141

 

 

17.2.2

CMM MIB Objects................................................................................................

142

 

17.3

SNMP Agent .....................................................................................................................

158

 

 

17.3.1 Configuring the SNMP Agent Port .......................................................................

158

 

 

17.3.2 Configuring the Agent to Respond to SNMP v3 Requests ..................................

158

 

 

17.3.3 Configuring the Agent Back to SNMP v1.............................................................

159

 

 

17.3.4 Setting up an SNMP v1 MIB Browser..................................................................

159

 

 

17.3.5 Setting up an SNMP v3 MIB Browser..................................................................

159

 

 

17.3.6 Changing the SNMP MD5 and DES Passwords..................................................

159

7

MPCMM0001 Chassis Management Module Software Technical Product Specification

Contents

 

 

17.4

SNMP Trap Utility .............................................................................................................

160

 

17.4.1 Configuring the SNMP Trap Port .........................................................................

160

 

17.4.2 Configuring the CMM to Send SNMP v3 Traps ...................................................

160

 

17.4.3 Configuring the CMM to Send SNMP v1 Traps ...................................................

160

17.5

Configuring and Enabling SNMP Trap Addresses............................................................

160

 

17.5.1 Configuring an SNMP Trap Address ...................................................................

161

 

17.5.2 Enabling and Disabling SNMP Traps ..................................................................

161

 

17.5.3 Alerts Using SNMP v3 .........................................................................................

161

 

17.5.4 Alert Using UDP Alert ..........................................................................................

161

17.6

SNMP Security .................................................................................................................

162

 

17.6.1 SNMP v1 Security................................................................................................

162

 

17.6.2 SNMP v3 Security - Authentication Protocol and Privacy Protocol .....................

162

17.7

SNMP Trap Descriptions ..................................................................................................

162

17.8

Snmpd.conf File................................................................................................................

163

18 CMM Scripting.............................................................................................................................

164

18.1

CLI Scripting .....................................................................................................................

164

 

18.1.1 Script Synchronization .........................................................................................

164

18.2

Event Scripting..................................................................................................................

164

 

18.2.1 Listing Scripts Associated With Events................................................................

165

 

18.2.2 Removing Scripts From an Associated Event .....................................................

165

18.3

Setting Scripts for Specific Individual Events....................................................................

165

 

18.3.1 Event Codes ........................................................................................................

165

 

18.3.2 Setting Event Action Scripts ................................................................................

166

18.4Running CMM Event Scripts on CMM State Transitions

 

(Active/Standby/Ready/Not Ready) ..................................................................................

166

 

18.4.1

Sensor Data Bits..................................................................................................

166

 

18.4.2 Retrieving the Value of the Data Sensor Bits ......................................................

167

 

18.4.3

CMMReadyTimeout Value...................................................................................

168

 

18.4.4 CMM State Transition Model ...............................................................................

168

18.5

FRU Control Script............................................................................................................

169

 

18.5.1

Command line arguments....................................................................................

170

 

18.5.2

Sample frucontrol file ...........................................................................................

170

19 Remote Procedure Calls (RPC) ..................................................................................................

174

19.1

Setting Up the RPC Interface ...........................................................................................

174

19.2

Using the RPC Interface...................................................................................................

174

 

19.2.1

GetAuthCapability() .............................................................................................

175

 

19.2.2

ChassisManagementApi() ...................................................................................

175

 

19.2.3 ChassisManagementApi() Threshold Response Format.....................................

181

 

19.2.4 ChassisManagementApi() String Response Format ...........................................

181

 

19.2.5 ChassisManagementApi() Integer Response Format..........................................

185

 

19.2.6 FRU String Response Format .............................................................................

186

19.3

RPC Sample Code ...........................................................................................................

187

19.4

RPC Usage Examples ......................................................................................................

187

20 RMCP

..........................................................................................................................................

 

190

20.1

RMCP References............................................................................................................

190

20.2

RMCP Modes ...................................................................................................................

190

20.3

RMCP User Privilege Levels ............................................................................................

191

20.4

RMCP Discovery ..............................................................................................................

191

8

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

 

 

 

Contents

 

20.5

RMCP Session Activation.................................................................................................

191

 

20.6

RMCP Port Numbers ........................................................................................................

192

 

20.7

IPMB Slave Addresses .....................................................................................................

193

 

20.8

CMM RMCP Configuration ...............................................................................................

193

 

20.9

IPMI Commands Supported by CMM RMCP ...................................................................

194

 

20.10

Configuring IPMI Command Privileges.............................................................................

196

 

 

20.10.1 Sample cmdPrivillege.ini file ................................................................................

197

 

20.11

Completion Codes for the RMCP Messages ....................................................................

197

21

Command and Error Logging ......................................................................................................

199

 

21.1

Command Logging ...........................................................................................................

199

 

21.2

Error Logging ....................................................................................................................

199

 

 

21.2.1

Error.log File ........................................................................................................

199

 

 

21.2.2

Debug.log File......................................................................................................

199

 

21.3

Cmmdump Utility ..............................................................................................................

200

22

Application Hosting......................................................................................................................

201

 

22.1

System Details..................................................................................................................

201

 

22.2

Startup and Shutdown Scripts ..........................................................................................

201

 

22.3

System Resources Available to User Applications ...........................................................

201

 

 

22.3.1 File System Storage Constraints .........................................................................

201

 

 

22.3.2

RAM Constraints..................................................................................................

202

 

 

22.3.3

Interrupt Constraints ............................................................................................

203

 

22.4

RAM Disk Directory Structure...........................................................................................

203

23

Updating CMM Software .............................................................................................................

204

 

23.1

Key Features of the Firmware Update Process................................................................

204

 

23.2

Update Process Architecture ............................................................................................

204

 

23.3

Critical Software Update Files and Directories .................................................................

205

 

23.4

Update Package ...............................................................................................................

205

 

 

23.4.1 Update Package File Validation...........................................................................

206

 

 

23.4.2 Update Firmware Package Version .....................................................................

207

 

 

23.4.3

Component Versioning ........................................................................................

207

 

23.5

saveList and Data Preservation........................................................................................

207

 

23.6

Update Mode ....................................................................................................................

208

 

23.7

Update_Metadata File ......................................................................................................

209

 

23.8

Firmware Update Synchronization/Failover Support ........................................................

209

 

23.9

Automatic/Manual Failover Configuration.........................................................................

209

 

 

23.9.1 Setting Failover Configuration Flag .....................................................................

210

 

 

23.9.2 Retrieving the Failover Configuration Flag...........................................................

210

 

23.10 Single CMM System .........................................................................................................

210

 

23.11 Redundant CMM Systems................................................................................................

210

 

23.12

CLI Software Update Procedure.......................................................................................

210

 

23.13 Hooks for User Scripts......................................................................................................

211

 

 

23.13.1 Update Mode User Scripts...................................................................................

211

 

 

23.13.2 Data Restore User Scripts ...................................................................................

212

 

 

23.13.3 Example Task—Replace /home/scripts/myScript................................................

212

 

23.14 Update Process ................................................................................................................

213

 

23.15 Update Process Status and Logging ................................................................................

215

 

23.16 Update Process Sensor and SEL Events .........................................................................

215

 

23.17

Redboot* Update Process ................................................................................................

215

MPCMM0001 Chassis Management Module Software Technical Product Specification

9

Contents

 

 

 

 

 

23.17.1 Required Setup....................................................................................................

215

 

 

23.17.2 Update Procedure................................................................................................

215

24

Updating Shelf Components........................................................................................................

217

25

IPMI Pass-Through......................................................................................................................

218

 

25.1

Overview...........................................................................................................................

218

 

25.2

Command Syntax and Interface .......................................................................................

218

 

 

25.2.1 Command Request String Format .......................................................................

218

 

 

25.2.2

Response String ..................................................................................................

219

 

 

25.2.3

Usage Examples ..................................................................................................

219

 

25.3

SNMP

...............................................................................................................................

219

 

 

25.3.1

Usage Example ...................................................................................................

219

26

FRU Update Utility.......................................................................................................................

221

 

26.1

Overview...........................................................................................................................

221

 

26.2

FRU Update ..................................................................................................Architecture

221

 

26.3

FRU Update ........................................................................................................Process

222

 

26.4

FRU Recovery ....................................................................................................Process

222

 

26.5

FRU Verification................................................................................................................

223

 

26.6

FRU Display......................................................................................................................

223

 

26.7

Setting ...............................................................the Library Path And Invoking the Utility

223

 

26.8

FRU Update .............................................................................Command Line Interface

223

 

26.9

Using the ................................................................................................Location Switch

224

 

26.10

Updating .............................................................................................................the FRU

225

 

26.11

Getting ........................................................................................................the Inventory

225

 

26.12

Viewing .....................................................................................the Contents of the FRU

225

 

26.13

Getting ......................................................................................the Contents of the FRU

225

 

26.14

Dumping ....................................................................................the Contents of the FRU

225

27 FRU Update Configuration ...................................................................................................File

227

 

27.1

Configuration .................................................................................................File Format

227

 

27.2

File Format........................................................................................................................

227

 

27.3

String .............................................................................................................Constraints

227

 

27.4

Numeric .........................................................................................................Constraints

228

 

27.5

Tags..................................................................................................................................

 

228

 

27.6

Control ...........................................................................................................Commands

228

 

 

27.6.1 ...................................................................................................................

IFSET

228

 

 

27.6.2 ....................................................................................................................

ELSE

229

 

 

27.6.3 ..................................................................................................................

ENDIF

229

 

 

27.6.4 ......................................................................................................................

SET

229

 

 

27.6.5 .................................................................................................................

CLEAR

230

 

 

27.6.6 ...........................................................................................................

CFGNAME

230

 

 

27.6.7 .....................................................................................................

ERRORLEVEL

230

 

27.7

Probing ..........................................................................................................Commands

230

 

 

27.7.1 ................................................................................................................

PROBE

230

 

 

27.7.2 ..............................................................................................................

SYSTEM

231

 

 

27.7.3 ..............................................................................................................

FRUVER

231

 

 

27.7.4 .............................................................................................................

BMCVER

232

 

 

27.7.5 ................................................................................................................

FOUND

232

 

27.8

Update ..........................................................................................................Commands

233

10

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

 

 

 

Contents

 

 

27.8.1

FRUNAME ...........................................................................................................

233

 

 

27.8.2

FRUADDRESS ....................................................................................................

234

 

 

27.8.3

FRUAREA ............................................................................................................

234

 

 

27.8.4

MULTIREC ..........................................................................................................

235

 

 

27.8.5

FRUFIELD ...........................................................................................................

236

 

 

27.8.6

Input of Data ........................................................................................................

240

 

27.9

Display Commands...........................................................................................................

240

 

 

27.9.1

DISPLAY ..............................................................................................................

241

 

 

27.9.2

CONFIGURATION ...............................................................................................

241

 

 

27.9.3

Input Commands .................................................................................................

241

 

 

27.9.4

MENU ..................................................................................................................

241

 

 

27.9.5

MENUTITLE ........................................................................................................

242

 

 

27.9.6

MENUPROMPT ...................................................................................................

242

 

 

27.9.7

PROMPT .............................................................................................................

242

 

 

27.9.8

YES ......................................................................................................................

243

 

 

27.9.9

NO .......................................................................................................................

243

 

27.10 Command Quick Reference .............................................................................................

243

 

27.11

Example Configuration File...............................................................................................

246

 

 

27.11.1 Chassis Update Version 0 ...................................................................................

246

 

 

27.11.2 Chassis Update Version 1 ...................................................................................

249

28

Unrecognized Sensor Types .......................................................................................................

253

 

28.1

System Events Overview..................................................................................................

253

 

28.2

System Events— SNMP Trap Support.............................................................................

254

 

 

28.2.1

SNMP Trap Header Format .................................................................................

254

 

 

28.2.2

SNMP Trap ATCA Trap Text Translation Format ................................................

254

 

28.3

SNMP Trap Raw Format ..................................................................................................

255

 

 

28.3.1

SNMP Trap Control .............................................................................................

256

 

 

28.3.2

System Events— SEL Support ............................................................................

256

 

 

28.3.3

Configuring SEL Format ......................................................................................

257

29

Warranty Information ...................................................................................................................

259

 

29.1

Intel® NetStructure™ Compute Boards and Platform Products Limited Warranty ...........

259

 

29.2

Returning a Defective Product (RMA) ..............................................................................

259

 

29.3

For the Americas ..............................................................................................................

260

 

 

29.3.1

For Europe, Middle East, and Africa (EMEA) ......................................................

260

 

 

29.3.2

For Asia and Pacific (APAC) ................................................................................

260

30

Customer Support .......................................................................................................................

262

 

30.1

Customer Support.............................................................................................................

262

 

30.2

Technical Support and Return for Service Assistance .....................................................

262

 

30.3

Sales Assistance ..............................................................................................................

262

31

Certifications

................................................................................................................................

263

32

Agency Information......................................................................................................................

264

 

32.1

North America ...........................................................................................(FCC Class A)

264

32.2Canada – Industry Canada (ICES-003 Class A) (English and French-translated below).264

32.3 Safety Instructions (English and French-translated below) ..............................................

265

32.3.1

English .................................................................................................................

265

32.3.2

French..................................................................................................................

265

MPCMM0001 Chassis Management Module Software Technical Product Specification

11

Contents

 

 

 

32.4

Taiwan Class A Warning Statement .................................................................................

266

 

32.5

Japan VCCI Class A .........................................................................................................

266

 

32.6

Korean Class A.................................................................................................................

266

 

32.7

Australia, New Zealand.....................................................................................................

266

33

Safety Warnings ..........................................................................................................................

267

 

33.1

Mesures de Sécurité.........................................................................................................

268

 

33.2

Sicherheitshinweise..........................................................................................................

270

 

33.3

Norme di Sicurezza ..........................................................................................................

272

 

33.4

Instrucciones de Seguridad ..............................................................................................

274

 

33.5

Chinese Safety Warning...................................................................................................

276

Figures

1

BIST Flow Chart .........................................................................................................................

32

2

Timing of BIST Stages................................................................................................................

34

3

High Level SNMP/MIB Layout ..................................................................................................

140

4

CMM Custom MIB Tree............................................................................................................

142

5

CMM Status State Diagram......................................................................................................

169

6

SNMPTrapFormat = 1 ..............................................................................................................

255

7

SNMPTrapFormat = 2 ..............................................................................................................

255

8

SNMPTrapFormat = 3 ..............................................................................................................

255

Tables

1

Glossary .....................................................................................................................................

16

2

CMM Synchronization ................................................................................................................

22

3

CMM Status Event Strings (CMM Status) ..................................................................................

30

4

BIST Implementation ..................................................................................................................

32

5

Processes Monitored..................................................................................................................

42

6

No Action Recovery....................................................................................................................

46

7

Successful Restart Recovery .....................................................................................................

46

8

Successful Failover/Restart Recovery........................................................................................

47

9

Successful Failover/Reboot Recovery........................................................................................

48

10

Failed Failover/Reboot Recovery, Non-Critical ..........................................................................

49

11

Failed Failover/Reboot Recovery, Critical ..................................................................................

50

12

Existence Fault, Excessive Restarts, Escalate No Action ..........................................................

50

13

Excessive Restarts, Successful Escalate Failover/Reboot ........................................................

51

14

Excessive Restarts, Failed Escalate Failover/Reboot, Non-Critical ...........................................

52

15

Excessive Restarts, Failed Escalate Failover/Reboot, Critical ...................................................

53

16

Administrative Action ..................................................................................................................

53

17

Excessive Failover/Reboots, Administrative Action....................................................................

54

18

Time to Delay and Number of Attempts .....................................................................................

70

19

SETIP Interface Assignments when BOOTPROTO=”static” ......................................................

74

20

SETIP Interface Assignments when BOOTPROTO=”dhcp”.......................................................

75

21

Location (-l) Keywords................................................................................................................

77

22

CMM Targets..............................................................................................................................

79

23

Dataitem Keywords for All Locations..........................................................................................

80

12

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

Contents

24

Dataitem Keywords for All Locations Except System .................................................................

80

25

Dataitem Keywords for All Locations Except Chassis and System ............................................

81

26

Dataitem Keywords for Chassis Location ...................................................................................

85

27

Dataitem Keywords for Cmm Location .......................................................................................

86

28

Dataitem Keywords for System Location....................................................................................

92

29

Dataitem Keywords for FantrayN Location .................................................................................

93

30

Dataitem Keywords Used with the Target Parameter.................................................................

94

31

CMM Voltage and Temp Sensor Thresholds............................................................................

102

32

CMM SEL Sensor Information ..................................................................................................

105

33

Sensor Targets .........................................................................................................................

106

34

Threshold-Based Sensors: Voltage, Temp, Current, Fan.........................................................

109

35

Hot Swap Sensor: Filter Tray HS, FRU Hot Swap....................................................................

110

36

IPMB Link State Sensor: IPMB-0 Snsr [1-16] ...........................................................................

110

37

System Firmware Progress Event Strings (System Firmware Progress) .................................

111

38

Watchdog 2 Sensor Event Strings............................................................................................

113

39

CMM Redundancy ....................................................................................................................

115

40

CMM Trap Connectivity (CMM [1-2] Trap Conn) ......................................................................

115

41

CMM Failover ...........................................................................................................................

115

42

CMM Synchronization...............................................................................................................

116

43

BIST Event Strings ...................................................................................................................

117

44

Chassis Data Module (CDM [1,2]) ............................................................................................

118

45

Datasync Status........................................................................................................................

118

46

CMM Status Event Strings (CMM Status) ................................................................................

118

47

Process Monitoring Service Fault Event Strings (PMS Fault) ..................................................

119

48

Process Monitoring Service Info Event Strings (PMS Info) ......................................................

120

49

Chassis Events .........................................................................................................................

120

50

IPMI Error Completion Codes and Enumerations.....................................................................

121

51

System Health LED States .......................................................................................................

123

52

CMM Health LED States...........................................................................................................

124

53

CMM Hot Swap LED States .....................................................................................................

124

54

Ledstate Functions and Function Options ................................................................................

125

55

LED Event Sequence ...............................................................................................................

126

56

Dataitems Used With FRU Target (-t) to Obtain FRU Information............................................

131

57

CMM Cooling Table ..................................................................................................................

133

58

MIB II Objects - System Group .................................................................................................

141

59

MIB II - Interface Group ............................................................................................................

141

60

System Location (1.3.6.1.4.1.343.2.14.2.10.1).........................................................................

143

61

Shelf Location (Equivalent to Chassis) (1.3.6.1.4.1.343.2.14.2.10.2).......................................

144

62

ShelfTable/shelfEntry (1.3.6.1.4.1.343.2.14.2.10.2.50.1) .........................................................

144

63

Cmm Location (1.3.6.1.4.1.343.2.14.2.10.3) ............................................................................

146

64

CmmTable/cmmEntry (1.3.6.1.4.1.343.2.14.2.10.3.51.1).........................................................

149

65

CmmFruTable/cmmFruEntry (1.3.6.1.4.1.343.2.14.2.10.3.52.1)..............................................

151

66

CmmFruTargetTable (1.3.6.1.4.1.343.2.14.2.10.3.53.1) ..........................................................

151

67

CmmPmsTable/cmmPmsEntry (1.3.6.1.4.1.343.2.14.2.10.3.54.1) ..........................................

151

68

Blade# Location (1.3.6.1.4.1.343.2.14.2.10.4.[1-16]) ...............................................................

152

69

Blade#TargetTable/blade#TargetEntry (1.3.6.1.4.1.343.2.14.2.10.4.[1-16].51.1) ....................

153

70

Blade#FruTable/blade#FruEntry (1.3.6.1.4.1.343.2.14.2.10.4.[1-16].52.1) ..............................

154

71

Blade#FruTargetTable/blade#FruTargetEntry (1.3.6.1.4.1.343.2.14.2.10.4.[1-16].53.1) .........

155

72

[FanTray/pem]Table/[fanTray/pem]Entry (1.3.6.1.4.1.343.2.14.2.10.[5/6].51.1) ......................

155

73

[FanTray/pem]TargetTable/[fanTray/pem]TargetEntry (1.3.6.1.4.1.343.2.14.2.10.[5/6].52.1)..

156

MPCMM0001 Chassis Management Module Software Technical Product Specification

13

Contents

 

74

[FanTray/pem]FruTable/[fanTray/pem]FruEntry (1.3.6.1.4.1.343.2.14.2.10.[5/6].53.1) ...........

157

75

[FanTray/pem]FruTargetTable/[fanTray/pem]FruTargetEntry

 

 

(1.3.6.1.4.1.343.2.14.2.10.[5/6].54.1) .......................................................................................

158

76

SNMP v3 Security Fields For Traps .........................................................................................

162

77

SNMP v3 Security Fields For Queries......................................................................................

162

78

CMM State Transition Events and Event IDs ...........................................................................

166

79

CMM Status Sensor Data Bits..................................................................................................

167

80

Error and Return Codes for the RPC Interface.........................................................................

177

81

Threshold Response Formats ..................................................................................................

181

82

String Response Formats.........................................................................................................

181

83

Integer Response Formats .......................................................................................................

185

84

FRU Data Items String Response Format................................................................................

186

85

RPC Usage Examples..............................................................................................................

187

86

RMCP Modes ...........................................................................................................................

190

87

RMCP Session Timers .............................................................................................................

192

88

RMCP Slave Addresses ...........................................................................................................

193

89

IPMI Commands Supported by CMM RMCP ...........................................................................

194

90

RMCP Message Completion Codes.........................................................................................

198

91

Flash #1....................................................................................................................................

202

92

Flash #2....................................................................................................................................

202

93

Flash #3....................................................................................................................................

202

94

Flash #4....................................................................................................................................

202

95

List of Critical Software Update Files and Directories ..............................................................

205

96

Contents of the Update Package..............................................................................................

206

97

SaveList Items and Their Priorities...........................................................................................

208

98

CMM Update Directions ...........................................................................................................

209

99

Platform FRU Accessibility of the FRU Update Utility ..............................................................

221

100

FruUpdate Utility Command Line Options ................................................................................

224

101

Probe Command Parameters...................................................................................................

231

102

FRU Area String Specifications ................................................................................................

235

103

Multi-Record Selection Parameters..........................................................................................

236

104

FRU Field First String Specifications........................................................................................

237

105

FRU Field Maximum Allowed Lengths .....................................................................................

237

106

FRU Field Second String Specification ....................................................................................

238

107

Type Code Specification...........................................................................................................

239

108

Command Quick Reference .....................................................................................................

243

109

Probe Arguments Quick Reference..........................................................................................

246

110

Results of Variable Settings .....................................................................................................

256

111

Example CLI Commands..........................................................................................................

277

14

MPCMM0001 Chassis Management Module Software Technical Product Specification

Contents

Revision History

Date

Revision

Description

 

 

 

April 2005

007

Firmware version 5.2

 

 

 

August 2004

006

Firmware version 5.1.0.757

 

 

 

 

 

Version 5.1 TPS

April 2004

005

Added Re-Enumeration Section

 

 

Added Process Monitoring Section

 

 

 

January 2004

004.1

Version 4.1 TPS

 

 

 

MPCMM0001 Chassis Management Module Software Technical Product Specification

15

Introduction

 

Introduction

1

1.1Overview

The Intel® NetStructureTM MPCMM0001 Chassis Management Module is a 4U, single-slot CMM intended for use with AdvancedTCA* PICMG* 3.0 platforms. This document details the software features and specifications of the CMM. For information on hardware features for the CMM refer to the Intel® NetStructure™ MPCMM0001 Hardware Technical Product Specification. Links to specifications and other material can be found in Appendix B, “Data Sheet Reference.”

The CMM plugs into a dedicated slot in compatible systems. It provides centralized management and alarming for up to 16 node and/or fabric slots as well as for system power supplies, fans and power entry modules. The CMM may be paired with a backup for redundant use in highavailability applications.

The CMM is a special purpose single board computer (SBC) with its own CPU, memory, PCI bus, operating system, and peripherals. The CMM monitors and configures IPMI-based components in the chassis. When thresholds (such as temperature and voltage) are crossed or a failure occurs, the CMM captures these events, stores them in an event log, sends SNMP traps, and drives the Telco alarm relays and alarm LEDs. The CMM can query FRU information (such as serial number, model number, manufacture date, etc.), detect presence of components (such as fan tray, CPU board, etc.), perform health monitoring of each component, control the power-up sequencing of each device, and control power to each slot via IPMI.

Assumptions: This document assumes some basic Linux* knowledge and the ability to use Linux text editors such as vi.

1.2Terms Used in this Document

Table 1. Glossary (Sheet 1 of 2)

Acronym

Description

 

 

BIST

Built-In Self Test

 

 

CDM

Chassis Data Module

 

 

CLI

Command Line Interface

 

 

CMM

Chassis Management Module

 

 

DHCP

Dynamic Host Configuration Protocol

 

 

FFS

Flash File System

 

 

FIS

Flash Image System

 

 

FPGA

Field-Programmable Gate Arrays

 

 

FRU

Field Replaceable Unit

 

 

HS

Hot Swap

 

 

IPMI

Intelligent Platform Management

 

 

IPMB

Intelligent Platform Management Bus

 

 

16

MPCMM0001 Chassis Management Module Software Technical Product Specification

Introduction

Table 1. Glossary (Sheet 2 of 2)

Acronym

Description

 

 

IPMI

Intelligent Platform Management Interface

 

 

LED

Light Emitting Diode

 

 

MIB

Management Information Base

 

 

MIB II

RFC1213 - A standard Management Information Base for Network

Management

 

 

 

PEM

Power Entry Module

 

 

PICMG

PCI Industrial Computer Manufacturers’ Group

 

 

RMCP

Remote Management Control Protocol

 

 

RPC

Remote Procedural Calls

 

 

SBC

Single Board Computer

 

 

SDR

Sensor Data Record

 

 

SEL

System Event Log

 

 

ShMC

Shelf Management Controller

 

 

SNMP

Simple Network Management Protocol

 

 

SSH

Secure Socket Shell

 

 

TFTP

Trivial File Transfer Protocol

 

 

UDP

User Datagram Protocol

 

 

WDT

Watchdog Timer

 

 

MPCMM0001 Chassis Management Module Software Technical Product Specification

17

Software Specifications

 

Software Specifications

2

2.1Red Hat* Embedded Debug and Bootstrap (Redboot)

Upon initial power on, the CMM enters into the Redboot firmware to bootstrap the embedded environment. Upon execution, Redboot acts as a TFTP server and checks for a TFTP connection to a client. If a TFTP connection exists, Redboot will accept a firmware update that is pushed down from the client, check the firmware update for data integrity, and then write the update to the flash.

Note: Firmware updates using the Redboot TFTP method are supported for backwards compatibility. However, updating from within the OS using the CLI is the preferred method of updating CMM firmware. For information on the firmware update process refer to Section 23, “Updating CMM Software” on page 204.

Under normal circumstances, Redboot runs through the standard diagnostics, memory setup, decompresses the OS kernel, and boots into that kernel.

2.2Operating System

The CMM runs a customized version of embedded BlueCat* Linux* 4.0 on an Intel® 80321 processor with Intel® XScale® technology. Development support for BlueCat Linux is available on the web at http://www.lynuxworks.com.

2.3Command Line Interface (CLI)

The Command Line Interface (CLI) connects to and communicates with the intelligent management devices of the chassis, boards, and the CMM itself. The CLI is an IPMI-based library of commands that can be accessed directly or through a higher-level management application. Administrators can access the CLI through Telnet, SSH, or the CMM’s serial port. Using the CLI, users can access information about the current state of the system including current sensor values, threshold settings, recent events, and overall chassis health, access and modify shelf and CMM configurations, set fan speeds, perform actions on a FRU, etc. The CLI is covered in Section 8, “The Command Line Interface (CLI)” on page 71.

2.4SNMP/UDP

The chassis management module supports both queries and traps on SNMP (Simple Network Management Protocol) v1 or v3. The SNMP version can be configured through the CLI interface. The default is for SNMP v1. A MIB for the entire platform is included with the CMM. The CMM can send out SNMP traps to up to five trap receivers.

Along with SNMP traps, the CMM sends UDP (User Datagram Protocol) alerts to port 10000. The content of these UDP alerts is the same as the SNMP traps. SNMP is covered in Section 17, “SNMP” on page 140.

18

MPCMM0001 Chassis Management Module Software Technical Product Specification

Software Specifications

2.5Remote Procedural Call (RPC) Interface

In addition to the console command-line interface, the CMM can be administered by custom remote applications via remote procedure calls (RPC). RPC is covered in Section 19, “Remote Procedure Calls (RPC)” on page 174.

2.6RMCP

RMCP (Remote Management Control Protocol) is a protocol that defines a method to send IPMI packets over LAN. The RMCP server on the CMM can decode RMCP packages and forward the IPMI messages to the appropriate channels including: SBC blades, PEMs, and FanTrays or local destination within the CMM. When there is a responding IPMI message coming from SBC blades, PEMs, or FanTrays destined to RMCP client, the RMCP server will format this IPMI message into a RMCP message and send it to through the designated LAN interface back to originator. RMCP is covered in Section 20, “RMCP” on page 190.

2.7Ethernet Interfaces

The CMM contains two Ethernet ports. The software can configure each of these ports to either the front panel, to the backplane, or to the rear transition module (RTM). Information on configuring the Ethernet interfaces is covered in Section 8.3.1, “Setting IP Address Properties” on page 72.

2.8Sensor Event Logs (SEL)

The AdvancedTCA CMM implements system event logs according to Section 3.5 of the PICMG

3.0 Specification. The SEL contained on the CMM is fully IPMI compliant.

2.8.1CMM SEL Architecture

The MPCMM0001 uses a single flat SEL file stored locally in the /etc/cmm directory. The SEL maintains a list of all the sensor events in the shelf. Each of the managed devices may keep its own SEL records in local SELs, but the master copy for the shelf is maintained by the CMM.

The SEL is limited to 65536 bytes. In order to keep the SEL from getting full, which can cause loss of error logging, the SEL is checked every 15 minutes by the CMM, and if the size of the cmm_sel is greater than 40000 bytes, the SEL is archived in gzip format and saved in /home/log/SEL. The names of the saved logs will be cmm_sel.0.gz, cmm_sel.1.gz, and so on, to a maximum of 16 logs where they are then rolled over.

Note: Archived files should NEVER be decompressed on the CMM as the resulting prolonged flash file writing could disrupt normal CMM operation and behavior. Using FTP, transfer the files to a different system before decompressing the archive using utilities such as gzip.

2.8.2Retrieving a SEL

To retrieve a SEL from the CMM, issues the following command:

cmmget [-l location] -d sel

MPCMM0001 Chassis Management Module Software Technical Product Specification

19

Software Specifications

Where location is one of {cmm, blade[1-14], fantray1, PEM[1-2]}. Even though the CMM uses a single flat SEL for system events, the ‘cmmget’ command will filter the SEL and only return events associated with the provided location. Also, some individual FRUs may keep their own local SELs (i.e., blades).

2.8.3Clearing the SEL

The following command will clear the SEL on both the active and the standby:

cmmset -d clearsel -v clear

Note: Since the CMM uses a single flat SEL for system events, this command clears the entire shelf SEL, not just a filtered subset.

2.8.4Retrieving the Raw SEL

To retrieve the SEL in its raw format from a location, issue the following command:

cmmget -l [location] -d rawsel

2.9Blade OverTemp Shutdown Script

The CMM software includes predefined script settings specifically for the MPCBL0001 board, which will automatically shut down a board when the “baseboard temp” sensor on that board crosses the upper critical threshold. This is done to prevent a runaway thermal event on the board from occurring. If this functionality is needed when using boards other than the MPCBL0001, the user will need to associate the name of the thermal sensor and the threshold with the board shutdown script:

cmmset -l bladeN -d majoraction -t [temp sensor name] -v overtempbladepoweroff [Blade Number]

Please refer to Section 18, “CMM Scripting” on page 164 for more information on assocating a script to an event.

When using the CMM with boards other than the MPCBL0001, as long as there is no sensor name titled "baseboard temp" associated with the particular board being used, then there is no issue leaving these settings intact. If needed, to deactivate these settings for each physical slot, use the command:

cmmset -l bladeN -d majoraction -t “baseboard temp” -v none

where bladeN is the blade, corresponding to the physical slot number, on which to remove the automatic shutdown setting (blade[1-16]). Please refer to Section 18, “CMM Scripting” on page 164 for more information on removing script actions.

20

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

Redundancy, Synchronization, and Failover

Redundancy, Synchronization, and

Failover

3

3.1Overview

The CMM supports redundant operation with automatic failover in a chassis using redundant CMM slots. In systems where two CMMs are present, one acts as the active shelf manager and the other as standby. Both CMMs monitor each other, and either can trigger a failover if necessary.

Data from the active CMM is synchronized to the standby CMM whenever any changes occur. Data on the standby CMM is overwritten. A full synchronization between active and standby CMMs occurs on initial power up, or any insertion of a new CMM.

The active CMM is responsible for shelf FRU information management when CMMs are in redundant mode.

3.2Synchronization

To ensure critical files on the standby CMM match the data on the active CMM, the active CMM synchronizes its data with the standby CMM, overwriting any existing data on the standby CMM.

An exception to this is the password reset procedure, detailed in Section 9, “Resetting the Password” on page 99. When the password reset switch is activated on the standby CMM, the password will be synchronized to the active CMM.

The CMMs will initially fully synchronize data from the active to the standby CMM just after booting. An insertion of a new CMM will also cause a full synchronization from the active to the newly inserted standby. Date and time are synched every hour. Partial synchronization will also occur any time files are modified or touched via the Linux* “touch” command with the exception of all *.sif and *.bin files in the /etc/cmm directory.

The *.sif (ALL SIF files), and *.bin (SDR Files) files under /etc/cmm are synchronized only once (when the CMMs establish communication). A 'touch' on those files at any later time will not perform a sync operation. Also, any updates to these files always happen as part of the software updates and not in isolation.

Note: During synchronization, the health event LEDs on the standby CMM may blink on and off as the health events that were logged in the SEL are synchronized.

Below is a list of items that are synchronized between CMMs. During a full synchronization, all of these files and data are synchronized. A change to any of these files results in that file being synched. The active CMM overwrites these files on the standby CMM.

There are two "levels" of files that get synchronized. In order to normally manage the chassis, the priority 1 files must be synchronized after power up or installation of a brand new CMM into the chassis. It is absolutely necessary that a standby CMM has the priority one files synched before a successful failover can occur. When a brand new CMM boots the first time as a standby, if a CMM

MPCMM0001 Chassis Management Module Software Technical Product Specification

21

Redundancy, Synchronization, and Failover

failover is forced before all priority 1 data items are synchronized to the standby CMM, the standby CMM can still become the active CMM but may not be able to properly manage the FRUs in the chassis.

Table 2. CMM Synchronization (Sheet 1 of 2)

File(s) or Data

Description

Path

Priority

 

 

 

 

date and time

Date and time

IPMB

1

 

 

 

 

 

CMM eth1, eth1:1, and eth0 IP address

 

 

IP Address Settings

settings to allow CMMs to discover the

IPMB

1

 

other’s IP information.

 

 

 

 

 

 

/etc/cmm.cfg

CMM’s main configuration file

Ethernet

1

 

 

 

 

/etc/cmm/cmm_sel

System SEL

Ethernet

1

 

 

 

 

/etc/cmm/sensors.ini

Sensor Set Values

Ethernet

1

 

 

 

 

Ekey Controller Structures

Ekey Controller Structures

Ethernet

1

 

 

 

 

Bused EKey Token info

Bused EKey Token info

Ethernet

1

 

 

 

 

IPMB User States

IPMB User States

Ethernet

1

 

 

 

 

Fan States

Fan States

Ethernet

1

 

 

 

 

Cooling State

Cooling State Information

Ethernet

1

 

 

 

 

User LED States

User LED States

Ethernet

1

 

 

 

 

SDR structures and SIPI Controller Info

SDR structures and SIPI Controller Info

Ethernet

1

 

 

 

 

PHM FRU state, Power Usage and

PHM FRU state, Power Usage and Power

Ethernet

1

Power Info

Info

 

 

 

 

 

 

FIM FRU Cache (Local and Temp)

FIM FRU Cache (Local and Temp)

Ethernet

1

 

 

 

 

SEL Time

SEL Time

IPMB

1

 

 

 

 

SEL Events

Individual SEL Events

IPMB

1

 

 

 

 

/etc/cmm/fantray.cfg

Fantray settings needed by cooling manager

Ethernet

1

 

 

 

 

/etc/cmm.ini

Provides configuration values like the bus

Ethernet

2

mapping

 

 

 

 

 

 

 

/etc/passwd

Password file

Ethernet

2

 

 

 

 

/etc/shadow

Password file

Ethernet

2

 

 

 

 

/etc/cmdPrivillege.ini

Provides privilege related configuration

Ethernet

2

values for RMCP

 

 

 

 

 

 

 

/etc/cmm/*.bin

All SDR Files

Ethernet

2

 

 

 

 

/etc/cmm/*.sif

All SIF Files

Ethernet

2

 

 

 

 

/etc/var/snmpd.conf

SNMP configuration files

Ethernet

2

 

 

 

 

/etc/snmpd.conf

SNMP configuration files

Ethernet

2

 

 

 

 

/home/scripts

Entire user scripts area

Ethernet

2

 

 

 

 

Prompt file

Prompt file

Ethernet

2

 

 

 

 

/etc/actionscripts.cfg

Event action settings

Ethernet

2

 

 

 

 

22

MPCMM0001 Chassis Management Module Software Technical Product Specification

Redundancy, Synchronization, and Failover

Table 2. CMM Synchronization (Sheet 2 of 2)

File(s) or Data

Description

Path

Priority

 

 

 

 

Issues files

Issues files

Ethernet

2

 

 

 

 

 

Recovery Action and escalation action for all

 

 

/usr/local/cmm/temp/pmssync.ini

the monitored processes except monitor

Ethernet

2

 

process

 

 

 

 

 

 

/usr/local/cmm/temp/pmsshadowsync.ini

Recovery action and escalation action for

Ethernet

2

monitor process

 

 

 

 

 

 

 

Note: The /.rhosts file is used for synchronization and should NEVER be modified.

3.3Heterogeneous Synchronization

Beginning in version 5.2 firmware, the CMM can synchronize data between differing CMM versions. The firmware delineates synchronization from firmware versioning, thus allowing seamless synchronization between all CMM versions. A form of internal data versioning maintained by the CMM helps achieve this.

Note: SDR/SIF and user scripts differ slightly in synchronization architecture as described below.

3.3.1SDR/SIF Synchronization

Sensor Data Records (SDRs) and Sensor Information Files (SIFs) will be synchronized only between CMMs having the same version for this data item (even if the CMM firmware versions differ).

3.3.2User Scripts Synchronization and Configuration

By default, user scripts are synchronized only between CMM’s with same firmware versions. User can control the user scripts synchronization irrespective of CMM version differences by modifying the value of a configuration flag - "SyncUserScripts" (in the CMM configuration file, cmm.cfg under /etc). The configuration flag can be modified using the cmmget/cmmset commands. This flag can be read/set through any of the CMM interfaces (i.e., CLI, SNMP and RPC).

Only when CMM firmware versions differ will the value of this flag determines if user scripts should be synchronized or not. Between same firmware versions, the user scripts directory will continue to be synchronized and this flag ignored.

3.3.2.1Setting User Scripts Sync Configuration Flag

To set the value of the Scripts Synchronization configuration flag, the following CMM command is used:

cmmget -l cmm -d syncuserscripts -v [equal/upgrade/downgrade/always]

Where:

equal: Synchronizes user scripts only when the CMM versions are same. This is the default value.

MPCMM0001 Chassis Management Module Software Technical Product Specification

23

Redundancy, Synchronization, and Failover

upgrade: Synchronizes user scripts only when the other CMM has a newer firmware version. downgrade: Synchronizes user scripts only when the other CMM has an older firmware version. always: Synchronizes user scripts irrespective of version differences.

3.3.2.2Retrieving User Scripts Sync Configuration Flag

To retrieve the value of the Scripts Synchronization configuration flag, the following CMM command is used:

cmmget -l cmm -d syncuserscripts

The value returned will be one of: Equal, Upgrade, Downgrade, Always, or Error on failure.

3.3.3Synchronization Requirements

For synchronization to occur:

The CMMs must be able to communicate with each other over their dedicated IPMB. The CMMs use a heartbeat via their dedicated IPMB to determine if they can communicate with each other over IPMB.

An Ethernet connection must exist between the two CMMs. The CMMs must be able to ping each other via Ethernet for synchronization to be successful. This can be a connection through the Ethernet switches in the chassis, which requires both switches to be present in the chassis; a connection can occur through an external Ethernet switch connected to the front ports of the CMM pair, or alternatively, the connection can be a crossover cable connecting the two front ports of the CMM pair. If synchronization fails on eth1, then it will be attempted on eth0. If the CMMs cannot successfully ping each other via eth0 or eth1, then synchronization between the CMMs cannot occur.

A failure of any priority 1 synchronization will result in a health event being logged in the CMM SEL and will inhibit a failover from occurring.

3.4Initial Data Synchronization

It is absolutely necessary that a standby CMM has the priority one files synched before a successful failover can occur. A standby CMM can still become active if all priority one synchronization has not been completed, but it may not be able to properly manage all the FRU’s in the chassis.

The CMM implements the “Datasync Status” sensor to determine the state of synchronization and if synchronization has completed. successfully.

3.4.1Initial Data Sync Failure

If CMM encounters any failure during data synchronization it marks the data synchronization failure and logs a SEL event and sends an SNMP trap. Duplicate failures are not reported multiple times. As soon as CMM is out of failure condition it will reset data synchronization failure state.

The CMM will continue trying to synchronize as long as there are two CMMs present in the chassis and they are able to communicate via their cross-connected IPMB.

24

MPCMM0001 Chassis Management Module Software Technical Product Specification

Redundancy, Synchronization, and Failover

3.5Datasync Status Sensor

A sensor named “Datasync Status” exists in order to make the Datasync state information available to the user. This sensor tracks the status of the Datasync module and will make its status available through the various CMM interfaces. This sensor is used to query the data synchronization states, and log SEL events for initial synchronization complete event. It is a discrete OEM sensor with status bits representing the state of different parts of the Datasync module.

Note: The Datasync Status sensor can only be queried through the active CMM.

3.5.1Sensor bitmap

When the Datasync starts the first time through in a dual CMM system and whenever the CMM changes between Active and Standby, the status bits are all cleared to 0x0000.

Bit 0 (Running) is set when the datasync module is active.

Bit 1 (P1Done) is set when the priority 1 data syncs are done, and cleared when priority 1 data needs to be synced.

Bit 2 (P2Done) is set when the priority 2 data syncs are done, and cleared when a priority 2 data needs to be synced.

Bit 3 (InitSyncDone) is set when both priority 1 and priority 2 data syncs are done, and stays set (latches) until the CMM changes between Active and Standby, or looses contact with the partner CMM.

Bit 4 (SyncError) is set if an error was detected, and cleared when no data items have errors.

3.5.2Event IDs

The “Datasync Status” sensor will use event ids 0x420 to 0x42f. The following new event ids are used to log various events for these requirements. These event ID’s can be used to associated scripts with the respective events.

Event

Event ID

 

 

Initial Data Synchronization complete

0x420 (1056)

 

 

3.5.3Querying the Datasync Status

The status of the data synch sensor can be queried using the following CLI command:

cmmget –l cmm –t “Datasync Status” –d current

Output of the command is as follows:

Initial State:

The current value is 0x0001

Initial Data Synchronization is not complete.

There is Priority 1 data to sync.

MPCMM0001 Chassis Management Module Software Technical Product Specification

25

Redundancy, Synchronization, and Failover

There is Priority 2 data to sync.

No Data Synchronization problems known.

Initial Data Synch Incomplete, Pri 1 Data Synced, Pri 2 Data Not Synched

The current value is 0x0003

Initial Data Synchronization is not complete.

Priority 1 data is synced.

There is Priority 2 data to sync.

No Data Synchronization problems known.

Initial Data Sync is complete, Priority 1 and Priority 2 are also synced

The current value is 0x000f

Initial Data Synchronization is complete.

Priority 1 data is synced.

Priority 2 data is synced.

No Data Synchronization problems known.

Initial Data Sync failure

The current value is 0x0013

Initial Data Synchronization is not complete.

Priority 1 data is synced.

There is Priority 2 data to sync.

Data Synchronization has encountered a problem in synchronizing data.

Initial Data Sync is complete and Priority 1 data is changed

The current value is 0x000d

Initial Data Synchronization is complete.

There is Priority 1 data to sync.

Priority 2 data is synced.

No Data Synchronization problems known

Data Sync failure of Priority 1 Data occurs after Initial Data Sync and there is a Data Sync Problem

The current value is 0x001d

Initial Data Synchronization is complete.

There is Priority 1 data to sync

26

MPCMM0001 Chassis Management Module Software Technical Product Specification

Redundancy, Synchronization, and Failover

Priority 2 data is synced.

Data Synchronization has encountered a problem in synchronizing data.

Data Sync becomes normal after Data Sync failure

The current value is 0x000f

Initial Data Synchronization is complete.

Priority 1 data is synced.

Priority 2 data is synced.

No Data Synchronization problems known

Single CMM

The current value is 0x0000

Datasync disabled - there is no partner CMM present.

3.5.4SEL Event

The Datasync Status sensor generates the following two SEL events:

When the active CMM is or becomes the only CMM, or the active CMM loses communication with the standby CMM, the following event will be logged:

[Day] [Month] [Date]

[Time] [Year]

 

CMM[n]: CMM Datasync Status

Initial Data Synchronization is

complete. Deasserted

 

 

The following event will be logged in the SEL when initial data synchronization is complete:

[Day] [Month] [Date] [Time] [Year]

 

CMM[n]: CMM Datasync Status

Initial Data Synchronization is

complete. Asserted

 

Where

n: The number of the CMM generating the event.

3.5.5SNMP Trap

The Datasync Status sensor generates following two SNMP traps:

When the active CMM is or becomes the only CMM, or the active CMM loses communication with the standby CMM, the following SNMP trap will be generated.

[Month] [Date] [Time] [hostname] snmptrapd[xxxxx]: [IP Address]: Enterprise Specific Trap (25) Uptime: [Time], SNMPv2SMI::enterprises.343.2.14.1.5 = STRING: "Time : [Day] [Month] [Date] [Time] [Year], Location : [location] , Chassis Serial # : [xxxxxxxx], Board : CMM[x] , Sensor : Datasync Status , Event : Initial Data Synchronization complete: Deasserted "

MPCMM0001 Chassis Management Module Software Technical Product Specification

27

Redundancy, Synchronization, and Failover

When initial data synchronization is complete, the following SNMP trap is generated:

[Month] [Date] [Time] [hostname] snmptrapd[xxxxx]: [IP Address]: Enterprise Specific Trap (25) Uptime: [Time], SNMPv2SMI::enterprises.343.2.14.1.5 = STRING: "Time : [Day] [Month] [Date] [Time] [Year], Location : [location] , Chassis Serial # : [xxxxxxxx], Board : CMM[x] , Sensor : CMM[x]:Datasync Status , Event : Initial Data Synchronization is complete. Asserted "

3.5.6System Health

The “Datasync Status” sensor will not contribute to the system health. However sync failures are captured by the “File Sync Failure” sensor and it contributes to the system health

3.6CMM Failover

Once information is synchronized between the redundant CMMs, the active CMM will constantly monitor its own health as well as the health of the standby CMM. In the event of one of the scenarios listed in the sections that follow, the active CMM will automatically failover to the standby CMM so that no management functionality is lost at any time.

3.6.1Scenarios That Prevent Failover

The following are reasons a failover can NOT occur:

The active CMM can NOT communicate with the standby CMM via their IPMB bus.

Not all priority 1 data has been completely synchronized between the CMMs.

To determine the active CMM at anytime, use the CLI command:

cmmget -l cmm –d redundancy

This command will output a list stating if both CMMs are present, which one is the active CMM, and which CMM you are logged in to. CMM1 is the CMM on the left when looking from the front of the chassis, and CMM2 is on the right.

3.6.2Scenarios That Failover to a Healthier Standby CMM

The scenarios listed below can only cause a failover if the standby CMM is in a healthier state than the active CMM. The health of the CMM is determined by computing a CMM health score, which is equal to the sum of the weights of the following active conditions. A CMM health score is determined for each CMM whenever any of these conditions occur on the active CMM. The CMM health score is composed of the sum of the weights of any of the three conditions listed below. Each condition has a default weight of 1 assigned to it, causing all conditions to have equal importance in causing failover.

To determine if a failover is necessary when one of these conditions occurs, the active CMM computes its CMM health score, and requests the health score of the standby CMM. If the score of the standby CMM is LESS than the score of the active CMM, a failover will occur. If a failover does not occur, the CMM SEL will contain an entry indicating the reason failover did not occur.

1. SNMPTrapAddress1 ping failure:

28

MPCMM0001 Chassis Management Module Software Technical Product Specification

Redundancy, Synchronization, and Failover

The active CMM will failover to the standby CMM if the active CMM cannot ping its first SNMP trap address (SNMPTrapAddress1) over any of the available Ethernet ports, but the standby CMM can. The trap address is set using the command:

cmmset –l cmm –d snmptrapaddress1 –v [ip address]

Only a ping failure of the first SNMP trap address (SNMPTrapAddress1) can cause a failover. SNMPtrapaddress2 through SNMPtrapaddress5 do not perform this ping test.

Note: The frequency of the ping to the first trap address can vary from one second to approximately 20 seconds.

2.Critical events on the active CMM:

The active CMM has critical events for any of the CMM sensors (not critical chassis or blade events) and the standby CMM does not. If both CMMs have critical CMM events, then the number of major and minor CMM events is examined to decide if a failover should occur. The number of major events is compared, and if they are equal, the number of minor events is used.

3.6.3Manual Failover

The following command can be issued to the active CMM to manually cause a failover to the standby CMM:

cmmset -l cmm -d failover -v [1/any]

Where:

1: Will failover only to a CMM with the same or newer version of firmware.

any: Will failover to any version of firmware.

A manual failover can only be initiated on the active CMM. A failover will only occur if the standby CMM is at least as healthy as the active CMM. Once the command executes, the former standby CMM immediately becomes the active CMM.

If the failover could not occur, the CLI will indicate the reason why the failover could not occur, and a SEL event will be recorded.

In addition, opening the ejector latch on the active CMM will initiate a failover, but only if the standby is at least as healthy as the active.

3.6.4Scenarios That Force a Failover

The following scenarios cause a failover as long as the standby CMM is operational, even when it is less healthy than the active:

The active CMM is pulled out of the chassis.

The active CMM’s healthy signal is de-asserted.

A “reboot” command issued to the active CMM.

The front panel alarm quiet switch button on the active CMM is pushed for more than five seconds. If the button continues to be pressed for more than 10 seconds, the CMM does not reset.

MPCMM0001 Chassis Management Module Software Technical Product Specification

29

Redundancy, Synchronization, and Failover

3.7CMM Ready Event

The CMM Ready Event is a notification mechanism that informs the user when all CMM modules are fully up and running. The CMM is ready to process any request after receiving this event.

The CMM uses the "CMM Status" sensor when generating the CMM Not Ready event. Please refer to Table 46, “CMM Status Event Strings (CMM Status)” on page 118 for CMM status event strings.

Table 3. CMM Status Event Strings (CMM Status)

Event String

Event Code

Event Severity

 

 

 

“CMM is not ready.”

1024

Minor

 

 

 

“CMM is ready.”

1025

OK

 

 

 

“CMM is Active”

1026

OK

 

 

 

“CMM is Standby”

1027

OK

 

 

 

“CMM ready timed out”

1028

Minor

 

 

 

A CMM Not Ready Assertion SEL event is generated on a CMM when it transitions from standby mode to active mode during a failover or on the active CMM on power up. The event is only generated on the newly active CMM. The “CMM is Ready” event is generated after all CMM modules (board wrapper processes) are up and running and the SNMP daemon is active.

30

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

Built-In Self Test (BIST)

Built-In Self Test (BIST)

4

 

 

The CMM provides for a Built-In Self Test (BIST). The test is run automatically after power up. This test detects flash corruption as well as other critical hardware failures.

Results of the BIST are displayed on the console through the serial port during boot time. Results of BIST are also available through the CLI if the OS successfully boots. If the BIST detects a fatal error, the CMM is not allowed to function as an active CMM.

4.1BIST Test Flow

The following state diagram shows the order of the tests RedBoot runs following a power-up or front-panel reset. On every state before reaching active CMM, if there is an error, RedBoot will log the error event into the EEPROM, route the error message to the serial port, and continue booting. If the execution hangs before the OS loads due to the nature of the error, the CMM hangs. If the OS successfully boots, it alerts users to any errors that occurred during boot.

MPCMM0001 Chassis Management Module Software Technical Product Specification

31

Built-In Self Test (BIST)

Figure 1.

BIST Flow Chart

 

 

 

 

 

 

 

Jump to

 

 

 

 

 

run from

 

 

 

 

 

RB

 

 

 

 

RB image pass

 

 

 

Run from

RB image

RB image fail

FPGA image

Power Up/

and backup

and backup

 

Reset

backup

RB image

 

FPGA image

 

 

RB

checksum

 

checksum

 

 

 

 

 

 

 

 

backup FPGA

NOT (backup

 

 

 

 

image pass

FPGA image

 

BlueCat

 

 

and FPGA

pass and

 

 

 

image fail

FPGA image

 

loaded (active

 

 

 

 

 

 

fail)

 

CMM)

 

 

 

 

 

 

 

 

Load backup

Load FPGA

 

 

 

 

FPGA image

image

 

IPMB

 

 

 

 

 

Bus Test

 

 

 

 

 

BlueCat

FPGA,

 

Memory Test

 

 

Image

DS1307, NIC

 

 

 

 

Checksum

 

 

 

 

The BIST has been broken down into stages consisting of groups of tests that run at certain times throughout the boot process. The following table shows the different BIST stages and the tests associated with each stage:

Table 4. BIST Implementation

Boot-BIST

Early-BIST

Mid-BIST

Late-BIST

 

 

 

 

RedBoot image

Strobe WDT to extend

Extended memory test

BlueCat image checksum

checksum

timeout period

 

 

 

 

 

 

FPGA image checksum

 

FPGA version check

IPMB bus test

 

 

 

 

Base memory test

 

DS1307 RTC test

 

 

 

 

 

 

 

Local PCI bus/NIC

 

 

 

presence test

 

 

 

 

 

32

MPCMM0001 Chassis Management Module Software Technical Product Specification

Built-In Self Test (BIST)

4.2Boot-BIST

The codes in Boot-BIST are executed at the very early stage of the RedBoot bootstrap, which is just before the FPGA programming and memory module initialization. Boot-BIST performs checksum checking over the RedBoot image and the FPGA image. A checksum error will be detected if there is a mismatch between the calculated checksum and the stored checksum in FIS directory.

Boot-BIST also performs a Base Memory Test for the first 1 MByte of memory. Whenever there is an error, BIST will inform the user by prompting a warning message through the console terminal and log the event to event-log area.

4.3Early-BIST

The early BIST stage extends the reset timeout period on the watchdog timer (MAX6374) by strobing GPIO7 on FPGA1. This prevents any possible hardware reset during the BIST process. The watchdog timer is enabled after the ADM1026 GPIO initialization and disabled once it reaches the RedBoot console. The OS enables the watchdog timer again and starts the strobing thread at the kernel level.

4.4Mid-BIST

This stage of BIST performs the Extended Memory Test to scan and diagnose the possible bit errors in the memory. It starts scanning from 1 MByte to the 128 MByte. It does not test the memory below 1 MByte because a portion of RedBoot has already loaded and resided on it.

The memory test includes the walking ones test 32-bit address test, and 32-bit inverse address test. Furthermore, voltage and temperature ratings will be verified to lie within the hardware tolerable ranges. The FPGA firmware version is checked and will alert if an older version of an FPGA image has been detected. Also, system date and time is read from the real-time clock and displayed through the console terminal. NIC presence is also checked here, though the NIC self-test happens later when the driver is loaded.

4.5Late-BIST

Late-BIST disables the watchdog timer once RedBoot is fully loaded. It then verifies the checksum of the OS image with a stored checksum at the top of flash memory, before proceeding with the boot script execution.

The following diagram shows the times during the boot cycle the when various stages of BIST are performed.

MPCMM0001 Chassis Management Module Software Technical Product Specification

33

Built-In Self Test (BIST)

Figure 2. Timing of BIST Stages

HAL initialization (processor, cache, serial port)

Boot-BIST

FPGA programming

Memory parameters initialization

Early-BIST

Module initialization (flash, zlib, ide)

Mid-BIST

Module initialization (ethernet interface)

Late-BIST

Display copyright banner, and execute boot script

Done

4.6QuickBoot Feature

This feature will skip all the diagnostics tests in the mid-BIST and late-BIST, once it has been enabled. However, Flash Test and Base Memory Test in the boot-BIST will still execute, even with this feature enabled. The default setting is QuickBoot enabled.

When QuickBoot feature has been disabled, user has the choice to optionally enable or disable the Extended Memory Test (in mid-BIST) and the OS Image Checksum Test (in late-BIST) individually.

4.6.1Configuring QuickBoot

RedBoot> fconfig

...

Enable QuickBoot during BIST: false

34

MPCMM0001 Chassis Management Module Software Technical Product Specification

Built-In Self Test (BIST)

Execute extended memory test: true OS image checksum at boot: true

...

Update RedBoot non-volatile configuration - are you sure (y/n)? y

The default 'Enable QuickBoot during BIST' is true. When 'Enable QuickBoot during BIST' set to false, there will be two additional options displayed in the configuration menu. They are 'Execute extended memory test' and 'OS image checksum at boot' options. User can selectively enable one or both tests during the QuickBoot disabled mode. Both options will not be shown in the configuration menu if the QuickBoot is enabled. These options will go into effect during the next boot.

4.7Event Log Area and Event Management

Errors detected by the BIST are stored in an event log. The event-log area is designed to have up to 269 entries. Each entry is 14 bytes. The event-log area is located in EEPROM on the CMM. The BIST can place entries into the event log until it becomes full. Once full, any new entries will be lost. The BIST event log is cleared by the OS once the OS logs any BIST errors into the SEL.

At OS start-up, the CMM reads the contents of BIST results in the reserved event log area and stores the errors as entries in the CMM SEL. This allows the CMM application to take the appropriate action based upon the SEL events as a result of RedBoot BIST tests. If there is not enough space to log the events in the CMM SEL, no results are logged to the CMM SEL.

The BIST event log is erased only after the event log is stored into the CMM SEL. Event strings for BIST events are listed in Section 11, “Health Events” on page 104.

4.8OS Flash Corruption Detection and Recovery Design

The OS is responsible for the flash content integrity at runtime. Flash monitoring under the OS environment can be divided into two parts: Monitoring static images and monitoring dynamic images.

Static images refer to the RedBoot image, FPGA image and BlueCat image in flash. These images should not change throughout the lifetime of the CMM unless they are purposely updated or corrupted. The checksum for these files is written into flash when the images are uploaded.

Dynamic image refers to the OS Flash File System (JFFS2). This image dynamically changes throughout the runtime of the OS.

4.8.1Monitoring the Static Images

A static test is run every 24 hours during CMM operation. The static test reads each static image (RedBoot, FPGA, BlueCat), calculates the image checksum, and compares with the checksum in the RedBoot configuration area (FIS). If the checksum test fails, the error is logged to the CMM SEL.

MPCMM0001 Chassis Management Module Software Technical Product Specification

35

Built-In Self Test (BIST)

4.8.2Monitoring the Dynamic Images

For monitoring the dynamic images, the CMM leverages the corruption detection ability from the JFFS(2) flash file system. At OS start-up, the CMM executes an initialization script to mount the JFFS(2) flash partitions (/etc and /home). If a flash corruption is detected, an event is logged to the CMM SEL.

During normal OS operation, flash corruption during file access can also be detected by the JFFS(2) and/or the flash driver. If a flash corruption is detected, an event is logged to the CMM SEL.

4.8.3CMM Failover

If during normal OS operation a critical error occurs on the active CMM, such as a flash corruption, the standby CMM is checked to see if it is in a healthier state. If the standby CMM is in a healthier state, then a failover will occur. See Section 3, “Redundancy, Synchronization, and Failover” on page 21.

4.9BIST Test Descriptions

4.9.1Flash Checksum Test

This test is targeted to verify the RedBoot image and FPGA image are not corrupted. This test calculates the CRC32 checksum from the RedBoot image, then compares with the image checksum stored in the FIS directory. If one mismatches another, BIST switches to the backup image. If checksum mismatch was found from the FPGA image, BIST loads the backup image to program the FPGA device.

4.9.2Base Memory Test

This test writes the data pattern of 55AA55AA into every 4 bytes of the memory below 1 MByte. Its objective is to verify the wire connectivity of address and data pins between the memory module and the processor. The test first writes the data pattern into the complete first 1 MByte, then verifies the written data pattern by reading them from the memory module. If the data pattern mismatches, the test logs the error event into the event-log area and routes the error message to the serial port.

4.9.3Extended Memory Tests

Walking Ones Test

This test is targeted to verify the data bus wiring by testing the bus one bit at a time. The data bus passes the test if each data bit can be set to 0 and 1 independently of the other data bits.

32-Bit Address Test

This test is targeted to verify the address bus wiring. The smallest set of addresses that will cover all possible combinations is the set of “power-of-two” addresses. These addresses are analogous to the set of data values used in the walking ones test. The corresponding memory locations are 0001h, 0002h, 0004h, 0008h, 0010h, 0020h, and so on. In addition, address 0000h must also be tested. To confirm that no two memory locations overlap, initial data value is first written at each power-of-two offset within the device. Then a new value is written–an inverted copy of the initial

36

MPCMM0001 Chassis Management Module Software Technical Product Specification

Built-In Self Test (BIST)

value to the first test offset. It is then verified that the initial data value is still stored at every other power-of-two offset. If a location is found, other than the one just written, that contains the new data value, there is a problem with the current address bit. If no overlapping is found, the procedure is repeated for each of the remaining offsets.

32-Bit Inverse Address Test

This test behaves similarly to the memory test described above, except the addresses are tested in the inverse direction. This test helps to identify a broader scope of possible addressing errors inherent in the memory modules.

4.9.4FPGA Version Check

This test is targeted to verify the correct FPGA image programmed into both FPGA chips. It displays the FPGA version on both FPGAs. Both versions should be the same. If the programmed version is older than expected, an event is logged to the SEL.

4.9.5DS1307 RTC (Real-Time Clock) Test

This test is targeted to verify the functionality of DS1307 RTC chip. This test displays the date/time settings from the RTC and validates the readings. If any readings are found to be non-BCD format, an event is logged to the SEL. This test also captures current time, sleeps a while, and compares the previously captured time and new time. If they differ, it means the RTC is working. If not, an event is logged to the SEL.

4.9.6NIC Presence/Local PCI Bus Test

This test generates the PCI bus transaction by scanning the PCI buses available on the board. This test detects the two Ethernet devices and verifies each device has the valid Vendor ID and Device ID in the PCI configuration space. NIC internal self-test is not performed here, as the self-test is executed when loading the Ethernet driver.

4.9.7OS Image Checksum Test

This test is targeted to verify the OS image stored in the flash is not corrupted. This test calculates the CRC32 checksum from the OS image, and then compares it with the image checksum stored in the FIS directory. If one mismatches another, BIST will log an error event to the SEL.

4.9.8CRC32 Checksum

CRC32 is the 32-bit version of Cyclic Redundant Check technique, which is designed to ensure the bits validity and integrity within the data. It first generates the diffusion table, which consists of 256 entries of double-word; each entry is known as a unique diffusion code. The checksum calculation is started by fetching the first byte in data buffer, exclusive-OR with the temporary checksum value. The resulting value is AND-ed with 0xFF to restrict an index from 0 to 255 (decimal). That index is used to fetch a new diffusion code from the table. Next, the newly fetched diffusion code is exclusive-OR with the most significant 24 bits of the temporary checksum value (effectively 8 bits left-shifting the checksum value). The resulting value is the new temporary checksum value. The calculation process is repeated until the last byte in the data buffer. The final temporary checksum value becomes the final checksum value.

MPCMM0001 Chassis Management Module Software Technical Product Specification

37

Built-In Self Test (BIST)

4.9.9IPMB Bus Busy/Not Ready Test

The objective of the test is to identify any potential FPGA lockup before loading the BlueCat. When the FPGA is detected to be locked up, an event indicating which bus actually failed is logged into the Event log.

38

MPCMM0001 Chassis Management Module Software Technical Product Specification

 

Re-enumeration

Re-enumeration

5

 

 

5.1Overview

The Chassis Management Module has the ability to re-enumerate devices in the chassis in the event that the chassis loses and then regains CMM management. This allows the CMM to query information on all devices in the chassis on startup if there are no active CMMs in that chassis already containing that information from which it can receive via a regular synchronization. This is achieved without having to restart the individual blades already present in the chassis.

Re-enumeration provides a way to recover from situations such as double failures where both the CMMs have failed or been accidentally removed from the chassis. For the CMM to identify the contents of the chassis, it first determines if it should do this function. The Standby CMM does not re-enumerate its information and relies on the information synchronized from the Active CMM in case a failover occurs. After the startup, the Active CMM determines what Entities are present. Then for each of these Entities, the CMM queries it to get state and other information to be able to properly manage the Entity as well as the entire chassis. The CMM stays in M2 state until reenumeration is complete.

The CMM re-enumeration process obtains the following information for each FRU in the chassis:

Presence

M-State

Power Usage

Sensor Data Records

Health Events

Board EKey Usage

Bused EKey Usage

5.2Re-enumeration on Failover

In case of forced failover, the newly Active CMM will do re-enumeration if following conditions are satisfied:

Re-enumeration has not completed on the Active CMM.

Active CMM has not yet synchronized the re-enumerated data over to the Standby CMM.

In case the newly Active CMM has to do re-enumeration, it will switch to M2 state before starting re-enumeration. The Blue LED uses long blinks to provide visual indication of the state of the CMM. It is recommended that the Entities in the chassis be not activated or deactivated while reenumeration is in progress.

MPCMM0001 Chassis Management Module Software Technical Product Specification

39

Re-enumeration

5.3Re-enumeration of M5 FRU

If, during re-enumeration, the CMM discovers that a FRU is requesting for deactivation (State M5), it denies the request and informs the FRU to go back to Active (M4) state if there is no frucontrol script present (refer to Section 18.5, “FRU Control Script” on page 169). Otherwise, the CMM executes the frucontrol script and lets it handle the deactivation of the FRU.

5.4Resolution of EKeys

During re-enumeration, the CMM determines the status of EKeys of the Boards present in the chassis. If there are interfaces which can be enabled with respect to other end-point, the CMM completes the EKeying process as per Section 14.1. If there are EKeys enabled to a slot but CMM was unable to discover a Board in that slot, it assumes that the Board in that slot is in M7 (Communication Lost) state.

5.5Events Regeneration

The Re-enumeration agent sends out the "Set Event Receiver" command to all the Entities in the chassis. On receiving the command, the Entities re-arm event generation for all their internal sensors. This will cause them to transmit the event messages that they have based on the current event conditions. These events will be logged in the SEL.

Note: The regeneration of events may cause events to be logged into the SEL twice. This could result in configured eventaction scripts running twice.

During the process of identifying the chassis content, once the CMM determines that the Entity is a fantray, it automatically sets the fan speeds to the critical level. The speeds are not brought back to normal level until it has determined that there are no thermal events in the chassis.

40

MPCMM0001 Chassis Management Module Software Technical Product Specification

Process Monitoring and Integrity

Process Monitoring and Integrity

6

6.1Overview

The Chassis Management Module monitors the general health of processes running on the CMM and can take recovery actions upon detection of failed processes. This is handled by the Process Monitoring Service (PMS).

Upon detecting unhealthy processes, the PMS will take a configurable recovery action. Examples of recovery actions include restarting the process, failing over to the standby CMM, etc.

The PMS itself is also monitored to ensure that it is operating correctly. The PMS is monitored in both a single CMM configuration and a redundant CMM configuration. When faults are detected in the PMS, corrective actions are taken.

The PMS also provides dynamic configuration and status information through the CLI, RPC, and SNMP interfaces. For example, users can administratively lock/disable monitoring of a process while the PMS is running to suit their particular needs. The PMS also provides static configuration to allow customers the ability to tune the static system parameters for the given platform. Examples of these parameters may include monitoring interval, retries, and ramp-up times.

6.1.1Process Existence Monitoring

Process existence monitoring utilizes the operating system's process table to determine the existence of the process. When the CMM software is started, the PMS initializes and determines the set of processes to monitor for process existence. The PMS periodically queries the operating system for the existence of that set of processes. When a monitored process is found not to exist, the PMS will generate a SEL entry and take a recovery action.

Process existence monitoring can be utilized on all permanent processes (processes which exist for the life of the CMM software as a whole). It is particularly useful when monitoring processes that were not specifically developed for running on the CMM. Applications that are provided by the operating system vendor are examples of these types of processes. For the Linux* operating system, processes like syslogd and crond would be good examples.

6.1.2Thread Watchdog Monitoring

Thread watchdog monitoring requires that the process being monitored notifies the PMS of its continued operation. Notifying the PMS will allow the PMS to monitor the process for existence and conditions where a process locks-up. Each thread requiring monitoring within a process using the thread watchdog will register with the PMS. The PMS will loop through its list of registered threads and determine if the set of registered threads are operating. When any thread is determined to be unresponsive (i.e., not notifying the PMS of its continued operation), the PMS will generate a SEL entry and take a recovery action.

Thread watchdog monitoring can be used on all processes that are instrumented with the PMS thread watchdog API. It provides more functionality then process existence monitoring and can be used in conjunction with process integrity monitoring to provide a comprehensive solution. Thread

MPCMM0001 Chassis Management Module Software Technical Product Specification

41

Process Monitoring and Integrity

watchdog monitoring is relatively lightweight and can be done every second, although, the process being monitored may dictate a (much) lower frequency depending on how often it is capable of feeding the watchdog.

6.1.3Process Integrity Monitoring

The Process Integrity Executable (PIE) will be responsible for determining the health of process or processes. When a PIE finds an unhealthy process, it will notify the PMS of the errant process so that the PMS can take the appropriate action. An example of a PIE would be one that monitored the Simple Network Management Protocol (SNMP) process. The PIE could utilize SNMP get operations to query the SNMP process. If the SNMP process cannot respond to the queries with the appropriate information, the process would be considered unhealthy and the PIE would notify the PMS.

Process integrity monitoring may be used in conjunction with existence monitoring to provide a comprehensive solution.

6.2Processes Monitored

Below is a list of processes that are monitored for Process Existence on the CMM by the Process Monitoring Service.

Table 5. Processes Monitored

Process Monitored

Process Command Line /

Target

Monitoring Level

Process Name

Name

 

 

 

 

 

 

CMM Wrapper Process

./WrapperProcess 23

PmsProc 23

Existence and

Integrity

 

 

 

 

 

 

 

CMM Wrapper Process

./WrapperProcess 255

PmsProc50

Existence and

Integrity

 

 

 

 

 

 

 

SNMP Daemon

/usr/sbin/snmpd -c /etc/

PmsProc51

Existence and

snmpd.conf

Integrity

 

 

 

 

 

 

CLI Server

./cli_svr

PmsProc52

Existence

 

 

 

 

Cron Daemon

/bin/crond

PmsProc100

Existence

 

 

 

 

Inet Daemon

xinetd -stayalive -reuse

PmsProc101

Existence

 

 

 

 

Syslog Daemon

/sbin/syslogd

PmsProc102

Existence

 

 

 

 

CMM Command

./cmd_hand

PmsProc53

Existence

Handler

 

 

 

 

 

 

 

CMM Blade Process

./BPM

PmsProc54

Existence and

Manager

Integrity

 

 

 

 

 

 

CMM Wrapper Process

./WrapperProcess[#] (0-39)

PmsProc[#]

Integrity

[0-39]

(60-99)

 

 

 

 

 

 

Pms Monitor

./PmsMonitor

PmsProc3

Existence and TWL

 

 

 

 

Pms Shadow

./PmsMonitor shadow

PmsProc2

Existence and TWL

 

 

 

 

6.3Process Monitoring Targets

The following targets are provided for the Process Monitoring Service under the cmm location:

42

MPCMM0001 Chassis Management Module Software Technical Product Specification

Process Monitoring and Integrity

PmsGlobal

Target for PMS global data

PmsProc[#]

Target for each process monitored

PmsPie[#]

Target for each PMS PIE

Use the following CLI command to view the targets for the processes being monitored.

cmmget -l cmm -d listtargets

The particular processes being monitored will be listed (e.g., PmsProc23, PmsProc100). To view the name of the process being monitored use the following example command:

cmmget -l cmm -t PmsProc34 -d ProcessName

Table 5, “Processes Monitored” contains the list of processes monitored and the command lines and the target names. The ProcessName dataitem will return the Process Command Line.

6.4Process Monitoring Dataitems

The following dataitems are used to retrieve information on and configure the Process Monitoring Service (used with PmsGlobal or PmsProc[#] targets on the cmm location).

AdminState

RecoveryAction

EscalationAction

ProcessName

OpState

More information on the usage and descriptions of these dataitems can be found in Section 8, “The Command Line Interface (CLI)” on page 71.

6.4.1Examples

The following example will set the global PMS AdminState to locked:

cmmset -l cmm -t PmsGlobal -d AdminState -v 2

The following example will get the recovery action assigned to a monitored process:

cmmget -l cmm -t PmsProc34 -d RecoveryAction

The following example will get the admin state to a PIE:

cmmget -l cmm -t PmsPie176 -d AdminState

MPCMM0001 Chassis Management Module Software Technical Product Specification

43

Process Monitoring and Integrity

6.5SNMP MIB Commands

SNMP commands are implemented in the CMM mib for Process Monitoring. The list of new commands can be found in the CMMs MIB file or in Section 17, “SNMP” on page 140.

6.6Process Monitoring CMM Events

The “Process Monitoring Service” sensor types are used to assert and de-assert process status information such as process presence not detected, process recovery failure, or recovery action taken. See Section 11.4, “List of Possible Health Event Strings” on page 108 for event strings, codes, and severities for Process Monitoring.

Event severities are configurable by the user and are unique to the process being monitored.

The processes that are monitored and their default severities are listed below. Severities are configured (while PMS is not running) by changing the ProcessSeverity field in the configuration file (pms.ini). Values for severity: 1 = minor, 2 = major, 3 = critical.

./WrapperProcess 23

ProcessSeverity = 2

./WrapperProcess 255

ProcessSeverity = 2

/usr/sbin/snmpd -c /etc/snmpd.conf

ProcessSeverity = 2

./cli_svr

ProcessSeverity = 2

/bin/crond

ProcessSeverity = 2

xinetd -stayalive -reuse

ProcessSeverity = 2

/sbin/syslogd

ProcessSeverity = 1

./PmsMonitor

ProcessSeverity=2

./PmsMonitor shadow

ProcessSeverity=2

./WrapperProcess0 through ./WrapperProcess39

ProcessSeverity=2

./cmd_hand

ProcessSeverity=3

./BPM

44

MPCMM0001 Chassis Management Module Software Technical Product Specification

Process Monitoring and Integrity

ProcessSeverity=3

Note: The recovery action and escalation action should not be set to "no action" for the xinetd process. This process is involved in data synchronization between the CMMs.

Note: When a user tries to change the recovery action for cmd_hand or BPM to values other than allowed via the CLI API, the error string displayed is:

"Recovery action not allowed for this target."

6.7Failure Scenarios and Eventing

This section describes the process fault scenarios that are detected and handled by the PMS. It also describes the eventing that is associated with the detection and recovery mechanisms. Each scenario contains a brief textual description and a table that further describes the scenario.

In the table, the Description column outlines the current action. The Event Type String defines the text for the event that is written to the SEL. The text in this field describes the portion of the event containing event-specific string (the remainder of the event text is standard for all events).

However, for PMS the target name (sensor name) will be PmsProc<#> instead of the name of the sensor (where # is the unique identifier of the given process).

The UID indicates the unique identifier for the process causing the event. An ID of 1 indicates the monitoring service itself (global) and an ID of # indicates an application process.

The Assert column indicates if the event is asserted or de-asserted. For items that are just written to the SEL for informational purposes, the assertion state is not applicable. However, it is required by the interface and therefore it will be set to de-assert.

The Severity column will define the severity of the event. A severity of Configure indicates that the severity is configurable. The configurable severities are available in the Configuration Database. The remaining columns (SNMP traps, health events, LEDs, and telecommunication alarms) define what indicator will be triggered by the event.

6.7.1No Action Recovery

In this scenario PMS detects a process fault. The PMS is configured to take no action and therefore disables monitoring of the process.

MPCMM0001 Chassis Management Module Software Technical Product Specification

45

Process Monitoring and Integrity

Table 6. No Action Recovery

Description

 

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery

or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery

or

 

the fault will determine which of the

 

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

 

 

The recovery action specified is "no

Take no action specified for

#

N/A

Configure

action".

recovery

 

 

 

 

 

 

 

 

 

 

 

 

 

 

No attempt will be made to recover

Process existence fault;

 

 

 

the process. The PMS will stop

 

 

 

monitoring disabled

or

 

 

 

monitoring the process.

 

 

 

Thread watchdog fault; monitoring

#

Assert

Configure

See Section 6.7.11, “Process

disabled

or

 

Administrative Action” on page 53, for

 

 

 

 

Process integrity fault; monitoring

 

 

 

information about how to re-enable

 

 

 

disabled

 

 

 

 

 

monitoring and de-assert the event.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6.7.2Successful Restart Recovery

In this scenario PMS detects a process fault. The configured recovery action is: restart the process. The PMS is able to successfully recover the process by restarting it.

Table 7. Successful Restart Recovery

Description

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery or

the fault will determine which of the

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting process restart

#

N/A

Configure

"process restart".

recovery action

 

 

 

 

 

 

 

 

PMS was successfully able to restart

Recovery successful

#

De-assert

OK

the process

 

 

 

 

 

 

 

 

 

46

MPCMM0001 Chassis Management Module Software Technical Product Specification

Process Monitoring and Integrity

6.7.3Successful Failover/Restart Recovery

In this scenario PMS detects a process fault. The configured recovery action is: failover to the standby CMM and then restart the failed process. The PMS is able to successfully recover the process by restarting it.

Table 8. Successful Failover/Restart Recovery

Description

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery or

the fault will determine which of the

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting process failover &

#

N/A

Configure

"failover and restart".

restart recovery action

 

 

 

 

 

 

 

 

PMS executes a failover.

The existing code generates the

 

 

 

events for failover. They are

 

 

 

Note this step is skipped when

separate from process monitoring

-

N/A

N/A

running on the standby CMM.

events and are not described

 

 

 

 

here.

 

 

 

 

 

 

 

 

PMS was successfully able to restart

 

 

 

 

the process

Recovery successful

#

De-assert

OK

Note PMS will execute this step even

if the failover is unsuccessful (standby

 

 

 

 

not available, unhealthy, etc.).

 

 

 

 

 

 

 

 

 

MPCMM0001 Chassis Management Module Software Technical Product Specification

47

Process Monitoring and Integrity

6.7.4Successful Failover/Reboot Recovery

In this scenario, PMS detects a process fault. The configured recovery action is: failover to the standby CMM and upon successfully executing the failover, reboot the now standby CMM. The recovery actions are successful.

Table 9. Successful Failover/Reboot Recovery

Description

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery or

the fault will determine which of the

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting failover & reboot

#

N/A

Configure

"failover & reboot"

recovery action

 

 

 

 

 

 

 

 

PMS executes a failover.

The existing code generates the

 

 

 

events for failover. They are

 

 

 

Note this step is skipped when

separate from process monitoring

-

N/A

N/A

running on the standby CMM.

events and are not described

 

 

 

 

here.

 

 

 

 

 

 

 

 

PMS is running on the standby CMM

 

 

 

 

(failover was successful or already

 

 

 

 

running on the standby), PMS

 

 

 

 

recovers the CMM by rebooting.

Monitoring initialized

#

De-assert

OK

Upon initialization of PMS after the

 

 

 

 

reboot. The monitor will de-assert the

 

 

 

 

event.

 

 

 

 

 

 

 

 

 

6.7.5Failed Failover/Reboot Recovery, Non-Critical

In this scenario, PMS is running on the active CMM and detects a monitored process fault. The severity of the process is configured to a value that is not critical. The configured recovery action is: failover to the standby CMM and upon successfully executing the failover, reboot the now standby CMM. The failover recovery action is unsuccessful (standby is not available, etc.). The process being monitored is not of a critical severity and therefore the reboot of the CMM will not be performed.

48

MPCMM0001 Chassis Management Module Software Technical Product Specification

Process Monitoring and Integrity

Table 10. Failed Failover/Reboot Recovery, Non-Critical

Description

 

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery

or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery

or

 

the fault will determine which of the

 

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting failover & reboot

#

N/A

Configure

"failover & reboot"

recovery action

 

 

 

 

 

 

 

 

 

 

 

The existing code generates the

 

 

 

 

events for failover. They are

 

 

 

PMS executes a failover

separate from process monitoring

-

N/A

N/A

 

events and are not described

 

 

 

 

here.

 

 

 

 

 

 

 

 

 

 

 

 

PMS detects that it is still running on

 

 

 

 

 

 

the active CMM. The process is not

Failover & reboot recovery failure

#

N/A

Configure

critical and therefore the reboot

 

 

 

 

 

 

operation will not be performed.

 

 

 

 

 

 

 

 

 

 

 

 

 

No attempt will be made to recover

Process existence fault;

 

 

 

the process. The PMS will stop

 

 

 

monitoring disabled

or

 

 

 

monitoring the process.

 

 

 

Thread watchdog fault; monitoring

#

Assert

Configure

See Section 6.7.11, “Process

disabled

or

 

Administrative Action” on page 53, for

 

 

 

 

Process integrity fault; monitoring

 

 

 

information about how to re-enable

 

 

 

disabled

 

 

 

 

 

monitoring and de-assert the event.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6.7.6Failed Failover/Reboot Recovery, Critical

In this scenario, PMS is running on the active CMM and detects a monitored process fault. The severity of the process is configured to be critical. The configured recovery action is: failover to the standby CMM and upon successfully executing the failover, reboot the now standby CMM. The failover recovery action is unsuccessful (standby is not available, etc.). The process being monitored is of a critical severity and therefore the reboot of the CMM will be performed.

MPCMM0001 Chassis Management Module Software Technical Product Specification

49

Process Monitoring and Integrity

Table 11. Failed Failover/Reboot Recovery, Critical

Description

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery or

the fault will determine which of the

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting failover & reboot

#

N/A

Configure

"failover & reboot"

recovery action

 

 

 

 

 

 

 

 

 

The existing code generates the

 

 

 

 

events for failover. They are

 

 

 

PMS executes a failover.

separate from process monitoring

-

N/A

N/A

 

events and are not described

 

 

 

 

here.

 

 

 

 

 

 

 

 

PMS detects that it is still running on

 

 

 

 

the active CMM. The process is

 

 

 

 

critical and therefore the reboot

 

 

 

 

operation is performed.

Monitoring initialized

#

De-assert

OK

Upon initialization of PMS after the

 

 

 

 

reboot. The monitor will de-assert the

 

 

 

 

event.

 

 

 

 

 

 

 

 

 

6.7.7Excessive Restarts, Escalate No Action

In this scenario PMS detects a process fault. The configured recovery action is: restart the process. However, the PMS also detects that the process has exceeded the threshold for excessive process restarts. Therefore, the PMS will execute the escalation action. The escalation action is configured for no action.

Table 12. Existence Fault, Excessive Restarts, Escalate No Action (Sheet 1 of 2)

Description

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery or

the fault will determine which of the

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting process restart

#

N/A

Configure

"process restart"

recovery action

 

 

 

 

 

 

 

 

50

MPCMM0001 Chassis Management Module Software Technical Product Specification

+ 231 hidden pages