NEC Express5800/A1040b, Express5800/A2010b, Express5800/A2020b, Express5800/A2040b User Manual

Page 1
Express5800/A2040b,A2020b,
A2010b,A1040b
Machine Check Monitoring Service
User's Guide (Release 1.5)
NEC Corporation
© 2015 NEC Corporation
855-900937
Page 2
Notes on Using This Manual
No part of this manual may be reproduced in any form without the prior written
permission of NEC Corporation.
The contents of this manual may be revised without prior notice.
The contents of this manual shall not be copied or altered without the prior written
permission of NEC Corporation.
Trademarks
Linux is a trademark or registered trademark of Linus Torvalds in Japan and other
countries.
Red Hat® and Red Hat Enterprise Linux are trademarks or registered trademarks of
Red Hat, Inc. in the United States and other countries.
Intel and its log are registered trademarks of Intel Corporation in the United States and
other countries.
Emulex and LightPulse are registered trademarks of Emulex Corporation.
Broadcom, NetXtreme, Ethernet@Wirespeed, LiveLink, and Smart Load Balancing
trademarks of Broadcom Corporation and/or its associated company in the United States and other countries.
All other product, brand, or trade names used in this publication are the trademarks or
registered trademarks of their respective trademark owners.
Related Documents
Express5800/A2040b, A2020b, A2010b, A1040b User's Guide
Capacity Optimization (COPT) User's Guide
Page 3
Contents
1. Introduction .............................................................................................................................. 1
1.1 Overview ............................................................................................................................ 1
1.2 Operating Environment ..................................................................................................... 1
1.3 Terminology ....................................................................................................................... 2
1.4 Access Limitation ............................................................................................................... 2
2. Features of Machine Check Monitoring Service .................................................................. 3
2.1 Features of Machine Check Monitoring Service ............................................................... 3
2.2 System Configuration of Machine Check Monitoring Service ........................................... 3
2.3 Functional Drawing of Machine Check Monitoring Service ............................................... 4
2.4 Features of Machine Check Monitoring Service ............................................................... 5
3. Installation and Configuration ............................................................................................... 6
3.1 Installation ......................................................................................................................... 6
3.1.1 Installing acpi_call ......................................................................................................... 6
3.1.2 Installing capmonitor ...................................................................................................... 8
3.1.3 Installing mcemonitor ..................................................................................................... 9
3.2 Upgrade ........................................................................................................................... 10
3.2.1 Upgrading acpi_call ..................................................................................................... 10
3.2.2 Upgrading capmonitor .................................................................................................. 11
3.2.3 Upgrading mcemonitor ................................................................................................ 12
3.3 Configuration ................................................................................................................... 13
3.3.1 capmonitor configuration file ....................................................................................... 13
3.3.2 mcemonitor configuration file ...................................................................................... 13
3.3.3 Disabling CMCI ............................................................................................................ 14
3.3.4 Disabling kdump restart on udev triggered by logical processor offline ...................... 14
3.3.5 Script file to be executed after Core Offline ................................................................. 15
3.3.6 Disabling EDAC ........................................................................................................... 15
3.4 Uninstallation ................................................................................................................... 16
3.4.1 Uninstalling acpi_call ................................................................................................... 16
3.4.2 Uninstalling capmonitor ............................................................................................... 17
3.4.3 Uninstalling mcemonitor .............................................................................................. 17
4. Log .......................................................................................................................................... 18
Page 4
4.1 Logging Destination ......................................................................................................... 18
4.2 Output Format ................................................................................................................. 18
5. Command Reference ............................................................................................................ 19
5.1 Show CPU / Memory Status ............................................................................................ 19
6. Messages ............................................................................................................................... 22
6.1 On-screen Message ........................................................................................................ 22
6.1.1 On-screen messages output from mcemonitor ........................................................... 22
6.1.2 On-screen messages output from capmonitor ............................................................ 24
6.1.3 On-screen messages output from acpi_call ................................................................ 26
6.1.4 Other on-screen messages ......................................................................................... 27
6.2 syslog Messages ............................................................................................................. 27
6.3 Operation Log Messages ................................................................................................ 28
6.3.1 Operation log messages output from mcemonitor ...................................................... 28
6.3.2 Operation log messages output from capmonitor ....................................................... 35
7. Restrictions and Precautions ............................................................................................... 39
7.1 Manual Onlining CPU being Core Offlined ...................................................................... 39
7.2 cpuspeed Error Message Output at OS Shutdown ......................................................... 39
Page 5
1
1. Introduction
1.1 Overview
Machine Check Monitoring Service provides a service to identify fault component of hardware by sending logs of correctable error occurred on CPU and memory of Linux server to the firmware in the server.
If the number of times correctable error occurrence exceeds threshold value, Machine Check Monitoring Service performs Core Offline (offlining of CPU) or Page Offline (offlining memory page) to prevent system down due to uncorrectable error. If the OS supports Core Online feature and the system has spare CPU, Machine Check Monitoring Service adds spare CPU automatically (Core Online) after Core Offline completes. The Offline and Online operations are performed in cooperation with kernel on Linux server.
Machine Check Monitoring Service is composed of firmware and software on Linux server. Software includes mcemonitor (Machine Check Monitoring Service) and capmonitor (Capacity Monitoring Service).
Note
Refer to "Capacity Optimization (COPT) User's Guide" for details of Core
Online feature.
Core Offline, Core Online, and Page Offline are not supported on
Express5800/A1040b.
1.2 Operating Environment
Machine Check Monitoring Service requires operating environment as shown below:
Table 1-1 Operating Environment
Hardware
Express5800/A1040b Express5800/A2010b Express5800/A2020b Express5800/A2040b
OS
Red Hat Enterprise Linux 6.6
Page 6
2
1.3 Terminology
Terms used in Machine Check Monitoring Service are as shown below:
Table 1-2 Terminology
Term
Description
mcemonitor
Software that realizes higher RAS feature. When mcemonitor receives logs from mce mechanism of Linux kernel, analyze it, and monitors fault occurrence in cooperation with system.
mcemonitor instructs Core Offline and Page Offline to the kernel.
capmonitor
Software that controls Core Offline for failed core, and Core Online that COPT feature provides.
Refer to "Capacity Optimization (COPT) User's Guide" for details of COPT feature.
acpi_call
Driver used to access ACPI
ACPI
Advanced Configuration and Power Interface Open industry specification related power management and hardware configuration.
MCE
Machine Check Exceptions Hardware error detected by CPU
CMC
Corrected Machine Check Correctable error detected by CPU
CPU socket
Means a single Intel Xeon processor. One CPU socket can have several cores. With Express5800/A2040, up to 4 CPU sockets can be installed in the server.
CPU core
Core portion of CPU that performs arithmetic processing and others. One or more cores can exist in CPU socket.
Physical CPU socket number
Means physical mounting position of a CPU socket in the server. The number from No. 1 to No. 4 is assigned for every CPU socket.
Logical processor
Means the processor where OS actually executes task and threads. When Hyper-Threading feature is enabled, two logical processors exist in one CPU core. When Hyper-Threading feature is disabled, only one logical processor exists in one CPU core.
1.4 Access Limitation
Only the privileged user (root account) can use mcemonitor.
Page 7
3
2. Features of Machine Check Monitoring Service
This section describes features and characteristics of Machine Check Monitoring Service.
2.1 Features of Machine Check Monitoring Service
For the server that is used in mission critical domain, it is required to identify the failing component, online degrade it, and online replace it before system down occurs on the server.
If the Machine Check Monitoring Service detects a correctable failure in CPU and memory in Linux server, it sends log to firmware in the server to identify the failed component. When the correctable error exceeds threshold value, the Machine Check Monitoring Service degrades CPU or memory page online (Core Offline, Page Offline). In addition, if the server uses an OS that supports Core Online feature and spare CPU is equipped in the server, the Machine Check Monitoring Service adds the spare CPU automatically (Core Online) after Core Offline. Thus the performance deterioration can be prevented.
Note
Refer to "Capacity Optimization (COPT) User's Guide" for details of Core
Online feature.
Express5800/A1040b does not support Core Offline, Core Online, and Page
Offline.
2.2 System Configuration of Machine Check Monitoring Service
The system configuration of Machine Check Monitoring Service is shown below.
Figure 2-1 System Configuration of Machine Check Monitoring Service
Server
OS
MC Scope
mcemonitor
capmonitor
acpi_call
Firmware
Page 8
4
2.3 Functional Drawing of Machine Check Monitoring Service
Functional drawing of Machine Check Monitoring Service and its associated components are shown below.
Figure 2-2 Functional drawing
capmonitor (log)
syslog
Firmware
mcemonitor (log)
mcemonitor
Hardware
CPU
Memory
Fault
kernel
capmonitor
acpi_call
Page 9
5
2.4 Features of Machine Check Monitoring Service
Process flow of Machine Check Monitoring Service is shown below.
Table 2-1 Process flow of Machine Check Monitoring Service
Features
Process flow
Monitoring CPU failure
When mcemonitor detects occurrence of CPU failure, send CPU fault information to firmware. When the firmware receives CPU fault information, it determines the failed component. The firmware manages failure occurrence count, and when it exceeds threshold value, the firmware instructs Core Offline to mcemonitor. When mcemonitor receives Core Offline instruction from firmware, it issues CPU Offline instruction to kernel. If Hyper Threading Mode is set to OFF, one logical CPU in CPU core is made offline. If Hyper Threading Mode is set to ON, two logical CPUs in CPU core are made offline. When CPU Offline succeeds, the relevant CPU is disabled for OS and software. Thus, the number of available CPUs is reduced. Note: Express5800/A1040b does not support Core Offline feature. mcemonitor notifies the firmware of result of CPU Offline. When CPU Offline succeeds and if the server has spare CPU, the spare CPU is added automatically (Core Online feature). Note: For details of Core Online, refer to Capacity Optimization (COPT) User's Guide. CPU fault information and result of CPU Offline can be confirmed by mcemonitor command. See 5.1 Show CPU / Memory Status for details of mcemonitor command.
Monitoring memory failure
If the correctable memory error on a certain memory page exceeds threshold value, the firmware instructs Memory Page Offline to mcemonitor. When mcemonitor receives Memory Page Offline instruction from firmware, it sends Memory Page Offline instruction to kernel. Memory Page Offline is performed in unit of 4K bytes. When Memory Page Offline succeeds, the relevant memory page is disabled for OS and software. Thus, the number of available memory capacity is reduced. Note: Express5800/A1040b does not support Page Offline feature. mcemonitor notifies the firmware of result of Memory Page Offline. Result of Memory Page Offline can be confirmed by mcemonitor command. See 5.1 Show CPU / Memory Status for details of mcemonitor command.
Page 10
6
3. Installation and Configuration
This section describes how to install, configure, and start the service of Machine Check Monitoring Service.
3.1 Installation
Machine Check Monitoring Service is provided as RPM package. Install it by using rpm command as shown below:Install packages acpi_call, capmonitor, and mcemonitor in order.
3.1.1 Installing acpi_call
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64.rpm
Preparing... ########################################## [100%]
1:mcl-acpi_call ########################################## [100%]
Starting acpi_call driver[ OK ]
4. Confirm that acpi_call RPM package of Machine Check Monitoring Service is installed correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep acpicall mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
5. Check if acpi_call driver is started normally. If the following 3 acpi_call are displayed, acpi_call driver is started normally.
# lsmod | grep acpi
acpi_clpcall 6897 0
acpi_capcall 6897 0
acpi_mcecall 6897 0
6. Installation of package may not complete if the following message is displayed. Repeat from Step 3 according to "Solution".
Error message
package mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64 is already installed
Solution
Uninstall acpi_call, and install it again.
Error message
error: unpacking of archive failed on file: cpio: write failed - No space left on device
Solution
Disk space is insufficient. Increase free space, and install it again.
Page 11
7
7. Configure /etc/sysconfig/kdump.
Creation of initrd file for kdump may fail if an external module unnecessary for dump collection is incorporated. To prevent this, add MKDUMPRD_ARGS="--allow-missing".
Sample configuration of /etc/sysconfig/kdump MKDUMPRD_ARGS="--allow-missing" With this configuration, the following warning may appear when kdump service is started. This
message indicates that the external module was not incorporated, and it is not the problem. WARNING: No module xxx found for kernel 2.6.32-504.23.4.el6.x86_64, continuing anyway (xxx represents external module name)
Page 12
8
3.1.2 Installing capmonitor
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install capmonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-capmonitor-2.4-2.12.el6.x86_64.rpm
Preparing... ######################################## [100%]
1:mcl-capmonitor ######################################## [100%]
Starting capmonitor daemon[ OK ]
Note: acpi_call must be installed before installing capmonitor.
If capmonitor is installed while acpi_call has not been installed, the following message is output and installation of capmonitor fails.
# rpm -ivh mcl-capmonitor-2.4-2.12.el6.x86_64.rpm
error: Failed dependencies:
mcl-acpicall is needed by mcl-capmonitor-2.4-2.12.el6.x86_64
4. Confirm that capmonitor RPM package of Machine Check Monitoring Service is installed correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.12.el6.x86_64
5. Check if capmonitor is started normally. If the following is displayed, capmonitor is started normally.
# ps aux | grep monitor
root 6044 0.0 0.0 4068 324 ? Ss 06:18 0:00 /opt/nec/capmonitor/capmonitor
6. Installation of package may not complete if the following message is displayed. Repeat from Step 3 according to "Solution".
Error message
package mcl-capmonitor-2.4-2.12.el6.x86_64 is already installed
Solution
Uninstall capmonitor, and install it again.
Error message
error: unpacking of archive failed on file: cpio: write failed - No space left on device
Solution
Disk space is insufficient. Increase free space, and install it again.
Page 13
9
3.1.3 Installing mcemonitor
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install mcemonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-mcemonitor1-2.4-2.02.el6.x86_64.rpm
Preparing... ######################################### [100%]
1:mcl-mcemonitor1 ######################################### [100%]
Starting mcemonitor daemon[ OK ]
Note: acpi_call must be installed before installing mcemonitor.
If mcemonitor is installed while acpi_call has not been installed, the following message is output and installation of mcemonitor fails.
# rpm -ivh mcl-mcemonitor1-2.4-2.02.el6.x86_64.rpm
error: Failed dependencies:
mcl-acpicall is needed by mcl-mcemonitor1-2.4-2.02.el6.x86_64
4. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is installed correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.02.el6.x86_64
5. Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started normally.
# ps aux | grep monitor
root 6078 0.0 0.0 4076 328 ? Ss 06:19 0:00 /opt/nec/mcemonitor/mcemonitor
6. Installation of package may not complete if the following message is displayed. Repeat from Step 3 according to "Solution".
Error message
package mcl-mcemonitor1-2.4-2.02.el6.x86_64 is already installed
Solution
Uninstall mcemonitor, and install it again.
Error message
error: unpacking of archive failed on file: cpio: write failed - No space left on device
Solution
Disk space is insufficient. Increase free space, and install it again.
Page 14
10
3.2 Upgrade
Use rpm command to upgrade Machine Check Monitoring Service from old to new version.
3.2.1 Upgrading acpi_call
1. Login to the target machine as a root user.
2. Confirm that the current version of acpi_call RPM package of Machine Check Monitoring
Service is older than that of acpi_call RPM package you are going to upgrade.
# rpm -qa | grep acpi_call
mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
3. Copy RPM to desired directory in target machine. The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -Uvh mcl-acpicall-2.4-3.02.2.6.32.504.23.4.el6.x86_64.rpm
Preparing... ########################################### [100%]
1:mcl-acpi_call ########################################### [100%]
Starting acpi_call driver[ OK ]
5. Confirm that acpi_call RPM package of Machine Check Monitoring Service is upgraded correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep acpicall
mcl-acpicall-2.4-3.02.2.6.32.504.23.4.el6.x86_64
6. Check if acpi_call driver is started normally. If the following 3 acpi_call are displayed, acpi_call driver is started normally.
# lsmod | grep acpi
acpi_clpcall 6897 0
acpi_capcall 6897 0
acpi_mcecall 6897 0
Page 15
11
3.2.2 Upgrading capmonitor
1. Login to the target machine as a root user.
2. Confirm that the current version of capmonitor RPM package of Machine Check Monitoring
Service is older than that of capmonitor RPM package you are going to upgrade.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.12.el6.x86_64
3. Copy RPM to desired directory in target machine. The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade capmonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -Uvh mcl-capmonitor-2.4-2.13.el6.x86_64.rpm
Preparing... ######################################### [100%]
4048 /opt/nec/capmonitor/capmonitor
Stopping capmonitor[ OK ]
1:mcl-capmonitor ######################################### [100%]
Starting capmonitor daemon[ OK ]
If capmonitor.conf was changed, the following message will be displayed. The message can be safely ignored because your configuration of the capmonitor.conf is preserved. capmonitor.conf.rpmnew is the default capmonitor.conf file.
warning: /opt/nec/capmonitor/conf/capmonitor.conf created as /opt/nec/capmonitor/conf/capmonitor.conf.rpmnew
5. Confirm that capmonitor RPM package of Machine Check Monitoring Service is upgraded correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.13.el6.x86_64
6. Check if capmonitor is started normally. If the following is displayed, capmonitor is started normally.
# ps aux | grep monitor
root 4141 0.0 0.0 4068 352 ? Ss 13:54 0:00 /opt/nec/capmonitor/capmonitor
Page 16
12
3.2.3 Upgrading mcemonitor
1. Login to the target machine as a root user.
2. Confirm that the current version of mcemonitor RPM package of Machine Check Monitoring
Service is older than that of mcemonitor RPM package you are going to upgrade.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.02.el6.x86_64
3. Copy RPM to desired directory in target machine. The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade mcemonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -Uvh mcl-mcemonitor1-2.4-2.03.el6.x86_64.rpm
Preparing... ######################################### [100%]
4083 /opt/nec/mcemonitor/mcemonitor
Stopping mcemonitor[ OK ]
1: mcl-mcemonitor1 ######################################### [100%]
Starting mcemonitor daemon[ OK ]
If mcemonitor.conf was changed, the following message will be displayed. The message can be safely ignored because your configuration of the mcemonitor.conf is preserved. mcemonitor.conf.rpmnew is the default mcemonitor.conf file.
warning: /opt/nec/mcemonitor/conf/mcemonitor.conf created as /opt/nec/mcemonitor/conf/mcemonitor.conf.rpmnew
5. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is upgraded correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.03.el6.x86_64
6. Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started normally.
# ps aux | grep monitor
root 4189 0.0 0.0 4076 364 ? Ss 13:56 0:00 /opt/nec/mcemonitor/mcemonitor
Page 17
13
3.3 Configuration
Machine Check Monitoring Service provides the following two configuration files. You can change behavior of Machine Check Monitoring Service by modifying these configuration files. This section describes available parameters and how to specify them.
/opt/nec/capmonitor/conf/capmonitor.conf /opt/nec/mcemonitor/conf/mcemonitor.conf
3.3.1 capmonitor configuration file
capmonitor configuration file /opt/nec/capmonitor/conf/capmonitor.conf is used for configuration related to CPU Core Online.
Note
For details of capmonitor configuration file, refer to "Capacity Optimization (COPT) User's Guide".
3.3.2 mcemonitor configuration file
mcemonitor configuration file /opt/nec/mcemonitor/conf/mcemonitor.conf is used for configuration related to CPU Core Offline and Memory Page Offline. Modify this file according to description below.
mcemonitor.conf
# vi /opt/nec/mcemonitor/conf/mcemonitor.conf
#
# Config file for mcemonitor
#
# specify the internal action in mcemonitor to a cpu error
# off no action
# account only account errors
# soft try to offline CPU
core-ce-action = soft
# specify the internal action in mcemonitor to a page error
# off no action
# soft try to soft-offline page without killing any processes
memory-ce-action = soft
Page 18
14
Table 3-1 mcemonitor configuration file(core-ce-action)
Setting in mcemonitor.conf
Description
core-ce-action = soft
Collects log and makes CPU Core Offline if the CPU error count exceeds the threshold value. (Default)
core-ce-action = account
Collects log but does not make CPU Core Offline even if the CPU error count exceeds the threshold value.
core-ce-action = off
Does not collect log nor make CPU Core Offline.
Table 3-2 mcemonitor configuration file(memory-ce-action)
Setting in mcemonitor.conf
Description
memory-ce-action = soft
Collects log and makes Memory Page Offline if the memory error count exceeds the threshold value. (Default) The process running on the relevant memory is transferred to another memory.
memory-ce-action = off
Does not collect log nor make Memory Page Offline.
The system must be rebooted if configuration file is modified.
3.3.3 Disabling CMCI
In RHEL6.6 kernel 2.6.32-504.23.4.el6.x86_64, it is reported that the frequent occurrence of CMCI(Corrected Machine Check Interrupt), which notifies the operating system of the detected corrrectable error, may cause System panic.
To change the error detecting mode from "interrupt mode" to "polling mode", you need to add "mce=no_cmci" to the kernel line in the "/boot/efi/EFI/redhat/grub.conf".
The system must be rebooted if configuration file is modified.
title Red Hat Enterprise Linux Server (2.6.32-504.23.4.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-504.23.4.el6.x86_64 ro
root=/dev/mapper/VolGroup00-LogVol00
rd_LVM_LV= VolGroup00/LogVol00 rd_NO_LUKS nomodeset rd_NO_MD rhgb quiet
crashkernel=256M KEYBOARDTYPE=pc KEYTABLE=jp106 LANG=ja_JP.UTF-8 rd_NO_DM
mce=no_cmci
initrd /initramfs-2.6.32-504.23.4.el6.x86_64.img
3.3.4 Disabling kdump restart on udev triggered by logical processor offline
Add # at the top of the following line in /etc/udev/rules.d/98-kexec.rules file to disable the rule.
#SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/etc/init.d/kdump restart"
Restart udev after modifying configuration file.
udevadm control --reload-rules
Note
kdump is restarted when capmonitor executes script upon completion of Core Offline. You need to place the script file to be used after Core Offline according to "3.3.5 Script file to be executed after Core Offline".
Page 19
15
3.3.5 Script file to be executed after Core Offline
capmonitor executes all script files stored in the directory /opt/nec/capmonitor/script/cpu/offline.d upon completion of Core Offline.
If several logical processors are made offline by a single Core Offline, the script file is executed only once after the last processor is offlined.
Place the script /opt/nec/capmonitor/script/03kdump.sh under the directory /opt/nec/capmonitor/script/cpu/offline.d to restart kdump as an alternative of kdump that was disabled in
3.3.4. If you use the software that requires reboot after Core Offline (number of logical processors is reduced),
create a script file containing the necessary processes and store it under the directory /opt/nec/capmonitor/script/cpu/offline.d.
Table 3-3 Script under /opt/nec/capmonitor/script/cpu/offline.d
Script file name
Description
How to install script file
03kdump.sh
Script that restarts kdump daemon as needed so that crash dump can be collected after Core Offline.
Copy from /opt/nec/capmonitor/script/ to /opt/nec/capmonitor/script/cpu/offline.d.
XX~.sh User script XX: Specify execution order
by 2-digit decimal lnumber. (Starts from younger number.)
~: Arbitrary character string
If you use the software that requires reboot after Core Offline, create a script file containing the necessary processes
Create a script and store it under /opt/nec/capmonitor/script/cpu/offline.d.
3.3.6 Disabling EDAC
If the EDAC is running in the system, Machine Check Monitoring Service will not run correctly. Disable the EDAC by creating a file /etc/modprobe.d/disable_edac.conf with the following contents:
install *_edac /bin/true
install edac_* /bin/true
After saving the file, reboot the system. After the system is rebooted, confirm the EDAC was disabled as shown below.
# lsmod | grep edac
Page 20
16
3.4 Uninstallation
Use rpm command to uninstall Machine Check Monitoring Service. Uninstall packages mcemonitor, capmonitor, and acpi_call in order.
3.4.1 Uninstalling acpi_call
1. Login to the target machine as a root user.
2. Uninstall acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -e mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
Note: mcemonitor and capmonitor must be uninstalled before uninstalling acpi_call.
If acpi_call is uninstalled while mcemonitor and capmonitor have not been uninstalled, the following message is output and uninstallation of acpi_call fails.
# rpm -e mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
error: Failed dependencies:
mcl-acpicall is needed by mcl-capmonitor-2.4-2.12.el6.x86_64
mcl-acpicall is needed by mcl-mcemonitor1-2.4-2.02.el6.x86_64
3. Confirm that acpi_call RPM package of Machine Check Monitoring Service is uninstalled correctly. Uninstallation completes successfully if "acpi_call" is not displayed as shown below.
# rpm -qa | grep acpicall
4. Check if acpi_call driver is uninstalled correctly. If the 3 acpi_call are not displayed, acpi_call driver is uninstalled correctly.
# lsmod | grep acpi
Page 21
17
3.4.2 Uninstalling capmonitor
1. Login to the target machine as a root user.
2. Uninstall capmonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -e mcl-capmonitor-2.4-2.12.el6.x86_64
3834 /opt/nec/capmonitor/capmonitor
Stopping capmonitor[ OK ]
3. Confirm that capmonitor RPM package of Machine Check Monitoring Service is uninstalled correctly. Uninstallation completes successfully if "capmonitor" is not displayed as shown below.
# rpm -qa | grep capmonitor
4. Check if capmonitor is stopped correctly. If "capmonitor" is not displayed, capmonitor is stopped correctly.
# ps aux | grep monitor
3.4.3 Uninstalling mcemonitor
1. Login to the target machine as a root user.
2. Uninstall mcemonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -e mcl-mcemonitor1-2.4-2.02.el6.x86_64
3871 /opt/nec/mcemonitor/mcemonitor
Stopping mcemonitor[ OK ]
Starting mcelog daemon
3. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is uninstalled correctly. Uninstallation completes successfully if "mcemonitor" is not displayed as shown below.
# rpm -qa | grep mcemonitor
4. Check if mcemonitor is stopped correctly. If "mcemonitor" is not displayed, capmonitor is stopped correctly.
# ps aux | grep monitor
Page 22
18
4. Log
4.1 Logging Destination
Machine Check Monitoring Service outputs log to the following destinations:
/var/opt/nec/mcemonitor (Fault monitoring log) /var/opt/nec/capmonitor (Core Offline log (including logs related to Core Online of COPT)
4.2 Output Format
Shown below is an example of log that Machine Check Monitoring Service outputs.
(/var/opt/nec/mcemonitor)
Tue Feb 19 21:03:39 2013 : CPU 7
Tue Feb 19 21:03:39 2013 : BANK 0
Tue Feb 19 21:03:39 2013 : TSC 0
Tue Feb 19 21:03:39 2013 : RIP 00:0
Tue Feb 19 21:03:39 2013 : MISC 0
Tue Feb 19 21:03:39 2013 : ADDR 0
Tue Feb 19 21:03:39 2013 : STATUS 0x9000000000000000
Tue Feb 19 21:03:39 2013 : MCGSTATUS 0
Tue Feb 19 21:03:39 2013 : CPUID Vendor Intel Family 6 Model 62
Tue Feb 19 21:03:39 2013 : TIME 1361275419 Tue Feb 19 21:03:39 2013
Tue Feb 19 21:03:39 2013 : SOCKETID 0
Tue Feb 19 21:03:39 2013 : APICID 14
Tue Feb 19 21:03:39 2013 : MCGCAP 0x5000c20
Tue Feb 19 21:03:39 2013 :
Tue Feb 19 21:03:39 2013 : Offlining CPU 7 due to corrected error threshold.
Tue Feb 19 21:03:39 2013 : Offlining CPU 22 due to corrected error threshold.
Tue Feb 19 21:03:39 2013 : Offlining CPU 7 succeeded.
Tue Feb 19 21:03:39 2013 : Offlining CPU 22 succeeded.
(/var/opt/nec/capmonitor)
Tue Feb 19 21:03:39 2013 : CPU 7 is now offline.
Tue Feb 19 21:03:39 2013 : CPU 24 is now online.
Table 4-1 Machine Check Monitoring Service outputs log
No.
Field
Description
Logged date and time
Shows date and time of OS.
Message body
Shows body of log message. See "6.3 Operation Log Messages" for details of log
message.
Core Offline log
Shows Core Offline log output from capmonitor. See "6.3 Operation Log Messages" for details of log
message.
Core Online log
Shows Core Online log output from capmonitor. See "6.3 Operation Log Messages" for details of log
message.
③ ④ ①
Page 23
19
5. Command Reference
5.1 Show CPU / Memory Status
You can view CPU fault information and offline state of CPU/Memory page by using mcemonitor command.
The following shows command options.
Name
mcemonitor – Outputs state of CPU / Memory page to standard output.
Syntax
mcemonitor [ --version ] mcemonitor [ --client | --client=core | --client=page ]
Description
CPU fault information and offline state of CPU / Memory page can be confirmed by mcemonitor command.
Option
--version Shows version information of mcemonitor command.
--client Shows CPU fault information and offline state of CPU / Memory page.
--client=core Shows CPU fault information and offline state of CPU.
--client=page Shows offline state of Memory page.
Return value
0: Normal End 1: Abnormal End
Page 24
20
Display format
# /opt/nec/mcemonitor/mcemonitor --client
Per page status corrected error over threshold:
100000: offline-failed
10000000: offline
20000000: offline
:
:
Per page status uncorrected error:
1abc40000
1abc90000
:
:
CPU errors
CPU1/core2
corrected errors:
1 total
uncorrected errors:
0 total
CPU4/core1
corrected errors:
10 total
uncorrected errors:
0 total
CPU4/core2
corrected errors:
10 total
uncorrected errors:
0 total
:
:
CPU1/uncore
corrected errors:
1 total
uncorrected errors:
0 total
:
:
Per CPU status corrected error over threshold:
CPU4/core1:
/sys/devices/system/cpu5 offline-failed
/sys/devices/system/cpu15 offline
CPU4/core2:
/sys/devices/system/cpu6 offline
/sys/devices/system/cpu16 online
Page 25
21
Table 5-1 mcemonitor command
Item name
Description
Per page status corrected error over threshold:
Shows result of Memory Page Offline.
100000: offline-failed
Indicates that offlining failed for 0x10000 page of memory address.
10000000: offline
Indicates 0x10000 page of memory address was offlined.
Per page status uncorrected error:
Show Memory Page that uncorrected error occurred.
CPU errors
Shows CPU fault information.
CPUx/corey
Shows fault information of CPU core. CPU x Indicates physical CPU socket number (x). corey Indicates CPU core number (y).
corrected errors:
Shows number of occurrence of correctable errors.
x total
Indicates that errors occurred x times.
uncorrected errors:
Shows number of occurrence of uncorrectable errors.
CPU1/uncore
Fault information of CPU Uncore.
Per CPU status corrected error over threshold:
Shows result of CPU Offline.
/sys/devices/system/cpu5 offline-failed
Indicates that offlining logical processor 5 failed.
/sys/devices/system/cpu15 offline
Indicates that offlining logical processor 5 succeeded.
/sys/devices/system/cpu16 online
Indicates that offlined logical processor 16 is returned to online by user.
Note: This CPU was made offline due to failure. Do not make it online.
Page 26
22
6. Messages
6.1 On-screen Message
6.1.1 On-screen messages output from mcemonitor
The following table shows on-screen message (related to fault monitoring) that mcemonitor outputs.
Table 6-1 On-screen messages output from mcemonitor
No.
Message
Meaning
Action
1
Cannot open logfile /var/opt/nec/mcemonitor mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
Failed to open log file, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
2
cannot open socket mcemonitor will continue to be run safely. Please retry operation.
Failed to communicate with mcemonitor. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
3
cannot connect server mcemonitor exited due to a
system error. mcemonitor will be restarted by cron.
Failed to communicate with mcemonitor. After mcemonitor is restarted by cron, run the command again.
When mcemonitor automatically restarts by cron, run /opt/nec/mcemonitor/mcemo nitor --client again.
4
failed to write socket mcemonitor will continue to be run
safely. Please retry operation.
Failed to communicate with mcemonitor. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
5
failed to read mcemonitor will continue to be run
safely. Please retry operation.
Failed to communicate with mcemonitor. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
6
Cannot reopen logfile /var/opt/nec/mcemonitor mcemonitor found a continuable error. mcemonitor will continue to be run safely.
Failed to reopen log file but mcemonitor continue operation.
mcemonitor continues operation. No action is needed.
7
Usage: ./mcemonitor --client :
display core & page status ./mcemonitor --client=core :
display core status ./mcemonitor --client=page : display page status
Shows usage of mcemonitor.
8
mcemonitor Version x.x
Shows mcemonitor version.
9
out of memory mcemonitor exited due to a
system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
When mcemonitor automatically restarts by cron, run /opt/nec/mcemonitor/mcemo nitor --client again.
10
Did not receive credentials over client unix socket mcemonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
11
rejected client access from pid:xx uid:yy gid:zz mcemonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
12
error while reading from client mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
Page 27
23
No.
Message
Meaning
Action
13
accept failed on client socket mcemonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
14
Cannot enable credentials passing on client socket
mcemonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
15
mcemonitor too busy mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
16
popen error
Failed to read command line.
Cannot run mcemonitorcmd through command line. Do not use command line to run mcemonitorcmd.
17
/proc/xxx/cmdline read error
Failed to read command line.
18
Can't execute this command only
Cannot execute this command.
19
Invalid Arguments
Invallid argument is specified.
20
insmod: can't read '/opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko': No such file or directory
Failed to load because /opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko was not found.
21
/dev/mcelog was not found.
/dev/mcelog was not found, thus failed to start mcemonitor.
Restart the OS.
22
/opt/nec/acpicall/proc/acpi/mcecal l/acpi_mcecall.ko was not found.
/opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko was not found, thus failed to start mcemonitor.
Reinstall mcemonitor.
23
/opt/nec/mcemonitor/mcemonitor was not found.
/opt/nec/mcemonitor/mcemonitor was not found, thus failed to start mcemonitor.
Reinstall mcemonitor.
24
/opt/nec/mcemonitor/mcemonitorc md was not found.
/opt/nec/mcemonitor/mcemonitor cmd was not found, thus failed to start mcemonitor.
Reinstall mcemonitor.
25
/var/opt/nec was not found.
/var/opt/nec was not found, thus failed to start mcemonitor.
Reinstall mcemonitor.
26
Unknown mcemonitor mode xx. Valid daemon
mcemonitor is not in daemon mode.
Specify daemon for MCEMONITOR_MODE of /etc/rc.d/init.d/mcemonitord
27
mcemonitor already running.
mcemonitor is already running.
28
/etc/rc.d/init.d/mcemonitord start already running.
mcemonitor is already starting.
29
mcemonitor already stopped.
mcemonitor is already stopped.
30
/etc/rc.d/init.d/mcemonitord stop already running.
mcemonitor is already stopping.
31
Usage: mcemonitord {start|stop|try-restart|restart|status |force-reload|reload}
Shows usage of /etc/rc.d/init.d/mcemonitord.
32
Starting mcemonitor daemon [Result]
Starting mcemonitor.
If the result is [Fail], starting of mcemonitor is failed. Confirm mcemonitor log, and restart mcemonitor
33
Stopping mcemonitor [Result]
Stopping mcemonitor.
If the result is [Fail], stopping of mcemonitor is failed. Confirm mcemonitor log, and stop mcemonitor again.
34
acpi_mcecall was not loaded
Failed to load mcemonitor because acpi_mcecall was not loaded.
Reinstall acpi_call and restart mcemonitor.
Page 28
24
6.1.2 On-screen messages output from capmonitor
The following table shows on-screen message (related to core offline (including logs of core online feature of COPT) that capmonitor outputs.
Table 6-2 On-screen messages output from capmonitor
No.
Message
Meaning
Action
1
Cannot open logfile /var/opt/nec/capmonitor capmonitor exited due to a system error. capmonitor will be restarted by cron.
Failed to open log file, and capmonitor exited. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
2
cannot open socket capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with capmonitor.
Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
3
cannot connect server capmonitor exited due to a system
error. capmonitor will be restarted by cron.
Failed to communicate with capmonitor.
After capmonitor is restarted by cron, run the command again.
When capmonitor automatically restarts by cron, run /opt/nec/capmonitor/capmoni tor --client=addtime again.
4
failed to write socket capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with capmonitor. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
5
failed to read capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with capmonitor. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
6
Cannot reopen logfile /var/opt/nec/capmonitor capmonitor found a continuable error. capmonitor will continue to be run safely.
Failed to reopen log file but capmonitor continue operation.
capmonitor continues operation. No action is needed.
7
Usage: ./capmonitor --client=addtime :
display cpu core hot-add processing time
Shows usage of capmonitor.
8
capmonitor Version x.x
Shows capmonitor version.
9
out of memory capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
When capmonitor automatically restarts by cron, run /opt/nec/capmonitor/capmoni tor --client=addtime again.
10
Did not receive credentials over client unix socket capmonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
11
rejected client access from pid:xx uid:yy gid:zz
capmonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
12
error while reading from client capmonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
13
accept failed on client socket capmonitor will continue to be run
safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
14
Cannot enable credentials passing on client socket capmonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
Page 29
25
No.
Message
Meaning
Action
15
capmonitor too busy capmonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
16
popen error
Failed to read command line.
Cannot run capmonitorcmd through command line. Do not use command line to run capmonitorcmd.
17
/proc/xxx/cmdline read error
Failed to read command line.
18
Can't execute this command only
Cannot execute this command.
19
Invalid Arguments
Invallid argument is specified.
20
insmod: can't read '/opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko': No such file or directory
Failed to load because /opt/nec/acpicall/proc/acpi/capcall/ acpi_capcall.ko was not found.
Reinstall capmonitor.
21
/opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko was not found.
/opt/nec/acpicall/proc/acpi/capcall/ acpi_capcall.ko was not found, thus failed to start capmonitor.
Reinstall capmonitor.
22
/opt/nec/capmonitor/capmonitor was not found.
/opt/nec/capmonitor/capmonitor was not found, thus failed to start capmonitor.
Reinstall capmonitor.
23
/opt/nec/capmonitor/capmonitorc md was not found.
/opt/nec/capmonitor/capmonitorcm d was not found, thus failed to start capmonitor.
Reinstall capmonitor.
24
/var/opt/nec was not found.
/var/opt/nec was not found, thus failed to start capmonitor.
Reinstall capmonitor.
25
Unknown capmonitor mode xx. Valid daemon
Unknown mode. Only daemon mode is valid.
Specify daemon for CAPMONITOR_MODE of /etc/rc.d/init.d/capmonitord.
26
capmonitor already running.
capmonitor is already running.
27
/etc/rc.d/init.d/capmonitord start already running.
capmonitor is already starting.
28
capmonitor already stopped.
capmonitor is already stopped.
29
/etc/rc.d/init.d/capmonitord stop already running.
capmonitor is already stopping.
30
Usage: capmonitord {start|stop|try-restart|restart|status |force-reload|reload}
Shows usage of /etc/rc.d/init.d/capmonitord.
31
Starting capmonitor daemon [Result]
Starting capmonitor.
If the result is [Fail], starting of capmonitor is failed. Confirm capmonitor log, and restart capmonitor
32
Stopping capmonitor [Result]
Stopping capmonitor.
If the result is [Fail], stopping of capmonitor is failed. Confirm capmonitor log, and stop capmonitor again.
33
acpi_capcall was not loaded
Failed to load capmonitor because acpi_capcall was not loaded.
Reinstall acpi_call and restart capmonitor.
Page 30
26
6.1.3 On-screen messages output from acpi_call
The following table shows on-screen message that acpi_call outputs.
Table 6-3 On-screen messages output from acpi_call
No.
Message
Meaning
Action
1
insmod: can't read '/opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko': No such file or directory
Failed to load acpi_capcall.ko because /opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko was not found.
Reinstall acpi_call.
2
insmod: can't read '/opt/nec/acpicall/proc/acpi/clpcall/ acpi_clpcall.ko': No such file or directory
Failed to load acpi_clpcall.ko because /opt/nec/acpicall/proc/acpi/clpcall/ acpi_clpcall.ko was not found.
Reinstall acpi_call.
3
insmod: can't read '/opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko': No such file or directory
Failed to load acpi_mcecall.ko because /opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko was not found.
Reinstall acpi_call.
4
/opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko was not found.
Failed to load acpi_capcall.ko because /opt/nec/acpicall/proc/acpi/capcall /acpi_capcall.ko was not found.
Reinstall acpi_call.
5
/opt/nec/acpicall/proc/acpi/clpcall/ acpi_clpcall.ko was not found.
Failed to load acpi_clpcall.ko because /opt/nec/acpicall/proc/acpi/clpcall/ acpi_clpcall.ko was not found.
Reinstall acpi_call.
6
/opt/nec/acpicall/proc/acpi/mcecal l/acpi_mcecall.ko was not found.
Failed to load acpi_mcecall.ko because /opt/nec/acpicall/proc/acpi/mceca ll/acpi_mcecall.ko was not found.
Reinstall acpi_call.
7
Usage: acpicalld {start|stop|restart}
Shows usage of /etc/rc.d/init.d/acpicalld.
8
Starting acpi_call driver
Loading acpi_call driver.
9
skip to load acpi_call for $KERNEL
Skiped to load acpi_call due to Kernel verion mismatch.
Reinstall acpi_call which is corresponding to the Kernel version.
10
Stopping acpi_call driver [Result]
Stopping acpi_call driver.
Page 31
27
6.1.4 Other on-screen messages
The following table shows on-screen message related to Machine Check Monitoring Service.
Table 6-4 Other on-screen messages
No.
Message
Meaning
Action
1
Disabling ondemand cpu frequency scaling: /etc/rc0.d/K99cpuspeed: line 288: /sys/devices/system/cpu/cpuxx/cp ufreq/scaling_governor: No such file or directory
cpuspeed end processing was not executed to CPU xx because CPUxx is offlined.
It is not a problem if cpuspeed end processing was not executed to offlined CPU. No action is needed.
6.2 syslog Messages
The following table shows messages output to syslog.
Table 6-5 syslog messages
No.
Message
Meaning
Action
1
The number of active cores exceeded the number of core license.
The number of active cores exceeded the number of core license.
Offline CPU so that the number of CPUs becomes less than the number of license.
2
cpuspeed: Disabling ondemand cpu frequency scaling governor
cpuspeed is stopped. This message is output when CPU is onlined.
3
cpuspeed: Enabling ondemand cpu frequency scaling governor
cpuspeed is started. * Output when CPU is onlined.
4
kdump: kexec: unloaded kdump kernel kdump: stopped
kdump is stopped. * This message is output when
CPU is onlined or offlined.
5
kexec: loaded kdump kernel kdump: started up
kdump start しました。 * This message is output when
CPU is onlined or offlined.
6
[Hardware Error]: Machine check events logged
MCE / CMC occurred.
7
CPU x is now offline
CPU x is offlined.
8
kernel: soft_offline: <Page number>: <Message>
Memory page xx was offlined in Soft Mode.
Page 32
28
6.3 Operation Log Messages
6.3.1 Operation log messages output from mcemonitor
The following table shows operation log message (related to fault monitoring) that mcemonitor outputs.
Table 6-6 Operation log messages output from mcemonitor
No.
Message
Meaning
Action
Operation Log
1
Error: 1003 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
2
Error: 1007 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
3
Error: 1008 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
4
Error: 1010 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
5
Error: 1015 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
6
Error: 1016 <error type> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
7
Error: 1017 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
8
Error: 1018 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
9
Error: 1019 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
10
Error: 1025 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
11
Error: 1026 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
12
Warning: 1032 mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
Page 33
29
No.
Message
Meaning
Action
13
Error: 1033 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
14
Error: 1034 mcemonitor found a continuable error. mcemonitor will continue to
be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
15
Error: 1035 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
16
Error: 1036 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
17
Error: 1038 mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
18
Error: 1039 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
19
Error: 1040 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
20
Error: 1041 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
21
Error: 1042 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
22
Error: 1043 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
23
Warning: 1045 mcemonitor found a continuable
error. mcemonitor will continue to be run safely. mcemonitor will run with default value.
An error occurred on system-related function, mcemonitor continues operation with the default value of mcemonitor.conf.
No action is needed because mcemonitor continues operation with the default value of mcemonitor.conf.
24
Warning: 1046 memory-ce-action value is
unspecified in mcemonitor.conf. mcemonitor will run with default value.
Failed to read memory-ce-action of mcemonitor.conf. mcemonitor will run with default value of memory-ce-action.
mcemonitor will run with default value of memory-ce-action. Review the setting value of memory-ce-action in mcemonitor.conf.
25
Warning: 1046 core-ce-action value is unspecified
in mcemonitor.conf. mcemonitor will run with default value.
Failed to read core-ce-action of mcemonitor.conf. mcemonitor will run with default value of core-ce-action.
mcemonitor will run with default value of core-ce-action. Review the setting value of core-ce-action in mcemonitor.conf.
Page 34
30
No.
Message
Meaning
Action
26
Warning: 1046 memory-ce-action and
core-ce-action values are unspecified in mcemonitor.conf. mcemonitor will run with default value.
Failed to read memory-ce-action and core-ce-action of mcemonitor.conf. mcemonitor will run with default values of memory-ce-action and core-ce-action.
mcemonitor will run with default values of memory-ce-action and core-ce-action. Review the setting values of memory-ce-action and core-ce-action in mcemonitor.conf.
27
Error: 5001 <error type> mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
28
Error: 5006 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
29
Warning: 5007 mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
30
Error: 5008 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
31
Error: 5009 <error type> <error cause>
mcemonitor found a continuable error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
32
Error: 5010 <error type> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
33
Error: 5011 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
34
Error: 5012 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
35
Error: 5013 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
36
Error: 5015 <error cause> mcemonitor found a continuable error. mcemonitor will continue to
be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
37
Error: 5016 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
38
Error: 5017 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
Page 35
31
No.
Message
Meaning
Action
39
Error: 5018 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
40
Error: 5024 mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
41
Error: 5025 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
42
Error: 5026 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
43
Error: 5036 <error cause> mcemonitor will continue to be run safely. Please retry operation.
An error occurred on system-related function, but mcemonitor continue operation. Run the command again.
Run /opt/nec/mcemonitor/mcemo nitor --client again.
44
Error: 6001 <error type> <error cause>
mcemonitor found a continuable error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
45
Error: 6002 <error type> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
46
Error: 6003 <error type> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
47
Error: 6004 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
48
Error: 6005 <error type> <error cause> mcemonitor found a continuable error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
49
Error: 6006 <error type> <error cause>
mcemonitor found a continuable error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
50
Error: 6007 <error type> <error cause> mcemonitor found a continuable error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
51
Error: 7001 mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
Page 36
32
No.
Message
Meaning
Action
52
Cannot open /dev/mcelog. <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
53
MCE_GET_RECORD_LEN <error cause>
mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
54
MCE_GET_LOG_LEN <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
55
no data in mce record mcemonitor found a continuable error. mcemonitor will continue to
be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
56
mcelog read <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
57
warning: xxxx bytes ignored in each record consider an update mcemonitor can not analyze mcelog due to the inconsistency of log format. mcemonitor needs to be updated.
mce structure in Linux kernel may be changed. Update mcemonitor.
Install mcemonitor of the latest version.
58
Cannot open pidfile <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
59
CPU 56 BANK 128 TSC 0x5e6974d4256 RIP 00:0 MISC 0 ADDR 0 STATUS 0x40000000883c0c00 MCGSTATUS 0 CPUID Vendor Intel Family 6
Model 62 TIME 1366542972 Sun Apr 21
20:16:12 2013 SOCKETID 0 APICID 23 MCGCAP 0x5000c20
MCE/CMC occurred.
60
error while recieving from kernel mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
61
cannot open NETLINK socket <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
Page 37
33
No.
Message
Meaning
Action
62
cannot bind to NETLINK socket <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
63
cannot set FD_CLOEXEC flag on fd <error cause>
mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
64
poll table overflow mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
65
cannot set FD_CLOEXEC flag on fd
mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
66
poll table overflow mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
67
poll error <error cause> mcemonitor found a continuable error. mcemonitor will continue to
be run safely.
An error occurred on system-related function, but mcemonitor continue operation.
mcemonitor continues running. No action is needed.
68
Offlining CPU xx due to corrected error threshold
Offline CPU xx due to corrected error exceeds threshold.
69
Not offlining CPU 0 because of kernel running on CPU 0.
Cannot offline CPU 0 because its kernel is running.
70
Offlining CPU xx succeeded
Offlining CPU xx succeeded.
71
Offlining CPU xx failed
Offlineing CPU xx failed because it was in use.
72
Kernel does not support page offline interface
Kernel does not support MemoryPage Offline.
73
Corrected memory errors on page xx exceeded threshold
Correctable memory error on page xx exceeded threshold value.
74
Offlining page xx due to corrected error threshold
Memory error on address xx exceeded threshold value. Offline memory page.
75
Not offlining page xx because this offlining page has already succeeded.
Do not offline memory page because memory address xx has already been offlined.
76
Not offlining page xx because this offlining page has already failed.
Do not offline memory page because memory address xx has already been offlined.
77
Soft offlining of page xx succeeded.
Memory Page Offline was executed on address xx in Soft Mode.
78
Soft offlining of page xx failed.
Could not offline the memory page because the page was in use.
Could not offline the memory page because the page was in use. Use syslog to confirm attribute of page.
79
mcemonitor already running
mcemonitor is already running.
Page 38
34
No.
Message
Meaning
Action
80
cannot open listening socket <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
81
cannot bind to client unix socket <error cause>
/var/run/mcemonitor-client mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
82
Error: 6008 mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
83
Error: 1047 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
84
Error: 1048 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
85
Error: 1049 <error cause> mcemonitor found a continuable
error. mcemonitor will continue to be run safely.
An error occurred on system-related function, but mcemonitor is running safely.
mcemonitor is running safely. No action is needed.
86
Error: 1050 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
87
Error: 1051 <error cause> mcemonitor exited due to a system error. .
An error occurred on system-related function during stop phase. mcemonitor exited.
Restart mcemonitor, then stop mcemonitor.
88
Error: 1052 <error cause> mcemonitor exited due to a system error. .
An error occurred on system-related function during stop phase. mcemonitor exited.
Restart mcemonitor, then stop mcemonitor.
89
Error: 1053 <error cause> mcemonitor exited due to a system error. .
An error occurred on system-related function during stop phase. mcemonitor exited.
Restart mcemonitor, then stop mcemonitor.
90
Error: 5037 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
91
Error: 5038 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
92
Error: 5039 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
93
Error: 6009 <error cause> mcemonitor exited due to a system error. mcemonitor will be restarted
by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
94
Error: 6010 <error cause> mcemonitor exited due to a system
error. mcemonitor will be restarted by cron.
An error occurred on system-related function, and mcemonitor exited. mcemonitor is restarted by cron.
Restart mcemonitor automatically by cron.
Page 39
35
6.3.2 Operation log messages output from capmonitor
The following table shows operation log message (Core Offline log (including logs related to Core Online of COPT) that capmonitor outputs.
Table 6-7 Operation log messages output from capmonitor
No.
Message
Meaning
Solution
Operation log
1
Error: 1102 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
2
Error: 1103 <error cause> capmonitor exited due to a system error. capmonitor will be restarted
by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
3
Error: 1104 <error type> capmonitor exited due to
capmonitor.conf error. cpu-hotadd-online-timeout value needs to be less 1200 seconds. Please correct capmonitor.conf and restart capmonitor.
capmonitor exited due to capmonitor.conf error.
cpu-hotadd-online-timeout value needs to be less than 1200 seconds. Please correct capmonitor.conf and restart capmonitor.
Change cpu-hotadd-online-timeout value of capmonitor.conf to less than 1200 seconds, and restart capmonitor.
4
Error: 1104 <error type> capmonitor exited due to
capmonitor.conf error. cpu-hotadd-timeout value needs to
be less cpu-hotadd-online-timeout. Please correct capmonitor.conf and restart capmonitor.
capmonitor exited due to capmonitor.conf error. cpu-hotadd-timeout value needs to be less than cpu-hotadd-online-timeout. Please correct capmonitor.conf and restart capmonitor.
Change cpu-hotadd-timeout value of capmonitor.conf to less than cpu-hotadd-online-timeout value, and restart capmonitor.
5
Error: 1104 <error type> capmonitor exited due to a system error. capmonitor will be restarted
by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
6
Error: 1105 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
7
Error: 1106 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
8
Error: 1107 <error cause> capmonitor exited due to a system error. capmonitor will be restarted
by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
9
Error: 1108 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
10
Error: 1109 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
11
Error: 1110 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
Page 40
36
No.
Message
Meaning
Solution
12
Warning: 1111 <error type> cpu-hotadd-timeout value is
unspecified in capmonitor.conf. capmonitor will run with default value.
Failed to read cpu-hotadd-timeout of capmonitor.conf. capmonitor will run with default value of cpu-hotadd-timeout.
capmonitor will run with default value of cpu-hotadd-timeout. Review the setting value of cpu-hotadd-timeout in capmonitor.conf.
13
Warning: 1111 <error type> cpu-hotadd-online-timeout value is
unspecified in capmonitor.conf. capmonitor will run with default value.
Failed to read cpu-hotadd-online-timeout of capmonitor.conf. capmonitor will run with default value of cpu-hotadd-online-timeout.
capmonitor will run with default value of cpu-hotadd-online-timeout. Review the setting value of cpu-hotadd-online-timeout in capmonitor.conf.
14
Warning: 1111 <error type> cpu-hotadd-timeout and
cpu-hotadd-online-timeout values are unspecified in capmonitor.conf. capmonitor will run with default value.
Failed to read cpu-hotadd-timeout and cpu-hotadd-online-timeout of capmonitor.conf capmonitor will run with default values of cpu-hotadd-timeout and cpu-hotadd-online-timeout.
capmonitor will run with default values of cpu-hotadd-timeout and cpu-hotadd-online-timeout. Review the setting values of cpu-hotadd-timeout and cpu-hotadd-online-timeout in capmonitor.conf.
15
Error: 5101 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
16
Error: 5102 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
17
Error: 5103 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
18
Error: 5104 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
19
Error: 5104 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
20
Error: 5105 <error cause> capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
21
Error: 5106 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
22
Error: 5107 <error cause> capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
23
Error: 5108 <error cause> capmonitor found a continuable error. capmonitor will continue to
be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
24
Error: 5109 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
Page 41
37
No.
Message
Meaning
Solution
25
Error: 5110 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
26
Error: 5111 <error cause> capmonitor found a continuable error. capmonitor will continue to
be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
27
Error: 5112 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
28
Error: 5113 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
29
Info: 5114 <CPU number> The number of active cores exceeded the number of core
license.
The number of active cores exceeded the number of core license.
Offline the CPU.
30
Error: 5115 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
31
Error: 5116 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
32
Error: 5117 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
33
Error: 5118 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
34
Error: 5119 <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
35
Error: 5120 <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
36
Info: 5121 <CPU number> The number of active cores exceeded the number of core license.
The number of active cores exceeded the number of core license.
Offline the CPU.
37
Error: 5122 capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
38
Error: 6100 <error cause> capmonitor will continue to be run
safely. Please retry operation.
An error occurred on system-related function but capmonitor continue operation. Run the command again.
Run /opt/nec/capmonitor/capmoni tor --client=addtime again.
39
Error: 7101 capmonitor exited due to a system error. capmonitor will be restarted
by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
Page 42
38
No.
Message
Meaning
Solution
40
Cannot open pidfile <error cause> capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
41
Hot Add CPU xx succeeded.
Hot Add of spare CPU xx succeeded.
42
Hot Add CPU xx failed.
Hot Add of spare CPU xx failed.
43
Hot Add CPU xx timeouted.
Hot Add of spare CPU xx is timed out.
44
CPU xx is now online
Spare CPU xx is added.
45
error while recieving from kernel capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
46
cannot open NETLINK socket <error cause> capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
47
cannot bind to NETLINK socket <error cause>
capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
48
cannot set FD_CLOEXEC flag on fd <error cause> capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
49
poll table overflow capmonitor exited due to a system
error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
50
poll error <error cause> capmonitor found a continuable
error. capmonitor will continue to be run safely.
An error occurred on system-related function but capmonitor continue operation.
capmonitor continues running. No action is needed.
51
cannot open listening socket <error cause> capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
52
cannot bind to client unix socket <error cause> /var/run/capmonitor-client capmonitor exited due to a system error. capmonitor will be restarted by cron.
An error occurred on system-related function and capmonitor terminated. capmonitor is restarted by cron.
mcemonitor restarts automatically by cron.
Page 43
39
7. Restrictions and Precautions
7.1 Manual Onlining CPU being Core Offlined
Do not manually online the CPU that was core offlined. When correctable error exceeds threshold value, Machine Check Monitoring Service offlines the failed
core. The core offlined CPU cannot be used by OS. You can online the core from OS (*), however, the offlined CPU is failing. Do not make online the failing
CPU.
* (Example) # echo 1 > /sys/devices/system/cpu/cpuX/online
7.2 cpuspeed Error Message Output at OS Shutdown
An error message of cpuspeed may be displayed when OS is shutdown. However, this does not affect system operation.
When correctable error exceeds threshold value, the failed CPU core is offlined. If OS is shutdown after CPU Core Offline, the message of cpuspeed is output.
This indicates that cpuspeed daemon failed to execute end processing to core offlined CPU. It does not affect system operation if cpuspeed end processing was not executed to offlined CPU. You can ignore this message.
See 6.1.4 Other on-screen messages for details of error message.
Page 44
40
© NEC Corporation 2015 No part of this manual may be reproduced in any form without the prior written permission of NEC Corporation.
Express5800/A2040b,A2020b,A2010b,A1040b
Machine Check Monitoring Service
Users Guide
(Release 1.5)
NEC Corporation
7-1 Shiba 5-Chome, Minato-Ku
Tokyo 108-8001, Japan
TEL (03) 3454-1111 (Main phone number)
Loading...