
Express5800/A2040b,A2020b,
A2010b,A1040b
Machine Check Monitoring Service
User's Guide
(Release 1.5)
May 2015
NEC Corporation
© 2015 NEC Corporation
855-900937

Notes on Using This Manual
No part of this manual may be reproduced in any form without the prior written
permission of NEC Corporation.
The contents of this manual may be revised without prior notice.
The contents of this manual shall not be copied or altered without the prior written
permission of NEC Corporation.
Trademarks
Linux is a trademark or registered trademark of Linus Torvalds in Japan and other
countries.
Red Hat® and Red Hat Enterprise Linux are trademarks or registered trademarks of
Red Hat, Inc. in the United States and other countries.
Intel and its log are registered trademarks of Intel Corporation in the United States and
other countries.
Emulex and LightPulse are registered trademarks of Emulex Corporation.
Broadcom, NetXtreme, Ethernet@Wirespeed, LiveLink, and Smart Load Balancing
trademarks of Broadcom Corporation and/or its associated company in the United
States and other countries.
All other product, brand, or trade names used in this publication are the trademarks or
registered trademarks of their respective trademark owners.
Related Documents
Express5800/A2040b, A2020b, A2010b, A1040b User's Guide
Capacity Optimization (COPT) User's Guide

Contents
1. Introduction .............................................................................................................................. 1
1.1 Overview ............................................................................................................................ 1
1.2 Operating Environment ..................................................................................................... 1
1.3 Terminology ....................................................................................................................... 2
1.4 Access Limitation ............................................................................................................... 2
2. Features of Machine Check Monitoring Service .................................................................. 3
2.1 Features of Machine Check Monitoring Service ............................................................... 3
2.2 System Configuration of Machine Check Monitoring Service ........................................... 3
2.3 Functional Drawing of Machine Check Monitoring Service ............................................... 4
2.4 Features of Machine Check Monitoring Service ............................................................... 5
3. Installation and Configuration ............................................................................................... 6
3.1 Installation ......................................................................................................................... 6
3.1.1 Installing acpi_call ......................................................................................................... 6
3.1.2 Installing capmonitor ...................................................................................................... 8
3.1.3 Installing mcemonitor ..................................................................................................... 9
3.2 Upgrade ........................................................................................................................... 10
3.2.1 Upgrading acpi_call ..................................................................................................... 10
3.2.2 Upgrading capmonitor .................................................................................................. 11
3.2.3 Upgrading mcemonitor ................................................................................................ 12
3.3 Configuration ................................................................................................................... 13
3.3.1 capmonitor configuration file ....................................................................................... 13
3.3.2 mcemonitor configuration file ...................................................................................... 13
3.3.3 Disabling CMCI ............................................................................................................ 14
3.3.4 Disabling kdump restart on udev triggered by logical processor offline ...................... 14
3.3.5 Script file to be executed after Core Offline ................................................................. 15
3.3.6 Disabling EDAC ........................................................................................................... 15
3.4 Uninstallation ................................................................................................................... 16
3.4.1 Uninstalling acpi_call ................................................................................................... 16
3.4.2 Uninstalling capmonitor ............................................................................................... 17
3.4.3 Uninstalling mcemonitor .............................................................................................. 17
4. Log .......................................................................................................................................... 18

4.1 Logging Destination ......................................................................................................... 18
4.2 Output Format ................................................................................................................. 18
5. Command Reference ............................................................................................................ 19
5.1 Show CPU / Memory Status ............................................................................................ 19
6. Messages ............................................................................................................................... 22
6.1 On-screen Message ........................................................................................................ 22
6.1.1 On-screen messages output from mcemonitor ........................................................... 22
6.1.2 On-screen messages output from capmonitor ............................................................ 24
6.1.3 On-screen messages output from acpi_call ................................................................ 26
6.1.4 Other on-screen messages ......................................................................................... 27
6.2 syslog Messages ............................................................................................................. 27
6.3 Operation Log Messages ................................................................................................ 28
6.3.1 Operation log messages output from mcemonitor ...................................................... 28
6.3.2 Operation log messages output from capmonitor ....................................................... 35
7. Restrictions and Precautions ............................................................................................... 39
7.1 Manual Onlining CPU being Core Offlined ...................................................................... 39
7.2 cpuspeed Error Message Output at OS Shutdown ......................................................... 39

1. Introduction
1.1 Overview
Machine Check Monitoring Service provides a service to identify fault component of hardware by
sending logs of correctable error occurred on CPU and memory of Linux server to the firmware in the
server.
If the number of times correctable error occurrence exceeds threshold value, Machine Check
Monitoring Service performs Core Offline (offlining of CPU) or Page Offline (offlining memory page) to
prevent system down due to uncorrectable error. If the OS supports Core Online feature and the
system has spare CPU, Machine Check Monitoring Service adds spare CPU automatically (Core
Online) after Core Offline completes. The Offline and Online operations are performed in cooperation
with kernel on Linux server.
Machine Check Monitoring Service is composed of firmware and software on Linux server. Software
includes mcemonitor (Machine Check Monitoring Service) and capmonitor (Capacity Monitoring
Service).
Refer to "Capacity Optimization (COPT) User's Guide" for details of Core
Online feature.
Core Offline, Core Online, and Page Offline are not supported on
Express5800/A1040b.
1.2 Operating Environment
Machine Check Monitoring Service requires operating environment as shown below:
Table 1-1 Operating Environment
Express5800/A1040b
Express5800/A2010b
Express5800/A2020b
Express5800/A2040b
Red Hat Enterprise Linux 6.6

1.3 Terminology
Terms used in Machine Check Monitoring Service are as shown below:
Table 1-2 Terminology
Software that realizes higher RAS feature.
When mcemonitor receives logs from mce mechanism of Linux kernel,
analyze it, and monitors fault occurrence in cooperation with system.
mcemonitor instructs Core Offline and Page Offline to the kernel.
Software that controls Core Offline for failed core, and Core Online that
COPT feature provides.
Refer to "Capacity Optimization (COPT) User's Guide" for details of COPT
feature.
Driver used to access ACPI
Advanced Configuration and Power Interface
Open industry specification related power management and hardware
configuration.
Machine Check Exceptions
Hardware error detected by CPU
Corrected Machine Check
Correctable error detected by CPU
Means a single Intel Xeon processor. One CPU socket can have several
cores. With Express5800/A2040, up to 4 CPU sockets can be installed in
the server.
Core portion of CPU that performs arithmetic processing and others. One
or more cores can exist in CPU socket.
Physical CPU socket
number
Means physical mounting position of a CPU socket in the server. The
number from No. 1 to No. 4 is assigned for every CPU socket.
Means the processor where OS actually executes task and threads. When
Hyper-Threading feature is enabled, two logical processors exist in one
CPU core. When Hyper-Threading feature is disabled, only one logical
processor exists in one CPU core.
1.4 Access Limitation
Only the privileged user (root account) can use mcemonitor.

2. Features of Machine Check Monitoring Service
This section describes features and characteristics of Machine Check Monitoring Service.
2.1 Features of Machine Check Monitoring Service
For the server that is used in mission critical domain, it is required to identify the failing component,
online degrade it, and online replace it before system down occurs on the server.
If the Machine Check Monitoring Service detects a correctable failure in CPU and memory in Linux
server, it sends log to firmware in the server to identify the failed component. When the correctable
error exceeds threshold value, the Machine Check Monitoring Service degrades CPU or memory page
online (Core Offline, Page Offline). In addition, if the server uses an OS that supports Core Online
feature and spare CPU is equipped in the server, the Machine Check Monitoring Service adds the
spare CPU automatically (Core Online) after Core Offline. Thus the performance deterioration can be
prevented.
Refer to "Capacity Optimization (COPT) User's Guide" for details of Core
Online feature.
Express5800/A1040b does not support Core Offline, Core Online, and Page
Offline.
2.2 System Configuration of Machine Check Monitoring Service
The system configuration of Machine Check Monitoring Service is shown below.
Figure 2-1 System Configuration of Machine Check Monitoring Service

2.3 Functional Drawing of Machine Check Monitoring Service
Functional drawing of Machine Check Monitoring Service and its associated components are shown
below.
Figure 2-2 Functional drawing

2.4 Features of Machine Check Monitoring Service
Process flow of Machine Check Monitoring Service is shown below.
Table 2-1 Process flow of Machine Check Monitoring Service
When mcemonitor detects occurrence of CPU failure, send CPU fault information to
firmware.
When the firmware receives CPU fault information, it determines the failed
component.
The firmware manages failure occurrence count, and when it exceeds threshold
value, the firmware instructs Core Offline to mcemonitor.
When mcemonitor receives Core Offline instruction from firmware, it issues CPU
Offline instruction to kernel.
If Hyper Threading Mode is set to OFF, one logical CPU in CPU core is made
offline. If Hyper Threading Mode is set to ON, two logical CPUs in CPU core are
made offline.
When CPU Offline succeeds, the relevant CPU is disabled for OS and software.
Thus, the number of available CPUs is reduced.
Note: Express5800/A1040b does not support Core Offline feature.
mcemonitor notifies the firmware of result of CPU Offline.
When CPU Offline succeeds and if the server has spare CPU, the spare CPU is
added automatically (Core Online feature).
Note: For details of Core Online, refer to Capacity Optimization (COPT) User's
Guide.
CPU fault information and result of CPU Offline can be confirmed by mcemonitor
command.
See 5.1 Show CPU / Memory Status for details of mcemonitor command.
Monitoring
memory failure
If the correctable memory error on a certain memory page exceeds threshold value,
the firmware instructs Memory Page Offline to mcemonitor.
When mcemonitor receives Memory Page Offline instruction from firmware, it sends
Memory Page Offline instruction to kernel.
Memory Page Offline is performed in unit of 4K bytes.
When Memory Page Offline succeeds, the relevant memory page is disabled for OS
and software. Thus, the number of available memory capacity is reduced.
Note: Express5800/A1040b does not support Page Offline feature.
mcemonitor notifies the firmware of result of Memory Page Offline.
Result of Memory Page Offline can be confirmed by mcemonitor command.
See 5.1 Show CPU / Memory Status for details of mcemonitor command.

3. Installation and Configuration
This section describes how to install, configure, and start the service of Machine Check Monitoring
Service.
3.1 Installation
Machine Check Monitoring Service is provided as RPM package. Install it by using rpm command as
shown below:Install packages acpi_call, capmonitor, and mcemonitor in order.
3.1.1 Installing acpi_call
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64.rpm
Preparing... ########################################## [100%]
1:mcl-acpi_call ########################################## [100%]
Starting acpi_call driver[ OK ]
4. Confirm that acpi_call RPM package of Machine Check Monitoring Service is installed
correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep acpicall
mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
5. Check if acpi_call driver is started normally. If the following 3 acpi_call are displayed,
acpi_call driver is started normally.
# lsmod | grep acpi
acpi_clpcall 6897 0
acpi_capcall 6897 0
acpi_mcecall 6897 0
6. Installation of package may not complete if the following message is displayed. Repeat from
Step 3 according to "Solution".
package mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64 is already installed
Uninstall acpi_call, and install it again.
error: unpacking of archive failed
on file: cpio: write failed - No space left on device
Disk space is insufficient. Increase free space, and install it again.

7. Configure /etc/sysconfig/kdump.
Creation of initrd file for kdump may fail if an external module unnecessary for dump collection
is incorporated. To prevent this, add MKDUMPRD_ARGS="--allow-missing".
Sample configuration of /etc/sysconfig/kdump
MKDUMPRD_ARGS="--allow-missing"
With this configuration, the following warning may appear when kdump service is started. This
message indicates that the external module was not incorporated, and it is not the problem.
WARNING: No module xxx found for kernel 2.6.32-504.23.4.el6.x86_64, continuing anyway
(xxx represents external module name)

3.1.2 Installing capmonitor
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install capmonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-capmonitor-2.4-2.12.el6.x86_64.rpm
Preparing... ######################################## [100%]
1:mcl-capmonitor ######################################## [100%]
Starting capmonitor daemon[ OK ]
Note: acpi_call must be installed before installing capmonitor.
If capmonitor is installed while acpi_call has not been installed, the following message
is output and installation of capmonitor fails.
# rpm -ivh mcl-capmonitor-2.4-2.12.el6.x86_64.rpm
error: Failed dependencies:
mcl-acpicall is needed by mcl-capmonitor-2.4-2.12.el6.x86_64
4. Confirm that capmonitor RPM package of Machine Check Monitoring Service is installed
correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.12.el6.x86_64
5. Check if capmonitor is started normally. If the following is displayed, capmonitor is started
normally.
# ps aux | grep monitor
root 6044 0.0 0.0 4068 324 ? Ss 06:18 0:00 /opt/nec/capmonitor/capmonitor
6. Installation of package may not complete if the following message is displayed. Repeat from
Step 3 according to "Solution".
package mcl-capmonitor-2.4-2.12.el6.x86_64 is already installed
Uninstall capmonitor, and install it again.
error: unpacking of archive failed
on file: cpio: write failed - No space left on device
Disk space is insufficient. Increase free space, and install it again.

3.1.3 Installing mcemonitor
1. Login to the target machine as a root user.
2. The most recent version of RPM are available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
3. Install mcemonitor RPM package of Machine Check Monitoring Service using rpm command.
# rpm -ivh mcl-mcemonitor1-2.4-2.02.el6.x86_64.rpm
Preparing... ######################################### [100%]
1:mcl-mcemonitor1 ######################################### [100%]
Starting mcemonitor daemon[ OK ]
Note: acpi_call must be installed before installing mcemonitor.
If mcemonitor is installed while acpi_call has not been installed, the following message
is output and installation of mcemonitor fails.
# rpm -ivh mcl-mcemonitor1-2.4-2.02.el6.x86_64.rpm
error: Failed dependencies:
mcl-acpicall is needed by mcl-mcemonitor1-2.4-2.02.el6.x86_64
4. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is installed
correctly. The following is displayed when installation completes successfully.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.02.el6.x86_64
5. Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started
normally.
# ps aux | grep monitor
root 6078 0.0 0.0 4076 328 ? Ss 06:19 0:00 /opt/nec/mcemonitor/mcemonitor
6. Installation of package may not complete if the following message is displayed. Repeat from
Step 3 according to "Solution".
package mcl-mcemonitor1-2.4-2.02.el6.x86_64 is already installed
Uninstall mcemonitor, and install it again.
error: unpacking of archive failed
on file: cpio: write failed - No space left on device
Disk space is insufficient. Increase free space, and install it again.

3.2 Upgrade
Use rpm command to upgrade Machine Check Monitoring Service from old to new version.
3.2.1 Upgrading acpi_call
1. Login to the target machine as a root user.
2. Confirm that the current version of acpi_call RPM package of Machine Check Monitoring
Service is older than that of acpi_call RPM package you are going to upgrade.
# rpm -qa | grep acpi_call
mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
3. Copy RPM to desired directory in target machine.
The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -Uvh mcl-acpicall-2.4-3.02.2.6.32.504.23.4.el6.x86_64.rpm
Preparing... ########################################### [100%]
1:mcl-acpi_call ########################################### [100%]
Starting acpi_call driver[ OK ]
5. Confirm that acpi_call RPM package of Machine Check Monitoring Service is upgraded
correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep acpicall
mcl-acpicall-2.4-3.02.2.6.32.504.23.4.el6.x86_64
6. Check if acpi_call driver is started normally. If the following 3 acpi_call are displayed,
acpi_call driver is started normally.
# lsmod | grep acpi
acpi_clpcall 6897 0
acpi_capcall 6897 0
acpi_mcecall 6897 0

3.2.2 Upgrading capmonitor
1. Login to the target machine as a root user.
2. Confirm that the current version of capmonitor RPM package of Machine Check Monitoring
Service is older than that of capmonitor RPM package you are going to upgrade.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.12.el6.x86_64
3. Copy RPM to desired directory in target machine.
The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade capmonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -Uvh mcl-capmonitor-2.4-2.13.el6.x86_64.rpm
Preparing... ######################################### [100%]
4048 /opt/nec/capmonitor/capmonitor
Stopping capmonitor[ OK ]
1:mcl-capmonitor ######################################### [100%]
Starting capmonitor daemon[ OK ]
If capmonitor.conf was changed, the following message will be displayed. The message can
be safely ignored because your configuration of the capmonitor.conf is preserved.
capmonitor.conf.rpmnew is the default capmonitor.conf file.
warning: /opt/nec/capmonitor/conf/capmonitor.conf created as
/opt/nec/capmonitor/conf/capmonitor.conf.rpmnew
5. Confirm that capmonitor RPM package of Machine Check Monitoring Service is upgraded
correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep capmonitor
mcl-capmonitor-2.4-2.13.el6.x86_64
6. Check if capmonitor is started normally. If the following is displayed, capmonitor is started
normally.
# ps aux | grep monitor
root 4141 0.0 0.0 4068 352 ? Ss 13:54 0:00 /opt/nec/capmonitor/capmonitor

3.2.3 Upgrading mcemonitor
1. Login to the target machine as a root user.
2. Confirm that the current version of mcemonitor RPM package of Machine Check Monitoring
Service is older than that of mcemonitor RPM package you are going to upgrade.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.02.el6.x86_64
3. Copy RPM to desired directory in target machine.
The most recent version of RPM is available for download from the following website.
http://www.58support.nec.co.jp/global/download/index.html
4. Upgrade mcemonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -Uvh mcl-mcemonitor1-2.4-2.03.el6.x86_64.rpm
Preparing... ######################################### [100%]
4083 /opt/nec/mcemonitor/mcemonitor
Stopping mcemonitor[ OK ]
1: mcl-mcemonitor1 ######################################### [100%]
Starting mcemonitor daemon[ OK ]
If mcemonitor.conf was changed, the following message will be displayed. The message can
be safely ignored because your configuration of the mcemonitor.conf is preserved.
mcemonitor.conf.rpmnew is the default mcemonitor.conf file.
warning: /opt/nec/mcemonitor/conf/mcemonitor.conf created as
/opt/nec/mcemonitor/conf/mcemonitor.conf.rpmnew
5. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is upgraded
correctly. The following is displayed when upgrade completes successfully.
# rpm -qa | grep mcemonitor
mcl-mcemonitor1-2.4-2.03.el6.x86_64
6. Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started
normally.
# ps aux | grep monitor
root 4189 0.0 0.0 4076 364 ? Ss 13:56 0:00 /opt/nec/mcemonitor/mcemonitor

3.3 Configuration
Machine Check Monitoring Service provides the following two configuration files. You can change
behavior of Machine Check Monitoring Service by modifying these configuration files. This section
describes available parameters and how to specify them.
/opt/nec/capmonitor/conf/capmonitor.conf
/opt/nec/mcemonitor/conf/mcemonitor.conf
3.3.1 capmonitor configuration file
capmonitor configuration file /opt/nec/capmonitor/conf/capmonitor.conf is used for configuration related
to CPU Core Online.
For details of capmonitor configuration file, refer to "Capacity Optimization
(COPT) User's Guide".
3.3.2 mcemonitor configuration file
mcemonitor configuration file /opt/nec/mcemonitor/conf/mcemonitor.conf is used for configuration
related to CPU Core Offline and Memory Page Offline. Modify this file according to description below.
・mcemonitor.conf
# vi /opt/nec/mcemonitor/conf/mcemonitor.conf
#
# Config file for mcemonitor
#
# specify the internal action in mcemonitor to a cpu error
# off no action
# account only account errors
# soft try to offline CPU
core-ce-action = soft
# specify the internal action in mcemonitor to a page error
# off no action
# soft try to soft-offline page without killing any processes
memory-ce-action = soft

Table 3-1 mcemonitor configuration file(core-ce-action)
Setting in mcemonitor.conf
Collects log and makes CPU Core Offline if the CPU error count
exceeds the threshold value. (Default)
Collects log but does not make CPU Core Offline even if the CPU
error count exceeds the threshold value.
Does not collect log nor make CPU Core Offline.
Table 3-2 mcemonitor configuration file(memory-ce-action)
Setting in mcemonitor.conf
Collects log and makes Memory Page Offline if the memory error count
exceeds the threshold value. (Default)
The process running on the relevant memory is transferred to another
memory.
Does not collect log nor make Memory Page Offline.
The system must be rebooted if configuration file is modified.
3.3.3 Disabling CMCI
In RHEL6.6 kernel 2.6.32-504.23.4.el6.x86_64, it is reported that the frequent occurrence of
CMCI(Corrected Machine Check Interrupt), which notifies the operating system of the detected
corrrectable error, may cause System panic.
To change the error detecting mode from "interrupt mode" to "polling mode", you need to add
"mce=no_cmci" to the kernel line in the "/boot/efi/EFI/redhat/grub.conf".
The system must be rebooted if configuration file is modified.
title Red Hat Enterprise Linux Server (2.6.32-504.23.4.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-504.23.4.el6.x86_64 ro
root=/dev/mapper/VolGroup00-LogVol00
rd_LVM_LV= VolGroup00/LogVol00 rd_NO_LUKS nomodeset rd_NO_MD rhgb quiet
crashkernel=256M KEYBOARDTYPE=pc KEYTABLE=jp106 LANG=ja_JP.UTF-8 rd_NO_DM
mce=no_cmci
initrd /initramfs-2.6.32-504.23.4.el6.x86_64.img
3.3.4 Disabling kdump restart on udev triggered by logical processor offline
Add # at the top of the following line in /etc/udev/rules.d/98-kexec.rules file to disable the rule.
#SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/etc/init.d/kdump restart"
Restart udev after modifying configuration file.
udevadm control --reload-rules
kdump is restarted when capmonitor executes script upon completion of Core
Offline. You need to place the script file to be used after Core Offline according to
"3.3.5 Script file to be executed after Core Offline".

3.3.5 Script file to be executed after Core Offline
capmonitor executes all script files stored in the directory /opt/nec/capmonitor/script/cpu/offline.d upon
completion of Core Offline.
If several logical processors are made offline by a single Core Offline, the script file is executed only
once after the last processor is offlined.
Place the script /opt/nec/capmonitor/script/03kdump.sh under the directory
/opt/nec/capmonitor/script/cpu/offline.d to restart kdump as an alternative of kdump that was disabled in
3.3.4.
If you use the software that requires reboot after Core Offline (number of logical processors is reduced),
create a script file containing the necessary processes and store it under the directory
/opt/nec/capmonitor/script/cpu/offline.d.
Table 3-3 Script under /opt/nec/capmonitor/script/cpu/offline.d
How to install script file
Script that restarts kdump daemon
as needed so that crash dump can
be collected after Core Offline.
Copy from /opt/nec/capmonitor/script/ to
/opt/nec/capmonitor/script/cpu/offline.d.
XX~.sh
User script
XX: Specify execution order
by 2-digit decimal lnumber.
(Starts from younger
number.)
~: Arbitrary character string
If you use the software that
requires reboot after Core Offline,
create a script file containing the
necessary processes
Create a script and store it under
/opt/nec/capmonitor/script/cpu/offline.d.
3.3.6 Disabling EDAC
If the EDAC is running in the system, Machine Check Monitoring Service will not run correctly. Disable
the EDAC by creating a file /etc/modprobe.d/disable_edac.conf with the following contents:
install *_edac /bin/true
install edac_* /bin/true
After saving the file, reboot the system. After the system is rebooted, confirm the EDAC was disabled
as shown below.
# lsmod | grep edac

3.4 Uninstallation
Use rpm command to uninstall Machine Check Monitoring Service.
Uninstall packages mcemonitor, capmonitor, and acpi_call in order.
3.4.1 Uninstalling acpi_call
1. Login to the target machine as a root user.
2. Uninstall acpi_call RPM package of Machine Check Monitoring Service using rpm command.
# rpm -e mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
Note: mcemonitor and capmonitor must be uninstalled before uninstalling acpi_call.
If acpi_call is uninstalled while mcemonitor and capmonitor have not been uninstalled,
the following message is output and uninstallation of acpi_call fails.
# rpm -e mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64
error: Failed dependencies:
mcl-acpicall is needed by mcl-capmonitor-2.4-2.12.el6.x86_64
mcl-acpicall is needed by mcl-mcemonitor1-2.4-2.02.el6.x86_64
3. Confirm that acpi_call RPM package of Machine Check Monitoring Service is uninstalled
correctly. Uninstallation completes successfully if "acpi_call" is not displayed as shown below.
# rpm -qa | grep acpicall
4. Check if acpi_call driver is uninstalled correctly. If the 3 acpi_call are not displayed, acpi_call
driver is uninstalled correctly.
# lsmod | grep acpi

3.4.2 Uninstalling capmonitor
1. Login to the target machine as a root user.
2. Uninstall capmonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -e mcl-capmonitor-2.4-2.12.el6.x86_64
3834 /opt/nec/capmonitor/capmonitor
Stopping capmonitor[ OK ]
3. Confirm that capmonitor RPM package of Machine Check Monitoring Service is uninstalled
correctly. Uninstallation completes successfully if "capmonitor" is not displayed as shown
below.
# rpm -qa | grep capmonitor
4. Check if capmonitor is stopped correctly. If "capmonitor" is not displayed, capmonitor is
stopped correctly.
# ps aux | grep monitor
3.4.3 Uninstalling mcemonitor
1. Login to the target machine as a root user.
2. Uninstall mcemonitor RPM package of Machine Check Monitoring Service using rpm
command.
# rpm -e mcl-mcemonitor1-2.4-2.02.el6.x86_64
3871 /opt/nec/mcemonitor/mcemonitor
Stopping mcemonitor[ OK ]
Starting mcelog daemon
3. Confirm that mcemonitor RPM package of Machine Check Monitoring Service is uninstalled
correctly. Uninstallation completes successfully if "mcemonitor" is not displayed as shown
below.
# rpm -qa | grep mcemonitor
4. Check if mcemonitor is stopped correctly. If "mcemonitor" is not displayed, capmonitor is
stopped correctly.
# ps aux | grep monitor

4. Log
4.1 Logging Destination
Machine Check Monitoring Service outputs log to the following destinations:
/var/opt/nec/mcemonitor (Fault monitoring log)
/var/opt/nec/capmonitor (Core Offline log (including logs related to Core Online of COPT)
4.2 Output Format
Shown below is an example of log that Machine Check Monitoring Service outputs.
(/var/opt/nec/mcemonitor)
Tue Feb 19 21:03:39 2013 : CPU 7
Tue Feb 19 21:03:39 2013 : BANK 0
Tue Feb 19 21:03:39 2013 : TSC 0
Tue Feb 19 21:03:39 2013 : RIP 00:0
Tue Feb 19 21:03:39 2013 : MISC 0
Tue Feb 19 21:03:39 2013 : ADDR 0
Tue Feb 19 21:03:39 2013 : STATUS 0x9000000000000000
Tue Feb 19 21:03:39 2013 : MCGSTATUS 0
Tue Feb 19 21:03:39 2013 : CPUID Vendor Intel Family 6 Model 62
Tue Feb 19 21:03:39 2013 : TIME 1361275419 Tue Feb 19 21:03:39 2013
Tue Feb 19 21:03:39 2013 : SOCKETID 0
Tue Feb 19 21:03:39 2013 : APICID 14
Tue Feb 19 21:03:39 2013 : MCGCAP 0x5000c20
Tue Feb 19 21:03:39 2013 :
Tue Feb 19 21:03:39 2013 : Offlining CPU 7 due to corrected error threshold.
Tue Feb 19 21:03:39 2013 : Offlining CPU 22 due to corrected error threshold.
Tue Feb 19 21:03:39 2013 : Offlining CPU 7 succeeded.
Tue Feb 19 21:03:39 2013 : Offlining CPU 22 succeeded.
(/var/opt/nec/capmonitor)
Tue Feb 19 21:03:39 2013 : CPU 7 is now offline.
Tue Feb 19 21:03:39 2013 : CPU 24 is now online.
Table 4-1 Machine Check Monitoring Service outputs log
Shows date and time of OS.
Shows body of log message.
See "6.3 Operation Log Messages" for details of log
message.
Shows Core Offline log output from capmonitor.
See "6.3 Operation Log Messages" for details of log
message.
Shows Core Online log output from capmonitor.
See "6.3 Operation Log Messages" for details of log
message.

5. Command Reference
5.1 Show CPU / Memory Status
You can view CPU fault information and offline state of CPU/Memory page by using mcemonitor
command.
The following shows command options.
Name
mcemonitor – Outputs state of CPU / Memory page to standard output.
Syntax
mcemonitor [ --version ]
mcemonitor [ --client | --client=core | --client=page ]
Description
CPU fault information and offline state of CPU / Memory page can be confirmed by
mcemonitor command.
Option
--version
Shows version information of mcemonitor command.
--client
Shows CPU fault information and offline state of CPU / Memory page.
--client=core
Shows CPU fault information and offline state of CPU.
--client=page
Shows offline state of Memory page.
Return value
0: Normal End
1: Abnormal End

Display format
# /opt/nec/mcemonitor/mcemonitor --client
Per page status corrected error over threshold:
100000: offline-failed
10000000: offline
20000000: offline
:
:
Per page status uncorrected error:
1abc40000
1abc90000
:
:
CPU errors
CPU1/core2
corrected errors:
1 total
uncorrected errors:
0 total
CPU4/core1
corrected errors:
10 total
uncorrected errors:
0 total
CPU4/core2
corrected errors:
10 total
uncorrected errors:
0 total
:
:
CPU1/uncore
corrected errors:
1 total
uncorrected errors:
0 total
:
:
Per CPU status corrected error over threshold:
CPU4/core1:
/sys/devices/system/cpu5 offline-failed
/sys/devices/system/cpu15 offline
CPU4/core2:
/sys/devices/system/cpu6 offline
/sys/devices/system/cpu16 online

Table 5-1 mcemonitor command
Per page status corrected error over threshold:
Shows result of Memory Page Offline.
Indicates that offlining failed for 0x10000 page of
memory address.
Indicates 0x10000 page of memory address was
offlined.
Per page status uncorrected error:
Show Memory Page that uncorrected error occurred.
Shows CPU fault information.
Shows fault information of CPU core.
CPU x: Indicates physical CPU socket number (x).
corey: Indicates CPU core number (y).
Shows number of occurrence of correctable errors.
Indicates that errors occurred x times.
Shows number of occurrence of uncorrectable errors.
Fault information of CPU Uncore.
Per CPU status corrected error over threshold:
Shows result of CPU Offline.
/sys/devices/system/cpu5 offline-failed
Indicates that offlining logical processor 5 failed.
/sys/devices/system/cpu15 offline
Indicates that offlining logical processor 5 succeeded.
/sys/devices/system/cpu16 online
Indicates that offlined logical processor 16 is returned
to online by user.
Note: This CPU was made offline due to failure. Do not
make it online.

6. Messages
6.1 On-screen Message
6.1.1 On-screen messages output from mcemonitor
The following table shows on-screen message (related to fault monitoring) that mcemonitor outputs.
Table 6-1 On-screen messages output from mcemonitor
Cannot open logfile
/var/opt/nec/mcemonitor
mcemonitor exited due to a
system error. mcemonitor will be
restarted by cron.
Failed to open log file, and
mcemonitor exited.
mcemonitor is restarted by cron.
Restart mcemonitor
automatically by cron.
cannot open socket
mcemonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
mcemonitor.
Run the command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
cannot connect server
mcemonitor exited due to a
system error. mcemonitor will be
restarted by cron.
Failed to communicate with
mcemonitor.
After mcemonitor is restarted by
cron, run the command again.
When mcemonitor
automatically restarts by
cron, run
/opt/nec/mcemonitor/mcemo
nitor --client again.
failed to write socket
mcemonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
mcemonitor.
Run the command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
failed to read
mcemonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
mcemonitor.
Run the command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
Cannot reopen logfile
/var/opt/nec/mcemonitor
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
Failed to reopen log file but
mcemonitor continue operation.
mcemonitor continues
operation. No action is
needed.
Usage:
./mcemonitor --client :
display core & page status
./mcemonitor --client=core :
display core status
./mcemonitor --client=page :
display page status
Shows usage of mcemonitor.
Shows mcemonitor version.
out of memory
mcemonitor exited due to a
system error. mcemonitor will be
restarted by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor is
restarted by cron.
When mcemonitor
automatically restarts by
cron, run
/opt/nec/mcemonitor/mcemo
nitor --client again.
Did not receive credentials over
client unix socket
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
rejected client access from pid:xx
uid:yy gid:zz
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
error while reading from client
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.

accept failed on client socket
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
Cannot enable credentials
passing on client socket
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
mcemonitor too busy
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
Failed to read command line.
Cannot run mcemonitorcmd
through command line. Do
not use command line to run
mcemonitorcmd.
/proc/xxx/cmdline read error
Failed to read command line.
Can't execute this command only
Cannot execute this command.
Invallid argument is specified.
insmod: can't read
'/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko': No such file or
directory
Failed to load because
/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko was not found.
/dev/mcelog was not found.
/dev/mcelog was not found, thus
failed to start mcemonitor.
/opt/nec/acpicall/proc/acpi/mcecal
l/acpi_mcecall.ko was not found.
/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko was not found,
thus failed to start mcemonitor.
/opt/nec/mcemonitor/mcemonitor
was not found.
/opt/nec/mcemonitor/mcemonitor
was not found, thus failed to start
mcemonitor.
/opt/nec/mcemonitor/mcemonitorc
md was not found.
/opt/nec/mcemonitor/mcemonitor
cmd was not found, thus failed to
start mcemonitor.
/var/opt/nec was not found.
/var/opt/nec was not found, thus
failed to start mcemonitor.
Unknown mcemonitor mode xx.
Valid daemon
mcemonitor is not in daemon
mode.
Specify daemon for
MCEMONITOR_MODE of
/etc/rc.d/init.d/mcemonitord
mcemonitor already running.
mcemonitor is already running.
/etc/rc.d/init.d/mcemonitord start
already running.
mcemonitor is already starting.
mcemonitor already stopped.
mcemonitor is already stopped.
/etc/rc.d/init.d/mcemonitord stop
already running.
mcemonitor is already stopping.
Usage: mcemonitord
{start|stop|try-restart|restart|status
|force-reload|reload}
Shows usage of
/etc/rc.d/init.d/mcemonitord.
Starting mcemonitor daemon
[Result]
If the result is [Fail], starting
of mcemonitor is failed.
Confirm mcemonitor log, and
restart mcemonitor
Stopping mcemonitor [Result]
If the result is [Fail], stopping
of mcemonitor is failed.
Confirm mcemonitor log, and
stop mcemonitor again.
acpi_mcecall was not loaded
Failed to load mcemonitor
because acpi_mcecall was not
loaded.
Reinstall acpi_call and restart
mcemonitor.

6.1.2 On-screen messages output from capmonitor
The following table shows on-screen message (related to core offline (including logs of core online
feature of COPT) that capmonitor outputs.
Table 6-2 On-screen messages output from capmonitor
Cannot open logfile
/var/opt/nec/capmonitor
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
Failed to open log file, and
capmonitor exited.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
cannot open socket
capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
capmonitor.
Run the command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
cannot connect server
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
Failed to communicate with
capmonitor.
After capmonitor is restarted by
cron, run the command again.
When capmonitor
automatically restarts by
cron, run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
failed to write socket
capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
capmonitor.
Run the command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
failed to read
capmonitor will continue to be run
safely. Please retry operation.
Failed to communicate with
capmonitor.
Run the command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
Cannot reopen logfile
/var/opt/nec/capmonitor
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
Failed to reopen log file but
capmonitor continue operation.
capmonitor continues
operation. No action is
needed.
Usage:
./capmonitor --client=addtime :
display cpu core hot-add
processing time
Shows usage of capmonitor.
Shows capmonitor version.
out of memory
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated. capmonitor
is restarted by cron.
When capmonitor
automatically restarts by
cron, run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
Did not receive credentials over
client unix socket
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
rejected client access from pid:xx
uid:yy gid:zz
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
error while reading from client
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
accept failed on client socket
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
Cannot enable credentials
passing on client socket
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.

capmonitor too busy
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function. Run the
command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
Failed to read command line.
Cannot run capmonitorcmd
through command line. Do
not use command line to run
capmonitorcmd.
/proc/xxx/cmdline read error
Failed to read command line.
Can't execute this command only
Cannot execute this command.
Invallid argument is specified.
insmod: can't read
'/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko': No such file or
directory
Failed to load because
/opt/nec/acpicall/proc/acpi/capcall/
acpi_capcall.ko was not found.
/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko was not found.
/opt/nec/acpicall/proc/acpi/capcall/
acpi_capcall.ko was not found,
thus failed to start capmonitor.
/opt/nec/capmonitor/capmonitor
was not found.
/opt/nec/capmonitor/capmonitor
was not found, thus failed to start
capmonitor.
/opt/nec/capmonitor/capmonitorc
md was not found.
/opt/nec/capmonitor/capmonitorcm
d was not found, thus failed to start
capmonitor.
/var/opt/nec was not found.
/var/opt/nec was not found, thus
failed to start capmonitor.
Unknown capmonitor mode xx.
Valid daemon
Unknown mode. Only daemon
mode is valid.
Specify daemon for
CAPMONITOR_MODE of
/etc/rc.d/init.d/capmonitord.
capmonitor already running.
capmonitor is already running.
/etc/rc.d/init.d/capmonitord start
already running.
capmonitor is already starting.
capmonitor already stopped.
capmonitor is already stopped.
/etc/rc.d/init.d/capmonitord stop
already running.
capmonitor is already stopping.
Usage: capmonitord
{start|stop|try-restart|restart|status
|force-reload|reload}
Shows usage of
/etc/rc.d/init.d/capmonitord.
Starting capmonitor daemon
[Result]
If the result is [Fail], starting
of capmonitor is failed.
Confirm capmonitor log, and
restart capmonitor
Stopping capmonitor [Result]
If the result is [Fail], stopping
of capmonitor is failed.
Confirm capmonitor log, and
stop capmonitor again.
acpi_capcall was not loaded
Failed to load capmonitor because
acpi_capcall was not loaded.
Reinstall acpi_call and restart
capmonitor.

6.1.3 On-screen messages output from acpi_call
The following table shows on-screen message that acpi_call outputs.
Table 6-3 On-screen messages output from acpi_call
insmod: can't read
'/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko': No such file or
directory
Failed to load acpi_capcall.ko
because
/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko was not found.
insmod: can't read
'/opt/nec/acpicall/proc/acpi/clpcall/
acpi_clpcall.ko': No such file or
directory
Failed to load acpi_clpcall.ko
because
/opt/nec/acpicall/proc/acpi/clpcall/
acpi_clpcall.ko was not found.
insmod: can't read
'/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko': No such file or
directory
Failed to load acpi_mcecall.ko
because
/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko was not found.
/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko was not found.
Failed to load acpi_capcall.ko
because
/opt/nec/acpicall/proc/acpi/capcall
/acpi_capcall.ko was not found.
/opt/nec/acpicall/proc/acpi/clpcall/
acpi_clpcall.ko was not found.
Failed to load acpi_clpcall.ko
because
/opt/nec/acpicall/proc/acpi/clpcall/
acpi_clpcall.ko was not found.
/opt/nec/acpicall/proc/acpi/mcecal
l/acpi_mcecall.ko was not found.
Failed to load acpi_mcecall.ko
because
/opt/nec/acpicall/proc/acpi/mceca
ll/acpi_mcecall.ko was not found.
Usage: acpicalld
{start|stop|restart}
Shows usage of
/etc/rc.d/init.d/acpicalld.
Starting acpi_call driver
Loading acpi_call driver.
skip to load acpi_call for
$KERNEL
Skiped to load acpi_call due to
Kernel verion mismatch.
Reinstall acpi_call which is
corresponding to the Kernel
version.
Stopping acpi_call driver [Result]
Stopping acpi_call driver.

6.1.4 Other on-screen messages
The following table shows on-screen message related to Machine Check Monitoring Service.
Table 6-4 Other on-screen messages
Disabling ondemand cpu
frequency scaling:
/etc/rc0.d/K99cpuspeed: line 288:
/sys/devices/system/cpu/cpuxx/cp
ufreq/scaling_governor: No such
file or directory
cpuspeed end processing was
not executed to CPU xx because
CPUxx is offlined.
It is not a problem if
cpuspeed end processing
was not executed to offlined
CPU. No action is needed.
6.2 syslog Messages
The following table shows messages output to syslog.
Table 6-5 syslog messages
The number of active cores
exceeded the number of core
license.
The number of active cores
exceeded the number of core
license.
Offline CPU so that the
number of CPUs becomes
less than the number of
license.
cpuspeed: Disabling ondemand
cpu frequency scaling governor
cpuspeed is stopped.
This message is output when
CPU is onlined.
cpuspeed: Enabling ondemand
cpu frequency scaling governor
cpuspeed is started.
* Output when CPU is onlined.
kdump: kexec: unloaded kdump
kernel
kdump: stopped
kdump is stopped.
* This message is output when
CPU is onlined or offlined.
kexec: loaded kdump kernel
kdump: started up
kdump を start しました。
* This message is output when
CPU is onlined or offlined.
[Hardware Error]: Machine check
events logged
kernel: soft_offline: <Page
number>: <Message>
Memory page xx was offlined in
Soft Mode.

6.3 Operation Log Messages
6.3.1 Operation log messages output from mcemonitor
The following table shows operation log message (related to fault monitoring) that mcemonitor outputs.
Table 6-6 Operation log messages output from mcemonitor
Error: 1003 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1007 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1008 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1010 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1015 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1016 <error type>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1017 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1018 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1019 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1025 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 1026 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Warning: 1032
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.

Error: 1033 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1034
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 1035 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1036 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1038
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1039 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1040 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1041 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1042 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1043 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Warning: 1045
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
mcemonitor will run with default
value.
An error occurred on
system-related function,
mcemonitor continues operation
with the default value of
mcemonitor.conf.
No action is needed because
mcemonitor continues
operation with the default
value of mcemonitor.conf.
Warning: 1046
memory-ce-action value is
unspecified in mcemonitor.conf.
mcemonitor will run with default
value.
Failed to read memory-ce-action
of mcemonitor.conf.
mcemonitor will run with default
value of memory-ce-action.
mcemonitor will run with
default value of
memory-ce-action.
Review the setting value of
memory-ce-action in
mcemonitor.conf.
Warning: 1046
core-ce-action value is unspecified
in mcemonitor.conf. mcemonitor
will run with default value.
Failed to read core-ce-action of
mcemonitor.conf.
mcemonitor will run with default
value of core-ce-action.
mcemonitor will run with
default value of
core-ce-action.
Review the setting value of
core-ce-action in
mcemonitor.conf.

Warning: 1046
memory-ce-action and
core-ce-action values are
unspecified in mcemonitor.conf.
mcemonitor will run with default
value.
Failed to read memory-ce-action
and core-ce-action of
mcemonitor.conf.
mcemonitor will run with default
values of memory-ce-action and
core-ce-action.
mcemonitor will run with
default values of
memory-ce-action and
core-ce-action.
Review the setting values of
memory-ce-action and
core-ce-action in
mcemonitor.conf.
Error: 5001 <error type>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5006 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Warning: 5007
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5008 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5009 <error type> <error
cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5010 <error type>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5011 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5012 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5013 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5015 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5016 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5017 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.

Error: 5018 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5024
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5025 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5026 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 5036 <error cause>
mcemonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function, but
mcemonitor continue operation.
Run the command again.
Run
/opt/nec/mcemonitor/mcemo
nitor --client again.
Error: 6001 <error type> <error
cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 6002 <error type>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6003 <error type>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6004 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6005 <error type> <error
cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 6006 <error type> <error
cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 6007 <error type> <error
cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Error: 7001
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.

Cannot open /dev/mcelog. <error
cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
MCE_GET_RECORD_LEN <error
cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
MCE_GET_LOG_LEN <error
cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
no data in mce record
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
mcelog read <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
warning: xxxx bytes ignored in
each record
consider an update
mcemonitor can not analyze
mcelog due to the inconsistency of
log format. mcemonitor needs to
be updated.
mce structure in Linux kernel
may be changed.
Update mcemonitor.
Install mcemonitor of the
latest version.
Cannot open pidfile <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
CPU 56
BANK 128
TSC 0x5e6974d4256
RIP 00:0
MISC 0
ADDR 0
STATUS 0x40000000883c0c00
MCGSTATUS 0
CPUID Vendor Intel Family 6
Model 62
TIME 1366542972 Sun Apr 21
20:16:12 2013
SOCKETID 0
APICID 23
MCGCAP 0x5000c20
error while recieving from kernel
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
cannot open NETLINK socket
<error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.

cannot bind to NETLINK socket
<error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
cannot set FD_CLOEXEC flag on
fd <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
poll table overflow
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
cannot set FD_CLOEXEC flag on
fd
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
poll table overflow
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
poll error <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor continue operation.
mcemonitor continues
running. No action is needed.
Offlining CPU xx due to corrected
error threshold
Offline CPU xx due to corrected
error exceeds threshold.
Not offlining CPU 0 because of
kernel running on CPU 0.
Cannot offline CPU 0 because its
kernel is running.
Offlining CPU xx succeeded
Offlining CPU xx succeeded.
Offlineing CPU xx failed because
it was in use.
Kernel does not support page
offline interface
Kernel does not support
MemoryPage Offline.
Corrected memory errors on page
xx exceeded threshold
Correctable memory error on
page xx exceeded threshold
value.
Offlining page xx due to corrected
error threshold
Memory error on address xx
exceeded threshold value.
Offline memory page.
Not offlining page xx because this
offlining page has already
succeeded.
Do not offline memory page
because memory address xx has
already been offlined.
Not offlining page xx because this
offlining page has already failed.
Do not offline memory page
because memory address xx has
already been offlined.
Soft offlining of page xx
succeeded.
Memory Page Offline was
executed on address xx in Soft
Mode.
Soft offlining of page xx failed.
Could not offline the memory
page because the page was in
use.
Could not offline the memory
page because the page was
in use. Use syslog to confirm
attribute of page.
mcemonitor already running
mcemonitor is already running.

cannot open listening socket
<error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
cannot bind to client unix socket
<error cause>
/var/run/mcemonitor-client
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6008
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1047 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1048 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1049 <error cause>
mcemonitor found a continuable
error. mcemonitor will continue to
be run safely.
An error occurred on
system-related function, but
mcemonitor is running safely.
mcemonitor is running safely.
No action is needed.
Error: 1050 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 1051 <error cause>
mcemonitor exited due to a system
error. .
An error occurred on
system-related function during
stop phase. mcemonitor exited.
Restart mcemonitor, then
stop mcemonitor.
Error: 1052 <error cause>
mcemonitor exited due to a system
error. .
An error occurred on
system-related function during
stop phase. mcemonitor exited.
Restart mcemonitor, then
stop mcemonitor.
Error: 1053 <error cause>
mcemonitor exited due to a system
error. .
An error occurred on
system-related function during
stop phase. mcemonitor exited.
Restart mcemonitor, then
stop mcemonitor.
Error: 5037 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5038 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 5039 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6009 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.
Error: 6010 <error cause>
mcemonitor exited due to a system
error. mcemonitor will be restarted
by cron.
An error occurred on
system-related function, and
mcemonitor exited. mcemonitor
is restarted by cron.
Restart mcemonitor
automatically by cron.

6.3.2 Operation log messages output from capmonitor
The following table shows operation log message (Core Offline log (including logs related to Core
Online of COPT) that capmonitor outputs.
Table 6-7 Operation log messages output from capmonitor
Error: 1102 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1103 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1104 <error type>
capmonitor exited due to
capmonitor.conf error.
cpu-hotadd-online-timeout value
needs to be less 1200 seconds.
Please correct capmonitor.conf
and restart capmonitor.
capmonitor exited due to
capmonitor.conf error.
cpu-hotadd-online-timeout value
needs to be less than 1200
seconds. Please correct
capmonitor.conf and restart
capmonitor.
Change
cpu-hotadd-online-timeout
value of capmonitor.conf to
less than 1200 seconds, and
restart capmonitor.
Error: 1104 <error type>
capmonitor exited due to
capmonitor.conf error.
cpu-hotadd-timeout value needs to
be less cpu-hotadd-online-timeout.
Please correct capmonitor.conf
and restart capmonitor.
capmonitor exited due to
capmonitor.conf error.
cpu-hotadd-timeout value needs
to be less than
cpu-hotadd-online-timeout.
Please correct capmonitor.conf
and restart capmonitor.
Change cpu-hotadd-timeout
value of capmonitor.conf to
less than
cpu-hotadd-online-timeout
value, and restart
capmonitor.
Error: 1104 <error type>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1105 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1106 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1107 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1108 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1109 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 1110 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.

Warning: 1111 <error type>
cpu-hotadd-timeout value is
unspecified in capmonitor.conf.
capmonitor will run with default
value.
Failed to read
cpu-hotadd-timeout of
capmonitor.conf.
capmonitor will run with default
value of cpu-hotadd-timeout.
capmonitor will run with
default value of
cpu-hotadd-timeout.
Review the setting value of
cpu-hotadd-timeout in
capmonitor.conf.
Warning: 1111 <error type>
cpu-hotadd-online-timeout value is
unspecified in capmonitor.conf.
capmonitor will run with default
value.
Failed to read
cpu-hotadd-online-timeout of
capmonitor.conf.
capmonitor will run with default
value of
cpu-hotadd-online-timeout.
capmonitor will run with
default value of
cpu-hotadd-online-timeout.
Review the setting value of
cpu-hotadd-online-timeout in
capmonitor.conf.
Warning: 1111 <error type>
cpu-hotadd-timeout and
cpu-hotadd-online-timeout values
are unspecified in capmonitor.conf.
capmonitor will run with default
value.
Failed to read
cpu-hotadd-timeout and
cpu-hotadd-online-timeout of
capmonitor.conf
capmonitor will run with default
values of cpu-hotadd-timeout
and cpu-hotadd-online-timeout.
capmonitor will run with
default values of
cpu-hotadd-timeout and
cpu-hotadd-online-timeout.
Review the setting values of
cpu-hotadd-timeout and
cpu-hotadd-online-timeout in
capmonitor.conf.
Error: 5101 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5102 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5103 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5104 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5104 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5105 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5106 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5107 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5108 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5109 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.

Error: 5110 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5111 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5112 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5113 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Info: 5114 <CPU number>
The number of active cores
exceeded the number of core
license.
The number of active cores
exceeded the number of core
license.
Error: 5115 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5116 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5117 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5118 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Error: 5119 <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 5120 <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
Info: 5121 <CPU number>
The number of active cores
exceeded the number of core
license.
The number of active cores
exceeded the number of core
license.
Error: 5122
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Error: 6100 <error cause>
capmonitor will continue to be run
safely. Please retry operation.
An error occurred on
system-related function but
capmonitor continue operation.
Run the command again.
Run
/opt/nec/capmonitor/capmoni
tor --client=addtime again.
Error: 7101
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.

Cannot open pidfile <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
Hot Add CPU xx succeeded.
Hot Add of spare CPU xx
succeeded.
Hot Add of spare CPU xx failed.
Hot Add CPU xx timeouted.
Hot Add of spare CPU xx is
timed out.
error while recieving from kernel
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
cannot open NETLINK socket
<error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
cannot bind to NETLINK socket
<error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
cannot set FD_CLOEXEC flag on
fd <error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
poll table overflow
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
poll error <error cause>
capmonitor found a continuable
error. capmonitor will continue to
be run safely.
An error occurred on
system-related function but
capmonitor continue operation.
capmonitor continues
running. No action is needed.
cannot open listening socket
<error cause>
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.
cannot bind to client unix socket
<error cause>
/var/run/capmonitor-client
capmonitor exited due to a system
error. capmonitor will be restarted
by cron.
An error occurred on
system-related function and
capmonitor terminated.
capmonitor is restarted by cron.
mcemonitor restarts
automatically by cron.

7. Restrictions and Precautions
7.1 Manual Onlining CPU being Core Offlined
Do not manually online the CPU that was core offlined.
When correctable error exceeds threshold value, Machine Check Monitoring Service offlines the failed
core.
The core offlined CPU cannot be used by OS.
You can online the core from OS (*), however, the offlined CPU is failing. Do not make online the failing
CPU.
* (Example) # echo 1 > /sys/devices/system/cpu/cpuX/online
7.2 cpuspeed Error Message Output at OS Shutdown
An error message of cpuspeed may be displayed when OS is shutdown. However, this does not affect
system operation.
When correctable error exceeds threshold value, the failed CPU core is offlined. If OS is shutdown after
CPU Core Offline, the message of cpuspeed is output.
This indicates that cpuspeed daemon failed to execute end processing to core offlined CPU. It does not
affect system operation if cpuspeed end processing was not executed to offlined CPU. You can ignore
this message.
See 6.1.4 Other on-screen messages for details of error message.

© NEC Corporation 2015
No part of this manual may be reproduced in any form without the prior written permission of NEC Corporation.
Express5800/A2040b,A2020b,A2010b,A1040b
Machine Check Monitoring Service
User’s Guide
(Release 1.5)
NEC Corporation
7-1 Shiba 5-Chome, Minato-Ku
Tokyo 108-8001, Japan
TEL (03) 3454-1111 (Main phone number)