Information furnished in this manual is believed to be accurate and reliable. However, QLogic Corporation assumes no
responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use.
QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described
in this document for any of these products are for illustrative purposes only. QLogic Corporation makes no representation
nor warranty that such applications are suitable for the specified use without further testing or modification. QLogic
Corporation assumes no responsibility for any errors that may appear in this document.
No part of this document may be copied nor reproduced by any means, nor translated nor transmitted to any magnetic
medium without the express written consent of QLogic Corporation. In accordance with the terms of their valid PathScale
agreements, customers are permitted to make electronic and paper copies of this document for their own exclusive use.
Linux is a registered trademark of Linus Torvalds.
QLA, QLogic, SANsurfer, the QLogic logo, PathScale, the PathScale logo, and InfiniPath are registered trademarks
of QLogic Corporation.
Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc.
SuSE is a registered trademark of SuSE Linux AG.
All other brand and product names are trademarks or registered trademarks of their respective owners.
Document Revision History
Rev. 1.0, 8/20/2005
Rev. 1.1, 11/15/05
Rev. 1.2,02/15/06
Rev. 1.3 Beta 1, 4/15/06
Rev. 1.3, 6/15/06
Rev. 2.0 Beta, 9/25/06,
QLogic Rev IB6054601 A
Rev. 2.0 Beta 2, 10/15/06,
QLogic Rev IB6054601 B
Rev. 2.0, 11/30/06,
QLogic Rev IB6054601 C
Rev. 2.0, 3/23/07,
QLogic Rev IB6054601 D
Rev. D ChangeDocument Sections Affected
Added metadata to pdf document onlyPDF metadata
Rev. C ChangesDocument Sections Affected
Updated Preface and Overview by combining into single section, now
called Introduction. Same as introduction in Install Guide.
Added SLES 9 as new supported distribution1.7
Revised info about MTRR mapping in BIOS. Some BIOS’ don’t have it,
or call it something else.
Corrected usage of ipath_core, replacing with ib_ipath2
Added more options to mpirun man page description3.5.10
Added new section on Environment for Multiple Versions of InfiniPath
or MPI
Added info on support for multiple MPIs3.6
C.1, C.2.1, C.2.2, C.2.3
1
3.5.8.1
Q
Page iiIB6054601-00 D
Q
InfiniPath User Guide
Version 2.0
Added info about using MPI over uDAPL. Need to load modules
rdma_cm and rdma_ucm.
Added section: Error messages generated by mpirun. This explains
more about the types of errors found in the sub-sections. Also added
error messages related to failed connections between nodes
Added mpirun error message about stray processes to error message
section
Added driver and link error messages reported by MPI programsC.8.12.3
Added section about errors occurring when different runtime/compile
time MPI versions are used
2.0 mpirun incompatible with 1.3 librariesC.8.1
Added glossary entry for MTRRE
Added new index entries for MPI error messages format, corrected
index formatting
This chapter describes the objectives, intended audience, and organization of the
InfiniPath User Guide.
The InfiniPath User Guide is intended to give the end users of an InifiniPath cluster
what they need to know to use it. In this case, end users are understood to include
both the cluster administrator and the MPI application programmers, who have
different but overlapping interests in the details of the technology.
For specific instructions about installing the InfiniPath QLE7140 PCI Express™
adapter, the QMI7140 adapter, or the QHT7140 /QHT7040 HTX™ adapters, and
the initial installation of the InifiniPath Software, see the InfiniPath Install Guide.
1.1
Who Should Read this Guide
This guide is intended both for readers responsible for administration of an InfiniPath
cluster network and for readers wanting to use that cluster.
This guide assumes that all readers are familiar with cluster computing, that the
cluster administrator reader is familiar with Linux administration and that the
application programmer reader is familiar with MPI.
Section 1
Introduction
1.2
How this Guide is Organized
The InfiniPath User Guide is organized into these sections:
■ Section 1"Introduction". This section.
■ Section 2 “InfiniPath Cluster Administration” describes the lower levels of the
supplied InfiniPath software. This would be of interest mainly to an InfiniPath
cluster administrator.
■ Section 3 “Using InfiniPath MPI” helps the MPI programmer make best use of
the InfiniPath MPI implementation.
■ Appendix A “Benchmark Programs”
■ Appendix B “Integration with a Batch Queuing System”
■ Appendix C “Troubleshooting”. The Troubleshooting section provides
information for troubleshooting installation, cluster administration, and MPI.
■ Appendix D “Recommended Reading”
IB6054601-00 D1-1
1 – Introduction
Interoperability
1.3
Overview
Q
■ Appendix E Glossary of technical terms
■ Index
In addition, the InfiniPath Install Guide contains information on InfiniPath hardware
and software installation.
The material in this documentation pertains to an InfiniPath cluster. This is defined
as a collection of nodes, each attached to an InfiniBand™-based fabric through the
InfiniPath Interconnect. The nodes are Linux-based computers, each having up to
eight processors.
The InfiniPath interconnect is InfiniBand 4X, with a raw data rate of 10 Gb/s (data
rate of 8Gb/s).
InfiniPath utilizes standard, off-the-shelf InfiniBand 4X switches and cabling.
InfiniPath OpenFabrics software is interoperable with other vendors’ InfiniBand
HCAs running compatible OpenFabrics releases. There are two options for Subnet
Management in your cluster:
■ Use the Subnet Manager on one or more managed switches supplied with your
Infiniband switches.
■ Use the OpenSM component of OpenFabrics.
1.4
Switches
The InfiniPath interconnect is designed to work with all InfiniBand-compliant
switches. Use of OpenSM as a subnet manager is now supported. OpenSM is part
of the OpenFabrics component of this release.
1.5
Interoperability
InfiniPath participates in the standard InfiniBand Subnet Management protocols for
configuration and monitoring. InfiniPath OpenFabrics (including IPoIB) is
interoperable with other vendors’ InfiniBand HCAs running compatible OpenFabrics
releases. The InfiniPath MPI and Ethernet emulation stacks (
interoperable with other InfiniBand Host Channel Adapters (HCA) and Target
Channel Adapters (TCA). Instead, InfiniPath uses an InfiniBand-compliant
vendor-specific protocol that is highly optimized for MPI and TCP between
InfiniPath-equipped hosts.
ipath_ether) are not
1-2IB6054601-00 D
Q
NOTE:OpenFabrics was known as OpenIB until March 2006. All relevant
references to OpenIB in this documentation have been updated to reflect
this change. See the OpenFabrics website at http://www.openfabrics.org
for more information on the OpenFabrics Alliance.
1.6
What’s New in this Release
QLogic Corp. acquired PathScale in April 2006. In this 2.0 release, product names,
internal program and output message names now refer to QLogic rather than
PathScale.
The new QLogic and former PathScale adapter model numbers are shown in the
table below.
Table 1-1. PathScale-QLogic Adapter Model Numbers
Former
PathScale
Model Number
HT-400IBA6110Single Port 10GBS InfiniBand to HTX ASIC
PE-800IBA6120Single Port 10GBS InfiniBand to x8 PCI Express
HT-460QHT7040Single Port 10GBS InfiniBand to HTX Adapter
HT-465QHT7140Single Port 10GBS InfiniBand to HTX Adapter
PE-880QLE7140Single Port 10GBS InfiniBand to x8 PCI Express
PE-850QMI7140Single Port 10GBS InfiniBand IBM Blade Center
New QLogic Model
Number
1 – Introduction
What’s New in this Release
Description
ROHS
ASIC ROHS
Adapter
Adapter
This version of InfiniPath provides support for all QLogic’s HCAs, including:
■ InfiniPath QLE7140, which is supported on systems with PCIe x8 or x16 slots
■ InfiniPath QMI7140, which runs on Power PC systems, particularly on the IBM®
BladeCenter H processor blades
■ InfiniPath QHT7040 and QHT7140, which leverage HTX™. The InfiniPath
QHT7040 and QHT7140 are exclusively for motherboards that support
HTXcards. The QHT7140 has a smaller form factor than the QHT7040, but is
otherwise the same. Unless otherwise stated, QHT7140 will refer to both the
QHT7040 and QHT7140 in this documentation.
Expanded MPI scalability enhancements for PCI Express have been added. The
QHT7040 and QHT7140 can support 2 processes per context for a total of 16. The
QLE7140 and QMI7140 also support 2 processes per context, for a total of 8.
IB6054601-00 D1-3
1 – Introduction
Supported Distributions and Kernels
Support for multiple versions of MPI has been added. You can use a different version
of MPI and achieve the high-bandwidth and low-latency performance that is
standard with InfiniPath MPI.
Also included is expanded operating system support, and support for the latest
OpenFabrics software stack.
Multiple InfiniPath cards per node are supported. A single software installation works
for all the cards.
Additional up-to-date information can be found on the QLogic web site:
http://www.qlogic.com
1.7
Supported Distributions and Kernels
The InfiniPath interconnect runs on AMD Opteron, Intel EM64T, and IBM Power
Blade Center H) systems running Linux. The currently supported distributions and
associated Linux kernel versions for InfiniPath and OpenFabrics are listed in the
following table. The kernels are the ones that shipped with the distributions, unless
otherwise noted.
Q
Table 1-2. InfiniPath/OpenFabrics Supported Distributions and Kernels
InfiniPath/OpenFabrics supported
Distribution
Fedora Core 3 (FC3)2.6.12 (x86_64)
Fedora Core 4 (FC4)2.6.16, 2.6.17 (x86_64)
Red Hat Enterprise Linux 4 (RHEL4)2.6.9-22, 2.6.9-34, 2.6.9-42(U2/U3/U4)
(x86_64)
CentOS 4.2-4.4 (Rocks 4.2-4.4)2.6.9 (x86_64)
SUSE Linux 9.3 (SUSE 9.3)2.6.11 (x86_64)
SUSE LInux Enterprise Server (SLES 9)2.6.5 (x86_64)
SUSE LInux Enterprise Server (SLES 10)2.6.16 (x86_64 and ppc64)
NOTE:IBM Power systems run only with the SLES 10 distribution.
The SUSE10 release series is no longer supported as of this InfiniPath 2.0 release.
Fedora Core 4 kernels prior to 2.6.16 are also no longer supported.
kernels
1-4IB6054601-00 D
Q
1.8
Software Components
The software provided with the InfiniPath Interconnect product consists of:
■ InfiniPath driver (including OpenFabrics)
■ InfiniPath ethernet emulation
■ InfiniPath libraries
■ InfiniPath utilities, configuration, and support tools
■ InfiniPath MPI
■ InfiniPath MPI benchmarks
■ OpenFabrics protocols, including Subnet Management Agent
■ OpenFabrics libraries and utilities
1 – Introduction
Software Components
OpenFabrics kernel module support is now built and installed as part of the InfiniPath
RPM install. The InfiniPath release 2.0 runs on the same code base as OpenFabrics
Enterprise Distribution (OFED) version 1.1. It also includes the OpenFabrics
1.1-based library and utility RPMs. InfiniBand protocols are interoperable between
InfiniPath 2.0 and OFED 1.1.
This release provides support for the following protocols:
■ IPoIB (TCP/IP networking)
■ SDP (Sockets Direct Protocol)
■ OpenSM
■ UD (Unreliable Datagram)
■ RC (Reliable Connection)
■ UC (Unreliable Connection)
■ SRQ (Shared Receive Queue)
■ uDAPL (user Direct Access Provider Library)
This release includes a technology preview of:
■ SRP (SCSI RDMA Protocol)
Future releases will provide support for:
■ iSER (iSCSI Extensions for RDMA)
No support is provided for RD.
IB6054601-00 D1-5
1 – Introduction
Documentation and Technical Support
NOTE:32 bit OpenFabrics programs using the verb interfaces are not supported
in this InfiniPath release, but will be supported in a future release.
1.9
Conventions Used in this Document
This Guide uses these typographical conventions:
Table 1-3. Typographical Conventions
ConventionMeaning
commandFixed-space font is used for literal items such as commands,
functions, programs, files and pathnames, and program
output;
variableItalic fixed-space font is used for variable names in programs
and command lines.
conceptItalic font is used for emphasis, concepts.
user inputBold fixed-space font is used for literal items in commands or
constructs that you type in.
$Indicates a command line prompt.
#Indicates a command line prompt as root when using bash or
sh.
[ ]Brackets enclose optional elements of a command or
program construct.
...Ellipses indicate that a preceding element can be repeated.
>Right caret identifies the cascading path of menu commands
used in a procedure.
2.0The current version number of the software is included in the
RPM names and within this documentation.
NOTE:Indicates important information.
Q
1.10
Documentation and Technical Support
The InfiniPath product documentation includes:
■ The InfiniPath Install Guide
■ The InfiniPath User Guide
■ Release Notes
■ Quick Start Guide
1-6IB6054601-00 D
Q
1 – Introduction
Documentation and Technical Support
■ Readme file
The Troubleshooting Appendix for installation, InfiniPath and OpenFabrics
administration, and MPI issues is located in the InfiniPath User Guide.
Visit the QLogic support Web site for documentation and the latest software updates.
http://www.qlogic.com
IB6054601-00 D1-7
1 – Introduction
Documentation and Technical Support
Notes
Q
1-8IB6054601-00 D
This chapter describes what the cluster administrator needs to know about the
InfiniPath software and system administration.
2.1
Introduction
The InfiniPath driver ib_ipath, layered Ethernet driver ipath_ether, OpenSM,
and other modules and the protocol and MPI support libraries are the components
of the InfiniPath software providing the foundation that supports the MPI
implementation.
Figure 2-1, below, shows these relationships.
Section 2
InfiniPath Cluster Administration
MPI Application
InfiniPath Channel (ADI Layer)
InfiniPath Protocol Library
InfiniPath Hardw are
2.2
Installed Layout
The InfiniPath software is supplied as a set of RPM files, described in detail in the
InfiniPath Install Guide. This section describes the directory structure that the
installation leaves on each node’s file system.
The InfiniPath shared libraries are installed in:
/usr/lib for 32-bit applications
/usr/lib64 for 64-bit applications
MPI programming examples and source for several MPI benchmarks are in:
/usr/share/mpich/examples
InfiniPath utility programs, as well as MPI utilities and benchmarks are installed in:
/usr/bin
The InfiniPath kernel modules are installed in the standard module locations in:
/lib/modules (version dependent)
They are compiled and installed when the infinipath-kernel RPM is installed.
They must be rebuilt and re-installed when the kernel is upgraded. This can be done
by running the script:
The following is a preliminary guideline for estimating the memory footprint of the
InfiniPath adapter on Linux x86_64systems. Memory consumption is linear based
2-2IB6054601-00 D
Q
2 – InfiniPath Cluster Administration
Memory Footprint
on system configuration. OpenFabrics support is under development and has not
been fully characterized. This table summarizes the guidelines.
Table 2-1. Memory Footprint of the InfiniPath Adapter on Linux x86_64 Systems
Adapter
component
InfiniPath DriverRequired9 MBIncludes accelerated IP
MPIOptional71 MB per process with
OpenFabricsOptional1~6 MB
Required/
optional
Memory FootprintComment
support. Includes tables
space to support up to
1000 node systems.
Clusters larger than 1000
nodes can also be
configured.
+ 32 MB per node when
multiple processes
communicate via shared
memory
+ 264 Bytes per MPI node
on the subnet
+ ~500 bytes per QP
+ TBD bytes per MR
+ ~500 bytes per EE
Context
+ OpenFabrics stack from
openfabrics.org (size not
included in these
guidelines)
parameters (sendbufs,
recvbufs and size of the
shared memory region)
are tunable if reduced
memory footprint is
desired.
This not been fully
characterized as of this
writing.
Here is an example for a 1024 processor system:
■ 1024 cores over 256 nodes (each node has 2 sockets with dual-core processors)
■ 1 adapter per node
■ Each core runs an MPI process, with the 4 processes per node communicating
via shared memory.
■ Each core uses OpenFabrics to connect with storage and file system targets
using 50 QPs and 50 EECs per core.
IB6054601-00 D2-3
2 – InfiniPath Cluster Administration
Configuration and Startup
This breaks down to a memory footprint of 331MB per node, as follows:
Table 2-2. Memory Footprint, 331 MB per Node
ComponentFootprint (in MB)Breakdown
Driver9Per node
MPI316 4*71 MB (MPI per process)
OpenFabrics66 MB + 200 KB per node
2.4
Configuration and Startup
2.4.1
BIOS Settings
A properly configured BIOS is required. The BIOS settings, which are stored in
non-volatile memory, contain certain parameters characterizing the system,. These
parameters may include date and time, configuration settings, and information about
the installed hardware.
Q
+ 32 MB (shared memory
per node)
There are currently two issues concerning BIOS settings that you need to be aware
of:
■ ACPI needs to be enabled
■ MTRR mapping needs to be set to “Discrete”
MTRR (Memory Type Range Registers) is used by the InfiniPath driver to enable
write combining to the InfiniPath on-chip transmit buffers. This improves write
bandwidth to the InfiniPath chip by writing multiple words in a single bus transaction
(typically 64). This applies only to x86_64 systems.
However, some BIOSes don’t have the MTRR mapping option. It may be referred
to in a different way, dependent upon chipset, vendor, BIOS, or other factors. For
example, it is sometimes referred to as "32 bit memory hole", which should be
enabled.
If there is no setting for MTRR mapping or 32 bit memory hole, please contact your
system or motherboard vendor and inquire as to how write combining may be
enabled.
ACPI and MTRR mapping issues are discussed in greater detail in the
Troubleshooting section of the InfiniPath User Guide.
NOTE:BIOS settings on IBM Blade Center H (Power) systems do not need
adjustment.
2-4IB6054601-00 D
Q
You can check and adjust these BIOS settings using the BIOS Setup Utility. For
specific instructions on how to do this, follow the hardware documentation that came
with your system.
2.4.2
InfiniPath Driver Startup
The ib_ipath module provides low level InfiniPath hardware support. It does
hardware initialization, handles infinipath-specific memory management, and
provides services to other InfiniPath and OpenFabrics modules. It provides the
management functions for InfiniPath MPI programs, the ipath_ether ethernet
emulation, and general OpenFabrics protocols such as IPoIB, and SDP. It also
contains a Subnet Management Agent.
The InfiniPath driver software is generally started at system startup under control
of these scripts:
/etc/init.d/infinipath
/etc/sysconfig/infinipath
2 – InfiniPath Cluster Administration
Configuration and Startup
These scripts are configured by the installation. Debug messages are printed with
the function name preceding the message.
The cluster administrator does not normally need to be concerned with the
configuration parameters. Assuming that all the InfiniPath and OpenFabrics
software has been installed, the default settings upon startup will be:
■ InfiniPath ib_ipath is enabled
■ InfiniPath ipath_ether is not running until configured
■ OpenFabrics IPoIB is not running until configured
■ OpenSM is enabled on startup. Disable it on all nodes except where it will be
used as subnet manager.
2.4.3
InfiniPath Driver Software Configuration
The ib_ipath driver has several configuration variables which provide for setting
reserved buffers for the software, defining events to create trace records, and setting
debug level. See the
2.4.4
ib_ipath man page for details.
InfiniPath Driver Filesystem
The InfiniPath driver supplies a filesystem for exporting certain binary statistics to
user applications. By default, this filesystem is mounted in the
when the infinipath script is invoked with the "start" option (e.g. at system startup)
IB6054601-00 D2-5
/ipathfs directory
2 – InfiniPath Cluster Administration
Configuration and Startup
and unmounted when the infinipath script is invoked with the "stop" option (e.g. at
system shutdown).
The layout of the filesystem is as follows:
atomic_stats
00/
01/
...
The atomic_stats file contains general driver statistics. There is one numbered
directory per InfiniPath device on the system. Each numbered directory contains
the following files of per-device statistics:
atomic_counters
node_info
port_info
The atomic_counters file contains counters for the device: examples would be
interrupts received, bytes and packets in and out, and so on. The
contains information such as the device’s GUID. The
information for each port on the device. An example would be the port LID.
Q
node_info file
port_info file contains
2.4.5
Subnet Management Agent
Each node in an InfiniPath cluster runs a Subnet Management Agent (SMA), which
carries out two-way communication with the Subnet Manager (SM) running on one
or more managed switches. The Subnet Manager is responsible for network
initialization (topology discovery), configuration, and maintenance. The Subnet
Manager also assigns and manages InfiniBand multicast groups, such as the group
used for broadcast purposes by the
the SMA are to keep the SM informed whether a node is alive and to get the node’s
assigned identifier (LID) from the SM.
2.4.6
Layered Ethernet Driver
The layered Ethernet component ipath_ether provides almost complete Ethernet
software functionality over the InfiniPath fabric. At startup this is bound to some
Ethernet device
transparent way, except that Ethernet multicasting is not supported. Broadcasting
is supported. You can use all the usual command line and GUI-based configuration
tools on this Ethernet. Configuration of
These instructions are for enabling TCP-IP networking over the InfiniPath link. To
enable IPoIB networking, see section 2.4.7.1.
ethx. All Ethernet functions are available through this device in a
ipath_ether driver. The primary functions of
ipath_ether is optional.
2-6IB6054601-00 D
2 – InfiniPath Cluster Administration
Q
You must create a network device configuration file for the layered Ethernet device
on the InfiniPath adapter. This configuration file will resemble the configuration files
for the other Ethernet devices on the nodes. Typically on servers there are two
Ethernet devices present, numbered as 0 (eth0) and 1 (eth1). This examples
assumes we create a third device, eth2.
NOTE:When multiple InfiniPath chips are present, the configuration for eth3,
eth4, and so on follow the same format as for adding eth2 in the examples
below.
Two slightly different procedures are given below for the ipath configuration; one
for Fedora and one for SUSE, SLES9, or SLES 10.
Many of the entries that are used in the configuration directions below are explained
in the file sysconfig.txt. To familiarize yourself with these, please see:
/usr/share/doc/initscripts-*/sysconfig.txt
2.4.6.1
ipath_ether Configuration on Fedora and RHEL4
Configuration and Startup
These configuration steps will cause the ipath_ether network interfaces to be
automatically configured when you next reboot the system. These instructions are
for the
Typically on servers there are two Ethernet devices present, numbered as 0 (eth0)
and 1 (eth1). This example assumes we create a third device, eth2.
NOTE:When multiple InfiniPath chips are present, the configuration for eth3,
Fedora Core 3, Fedora Core 4 and Red Hat Enterprise Linux 4 distributions.
eth4, and so on follow the same format as for adding eth2 in the
examples below.
1. Check for the number of Ethernet drivers you currently have by either one of
the two following commands :
$ ifconfig -a
$ ls /sys/class/net
As mentioned above we assume that two Ethernet devices (numbered 0 and
1) are already present.
2. Edit the file
alias eth2 ipath_ether
3. Create or edit the following file (as root).
/etc/sysconfig/network-scripts/ifcfg-eth2
/etc/modprobe.conf (as root) by adding the following line:
IB6054601-00 D2-7
2 – InfiniPath Cluster Administration
Configuration and Startup
If you are using DHCP (dynamic host configuration protocol), add the following
lines to ifcfg-eth2:
If you are using static IP addresses, use the following lines instead, substituting
your own IP address for the sample one given here
netmask is shown.
# QLogic Interconnect Ethernet
DEVICE=eth2
BOOTPROTO=static
ONBOOT=YES
IPADDR=192.168.5.101 #Substitute your IP address here
NETMASK="255.255.255.0"#Normal matching netmask
TYPE=Ethernet
This will cause the ipath_ether Ethernet driver to be loaded and configured during
system startup. To check your configuration, and make the
driver available immediately, use the command (as root):
Q
.The normal matching
ipath_ether Ethernet
# /sbin/ifup eth2
4. Check whether the Ethernet driver has been loaded with:
$ lsmod | grep ipath_ether
5. Verify that the driver is up with:
$ ifconfig -a
2.4.6.2
ipath_ether Configuration on SUSE 9.3, SLES 9, and SLES 10
These configuration steps will cause the ipath_ether network interfaces to be
automatically configured when you next reboot the system. These instructions are
for the
Typically on servers there are two Ethernet devices present, numbered as 0 (eth0)
and 1 (eth1). This example assumes we create a third device, eth2.
NOTE:When multiple InfiniPath chips are present, the configuration for eth3,
SUSE 9.3, SLES 9 and SLES 10 distributions.
eth4, and so on follow the same format as for adding eth2 in the
examples below. Similarly , in step 2, add one to the unit number, so
replace
and so on.
.../00/guid with /01/guid for the second InfiniPath interface,
2-8IB6054601-00 D
Q
2 – InfiniPath Cluster Administration
Configuration and Startup
Step 3 is applicable only to SLES 10; it is required because SLES 10 uses a newer
version of the
NOTE:The MAC address (media access control address) is a unique identifier
The following steps must all be executed as the root user.
# sed ’s/^\(..:..:..\):..:../\1/’ \
/sys/bus/pci/drivers/ib_ipath/00/guid
NOTE:Care should be taken when cutting and pasting commands such as
udev subsystem.
attached to most forms of networking equipment. Step 2 below determines
the MAC address to use, and will be referred to as $MAC in the
subsequent steps. $MAC must be replaced in each case with the string
printed in step 2.
the above from PDF documents, as quotes are special characters
and may not be translated correctly.
The output should appear similar to this (6 hex digit pairs, separated by colons):
Note that removing the middle two 00:00 octets from the GUID in the above
output will form the MAC address
If either step 1 or step 2 fails in some fashion, the problem must be found and
corrected before continuing. Verify that the RPMs are installed correctly, and
that infinipath has correctly been started. If problems continue, run
ipathbug-helper and report the results to your reseller or InfiniPath support
organization.
3. Skip to Step 4 if you are using SUSE 9.3 or SLES 9. This step is only done on
SLES 10 systems. Edit the file:
/etc/udev/rules.d/30-net_persistent_names.rules
If this file does not exist, skip to Step 4.
IB6054601-00 D2-9
2 – InfiniPath Cluster Administration
Configuration and Startup
Check each of the lines starting with SUBSYSTEM=, to find the highest numbered
interface. (For standard motherboards, the highest numbered interface will
typically be 1.)
Add a new line at the end of the file, incrementing the interface number by one.
In this example, it becomes eth2. The new line will look like this:
Make sure that you substitute your own IP address for the sample IPADDR
shown here. The BROADCAST, NETMASK, and NETWORK lines need to
match for your network.
2-10IB6054601-00 D
Q
6. To verify that the configuration files are correct, you will normally now be able
to run the commands:
# ifup eth2
# ifconfig eth2
Note that it may be necessary to reboot the system before the configuration
changes will work.
2.4.7
OpenFabrics Configuration and Startup
In the prior InfiniPath 1.3 release the InfiniPath (ipath_core) and OpenFabrics
(ib_ipath) modules were separate. In this release there is now one module,
ib_ipath, which provides both low level InfiniPath support and management
functions for OpenFabrics protocols. The startup script for ib_ipath is installed
automatically as part of the software installation, and normally does not need to be
changed.
2 – InfiniPath Cluster Administration
Configuration and Startup
However, the IPoIB network interface and OpenSM components of OpenFabrics
can be configured to be on or off. IPoIB is off by default; OpenSM is on by default.
IPoIB and OpenSM configuration is explained in greater detail in the following
sections.
NOTE:The following instructions work for FC4, SUSE9.3, SLES 9, and SLES 10.
2.4.7.1
Configuring the IPoIB Network Interface
Instructions are given here to manually configure your OpenFabrics IPoIB network
interface. This example assumes that you are using sh or bash as your shell, and
that all required InfiniPath and OpenFabrics RPMs are installed, and your startup
scripts have been run, either manually or at system boot.
For this example, we assume that your IPoIB network is 10.1.17.0 (one of the
networks reserved for private use, and thus not routable on the internet), with a /8
host portion, and therefore requires that the netmask be specified.
This example assumes that no hosts files exist, and that the host being configured
has the IP address 10.1.17.3, and that DHCP is not being used.
NOTE:We supply instructions only for this static IP address case. Configuration
methods for using DHCP will be supplied in a later release.
Type the following commands (as root):
# ifconfig ib0 10.1.17.3 netmask 0xffffff00
IB6054601-00 D2-11
2 – InfiniPath Cluster Administration
Configuration and Startup
To verify the configuration, type:
# ifconfig ib0
The output from this command should be similar to this:
ib0 Link encap:InfiniBand HWaddr
00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Next, type:
# ping -c 2 -b 10.1.17.255
The output of the ping command should be similar to that below, with a line for
each host already configured and connected:
WARNING: pinging broadcast address
PING 10.1.17.255 (10.1.17.255) 517(84) bytes of data.
174 bytes from 10.1.17.3: icmp_seq=0 ttl=174 time=0.022 ms
64 bytes from 10.1.17.1: icmp_seq=0 ttl=64 time=0.070 ms (DUP!)
64 bytes from 10.1.17.7: icmp_seq=0 ttl=64 time=0.073 ms (DUP!)
Q
2.4.8
OpenSM
The IPoIB network interface is now configured.
NOTE:The configuration must be repeated each time the system is rebooted.
OpenSM is an optional component of the OpenFabrics project that provides a
subnet manager for InfiniBand networks. This package can be installed on all
machines, but only needs to be enabled on the machine in your cluster that is going
to act as a subnet manager. You do not need to use OpenSM if any of your InfiniBand
switches provide a subnet manager.
After installing the opensm package, OpenSM is configured to be on on the next
machine reboot. It only needs to be enabled on the node which acts as the subnet
manager, so use the chkconfig command (as root) to disable it on the other nodes:
# chkconfig opensmd off
The command to enable it on reboot is:
# chkconfig opensmd on
You can start opensmd without rebooting your machine as follows:
# /etc/init.d/opensmd start
2-12IB6054601-00 D
Q
2.5
SRP
2 – InfiniPath Cluster Administration
Starting and Stopping the InfiniPath Software
and you can stop it again like this:
# /etc/init.d/opensmd stop
If you wish to pass any arguments to the OpenSM program, modify the file:
/etc/init.d/opensmd
and add the arguments to the "OPTIONS" variable. Here is an example:
# Use the UPDN algorithm instead of the Min Hop algorithm.
OPTIONS="-u"
SRP stands for SCSI RDMA Protocol. It was originally intended to allow the SCSI
protocol to run over InfiniBand for SAN usage. SRP interfaces directly to the Linux
file system through the SRP Upper Layer Protocol. SRP storage can be treated as
just another device.
In this release SRP is provided as a technology preview. Add ib_srp to the module
list in /etc/sysconfig/infinipath to have it automatically loaded.
NOTE:SRP does not yet work with IBM Power Systems.This will be fixed in a
future release.
2.6
Further Information on Configuring and Loading Drivers
See the modprobe(8), modprobe.conf(5), lsmod(8), man pages for more
information. Also see the file
for more general information on configuration files. Section 2.7, below, may also be
useful.
2.7
/usr/share/doc/initscripts-*/sysconfig.txt
Starting and Stopping the InfiniPath Software
The InfiniPath driver software runs as a system service, normally started at system
startup. Normally you will not need to restart the software, but you may wish to do
so after installing a new InfiniPath release, or after changing driver options, or if
doing manual testing.
The following commands can be used to check or configure state. These methods
will not reboot the system.
To check the configuration state, use the command:
$ chkconfig --list infinipath
To enable the driver, use the command (as root):
# chkconfig infinipath on 2345
IB6054601-00 D2-13
2 – InfiniPath Cluster Administration
Starting and Stopping the InfiniPath Software
To disable the driver on the next system boot, use the command (as root):
# chkconfig infinipath off
NOTE:This does not stop and unload the driver, if it is already loaded.
You can start, stop, or restart (as root) the InfiniPath support with:
# /etc/init.d/infinipath [start | stop | restart]
This method will not reboot the system. The following set of commands shows how
this script can be used. Please take note of the following:
■ You should omit the commands to start/stop opensmd if you are not running it
on that node.
■ You should omit the ifdown and ifup step if you are not using ipath_ether
on that node.
The sequence of commands to restart infinipath are given below. Note that this
next example assumes that ipath_ether is configured as eth2.
Another useful program is ibstatus. Sample usage and output is as follows:
$ ibstatus
Infiniband device ’ipath0’ port 1 status:
default gid: fe80:0000:0000:0000:0011:7500:0005:602f
base lid: 0x35
sm lid: 0x2
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 10 Gb/sec (4X)
For more information on these programs, See appendix C.9.9 and appendix C.9.5.
2.9
Configuring ssh and sshd Using shosts.equiv
Running MPI programs on an InfiniPath cluster depends, by default, on secure shell
ssh to launch node programs on the nodes. Jobs must be able to start up without
the need for interactive password entry on every node. Here we see how the cluster
administrator can lift this burden from the user through the use of the
mechanism. This method is recommended, provided that your cluster is behind a
firewall and accessible only to trusted users.
Later, in section 3.5.1, we show how an individual user can accomplish this end
through the use of
ssh-agent.
shosts.equiv
IB6054601-00 D2-15
2 – InfiniPath Cluster Administration
Configuring ssh and sshd Using shosts.equiv
This next example assumes the following:
■ Both the cluster nodes and the front end system are running the openssh
package as distributed in current Linux systems.
■ All cluster users have accounts with the same account name on the front end
and on each node, either by using NIS or some other means of distributing the
password file.
■ The front end is called ip-fe.
■ Root or superuser access is required on ip-fe and on each node in order to
configure
■ ssh, including the host’s key, has already been configured on the system ip-fe.
See the
ssh.
sshd and ssh-keygen man pages for more information.
The example proceeds as follows:
Q
1. On the system
ip-fe, the front end node, change /etc/ssh/ssh_config to
allow host-based authentication. Specifically, this file must contain the following
four lines, set to ‘yes’. If they are already present but commented out with an
initial #, remove the #.
5. After creating or editing these three files in steps 2, 3 and 4, sshd must be
restarted on each system. If you are already logged in via
user is logged in via
this only on idle nodes. Tell
(as root):
# killall -HUP sshd
NOTE:This will terminate all ssh sessions into that system. Run from the
console, or have a way to log into the console in case of any problem.
At this point, any user should be able to login to the
then use
or pass phrase.
2.9.1
ssh to login to any InfiniPath node without being prompted for a password
Process Limitation with ssh
/etc/ssh/sshd_config must be edited, so that
# at the start of the line) and are set
ssh (or any other
ssh), their sessions or programs will be terminated, so do
sshd to use the new configuration files by typing
ip-fe front end system, and
MPI jobs that use more than 8 processes per node may encounter an SSH throttling
mechanism that limits the amount of concurrent per-node connections to 10. If you
need to use more processes, you or your system administrator should increase the
value of ’MaxStartups’ in your sshd configurations. See appendix C.8.8 for an
example of an error message associated with this limitation.
2.10
Performance and Management Tips
The following section gives some suggestions for improving performance and
simplifying management of the cluster.
2.10.1
Remove Unneeded Services
An important step that the cluster administrator can take to enhance application
performance is to minimize the set of system services running on the compute
IB6054601-00 D2-17
2 – InfiniPath Cluster Administration
Performance and Management Tips
nodes. Since these are presumed to be specialized computing appliances, they
do not need many of the service daemons normally running on a general Linux
computer.
Following are several groups constituting a minimal necessary set of services.
These are all services controlled by
enabled, use the command:
$ /sbin/chkconfig --list | grep -w on
Basic network services:
network
ntpd
syslog
xinetd
sshd
For system housekeeping:
anacron
atd
crond
Q
chkconfig. To see the list of services that are
If you are using NFS or yp passwords:
rpcidmapd
ypbind
portmap
nfs
nfslock
autofs
To watch for disk problems:
smartd
readahead
The service comprising the InfiniPath driver and SMA:
infinipath
Other services may be required by your batch queuing system or user community.
2.10.2
Disable Powersaving Features
If you are running benchmarks or large numbers of short jobs, it is beneficial to
disable the powersaving features of the Opteron. The reason is that these features
may be slow to respond to changes in system load.
For rhel4, fc3 and fc4, run this command as root:
# /sbin/chkconfig --level 12345 cpuspeed off
2-18IB6054601-00 D
Q
For SUSE 9.3 and 10.0 run this command as root:
# /sbin/chkconfig --level 12345 powersaved off
After running either of these commands, the system will need to be rebooted for
these changes to take effect.
2.10.3
Balanced Processor Power
Higher processor speed is good. However, adding more processors is good only if
processor speed is balanced. Adding processors with different speeds can result
in load imbalance.
2.10.4
SDP Module Parameters for Best Performance
To get the best performance from SDP, especially for bandwidth tests, edit one of
these files:
/etc/modprobe.conf (on Fedora and RHEL)
/etc/modprobe.conf.local (on SUSE and SLES)
2 – InfiniPath Cluster Administration
Performance and Management Tips
This should be a single line in the file. This sets both the debug level and the zero
copy threshold.
2.10.5
CPU Affinity
InfiniPath will attempt to run each node program with CPU affinity set to a separate
logical processor, up to the number of available logical processors. If CPU affinity
is already set (with
InfiniPath will not change the setting.
The
processes to logical processors. This is useful, for example, to make best use of
available memory bandwidth or cache locality when running on dual-core SMP
cluster nodes.
In the following example we use the NAS Parallel Benchmark’s MG (multi-grid)
benchmark and the
sched_setaffinity(), or with the taskset utility), then
taskset utility can be used with mpirun to specify the mapping of MPI
-c option to taskset.
The first command forces the programs to run on CPUs (or cores) 0 and 2. The
second forces the programs to run on CPUs 1 and 3. Please see the
taskset for more information on usage.
IB6054601-00 D2-19
man page for
2 – InfiniPath Cluster Administration
Performance and Management Tips
2.10.6
Hyper-Threading
If using Intel processors that support Hyper-Threading, it is recommended that
HyperThreading is turned off in the BIOS. This will provide more consistent
performance. You can check and adjust this setting using the BIOS Setup Utility.
For specific instructions on how to do this, follow the hardware documentation that
came with your system.
2.10.7
Homogeneous Nodes
To minimize management problems, the compute nodes of the cluster should have
very similar hardware configurations and identical software installations. A
mismatch between the InfiniPath software versions may also cause problems. Old
and new libraries should not be run within the same job. It may also be useful to
distinguish between the InfiniPath-specific drivers and those that are associated
with kernel.org, OpenFabrics, or are distribution-built. The most useful tools are:
NOTE:Run these tools to gather information before reporting problems and
requesting support.
ipathbug_helper
The InfiniPath software includes a shell script ipathbug-helper, which can gather
status and history information for use in analyzing InfiniPath problems. This tool is
also useful for verifying homogeneity. It is best to run
privilege, since some of the queries require it. There is also a
which greatly increases the amount of gathered information. Simply run it on several
nodes and examine the output for differences.
Note that ipath_control will report whether the installed adapter is the QHT7040,
QHT7140, or the QLE7140. It will also report whether the driver is InfiniPath-specific
or not with the output associated with $Id.
rpm
To check the contents of an RPM, commands of these types may be useful:
mpirun can give information on whether the program is being run against a QLogic
or non-QLogic driver. Sample commands and results are given below.
QLogic-built:
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1
active chips)
asus-01:0.ipath_userinit: Driver is QLogic-built
Non-QLogic built:
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1
active chips)
asus-01:0.ipath_userinit: Driver is not QLogic-built
ident
ident
strings are available in ib_ipath.ko. Running ident (as root) will yield
information similar to the following. For QLogic RPMs, it will look like:
NOTE:strings is part of binutils (a development RPM), and may not be
available on all machines.
ipath_checkout
ipath_checkout
that all the nodes are functioning. It is run on a front end node and requires a hosts
file:
Q
is a bash script used to verify that the installation is correct, and
$ ipath_checkout [options] hostsfile
More complete information on ipath_checkout is given below in section 2.11 and
in section C.9.8.
2.11
Customer Acceptance Utility
ipath_checkout is a bash script used to verify that the installation is correct and
that all the nodes of the network are functioning and mutually connected by the
InfiniPath fabric. It is to be run on a front end node, and requires specification of a
hosts file:
$ ipath_checkout [options] hostsfile
where hostsfile designates a file listing the hostnames of the nodes of the cluster,
one hostname per line. The format of hostsfile is as follows:
hostname1
hostname2
...
ipath_checkout performs the following seven tests on the cluster:
1. ping all nodes to verify all are reachable from the frontend.
2. ssh to each node to verify correct configuration of ssh.
2-22IB6054601-00 D
Q
2 – InfiniPath Cluster Administration
Customer Acceptance Utility
3. Gather and analyze system configuration from nodes.
4. Gather and analyze RPMs installed on nodes.
5. Verify InfiniPath hardware and software status and configuration.
6. Verify ability to mpirun
7. Run bandwidth and latency test on every pair of nodes and analyze results.
The possible options to ipath_checkout are:
-h, --help
Displays help messages giving defined usage.
-v, --verbose
-vv, --vverbose
-vvv, --vvverbose
These specify three successively higher levels of detail in reporting results of tests.
So, there are four levels of detail in all, including the case of where none these
options are given.
-c, --continue
When not specified, the test terminates when any test fails. When specified, the
tests continue after a failure, with failing nodes excluded from subsequent tests.
--workdir=DIR
Use DIR to hold intermediate files created while running tests. DIR must not already
exist.
jobs on nodes.
-k, --keep
Keep intermediate files that were created while performing tests and compiling
reports. Results will be saved in a directory created by mktemp and named
infinipath_XXXXXX or in the directory name given to --workdir.
--skip=LIST
Skip the tests in LIST (e.g. --skip=2,4,5,7 will skip tests 2, 4, 5, and 7)
-d, --debug
Turn on -x and -v flags in bash.
In most cases of failure, the script suggests recommended actions. Please see the
This chapter provides information on using InfiniPath MPI. Examples are provided
for compiling and running MPI programs.
3.1
InfiniPath MPI
QLogic’s implementation of the MPI standard is derived from the MPICH reference
implementation Version 1.2.6. The InfiniPath MPI libraries have been highly tuned
for the InfiniPath Interconnect, and will not run over other interconnects.
InfiniPath MPI is an implementation of the original MPI 1.2 standard. The MPI-2
standard provides several enhancements of the original standard. Of the MPI-2
features, InfiniPath MPI includes only the MPI-IO features implemented in ROMIO
version 1.2.6 and the generalized MPI_Alltoallw communication exchange.
In this Version 2.0release, the InfiniPath MPI implementation supports hybrid
MPI/OpenMP, and other multi-threaded programs, as long as only one thread uses
MPI. For more information, see section 3.10.
Section 3
Using InfiniPath MPI
3.2
Other MPI Implementations
As of this release, other MPI implementations can now be run over InfiniPath. The
currently supported implementations are HP-MPI, OpenMPI and Scali. For more
information see section 3.6.
3.3
Getting Started with MPI
In this section you will learn how to compile and run some simple example programs
that are included in the InfiniPath software product. Compiling and running these
examples lets you verify that InfiniPath MPI and its components have been properly
installed on your cluster. See appendix C.8 if you have problems compiling or
running these examples.
IB6054601-00 D3-1
3 – Using InfiniPath MPI
Getting Started with MPI
These examples assume that:
■ Your cluster administrator has properly installed InfiniPath MPI and the
PathScale compilers.
■ Your cluster’s policy allows you to use the mpirun script directly, without having
to submit the job to a batch queuing system.
■ You or your administrator has properly set up your ssh keys and associated files
on your cluster. See section 3.5.1 and section 2.9 for details on
administration.
To begin, copy the examples to your working directory:
$ cp /usr/share/mpich/examples/basic/* .
Next, create an MPI hosts file in the same working directory. It contains the host
names of the nodes in your cluster on which you want to run the examples, with
one host name per line. Name this file
following format:
hostname1
hostname2
...
Q
ssh
mpihosts. The contents can be in the
There is more information on the mpihosts file in section 3.5.6.
3.3.1
An Example C Program
InfiniPath MPI uses some shell scripts to find the appropriate include files and
libraries for each supported language. Use the script
program in C and the script
The supplied example program
compile it to an executable named
$ mpicc -o cpi cpi.c
mpicc, by default, runs the PathScale pathcc or gcc compiler, and is used for
both compiling and linking, exactly as you'd use the
NOTE:On ppc64 systems, gcc is the default compiler. For information on using
other compilers, see section 3.5.3.
Then, run it with several different specifications for the number of processes:
$ mpirun -np 2 -m mpihosts ./cpi
Process 0 on hostname1
Process 1 on hostname2
pi is approximately 3.1416009869231241,
Error is 0.0000083333333309
wall clock time = 0.000149
mpicc to compile an MPI
mpirun to execute it.
cpi.c computes an approximation to pi. First,
cpi.
pathcc command.
3-2IB6054601-00 D
Q
3 – Using InfiniPath MPI
Getting Started with MPI
Here ./cpi designates the executable of the example program in the working
directory. The -np parameter to
used in the parallel computation. Now try it with four processes:
$ mpirun -np 4 -m mpihosts ./cpi
Process 3 on hostname1
Process 0 on hostname2
Process 2 on hostname2
Process 1 on hostname1
pi is approximately 3.1416009869231249,
Error is 0.0000083333333318
wall clock time = 0.000603
If you run the program several times with the same value of the -np parameter, you
may get the output lines in different orders. This is because they are issued by
independent asynchronous processes, so their order is non-deterministic.
The number of processes can be greater than the number of nodes. In this
four-process example, the
hostname2. Generally,
processes evenly among the nodes listed in the
processes exceeds the number of nodes listed in the
nodes will be assigned more than one instance of the program.
mpirun will try to distribute the specified number of
mpirun defines the number of processes to be
mpihosts file listed only two hosts, hostname1 and
mpihosts file, but if the number of
mpihosts file, then some
Up to a limit, the number of processes can even exceed the total number of
processors on the specified set of nodes, although it is usually detrimental to
performance to have more than one node program per processor. This limit is eight
processes per node with the QHT7140, and four processes per node with the
QLE7140. See section 3.5.9 for further discussion.
Details on alternate means of specifying the
Further information on the
section 3.5.10.
3.3.2
mpirun options are in section 3.5.5, section 3.5.9 and
Examples Using Other Languages
This section gives more examples, one for Fortran77, one for Fortran90, and one
for C++. Fortran95 usage will be similar to that for Fortran90.
fpi.f is a Fortran77 program that computes pi in a way similar to cpi.c. Compile
and link it with:
$ mpif77 -o fpi3 fpi3.f
and run it with:
$ mpirun -np 2 -m mpihosts ./fpi3
pi3f90.f90
same computation. Compile and link it with:
in the same directory is a Fortran90 program that does essentially the
mpihosts file are given in section 3.5.6.
$ mpif90 -o pi3f90 pi3f90.f90
IB6054601-00 D3-3
3 – Using InfiniPath MPI
Configuring MPI Programs for InfiniPath MPI
and run it with:
$ mpirun -np 2 -m mpihosts ./pi3f90
The C++ program hello++.cc is a parallel processing version of the traditional
“Hello, World” program. Notice that this version makes use of the external C
bindings of the MPI functions if the C++ bindings are not present.
Compile it:
$ mpicxx -o hello hello++.cc
and run it:
$ mpirun -np 10 -m mpihosts ./hello
Hello World! I am 9 of 10
Hello World! I am 2 of 10
Hello World! I am 4 of 10
Hello World! I am 1 of 10
Hello World! I am 7 of 10
Hello World! I am 6 of 10
Hello World! I am 3 of 10
Hello World! I am 0 of 10
Hello World! I am 5 of 10
Hello World! I am 8 of 10
Q
Each of the scripts invokes the PathScale compiler for the respective language and
the linker. See section 3.5.3 for an example of how to use the
use of
3.4
mpirun is the same for programs in all languages.
Configuring MPI Programs for InfiniPath MPI
When configuring an MPI program (generating header files and/or Makefiles), for
InfiniPath MPI, you will usually need to specify
rather than
Typically this is done with commands similar to these (this assumes you are using
In some cases, the configuration process may specify the linker. It is recommended
that the linker be specified as
automatically include the correct flags and libraries, rather than trying to configure
to pass the flags and libraries explicitly. For example:
LD=mpicc
LD=mpif90
mpicc, mpif90, etc. in these cases. That will
These scripts pass appropriate options to the various compiler passes to include
header files, required libraries, etc. While the same effect can be achieved by
passing the arguments explicitly as flags, the required arguments may vary from
release to release, so it's good practice to use the provided scripts.
3.5
InfiniPath MPI Details
This section gives more details on the use of InfiniPath MPI. We assume the reader
has some familiarity with standard MPI. See the references in appendix D.1. This
implementation does include the
numerous MPI functions.
3.5.1
man pages from the MPICH implementation for the
Configuring for ssh Using ssh-agent
The command mpirun can be run on the front end or on any other node. In InfiniPath
MPI, this uses the secure shell command
program on the remote compute nodes. To use
RSA or DSA keys, public and private. The public keys must be distributed to all the
compute nodes so that connections to the remote machines can be established
without supplying a password. Each user can accomplish this through use of the
ssh-agent. ssh-agent is a daemon that caches decrypted private keys. You use
ssh-add to add your private keys to ssh-agent’s cache. When ssh establishes a
new connection, it communicates with
rather than prompting you for a passphrase.
ssh to start instances of the given MPI
ssh, the user must have generated
ssh-agent in order to acquire these keys,
IB6054601-00 D3-5
3 – Using InfiniPath MPI
InfiniPath MPI Details
The process is shown in the following steps:
1. Create a key pair. Use the default file name, and be sure to enter a passphrase.
2. Enter a passphrase for your key pair when prompted. Note that the key agent
does not survive X11 logout or system reboot:
3. This tells ssh that your key pair should let you in:
edit ~/.ssh/config so that it reads like this:
Host*
ForwardAgent yes
ForwardX11 yes
CheckHostIP no
StrictHostKeyChecking no
Q
$ ssh-keygen -t rsa
$ ssh-add
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
This forwards the key agent requests back to your desktop. When you log into
a front end node, you can ssh to compute nodes without passwords.
4. Start ssh-agent by adding the following line to your ~/.bash_profile (or
equivalent in another shell):
eval ‘ssh-agent‘
Use back-quotes rather than normal single-quotes. Programs started in your
login shell will then be able to locate
5. Finally, test by logging into the front end node, and from the front end node to
a compute node as follows:
$ ssh frontend_node_name
$ ssh compute_node_name
For more information, see the man pages for ssh(1),ssh-keygen(1),
ssh-add(1), and ssh-agent(1).
Alternatively, the cluster administrator can accomplish this for all users through the
shosts.equiv mechanism, as described in section 2.9.
ssh-agent and query it for keys.
3-6IB6054601-00 D
Q
3.5.2
Compiling and Linking
These scripts invoke the compiler and linker for programs in each of the respective
languages, and take care of referring to the correct include files and libraries in each
case.
mpicc
mpicxx
mpif77
mpif90
mpif95
On x86_64, by default these call the PathScale compiler and linker. To use other
compilers, see section 3.5.3.
NOTE:The 2.x PathScale compilers aren’t currently supported on systems that
use the GNU 4.x compiler and environment. This includes FC4, FC5 and
SLES10. For suggestions on how to work around this issue, see
section 3.5.4. The 3.0 compiler release will support the GNU 4.x compiler
environment.
3 – Using InfiniPath MPI
InfiniPath MPI Details
These scripts all provide the following command line options:
-help
Provides help.
-show
Lists each of the compiling and linking commands that would be called without
actually calling them.
-echo
Gets verbose output of all the commands in the script.
-compile_info
Shows how to compile a program.
-link_info
Shows how to link a program.
Further, each of these scripts allows a command line option for specifying the use
of a different compiler/linker as an alternative to the PathScale Compiler Suite.
These are described in the next section.
Most other command line options are passed on to the invoked compiler and linker.
The PathScale compiler and the usual alternatives all admit numerous command
IB6054601-00 D3-7
3 – Using InfiniPath MPI
InfiniPath MPI Details
line options. See the PathScale compiler documentation and the man pages for
pathcc and pathf90 for complete information on its options. See the corresponding
documentation for any other compiler/linker you may call for its options.
3.5.3
To Use Another Compiler
In addition to the PathScale Compiler Suite, InfiniPath MPI supports a number of
other compilers. These include PGI 5.2 and 6.0, Intel 9.0, the GNU gcc 3.3.x, 3.4.x,
and 4.0.x compiler suites and gfortran. The IBM XL family of compilers is also
supported on ppc64 (Power) systems.
NOTE:The 2.x PathScale compilers aren’t currently supported on systems that
have the GNU 4.x compilers and compiler environment (header files and
libraries). This includes Fedora Core 4, Fedora Core 5, SUSE 10, and
SLES 10. To run on those distributions, you can compile your application
on a system that does support the PathScale compiler. Then you can run
the executable on one of the systems that uses the GNU 4.x compiler
and environment. For more information on setting up for
cross-compilation, see section 3.5.4. The GNU 4.x compiler environment
will be supported by the PathScale Compiler Suite 3.0 release.
Q
NOTE:In addition, gfortran is not currently supported on Fedora Core 3, as it
has dependencies on the GNU 4.x suite.
The following example shows how to use gcc for compiling and linking MPI
programs in C:
$ mpicc -cc=gcc .......
To use gcc for compiling and linking C++ programs use:
$ mpicxx -CC=g++ .......
To use gcc for compiling and linking Fortran77 programs use:
$ mpif77 -fc=g77 .......
In each case, ..... stands for the remaining options to the mpicxx script, the
options to the compiler in question, and the names of the files it is to operate upon.
Using the same pattern you will see that this next example is similar, except that it
uses the PGI (pgcc) compiler for compiling and linking in C:
$ mpicc -cc=pgcc .....
To use PGI for Fortran90/Fortran95 programs, use:
$ mpif90 -f90=pgf90 .....
$ mpif95 -f95=pgf95 .....
This example uses the Intel C compiler (icc):
$ mpicc -cc=icc .....
3-8IB6054601-00 D
Q
3 – Using InfiniPath MPI
InfiniPath MPI Details
To use the Intel compiler for Fortran90/Fortran95 programs, use:
$ mpif90 -f90=ifort .....
$ mpif95 -f95=ifort .....
Usage for other compilers will be similar to the examples above, substituting the
options following
specific compilers for more details.
-cc, -CC, -f77, -f90, or -f95. Consult the documentation for
Also, use
with
linking, you should link a sample program using the
what libraries to add to your link line. Some examples follow.
mpif77, mpif90, or mpif95 for linking, otherwise you may have problems
.true. having the wrong value. If you are not using the provided scripts for
Compiler and Linker Variables
If you use environment variables (e.g., $MPICH_CC) to select which compiler
mpicc, et al. should use, the scripts will also set the matching linker variable (e.g.
$MPICH_CLINKER), if not already set. If both the environment variable and
command line options are used (e.g,
-show option as a test, to see
-cc=gcc), the command line variable is used.
If both the compiler and linker variables are set, and they do not match for the
compiler you are using, it is likely that the MPI program will fail to link, or if it links,
it may not execute correctly. For a sample error message, please see section C.8.3
in the Troubleshooting chapter.
3.5.4
Cross-compilation Issues
The 2.x PathScale compilers aren’t currently supported on systems that use the
GNU 4.x compilers and compiler environment (header files and libraries). This
includes Fedora Core 4, Fedora Core 5 and SLES 10. The GNU 4.x environment
will be supported in the PathScale Complier Suite 3.0 release.
IB6054601-00 D3-9
3 – Using InfiniPath MPI
InfiniPath MPI Details
The current workaround for this is to compile on a supported and compatible
distribution, then run the executable on one of the systems that uses the GNU 4.x
compilers and environment.
■ To run on FC4 or FC5, install FC3 or RHEL4/CentOS on your build machine.
Compile your application on this machine.
■ To run on SLES 10, install SUSE 9.3 on your build machine. Compile your
application on this machine.
■ Alternatively, gcc can be used as the default compiler. Set mpicc -cc=gcc as
described in section 3.5.3 "To Use Another Compiler".
Next, on the machines in your cluster on which the job will run, install compatibility
libraries. These libraries include C++ and Fortran compatibility shared libraries and
libgcc.
For an FC4 or FC5 system, you would need:
■ pathscale-compilers-libs (for FC3)
■ compat-gcc-32
Q
■ compat-gcc-32-g77
■ compat-libstdc++-33
On a SLES 10 system, you would need:
■ compat-libstdc++ (for FC3)
■ compat-libstdc++5 (for SLES 10)
Depending upon the application, you may need to use the -W1,-Bstatic option to
use the static versions of some libraries.
3.5.5
Running MPI Programs
The script mpirun lets you start your parallel MPI program on a set of nodes in a
cluster. It starts, monitors, and terminates the node programs.
(secure shell) to log in to individual cluster machines and prints any messages that
the node program prints on stdout or stderr on the terminal from which mpirun
is invoked. It is therefore usually desirable to either configure all cluster nodes to
use shosts.equiv (see section 2.9), or for users to use ssh-agent (see
section 3.5.1) in order to allow MPI programs to be run without requiring that a
program-name will generally be the pathname to the executable MPI program. If
the MPI program resides in the current directory and the current directory is not in
your search path, then
./program-name
Unless you want to run only one instance of the program, you need to use the -np
option, as in:
$ mpirun -np n [other options] program-name
This spawns n instances of program-name. We usually call these instances node
programs.
Each node program is started as a process on one node. While it is certainly possible
for a node program to fork child processes, the children must not themselves call
MPI functions.
mpirun monitors the parallel MPI job, terminating when all the node programs in
that job exit normally, or if any of them terminates abnormally.
program-name must begin with ‘./’, such as:
Killing the
3.5.6
The mpihosts File
As noted in section 3.3 you have created an mpihosts file (also called a machines
file, node file, or hosts file) in your current working directory. This file names the
nodes on which the node programs may run. The
the form:
hostname[:p]
The optional part :p specifies the number of node programs that can be spawned
on that node. When not specified, the default value is 1. The two supported formats
for the
In the first format, if the -np count is greater than the number of lines in the machine
file, the hostnames will be repeated (in order) as many times as necessary for the
requested number of node programs.
mpihosts file are:
mpirun program kills all the processes in the job. Use Ctrl-C to do this.
mpihosts file contains lines of
In the second format
the number of available processors on the node. Up to
IB6054601-00 D3-11
process_count can be different for each host, and is normally
process_count node
3 – Using InfiniPath MPI
InfiniPath MPI Details
programs will be started on that host before using the next entry in the mpihosts
file. If the full
requested, processing starts again at the start of the file.
Q
mpihosts file is processed, and there are still more processes
You have several alternative ways of specifying the
1. First, as noted in section 3.3.1, you can use the command line option
$
mpirun -np n -m mpihosts [other options] program-name
In this case, if the named file cannot be opened, the MPI job fails.
2. If the
3. In the absence of both the
4. If none of these three methods of specifying the hosts file are used,
If you are working in the context of a batch queuing system, it may provide you with
a job submission script that generates an appropriate
3.5.7
-m option is omitted, mpirun checks the environment variable MPIHOSTS
for the name of the MPI hosts file. If this variable is defined and the file it names
cannot be opened, then the MPI job fails.
mpirun uses the file ./mpihosts, if it exists.
looks for the file
~/.mpihosts.
Console I/O in MPI Programs
mpirun sends any output printed to stdout or stderr by any node program to the
terminal. This output is line-buffered, so the lines output from the various node
programs will be non-deterministically interleaved on the terminal. Using the
option to mpirun will label each line with the rank of the node program that produced
it.
mpihosts file.
-m:
-m option and the MPIHOSTS environment variable,
mpirun
mpihosts file.
-l
3.5.8
Node programs do not normally use interactive input on
stdin is bound to /dev/null. However, for applications that require standard input
redirection, InfiniPath MPI supports two mechanisms to redirect stdin:
1. If
mpirun is run from the same node as MPI rank 0, all input piped to the mpirun
command will be redirected to rank 0.
mpirun is not run from the same node as MPI rank 0 or if the input must be
2. If
redirected to all or specific MPI processes, the
redirect a file as standard input to all nodes or to a particular node as specified
by the
-stdin-target option.
stdin, and by default,
-stdin option can be used to
Environment for Node Programs
The environment variables existing on the front end node on which you run mpirun
are not propagated to the other nodes. You can set the paths, such as
3-12IB6054601-00 D
Q
3 – Using InfiniPath MPI
InfiniPath MPI Details
LD_LIBRARY_PATH, and other environment variables for the node programs
through the use of the
$ mpirun -np n -m mpihosts -rcfile mpirunrc program
In the absence of this option, mpirun checks to see if a file called
$HOME/.mpirunrc exists in the user's home directory. In either case, the file is
sourced by the shell on each node at time of startup of the node program.
mpirunrc should not contain any interactive commands. It may contain
The .
commands that output on
-rcfile option of mpirun:
stdout or stderr.
3.5.8.1
When you do not specify an
~/.
mpirunrc, the environment on each node is whatever it would be for the user’s
login via
There is a global options file that can be used for
location of this file is:
/opt/infinipath/etc/mpirun.defaults
You can use an alternate file by setting the environment variable
$PSC_MPIRUN_DEFAULTS_PATH. See the mpirun man page for more
information.
ssh, unless you are using MPD. (See section 3.8.)
mpirunrc file, either through the option or the default
mpirun arguments. The default
Environment for Multiple Versions of InfiniPath or MPI
The variable INFINIPATH_ROOT sets a root prefix for all Infinipath-related paths.
It is used by mpirun to try to find the mpirun-ipath-ssh executable, and it is
also used to set up LD_LIBRARY_PATH for new programs. This allows multiple
versions of the InfiniPath software releases to be installed on some or all nodes, as
well as having InfiniPath MPI and other version(s) of MPI installed at the same time.
It may be set in the environment, in mpirun.defaults, or in an rcfile (such
as .mpirunrc, .bashrc or .cshrc) that will be invoked on remote nodes.
If you have used the --prefix argument with the rpm command to change the
root prefix for the InfiniPath installation, then set INFINIPATH_ROOT to the same
value.
If INFINIPATH_ROOT is not set, the normal PATH is used unless mpirun is invoked
with a full pathname.
NOTE:mpirun-ssh was renamed mpirun-ipath-ssh so as to avoid name
collisions with other MPI implementations.
IB6054601-00 D3-13
3 – Using InfiniPath MPI
InfiniPath MPI Details
3.5.9
Multiprocessor Nodes
Another command line option, -ppn, instructs mpirun to assign a fixed number p
of node programs to each node, as it distributes the
$ mpirun -np n -m mpihosts -ppn p program-name
This option overrides the :p specifications, if any, in the lines of the MPI hosts file.
Q
n instances among the nodes:
As a general rule,
without exceeding on any node the maximum number of instances specified by
the
:p option. The value of the :p option is specified by either the -ppn command
line option or in the
NOTE:When the
Normally, the number of node programs should be no larger than the number of
processors on the node, at least not for compute-bound problems. In the current
implementation of the InfiniPath interconnect, no node can run more than eight node
programs.
For improved performance, InfiniPath MPI uses shared memory to pass messages
between node programs running on the same host.
3.5.10
mpirun Options
Here is a list summarizing the most commonly used options to mpirun. See the
man page for a more complete listing.
-np np
Number of processes to spawn.
mpirun tries to distribute the n node programs among the nodes
mpihosts file.
-np value is larger than the number of nodes in the mpi hostsfile
times the
assigning additional node programs per host.
-ppn value, mpirun will cycle back through the hostsfile,
-ppn processes-per-node
Create up to specified number of processes per node.
-machinefile filename, -m filename
Machines (mpihosts) file, the list of hosts to be used for this job.
Default:
-M
Print a formatted list of MPI-level stats of interest for the MPI programmer
3-14IB6054601-00 D
$MPIHOSTS, then ./mpihosts, then ~/.mpihosts
Q
3 – Using InfiniPath MPI
InfiniPath MPI Details
-verbose
Print diagnostic messages from mpirun itself. Can be useful in troubleshooting
Default: Off
-version, -v
Print MPI version. Default: Off
-help, -h
Print mpirun help message. Default: Off
-rcfile node-shell-script
Startup script for setting environment on nodes.
Default:
-in-xterm
Run each process in an xterm window. Default: Off
$HOME/.mpirunrc
-display X-server
X Display for xterm. Default: None
-debug
Run each process under debugger in an xterm window. Uses gdb by default.
Default: Off
Set -q 0 when using -debug.
-debug-no-pause
Like debug, except doesn't pause at beginning. Uses gdb by default.
Default: Off
-debugger gdb|pathdb|strace
Which debugger to use.Default: gdb
-psc-debug-level mask
Controls the verbosity of MPI and InfiniPath debug messages for node programs.
A synonym is -d
Default: 1
mask.
IB6054601-00 D3-15
3 – Using InfiniPath MPI
InfiniPath MPI Details
-nonmpi
Run a non-MPI program. Required if the node program makes no MPI calls. Default:
Off
-quiescence-timeout, seconds
Wait time in seconds for quiescence (absence of MPI communication) on the nodes.
Useful for detecting deadlocks. 0 disables quiescence detection.
Default: 900
-disable-mpi-progress-check
This option disables MPI communication progress check, without disabling the ping
reply check. Default: Off.
-l
Label each line of output on stdout and stderrwith the rank of the MPI process
which produces the output.
Q
-labelstyle string
Specify the label that is prefixed to error messages and statistics. Process rank is
the default prefix.
-stdin filename
Filename that should be fed as stdin to the node program. Default: /dev/null
-stdin-target 0..np-1 | -1
Process rank that should receive the file specified with the -stdin option. -1 means
all ranks. Default: -1
-wdir path-to-working_dir
Sets the working directory for the node program.
Default:
-print-stats
Causes each node program to print various MPI statistics to stderr on job
termination. Can be useful for troubleshooting. Default: off. For details, see
appendix C.8.13.
-wdir current-working-dir
3-16IB6054601-00 D
Q
-statsfile file-prefix
Specifies alternate file to receive the output from the -print-stats option.
Default:
3.6
Using Other MPI Implementations
Support for multiple MPI implementations has been added. You can use a different
version of MPI and achieve the high-bandwidth and low-latency performance that
it is standard with InfiniPath MPI.
The currently supported implementations are HP-MPI, OpenMPI and Scali.
These MPI implementations will run on multiple interconnects, and have their own
mechanisms for selecting which one you will run on. Please see the documentation
provided with the version of MPI that you wish to use.
If you have downloaded and installed another MPI implementation, you will need
to set your PATH up to pick up the version of MPI you wish to use.
You will also need to set LD_LIBRARY_PATH, both in your local environment and
in an rcfile (such as .mpirunrc, .bashrc or .cshrc) that will be invoked on
remote nodes. See section 3.5.8 and section 3.5.3.1 for information on setting up
your environment and section C.8.6 for information on setting your run-time library
path. See also section C.8.7 for information on run time errors that may occur if
there are MPI version mismatches.
stderr
3 – Using InfiniPath MPI
MPD
3.7
MPI Over uDAPL
Some MPI implementations can be run over uDAPL. uDAPL is the user mode
version of the Direct Access Provider Library (DAPL). Examples of such MPI
implementations are Intel MPI and one option on OpenMPI.
If you are running such an MPI implementation, the rdma_cm and rdma_ucm
modules will need to be loaded. To test these modules, use these commands (as
root):
# modprobe rdma_cm
# modprobe rdma_ucm
To ensure that the modules are loaded whenever the driver is loaded, add rdma_cm
and rdma_ucm to the OPENFABRICS_MODULES assignment in
/etc/sysconfig/infinipath.
3.8
MPD
MPD is an alternative to mpirun for launching MPI jobs. It is described briefly in the
following sections.
IB6054601-00 D3-17
3 – Using InfiniPath MPI
File I/O in MPI
3.8.1
MPD Description
The Multi-Purpose Daemon (MPD) was developed by Argonne National Laboratory
(ANL), as part of the MPICH-2 system. While the ANL MPD had certain advantages
over the use of their
tolerance of node failures), the InfiniPath
Q
mpirun (faster launching, better cleanup after crashes, better
mpirun offers the same advantages.
3.8.2
Using MPD
The disadvantage of MPD is reduced security, since it does not use
node programs. It is also a little more complex to use than
requires starting a ring of MPD daemons on the nodes. Therefore, most users should
use the normal
chapter. However, for users who wish to use MPD, it is included in the InfiniPath
software.
To start an MPD environment, use the mpdboot program. You must provide mpdboot
with a file listing the machines on which to run the
file is the same as for the mpihosts file in the
Here is an example of how to run mpdboot:
$ mpdboot -f hostsfile
After mpdboot has started the MPD daemons, it will print a status message and
drop you into a new shell.
To leave the MPD environment, exit from this shell. This will terminate the daemons.
To run an MPI program from within the MPD environment, use the
You do not need to provide a
will use all nodes and CPUs available within the MPD environment.
mpirun mechanism for starting jobs as described in the previous
mpd daemon. The format of this
mpirun command.
mpihosts file or a count of CPUs; by default, mpirun
mpirun because it
ssh to launch
mpirun command.
To check the status of the MPD daemons, use the
NOTE:To use MPD, the software package mpi-frontend-2.0*.rpm must be
installed on all nodes. See the InfiniPath Install Guide for more details on
software installation.
3.9
mpdping command
File I/O in MPI
File I/O in MPI is discussed briefly in the following two sections.
3.9.1
Linux File I/O in MPI Programs
MPI node programs are Linux programs, which can do file I/O to local or remote
files in the usual ways through APIs of the language in use. Remote files are
3-18IB6054601-00 D
Q
accessed via some network file system, typically NFS. Parallel programs usually
need to have some data in files to be shared by all of the processes of an MPI job.
Node programs may also use non-shared, node-specific files, such as for scratch
storage for intermediate results or for a node’s share of a distributed database.
There are different styles of handling file I/O of shared data in parallel programming.
You may have one process, typically on the front end node or on a file server, which
is the only process to touch the shared files, and which passes data to and from
the other processes via MPI messages. On the other hand, the shared data files
could be accessed directly by each node program. In this case, the shared files
would be available through some network file support, such as NFS. Also, in this
case, the application programmer would be responsible for ensuring file
consistency, either through proper use of file locking mechanisms offered by the
OS and the programming language, such as
synchronization operations.
3.9.2
MPI-IO with ROMIO
3 – Using InfiniPath MPI
InfiniPath MPI and Hybrid MPI/OpenMP Applications
fcntl in C, or by the use of MPI
MPI-IO is the part of the MPI2 standard, supporting collective and parallel file IO.
One of the advantages in using MPI-IO is that it can take care of managing file locks
in case of file data shared among nodes.
InfiniPath MPI includes ROMIO version 1.2.6, a high-performance, portable
implementation of MPI-IO from Argonne National Laboratory. ROMIO includes
everything defined in the MPI-2 I/O chapter of the MPI-2 standard except support
for file interoperability and user-defined error handlers for files. Of the MPI-2
features, InfiniPath MPI includes only the MPI-IO features implemented in ROMIO
version 1.2.6 and the generalized MPI_Alltoallw communication exchange. See the
ROMIO documentation in http://www.mcs.anl.gov/romio for details.
3.10
InfiniPath MPI and Hybrid MPI/OpenMP Applications
InfiniPath MPI supports hybrid MPI/OpenMP applications provided that MPI routines
are only called by the master OpenMP thread. This is called the funneled thread model. Instead of MPI_Init/MPI_INIT (for C/C++ and Fortran respectively), the
program can call MPI_Init_thread/MPI_INIT_THREAD to determine the level of
thread support and the value MPI_THREAD_FUNNELED will be returned.
To use this feature the application should be compiled with both OpenMP and MPI
code enabled. To do this, use the
As mentioned above, MPI routines must only be called by the master OpenMP
thread. The hybrid executable is executed as usual using
one MPI process is run per node and the OpenMP library will create additional
threads to utilize all CPUs on that node. If there are sufficient CPUs on a node, it
-mp flag on the mpicc compile line.
mpirun, but typically only
IB6054601-00 D3-19
3 – Using InfiniPath MPI
Debugging MPI Programs
may be desirable to run multiple MPI processes and multiple OpenMP threads per
node.
The number of OpenMP threads is typically controlled by the
OMP_NUM_THREADS environment variable in the .
used to adjust the split between MPI processes and OpenMP threads. Usually the
number of MPI processes (per node) times the number of OpenMP threads will be
set to match the number of CPUs per node. An example case would be a node with
4 CPUs, running 1 MPI process and 4 OpenMP threads. In this case,
OMP_NUM_THREADS is set to 4. OMP_NUM_THREADS is on a per-node basis.
See the section 3.5.8 for information on setting environment variables.
The MPI_THREAD_SERIALIZED and MPI_THREAD_MULTIPLE models are not
yet supported.
NOTE:If there are more threads than CPUs, then both MPI and OpenMP
performance can be significantly degraded due to over-subscription of
the CPUs.
3.11
Debugging MPI Programs
Q
mpirunrc file. This may be
Debugging parallel programs is substantially more difficult than debugging serial
programs. Thoroughly debugging the serial parts of your code before parallelizing
is good programming practice.
3.11.1
MPI Errors
Almost all MPI routines (except MPI_Wtime and MPI_Wtick) return an error code;
as the function return value in C functions or as the last argument in a Fortran
subroutine call. Before the value is returned, the current MPI error handler is called.
By default, this error handler aborts the MPI job. Therefore you can get information
about MPI exceptions in your code by providing your own handler for
MPI_ERRORS_RETURN. See the man page for MPI_Errhandler_set for details.
NOTE:MPI does not guarantee that an MPI program can continue past an error.
See the standard MPI documentation referenced in appendix D for details on the
MPI error codes.
3.11.2
Using Debuggers
The InfiniPath software supports the use of multiple debuggers, including pathdb,
gdb, and the system call tracing utility strace. These debuggers let you set
breakpoints in a running program, and examine and set its variables.
3-20IB6054601-00 D
Q
3 – Using InfiniPath MPI
InfiniPath MPI Limitations
Symbolic debugging is easier than machine language debugging. To enable
symbolic debugging you must have compiled with the
-g option to mpicc so that
the compiler will have included symbol tables in the compiled object code.
To run your MPI program with a debugger use the
and
-debugger options to mpirun. See the man pages to pathdb, gdb, and strace
-debug or -debug-no-pause
for details. When you run under a debugger, you get an xterm window on the front
end machine for each node process. Thus, you can control the different node
processes as desired.
To use
$ mpirun -np n -m mpihosts strace program-name
strace with your MPI program, the syntax would be:
The following features of InfiniPath MPI especially facilitate debugging:
■ Stack backtraces are provided for programs that crash.
■ -debug and -debug-no-pause options are provided for mpirun that can make
each node program start with debugging enabled. The
-debug option allows you
to set breakpoints, and start running programs individually. The
-debug-no-pause option allows postmortem inspection. Note that you should
set
-q 0 when using -debug.
■ Communication between mpirun and node programs can be printed by
specifying the
■ MPI implementation debug messages can be printed by specifying the mpirun
-psc-debug-level
mpirun -verbose option.
option. Note that this can substantially impact the
performance of the node program.
■ Support is provided for progress timeout specifications, deadlock detection, and
generating information about where a program is stuck.
■ Several misconfigurations (such as mixed use of 32-bit/64-bit executables) are
detected by the runtime.
■ A formatted list containing information useful for high-level MPI application
profiling is provided by using the
include minimum, maximum and median values for message transmission
protocols as well as a more detailed information for expected and unexpected
message reception. See appendix C.8.13 for more information and a sample
output listing.
3.12
InfiniPath MPI Limitations
The current version of InfiniPath MPI has the following limitations:
By default, at most eight node programs per node with the QHT7140 are allowed,
and at most four node programs per node with the QLE7140. The error message
when this limit is exceeded is:
-print-stats option with mpirun. Statistics
IB6054601-00 D3-21
3 – Using InfiniPath MPI
InfiniPath MPI Limitations
No ports available on /dev/ipath
NOTE:If port sharing is enabled, this limit is raised to 16 and 8 respectively. To
There are no C++ bindings to MPI -- use the extern C MPI function calls.
In MPI-IO file I/O calls in the Fortran binding, offset or displacement arguments are
limited to 32 bits. Thus, for example, the second argument of
lie between -2
between 0 and 2
Q
enable port sharing, set PSM_SHAREDPORTS=1 in your environment
MPI_File_seek must
31
and 231-1, and the argument to MPI_File_read_at must lie
32
-1.
3-22IB6054601-00 D
Appendix A
Benchmark Programs
Several MPI performance measurement programs are installed from the
mpi-benchmark RPM. This Appendix describes these useful benchmarks and how
to run them. These programs are based on code from the group of Dr. Dhabaleswar
K. Panda at the Network-Based Computing Laboratory at the Ohio State University.
For more information, see:
http://nowlab.cis.ohio-state.edu/
These programs allow you to measure the MPI latency and bandwidth between two
or more nodes in your cluster. Both the executables, and the source for those
executables, are shipped. The executables are shipped in the
RPM, and installed under
and installed under
/usr/share/mpich/examples/performance.
The examples given below are intended only to show the syntax for invoking these
programs and the meaning of the output. They are NOT representations of actual
InfiniPath performance characteristics.
/usr/bin. The source is shipped in the mpi-devel RPM
mpi-benchmark
A.1
Benchmark 1: Measuring MPI Latency Between Two Nodes
In the MPI community, latency for a message of given size is defined to be the time
difference between a node program’s calling
corresponding
without a qualifying message size, we mean the latency for a message of size zero.
This latency represents the minimum overhead for sending messages, due both to
software overhead and to delays in the electronics of the fabric. To simplify the
timing measurement, latencies are usually measured with a ping-pong method,
timing a round-trip and dividing by two.
The program
range of messages sizes from 0 to 4 megabytes. It uses a ping-pong method, in
which the rank 0 process initiates a series of sends and the rank 1 process echoes
them back, using the blocking MPI send and receive calls for all operations. Half
the time interval observed by the rank 0 process for each such exchange is a
measure of the latency for messages of that size, as defined above. The program
uses a loop, executing many such exchanges for each message size, in order to
get an average. It defers the timing until the message has been sent and received
a number of times, in order to be sure that all the caches in the pipeline have been
filled.
MPI_Recv in the receiving node program returns. By latency, alone
osu_latency, from Ohio State University, measures the latency for a
MPI_Send and the time that the
IB6054601-00 DA-1
A – Benchmark Programs
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes
This benchmark always involves just two node programs. You can run it with the
command:
$ mpirun -np 2 -ppn 1 -m mpihosts osu_latency
The -ppn 1 option is needed to be certain that the two communicating processes
are on different nodes. Otherwise, in the case of multiprocessor nodes,
might assign the two processes to the same node, and so the result would not be
indicative of the latency of the InfiniPath fabric, but rather of the shared memory
transport mechanism. Here is what the output of the program looks like:
The first column gives the message size in bytes, the second gives the average
(one-way) latency in microseconds. Again, this example is given to show the syntax
of the command and the format of the output, and is not meant to represent actual
values that might be obtained on any particular InfiniPath installation.
A.2
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes
The osu_bw benchmark is meant to measure the maximum rate at which you can
pump data between two nodes. It also uses a ping-pong mechanism, similar to the
osu_latency code, except in this case, the originator of the messages pumps a
number of them (64 in the installed version) in succession using the non-blocking
A-2IB6054601-00 D
Q
A – Benchmark Programs
Benchmark 3: Messaging Rate Microbenchmarks
MPI_Isend function, while the receiving node consumes them as quickly as it can
using the non-blocking MPI_Irecv, and then returns a zero-length acknowledgement
when all of the set has been received.
Note that the increase in measured bandwidth with messages size results from the
fact that latency’s contribution to the measured time interval becomes relatively
smaller.
A.3
Benchmark 3: Messaging Rate Microbenchmarks
mpi_multibw is the microbenchmark used to highlight QLogic’s messaging rate
results. This benchmark is a modified form of the OSU NOWlab’s
IB6054601-00 DA-3
osu_bw
A – Benchmark Programs
Benchmark 3: Messaging Rate Microbenchmarks
benchmark (as shown in the example above). It has been enhanced with the
following additional functionality:
■ Messaging rate reported as well as bandwidth
■ N/2 dynamically calculated at end of run
■ Allows user to run multiple processes per node and see aggregate bandwidth
and messaging rates
The benchmark has been updated with code to dynamically determine which
processes are on which host. This is an example showing the type of output you
will see when you run
$ mpirun -np 8 ./mpi_multibw
mpi_multibw:
This will run on four processes per node. Typical output might look like:
Searching for N/2 bandwidth. Maximum Bandwidth of 954.547225
MB/s...
Found N/2 bandwidth of 476.993060 MB/s at size 94 bytes
This microbenchmark is available and can be downloaded from the QLogic website:
http://www.qlogic.com
A-4IB6054601-00 D
Q
A.4
Benchmark 4: Measuring MPI Latency in Host Rings
Benchmark 4: Measuring MPI Latency in Host Rings
The program mpi_latency can be used to measure latency in a ring of hosts. Its
syntax is a bit different from Benchmark 1 in that it takes command line arguments
that let you specify the message size and the number of messages over which to
average the results. So, for example, if you have a hosts file listing four or more
nodes, the command:
This indicates that it took an average of 1.76 microseconds per hop to send a
zero-length message from the first host, to the second, to the third, to the fourth,
and then get replies back in the other direction.
A – Benchmark Programs
IB6054601-00 DA-5
A – Benchmark Programs
Benchmark 4: Measuring MPI Latency in Host Rings
Notes
Q
A-6IB6054601-00 D
Integration with a Batch Queuing System
Most cluster systems use some kind of batch queuing system as an orderly way to
provide users with access to the resources they need to meet their job’s performance
requirements. One of the tasks of the cluster administrator is to provide means for
users to submit MPI jobs through such batch queuing systems. This can take the
form of a script, which your users can invoke much as they would invoke
to submit their MPI jobs. A sample script is presented in this section.
B.1
A Batch Queuing Script
We give an example of the some of the functions that such a script might perform,
in the context of the Simple Linux Utility Resource Manager (SLURM) developed
at Lawrence Livermore National Laboratory. These functions assume the use of the
bash shell. We will call this script batch_mpirun. It is provided here:
#! /bin/sh
# Very simple example batch script for InfiniPath MPI, using slurm
# (http://www.llnl.gov/linux/slurm/)
# Invoked as:
# batch_mpirun #cpus mpi_program_name mpi_program_args ...
#
np=$1 mpi_prog="$2" # assume arguments to script are correct
shift 2 # program args are now $@
eval ‘srun --allocate --ntasks=$np --no-shell‘
mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘
srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \
In the following sections, setup and the various functions of the script are discussed
in further detail.
B.1.1
Allocating Resources
When the mpirun command starts, it requires specification of the number of node
programs it must spawn (via the
listing the nodes on which the node programs may be run. (See section 3.5.8 for
more information.) Normally, since performance is usually important, a user might
IB6054601-00 DB-1
-np option) and specification of an mpihosts file
B – Integration with a Batch Queuing System
A Batch Queuing Script
require that his node program be the only application running on each node CPU.
In a typical batch environment, the MPI user would still specify the number of node
programs, but would depend on the batch system to allocate specific nodes when
the required number of CPUs becomes available. Thus,
at least an argument specifying the number of node programs and an argument
specifying the MPI program to be instantiated. For example,
$ batch_mpirun -np n my_mpi_program
After parsing the command line arguments, the next step of batch_mpirun would
be to request an allocation of
would use the command
eval ‘srun --allocate --ntasks=$np --no-shell‘
Make sure to use back-quotes rather than normal single-quotes. $np is the shell
variable that your script has set from the parsing of its command line options. The
--no-shell option to srun prevents SLURM from starting a subshell. The srun
command is run with
output of the
srun command.
eval in order to set the SLURM_JOBID shell variable from the
Q
batch_mpirun would take
n processors from the batch system. In SLURM, this
With these specified arguments, the SLURM function
$np processors available to commit to the caller. When the requested resources
are available, this command opens a new shell and allocates the requested number
of processors to it.
B.1.2
Generating the mpihosts File
Once the batch system has allocated the required resources, your script must
generate a
it must find out which nodes the batch system has allocated, and how many
processes we can start on each node. This is the part of the script
that performs these tasks:
The first command creates a temporary hosts file with a random name, and assigns
the name to the variable
The next instance of the SLURM
process slot that SLURM has allocated to us. If SLURM has allocated two slots on
one node, we thus get the output of
The
sort | uniq -c component tells us the number of times each unique line was
printed. The
mpihosts file, which contains a list of nodes that will be used. To do this,
awk command converts the result into the mpihosts file format used
srun blocks until there are
batch_mpirun
mpihosts file it has generated.
srun command runs hostname -s once per
hostname -s twice for that node.
B-2IB6054601-00 D
Q
by mpirun.Each line consists of a node name, a colon, and the number of processes
to start on that node.
NOTE:This is one of two formats that the file may use. See section 3.5.6 for more
information.
B.1.3
Simple Process Management
At this point, your script has enough information to be able to run an MPI program.
All that remains is to start the program when the batch system tells us that we can
do so, and notify the batch system when the job completes. This is done in the final
part of
The InfiniPath software will normally ensure clean termination of all MPI programs
when a job ends, but in some rare circumstances an MPI process will remain alive,
and potentially interfere with future MPI jobs. To avoid this problem, the usual
solution is to run a script before and after each batch job which kills all unwanted
processes. QLogic does not provide such a script, but it is useful to know how to
find out which processes on a node are using the InfiniPath interconnect. The easiest
way to do this is through use of the fuser command, which is normally installed in
/sbin.Run as root:
# /sbin/fuser -v /dev/ipath
/dev/ipath: 22648m 22651m
In this example, processes 22648 and 22651 are using the InfiniPath interconnect.
It is also possible to use this command (as root):
# lsof /dev/ipath
This gets a list of processes using InfiniPath. Additionally, to get all processes,
including stats programs, ipath_sma, diags, and others, run the program in this
way:
# /sbin/fuser -v /dev/ipath*
losf can also take the same form:
# lsof /dev/ipath*
IB6054601-00 DB-3
B – Integration with a Batch Queuing System
Lock Enough Memory on Nodes When Using SLURM
The following command will terminate all processes using the InfiniPath
interconnect:
# /sbin/fuser -k /dev/ipath
For more information, see the man pages for fuser(1) and lsof(8).
NOTE:Run these commands as root to insure that all processes are reported.
B.2
Lock Enough Memory on Nodes When Using SLURM
This is identical to information provided in appendix C.8.11. It is repeated here for
your convenience.
InfiniPath MPI requires the ability to lock (pin) memory during data transfers on each
compute node. This is normally done via
modified during the installation of the infinipath RPM (setting a limit of 64MB,
with the command "
Some batch systems, such as SLURM, propagate the user’s environment from the
node where you start the job to all the other nodes. For these batch systems, you
may need to make the same change on the node from which you start your batch
jobs.
ulimit -l 65536").
/etc/initscript, which is created or
Q
If this file is not present or the node has not been rebooted after the infinipath
RPM has been installed, a failure message similar to this will be generated:
You can check the ulimit -l on all the nodes by running ipath_checkout. A
warning will be given if ulimit -l is less that 4096.
There are two possible solutions to this. If infinipath is not installed on the node
where you start the job, set this value in the following way. You must be root to set it:
# ulimit -l 65536
Or, if you have installed infinipath on the node, reboot it to insure that
/etc/initscript is run.
B-4IB6054601-00 D
Appendix C
Troubleshooting
This Appendix describes some of the existing provisions for diagnosing and fixing
problems. The sections are organized in the following order:
This section lists conditions you may encounter while installing the InfiniPath
QLE7140 or QHT7140 adapter, and offers suggestions for working around them.
C.1.1
Mechanical and Electrical Considerations
The LEDs function as link and data indicators once the InfiniPath hardware and
software has been installed, the driver has been loaded, and the fabric is being
actively managed by a Subnet Manager. The following table shows the possible
IB6054601-00 DC-1
C – Troubleshooting
BIOS Settings
states of the LEDs. The green LED will normally illuminate first. The normal state
is Green On, Amber On.
PowerGreenON
LinkAmberON
If a node repeatedly and spontaneously reboots when attempting to load the
InfiniPath driver, it may be a symptom that its InfiniPath interconnect board is not
well seated in the HTX or PCIe slot.
Table C-1. LED Link and Data Indicators
LEDColorStatus
OFF
Signal detected.
Ready to talk to an SM to bring
link fully up.
Link configured.
Properly connected and ready
to receive data and link packets.
Switch not powered up.
Software not installed or started.
Loss of signal.
Check cabling.
OFF
SM may be missing.
Link may not be configured.
Check the connection.
Q
C.1.2
Some HTX Motherboards May Need 2 or More CPUs in Use
Some HTX motherboards may require that 2 or more of the CPUs be in use for the
HTX InfiniPath card to be recognized. This is most evident in four-socket
motherboards.
C.2
BIOS Settings
This section covers issues related to improper BIOS settings.The two most
important settings are:
■ ACPI needs to be enabled
■ MTRR mapping needs to be set to “Discrete”
If ACPI has been disabled, it may result in initialization problems, as described in
appendix C.4.4.
An improper setting for MTRR mapping can result in reduced performance. See
appendix C.2.2, appendix C.2.3, and appendix C.2.3 for details.
NOTE:BIOS settings on IBM Blade Center H (Power) systems do not need
adjustment.
C-2IB6054601-00 D
Q
C.2.1
MTRR Mapping and Write Combining
MTRR (Memory Type Range Registers) is used by the InfiniPath driver to enable
write combining to the InfiniPath on-chip transmit buffers. This improves write
bandwidth to the InfiniPath chip by writing multiple words in a single bus transaction
(typically 64). This applies only to x86_64 systems. To see if is working correctly
and to check your bandwidth use this command:
$ ipath_pkt_test -B
When configured correctly, PCIe InfiniPath will normally report in the range of
1150-1500 MB/s, while HTX InfiniPath cards will normally report in the range of
2300-2650 MB/s.
However, some BIOSes don’t have the MTRR mapping option. It may be referred
to in a different way, dependent upon chipset, vendor, BIOS, or other factors. For
example, it is sometimes referred to as "32 bit memory hole", which should be
enabled.
If there is no setting for MTRR mapping or 32 bit memory hole, please contact your
system or motherboard vendor and inquire as to how write combining may be
enabled.
C – Troubleshooting
BIOS Settings
C.2.2
Incorrect MTRR Mapping
In some cases, the InfiniPath driver may be unable to configure the CPU Write
Combining attributes for the QLogic InfiniPath IBA6110. This would normally be
seen for a new system, or after the system’s BIOS has been upgraded or
reconfigured.
If this error occurs, the InfiniPath interconnect will operate, but in a degraded
performance mode. Typically the latency will increase to several microseconds, and
the bandwidth may decrease to as little as 200 MBytes/sec.
A message similar to this will be printed on the console, and normally to the system
log (typically in /var/log/messages):
infinipath: mtrr_add(feb00000,0x100000,WC,0) failed (-22)
infinipath: probe of 0000:04:01.0 failed with error -22
If you see this error message, you should edit the BIOS setting for MTRR Mapping.
The setting should look like this:
MTRR Mapping [Discrete]
You can check and adjust the BIOS settings using the BIOS Setup Utility. Check
the hardware documentation that came with your system for more information on
how to do this. Section C.2.3, below, documents a related issue.
This same MTRR Mapping setting as described in the previous section can also
cause unexpected low bandwidth if it is set incorrectly.
The setting should look like this:
MTRR Mapping [Discrete]
The MTRR Mapping needs to be set to Discrete if there is 4GB or more memory in
the system; it affects where the PCI, PCIe, and HyperTransport i/o addresses
(BARs) are mapped. If there is 4GB or more memory in the system, and this is not
set to Discrete, you will get very low bandwidth (under 250 MB/sec) on anything
that would normally run near full bandwidth. The exact symptoms can vary with
BIOS, amount of memory, etc., but typically there will be no errors or warnings.
To check your bandwidth try:
$ ipath_pkt_test -B
When configured correctly, PICIe InfiniPath will normally report in the range of
1150-1500 MB/s, while HTX InfiniPath cards will normally report in the range of
2300-2650 MB/s. ipath_checkout can also be used to check bandwidth.
Q
You can check and adjust the BIOS settings using the BIOS Setup Utility. Check
the hardware documentation that came with your system for more information on
how to do this.
C.2.4
Change Setting for Mapping Memory
In some cases, on systems with 4GB or more memory on Opteron systems with
InfiniPath HTX cards (QHT7040 or QHT7140), and the Red Hat Enterprise Linux 4
release with 2.6.9 Linux kernels, MPI jobs may fail to initialize or may terminate
early. This can be worked around by changing the setting for mapping memory
around the PCI configuration space ("SoftWare Memory Hole") to "Disabled" in the
Chipset, Northbridge screen in the BIOS. This will result in a small loss in usable
memory.
C.2.5
Issue with SuperMicro H8DCE-HTe and QHT7040
The InfiniPath card may not be recognized on startup when using the SuperMicro
H8DCE-HT-e and the QHT7040 adapter. To fix this problem, the OS selector option
in the BIOS should be set for Linux. The option will look like this:
OS Installation [Linux]
C-4IB6054601-00 D
Q
C.3
Software Installation Issues
This section covers issues related to software installation.
C.3.1
OpenFabrics Dependencies
You need to install sysfsutils for your distribution before installing the
OpenFabrics RPMs, as there are dependencies. If
installed, you might see error messages like this:
error: Failed dependencies:
libsysfs.so.1()(64bit) is needed by
libipathverbs-2.0-1_100.77_fc3_psc.x86_64
libsysfs.so.1()(64bit) is needed by
libibverbs-utils-2.0-1_100.77_fc3_psc.x86_64
/usr/include/sysfs/libsysfs.h is needed by
libibverbs-devel-2.0-1_100.77_fc3_psc.x86_64
Check your distribution’s documentation for information about sysfsutils.
C – Troubleshooting
Software Installation Issues
sysfsutils has not been
C.3.2
Install Warning with RHEL4U2
You may see a warning similar to this when installing InfiniPath and OpenFabrics
modules on RHEL4U2.
infinipath-2.0-7277.1538_fc3_psc
Building and installing InfiniPath and OpenIB modules for
2.6.9-22.ELsmp kernel
Building modules, stage 2.
Warning: could not find versions for .tmp_versions/ib_mthca.mod
This warning may be safely ignored.
C.3.3
mpirun Installation Requires 32-bit Support
On a 64-bit system, 32-bit glibc must be installed before installing the
mpi-frontend-* RPM. mpirun, which is part of the mpi-frontend-* RPM,
requires 32-bit support.
If 32-bit
when installing
# rpm -Uv ~/tmp/mpi-frontend-2.0-2250.735_fc3_psc.i386.rpm
error: Failed dependencies:
/lib/libc.so.6 is needed by mpi-frontend-2.0 2250.735_fc3_psc.i386
glibc is not installed on a 64-bit system, you will now see an error like this
mpi-frontend:
IB6054601-00 DC-5
C – Troubleshooting
Software Installation Issues
In older distributions, such as RHEL4, the 32-bit glibc will be contained in the
libgcc RPM. The RPM will be named similarly to:
libgcc-3.4.3-9.EL4.i386.rpm
In newer distributions, glibc is an RPM name. The 32-bit glibc will be named
similarly to:
glibc-2.3.4-2.i686.rpm
or
glibc-2.3.4-2.i386.rpm
Check your distribution for the exact RPM name.
C.3.4
Installing Newer Drivers from Other Distributions
The driver source now resides in infinipath-kernel. This means that newer
drivers can be installed as they become available. Those who wish to install newer
drivers, for example, from OFED (Open Fabrics Enterprise Distribution), should be
able to do so. However, some extra steps need to be taken in order to install properly.
Q
1. Install all InfiniPath RPMs, including
infinipath-kernel installs into:
/lib/modules/$(uname -r)/updates
This should not affect any other installed InfiniPath or OpenFabrics drivers.
2. Reload the InfiniPath and OpenFabrics modules to verify that the installation
works by using this command (as root):
# /etc/init.d/infinipath restart
3. Run ipath_checkout or other OpenFabrics test program to verify that the
InfiniPath card(s) work properly.
4. Unload the InfiniPath and OpenFabrics modules with the command:
# /etc/init.d/infinipath stop
5. Remove the InfiniPath kernel components with the command:
$ rpm -e infinipath-kernel --nodeps
The option --nodeps is required because the other InfiniPath RPMs depend
infinipath-kernel.
on
6. Verify that no InfiniPath or OpenFabrics modules are present in the
/lib/modules/$(uname -r)/updates directory.
7. If not yet installed, install the InfiniPath and OpenFabrics modules from your
alternate set of RPMs.
infinipath-kernel. The RPM
C-6IB6054601-00 D
Q
8. Reload all modules by using this command (as root):
# /etc/init.d/infinipath start
An alternate mechanism can be used, if provided as part of your alternate
installation.
9. Run an OpenFabrics test program, such as ibstatus, to verify that your
InfiniPath card(s) work correctly.
C.3.5
Installing for Your Distribution
You may be using a kernel which is compatible with one of the supported
distributions, but which may not be picked up during infinipath-kernel
installation. It may also happen when using make-install.sh to manually
recompile the drivers.
In this case, you can set your distribution with the $IPATH_DISTRO override. Run
this command before installation, or before running make-install.sh. We use
the RHEL4 Update 4 distribution as an example in this command for bash or sh
users:
C – Troubleshooting
Kernel and Initialization Issues
$ export IPATH_DISTRO=rhel4_U4
The distribution arguments that are currently understood are listed below. They are
found in the file build-guards.sh.
These are used for RHEL, CentOS(Rocks), and Scientific Linux.
rhel4_U2
rhel4_U3
rhel4_U4
These are used for SLES, SUSE, and Fedora:
sles9
sles10
suse9.3
fc3
fc4
make-install.sh and build-guards.sh are both found in this directory:
/usr/src/infinipath/drivers
C.4
Kernel and Initialization Issues
Issues that may prevent the system from coming up properly are described.
IB6054601-00 DC-7
C – Troubleshooting
Kernel and Initialization Issues
C.4.1
Kernel Needs CONFIG_PCI_MSI=y
If the InfiniPath driver is being compiled on a machine without CONFIG_PCI_MSI=y
configured, you will get a compilation error similar to this:
ib_ipath/ipath_driver.c:46:2: #error "InfiniPath driver can only
be used with kernels with CONFIG_PCI_MSI=y"
make[3]: *** [ib_ipath/ipath_driver.o]
Error 1
Some kernels, such as some versions of FC4 (2.6.16), have CONFIG_PCI_MSI=n
as the default. This default may also be introduced with updates to other Linux
distributions or local configuration changes. This needs to be changed to
CONFIG_PCI_MSI=y in order for the InfiniPath driver to function.
The suggested remedy is to install one of the supported Linux kernels (see
section 1.7), or download a patched kernel from the QLogic website.
Pre-built kernels and patches for these distributions are available for download on
the website. Please go to:
Q
http://www.qlogic.com
Follow the links to the download page.
NOTE:As of this writing, kernels later than 2.6.16-1.2108_FC4smp on FC4 no
C.4.2
pci_msi_quirk
A change was made in the kernel.org 2.6.12 kernel that can cause an InfiniPath
driver runtime error with the QLE7140. This change is found in most linux
distributions with 2.6.12 - 2.6.16 kernels, including Fedora Core 3, Fedora Core 4,
and SUSE Linux 10.0. Affected systems are those that contain the AMD8131 PCI
bridge. Such systems may experience a problem with MSI (Message Signaled
Interrupt) that impairs the operation of the InfiniPath QLE7140 adapter. The
InfiniPath driver will not be able to configure the InfiniBand link to the Active state.
If messages similar to those below are displayed on the console during boot, or are
in
/var/log/messages, then you probably have the problem:
PCI: MSI quirk detected. pci_msi_quirk set.
path_core 0000:03:00.0: pci_enable_msi failed: -22, interrupts may
not work
Pre-built kernels and patches for these distributions are available for download on
the website. Please go to:
longer have this problem.
http://www.qlogic.com
Follow the links to the downloads page.
C-8IB6054601-00 D
Q
NOTE:This problem has been fixed in the 2.6.17 kernel.org kernel.
C.4.3
Driver Load Fails Due to Unsupported Kernel
If you try to load the InfiniPath driver on a kernel that InfiniPath software does not
support, the load fails. Error messages similar to this appear:
modprobe: error inserting
’/lib/modules/2.6.3-1.1659-smp/kernel/drivers/infiniband/hw/ipath/
ib_ipath.ko’: -1 Invalid module format
To correct this, install one of the appropriate supported Linux kernel versions as
listed in section 2.3.3, then reload the driver.
C.4.4
InfiniPath Interrupts Not Working
The InfiniPath driver will not be able to configure the InfiniPath link to a usable state
unless interrupts are working. Check for this with the commands:
If there is no output at all, driver initialization has failed. For further information on
driver problems, see appendix C.4.1, appendix C.4.3, or appendix C.4.6.
However, if the output appears similar to one of these lines, then interrupts are not
being delivered to the driver:
66: 0 0 PCI-MSI ib_ipath
185:00IO-APIC-level ib_ipath
NOTE:The output you see may vary depending on board type, distribution, or
update level.
IB6054601-00 DC-9
C – Troubleshooting
Kernel and Initialization Issues
A zero count in all CPU columns means that no interrupts have been delivered to
the processor.
Possible causes are:
■ Booting the linux kernel with ACPI (Advanced Configuration and Power Interface)
disabled on the boot command line, or in the BIOS configuration
■ Other infinipath initialization failures
Q
"noacpi" or "pci=noacpi" options, use
C.4.5
To check if the kernel was booted with the
this command:
$ grep -i acpi /proc/cmdline
If output is displayed, fix your kernel boot command line so that ACPI is enabled.
This can be set in various ways, depending on your distribution. If no output is
displayed, check to be sure that ACPI is enabled in your BIOS settings.
To track down other initialization failures, see appendix C.4.6.
The program
appendix C.9.8 for more information.
ipath_checkout can also help flag these kinds of problems. See
OpenFabrics Load Errors If ib_ipath Driver Load Fails
When the ib_ipath driver fails to load for any reason, all of the OpenFabrics
drivers/modules loaded by /etc/init.d/infinipath fail with "Unknown symbol" errors:
ib_mad: Unknown symbol ib_unregister_client
ib_mad: Unknown symbol ib_query_ah
.
ib_sa: Unknown symbol ib_unregister_client
ib_sa: Unknown symbol ib_unpack
.
ib_ipath: Unknown symbol ib_modify_qp_is_ok
ib_ipath: Unknown symbol ib_unregister_device
.
ipath_ether: Unknown symbol ipath_layer_get_mac
ipath_ether: Unknown symbol ipath_layer_get_lid
.
NOTE:Not all the error messages are shown here.
C-10IB6054601-00 D
Q
C.4.6
InfiniPath ib_ipath Initialization Failure
There may be cases where ib_ipath was not properly initialized. Symptoms of this
may show up in error messages from an MPI job or another program. Here is a
sample command and error message:
$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency
<nodename>:The link is down
MPIRUN: Node program unexpectedly quit. Exiting.
First, check to be sure that the InfiniPath driver is loaded:
$ lsmod | grep ib_ipath
If no output is displayed, the driver did not load for some reason. Try the commands
(as root):
This will indicate whether the driver has loaded. Printing out messages using dmesg
may help to locate any problems with
If the driver loaded, but MPI or other programs are not working, check to see if
problems were detected during the driver and InfiniPath hardware initialization with
the command:
$ dmesg | grep -i ipath
This may generate more than one screen of output. Also, check the link status with
the commands:
$ cat /sys/bus/pci/driver/ib_ipath/0?/status_str
These commands are normally executed by the ipathbug-helper script, but
running them separately may help locate the problem.
Refer also to appendix C.9.16 and appendix C.9.8.
C.4.7
ib_ipath.
MPI Job Failures Due to Initialization Problems
If one or more nodes do not have the interconnect in a usable state, messages
similar to the following will occur when the MPI program is started:
userinit: userinit ioctl failed: Network is down [1]: device init
failed
userinit: userinit ioctl failed: Fatal Error in keypriv.c(520):
device init failed
This could indicate that a cable is not connected, the switch is down, SM is not
running, or a hardware error has occurred.
IB6054601-00 DC-11
C – Troubleshooting
System Administration Troubleshooting
C.5
OpenFabrics Issues
This section covers items related to OpenFabrics, including OpenSM.
C.5.1
Stop OpenSM Before Stopping/Restarting InfiniPath
OpenSM must be stopped before stopping or restarting InfiniPath. If not, error
messages such as the following will occur:
# /etc/init.d/infinipath stop
Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath
mad coreFATAL:Module ib_umad is in use.
Unloading infinipath modules FATAL: Module ib_ipath is in use.
[FAILED]
C.5.2
Load and Configure IPoIB Before Loading SDP
Q
SDP will generate "Connection Refused" errors if it is loaded before IPoIB has been
loaded and configured. Loading and configuring IPoIB first should solve the
problem.
C.5.3
Set $IBPATH for OpenFabrics Scripts
The environment variable $IBPATH should be set to /usr/bin. If this has not been
set, or if you have it set to a location other than the installed location, you may see
error messages similar to this when running some OpenFabrics scripts:
/usr/bin/ibhosts: line 30: /usr/local/bin/ibnetdiscover: No such
file or directory
For the OpenFabrics commands supplied with this InfiniPath release, you should
set the variable (if it has not been set already), to
$ export IBPATH=/usr/bin
C.6
System Administration Troubleshooting
The following section gives details on locating problems related to system
administration.
/usr/bin as follows:
C-12IB6054601-00 D
Q
C.6.1
Broken Intermediate Link
Sometimes message traffic passes through the fabric while other traffic appears to
be blocked. In this case, MPI jobs fail to run.
In large cluster configurations, switches may be attached to other switches in order
to supply the necessary inter-node connectivity. Problems with these inter-switch
(or intermediate) links are sometime more difficult to diagnose than failure of the
final link between a switch and a node. The failure of an intermediate link may allow
some traffic to pass through the fabric while other traffic is blocked or degraded.
If you encounter such behavior in a multi-layer fabric, check that all switch cable
connections are correct. Statistics for managed switches are available on a per-port
basis, and may help with debugging. See your switch vendor for more information.
C.7
Performance Issues
Performance issues that are currently being addressed are covered in this section.
C – Troubleshooting
InfiniPath MPI Troubleshooting
C.7.1
MVAPICH Performance Issues
MVAPICH over OpenFabrics over InfiniPath performance tuning has not yet been
done. Improved performance will be delivered in future releases.
C.8
InfiniPath MPI Troubleshooting
Problems specific to compiling and running MPI programs are detailed below.
C.8.1
Mixed Releases of MPI RPMs
Make sure that all of the MPI RPMs are from the same release. When using mpirun,
an error message will occur if different components of the MPI RPMs are from
different releases. This is a sample message in the case where mpirun from
release 1.3 is being used with a 2.0 library:
$ mpirun -np 2 -m ~/tmp/x2 osu_latency
MPI_runscript-xqa-14.0: ssh -x> Cannot detect InfiniPath
interconnect.
MPI_runscript-xqa-14.0: ssh -x> Seek help on loading InfiniPath
interconnect driver.
MPI_runscript-xqa-15.1: ssh -x> Cannot detect InfiniPath
interconnect.
MPI_runscript-xqa-15.1: ssh -x> Seek help on loading InfiniPath
interconnect driver.
MPIRUN: Node program(s) exited during connection setup
IB6054601-00 DC-13
C – Troubleshooting
InfiniPath MPI Troubleshooting
$ mpirun -v
MPIRUN:Infinipath Release2.0 : Built on Wed Nov 19 17:28:58 PDT
2006 by mee
The following is the error that occurs when mpirun from the 2.0 release is being
used with the 1.3 libraries:
MPIRUN: mpirun from the 2.0 software distribution requires all
node processes to be running 2.0 software. At least node
<nodename> uses non-2.0 MPI libraries
C.8.2
Cross-compilation Issues
The 2.x PathScale compilers aren’t currently supported on systems that use the
GNU 4.x compilers and compiler environment (header files and libraries). This
includes Fedora Core 4, Fedora Core 5 and SLES 10. The GNU 4.x environment
will be supported in the PathScale Complier Suite 3.0 release.
Q
The current workaround for this is to compile on a supported and compatible
distribution, then run the executable on one of the systems that uses the GNU 4.x
compilers and environment.
■ To run on FC4 or FC5, install FC3 or RHEL4/CentOS on your build machine.
Compile your application on this machine.
■ To run on SLES 10, install SUSE 9.3 on your build machine. Compile your
application on this machine.
■ Alternatively, gcc can be used as the default compiler. Set mpicc -cc=gcc as
described in section 3.5.3 "To Use Another Compiler".
Next, on the machines in your cluster on which the job will run, install compatibility
libraries. These libraries include C++ and Fortran compatibility shared libraries and
libgcc.
For an FC4 or FC5 system, you would need:
■ pathscale-compilers-libs (for FC3)
■ compat-gcc-32
■ compat-gcc-32-g77
■ compat-libstdc++-33
C-14IB6054601-00 D
Q
On a SLES 10 system, you would need:
■ compat-libstdc++ (for FC3)
■ compat-libstdc++5 (for SLES 10)
Depending upon the application, you may need to use the -W1,-Bstatic option to
use the static versions of some libraries.
C.8.3
InfiniPath MPI Troubleshooting
Compiler/Linker Mismatch
This is a typical error message if the compiler and linker are not matching in C and
C++ programs:
$ export MPICH_CC=gcc
$ mpicc mpiworld.c
/usr/bin/ld: cannot find -lmpichabiglue_gcc3
collect2: ld returned 1 exit status
C.8.4
Compiler Can’t Find Include, Module or Library Files
C – Troubleshooting
RPMs can be installed in any location by using the --prefix option. This can
introduce errors when compiling, if the compiler cannot find the include files (and
module files for Fortran90 and Fortran95) from
mpi-libs* in the new locations. Compiler errors similar to this can occur:
$ mpicc myprogram.c
/usr/bin/ld: cannot find -lmpich
collect2: ld returned 1 exit status
NOTE:As noted in section 3.5.2 of the InfiniPath Install Guide, all development
files now reside in specific *-Devel subdirectories.
On development nodes, programs must be compiled with the appropriate options
so that the include files and the libraries can be found in the new locations. In
addition, when running programs on compute nodes, you need to insure that the
run-time library path is the same as the path that was used to compile the program.
The examples below show what compiler options to use for include files and libraries
on the development nodes, and how to specify this new library path on the compute
nodes for the runtime linker. The affected RPMs are:
mpi-devel* (on the development nodes)
mpi-libs* (on the development or compute nodes)
mpi-devel*, and the libraries from
IB6054601-00 DC-15
C – Troubleshooting
InfiniPath MPI Troubleshooting
For these examples in Section C.8.5 below, we assume that these new locations
are:
There are several ways to specify the run-time library path so that when the
programs are run the appropriate libraries are found in the new location. There are
three different ways to do this:
■ Use the -Wl,-rpath, option when compiling on the development node.
■ Update the /etc/ld.so.conf file on the compute nodes to include the path.
■ Export the path in the .mpirunrc file.
These methods are explained in more detail below.
1. An additional linker option,
when compiling on the development node. The compiler options now look like
this:
-Wl,-rpath, supplies the run-time library path
$ mpicc myprogram.c -I/path/to/devel/include
-L/path/to/libs/lib -Wl,-rpath,/path/to/libs/lib
C-16IB6054601-00 D
Q
C – Troubleshooting
InfiniPath MPI Troubleshooting
The above compiler command insures that the program will run using this path
on any machine.
For the second option, we change the file
nodes rather than using the
development node.
compute nodes with the same
development nodes. Then, on the computer nodes we then add the following
lines to the file
/path/to/libs/lib
/path/to/libs/lib64
Then, to make sure that the changes are picked up, run (as root):
# /etc/ldconfig
The libraries can now be found by the runtime linker on the compute nodes.
This method has the advantage that it will work for all InfiniPath programs,
without having to remember to change the compile/link lines.
2. Instead of either of the two above mechanisms, you can also put this line in the
~/.mpirunrc file:
export LD_LIBRARY_PATH=/path/to/libs/{lib,lib64}
See Section 3.5.8 in the chapter “Using InfiniPath MPI” for more information
on using the
Choices between these options are left up to the cluster administrator and the
MPI developer. See the documentation for your compiler for more information
on the compiler options.
-rcfile option to mpirun.
We assume that the mpi-lib-* rpm is installed on the
/etc/ld.so.conf.
-Wl,-rpath, option when compiling on the
--prefix /path/to/libs option as on the
/etc/ld.so.conf on the compute
C.8.7
Run Time Errors With Different MPI Implementations
It is now possible to run different implementations of MPI, such as HP-MPI, over
InfiniPath. Many of these implementations share command (such as mpirun) and
library names, so it is important to distinguish which MPI version is in use. This is
done primarily through careful programming practices.
IB6054601-00 DC-17
C – Troubleshooting
InfiniPath MPI Troubleshooting
Examples are given below.
In the following command, the HP-MPI version of mpirun is invoked by the full
pathname. However, the program mpi_nxnlatbw was compiled with the QLogic
version of mpicc. The mismatch will produce errors similar this:
bbb-02: Not running from mpirun?.
MPI Application rank 1 exited before MPI_Init() with status 1
bbb-03: Not running from mpirun?.
MPI Application rank 2 exited before MPI_Init() with status 1
bbb-01: Not running from mpirun?.
bbb-04: Not running from mpirun?.
MPI Application rank 3 exited before MPI_Init() with status 1
MPI Application rank 0 exited before MPI_Init() with status 1
In the case below, mpi_nxnlatbw.c is compiled with the HP-MPI version of
mpicc, and given the name of hpmpi-mpi_nxnlatbw, so that it is easy to see
which version was used. However, it is run with the QLogic mpirun, which will
produce errors similar to this:
$ mpirun -m ~/host-bbb -np 4 ./hpmpi-mpi_nxnlatbw
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:
libmpio.so.1: cannot open shared object file: No such file or
directory
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:
libmpio.so.1: cannot open shared object file: No such file or
directory
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:
libmpio.so.1: cannot open shared object file: No such file or
directory
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:
libmpio.so.1: cannot open shared object file: No such file or
directory
MPIRUN: Node program(s) exited during connection setup
C-18IB6054601-00 D
Q
C – Troubleshooting
InfiniPath MPI Troubleshooting
The following two commands will both work properly:
Check all rcfiles and /opt/infinipath/etc/mpirun.defaults to make sure
that the paths for binaries and libraries ($PATH and $LD_LIBRARY _PATH) are
consistent.
When compiling, use descriptive names for the object files.
See section C.8.4, section C.8.5, and section C.8.6 for additional information.
C.8.8
Process Limitation with ssh
MPI jobs that use more than 8 processes per node may encounter an ssh throttling
mechanism that limits the amount of concurrent per-node connections to 10. If you
have this problem, you will see a message similar to this when using mpirun:
$ mpirun -m tmp -np 11 ~/mpi/mpiworld/mpiworld
ssh_exchange_identification: Connection closed by remote host
MPIRUN: Node program(s) exited during connection setup
If you encounter a message like this, you or your system administrator should
increase the value of ’MaxStartups’ in your sshd configurations.
C.8.9
Using MPI.mod Files
MPI.mod (or mpi.mod) are the Fortran90/Fortran95 mpi modules files. These
contain the Fortran90/Fortran95 interface to the platform-specific MPI library. The
module file is invoked by ‘USE MPI’ or ‘use mpi’ in your application. If the application
has an argument list that doesn’t match what mpi.mod expects, errors such as this
can occur:
C – Troubleshooting
InfiniPath MPI Troubleshooting
^
pathf95-389 pathf90: ERROR BORDERS, File = communicate.F, Line =
407, Column = 18
No specific match can be found for the generic subprogram call
"MPI_RECV".
If it is necessary to use a non-standard argument list, it is advisable to create your
own MPI module file, and compile the application with it, rather than the standard
MPI module file that is shipped in the mpi-devel-* RPM.
The default search path for the module file is:
/usr/include
To include your own MPI.mod rather than the standard version, use
-I/your/search/directory which will cause /your/search/directory to be
checked before
$ mpif90 -I/your/search/directory myprogram.f90
Usage for Fortran95 will be similar to the example for Fortran90.
Q
/usr/include:
C.8.10
Extending MPI Modules
MPI implementations provide certain procedures which accept an argument having
any data type, any precision, and any rank, but it isn’t practical for an MPI module
to enumerate every possible combination of type, kind, and rank. Therefore the
strict type checking required by Fortran 90 may generate errors.
For example, if the MPI module tells the compiler that "mpi_bcast" can operate on
an integer but does not also say that it can operate on a character string, you may
see a message similar to the following one:
pathf95: ERROR INPUT, File = input.F, Line = 32, Column = 14
No specific match can be found for the generic subprogram call
"MPI_BCAST".
If you know that an argument can in fact accept a data type which the MPI module
doesn’t explicitly allow, you can extend the interface for yourself. For example, here’s
a program which illustrates how to extend the interface for "mpi_bcast" so that it
accepts a character type as its first argument, without losing the ability to accept an
integer type as well:
integer count, datatype, root, comm, ierror
! Call the Fortran 77 style implicit interface to "mpi_bcast"
external mpi_bcast
call mpi_bcast(buffer, count, datatype, root, comm, ierror)
end subroutine additional_mpi_bcast_for_character
end module additional_bcast
program myprogram
use mpi
use additional_bcast
implicit none
character*4 c
integer master, ierr, i
! Explicit integer version obtained from module "mpi"
call mpi_bcast(i, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)
! Explicit character version obtained from module "additional_bcast"
call mpi_bcast(c, 4, MPI_CHARACTER, master, MPI_COMM_WORLD,
end program myprogram
ierr)
This is equally applicable if the module "mpi" provides only a lower-rank interface
and you want to add a higher-rank interface. An example would be where the module
explicitly provides for 1-D and 2-D integer arrays but you need to pass a 3-D integer
array.
However, some care must be taken. One should only do this if:
■ The module "mpi" provides an explicit Fortran 90 style interface for "mpi_bcast."
If the module "mpi" does not, the program will use an implicit Fortran 77 style
interface, which does not perform any type checking. Adding an interface will
cause type-checking error messages where there previously were none.
■ The underlying function really does accept any data type. It is appropriate for the
first argument of "mpi_bcast" because the function operates on the underlying
bits, without attempting to interpret them as integer or character data.
C.8.11
Lock Enough Memory on Nodes When Using a Batch Queuing System
InfiniPath MPI requires the ability to lock (pin) memory during data transfers on each
compute node. This is normally done via
modified during the installation of the
with the command "
ulimit -l 65536").
Some batch systems, such as SLURM, propagate the user’s environment from the
node where you start the job to all the other nodes. For these batch systems, you
may need to make the same change on the node from which you start your batch
jobs.
/etc/initscript, which is created or
infinipath RPM (setting a limit of 64MB,
IB6054601-00 DC-21
C – Troubleshooting
InfiniPath MPI Troubleshooting
If this file is not present or the node has not been rebooted after the infinipath
RPM has been installed, a failure message similar to this will be generated:
You can check the ulimit -l on all the nodes by running ipath_checkout. A
warning will be given if
There are two possible solutions to this. If InfiniPath is not installed on the node
where you start the job, set this value in the following way (as root).
# ulimit -l 65536
Or, if you have installed InfiniPath on the node, reboot it to insure that
/etc/initscript is run.
Q
ulimit -l is less that 4096.
C.8.12
Error Messages Generated by mpirun
In the sections below, types of mpirun error messages are described. They fall into
these categories:
■ Messages from the InfiniPath Library
■ MPI messages
■ Messages relating to the InfiniPath driver and InfiniBand links
Messages generated by mpirun follow a general format:
program_name: message
function_name: message
Messages may also have different prefixes, such and ipath_ or psm_, which will
indicate in which part of the software the errors are occurring.
C.8.12.1
Messages from the InfiniPath Library
These messages may appear in the mpirun output.
The first set are error messages, which indicate internal problems and should be
reported to Support.
Trying to cancel invalid timer (EOC)
sender rank rank is out of range (notification)
sender rank rank is out of range (ack)
Reached TIMER_TYPE_EOC while processing timers
C-22IB6054601-00 D
Q
C – Troubleshooting
InfiniPath MPI Troubleshooting
Found unknown timer type type
unknown frame type type
recv done: available_tids now n, but max is m (freed p)
cancel recv available_tids now n, but max is m (freed %p)
[n] Src lid error: sender: x, exp send: y
Frame receive from unknown sender. exp. sender = x, came from y
Failed to allocate memory for eager buffer addresses: str
The following error messages probably indicate a hardware or connectivity problem:
Failed to get IB Unit LID for any unit
Failed to get our IB LID
Failed to get number of Infinipath units
In these cases you can try to reboot, then call Support.
The following indicate a mismatch between the InfiniPath interconnect hardware in
use and the version for which the software was compiled:
Number of buffer avail registers is wrong; have n, expected m
build mismatch, tidmap has n bits, ts_map m
These indicate a mismatch between the InfiniPath software and hardware versions.
Consult Support after verifying that current drivers and libraries are installed.
The following are all informative messages about driver initialization problems. They
are not necessarily fatal themselves, but sometimes indicate problems that interfere
with the application. In the actual printed output all of them are prefixed with the
name of the function that produced them.
Failed to get LID for unit u: str
Failed to get number of units: str
GETPORT ioctl failed: str
can't allocate memory for ipath_ctrl_typ: type
can't stat infinipath device to determine type: type
file descriptor is not for a real device, failing
get info ioctl failed: str
ipath_get_num_units called before init
ipath_get_unit_lid called before init
mmap64 of egr bufs from h failed: str
mmap64 of pio buffers at %llx failed: str
mmap64 of pioavail registers (%llx) failed: str
mmap64 of rcvhdr q failed: str
mmap64 of user registers at %llx failed: str
userinit allocation of rcvtail memory failed: str
userinit ioctl failed: str
Failed to set close on exec for device: str
NOTE:These messages should never occur. Please inform Support if they do.
IB6054601-00 DC-23
C – Troubleshooting
InfiniPath MPI Troubleshooting
The following message indicates that a node program may not be processing
incoming packets, perhaps due to a very high system load:
eager array full after overflow, flushing (head h, tail t)
The following indicates an invalid InfiniPath link protocol version:
InfiniPath version ERROR: Expected version v, found w (memkey h)
The following error messages should rarely occur and indicate internal software
problems:
ExpSend opcode h tid=j, rhf_error k: str
Asked to set timeout w/delay l, gives time in past (t2 < t1)
Error in sending packet: str
Fatal error in sending packet, exiting: str
Fatal error in sending packet: str
Here the str can give additional clues to the reason for the failure.
Q
The following probably indicates a node failure or malfunctioning link in the fabric:
Couldn’t connect to NODENAME, rank RANK#. Time elapsed HH:MM:SS.
Still trying
NODENAME is the node (host) name, RANK# is the MPI rank, and HH:MM:SS are
the hours, minutes, and seconds since we started trying to connect.
If you get messages similar to the following, it may mean that you are trying to
receive to an invalid (unallocated) memory address, perhaps due to a logic error in
the program, usually related to malloc/free:
ipath_update_tid_err: Failed TID update for rendevous, allocation
problem
kernel: infinipath: get_user_pages (0x41 pages starting at
0x2aaaaeb50000
kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:
errno 12
TID is short for Token ID, and is part of the InfiniPath hardware. This error indicates
a failure of the program, not the hardware or driver.
C.8.12.2
MPI Messages
Some MPI error messages are issued from the parts of the code inherited from the
MPICH implementation. See the MPICH documentation for descriptions of these.
This section presents the error messages specific to the InfiniPath MPI
implementation.
C-24IB6054601-00 D
Q
C – Troubleshooting
InfiniPath MPI Troubleshooting
These messages appear in the mpirun output. Most are followed by an abort, and
possibly a backtrace. Each is preceded by the name of the function in which the
exception occurred.
A fatal protocol error occurred while trying to send an InfiniPath packet.
On Node n, process p seems to have forked.
The new process id is q. Forking is illegal under
InfiniPath. Exiting.
An MPI process has forked and its child process has attempted to make MPI calls.
This is not allowed.
processlabel Fatal Error in filename line_no: error_string
This is always followed by an abort. The processlabel usually takes the form of
host name followed by process rank.
At time of writing, the possible
Illegal label format character.
Recv Error.
Memory allocation failed.
Error creating shared memory object.
Error setting size of shared memory object.
Error mapping shared memory.
Error opening shared memory object.
Error attaching to shared memory.
invalid remaining buffers !!
Node table has inconsistent length!
Timeout waiting for nodetab!
error_strings are:
The following indicates an unknown host:
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100
MPIRUN: Cannot obtain IP address of <nodename>: Unknown host
<nodename> 15:35_~.1019
There is no route to a valid host:
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100
ssh: connect to host <nodename> port 22: No route to host
MPIRUN: Some node programs ended prematurely without connecting to
mpirun.
MPIRUN: No connection received from 1 node process on node
<nodename>
IB6054601-00 DC-25
C – Troubleshooting
InfiniPath MPI Troubleshooting
There is no route to any host:
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100
ssh: connect to host <nodename> port 22: No route to host
ssh: connect to host <nodename> port 22: No route to host
MPIRUN: All node programs ended prematurely without connecting to
mpirun.
Node jobs have started, but one host couldn’t connect back to mpirun:
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100
9139.psc_skt_connect: Error connecting to socket: No route to host
<nodename> Cannot connect to mpirun within 60 seconds.
MPIRUN: Some node programs ended prematurely without connecting to
mpirun.
MPIRUN: No connection received from 1 node process on node
<nodename>
Node jobs have started, both hosts couldn’t connect back to mpirun:
Q
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100
9158.psc_skt_connect: Error connecting to socket: No route to host
<nodename> Cannot connect to mpirun within 60 seconds.
6083.psc_skt_connect: Error connecting to socket: No route to host
<nodename> Cannot connect to mpirun within 60 seconds.
MPIRUN: All node programs ended prematurely without connecting to
mpirun.
$ mpirun -np 2 -m ~/tmp/q mpi_latency 1000000 1000000
MPIRUN: <nodename> node program unexpectedly quit: Exiting.
The quiescence detected message is printed when an MPI job does not seem
to be making progress. The default timeout is 900 seconds. After this length of time
all the node processes will be terminated. This timeout can be extended or disabled
with the
-quiescence-timeout option in mpirun.
C-26IB6054601-00 D
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.