This paper is intended to help you tune and debug the performance of the IBM
pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be
a comprehensive guide, but rather to help in initial tuning and debugging of performance issues.
Additional detailed information on the materials presented here can be found in sources noted in
the text and listed in section 7.0.
This paper assumes an understanding of MPI and AIX 5L™, and that you are familiar with and
have access to the Hardware Management Console (HMC) for pSeries systems.
This paper is divided into four sections. The first deals with HPS-specific tunables for tuning the
HPS subsystems. The second section deals with tuning AIX 5L and its components for optimal
performance of the HPS system. The third section deals with tuning various system daemons in
both AIX 5L and cluster environments to prevent impact on high-performance parallel
applications. The final section deals with debugging performance problems on the HPS.
Before debugging a performance problem in the HPS, review the HPS and AIX 5L tuning as well
as daemon controls. Many pr oblems are s p ecifically related to t hese su b systems. I f a
performance problem persists after you follow the instructions in the debugging section, call IBM
service for additional tools and help.
We want to thank the following people in the IBM Poughkeepsie development organization for
their help in writing this paper:
Robert Blackmore
George Chochi a
Frank Johnston
Bernard King-Smith
John Lewars
Steve M ar tin
Fernando Pizzano
Bill Tuel
Richard Treumann
2.0 Tunables and settings for switch software
To optimiz e the HPS, you c an set s hell va riab les for Parallel En v i ronment MPI-based workloads
and for IP-based workloads. This section reviews the shell variables that are most often used for
performance tuning. For a complete list of tunables and their usage, see the documentation listed
in section 7 of this paper.
2.1 MPI tunables for Parallel Environment
The following sections list the most common MPI tunables for applications that use the HPS.
Along with each tunable is a description of the variable, what it is used for, and how to set it
The MP_EAGER_LIMIT variable tells the MPI transport protocol to use the "eager" mode for
messages less than or equal to the specified size. Under the "eager" mode, the sender sends the
message without knowing if the matching receive has actually been posted by the destination
task. For messages larger than the EAGER_LIMIT, a rendezvous must be used to confirm that
the matching receive has been posted
The sending task does not have to wait for an okay from the receiver before sending the data, so
the effective start-up cost for a small message is lower in “eager” mode. As a result, any
messages that are smaller than the EAGER_LIMIT are typically faster, especially if the
corresponding receive has already been posted. If the receive has not been posted, the transport
incurs an extra copy cost on the target, because data is staged through the early-arrival buffers.
However, the overall time to send a small message might still be less in "eager" mode. Welldesigned MPI applications often try to post each MPI_RECV before the message is expected, but
because tasks of a parallel job are not in lock step, most applications have occasional early
The maximum message size for the “eager” protocol is currently 65536 bytes, although the
default value is lower. An application for which a significant fraction of the MPI messages are
less than 65536 bytes might see a performance benefit from setting MP_EAGER_LIMIT. If
MP_EAGER_LIMIT is increased above the default value, it might also be necessary to increase
MP_BUFF E R_MEM, which determines the amount of memory availab le for early arrival
buffers. Higher “eager” limits or larger task counts either demand more buffer memory or reduce
the number of unl imite d “eager” mes sages that can b e ou tstandi n g, and ther efor e can also impact
The MP_POL LING_INTER VAL and MP_RETRANSMIT_INTERVAL va ria bles control h ow
oft e n the protocol co de checks wheth er dat a that was pr e viously s ent is assumed to be lost and
needs to be retransmitted. When the values are larger, this checking is done less often. There are
two different environment variables because the check can be done by an MPI/LAPI service
thread, and from within the MPI/LAPI polling code that is invoked when the application makes
blocking MPI calls.
MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread
should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be
retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the
internal MPI/LAPI polling routine between calls before checking whether any data needs to be
resent. When the switch fabric, adapters, and nodes are operating properly, data that is sent
arrives intact, and the receiver sends the source task an acknowledgment for the data. If the
sending task does not receive such an acknowledgment within a reasonable amount of time
(determined by the variable MP_RETRANSMIT_INTERVAL), it assumes the data has been lost
and tries to resend it.
Sometimes w hen many MPI tasks s hare the s witch adapter s, swit ch fab ric, or both, t he time it
takes to send a message and receive an acknowledgment is longer than the library expects. In this
case, data might be retransmitted unnecessarily. Increasing the values of
unnecessary retransmission but increase the time a job is delayed when a packet is actually
You can improve application performance by allowing a task that is sending a message shorter
than the “eager” limit to return the send buffer to the application before the message has reached
its destination, rather than forcing the sending task to wait until the data has actually reached the
receiving task and the acknowledgement has been returned. To allow immediate return of the
send buffer to the application, LAPI attempts to make a copy of the data in case it must be
retransmitted later (unlikely but not impossible). LAPI copies the data into a retransmit buffer
(REXMIT_BUF) if one is available. The MP_REXMIT_BUF_SIZE and
MP_REXMIT_B UF_CNT environment va riab les control the size and number of the retransmit
buffers allocated by each task.
The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip
modu le (MCM). An MCM c ontai ns eig ht CPUs and frequ entl y has t wo loca l memory cards. On
these systems, applicat ion per formance can improve when ea ch CPU and the memory it accesses
are on the same MCM.
Setting the AIX MEMORY_AFFINITY environment variable to MCM tells the operating system
to attempt to allocate the memory from within the MCM containing the processor that made the
request. If memory is a vailab le on the MCM c ontai ning th e C PU, the request is usually granted.
If memory is n ot available o n t he loca l MCM, b ut is available on a re mote MCM, the memo ry is
taken from the remote MCM. (Lack of local memory does not cause the job to fail.)
Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each
task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory
used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same
CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter.
If more than four tasks share any HPS adapter, set MP_TASK_AFFINITY to MCM, whic h allows
each M PI task to us e C PUs and memor y f rom the same MCM, e ven if the adapter is on a remote
MCM. If MP_TASK_AFFINITY is set to either MCM or SNI, MEMORY_AFFINITY should be
set to MCM.
The MP_CSS_INTERRUPT variable allows you to control interrupts triggered by packet arrivals.
Setting this variable to no implies that the application should run in polling mode. This setting is
appropriate for applications that have mostly synchronous communication. Even applications that
make heavy use of MPI_ISEND/MPI_IRECV should be considered synchronous unless there is
significant computation between the ISEND/IRECV postings and the MPI_WAITALL. The
default value for MP_CSS_INTERRUPT is no.
For applications with an asynchronous communication pattern (one that uses non-blocking MPI
calls), it might be more appropriate to set this variable to yes. Setting MP_CSS_INTERRUPT to yes can cause your application to be interrupted when new packets arrive, which could be
helpful if a receiving MPI tas k is likely to be in t he midd le of a long numerica l comput ation at the
time when data from a remote-blocking send arrives.
2.2 MPI-IO
The most effectiv e us e of MP I-IO is wh en an a p plicat i o n takes advantage of fil e views an d
collective operations to read or write a file in which data for each task is dispersed across the file.
To simplify we focus on read, but write is similar.
An example is reading a matrix with application-wide scope from a single file, with each task
needing a different fragment of that matrix. To bring in the fragment needed for each task,
several disjoint chunks must be read. If every task were to do POSIX read of each chunk, the
GPFS file system handle it correctly. However, because each read() is independent, there is little
chance to apply an effective strategy.
When the same set of reads is done with collective MPI-IO, every task specifies all the chunks it
needs to one MPI-IO call. Because the call is co l l ect i v e, t h e requirements of a ll th e ta s ks are
known at one time. As a result, MPI can use a broad strategy for doing the I/O.
When MPI-IO is us ed but eac h call to read or write a file is local or s p ecif i es o nly a s ing l e chunk
of data, there is much less chance for MPI-IO t o do anything more than a simple POSIX read()
would do. Also, when the file is organized by task rather than globall y, t here is less MP I-IO can
do to help. This is the case when each task's fragment of the matrix is stored contiguously in the
file rather than having the matrix organized as a whole.
Sometimes MP I-IO is u s ed in an a pp licat i o n as if it were bas ic P OSIX read/write, eit h er b ecause
there is no need for more complex read/write patterns or because the application was previously
hand-optimized to use POSIX read/write. In such cases, it is often better to use the
IBM_largeblock_io hint on MPI_FILE_OPEN. By default, the PE/MPI implementation of MPIIO tries to take advantage of the information the MPI-IO interface can provide to do file I/O more
efficiently. If t h e MPI-IO calls do not use MPI_data types and file views or collective I/O, there
might not be enough information to do any optimization. The hint shuts off the attempt to
optimize and makes MPI-IO ca l ls beha ve muc h li k e the P O S IX I/O ca lls that GP FS a lr eady
handles well.
2.3 chgsni command
The chgsni command is used to tune the HPS drivers by changing a list of settings. The basic
syntax for chgsni is:
chgsni -l <HPS devicename > -a <variable>=<new value>
Multiple variables can be set in a single command.
The key variables to set for TCP/IP are spoolsize and rpoolsize. To change the send IP pools for
HPS, change the spoolsize parameter. To change the receive IP pool, change the rpoolsize
The IP buffer pools are allocated in partitions of up to 16MB each. Each increase in the buffer
that crosses a 16 MB boundary allocates an additional partition. If you are running a pSeries 655
system wit h two HP S links, a lloc ate t wo partit ions (32MB) of b uffer space. If you ar e runnin g a
p690+ system with eight HPS links, set the buffer size to 128MB. If you are running in an LPAR
and have a different number of links, sca l e t he buffer size a c c ordi ngly.
IP buffer settings are global across all HPS links in an AIX 5L partition. This means you only
need to change the setting on one interface. All other interfaces get the new setting. In other
words, if you run the chgsni command against sn0, the new setting takes effect under sn1, sn2,
and so on, up to the number of links in a node or partition. The following command sets the IP
buffer pools for either a p655 with two HPS links or a p690 LPAR:
chgsni -l sni0 -a spoolsize=33554432 -a rpoolsize=33554432
To see the values for the current chgsni settings, use the lsattr co m man d . The fo llowi ng example
shows the settings on the HPS sni0 link.
lsattr -E -l sni0
> lsattr -E -l sni0
base_laddr 0x3fff9c00000 base address False
driver_debug 0x0 Device driver trace level True
int_level 1040 interrupt level False
ip_kthread 0x1 IP kthread flag True
ip_trc_lvl 0x00001111 IP trace level True
num_windows 16 Number of windows False
perf_level 0x00000000 Device driver perf level True
rdma_xlat_limit 0x8000000000000000 RDMA translation limit True
rfifosize 0x1000000 receive fifo size False
rpoolsize 0x02000000 IP receive pool size True
spoolsize 0x02000000 IP send pool size True
3.0 Tunables and settings for AIX 5L
Several settings in AIX 5L impact the performance of the HPS. These include the IP and
memory subsystems. The following sections provide a brief overview of the most commonly
used tunables. For more information about these subjects, see the AIX 5L tuning manuals listed
in section 7.0.
3.1 IP tunables
When defining subnets for HPS links, it is easier to debug performance problems if there is only
one HPS interface for each IP subnet. When running with multiple interfaces for each subnet,
applications do not typically control which interface is used to send or receive packets. This can
make connectivity problems more difficult to debug. For example, the RSCT cthats subsystem
that polls interfaces to assert connectivity might have problems identifying which interfaces are
down w h e n mult ip le interfa c es are on the same IP su bnet.
The IP subsystem has several variables that impact IP performance over HPS. The following
table contains recommended initial settings used for TCP/IP. For more information about these
variables, see the AIX 5L manuals listed in section 7.0.
AIX 5L defines all virtual memory pages allocated for most file systems as permanent storage
pages. Files mapped from the GPFS file cache are an exception. A subset of permanent storage
pages are further defined as client pages (such as NFS and JFS2 mapped files). All permanent
storage pages can be referred to as the file cache. The size of the file cache tends to grow unless
an increase in computational page allocations (for example, application data stored in memory)
ca uses the o perating system to run low o n availabl e virtual memory f ra m es, o r t he fi les being
memory mapped become unavailable (for example, a file system becomes unmounted).
The overhead in maintaining the file cache can impact the performance of large parallel
applications. Much of the overhead is associated with the sync() system call (by defa u lt, run
every minu te fro m the syncd daemon). The sync() system call scans all of the pages in the file
cache to determine if any pages have been modified since the last sync(), an d therefore ne ed t o be
written to disk. This type of d elay af fects larg er parallel a pplica tions more s e verely, and those
with f requent sync hroni zing c ollecti ve cal ls (su c h as MP I_ALL TOALL or MPI_BAR RIER) ar e
aff e c ted the most. A s ynchroniz ing op eration like MPI_ALLT O ALL ca n be completed only a f ter
the slowest task involved has reached it. Unless an effort is made to synchronize the sync
daemons across a cluster, t he sync() system call runs at different times across all of the LPARs.
Unless the time between synchronizing operations for the application is large compared to the
time required for a sync(), the random delays from sync() oper ations on many LPARs c an slo w
the entire application. To address this problem, tune the file cache to reduce the amount of work
each sync() must do..
To determine if the system is impacted by an increasing file cache, run the vmstat -v command
and c heck the numperm and numclient percentages. Here is an example:
vmstat -v
[. . .]
0.6 numperm percentage
10737 file pages
0.0 compressed percentage
0 compressed pages
0.0 numclient percentage
[. . .]
If the system tends to move towards a high numperm level, here are a couple of approaches to
address performance concer ns:
• Use vmo tunables to tune page replacement. By decreasing the maxperm percentage
and maxclient percentage, you can try to force page replacement to steal permanent
and client pages before computational pages. Read the vmo man page before changing
these tunables , and test any vmo changes incrementally . Always consult IBM serv ic e
before changing the vmo tunables strict_maxperm and strict_maxclient.
• If most of the permanent file pages allocated are listed as being client pages, these might
be NFS pages. If NFS accesses are driving the file cache up, consider periodically
unmounting the NFS file systems (for example, use automount to mount file systems as
th ey are requ i red).
3.3 svmon and vmstat commands
The svmon and vmstat comma nds are very he lpful in analyzing problems w ith virtual memor y.
To find an optimal problem size, it helps to understand how much memory is available for an
HPS application before paging starts. In addition, if an application uses large pages, it must know
how much of t hat r es ourc e is availa ble. B ecause processes compete for the memory, and memory
allocation changes over time, you need to understand the process requirements. The following
sections introduce how to use the svmon and vmstat commands for debugging. For more
information, see the AIX 5L performance and tuning guide and the related man pages.
