IBM p Series User Manual

IBM ~ pSeries
High Performance Switch
Tuning and Debug Guide
April 2005
IBM Systems and Technology Group
Cluster Performance Department
Poughkeepsie, NY
Contents
1.0 Introduction.....................................................................................................4
2.0 Tunables and settings for switch software...................................................... 5
2.1 MPI tunables for Parallel Environment........................................................5
2.1.1 MP_EAGER_LIMIT..............................................................................5
2.1.2 MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL......... 5
2.1.3 MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT......................6
2.1.4 MEMORY_AFFINITY...........................................................................6
2.1.5 MP_TASK_AFFINITY........................................................................... 7
2.1.6 MP_CSS_INTERRUPT........................................................................ 7
2.2 MPI-IO ........................................................................................................ 7
2.3 chgsni command.........................................................................................8
3.0 Tunables and settings for AIX 5L ................................................................... 9
3.1 IP tunables.................................................................................................. 9
3.2 File cache ................................................................................................... 9
3.3 svmon and vmstat commands..................................................................10
3.3.1 svmon.................................................................................................11
3.3.2 vmstat.................................................................................................12
3.4 Large page sizing......................................................................................13
3.5 Large pages and IP support......................................................................15
3.6 Memory affinity for a single LPAR............................................................. 15
3.7 Amount of memory available....................................................................15
3.8 Debug settings in the AIX 5L kernel.......................................................... 16
4.0 Daemon configuration..................................................................................16
4.1 RSCT daemons........................................................................................16
4.2 LoadLeveler daemons..............................................................................17
4.2.1 Reducing the number of daemons running ........................................17
4.2.2 Reducing daemon communication and placing daemons on a switch17
4.2.3 Reducing logging................................................................................17
4.3 Settings for AIX 5L threads....................................................................... 18
4.4 AIX 5L mail, spool, and sync daemons..................................................... 18
4.5 Placement of POE managers and LoadLeveler scheduler.......................18
5.0 Debug settings and data collection tools......................................................19
5.1 lsattr tuning...............................................................................................19
5.1.1 driver_debug setting...........................................................................19
5.1.2 ip_trc_lvl setting..................................................................................19
5.2 CPUs and frequency................................................................................. 19
5.3 Affinity LPARs...........................................................................................20
5.4 Small Real Mode Address Region on HMC GUI....................................... 20
5.5 Deconfigured L3 cache............................................................................. 20
5.6 Service focal point.....................................................................................20
5.7 errpt command.......................................................................................... 21
5.8 HMC error logging.....................................................................................21
5.9 Multiple versions of MPI libraries..............................................................21
pshpstuningguidewp040105.doc Page 2
5.10 MP_PRINTENV......................................................................................22
5.11 MP_STATISTICS.................................................................................... 23
5.12 Dropped switch packets.......................................................................... 24
5.12.1 Packets dropped because of a software problem on an endpoint....24
5.12.2 Packets dropped in the ML0 interface.............................................. 26
5.12.3 Packets dropped because of a hardware problem on an endpoint...27
5.12.4 Packets dropped in the switch hardware.......................................... 28
5.13 MP_INFOLEVEL.....................................................................................28
5.14 LAPI_DEBUG_COMM_TIMEOUT.......................................................... 29
5.15 LAPI_DEBUG_PERF..............................................................................29
5.16 AIX 5L trace for daemon activity.............................................................30
6.0 Conclusions and summary........................................................................... 30
7.0 Additional reading.........................................................................................30
7.1 HPS documentation..................................................................................30
7.2 MPI documentation................................................................................... 31
7.3 AIX 5L performance guides...................................................................... 31
7.4 IBM Redbooks.......................................................................................... 31
7.5 POWER4.................................................................................................. 31
pshpstuningguidewp040105.doc Page 3
1.0 Introduction
This paper is intended to help you tune and debug the performance of the IBM pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be a comprehensive guide, but rather to help in initial tuning and debugging of performance issues. Additional detailed information on the materials presented here can be found in sources noted in the text and listed in section 7.0.
This paper assumes an understanding of MPI and AIX 5L™, and that you are familiar with and have access to the Hardware Management Console (HMC) for pSeries systems.
This paper is divided into four sections. The first deals with HPS-specific tunables for tuning the HPS subsystems. The second section deals with tuning AIX 5L and its components for optimal performance of the HPS system. The third section deals with tuning various system daemons in both AIX 5L and cluster environments to prevent impact on high-performance parallel applications. The final section deals with debugging performance problems on the HPS.
Before debugging a performance problem in the HPS, review the HPS and AIX 5L tuning as well as daemon controls. Many pr oblems are s p ecifically related to t hese su b systems. I f a performance problem persists after you follow the instructions in the debugging section, call IBM service for additional tools and help.
We want to thank the following people in the IBM Poughkeepsie development organization for their help in writing this paper:
Robert Blackmore George Chochi a Frank Johnston Bernard King-Smith John Lewars Steve M ar tin Fernando Pizzano Bill Tuel Richard Treumann
~
®
pshpstuningguidewp040105.doc Page 4
2.0 Tunables and settings for switch software
To optimiz e the HPS, you c an set s hell va riab les for Parallel En v i ronment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper.
2.1 MPI tunables for Parallel Environment
The following sections list the most common MPI tunables for applications that use the HPS. Along with each tunable is a description of the variable, what it is used for, and how to set it appropriately.
2.1.1 MP_EAGER_LIMIT
The MP_EAGER_LIMIT variable tells the MPI transport protocol to use the "eager" mode for messages less than or equal to the specified size. Under the "eager" mode, the sender sends the message without knowing if the matching receive has actually been posted by the destination task. For messages larger than the EAGER_LIMIT, a rendezvous must be used to confirm that the matching receive has been posted
The sending task does not have to wait for an okay from the receiver before sending the data, so the effective start-up cost for a small message is lower in “eager” mode. As a result, any messages that are smaller than the EAGER_LIMIT are typically faster, especially if the corresponding receive has already been posted. If the receive has not been posted, the transport incurs an extra copy cost on the target, because data is staged through the early-arrival buffers. However, the overall time to send a small message might still be less in "eager" mode. Well­designed MPI applications often try to post each MPI_RECV before the message is expected, but because tasks of a parallel job are not in lock step, most applications have occasional early arrivals.
The maximum message size for the “eager” protocol is currently 65536 bytes, although the default value is lower. An application for which a significant fraction of the MPI messages are less than 65536 bytes might see a performance benefit from setting MP_EAGER_LIMIT. If MP_EAGER_LIMIT is increased above the default value, it might also be necessary to increase MP_BUFF E R_MEM, which determines the amount of memory availab le for early arrival buffers. Higher “eager” limits or larger task counts either demand more buffer memory or reduce the number of unl imite d “eager” mes sages that can b e ou tstandi n g, and ther efor e can also impact performance.
2.1.2 MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL
The MP_POL LING_INTER VAL and MP_RETRANSMIT_INTERVAL va ria bles control h ow oft e n the protocol co de checks wheth er dat a that was pr e viously s ent is assumed to be lost and needs to be retransmitted. When the values are larger, this checking is done less often. There are two different environment variables because the check can be done by an MPI/LAPI service
pshpstuningguidewp040105.doc Page 5
thread, and from within the MPI/LAPI polling code that is invoked when the application makes blocking MPI calls.
MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the internal MPI/LAPI polling routine between calls before checking whether any data needs to be resent. When the switch fabric, adapters, and nodes are operating properly, data that is sent arrives intact, and the receiver sends the source task an acknowledgment for the data. If the sending task does not receive such an acknowledgment within a reasonable amount of time (determined by the variable MP_RETRANSMIT_INTERVAL), it assumes the data has been lost and tries to resend it.
Sometimes w hen many MPI tasks s hare the s witch adapter s, swit ch fab ric, or both, t he time it takes to send a message and receive an acknowledgment is longer than the library expects. In this case, data might be retransmitted unnecessarily. Increasing the values of MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL decrease the likelihood of unnecessary retransmission but increase the time a job is delayed when a packet is actually dropped.
2.1.3 MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT
You can improve application performance by allowing a task that is sending a message shorter than the “eager” limit to return the send buffer to the application before the message has reached its destination, rather than forcing the sending task to wait until the data has actually reached the receiving task and the acknowledgement has been returned. To allow immediate return of the send buffer to the application, LAPI attempts to make a copy of the data in case it must be retransmitted later (unlikely but not impossible). LAPI copies the data into a retransmit buffer (REXMIT_BUF) if one is available. The MP_REXMIT_BUF_SIZE and MP_REXMIT_B UF_CNT environment va riab les control the size and number of the retransmit buffers allocated by each task.
2.1.4 MEMORY_AFFINITY
The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip modu le (MCM). An MCM c ontai ns eig ht CPUs and frequ entl y has t wo loca l memory cards. On these systems, applicat ion per formance can improve when ea ch CPU and the memory it accesses are on the same MCM.
Setting the AIX MEMORY_AFFINITY environment variable to MCM tells the operating system to attempt to allocate the memory from within the MCM containing the processor that made the request. If memory is a vailab le on the MCM c ontai ning th e C PU, the request is usually granted. If memory is n ot available o n t he loca l MCM, b ut is available on a re mote MCM, the memo ry is taken from the remote MCM. (Lack of local memory does not cause the job to fail.)
pshpstuningguidewp040105.doc Page 6
2.1.5 MP_TASK_AFFINITY
Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter. If more than four tasks share any HPS adapter, set MP_TASK_AFFINITY to MCM, whic h allows each M PI task to us e C PUs and memor y f rom the same MCM, e ven if the adapter is on a remote MCM. If MP_TASK_AFFINITY is set to either MCM or SNI, MEMORY_AFFINITY should be set to MCM.
2.1.6 MP_CSS_INTERRUPT
The MP_CSS_INTERRUPT variable allows you to control interrupts triggered by packet arrivals. Setting this variable to no implies that the application should run in polling mode. This setting is appropriate for applications that have mostly synchronous communication. Even applications that make heavy use of MPI_ISEND/MPI_IRECV should be considered synchronous unless there is significant computation between the ISEND/IRECV postings and the MPI_WAITALL. The default value for MP_CSS_INTERRUPT is no.
For applications with an asynchronous communication pattern (one that uses non-blocking MPI calls), it might be more appropriate to set this variable to yes. Setting MP_CSS_INTERRUPT to yes can cause your application to be interrupted when new packets arrive, which could be helpful if a receiving MPI tas k is likely to be in t he midd le of a long numerica l comput ation at the time when data from a remote-blocking send arrives.
2.2 MPI-IO
The most effectiv e us e of MP I-IO is wh en an a p plicat i o n takes advantage of fil e views an d collective operations to read or write a file in which data for each task is dispersed across the file. To simplify we focus on read, but write is similar.
An example is reading a matrix with application-wide scope from a single file, with each task needing a different fragment of that matrix. To bring in the fragment needed for each task, several disjoint chunks must be read. If every task were to do POSIX read of each chunk, the GPFS file system handle it correctly. However, because each read() is independent, there is little chance to apply an effective strategy.
When the same set of reads is done with collective MPI-IO, every task specifies all the chunks it needs to one MPI-IO call. Because the call is co l l ect i v e, t h e requirements of a ll th e ta s ks are known at one time. As a result, MPI can use a broad strategy for doing the I/O.
When MPI-IO is us ed but eac h call to read or write a file is local or s p ecif i es o nly a s ing l e chunk of data, there is much less chance for MPI-IO t o do anything more than a simple POSIX read() would do. Also, when the file is organized by task rather than globall y, t here is less MP I-IO can do to help. This is the case when each task's fragment of the matrix is stored contiguously in the file rather than having the matrix organized as a whole.
pshpstuningguidewp040105.doc Page 7
Sometimes MP I-IO is u s ed in an a pp licat i o n as if it were bas ic P OSIX read/write, eit h er b ecause there is no need for more complex read/write patterns or because the application was previously hand-optimized to use POSIX read/write. In such cases, it is often better to use the IBM_largeblock_io hint on MPI_FILE_OPEN. By default, the PE/MPI implementation of MPI­IO tries to take advantage of the information the MPI-IO interface can provide to do file I/O more efficiently. If t h e MPI-IO calls do not use MPI_data types and file views or collective I/O, there might not be enough information to do any optimization. The hint shuts off the attempt to optimize and makes MPI-IO ca l ls beha ve muc h li k e the P O S IX I/O ca lls that GP FS a lr eady handles well.
2.3 chgsni command
The chgsni command is used to tune the HPS drivers by changing a list of settings. The basic syntax for chgsni is:
chgsni -l <HPS device name > -a <variable>=<new value>
Multiple variables can be set in a single command. The key variables to set for TCP/IP are spoolsize and rpoolsize. To change the send IP pools for
HPS, change the spoolsize parameter. To change the receive IP pool, change the rpoolsize parameter.
The IP buffer pools are allocated in partitions of up to 16MB each. Each increase in the buffer that crosses a 16 MB boundary allocates an additional partition. If you are running a pSeries 655 system wit h two HP S links, a lloc ate t wo partit ions (32MB) of b uffer space. If you ar e runnin g a p690+ system with eight HPS links, set the buffer size to 128MB. If you are running in an LPAR and have a different number of links, sca l e t he buffer size a c c ordi ngly.
IP buffer settings are global across all HPS links in an AIX 5L partition. This means you only need to change the setting on one interface. All other interfaces get the new setting. In other words, if you run the chgsni command against sn0, the new setting takes effect under sn1, sn2, and so on, up to the number of links in a node or partition. The following command sets the IP buffer pools for either a p655 with two HPS links or a p690 LPAR:
chgsni -l sni0 -a spoolsize=33554432 -a rpoolsize=33554432
To see the values for the current chgsni settings, use the lsattr co m man d . The fo llowi ng example shows the settings on the HPS sni0 link.
lsattr -E -l sni0
> lsattr -E -l sni0 base_laddr 0x3fff9c00000 base address False driver_debug 0x0 Device driver trace level True int_level 1040 interrupt level False ip_kthread 0x1 IP kthread flag True ip_trc_lvl 0x00001111 IP trace level True num_windows 16 Number of windows False perf_level 0x00000000 Device driver perf level True rdma_xlat_limit 0x8000000000000000 RDMA translation limit True
pshpstuningguidewp040105.doc Page 8
rfifosize 0x1000000 receive fifo size False rpoolsize 0x02000000 IP receive pool size True spoolsize 0x02000000 IP send pool size True
3.0 Tunables and settings for AIX 5L
Several settings in AIX 5L impact the performance of the HPS. These include the IP and memory subsystems. The following sections provide a brief overview of the most commonly used tunables. For more information about these subjects, see the AIX 5L tuning manuals listed in section 7.0.
3.1 IP tunables
When defining subnets for HPS links, it is easier to debug performance problems if there is only one HPS interface for each IP subnet. When running with multiple interfaces for each subnet, applications do not typically control which interface is used to send or receive packets. This can make connectivity problems more difficult to debug. For example, the RSCT cthats subsystem that polls interfaces to assert connectivity might have problems identifying which interfaces are down w h e n mult ip le interfa c es are on the same IP su bnet.
The IP subsystem has several variables that impact IP performance over HPS. The following table contains recommended initial settings used for TCP/IP. For more information about these variables, see the AIX 5L manuals listed in section 7.0.
Parameter Setting sb_max 1310720 tcp_sendspace 655360 tcp_recvspace 655360 rfc1323 1 tcp_mssdflt 1448 ipforwarding 1
3.2 File cache
AIX 5L defines all virtual memory pages allocated for most file systems as permanent storage pages. Files mapped from the GPFS file cache are an exception. A subset of permanent storage pages are further defined as client pages (such as NFS and JFS2 mapped files). All permanent storage pages can be referred to as the file cache. The size of the file cache tends to grow unless an increase in computational page allocations (for example, application data stored in memory) ca uses the o perating system to run low o n availabl e virtual memory f ra m es, o r t he fi les being memory mapped become unavailable (for example, a file system becomes unmounted).
pshpstuningguidewp040105.doc Page 9
The overhead in maintaining the file cache can impact the performance of large parallel applications. Much of the overhead is associated with the sync() system call (by defa u lt, run every minu te fro m the syncd daemon). The sync() system call scans all of the pages in the file cache to determine if any pages have been modified since the last sync(), an d therefore ne ed t o be written to disk. This type of d elay af fects larg er parallel a pplica tions more s e verely, and those with f requent sync hroni zing c ollecti ve cal ls (su c h as MP I_ALL TOALL or MPI_BAR RIER) ar e aff e c ted the most. A s ynchroniz ing op eration like MPI_ALLT O ALL ca n be completed only a f ter the slowest task involved has reached it. Unless an effort is made to synchronize the sync daemons across a cluster, t he sync() system call runs at different times across all of the LPARs. Unless the time between synchronizing operations for the application is large compared to the time required for a sync(), the random delays from sync() oper ations on many LPARs c an slo w the entire application. To address this problem, tune the file cache to reduce the amount of work each sync() must do..
To determine if the system is impacted by an increasing file cache, run the vmstat -v command and c heck the numperm and numclient percentages. Here is an example:
vmstat -v [. . .]
0.6 numperm percentage 10737 file pages
0.0 compressed percentage 0 compressed pages
0.0 numclient percentage
[. . .]
If the system tends to move towards a high numperm level, here are a couple of approaches to address performance concer ns:
Use vmo tunables to tune page replacement. By decreasing the maxperm percentage
and maxclient percentage, you can try to force page replacement to steal permanent and client pages before computational pages. Read the vmo man page before changing these tunables , and test any vmo changes incrementally . Always consult IBM serv ic e before changing the vmo tunables strict_maxperm and strict_maxclient.
If most of the permanent file pages allocated are listed as being client pages, these might
be NFS pages. If NFS accesses are driving the file cache up, consider periodically unmounting the NFS file systems (for example, use automount to mount file systems as th ey are requ i red).
3.3 svmon and vmstat commands
The svmon and vmstat comma nds are very he lpful in analyzing problems w ith virtual memor y. To find an optimal problem size, it helps to understand how much memory is available for an HPS application before paging starts. In addition, if an application uses large pages, it must know how much of t hat r es ourc e is availa ble. B ecause processes compete for the memory, and memory allocation changes over time, you need to understand the process requirements. The following sections introduce how to use the svmon and vmstat commands for debugging. For more information, see the AIX 5L performance and tuning guide and the related man pages.
pshpstuningguidewp040105.doc Page 10
Loading...
+ 22 hidden pages