4285TOC.fmDraft Document for Review May 4, 2007 11:35 am
viLinux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285spec.fm
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
DS8000™
IBM®
POWER™
Redbooks®
ServeRAID™
System i™
System p™
System x™
System z™
System Storage™
TotalStorage®
The following terms are trademarks of other companies:
Java, JDBC, Solaris, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United
States, other countries, or both.
Excel, Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United
States, other countries, or both.
Intel, Itanium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viiiLinux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285pref.fm
Preface
Linux® is an open source operating system developed by people all over the world. The
source code is freely available and can be used under the GNU General Public License. The
operating system is made available to users in the form of distributions from companies such
as Red Hat and Novell. Some desktop Linux distributions can be downloaded at no charge
from the Web, but the server versions typically must be purchased.
Over the past few years, Linux has made its way into the data centers of many corporations
all over the globe. The Linux operating system has become accepted by both the scientific
and enterprise user population. Today, Linux is by far the most versatile operating system.
You can find Linux on embedded devices such as firewalls and cell phones and mainframes.
Naturally, performance of the Linux operating system has become a hot topic for both
scientific and enterprise users. However, calculating a global weather forecast and hosting a
database impose different requirements on the operating system. Linux has to accommodate
all possible usage scenarios with the most optimal performance. The consequence of this
challenge is that most Linux distributions contain general tuning parameters to accommodate
all users.
IBM® has embraced Linux, and it is recognized as an operating system suitable for
enterprise-level applications running on IBM systems. Most enterprise applications are now
available on Linux, including file and print servers, database servers, Web servers, and
collaboration and mail servers.
With use of Linux in an enterprise-class server comes the need to monitor performance and,
when necessary, tune the server to remove bottlenecks that affect users. This IBM Redpaper
describes the methods you can use to tune Linux, tools that you can use to monitor and
analyze server performance, and key tuning parameters for specific server applications. The
purpose of this redpaper is to understand, analyze, and tune the Linux operating system to
yield superior performance for any type of application you plan to run on these systems.
The tuning parameters, benchmark results, and monitoring tools used in our test environment
were executed on Red Hat and Novell SUSE Linux kernel 2.6 systems running on IBM
System x servers and IBM System z servers. However, the information in this redpaper
should be helpful for all Linux hardware platforms.
How this Redpaper is structured
To help readers new to Linux or performance tuning get a fast start on the topic, we have
structured this book the following way:
Understanding the Linux operating system
This chapter introduces the factors that influence systems performance and the way the
Linux operating system manages system resources. The reader is introduced to several
important performance metrics that are needed to quantify system performance.
Monitoring Linux performance
The second chapter introduces the various utilities that are available for Linux to measure
and analyze systems performance.
Analyzing performance bottlenecks
This chapter introduces the process of identifying and analyzing bottlenecks in the system.
4285pref.fmDraft Document for Review May 4, 2007 11:35 am
Tuning the operating system
With the basic knowledge of the operating systems way of working and the skills in a
variety of performance measurement utilities, the reader is now ready to go to work and
explore the various performance tweaks available in the Linux operating system.
The team that wrote this Redpaper
This Redpaper was produced by a team of specialists from around the world working at the
International Technical Support Organization, Raleigh Center.
The team: Byron, Eduardo, Takechika
Eduardo Ciliendo is an Advisory IT Specialist working as a performance specialist on
IBM Mainframe Systems in IBM Switzerland. He has over than 10 years of experience in
computer sciences. Eddy studied Computer and Business Sciences at the University of
Zurich and holds a post-diploma in Japanology. Eddy is a member of the zChampion team
and holds several IT certifications including the RHCE title. As a Systems Engineer for
IBM System z™, he works on capacity planning and systems performance for z/OS® and
Linux for System z. Eddy has made several publications on systems performance and
Linux.
Takechika Kunimasa is an Associate IT Architect in IBM Global Service in Japan. He studied
Electrical and Electronics engineering at Chiba University. He has more than 10 years of
experience in IT industry. He worked as network engineer for 5 years and he has been
working for Linux technical support. His areas of expertise include Linux on System x™,
System p™ and System z, high availability system, networking and infrastructure architecture
design. He is Cisco Certified Network Professional and Red Hat Certified Engineer.
Byron Braswell is a Networking Professional at the International Technical Support
Organization, Raleigh Center. He received a B.S. degree in Physics and an M.S. degree in
xLinux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285pref.fm
Computer Sciences from Texas A&M University. He writes extensively in the areas of
networking, application integration middleware, and personal computer software. Before
joining the ITSO, Byron worked in IBM Learning Services Development in networking
education development.
Thanks to the following people for their contributions to this project:
Margaret Ticknor
Carolyn Briscoe
International Technical Support Organization, Raleigh Center
Roy Costa
Michael B Schwartz
Frieder Hamm
International Technical Support Organization, Poughkeepsie Center
Christian Ehrhardt
Martin Kammerer
IBM Böblingen, Germany
Erwan Auffret
IBM France
Become a published author
Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with
specific products or solutions, while getting hands-on experience with leading-edge
technologies. You will have the opportunity to team with IBM technical professionals,
Business Partners, and Clients.
Your efforts will help increase product acceptance and customer satisfaction. As a bonus,
you'll develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this
Redpaper or other Redbooks® in one of the following ways:
Use the online Contact us review redbook form found at:
ibm.com/redbooks
Send your comments in an e-mail to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
Preface xi
4285pref.fmDraft Document for Review May 4, 2007 11:35 am
2455 South Road
Poughkeepsie, NY 12601-5400
xiiLinux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
1
Chapter 1.Understanding the Linux
operating system
We begin this Redpaper with a quick overview of how the Linux operating system handles its
tasks to complete interacting with its hardware resources. Performance tuning is a difficult
task that requires in-depth understanding of the hardware, operating system, and application.
If performance tuning were simple, the parameters we are about to explore would be
hard-coded into the firmware or the operating system and you would not be reading these
lines. However, as shown in the following figure, server performance is affected by multiple
factors.
Applications
Applications
Libraries
Libraries
Kernel
Kernel
Drivers
Drivers
Firmware
Firmware
Hardware
Hardware
Figure 1-1 Schematic interaction of different performance components
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
We can tune the I/O subsystem for weeks in vain if the disk subsystem for a 20,000-user
database server consists of a single IDE drive. Often a new driver or an update to the
application will yield impressive performance gains. Even as we discuss specific details,
never forget the complete picture of systems performance. Understanding the way an
operating system manages the system resources aids us in understanding what subsystems
we need to tune, given a specific application scenario.
The following sections provide a short introduction to the architecture of the Linux operating
system. A complete analysis of the Linux kernel is beyond the scope of this Redpaper. The
interested reader is pointed to the kernel documentation for a complete reference of the Linux
kernel. Once you get a overall picture of the Linux kernel, you can go further depth into the
detail more easily.
Note: This Redpaper focuses on the performance of the Linux operating system.
In this chapter we cover:
1.1, “Linux process management” on page 3
1.2, “Linux memory architecture” on page 11
1.3, “Linux file systems” on page 15
1.4, “Disk I/O subsystem” on page 19
1.5, “Network subsystem” on page 26
1.6, “Understanding Linux performance metrics” on page 34
2Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
1.1 Linux process management
Process management is one of the most important roles of any operating system. Effective
process management enables an application to operate steadily and effectively.
Linux process management implementation is similar to UNIX® implementation. It includes
process scheduling, interrupt handling, signaling, process prioritization, process switching,
process state, process memory and so on.
In this section, we discuss the fundamentals of the Linux process management
implementation. It helps to understand how the Linux kernel deals with processes that will
have an effect on system performance.
1.1.1 What is a process?
A process is an instance of execution that runs on a processor. The process uses any
resources Linux kernel can handle to complete its task.
All processes running on Linux operating system are managed by the task_struct structure,
which is also called
necessary for a single process to run such as process identification, attributes of the process,
resources which construct the process. If you know the structure of the process, you can
understand what is important for process execution and performance. Figure 1-2 shows the
outline of structures related to process information.
process descriptor. A process descriptor contains all the information
task_struct structure
kernel stack
kernel stack
Root directory
Root directory
thread_info structure
task
stateProcess state
stateProcess state
thread_infoProcess information and
thread_infoProcess information and
:
:
run_list, arrayFor process scheduling
run_list, arrayFor process scheduling
:
:
mmProcess address space
mmProcess address space
:
:
pidProcess ID
pidProcess ID
:
:
group_infoGroup management
group_infoGroup management
:
:
userUser management
userUser management
:
:
fsWorking directory
fsWorking directory
fliesFile descripter
fliesFile descripter
:
:
signalSignal information
signalSignal information
sighandSignal handler
sighandSignal handler
:
:
task
exec_domain
exec_domain
flags
flags
status
status
Kernel stack
Kernel stack
the other structures
runqueue
mm_struct
group_info
user_struct
fs_struct
files_struct
signal_struct
sighand_struct
Figure 1-2 task_struct structure
Chapter 1. Understanding the Linux operating system 3
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
1.1.2 Lifecycle of a process
Every process has its own lifecycle such as creation, execution, termination and removal.
These phases will be repeated literally millions of times as long as the system is up and
running. Therefore, the process lifecycle is a very important topic from the performance
perspective.
Figure 1-3 shows typical lifecycle of processes.
wait()
wait()
parent
parent
process
process
fork()
fork()
parent
parent
process
process
child
child
process
process
Figure 1-3 Lifecycle of typical processes
exec()exit()
exec()exit()
child
child
process
process
zombie
zombie
process
process
When a process creates new process, the creating process (parent process) issues a fork()
system call. When a fork() system call is issued, it gets a process descriptor for the newly
created process (child process) and sets a new process id. It then copies the values of the
parent process’s process descriptor to the child’s. At this time the entire address space of the
parent process is not copied; both processes share the same address space.
The exec() system call copies the new program to the address space of the child process.
Because both processes share the same address space, writing new program data causes a
page fault exception. At this point, the kernel assigns the new physical page to the child
process.
This deferred operation is called the
Copy On Write. The child process usually executes their
own program rather than the same execution as its parent does. This operation is a
reasonable choice to avoid unnecessary overhead because copying an entire address space
is a very slow and inefficient operation which uses much processor time and resources.
When program execution has completed, the child process terminates with an exit() system
call. The exit() system call releases most of the data structure of the process, and notifies
the parent process of the termination sending a certain signal. At this time, the process is
called a
zombie process (refer to “Zombie processes” on page 8).
The child process will not be completely removed until the parent process knows of the
termination of its child process by the wait() system call. As soon as the parent process is
notified of the child process termination, it removes all the data structure of the child process
and release the process descriptor.
1.1.3 Thread
A thread is an execution unit which is generated in a single process and runs in parallel with
other threads in the same process. They can share the same resources such as memory,
address space, open files and so on. They can access the same set of application data. A
thread is also called
should take care not to change their shared resources at the same time. The implementation
of mutual exclusion, locking and serialization etc. are the user application’s responsibility.
4Linux Performance and Tuning Guidelines
Light Weight Process (LWP). Because they share resources, each thread
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
resource
e
source
Threa
d
Threa
d
e
source
From the performance perspective, thread creation is less expensive than process creation
because a thread does not need to copy resources on creation. On the other hand, processes
and threads have similar characteristics in term of scheduling algorithm. The kernel deals
with both of them in the similar manner.
ProcessProcess
resource
resource
r
copy
resource
resource
Thread
Thread
Process
resource
resource
shareshare
r
Thread
Thread
Process creationThread creation
Figure 1-4 process and thread
In current Linux implementations, a thread is supported with the POSIX (Portable Operating
System Interface for UNIX) compliant library (
pthread). There are several thread
implementations available in the Linux operating system. The following are the widely used.
LinuxThreads
LinuxThreads have been the default thread implementation since Linux kernel 2.0 was
available. The LinuxThread has some noncompliant implementations with the POSIX
standard. NPTL is taking the place of LinuxThreads. The LinuxThreads will not be
supported in future release of Enterprise Linux distributions.
Native POSIX Thread Library (NPTL)
The NPTL was originally developed by Red Hat. NPTL is more compliant with POSIX
standards. Taking advantage of enhancements in kernel 2.6 such as the new clone()
system call, signal handling implementation etc., it has better performance and scalability
than LinuxThreads.
There is some incompatibility with LinuxThreads. An application which has a dependence
on LinuxThread may not work with the NPTL implementation.
Next Generation POSIX Thread (NGPT)
NGPT is an IBM developed version of POSIX thread library. It is currently under
maintenance operation and no further development is planned.
Using the LD_ASSUME_KERNEL environment variable, you can choose which threads library the
application should use.
1.1.4 Process priority and nice level
Process priority is a number that determines the order in which the process is handled by the
CPU and is determined by dynamic priority and static priority. A process which has higher
process priority has higher chances of getting permission to run on processor.
The kernel dynamically adjusts dynamic priority up and down as needed using a heuristic
algorithm based on process behaviors and characteristics. A user process can change the
static priority indirectly through the use of the
higher static priority will have longer time slice (how long the process can run on processor).
nice level of the process. A process which has
Chapter 1. Understanding the Linux operating system 5
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
Linux supports nice levels from 19 (lowest priority) to -20 (highest priority). The default value
is 0. To change the nice level of a program to a negative number (which makes it higher
priority), it is necessary to log on or su to root.
1.1.5 Context switching
During process execution, information of the running process is stored in registers on
processor and its cache. The set of data that is loaded to the register for the executing
process is called the
stored and the context of the next running process is restored to the register. The process
descriptor and the area called kernel mode stack are used to store the context. This switching
process is called
because the processor has to flush its register and cache every time to make room for the
new process. It may cause performance problems.
Figure 1-5 illustrates how the context switching works.
context. To switch processes, the context of the running process is
context switching. Having too much context switching is undesirable
task_struct
(Process A)
Figure 1-5 Context switching
1.1.6 Interrupt handling
Interrupt handling is one of the highest priority tasks. Interrupts are usually generated by I/O
devices such as a network interface card, keyboard, disk controller, serial adapter, and so on.
The interrupt handler notifies the Linux kernel of an event (such as keyboard input, ethernet
frame arrival, and so on). It tells the kernel to interrupt process execution and perform
interrupt handling as quickly as possible because some device requires quick
responsiveness. This is critical for system stability. When an interrupt signal arrives to the
kernel, the kernel must switch a currently execution process to new one to handle the
interrupt. This means interrupts cause context switching, and therefore a significant amount
of interrupts may cause performance degradation.
Address space
of process A
stack
Suspend
Context switch
CPU
stack pointer
other registers
EIP register
etc.
Address space
of process B
stack
task_struct
(Process B)
Resume
In Linux implementations, there are two types of interrupt. A
devices which require responsiveness (disk I/O interrupt, network adapter interrupt, keyboard
interrupt, mouse interrupt). A
deferred (TCP/IP operation, SCSI protocol operation etc.). You can see information related to
hard interrupts at /proc/interrupts.
6Linux Performance and Tuning Guidelines
hard interrupt is generated for
soft interrupt is used for tasks which processing can be
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
In a multi-processor environment, interrupts are handled by each processor. Binding
interrupts to a single physical processor may improve system performance. For further
details, refer to 4.4.2, “CPU affinity for interrupt handling”.
1.1.7 Process state
Every process has its own state to show what is currently happening in the process. Process
state changes during process execution. Some of the possible states are as follows:
TASK_RUNNING
In this state, a process is running on a CPU or waiting to run in the queue (run queue).
TASK_STOPPED
A process suspended by certain signals (ex. SIGINT, SIGSTOP) is in this state. The process
is waiting to be resumed by a signal such as SIGCONT.
TASK_INTERRUPTIBLE
In this state, the process is suspended and waits for a certain condition to be satisfied. If a
process is in TASK_INTERRUPTIBLE state and it receives a signal to stop, the process
state is changed and operation will be interrupted. A typical example of a
TASK_INTERRUPTIBLE process is a process waiting for keyboard interrupt.
TASK_UNINTERRUPTIBLE
Similar to TASK_INTERRUPTIBLE. While a process in TASK_INTERRUPTIBLE state can
be interrupted, sending a signal does nothing to the process in
TASK_UNINTERRUPTIBLE state. A typical example of TASK_UNINTERRUPTIBLE
process is a process waiting for disk I/O operation.
TASK_ZOMBIE
After a process exits with exit() system call, its parent should know of the termination. In
TASK_ZOMBIE state, a process is waiting for its parent to be notified to release all the
data structure.
TASK_ZOMBIE
fork()
TASK_RUNNING
TASK_RUNNING
(READY)TASK_RUNNING
(READY)
TASK_STOPPED
TASK_STOPPED
Scheduling
Preemption
TASK_ZOMBIE
exit()
TASK_RUNNING
Processor
TASK_UNINTERRUPTIBLE
TASK_UNINTERRUPTIBLE
Figure 1-6 Process state
TASK_INTERRUPTIBLE
TASK_INTERRUPTIBLE
Chapter 1. Understanding the Linux operating system 7
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
Zombie processes
When a process has already terminated, having received a signal to do so, it normally takes
some time to finish all tasks (such as closing open files) before ending itself. In that normally
very short time frame, the process is a
After the process has completed all of these shutdown tasks, it reports to the parent process
that it is about to terminate. Sometimes, a zombie process is unable to terminate itself, in
which case it shows a status of Z (zombie).
It is not possible to kill such a process with the kill command, because it is already
considered “dead.” If you cannot get rid of a zombie, you can kill the parent process and then
the zombie disappears as well. However, if the parent process is the init process, you should
not kill it. The init process is a very important process and therefore a reboot may be needed
to get rid of the zombie process.
zombie.
1.1.8 Process memory segments
A process uses its own memory area to perform work. The work varies depending on the
situation and process usage. A process can have different workload characteristics and
different data size requirements. The process has to handle any of varying data sizes. To
satisfy this requirement, the Linux kernel uses a dynamic memory allocation mechanism for
each process. The process memory allocation structure is shown in Figure 1-7.
Text
segment
Data
segment
Heap
segment
Stack
segment
Figure 1-7 Process address space
Process address space
0x0000
Text
Executable instruction (Read-only
Data
Initialized data
BSS
Zero-ininitialized data
Heap
Dynamic memory allocation
by malloc()
Stack
Local variables
Function parameters,
Return address etc.
The process memory area consist of these segments
Text segment
The area where executable code is stored.
8Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
Data segment
The data segment consist of these three area.
– Data: The area where initialized data such as static variables are stored.
– BSS: The area where zero-initialized data is stored. The data is initialized to zero.
– Heap: The area where malloc() allocates dynamic memory based on the demand.
The heap grows toward higher addresses.
Stack segment
The area where local variables, function parameters, and the return address of a function
is stored. The stack grows toward lower addresses.
The memory allocation of a user process address space can be displayed with the pmap
command. You can display the total size of the segment with the ps command. Refer to
2.3.10, “pmap” on page 52 and 2.3.4, “ps and pstree” on page 44.
1.1.9 Linux CPU scheduler
The basic functionality of any computer is, quite simply, to compute. To be able to compute,
there must be a means to manage the computing resources, or processors, and the
computing tasks, also known as threads or processes. Thanks to the great work of Ingo
Molnar, Linux features a kernel using a O(1) algorithm as opposed to the O(n) algorithm used
to describe the former CPU scheduler. The term O(1) refers to a static algorithm, meaning
that the time taken to choose a process for placing into execution is constant, regardless of
the number of processes.
The new scheduler scales very well, regardless of process count or processor count, and
imposes a low overhead on the system. The algorithm uses two process priority arrays:
active
expired
As processes are allocated a timeslice by the scheduler, based on their priority and prior
blocking rate, they are placed in a list of processes for their priority in the active array. When
they expire their timeslice, they are allocated a new timeslice and placed on the expired array.
When all processes in the active array have expired their timeslice, the two arrays are
switched, restarting the algorithm. For general interactive processes (as opposed to real-time
processes) this results in high-priority processes, which typically have long timeslices, getting
more compute time than low-priority processes, but not to the point where they can starve the
low-priority processes completely. The advantage of such an algorithm is the vastly improved
scalability of the Linux kernel for enterprise workloads that often include vast amounts of
threads or processes and also a significant number of processors. The new O(1) CPU
scheduler was designed for kernel 2.6 but backported to the 2.4 kernel family. Figure 1-8
illustrates how the Linux CPU scheduler works.
Chapter 1. Understanding the Linux operating system 9
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
active
active
expired
expired
priority0
priority0
priority 139
priority 139
priority0
priority0
priority 139
priority 139
:
:
:
:
array[0]
array[0]
array[1]
array[1]
P
P
:
:
P
P
P
P
P
P
PP
PP
P
P
:
:
PP
PP
Figure 1-8 Linux kernel 2.6 O(1) scheduler
Another significant advantage of the new scheduler is the support for Non-Uniform Memory
Architecture (NUMA) and symmetric multithreading processors, such as Intel®
Hyper-Threading technology.
The improved NUMA support ensures that load balancing will not occur across NUMA nodes
unless a node gets overburdened. This mechanism ensures that traffic over the comparatively
slow scalability links in a NUMA system are minimized. Although load balancing across
processors in a scheduler domain group will be load balanced with every scheduler tick,
workload across scheduler domains will only occur if that node is overloaded and asks for
load balancing.
Parent
Scheduler
Domain
Two node xSeries 445 (8 CPU)
One CEC (4 CPU)
One Xeon MP (HT)
One HT CPU
Logical
CPU
Scheduler
Domain
Group
1
2
…
1
2
…
1
2
…
Child
Scheduler
Domain
1
2
3
…
1
2
…
1
2
3
…
Load balancing
only if a child
is overburdened
1
2
3
…
Load balancing
via scheduler_tick()
and time slice
1
2
…
1
2
…
Load balancing
via scheduler_tick()
Figure 1-9 Architecture of the O(1) CPU scheduler on an 8-way NUMA based system with
Hyper-Threading enabled
10Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
1.2 Linux memory architecture
To execute a process, the Linux kernel allocates a portion of the memory area to the
requesting process. The process uses the memory area as workspace and performs the
required work. It is similar to you having your own desk allocated and then using the desktop
to scatter papers, documents and memos to perform your work. The difference is that the
kernel has to allocate space in more dynamic manner. The number of running processes
sometimes comes to tens of thousands and amount of memory is usually limited. Therefore,
Linux kernel must handle the memory efficiently. In this section, we describe the Linux
memory architecture, address layout, and how the Linux manages memory space efficiently.
1.2.1 Physical and virtual memory
Today we are faced with the choice of 32-bit systems and 64-bit systems. One of the most
important differences for enterprise-class clients is the possibility of virtual memory
addressing above 4 GB. From a performance point of view, it is therefore interesting to
understand how the Linux kernel maps physical memory into virtual memory on both 32-bit
and 64-bit systems.
As you can see in Figure 1-10 on page 12, there are obvious differences in the way the Linux
kernel has to address memory in 32-bit and 64-bit systems. Exploring the physical-to-virtual
mapping in detail is beyond the scope of this paper, so we highlight some specifics in the
Linux memory architecture.
On 32-bit architectures such as the IA-32, the Linux kernel can directly address only the first
gigabyte of physical memory (896 MB when considering the reserved range). Memory above
the so-called ZONE_NORMAL must be mapped into the lower 1 GB. This mapping is
completely transparent to applications, but allocating a memory page in ZONE_HIGHMEM
causes a small performance degradation.
On the other hand, with 64-bit architectures such as x86-64 (also x64), ZONE_NORMAL
extends all the way to 64GB or to 128 GB in the case of IA-64 systems. As you can see, the
overhead of mapping memory pages from ZONE_HIGHMEM into ZONE_NORMAL can be
eliminated by using a 64-bit architecture.
Chapter 1. Understanding the Linux operating system 11
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
32-bit Architecture64-bit Architecture
64GB
1GB
896MB
16MB
ZONE_HIGHMEM
~~
~~
128MB
“Reserved”
ZONE_NORMAL
ZONE_DMA
Pages in ZONE_HIGHMEM
must be mapped into
ZONE_NORMAL
Reserved for Kernel
data structures
64GB
1GB
ZONE_NORMAL
ZONE_DMA
Figure 1-10 Linux kernel memory layout for 32-bit and 64-bit systems
Virtual memory addressing layout
Figure 1-11 shows the Linux virtual addressing layout for 32-bit and 64-bit architecture.
On 32-bit architectures, the maximum address space that single process can access is 4GB.
This is a restriction derived from 32-bit virtual addressing. In a standard implementation, the
virtual address space is divided into a 3GB user space and a 1GB kernel space. There is
some variants like 4G/4G addressing layout implementing.
On the other hand, on 64-bit architecture such as x86_64 and ia64, no such restriction exits.
Each single process can enjoy the vast and huge address space.
32-bit Architecture
3G/1G kernel
0GB
User space
3GB
Kernel space
4GB
64-bit Architecture
x86_64
0GB
User space
Figure 1-11 Virtual memory addressing layout for 32bit and 64-bit architecture
512GB or more
Kernel space
12Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
1.2.2 Virtual memory manager
The physical memory architecture of an operating system usually is hidden to the application
and the user because operating systems map any memory into virtual memory. If we want to
understand the tuning possibilities within the Linux operating system, we have to understand
how Linux handles virtual memory. As explained in 1.2.1, “Physical and virtual memory” on
page 11, applications do not allocate physical memory, but request a memory map of a
certain size at the Linux kernel and in exchange receive a map in virtual memory. As you can
see in Figure 1-12 on page 13, virtual memory does not necessarily have to be mapped into
physical memory. If your application allocates a large amount of memory, some of it might be
mapped to the swap file on the disk subsystem.
Another enlightening fact that can be taken from Figure 1-12 on page 13 is that applications
usually do not write directly to the disk subsystem, but into cache or buffers. The
kernel threads then flushes out data in cache/buffers to the disk whenever it has time to do so
(or, of course, if a file size exceeds the buffer cache). Refer to “Flushing dirty buffer” on
page 22
pdflush
Physical
sh
Kernel
Standard
httpd
mozilla
User Space
Processes
Figure 1-12 The Linux virtual memory manager
C Library
(glibc)
Subsystems
Slab Allocator
kswapd
bdflush
VM Subsystem
MMU
zoned
buddy
allocator
Disk Driver
Memory
Disk
Closely connected to the way the Linux kernel handles writes to the physical disk subsystem
is the way the Linux kernel manages disk cache. While other operating systems allocate only
a certain portion of memory as disk cache, Linux handles the memory resource far more
efficiently. The default configuration of the virtual memory manager allocates all available free
memory space as disk cache. Hence it is not unusual to see productive Linux systems that
boast gigabytes of memory but only have 20 MB of that memory free.
In the same context, Linux also handles swap space very efficiently. The fact that swap space
is being used does not mean a memory bottleneck but rather proves how efficiently Linux
handles system resources. See “Page frame reclaiming” on page 14 for more detail.
Chapter 1. Understanding the Linux operating system 13
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
Page frame allocation
A page is a group of contiguous linear addresses in physical memory (page frame) or virtual
memory. The Linux kernel handles memory with this page unit. A page is usually 4K bytes in
size. When a process requests a certain amount of pages, if there are available pages, the
Linux kernel can allocate them to the process immediately. Otherwise pages have to be taken
from some other process or page cache. The kernel knows how many memory pages are
available and where they are located.
Buddy system
The Linux kernel maintains its free pages by using the mechanism called buddy system. The
buddy system maintains free pages and tries to allocate pages for page allocation requests. It
tries to keep the memory area contiguous. If small pages are scattered without consideration,
it may cause memory fragmentation and it’s more difficult to allocate large portion of pages
into a contiguous area. It may lead to inefficient memory use and performance decline.
Figure 1-13 illustrates how the buddy system allocates pages.
2 pages
Used
chunk
Used
8 pages
chunk
Figure 1-13 Buddy System
Request
for 2pages
8 pages
chunk
Used
Used
Used
Request
for 2 pages
2 pages
chunk
4 pages
chunk
Used
Used
Used
Used
Release
2 pages
8 pages
chunk
Used
Used
Used
When the attempt of pages allocation failed, the page reclaiming will be activated. Refer to
“Page frame reclaiming” on page 14.
You can find information on the buddy system through /proc/buddyinfo. For detail, please
refer to “Memory used in a zone” on page 47.
Page frame reclaiming
If pages are not available when a process requests to map a certain amount of pages, the
Linux kernel tries to get pages for the new request by releasing certain pages which are used
before but not used anymore and still marked as active pages based on certain principals and
allocating the memory to new process. This process is called
thread and try_to_free_page() kernel function are responsible for page reclaiming.
page reclaiming. kswapd kernel
While kswapd is usually sleeping in task interruptible state, it is called by the buddy system
when free pages in a zone fall short of a certain threshold. It then tries to find the candidate
pages to be gotten out of active pages based on the Least Recently Used (
This is relatively simple. The pages least recently used should be released first. The active list
and the inactive list are used to maintain the candidate pages. kswapd scans part of the
active list and check how recently the pages were used then the pages not used recently is
put into inactive list. You can take a look at how much memory is considered as active and
inactive using vmstat -a command. For detail refer to 2.3.2, “vmstat”.
kswapd also follows another principal. The pages are used mainly for two purpose;
and process address space. The page cache is pages mapped to a file on disk. The
cache
pages belonging to a process address space is used for heap and stack (called anonymous
memory because it‘s not mapped to any files, and has no name) (refer to 1.1.8, “Process
14Linux Performance and Tuning Guidelines
LRU) principal.
page
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
memory segments” on page 8). When kswapd reclaims pages, it would rather shrink the page
cache than page out (or swap out) the pages owned by processes.
Note: The phrase “page out” and “swap out” is sometimes confusing. “page out” means
take some pages (a part of entire address space) into swap space while “swap out” means
taking entire address space into swap space. They are sometimes used interchangeably.
The good proportion of page cache reclaimed and process address space reclaimed may
depend on the usage scenario and will have certain effects on performance. You can take
some control of this behavior by using /proc/sys/vm/swappiness. Please refer to 4.5.1,
“Setting kernel swap and pdflush behavior” on page 110 for tuning detail.
swap
As we stated before, when page reclaiming occurs, the candidate pages in the inactive list
which belong to the process address space may be paged out. Having swap itself is not
problematic situation. While swap is nothing more than a guarantee in case of over allocation
of main memory in other operating systems, Linux utilizes swap space far more efficiently. As
you can see in Figure 1-12, virtual memory is composed of both physical memory and the
disk subsystem or the swap partition. If the virtual memory manager in Linux realizes that a
memory page has been allocated but not used for a significant amount of time, it moves this
memory page to swap space.
Often you will see daemons such as getty that will be launched when the system starts up but
will hardly ever be used. It appears that it would be more efficient to free the expensive main
memory of such a page and move the memory page to swap. This is exactly how Linux
handles swap, so there is no need to be alarmed if you find the swap partition filled to 50%.
The fact that swap space is being used does not mean a memory bottleneck but rather proves
how efficiently Linux handles system resources.
1.3 Linux file systems
One of the great advantages of Linux as an open source operating system is that it offers
users a variety of supported file systems. Modern Linux kernels can support nearly every file
system ever used by a computer system, from basic FAT support to high performance file
systems such as the journaling file system JFS. However, because Ext2, Ext3 and ReiserFS
are native Linux file systems and are supported by most Linux distributions (ReiserFS is
commercially supported only on Novell SUSE Linux), we will focus on their characteristics
and give only an overview of the other frequently used Linux file systems.
For more information on file systems and the disk subsystem, see 4.6, “Tuning the disk
subsystem” on page 113.
1.3.1 Virtual file system
Virtual Files System (VFS) is an abstraction interface layer that resides between the user
process and various types of Linux file system implementations. VFS provides common
object models (i.e. i-node, file object, page cache, directory entry etc.) and methods to access
file system objects. It hides the differences of each file system implementation from user
processes. Thanks to VFS, user processes do not need to know which file system to use, or
which system call should be issued for each file system. Figure 1-14 illustrates the concept of
VFS.
Chapter 1. Understanding the Linux operating system 15
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
ext2
Figure 1-14 VFS concept
1.3.2 Journaling
In a non-journaling file system, when a write is performed to a file system the Linux kernel
makes changes to the file system metadata first and then writes actual user data next. This
operations sometimes causes higher chances of losing data integrity. If the system suddenly
crashes for some reason while the write operation to file system metadata is in process, the
file system consistency may be broken.
metadata and recover the consistency at the time of next reboot. But it takes way much time
to be completed when the system has large volume. The system is not operational during this
process.
NFS
ext3
AFSVFAT
User Process
System call
VFS
Reiserfs
cp
open(), read(), write()
translation for each file system
XFS
JFS
proc
fsck will fix the inconsistency by checking all the
A Journaling file system solves this problem by writing data to be changed to the area called
the journal area before writing the data to the actual file system. The journal area can be
placed both in the file system itself or out of the file system. The data written to the journal
area is called the journal log. It includes the changes to file system metadata and the actual
file data if supported.
As journaling write journal logs before writing actual user data to the file system, it may cause
performance overhead compared to no-journaling file system. How much performance
overhead is sacrificed to maintain higher data consistency depends on how much information
is written to disk before writing user data. We will discuss this topic in 1.3.4, “Ext3” on
page 18.
s
g
o
l
l
a
n
r
u
o
j
e
t
write
i
r
w
.
1
e
l
e
d
.
3
2
.
M
a
k
e
c
h
a
f
i
l
e
s
n
y
s
t
e
m
n
r
u
o
j
e
t
g
e
s
t
o
s
g
o
l
l
a
a
c
t
u
a
l
Journal area
File system
Figure 1-15 Journaling concept
16Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
1.3.3 Ext2
The extended 2 file system is the predecessor of the extended 3 file system. A fast, simple file
system, it features no journaling capabilities, unlike most other current file systems.
Figure 1-16 shows the Ext2 file system data structure. The file system starts with boot sector
and followed by block groups. Splitting entire file system into several small block groups
contributes performance gain because i-node table and data blocks which hold user data can
resides closer on disk platter, then seek time can be reduced. A block group consist of:
Super block:Information on the file system is stored here. The exact copy of a
super block is placed in the top of every block group.
Block group descriptor: Information on the block group is stored.
Data block bitmaps:Used for free data block management.
i-node bitmaps:Used for free i-node management.
i-node tables:inode tables are stored here. Every file has a corresponding i-node
table which holds meta-data of the file such as file mode, uid, gid,
atime, ctime, mtime, dtime and pointer to the data block.
Data blocks:Where actual user data is stored.
boot sector
boot sector
BLOCK
BLOCK
Ext2
Figure 1-16 Ext2 file system data structure
GROUP 0
GROUP 0
BLOCK
BLOCK
GROUP 1
GROUP 1
BLOCK
BLOCK
GROUP 2
GROUP 2
:
:
:
:
BLOCK
BLOCK
GROUP N
GROUP N
super block
super block
block group
block group
descriptors
descriptors
data-block
data-block
bitmaps
bitmaps
inode
inode
bitmaps
bitmaps
inode-table
inode-table
Data-blocks
Data-blocks
To find data blocks which consist of a file, the kernel searches the i-node of the file first. When
a request to open /var/log/messages comes from a process, the kernel parses the file path
and searches a directory entry of / (root directory) which has the information about files and
directories under itself (root directory). Then the kernel can find the i-node of /var next and
takes a look at the directory entry of /var, and it also has the information of files and
directories under itself as well. The kernel gets down to the file in same manner until it finds
Chapter 1. Understanding the Linux operating system 17
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
i-node of the file. The Linux kernel uses file object cache such as directory entry cache,
i-node cache to accelerate finding the corresponding i-node.
Now the Linux kernel knows i-node of the file then it tries to reach actual user data block. As
we described, i-node has the pointer to the data block. By referring to it, the kernel can get to
the data block. For large files, Ext2 implements direct/indirect reference to data block.
Figure 1-17 illustrates how it works.
ext2 disk inode
ext2 disk inode
i_size
direct
direct
indirect
indirect
double indirect
double indirect
trebly indirect
trebly indirect
i_size
:
:
i_blocks
i_blocks
i_blocks[0]
i_blocks[0]
i_blocks[1]
i_blocks[1]
i_blocks[2]
i_blocks[2]
i_blocks[3]
i_blocks[3]
i_blocks[4]
i_blocks[4]
i_blocks[5]
i_blocks[5]
i_blocks[6]
i_blocks[6]
i_blocks[7]
i_blocks[7]
i_blocks[8]
i_blocks[8]
i_blocks[9]
i_blocks[9]
i_blocks[10]
i_blocks[10]
i_blocks[11]
i_blocks[11]
i_blocks[12]
i_blocks[12]
i_blocks[13]
i_blocks[13]
i_blocks[14]
i_blocks[14]
Indirect
Indirect
Indirect
Indirect
block
block
block
block
Data
Data
block
block
Indirect
Indirect
Indirect
Indirect
block
block
block
block
Indirect
Indirect
Indirect
Indirect
block
block
block
block
Indirect
Indirect
Indirect
Indirect
block
block
block
block
Data
Data
block
block
Indirect
Indirect
Indirect
Indirect
block
block
block
block
Indirect
Indirect
Indirect
block
block
block
Data
block
Data
Data
block
block
1.3.4 Ext3
Figure 1-17 Ext2 file system direct / indirect reference to data block
The file system structure and file access operations differ by file system. This makes different
characteristics of each file system.
The current Enterprise Linux distributions support the extended 3 file system. This is an
updated version of the widely used extended 2 file system. Though the fundamental
structures are quite similar to Ext2 file system, the major difference is the support of
journaling capability. Highlights of this file system include:
Availability: Ext3 always writes data to the disks in a consistent way, so in case of an
unclean shutdown (unexpected power failure or system crash), the server does not have
to spend time checking the consistency of the data, thereby reducing system recovery
from hours to seconds.
Data integrity: By specifying the journaling mode data=journal on the mount command, all
data, both file data and metadata, is journaled.
Speed: By specifying the journaling mode data=writeback, you can decide on speed
versus integrity to meet the needs of your business requirements. This will be notable in
environments where there are heavy synchronous writes.
Flexibility: Upgrading from existing Ext2 file systems is simple and no reformatting is
necessary. By executing the tune2fs command and modifying the /etc/fstab file, you can
easily update an Ext2 to an Ext3 file system. Also note that Ext3 file systems can be
mounted as Ext2 with journaling disabled. Products from many third-party vendors have
18Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
the capability of manipulating Ext3 file systems. For example, PartitionMagic can handle
the modification of Ext3 partitions.
Mode of journaling
Ext3 support three types of journaling mode.
journal
This journaling option provides the highest form of data consistency by causing both file
data and metadata to be journaled. It is also has the higher performance overhead.
ordered
In this mode only metadata is written. However, file data is guaranteed to be written first.
This is the default setting.
writeback
This journaling option provides the fastest access to the data at the expense of data
consistency. The data is guaranteed to be consistent as the metadata is still being logged.
However, no special handling of actual file data is done and this may lead to old data
appearing in files after a system crash.
1.3.5 ReiserFS
ReiserFS is a fast journaling file system with optimized disk-space utilization and quick crash
recovery. ReiserFS has been developed to a great extent with the help of Novell. ReiserFS is
commercially supported only on Novell SUSE Linux.
1.3.6 Journal File System
The Journal File System (JFS) is a full 64-bit file system that can support very large files and
partitions. JFS was developed by IBM originally for AIX® and is now available under the
general public license (GPL). JFS is an ideal file system for very large partitions and file sizes
that are typically encountered in high performance computing (HPC) or database
environments. If you would like to learn more about JFS, refer to:
http://jfs.sourceforge.net
Note: In Novell SUSE Linux Enterprise Server 10, JFS is no longer supported as a new file
system.
1.3.7 XFS
The eXtended File System (XFS) is a high-performance journaling file system developed by
Silicon Graphics Incorporated originally for its IRIX family of systems. It features
characteristics similar to JFS from IBM by also supporting very large file and partition sizes.
Therefore usage scenarios are very similar to JFS.
1.4 Disk I/O subsystem
Before a processor can decode and execute instructions, data should be retrieved all the way
from sectors on a disk platter to processor cache and its registers and the results of the
executions may be written back to the disk.
Chapter 1. Understanding the Linux operating system 19
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
We’ll take a look at Linux disk I/O subsystem to have better understanding of the components
which have large effect on system performance.
1.4.1 I/O subsystem architecture
Figure 1-18 on page 20 shows basic concept of I/O subsystem architecture
User process
write()
VFS / file system layer
bio
block layer
device driver
file
page cache
I/O scheduler
I/O Request queue
Device driver
disk device
Disk
page
cache
page
cache
block buffer
pdflush
Figure 1-18 I/O subsystem architecture
For a quick understanding of overall I/O subsystem operations, we will take an example of
writing data to a disk. The following sequence outlines the fundamental operations that occur
when a disk-write operation is performed. Assuming that the file data is on sectors on disk
platters and has already been read and is on the page cache.
1. A process requests to write a file through the write() system call
2. The kernel updates the
3. A
pdflush kernel thread takes care of flushing the page cache to disk
page cache mapped to the file
4. The file system layer puts each block buffer together to a
layer” on page 23) and submits a write request to the block device layer
5. The block device layer gets requests from upper layers and performs an
operation and puts the requests into the I/O request queue
20Linux Performance and Tuning Guidelines
sector
bio struct (refer to 1.4.3, “Block
I/O elevator
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
6. A device driver such as SCSI or other device specific drivers will take care of write
operation
7. A disk device firmware do hardware operation like seek head, rotation, data transfer to the
sector on the platter.
1.4.2 Cache
In the past 20 years, the performance improvement of processors has outperformed that of
the other components in a computer system such as processor cache, bus, RAM, disk and so
on. Slower access to memory and disk restricts overall system performance, so system
performance is not be benefited by processor speed improvement. The cache mechanism
resolves this problem by caching frequently used data in faster memory. It reduces the
chances of having to access slower memory. Current computer system uses this technique in
most all I/O components such as hard disk drive cache, disk controller cache, file system
cache, cache handled by each application and so on.
Memory hierarchy
Figure 1-19 shows the concept of memory hierarchy. As the difference of access speed
between the CPU register and disk is large, the CPU will spend much time waiting for data
from slow disk devices, and therefore it significantly reduces the advantage of a fast CPU.
Memory hierarchal structure reduces this mismatch by placing L1 cache, L2 cache, RAM and
some other caches between the CPU and disk. It enables a process to get less chance to
access slower memory and disk. The memory closer to processor has higher speed and less
size.
This technique can also take advantage of locality of reference principal. The higher cache hit
rate on faster memory is, the faster the access to data is.
very fast
CPU register
Figure 1-19 Memory hierarchy
Large
speed mismatch
very slow
Disk
CPU
very fast
register
fast
cache
slow
RAM
very slow
Disk
Locality of reference
As we stated previously in “Memory hierarchy” above, achieving higher cache hit rate is the
key for performance improvement. To achieve higher cache hit rate, the technique called
“locality of reference” is used. This technique is based on the following principals:
The data most recently used has a high probability of being used in near future (temporal
locality)
The data resides close to the data which has been used has a high probability of being
used (spatial locality)
Figure 1-20 on page 22 illustrates this principal.
Chapter 1. Understanding the Linux operating system 21
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
Register
Register
CPU
Data
Data
Cache
Data
Memory
Disk
First access
CPU
Data
Data
Cache
Data
Memory
Disk
Second access in a few seconds
Temporal locality
Data
Data
CPU
Register
Register
Data
Data
Cache
Data1
Disk
Data1
Data2
Data1
Data2
Data1
Disk
Memory
First access
CPU
Data
Data
Cache
Memory
Second access to data2 in a few seconds
Spatial locality
Data2
Data2
Figure 1-20 Locality of reference
Linux implementation make use of this principal in many components such as page cache,
file object cache (i-node cache, directory entry cache etc.), read ahead buffer and so on.
Flushing dirty buffer
When a process reads data from disk, the data is copied on to memory. The process and
other processes can retrieve the same data from the copy of the data cached in memory.
When a process tries to change the data, the process changes the data in memory first. At
this time, the data on disk and the data in memory is not identical and the data in memory is
referred to as a
soon as possible, or the data in memory may be lost if a sudden crash occurs.
The synchronization process for a dirty buffer is called
implementation,
occurs on regular basis (kupdate) and when the proportion of dirty buffers in memory
exceeds a certain threshold (bdflush). The threshold is configurable in the
/proc/sys/vm/dirty_background_ratio file. For more information, refer to 4.5.1, “Setting
kernel swap and pdflush behavior” on page 110.
dirty buffer. The dirty buffer should be synchronized to the data on disk as
flush. In the Linux kernel 2.6
pdflush kernel thread is responsible for flushing data to the disk. The flush
22Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
• Process read a data from disk
The data on memory and the data on disk are identical at this time.
• Process writes a new data
Only the data on memory has been changed, the data on disk and the data on memory is not the identical.
• Flushing writes the data on memory to the disk.
The data on disk is now identical to the data on memory.
Figure 1-21 Flushing dirty buffers
1.4.3 Block layer
The block layer handles all the activity related to block device operation (refer to Figure 1-18
on page 20). The key data structure in the block layer is the
an interface between file system layer and block layer.
Process
Process
Process
read
write
Data
Cache
Data
Cache
Data
Cache
dirty buffer
flush
•pdflush
•sync()
Data
Disk
Data
Disk
Data
Disk
bio structure. The bio structure is
When a write is performed, file system layer tries to write to page cache which is made up of
block buffers. It makes up a bio structure by putting the contiguous blocks together, then
sends bio to the block layer. (refer to Figure 1-18 on page 20)
The block layer handles the bio request and links these requests into a queue called the I/O
request queue. This linking operation is called
I/O elevator. In Linux kernel 2.6
implementations, four types of I/O elevator algorithms are available. These are described
below.
Block sizes
The block size, the smallest amount of data that can be read or written to a drive, can have a
direct impact on a server’s performance. As a guideline, if your server is handling many small
files, then a smaller block size will be more efficient. If your server is dedicated to handling
large files, a larger block size may improve performance. Block sizes cannot be changed on
the fly on existing file systems, and only a reformat will modify the current block size.
I/O elevator
Apart from a vast amount of other features, the Linux kernel 2.6 employs a new I/O elevator
model. While the Linux kernel 2.4 used a single, general-purpose I/O elevator, kernel 2.6
offers the choice of four elevators. Because the Linux operating system can be used for a
wide range of tasks, both I/O devices and workload characteristics change significantly. A
Chapter 1. Understanding the Linux operating system 23
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
laptop computer quite likely has different I/O requirements from a 10,000-user database
system. To accommodate this, four I/O elevators are available.
Anticipatory
The anticipatory I/O elevator was created based on the assumption of a block device with
only one physical seek head (for example a single SATA drive). The anticipatory elevator
uses the deadline mechanism described in more detail below plus an anticipation
heuristic. As the name suggests, the anticipatory I/O elevator “anticipates” I/O and
attempts to write it in single, bigger streams to the disk instead of multiple very small
random disk accesses. The anticipation heuristic may cause latency for write I/O. It is
clearly tuned for high throughput on general purpose systems such as the average
personal computer. Up to kernel release 2.6.18 the Anticipatory elevator is the standard
I/O scheduler. However most Enterprise Linux distributions default to the CFQ elevator.
Complete Fair Queuing (CFQ)
The CFQ elevator implements a QoS (Quality of Service) policy for processes by
maintaining per-process I/O queues. The CFQ elevator is well suited for large multiuser
systems with a vast amount of competing processes. It aggressively attempts to avoid
starvation of processes and features low latency. Starting with kernel release 2.6.18 the
improved CFQ elevator is the default I/O scheduler.
Depending on the system setup and the workload characterstic the CFQ scheduler can
slowdown a single main application, for example a massive database with its fairness
oriented algorithms. The default configuration handles the fairness based on process
groups which compete against each other. For example a single database and also all
writes via the page cache (all pdflush instances are in one pgroup) are considered as a
single application by CFQ that may compete against many background processes. It can
be useful to experiment with I/O scheduler subconfigurations and/or the deadline
scheduler in such cases.
Deadline
The deadline elevator is a cyclic elevator (round robin) with a deadline algorithm that
provides a near real-time behavior of the I/O subsystem. The deadline elevator offers
excellent request latency while maintaining good disk throughput. The implementation of
the deadline algorithm ensures that starvation of a process cannot occur.
NOOP
NOOP stands for No Operation, and the name explains most of its functionality. The
NOOP elevator is simple and lean. It is a simple FIFO queue that performs no data
ordering but simple merging of adjacent requests, so it adds very low processor overhead
to disk I/O. The NOOP elevator assumes that a block device either features its own
elevator algorithm such as TCQ for SCSI, or that the block device has no seek latency
such as a flash card.
Note: With the Linux kernel release 2.6.18 the I/O elevators are now selectable on a
per disk subsystem basis and have no longer to be set on a per system level.
1.4.4 I/O device driver
The Linux kernel takes control of devices using a device driver. The device driver is usually a
separate kernel module and is provided for each device (or group of devices) to make the
device available for the Linux operating system. Once the device driver is loaded, it runs as a
part of the Linux kernel and takes full control of the device. Here we describe SCSI device
drivers.
24Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
SCSI
The Small Computer System Interface (SCSI) is the most commonly used I/O device
technology, especially in the enterprise server environment. In Linux kernel implementations,
SCSI devices are controlled by device driver modules. They consist of the following types of
modules.
Provide functionalities to support several types of SCSI devices such as SCSI CD-ROM,
SCSI tape etc.
Middle level driver: scsi_mod
Implements SCSI protocol and common SCSI functionality
Low level drivers
Provide lower level access to each devices. Low level driver is basically specific to a
hardware device and provided for each device. For example, ips for IBM ServeRAID™
controller, qla2300 for Qlogic HBA, mptscsih for LSI Logic SCSI controller etc.
Pseudo driver: ide-scsi
Used for IDE-SCSI emulation.
stsr_modsd_modsg
ipsqla2300mptscsih
Figure 1-22 Structure of SCSI drivers
If there is specific functionality implemented for a device, it should be implemented in device
firmware and the low level device driver. The supported functionality depend on which
hardware you use and which version of device driver you use. The device itself should also
support the desired functionality. Specific functions are usually tuned by a device driver
parameter. You may try some performance tuning in /etc/modules.conf. Refer to the device
and device driver documentation for possible tuning hints and tips.
1.4.5 RAID and Storage system
The selection and configuration of storage system and RAID types are also important factors
in terms of system performance. However we leave the details of this topic out of scope of this
Redpaper, though Linux supports software RAID. We include some of tuning considerations
in 4.6.1, “Hardware considerations before installing Linux” on page 114.
Process
scsi_mod
Device
……
Upper level driver
Mid level driver
Low level driver
For additional, in-depth coverage of the available IBM storage solutions, see:
Tuning IBM System x Servers for Performance, SG24-5287
IBM System Storage Solutions Handbook, SG24-5250
Introduction to Storage Area Networks, SG24-5470
Chapter 1. Understanding the Linux operating system 25
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
1.5 Network subsystem
The network subsystem is another important subsystem in the performance perspective.
Networking operations interact with many components other than Linux itself such as
switches, routers, gateways, PC clients etc. Though these components may be out of the
control of Linux, they have much influence on the overall performance. Keep in mind that you
have to work closely with people working on the network system.
Here we mainly focus on how Linux handles networking operations.
1.5.1 Networking implementation
The TCP/IP protocol has a layered structure similar to the OSI layer model. The Linux kernel
networking implementation employs a similar approach. Figure 1-23 illustrates the layered
Linux TCP/IP stack and quick overview of TCP/IP communication.
Process
BSD socket
Ethernet
Header
IP Header
TCP/UDP
Header
Data
sk_buff
INET socket
TCP/UDP
IP
Datalink
Device
Device driver
NIC
Figure 1-23 Network layered structure and quick overview of networking operation
Linux uses a socket interface for TCP/IP networking operation as well as many UNIX systems
do. The socket provides an interface for user applications. We will take a quick look at the
sequence that outlines the fundamental operations that occur during network data transfer.
1. When an application sends data to its peer host, the application creates its data.
2. The application opens the socket and writes the data through the socket interface.
3. The
4. In each layer, appropriate operations such as parsing the headers, adding and modifying
socket buffer is used to deal with the transferred data. The socket buffer has reference
to the data and it goes down through the layers.
the headers, check sums, routing operation, fragmentation etc. are performed. When the
socket buffer goes down through the layers, the data itself is not copied between the
layers. Because copying actual data between different layer is not effective, the kernel
avoids unnecessary overhead by just changing the reference in the socket buffer and
passing it to the next layer.
Process
BSD socket
INET socket
TCP/UDP
IP
Datalink
Device
Device driver
NIC
5. Finally the data goes out to the wire from network interface card.
6. The Ethernet frame arrives at the network interface of the peer host
26Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
7. The frame is moved into the network interface card buffer if the MAC address matches the
MAC address of the interface card.
8. The network interface card eventually moves the packet into a socket buffer and issues a
hard interrupt at the CPU.
9. The CPU then processes the packet and moves it up through the layers until it arrives at
(for example) a TCP port of an application such as Apache.
Socket buffer
As we stated before, the kernel uses buffers to send and receive data. Figure 1-24 shows
configurable buffers which can be used for networking. They can be tuned through files in
Sometimes it may have an effect on the network performance. We’ll cover the details in 4.7.4,
“Increasing network buffers” on page 127.
tcp_rmem
tcp_wmem
rmem_max
TCP/IP
receive
IPX
tcp_mem
socket
buffer
socket
rs
socket
r
wmem_max
send
buffer
s
tcp_mem
socket
send
buffer
receive
buffer
tcp_mem
socket
rs
Appletalk
Figure 1-24 socket buffer memory allocation
Chapter 1. Understanding the Linux operating system 27
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
Network API (NAPI)
The network subsystem has undergone some changes with the introduction of the new
network API (NAPI). The standard implementation of the network stack in Linux focuses more
on reliability and low latency than on low overhead and high throughput. While these
characteristics are favorable when creating a firewall, most enterprise applications such as
file and print or databases will perform more slowly than a similar installation under
Windows®.
In the traditional approach of handling network packets, as depicted by the blue arrows in
Figure 1-25, the network interface card eventually moves the packet into a network buffer of
the operating systems kernel and issues a hard interrupt at the CPU, as we stated before.
This is only a simplified view of the process of handling network packets, but it illustrates one
of the shortcomings of this very approach. As you have realized, every time an Ethernet
frame with a matching MAC address arrives at the interface, there will be a hard interrupt.
Whenever a CPU has to handle a hard interrupt, it has to stop processing whatever it was
working on and handle the interrupt, causing a context switch and the associated flush of the
processor cache. While one might think that this is not a problem if only a few packets arrive
at the interface, Gigabit Ethernet and modern applications can create thousands of packets
per second, causing a vast number of interrupts and context switches to occur.
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
Because of this, NAPI was introduced to counter the overhead associated with processing
network traffic. For the first packet, NAPI works just like the traditional implementation as it
issues an interrupt for the first packet. But after the first packet, the interface goes into a
polling mode: As long as there are packets in the DMA ring buffer of the network interface, no
new interrupts will be caused, effectively reducing context switching and the associated
overhead. Should the last packet be processed and the ring buffer be emptied, then the
interface card will again fall back into the interrupt mode we explored earlier. NAPI also has
the advantage of improved multiprocessor scalability by creating soft interrupts that can be
handled by multiple processors. While NAPI would be a vast improvement for most enterprise
class multiprocessor systems, it requires NAPI-enabled drivers. There is significant room for
tuning, as we will explore in the tuning section of this Redpaper.
Netfilter
Linux has an advanced firewall capability as a part of the kernel. This capability is provided by
Netfilter modules. You can manipulate and configure Netfilter using iptables utility.
Generally speaking, Netfilter provides the following functions.
Packet filtering: If a packet match a certain rule, Netfilter accept or deny the packets or
take appropriate action based on defined rules
Address translation: If a packet match a certain rule, Netfilter alter the packet itself to meet
the address translation requirements.
Matching filters can be defined with the following properties.
Network interface
IP address, IP address range, subnet
Protocol
ICMP Type
Por t
TCP flag
State (refer to “Connection tracking” on page 30)
Figure 1-26 give an overview of how packets traverse the Netfilter chains which are the lists of
defined rules applied at each point in sequence.
incoming packets
PREROUTINGPREROUTING
Connection Tracking
Mangle
NAT(DNAT)
ROUTING
incoming
packets
INPUTINPUT
forwarded
packets
FORWARDFORWARD
Filter
Connection Tracking
Filter
OUTPUTOUTPUT
originated from
local process
Connection Tracking
NAT(SNAT,MASQUERADE)
POSTROUTINGPOSTROUTING
outgoing
packets
Connection Tracking
Mangle
NAT(DNAT)
Filter
Local process
Figure 1-26 Netfilter packet flow
Netfilter will take appropriate actions if packet matches the rule. The action is called a target.
Some of possible targets are:
Chapter 1. Understanding the Linux operating system 29
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
ACCEPT:Accept the packet and let it through.
DROP:Silently discard the packet.
REJECT:Discard the packet with sending back the packet such as ICMP port
To achieve more sophisticated firewall capability, Netfilter employes the connection tracking
mechanism which keeps track of the state of all network traffic. Using the TCP connection
state (refer to “Connection establishment” on page 30) and other network properties (such as
IP address, port, protocol, sequence number, ack number, ICMP type etc.), Netfilter classifies
each packet to the following four states.
NEW:packet attempting to establish new connection
ESTABLISHED:packet goes through established connection
RELATED:packet which is related to previous packets
INVALID:packet which is unknown state due to malformed or invalid packet
In addition, Netfilter can use a separate module to perform more detailed connection tracking
by analyzing protocol specific properties and operations. For example, there are connection
tracking modules for FTP, NetBIOS, TFTP, IRC and so on.
1.5.2 TCP/IP
TCP/IP has been default network protocol for many years. Linux TCP/IP implementation is
fairly compliant with its standards. For better performance tuning, you should be familiar with
basic TCP/IP networking.
For additional detail refer to the following documentation:
TCP/IP Tutorial and Technical Overview, SG24-3376.
Connection establishment
Before application data is transferred, the connection should be established between client
and server. The connection establishment process is called TCP/IP 3-way hand shake.
Figure 1-27 on page 31 outlines basic connection establishment and termination process.
1. A client sends a SYN packet (a packet with SYN flag set) to its peer server to request
connection.
2. The server receives the packet and sends back SYN+ACK packet
3. Then the client sends an ACK packet to its peer to complete connection establishment.
Once the connection is established, the application data can be transferred through the
connection. When all data has been transferred, the connection closing process starts.
1. The client sends a FIN packet to the server to start the connection termination process.
2. The server sends the acknowledgement of the FIN back and then sends the FIN packet to
the client if it has no data to send to the client.
3. Then the client sends an ACK packet to the server to complete connection termination.
30Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
Client
send SYN
SYN_SENT
receive SYN+ACK
ESTABLISHED
receive FIN
FIN_WAIT1
receive ACK
FIN_WAIT2
receive FIN
TIME_WAIT
send ACK
TimeOut
CLOSED
SYN
SYN+ACK
ACK
TCP session established
FIN
ACK
FIN
ACK
Server
LISTEN
receive SYN
SYN_RECV
SYN+ACK sent
receive ACK
ESTABLISHED
receivr FIN
CLOSE_WAIT
receive ACK
reveive FIN
LAST_ACK
receive ACK
CLOSED
Figure 1-27 TCP 3-way handshake
The state of a connection changes during the session. Figure 1-28 on page 32 show the
TCP/IP connection state diagram.
Chapter 1. Understanding the Linux operating system 31
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
active OPEN
create TCB
snd SYN
SYN
SENT
CLOSE
RCVD
CLOSE
snd FIN
SYN
FIN
passive
OPEN
create TCB
rcv SYN
snd
SYN,ACK
rcv ACK of
SYN
x
CLOSE
snd FIN
CLOSED
LISTEN
rcv SYN
snd ACK
ESTAB
CLOSE
delete
TCB
SEND
snd SYN
SYN,ACK
rcv FIN
snd
ACK
CLOSE
delete TCB
rcv
snd ACK
WAIT-1
rcv ACK of FIN
x
FIN
rcv FIN
snd
ACK
rcv ACK of
rcv FIN
snd
ACK
WAIT-2
Figure 1-28 TCP connection state diagram
You can see the connection state of each TCP/IP session using netstat command. For more
detail, see 2.3.11, “netstat” on page 53.
Traffic control
TCP/IP implementation has a mechanism that ensures efficient data transfer and guarantees
packet delivery even in time of poor network transmission quality and congestion.
TCP/IP transfer window
The principle of transfer windows is an important aspect of the TCP/IP implementation in the
Linux operating system in regard to performance. Very simplified, the TCP transfer window is
the maximum amount of data a given host can send or receive before requiring an
acknowledgement from the other side of the connection. The window size is offered from the
receiving host to the sending host by the window size field in the TCP header. Using the
transfer window, the host can send packets more effectively because the sending host doesn’t
have to wait for acknowledgement for each sending packet. It enables the network to be
utilized more. Delayed acknowledgement also improve efficiency. TCP windows start small
and increase slowly with every successful acknowledgement from the other side of the
connection. To optimize window size, see 4.7.4, “Increasing network buffers” on page 127
CLOSING
FIN
x
TIME WAIT
Timeout=2MSL
delete TCB
WAIT
LAST-ACK
CLOSED
CLOSE
snd FIN
rcv ACK of
FIN
x
32Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
SenderReceiver
Figure 1-29 Sliding window and delayed ack
SenderReceiver
Sliding
window
Delayed Ack
As an option, high-speed networks may use a technique called window scaling to increase
the maximum transfer window size even more. We will analyze the effects of these
implementations in more detail in “Tuning TCP options” on page 132.
Retransmission
In the connection establishment and termination and data transfer, many timeouts and data
retransmissions may be caused by various reasons (faulty network interface, slow router,
network congestion, buggy network implementation, and so on). TCP/IP handles this situation
by queuing packets and trying to send packets several times.
You can change some behavior of the kernel by configuring parameters. You may want to
increase the number of attempts for TCP SYN connection establishment packet on the
network with high rate of packet loss. You can also change some of timeout threshold through
files under /proc/sys/net. For more information, see “Tuning TCP behavior” on page 131.
1.5.3 Offload
If the network adapter on your system supports hardware offload functionality, the kernel can
offload part of its task to the adapter and it can reduce CPU utilization.
Checksum offload
TCP segmentation offload (TSO)
For more advanced network features, refer to redbook Tuning IBM System x Servers for
Performance, SG24-5287. section 10.3. Advanced network features.
IP/TCP/UDP checksum is performed to make sure if the packet is correctly transferred by
comparing the value of checksum field in protocol headers and the calculated values by
the packet data.
When the data that is lager than supported maximum transmission unit (MTU) is sent to
the network adapter, the data should be divided into MTU sized packets. The adapter
takes care of that on behalf of the kernel.
Chapter 1. Understanding the Linux operating system 33
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
1.5.4 Bonding module
The Linux kernel provides network interface aggregation capability by using a bonding driver.
This is a device independent bonding driver, while there are device specific drivers as well.
The bonding driver supports the 802.3 link aggregation specification and some original load
balancing and fault tolerant implementations as well. It achieves a higher level of availability
and performance improvement. Please refer to the kernel documentation
Documentation/networking/bonding.txt.
1.6 Understanding Linux performance metrics
Before we can look at the various tuning parameters and performance measurement utilities
in the Linux operating system, it makes sense to discuss various available metrics and their
meaning in regard to system performance. Because this is an open source operating system,
a significant amount of performance measurement tools are available. The tool you ultimately
choose will depend upon your personal liking and the amount of data and detail you require.
Even though numerous tools are available, all performance measurement utilities measure
the same metrics, so understanding the metrics enables you to use whatever utility you come
across. Therefore, we cover only the most important metrics, understanding that many more
detailed values are available that might be useful for detailed analysis beyond the scope of
this paper.
1.6.1 Processor metrics
CPU utilization
This is probably the most straightforward metric. It describes the overall utilization per
processor. On IBM System x architectures, if the CPU utilization exceeds 80% for a
sustained period of time, a processor bottleneck is likely.
User time
Depicts the CPU percentage spent on user processes, including nice time. High values in
user time are generally desirable because, in this case, the system performs actual work.
System time
Depicts the CPU percentage spent on kernel operations including IRQ and softirq time.
High and sustained system time values can point you to bottlenecks in the network and
driver stack. A system should generally spend as little time as possible in kernel time.
Waiting
Total amount of CPU time spent waiting for an I/O operation to occur. Like the
value, a system should not spend too much time waiting for I/O operations; otherwise you
should investigate the performance of the respective I/O subsystem.
Idle time
Depicts the CPU percentage the system was idle waiting for tasks.
Nice time
blocked
Depicts the CPU percentage spent on re-nicing processes that change the execution
order and priority of processes.
Load average
The load average is not a percentage, but the rolling average of the sum of the followings:
– the number of processes in queue waiting to be processed
– the number of processes waiting for uninterruptable task to be completed
34Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
That is, the average of the sum of TASK_RUNNING and TASK_UNINTERRUPTIBLE
process. If processes that request CPU time are blocked (which means that the CPU has
no time to process them), the load average will increase. On the other hand, if each
process gets immediate access to CPU time and there are no CPU cycles lost, the load
will decrease.
Runable processes
This value depicts the processes that are ready to be executed. This value should not
exceed 10 times the amount of physical processors for a sustained period of time;
otherwise a processor bottleneck is likely.
Blocked
Processes that cannot execute as they are waiting for an I/O operation to finish. Blocked
processes can point you toward an I/O bottleneck.
Context switch
Amount of switches between threads that occur on the system. High numbers of context
switches in connection with a large number of interrupts can signal driver or application
issues. Context switches generally are not desirable because the CPU cache is flushed
with each one, but some context switching is necessary. Refer to 1.1.5, “Context
switching” on page 6.
Interrupts
The interrupt value contains hard interrupts and soft interrupts; hard interrupts have more
of an adverse effect on system performance. High interrupt values are an indication of a
software bottleneck, either in the kernel or a driver. Remember that the interrupt value
includes the interrupts caused by the CPU clock. Refer to 1.1.6, “Interrupt handling” on
page 6
1.6.2 Memory metrics
Free memor y
Compared to most other operating systems, the free memory value in Linux should not be
a cause for concern. As explained in 1.2.2, “Virtual memory manager” on page 13, the
Linux kernel allocates most unused memory as file system cache, so subtract the amount
of buffers and cache from the used memory to determine (effectively) free memory.
Swap usage
This value depicts the amount of swap space used. As described in 1.2.2, “Virtual memory
manager” on page 13, swap usage only tells you that Linux manages memory really
efficiently. Swap In/Out is a reliable means of identifying a memory bottleneck. Values
above 200 to 300 pages per second for a sustained period of time express a likely memory
bottleneck.
Buffer and cache
Cache allocated as file system and block device cache.
Slabs
Depicts the kernel usage of memory. Note that kernel pages cannot be paged out to disk.
Active versus inactive memory
Provides you with information about the active use of the system memory. Inactive
memory is a likely candidate to be swapped out to disk by the kswapd daemon. Refer to
“Page frame reclaiming” on page 14.
Chapter 1. Understanding the Linux operating system 35
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
1.6.3 Network interface metrics
Packets received and sent
This metric informs you of the quantity of packets received and sent by a given network
interface.
Bytes received and sent
This value depicts the number of bytes received and sent by a given network interface.
Collisions per second
This value provides an indication of the number of collisions that occur on the network the
respective interface is connected to. Sustained values of collisions often concern a
bottleneck in the network infrastructure, not the server. On most properly configured
networks, collisions are very rare unless the network infrastructure consists of hubs.
Packets dropped
This is a count of packets that have been dropped by the kernel, either due to a firewall
configuration or due to a lack in network buffers.
Overruns
Overruns represent the number of times that the network interface ran out of buffer space.
This metric should be used in conjunction with the
possible bottleneck in network buffers or the network queue length.
Errors
packets dropped value to identify a
The number of frames marked as faulty. This is often caused by a network mismatch or a
partially broken network cable. Partially broken network cables can be a significant
performance issue for copper-based Gigabit networks.
1.6.4 Block device metrics
Iowait
Time the CPU spends waiting for an I/O operation to occur. High and sustained values
most likely indicate an I/O bottleneck.
Average queue length
Amount of outstanding I/O requests. In general, a disk queue of 2 to 3 is optimal; higher
values might point toward a disk I/O bottleneck.
Average wait
A measurement of the average time in ms it takes for an I/O request to be serviced. The
wait time consists of the actual I/O operation and the time it waited in the I/O queue.
Transfers per second
Depicts how many I/O operations per second are performed (reads and writes). The
transfers per second metric in conjunction with the kBytes per second value helps you to
identify the average transfer size of the system. The average transfer size generally should
match with the stripe size used by your disk subsystem.
Blocks read/write per second
This metric depicts the reads and writes per second expressed in blocks of 1024 bytes as
of kernel 2.6. Earlier kernels may report different block sizes, from 512 bytes to 4 KB.
Kilobytes per second read/write
Reads and writes from/to the block device in kilobytes represent the amount of actual data
transferred to and from the block device.
36Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch01.fm
Chapter 1. Understanding the Linux operating system 37
4285ch01.fmDraft Document for Review May 4, 2007 11:35 am
38Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
2
Chapter 2.Monitoring and benchmark tools
The open and flexible nature of the Linux operating system has led to a significant number of
performance monitoring tools. Some of them are Linux versions of well-known UNIX utilities,
and others were specifically designed for Linux. The fundamental support for most Linux
performance monitoring tools lays in the virtual proc file system. To measure performance, we
also have to use appropriate benchmark tools.
In this chapter we outline a selection of Linux performance monitoring tools and discuss
useful commands and we also introduce some of useful benchmark tools. It is up to the
reader to select utilities to achieve the performance monitoring task.
Most of the monitoring tools we discuss ship with Enterprise Linux distributions.
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
2.1 Introduction
The Enterprise Linux distributions are shipped with many monitoring tools. Some of them deal
with many metrics in a single tool and give us well formatted output for easy understanding of
system activities. Some of them are specific to certain performance metrics (i.e. Disk I/O) and
give us detailed information.
Being familiar with these tools will help to enhance your understand of what’s going on in the
system and to find the possible causes of a performance problem.
2.2 Overview of tool function
Table 2-1 lists the function of the monitoring tools covered in this chapter.
Table 2-1 Linux performance monitoring tools
ToolMost useful tool function
topProcess activity
vmstatSystem activity Hardware and system information
KDE system guardReal-time systems reporting and graphing
Gnome System MonitorReal-time systems reporting and graphing
Table 2-2 lists the function of the benchmark tools covered in this chapter.
Table 2-2 Benchmark tools
ToolMost useful tool function
lmbenchMicrobenchmark for operating system functions
iozoneFile system benchmark
40Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
ToolMost useful tool function
netperfNetwork performance benchmark
2.3 Monitoring tools
In this section, we discuss the monitoring tools. Most of the tools come with Enterprise Linux
distributions. You should be familiar with the tools for better understanding of system behavior
and performance tuning.
2.3.1 top
The top command shows actual process activity. By default, it displays the most
CPU-intensive tasks running on the server and updates the list every five seconds. You can
sort the processes by PID (numerically), age (newest first), time (cumulative time), and
resident memory usage and time (time the process has occupied the CPU since startup).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13737 root 17 0 1760 896 1540 R 0.7 0.2 0:00.05 top
238 root 5 -10 0 0 0 S 0.3 0.0 0:01.56 reiserfs/0
1 root 16 0 588 240 444 S 0.0 0.0 0:05.70 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
5 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
6 root 5 -10 0 0 0 S 0.0 0.0 0:00.02 events/0
7 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 events/1
8 root 5 -10 0 0 0 S 0.0 0.0 0:00.09 kblockd/0
9 root 5 -10 0 0 0 S 0.0 0.0 0:00.01 kblockd/1
10 root 15 0 0 0 0 S 0.0 0.0 0:00.00 kirqd
13 root 5 -10 0 0 0 S 0.0 0.0 0:00.02 khelper/0
14 root 16 0 0 0 0 S 0.0 0.0 0:00.45 pdflush
16 root 15 0 0 0 0 S 0.0 0.0 0:00.61 kswapd0
17 root 13 -10 0 0 0 S 0.0 0.0 0:00.00 aio/0
18 root 13 -10 0 0 0 S 0.0 0.0 0:00.00 aio/1
You can further modify the processes using renice to give a new priority to each process. If a
process hangs or occupies too much CPU, you can kill the process (kill command).
The columns in the output are:
PIDProcess identification.
USERName of the user who owns (and perhaps started) the process.
PRIPriority of the process. (See 1.1.4, “Process priority and nice level” on page 5
for details.)
Chapter 2. Monitoring and benchmark tools 41
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
NINiceness level (that is, whether the process tries to be nice by adjusting the
priority by the number given; see below for details).
SIZEAmount of memory (code+data+stack) used by the process in kilobytes.
RSSAmount of physical RAM used, in kilobytes.
SHAREAmount of memory shared with other processes, in kilobytes.
STATState of the process: S=sleeping, R=running, T=stopped or traced,
D=interruptible sleep, Z=zombie. The process state is discussed further in
1.1.7, “Process state”.
%CPUShare of the CPU usage (since the last screen update).
%MEMShare of physical memory.
TIMETotal CPU time used by the process (since it was started).
COMMANDCommand line used to start the task (including parameters).
The top utility supports several useful hot keys, including:
t Displays summary information off and on.
m Displays memory information off and on.
A Sorts the display by top consumers of various system resources. Useful for
quick identification of performance-hungry tasks on a system.
2.3.2 vmstat
f Enters an interactive configuration screen for top. Helpful for setting up top
for a specific task.
o Enables you to interactively select the ordering within top.
rIssues renice command
kIssues kill command
vmstat provides information about processes, memory, paging, block I/O, traps, and CPU
activity. The vmstat command displays either average data or actual samples. The sampling
mode is enabled by providing vmstat with a sampling frequency and a sampling duration.
Attention: In sampling mode consider the possibility of spikes between the actual data
collection. Changing sampling frequency to a lower value may evade such hidden spikes.
Note: The first data line of the vmstat report shows averages since the last reboot, so it
should be eliminated.
42Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
The columns in the output are as follows:
Process (procs)r: The number of processes waiting for runtime.
b: The number of processes in uninterruptable sleep.
Memoryswpd: The amount of virtual memory used (KB).
free: The amount of idle memory (KB).
buff: The amount of memory used as buffers (KB).
cache: The amount of memory used as cache (KB).
Swapsi: Amount of memory swapped from the disk (KBps).
so: Amount of memory swapped to the disk (KBps).
IObi: Blocks sent to a block device (blocks/s).
bo: Blocks received from a block device (blocks/s).
Systemin: The number of interrupts per second, including the clock.
cs: The number of context switches per second.
CPU (% of total CPU time)
us: Time spent running non-kernel code (user time, including nice time).
sy: Time spent running kernel code (system time).
id: Time spent idle. Prior to Linux 2.5.41, this included I/O-wait time.
wa: Time spent waiting for IO. Prior to Linux 2.5.41, this appeared as
zero.
2.3.3 uptime
The vmstat command supports a vast number of command line parameters that are fully
documented in the man pages for vmstat. Some of the more useful flags include:
-m displays the memory utilization of the kernel (slabs).
-a provides information about active and inactive memory pages.
-n displays only one header line, useful if running vmstat in sampling mode and
piping the output to a file. (For example, root#vmstat –n 2 10 generates vmstat
10 times with a sampling rate of two seconds.)
When used with the –p {partition} flag, vmstat also provides I/O statistics.
The uptime command can be used to see how long the server has been running and how
many users are logged on, as well as for a quick overview of the average load of the server
(Refer to 1.6.1, “Processor metrics” on page 34). The system load average is displayed for
the past 1-minute, 5-minute, and 15-minute intervals.
The optimal value of the load is 1, which means that each process has immediate access to
the CPU and there are no CPU cycles lost. The typical load can vary from system to system:
For a uniprocessor workstation, 1 or 2 might be acceptable, whereas you will probably see
values of 8 to 10 on multiprocessor servers.
You can use uptime to pinpoint a problem with your server or the network. For example, if a
network application is running poorly, run uptime and you will see whether the system load is
high. If not, the problem is more likely to be related to your network than to your server.
Tip: You can use w instead of uptime. w also provides information about who is currently
logged on to the machine and what the user is doing.
Chapter 2. Monitoring and benchmark tools 43
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Example 2-3 Sample output of uptime
1:57am up 4 days 17:05, 2 users, load average: 0.00, 0.00, 0.00
2.3.4 ps and pstree
The ps and pstree commands are some of the most basic commands when it comes to
system analysis. ps can have 3 different types of command options, UNIX style, BSD style
and GNU style. Here we’ll take UNIX style options.
The ps command provides a list of existing processes. The top command shows the process
information as well, but ps will provide more detailed information. The number or processes
listed depends on the options used. A simple ps -A command lists all processes with their
respective process ID (PID) that can be crucial for further investigation. A PID number is
necessary to use tools such as pmap or renice.
On systems running Java™ applications, the output of a ps -A command might easily fill up
the display to the point where it is difficult to get a complete picture of all running processes.
In this case, the pstree command might come in handy as it displays the running processes
in a tree structure and consolidates spawned subprocesses (for example, Java threads). The
pstree command can be very helpful to identify originating processes. There is another ps
variant pgrep. It might be useful as well.
The command /bin/free displays information about the total amounts of free and used
memory (including swap) on the system. It also includes information about the buffers and
cache used by the kernel.
When using free, remember the Linux memory architecture and the way the virtual memory
manager works. The amount of free memory in itself is of limited use, and the pure utilization
statistics of swap are no indication for a memory bottleneck.
Figure 2-1 on page 47 depicts basic idea of what free command output shows.
46Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
memory 4GB
memory 4GB
Free= 826(MB)
Used=1748(MB)
Used=1748(MB)
Free= 826(MB)
Cache=1482(MB)
Cache=1482(MB)
Buffer=36(MB)
Buffer=36(MB)
#free -m
total used free shared buffers cached
Mem: 4092 3270 826 0 36 1482
-/+ buffers/cache: 1748 2344
Swap: 4096 0 4096
total amount
of memory
(KB)
Mem: used = Used + Buffer + Cache / free = Free
-/+ buffers/cache: used = Used / free = Free + Buffer + Cache
Figure 2-1 free command output
used
memory
(KB)
free
memory
(KB)
shared
memory
(KB)
buffer
(KB)
cache
(KB)
Useful parameters for the free command include:
-b, -k, -m, -g display values in bytes, kilobytes, megabytes, and gigabytes.
-l distinguishes between low and high memory (refer to 1.2, “Linux memory
architecture” on page 11).
-c <count>displays the free output <count> number of times.
Memory used in a zone
Using the -l option, you can see how much memory is used in each memory zone.
Example 2-8 and Example 2-9 show the example of free -l output of 32 bit and 64 bit
system. Notice that 64-bit system no longer use High memory.
Example 2-8 Example output from the free command on 32 bit version kernel
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Swap: 2031608 332 2031276
You can also determine much chunks of memory are available in each zone using
/proc/buddyinfo file. Each column of numbers means the number of pages of that order
which are available. In Example 2-10, there are 5 chunks of 2^2*PAGE_SIZE available in
ZONE_DMA, and 16 chunks of 2^4*PAGE_SIZE available in ZONE_DMA32. Remember how
the buddy system allocate pages (refer to “Buddy system” on page 14). This information show
you how fragmented memory is and give you a clue as to how much pages you can safely
allocate.
Example 2-10 Buddy system information for 64 bit system
The iostat command shows average CPU times since the system was started (similar to
uptime). It also creates a report of the activities of the disk subsystem of the server in two
parts: CPU utilization and device (disk) utilization. To use iostat to perform detailed I/O
bottleneck and performance tuning, see 3.4.1, “Finding disk bottlenecks” on page 84. The
iostat utility is part of the sysstat package.
%userShows the percentage of CPU utilization that was taken up while executing at
the user level (applications).
%niceShows the percentage of CPU utilization that was taken up while executing at
the user level with a nice priority. (Priority and nice levels are described in
2.3.7, “nice, renice” on page 67.)
%sysShows the percentage of CPU utilization that was taken up while executing at
the system level (kernel).
%idleShows the percentage of time the CPU was idle.
The device utilization report has these sections:
DeviceThe name of the block device.
48Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
tpsThe number of transfers per second (I/O requests per second) to the device.
Multiple single I/O requests can be combined in a transfer request, because
a transfer request can have different sizes.
Blk_read/s, Blk_wrtn/s
Blocks read and written per second indicate data read from or written to the
device in seconds. Blocks may also have different sizes. Typical sizes are
1024, 2048, and 4048 bytes, depending on the partition size. For example,
the block size of /dev/sda1 can be found with:
dumpe2fs -h /dev/sda1 |grep -F "Block size"
This produces output similar to:
dumpe2fs 1.34 (25-Jul-2003)
Block size: 1024
Blk_read, Blk_wrtn
Indicates the total number of blocks read and written since the boot.
The iostat can take many options. The most useful one is -x option from the performance
perspective. It displays extended statistics. The following is sample output.
Example 2-12 iostat -x extended statistics display
The number of read/write requests merged per second that were issued to
the device. Multiple single I/O requests can be merged in a transfer request,
because a transfer request can have different sizes.
r/s, w/sThe number of read/write requests that were issued to the device per
second.
rsec/s, wsec/s The number of sectors read/write from the device per second.
rkB/s, wkB/sThe number of kilobytes read/write from the device per second.
avgrq-szThe average size of the requests that were issued to the device. This value is
is displayed in sectors.
avgqu-szThe average queue length of the requests that were issued to the device.
awaitShows the percentage of CPU utilization that was taken up while executing at
the system level (kernel).
svctmThe average service time (in milliseconds) for I/O requests that were issued
to the device.
%utilPercentage of CPU time during which I/O requests were issued to the device
(bandwidth utilization for the device). Device saturation occurs when this
value is close to 100%.
It may be very useful to calculate the average I/O size in order to tailor a disk subsystem
towards the access pattern. The following example is the output of using iostat with the -d
and -x flag in order to display only information about the disk subsystem of interest:
Chapter 2. Monitoring and benchmark tools 49
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Example 2-13 Using iostat -x -d to analyze the average I/O size
The iostat output in Example 2-13 shows that the device dasdc had to write 12300.99 kB of
data per second as being displayed under the kB_wrtn/s heading. This amount of data was
being sent to the disk subsystem in 2502.97 I/Os as shown under w/sin the example above.
The average I/O size or average request size is displayed under avgrq-sz and is 9.83 blocks
of 512 byte in our example. For async writes the average I/O size is usually some odd
number. However most applications perform read and write I/O in multiples of 4kB (for
instance 4kB, 8kB, 16kB, 32kB and so on). In the example above the application was issuing
nothing but random write requests of 4kB, however iostat shows a average request size
4.915kB. The difference is caused by the Linux file system that even though we were
performing random writes found some I/Os that could be merged together for more efficient
flushing out to the disk subsystem.
Note: When using the default async mode for file systems, only the average request size
displayed in iostat is correct. Even though applications perform write requests at distinct
sizes, the I/O layer of Linux will most likely merge and hence alter the average I/O size.
2.3.7 sar
The sar command is used to collect, report, and save system activity information. The sar
command consists of three applications: sar, which displays the data, and sa1 and sa2, which
are used for collecting and storing the data. The sar tool features a wide range of options so
be sure to check the man page for it. The sar utility is part of the sysstat package.
With sa1 and sa2, the system can be configured to get information and log it for later analysis.
Tip: We suggest that you have sar running on most if not all of your systems. In case of a
performance problem, you will have very detailed information at hand at very small
overhead and no additional cost.
To accomplish this, add the lines to /etc/crontab (Example 2-14). Keep in mind that a default
cron job running sar daily is set up automatically after installing sar on your system.
Example 2-14 Example of starting automatic log reporting with cron
# 8am-7pm activity reports every 10 minutes during weekdays.
*/10 8-18 * * 1-5 /usr/lib/sa/sa1 600 6 &
# 7pm-8am activity reports every an hour during weekdays.
0 19-7 * * 1-5 /usr/lib/sa/sa1 &
# Activity reports every an hour on Saturday and Sunday.
0 * * * 0,6 /usr/lib/sa/sa1 &
# Daily summary prepared at 19:05
5 19 * * * /usr/lib/sa/sa2 -A &
The raw data for the sar tool is stored under /var/log/sa/ where the various files represent the
days of the respective month. To examine your results, select the weekday of the month and
the requested performance data. For example, to display the network counters from the 21st,
use the command sar -n DEV -f sa21 and pipe it to less as in Example 2-15.
50Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Example 2-15 Displaying system statistics with sar
[root@linux sa]# sar -n DEV -f sa21 | less
Linux 2.6.9-5.ELsmp (linux.itso.ral.ibm.com) 04/21/2005
12:00:01 AM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
12:10:01 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:10:01 AM eth0 1.80 0.00 247.89 0.00 0.00 0.00 0.00
12:10:01 AM eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
You can also use sar to run near-real-time reporting from the command line (Example 2-16).
Example 2-16 Ad hoc CPU monitoring
[root@x232 root]# sar -u 3 10
Linux 2.4.21-9.0.3.EL (x232) 05/22/2004
02:10:40 PM CPU %user %nice %system %idle
02:10:43 PM all 0.00 0.00 0.00 100.00
02:10:46 PM all 0.33 0.00 0.00 99.67
02:10:49 PM all 0.00 0.00 0.00 100.00
02:10:52 PM all 7.14 0.00 18.57 74.29
02:10:55 PM all 71.43 0.00 28.57 0.00
02:10:58 PM all 0.00 0.00 100.00 0.00
02:11:01 PM all 0.00 0.00 0.00 0.00
02:11:04 PM all 0.00 0.00 100.00 0.00
02:11:07 PM all 50.00 0.00 50.00 0.00
02:11:10 PM all 0.00 0.00 100.00 0.00
Average: all 1.62 0.00 3.33 95.06
2.3.8 mpstat
From the collected data, you see a detailed overview of CPU utilization (%user, %nice,
%system, %idle), memory paging, network I/O and transfer statistics, process creation
activity, activity for block devices, and interrupts/second over time.
The mpstat command is used to report the activities of each of the available CPUs on a
multiprocessor server. Global average activities among all CPUs are also reported. The
mpstat utility is part of the sysstat package.
The mpstat utility enables you to display overall CPU statistics per system or per processor.
mpstat also enables the creation of statistics when used in sampling mode analogous to the
vmstat command with a sampling frequency and a sampling count. Example 2-17 shows a
sample output created with mpstat -P ALL to display average CPU utilization per processor.
Example 2-17 Output of mpstat command on multiprocessor system
[root@linux ~]# mpstat -P ALL
Linux 2.6.9-5.ELsmp (linux.itso.ral.ibm.com) 04/22/2005
For the complete syntax of the mpstat command, issue:
mpstat -?
2.3.9 numastat
With Non-Uniform Memory Architecture (NUMA) systems such as the IBM System x 3950,
NUMA architectures have become mainstream in enterprise data centers. However, NUMA
systems introduce new challenges to the performance tuning process: Topics such as
memory locality were of no interest until NUMA systems arrived. Luckily, Enterprise Linux
distributions provides a tool for monitoring the behavior of NUMA architectures. The numastat
command provides information about the ratio of local versus remote memory usage and the
overall memory configuration of all nodes. Failed allocations of local memory as displayed in
the numa_miss column and allocations of remote memory (slower memory) as displayed in
the numa_foreign column should be investigated. Excessive allocation of remote memory will
increase system latency and most likely decrease overall performance. Binding processes to
a node with the memory map in the local RAM will most likely improve performance.
Example 2-19 Sample output of the numastat command
The pmap command reports the amount of memory that one or more processes are using. You
can use this tool to determine which processes on the server are being allocated memory and
52Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
whether this amount of memory is a cause of memory bottlenecks. For detailed information,
use pmap -d option.
pmap -d <pid>
Example 2-20 Process memory information the init process is using
Some of the most important information is at the bottom of the display. The line shows:
mapped:total amount of memory mapped to files used in the process
writable/private: the amount of private address space this process is taking.
shared:the amount of address space this process is sharing with others.
You can also take a look at the address spaces where the information is stored. You can find
an interesting difference when you issue the pmap command on 32-bit and 64-bit systems. For
the complete syntax of the pmap command, issue:
2.3.11 netstat
netstat is one of the most popular tools. If you work on the network. you should be familiar
with this tool. It displays a lot of network related information such as socket usage, routing,
interface, protocol, network statistics etc. Here are some of the basic options:
-aShow all socket information
-rShow routing information
-iShow network interface statistics
-sShow network protocol statistics
pmap -?
Chapter 2. Monitoring and benchmark tools 53
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
There are many other useful options. Please check man page. The following example
displays sample output of socket information.
Example 2-21 Showing socket information with netstat
ProtoThe protocol (tcp, udp, raw) used by the socket.
Recv-QThe count of bytes not copied by the user program connected to this
socket.
Send-QThe count of bytes not acknowledged by the remote host.
Local AddressAddress and port number of the local end of the socket. Unless the
--numeric (-n) option is specified, the socket address is resolved to its
canonical host name (FQDN), and the port number is translated into the
corresponding service name.
Foreign Address Address and port number of the remote end of the socket.
StateThe state of the socket. Since there are no states in raw mode and
usually no states used in UDP, this column may be left blank. For possible
states, see Figure 1-28, “TCP connection state diagram” on page 32 and
man page.
iptraf monitors TCP/IP traffic in a real time manner and generates real time reports. It shows
TCP/IP traffic statistics by each session, by interface and by protocol. The iptraf utility is
provided by iptraf package.
The iptraf give us some reports like following
IP traffic monitor: Network traffic statistics by TCP connection
General interface statistics: IP traffic statistics by network interface
Detailed interface statistics: Network traffic statistics by protocol
Statistical breakdowns: Network traffic statistics by TCP/UDP port and by packet size
LAN station monitor: Network traffic statistics by Layer2 address
Following are a few of the reports iptraf generates.
54Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Figure 2-2 iptraf output of TCP/IP statistics by protocol
Figure 2-3 iptraf output of TCP/IP traffic statistics by packet size
2.3.13 tcpdump / ethereal
The tcpdump and ethereal are used to capture and analyze network traffic. Both tool uses the
libpcap library to capture packets. They monitor all the traffic on a network adapter with
promiscuous mode and capture all the frames the adapter has received. To capture all the
packets, these commands should be executed with super user privilege to make the interface
promiscuous mode.
Chapter 2. Monitoring and benchmark tools 55
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
You can use these tools to dig into the network related problems. You can find TCP/IP
retransmission, windows size scaling, name resolution problem, network misconfiguration
etc. Just keep in mind that these tools can monitor only frames the network adapter has
received, not entire network traffic.
tcpdump
tcpdump is a simple but robust utility. It also has basic protocol analyzing capability allowing
you to get rough picture of what is happening on the network. tcpdump supports many options
and flexible expressions for filtering the frames to be captured (capture filter). We’ll take a look
at this below.
Options:
-i <interface> Network interface
-e Print the link-level header
-s <snaplen>Capture <snaplen> bytes from each packet
-nAvoide DNS lookup
-w <file>Write to file
-r <file>Read from file
-v, -vv, -vvvVervose output
Expressions for the capture filter:
Keywords:
host dst, src, port, src port, dst port, tcp, udp, icmp, net, dst net, src net etc.
Primitives may be combined using:
Negation (‘`!‘ or ‘not‘).
Concatenation (`&&' or `and').
Alternation (`||' or `or').
Example of some useful expressions:
DNS query packets
tcpdump -i eth0 'udp port 53'
FTP control and FTP data session to 192.168.1.10
tcpdump -i eth0 'dst 192.168.1.10 and (port ftp or ftp-data)'
HTTP session to 192.168.2.253
tcpdump -ni eth0 'dst 192.168.2.253 and tcp and port 80'
Telnet session to subnet 192.168.2.0/24
tcpdump -ni eth0 'dst net 192.168.2.0/24 and tcp and port 22'
Packets for which the source and destination is not in subnet 192.168.1.0/24 with TCP
SYN or TCP FIN flags on (TCP establishment or termination)
tcpdump 'tcp[tcpflags] & (tcp-syn|tcp-fin) != 0 and not src and dst net
192.168.1.0/24'
56Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
ethereal has quite similar functionality to tcpdump but is more sophisticated and has
advanced protocol analyzing and reporting capability. It also has a GUI interface as well as a
command line interface using the ethereal command which is part of a ethereal package.
Like tcpdump, the capture filter can be used and it also support the display filter. It can be used
to narrow down the frames. We’ll show you some examples of useful expression here.
IP
ip.version == 6 and ip.len > 1450
ip.addr == 129.111.0.0/16
ip.dst eq www.example.com and ip.src == 192.168.1.1
not ip.addr eq 192.168.4.1
TCP/UDP
tcp.port eq 22
tcp.port == 80 and ip.src == 192.168.2.1
tcp.dstport == 80 and (tcp.flags.syn == 1 or tcp.flags.fin == 1)
tcp.srcport == 80 and (tcp.flags.syn == 1 and tcp.flags.ack == 1)
tcp.dstport == 80 and tcp.flags == 0x12
tcp.options.mss_val == 1460 and tcp.option.sack == 1
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Figure 2-4 ethereal GUI
2.3.14 nmon
nmon, short for Nigel's Monitor, is a popular tool to monitor Linux systems performance
developed by Nigel Griffiths. Since nmon incorporates the performance information for
several subsystems, it can be used as a single source for performance monitoring. Some of
the tasks that can be achieved with nmon include processor utilization, memory utilization,
run queue information, disks I/O statistics, network I/O statistics, paging activity and process
metrics.
In order to run nmon, simply start the tool and select the subsystems of interest by typing their
one-key commands. For example, to get CPU, memory, and disk statistics, start nmon and
type c m d.
A very nice feature of nmon is the possibility to save performance statistics for later analysis in
a comma separated values (CSV) file. The CSV output of nmon can be imported into a
spreadsheet application in order to produce graphical reports. In order to do so nmon should
be started with the -f flag (see nmon -h for the details). For example running nmon for an
hour capturing data snapshots every 30 seconds would be achieved using the command in
Example 2-23 on page 59.
58Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Example 2-23 Using nmon to record performance data
# nmon -f -s 30 -c 120
The output of the above command will be stored in a text file in the current directory named
<hostname>_date_time.nmon.
The strace command intercepts and records the system calls that are called by a process, as
well as the signals that are received by a process. This is a useful diagnostic, instructional,
and debugging tool. System administrators find it valuable for solving problems with
programs.
To trace a process, specify the process ID (PID) to be monitored:
strace -p <pid>
Example 2-24 shows an example of the output of strace.
Example 2-24 Output of strace monitoring httpd process
Attention: While the strace command is running against a process, the performance of
the PID is drastically reduced and should only be run for the time of data collection.
Here’s another interesting usage. This command reports how much time has been consumed
in the kernel by each system call to execute a command.
strace -c <command>
Chapter 2. Monitoring and benchmark tools 59
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Example 2-25 Output of strace counting for system time
[root@lnxsu4 ~]# strace -c find /etc -name httpd.conf
/etc/httpd/conf/httpd.conf
Process 3563 detached
% time seconds usecs/call calls errors syscall
For the complete syntax of the strace command, issue:
strace -?
2.3.16 Proc file system
The proc file system is not a real file system, but nevertheless is extremely useful. It is not
intended to store data; rather, it provides an interface to the running kernel. The proc file
system enables an administrator to monitor and change the kernel on the fly. Figure 2-5
depicts a sample proc file system. Most Linux tools for performance measurement rely on the
information provided by /proc.
60Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
/
proc/
1/
2546/
bus/
pci/
usb/
driver/
fs/
nfs/
ide/
irq/
net/
scsi/
self/
sys/
abi/
debug/
dev/
fs/
binvmt_misc/
mfs/
quota/
kernel/
random/
net/
802/
core/
ethernet/
Figure 2-5 A sample /proc file system
Looking at the proc file system, we can distinguish several subdirectories that serve various
purposes, but because most of the information in the proc directory is not easily readable to
the human eye, you are encouraged to use tools such as vmstat to display the various
statistics in a more readable manner. Keep in mind that the layout and information contained
within the proc file system varies across different system architectures.
Files in the /proc directory
The various files in the root directory of proc refer to several pertinent system statics. Here
you can find information taken by Linux tools such as vmstat and cpuinfo as the source of
their output.
Numbers 1 to X
The various subdirectories represented by numbers refer to the running processes or their
respective process ID (PID). The directory structure always starts with PID 1, which refers
to the init process, and goes up to the number of PIDs running on the respective system.
Each numbered subdirectory stores statistics related to the process. One example of such
data is the virtual memory mapped by the process.
acpi
ACPI refers to the advanced configuration and power interface supported by most modern
desktop and laptop systems. Because ACPI is mainly a PC technology, it is often disabled
on server systems. For more information about ACPI refer to:
http://www.apci.info
Chapter 2. Monitoring and benchmark tools 61
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
bus
This subdirectory contains information about the bus subsystems such as the PCI bus or
the USB interface of the respective system.
irq
The irq subdirectory contains information about the interrupts in a system. Each
subdirectory in this directory refers to an interrupt and possibly to an attached device such
as a network interface card. In the irq subdirectory, you can change the CPU affinity of a
given interrupt (a feature we cover later in this book).
net
The net subdirectory contains a significant number of raw statistics regarding your network
interfaces, such as received multicast packets or the routes per interface.
scsi
This subdirectory contains information about the SCSI subsystem of the respective
system, such as attached devices or driver revision. The subdirectory ips refers to the IBM
ServeRAID controllers found on most IBM System x servers.
sys
In the sys subdirectory you find the tunable kernel parameters such as the behavior of the
virtual memory manager or the network stack. We cover the various options and tunable
values in /proc/sys in 4.3, “Changing kernel parameters” on page 104.
tty
The tty subdirectory contains information about the respective virtual terminals of the
systems and to what physical devices they are attached.
2.3.17 KDE System Guard
KDE System Guard (KSysguard) is the KDE task manager and performance monitor. It
features a client/server architecture that enables monitoring of local and remote hosts.
62Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Figure 2-6 Default KDE System Guard window
The graphical front end (Figure 2-6) uses sensors to retrieve the information it displays. A
sensor can return simple values or more complex information such as tables. For each type of
information, one or more displays are provided. Displays are organized in worksheets that
can be saved and loaded independent of each other.
The KSysguard main window consists of a menu bar, an optional tool bar and status bar, the
sensor browser, and the work space. When first started, you see the default setup: your local
machine listed as localhost in the sensor browser and two tabs in the work space area.
Each sensor monitors a certain system value. All of the displayed sensors can be dragged
and dropped into the work space. There are three options:
You can delete and replace sensors in the actual work space.
You can edit worksheet properties and increase the number of rows and columns.
You can create a new worksheet and drop new sensors meeting your needs.
Work space
The work space in Figure 2-7 shows two tabs:
System Load, the default view when first starting up KSysguard
Process Table
Chapter 2. Monitoring and benchmark tools 63
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Figure 2-7 KDE System Guard sensor browser
System Load
The System Load worksheet shows four sensor windows: CPU Load, Load Average (1 Min),
Physical Memory, and Swap Memory. Multiple sensors can be displayed in one window. To
see which sensors are being monitored in a window, mouse over the graph and descriptive
text will appear. You can also right-click the graph and click Properties, then click the
Sensors tab (Figure 2-8). This also shows a key of what each color represents on the graph.
Figure 2-8 Sensor Information, Physical Memory Signal Plotter
64Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Process Table
Clicking the Process Table tab displays information about all running processes on the
server (Figure 2-9). The table, by default, is sorted by System CPU utilization, but this can be
changed by clicking another one of the headings.
Figure 2-9 Process Table view
Configuring a work sheet
For your environment or the particular area that you wish to monitor, you might have to use
different sensors for monitoring. The best way to do this is to create a custom work sheet. In
this section, we guide you through the steps that are required to create the work sheet shown
in Figure 2-12 on page 67:
1. Create a blank worksheet by clicking File → New to open the window in Figure 2-10.
Figure 2-10 Properties for new worksheet
2. Enter a title and a number of rows and columns; this gives you the maximum number of
monitor windows, which in our case will be four. When the information is complete, click
OK to create the blank worksheet, as shown in Figure 2-11 on page 66.
Chapter 2. Monitoring and benchmark tools 65
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Note: The fastest update interval that can be defined is two seconds.
Figure 2-11 Empty worksheet
3. Fill in the sensor boxes by dragging the sensors on the left side of the window to the
desired box on the right. The types of display are:
– Signal Plotter: This displays samples of one or more sensors over time. If several
sensors are displayed, the values are layered in different colors. If the display is large
enough, a grid will be displayed to show the range of the plotted samples.
By default, the automatic range mode is active, so the minimum and maximum values
will be set automatically. If you want fixed minimum and maximum values, you can
deactivate the automatic range mode and set the values in the Scales tab from the
Properties dialog window (which you access by right-clicking the graph).
– Multimeter: This displays the sensor values as a digital meter. In the Properties dialog,
you can specify a lower and upper limit. If the range is exceeded, the display is colored
in the alarm color.
– BarGraph: This displays the sensor value as dancing bars. In the Properties dialog,
you can specify the minimum and maximum values of the range and a lower and upper
limit. If the range is exceeded, the display is colored in the alarm color.
– Sensor Logger: This does not display any values, but logs them in a file with additional
date and time information.
For each sensor, you have to define a target log file, the time interval the sensor will be
logged, and whether alarms are enabled.
4. Click File → Save to save the changes to the worksheet.
Note: When you save a work sheet, it will be saved in the user’s home directory, which may
prevent other administrators from using your custom worksheets.
66Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Figure 2-12 Example worksheet
Find more information about KDE System Guard at:
http://docs.kde.org/
2.3.18 Gnome System Monitor
Although not as powerful as the KDE System Guard, the Gnome desktop environment
features a graphical performance analysis tool. The Gnome System Monitor can display
performance-relevant system resources as graphs for visualizing possible peaks and
bottlenecks. Note that all statistics are generated in real time. Long-term performance
analysis should be carried out with different tools.
2.3.19 Capacity Manager
Capacity Manager, an add-on to the IBM Director system management suite for IBM
Systems, is available in the ServerPlus Pack for IBM System x systems. Capacity Manager
offers the possibility of long-term performance measurements across multiple systems and
platforms. Apart from performance measurement, Capacity Manager enables capacity
planning, offering you an estimate of future required system capacity needs. With Capacity
Manager, you can export reports to HTML, XML, and GIF files that can be stored
automatically on an intranet Web server. IBM Director can be used on different operating
system platforms, which makes it much easier to collect and analyze data in a heterogeneous
environment. Capacity Manager is discussed in detail in the redbook Tuning IBM System x Servers for Performance, SG24-5287.
To use Capacity Manager, you first must install the respective RPM package on the systems
that will use its advanced features. After installing the RPM, select Capacity Manager → Monitor Activator in the IBM Director Console.
Chapter 2. Monitoring and benchmark tools 67
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Figure 2-13 The task list in the IBM Director Console
Drag and drop the icon for Monitor Activator over a single system or a group of systems that
have the Capacity Manager package installed. A window opens (Figure 2-14) in which you
can select the various subsystems to be monitored over time. Capacity Manager for Linux
does not yet support the full-feature set of available performance counters. System statistics
are limited to a basic subset of performance parameters.
Figure 2-14 Activating performance monitors multiple systems
The Monitor Activator window shows the respective systems with their current status on the
right side and the different available performance monitors at the left side. To add a new
monitor, select the monitor and click On. The changes take effect shortly after the Monitor
Activator window is closed. After this step, IBM Director starts collecting the requested
performance metrics and stores them in a temporary location on the different systems.
To create a report of the collected data, select Capacity Manager → Report Generator (see
Figure 2-13) and drag it over a single system or a group of systems for which you would like to
see performance statistics. IBM Director asks whether the report should be generated right
away or scheduled for later execution (Figure 2-15).
68Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Figure 2-15 Scheduling reports
In a production environment, it is a good idea to have Capacity Manager generate reports on
a regular basis. Our experience is that weekly reports that are performed in off-hours over the
weekend can be very valuable. An immediate execution or scheduled execution report is
generated according to your choice. As soon as the report has completed, it is stored on the
central IBM Director management server, where it can be viewed using the Report Viewer
task. Figure 2-16 shows sample output from a monthly Capacity Manager report.
Chapter 2. Monitoring and benchmark tools 69
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
Figure 2-16 A sample Capacity Manager report
The Report Viewer window enables you to select the different performance counters that
were collected and correlate this data to a single system or to a selection of systems.
Data acquired by Capacity Manager can be exported to an HTML or XML file to be displayed
on an intranet Web server or for future analysis.
2.4 Benchmark tools
In this section, we pick up some of major benchmark tools. To measure performance it’s wise
to use good benchmark tools. There are a lot of good tools available. Some of them have all
or some of the following capabilities
A benchmark is nothing more than a model for a specific workload that may or may not be
close to the workload that will finally run on a system. If a system boasts a good Linpack
score it might still not be the ideal file server. You should always remember that a benchmark
can not simulate the sometimes unpredictable reactions of an end-user. A benchmark will
also not tell you how a file server behaves once not only the user access their data but also
the backup starts up. Generally the following rules should be observed when performing a
benchmark on any system:
Use a benchmark for server workloads: Server systems boast very distinct characteristic
that make them very different from a typical desktop PC even though the IBM System x
70Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
platform shares many of the technologies available for desktop computers. Server
benchmarks spawn multiple threads in order to utilize the SMP capabilities of the system
and in order to simulate a true multi user environment. While a PC might start one web
browser faster than a high-end server, the server will start a thousand web browsers faster
than a PC.
Simulate the expected workload: All benchmarks have different configuration options that
should be used to tailor the benchmark towards the workload that the system should be
running in the future. Great CPU performance will be of little use if the application in the
end has to rely on low disk latency.
Isolate benchmark systems: If a system is to be tested with a benchmark it is paramount
to isolate it from any other load as good as possible. Already an open session running the
top command can greatly impact the results of the benchmark.
Average results: Even if you try to isolate the benchmark system as good as possible there
might always be unknown factors that might impact systems performance just at the time
of your benchmark. It is a good practice to run any benchmark for at least three times and
average the results in order to make sure that a one time event does not impact your entire
analysis.
In the following sections, we’ve selected some tools based on these criteria:
Works on Linux: Linux is the target of the benchmark
Works on all hardware platforms: Since IBM offers three distinct hardware platforms
(assuming that the hardware technology of IBM System p and IBM System i™ are both
based on the IBM POWER™ architecture) it is important to select a benchmark that may
be used without big porting efforts on all architectures.
Open source: Linux runs on several platform then the binary file may not be available if the
Well-documented: You have to know well about the tool when you perform benchmarking.
Actively-maintained: The old abandoned tool may not follow the recent specification and
Widely used: You can find a lot of information about widely-used tools more easily.
Easy to use: It’s always good thing.
Reporting capability: Having reporting capability will greatly reduce the performance
2.4.1 LMbench
LMbench is a suite of microbenchmarks that can be used to analyze different operating
system settings such as an SELinux enabled system versus a non SELinux system. The
benchmarks included in LMbench measure various operating system routines such as
context switching, local communications, memory bandwidth and file operations. Using
LMbench is pretty straight forward as there are only three important commands to know;
make results: The first time LMbench is run it will prompt for some details of the system
source code is not available.
The documentation will help you to be familiar with the tools. It also helps to evaluate
whether the tool is suit for your needs by taking a look at the concept and design and
details before you decide to use certain tool.
technology. It may produce a wrong result and lead misunderstanding.
analysis work.
configuration and what tests it should perform.
make rerun: After the initial configuration and a first benchmark run, using the make rerun
command simply repeats the benchmark using the configuration supplied during the make results run.
Chapter 2. Monitoring and benchmark tools 71
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
make see: Finally after a minimum of three runs the results can be viewed using the make
see command. The results will be displayed and can be copied to a spreadsheet
application for further analysis or graphical representation of the data.
The LMbench benchmark can be found at http://sourceforge.net/projects/lmbench/
2.4.2 IOzone
IOzone is a file system benchmark that can be utilized to simulate a wide variety of different
disk access patterns. Since the configuration possibilities of IOzone are very detailed it is
possible to simulate a targeted workload profile very precisely. In essence IOzone writes one
or multiple files of variable size using variable block sizes.
While IOzone offers a very comfortable automatic benchmarking mode it is usually more
efficient to define the workload characteristic such as file size, I/O size and access pattern. If
a file system has to be evaluated for a database workload it would be sensible to cause
IOzone to create a random access pattern to a rather large file at large block sizes instead of
streaming a large file with a small block size. Some of the most important options for IOzone
are:
-b <output.xls>Tells IOzone to store the results in a Microsoft® Excel® compatible
spreadsheet
-CDisplays output for each child process (can be used to check if all
children really run simultaneously)
-f <filename>Can be used to tell IOzone where to write the data
-i <number of test> This option is used to specify what test are to be run. You will always
have to specify -i 0 in order to write the test file for the first time.
Useful tests are -i 1 for streaming reads and -i 2 for random read and
random write access as well as -i 8 for a workload with mixed random
access
-hDisplays the onscreen help
-rTells IOzone what record or I/O size that should be used for the tests.
The record size should be as close as possible to the record size that
will be used by the targeted workload
-k <number of async I/Os>
Uses the async I/O feature of kernel 2.6 that often is used by
databases such as IBM DB2®
-mShould the targeted application use multiple internal buffers then this
behavior can be simulated using the -m flag
-s <size in KB>Specifies the file size for the benchmark. For asynchronous file
systems (the default mounting option for most file systems) IOzone
should be used with a file size of at least twice the systems memory in
order to really measure disk performance. The size can also be
specified in MB or GB using m or g respectively directly after the file
size.
-+uIs an experimental switch that can be used to measure the processor
utilization during the test
72Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
Note: Any benchmark using files that fit into the systems memory and that are stored on
asynchronous file systems will measure the memory throughput rather than the disk
subsystem performance. Hence you should either mount the file system of interest with the
sync option or use a file size roughly twice the size of the systems memory.
Using IOzone to measure the random read performance of a given disk subsystem mounted
at /perf for a file of 10 GB size at 32KB I/O size (these characteristics could model a simple
database) would look as follows:
Finally, the obtained result can be imported into your spreadsheet application of choice and
then transformed into graphs. Using a graphical output of the data might make it easier to
analyze a large amount of data and to identify trends. A sample output of the example above
(refer to Example 2-26) might look like the graphic displayed in Figure 2-17.
120000
100000
80000
60000
kB/sec
40000
20000
0
Writer ReportRe-writer ReportRandom Read
Report
Random Write
Report
10 GB File Access at 32 KB I/O Size
Figure 2-17 A graphic produced out of the sample results of Example 2-26
If IOzone is used with file sizes that either fit into the system’s memory or cache it can also be
used to gain some data about cache and memory throughput. It should however be noted that
due to the file system overheads IOzone will report only 70-80% of a system’s bandwidth.
The IOzone benchmark can be found at http://www.iozone.org/
2.4.3 netperf
netperf is a performance benchmark tool especially focusing on TCP/IP networking
performance. It also supports UNIX domain socket and SCTP benchmarking.
netperf is designed based on a client-server model. netserver runs on a target system and
netperf runs on the client. netperf controls the netserver and passes configuration data to
Chapter 2. Monitoring and benchmark tools 73
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
netserver, generates network traffic, gets the result from netserver via a control connection
which is separated from the actual benchmark traffic connection. During the benchmarking,
no communication occurs on the control connection so it does not have any effect on the
result. The netperf benchmark tool also has a reporting capability including a CPU utilization
report. The current stable version is 2.4.3 at the time of writing.
netperf can generate several types of traffic. Basically these fall into the two categories: bulk
data transfer traffic and request/response type traffic. One thing you should keep in mind is
netperf uses only one socket at a time. The next version of netperf (netperf4) will fully support
benchmarking for concurrent session. At this time, we can perform multiple session
benchmarking as described below.
Bulk data transfer
Bulk data transfer is most commonly measured factor when it comes to network
benchmarking. The bulk data transfer is measured by the amount of data transferred in
one second. It simulates large file transfer such as multimedia streaming, FTP data
transfer.
Request/response type
This simulate request/response type traffic which is measured by the number of
transactions exchanged in one second. Request/response traffic type is typical for online
transaction application such as web server, database server, mail server, file server which
serves small or medium files and directory server. In real environment, session
establishment and termination should be performed as well as data exchange. To simulate
this, TCP_CRR type was introduced.
Concurrent session
netperf does not have real support for concurrent multiple session benchmarking in the
current stable version, but we can perform some benchmarking by just issuing multiple
instances of netperf as follows:
for i in ‘seq 1 10‘; do netperf -t TCP_CRR -H target.example.com -i 10 -P 0
&; done
We’ll take a brief look at some useful and interesting options.
Global options:
-AChange send and receive buffer alignment on remote system
-bBurst of packet in stream test
-H <remotehost>Remote host
-t <testname>Test traffic type
TCP_STREAMBulk data transfer benchmark
TCP_MAERTSSimilar to TCP_STREAM except direction of stream is opposite.
TCP_SENDFILESimilar to TCP_STREAM except using sendfile() instead of
send(). It causes a zero-copy operation.
UDP_STREAMSame as TCP_STREAM except UDP is used.
TCP_RRRequest/response type traffic benchmark
TCP_CCTCP connect/close benchmark. No request and response packet is
exchanged.
TCP_CRRPerforms connect/request/response/close operation. It’s very much
like HTTP1.0/1.1 session with HTTP keepalive disabled.
UDP_RRSame as TCP_RR except UDP is used.
74Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch02.fm
-l <testlen>Test length of benchmarking. If positive value is set, netperf perform
the benchmarking in
value of
benchmarking or value of
type.
-cLocal CPU utilization report
-CRemote CPU utilization report
Note: The report of the CPU utilization may not be accurate in some platform. Make sure if
it is accurate before you perform benchmarking.
-I <conflevel><interval>
This option is used to maintain confidence of the result. The
confidence level should be 99 or 95 (percent) and interval (percent)
can be set as well. To keep the result a certain level of confidence, the
netperf repeats the same benchmarking several times. For example,
-I 99,5 means that the result is within 5% interval (+- 2.5%) of the real
result in 99 times out of 100.
-i <max><min>Number of maximum and minimum test iterations. This option limits
the number of iteration. -i 10,3 means netperf perform same
benchmarking at least 3 times and at most 10 times. If the iteration
exceeds the maximum value, the result would not be in the confidence
level which is specified with -I option and some warning will be
displayed in the result.
-s <bytes>, -S <bytes>
Changes send and receive buffer size on local, remote system. This
will affect the advertised and effective window size.
testlen bytes data is exchanged for bulk data transfer
testlen seconds. If negative, it performs until
testlen transactions for request/response
Options for TCP_STREAM, TCP_MAERTS, TCP_SENDFILE, UDP_STREAM
-m <bytes>, -M <bytes>
Specifies the size of buffer passed to send(), recv() function call
respectively and control the size sent and received per call.
Options for TCP_RR, TCP_CC, TCP_CRR, UDP_RR:
-r <bytes>, -R <bytes>
Specifies request, response size respectively. For example, -r
128,8129 means that netperf send 128 byte packets to the netserver
and it sends the 8129 byte packets back to netperf.
The following is an example output of netperf for TCP_CRR type benchmark.
Example 2-27 An example result of TCP_CRR benchmark
Testing with the following command line:
/usr/local/bin/netperf -l 60 -H plnxsu4 -t TCP_CRR -c 100 -C 100 -i ,3 -I 95,5 -v
1 -- -r 64,1 -s 0 -S 512
TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to plnxsu4
(10.0.0.4) port 0 AF_INET
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr
Chapter 2. Monitoring and benchmark tools 75
4285ch02.fmDraft Document for Review May 4, 2007 11:35 am
When you perform benchmarking, it’s wise to use the sample test scripts which come with
netperf. By changing some variables in the scripts, you can perform your benchmarking as
you like. The scripts are in the doc/examples/ directory of the netperf package.
For more details, refer to http://www.netperf.org/
2.4.4 Other useful tools
Following are some other useful benchmark tools. Keep in mind that you have to know the
characteristics of the benchmark tool and choose the tools that fit your needs.
Table 2-3 Additional benchmarking tools
ToolMost useful tool function
bonnieDisk I/O and file system benchmark
http://www.textuality.com/bonnie/
bonnie++Disk I/O and file system benchmark.
http://www.coker.com.au/bonnie++/
NetBenchFile server benchmark. It runs on Windows.
dbenchFile system benchmark. Commonly used for file server benchmark.
http://freshmeat.net/projects/dbench/
iometerDisk I/O and network benchmark
http://www.iometer.org/
ttcpSimple network benchmark
nttcpSimple network benchmark
iperfNetwork benchmark
http://dast.nlanr.net/projects/Iperf/
ab (Apache Bench)Simple web server benchmark. It comes with Apache HTTP server.
http://httpd.apache.org/
WebStoneWeb server benchmark
http://www.mindcraft.com/webstone/
Apache JMeterUsed mainly web server performance benchmarking. It also support
other protocol such as SMTP, LDAP, JDBC™ etc. and it has good
reporting capability.
http://jakarta.apache.org/jmeter/
fsstone, smtpstoneMail server benchmark. They come with Postfix.
http://www.postfix.org/
nhfsstoneNetwork File System benchmark. Comes with nfs-utils package.
DirectoryMarkLDAP benchmark
http://www.mindcraft.com/directorymark/
76Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch03.fm
3
Chapter 3.Analyzing performance
bottlenecks
This chapter is useful for finding a performance problem that may be already affecting one of
your servers. We outline a series of steps to lead you to a concrete solution that you can
implement to restore the server to an acceptable performance level.
The topics that are covered in this chapter are:
3.1, “Identifying bottlenecks” on page 78
3.2, “CPU bottlenecks” on page 81
3.3, “Memory bottlenecks” on page 82
3.4, “Disk bottlenecks” on page 84
3.5, “Network bottlenecks” on page 87
4285ch03.fmDraft Document for Review May 4, 2007 11:35 am
3.1 Identifying bottlenecks
The following steps are used as our quick tuning strategy:
1. Know your system.
2. Back up the system.
3. Monitor and analyze the system’s performance.
4. Narrow down the bottleneck and find its cause.
5. Fix the bottleneck cause by trying only one single change at a time.
6. Go back to step 3 until you are satisfied with the performance of the system.
Tip: You should document each step, especially the changes you make and their effect on
performance.
3.1.1 Gathering information
Mostly likely, the only first-hand information you will have access to will be statements such as
“There is a problem with the server.” It is crucial to use probing questions to clarify and
document the problem. Here is a list of questions you should ask to help you get a better
picture of the system.
Can you give me a complete description of the server in question?
– Model
–Age
– Configuration
– Peripheral equipment
– Operating system version and update level
Can you tell me
– What are the symptoms?
– Describe any error messages.
Some people will have problems answering this question, but any extra information the
customer can give you might enable you to find the problem. For example, the customer
might say “It is really slow when I copy large files to the server.” This might indicate a
network problem or a disk subsystem problem.
Who is experiencing the problem?
Is one person, one particular group of people, or the entire organization experiencing the
problem? This helps determine whether the problem exists in one particular part of the
network, whether it is application-dependent, and so on. If only one user experiences the
problem, then the problem might be with the user’s PC (or their imagination).
The perception clients have of the server is usually a key factor. From this point of view,
performance problems may not be directly related to the server: the network path between
the server and the clients can easily be the cause of the problem. This path includes
network devices as well as services provided by other servers, such as domain
controllers.
Can the problem be reproduced?
All reproducible problems can be solved. If you have sufficient knowledge of the system,
you should be able to narrow the problem to its root and decide which actions should be
taken.
exactly what the problem is?
78Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch03.fm
The fact that the problem can be reproduced enables you to see and understand it better.
Document the sequence of actions that are necessary to reproduce the problem:
– What are the steps to reproduce the problem?
Knowing the steps may help you reproduce the same problem on a different machine
under the same conditions. If this works, it gives you the opportunity to use a machine
in a test environment and removes the chance of crashing the production server.
– Is it an intermittent problem?
If the problem is intermittent, the first thing to do is to gather information and find a path
to move the problem in the reproducible category. The goal here is to have a scenario
to make the problem happen on command.
– Does it occur at certain times of the day or certain days of the week?
This might help you determine what is causing the problem. It may occur when
everyone arrives for work or returns from lunch. Look for ways to change the timing
(that is, make it happen less or more often); if there are ways to do so, the problem
becomes a reproducible one.
– Is it unusual?
If the problem falls into the non-reproducible category, you may conclude that it is the
result of extraordinary conditions and classify it as fixed. In real life, there is a high
probability that it will happen again.
A good procedure to troubleshoot a hard-to-reproduce problem is to perform general
maintenance on the server: reboot, or bring the machine up to date on drivers and
patches.
When did the problem start? Was it gradual or did it occur very quickly?
If the performance issue appeared gradually, then it is likely to be a sizing issue; if it
appeared overnight, then the problem could be caused by a change made to the server or
peripherals.
Have any changes been made to the server (minor or major) or are there any changes in
the way clients are using the server?
Did the customer alter something on the server or peripherals to cause the problem? Is
there a log of all network changes available?
Demands could change based on business changes, which could affect demands on a
servers and network systems.
Are there any other servers or hardware components involved?
Are any logs available?
What is the priority of the problem? When does it have to be fixed?
– Does it have to be fixed in the next few minutes, or in days? You may have some time to
fix it; or it may already be time to operate in panic mode.
– How massive is the problem?
– What is the related cost of that problem?
Chapter 3. Analyzing performance bottlenecks 79
4285ch03.fmDraft Document for Review May 4, 2007 11:35 am
3.1.2 Analyzing the server’s performance
Important: Before taking any troubleshooting actions, back up all data and the
configuration information to prevent a partial or complete loss.
At this point, you should begin monitoring the server. The simplest way is to run monitoring
tools from the server that is being analyzed. (See Chapter 2, “Monitoring and benchmark
tools” on page 39, for information.)
A performance log of the server should be created during its peak time of operation (for
example, 9:00 a.m. to 5:00 p.m.); it will depend on what services are being provided and on
who is using these services. When creating the log, if available, the following objects should
be included:
Processor
System
Server work queues
Memory
Page file
Physical disk
Redirector
Network interface
Before you begin, remember that a methodical approach to performance tuning is important.
Our recommended process, which you can use for your server performance tuning process,
is as follows:
1. Understand the factors affecting server performance.
2. Measure the current performance to create a performance baseline to compare with your
future measurements and to identify system bottlenecks.
3. Use the monitoring tools to identify a performance bottleneck. By following the instructions
in the next sections, you should be able to narrow down the bottleneck to the subsystem
level.
4. Work with the component that is causing the bottleneck by performing some actions to
improve server performance in response to demands.
Note: It is important to understand that the greatest gains are obtained by upgrading a
component that has a bottleneck when the other components in the server have ample
“power” left to sustain an elevated level of performance.
5. Measure the new performance. This helps you compare performance before and after the
tuning steps.
When attempting to fix a performance problem, remember the following:
Applications should be compiled with an appropriate optimization level to reduce the path
length.
Take measurements before you upgrade or modify anything so that you can tell whether
the change had any effect. (That is, take baseline measurements.)
Examine the options that involve reconfiguring existing hardware, not just those that
involve adding new hardware.
80Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch03.fm
3.2 CPU bottlenecks
For servers whose primary role is that of an application or database server, the CPU is a
critical resource and can often be a source of performance bottlenecks. It is important to note
that high CPU utilization does not always mean that a CPU is busy doing work; it may, in fact,
be waiting on another subsystem. When performing proper analysis, it is very important that
you look at the system as a whole and at all subsystems because there may be a cascade
effect within the subsystems.
Note: There is a common misconception that the CPU is the most important part of the
server. This is not always the case, and servers are often overconfigured with CPU and
underconfigured with disks, memory, and network subsystems. Only specific applications
that are truly CPU-intensive can take advantage of today’s high-end processors.
3.2.1 Finding CPU bottlenecks
Determining bottlenecks with the CPU can be accomplished in several ways. As discussed in
Chapter 2, “Monitoring and benchmark tools” on page 39, Linux has a variety of tools to help
determine this; the question is: which tools to use?
One such tool is uptime. By analyzing the output from uptime, we can get a rough idea of
what has been happening in the system for the past 15 minutes. For a more detailed
explanation of this tool, see 2.3.3, “uptime” on page 43.
3.2.2 SMP
Example 3-1 uptime output from a CPU strapped system
Using KDE System Guard and the CPU sensors lets you view the current CPU workload.
Tip: Be careful not to add to CPU problems by running too many tools at one time. You
may find that using a lot of different monitoring tools at one time may be contributing to the
high CPU load.
Using top, you can see both CPU utilization and what processes are the biggest contributors
to the problem (Example 2-1 on page 41). If you have set up sar, you are collecting a lot of
information, some of which is CPU utilization, over a period of time. Analyzing this information
can be difficult, so use isag, which can use sar output to plot a graph. Otherwise, you may
wish to parse the information through a script and use a spreadsheet to plot it to see any
trends in CPU utilization. You can also use sar from the command line by issuing sar -u or sar -U processornumber. To gain a broader perspective of the system and current utilization
of more than just the CPU subsystem, a good tool is vmstat (2.3.2, “vmstat” on page 42).
SMP-based systems can present their own set of interesting problems that can be difficult to
detect. In an SMP environment, there is the concept of
bind a process to a CPU.
CPU affinity, which implies that you
The main reason this is useful is CPU cache optimization, which is achieved by keeping the
same process on one CPU rather than moving between processors. When a process moves
between CPUs, the cache of the new CPU must be flushed. Therefore, a process that moves
between processors causes many cache flushes to occur, which means that an individual
process will take longer to finish. This scenario is very hard to detect because, when
Chapter 3. Analyzing performance bottlenecks 81
4285ch03.fmDraft Document for Review May 4, 2007 11:35 am
monitoring it, the CPU load will appear to be very balanced and not necessarily peaking on
any CPU. Affinity is also useful in NUMA-based systems such as the IBM System x 3950,
where it is important to keep memory, cache, and CPU access local to one another.
3.2.3 Performance tuning options
The first step is to ensure that the system performance problem is being caused by the CPU
and not one of the other subsystems. If the processor is the server bottleneck, then a number
of actions can be taken to improve performance. These include:
Ensure that no unnecessary programs are running in the background by using ps -ef. If
you find such programs, stop them and use cron to schedule them to run at off-peak
hours.
Identify non-critical, CPU-intensive processes by using top and modify their priority using
renice.
In an SMP-based machine, try using taskset to bind processes to CPUs to make sure that
processes are not hopping between processors, causing cache flushes.
Based on the running application, it may be better to scale up (bigger CPUs) than scale
out (more CPUs). This depends on whether your application was designed to effectively
take advantage of more processors. For example, a single-threaded application would
scale better with a faster CPU and not with more CPUs.
General options include making sure you are using the latest drivers and firmware, as this
may affect the load they have on the CPU.
3.3 Memory bottlenecks
On a Linux system, many programs run at the same time; these programs support multiple
users and some processes are more used than others. Some of these programs use a
portion of memory while the rest are “sleeping.” When an application accesses cache, the
performance increases because an in-memory access retrieves data, thereby eliminating the
need to access slower disks.
The OS uses an algorithm to control which programs will use physical memory and which are
paged out. This is transparent to user programs. Page space is a file created by the OS on a
disk partition to store user programs that are not currently in use. Typically, page sizes are
4 KB or 8 KB. In Linux, the page size is defined by using the variable EXEC_PAGESIZE in the
include/asm-<architecture>/param.h kernel header file. The process used to page a process
out to disk is called
3.3.1 Finding memory bottlenecks
Start your analysis by listing the applications that are running on the server. Determine how
much physical memory and swap each application needs to run. Figure 3-1 on page 83
shows KDE System Guard monitoring memory usage.
pageout.
82Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch03.fm
Figure 3-1 KDE System Guard memory monitoring
The indicators in Table 3-1 can also help you define a problem with memory.
Table 3-1 Indicator for memory analysis
Memory indicatorAnalysis
Memory availableThis indicates how much physical memory is available for use. If, after you start your application,
this value has decreased significantly, you may have a memory leak. Check the application that
is causing it and make the necessary adjustments. Use free -l -t -o for additional information.
Page faultsThere are two types of page faults: soft page faults, when the page is found in memory, and hard
page faults, when the page is not found in memory and must be fetched from disk. Accessing
the disk will slow your application considerably. The sar -B command can provide useful
information for analyzing page faults, specifically columns pgpgin/s and pgpgout/s.
File system cacheThis is the common memory space used by the file system cache. Use the free -l -t -o
command for additional information.
Private memory for
process
This represents the memory used by each process running on the server. You can use the pmap command to see how much memory is allocated to a specific process.
Paging and swapping indicators
In Linux, as with all UNIX-based operating systems, there are differences between paging
and swapping. Paging moves individual pages to swap space on the disk; swapping is a
bigger operation that moves the entire address space of a process to swap space in one
operation.
Swapping can have one of two causes:
A process enters sleep mode. This usually happens because the process depends on
interactive action, as editors, shells, and data entry applications spend most of their time
waiting for user input. During this time, they are inactive.
Chapter 3. Analyzing performance bottlenecks 83
4285ch03.fmDraft Document for Review May 4, 2007 11:35 am
A process behaves poorly. Paging can be a serious performance problem when the
amount of free memory pages falls below the minimum amount specified, because the
paging mechanism is not able to handle the requests for physical memory pages and the
swap mechanism is called to free more pages. This significantly increases I/O to disk and
will quickly degrade a server’s performance.
If your server is always paging to disk (a high page-out rate), consider adding more memory.
However, for systems with a low page-out rate, it may not affect performance.
3.3.2 Performance tuning options
It you believe there is a memory bottleneck, consider performing one or more of these
actions:
Tune the swap space using bigpages, hugetlb, shared memory.
Increase or decrease the size of pages.
Improve the handling of active and inactive memory.
Adjust the page-out rate.
Limit the resources used for each user on the server.
Stop the services that are not needed, as discussed in “Daemons” on page 97.
Add memory.
3.4 Disk bottlenecks
The disk subsystem is often the most important aspect of server performance and is usually
the most common bottleneck. However, problems can be hidden by other factors, such as
lack of memory. Applications are considered to be I/O-bound when CPU cycles are wasted
simply waiting for I/O tasks to finish.
The most common disk bottleneck is having too few disks. Most disk configurations are based
on capacity requirements, not performance. The least expensive solution is to purchase the
smallest number of the largest-capacity disks possible. However, this places more user data
on each disk, causing greater I/O rates to the physical disk and allowing disk bottlenecks to
occur.
The second most common problem is having too many logical disks on the same array. This
increases seek time and significantly lowers performance.
The disk subsystem is discussed in 4.6, “Tuning the disk subsystem” on page 113.
3.4.1 Finding disk bottlenecks
A server exhibiting the following symptoms may be suffering from a disk bottleneck (or a
hidden memory problem):
Slow disks will result in:
– Memory buffers filling with write data (or waiting for read data), which will delay all
requests because free memory buffers are unavailable for write requests (or the
response is waiting for read data in the disk queue)
– Insufficient memory, as in the case of not enough memory buffers for network requests,
will cause synchronous disk I/O
Disk utilization, controller utilization, or both will typically be very high.
Most LAN transfers will happen only after disk I/O has completed, causing very long
response times and low network utilization.
84Linux Performance and Tuning Guidelines
Draft Document for Review May 4, 2007 11:35 am4285ch03.fm
Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will
be idle or have low utilization because they wait long periods of time before processing the
next request.
The disk subsystem is perhaps the most challenging subsystem to properly configure.
Besides looking at raw disk interface speed and disk capacity, it is key to also understand the
workload: Is disk access random or sequential? Is there large I/O or small I/O? Answering
these questions provides the necessary information to make sure the disk subsystem is
adequately tuned.
Disk manufacturers tend to showcase the upper limits of their drive technology’s throughput.
However, taking the time to understand the throughput of your workload will help you
understand what true expectations to have of your underlying disk subsystem.
Table 3-2 Exercise showing true throughput for 8 KB I/Os for different drive speeds
Disk speedLatencySeek
time
15 000 RPM2.0 ms3.8 ms6.8 ms1471.15 MBps
10 000 RPM3.0 ms4.9 ms8.9 ms112900 KBps
Total random
access time
I/Os per
a
second
per disk
Throughput
given 8 KB I/O
b
7 200 RPM4.2 ms9 ms13.2 ms75600 KBps
a. Assuming that the handling of the command + data transfer < 1 ms, total random
access time = latency + seek time + 1 ms.
b. Calculated as 1/total random access time.
Random read/write workloads usually require several disks to scale. The bus bandwidths of
SCSI or Fibre Channel are of lesser concern. Larger databases with random access
workload will benefit from having more disks. Larger SMP servers will scale better with more
disks. Given the I/O profile of 70% reads and 30% writes of the average commercial
workload, a RAID-10 implementation will perform 50% to 60% better than a RAID-5.
Sequential workloads tend to stress the bus bandwidth of disk subsystems. Pay special
attention to the number of SCSI buses and Fibre Channel controllers when maximum
throughput is desired. Given the same number of drives in an array, RAID-10, RAID-0, and
RAID-5 all have similar streaming read and write throughput.
There are two ways to approach disk bottleneck analysis: real-time monitoring and tracing.
Real-time monitoring must be done while the problem is occurring. This may not be
practical in cases where system workload is dynamic and the problem is not repeatable.
However, if the problem is repeatable, this method is flexible because of the ability to add
objects and counters as the problem becomes well understood.
Tracing is the collecting of performance data over time to diagnose a problem. This is a
good way to perform remote performance analysis. Some of the drawbacks include the
potential for having to analyze large files when performance problems are not repeatable,
and the potential for not having all key objects and parameters in the trace and having to
wait for the next time the problem occurs for the additional data.
vmstat command
One way to track disk usage on a Linux system is by using the vmstat tool. The columns of
interest in vmstat with respect to I/O are the bi and bo fields. These fields monitor the
movement of blocks in and out of the disk subsystem. Having a baseline is key to being able
to identify any changes over time.
Chapter 3. Analyzing performance bottlenecks 85
4285ch03.fmDraft Document for Review May 4, 2007 11:35 am
Performance problems can be encountered when too many files are opened, being read and
written to, then closed repeatedly. This could become apparent as seek times (the time it
takes to move to the exact track where the data is stored) start to increase. Using the iostat
tool, you can monitor the I/O device loading in real time. Different options enable you to drill
down even farther to gather the necessary data.
Example 3-3 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows
average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.
Example 3-3 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1