Red Hat ENTERPRISE LINUX 4 - INTRODUCTION TO SYSTEM ADMINISTRATION Administration Manual

Download

Red Hat Enterprise Linux 4

Introduction to System

Administration

Red Hat Enterprise Linux 4: Introduction to System Administration

Red Hat, Inc.

1801 Varsity Drive Raleigh NC 27606-2072 USA Phone: +1 919 754 3700 Phone: 888 733 4281 Fax: +1 919 754 3701 PO Box 13588 Research Triangle Park NC 27709 USA

rhel-isa(EN)-4-Print-RHI (2004-08-25T17:11) Copyright © 2005 by Red Hat, Inc. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, V1.0 or later (the latest version is presentlyavailableat http://www.opencontent.org/openpub/). Distribution of substantively modiﬁed versions of this document is prohibited without the explicit permission of the copyright holder. Distribution of the work or derivative of the work in any standard (paper) book form for commercial purposes is prohibited unless prior permission is obtained from the copyright holder. Red Hat and the Red Hat "Shadow Man" logo are registered trademarks of Red Hat, Inc. in the United States and other

countries. All other trademarks referenced herein are the property of their respective owners. The GPG ﬁngerprint of the security@redhat.com key is: CA 20 86 86 2B D6 9D FC 65 F6 EC C4 21 91 80 CD DB 42 A6 0E

Table of Contents

Introduction.......................................................................................................................................... i

1. Architecture-speciﬁc Information .......................................................................................... i

2. Document Conventions .......................................................................................................... i

3. Activate Your Subscription .................................................................................................. iv

3.1. Provide a Red Hat Login....................................................................................... iv

3.2. Provide Your Subscription Number ....................................................................... v

3.3. Connect Your System............................................................................................. v

4. More to Come ....................................................................................................................... v

4.1. Send in Your Feedback .......................................................................................... v

1. The Philosophy of System Administration ................................................................................... 1

1.1. Automate Everything ......................................................................................................... 1

1.2. Document Everything ........................................................................................................ 2

1.3. Communicate as Much as Possible .................................................................................... 3

1.3.1. Tell Your Users What You Are Going to Do ...................................................... 3

1.3.2. Tell Your Users What You Are Doing ................................................................ 4

1.3.3. Tell Your Users What You Have Done ............................................................... 4

1.4. Know Your Resources........................................................................................................ 5

1.5. Know Your Users ............................................................................................................... 6

1.6. Know Your Business.......................................................................................................... 6

1.7. Security Cannot be an Afterthought .................................................................................. 6

1.7.1. The Risks of Social Engineering......................................................................... 7

1.8. Plan Ahead ......................................................................................................................... 7

1.9. Expect the Unexpected ...................................................................................................... 8

1.10. Red Hat Enterprise Linux-Speciﬁc Information .............................................................. 8

1.10.1. Automation ....................................................................................................... 8

1.10.2. Documentation and Communication ................................................................ 9

1.10.3. Security ........................................................................................................... 10

1.11. Additional Resources ..................................................................................................... 10

1.11.1. Installed Documentation ................................................................................. 10

1.11.2. Useful Websites .............................................................................................. 11

1.11.3. Related Books ................................................................................................. 11

2. Resource Monitoring .................................................................................................................... 13

2.1. Basic Concepts................................................................................................................. 13

2.2. System Performance Monitoring ..................................................................................... 13

2.3. Monitoring System Capacity ........................................................................................... 14

2.4. What to Monitor? ............................................................................................................. 14

2.4.1. Monitoring CPU Power .................................................................................... 15

2.4.2. Monitoring Bandwidth ...................................................................................... 16

2.4.3. Monitoring Memory.......................................................................................... 16

2.4.4. Monitoring Storage ........................................................................................... 17

2.5. Red Hat Enterprise Linux-Speciﬁc Information .............................................................. 18

2.5.1. free.................................................................................................................. 18

2.5.2. top .................................................................................................................... 19

2.5.3. vmstat.............................................................................................................. 20

2.5.4. The Sysstat Suite of Resource Monitoring Tools ............................................. 22

2.5.5. OProﬁle ............................................................................................................. 24

2.6. Additional Resources ....................................................................................................... 28

2.6.1. Installed Documentation ................................................................................... 28

2.6.2. Useful Websites ................................................................................................ 28

2.6.3. Related Books ................................................................................................... 29

3. Bandwidth and Processing Power ............................................................................................... 31

3.1. Bandwidth ........................................................................................................................ 31

3.1.1. Buses ................................................................................................................. 31

3.1.2. Datapaths........................................................................................................... 32

3.1.3. Potential Bandwidth-Related Problems ............................................................ 32

3.1.4. Potential Bandwidth-Related Solutions ............................................................ 33

3.1.5. In Summary. . . .................................................................................................. 33

3.2. Processing Power ............................................................................................................. 34

3.2.1. Facts About Processing Power.......................................................................... 34

3.2.2. Consumers of Processing Power....................................................................... 34

3.2.3. Improving a CPU Shortage ............................................................................... 35

3.3. Red Hat Enterprise Linux-Speciﬁc Information .............................................................. 38

3.3.1. Monitoring Bandwidth on Red Hat Enterprise Linux....................................... 38

3.3.2. Monitoring CPU Utilization on Red Hat Enterprise Linux .............................. 39

3.4. Additional Resources ....................................................................................................... 43

3.4.1. Installed Documentation ................................................................................... 43

3.4.2. Useful Websites ................................................................................................ 43

3.4.3. Related Books ................................................................................................... 44

4. Physical and Virtual Memory ...................................................................................................... 45

4.1. Storage Access Patterns ................................................................................................... 45

4.2. The Storage Spectrum ...................................................................................................... 45

4.2.1. CPU Registers ................................................................................................... 46

4.2.2. Cache Memory .................................................................................................. 46

4.2.3. Main Memory — RAM .................................................................................... 47

4.2.4. Hard Drives ....................................................................................................... 48

4.2.5. Off-Line Backup Storage .................................................................................. 49

4.3. Basic Virtual Memory Concepts ...................................................................................... 49

4.3.1. Virtual Memory in Simple Terms ..................................................................... 49

4.3.2. Backing Store — the Central Tenet of Virtual Memory ................................... 50

4.4. Virtual Memory: The Details ........................................................................................... 50

4.4.1. Page Faults ........................................................................................................ 51

4.4.2. The Working Set ............................................................................................... 52

4.4.3. Swapping........................................................................................................... 52

4.5. Virtual Memory Performance Implications ..................................................................... 52

4.5.1. Worst Case Performance Scenario .................................................................... 53

4.5.2. Best Case Performance Scenario ...................................................................... 53

4.6. Red Hat Enterprise Linux-Speciﬁc Information .............................................................. 53

4.7. Additional Resources ....................................................................................................... 56

4.7.1. Installed Documentation ................................................................................... 56

4.7.2. Useful Websites ................................................................................................ 56

4.7.3. Related Books ................................................................................................... 57

5. Managing Storage ......................................................................................................................... 59

5.1. An Overview of Storage Hardware.................................................................................. 59

5.1.1. Disk Platters ...................................................................................................... 59

5.1.2. Data reading/writing device .............................................................................. 59

5.1.3. Access Arms ..................................................................................................... 60

5.2. Storage Addressing Concepts .......................................................................................... 61

5.2.1. Geometry-Based Addressing ............................................................................ 61

5.2.2. Block-Based Addressing ................................................................................... 62

5.3. Mass Storage Device Interfaces ....................................................................................... 63

5.3.1. Historical Background ...................................................................................... 63

5.3.2. Present-Day Industry-Standard Interfaces ........................................................ 64

5.4. Hard Drive Performance Characteristics ......................................................................... 66

5.4.1. Mechanical/Electrical Limitations .................................................................... 67

5.4.2. I/O Loads and Performance .............................................................................. 68

5.5. Making the Storage Usable.............................................................................................. 69

5.5.1. Partitions/Slices................................................................................................. 70

5.5.2. File Systems ...................................................................................................... 71

5.5.3. Directory Structure............................................................................................ 73

5.5.4. Enabling Storage Access................................................................................... 74

5.6. Advanced Storage Technologies ...................................................................................... 74

5.6.1. Network-Accessible Storage ............................................................................. 74

5.6.2. RAID-Based Storage ........................................................................................ 75

5.6.3. Logical Volume Management ........................................................................... 80

5.7. Storage Management Day-to-Day ................................................................................... 81

5.7.1. Monitoring Free Space ...................................................................................... 82

5.7.2. Disk Quota Issues ............................................................................................. 84

5.7.3. File-Related Issues ............................................................................................ 85

5.7.4. Adding/Removing Storage................................................................................ 86

5.8. A Word About Backups. .. .............................................................................................. 91

5.9. Red Hat Enterprise Linux-Speciﬁc Information .............................................................. 92

5.9.1. Device Naming Conventions ............................................................................ 92

5.9.2. File System Basics ............................................................................................ 94

5.9.3. Mounting File Systems ..................................................................................... 96

5.9.4. Network-Accessible Storage Under Red Hat Enterprise Linux ....................... 99

5.9.5. Mounting File Systems Automatically with /etc/fstab............................. 100

5.9.6. Adding/Removing Storage.............................................................................. 100

5.9.7. Implementing Disk Quotas ............................................................................. 104

5.9.8. Creating RAID Arrays .................................................................................... 108

5.9.9. Day to Day Management of RAID Arrays ..................................................... 109

5.9.10. Logical Volume Management ....................................................................... 111

5.10. Additional Resources ................................................................................................... 111

5.10.1. Installed Documentation ............................................................................... 111

5.10.2. Useful Websites ............................................................................................ 111

5.10.3. Related Books ............................................................................................... 112

6. Managing User Accounts and Resource Access ....................................................................... 113

6.1. Managing User Accounts............................................................................................... 113

6.1.1. The Username ................................................................................................. 113

6.1.2. Passwords........................................................................................................ 116

6.1.3. Access Control Information ............................................................................ 120

6.1.4. Managing Accounts and Resource Access Day-to-Day ................................. 121

6.2. Managing User Resources ............................................................................................. 123

6.2.1. Who Can Access Shared Data ........................................................................ 123

6.2.2. Where Users Access Shared Data ................................................................... 124

6.2.3. What Barriers Are in Place To Prevent Abuse of Resources .......................... 125

6.3. Red Hat Enterprise Linux-Speciﬁc Information ............................................................ 125

6.3.1. User Accounts, Groups, and Permissions ....................................................... 125

6.3.2. Files Controlling User Accounts and Groups ................................................. 127

6.3.3. User Account and Group Applications ........................................................... 130

6.4. Additional Resources ..................................................................................................... 131

6.4.1. Installed Documentation ................................................................................. 132

6.4.2. Useful Websites .............................................................................................. 132

6.4.3. Related Books ................................................................................................. 132

7. Printers and Printing .................................................................................................................. 135

7.1. Types of Printers ............................................................................................................ 135

7.1.1. Printing Considerations ................................................................................... 135

7.2. Impact Printers ............................................................................................................... 136

7.2.1. Dot-Matrix Printers ......................................................................................... 136

7.2.2. Daisy-Wheel Printers ...................................................................................... 137

7.2.3. Line Printers.................................................................................................... 137

7.2.4. Impact Printer Consumables ........................................................................... 137

7.3. Inkjet Printers................................................................................................................. 137

7.3.1. Inkjet Consumables......................................................................................... 138

7.4. Laser Printers ................................................................................................................. 138

7.4.1. Color Laser Printers ........................................................................................ 138

7.4.2. Laser Printer Consumables ............................................................................. 139

7.5. Other Printer Types ........................................................................................................ 139

7.6. Printer Languages and Technologies ............................................................................. 140

7.7. Networked Versus Local Printers................................................................................... 140

7.8. Red Hat Enterprise Linux-Speciﬁc Information ............................................................ 141

7.9. Additional Resources ..................................................................................................... 142

7.9.1. Installed Documentation ................................................................................. 142

7.9.2. Useful Websites .............................................................................................. 142

7.9.3. Related Books ................................................................................................. 143

8. Planning for Disaster .................................................................................................................. 145

8.1. Types of Disasters .......................................................................................................... 145

8.1.1. Hardware Failures ........................................................................................... 145

8.1.2. Software Failures ............................................................................................ 150

8.1.3. Environmental Failures ................................................................................... 153

8.1.4. Human Errors.................................................................................................. 158

8.2. Backups.......................................................................................................................... 162

8.2.1. Different Data: Different Backup Needs......................................................... 163

8.2.2. Backup Software: Buy Versus Build............................................................... 164

8.2.3. Types of Backups ............................................................................................ 165

8.2.4. Backup Media ................................................................................................. 166

8.2.5. Storage of Backups ......................................................................................... 168

8.2.6. Restoration Issues ........................................................................................... 168

8.3. Disaster Recovery .......................................................................................................... 169

8.3.1. Creating, Testing, and Implementing a Disaster Recovery Plan ..................... 170

8.3.2. Backup Sites: Cold, Warm, and Hot ............................................................... 171

8.3.3. Hardware and Software Availability ............................................................... 171

8.3.4. Availability of Backups................................................................................... 172

8.3.5. Network Connectivity to the Backup Site....................................................... 172

8.3.6. Backup Site Stafﬁng ....................................................................................... 172

8.3.7. Moving Back Toward Normalcy ..................................................................... 172

8.4. Red Hat Enterprise Linux-Speciﬁc Information ............................................................ 173

8.4.1. Software Support ............................................................................................ 173

8.4.2. Backup Technologies ...................................................................................... 173

8.5. Additional Resources ..................................................................................................... 176

8.5.1. Installed Documentation ................................................................................. 176

8.5.2. Useful Websites .............................................................................................. 177

8.5.3. Related Books ................................................................................................. 177

Index................................................................................................................................................. 179

Colophon.......................................................................................................................................... 187

Introduction

Welcome to the Red Hat Enterprise Linux Introduction to System Administration.

The Red Hat Enterprise Linux Introduction to System Administration contains introductory information for new Red Hat Enterprise Linux system administrators. It does not teach you how to perform a particular task under Red Hat Enterprise Linux; rather, it provides you with the background knowledge that more experienced system administrators have learned over time.

This guide assumes you have a limited amount of experience as a Linux user, but no Linux system administration experience. If you are completely new to Linux in general (and Red Hat Enterprise Linux in particular), you should start by purchasing an introductory book on Linux.

Each chapter in the Red Hat Enterprise Linux Introduction to System Administration has the following structure:

• Generic overview material — This section discusses the topic of the chapter without going into

details about a speciﬁc operating system, technology, or methodology.

• Red Hat Enterprise Linux-speciﬁc material — This section addresses aspects of the topic related to

Linux in general and Red Hat Enterprise Linux in particular.

• Additional resources for further study — This section includes pointers to other Red Hat Enterprise

Linux manuals, helpful websites, and books containing information applicable to the topic.

By adopting a consistent structure, readers can more easily read the Red Hat Enterprise Linux Intro- duction to System Administration in whatever way they choose. For example, an experienced system administrator with little Red Hat Enterprise Linux experience could skim only the sections that specifically focus on Red Hat Enterprise Linux, while a new system adminstrator could start by reading only the generic overview sections, and using the Red Hat Enterprise Linux-speciﬁc sections as an introduction to more in-depth resources.

While on the subject of more in-depth resources, the Red Hat Enterprise Linux System Adminis- tration Guide is an excellent resource for performing speciﬁc tasks in a Red Hat Enterprise Linux environment. Administrators requiring more in-depth, factual information should refer to the Red Hat Enterprise Linux Reference Guide.

HTML, PDF, and RPM versions of the manuals are available on the Red Hat Enterprise Linux Documentation CD and online at http://www.redhat.com/docs/.

Note

Although this manual reﬂects the most current information possible, read the Red Hat Enterprise Linux Release Notes for information that may not have been available prior to our documenta-

tion being ﬁnalized. They can be found on the Red Hat Enterprise Linux CD #1 and online at http://www.redhat.com/docs/.

1. Architecture-speciﬁc Information

Unless otherwise noted, all information contained in this manual apply only to the x86 processor and processors featuring the Intel® Extended Memory 64 Technology (Intel® EM64T) and AMD64 technologies. For architecture-speciﬁc information, refer to the Red Hat Enterprise Linux Installation Guide for your respective architecture.

ii Introduction

2. Document Conventions

When you read this manual, certain words are represented in different fonts, typefaces, sizes, and weights. This highlighting is systematic; different words are represented in the same style to indicate their inclusion in a speciﬁc category. The types of words that are represented this way include the following:

command

Linux commands (and other operating system commands, when used) are represented this way. This style should indicate to you that you can type the word or phrase on the command line and press [Enter] to invoke a command. Sometimes a command contains words that would be displayed in a different style on their own (such as ﬁle names). In these cases, they are considered to be part of the command, so the entire phrase is displayed as a command. For example:

Use the cat testfile command to view the contents of a ﬁle, named testfile, in the current working directory.

file name

File names, directory names, paths, and RPM package names are represented this way. This style should indicate that a particular ﬁle or directory exists by that name on your system. Examples:

The .bashrc ﬁle in your home directory contains bash shell deﬁnitions and aliases for your own use.

The /etc/fstab ﬁle contains information about different system devices and ﬁle systems.

Install the webalizer RPM if you want to use a Web server log ﬁle analysis program.

application

This style indicates that the program is an end-user application (as opposed to system software). For example:

Use Mozilla to browse the Web.

[key]

A key on the keyboard is shown in this style. For example:

To use [Tab] completion, type in a character and then press the [Tab] key. Your terminal displays the list of ﬁles in the directory that start with that letter.

[key]-[combination]

A combination of keystrokes is represented in this way. For example:

The [Ctrl]-[Alt]-[Backspace] key combination exits your graphical session and return you to the graphical login screen or the console.

text found on a GUI interface

A title, word, or phrase found on a GUI interface screen or window is shown in this style. Text shown in this style is being used to identify a particular GUI screen or an element on a GUI screen (such as text associated with a checkbox or ﬁeld). Example:

Select the Require Password checkbox if you would like your screensaver to require a password before stopping.

top level of a menu on a GUI screen or window

A word in this style indicates that the word is the top level of a pulldown menu. If you click on the word on the GUI screen, the rest of the menu should appear. For example:

Introduction iii

Under File on a GNOME terminal, the New Tab option allows you to open multiple shell prompts in the same window.

If you need to type in a sequence of commands from a GUI menu, they are shown like the following example:

Go to Main Menu Button (on the Panel) => Programming => Emacs to start the Emacs text editor.

button on a GUI screen or window

This style indicates that the text can be found on a clickable button on a GUI screen. For example:

Click on the Back button to return to the webpage you last viewed.

computer output

Text in this style indicates text displayed to a shell prompt such as error messages and responses to commands. For example:

The ls command displays the contents of a directory. For example:

Desktop about.html logs paulwesterberg.png Mail backupfiles mail reports

The output returned in response to the command (in this case, the contents of the directory) is shown in this style.

prompt

A prompt, which is a computer’s way of signifying that it is ready for you to input something, is shown in this style. Examples:

[stephen@maturin stephen]$

leopard login:

user input

Text that the user has to type, either on the command line, or into a text box on a GUI screen, is displayed in this style. In the following example, text is displayed in this style:

To boot your system into the text based installation program, you must type in the text command at the boot: prompt.

replaceable

Text used for examples, which is meant to be replaced with data provided by the user, is displayed in this style. In the following example, <version-number> is displayed in this style:

The directory for the kernel source is /usr/src/<version-number>/, where <version-number> is the version of the kernel installed on this system.

Additionally, we use several different strategies to draw your attention to certain pieces of information. In order of how critical the information is to your system, these items are marked as a note, tip, important, caution, or warning. For example:

Note

Remember that Linux is case sensitive. In other words, a rose is not a ROSE is not a rOsE.

iv Introduction

Tip

The directory /usr/share/doc/ contains additional documentation for packages installed on your system.

Important

If you modify the DHCP conﬁguration ﬁle, the changes do not take effect until you restart the DHCP daemon.

Caution

Do not perform routine tasks as root — use a regular user account unless you need to use the root account for system administration tasks.

Warning

Be careful to remove only the necessary Red Hat Enterprise Linux partitions. Removing other partitions could result in data loss or a corrupted system environment.

3. Activate Your Subscription

Before you can access service and software maintenance information, and the support documentation included in your subscription, you must activate your subscription by registering with Red Hat. Registration includes these simple steps:

• Provide a Red Hat login

• Provide a subscription number

• Connect your system

The ﬁrst time you boot your installation of Red Hat Enterprise Linux, you are prompted to register with Red Hat using the Setup Agent. If you follow the prompts during the Setup Agent, you can complete the registration steps and activate your subscription.

If you can not complete registration during the Setup Agent (which requires network access), you can alternatively complete the Red Hat registration process online at http://www.redhat.com/register/.

3.1. Provide a Red Hat Login

If you do not have an existing Red Hat login, you can create one when prompted during the Setup Agent or online at:

https://www.redhat.com/apps/activate/newlogin.html

Introduction v

A Red Hat login enables your access to:

• Software updates, errata and maintenance via Red Hat Network

• Red Hat technical support resources, documentation, and Knowledgebase

If you have forgotten your Red Hat login, you can search for your Red Hat login online at:

https://rhn.redhat.com/help/forgot_password.pxt

3.2. Provide Your Subscription Number

Your subscription number is located in the package that came with your order. If your package did not include a subscription number, your subscription was activated for you and you can skip this step.

You can provide your subscription number when prompted during the Setup Agent or by visiting http://www.redhat.com/register/.

3.3. Connect Your System

The Red Hat Network Registration Client helps you connect your system so that you can begin to get updates and perform systems management. There are three ways to connect:

1. During the Setup Agent — Check the Send hardware information and Send system package list options when prompted.

2. After the Setup Agent has been completed — From the Main Menu, go to System Tools, then select Red Hat Network.

3. After the Setup Agent has been completed — Enter the following command from the command line as the root user:

• /usr/bin/up2date --register

4. More to Come

The Red Hat Enterprise Linux Introduction to System Administration is part of Red Hat’s growing commitment to provide useful and timely support to Red Hat Enterprise Linux users. As new releases of Red Hat Enterprise Linux are made available, we make every effort to include both new and improved documentation for you.

4.1. Send in Your Feedback

If you spot a typo in the Red Hat Enterprise Linux Introduction to System Administration, or if you have thought of a way to make this manual better, we would love to hear from you. Please submit a report in Bugzilla (http://bugzilla.redhat.com/bugzilla) against the component rhel-isa.

Be sure to mention the manual’s identiﬁer:

rhel-isa(EN)-4-Print-RHI (2004-08-25T17:11)

If you mention this manual’s identiﬁer, we will know exactly which version of the guide you have.

vi Introduction

If you have a suggestion for improving the documentation, try to be as speciﬁc as possible. If you have found an error, please include the section number and some of the surrounding text so we can ﬁnd it easily.

Chapter 1.

The Philosophy of System Administration

Although the speciﬁcs of being a system administrator may change from platform to platform, there are underlying themes that do not. These themes make up the philosophy of system administration.

The themes are:

• Automate everything

• Document everything

• Communicate as much as possible

• Know your resources

• Know your users

• Know your business

• Security cannot be an afterthought

• Plan ahead

• Expect the unexpected

The following sections explore each theme in more detail.

1.1. Automate Everything

Most system administrators are outnumbered — either by their users, their systems, or both. In many cases, automation is the only way to keep up. In general, anything done more than once should be examined as a possible candidate for automation.

Here are some commonly automated tasks:

• Free disk space checking and reporting

• Backups

• System performance data collection

• User account maintenance (creation, deletion, etc.)

• Business-speciﬁc functions (pushing new data to a Web server, running monthly/quarterly/yearly

reports, etc.)

This list is by no means complete; the functions automated by system administrators are only limited by an administrator’s willingness to write the necessary scripts. In this case, being lazy (and making the computer do more of the mundane work) is actually a good thing.

Automation also gives users the extra beneﬁt of greater predictability and consistency of service.

Tip

Keep in mind that if you have a task that should be automated, it is likely that you are not the ﬁrst system administrator to have that need. Here is where the beneﬁts of open source software really shine — you may be able to leverage someone else’s work to automate the manual procedure that is currently eating up your time. So always make sure you search the Web before writing anything more complex than a small Perl script.

2 Chapter 1. The Philosophy of System Administration

1.2. Document Everything

If given the choice between installing a brand-new server and writing a procedural document on performing system backups, the average system administrator would install the new server every time. While this is not at all unusual, you must document what you do. Many system administrators put off doing the necessary documentation for a variety of reasons:

"I will get around to it later."

Unfortunately, this is usually not true. Even if a system administrator is not kidding themselves, the nature of the job is such that everyday tasks are usually too chaotic to "do it later." Even worse, the longer it is put off, the more that is forgotten, leading to a much less detailed (and therefore, less useful) document.

"Why write it up? I will remember it."

Unless you are one of those rare individuals with a photographic memory, no, you will not remember it. Or worse, you will remember only half of it, not realizing that you are missing the whole story. This leads to wasted time either trying to relearn what you had forgotten or ﬁxing what you had broken due to your incomplete understanding of the situation.

"If I keep it in my head, they will not ﬁre me — I will have job security!"

While this may work for a while, invariably it leads to less — not more — job security. Think for a moment about what may happen during an emergency. You may not be available; your documentation may save the day by letting someone else resolve the problem in your absence. And never forget that emergencies tend to be times when upper management pays close attention. In such cases, it is better to have your documentation be part of the solution than it is for your absence to be part of the problem.

In addition, if you are part of a small but growing organization, eventually there will be a need for another system administrator. How can this person learn to back you up if everything is in your head? Worst yet, not documenting may make you so indispensable that you might not be able to advance your career. You could end up working for the very person that was hired to assist you.

Hopefully you are now sold on the beneﬁts of system documentation. That brings us to the next question: What should you document? Here is a partial list:

Policies

Policies are written to formalize and clarify the relationship you have with your user community. They make it clear to your users how their requests for resources and/or assistance are handled. The nature, style, and method of disseminating policies to your a community varies from organization to organization.

Procedures

Procedures are any step-by-step sequence of actions that must be taken to accomplish a certain task. Procedures to be documented can include backup procedures, user account management procedures, problem reporting procedures, and so on. Like automation, if a procedure is followed more than once, it is a good idea to document it.

Changes

A large part of a system administrator’s career revolves around making changes — conﬁguring systems for maximum performance, tweaking scripts, modifying conﬁguration ﬁles, and so on.

Chapter 1. The Philosophy of System Administration 3

All of these changes should be documented in some fashion. Otherwise, you could ﬁnd yourself being completely confused about a change you made several months earlier.

Some organizations use more complex methods for keeping track of changes, but in many cases a simple revision history at the start of the ﬁle being changed is all that is necessary. At a minimum, each entry in the revision history should contain:

• The name or initials of the person making the change

• The date the change was made

• The reason the change was made

This results in concise, yet useful entries:

ECB, 12-June-2002 — Updated entry for new Accounting printer (to support the replacement printer’s ability to print duplex)

1.3. Communicate as Much as Possible

When it comes to your users, you can never communicate too much. Be aware that small system changes you might think are practically unnoticeable could very well completely confuse the administrative assistant in Human Resources.

The method by which you communicate with your users can vary according to your organization. Some organizations use email; others, an internal website. Still others may rely on Usenet news or IRC. A sheet of paper tacked to a bulletin board in the breakroom may even sufﬁce at some places. In any case, use whatever method(s) that work well at your organization.

In general, it is best to follow this paraphrased approach used in writing newspaper stories:

1. Tell your users what you are going to do

2. Tell your users what you are doing

3. Tell your users what you have done

The following sections look at these steps in more depth.

1.3.1. Tell Your Users What You Are Going to Do

Make sure you give your users sufﬁcient warning before you do anything. The actual amount of warning necessary varies according to the type of change (upgrading an operating system demands more lead time than changing the default color of the system login screen), as well as the nature of your user community (more technically adept users may be able to handle changes more readily than users with minimal technical skills.)

At a minimum, you should describe:

• The nature of the change

• When it will take place

• Why it is happening

• Approximately how long it should take

• The impact (if any) that the users can expect due to the change

• Contact information should they have any questions or concerns

Here is a hypothetical situation. The Finance department has been experiencing problems with their database server being very slow at times. You are going to bring the server down, upgrade the CPU

4 Chapter 1. The Philosophy of System Administration

module to a faster model, and reboot. Once this is done, you will move the database itself to faster, RAID-based storage. Here is one possible announcement for this situation:

System Downtime Scheduled for Friday Night

Starting this Friday at 6pm (midnight for our associates in Berlin), all ﬁnancial applications will be unavailable for a period of approximately four hours.

During this time, changes to both the hardware and software on the Finance database server will be performed. These changes should greatly reduce the time required to run the Accounts Payable and Accounts Receivable applications, and the weekly Balance Sheet report.

Other than the change in runtime, most people should notice no other change. However, those of you that have written your own SQL queries should be aware that the layout of some indices will change. This is documented on the company intranet website, on the Finance page.

Should you have any questions, comments, or concerns, please contact System Administration at extension

4321.

A few points are worth noting:

• Effectively communicate the start and duration of any downtime that might be involved in the

change.

• Make sure you give the time of the change in such a way that it is useful to all users, no matter

where they may be located.

• Use terms that your users understand. The people impacted by this work do not care that the new

CPU module is a 2GHz unit with twice as much L2 cache, or that the database is being placed on a RAID 5 logical volume.

1.3.2. Tell Your Users What You Are Doing

This step is primarily a last-minute warning of the impending change; as such, it should be a brief repeat of the ﬁrst message, though with the impending nature of the change made more apparent ("The system upgrade will take place TONIGHT."). This is also a good place to publicly answer any questions you may have received as a result of the ﬁrst message.

Continuing our hypothetical example, here is one possible last-minute warning:

System Downtime Scheduled for Tonight

Reminder: The system downtime announced this past Monday will take place as scheduled tonight at 6pm (midnight for the Berlin ofﬁce). You can ﬁnd the original announcement on the company intranet website, on the System Administration page.

Several people have asked whether they should stop working early tonight to make sure their work is backed up prior to the downtime. This will not be necessary, as the work being done tonight will not impact any work done on your personal workstations.

Remember, those of you that have written your own SQL queries should be aware that the layout of some indices will change. This is documented on the company intranet website, on the Finance page.

Your users have been alerted; now you are ready to actually do the work.

Chapter 1. The Philosophy of System Administration 5

1.3.3. Tell Your Users What You Have Done

After you have ﬁnished making the changes, you must tell your users what you have done. Again, this should be a summary of the previous messages (invariably someone will not have read them.)

However, there is one important addition you must make. It is vital that you give your users the current status. Did the upgrade not go as smoothly as planned? Was the new storage server only able to serve the systems in Engineering, and not in Finance? These types of issues must be addressed here.

Of course, if the current status differs from what you communicated previously, you should make this point clear and describe what will be done (if anything) to arrive at the ﬁnal solution.

In our hypothetical situation, the downtime had some problems. The new CPU module did not work; a call to the system’s manufacturer revealed that a special version of the module is required for in-theﬁeld upgrades. On the plus side, the migration of the database to the RAID volume went well (even though it took a bit longer than planned due to the problems with the CPU module.

Here is one possible announcement:

System Downtime Complete

The system downtime scheduled for Friday night (refer to the System Administration page on the company intranet website) has been completed.Unfortunately, hardware issues prevented one of the tasks from being completed. Due to this, the remaining tasks took longer than the originally-scheduled four hours. Instead, all systems were back in production by midnight (6am Saturday for the Berlin ofﬁce).

Because of the remaining hardware issues, performance of the AP, AR, and the Balance Sheet report will be slightly improved, but not to the extent originally planned. A second downtime will be announced and scheduled as soon as the issues that prevented completion of the task have been resolved.

Please note that the downtime did change some database indices; people that have written their own SQL queries should consult the Finance page on the company intranet website.

Please contact System Administration at extension 4321 with any questions.

With this kind of information, your users will have sufﬁcient background knowledge to continue their work, and to understand how the changes impact them.

1.4. Know Your Resources

System administration is mostly a matter of balancing available resources against the people and programs that use those resources. Therefore, your career as a system administrator will be a short and stress-ﬁlled one unless you fully understand the resources you have at your disposal.

Some of the resources are ones that seem pretty obvious:

• System resources, such as available processing power, memory, and disk space

• Network bandwidth

• Available money in the IT budget

But some may not be so obvious:

• The services of operations personnel, other system administrators, or even an administrative assis-

tant

1. Be sure to send this message out as soon as the work is done, before you leave for home. Once you have left

the ofﬁce, it is much too easy to forget, leaving your users in the dark as to whether they can use the system or

not.

6 Chapter 1. The Philosophy of System Administration

• Time (often of critical importance when the time involves things such as the amount of time during

which system backups may take place)

• Knowledge (whether it is stored in books, system documentation, or the brain of a person that has

worked at the company for the past twenty years)

It is important to note is that it is highly valuable to take a complete inventory of those resources available to you and to keep it current — a lack of "situational awareness" when it comes to available resources can often be worse than no awareness at all.

1.5. Know Your Users

Although some people bristle at the term "users" (perhaps due to some system administrators’ use of the term in a derogatory manner), it is used here with no such connotation implied. Users are those people that use the systems and resources for which you are responsible — no more, and no less. As such, they are central to your ability to successfully administer your systems; without understanding your users, how can you understand the system resources they require?

For example, consider a bank teller. A bank teller uses a strictly-deﬁned set of applications and requires little in the way of system resources. A software engineer, on the other hand, may use many different applications and always welcomes more system resources (for faster build times). Two entirely different users with two entirely different needs.

Make sure you learn as much about your users as you can.

1.6. Know Your Business

Whether you work for a large, multinational corporation or a small community college, you must still understand the nature of the business environment in which you work. This can be boiled down to one question:

What is the purpose of the systems you administer?

The key point here is to understand your systems’ purpose in a more global sense:

• Applications that must be run within certain time frames, such as at the end of a month, quarter, or

year

• The times during which system maintenance may be done

• New technologies that could be used to resolve long-standing business problems

By taking into account your organization’s business, you will ﬁnd that your day-to-day decisions will be better for your users, and for you.

1.7. Security Cannot be an Afterthought

No matter what you might think about the environment in which your systems are running, you cannot take security for granted. Even standalone systems not connected to the Internet may be at risk (although obviously the risks will be different from a system that has connections to the outside world).

Therefore, it is extremely important to consider the security implications of everything you do. The following list illustrates the different kinds of issues you should consider:

• The nature of possible threats to each of the systems under your care

• The location, type, and value of the data on those systems

Chapter 1. The Philosophy of System Administration 7

• The type and frequency of authorized access to the systems

While you are thinking about security, do not make the mistake of assuming that possible intruders will only attack your systems from outside of your company. Many times the perpetrator is someone within the company. So the next time you walk around the ofﬁce, look at the people around you and ask yourself this question:

What would happen if that person were to attempt to subvert our security?

Note

This does not mean that you should treat your coworkers as if they are criminals. It just means that you should look at the type of work that each person performs and determine what types of security breaches a person in that position could perpetrate, if they were so inclined.

1.7.1. The Risks of Social Engineering

While most system administrators’ ﬁrst reactions when they think about security is to concentrate on the technological aspects, it is important to maintain perspective. Quite often, security breaches do not have their origins in technology, but in human nature.

People interested in breaching security often use human nature to entirely bypass technological access controls. This is known as social engineering. Here is an example:

The second shift operator receives an outside phone call. The caller claims to be your organization’s CFO (the CFO’s name and background information was obtained from your organization’s website, on the "Management Team" page).

The caller claims to be calling from some place halfway around the world (maybe this part of the story is a complete fabrication, or perhaps your organization’s website has a recent press release that makes mention of the CFO attending a tradeshow).

The caller tells a tale of woe; his laptop was stolen at the airport, and he is with an important customer and needs access to the corporate intranet to check on the customer’s account status. Would the operator be so kind as to give him the necessary access information?

Do you know what would your operator do? Unless your operator has guidance (in the form of policies and procedures), you very likely do not know for sure.

Like trafﬁc lights, the goal of policies and procedures is to provide unambiguous guidance as to what is and is not appropriate behavior. However, just as with trafﬁc lights, policies and procedures only work if everyone follows them. And there is the crux of the problem — it is unlikely that everyone will adhere to your policies and procedures. In fact, depending on the nature of your organization, it is possible that you do not even have sufﬁcient authority to deﬁne policies, much less enforce them. What then?

Unfortunately, there are no easy answers. User education can help; do everything you can to help make your user community aware of security and social engineering. Give lunchtime presentations about security. Post pointers to security-related news articles on your organization’s mailing lists. Make yourself available as a sounding board for users’ questions about things that do not seem quite right.

In short, get the message out to your users any way you can.

8 Chapter 1. The Philosophy of System Administration

1.8. Plan Ahead

System administrators that took all this advice to heart and did their best to follow it would be fantastic system administrators — for a day. Eventually, the environment will change, and one day our fantastic administrator would be caught ﬂat-footed. The reason? Our fantastic administrator failed to plan ahead.

Certainly no one can predict the future with 100% accuracy. However, with a bit of awareness it is easy to read the signs of many changes:

• An offhand mention of a new project gearing up during that boring weekly staff meeting is a sure

sign that you will likely need to support new users in the near future

• Talk of an impending acquisition means that you may end up being responsible for new (and pos-

sibly incompatible) systems in one or more remote locations

Being able to read these signs (and to respond effectively to them) makes life easier for you and your users.

1.9. Expect the Unexpected

While the phrase "expect the unexpected" is trite, it reﬂects an underlying truth that all system administrators must understand:

There will be times when you are caught off-guard.

After becoming comfortable with this uncomfortable fact of life, what can a concerned system administrator do? The answer lies in ﬂexibility; by performing your job in such a way as to give you (and your users) the most options possible. Take, for example, the issue of disk space. Given that never having sufﬁcient disk space seems to be as much a physical law as the law of gravity, it is reasonable to assume that at some point you will be confronted with a desperate need for additional disk space right now.

What would a system administrator who expects the unexpected do in this case? Perhaps it is possible to keep a few disk drives sitting on the shelf as spares in case of hardware problems2. A spare of this type could be quickly deployed3on a temporary basis to address the short-term need for disk space, giving time to more permanently resolve the issue (by following the standard procedure for procuring additional disk drives, for example).

By trying to anticipate problems before they occur, you will be in a position to respond more quickly and effectively than if you let yourself be surprised.

1.10. Red Hat Enterprise Linux-Speciﬁc Information

This section describes information related to the philosophy of system administration that is speciﬁc to Red Hat Enterprise Linux.

2. And of course a system administrator that expects the unexpected would naturally use RAID (or related

technologies) to lessen the impact of a critical disk drive failing during production.

3. Again,system administrators thatthink ahead conﬁguretheir systems to make it as easy as possible to quickly

add a new disk drive to the system.

Chapter 1. The Philosophy of System Administration 9

1.10.1. Automation

Automation of frequently-performed tasks under Red Hat Enterprise Linux requires knowledge of several different types of technologies. First are the commands that control the timing of command or script execution. The cron and at commands are most commonly used in these roles.

Incorporating an easy-to-understand yet powerfully ﬂexible time speciﬁcation system, cron can schedule the execution of commands or scripts for recurring intervals ranging in length from minutes to months. The crontab command is used to manipulate the ﬁles controlling the cron daemon that actually schedules each cron job for execution.

The at command (and the closely-related command batch) are more appropriate for scheduling the execution of one-time scripts or commands. These commands implement a rudimentary batch subsystem consisting of multiple queues with varying scheduling priorities. The priorities are known as niceness levels (due to the name of the command — nice). Both at and batch are perfect for tasks that must start at a given time but are not time-critical in terms of ﬁnishing.

Next are the various scripting languages. These are the "programming languages" that the average system administrator uses to automate manual operations. There are many scripting languages (and each system administrator tends to have a personal favorite), but the following are currently the most common:

• The bash command shell

• The perl scripting language

• The python scripting language

Over and above the obvious differences between these languages, the biggest difference is in the way in which these languages interact with other utility programs on a Red Hat Enterprise Linux system. Scripts written with the bash shell tend to make more extensive use of the many small utility programs (for example, to perform character string manipulation), while perl scripts perform more of these types of operations using features built into the language itself. A script written using python can fully exploit the language’s object-oriented capabilities, making complex scripts more easily extensible.

This means that, in order to truly master shell scripting, you must be familiar with the many utility programs (such as grep and sed) that are part of Red Hat Enterprise Linux. Learning perl (and

python), on the other hand, tends to be a more "self-contained" process. However, many perl lan-

guage constructs are based on the syntax of various traditional UNIX utility programs, and as such are familiar to those Red Hat Enterprise Linux system administrators with shell scripting experience.

1.10.2. Documentation and Communication

In the areas of documentation and communication, there is little that is speciﬁc to Red Hat Enterprise Linux. Since documentation and communication can consist of anything from adding comments to a text-based conﬁguration ﬁle to updating a webpage or sending an email, a system administrator using Red Hat Enterprise Linux must have access to text editors, HTML editors, and mail clients.

Here is a small sample of the many text editors available under Red Hat Enterprise Linux:

• The gedit text editor

• The Emacs text editor

• The Vim text editor

The gedit text editor is a strictly graphical application (in other words, it requires an active X Window System environment), while vim and Emacs are primarily text-based in nature.

10 Chapter 1. The Philosophy of System Administration

The subject of the best text editor has sparked debate for nearly as long as computers have existed and will continue to do so. Therefore, the best approach is to try each editor for yourself, and use what works best for you.

For HTML editors, system administrators can use the Composer function of the Mozilla Web browser. Of course, some system administrators prefer to hand-code their HTML, making a regular text editor a perfectly acceptable tool as well.

As far as email is concerned, Red Hat Enterprise Linux includes the Evolution graphical email client, the Mozilla email client (which is also graphical), and mutt, which is text-based. As with text editors, the choice of an email client tends to be a personal one; therefore, the best approach is to try each client for yourself, and use what works best for you.

1.10.3. Security

As stated earlier in this chapter, security cannot be an afterthought, and security under Red Hat Enterprise Linux is more than skin-deep. Authentication and access controls are deeply-integrated into the operating system and are based on designs gleaned from long experience in the UNIX community.

For authentication, Red Hat Enterprise Linux uses PAM — Pluggable Authentication Modules. PAM makes it possible to ﬁne-tune user authentication via the conﬁguration of shared libraries that all PAM-aware applications use, all without requiring any changes to the applications themselves.

Access control under Red Hat Enterprise Linux uses traditional UNIX-style permissions (read, write, execute) against user, group, and "everyone else" classiﬁcations. Like UNIX, Red Hat Enterprise Linux also makes use of setuid and setgid bits to temporarily confer expanded access rights to processes running a particular program, based on the ownership of the program ﬁle. Of course, this makes it critical that any program to be run with setuid or setgid privileges must be carefully audited to ensure that no exploitable vulnerabilities exist.

Red Hat Enterprise Linux also includes support for access control lists. An access control list (ACL) is a construct that allows extremely ﬁne-grained control over what users or groups may access a ﬁle or directory. For example, a ﬁle’s permissions may restrict all access by anyone other than the ﬁle’s owner, yet the ﬁle’s ACL can be conﬁgured to allow only user bob to write and group finance to read the ﬁle.

Another aspect of security is being able to keep track of system activity. Red Hat Enterprise Linux makes extensive use of logging, both at a kernel and an application level. Logging is controlled by the system logging daemon syslogd, which can log system information locally (normally to ﬁles in the /var/log/ directory) or to a remote system (which acts as a dedicated log server for multiple computers.)

Intrusion detection sytems (IDS) are powerful tools for any Red Hat Enterprise Linux system administrator. An IDS makes it possible for system administrators to determine whether unauthorized changes were made to one or more systems. The overall design of the operating system itself includes IDS-like functionality.

Because Red Hat Enterprise Linux is installed using the RPM Package Manager (RPM), it is possible to use RPM to verify whether any changes have been made to the packages comprising the operating system. However, because RPM is primarily a package management tool, its abilities as an IDS are somewhat limited. Even so, it can be a good ﬁrst step toward monitoring a Red Hat Enterprise Linux system for unauthorized modiﬁcations.

1.11. Additional Resources

This section includes various resources that can be used to learn more about the philosophy of system administration and the Red Hat Enterprise Linux-speciﬁc subject matter discussed in this chapter.

Chapter 1. The Philosophy of System Administration 11

1.11.1. Installed Documentation

The following resources are installed in the course of a typical Red Hat Enterprise Linux installation and can help you learn more about the subject matter discussed in this chapter.

• crontab(1) and crontab(5) man pages — Learn how to schedule commands and scripts for

automatic execution at regular intervals.

• at(1) man page — Learn how to schedule commands and scripts for execution at a later time.

• bash(1) man page — Learn more about the default shell and shell script writing.

• perl(1) man page — Review pointers to the many man pages that make up perl’s online docu-

mentation.

• python(1) man page — Learn more about options, ﬁles, and environment variables controlling

the Python interpreter.

• gedit(1) man page and Help menu entry — Learn how to edit text ﬁles with this graphical text

editor.

• emacs(1) man page — Learn more about this highly-ﬂexible text editor, including how to run its

online tutorial.

• vim(1) man page — Learn how to use this powerful text editor.

• Mozilla Help Contents menu entry — Learn how to edit HTML ﬁles, read mail, and browse the

Web.

• evolution(1) man page and Help menu entry — Learn how to manage your email with this

graphical email client.

• mutt(1) man page and ﬁles in /usr/share/doc/mutt-<version> — Learn how to manage

your email with this text-based email client.

• pam(8) man page and ﬁles in /usr/share/doc/pam-<version> — Learn how authentication

takes place under Red Hat Enterprise Linux.

1.11.2. Useful Websites

• http://www.kernel.org/pub/linux/libs/pam/ — The Linux-PAM project homepage.

• http://www.usenix.org/ — The USENIX homepage. A professional organization dedicated to bring-

ing together computer professionals of all types and fostering improved communication and innovation.

• http://www.sage.org/ — The System Administrators Guild homepage. A USENIX special technical

group that is a good resource for all system administrators responsible for Linux (or Linux-like) operating systems.

• http://www.python.org/ — The Python Language Website. An excellent site for learning more about

Python.

• http://www.perl.org/ — The Perl Mongers Website. A good place to start learning about Perl and

connecting with the Perl community.

• http://www.rpm.org/ — The RPM Package Manager homepage. The most comprehensive website

for learning about RPM.

12 Chapter 1. The Philosophy of System Administration

1.11.3. Related Books

Most books on system administration do little to cover the philosophy behind the job. However, the following books do have sections that give a bit more depth to the issues that were discussed here:

• The Red Hat Enterprise Linux Reference Guide; Red Hat, Inc. — Provides an overview of locations

of key system ﬁles, user and group settings, and PAM conﬁguration.

• The Red Hat Enterprise Linux Security Guide; Red Hat, Inc. — Contains a comprehensive discus-

sion of many security-related issues for Red Hat Enterprise Linux system administrators.

• The Red Hat Enterprise Linux System Administration Guide; Red Hat, Inc. — Includes chapters on

managing users and groups, automating tasks, and managing log ﬁles.

• Linux Administration Handbook by Evi Nemeth, Garth Snyder, and Trent R. Hein; Prentice Hall —

Provides a good section on the policies and politics side of system administration, including several "what-if" discussions concerning ethics.

• Linux System Administration: A User’s Guide by Marcel Gagne; Addison Wesley Professional —

Contains a good chapter on automating various tasks.

• Solaris System Management by John Philcox; New Riders Publishing — Although not speciﬁcally

written for Red Hat Enterprise Linux (or even Linux in general), and using the term "system manager" instead of "system administrator," this book provides a 70-page overview of the many roles that system administrators play in a typical organization.

Chapter 2.

Resource Monitoring

As stated earlier, a great deal of system administration revolves around resources and their efﬁcient use. By balancing various resources against the people and programs that use those resources, you waste less money and make your users as happy as possible. However, this leaves two questions:

What are resources?

And:

How is it possible to know what resources are being used (and to what extent)?

The purpose of this chapter is to enable you to answer these questions by helping you to learn more about resources and how they can be monitored.

2.1. Basic Concepts

Before you can monitor resources, you ﬁrst have to know what resources there are to monitor. All systems have the following resources available:

• CPU power

• Bandwidth

• Memory

• Storage

These resources are covered in more depth in the following chapters. However, for the time being all you need to keep in mind is that these resources have a direct impact on system performance, and therefore, on your users’ productivity and happiness.

At its simplest, resource monitoring is nothing more than obtaining information concerning the utilization of one or more system resources.

However, it is rarely this simple. First, one must take into account the resources to be monitored. Then it is necessary to examine each system to be monitored, paying particular attention to each system’s situation.

The systems you monitor fall into one of two categories:

• The system is currently experiencing performance problems at least part of the time and you would

like to improve its performance.

• The system is currently running well and you would like it to stay that way.

The ﬁrst category means you should monitor resources from a system performance perspective, while the second category means you should monitor system resources from a capacity planning perspective.

Because each perspective has its own unique requirements, the following sections explore each category in more depth.

14 Chapter 2. Resource Monitoring

2.2. System Performance Monitoring

As stated above, system performance monitoring is normally done in response to a performance problem. Either the system is running too slowly, or programs (and sometimes even the entire system) fail to run at all. In either case, performance monitoring is normally done as the ﬁrst and last steps of a three-step process:

1. Monitoring to identify the nature and scope of the resource shortages that are causing the performance problems

2. The data produced from monitoring is analyzed and a course of action (normally performance tuning and/or the procurement of additional hardware) is taken to resolve the problem

3. Monitoring to ensure that the performance problem has been resolved

Because of this, performance monitoring tends to be relatively short-lived in duration and more detailed in scope.

Note

System performance monitoring is often an iterative process, with these steps being repeated several times to arrive at the best possible system performance. The primary reason for this is that system resources and their utilization tend to be highly interrelated, meaning that often the elimination of one resource bottleneck uncovers another one.

2.3. Monitoring System Capacity

Monitoring system capacity is done as part of an ongoing capacity planning program. Capacity planning uses long-term resource monitoring to determine rates of change in the utilization of system resources. Once these rates of change are known, it becomes possible to conduct more accurate longterm planning regarding the procurement of additional resources.

Monitoring done for capacity planning purposes is different from performance monitoring in two ways:

• The monitoring is done on a more-or-less continuous basis

• The monitoring is usually not as detailed

The reason for these differences stems from the goals of a capacity planning program. Capacity planning requires a "big picture" viewpoint; short-term or anomalous resource usage is of little concern. Instead, data is collected over a period of time, making it possible to categorize resource utilization in terms of changes in workload. In more narrowly-deﬁned environments, (where only one application is run, for example) it is possible to model the application’s impact on system resources. This can be done with sufﬁcient accuracy to make it possible to determine, for example, the impact of ﬁve more customer service representatives running the customer service application during the busiest time of the day.

2.4. What to Monitor?

As stated earlier, the resources present in every system are CPU power, bandwidth, memory, and storage. At ﬁrst glance, it would seem that monitoring would need only consist of examining these four different things.

Chapter 2. Resource Monitoring 15

Unfortunately, it is not that simple. For example, consider a disk drive. What things might you want to know about its performance?

• How much free space is available?

• How many I/O operations on average does it perform each second?

• How long on average does it take each I/O operation to be completed?

• How many of those I/O operations are reads? How many are writes?

• What is the average amount of data read/written with each I/O?

There are more ways of studying disk drive performance; these points have only scratched the surface. The main concept to keep in mind is that there are many different types of data for each resource.

The following sections explore the types of utilization information that would be helpful for each of the major resource types.

2.4.1. Monitoring CPU Power

In its most basic form, monitoring CPU power can be no more difﬁcult than determining if CPU utilization ever reaches 100%. If CPU utilization stays below 100%, no matter what the system is doing, there is additional processing power available for more work.

However, it is a rare system that does not reach 100% CPU utilization at least some of the time. At that point it is important to examine more detailed CPU utilization data. By doing so, it becomes possible to start determining where the majority of your processing power is being consumed. Here are some of the more popular CPU utilization statistics:

User Versus System

The percentage of time spent performing user-level processing versus system-level processing can point out whether a system’s load is primarily due to running applications or due to operating system overhead. High user-level percentages tend to be good (assuming users are not experiencing unsatisfactory performance), while high system-level percentages tend to point toward problems that will require further investigation.

Context Switches

A context switch happens when the CPU stops running one process and starts running another. Because each context switch requires the operating system to take control of the CPU, excessive context switches and high levels of system-level CPU consumption tend to go together.

Interrupts

As the name implies, interrupts are situations where the processing being performed by the CPU is abruptly changed. Interrupts generally occur due to hardware activity (such as an I/O device completing an I/O operation) or due to software (such as software interrupts that control application processing). Because interrupts must be serviced at a system level, high interrupt rates lead to higher system-level CPU consumption.

Runnable Processes

A process may be in different states. For example, it may be:

• Waiting for an I/O operation to complete

• Waiting for the memory management subsystem to handle a page fault

In these cases, the process has no need for the CPU.

16 Chapter 2. Resource Monitoring

However, eventually the process state changes, and the process becomes runnable. As the name implies, a runnable process is one that is capable of getting work done as soon as it is scheduled to receive CPU time. However, if more than one process is runnable at any given time, all but one1of the runnable processes must wait for their turn at the CPU. By monitoring the number of runnable processes, it is possible to determine how CPU-bound your system is.

Other performance metrics that reﬂect an impact on CPU utilization tend to include different services the operating system provides to processes. They may include statistics on memory management, I/O processing, and so on. These statistics also reveal that, when system performance is monitored, there are no boundaries between the different statistics. In other words, CPU utilization statistics may end up pointing to a problem in the I/O subsystem, or memory utilization statistics may reveal an application design ﬂaw.

Therefore, when monitoring system performance, it is not possible to examine any one statistic in complete isolation; only by examining the overall picture it it possible to extract meaningful information from any performance statistics you gather.

2.4.2. Monitoring Bandwidth

Monitoring bandwidth is more difﬁcult than the other resources described here. The reason for this is due to the fact that performance statistics tend to be device-based, while most of the places where bandwidth is important tend to be the buses that connect devices. In those instances where more than one device shares a common bus, you might see reasonable statistics for each device, but the aggregate load those devices place on the bus would be much greater.

Another challenge to monitoring bandwidth is that there can be circumstances where statistics for the devices themselves may not be available. This is particularly true for system expansion buses and datapaths2. However, even though 100% accurate bandwidth-related statistics may not always be available, there is often enough information to make some level of analysis possible, particularly when related statistics are taken into account.

Some of the more common bandwidth-related statistics are:

Bytes received/sent

Network interface statistics provide an indication of the bandwidth utilization of one of the more visible buses — the network.

Interface counts and rates

These network-related statistics can give indications of excessive collisions, transmit and receive errors, and more. Through the use of these statistics (particularly if the statistics are available for more than one system on your network), it is possible to perform a modicum of network troubleshooting even before the more common network diagnostic tools are used.

Transfers per Second

Normally collected for block I/O devices, such as disk and high-performance tape drives, this statistic is a good way of determining whether a particular device’s bandwidth limit is being reached. Due to their electromechanical nature, disk and tape drives can only perform so many I/O operations every second; their performance degrades rapidly as this limit is reached.

1. Assuming a single-processor computer system.

2. More information on buses, datapaths, and bandwidth is available in

Chapter 3 Bandwidth and Processing Power.

Chapter 2. Resource Monitoring 17

2.4.3. Monitoring Memory

If there is one area where a wealth of performance statistics can be found, it is in the area of monitoring memory utilization. Due to the inherent complexity of today’s demand-paged virtual memory operating systems, memory utilization statistics are many and varied. It is here that the majority of a system administrator’s work with resource management takes place.

The following statistics represent a cursory overview of commonly-found memory management statistics:

Page Ins/Page Outs

These statistics make it possible to gauge the ﬂow of pages from system memory to attached mass storage devices (usually disk drives). High rates for both of these statistics can mean that the system is short of physical memory and is thrashing, or spending more system resources on moving pages into and out of memory than on actually running applications.

Active/Inactive Pages

These statistics show how heavily memory-resident pages are used. A lack of inactive pages can point toward a shortage of physical memory.

Free, Shared, Buffered, and Cached Pages

These statistics provide additional detail over the more simplistic active/inactive page statistics. By using these statistics, it is possible to determine the overall mix of memory utilization.

Swap Ins/Swap Outs

These statistics show the system’s overall swapping behavior. Excessive rates here can point to physical memory shortages.

Successfully monitoring memory utilization requires a good understanding of how demand-paged virtual memory operating systems work. While such a subject alone could take up an entire book, the basic concepts are discussed in Chapter 4 Physical and Virtual Memory. This chapter, along with time spent actually monitoring a system, gives you the the necessary building blocks to learn more about this subject.

2.4.4. Monitoring Storage

Monitoring storage normally takes place at two different levels:

• Monitoring for sufﬁcient disk space

• Monitoring for storage-related performance problems

The reason for this is that it is possible to have dire problems in one area and no problems whatsoever in the other. For example, it is possible to cause a disk drive to run out of disk space without once causing any kind of performance-related problems. Likewise, it is possible to have a disk drive that has 99% free space, yet is being pushed past its limits in terms of performance.

However, it is more likely that the average system experiences varying degrees of resource shortages in both areas. Because of this, it is also likely that — to some extent — problems in one area impact the other. Most often this type of interaction takes the form of poorer and poorer I/O performance as a disk drive nears 0% free space although, in cases of extreme I/O loads, it might be possible to slow I/O throughput to such a level that applications no longer run properly.

In any case, the following statistics are useful for monitoring storage:

18 Chapter 2. Resource Monitoring

Free Space

Free space is probably the one resource all system administrators watch closely; it would be a rare administrator that never checks on free space (or has some automated way of doing so).

File System-Related Statistics

These statistics (such as number of ﬁles/directories, average ﬁle size, etc.) provide additional detail over a single free space percentage. As such, these statistics make it possible for system administrators to conﬁgure the system to give the best performance, as the I/O load imposed by a ﬁle system full of many small ﬁles is not the same as that imposed by a ﬁle system ﬁlled with a single massive ﬁle.

Transfers per Second

This statistic is a good way of determining whether a particular device’s bandwidth limitations are being reached.

Reads/Writes per Second

A slightly more detailed breakdown of transfers per second, these statistics allow the system administrator to more fully understand the nature of the I/O loads a storage device is experiencing. This can be critical, as some storage technologies have widely different performance characteristics for read versus write operations.

2.5. Red Hat Enterprise Linux-Speciﬁc Information

Red Hat Enterprise Linux comes with a variety of resource monitoring tools. While there are more than those listed here, these tools are representative in terms of functionality. The tools are:

• free

• top (and GNOME System Monitor, a more graphically oriented version of top)

• vmstat

• The Sysstat suite of resource monitoring tools

• The OProﬁle system-wide proﬁler

Let us examine each one in more detail.

2.5.1. free

The free command displays system memory utilization. Here is an example of its output:

total used free shared buffers cached

Mem: 255508 240268 15240 0 7592 86188

-/+ buffers/cache: 146488 109020 Swap: 530136 26268 503868

The Mem: row displays physical memory utilization, while the Swap: row displays the utilization of the system swap space, and the -/+ buffers/cache: row displays the amount of physical memory currently devoted to system buffers.

Since free by default only displays memory utilization information once, it is only useful for very short-term monitoring, or quickly determining if a memory-related problem is currently in progress. Although free has the ability to repetitively display memory utilization ﬁgures via its -s option, the output scrolls, making it difﬁcult to easily detect changes in memory utilization.

Chapter 2. Resource Monitoring 19

Tip

A better solution than using free -s would be to run free using the watch command. For example, to display memory utilization every two seconds (the default display interval for watch), use this command:

watch free

The watch command issues the free command every two seconds, updating by clearing the screen and writing the new output to the same screen location. This makes it much easier to determine how memory utilization changes over time, since watch creates a single updated view with no scrolling. You can control the delay between updates by using the -n option, and can cause any changes between updates to be highlighted by using the -d option, as in the following command:

watch -n 1 -d free

For more information, refer to the watch man page.

The watch command runs until interrupted with [Ctrl]-[C]. The watch command is something to keep in mind; it can come in handy in many situations.

2.5.2. top

While free displays only memory-related information, the top command does a little bit of everything. CPU utilization, process statistics, memory utilization — top monitors it all. In addition, unlike the free command, top’s default behavior is to run continuously; there is no need to use the watch command. Here is a sample display:

14:06:32 up 4 days, 21:20, 4 users, load average: 0.00, 0.00, 0.00 77 processes: 76 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle

total 19.6% 0.0% 0.0% 0.0% 0.0% 0.0% 180.2% cpu00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu01 19.6% 0.0% 0.0% 0.0% 0.0% 0.0% 80.3%

Mem: 1028548k av, 716604k used, 311944k free, 0k shrd, 131056k buff

324996k actv, 108692k in_d, 13988k in_c

Swap: 1020116k av, 5276k used, 1014840k free 382228k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 17578 root 15 0 13456 13M 9020 S 18.5 1.3 26:35 1 rhn-applet-gu 19154 root 20 0 1176 1176 892 R 0.9 0.1 0:00 1 top

1 root 15 0 168 160 108 S 0.0 0.0 0:09 0 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 6 root 35 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 9 root 15 0 0 0 0 SW 0.0 0.0 0:07 1 bdflush 7 root 15 0 0 0 0 SW 0.0 0.0 1:19 0 kswapd

8 root 15 0 0 0 0 SW 0.0 0.0 0:14 1 kscand 10 root 15 0 0 0 0 SW 0.0 0.0 0:03 1 kupdated 11 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd

The display is divided into two sections. The top section contains information related to overall system status — uptime, load average, process counts, CPU status, and utilization statistics for both memory and swap space. The lower section displays process-level statistics. It is possible to change what is displayed while top is running. For example, top by default displays both idle and non-idle processes. To display only non-idle processes, press [i]; a second press returns to the default display mode.

20 Chapter 2. Resource Monitoring

Warning

Although top appears like a simple display-only program, this is not the case. That is because top uses single character commands to perform various operations. For example, if you are logged in as root, it is possible to change the priority and even kill any process on your system. Therefore, until you have reviewed top’s help screen (type [?] to display it), it is safest to only type [q] (which exits

top).

2.5.2.1. The GNOME System Monitor — A Graphical top

If you are more comfortable with graphical user interfaces, the GNOME System Monitor may be more to your liking. Like top, the GNOME System Monitor displays information related to overall system status, process counts, memory and swap utilization, and process-level statistics.

However, the GNOME System Monitor goes a step further by also including graphical representations of CPU, memory, and swap utilization, along with a tabular disk space utilization listing. An example of the GNOME System Monitor’s Process Listing display appears in Figure 2-1.

Figure 2-1. The GNOME System Monitor Process Listing Display

Additional information can be displayed for a speciﬁc process by ﬁrst clicking on the desired process and then clicking on the More Info button.

To display the CPU, memory, and disk usage statistics, click on the System Monitor tab.

2.5.3. vmstat

For a more concise understanding of system performance, try vmstat. With vmstat, it is possible to get an overview of process, memory, swap, I/O, system, and CPU activity in one line of numbers:

Chapter 2. Resource Monitoring 21

procs memory swap io system cpu

r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 5276 315000 130744 380184 1 1 2 24 14 50 1 1 47 0

The ﬁrst line divides the ﬁelds in six categories, including process, memory, swap, I/O, system, and CPU related statistics. The second line further identiﬁes the contents of each ﬁeld, making it easy to quickly scan data for speciﬁc statistics.

The process-related ﬁelds are:

• r — The number of runnable processes waiting for access to the CPU

• b — The number of processes in an uninterruptible sleep state

The memory-related ﬁelds are:

• swpd — The amount of virtual memory used

• free — The amount of free memory

• buff — The amount of memory used for buffers

• cache — The amount of memory used as page cache

The swap-related ﬁelds are:

• si — The amount of memory swapped in from disk

• so — The amount of memory swapped out to disk

The I/O-related ﬁelds are:

• bi — Blocks sent to a block device

• bo — Blocks received from a block device

The system-related ﬁelds are:

• in — The number of interrupts per second

• cs — The number of context switches per second

The CPU-related ﬁelds are:

• us — The percentage of the time the CPU ran user-level code

• sy — The percentage of the time the CPU ran system-level code

• id — The percentage of the time the CPU was idle

• wa — I/O wait

When vmstat is run without any options, only one line is displayed. This line contains averages, calculated from the time the system was last booted.

However, most system administrators do not rely on the data in this line, as the time over which it was collected varies. Instead, most administrators take advantage of vmstat’s ability to repetitively display resource utilization data at set intervals. For example, the command vmstat 1 displays one new line of utilization data every second, while the command vmstat 1 10 displays one new line per second, but only for the next ten seconds.

In the hands of an experienced administrator, vmstat can be used to quickly determine resource utilization and performance issues. But to gain more insight into those issues, a different kind of tool is required — a tool capable of more in-depth data collection and analysis.

22 Chapter 2. Resource Monitoring

2.5.4. The Sysstat Suite of Resource Monitoring Tools

While the previous tools may be helpful for gaining more insight into system performance over very short time frames, they are of little use beyond providing a snapshot of system resource utilization. In addition, there are aspects of system performance that cannot be easily monitored using such simplistic tools.

Therefore, a more sophisticated tool is necessary. Sysstat is such a tool.

Sysstat contains the following tools related to collecting I/O and CPU statistics:

iostat

Displays an overview of CPU utilization, along with I/O statistics for one or more disk drives.

mpstat

Displays more in-depth CPU statistics.

Sysstat also contains tools that collect system resource utilization data and create daily reports based on that data. These tools are:

sadc

Known as the system activity data collector, sadc collects system resource utilization information and writes it to a ﬁle.

sar

Producing reports from the ﬁles created by sadc, sar reports can be generated interactively or written to a ﬁle for more intensive analysis.

The following sections explore each of these tools in more detail.

2.5.4.1. The iostat command

The iostat command at its most basic provides an overview of CPU and disk I/O statistics:

Linux 2.4.20-1.1931.2.231.2.10.ent (pigdog.example.com) 07/11/2003

avg-cpu: %user %nice %sys %idle

6.11 2.56 2.15 89.18

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev3-0 1.68 15.69 22.42 31175836 44543290

Below the ﬁrst line (which contains the system’s kernel version and hostname, along with the current date), iostat displays an overview of the system’s average CPU utilization since the last reboot. The CPU utilization report includes the following percentages:

• Percentage of time spent in user mode (running applications, etc.)

• Percentage of time spent in user mode (for processes that have altered their scheduling priority

using nice(2))

• Percentage of time spent in kernel mode

• Percentage of time spent idle

Below the CPU utilization report is the device utilization report. This report contains one line for each active disk device on the system and includes the following information:

Chapter 2. Resource Monitoring 23

• The device speciﬁcation, displayed as dev<major-number>-sequence-number, where

<major-number> is the device’s major number

, and <sequence-number> is a sequence

number starting at zero.

• The number of transfers (or I/O operations) per second.

• The number of 512-byte blocks read per second.

• The number of 512-byte blocks written per second.

• The total number of 512-byte blocks read.

• The total number of 512-byte block written.

This is just a sample of the information that can be obtained using iostat. For more information, refer to the iostat(1) man page.

2.5.4.2. The mpstat command

The mpstat command at ﬁrst appears no different from the CPU utilization report produced by

iostat:

Linux 2.4.20-1.1931.2.231.2.10.ent (pigdog.example.com) 07/11/2003

07:09:26 PM CPU %user %nice %system %idle intr/s 07:09:26 PM all 6.40 5.84 3.29 84.47 542.47

With the exception of an additional column showing the interrupts per second being handled by the CPU, there is no real difference. However, the situation changes if mpstat’s -P ALL option is used:

Linux 2.4.20-1.1931.2.231.2.10.ent (pigdog.example.com) 07/11/2003

07:13:03 PM CPU %user %nice %system %idle intr/s 07:13:03 PM all 6.40 5.84 3.29 84.47 542.47 07:13:03 PM 0 6.36 5.80 3.29 84.54 542.47 07:13:03 PM 1 6.43 5.87 3.29 84.40 542.47

On multiprocessor systems, mpstat allows the utilization for each CPU to be displayed individually, making it possible to determine how effectively each CPU is being used.

2.5.4.3. The sadc command

As stated earlier, the sadc command collects system utilization data and writes it to a ﬁle for later analysis. By default, the data is written to ﬁles in the /var/log/sa/ directory. The ﬁles are named

sa<dd>, where <dd> is the current day’s two-digit date.

sadc is normally run by the sa1 script. This script is periodically invoked by cron via the ﬁle sysstat, which is located in /etc/cron.d/. The sa1 script invokes sadc for a single one-second

measuring interval. By default, cron runs sa1 every 10 minutes, adding the data collected during each interval to the current /var/log/sa/sa<dd> ﬁle.

3. Device major numbers can be found by using ls -l to display the desired device ﬁle in /dev/. The major

number appears after the device’s group speciﬁcation.

24 Chapter 2. Resource Monitoring

2.5.4.4. The sar command

The sar command produces system utilization reports based on the data collected by sadc. As conﬁgured in Red Hat Enterprise Linux, sar is automatically run to process the ﬁles automatically collected by sadc. The report ﬁles are written to /var/log/sa/ and are named sar<dd>, where <dd> is the two-digit representations of the previous day’s two-digit date.

sar is normally run by the sa2 script. This script is periodically invoked by cron via the ﬁle sysstat, which is located in /etc/cron.d/. By default, cron runs sa2 once a day at 23:53, allow-

ing it to produce a report for the entire day’s data.

2.5.4.4.1. Reading sar Reports

The format of a sar report produced by the default Red Hat Enterprise Linux conﬁguration consists of multiple sections, with each section containing a speciﬁc type of data, ordered by the time of day that the data was collected. Since sadc is conﬁgured to perform a one-second measurement interval every ten minutes, the default sar reports contain data in ten-minute increments, from 00:00 to 23:504.

Each section of the report starts with a heading describing the data contained in the section. The heading is repeated at regular intervals throughout the section, making it easier to interpret the data while paging through the report. Each section ends with a line containing the average of the data reported in that section.

Here is a sample section sar report, with the data from 00:30 through 23:40 removed to save space:

00:00:01 CPU %user %nice %system %idle 00:10:00 all 6.39 1.96 0.66 90.98 00:20:01 all 1.61 3.16 1.09 94.14 ... 23:50:01 all 44.07 0.02 0.77 55.14 Average: all 5.80 4.99 2.87 86.34

In this section, CPU utilization information is displayed. This is very similar to the data displayed by

iostat.

Other sections may have more than one line’s worth of data per time, as shown by this section generated from CPU utilization data collected on a dual-processor system:

00:00:01 CPU %user %nice %system %idle 00:10:00 0 4.19 1.75 0.70 93.37 00:10:00 1 8.59 2.18 0.63 88.60 00:20:01 0 1.87 3.21 1.14 93.78 00:20:01 1 1.35 3.12 1.04 94.49 ... 23:50:01 0 42.84 0.03 0.80 56.33 23:50:01 1 45.29 0.01 0.74 53.95 Average: 0 6.00 5.01 2.74 86.25 Average: 1 5.61 4.97 2.99 86.43

There are a total of seventeen different sections present in reports generated by the default Red Hat Enterprise Linux sar conﬁguration; some are explored in upcoming chapters. For more information about the data contained in each section, refer to the sar(1) man page.

4. Due to changing system loads, the actual time at which the data was collected may vary by a second or two.

Chapter 2. Resource Monitoring 25

2.5.5. OProﬁle

The OProﬁle system-wide proﬁler is a low-overhead monitoring tool. OProﬁle makes use of the processor’s performance monitoring hardware5to determine the nature of performance-related problems.

Performance monitoring hardware is part of the processor itself. It takes the form of a special counter, incremented each time a certain event (such as the processor not being idle or the requested data not being in cache) occurs. Some processors have more than one such counter and allow the selection of different event types for each counter.

The counters can be loaded with an initial value and produce an interrupt whenever the counter overﬂows. By loading a counter with different initial values, it is possible to vary the rate at which interrupts are produced. In this way it is possible to control the sample rate and, therefore, the level of detail obtained from the data being collected.

At one extreme, setting the counter so that it generates an overﬂow interrupt with every event provides extremely detailed performance data (but with massive overhead). At the other extreme, setting the counter so that it generates as few interrupts as possible provides only the most general overview of system performance (with practically no overhead). The secret to effective monitoring is the selection of a sample rate sufﬁciently high to capture the required data, but not so high as to overload the system with performance monitoring overhead.

Warning

You can conﬁgure OProﬁle so that it produces sufﬁcient overhead to render the system unusable. Therefore, you must exercise care when selecting counter values. For this reason, the opcontrol command supports the --list-events option, which displays the event types available for the currently-installed processor, along with suggested minimum counter values for each.

It is important to keep the tradeoff between sample rate and overhead in mind when using OProﬁle.

2.5.5.1. OProﬁle Components

Oproﬁle consists of the following components:

• Data collection software

• Data analysis software

• Administrative interface software

The data collection software consists of the oprofile.o kernel module, and the oprofiled daemon.

The data analysis software includes the following programs:

op_time

Displays the number and relative percentages of samples taken for each executable ﬁle

oprofpp

Displays the number and relative percentage of samples taken by either function, individual instruction, or in gprof-style output

5. OProﬁle can also use a fallback mechanism (known as TIMER_INT) for those system architectures that lack

performance monitoring hardware.

26 Chapter 2. Resource Monitoring

op_to_source

Displays annotated source code and/or assembly listings

op_visualise

Graphically displays collected data

These programs make it possible to display the collected data in a variety of ways.

The administrative interface software controls all aspects of data collection, from specifying which events are to be monitored to starting and stopping the collection itself. This is done using the

opcontrol command.

2.5.5.2. A Sample OProﬁle Session

This section shows an OProﬁle monitoring and data analysis session from initial conﬁguration to ﬁnal data analysis. It is only an introductory overview; for more detailed information, consult the Red Hat Enterprise Linux System Administration Guide.

Use opcontrol to conﬁgure the type of data to be collected with the following command:

opcontrol \

--vmlinux=/boot/vmlinux-‘uname -r‘ \

--ctr0-event=CPU_CLK_UNHALTED \

--ctr0-count=6000

The options used here direct opcontrol to:

• Direct OProﬁle to a copy of the currently running kernel (--vmlinux=/boot/vmlinux-‘uname

-r‘)

• Specify that the processor’s counter 0 is to be used and that the event to be monitored is the time

when the CPU is executing instructions (--ctr0-event=CPU_CLK_UNHALTED)

• Specify that OProﬁle is to collect samples every 6000th time the speciﬁed event occurs

(--ctr0-count=6000)

Next, check that the oprofile kernel module is loaded by using the lsmod command:

Module Size Used by Not tainted oprofile 75616 1 ...

Conﬁrm that the OProﬁle ﬁle system (located in /dev/oprofile/) is mounted with the ls

/dev/oprofile/ command:

0 buffer buffer_watershed cpu_type enable stats 1 buffer_size cpu_buffer_size dump kernel_only

(The exact number of ﬁles varies according to processor type.)

At this point, the /root/.oprofile/daemonrc ﬁle contains the settings required by the data collection software:

CTR_EVENT[0]=CPU_CLK_UNHALTED CTR_COUNT[0]=6000 CTR_KERNEL[0]=1 CTR_USER[0]=1 CTR_UM[0]=0 CTR_EVENT_VAL[0]=121 CTR_EVENT[1]=

Chapter 2. Resource Monitoring 27

CTR_COUNT[1]= CTR_KERNEL[1]=1 CTR_USER[1]=1 CTR_UM[1]=0 CTR_EVENT_VAL[1]= one_enabled=1 SEPARATE_LIB_SAMPLES=0 SEPARATE_KERNEL_SAMPLES=0 VMLINUX=/boot/vmlinux-2.4.21-1.1931.2.349.2.2.entsmp

Next, use opcontrol to actually start data collection with the opcontrol --start command:

Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running.

Verify that the oprofiled daemon is running with the command ps x | grep -i oprofiled:

32019 ? S 0:00 /usr/bin/oprofiled --separate-lib-samples=0 ... 32021 pts/0 S 0:00 grep -i oprofiled

(The actual oprofiled command line displayed by ps is much longer; however, it has been truncated here for formatting purposes.)

The system is now being monitored, with the data collected for all executables present on the system. The data is stored in the /var/lib/oprofile/samples/ directory. The ﬁles in this directory follow a somewhat unusual naming convention. Here is an example:

}usr}bin}less#0

The naming convention uses the absolute path of each ﬁle containing executable code, with the slash (/) characters replaced by right curly brackets (}), and ending with a pound sign (#) followed by a number (in this case, 0.) Therefore, the ﬁle used in this example represents data collected while

/usr/bin/less was running.

Once data has been collected, use one of the analysis tools to display it. One nice feature of OProﬁle is that it is not necessary to stop data collection before performing a data analysis. However, you must wait for at least one set of samples to be written to disk, or use the opcontrol --dump command to force the samples to disk.

In the following example, op_time is used to display (in reverse order — from highest number of samples to lowest) the samples that have been collected:

3321080 48.8021 0.0000 /boot/vmlinux-2.4.21-1.1931.2.349.2.2.entsmp 761776 11.1940 0.0000 /usr/bin/oprofiled 368933 5.4213 0.0000 /lib/tls/libc-2.3.2.so 293570 4.3139 0.0000 /usr/lib/libgobject-2.0.so.0.200.2 205231 3.0158 0.0000 /usr/lib/libgdk-x11-2.0.so.0.200.2 167575 2.4625 0.0000 /usr/lib/libglib-2.0.so.0.200.2 123095 1.8088 0.0000 /lib/libcrypto.so.0.9.7a 105677 1.5529 0.0000 /usr/X11R6/bin/XFree86 ...

Using less is a good idea when producing a report interactively, as the reports can be hundreds of lines long. The example given here has been truncated for that reason.

The format for this particular report is that one line is produced for each executable ﬁle for which samples were taken. Each line follows this format:

<sample-count> <sample-percent> <unused-field> <executable-name>

28 Chapter 2. Resource Monitoring

Where:

• <sample-count> represents the number of samples collected

• <sample-percent> represents the percentage of all samples collected for this speciﬁc executable

• <unused-field> is a ﬁeld that is not used

• <executable-name> represents the name of the ﬁle containing executable code for which sam-

ples were collected.

This report (produced on a mostly-idle system) shows that nearly half of all samples were taken while the CPU was running code within the kernel itself. Next in line was the OProﬁle data collection daemon, followed by a variety of libraries and the X Window System server, XFree86. It is worth noting that for the system running this sample session, the counter value of 6000 used represents the minimum value recommended by opcontrol --list-events. This means that — at least for this particular system — OProﬁle overhead at its highest consumes roughly 11% of the CPU.

2.6. Additional Resources

This section includes various resources that can be used to learn more about resource monitoring and the Red Hat Enterprise Linux-speciﬁc subject matter discussed in this chapter.

2.6.1. Installed Documentation

The following resources are installed in the course of a typical Red Hat Enterprise Linux installation.

• free(1) man page — Learn how to display free and used memory statistics.

• top(1) man page — Learn how to display CPU utilization and process-level statistics.

• watch(1) man page — Learn how to periodically execute a user-speciﬁed program, displaying

fullscreen output.

• GNOME System Monitor Help menu entry — Learn how to graphically display process, CPU,

memory, and disk space utilization statistics.

• vmstat(8) man page — Learn how to display a concise overview of process, memory, swap, I/O,

system, and CPU utilization.

• iostat(1) man page — Learn how to display CPU and I/O statistics.

• mpstat(1) man page — Learn how to display individual CPU statistics on multiprocessor sys-

tems.

• sadc(8) man page — Learn how to collects system utilization data.

• sa1(8) man page — Learn about a script that runs sadc periodically.

• sar(1) man page — Learn how to produce system resource utilization reports.

• sa2(8) man page — Learn how to produce daily system resource utilization report ﬁles.

• nice(1) man page — Learn how to change process scheduling priority.

• oprofile(1) man page — Learn how to proﬁle system performance.

• op_visualise(1) man page — Learn how to graphically display OProﬁle data.

Chapter 2. Resource Monitoring 29

2.6.2. Useful Websites

• http://people.redhat.com/alikins/system_tuning.html — System Tuning Info for Linux Servers. A

stream-of-consciousness approach to performance tuning and resource monitoring for servers.

• http://www.linuxjournal.com/article.php?sid=2396 — Performance Monitoring Tools for Linux.

This Linux Journal page is geared more toward the administrator interested in writing a customized performance graphing solution. Written several years ago, some of the details may no longer apply, but the overall concept and execution are sound.

• http://oproﬁle.sourceforge.net/ — OProﬁle project website. Includes valuable OProﬁle resources,

including pointers to mailing lists and the #oproﬁle IRC channel.

2.6.3. Related Books

The following books discuss various issues related to resource monitoring and are good resources for Red Hat Enterprise Linux system administrators:

• The Red Hat Enterprise Linux System Administration Guide; Red Hat, Inc. — Includes information

on many of the resource monitoring tools described here, including OProﬁle.

• Linux Performance Tuning and Capacity Planning by Jason R. Fink and Matthew D. Sherer; Sams

— Provides more in-depth overviews of the resource monitoring tools presented here and includes others that might be appropriate for more speciﬁc resource monitoring needs.

• Red Hat Linux Security and Optimization by Mohammed J. Kabir; Red Hat Press — Approximately

the ﬁrst 150 pages of this book discuss performance-related issues. This includes chapters dedicated to performance issues speciﬁc to network, Web, email, and ﬁle servers.

• Linux Administration Handbook by Evi Nemeth, Garth Snyder, and Trent R. Hein; Prentice Hall

— Provides a short chapter similar in scope to this book, but includes an interesting section on diagnosing a system that has suddenly slowed down.

• Linux System Administration: A User’s Guide by Marcel Gagne; Addison Wesley Professional —

Contains a small chapter on performance monitoring and tuning.

30 Chapter 2. Resource Monitoring

Chapter 3.

Bandwidth and Processing Power

Of the two resources discussed in this chapter, one (bandwidth) is often hard for the new system administrator to understand, while the other (processing power) is usually a much easier concept to grasp.

Additionally, it may seem that these two resources are not that closely related — why group them together?

The reason for addressing both resources together is that these resources are based on the hardware that tie directly into a computer’s ability to move and process data. As such, their relationship is often interrelated.

3.1. Bandwidth

At its most basic, bandwidth is the capacity for data transfer — in other words, how much data can be moved from one point to another in a given amount of time. Having point-to-point data communication implies two things:

• A set of electrical conductors used to make low-level communication possible

• A protocol to facilitate the efﬁcient and reliable communication of data

There are two types of system components that meet these requirements:

• Buses

• Datapaths

The following sections explore each in more detail.

3.1.1. Buses

As stated above, buses enable point-to-point communication and use some sort of protocol to ensure that all communication takes place in a controlled manner. However, buses have other distinguishing features:

• Standardized electrical characteristics (such as the number of conductors, voltage levels, signaling

speeds, etc.)

• Standardized mechanical characteristics (such as the type of connector, card size, physical layout,

etc.)

• Standardized protocol

The word "standardized" is important because buses are the primary way in which different system components are connected together.

In many cases, buses allow the interconnection of hardware made by multiple manufacturers; without standardization, this would not be possible. However, even in situations where a bus is proprietary to one manufacturer, standardization is important because it allows that manufacturer to more easily implement different components by using a common interface — the bus itself.

32 Chapter 3. Bandwidth and Processing Power

3.1.1.1. Examples of Buses

No matter where in a computer system you look, there are buses. Here are a few of the more common ones:

• Mass storage buses (ATA and SCSI)

• Networks

(Ethernet and Token Ring)

• Memory buses (PC133 and Rambus®)

• Expansion buses (PCI, ISA, USB)

3.1.2. Datapaths

Datapaths can be harder to identify but, like buses, they are everywhere. Also like buses, datapaths enable point-to-point communication. However, unlike buses, datapaths:

• Use a simpler protocol (if any)

• Have little (if any) mechanical standardization

The reason for these differences is that datapaths are normally internal to some system component and are not used to facilitate the ad-hoc interconnection of different components. As such, datapaths are highly optimized for a particular situation, where speed and low cost are preferred over slower and more expensive general-purpose ﬂexibility.

3.1.2.1. Examples of Datapaths

Here are some typical datapaths:

• CPU to on-chip cache datapath

• Graphics processor to video memory datapath

3.1.3. Potential Bandwidth-Related Problems

There are two ways in which bandwidth-related problems may occur (for either buses or datapaths):

1. The bus or datapath may represent a shared resource. In this situation, high levels of contention

for the bus reduces the effective bandwidth available for all devices on the bus.

A SCSI bus with several highly-active disk drives would be a good example of this. The highlyactive disk drives saturate the SCSI bus, leaving little bandwidth available for any other device on the same bus. The end result is that all I/O to any of the devices on this bus is slow, even if each device on the bus is not overly active.

2. The bus or datapath may be a dedicated resource with a ﬁxed number of devices attached to it.

In this case, the electrical characteristics of the bus (and to some extent the nature of the protocol being used) limit the available bandwidth. This is usually more the case with datapaths than with buses. This is one reason why graphics adapters tend to perform more slowly when operating at higher resolutions and/or color depths — for every screen refresh, there is more data that must be passed along the datapath connecting video memory and the graphics processor.

1. Instead of an intra-system bus, networks can be thought of as an inter-system bus.

Chapter 3. Bandwidth and Processing Power 33

3.1.4. Potential Bandwidth-Related Solutions

Fortunately, bandwidth-related problems can be addressed. In fact, there are several approaches you can take:

• Spread the load

• Reduce the load

• Increase the capacity

The following sections explore each approach in more detail.

3.1.4.1. Spread the Load

The ﬁrst approach is to more evenly distribute the bus activity. In other words, if one bus is overloaded and another is idle, perhaps the situation would be improved by moving some of the load to the idle bus.

As a system administrator, this is the ﬁrst approach you should consider, as often there are additional buses already present in your system. For example, most PCs include at least two ATA channels (which is just another name for a bus). If you have two ATA disk drives and two ATA channels, why should both drives be on the same channel?

Even if your system conﬁguration does not include additional buses, spreading the load might still be a reasonable approach. The hardware expenditures to do so would be less expensive than replacing an existing bus with higher-capacity hardware.

3.1.4.2. Reduce the Load

At ﬁrst glance, reducing the load and spreading the load appear to be different sides of the same coin. After all, when one spreads the load, it acts to reduce the load (at least on the overloaded bus), correct?

While this viewpoint is correct, it is not the same as reducing the load globally. The key here is to determine if there is some aspect of the system load that is causing this particular bus to be overloaded. For example, is a network heavily loaded due to activities that are unnecessary? Perhaps a small temporary ﬁle is the recipient of heavy read/write I/O. If that temporary ﬁle resides on a networked ﬁle server, a great deal of network trafﬁc could be eliminated by working with the ﬁle locally.

3.1.4.3. Increase the Capacity

The obvious solution to insufﬁcient bandwidth is to increase it somehow. However, this is usually an expensive proposition. Consider, for example, a SCSI controller and its overloaded bus. To increase its bandwidth, the SCSI controller (and likely all devices attached to it) would need to be replaced with faster hardware. If the SCSI controller is a separate card, this would be a relatively straightforward process, but if the SCSI controller is part of the system’s motherboard, it becomes much more difﬁcult to justify the economics of such a change.

3.1.5. In Summary. . .

All system administrators should be aware of bandwidth, and how system conﬁguration and usage impacts available bandwidth. Unfortunately, it is not always apparent what is a bandwidth-related problem and what is not. Sometimes, the problem is not the bus itself, but one of the components attached to the bus.

34 Chapter 3. Bandwidth and Processing Power

For example, consider a SCSI adapter that is connected to a PCI bus. If there are performance problems with SCSI disk I/O, it might be the result of a poorly-performing SCSI adapter, even though the SCSI and PCI buses themselves are nowhere near their bandwidth capabilities.

3.2. Processing Power

Often known as CPU power, CPU cycles, and various other names, processing power is the ability of a computer to manipulate data. Processing power varies with the architecture (and clock speed) of the CPU — usually CPUs with higher clock speeds and those supporting larger word sizes have more processing power than slower CPUs supporting smaller word sizes.

3.2.1. Facts About Processing Power

Here are the two main facts about processing power that you should keep in mind:

• Processing power is ﬁxed

• Processing power cannot be stored

Processing power is ﬁxed, in that the CPU can only go so fast. For example, if you need to add two numbers together (an operation that takes only one machine instruction on most architectures), a particular CPU can do it at one speed, and one speed only. With few exceptions, it is not even possible to slow the rate at which a CPU processes instructions, much less increase it.

Processing power is also ﬁxed in another way: it is ﬁnite. That is, there are limits to the types of CPUs that can be plugged into any given computer. Some systems are capable of supporting a wide range of CPUs of differing speeds, while others may not be upgradeable at all2.

Processing power cannot be stored for later use. In other words, if a CPU can process 100 million instructions in one second, one second of idle time equals 100 million instructions worth of processing that have been wasted.

If we take these facts and examine them from a slightly different perspective, a CPU "produces" a stream of executed instructions at a ﬁxed rate. And if the CPU "produces" executed instructions, that means that something else must "consume" them. The next section deﬁnes these consumers.

3.2.2. Consumers of Processing Power

There are two main consumers of processing power:

• Applications

• The operating system itself

3.2.2.1. Applications

The most obvious consumers of processing power are the applications and programs you want the computer to run for you. From a spreadsheet to a database, applications are the reason you have a computer.

2. This situation leads to what is humorously termed as a forklift upgrade, which means a complete replacement

of a computer.

Chapter 3. Bandwidth and Processing Power 35

A single-CPU system can only do one thing at any given time. Therefore, if your application is running, everything else on the system is not. And the opposite is, of course, true — if something other than your application is running, then your application is doing nothing.

But how is it that many different applications can seemingly run at once under a modern operating system? The answer is that these are multitasking operating systems. In other words, they create the illusion that many different things are going on simultaneously when in fact that is not possible. The trick is to give each process a fraction of a second’s worth of time running on the CPU before giving the CPU to another process for the next fraction of a second. If these context switches happen frequently enough, the illusion of multiple applications running simultaneously is achieved.

Of course, applications do other things than manipulate data using the CPU. They may wait for user input as well as performing I/O to devices such as disk drives and graphics displays. When these events take place, the application no longer needs the CPU. At these times, the CPU can be used for other processes running other applications without slowing the waiting application at all.

In addition, the CPU can be used by another consumer of processing power: the operating system itself.

3.2.2.2. The Operating System

It is difﬁcult to determine how much processing power is consumed by the operating system. The reason for this is that operating systems use a mixture of process-level and system-level code to perform their work. While, for example, it is easy to use a process monitor to determine what the process running a daemon or service is doing, it is not so easy to determine how much processing power is being consumed by system-level I/O-related processing (which is normally done within the context of the process requesting the I/O.)

In general, it is possible to divide this kind of operating system overhead into two types:

• Operating system housekeeping

• Process-related activities

Operating system housekeeping includes activities such as process scheduling and memory management, while process-related activities include any processes that support the operating system itself, such as processes handling system-wide event logging or I/O cache ﬂushing.

3.2.3. Improving a CPU Shortage

When there is insufﬁcient processing power available for the work needing to be done, you have two options:

• Reducing the load

• Increasing the capacity

3.2.3.1. Reducing the Load

Reducing the CPU load is something that can be done with no expenditure of money. The trick is to identify those aspects of the system load under your control that can be cut back. There are three areas to focus on:

• Reducing operating system overhead

• Reducing application overhead

36 Chapter 3. Bandwidth and Processing Power

• Eliminating applications entirely

3.2.3.1.1. Reducing Operating System Overhead

To reduce operating system overhead, you must examine your current system load and determine what aspects of it result in inordinate amounts of overhead. These areas could include:

• Reducing the need for frequent process scheduling

• Reducing the amount of I/O performed

Do not expect miracles; in a reasonably-well conﬁgured system, it is unlikely to notice much of a performance increase by trying to reduce operating system overhead. This is due to the fact that a reasonably-well conﬁgured system, by deﬁnition, results in a minimal amount of overhead. However, if your system is running with too little RAM for instance, you may be able to reduce overhead by alleviating the RAM shortage.

3.2.3.1.2. Reducing Application Overhead

Reducing application overhead means making sure the application has everything it needs to run well. Some applications exhibit wildly different behaviors under different environments — an application may become highly compute-bound while processing certain types of data, but not others, for example.

The point to keep in mind here is that you must understand the applications running on your system if you are to enable them to run as efﬁciently as possible. Often this entails working with your users, and/or your organization’s developers, to help uncover ways in which the applications can be made to run more efﬁciently.

3.2.3.1.3. Eliminating Applications Entirely

Depending on your organization, this approach might not be available to you, as it often is not a system administrator’s responsibility to dictate which applications will and will not be run. However, if you can identify any applications that are known "CPU hogs", you might be able to inﬂuence the powers-that-be to retire them.

Doing this will likely involve more than just yourself. The affected users should certainly be a part of this process; in many cases they may have the knowledge and the political power to make the necessary changes to the application lineup.

Tip

Keep in mind that an application may not need to be eliminated from every system in your organization. You might be able to move a particularly CPU-hungry application from an overloaded system to another system that is nearly idle.

3.2.3.2. Increasing the Capacity

Of course, if it is not possible to reduce the demand for processing power, you must ﬁnd ways of increasing the processing power that is available. To do so costs money, but it can be done.

Chapter 3. Bandwidth and Processing Power 37

3.2.3.2.1. Upgrading the CPU

The most straightforward approach is to determine if your system’s CPU can be upgraded. The ﬁrst step is to determine if the current CPU can be removed. Some systems (primarily laptops) have CPUs that are soldered in place, making an upgrade impossible. The rest, however, have socketed CPUs, making upgrades possible — at least in theory.

Next, you must do some research to determine if a faster CPU exists for your system conﬁguration. For example, if you currently have a 1GHz CPU, and a 2GHz unit of the same type exists, an upgrade might be possible.

Finally, you must determine the maximum clock speed supported by your system. To continue the example above, even if a 2GHz CPU of the proper type exists, a simple CPU swap is not an option if your system only supports processors running at 1GHz or below.

Should you ﬁnd that you cannot install a faster CPU in your system, your options may be limited to changing motherboards or even the more expensive forklift upgrade mentioned earlier.

However, some system conﬁgurations make a slightly different approach possible. Instead of replacing the current CPU, why not just add another one?

3.2.3.2.2. Is Symmetric Multiprocessing Right for You?

Symmetric multiprocessing (also known as SMP) makes it possible for a computer system to have more than one CPU sharing all system resources. This means that, unlike a uniprocessor system, an SMP system may actually have more than one process running at the same time.

At ﬁrst glance, this seems like any system administrator’s dream. First and foremost, SMP makes it possible to increase a system’s CPU power even if CPUs with faster clock speeds are not available — just by adding another CPU. However, this ﬂexibility comes with some caveats.

The ﬁrst caveat is that not all systems are capable of SMP operation. Your system must have a motherboard designed to support multiple processors. If it does not, a motherboard upgrade (at the least) would be required.

The second caveat is that SMP increases system overhead. This makes sense if you stop to think about it; with more CPUs to schedule work for, the operating system requires more CPU cycles for overhead. Another aspect to this is that with multiple CPUs, there can be more contention for system resources. Because of these factors, upgrading a dual-processor system to a quad-processor unit does not result in a 100% increase in available CPU power. In fact, depending on the actual hardware, the workload, and the processor architecture, it is possible to reach a point where the addition of another processor could actually reduce system performance.

Another point to keep in mind is that SMP does not help workloads consisting of one monolithic application with a single stream of execution. In other words, if a large compute-bound simulation program runs as one process and without threads, it will not run any faster on an SMP system than on a single-processor machine. In fact, it may even run somewhat slower, due to the increased overhead SMP brings. For these reasons, many system administrators feel that when it comes to CPU power, single stream processing power is the way to go. It provides the most CPU power with the fewest restrictions on its use.

While this discussion seems to indicate that SMP is never a good idea, there are circumstances in which it makes sense. For example, environments running multiple highly compute-bound applications are good candidates for SMP. The reason for this is that applications that do nothing but compute for long periods of time keep contention between active processes (and therefore, the operating system overhead) to a minimum, while the processes themselves keep every CPU busy.

One other thing to keep in mind about SMP is that the performance of an SMP system tends to degrade more gracefully as the system load increases. This does make SMP systems popular in server and multi-user environments, as the ever-changing process mix can impact the system-wide load less on a multi-processor machine.

38 Chapter 3. Bandwidth and Processing Power

3.3. Red Hat Enterprise Linux-Speciﬁc Information

Monitoring bandwidth and CPU utilization under Red Hat Enterprise Linux entails using the tools discussed in Chapter 2 Resource Monitoring; therefore, if you have not yet read that chapter, you should do so before continuing.

3.3.1. Monitoring Bandwidth on Red Hat Enterprise Linux

As stated in Section 2.4.2 Monitoring Bandwidth, it is difﬁcult to directly monitor bandwidth utilization. However, by examining device-level statistics, it is possible to roughly gauge whether insufﬁcient bandwidth is an issue on your system.

By using vmstat, it is possible to determine if overall device activity is excessive by examining the

bi and bo ﬁelds; in addition, taking note of the si and so ﬁelds give you a bit more insight into how

much disk activity is due to swap-related I/O:

procs memory swap io system cpu

r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 0 248088 158636 480804 0 0 2 6 120 120 10 3 87

In this example, the bi ﬁeld shows two blocks/second written to block devices (primarily disk drives), while the bo ﬁeld shows six blocks/second read from block devices. We can determine that none of this activity was due to swapping, as the si and so ﬁelds both show a swap-related I/O rate of zero kilobytes/second.

By using iostat, it is possible to gain a bit more insight into disk-related activity:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003

avg-cpu: %user %nice %sys %idle

5.34 4.60 2.83 87.24

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev8-0 1.10 6.21 25.08 961342 3881610 dev8-1 0.00 0.00 0.00 16 0

This output shows us that the device with major number 8 (which is /dev/sda, the ﬁrst SCSI disk) averaged slightly more than one I/O operation per second (the tsp ﬁeld). Most of the I/O activity for this device were writes (the Blk_wrtn ﬁeld), with slightly more than 25 blocks written each second (the Blk_wrtn/s ﬁeld).

If more detail is required, use iostat’s -x option:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003

avg-cpu: %user %nice %sys %idle

5.37 4.54 2.81 87.27

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz /dev/sda 13.57 2.86 0.36 0.77 32.20 29.05 16.10 14.53 54.52 /dev/sda1 0.17 0.00 0.00 0.00 0.34 0.00 0.17 0.00 133.40 /dev/sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.56 /dev/sda3 0.31 2.11 0.29 0.62 4.74 21.80 2.37 10.90 29.42 /dev/sda4 0.09 0.75 0.04 0.15 1.06 7.24 0.53 3.62 43.01

Over and above the longer lines containing more ﬁelds, the ﬁrst thing to keep in mind is that this

iostat output is now displaying statistics on a per-partition level. By using df to associate mount

Chapter 3. Bandwidth and Processing Power 39

points with device names, it is possible to use this report to determine if, for example, the partition containing /home/ is experiencing an excessive workload.

Actually, each line output from iostat -x is longer and contains more information than this; here is the remainder of each line (with the device column added for easier reading):

Device: avgqu-sz await svctm %util /dev/sda 0.24 20.86 3.80 0.43 /dev/sda1 0.00 141.18 122.73 0.03 /dev/sda2 0.00 6.00 6.00 0.00 /dev/sda3 0.12 12.84 2.68 0.24 /dev/sda4 0.11 57.47 8.94 0.17

In this example, it is interesting to note that /dev/sda2 is the system swap partition; it is obvious from the many ﬁelds reading 0.00 for this partition that swapping is not a problem on this system.

Another interesting point to note is /dev/sda1. The statistics for this partition are unusual; the overall activity seems low, but why are the average I/O request size (the avgrq-sz ﬁeld), average wait time (the await ﬁeld), and the average service time (the svctm ﬁeld) so much larger than the other partitions? The answer is that this partition contains the /boot/ directory, which is where the kernel and initial ramdisk are stored. When the system boots, the read I/Os (notice that only the rsec/s and

rkB/s ﬁelds are non-zero; no writing is done here on a regular basis) used during the boot process are

for large numbers of blocks, resulting in the relatively long wait and service times iostat displays.

It is possible to use sar for a longer-term overview of I/O statistics; for example, sar -b displays a general I/O report:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003

12:00:00 AM tps rtps wtps bread/s bwrtn/s 12:10:00 AM 0.51 0.01 0.50 0.25 14.32 12:20:01 AM 0.48 0.00 0.48 0.00 13.32 ... 06:00:02 PM 1.24 0.00 1.24 0.01 36.23 Average: 1.11 0.31 0.80 68.14 34.79

Here, like iostat’s initial display, the statistics are grouped for all block devices.

Another I/O-related report is produced using sar -d:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003

12:00:00 AM DEV tps sect/s 12:10:00 AM dev8-0 0.51 14.57 12:10:00 AM dev8-1 0.00 0.00 12:20:01 AM dev8-0 0.48 13.32 12:20:01 AM dev8-1 0.00 0.00 ... 06:00:02 PM dev8-0 1.24 36.25 06:00:02 PM dev8-1 0.00 0.00 Average: dev8-0 1.11 102.93 Average: dev8-1 0.00 0.00

This report provides per-device information, but with little detail.

While there are no explicit statistics showing bandwidth utilization for a given bus or datapath, we can at least determine what the devices are doing and use their activity to indirectly determine the bus loading.

40 Chapter 3. Bandwidth and Processing Power

3.3.2. Monitoring CPU Utilization on Red Hat Enterprise Linux

Unlike bandwidth, monitoring CPU utilization is much more straightforward. From a single percentage of CPU utilization in GNOME System Monitor, to the more in-depth statistics reported by sar, it is possible to accurately determine how much CPU power is being consumed and by what.

Moving beyond GNOME System Monitor, top is the ﬁrst resource monitoring tool discussed in Chapter 2 Resource Monitoring to provide a more in-depth representation of CPU utilization. Here is a top report from a dual-processor workstation:

9:44pm up 2 days, 2 min, 1 user, load average: 0.14, 0.12, 0.09 90 processes: 82 sleeping, 1 running, 7 zombie, 0 stopped CPU0 states: 0.4% user, 1.1% system, 0.0% nice, 97.4% idle CPU1 states: 0.5% user, 1.3% system, 0.0% nice, 97.1% idle Mem: 1288720K av, 1056260K used, 232460K free, 0K shrd, 145644K buff Swap: 522104K av, 0K used, 522104K free 469764K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 30997 ed 16 0 1100 1100 840 R 1.7 0.0 0:00 top

1120 root 5 -10 249M 174M 71508 S < 0.9 13.8 254:59 X 1260 ed 15 0 54408 53M 6864 S 0.7 4.2 12:09 gnome-terminal

888 root 15 0 2428 2428 1796 S 0.1 0.1 0:06 sendmail

1264 ed 15 0 16336 15M 9480 S 0.1 1.2 1:58 rhn-applet-gui

1 root 15 0 476 476 424 S 0.0 0.0 0:05 init 2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0 3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1 4 root 15 0 0 0 0 SW 0.0 0.0 0:01 keventd 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1 7 root 15 0 0 0 0 SW 0.0 0.0 0:05 kswapd 8 root 15 0 0 0 0 SW 0.0 0.0 0:00 bdflush 9 root 15 0 0 0 0 SW 0.0 0.0 0:01 kupdated

10 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd

The ﬁrst CPU-related information is present on the very ﬁrst line: the load average. The load average is a number corresponding to the average number of runnable processes on the system. The load average is often listed as three sets of numbers (as top does), which represent the load average for the past 1, 5, and 15 minutes, indicating that the system in this example was not very busy.

The next line, although not strictly related to CPU utilization, has an indirect relationship, in that it shows the number of runnable processes (here, only one -- remember this number, as it means something special in this example). The number of runnable processes is a good indicator of how CPU-bound a system might be.

Next are two lines displaying the current utilization for each of the two CPUs in the system. The utilization statistics show whether the CPU cycles were expended for user-level or system-level processing; also included is a statistic showing how much CPU time was expended by processes with altered scheduling priorities. Finally, there is an idle time statistic.

Moving down into the process-related section of the display, we ﬁnd that the process using the most CPU power is top itself; in other words, the one runnable process on this otherwise-idle system was

top taking a "picture" of itself.

Tip

It is important to remember that the very act of running a system monitor affects the resource utilization statistics you receive. All software-based monitors do this to some extent.

Chapter 3. Bandwidth and Processing Power 41

To gain more detailed knowledge regarding CPU utilization, we must change tools. If we examine output from vmstat, we obtain a slightly different understanding of our example system:

procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 0 233276 146636 469808 0 0 7 7 14 27 10 3 87 0 0 0 0 233276 146636 469808 0 0 0 0 523 138 3 0 96 0 0 0 0 233276 146636 469808 0 0 0 0 557 385 2 1 97 0 0 0 0 233276 146636 469808 0 0 0 0 544 343 2 0 97 0 0 0 0 233276 146636 469808 0 0 0 0 517 89 2 0 98 0 0 0 0 233276 146636 469808 0 0 0 32 518 102 2 0 98 0 0 0 0 233276 146636 469808 0 0 0 0 516 91 2 1 98 0 0 0 0 233276 146636 469808 0 0 0 0 516 72 2 0 98 0 0 0 0 233276 146636 469808 0 0 0 0 516 88 2 0 97 0 0 0 0 233276 146636 469808 0 0 0 0 516 81 2 0 97

Here we have used the command vmstat 1 10 to sample the system every second for ten times. At ﬁrst, the CPU-related statistics (the us, sy, and id ﬁelds) seem similar to what top displayed, and maybe even appear a bit less detailed. However, unlike top, we can also gain a bit of insight into how the CPU is being used.

If we examine at the system ﬁelds, we notice that the CPU is handling about 500 interrupts per second on average and is switching between processes anywhere from 80 to nearly 400 times a second. If you think this seems like a lot of activity, think again, because the user-level processing (the us ﬁeld) is only averaging 2%, while system-level processing (the sy ﬁeld) is usually under 1%. Again, this is an idle system.

Reviewing the tools Sysstat offers, we ﬁnd that iostat and mpstat provide little additional information over what we have already experienced with top and vmstat. However, sar produces a number of reports that can come in handy when monitoring CPU utilization.

The ﬁrst report is obtained by the command sar -q, which displays the run queue length, total number of processes, and the load averages for the past one and ﬁve minutes. Here is a sample: