File System Tuning GuideFile System Tuning GuideFile System Tuning Guide
StorNext 3.1 File System Tuning Guide, 6-01376-07, Ver. A, Rel. 3.1, October 2007, Made in USA.
Quantum Corporation provides this publication “as is” without warranty of any kind, either express or implied,
including but not limited to the implied warranties of merchantability or fitness for a particular purpose. Quantum
Corporation may revise this publication from time to time without notice.
COPYRIGHT STATEMENT
Copyright 2007 by Quantum Corporation. All rights reserved.
StorNext copyright (c) 1991-2007 Advanced Digital Information Corporation (ADIC), Redmond, WA, USA. All rights
reserved.
Your right to copy this manual is limited by copyright law. Making copies or adaptations without prior written
authorization of Quantum Corporation is prohibited by law and constitutes a punishable violation of the law.
TRADEMARK STATEMENT
Quantum, DLT, DLTtape, the Quantum logo, and the DLTtape logo are all registered trademarks of Quantum
Corporation.
SDLT and Super DLTtape are trademarks of Quantum Corporation.
Other trademarks may be mentioned herein which belong to other companies.
Contents
Chapter 1StorNext File System Tuning1
The Underlying Storage System ...................................................................... 1
The StorNext File System (SNFS) provides extremely high performance
for widely varying scenarios. Many factors determine the level of
performance you will realize. In particular, the performance
characteristics of the underlying storage system are the most critical
factors. However, other components such as the Metadata Network and
MDC systems also have a significant effect on performance.
Furthermore, file size mix and application I/O characteristics may also
present specific performance requirements, so SNFS provides a wide
variety of tunable settings to achieve optimal performance. It is usually
best to use the default SNFS settings, because these are designed to
provide optimal performance under most scenarios. However, this guide
discusses circumstances in which special settings may offer a
performance benefit.
The Underlying Storage System
The performance characteristics of the underlying storage system are the
most critical factors for file system performance. Typically, RAID storage
systems provide many tuning options for cache settings, RAID level,
segment size, stripe size, and so on.
StorNext File System Tuning Guide1
StorNext File System Tuning
The Underlying Storage System
RAID Cache
Configuration0
RAID Write-Back Caching0
The single most important RAID tuning component is the cache
configuration. This is particularly true for small I/O operations.
Contemporary RAID systems such as the EMC CX series and the various
Engenio systems provide excellent small I/O performance with properly
tuned caching. So, for the best general purpose performance
characteristics, it is crucial to utilize the RAID system caching as fully as
possible.
For example, write-back caching is absolutely essential for metadata
stripe groups to achieve high metadata operations throughput.
However, there are a few drawbacks to consider as well. For example,
read-ahead caching improves sequential read performance but might
reduce random performance. Write-back caching is critical for small write
performance but may limit peak large I/O throughput. Some RAID
systems cannot safely support write-back caching without risk of data
loss, which is not suitable for critical data such as file system metadata.
Consequently, this is an area that requires an understanding of
application I/O requirements. As a general rule, RAID system caching is
critically important for most applications, so it is the first place to focus
tuning attention.
Write-back caching dramatically reduces latency in small write
operations. This is accomplished by returning a successful reply as soon
as data is written into cache, and then deferring the operation of actually
writing the data to the physical disks. This results in a great performance
improvement for small I/O operations.
Many contemporary RAID systems protect against write-back cache data
loss due to power or component failure. This is accomplished through
various techniques including redundancy, battery backup, batterybacked memory, and controller mirroring. To prevent data corruption, it
is important to ensure that these systems are working properly. It is
particularly catastrophic if file system metadata is corrupted, because
complete file system loss could result. Check with your RAID vendor to
make sure that write-back caching is safe to use.
Minimal I/O latency is critically important for metadata stripe groups to
achieve high metadata operations throughput. This is because metadata
operations involve a very high rate of small writes to the metadata disk,
so disk latency is the critical performance factor. Write-back caching can
be an effective approach to minimizing I/O latency and optimizing
metadata operations throughput. This is easily observed in the hourly
StorNext File System Tuning Guide2
StorNext File System Tuning
The Underlying Storage System
File System Manager (FSM) statistics reports in the cvlog file. For
example, here is a message line from the cvlog file:
PIO HiPriWr SUMMARY SnmsMetaDisk0 sysavg/350 sysmin/333 sysmax/367
This statistics message reports average, minimum, and maximum write
latency (in microseconds) for the reporting period. If the observed
average latency exceeds 500 microseconds, peak metadata operation
throughput will be degraded. For example, create operations may be
around 2000 per second when metadata disk latency is below 500
microseconds. However, if metadata disk latency is around 5
milliseconds, create operations per second may be degraded to 200 or
worse.
Another typical write caching approach is a “write-through.” This
approach involves synchronous writes to the physical disk before
returning a successful reply for the I/O operation. The write-through
approach exhibits much worse latency than write-back caching; therefore,
small I/O performance (such as metadata operations) is severely
impacted. It is important to determine which write caching approach is
employed, because the performance observed will differ greatly for small
write I/O operations.
RAID Read-Ahead
Caching0
In some cases, large write I/O operations can also benefit from caching.
However, some SNFS customers observe maximum large I/O
throughput by disabling caching. While this may be beneficial for special
large I/O scenarios, it severely degrades small I/O performance;
therefore, it is suboptimal for general-purpose file system performance.
RAID read-ahead caching is a very effective way to improve sequential
read performance for both small (buffered) and large (DMA) I/O
operations. When this setting is utilized, the RAID controller pre-fetches
disk blocks for sequential read operations. Therefore, subsequent
application read operations benefit from cache speed throughput, which
is faster than the physical disk throughput.
This is particularly important for concurrent file streams and mixed I/O
streams, because read-ahead significantly reduces disk head movement
that otherwise severely impacts performance.
While read-ahead caching improves sequential read performance, it does
not help highly transactional performance. Furthermore, some SNFS
customers actually observe maximum large sequential read throughput
by disabling caching. While disabling read-ahead is beneficial in these
StorNext File System Tuning Guide3
StorNext File System Tuning
The Underlying Storage System
unusual cases, it severely degrades typical scenarios. Therefore, it is
unsuitable for most environments.
RAID Level, Segment
Size, and Stripe Size0
Configuration settings such as RAID level, segment size, and stripe size
are very important and cannot be changed after put into production, so it
is critical to determine appropriate settings during initial configuration.
The best RAID level to use for high I/O throughput is usually RAID5.
The stripe size is determined by the product of the number of disks in the
RAID group and the segment size. For example, a 4+1 RAID5 group with
64K segment size results in a 256K stripe size. The stripe size is a very
critical factor for write performance because I/Os smaller than the stripe
size may incur a read/modify/write penalty. It is best to configure
RAID5 settings with no more than 512K stripe size to avoid the read/
modify/write penalty. The read/modify/write penalty is most
noticeable in the absence of “write-back” caching being performed by the
RAID controller.
The RAID stripe size configuration should typically match the
StripeBreadth
configuration setting when multiple LUNs are utilized in a
SNFS
stripe group. However, in some cases it might be optimal to configure the
SNFS StripeBreadth as a multiple of the RAID stripe size, such as when
the RAID stripe size is small but the user's I/O sizes are very large.
However, this will be suboptimal for small I/O performance, so may not
be suitable for general purpose usage.
RAID1 mirroring is the best RAID level for metadata and journal storage
because it is most optimal for very small I/O sizes. It is also very
important to allocate entire physical disks for the Metadata and Journal
LUNs in ordep to avoid bandwidth contention with other I/O traffic.
Metadata and Journal storage requires very high IOPS rates (low latency)
for optimal performance, so contention can severely impact IOPS (and
latency) and thus overall performance. If Journal I/O exceeds 1ms
average latency, you will observe significant performance degradation.
It can be useful to use a tool such as
lmdd to help determine the storage
system performance characteristics and choose optimal settings. For
example, varying the stripe size and running lmdd with a range of I/O
sizes might be useful to determine an optimal stripe size multiple to
configure the SNFS
StorNext File System Tuning Guide4
StripeBreadth.
StorNext File System Tuning
File Size Mix and Application I/O Characteristics
File Size Mix and Application I/O Characteristics
It is always valuable to understand the file size mix of the target dataset
as well as the application I/O characteristics. This includes the number of
concurrent streams, proportion of read versus write streams, I/O size,
sequential versus random, Network File System (NFS) or Common
Internet File System (CIFS) access, and so on.
For example, if the dataset is dominated by small or large files, various
settings can be optimized for the target size range.
Similarly, it might be beneficial to optimize for particular application I/O
characteristics. For example, to optimize for sequential 1MB I/O size it
would be beneficial to configure a stripe group with four 4+1 RAID5
LUNs with 256K stripe size.
However, optimizing for random I/O performance can incur a
performance trade-off with sequential I/O.
Furthermore, NFS and CIFS access have special requirements to consider
as described in the Direct Memory Access (DMA) I/O Transfer
section.
Direct Memory Access
(DMA) I/O Transfer0
Buffer Cache0
StorNext File System Tuning Guide5
To achieve the highest possible large sequential I/O transfer throughput,
SNFS provides DMA-based I/O. To utilize DMA I/O, the application
must issue its reads and writes of sufficient size and alignment. This is
called well-formed I/O. See the
auto_dma_read_length and auto_dma_write_length, described in the
Mount Command Options
Reads and writes that aren't well-formed utilize the SNFS buffer cache.
This also includes NFS or CIFS-based traffic because the NFS and CIFS
daemons defeat well-formed I/Os issued by the application.
There are several configuration parameters that affect buffer cache
performance. The most critical is the RAID cache configuration because
buffered I/O is usually smaller than the RAID stripe size, and therefore
incurs a read/modify/write penalty. It might also be possible to match
the RAID stripe size to the buffer cache I/O size. However, it is typically
most important to optimize the RAID cache configuration settings
described earlier in this document.
mount command settings
on page 17.
StorNext File System Tuning
File Size Mix and Application I/O Characteristics
It is usually best to configure the RAID stripe size no greater than 256K
for optimal small file buffer cache performance.
For more buffer cache configuration settings, see Mount Command
Options on page 17.
NFS / CIFS0
It is best to isolate NFS and/or CIFS traffic off of the metadata network to
eliminate contention that will impact performance. For optimal
performance it is necessary to use 1000BaseT instead of 100BaseT. On
NFS clients, use the vers=3, rsize=262144 and wsize=262144 mount
options, and use TCP mounts instead of UDP. When possible, it is also
best to utilize TCP Offload capabilities as well as jumbo frames.
It is best practice to have clients directly attached to the same network
switch as the NFS or CIFS server. Any routing required for NFS or CIFS
traffic incurs additional latency that impacts performance.
It is critical to make sure the
this severely impacts performance. Most of the time
correct setting. Some managed switches allow setting
example
1000Mb/full,) which disables auto-detect and requires the host to
speed/duplex settings are correct, because
auto-detect is the
speed/duplex (for
be set exactly the same. However, if the settings do not match between
switch and host, it severely impacts performance. For example, if the
switch is set to
auto-detect but the host is set to 1000Mb/full, you will
observe a high error rate along with extremely poor performance. On
Linux, the
duplex
mii-diag tool can be very useful to investigate and adjust speed/
settings.
If performance requirements cannot be achieved with NFS or CIFS,
consider using a StorNext Distributed LAN client or fibre-channel
attached client.
It can be useful to use a tool such as
netperf to help verify network
performance characteristics.
StorNext File System Tuning Guide6
The Metadata Network
As with any client/server protocol, SNFS performance is subject to the
limitations of the underlying network. Therefore, it is recommended that
you use a dedicated Metadata Network to avoid contention with other
network traffic. Either 100BaseT or 1000BaseT is required, but for a
dedicated Metadata Network there is usually no benefit from using
1000BaseT over 100BaseT. Neither TCP offload nor are jumbo frames
required.
It is best practice to have all SNFS clients directly attached to the same
network switch as the MDC systems. Any routing required for metadata
traffic will incur additional latency that impacts performance.
StorNext File System Tuning
The Metadata Network
It is critical to ensure that
severely impact performance. Most of the time
setting. Some managed switches allow setting
100Mb/full, which disables auto-detect and requires the host to be set
exactly the same. However, performance is severely impacted if the
settings do not match between switch and host. For example, if the switch
is set to
high error rate and extremely poor performance. On Linux the
tool can be very useful to investigate and adjust
It can be useful to use a tool like
Network performance characteristics. For example, if
reports less than 15,000 transactions per second capacity, a performance
penalty may be incurred.
auto-detect but the host is set to 100Mb/full, you will observe a
The Metadata Controller System
The CPU power and memory capacity of the MDC System are important
performance factors, as well as the number of file systems hosted per
system. In order to ensure fast response time it is necessary to use
dedicated systems, limit the number of file systems hosted per system
(maximum 8), and have an adequate CPU and memory.
speed/duplex settings are correct, as this will
auto-detect is the correct
speed/duplex, such as
mii-diag
speed/duplex settings.
netperf to help verify the Metadata
netperf -t TCP_RR
StorNext File System Tuning Guide7
Loading...
+ 22 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.