File System Tuning GuideFile System Tuning GuideFile System Tuning Guide
StorNext 3.5.2 File System Tuning Guide, 6-01376-14, Ver. A, Rel. 3.5.2, February 2010, Made in USA.
Quantum Corporation provides this publication “as is” without warranty of any kind, either express or implied,
including but not limited to the implied warranties of merchantability or fitness for a particular purpose. Quantum
Corporation may revise this publication from time to time without notice.
US Patent No: 5,990,810 applies. Other Patents pending in the US and/or other countries.
StorNext is either a trademark or registered trademark of Quantum Corporation in the US and/or other countries.
Your right to copy this manual is limited by copyright law. Making copies or adaptations without prior written
authorization of Quantum Corporation is prohibited by law and constitutes a punishable violation of the law.
TRADEMARK STATEMENT
Quantum, DLT, DLTtape, the Quantum logo, and the DLTtape logo are all registered trademarks of Quantum
Corporation.
SDLT and Super DLTtape are trademarks of Quantum Corporation.
Other trademarks may be mentioned herein which belong to other companies.
Contents
Chapter 0StorNext File System Tuning1
The Underlying Storage System ...................................................................... 1
The StorNext File System (SNFS) provides extremely high performance
for widely varying scenarios. Many factors determine the level of
performance you will realize. In particular, the performance
characteristics of the underlying storage system are the most critical
factors. However, other components such as the Metadata Network and
MDC systems also have a significant effect on performance.
Furthermore, file size mix and application I/O characteristics may also
present specific performance requirements, so SNFS provides a wide
variety of tunable settings to achieve optimal performance. It is usually
best to use the default SNFS settings, because these are designed to
provide optimal performance under most scenarios. However, this guide
discusses circumstances in which special settings may offer a
performance benefit.
The Underlying Storage System
The performance characteristics of the underlying storage system are the
most critical factors for file system performance. Typically, RAID storage
systems provide many tuning options for cache settings, RAID level,
segment size, stripe size, and so on.
StorNext File System Tuning Guide1
StorNext File System Tuning
The Underlying Storage System
RAID Cache
Configuration0
The single most important RAID tuning component is the cache
configuration. This is particularly true for small I/O operations.
Contemporary RAID systems such as the EMC CX series and the various
Engenio systems provide excellent small I/O performance with properly
tuned caching. So, for the best general purpose performance
characteristics, it is crucial to utilize the RAID system caching as fully as
possible.
For example, write-back caching is absolutely essential for metadata
stripe groups to achieve high metadata operations throughput.
However, there are a few drawbacks to consider as well. For example,
read-ahead caching improves sequential read performance but might
reduce random performance. Write-back caching is critical for small write
performance but may limit peak large I/O throughput.
Caution:Some RAID systems cannot safely support write-back
caching without risk of data loss, which is not suitable for
critical data such as file system metadata.
Consequently, this is an area that requires an understanding of
application I/O requirements. As a general rule, RAID system caching is
critically important for most applications, so it is the first place to focus
tuning attention.
RAID Write-Back
Caching0
Write-back caching dramatically reduces latency in small write
operations. This is accomplished by returning a successful reply as soon
as data is written into cache, and then deferring the operation of actually
writing the data to the physical disks. This results in a great performance
improvement for small I/O operations.
Many contemporary RAID systems protect against write-back cache data
loss due to power or component failure. This is accomplished through
various techniques including redundancy, battery backup, batterybacked memory, and controller mirroring. To prevent data corruption, it
is important to ensure that these systems are working properly. It is
particularly catastrophic if file system metadata is corrupted, because
complete file system loss could result. Check with your RAID vendor to
make sure that write-back caching is safe to use.
Minimal I/O latency is critically important for metadata stripe groups to
achieve high metadata operations throughput. This is because metadata
StorNext File System Tuning Guide2
StorNext File System Tuning
The Underlying Storage System
operations involve a very high rate of small writes to the metadata disk,
so disk latency is the critical performance factor. Write-back caching can
be an effective approach to minimizing I/O latency and optimizing
metadata operations throughput. This is easily observed in the hourly
File System Manager (FSM) statistics reports in the
cvlog file. For
example, here is a message line from the cvlog file:
PIO HiPriWr SUMMARY SnmsMetaDisk0 sysavg/350 sysmin/333 sysmax/367
This statistics message reports average, minimum, and maximum write
latency (in microseconds) for the reporting period. If the observed
average latency exceeds 500 microseconds, peak metadata operation
throughput will be degraded. For example, create operations may be
around 2000 per second when metadata disk latency is below 500
microseconds. However, if metadata disk latency is around 5
milliseconds, create operations per second may be degraded to 200 or
worse.
Another typical write caching approach is a “write-through.” This
approach involves synchronous writes to the physical disk before
returning a successful reply for the I/O operation. The write-through
approach exhibits much worse latency than write-back caching; therefore,
small I/O performance (such as metadata operations) is severely
impacted. It is important to determine which write caching approach is
employed, because the performance observed will differ greatly for small
write I/O operations.
In some cases, large write I/O operations can also benefit from caching.
However, some SNFS customers observe maximum large I/O
throughput by disabling caching. While this may be beneficial for special
large I/O scenarios, it severely degrades small I/O performance;
therefore, it is suboptimal for general-purpose file system performance.
RAID Read-Ahead
Caching0
RAID read-ahead caching is a very effective way to improve sequential
read performance for both small (buffered) and large (DMA) I/O
operations. When this setting is utilized, the RAID controller pre-fetches
disk blocks for sequential read operations. Therefore, subsequent
application read operations benefit from cache speed throughput, which
is faster than the physical disk throughput.
This is particularly important for concurrent file streams and mixed I/O
streams, because read-ahead significantly reduces disk head movement
that otherwise severely impacts performance.
StorNext File System Tuning Guide3
StorNext File System Tuning
The Underlying Storage System
While read-ahead caching improves sequential read performance, it does
not help highly transactional performance. Furthermore, some SNFS
customers actually observe maximum large sequential read throughput
by disabling caching. While disabling read-ahead is beneficial in these
unusual cases, it severely degrades typical scenarios. Therefore, it is
unsuitable for most environments.
RAID Level, Segment
Size, and Stripe Size0
Configuration settings such as RAID level, segment size, and stripe size
are very important and cannot be changed after put into production, so it is
critical to determine appropriate settings during initial configuration.
The best RAID level to use for high I/O throughput is usually RAID5.
The stripe size is determined by the product of the number of disks in the
RAID group and the segment size. For example, a 4+1 RAID5 group with
64K segment size results in a 256K stripe size. The stripe size is a very
critical factor for write performance because I/Os smaller than the stripe
size may incur a read/modify/write penalty. It is best to configure
RAID5 settings with no more than 512K stripe size to avoid the read/
modify/write penalty. The read/modify/write penalty is most
noticeable in the absence of “write-back” caching being performed by the
RAID controller.
The RAID stripe size configuration should typically match the
StripeBreadth
configuration setting when multiple LUNs are utilized in a
SNFS
stripe group. However, in some cases it might be optimal to configure the
SNFS StripeBreadth as a multiple of the RAID stripe size, such as when
the RAID stripe size is small but the user's I/O sizes are very large.
However, this will be suboptimal for small I/O performance, so may not
be suitable for general purpose usage.
RAID1 mirroring is the best RAID level for metadata and journal storage
because it is most optimal for very small I/O sizes. Quantum
recommends using fibre channel or SAS disks (as opposed to SATA) for
metadata and journal due to the higher IOPS performance and reliability.
It is also very important to allocate entire physical disks for the Metadata
and Journal LUNs in ordep to avoid bandwidth contention with other I/
O traffic. Metadata and Journal storage requires very high IOPS rates
(low latency) for optimal performance, so contention can severely impact
IOPS (and latency) and thus overall performance. If Journal I/O exceeds
1ms average latency, you will observe significant performance
degradation.
It can be useful to use a tool such as
lmdd to help determine the storage
system performance characteristics and choose optimal settings. For
StorNext File System Tuning Guide4
StorNext File System Tuning
File Size Mix and Application I/O Characteristics
example, varying the stripe size and running lmdd with a range of I/O
sizes might be useful to determine an optimal stripe size multiple to
configure the SNFS
Some storage vendors now provide RAID6 capability for improved
reliability over RAID5. This may be particularly valuable for SATA disks
where bit error rates can lead to disk problems. However, RAID6
typically incurs a performance penalty compared to RAID5, particularly
for writes. Check with your storage vendor for RAID5 versus RAID6
recommendations.
StripeBreadth.
File Size Mix and Application I/O Characteristics
It is always valuable to understand the file size mix of the target dataset
as well as the application I/O characteristics. This includes the number of
concurrent streams, proportion of read versus write streams, I/O size,
sequential versus random, Network File System (NFS) or Common
Internet File System (CIFS) access, and so on.
For example, if the dataset is dominated by small or large files, various
settings can be optimized for the target size range.
Similarly, it might be beneficial to optimize for particular application I/O
characteristics. For example, to optimize for sequential 1MB I/O size it
would be beneficial to configure a stripe group with four 4+1 RAID5
LUNs with 256K stripe size.
However, optimizing for random I/O performance can incur a
performance trade-off with sequential I/O.
Furthermore, NFS and CIFS access have special requirements to consider
as described in the Direct Memory Access (DMA) I/O Transfer
Direct Memory Access
(DMA) I/O Transfer0
StorNext File System Tuning Guide5
To achieve the highest possible large sequential I/O transfer throughput,
SNFS provides DMA-based I/O. To utilize DMA I/O, the application
must issue its reads and writes of sufficient size and alignment. This is
called well-formed I/O. See the
auto_dma_read_length and auto_dma_write_length, described in the
Mount Command Options
mount command settings
on page 19.
section.
StorNext File System Tuning
File Size Mix and Application I/O Characteristics
Buffer Cache0
NFS / CIFS0
Reads and writes that aren't well-formed utilize the SNFS buffer cache.
This also includes NFS or CIFS-based traffic because the NFS and CIFS
daemons defeat well-formed I/Os issued by the application.
There are several configuration parameters that affect buffer cache
performance. The most critical is the RAID cache configuration because
buffered I/O is usually smaller than the RAID stripe size, and therefore
incurs a read/modify/write penalty. It might also be possible to match
the RAID stripe size to the buffer cache I/O size. However, it is typically
most important to optimize the RAID cache configuration settings
described earlier in this document.
It is usually best to configure the RAID stripe size no greater than 256K
for optimal small file buffer cache performance.
For more buffer cache configuration settings, see Mount Command
Options on page 19.
It is best to isolate NFS and/or CIFS traffic off of the metadata network to
eliminate contention that will impact performance. For optimal
performance it is necessary to use 1000BaseT instead of 100BaseT. On
NFS clients, use the vers=3, rsize=262144 and wsize=262144 mount
options, and use TCP mounts instead of UDP. When possible, it is also
best to utilize TCP Offload capabilities as well as jumbo frames.
It is best practice to have clients directly attached to the same network
switch as the NFS or CIFS server. Any routing required for NFS or CIFS
traffic incurs additional latency that impacts performance.
It is critical to make sure the
this severely impacts performance. Most of the time
correct setting. Some managed switches allow setting
example
1000Mb/full,) which disables auto-detect and requires the host to
speed/duplex settings are correct, because
auto-detect is the
speed/duplex (for
be set exactly the same. However, if the settings do not match between
switch and host, it severely impacts performance. For example, if the
switch is set to
auto-detect but the host is set to 1000Mb/full, you will
observe a high error rate along with extremely poor performance. On
Linux, the
duplex
ethtool tool can be very useful to investigate and adjust speed/
settings.
If performance requirements cannot be achieved with NFS or CIFS,
consider using a StorNext Distributed LAN client or fibre-channel
attached client.
StorNext File System Tuning Guide6
It can be useful to use a tool such as netperf to help verify network
performance characteristics.
SNFS and Virus Checking
Virus-checking software can severely degrade the performance of any
file system, including SNFS. If you have anti-virus software running on a
Windows Server 2003 or Windows XP machine, Quantum recommends
configuring the software so that it does NOT check SNFS.
The Metadata Network
StorNext File System Tuning
SNFS and Virus Checking
As with any client/server protocol, SNFS performance is subject to the
limitations of the underlying network. Therefore, it is recommended that
you use a dedicated Metadata Network to avoid contention with other
network traffic. Either 100BaseT or 1000BaseT is required, but for a
dedicated Metadata Network there is usually no benefit from using
1000BaseT over 100BaseT. Neither TCP offload nor are jumbo frames
required.
It is best practice to have all SNFS clients directly attached to the same
network switch as the MDC systems. Any routing required for metadata
traffic will incur additional latency that impacts performance.
It is critical to ensure that
severely impact performance. Most of the time
setting. Some managed switches allow setting
100Mb/full, which disables auto-detect and requires the host to be set
exactly the same. However, performance is severely impacted if the
settings do not match between switch and host. For example, if the switch
is set to
high error rate and extremely poor performance. On Linux the
tool can be very useful to investigate and adjust
StorNext File System Tuning Guide7
auto-detect but the host is set to 100Mb/full, you will observe a
speed/duplex settings are correct, as this will
auto-detect is the correct
speed/duplex, such as
ethtool
speed/duplex settings.
StorNext File System Tuning
The Metadata Controller System
It can be useful to use a tool like netperf to help verify the Metadata
Network performance characteristics. For example, if
reports less than 15,000 transactions per second capacity, a performance
penalty may be incurred. You can also use the
retransmissions impacting performance. The cvadmin “latency-test” tool
is also useful for measuring network latency.
Note the following configuration requirements for the metadata network:
• In cases where gigabit networking hardware is used and maximum
StorNext performance is required, a separate, dedicated switched
Ethernet LAN is recommended for the StorNext metadata network. If
maximum StorNext performance is not required, shared gigabit
networking is acceptable.
•A separate, dedicated switched Ethernet LAN is mandatory for the
metadata network if 100 Mbit/s or slower networking hardware is
used.
•StorNext does not support file system metadata on the same network
as iSCSI, NFS, CIFS, or VLAN data when 100 Mbit/s or slower
networking hardware is used.
netstat tool to identify tcp
netperf -t TCP_RR
The Metadata Controller System
The CPU power and memory capacity of the MDC System are important
performance factors, as well as the number of file systems hosted per
system. In order to ensure fast response time it is necessary to use
dedicated systems, limit the number of file systems hosted per system
(maximum 8), and have an adequate CPU and memory.
Some metadata operations such as file creation can be CPU intensive, and
benefit from increased CPU power. The MDC platform is important in
these scenarios because lower clock- speed CPUs such as Sparc degrade
performance.
Other operations can benefit greatly from increased memory, such as
directory traversal. SNFS provides three config file settings that can be
used to realize performance gains from increased memory:
BufferCacheSize, InodeCacheSize, and ThreadPoolSize.
StorNext File System Tuning Guide8
StorNext File System Tuning
The Metadata Controller System
However, it is critical that the MDC system have enough physical
memory available to ensure that the FSM process doesn’t get swapped
out. Otherwise, severe performance degradation and system instability
can result.
The operating system on the metadata controller must always be run in
U.S. English.
FSM Configuration File
Settings0
The following FSM configuration file settings are explained in greater
detail in the
see Sample FSM Configuration File
cvfs_config man page. For a sample FSM configuration file,
on page 28.
The examples in the following sections are excerpted from the sample
configuration file from Sample FSM Configuration File
on page 28.
Stripe Groups
Splitting apart data, metadata, and journal into separate stripe groups is
usually the most important performance tactic. The
allocate (e.g., write) operations are very sensitive to I/O latency of the
create, remove, and
journal stripe group. Configuring a separate stripe group for journal
greatly benefits the speed of these operations because disk seek latency is
minimized. However, if
create, remove, and allocate performance aren't
critical, it is okay to share a stripe group for both metadata and journal,
but be sure to set the exclusive property on the stripe group so it doesn't
get allocated for data as well. It is recommended that you assign only a
single LUN for each journal or metadata stripe group. Multiple metadata
stripe groups can be utilized to increase metadata I/O throughput
through concurrency. RAID1 mirroring is optimal for metadata and
journal storage. Utilizing the write-back caching feature of the RAID
system (as described previously) is critical to optimizing performance of
the journal and metadata stripe groups.
0
Example:
[stripeGroup RegularFiles]
Status UP
Exclusive No ##Non-Exclusive stripeGroup for all Files##
Read Enabled
Write Enabled
StripeBreadth 256K
StorNext File System Tuning Guide9
Loading...
+ 25 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.