Quantum 6-01376-05 User Manual

File System Tuning GuideFile System Tuning Guide File System Tuning Guide
StorNext 3.0
StorNext
®
6-01376-05
Document Title, 6-01376-05, Ver. A, Rel. 3.0, March 2007, Made in USA.
Quantum Corporation provides this publication “as is” without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability or fitness for a particular purpose. Quantum Corporation may revise this publication from time to time without notice.
COPYRIGHT STATEMENT
Copyright 2007 by Quantum Corporation. All rights reserved.
StorNext copyright (c) 1991-2007 Advanced Digital Information Corporation (ADIC), Redmond, WA, USA. All rights reserved.
Your right to copy this manual is limited by copyright law. Making copies or adaptations without prior written authorization of Quantum Corporation is prohibited by law and constitutes a punishable violation of the law.
TRADEMARK STATEMENT
Quantum, DLT, DLTtape, the Quantum logo, and the DLTtape logo are all registered trademarks of Quantum Corporation.
SDLT and Super DLTtape are trademarks of Quantum Corporation.
Other trademarks may be mentioned herein which belong to other companies.

Contents

StorNext File System Tuning 1
The Underlying Storage System ...................................................................... 1
RAID Cache Configuration....................................................................... 2
RAID Write-Back Caching ........................................................................ 2
RAID Read-Ahead Caching...................................................................... 3
RAID Level, Segment Size, and Stripe Size ............................................ 4
File Size Mix and Application I/O Characteristics....................................... 5
Direct Memory Access (DMA) I/O Transfer.......................................... 5
The Metadata Network..................................................................................... 7
The Metadata Controller System..................................................................... 7
FSM Configuration File Settings .............................................................. 8
Mount Command Options...................................................................... 16
The Distributed LAN (Disk Proxy) Networks............................................. 18
Network Configuration and Topology ................................................. 20
Distributed LAN Servers................................................................................ 22
Windows Memory Requirements ................................................................. 22
Sample FSM Configuration File..................................................................... 25
StorNext File System Tuning Guide i

0StorNext File System Tuning

The StorNext File System (SNFS) provides extremely high performance for widely varying scenarios. Many factors determine the level of performance you will realize. In particular, the performance characteristics of the underlying storage system are the most critical factors. However, other components such as the Metadata Network and MDC systems also have a significant effect on performance.
Furthermore, file size mix and application I/O characteristics may also present specific performance requirements, so SNFS provides a wide variety of tunable settings to achieve optimal performance. It is usually best to use the default SNFS settings, because these are designed to provide optimal performance under most scenarios. However, this guide discusses circumstances in which special settings may offer a performance benefit.

The Underlying Storage System

The performance characteristics of the underlying storage system are the most critical factors for file system performance. Typically, RAID storage systems provide many tuning options for cache settings, RAID level, segment size, stripe size, and so on.
StorNext File System Tuning Guide 1
StorNext File System Tuning
The Underlying Storage System

RAID Cache Configuration 0

RAID Write-Back Caching 0

The single most important RAID tuning component is the cache configuration. This is particularly true for small I/O operations. Contemporary RAID systems such as the EMC CX series and the various Engenio systems provide excellent small I/O performance with properly tuned caching. So, for the best general purpose performance characteristics, it is crucial to utilize the RAID system caching as fully as possible.
For example, write-back caching is absolutely essential for metadata stripe groups to achieve high metadata operations throughput.
However, there are a few drawbacks to consider as well. For example, read-ahead caching improves sequential read performance but might reduce random performance. Write-back caching is critical for small write performance but may limit peak large I/O throughput. Some RAID systems cannot safely support write-back caching without risk of data loss, which is not suitable for critical data such as file system metadata.
Consequently, this is an area that requires an understanding of application I/O requirements. As a general rule, RAID system caching is critically important for most applications, so it is the first place to focus tuning attention.
Write-back caching dramatically reduces latency in small write operations. This is accomplished by returning a successful reply as soon as data is written into cache, and then deferring the operation of actually writing the data to the physical disks. This results in a great performance improvement for small I/O operations.
Many contemporary RAID systems protect against write-back cache data loss due to power or component failure. This is accomplished through various techniques including redundancy, battery backup, battery­backed memory, and controller mirroring. To prevent data corruption, it is important to ensure that these systems are working properly. It is particularly catastrophic if file system metadata is corrupted, because complete file system loss could result. Check with your RAID vendor to make sure that write-back caching is safe to use.
Minimal I/O latency is critically important for metadata stripe groups to achieve high metadata operations throughput. This is because metadata operations involve a very high rate of small writes to the metadata disk, so disk latency is the critical performance factor. Write-back caching can be an effective approach to minimizing I/O latency and optimizing
StorNext File System Tuning Guide 2
StorNext File System Tuning
The Underlying Storage System
metadata operations throughput. This is easily observed in the hourly File System Manager (FSM) statistics reports in the
cvlog file. For
example, here is a message line from the cvlog file:
PIO HiPriWr SUMMARY SnmsMetaDisk0 sysavg/350 sysmin/333 sysmax/367
This statistics message reports average, minimum, and maximum write latency (in microseconds) for the reporting period. If the observed average latency exceeds 500 microseconds, peak metadata operation throughput will be degraded. For example, create operations may be around 2000 per second when metadata disk latency is below 500 microseconds. However, if metadata disk latency is around 5 milliseconds, create operations per second may be degraded to 200 or worse.
Another typical write caching approach is a “write-through.” This approach involves synchronous writes to the physical disk before returning a successful reply for the I/O operation. The write-through approach exhibits much worse latency than write-back caching; therefore, small I/O performance (such as metadata operations) is severely impacted. It is important to determine which write caching approach is employed, because the performance observed will differ greatly for small write I/O operations.
In some cases, large write I/O operations can also benefit from caching. However, some SNFS customers observe maximum large I/O throughput by disabling caching. While this may be beneficial for special large I/O scenarios, it severely degrades small I/O performance; therefore, it is suboptimal for general-purpose file system performance.

RAID Read-Ahead Caching 0

RAID read-ahead caching is a very effective way to improve sequential read performance for both small (buffered) and large (DMA) I/O operations. When this setting is utilized, the RAID controller pre-fetches disk blocks for sequential read operations. Therefore, subsequent application read operations benefit from cache speed throughput, which is faster than the physical disk throughput.
This is particularly important for concurrent file streams and mixed I/O streams, because read-ahead significantly reduces disk head movement that otherwise severely impacts performance.
While read-ahead caching improves sequential read performance, it does not help random performance. Furthermore, some SNFS customers actually observe maximum large sequential read throughput by disabling caching. While disabling read-ahead is beneficial in these unusual cases,
StorNext File System Tuning Guide 3
StorNext File System Tuning
The Underlying Storage System
it severely degrades typical scenarios. Therefore, it is unsuitable for most environments.

RAID Level, Segment Size, and Stripe Size 0

Configuration settings such as RAID level, segment size, and stripe size are very important and cannot be changed after put into production, so it is critical to determine appropriate settings during initial configuration.
The best RAID level to use for high I/O throughput is usually RAID5. The stripe size is determined by the product of the number of disks in the RAID group and the segment size. For example, a 4+1 RAID5 group with 64K segment size results in a 256K stripe size. The stripe size is a very critical factor for write performance because I/Os smaller than the stripe size may incur a read/modify/write penalty. It is best to configure RAID5 settings with no more than 512K stripe size to avoid the read/ modify/write penalty. The read/modify/write penalty is most noticeable in the absence of “write-back” caching being performed by the RAID controller.
The RAID stripe size configuration should typically match the
StripeBreadth
configuration setting when multiple LUNs are utilized in a
SNFS
stripe group. However, in some cases it might be optimal to configure the
SNFS StripeBreadth as a multiple of the RAID stripe size, such as when
the RAID stripe size is small but the user's I/O sizes are very large. However, this will be suboptimal for small I/O performance, so may not be suitable for general purpose usage.
RAID1 mirroring is the best RAID level for metadata and journal storage
because it is most optimal for very small I/O sizes. It is also very important to allocate entire physical disks for the Metadata and Journal LUNs in order to avoid bandwidth contention with other I/O traffic. Metadata and Journal storage requires very high IOPS rates (low latency) for optimal performance, so contention can severely impact IOPS (and latency) and thus overall performance. If Journal I/O exceeds 1ms average latency, you will observe significant performance degradation.
It can be useful to use a tool such as
lmdd to help determine the storage
system performance characteristics and choose optimal settings. For example, varying the stripe size and running lmdd with a range of I/O sizes might be useful to determine an optimal stripe size multiple to configure the SNFS
StorNext File System Tuning Guide 4
StripeBreadth.
StorNext File System Tuning

File Size Mix and Application I/O Characteristics

File Size Mix and Application I/O Characteristics
It is always valuable to understand the file size mix of the target dataset as well as the application I/O characteristics. This includes the number of concurrent streams, proportion of read versus write streams, I/O size, sequential versus random, Network File System (NFS) or Common Internet File System (CIFS) access, and so on.
For example, if the dataset is dominated by small or large files, various settings can be optimized for the target size range.
Similarly, it might be beneficial to optimize for particular application I/O characteristics. For example, to optimize for sequential 1MB I/O size it would be beneficial to configure a stripe group with four 4+1 RAID5 LUNs with 256K stripe size.
However, optimizing for random I/O performance can incur a performance trade-off with sequential I/O.

Furthermore, NFS and CIFS access have special requirements to consider as described in the Direct Memory Access (DMA) I/O Transfer

section.
Direct Memory Access (DMA) I/O Transfer 0

Buffer Cache 0

StorNext File System Tuning Guide 5
To achieve the highest possible large sequential I/O transfer throughput, SNFS provides DMA-based I/O. To utilize DMA I/O, the application must issue its reads and writes of sufficient size and alignment. This is called well-formed I/O. See the
auto_dma_read_length and auto_dma_write_length, described in the
M
ount Command Options on page 16.
Reads and writes that aren't well-formed utilize the SNFS buffer cache. This also includes NFS or CIFS-based traffic because the NFS and CIFS daemons defeat well-formed I/Os issued by the application.
There are several configuration parameters that affect buffer cache performance. The most critical is the RAID cache configuration because buffered I/O is usually smaller than the RAID stripe size, and therefore incurs a read/modify/write penalty. It might also be possible to match the RAID stripe size to the buffer cache I/O size. However, kernel memory fragmentation can defeat attempts to increase the SNFS buffer cache I/O size (see the
cachebufsize setting described in the Mount
mount command settings
StorNext File System Tuning
File Size Mix and Application I/O Characteristics
Command Options on page 16). So, it is typically most important to
optimize the RAID cache configuration settings described earlier in this document.
It is usually best to configure the RAID stripe size no greater than 256K for optimal small file buffer cache performance.

NFS / CIFS 0

For more buffer cache configuration settings, see M
ount Command
Options on page 16.
It is best to isolate NFS and/or CIFS traffic off of the metadata network to eliminate contention that will impact performance. For optimal performance it is necessary to use 1000BaseT instead of 100BaseT. When possible, it is also best to utilize TCP Offload capabilities as well as jumbo frames.
It is best practice to have clients directly attached to the same network switch as the NFS or CIFS server. Any routing required for NFS or CIFS traffic incurs additional latency that impacts performance.
It is critical to make sure the this severely impacts performance. Most of the time correct setting. Some managed switches allow setting example
1000Mb/full,) which disables auto-detect and requires the host to
speed/duplex settings are correct, because
auto-detect is the
speed/duplex (for
be set exactly the same. However, if the settings do not match between switch and host, it severely impacts performance. For example, if the switch is set to
auto-detect but the host is set to 1000Mb/full, you will
observe a high error rate along with extremely poor performance. On Linux, the
duplex
mii-diag tool can be very useful to investigate and adjust speed/
settings.
A higher performance alternative to NFS and CIFS is ISCSI. If performance requirements cannot be achieved with NFS or CIFS, SCSI should be considered.
It can be useful to use a tool such as
netperf to help verify network
performance characteristics.
StorNext File System Tuning Guide 6

The Metadata Network

As with any client/server protocol, SNFS performance is subject to the limitations of the underlying network. Therefore, it is recommended that you use a dedicated Metadata Network to avoid contention with other network traffic. Either 100BaseT or 1000BaseT is required, but for a dedicated Metadata Network there is usually no benefit from using 1000BaseT over 100BaseT. Neither TCP offload nor are jumbo frames required.
It is best practice to have all SNFS clients directly attached to the same network switch as the MDC systems. Any routing required for metadata traffic will incur additional latency that impacts performance.
StorNext File System Tuning
The Metadata Network
It is critical to ensure that severely impact performance. Most of the time setting. Some managed switches allow setting
100Mb/full, which disables auto-detect and requires the host to be set
exactly the same. However, performance is severely impacted if the settings do not match between switch and host. For example, if the switch is set to high error rate and extremely poor performance. On Linux the tool can be very useful to investigate and adjust
It can be useful to use a tool like Network performance characteristics. For example, if reports less than 15,000 transactions per second capacity, a performance penalty may be incurred.
auto-detect but the host is set to 100Mb/full, you will observe a

The Metadata Controller System

The CPU and memory power of the MDC System are important performance factors, as well as the number of file systems hosted per system. In order to ensure fast response time it is necessary to use dedicated systems, limit the number of file systems hosted per system (maximum 8), and have an adequate CPU and memory.
speed/duplex settings are correct, as this will
auto-detect is the correct
speed/duplex, such as
mii-diag
speed/duplex settings.
netperf to help verify the Metadata
netperf -t TCP_RR
StorNext File System Tuning Guide 7
Loading...
+ 22 hidden pages