Ibm TS7650G PROTECTIER User Manual

P R O D U C T P R O F I L E
Evaluating Enterprise-Class VTLs: The IBM System Storage
TS7650G ProtecTIER De-duplication Gateway
Increasingly stringent service level agreements (SLAs) are putting significant pressure on large enterprises to address backup window, recovery point objective (RPO), recovery time objective (RTO), and recovery reliability issues. While the use of disk storage technology offers clear functional advantages for
resolving these issues, disk’s high cost has been an impediment to widescale deployment in the data protection domain of the enterprise data center. Now that storage capacity optimization (SCO) technologies like single instancing, data de-duplication, and compression are available to reduce the amount of raw storage capacity required to store a given amount of data, the $/GB costs for disk-based secondary storage can be reduced by 10 to 20 times. Virtual tape technology, disk-based storage subsystems that appear to backup software as tape drives or libraries, are one of the most popular ways to integrate disk into a pre-existing data protection infrastructure because they require very little change to existing backup and restore processes. While virtual tape libraries (VTLs) are interesting, SCO VTLs that leverage data de-duplication and other related technologies are compelling.
Given high data growth rates, stringent SLAs for data protection, and the need to contain spending, enterprise customers really need to take a look at SCO technologies. Taneja Group predicts that large enterprises will rapidly move to SCO VTLs over the next 1-2 years while the market for non-SCO VTLs (VTLs that do not have integrated SCO technologies) dwindles rapidly. Data growth rates in the 50% - 60% range will be pushing this transition as much as will the clear cost advantages that SCO VTLs offer over non-SCO VTLs. While SCO is a key requirement, performance remains the number one need of the enterprise data protection environment. After all, if the SLA for completing the day’s backup cannot be met, all other criteria are moot. This has significant implications for vendors of SCO VTLs. Their solutions must provide the capacity optimization that the enterprise customer demands, while enabling enterprise-class performance. Vendors that can provide both efficient SCO technology and enterprise class performance offer a very compelling value proposition.
In this Product Profile, we discuss the criteria we recommend be used to compare and contrast enterprise-class SCO VTL solutions from different vendors, and then evaluate how the IBM System Storage TS7650G ProtecTIER De-duplication Gateway performs against these criteria. The TS7650G, IBM’s first offering based on technology from the April 2008 acquisition of Diligent Technologies, supports very high single system throughput, multiple PBs of usable capacity, and optional clustering with support for a global de-duplication repository - all important considerations for enterprise SCO VTL prospects.
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
1 of 11
P R O D U C T P R O F I L E
The Inevitability of Disk-Based Data Protection
Disk is in widespread use as a part of the data protection infrastructure of many large enterprises. Evolving business and regulatory mandates are imposing stringent SLAs on these organizations, pushing them to address backup window, RPO, RTO, and recovery reliability issues, and disk has a lot to offer in these areas. Technologies such as VTLs have made the integration of disk into existing data protection environments a very operationally viable option.
Cost has historically been the single biggest obstacle to integrating disk into existing data protection infrastructures in a widespread fashion, but the availability of SCO technologies such as single instancing, data de-duplication, and compression have brought the $/GB costs for usable disk capacity down significantly. SCO-based solutions first became available in 2004, and the SCO market hit $237M in revenue in
2007. Over the next five years, we expect revenue in the SCO space to surpass $2.2B, with the largest single market sub-segment being SCO VTLs (source: Taneja Group Next Generation Data Protection Emerging Markets Forecast September 2008). If you are not using disk for data protection purposes today, and you are feeling some pressure around backup window, RPO, RTO, or recovery reliability, you need to take another look at SCO VTLs. It is our opinion that within 1-2 years, SCO VTLs will be in widespread use throughout the enterprise. With data expected to continue to grow at 50% - 60% a year, the economics of SCO technology are just too compelling to ignore.
A Brief Primer on SCO
Taneja Group has chosen the term SCO to apply to the range of technologies that are used today to minimize the amount of raw storage capacity required to store a given amount of data. Data de-duplication is a common term in use by vendors, but this term really only describes one set of algorithms used to capacity optimize storage. And many vendors of de-duplication use it along with other technologies, such as compression, in a multi-step process used to achieve the end result. That said, de­duplication is the primary technology that enables solutions to reach dramatic capacity optimized ratios such as 20:1 or more. Given the focus and attention on de-duplication ­as well as the fact that it is at the heart of IBM’s TS7650G - let’s take a closer look.
At their most basic level, data de-duplication technologies break data down into smaller recognizable pieces (ie. elements) and then look for redundancy. As elements come into the system, they are compared against an index which holds a list of elements that are already stored in the system. When an incoming element is found to be a copy of an element that is already stored in the system, the new element is eliminated and replaced by a pointer to the reference element. In secondary storage environments like backup where backed up data may only change 3-5% or less per day, there is a significant amount of redundancy that can be identified and removed (a 5% change rate implies a 95% data redundancy rate!). De-duplication algorithms can operate at the file level (this is also referred to as single instancing) or at the sub-file level. Sub-file level de-duplication
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
2 of 11
P R O D U C T P R O F I L E
tends to produce higher data reduction ratios. Looking across vendor offerings in the market today, it is not unreasonable to achieve data reduction ratios against secondary storage like backup data sets of 10:1 to 20:1 or greater over time.
To provide an example of how de-duplication performs in practice in backup applications, let’s take an example. Assume a new data set that has never before been backed up. On day 1, it is backed up to disk and de­duplicated (this may occur during the backup or after the backup, but more on that later). On day 2, the data set is once again backed up to disk, but as de-duplication is applied, it can now look at both backups to find common elements. The data reduction ratio achieved on day 2 is very likely to be higher than that achieved on day 1, particularly if the backed up data has not changed much in the 24 hour period. If we assume that 30 days of backups are retained on disk, then it is very likely that there is a lot of redundant data that can be removed and replaced with pointers. The factors affecting data reduction ratios in backup include the change rate of data (day to day), the number of days of retained backups, and the specific SCO technology in use.
SCO Approaches and Architectures
SCO can be deployed either at the source (backup client) or at the target (backup target). Performing the capacity optimization work requires CPU cycles, so where it is performed may have a performance impact that needs to be evaluated. Source-based SCO typically leverages resources on the backup client to
perform the work, which may impact backup and/or application performance, but it does minimize the amount of data that has to be sent across a network to complete the backup. Source-based SCO may offer certain advantages in remote office back office (ROBO) backup environments, but tends to be targeted at environments where each backup client does not have a lot of data.
Target-based SCO presents a backup target, often through a VTL interface, and leverages resources on an appliance or a storage subsystem to perform the work. Target­based SCO supports much greater throughput than source-based, and tends to be targeted for use in enterprise environments to handle large backup volumes per client. Target-based SCO can offer the opportunity to much more efficiently leverage a global data de­duplication repository during the capacity optimization process than source-based SCO can. Vendors that support a global repository can often offer higher data reduction ratios than those that do not since they can perform the redundancy identification and elimination across a much larger number of backup clients.
Capacity optimization can be performed through either an in-line or a post­processing approach. In-line processing performs the capacity optimization work as it is writing data to the backup target. Post­processing allows the data to be first written to the backup target, and then through a separate process picks this data back up and runs it through the capacity optimization process. The operative metric for an end user, assuming that you want your backups
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
3 of 11
P R O D U C T P R O F I L E
in capacity optimized form, is the amount of time it takes to both ingest the backup and to perform the capacity optimization, not just the time it takes to ingest the backup.
This dichotomy (in-line vs post processing) has some key implications on overall system performance that may not be entirely evident. When an in-line vendor quotes a throughput number, that is the single number necessary to evaluate how long it takes to complete the backup and process the data into capacity optimized form, at which point it is ready for any further processing (e.g. 600MB/sec can process roughly
2.16TB/hour). When a post-processing vendor quotes throughput, that generally refers to how long it takes to ingest the data and does not include the post-processing time necessary to capacity optimize it (e.g. 600MB/sec can ingest 2.16TB/hr but additional time will be required to perform post-processing). To truly understand if a post-processing approach can meet your backup windows, you need to evaluate the total time required to both ingest the backup and to perform the post-processing. Post­processing vendors may argue that since the post-processing is de-coupled from the backup, it doesn’t matter how long it takes. In some environments, that may be true but if you have an 8 hour window to complete your backups and capacity optimize them before you clone data to tapes, or replicate your backup sets to a remote site for DR purposes, and you cannot complete the backup ingest and the post-processing within that 8 hour window, then the post­processing approach will impact your DR RPO.
Without a doubt, in-line approaches require less overall physical storage capacity than post-process approaches. For a given environment exhibiting a 10:1 capacity optimization ratio, the system will write 100GB of data for every 1TB it backs up. A post-process method will need to write that 1TB to disk first, then cycle it through post­processing, eventually shrinking the storage required to store that backup to 100GB. Thus, post-processing systems must maintain spare capacity to allow for the initial ingest of data prior to the de­duplication process. Post-processing products clearly require more capacity for a given environment than in-line solutions to allow for this buffer, but the actual amount will vary based on the specific post­processing approach being used.
Post-processing approaches introduce additional time before a capacity optimized backup is ready for further processing, such as cloning to tape, distributing electronically to a DR site, etc. If additional time and capacity are available, then you may be indifferent between the two approaches, but if they are not, then this is something to consider when evaluating solutions. Note that some post-processing vendors allow the post-processing to be started against a particular backup job before it completes, thereby reducing both the capacity and time requirements that would otherwise be associated with approaches which perform these operations sequentially. In-line approaches, however, will generally complete the overall backup processing (ingestion + capacity optimization) faster than post­processing approaches since they complete their work in a single pass.
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
4 of 11
Loading...
+ 7 hidden pages