Evaluating Enterprise-Class VTLs: The IBM System Storage
TS7650G ProtecTIER De-duplication Gateway
September 2008
Increasingly stringent service level agreements (SLAs) are putting significant
pressure on large enterprises to address backup window, recovery point
objective (RPO), recovery time objective (RTO), and recovery reliability issues.
While the use of disk storage technology offers clear functional advantages for
resolving these issues, disk’s high cost has been an impediment to widescale
deployment in the data protection domain of the enterprise data center. Now that storage
capacity optimization (SCO) technologies like single instancing, data de-duplication, and
compression are available to reduce the amount of raw storage capacity required to store a
given amount of data, the $/GB costs for disk-based secondary storage can be reduced by 10 to
20 times. Virtual tape technology, disk-based storage subsystems that appear to backup
software as tape drives or libraries, are one of the most popular ways to integrate disk into a
pre-existing data protection infrastructure because they require very little change to existing
backup and restore processes. While virtual tape libraries (VTLs) are interesting, SCO VTLs
that leverage data de-duplication and other related technologies are compelling.
Given high data growth rates, stringent SLAs for data protection, and the need to contain
spending, enterprise customers really need to take a look at SCO technologies. Taneja Group
predicts that large enterprises will rapidly move to SCO VTLs over the next 1-2 years while the
market for non-SCO VTLs (VTLs that do not have integrated SCO technologies) dwindles
rapidly. Data growth rates in the 50% - 60% range will be pushing this transition as much as
will the clear cost advantages that SCO VTLs offer over non-SCO VTLs. While SCO is a key
requirement, performance remains the number one need of the enterprise data protection
environment. After all, if the SLA for completing the day’s backup cannot be met, all other
criteria are moot. This has significant implications for vendors of SCO VTLs. Their solutions
must provide the capacity optimization that the enterprise customer demands, while enabling
enterprise-class performance. Vendors that can provide both efficient SCO technology and
enterprise class performance offer a very compelling value proposition.
In this Product Profile, we discuss the criteria we recommend be used to compare and contrast
enterprise-class SCO VTL solutions from different vendors, and then evaluate how the IBM
System Storage TS7650G ProtecTIER De-duplication Gateway performs against these criteria.
The TS7650G, IBM’s first offering based on technology from the April 2008 acquisition of
Diligent Technologies, supports very high single system throughput, multiple PBs of usable
capacity, and optional clustering with support for a global de-duplication repository - all
important considerations for enterprise SCO VTL prospects.
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
1 of 11
PRODUCTPROFILE
The Inevitability of Disk-Based
Data Protection
Disk is in widespread use as a part of the
data protection infrastructure of many large
enterprises. Evolving business and
regulatory mandates are imposing stringent
SLAs on these organizations, pushing them
to address backup window, RPO, RTO, and
recovery reliability issues, and disk has a lot
to offer in these areas. Technologies such as
VTLs have made the integration of disk into
existing data protection environments a very
operationally viable option.
Cost has historically been the single biggest
obstacle to integrating disk into existing data
protection infrastructures in a widespread
fashion, but the availability of SCO
technologies such as single instancing, data
de-duplication, and compression have
brought the $/GB costs for usable disk
capacity down significantly. SCO-based
solutions first became available in 2004, and
the SCO market hit $237M in revenue in
2007. Over the next five years, we expect
revenue in the SCO space to surpass $2.2B,
with the largest single market sub-segment
being SCO VTLs (source: Taneja Group Next
Generation Data Protection Emerging
Markets Forecast September 2008). If you
are not using disk for data protection
purposes today, and you are feeling some
pressure around backup window, RPO, RTO,
or recovery reliability, you need to take
another look at SCO VTLs. It is our opinion
that within 1-2 years, SCO VTLs will be in
widespread use throughout the enterprise.
With data expected to continue to grow at
50% - 60% a year, the economics of SCO
technology are just too compelling to ignore.
A Brief Primer on SCO
Taneja Group has chosen the term
SCO to apply to the range of technologies
that are used today to minimize the amount
of raw storage capacity required to store a
given amount of data. Data de-duplication is
a common term in use by vendors, but this
term really only describes one set of
algorithms used to capacity optimize storage.
And many vendors of de-duplication use it
along with other technologies, such as
compression, in a multi-step process used to
achieve the end result. That said, deduplication is the primary technology that
enables solutions to reach dramatic capacity
optimized ratios such as 20:1 or more. Given
the focus and attention on de-duplication as well as the fact that it is at the heart of
IBM’s TS7650G - let’s take a closer look.
At their most basic level, data de-duplication
technologies break data down into smaller
recognizable pieces (ie. elements) and then
look for redundancy. As elements come into
the system, they are compared against an
index which holds a list of elements that are
already stored in the system. When an
incoming element is found to be a copy of an
element that is already stored in the system,
the new element is eliminated and replaced
by a pointer to the reference element. In
secondary storage environments like backup
where backed up data may only change 3-5%
or less per day, there is a significant amount
of redundancy that can be identified and
removed (a 5% change rate implies a 95%
data redundancy rate!). De-duplication
algorithms can operate at the file level (this is
also referred to as single instancing) or at the
sub-file level. Sub-file level de-duplication
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
2 of 11
PRODUCTPROFILE
tends to produce higher data reduction
ratios. Looking across vendor offerings in
the market today, it is not unreasonable to
achieve data reduction ratios against
secondary storage like backup data sets of
10:1 to 20:1 or greater over time.
To provide an example of how de-duplication
performs in practice in backup applications,
let’s take an example. Assume a new data set
that has never before been backed up. On
day 1, it is backed up to disk and deduplicated (this may occur during the
backup or after the backup, but more on that
later). On day 2, the data set is once again
backed up to disk, but as de-duplication is
applied, it can now look at both backups to
find common elements. The data reduction
ratio achieved on day 2 is very likely to be
higher than that achieved on day 1,
particularly if the backed up data has not
changed much in the 24 hour period. If we
assume that 30 days of backups are retained
on disk, then it is very likely that there is a lot
of redundant data that can be removed and
replaced with pointers. The factors affecting
data reduction ratios in backup include the
change rate of data (day to day), the number
of days of retained backups, and the specific
SCO technology in use.
SCO Approaches and Architectures
SCO can be deployed either at the source
(backup client) or at the target (backup
target). Performing the capacity
optimization work requires CPU cycles, so
where it is performed may have a
performance impact that needs to be
evaluated. Source-based SCO typically
leverages resources on the backup client to
perform the work, which may impact backup
and/or application performance, but it does
minimize the amount of data that has to be
sent across a network to complete the
backup. Source-based SCO may offer certain
advantages in remote office back office
(ROBO) backup environments, but tends to
be targeted at environments where each
backup client does not have a lot of data.
Target-based SCO presents a backup target,
often through a VTL interface, and leverages
resources on an appliance or a storage
subsystem to perform the work. Targetbased SCO supports much greater
throughput than source-based, and tends to
be targeted for use in enterprise
environments to handle large backup
volumes per client. Target-based SCO can
offer the opportunity to much more
efficiently leverage a global data deduplication repository during the capacity
optimization process than source-based SCO
can. Vendors that support a global
repository can often offer higher data
reduction ratios than those that do not since
they can perform the redundancy
identification and elimination across a much
larger number of backup clients.
Capacity optimization can be performed
through either an in-line or a postprocessing approach. In-line processing
performs the capacity optimization work as it
is writing data to the backup target. Postprocessing allows the data to be first written
to the backup target, and then through a
separate process picks this data back up and
runs it through the capacity optimization
process. The operative metric for an end
user, assuming that you want your backups
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
3 of 11
PRODUCTPROFILE
in capacity optimized form, is the amount of
time it takes to both ingest the backup and to
perform the capacity optimization, not just
the time it takes to ingest the backup.
This dichotomy (in-line vs post processing)
has some key implications on overall system
performance that may not be entirely
evident. When an in-line vendor quotes a
throughput number, that is the single
number necessary to evaluate how long it
takes to complete the backup and process the
data into capacity optimized form, at which
point it is ready for any further processing
(e.g. 600MB/sec can process roughly
2.16TB/hour). When a post-processing
vendor quotes throughput, that generally
refers to how long it takes to ingest the data
and does not include the post-processing
time necessary to capacity optimize it (e.g.
600MB/sec can ingest 2.16TB/hr but
additional time will be required to perform
post-processing). To truly understand if a
post-processing approach can meet your
backup windows, you need to evaluate the
total time required to both ingest the backup
and to perform the post-processing. Postprocessing vendors may argue that since the
post-processing is de-coupled from the
backup, it doesn’t matter how long it takes.
In some environments, that may be true but
if you have an 8 hour window to complete
your backups and capacity optimize them
before you clone data to tapes, or replicate
your backup sets to a remote site for DR
purposes, and you cannot complete the
backup ingest and the post-processing within
that 8 hour window, then the postprocessing approach will impact your DR
RPO.
Without a doubt, in-line approaches require
less overall physical storage capacity than
post-process approaches. For a given
environment exhibiting a 10:1 capacity
optimization ratio, the system will write
100GB of data for every 1TB it backs up. A
post-process method will need to write that
1TB to disk first, then cycle it through postprocessing, eventually shrinking the storage
required to store that backup to 100GB.
Thus, post-processing systems must
maintain spare capacity to allow for the
initial ingest of data prior to the deduplication process. Post-processing
products clearly require more capacity for a
given environment than in-line solutions to
allow for this buffer, but the actual amount
will vary based on the specific postprocessing approach being used.
Post-processing approaches introduce
additional time before a capacity optimized
backup is ready for further processing, such
as cloning to tape, distributing electronically
to a DR site, etc. If additional time and
capacity are available, then you may be
indifferent between the two approaches, but
if they are not, then this is something to
consider when evaluating solutions. Note
that some post-processing vendors allow the
post-processing to be started against a
particular backup job before it completes,
thereby reducing both the capacity and time
requirements that would otherwise be
associated with approaches which perform
these operations sequentially. In-line
approaches, however, will generally complete
the overall backup processing (ingestion +
capacity optimization) faster than postprocessing approaches since they complete
their work in a single pass.
Copyright The TANEJA Group, Inc. 2008. All Rights Reserved
87 Elm Street, Suite 900 Hopkinton, MA 01748 Tel: 508-435-5040 Fax: 508-435-1530 www.tanejagroup.com
4 of 11
Loading...
+ 7 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.