INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL®
PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR
PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in
medical, life saving, life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Niraj Tolia†‡, Jan Harkes†, Michael Kozuch‡, M. Satyanarayanan
†
Carnegie Mellon University,‡Intel Research Pittsburgh
Abstract
We describe a technique called lookaside caching that combines the
strengths of distributed file systems and portable storage devices,
while negating their weaknesses. In spite of its simplicity, this technique proves to be powerful and versatile. By unifying distributed
storage and portable storage into a single abstraction, lookaside
caching allows users to treat devices they carry as merely performance and availability assists for distant file servers. Careless use
of portable storage has no catastrophic consequences.
1Introduction
Floppy disks were the sole means of sharing data
across users and computers in the early days of personal computing. Although they were trivial to use,
considerable discipline and foresight was required of
users to ensure data consistency and availability, and to
avoid data loss — if you did not have the right floppy
at the right place and time, you were in trouble! These
limitations were overcome by the emergence of distributed file systems such as NFS [17], Netware [8],
LanManager [24], and AFS [7]. In such a system, responsibility for data management is delegated to the
distributed file system and its operational staff.
Personal storage has come full circle in the recent
past. There has been explosive growth in the availability of USB- and Firewire-connected storage devices such as flash memory keychains and portable disk
drives. Although very different from floppy disks in
capacity, data transfer rate, form factor, and longevity,
their usage model is no different. In other words, they
are just glorified floppy disks and suffer from the same
limitations mentioned above. Why then are portable
storage devices in such demand today? Is there a way
to use them that avoids the messy mistakes of the past,
where a user was often awash in floppy disks trying to
figure out which one had the latest version of a specific
file? If loss, theft or destruction of a portable storage
device occurs, how can one prevent catastrophic data
loss? Since human attention grows ever more scarce,
can we reduce the data management demands on attention and discipline in the use of portable devices?
We focus on these and related questions in this paper. We describe a technique called lookaside caching
†‡
that combines the strengths of distributed file systems and portable storage devices, while negating their
weaknesses. In spite of its simplicity, this technique
proves to be powerful and versatile. By unifying “storage in the cloud” (distributed storage) and “storage in
the hand” (portable storage) into a single abstraction,
lookaside caching allows users to treat devices they
carry as merely performance and availability assists for
distant file servers. Careless use of portable storage has
no catastrophic consequences.
Lookaside caching has very different goals and design philosophy from a PersonalRAID system [18], the
only previous research that we are aware of on usage models for portable storage devices. Our starting
point is the well-entrenched base of distributed file systems in existence today. We assume that these are successful because they offer genuine value to their users.
Hence, our goal is to integrate portable storage devices
into such a system in a manner that is minimally disruptive of its existing usage model. In addition, we
make no changes to the native file system format of
a portable storage device; all we require is that the device be mountable as a local file system at any client of
the distributed file system. In contrast, PersonalRAID
takes a much richer view of the role of portable storage
devices. It views them as first-class citizens rather than
as adjuncts to a distributed file system. It also uses a
customized storage layout on the devices. Our design
and implementation are much simpler, but also more
limited in functionality.
We begin in Section 2 by examining the strengths
and weaknesses of portable storage and distributed file
systems. In Sections 3 and 4, we describe the design
and implementation of lookaside caching. We quantify
the performance benefit of lookaside caching in Section 5, using three different benchmarks. We explore
broader use of lookaside caching in Section 6, and conclude in Section 7 with a summary.
1
2Background
To understand the continuing popularity of portable
storage, it is useful to review the strengths and weaknesses of portable storage and distributed file systems.
While there is considerable variation in the designs of
distributed file systems, there is also a substantial degree of commonality across them. Our discussion below focuses on these common themes.
Performance: A portable storage device offers uniform performance at all locations, independent of factors such as network connectivity, initial cache state,
and temporal locality of references. Except for a few
devices such as floppy disks, the access times and bandwidths of portable devices are comparable to those of
local disks.In contrast, the performance of a distributed file system is highly variable. With a warm
client cache and good locality, performance can match
local storage. With a cold cache, poor connectivity and
low locality, performance can be intolerably slow.
Availability: If you have a portable storage device
in hand, you can access its data. Short of device failure, which is very rare, no other common failures prevent data access. In contrast, distributed file systems
are susceptible to network failure, server failure, and a
wide range of operator errors.
Robustness: A portable storage device can easily
be lost, stolen or damaged. Data on the device becomes permanently inaccessible after such an event.
In contrast, data in a distributed file system continues
to be accessible even if a particular client that uses it
is lost, stolen or damaged. For added robustness, the
operational staff of a distributed file system perform
regular backups and typically keep some of the backups off site to allow recovery after catastrophic site
failure. Backups also help recovery from user error:
if a user accidentally deletes a critical file, he can recover a backed-up version of it. In principle, a highly
disciplined user could implement a careful regimen of
backup of portable storage to improve robustness. In
practice, few users are sufficiently disciplined and wellorganized to do this. It is much simpler for professional
staff to regularly back up a few file servers, thus benefiting all users.
Sharing/Collaboration: The existence of a common name space simplifies sharing of data and collaboration between the users of a distributed file system.
This is much harder if done by physical transfers of devices. If one is restricted to sharing through physical
devices, a system such as PersonalRAID can be valuable in managing complexity.
Consistency: Without explicit user effort, a distributed file system presents the latest version of a file
when it is accessed. In contrast, a portable device has
to be explicitly kept up to date. When multiple users
can update a file, it is easy to get into situations where
a portable device has stale data without its owner being
aware of this fact.
Capacity: Any portable storage device has finite
capacity. In contrast, the client of a distributed file
system can access virtually unlimited amounts of data
spread across multiple file servers. Since local storage
on the client is merely a cache of server data, its size
only limits working set size rather than total data size.
Security: The privacy and integrity of data on
portable storage devices relies primarily on physical security. A further level of safety can be provided by
encrypting the data on the device, and by requiring a
password to mount it. These can be valuable as a second layer of defense in case physical security fails. Denial of service is impossible if a user has a portable
storage device in hand. In contrast, the security of data
in a distributed file system is based on more fragile assumptions. Denial of service may be possible through
network attacks.Privacy depends on encryption of
network traffic. Fine-grain protection of data through
mechanisms such as access control lists is possible, but
relies on secure authentication using a mechanism such
as Kerberos [19].
Ubiquity: A distributed file system requires operating system support. In addition, it may require environmental support such as Kerberos authentication
and specific firewall configuration. Unless a user is
at a client that meets all of these requirements, he
cannot access his data in a distributed file system.
In contrast, portable storage only depends on widelysupported low-level hardware and software interfaces.
If a user sits down at a random machine, he can be
much more confident of accessing data from portable
storage in his possession than from a remote file server.
3Lookaside Caching
Our goal is to exploit the performance and availability advantages of portable storage to improve these
same attributes in a distributed file system. The resulting design should preserve all other characteristics of
the underlying distributed file system. In particular,
2
there should be no compromise of robustness, consistency or security. There should also be no added complexity in sharing and collaboration. Finally, the design
should be tolerant of human error: improper use of the
portable storage device (such as using the wrong device or forgetting to copy the latest version of a file to
it) should not hurt correctness.
Lookaside caching is an extension of AFS2-style
whole-file caching [7] that meets the above goals. It is
based on the observation that virtually all distributed
file system protocols provide separate remote procedure calls (RPCs) for access of meta-data and access of
data content. Lookaside caching extends the definition
of meta-data to include a cryptographic hash of data
content. This extension only increases the size of metadata by a modest amount: just 20 bytes if SHA-1 [11]
is used as the hash. Since hash size does not depend on
file length, it costs very little to obtain and cache hash
information even for many large files. Using POSIX
terminology, caching the results of “ls -lR” of a large
tree is feasible on a small client, even if there is not
enough cache space for the contents of all the files in
the tree. This continues to be true even if one augments
stat information for each file or directory in the tree
with its SHA-1 hash.
Once a client possesses valid meta-data for an object, it can use the hash to redirect the fetch of data
content. If a mounted portable storage device has a file
with matching length and hash, the client can obtain the
contents of the file from the device rather than from the
file server. Whether it is beneficial to do this depends,
of course, on factors such as file size, network bandwidth, and device transfer rate. The important point is
that possession of the hash gives a degree of freedom
that clients of a distributed file system do not possess
today.
Since lookaside caching treats the hash as part of
the meta-data, there is no compromise in consistency.
The underlying cache coherence protocol of the distributed file system determines how closely client state
tracks server state. There is no degradation in the accuracy of this tracking if the hash is used to redirect
access of data content. To ensure no compromise in security, the file server should return a null hash for any
object on which the client only has permission to read
the meta-data.
Lookaside caching can be viewed as a degenerate
case of the use of file recipes, as described by Tolia et
al. [22]. In that work, a recipe is an XML description of
file content that enables block-level reassembly of the
file from content-addressable storage. One can view
the hash of a file as the smallest possible recipe for it.
The implementation using recipes is considerably more
complex than our support for lookaside caching. In return for this complexity, synthesis from recipes may
succeed in many situations where lookaside fails.
4Prototype Implementation
We have implemented lookaside caching in the
Coda file system on Linux. The user-level implementation of Coda client cache manager and server code
greatly simplified our effort since no kernel changes
were needed.The implementation consists of four
parts: a small change to the client-server protocol; a
quick index check (the “lookaside”) in the code path
for handling a cache miss; a tool for generating lookaside indexes; and a set of user commands to include or
exclude specific lookaside devices.
TheprotocolchangereplacestwoRPCs,
ViceGetAttr()andViceValidateAttrs()
with the extended calls ViceGetAttrPlusSHA()
and ViceValidateAttrsPlusSHA() that have an
extra parameter for the SHA-1 hash of the file.
ViceGetAttr() is used to obtain meta-data for a
file or directory, while ViceValidateAttrs() is
used to revalidate cached meta-data for a collection
of files or directories when connectivity is restored to
a server. Our implementation preserves compatibility
with legacy servers. If a client connects to a server that
has not been upgraded to support lookaside caching, it
falls back to using the original RPCs mentioned above.
The lookaside occurs just before the execution of
the ViceFetch() RPC to fetch file contents. Before
network communication is attempted, the client consults one or more lookaside indexes to see if a local file
with identical SHA-1 value exists. Trusting in the collision resistance of SHA-1 [10], a copy operation on the
local file can then be a substitute for the RPC. To detect version skew between the local file and its index,
the SHA-1 hash of the local file is re-computed. In case
of a mismatch, the local file substitution is suppressed
and the cache miss is serviced by contacting the file
server. Coda’s consistency model is not compromised,
although some small amount amount of work is wasted
on the lookaside path.
The index generation tool walks the file tree rooted
3
Loading...
+ 9 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.