The Hashing Approach to the Internet File System Problem - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

The Hashing Approach to the Internet File System Problem

Description:

Release Consistency - Shared data are made consistent when a critical region ... IFS support Release cache Consistency model using Tokens and Acq/Rel protocol ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 32

Provided by: gabriel50

Category:

more less

Transcript and Presenter's Notes

Title: The Hashing Approach to the Internet File System Problem

1
The Hashing Approach to the Internet File System
Problem

By - Gabriel Mizrahi
Supervised by - Dr. Yosi Ben-Asher

2
Purpose
In this work we consider the problem of
developing an efficiently distributed file system
over the internet. It should serve many clients
to perform concurrent I/O to a virtual shared
H.D. created from many local disks geographically
dispersed over the Internet. The key feature of
the proposed Internet File System (IFS) is that
mapping between files and physical blocks is
based on hashing and not on global data
structures.
3
Topics of Discussion

IFS requirements.
Overview of existing File Systems.
The Meta Data problem on DFS.
Usage of DMM simulations.
IFS components and APIs.
Semantics and Cache Consistency.
VOD models.
Experimental results and system simulation.
Conclusions.

4
IFS Requirements

It should allow a dynamic set of un-known clients
to access files over the Internet concurrently.

The storage space should simulate a shared H.D.
created of many local disks over remote servers.
It should support extremely fast and fully
distributed mapping between files and physical
blocks.
It should support consistent cooperative caching.
There should be a good load balancing in the
system.
Due to the relatively long communication times,
each access to a file (read/write) should involve
very few servers.

No central data structures should be used

6
IFS vs. DFS configurations
7
Overview of existing DFS

NFS
Files are distributed between the servers.
Timing dependent semantics (3 sec. for files 30
sec. for dirs).
Complete stateless service.
A network service so that independent
workstations should share remote files
transparently.
AFS
Usage of Session semantics.
Whole file cashing in local disks.
I/O are served directly by the cache without
involving servers.
Servers initiated approach for cache validation.
Designed for large scale systems.

xFS
A serverless design.
Distribution of data and metadata across multiple
machines including clients.
Token based cache consistency.
Implement cooperative caching.

Sprite
Apply UNIX semantics by disabling caches.
Designed for an environment consisting of
diskless workstations with huge main memory.
GFS
Designed for shared systems over SAN.
Use of Extendible Hashing for Meta Data
implementation.
Implement locks on the storage devices to
maintain coherence of file and metadata.

9
Issues to consider for DFS

Generally organized according to the
client-server model.
Client side supporting caching.
Support for server replication to meet
scalability, reliability and load balance.

DFS main differences

Data and Meta Data distribution.
The semantics of file sharing.
The client cache granularity and management.

10
The Meta Data problem on DFS

Search over directories and i-nodes trees
Allow concurrent read/write and delete operations
and keep it consistent
Accessing the meta-data should not pass through
too many servers

Ways for achieving these goals

Centralizing
Replication
Partitioning

11
The Meta Data Solution of IFS

Not to use search trees at all.
Base the mapping on hash values, viewing the
shared disk as a large hash table partitioned
between the servers disk.
No need to maintain a global list of free blocks.
Servers and clients work independently and never
exchange messages.

12
The IFS Metadata solution scheme
13
The DMM Model

Let n be the number of processors each having a
local module of memory.
And m the number of data items in a global shared
address space.
The goal is to find a scheme that distribute the
shared memory cells over the processor's memory
modules, such any set of addresses accessed by
the processors is equally partitioned between the
memory modules.
The result is that the load and access time to
the simulated shared memory are minimized.

14
Usage of DMM Simulations

The Constrains of IFSs (no communication between
the servers and load balancing) resemble those
involved with the problem of simulating shared
memory over DMM. (studies around 1984)
We observe that simulating a shared virtual disk
over the internet is similar to the way shared
memory is implemented on DMM.

To address the above constrains, DMM simulation
uses a complex hashing scheme which, translated
to our problem of simulating a shared virtual HD,
make use of
Pseudo Random mapping of logical blocks to
servers.
Replication of physical blocks of the virtual HD.

15
Random distribution of items to servers improve
loads of two sets of requests
16
Previously known results about DMM simulations
schemes.

The results on DMM simulations shows that with
high probability the number of accesses to a
memory module made in one cycle dos not exceed
Melhorn and Vishkin (1984) O(log n / loglog n).
Upfal and Wigderson (1987) O(log n (loglog n)2).
Mayer auf der Heide et. Al. (1993) O(loglog n
log n).

17
The hashing scheme used in IFS

There is a family of Universal hash functions

H ha,b (x) ((a x b) mod k) mod p a,b
Î 0, 1,..., k
k prime (vhd)
vhd the size of the virtual hard disk
p is the number of servers in IFS

At the beginning we chose at random three
functions h1, h2, h3 from H by choosing their
coefficients a and b at random.

Each server (IFSi) is responsible to store every
block Bx for which either h1(Bx) i, h2(Bx) i
or h3(Bx) i. Consequently there can be three
copies of each block.

Fetching a block Bx (read/write) to the cache of
client CLi requires fetching two copies Bx1 and
Bx2 (out of the three possible) from the
respective (IFS) servers. This is done using the
three functions h1, h2, h3 in a random order. Out
of the two copies we select the one with the
highest global time tag according to some
approximation tagging, and store it in the cache.

When the cache of a client becomes full the least
recently used block is flashed to the servers
to be stored using the above scheme.

19
IFS Components and operations
20
APIs supported in IFS

Create Create a file that was already created.
Delete Delete/Create a file that was already
Deleted Delete a file while some process perform
I/O on it.
Read/Write R/W while some process is adding
blocks to the file.
Seek
Tokens Locks (Acq, Rel) Delete of a file while
using Tokens.

21
Semantics and Cache Consistency
UNIX semantics is desired for DFSs. However,
implementing Unix semantics requires invalidating
caches before every write. This is not practical
in IFS setting. Thus, we choose release
consistency instead.

UNIX Semantic - Every operation on a file is
instantly visible to all processes
Release Consistency - Shared data are made
consistent when a critical region is exited

22
IFS Consistency Models

IFS support Release cache Consistency model using
Tokens and Acq/Rel protocol for synchronization.
If cache on clients is supported and Tokens are
not used for synchronization, cache consistency
is not guaranteed.
If cache on clients is disable, the usage of
global tagging and majority rule makes the
replicated data items relative consistent.

23
VOD Models

Tiger Video Fileserver (Microsoft) Movies are
sent in a stream mode pushed. Use central
scheduler control to maintain block
delivery. Stripe of movies between servers. Use
block level mirroring with block declustering.
Fault Tolerant VoD (Hebrew Univ.) Movies are
sent in a stream mode pushed Allows the
clients to send control rate messages to the
servers. Reallocates active clients of servers
that crash to servers with replicas of the movie.

The IBM VideoCharger (IBM) Movies are sent in a
stream mode pushed. Use central scheduler
control to maintain block delivery. Use RAID
devices to achieve data availability.

Parallel Video Server (Chinese Univ. of
HK) Pull based mode for receiving blocks. No
need of control scheduler. Extend RAID
technologies to the servers level.
IFS (Haifa Univ.) Pull based mode for
receiving blocks. No need of control
scheduler. Replicate blocks three times.

25
Experimental Results

Using more servers increase the amount of blocks
that each client receives.
The DMM simulation scheme used in IFS succeed to
distribute the load between the servers.

26
Average number of received blocks in a time unit
for a single client.
27
Number of received blocks by a single client as a
function of time.
28
Histogram of idle servers per step.
29
Histogram of max difference (5 servers).
30
Effect of low transfer rate on viewing quality.
Effect of sufficient transfer rate on viewing
quality.
31
Conclusions

IFS has been specially designed to work over the
Internet.
It serve clients performing concurrent I/O to a
collection of files held on a set of servers that
can be geographically sparse over the Internet.
In principle Cooperative caches are used to
overcame the relative long delays caused by the
large communication distance in the Internet.
The overall bandwidth improves due to the
geographical distribution.
Guaranty equal distribution of the load between
servers.
Support direct access to physical blocks without
searching on global metadata.
Special care was given to optimize IFS for VOD.