Title: Scale and Performance in a Distributed File System
1Scale and Performance in a Distributed File System
- J. H. HOWARD, M. L. KAZAR, S. G. MENEES, D. A.
NICHOLS - M. SATYANARAYANAN, R. N. SIDEBOTHAM, and M. J.
WEST - Carnegie Mellon University
- ACM Transactions cm Computer Systems, 1988
- Presented by Jongheum Yeon, 2009. 10. 14
2Outline
- Motivation
- Andrew File System
- Prototype
- Improvements to the Prototype
- Comparison with Remote Open
- Operability of Improved System
- Conclusion
3Andrew File System(AFS) History
- 1980 task force at CMU
- August 12, 1981 IBM launched the PC ? Sneaker
Net - 1982 Carnegie Mellon IBM ? Information
Technology Center (ITC) , personal computing - 1983, Multi-platform network operating system
(NOS), Novell NetWare - August 1983, 4.2BSD release with TCP/IP
- 1989 founded Transarc ? commercial product
- 1994 IBM acquired Transarc
- 16 Aug 2000 IBM announces Open AFS
- July 15, 2005 IBM withdrawn marketing effort
- Researches are still going on (280 references)
Coda, GFS, IVY,
4Motivation
- DFS Architectural Challenges
- Fault tolerant
- Highly available
- Recoverable
- Consistent
- Scalable
- Predictable performance
- Secure
- To build a distributed, scalable file system
- The file system should be able to scale to serve
a large number of users with out too much
degradation of performance - Should support simplified security model
- Should simplify system administration
5Andrew File System
- Developed at Carnegie Mellon University
- Distributed file system by considerations of
scale - Locality of file references
- Present a homogeneous, location-transparent file
name space to all the client workstations - Use 4.2BSD
- Server
- A set of trusted servers - Vice
- Clients
- User level processes Venus
- File system call hooking
- Contact with servers only on opens and closes for
a whole-file transfer - Caches files from Vice
- Store modified copies of files back on the servers
6System Overview
7Operation Flow
- Open/Read/Write/Close
- Workstation1 does not have readme.txt in cache,
Workstation2 has it.
8Prototype
- Venus on the client with a dedicated
- Persistent process on the server
- Dedicated lock server process
- Each server stored the directory hierarchy
- Mirroring the structure of the Vice files
- .admin directory Vice file status info (e.g.
access list) - Stub directory location database
- Vice-Venus interface by their full pathname
- Theres no notion of a low-level name such as
inode - Before using a cached file, Venus verifies its
timestamp - Each open of a file thus resulted in at least one
interaction with a server, even if the file were
already in the cache and up to date
9Qualitative Observations of the Prototype
- stat primitive
- To test for the presence of files
- To obtain status information before opening files
- Each stat call involved a cache validity check
- Increase total running time and the load on
servers - Dedicated Process
- Virtue of simplicity / Robust system
- Excessive context switching overhead
- Critical resource limits excess
- High virtual memory paging demands
10Qualitative Observations of the Prototype(contd)
- Remote Procedure Call (RPC)
- Simplification of implementation
- Network related resources in the kernel to be
exceeded - Location Database
- Difficult to move users directories between
servers - etc.
- Use Vice file without recompilation or relinking
11Limitation of the Prototype
- Too much stat call degraded performance
- This was solved by reducing the cache lookup
- Sever side overload due to too many processes
- Network resources in the kernel frequently
exhausted - Since location information was stored in each
server, moving files across servers became
difficult. - Was not possible to implement Disk Quotas
12Benchmark of the Prototype
- Benchmark
- Command scripts that operates on a collection of
files - 70files (source code of an application program)
- 200kb
- Stand-alone Benchmark and 5 phases
13Benchmark of the Prototype (contd)
- Skewed distribution of Vice calls
- TestAuth
- Validate cache entries
- GetFileStat
- Obtain status information about files absent from
the cache
14Benchmark of the Prototype (contd)
- 510 maximum server load
- Load unit
- Load placed on a server by a single client
workstation running this benchmark - A load unit
- 5 Andrew users
15Benchmark of the Prototype (contd)
- CPU/disk utilization profiling
- Performance bottleneck is CPU
- Frequently context switches
- The time spent by the servers in traversing full
pathnames
16Improvements to the Prototype
- Cache management
- Previous Cache Management
- Status(in virtual memory)/Data(in local disk)
cache - Interception only opening/closing operations
- Modifications to a cached files are reflected
back to Vice when the file is closed - Callback
- the server promises to notify it before allowing
a modification - This reduces cache validation traffic
- Each should maintain callback state information
(Restricted) - There is a potential for inconsistency
17Improvements to the Prototype (contd)
- Name resolution
- Previous Name Resolution
- inode - unique, fixed-length
- pathname one or more, variable-length
- namei routine maps a pathname to an inode
- CPU overhead on the servers
- Each Vice pathname involves implicit namei
operation. - fid unique, fixed-length
- Map a component of a pathname to a fid
- Each 32 bit- Volume number, Vnode number,
Uniquifuier - Volume number Identifying a Volume on one
server - Vnode number Index into an file storage info.
Array - Uniquifuier Allowing Reuse of Vnode number
- Moving files does not invalidate the contents of
directories cached on workstation
18Improvements to the Prototype (contd)
- Communication and server process structure
- Using LWP instead of a single process
- An LWP is bound to a particular client only for
the duration of a single server operation. - Using RPC mechanism
- Low-level storage representation
- Access files by their inodes
- vnode on the servers
- inode on the clients
19Improvements to the Prototype (contd)
- Consistency Semantics
- No dirty read writes to an open file by a
process are private to the workstation - Commit on closed
- changes are now visible to new opens, open
instances do not see the changes - Other file operation
- visible immediately
- No implicit locking
- application have to cooperate and manage it
20Improved System Overview
- Server
- VICE (Vast Integrated Computing Environment file
system ) - Client
- Venus ? VIRTUE (Virtue is Reached Through Unix
and Emacs)
21Improved System Overview (contd)
- Case of remote file access
- Access pathname P on workstation
- Kernel(workstation) detects P is Vice file
- Kernel passes it to Venus
- LWP uses the cache to examine each directory
component D of P - If D is in the cache and has a callback on it, it
is used without any network communication - D is in the cache but has no callback on it, the
appropriate server is contacted, a new copy of D
is fetched if it has been updated, and a callback
is established on it - D is not in the cache, it is fetched from the
appropriate server, and a callback is established
on it
22Performance of Improved System
- Scalability
- 19 slower than stand-alone workstation
- Prototype is 70 slower
23Performance of Improved System (contd)
24Performance of Improved System (contd)
25Comparison with Remote Open
- Remote Open
- The data in a file are not fetched en mass
- Instead the remote site potentially participates
in each individual read and write operation - File is actually opened on the remote site rather
than the local site - NFS
26Comparison with Remote Open (contd)
27Comparison with Remote Open (contd)
- Advantage of remote-open file system
- Low latency
28Operability of Improved System
- Volumes
- Volume Movement
- Quotas
- Read-Only Replication
- Backup
29Conclusion
- AFS Local File, NFS Remote File
- Having an combined approach to achieve best of
both world - The first access to the file will be remote
access - The file will then downloaded on a low priority
- Partial download of the file, the server need not
know how much file is downloaded by the client - Subsequent operation can work on the local file
- Transfer only Changes back to the server