Title: PVFS: A Parallel File System for Linux Clusters
1PVFS A Parallel File System for Linux Clusters
- Philip H. Carns Walter B.Ligon
- Robert B. Ross Rajeev Thakur
- Proceedings of the 4th Annual LinuxShowcase
Conference, 2000
2Abstract
- Linux clusters have pretty matured, but one area
devoid of support has been parallel file systems - PVFS Parallel Virtual File System
- High-performance parallel FS
- Tool for pursuing further research in parallel
I/O and parallel FS
3Introduction (1/2)
- Design goals
- Provide high bandwidth for concurrent read/write
operations - Support multiple APIs
- Native PVFS API
- UNIX/POSIX I/O API
- Other API (such as MPI-IO)
- Common UNIX commands must work with PVFS files
(ls, cp, rm)
4Introduction (2/2)
- Design goals (Contd)
- Appls developed with UNIX I/O API must be able to
access PVFS files without recompiling - Robust, scalable
- Easy to install and use
5Related Works-Various FSs
- Commercial parallel FS
- PFS (Intel Paragon), PIOFS/GPFS (IBM SP), HFS (HP
Exemplar), XFS (SGI) - Distributed FS
- NFS, AFS/Coda, InterMezzo, xFS, GFS
- FS in research projects related to parallel I/O
and parallel FS - PIOUS, PPFS, Galley
6PVFS Design Impl
- client-server system
- Client user process
- Server multiple I/O daemons (iod)
manager daemon (mgr)
7Sample Cluster
head (metadata server node)
/pvfs-meta
mount.pvfs head/pvfs-meta /mnt
n1 (I/O node)
n2 (I/O node)
pvfsd
c1 (client node)
libpvfs
pvfs.o
/pvfs-data
/pvfs-data
/mnt
8PVFS Design Impl
- PVFS manager and metadata
- Metadata
- Permissions, owner/group membership,
- Physical distribution of file data
- file locations on disk disk locations in
cluster - For simplicity, both file data and metadata are
stored on existing local file systems rather than
directly on raw devices
9PVFS Design Impl
- I/O daemons and data storage
- When a client openes a PVFS file
- PVFS manager informs the locations of iods
- Clients establish connections with iod directly
- When a client wishes to access a PVFS file
- Client library sends a descriptor of the file
region being accessed to io daemons holding data
in that region
10PVFS Design Impl
inode 1092157504
.
Base 2
pcount 3
base 65536
11PVFS Design Impl
pvfs_open(char pathname, int flag, mode_t
mode) pvfs_open(char pathname, int flag, mode_t
mode struct pvfs_filestat dist)
struct pvfs_filestat int base int pcount
int ssize int soff / not used / int
bsize / not used /
12PVFS Design Impl
- PVFS can be used with multiple APIs
- Native API
- pvfs_open(), pvfs_read(), pvfs_write()
- MPI-IO interface
- MPI_File_open(), MPI_File_read(),
MPI_File_write() - UNIX/POSIX API
- By trapping UNIX I/O calls
- open(), read(), write()
13PVFS Design Impl
application
C library
libc syscall wrappers
kernel
application
C library
PVFS syscall wrappers
kernel
PVFS library
a) Standard operation
b) With PVFS library loaded
14Performance Results
- Environments (Chiba City cluster)
- 256 nodes
- 500MHz Pentium III, 512M RAM
- 9G Quantum SCSI
- 100M EtherExpress Pro fast-ethernet,64-bit
Myrinet card - Linux 2.2.15pre4
- 60 nodes were used in experiment
- Some for compute node, some for I/O node
15Performance Results
- Disk bandwidth bonnie benchmark
- Write 22 Mbytes/sec
- Read 15 Mbytes/sec
- Network bandwidth ttcp test
- Fast ethernet 10.2 Mbytes/sec
- Myrinet 37.7 Mbytes/sec
16Performance Results
- Read performance (fast ethernet)
224Mbytes/sec (24 ion/ 24 cn)
24 ion
32 ion
16 ion
aggregate bandwidth (Mbytes/sec)
number of compute nodes
8 ion
4 ion
17Performance Results
- Write performance (fast ethernet)
18Performance Results
- Read performance (Myrinet)
687Mbytes/sec (32 ion/ 28 cn)
19Performance Results
- Write performance (Myrinet)
20Performance Results
- R/W performance (Myrinet) 32 ion
- native vs. MPI
native read
MPI read
native write
MPI write