Title: On evaluating GPFS
1On evaluating GPFS
Research work that has been done at HLRS
by Alejandro Calderon
2On evaluating GPFS
- Short description
- Metadata evaluation
- fdtree
- Bandwidth evaluation
- Bonnie
- Iozone
- IODD
- IOP
3GPFS description http//www.ncsa.uiuc.edu/UserInf
o/Data/filesystems/index.html
- General Parallel File System (GPFS) is a parallel
file system package developed by IBM. - History
- Originally developed for IBM's AIX operating
system then ported to Linux Systems. - Features
- Appears to work just like a traditional UNIX
file system from the user application level. - Provides additional functionality and enhanced
performance when accessed via parallel
interfaces such as MPI-I/O. - High performance is obtained by GPFS by striping
data across multiple nodes and disks. - Striping is performed automatically at the block
level. Therefore, all files (larger than the
designated block size) will be striped. - Can be deployed in NSD or SAN configurations.
- Clusters hosting a GPFS file system can allow
other clusters at different geographical
locations to mount that file system.
4GPFS (Simple NSD Configuration)
5GPFS evaluation (metadata)
- fdtree
- Used for testing the metadata performance of a
file system - Create several directories and files, in several
levels - Used on
- Computers
- noco-xyz
- Storage systems
- Local, GPFS
6fdtree local,NFS,GPFS
7fdtree on GPFS (Scenario 1)ssh x,...
fdtree.bash -f 3 -d 5 -o /gpfs...
nodex
P1
Pm
- Scenario 1
- several nodes,
- several process per node,
- different subtrees,
- many small files
8fdtree on GPFS (scenario 1)
9fdtree on GPFS (Scenario 2)ssh x,...
fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...
nodex
- Scenario 2
- several nodes,
- one process per node,
- same subtree,
- many small files
P1
Px
10fdtree on GPFS (scenario 2)
11Metadata cache on GPFS client
- Working in a GPFS directory with 894 entries
- ls las need to get each file attribute from GPFS
metadata server - In a couple of seconds, the contents of the cache
seams disappear
- hpc13782 noco186.nec 304 time ls -als wc -l
- 894
- real 0m0.466s
- user 0m0.010s
- sys 0m0.052s
hpc13782 noco186.nec 306 time ls -als wc
-l 894 real 0m0.033s user 0m0.009s sys
0m0.025s
hpc13782 noco186.nec 305 time ls -als wc
-l 894 real 0m0.222s user 0m0.011s sys
0m0.064s
hpc13782 noco186.nec 307 time ls -als wc
-l 894 real 0m0.034s user 0m0.010s sys
0m0.024s
12fdtree results
- Main conclusions
- Contention at directory level
- If two o more process from a parallel application
need to write data, please be sure each one use
different subdirectories from GPFS workspace - Better results than NFS (but lower that the local
file system)
13GPFS performance (bandwidth)
- Bonnie
- Read and write a 2 GB file
- Write, rewrite and read
- Used on
- Computers
- Cacau1
- Noco075
- Storage systems
- GPFS
14Bonnie on GPFS write re-write
GPFS over NFS
15Bonnie on GPFS read
GPFS over NFS
16GPFS performance (bandwidth)
- Iozone
- Write and read with several file size and access
size - Write and read bandwidth
- Used on
- Computers
- Noco075
- Storage systems
- GPFS
17Iozone on GPFS write
18Iozone on GPFS read
19GPFS evaluation (bandwidth)
next -gt
- IODD
- Evaluation of disk performance by using several
nodes - disk and networking
- A dd-like command that can be run from MPI
- Used on
- 2, and 4 nodes, 4, 8, 16, and 32 process (1,
2, 3, and 4 per node) that write a file of 1,
2, 4, 8, 16, and 32 GB - By using both, POSIX interface and MPI-IO
interface
20How IODD works
nodex
P1
P2
Pm
a b .. n
a b .. n
a b .. n
- nodex 2, 4 nodes
- processm 4, 8, 16, and 32 process (1, 2, 3, 4
per node) - file sizen 1, 2, 4, 8, 16 and 32 GB
21IODD on 2 nodes MPI-IO
22IODD on 4 nodes MPI-IO
23Differences by using different APIs
GPFS (2 nodes, MPI-IO)
GPFS (2 nodes, POSIX)
24IODD on 2 GB MPI-IO, directory
25IODD on 2 GB MPI-IO, ? directory
26IODD results
- Main conclusions
- The bandwidth decrease with the number of
processes per node - Beware of multithread application with
medium-high I/O bandwidth requirements for each
thread - It is very important to use MPI-IO because this
API let users get more bandwidth - The bandwidth decrease with more than 4 nodes too
- With large files, the metadata management seams
not to be the main bottleneck
27GPFS evaluation (bandwidth)
- IOP
- Get the bandwidth obtained by writing and reading
in parallel from several processes - The file size is divided between the process
number so each process work in an independent
part of the file - Used on
- GPFS through MPI-IO (ROMIO on Open MPI)
- Two nodes writing a 2 GB files in parallel
- On independent files (non-shared)
- On the same file (shared)
28How IOP works
File per process (non-shared)
Segmented access (shared)
P1
P2
P1
P2
Pm
Pm
a b .. x a b .. x a b .. x
a a ..
b b ..
x x ..
n
n
- 2 nodes
- m 2 process (1 per node)
- n 2 GB file size
29IOP Differences by using shared/non-shared
30IOP Differences by using shared/non-shared
31GPFS writing in non-shared files
GPFS writing in a
shared file
32GPFS writing in shared filethe 128 KB magic
number
33IOP results
- Main conclusions
- If several process try to write to the same file
but on independent areas then the performance
decrease - With several independent files results are
similar on several tests, but with shared file
are more irregular - Appears a magic number 128 KBSeams that at that
point the internal algorithm changes and it
increases the bandwidth