Title: Gfarm Grid File System for Distributed and Parallel Data Computing
1Gfarm Grid File System for Distributed and
Parallel Data Computing
APAN Workshop on Exploring eScience Aug 26,
2005 Taipei, Taiwan
- Osamu Tatebe
- o.tatebe_at_aist.go.jp
- Grid Technology Research Center, AIST
2Background Petascale Data Intensive Computing
- High Energy Physics
- CERN LHC, KEK-B Belle
- MB/collision, 100 collisions/sec
- PB/year
- 2000 physicists, 35 countries
Detector forLHCb experiment
Detector for ALICE experiment
- Astronomical Data Analysis
- data analysis of the whole data
- TBPB/year/telescope
- Subaru telescope
- 10 GB/night, 3 TB/year
3Petascale Data-intensive ComputingRequirements
- Peta/Exabyte scale files, millions of millions of
files - Scalable computational power
- gt 1TFLOPS, hopefully gt 10TFLOPS
- Scalable parallel I/O throughput
- gt 100GB/s, hopefully gt 1TB/s within a system and
between systems - Efficiently global sharing with group-oriented
authentication and access control - Fault Tolerance / Dynamic re-configuration
- Resource Management and Scheduling
- System monitoring and administration
- Global Computing Environment
4Goal and feature of Grid Datafarm
- Goal
- Dependable data sharing among multiple
organizations - High-speed data access, High-performance data
computing - Grid Datafarm
- Gfarm Grid File System Global dependable
virtual file system - Federates scratch disks in PCs
- Parallel distributed data computing
- Associates Computational Grid with Data Grid
- Features
- Secured based on Grid Security Infrastructure
- Scalable depending on data size and usage
scenarios - Data location transparent data access
- Automatic and transparent replica selection for
fault tolerance - High-performance data access and computing by
accessing multiple dispersed storages in parallel
(file affinity scheduling)
5Gfarm file system (1)
- Virtual file system that federates local disks of
cluster nodes or Grid nodes - Enables transparent access using Global namespace
to dispersed file data in a Grid - Supports fault tolerance and avoid access
concentration by automatic and transparent
replica selection - It can be shared among all cluster nodes and
clients
Global namespace
mapping
File replica creation
Gfarm File System
6Gfarm file system (2)
- A file can be shared among all nodes and clients
- Physically, it may be replicated and stored on
any file system node - Applications can access it regardless of its
location - In cluster environment, shared secret key is used
for authentication
Client PC
/gfarm
Gfarm file system
metadata
File A
File A
Note PC
File B
File C
File C
File A
File B
File B
File C
7Grid-wide configuration
- Grid-wide file system by integrating local disks
in several areas - GSI authentication
- It can be shared among all cluster nodes and
clients - GridFTP and samba servers in each site
Gfarm Grid file system
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
/gfarm
Japan
Singapore
US
8Feature of Gfarm file system
- A file can be stored on any file system (compute)
node (Distributed file system) - A file can be replicated and stored on different
nodes (Fault tolerant, access concentration
tolerant) - When there is a file replica on a compute node,
it can be accessed without overhead (High
performance, scalable I/O)
9More Scalable I/O Performance
Users view
Physical execution view in Gfarm (file-affinity
scheduling)
File A
User A submits that accesses
is executed on a node that has
File A
Job A
Job A
User B submits that accesses
is executed on a node that has
File B
Job B
File B
Job B
network
Cluster, Grid
CPU
CPU
CPU
CPU
Gfarm file system
File system nodes compute nodes
Shared network file system
Do not separate storage and CPU (SAN not
necessary)
Move and execute program instead of moving
large-scale data
Scalable file I/O by exploiting local I/O
10GfarmTM Data Grid middleware
- Open source development
- GfarmTM version 1.1.1 released on May 17th, 2005
(http//datafarm.apgrid.org/) - Read-write mode support, more support for
existing binary applications, metadata cache
server - A shared file system in a cluster or a grid
- Accessibility from legacy applications without
any modification - Standard protocol support by scp, GridFTP server,
samba server, . . .
Metadata server
- Existing applications can accessGfarm file
system without any modification using
LD_PRELOADof syscall hooking library or
GfarmFS-FUSE
application
gfmd
slapd
Gfarm client library
CPU
CPU
CPU
CPU
gfsd
gfsd
gfsd
gfsd
. . .
Compute and file system nodes
11GfarmTM Data Grid middleware (2)
- libgfarm Gfarm client library
- Gfarm API
- gfmd, slapd Metadata server
- Namespace, replica catalog, host information,
process information - gfsd I/O server
- Remote file access
Metadata server
application
File, host information
gfmd
slapd
Gfarm client library
Remote file access
CPU
CPU
CPU
CPU
gfsd
gfsd
gfsd
gfsd
. . .
Compute and file system nodes
12Access from legacy applications
- libgfs_hook.so system call hooking library
- It emulates to mount Gfarm file system at /gfarm
hooking open(2), read(2), write(2), - When it accesses under /gfarm, call appropriate
Gfarm API - Otherwise, call ordinal system call
- Re-link not necessary by specifying LD_PRELOAD
- Linux, FreeBSD, NetBSD,
- Higher portability than developing kernel module
- Mounting Gfarm file system
- GfarmFS-FUSE enables to mount Gfarm file system
using FUSE mechanism in Linux - released on Jul 12, 2005
- Need to develop a kernel module for other OSs
- Need volunteers
13Gfarm Application and performance result
- http//datafarm.apgrid.org/
14Scientific Application (1)
- ATLAS Data Production
- Distribution kit (binary)
- Atlfast fast simulation
- Input data stored in Gfarmfile system not NFS
- G4sim full simulation
- (Collaboration with ICEPP, KEK)
- Belle Monte-Carlo/Data Production
- Online data processing
- Distributed data processing
- Realtime histgram display
- 10 M events generated in a few daysusing a
50-node PC cluster - (Collaboration with KEK, U-Tokyo)
15Scientific Application (2)
- Astronomical Object Survey
- Data analysis on the wholearchive
- 652 GBytes data observed by SUBARU telescope
- Large configuration data from Lattice QCD
- Three sets of hundreds of gluon field
configurations on a 24348 4-D space-time
lattice(3 sets x 364.5 MB x 800 854.3 GB) - Generated by the CP-PACS parallel computer
atCenter for Computational Physics, Univ. of
Tsukuba (300Gflops x years of CPU time)
16Performance result of parallel grep
- 25 GBytes text file
- Xeon 2.8GHz/512KB, 2GB memory
- NFS 340 sec (sequential grep)
- Gfarm 15 sec (16 fs nodes, 16 parallel processes)
- 22.6 times superlinear speed up
Compute node
Compute node
Compute node
Compute node
.
.
.
Gfarm file system
NFS
Gfarm file system consists oflocal disks of
compute nodes
17GridFTP data transfer performance
Client
Client
Client
Client
Client
Client
Client
Client
ftpd
Local disk vs Gfarm (12 nodes)
Two GridFTP servers can provide almost peak
performance (1 Gbps)
18Gaussian 03 in Gfarm
- Ab initio quantum chemistry Package
- Install once and run everywhere
- No modification required to access Gfarm
- Test415 (IO intensive test input)
- 1h 54min 33sec (NFS)
- 1h 0min 51sec (Gfarm)
- Parallel analysis of all 666 test inputs using 47
nodes - Write error! (NFS)
- Due to heavy IO load
- 17h 31m 02s (Gfarm)
- Quite good scalability of IO performance
- Elapsed time can be reduced by re-ordering test
inputs
Compute node
NFS vs Gfarm
Compute node
Compute node
Compute node
.
.
.
NFS vs Gfarm
Gfarm consists of local disks of compute nodes
19Bioinformatics in Gfarm
- iGAP (Integrative Genome Annotation Pipeline)
- A suite of bioinformatics software for protein
structural and functional annotation - More than 140 complete or partial proteomes
analyzed - iGAP on Gfarm
- Install once and run everywhere using Gfarms
high performance file replication and transfer - no modifications required to use distributed
compute and storage resource
Burkholderia mallei (Bacteria)
Gfarm makes it possible to use iGAP to
analyzethe complete proteome (available
9/28/04)of the bacteria Burkholderia mallei,a
known biothreat agent, on distributed
resources.This is a collaboration under PRAGMA
andthe data is available through
http//eol.sdsc.edu.
Participating sites SDSC/UCSD (US), BII
(Singapore), Osaka Univ, AIST (Japan), Konkuk
Univ, Kookmin Univ, KISTI (Korea)
20Protein sequences
structure info
sequence info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
Step 1
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)
Structural assignment of domains by WU-BLAST
Step 2
Structural assignment of domains by PSI-BLAST
profiles on FOLDLIB
Step 3
Structural assignment of domains by 123D on
FOLDLIB
Step 4
Functional assignment by PFAM, NR assignments
FOLDLIB
Step 5
Domain location prediction by sequence
Step 6
Data Warehouse
21Cluster configuration of Worldwide iGAP/Gfarm
data analysis
22Preliminary performance result
- Multiple cluster data analysis
NFS
4-node cluster A 30.07 min
Gfarm
4-node cluster A 4-node cluster B 17.39 min
23(No Transcript)
24Development Status and Future Plan
- Gfarm Grid file system
- Global virtual file system
- A dependable network shared file system in a
cluster or a grid - High performance data computing support
- Associates Computational Grid with Data Grid
- Gfarm Grid software
- Version 1.1.1 released on May 17, 2005
(http//datafarm.apgrid.org/) - Version 1.2 available real soon now
- Existing programs can access Gfarm file system
using syscall hooking library or GfarmFS-FUSE - Distribute analysis shows scalable I/O
performance - iGAP/Gfarm bioinformatics package
- Gaussian 03 Ab initio quantum chemistry package
- Standardization effort with GGF Grid File System
WG (GFS-WG)
https//datafarm.apgrid.org/