Modeling and Acceleration of FileIO Dominated Parallel Workloads - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Modeling and Acceleration of FileIO Dominated Parallel Workloads

Description:

Presented at Analogic Corporation July 11th, 2005. Important File-base I/O Workloads. Many subsurface sensing ... Barracuda 7200.7. IBM UltraATA. Disk Drive ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 58
Provided by: yez
Category:

less

Transcript and Presenter's Notes

Title: Modeling and Acceleration of FileIO Dominated Parallel Workloads


1
Modeling and Acceleration of File-IO Dominated
Parallel Workloads
Presented at Analogic Corporation July 11th, 2005
  • Yijian Wang
  • David Kaeli
  • Department of Electrical and Computer Engineering
  • Northeastern University
  • yiwang_at_ece.neu.edu

2
Important File-base I/O Workloads
  • Many subsurface sensing and imaging workloads
    involve file-based I/O
  • Cellular biology in-vitro fertilization with
    NU biologists
  • Medical imaging cancer therapy with MGH
  • Underwater mapping multi-sensor fusion with
    Woods Hole Oceanographic Institution
  • Ground-penetrating radar toxic waste tracking
    with Idaho National Labs

3
The Impact of Profile-guided Parallelization on
SSI Applications
  • Reduced the runtime of a single-body Steepest
    Descent Fast Multipole Method (SDFMM) application
    by 74 on a 32-node Beowulf cluster
  • Hot-path parallelization
  • Data restructuring
  • Reduced the runtime of a Monte Carlo
  • scattered light simulation by 98 on
  • a 16-node Silicon Graphics Origin 2000
  • Matlab-to-C compliation
  • Hot-path parallelization
  • Obtained superlinear speedup of Ellipsoid
  • Algorithm run on a 16-node IBM SP2
  • Matlab-to-C compliation
  • Hot-path parallelization

4
Limits of Parallelization
  • For compute-bound workloads, Beowulf clusters can
    be used effectively to overcome computational
    barriers
  • Middlewares (e.g., MPI and MPI/IO) can
    significantly reduce the programming effort on
    parallel systems
  • Multiple clusters can be combined, utilizing Grid
    Middleware (Globus Toolkit)
  • For file-based I/O-bound workloads, Beowulf
    clusters and Grid systems are presently
    ill-suited to exploit the potential parallelism
    present on these systems

5
Outline
  • Introduction
  • Characterization of Parallel I/O Access Patterns
  • Profile-Guided I/O Partitioning
  • Parallel I/O Modeling and Simulation
  • Work in Progress

6
Introduction
  • The I/O bottleneck
  • The growing gap between the speed of processors,
    networks and underlying I/O devices
  • Many imaging and scientific applications access
    disks very frequently
  • IO intensive applications
  • Out-of-Core applications
  • Large dataset that cannot fit into main memory
  • File-IO intensive applications
  • Database applications
  • Randomly access small data chunks
  • Multimedia servers
  • Sequentially access large data chunks
  • Parallel scientific applications (target
    applications)

7
Parallel Scientific Applications
  • Application classes
  • Sub-surface sensing and imaging
  • Medical image processing
  • Seismic processing
  • Fluid dynamics
  • Weather forecasting and simulation
  • High energy physics
  • Bio-informatics image processing
  • Aerospace applications
  • Application characteristics
  • Access patterns a large number of non-contiguous
    data chunks
  • Multiple processes read/write simultaneously
  • Data sharing among multiple processes

8
Cluster Storage
  • General purpose shared file storage
  • Files (i.e., source codes, executables, scripts,
    etc.) need to be accessed and be available to all
    nodes
  • Stored on a centralized storage system (RAID,
    high capacity, high throughput)
  • Parallel file system to provide concurrent access
  • I/O requests are forwarded to I/O node, which
    complete I/O requests and send back to compute
    nodes through a message passing network

Local disk
  • Local disk
  • Hosts OS
  • Virtual memory and swap space
  • Temporary files

Ethernet
Local disk
Shared file space
9
I/O Models
10
I/O Models
Disk/network contention
Disk/network contention
11
Outline
  • Introduction
  • Characterization of Parallel I/O Access Patterns
  • Profile-Guided I/O Partitioning
  • Parallel I/O Modeling and Simulation
  • Work in Progress

12
Parallel I/O Access Patterns (Spatial)
Stride distance between two contiguous accesses
for every process
Simple Strided
1
0
2
3
1
Process ID
0
2
3
Multiple Level Strided
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
13
Parallel I/O Access Patterns (Spatial)
Varied Extent
1
0
2
3
1
0
2
3
Segmented
1
0
2
N
End of File
Start of File
14
Parallel I/O Access Patterns (Spatial)
0
1
2
3
Tiled Access
4
5
6
7
8
9
10
11
12
13
14
15
Overlapped Tiled Access
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15
Parallel I/O Access Patterns (Temporal)
read once
computation
computation
write once
read
computation
write
burst read
computation
burst write
computation
16
Outline
  • Introduction
  • Characterization of Parallel I/O Access Patterns
  • Profile-Guided I/O Partitioning
  • Parallel I/O Modeling and Simulation
  • Work in Progress

17
I/O Partitioning
  • I/O is parallelized at both the application level
    (using MPI and MPI-IO) and the disk level (using
    file partitioning)
  • Final goal
  • Integrate these levels into a system wide
    approach
  • Scalability
  • Ideally, every process will only access files on
    local disk (though this is typically not possible
    due to data sharing)
  • How to recognize the access patterns?
  • Profile-guided approach

18
Profile Generation
Run the instrumented application
Capture I/O execution profiles
Apply our partitioning algorithm
Rerun the tuned application
19
I/O traces and partitioning
  • For every process, for every contiguous file
    access, we capture the following I/O profile
    information
  • Process ID
  • File ID
  • Address
  • Chunk size
  • I/O operation (read/write)
  • Timestamp
  • Generate a partition for every process
  • Optimal partitioning is NP-complete, so we
    develop a greedy algorithm
  • We have found we can use partial profiles to
    guide partitioning

20
Greedy File Partitioning Algorithm
for each IO process, create a partition for each
contiguous data chunk total up the of
read/write accesses on a process-ID basis if
the chunk is accessed by only one process
ID assign the chunk to the associated
partition if the chunk is read (but never
written) by multiple processes duplicate the
chunk in all partitions where read if the chunk
is written by one partition, but later read by
multiple assign the chunk to all partitions
where read and broadcast the updates on
writes else assign the chunk to a shared
partition For each
partition sort chunks based on the earliest
timestamp for each chunk
21
Parallel I/O Workloads
  • NASA Parallel Benchmark (NPB2.4)/BT
  • Computational fluid dynamics
  • Generates a file (1.6 GB) dynamically and then
    reads it back
  • Writes/reads sequentially in chunk sizes of 2040
    Bytes
  • SPEChpc96/seismic
  • Seismic processing
  • Generates a file (1.5 GB) dynamically and then
    reads it back
  • Writes sequential chunks of 96 KB and reads
    sequential chunks of 2 KB
  • Tile-IO
  • Parallel Benchmarking Consortium
  • Tile access to a two-dimensional matrix (1 GB)
    with overlap
  • Writes/reads sequential chunks of 32 KB, with 2KB
    of overlap
  • Perf
  • Parallel I/O test program within MPICH
  • Writes a 1 MB chunk at a location determined by
    rank, no overlap
  • Mandelbrot
  • An image processing application that includes
    visualization
  • Chunk size is dependent on the number of
    processes
  • Jacobi

22
RAID Node
The Joulian Cluster
P2-350Mhz
P2-350Mhz
P2-350Mhz
10/100Mb Ethernet Switch
Local PCI-IDE Disk
Local PCI-IDE Disk
P2-350Mhz
P2-350Mhz
P2-350Mhz
RAID Node
23
Write/Read Bandwidth for NBT2.4/BT
24
Write/Read Bandwidth for SPEChpc96/seis
25
Write/Read Bandwidth for Tile-IO
26
Write/Read Bandwidth for Perf
27
Write/Read Bandwidth for Mandelbrot
28
Write/Read Bandwidth for Jacobi
29
Write/Read Bandwidth for FFT
30
Overall Execution Time
31
Profile training sensitivity analysis
  • We have found that IO access patterns are
    independent of file-based data values
  • When we increase the problem size or reduce the
    number of processes, either
  • the number of IO increases, but access patterns
    and chunk size remain the same (SPEChpc96,
    Mandelbrot), or
  • the number of IOs and IO access patterns remain
    the same, but the chunk size increases (NBT,
    Tile-IO, Perf)
  • Re-profiling can be avoided

32
Outline
  • Introduction
  • Characterization of Parallel I/O Access Patterns
  • Profile-Guided I/O Partitioning
  • Parallel I/O Modeling and Simulation
  • Work in Progress

33
Parallel I/O Simulation
  • Explore larger I/O design space
  • Studying new disk devices and technologies
  • Efficient implementation of storage architectures
    can significantly improve system performance
  • An accurate simulation environment for users to
    test and evaluate different storage architectures
    and applications

34
Storage Architecture
  • Direct Attached Storage (DAS)
  • Storage device is directly attached to the
    computer
  • Network Attached Storage (NAS)
  • Storage subsystem is attached to a network of
    servers and file requests are passed through a
    parallel file system to the centralized storage
    device

server

server
server
server
LAN/WAN
DAS
NAS
35
Storage Architecture
  • Storage Area Network (SAN)
  • A dedicated network to provide an any-to-any
    connection between processors and disks
  • To offload I/O traffic from backbone network

LAN/WAN
server
server
server

SAN

36
Execution-driven Parallel I/O Simulation
  • Use DiskSim as the underlying disk drive
    simulator
  • DiskSim 3.0 Carnegie Mellon University
  • Direct execution to model CPU and network
    communication
  • We execute the real parallel I/O accesses and
    meanwhile, calculate the simulated I/O response
    time

37
Simulation Framework - NAS
Local I/O traces
Local I/O traces
Local I/O traces
Local I/O traces
LAN/WAN
Network File System
RAID controller
Filesystem metadata Logical file access
addresses
I/O traces
I/O requests
Disk Sim
38
Simulation Framework SAN direct
  • A variety of SAN where disks are distributed
    across the network and each
  • server is directly connected to a single device
  • File partitioning
  • Utilize I/O profiling and data partitioning
    heuristics to distribute portions of
  • files to disks close to the processing nodes

LAN/WAN
FileSystem
FileSystem
FileSystem
FileSystem
I/O traces
I/O traces
I/O traces
I/O traces
Disk Sim
Disk Sim
Disk Sim
Disk Sim
39
Experimental Hardware Specifics
  • DAS configuration
  • A standalone PC, Western Digital WD800BB (IDE),
    80GB, 7200RPM
  • Beowulf cluster (base configuration)
  • Fast Ethernet 100Mbits/sec
  • Network-Attached RAID - Morstor TF200 with 6-9GB
    Seagate SCSI disks, 7200rpm, RAID-5
  • Local attached IDE disks IBM UltraATA-350840,
    5400rpm
  • Fibre channel disks
  • Seagate Cheetah X15 ST-336752FC, 15000rpm

40
Validation MicroBenchmarks on DAS
41
Validation - Overall Execution Time of NPB2.4/BT
(NAS)
42
Validation - Overall Execution Time of NPB2.4/BT
(SAN)
43
I/O Throughput of NPB2.4/BT base configuration
44
I/O Throughput of NPB2.4/BT Fibre-channel disk
45
I/O Throughput of SPEC/seis Fibre-channel disk
46
I/O Throughput of SPEC/seis Fibre-channel disk
SAN all-to-all all nodes have a direct
connection to each disk
47
Simulation of Disk Interfaces and Interconnections
  • Study the overall system performance as a
    function of underlying storage architectures
  • Interconnections NAS-RAID and SAN-direct
  • Disk interfaces

48
Overall Execution Time of NPB2.4/BT
49
Overall Execution Time of SPEChpc/seis
50
Overall Execution Time of Perf
51
Overall Execution Time of Mandelbrot
52
Overall Execution Time of Jacobi
53
Outline
  • Introduction
  • Characterization of Parallel I/O Access Patterns
  • Profile-Guided I/O Partitioning
  • Parallel I/O Modeling and Simulation
  • Work in Progress

54
Work in Progress - Grid I/O
  • The Grid
  • Connects geographically distributed computing
    resources
  • To provide high performance computation power for
    users across the nation and the world
  • Centralized storage architecture will limit the
    overall system performance
  • Peer-to-Peer Grid storage system to achieve high
    performance scalable I/O

Internet
head node
head node
RAID
1Gbit/s
1Gbit/s
31 sub-nodes
8 sub-nodes
55
Work in Progress Workload Balancing
  • Load balancing for heterogeneous storage systems
  • The Grid is inherently a heterogeneous system
  • Clusters usually have heterogeneous storage
    devices
  • File-system aging problem introduces
    heterogeneity to the system

56
Publications
  • Profile-Guided File Partitioning on Beowulf
    Clusters, accepted by the Journal of Cluster
    Computing
  • Load Balancing of Grid-based Peer-to-Peer
    Parallel I/O, the 2005 IEEE International
    Conference on Cluster Computing
  • Execution-driven Simulation of Network Storage
    Architectures, 12th IEEE International Symposium
    on Modeling, Analysis, and Simulationof Computer
    and Telecommunication Systems, October 2004
  • Profile-Guided I/O Partitioning, 17th ACM
    International Conference on Supercomputing, June
    2003
  • Source Level Transformation to Apply Data
    Partitioning, International Workshop on Storage
    Network Architecture and Parallel I/Os, September
    2003
  • Profile-Guided Data Partitioning to Achieve High
    Performance I/O, 1st Boston Area Computer
    Architecture Conference, January 2003

57
Thank You!Questions?
http//www.ece.neu.edu/students/yiwang/HPSA.htm
Write a Comment
User Comments (0)
About PowerShow.com