Investigation of Leading HPC I/O Performance Using a Scientific Application Derived Benchmark - PowerPoint PPT Presentation

About This Presentation
Title:

Investigation of Leading HPC I/O Performance Using a Scientific Application Derived Benchmark

Description:

Title: FWP Title Author: Juan Meza Last modified by: John Shalf Created Date: 12/20/2004 5:43:30 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 20
Provided by: JuanM152
Learn more at: https://crd.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Investigation of Leading HPC I/O Performance Using a Scientific Application Derived Benchmark


1
Investigation of Leading HPC I/O Performance
Using a Scientific Application Derived Benchmark
  • Julian Borrill, Leonid Oliker, John Shalf,
    Hongzhang Shan
  • Computational Research Division/
  • National Energy Research Scientific Computing
    Center (NERSC)
  • Lawrence Berkeley National Laboratory

2
Overview
  • Motivation
  • Demands for computational resources growing at
    rapid rate
  • Racing toward very high concurrency petaflop
    computing
  • Explosion of sensor simulation data make I/O
    critical component
  • Overview
  • Present MADbench2 lightweight, portable,
    parameterized I/O benchmark
  • Derived directly from CMB analysis package
  • Allows study under realistic I/O demands and
    patterns
  • Discovered optimizations can be fed back into
    scientific code
  • Tunable code allows I/O exploration of new and
    future systems
  • Examine I/O performance across 7 leading HEC
    systems
  • Luster (XT3, IA-64 cluster), GPFS (Power5,
    AMD-cluster)BG/L (GPFS and PVFS2), CXFS (SGI
    Altix)
  • MORE???? (what is different about this from other
    I/O benchmarks)
  • Schtick this is application driven (cannot
    understand results of other benchmarks in the
    context of application requirements)

3
Cosmic Microwave Background
  • After Big Bang, expansion of space cools the
    Universe until it falls below the ionization
    temperature of hydrogen when free electrons
    combine with protons
  • With nothing to scatter off, the photons then
    free-stream the CMB is therefore a snapshot of
    the Universe at the moment it first becomes
    electrically neutral about 400,000 years after
    the Big Bang
  • Tiny anisotropies in CMB radiation aresensitive
    probes of cosmology

Cosmic - primordial photons filling all
space Microwave - red-shifted by the continued
expansion of the Universe from 3000K at last
scattering to 3K today Background - coming from
behind all astrophysical sources.
4
CMB Science
  • The CMB is a unique probe of the very early
    Universe
  • Tiny fluctuations in its temperature (1 in 100K)
    and polarization (1 in 100M) encode the
    fundamental parameters of cosmology, including
    the geometry, composition (mass-energy content),
    and ionization history of the Universe
  • Combined with complementary supernova
    measurements tracing the dynamical history of the
    Universe, we have an entirely new concordance
    cosmology 70 dark energy 25 dark matter 5
    ordinary matter
  • Nobel prizes 1978 (Penzias Wilson) detection
    CMB, 2006 (Mather Smoot) detection CMB
    fluctuations.

Dark Energy
5
CMB Data Analysis
  • CMB analysis progressively moves
  • from the time domainprecise high-resolution
    measurements of the microwave sky - O(1012)
  • to the pixel domainpixelized sky map - O(108)
  • and finally to the multipole domainangular
    power spectrum (most compact sufficient statistic
    for CMB) - O(104)
  • calculating the compressed data and their
    reduced error bars (data correlations for
    error/uncertainity analysis) at each step
  • Problem exacerbated by an explosion in dataset
    sizes as cosmologists try to improve accuracy
  • HEC has therefore become an essential part of CMB
    data analysis

6
MADbench2 Overview
  • Lightweight version of the MADCAP maximum
    likelihood CMB angular power spectrum estimation
    code
  • Unlike most I/O benchmarks, MADbench2 is derived
    directly from important app
  • Benchmark retains operational complexity and
    integrated system requirements of the full
    science code
  • Eliminated special-case features, preliminary
    data checking, etc.
  • Out-of-core calculation because of large size of
    pixel-pixel correlation matrices
  • Holds at most three matrices in memory at any one
    time
  • MADbench2 used for
  • Procuring supercomputers and filesystems
  • Benchmarking and optimizing performance of
    realistic scientific applications
  • Comparing various computer system architectures

7
Computational Structure
  • Derive spectra from sky maps by
  • Compute, Write (Loop) Recursively build sequence
    of Legendre polynomial based CMB signal
    pixel-pixel correlation component matrices
  • Compute/Communicate Form and invert CMB signal
    noise correlation matrix
  • Read, Compute, Write (Loop) Read each CMB
    component signal matrix, multiply by inverse CMV
    data correlation matrix, write resulting matrix
    to disk
  • Read, Compute/Communicate (Loop) In turn read
    each pair of these result matrices and calculate
    trace of their product
  • Recast as benchmarking tool all scientific
    detail removed, allows varying busy-work
    component to measure balance between
    computational method and I/O

8
MADbench2 Parameters
Environment Variables IOMETHOD - either POSIX of
MPI-IO data transfers IOMODE - either synchronous
or asyncronous FILETYPE - either unique (1 file
per proc) or shared (1 file for all procs) BWEXP
- the busy work exponent ? Command-Line
Arguments NPIX - number of pixels (matrix
size) NBIN - number of bins (matrix
count) BScaLAPCK - ScaLAPACK blocksize FBLOCKSIZE
- file Blocksize MODRW - IO concurrency control
(only 1 MODRw procs does IO simultaneously)
CPU BG/L CPU NPIX NBIN Mem (GB) DISK (GB)
--- 16 12,500 8 6 9
16 64 25,000 8 23 37
64 256 50,000 8 93 149
256 --- 100,000 8 373 596
9
Parallel Filesystem Overview
MachineName Parallel FileSystem ProcArch InterconnectCompute to I/O Node MaxNode BW to IO MeasuredNode BW (GB/s) I/OServers/Clients MaxDisk BW(BG/s) TotalDisk (TB)
Jaguar Lustre AMD SeaStar-1 6.4 1.2 1105 22.5 100
Thunder Lustre IA64 Quadrics 0.9 0.4 164 6.4 185
Bassi GPFS Pwr5 Federation 8.0 6.1 116 6.4 100
Jacquard GPFS AMD Infiniband 2.0 1.2 122 6.4 30
SDSC BG/L GPFS PPC GigE 0.2 0.2 18 8 220
ANL BG/L PVFS2 PPC GigE 0.2 0.2 132 1.3 7
Columbia CXFS IA64 FC4 1.6 N/A N/A 1.6 600
Lustre, GPFS, PVFS2
CXFS
10
Jaguar Performance
  • Highest synchronous unique read/write performance
    of all evaluated platforms
  • Small concurrencies insufficient to saturate I/O
  • Seastar max throughput 1.1 GB/s
  • System near theoretical I/O peak at P256
  • Reading is slower than writing due to buffering
  • Unlike unique files, shared files performance is
    uniformly poor
  • Default I/O traffic only uses 8 of 96 OSTs
  • OST restriction allows consistent performance,
    but limits single job access to full throughput
  • Using 96 OSTs (lstripe) allows comparable
    performance between unique and shared
  • OST 96 is not default due to
  • Increase risk job failure
  • Exposes jobs to more I/O interference
  • Reduce performance of unique file access

Default
With Striping
Lustre 5,200 dual-AMD node XT3 _at_ ORNL Seastar-1
via HyperTransport in 3D Torus Catamount compute
PE, Linux service PEs 48 OSS, 1 MDS, 96 OST,
22.5 GB/s I/O peak
11
Thunder Performance
  • Second highest overall unique I/O performance
  • Peak and sustained a fraction of Jaguar
  • I/O trend very similar to Lustre Jaguar system
  • Writes outperform reads (buffering)
  • Shared significantly slower than unique
  • Unlike Jaguar, attempts to stripe did not improve
    performance
  • Difference likely due to older hw and sw
  • Future work will examine performance on updated
    sw environment

Lustre 1,024 quad-Itanium2 node _at_ LLNL Quadrics
Elan4 fat-tree, GigE, Linux 16 OSS, 2 MDS, 32
OST, 6.4 GB/s peak
12
Bassi Jacquard Performance
  • Unlike Lustre, Bassi and Jacquards attain
    similar shared and unique performance
  • Unique I/O significantly slower than Jaguar
  • Bassi and Jacquard attain high shared performance
    with no special optimization
  • Bassi quickly saturates I/O due to high BW node
    to I/O interconnect
  • Higher read behavior could be result of GPFS
    prefetching
  • Jacquard continues to scale at 256 indicating
    that GPFS NFS have not been saturated
  • Bassi outperforms Jacquard due to superior node
    to I/O BW (8 vs 2 GB/s)

Bassi
Jacquard
Bassi GPFS 122 8-way Power5, AIX, Federation,
fat-tree 6 VSD, 16 FC links, 6.4 GB/s peak _at_
LBNL Jacquard GPFS 320 dual-AMD, Linux,
Infiniband, fat-tree IB4X,12x (leaves, spine),
peak 4.2 GB/s (IP over IB) _at_ LBNL
13
SDSC BG/L Performance
  • BG/Ls have lower performance but are the smallest
    systems in our study (1024 node)
  • Original SDSC configuration rather poor I/O and
    scaling
  • Upgrade (WAN) comparable with Jacquard, and
    continues to scale at P256
  • Wan system many more spindles and NSDs and thus
    higher available bandwidth
  • Like other GPFS systems unique and share show
    similar I/O rate with no tuning required

GPFS 1,024 dual-PPC _at_ SDSC Global Tree, CNK
(compute), Linux (service) 18 I/O servers to
compute, forwarding via GigE Original 12 NSD, 2
MDS, Upgrade 50 NSD, 6 MDS
14
ANL BG/L Performance
  • Low I/O throughput across configurations
  • Drop off in read performance beyond P64
  • Attempts to tune I/O performance did not succeed
  • Raid chunk size, striping
  • Future work will continue exploring optimizations
  • Normalized compute-server ratio (81 vs 321) w/
    SDSC by using 4x ANL procs with 3 of 4 idle
  • Improved ANL 2.6x, still 4.7x slower vs SDSC
  • Ratio of I/O nodes is only 1 of many factors

PVFS2 1,024 dual-PPC _at_ ANL Global Tree, CNK
(compute), Linux (service) 132 I/O servers to
compute (vs 18 at SDSC) Peak I/O BW 1.3 GB/s
15
Columbia Performance
  • Default I/O rate lowest of evaluated systems
  • Read/Shared peaks at P16I/O interface of Altix
    CCNUMA shared across node
  • Higher P does not increase BW potential
  • With increasing concurrency
  • Higher lock overhead (access buffer cache)
  • More contention to I/O subsystem
  • Potentially reduced coherence of I/O request
  • DirectIO bypasses block-buffer cache, presents
    I/O request directly to disk subsystem from
    memory
  • Prevents block buffer cache reuse
  • Complicated I/O, each transaction must be
    block-aligned on disk
  • Has restrictions on mem alignment
  • Forces programming in disk-block sized I/O as
    opposed to arbitrary size POSIX I/O
  • Results show DirectIO significantly improves I/O
  • Saturation occurs at low P (good for low P jobs)
  • Columbia CCNUMA also offers option of using idle
    procs for I/O buffering for high-priority jobs

Default
Direct I/O
CXFS, 20 Altix3700, 512-way IA64 _at_ NASA 10,240
procs, Linux, NUMAlink3 Clients to FC without
intervening storage server 3MDS via GigE, max 4
FC4, peak 1.6 GB/s
16
Comparative Performance
  • Summary text here

17
Asynchronous Performance
  • Most examined systems saturate at only P256 -
    concern for ultra-scale
  • Possible to hide I/O behind simultaneous
    calculation in MADbench2 via MPI-2
  • Only 2 of 7 systems (Bassi and Columbia) support
    fully asynchronous I/O
  • We develop busy-work exponent ???corresponds to
    O(N?) flops
  • Bassi and Columbia improve I/O by almost 8x for
    high ???peak improvement)
  • Bassi now shows 2x the performance of Jaguar
  • As expected small ??reproduced synchronous
    behavior
  • Critical value for transition is ??between
    1.3-1.4 ie algorithms gt O(N2.6)
  • Only BLAS3 computations can effectively hide I/O.
    If balance between computational and I/O rate
    continue to decline, effective ??will
    increaseHowever, we are quickly approaching the
    practical limit of BLAS3 complexity!

18
Conclusions
  • I/O critical component due to exponential growth
    sensor simulated data
  • Presented one of most extensive I/O analyses on
    parallel filesystems
  • Introduced MADbech2 derived directly from CMB
    analysis
  • Lightweight, portable, generalized to varying
    computations (?)
  • POSIX vs MPI-IO, shared vs unique, synch vs async
  • Concurrent accesses work properly with modern
    POSIX API (same as MPI-IO)
  • It is possible to achieve similar behavior
    between shared and unique file access!
  • Default for all systems except Lustre which
    required trivial mod
  • Varying concurrency can saturate underlying disk
    subsystem
  • Columbia saturates at P16, while SDSC BG/L did
    not saturate even at P256
  • Asynchronous I/O offers tremendous potential, but
    supported by few systems
  • Defined amount of computation by I/O data via ?
  • Showed that computational intensity required to
    hide I/O is close to BLAS3
  • Future work continue evaluating latest HEC,
    explore effect of inter-processor communication
    on I/O behavior, conduct analysis of I/O
    variability.

19
Acknowledgments
  • We gratefully thank the following individuals for
    their kind assistance
  • Tina Bulter, Jay Srinivasan (LBNL)
  • Richard Hedges (LLNL)
  • Nick Wright (SDSC)
  • Susan Coughlan, Robert Latham, Rob Ross, Andrew
    Cherry (ANL)
  • Robert Hood, Ken Taylor, Rupak Biswas (NASA-Ames)
Write a Comment
User Comments (0)
About PowerShow.com