Scientific Computations on Modern Parallel Vector Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Scientific Computations on Modern Parallel Vector Systems

Description:

Title: Rotorcraft Program Review Author: NASA Ames Research Center Last modified by: yelick Created Date: 9/25/1997 11:33:14 AM Document presentation format – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 36
Provided by: NASAAmesR5
Category:

less

Transcript and Presenter's Notes

Title: Scientific Computations on Modern Parallel Vector Systems


1
Scientific Computations on Modern Parallel
Vector Systems
  • Leonid Oliker
  • Computer Staff Scientist
  • Future Technologies Group
  • Computational Research DivisionLawrence Berkeley
    National Laboratories
  • loliker_at_lbl.gov
  • http//crd.lbl.gov/oliker/paperlinks.html

2
Overview
  • Superscalar cache-based architectures dominate US
    HPC market
  • Leading architectures are commodity-based SMPs
    due to generality and perspection of cost
    effectiveness
  • Growing gap between peak sustained performance
    is well known in scientific computing
  • Modern parallel vectors may bridge gap this for
    many important applications
  • In April 2002, the Earth Simulator (ES) became
    operational Peak ES performance gt all DOE and
    DOD systems combined Demonstrated high
    sustained performance on demanding scientific
    apps
  • Currently conducting evaluation study of DOE
    applications on modern parallel vector
    architectures final year of three year project
  • 09/2003 MOU between NERSC and ES was
    completedVisited ES center December 8th-17th,
    2003First international team to conduct
    performance evaluation study at ES

3
Vector Paradigm
  • High memory bandwidth
  • Allows systems to effectively feed ALUs (high
    byte to flop ratio)
  • Flexible memory addressing modes
  • Supports fine grained strided and irregular data
    access
  • Vector Registers
  • Hide memory latency via deep pipelining of memory
    load/stores
  • Vector ISA
  • Single instruction specifies large number of
    identical operations
  • Vector architectures allow for
  • Reduced control complexity
  • Efficiently utilize large number of computational
    resources
  • Potential for automatic discovery of parallelism
  • However only effective if sufficient regularity
    discoverable in program structure
  • Suffers greatly even if small of code
    non-vectorizable (Amdahls Law)

4
ES Processor Overview
  • 8 Gflops per CPU
  • 8 CPU per SMP
  • 8 way replicated vector pipe
  • 72 vec registers, 256 64-bit words
  • Divide Unit
  • 32 GB/s pipe to FPLRAM
  • 4-way superscalar o-o-o _at_ 1 Gflop
  • 64KB I D
  • Earth Simulator 640 nodes
  • ES newly developed FPLRAM (Full Pipelined
    RAM) SX6 DDR-SDRAM 128/256Mb
  • ES uses IN 12.3 GB/s bi-dir btw any 2 nodes,
    640 nodes SX6 uses IXS 8GB/s bi-dir btw any
    2 nodes, max 128 nodes

5
Earth Simulator Overview
  • Machine type 640 nodes, each node is 8-way SMP
    vector processors (5120 total procs)
  • Machine Peak 40TF/s (proc peak 8GF/s)
  • OS Extended version of Super-UX 64
    bit Unix OS based on System V-R3
  • Connection structure a single stage crossbar
    network (1500 miles of cable), 83,000 copper
    cables 7.9 TB/s aggregate switching
    capacity 12.3 GB/s bi-di between any two nodes
  • Global Barrier Counter within interconnect allows
    global barrier synch lt3.5usec
  • Storage 480 TB Disk, 1.5 PB Tape
  • Compilers Fortran 90, HPF, ANSI C, C
  • Batch similar to NQS, PBS
  • Parallelization vectorization processor level
    OpenMP, Pthreads, MPI, HPF

6
Earth Simulator Cost
Approx costs Development 400MBuilding
70MMaintenance 50M/yearElectricity
8M/year
7
ES Programming Environment
  • Only benchmarking size runs were submitted (no
    production runs)
  • ES not connected to Internet
  • Interactive, S cluster, L cluster (2 nodes, 14
    nodes, 624 nodes)
  • No global file system
  • Few numerical libraries
  • Using gt10 nodes requires minimum vectorization
    ratio 95 parallelization efficiency
    50
  • Examples of required parallelization ratio (as
    defined by Amdahls Law)16 nodes 99.21 64
    nodes 99.80 256 nodes 99.95
  • Lack of external access and usage hurdles
    inhibits scientific productivity
  • All codes were ported/vectorized on single node
    SX6 at ARSC (Oliker et al, SC2003)
  • Multi-node vector runs first attempted at ES
    center

8
Cray X1 Overview
SSP 3.2GF computational core VL 64, dual
pipes (800 MHz)2-way scalar 0.4 GF (400MHz) MSP
12.8 GF combines 4 SSPshares 2MB data cache
(unique) Node 4 MSP w/ flat shared
mem Interconnect modified 2D torusfewer links
then full crossbar butsmaller bisection
bandwidth Globally addressable procs can
directly read/write to global mem Parallelization
Vectorization (SSP) Multistreaming (MSP)
Shared mem (OMP, Pthreads) Inter-node (MPI2,
CAF, UPC)
SSP
MSP
Node
9
Altix3000 Overview
  • Itanium2_at_ 1.5GHz (peak 6 GF/s) 128 FP
    registers, 32K L1, 256K L2, 6MB L3
  • Cannot store FP in values in L1
  • EPIC Bundles instruction
  • Bundles processed in-order, instructions within
    bundle processed in parallel
  • Consists of Cbricks 4 Itanium2, memory, 2
    controller ASICS called SHUB
  • Uses high bandwidth, low latency Numalink3
    interconnect (fat-tree)
  • Implements CCNUMA protocol in hardware
  • A cache miss caused data to be communicated/replic
    ated via SHUB
  • Uses 64-bit Linux with single system image (256
    processor / few for OS services)
  • Scalability to large numbers of processors ?

10
Architectural Comparison
Node Type Where CPU/Node ClockMHz PeakGFlop Mem BW GB/s Peak byte/flop NetwkBWGB/s/P BisectBWbyte/flop MPI Latencyusec NetworkTopology
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-tree
ES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 Crossbar
X1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus
  • Custom vector architectures have
  • High memory bandwidth relative to peak
  • Superior interconnect latency, point to point,
    and bisection bandwidth
  • Overall ES appears as the most balanced
    architecture, while Altix shows best
    architectural balance among superscalar
    architectures
  • A key balance point for vector systems is the
    scalarvector ratio

11
Memory Performance
Triad Mem Test A(i) B(i) sC(i) NO
MachineSpecific Optimizations
  • For strided access, SX6 achieves 10X, 100X, 1000X
    improvement over X1, Pwr4, Pwr3
  • For gather/scatter, SX6/X1 show similar
    performance, exceed scalar at higher data sizes
  • All machines performance can be improved via
    architecture specific optimizations
  • Example On X1 using non-cachable unroll(4)
    pragma improves strided BW by 20X

12
Analysis usingArchitectural Probe
  • Developed Architectural Probe allows stress the
    balance points of processor design (PMEO-04)
  • Tunable parameters to mimic behavior of important
    scientific kernel

What of memory access can be random before performance decreases by half? How much computational intensity is required to hide the penalty of all random access?
Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
Interested in developing application driven
architectural probes for evaluation of emerging
petascale systems
13
Sample Kernel Performance
NPB FT Class B
Nbody (Barnus-Hut)
FFT computationally intensive with data parallel operations Well suited for vectorization 17X, 4X faster than Power3/4 Fixed cost interprocessor communication hurts scalability Nbody requires irregular, unstructured data access, control flow and communication Poorly suited for vectorization2X and 5X slower than Power/4 Vector architectures not general purpose systems
  • Interested in exploring advanced algorithmic
    optimizations on emerging systems -
    preliminary work described in CUG04

14
Applications studied
  • Applications chosen with potential to run at
    ultrascale
  • CACTUS Astrophysics 100,000 lines
    grid based
  • Solves Einsteins equations of general
    relativity
  • PARATEC Material Science 50,000 lines
    Fourier space/grid
  • Density Functional Theory electronic structures
    codes
  • LBMHD Plasma Physics 1,500 lines
    grid based
  • Lattice Boltzmann approach for magneto-hydrodynami
    cs
  • GTC Magnetic Fusion 5,000 lines
    particle based
  • Particle in cell method for gyrokinetic
    Vlasov-Poisson equation
  • MADCAP Cosmology 5,000 lines
    dense linear algebra
  • Extracts key data from Cosmic Microwave
    Background Radiation

15
Astrophysics CACTUS
  • Numerical solution of Einsteins equations from
    theory of general relativity
  • Among most complex in physics set of coupled
    nonlinear hyperbolic elliptic systems with
    thousands of terms
  • CACTUS evolves these equations to simulate high
    gravitational fluxes, such as collision of two
    black holes

Visualization of grazing collision of two black
holes
Communication at boundariesExpect high parallel
efficiency
  • Evolves PDEs on regular grid using finite
    differences
  • Uses ADM formulation domain decomposed into 3D
    hypersurfaces for different slices of space along
    time dimension
  • Exciting new field about to be born
    Gravitational Wave Astronomy - fundamentally new
    information about Universe
  • What are gravitational waves? Ripples in
    spacetime curvature, caused by matter motion,
    causing distances to change
  • Developed at Max Planck Institute, vectorized by
    John Shalf

16
CACTUS Performance
ProblemSize P Power 3 Power 3 Power 4 Power 4 Altix Altix ES ES X1 X1
ProblemSize P Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak
80X80x80perprocessor 16 0.31 21 0.58 11 0.89 15 1.5 18 0.54 4
80X80x80perprocessor 64 0.22 14 0.50 10 0.70 12 1.4 17 0.43 3
80X80x80perprocessor 256 0.22 14 0.48 9 --- --- 1.4 17 0.41 3
250x80x80perprocessor 16 0.10 6 0.56 11 0.51 9 2.8 35 0.81 6
250x80x80perprocessor 64 0.08 6 --- --- 0.42 7 2.7 34 0.72 6
250x80x80perprocessor 256 0.07 5 --- --- --- --- 2.7 34 0.68 5
  • ES achieves fastest performance to date 45X
    faster than Power3!
  • Vector performance related to x-dim (vector
    length)
  • Scalar performance better on smaller problem size
    (cache effects)
  • Excellent scaling on ES using fixed data size per
    proc (weak scaling)
  • X1 surprisingly poor (4X slower than ES) - low
    ratio scalarvector
  • Note boundary radiation vectorized for X1 but not
    on ES giving the X1 an advantage
  • Unvectorized boundary, required 15-20 of runtime
    on ES (30 on X1)
  • lt 5 for the scalar version unvectorized code
    can quickly dominate cost

17
Material Science PARATEC
  • PARATEC performs first-principles quantum
    mechanical total energy calculation using
    pseudopotentials plane wave basis set
  • Density Functional Theory to calc structure
    electronic properties of new materials
  • DFT calc are one of the largest consumers of
    supercomputer cycles in the world

Induced current and chargedensity in
crystallized glycine
  • Uses all-band CG approach to obtain wavefunction
    of electrons
  • 33 3D FFT, 33 BLAS3, 33 Hand coded F90
  • Part of calculation in real space other in
    Fourier space
  • Uses specialized 3D FFT to transform wavefunction
  • Computationally intensive - generally obtains
    high percentage of peak
  • Developed w/ Louie and Cohens groups (UCB,
    LBNL), A Canning

18
PARATECWavefunction Transpose
(a)
(b)
  • Transpose from Fourier to real space
  • 3D FFT done via 3 sets of 1D FFTs and 2
    transposes
  • Most communication in global transpose (b) to
    (c) little communication (d) to (e)
  • Many FFTs done at the same timeto avoid latency
    issues
  • Only non-zero elements communicated/calculated
  • Much faster than vendor 3D-FFT

(c)
(d)
(e)
(f)
19
PARATEC Performance
DataSize P Power 3 Power 3 Power4 Power4 Altix Altix ES ES X1 X1
DataSize P Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak
432Atom 32 0.95 63 2.0 39 3.7 62 4.7 60 3.0 24
432Atom 64 0.85 57 1.7 33 3.2 54 4.7 59 2.6 20
432Atom 128 0.74 49 1.5 29 --- --- 4.7 59 1.9 15
432Atom 256 0.57 38 1.1 21 --- --- 4.2 52 --- ---
432Atom 512 0.41 28 --- --- --- --- 3.4 42 --- ---
686Atom 128 4.9 62 3.0 24
686Atom 256 4.6 57 1.3 10
  • ES achieves fastest performance to date! Over
    2Tflop/s on 1024 procs
  • X1 3.5X slower than ES (although peak is 50
    higher)
  • Non-vectorizable code can be much more expensive
    on X1 (321 vs 81)
  • Lower bisection bandwidth to computation ratio
  • Limited scalability due to increasing cost of
    global transpose and reduced vector length
  • Plan to run larger problem size next ES visit
  • Scalar architectures generally perform well due
    to high computational intensity
  • Power3 8X slower than ES
  • Power4 4X slower - Federation has increased speed
    2X compared with Colony
  • Altix 1.5X slower - high memory and interconnect
    bandwidth, low latency switch

20
PARATEC Scaling ES vs. Power3
  • ES can run the same system about 10 times
    faster than the IBM SP (on any number of
    processors)
  • Main advantage of ES for these types of codes
    is the fast communication network
  • Fast processors require less fine-grain
    parallelism in code to get same performance as
    RISC machines
  • Vector arch allow opportunity to simulate systems
    not possible on scalar platforms

21
Plasma Physics LBMHD
  • LBMHD uses a Lattice Boltzmann method to model
    magneto-hydrodynamics (MHD)
  • Performs 2D simulation of high temperature plasma
  • Evolves from initial conditions and decaying to
    form current sheets
  • 2D spatial grid is coupled to octagonal streaming
    lattice
  • Block distributed over 2D proc grid

Current density decays of two cross-shaped
structures
  • Main computational components
  • Collision requires coefficients for local
    gridpoint only, no communication
  • Stream values at gridpoints are streamed to
    neighbors, at cell boundaries
    information is exchanged via MPI
  • Interpolation step required between spatial and
    stream lattices
  • Developed George Vahalas group College of
    William and Mary, ported Jonathan Carter

22
LBMHD Porting Details
(left) octagonal streaming lattice coupled with
square spatial grid
(right) example of diagonal streaming vector
updating three spatial cells
  • Collision routine rewritten
  • For ES loop ordering switched so gridpoint loop
    (1000 iterations) is inner rather than velocity
    or magnetic field loops (10 iterations)
  • X1 compiler made this transformation
    automatically multistreaming outer loop and
    vectorizing (via strip mining) inner loop
  • Temporary arrays padded reduce bank conflicts
  • Stream routine performs well
  • Array shift operations, block copies, 3rd-degree
    polynomial eval
  • Boundary value exchange
  • MPI_Isend, MPI_Irecv pairs
  • Further work plan to use ES "global memory" to
    remove message copies

23
LBMHD Performance
DataSize P Power 3 Power 3 Power4 Power4 Altix Altix ES ES X1 X1
DataSize P Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak
4096 x4096 16 0.11 7 0.28 5 0.60 10 4.6 58 4.3 34
4096 x4096 64 0.14 9 0.30 6 0.62 10 4.3 54 4.4 34
4096 x4096 256 0.14 9 0.28 5 --- --- 3.2 40 --- ---
8192x8192 64 0.11 7 0.27 5 0.65 11 4.6 58 4.5 35
8192x8192 256 0.12 8 0.28 5 --- --- 4.3 53 2.7 21
8192x8192 1024 0.11 7 --- --- --- --- 3.3 41 --- ---
  • ES achieves highest performance to date over
    3.3 Tflops for P1024
  • X1 comparable absolute speed up to P64 (lower
    peak)
  • But performs 1.5X slower at P256 (decreased
    scalability)
  • CAF improved X1 to slightly exceed ES (up to 4.70
    Gflop/P)
  • ES is 44X, 16X, and 7X faster than Power3,
    Power4, and Altix
  • Low CI and high memory requirement (30GB) hurt
    scalar performance
  • Altix best scalar due to high memory bandwidth,
    fast interconnect

24
LBMHD on X1 MPI vs CAF
DataSize P X1-MPI X1-MPI X1-CAF X1-CAF
DataSize P Gflops/P peak Gflops/P peak
40962 16 4.32 34 4.55 36
40962 64 4.35 34 4.26 33
81922 64 4.48 35 4.70 37
81922 256 2.70 21 2.91 23
  • X1 well-suited for one-sided parallel languages
    (globally addressable mem)
  • MPI hinders this feature and requires scalar tag
    matching
  • CAF allows much simpler coding of boundary
    exchange (array subscripting)
  • feq(ista-1,jstajend,1) feq(iend,jstajend,1)ip
    rev,myrankj
  • MPI requires non-contiguous data copies into
    buffer, unpacked at destination
  • Since communication about 10 of LBMHD, only
    slight improvements
  • However, for P64 on 40962 performance degrades.
    Tradeoffs
  • CAF reduced total message volume 3X (eliminates
    user and system buffer copy)
  • But CAF used more numerous and smaller sized
    message
  • Interested in research of CAF and UPC performance
    and optimization

25
LBMHD Performance
8192 x 8192 Grid 256 processors
8192 x 8192 Grid 64 processors
  • Preliminary time breakdown shown relative to each
    architecture
  • Cray X1 has highest spent in communication
    (P64), CAF version reduced this
  • ES shows best memory bandwidth performance
    (stream)
  • Communication increases at higher scalability (as
    expected)

26
Magnetic Fusion GTC
  • Gyrokinetic Toroidal Code transport of thermal
    energy (plasma microturbulence)
  • Goal magnetic fusion is burning plasma power
    plant producing cleaner energy
  • GTC solves 3D gyroaveraged gyrokinetic system w/
    particle-in-cell approach (PIC)
  • PIC scales N instead of N2 particles interact
    w/ electromagnetic field on grid
  • Allows solving equation of particle motion with
    ODEs (instead of nonlinear PDEs)
  • Main computational tasks
  • Scatter deposit particle charge to nearest point
  • Solve Poisson eqn to get potential for each
    point
  • Gather calc force based on neighbors potential
  • Move particles by solving eqn of motion
  • Shift particles moved outside local domain

3D visualization of electrostatic potential in
magnetic fusion device
Developed at Princeton Plasma Physics Laboratory,
vectorized by Stephane Ethier
27
GTC Scatter operation
  • Particle charge deposited amongst nearest grid
    points. The particles can be anywhere inside the
    domain
  • Several particles can contribute to same grid
    points, resulting in memory conflicts
    (dependencies) that prevent vectorization
  • Since particles are randomly localized - scatter
    also hinders cache reuse
  • Solution VLEN copies of charge deposition array
    w/ reduction after main loop

28
GTC Porting Details
  • Large vector memory footprint requiried eliminate
    dependencies P64 uses 42 GB on ES compared w/ 5
    GB on Power3
  • Relatively small mem per processor (ES2GB,
    X14GB) limits problem size runs
  • GTC has second level of parallelism via OpenMP
    (hybrid programming). However, on ES/X1 memory
    footprint increased additional 8X, about 320GB
  • Non-vectorized Shift routine accounted for 54
    X1, 11 on ES
  • Due to high penalty of serialized sections on X1
    when multistreaming
  • The shift routine vectorized on X1, but NOT on ES
    - X1 has advantage
  • Limited time at ES prevented vectorization of
    shift routine
  • Now shift account for only 4 of X1 runtime

29
GTC Performance
Number Particles P Power 3 Power 3 Power4 Power4 Altix Altix ES ES X1 X1
Number Particles P Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak
10/cell 20M 32 0.13 9 0.29 5 0.29 5 0.96 12 1.00 8
10/cell 20M 64 0.13 9 0.32 5 0.26 4 0.84 10 0.80 6
100/cell 200M 32 0.13 9 0.29 5 0.33 6 1.34 17 1.50 12
100/cell 200M 64 0.13 9 0.29 5 0.31 5 1.25 16 1.36 11
100/cell 200M 1024 0.06 4
  • Vectors achieve fastest per-processor performance
    of any tested architecture!
  • P64 on X1 is 35 faster than P1024 on Power3!
  • X1 9 faster than ES (but has additional code
    section vectorized)
  • Advantage of ES for PIC codes may reside in
    higher statistical resolution simulations. The
    greater speed allows for more particles per grid
    cell
  • Larger testes could not be performed at ES due to
    parallelization/vectorization efficiency hurdles
  • Low Altix performance due under investigation
    (random generation)

30
GTC Performance
  • With increasing processors, and fixed problem
    size, the vector length decreases
  • Limited scaling due to decreased vector
    efficiency rather than communications overhead.
  • MPI communication by itself has near perfect
    scaling.

31
Cosmology MADCAP
  • Microwave Anisotropy Dataset Computational
    Analysis Package
  • Optimal general algorithm for extracting key
    cosmological data from Cosmic Microwave
    Background Radiation (CMB)
  • Anisotropies in the CMB contains early history of
    the Universe

Temperature anisotropies inCMB measured by
Boomerang
  • Calculates maximum likelihood two-point angular
    correlation function
  • Recasts problem in dense linear algebra
    ScaLAPACKSteps include mat-mat, matrix-inv,
    mat-vec, Cholesky decomp, data redistribution
  • Porting ScaLAPACK plus rewrite of Legendre
    polynomial recursion, such that large batches are
    computed in inner loop
  • Developed at NERSC by Julian Borrill

32
MADCAP Performance
P Power 3 Power 3 Power4 Power4 ES ES X1 X1
P Gflops/P peak Gflops/P peak Gflops/P peak Gflops/P peak
16 0.62 41 1.5 29 4.1 32 2.2 27
64 0.54 36 0.81 16 1.9 23 2.0 16
  • Only partially ported due to codes requirements
    of global file system
  • All systems sustain relatively low peak
    considering MADCAPs BLAS3 ops
  • Complex tradeoffs architectural paradigm,
    interconnect technology, and I/O filesystem
  • Detailed analysis presented HiPC 2004
  • Further work is required to reduce I/O, remove
    system calls, and remove global file system
    requirements
  • Plan to implement new MADCAP version for next ES
    visit

33
Overview
Code (P64) peak (P64) peak (P64) peak (P64) peak (P64) peak (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs.
Code Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1
LBMHD 7 5 11 58 37 30.6 15.3 7.2 1.5
CACTUS 6 11 7 34 6 45.0 5.1 6.4 4.0
GTC 9 6 5 16 11 9.4 4.3 4.1 0.9
PARATEC 57 33 54 58 20 8.2 3.9 1.4 3.9
MADCAP 61 40 --- 53 19 3.4 2.3 --- 0.9
  • Tremendous potential of vector architectures 4
    codes running faster than ever before
  • Vector systems allows resolution not possible
    with scalar arch (regardless of procs)
  • ES shows much higher sustained and often higher
    raw performance compared with X1
  • Limited X1 specific optimization - optimal
    programming approach still unclear (CAF, etc)
  • Non-vectorizable code segments become very
    expensive (81 or even 321 ratio)
  • Vectors potentially at odds w/ emerging
    techniques (sparse, irregular, multi-physics)
  • GTC example code at odds with data-parallelism
  • Much more difficult to evaluate codes poorly
    suited for vectorization
  • Return to ES in October - evaluate new codes and
    higher scalability studies
  • Potential opportunity of large-scale scientific
    runs (not just benchmarking)

34
Future directions
  • Leverage evaluation suite, (unclassified)
    application expertise, emerging arch research
  • Develop application driven architectural probes
    for evaluation of emerging petascale systems
  • Research the enhancement of commodity scalar
    processors with vector features for increased
    scientific productivity (including investigation
    into VIVA2 with IBM)
  • Software-controlled scratchpad, programmable
    prefetch/preload
  • Investigate algorithmic optimizations for leading
    vector systems and examine an architecture-indepen
    dent algorithmic analysis
  • Fundamental resource requirements of key
    algorithms (FPU, locality, bdwith,
    latency-tolerance)
  • Explore new application areas on leading parallel
    systems
  • Evaluate codes traditionally at odds with vector
    architectures
  • Study the potential of implicit parallel
    programming languages UPC and CAF
  • Especially codes difficult to express via MPI
    (portability requirement tradeoffs)
  • Evaluate soon-to-be-released supercomputing
    systems and identify classes of applications best
    suited for their unique architectural balance
  • Cray X1e, XD1 Red-Storm, IBM Power5,
    Bluegene/, Hitachi SR11000, NEC SX8,

35
Publications
  • L. Oliker, A. Canning, J. Carter, J. Shalf, and
    S. Ethier.Scientific Computations on Modern
    Parallel Vector Systems, Supercomputing 2004, to
    appear.?Nominated for Best Paper award
  • J. Carter, J. Borrill, and L. Oliker.
    Performance Characteristics of a Cosmology
    Package on Leading HPC Architectures,
    International Conference on Higher Performance
    Computing HIPC 2004, to appear. ?Nominated for
    Best Paper award
  • L. Oliker, J. Borrill, A. Canning, J. Carter, H.
    Shan, D. Skinner, R. Biswas, J. Djomheri,
    Performance Evaluation of the SX-6 Vector
    Architecture, Journal of Concurrency and
    Computation 2004, to appear.
  • L. Oliker and Rupak Biswas, Performance
    Modeling and Evaluation of Ultra-Scale Systems,
    Minisymposium organized at SIAM Conference on
    Parallel Processing for Scientific Computing
    SIAMPP 2004.
  • L. Oliker, J. Borrill, A. Canning, J. Carter, H.
    Shan, D. Skinner, R. Biswas, J. Djomheri, A
    Performance Evaluation of the Cray X1 for
    Scientific Applications, International Meeting
    on High Performance Computing for Computational
    Science VECPAR 2004.
  • H. Shan, E. Strohmaier, L. Oliker, Optimizing
    Performance of Superscalar Codes For a Single
    Cray X1 MSP Processor, 46th Cray User Group
    Conference, CUG 2004.
  • G. Griem, L. Oliker, J. Shalf, K. Yelick,
    Identifying Performance Bottlenecks on Modern
    Microarchitectures using an Adaptable Probe,
    Performance Modeling, Evaluation, Optimization of
    Parallel Distributed Systems PMEO 2004
  • L. Oliker, J. Carter, J. Shalf, D. Skinner, S.
    Ethier, R. Biswas, J. Djomehri, R. Van der
    Wijngaart. Evaluation of Cache-based Superscalar
    and Cacheless Vector Architectures for Scientific
    Computations, Supercomputing 2003.
Write a Comment
User Comments (0)
About PowerShow.com