Suitability of Alternative Architectures for Scientific Computing in 510 Years PowerPoint PPT Presentation

presentation player overlay
1 / 24
About This Presentation
Transcript and Presenter's Notes

Title: Suitability of Alternative Architectures for Scientific Computing in 510 Years


1
Suitability of Alternative Architectures for
Scientific Computing in 5-10 Years
  • LDRD 2002 Strategic-Computational Review
  • July 31, 2001

PIs Xiaoye Li, Bob Lucas, Lenny Oliker,
Katherine Yelick Others Brian Gaeke, Parry
Husbands, Hyun Jin Kim, Hyn Jin Moon
2
Outline
  • Project Goals
  • FY01 progress report
  • Benchmark kernels definition
  • Performance on IRAM, comparisons with
    conventional machines
  • Management plan
  • Funding opportunities in future

3
Motivation and Goal
  • NERSC-3 (now) and NERSC-4 (in 2-3 years) consist
    of large clusters of commodity SMPs. What about
    5-10 years from now?
  • Future architecture technologies
  • PIM (e.g. IRAM, DIVA, Blue Gene)
  • SIMD/Vector/Stream (e.g. IRAM, Imagine,
    Playstation)
  • Low power, narrow data types (e.g., MMX, IRAM,
    Imagine)
  • Feasibility of building large-scale systems
  • What will the commodity building blocks (nodes
    and networks) be?
  • Driven by NERSC and DOE scientific applications
    codes.
  • Where do the needs diverge from big market
    applications?
  • Influence future architectures

4
Computational Kernels and Applications
  • Kernels
  • Designed to stress memory systems
  • Some taken from the Data Intensive Systems
    Stressmarks
  • Unit and constant stride memory
  • Transitive-closure
  • FFT
  • Dense, sparse linear algebra (BLAS 1 and 2)
  • Indirect addressing
  • Pointer-jumping, Neighborhood (Image), sparse CG
  • NSA Giga-Updates Per Second (GUPS)
  • Frequent branching a well and irregular memory
    acess
  • Unstructured mesh adaptation
  • Examples of NERSC/DOE applications that may
    benefit
  • Omega3P, accelerator design (SLAC AMR and sparse
    linear algebra)
  • Paratec, material science package (LBNL FFT and
    dense linear algebra)
  • Camille, 3D atmospheric circulation model
    (preconditioned CG)
  • HyperClaw, simulate gas dynamics in AMR framework
    (LBNL)
  • NWChem, quantum chemistry (PNNL global arrays
    and linear algebra)

5
VIRAM Overview (UCB)
  • MIPS core (200 MHz)
  • Single-issue, 8 Kbyte ID caches
  • Vector unit (200 MHz)
  • 32 64b elements per register
  • 256b datapaths, (16b, 32b, 64b ops)
  • 4 address generation units
  • Main memory system
  • 12 MB of on-chip DRAM in 8 banks
  • 12.8 GBytes/s peak bandwidth
  • Typical power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops wo. multiply-add
  • 1.6 Gflops (single-precision)
  • Same process technology as Blue Gene
  • But for single chip for multi-media

6
Status of IRAM Benchmarking Infrastructure
  • Improved the VIRAM simulator.
  • Refining the performance model for
    double-precision FP performance.
  • Making the backend modular to allow for other
    microarchitectures.
  • Packaging the benchmark codes.
  • Build and test scripts plus input data (small and
    large data sets).
  • Added documentation.
  • Prepare for final chip benchmarking
  • Tape-out scheduled by UCB for 9/01.

7
Media Benchmarks
  • FFT uses in-register permutations, generalized
    reduction
  • All others written in C with Cray vectorizing
    compiler

8
Power Advantage of PIMVectors
  • 100x100 matrix vector multiplication (column
    layout)
  • Results from the LAPACK manual (vendor optimized
    assembly)
  • VIRAM performance improves with larger matrices!
  • VIRAM power includes on-chip main memory!

9
Benchmarks for Scientific Problems
  • Transitive-closure (small large data set)
  • Pointer-jumping (small large working set)
  • Computing a histogram
  • Used for image processing of a 16-bit greyscale
    image 1536 x 1536
  • 2 algorithms 64-elements sorting kernel
    privatization
  • Needed for sorting
  • Neighborhood image processing (small large
    images)
  • NSA Giga-Updates Per Second (GUPS, 16-bit
    64-bit)
  • Sparse matrix-vector product
  • Order 10000, nonzeros 177820
  • 2D unstructured mesh adaptation
  • initial grid 4802 triangles, final grid 24010

10
Benchmark Performance on IRAM Simulator
  • IRAM (200 MHz, 2 W) versus Mobile Pentium III
    (500 MHz, 4 W)

11
Conclusions and VIRAM Future Directions
  • VIRAM outperforms Pentium III on Scientific
    problems
  • With lower power and clock rate than the Mobile
    Pentium
  • Vectorization techniques developed for the Cray
    PVPs applicable.
  • PIM technology provides low power, low cost
    memory system.
  • Similar combination used in Sony Playstation.
  • Small ISA changes can have large impact
  • Limited in-register permutations sped up 1K FFT
    by 5x.
  • Memory system can still be a bottleneck
  • Indexed/variable stride costly, due to address
    generation.
  • Future work
  • Ongoing investigations into impact of lanes,
    subbanks
  • Technical paper in preparation expect
    completion 09/01
  • Run benchmark on real VIRAM chips
  • Examine multiprocessor VIRAM configurations

12
Project Goals for FY02 and Beyond
  • Use established data-intensive scientific
    benchmarks with other emerging architectures
  • IMAGINE (Stanford Univ.)
  • Designed for graphics and image/signal processing
  • Peak 20 GLOPS (32-bit FP)
  • Key features vector processing, VLIW, a
    streaming memory system. (Not a PIM-based
    design.)
  • Preliminary discussions with Bill Dally.
  • DIVA (DARPA-sponsored USC/ISI)
  • Based on PIM smart memory design, but for
    multiprocessors
  • Move computation to data
  • Designed for irregular data structures and
    dynamic databases.
  • Discussions with Mary Hall about benchmark
    comparisons

13
Management Plan
  • Roles of different groups and PIs
  • Senior researchers working on particular class of
    benchmarks
  • Parry sorting and histograms
  • Sherry sparse matrices
  • Lenny unstructured mesh adaptation
  • Brian simulation
  • Jin and Hyun specific benchmarks
  • Plan to hire additional postdoc for next year
    (focus on Imagine)
  • Undergrad model used for targeted benchmark
    efforts
  • Plan for using computational resources at NERSC
  • Few resourced used, except for comparisons

14
Future Funding Prospects
  • FY2003 and beyond
  • DARPA initiated DIS program
  • Related projects are continuing under Polymorphic
    Computing
  • New BAA coming in High Productivity Systems
  • Interest from other DOE labs (LANL) in general
    problem
  • General model
  • Most architectural research projects need
    benchmarking
  • Work has higher quality if done by people who
    understand apps.
  • Expertise for hardware projects is different
    system level design, circuit design, etc.
  • Interest from both IRAM and Imagine groups show
    level of interest

15
Long Term Impact
  • Potential impact on Computer Science
  • Promote research of new architectures and
    micro-architectures
  • Understand future architectures
  • Preparation for procurements
  • Provide visibility of NERSC in core CS research
    areas
  • Correlate applications DOE vs. large market
    problems
  • Influence future machines through research
    collaborations

16
The End
17
Integer Benchmarks
  • Strided access important, e.g., RGB
  • narrow types limited by address generation
  • Outer loop vectorization and unrolling used
  • helps avoid short vectors
  • spilling can be a problem

18
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized
  • Future work
  • Write more documentation, add better test cases
    as we find them
  • Incorporate media benchmarks, AMR code, library
    of frequently-used compiler flags pragmas

19
Status of benchmarking work
  • Two performance models
  • simulator (vsim-p), and trace analyzer (vsimII)
  • Recent work on vsim-p
  • Refining the performance model for
    double-precision FP performance.
  • Recent work on vsimII
  • Making the backend modular
  • Goal Model different architectures w/ same ISA.
  • Fixing bugs in the memory model of the VIRAM-1
    backend.
  • Better comments in code for better
    maintainability.
  • Completing a new backend for a new decoupled
    cluster architecture.

20
Comparison with Mobile Pentium
  • GUPS VIRAM gets 6x more GUPS

Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
21
Sparse CG
  • Solve Ax b Sparse matrix-vector
    multiplication dominates.
  • Traditional CRS format requires
  • Indexed load/store for X/Y vectors
  • Variable vector length, usually short
  • Other formats for better vectorization
  • CRS with narrow band (e.g., RCM ordering)
  • Smaller strides for X vector
  • Segmented-Sum (Modified the old code developed
    for Cray PVP)
  • Long vector length, of same size
  • Unit stride
  • ELL format make all rows the same length by
    padding zeros
  • Long vector length, of same size
  • Extra flops

22
SMVM Performance
  • DIS matrix N 10000, M 177820 ( 17 nonzeros
    per row)
  • IRAM results (MFLOPS)
  • Mobile PIII (500 MHz)
  • CRS 35 MFLOPS

23
2D Unstructured Mesh Adaptation
  • Powerful tool for efficiently solving
    computational problems with evolving physical
    features (shocks, vortices, shear layers, crack
    propagation)
  • Complicated logic and data structures
  • Difficult to achieve high efficiently
  • Irregular data access patterns (pointer chasing)
  • Many conditionals / integer intensive
  • Adaptation is tool for making numerical solution
    cost effective
  • Three types of element subdivision

24
Vectorization Strategy and Performance Results
  • Color elements based on vertices (not edges)
  • Guarantees no conflicts during vector operations
  • Vectorize across each subdivision (12, 13, 14)
    one color at a time
  • Difficult many conditionals, low flops,
    irregular data access, dependencies
  • Initial grid 4802 triangles, Final grid 24010
    triangles
  • Preliminary results demonstrate VIRAM 4.5x faster
    than Mobile Pentium III 500
  • Higher code complexity (requires graph coloring
    reordering)

Time (ms)
Write a Comment
User Comments (0)
About PowerShow.com