Analysis and Performance Results of a Molecular Modeling Application on Merrimac PowerPoint PPT Presentation

presentation player overlay
1 / 40
About This Presentation
Transcript and Presenter's Notes

Title: Analysis and Performance Results of a Molecular Modeling Application on Merrimac


1
Analysis and Performance Results of a Molecular
Modeling Application on Merrimac
  • Mattan Erez, Jung Ho Ahn, Ankit Garg, William J.
    Dally, Eric Darve (Stanford Univ.)
  • Presented by Jiahua He

2
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

3
Parallel Architectures
  • Flynn taxonomy
  • SISD (sequential machine), SIMD, MIMD, MISD (no
    commercial system)
  • SIMD
  • Processor-array machine
  • Single processor vector machine
  • MIMD
  • PVP, SMP, DSM, MPP, Cluster

4
Processor-Array Machine
  • Control processor issues instructions
  • All processors in processor array operate the
    instructions in lock-step
  • Distributed memory
  • Need permutation if data not aligned

5
Vector Machine
  • A processor can do element-wise operations on
    entire vectors with a single instruction
  • Dominated the high performance computer market
    for about 15 years
  • Overtaken by MPP in 90s
  • Re-emerges in recent years (Earth Simulator and
    Cray X1)

6
MPP and Cluster
  • Distributed memory
  • Each processor/node has its own private memory
  • Nodes may be SMPs
  • MIMD
  • Nodes execute different instructions
    asynchronously
  • Nodes communicate and synchronize by
    interconnection network

7
Earth Simulator
  • Vector machine re-emerges
  • Rmax 36 GFLOPS gt Rmax sum of top 10
  • Vector machines focused on powerful processors
  • MPP or Cluster focused on large-scale
    clustering
  • Trend merge the above two

8
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

9
Modern VLSI Technology
  • Arithmetic is cheap
  • 100s of GFLOPS/chip today
  • TFLOPS in 2010
  • Bandwidth is expensive
  • General purpose processor architectures have not
    adapted to this change

10
Stream Processor
  • One control unit and 100s of FPUs
  • 90nm fabrication process 64-bit, 0.5mm2, 50pJ
  • Deep register hierarchy with high local bandwidth
  • Match bandwidth demands and tech. limits
  • Stream sequence of data objects
  • Expose large amounts of data parallelism
  • Keep 100s of FPUs per processor busy
  • Hide long latencies of memory operations

11
Stream Processor (contd)
  • Expose multiple levels of locality
  • Short term producer-consumer locality (LRF)
  • Long term producer-consumer locality (SRF)
  • Cannot be exploited by caches no reuse, no
    spatial locality
  • Scalable
  • 128GFLOPS processor
  • 16 node 2TFLOPS single board workstation
  • 16,384 node 2PFLOPS supercomputer with 16
    cabinets

12
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

13
Merrimac Processor
  • Scalar core (1)
  • Perform control code and issue stream
    instructions
  • Arithmetic clusters (16)
  • 64-bit multiply-accumulate (MADD) FPUs (4)
  • Execute the same VLIW instruction
  • Local register file (LRF) per FPU (192 words)
  • Short term producer-consumer locality in a kernel
  • Stream register file (SRF) per cluster (8K words)
  • Long term producer-consumer locality across
    kernels
  • Staging area for memory data transfer to hide
    latencies

14
Architecture of Merrimac
15
Stream Programming Model
  • Cast the computation as a collection of streams
    passing through a series of computational
    kernels.
  • Data parallelism
  • Across stream elements
  • Task parallelism
  • Across kernels

16
Memory System
  • A stream memory instruction transfers entire
    stream
  • Address generator (2)
  • 8 single-word addresses every cycle
  • Stride access or gathers/scatters pattern
  • Cache (128K words, 64GB/s)
  • Directly interface with external DRAM and network
  • External DRAM (2GB, 38.4GB/s)
  • Single-word remote memory access
  • Scatter-add operation

17
Interconnection Network (Fat Tree)
18
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

19
Molecular Dynamics
  • Explore kinetic and thermodynamic properties of
    molecular system by simulating atomic models
  • water ?? water molecule
  • protein ?? water molecules
  • GROMAC fastest MD code available
  • Cut-off distance approximation
  • Neighbor list (neighbors within rc)

20
StreamMD
  • Single kernel non-bonded interaction between all
    atom pairs of a molecule and one of its neighbor
  • Pseudo code
  • c_positions gather(positions, i_central)
  • n_positions gather(positions, i_neighbor)
  • partial_forces compute_force(c_positions,
    n_positions)
  • forces scatter_add(partial_forces, i_forces)

21
Latency Tolerance
  • Pipeline the requests
  • To amortize long initial latency
  • By issuing a memory op of long stream
  • Hide memory ops with computations
  • Concurrently executing memory ops and kernel
    computations
  • Strip-mining
  • Large data set ? smaller strips
  • Outer loop (done manually)

22
Parallelism
  • 4 variants to exploit parallelism
  • Also implemented on Pentium 4 for comparison

23
Expanded Variant
  • Simplest version
  • Fully expand the interaction list
  • For each cluster per iteration
  • Read 2 interacting molecules
  • Produce 2 partial forces

24
Fixed Variant
  • Fixed-length neighbor list of length L
  • For each cluster
  • Read a central molecule once every L iteration
  • Read a neighbor molecule each iteration
  • Partial forces of central molecule are reduced in
    cluster
  • Repeat central molecule
    in i_central
  • Add dummy_neighbor
    in i_neighbor if needed

25
Variable Variant
  • Variable-length neighbor list
  • Process inputs and produce outputs at a different
    rate for each cluster
  • Merrimacs inter-cluster communication
  • Conditional streams mechanism
  • Indexable SRF
  • Instructions to read new central position and
    write partial forces are issued on every
    iteration but with a condition
  • Slight overhead of unexecuted instructions

26
Duplicated Variant
  • Fixed-length neighbor list
  • Duplicate all interaction calculations
  • Reduce complete force for central molecule within
    cluster
  • No partial force for neighbor molecule is written
    out

27
Locality
  • Only short term producer consumer locality within
    a single kernel
  • Computing partial forces
  • Internal reduction of forces within a cluster
  • Computation/bandwidth trade-off
  • Extra computation for interactions with dummy
    molecules fixed variant
  • Extreme case duplicated variant
  • Need more sophisticated schemes (discuss later)

28
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

29
Experiment Setup
  • Single-node experiments
  • 900 water-molecule system
  • Cycle-accurate simulator of Merrimac
  • 4 variants of StreamMD
  • Pentium 4 version
  • Latest version of GROMACS
  • Fully hand optimized
  • Single precision SSE

30
Latency Tolerance
  • Snippet of the execution of duplicated variant
  • Left column
  • Kernel computations
  • Right column
  • Memory operations
  • Perfect overlap of memory and computation

31
Locality
  • Arithmetic intensities
  • fixed and variable depend on data set
  • Small diff ? compiler efficiently utilize
    register hierarchy
  • Reference percentages
  • Nearly all to LRFs
  • Small diff ? use SRF just as staging area for
    memory

32
Performance
  • variable outperforms expanded by 84, fixed
    by 26, duplicated by 119, and Pentium 4 by
    a factor of 13.2
  • 38.8 GFLOPS is 50 of the optimal solution GFLOPS

33
Automatic Optimizations
  • Communication scheduling
  • SRF decouples memory from computation
  • Loop unrolling and software pipelining
  • Improve execution rate by 83
  • Stream scheduling
  • SRF is software managed
  • Capture long term producer
    consumer locality by
    intelligent eviction

34
Computation/bandwidth Trade-off
  • Blocking technique
  • Group molecules into cubic clusters of size r3
  • Pave the rc3 (cut-off radius) sphere with cubic
    clusters
  • Memory bandwidth requirement scales as O(r-3)
  • Extra computation between rc and rc2sqrt(3)r
  • Minimum occurs at about 3 molecules per cluster
    (1.43)

35
Content
  • Background
  • Motivation
  • Merrimac Architecture
  • Application StreamMD
  • Performance Evaluation
  • Conclusions and Discussions

36
Conclusions
  • Reviewed the architecture and organization of
    Merrimac
  • Presented app StreamMD, implemented 4 variants
    and evaluated their performance
  • Compared Merrimacs suitability for molecular
    dynamic app against a conventional Pentium 4
    processor

37
Special Applications?
  • Merrimac is tuned for scientific applications
  • Programming model
  • A collection of streams pass through a series of
    computational kernels
  • Need large data level parallelism to utilize the
    FPUs
  • Task parallelism just can be exploit across nodes
    because of SIMD

38
Easy to Program?
  • Effective automatic compilation
  • Communication scheduling and stream scheduling
    (shown earlier)
  • Highly optimized code for conventional processors
    is often written in assembly
  • Performance of different StreamMD variants vary
    only by 2 fold (shown earlier)

39
Compare with Supercomputer?
  • Only comparing with Pentium 4 seems not
    convincing
  • MDGRAPE-3 of Protein Explorer can achieve 165
    GFLOPS out of 200 GFLOPS (peak)
  • But it is special purpose design
  • How about vector machines?
  • Lack of standard benchmarks

40
Thanks! And questions?
Write a Comment
User Comments (0)
About PowerShow.com