Analysis and Performance Results of a Molecular Modeling Application on Merrimac presentation

About This Presentation

Transcript and Presenter's Notes

Title: Analysis and Performance Results of a Molecular Modeling Application on Merrimac

1
Analysis and Performance Results of a Molecular
Modeling Application on Merrimac

Mattan Erez, Jung Ho Ahn, Ankit Garg, William J.
Dally, Eric Darve (Stanford Univ.)
Presented by Jiahua He

2
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

3
Parallel Architectures

Flynn taxonomy
SISD (sequential machine), SIMD, MIMD, MISD (no
commercial system)
SIMD
Processor-array machine
Single processor vector machine
MIMD
PVP, SMP, DSM, MPP, Cluster

4
Processor-Array Machine

Control processor issues instructions
All processors in processor array operate the
instructions in lock-step
Distributed memory
Need permutation if data not aligned

5
Vector Machine

A processor can do element-wise operations on
entire vectors with a single instruction
Dominated the high performance computer market
for about 15 years
Overtaken by MPP in 90s
Re-emerges in recent years (Earth Simulator and
Cray X1)

6
MPP and Cluster

Distributed memory
Each processor/node has its own private memory
Nodes may be SMPs
MIMD
Nodes execute different instructions
asynchronously
Nodes communicate and synchronize by
interconnection network

7
Earth Simulator

Vector machine re-emerges
Rmax 36 GFLOPS gt Rmax sum of top 10
Vector machines focused on powerful processors
MPP or Cluster focused on large-scale
clustering
Trend merge the above two

8
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

9
Modern VLSI Technology

Arithmetic is cheap
100s of GFLOPS/chip today
TFLOPS in 2010
Bandwidth is expensive
General purpose processor architectures have not
adapted to this change

10
Stream Processor

One control unit and 100s of FPUs
90nm fabrication process 64-bit, 0.5mm2, 50pJ
Deep register hierarchy with high local bandwidth
Match bandwidth demands and tech. limits
Stream sequence of data objects
Expose large amounts of data parallelism
Keep 100s of FPUs per processor busy
Hide long latencies of memory operations

11
Stream Processor (contd)

Expose multiple levels of locality
Short term producer-consumer locality (LRF)
Long term producer-consumer locality (SRF)
Cannot be exploited by caches no reuse, no
spatial locality
Scalable
128GFLOPS processor
16 node 2TFLOPS single board workstation
16,384 node 2PFLOPS supercomputer with 16
cabinets

12
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

13
Merrimac Processor

Scalar core (1)
Perform control code and issue stream
instructions
Arithmetic clusters (16)
64-bit multiply-accumulate (MADD) FPUs (4)
Execute the same VLIW instruction
Local register file (LRF) per FPU (192 words)
Short term producer-consumer locality in a kernel
Stream register file (SRF) per cluster (8K words)
Long term producer-consumer locality across
kernels
Staging area for memory data transfer to hide
latencies

14
Architecture of Merrimac
15
Stream Programming Model

Cast the computation as a collection of streams
passing through a series of computational
kernels.
Data parallelism
Across stream elements
Task parallelism
Across kernels

16
Memory System

A stream memory instruction transfers entire
stream
Address generator (2)
8 single-word addresses every cycle
Stride access or gathers/scatters pattern
Cache (128K words, 64GB/s)
Directly interface with external DRAM and network
External DRAM (2GB, 38.4GB/s)
Single-word remote memory access
Scatter-add operation

17
Interconnection Network (Fat Tree)
18
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

19
Molecular Dynamics

Explore kinetic and thermodynamic properties of
molecular system by simulating atomic models
water ?? water molecule
protein ?? water molecules
GROMAC fastest MD code available
Cut-off distance approximation
Neighbor list (neighbors within rc)

20
StreamMD

Single kernel non-bonded interaction between all
atom pairs of a molecule and one of its neighbor
Pseudo code
c_positions gather(positions, i_central)
n_positions gather(positions, i_neighbor)
partial_forces compute_force(c_positions,
n_positions)
forces scatter_add(partial_forces, i_forces)

21
Latency Tolerance

Pipeline the requests
To amortize long initial latency
By issuing a memory op of long stream
Hide memory ops with computations
Concurrently executing memory ops and kernel
computations
Strip-mining
Large data set ? smaller strips
Outer loop (done manually)

22
Parallelism

4 variants to exploit parallelism
Also implemented on Pentium 4 for comparison

23
Expanded Variant

Simplest version
Fully expand the interaction list
For each cluster per iteration
Read 2 interacting molecules
Produce 2 partial forces

24
Fixed Variant

Fixed-length neighbor list of length L
For each cluster
Read a central molecule once every L iteration
Read a neighbor molecule each iteration
Partial forces of central molecule are reduced in
cluster
Repeat central molecule
in i_central
Add dummy_neighbor
in i_neighbor if needed

25
Variable Variant

Variable-length neighbor list
Process inputs and produce outputs at a different
rate for each cluster
Merrimacs inter-cluster communication
Conditional streams mechanism
Indexable SRF
Instructions to read new central position and
write partial forces are issued on every
iteration but with a condition
Slight overhead of unexecuted instructions

26
Duplicated Variant

Fixed-length neighbor list
Duplicate all interaction calculations
Reduce complete force for central molecule within
cluster
No partial force for neighbor molecule is written
out

27
Locality

Only short term producer consumer locality within
a single kernel
Computing partial forces
Internal reduction of forces within a cluster
Computation/bandwidth trade-off
Extra computation for interactions with dummy
molecules fixed variant
Extreme case duplicated variant
Need more sophisticated schemes (discuss later)

28
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

29
Experiment Setup

Single-node experiments
900 water-molecule system
Cycle-accurate simulator of Merrimac
4 variants of StreamMD
Pentium 4 version
Latest version of GROMACS
Fully hand optimized
Single precision SSE

30
Latency Tolerance

Snippet of the execution of duplicated variant
Left column
Kernel computations
Right column
Memory operations
Perfect overlap of memory and computation

31
Locality

Arithmetic intensities
fixed and variable depend on data set
Small diff ? compiler efficiently utilize
register hierarchy
Reference percentages
Nearly all to LRFs
Small diff ? use SRF just as staging area for
memory

32
Performance

variable outperforms expanded by 84, fixed
by 26, duplicated by 119, and Pentium 4 by
a factor of 13.2
38.8 GFLOPS is 50 of the optimal solution GFLOPS

33
Automatic Optimizations

Communication scheduling
SRF decouples memory from computation
Loop unrolling and software pipelining
Improve execution rate by 83
Stream scheduling
SRF is software managed
Capture long term producer
consumer locality by
intelligent eviction

34
Computation/bandwidth Trade-off

Blocking technique
Group molecules into cubic clusters of size r3
Pave the rc3 (cut-off radius) sphere with cubic
clusters
Memory bandwidth requirement scales as O(r-3)
Extra computation between rc and rc2sqrt(3)r
Minimum occurs at about 3 molecules per cluster
(1.43)

35
Content

Background
Motivation
Merrimac Architecture
Application StreamMD
Performance Evaluation
Conclusions and Discussions

36
Conclusions

Reviewed the architecture and organization of
Merrimac
Presented app StreamMD, implemented 4 variants
and evaluated their performance
Compared Merrimacs suitability for molecular
dynamic app against a conventional Pentium 4
processor

37
Special Applications?

Merrimac is tuned for scientific applications
Programming model
A collection of streams pass through a series of
computational kernels
Need large data level parallelism to utilize the
FPUs
Task parallelism just can be exploit across nodes
because of SIMD

38
Easy to Program?

Effective automatic compilation
Communication scheduling and stream scheduling
(shown earlier)
Highly optimized code for conventional processors
is often written in assembly
Performance of different StreamMD variants vary
only by 2 fold (shown earlier)

39
Compare with Supercomputer?

Only comparing with Pentium 4 seems not
convincing
MDGRAPE-3 of Protein Explorer can achieve 165
GFLOPS out of 200 GFLOPS (peak)
But it is special purpose design
How about vector machines?
Lack of standard benchmarks

40
Thanks! And questions?

Write a Comment

User Comments (0)

About PowerShow.com

Analysis and Performance Results of a Molecular Modeling Application on Merrimac PowerPoint PPT Presentation