Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications

Description:

... Development of Parallel Tree-Search Applications ... (happy scientist) Using 300 processors: ... Parallel Friends-of-Friends group finder. 8 months - 3 weeks ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 38

Provided by: jeffr139

Learn more at: https://users.sdsc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications

1
Enabling Rapid Development of Parallel
Tree-Search Applications

Harnessing the Power of Massively Parallel
Platforms for Astrophysical Data Analysis

Jeffrey P. Gardner Andrew Connolly Cameron McBride
Pittsburgh Supercomputing Center University of
Pittsburgh Carnegie Mellon University
2
How to turn astrophysics simulation output into
scientific knowledge
Using 300 processors (circa 1995)
Step 1 Run simulation
3
How to turn astrophysics simulation output into
scientific knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server (in serial)
Step 1 Run simulation
4
How to turn astrophysics simulation output into
scientific knowledge
Using 4000 processors (circa 2006)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
5
Exploring the Universe can be (Computationally)
Expensive

The size of simulations is no longer limited by
computational power
It is limited by the parallelizability of data
analysis tools
This situation, will only get worse in the future.

6
How to turn astrophysics simulation output into
scientific knowledge
Using 1,000,000 cores? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
By 2012, we will have machines that will have
many hundreds of thousands of cores!
7
The Challenge of Data Analysis in a
Multiprocessor Universe

Parallel programs are difficult to write!
Steep learning curve to learn parallel
programming
Parallel programs are expensive to write!
Lengthy development time
Parallel world is dominated by simulations
Code is often reused for many years by many
people
Therefore, you can afford to invest lots of time
writing the code.
Example GASOLINE (a cosmology N-body code)
Required 10 FTE-years of development

8
The Challenge of Data Analysis in a
Multiprocessor Universe

Data Analysis does not work this way
Rapidly changing scientific inqueries
Less code reuse
Simulation groups do not even write their
analysis code in parallel!
Data Mining paradigm mandates rapid software
development!

9
How to turn observational data into scientific
knowledge
Observe at Telescope (circa 1990)
Step 1 Collect data
10
How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2005)
X
Sloan Digital Sky Survey (500,000 galaxies)
Step 2 Analyze data on ???
Step 1 Collect data
11
How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2012)
3-point correlation function several petaflop
weeks of computation
Large Synoptic Survey Telescope (2,000,000
galaxies)
12
Tightly-Coupled Parallelism(what this talk is
about)

Data and computational domains overlap
Computational elements must communicate with one
another
Examples
Group finding
N-Point correlation functions
New object classification
Density estimation

13
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe

Build a library that is
Sophisticated enough to take care of all of the
nasty parallel bits for you.
Flexible enough to be used for your own
particular astrophysics data analysis
application.
Scalable scales well to thousands of processors.

14
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe

Astrophysics uses dynamic, irregular data
structures
Astronomy deals with point-like data in an
N-dimensional parameter space
Most efficient methods on these kind of data use
space-partitioning trees.
The most common data structure is a kd-tree.
Build a targeted library for distributed-memory
kd-trees that is scalable to thousands of
processing elements

15
Challenges for scalable parallel application
development

Things that make parallel programs difficult to
write
Work orchestration
Data management
Things that inhibit scalability
Granularity (synchronization, consistency)
Load balancing
Data locality

Structured data Memory consistency
16
Overview of existing paradigms DSM

There are many existing distributed shared-memory
(DSM) tools.
Compilers
UPC
Co-Array Fortran
Titanium
ZPL
Linda
Libraries
Global Arrays
TreadMarks
IVY
JIAJIA
Strings
Mirage
Munin
Quarks
CVM

17
Overview of existing paradigms DSM

The Good These are quite simple to use.
The Good Can manage data locality pretty well.
The Bad Existing DSM approaches tend not to
scale very well because of fine granularity.
The Ugly Almost none support structured data
(like trees).

18
Overview of existing paradigms DSM

There are some DSM approaches that do lend
themselves to structured data
e.g. Linda (tuple-space)
The Good Almost universally flexible
The Bad These tend not to scale even worse than
simple unstructured DSM approaches.
Granularity is too fine

19
Challenges for scalable parallel application
development

Things that make parallel programs difficult to
write
Work orchestration
Data management
Things that inhibit scalability
Granularity
Load balancing
Data locality

DSM
20
Overview of existing paradigms RMI
Remote Method Invocation
rmi_broadcast(, (myFunction))
Computational Agenda
Master Thread
RMI layer
Proc. 0
Proc. 3
Proc. 1
Proc. 2
RMI layer
RMI Layer
RMI Layer
RMI Layer
myFunction()
myFunction()
myFunction()
myFunction()
myFunction() is coarsely grained
21
RMI Performance Features

Coarse Granulary
Thread virtualization
Queue many instances of myFunction() on each
physical thread.
RMI Infrastucture can migrate these instances to
achieve load balacing.

22
Overview of existing paradigms RMI

RMI can be language based
Java
CHARM
Or library based
RPC
ARMI

23
Challenges for scalable parallel application
development

Things that make parallel programs difficult to
write
Work orchestration
Data management
Things that inhibit scalability
Granularity
Load balancing
Data locality

RMI
24
N tropy A Library for Rapid Development of
kd-tree Applications

No existing paradigm gives us everything we need.
Can we combine existing paradigms beneath a
simple, yet flexible API?

25
N tropy A Library for Rapid Development of
kd-tree Applications

Use RMI for orchestration
Use DSM for data management
Implementation of both is targeted towards
astrophysics

26
A Simple N tropy ExampleN-body Gravity
Calculation

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

100 million light years
27
A Simple N tropy ExampleN-body Gravity
Calculation
ntropy_Dynamic(, (myGravityFunc))
Computational Agenda
Master Thread
N tropy master RMI layer
Proc. 3
Proc. 0
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
P1

P2
Pn

Particles on which to calculate gravitational
force
28
A Simple N tropy ExampleN-body Gravity
Calculation

Cosmological N-Body
simulation
100,000,000 particles
1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
100 million light years
29
A Simple N tropy ExampleN-body Gravity
Calculation
Proc. 0
Proc. 3
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
Work
Data
30
N tropy Performance Features

DSM allows performance features to be provided
under the hood
Interprocessor data caching for both reads and
writes
lt 1 in 100,000 off-PE requests actually result in
communication.
Updates through DSM interface must be commutative
Relaxed memory model allows multiple writers with
no overhead
Consistency enforced through global
synchronization

31
N tropy Performance Features

RMI allows further performance features
Thread virtualization
Divide workload into many more pieces than
physical threads
Dynamic load balacing is achieved by migrating
work elements as computation progresses.

32
N tropy Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
Interprocessor data cache, Load balancing
Interprocessor data cache, No load balancing
No interprocessor data cache, No load balancing
33
Why does the data cache make such a huge
difference?
Proc. 0
myGravityFunc()
34
N tropy Meaningful Benchmarks

The purpose of this library is to minimize
development time!
Development time for
Parallel N-point correlation function calculator
2 years -gt 3 months
Parallel Friends-of-Friends group finder
8 months -gt 3 weeks

35
Conclusions

Most approaches for parallel application
development rely on providing a single paradigm
in the most general possible manner
Many scientific problems tend not to map well
onto single paradigms
Providing an ultra-general single paradigm
inhibits scalability

36
Conclusions

Scientists often borrow from several paradigms
and implement them in a restricted and targeted
manner.
Almost all current HPC programs are written in
MPI (paradigm-less)
MPI is a lowest common denominator upon which
any paradigm can be imposed.

37
Conclusions