Title: Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications
1Enabling Rapid Development of Parallel
Tree-Search Applications
- Harnessing the Power of Massively Parallel
Platforms for Astrophysical Data Analysis
Jeffrey P. Gardner Andrew Connolly Cameron McBride
Pittsburgh Supercomputing Center University of
Pittsburgh Carnegie Mellon University
2How to turn astrophysics simulation output into
scientific knowledge
Using 300 processors (circa 1995)
Step 1 Run simulation
3How to turn astrophysics simulation output into
scientific knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server (in serial)
Step 1 Run simulation
4How to turn astrophysics simulation output into
scientific knowledge
Using 4000 processors (circa 2006)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
5Exploring the Universe can be (Computationally)
Expensive
- The size of simulations is no longer limited by
computational power - It is limited by the parallelizability of data
analysis tools - This situation, will only get worse in the future.
6How to turn astrophysics simulation output into
scientific knowledge
Using 1,000,000 cores? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
By 2012, we will have machines that will have
many hundreds of thousands of cores!
7The Challenge of Data Analysis in a
Multiprocessor Universe
- Parallel programs are difficult to write!
- Steep learning curve to learn parallel
programming - Parallel programs are expensive to write!
- Lengthy development time
- Parallel world is dominated by simulations
- Code is often reused for many years by many
people - Therefore, you can afford to invest lots of time
writing the code. - Example GASOLINE (a cosmology N-body code)
- Required 10 FTE-years of development
8The Challenge of Data Analysis in a
Multiprocessor Universe
- Data Analysis does not work this way
- Rapidly changing scientific inqueries
- Less code reuse
- Simulation groups do not even write their
analysis code in parallel! - Data Mining paradigm mandates rapid software
development!
9How to turn observational data into scientific
knowledge
Observe at Telescope (circa 1990)
Step 1 Collect data
10How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2005)
X
Sloan Digital Sky Survey (500,000 galaxies)
Step 2 Analyze data on ???
Step 1 Collect data
11How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2012)
3-point correlation function several petaflop
weeks of computation
Large Synoptic Survey Telescope (2,000,000
galaxies)
12Tightly-Coupled Parallelism(what this talk is
about)
- Data and computational domains overlap
- Computational elements must communicate with one
another - Examples
- Group finding
- N-Point correlation functions
- New object classification
- Density estimation
13The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
- Build a library that is
- Sophisticated enough to take care of all of the
nasty parallel bits for you. - Flexible enough to be used for your own
particular astrophysics data analysis
application. - Scalable scales well to thousands of processors.
14The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
- Astrophysics uses dynamic, irregular data
structures - Astronomy deals with point-like data in an
N-dimensional parameter space - Most efficient methods on these kind of data use
space-partitioning trees. - The most common data structure is a kd-tree.
- Build a targeted library for distributed-memory
kd-trees that is scalable to thousands of
processing elements
15Challenges for scalable parallel application
development
- Things that make parallel programs difficult to
write - Work orchestration
- Data management
- Things that inhibit scalability
- Granularity (synchronization, consistency)
- Load balancing
- Data locality
Structured data Memory consistency
16Overview of existing paradigms DSM
- There are many existing distributed shared-memory
(DSM) tools. - Compilers
- UPC
- Co-Array Fortran
- Titanium
- ZPL
- Linda
- Libraries
- Global Arrays
- TreadMarks
- IVY
- JIAJIA
- Strings
- Mirage
- Munin
- Quarks
- CVM
17Overview of existing paradigms DSM
- The Good These are quite simple to use.
- The Good Can manage data locality pretty well.
- The Bad Existing DSM approaches tend not to
scale very well because of fine granularity. - The Ugly Almost none support structured data
(like trees).
18Overview of existing paradigms DSM
- There are some DSM approaches that do lend
themselves to structured data - e.g. Linda (tuple-space)
- The Good Almost universally flexible
- The Bad These tend not to scale even worse than
simple unstructured DSM approaches. - Granularity is too fine
19Challenges for scalable parallel application
development
- Things that make parallel programs difficult to
write - Work orchestration
- Data management
- Things that inhibit scalability
- Granularity
- Load balancing
- Data locality
DSM
20Overview of existing paradigms RMI
Remote Method Invocation
rmi_broadcast(, (myFunction))
Computational Agenda
Master Thread
RMI layer
Proc. 0
Proc. 3
Proc. 1
Proc. 2
RMI layer
RMI Layer
RMI Layer
RMI Layer
myFunction()
myFunction()
myFunction()
myFunction()
myFunction() is coarsely grained
21RMI Performance Features
- Coarse Granulary
- Thread virtualization
- Queue many instances of myFunction() on each
physical thread. - RMI Infrastucture can migrate these instances to
achieve load balacing.
22Overview of existing paradigms RMI
- RMI can be language based
- Java
- CHARM
- Or library based
- RPC
- ARMI
23Challenges for scalable parallel application
development
- Things that make parallel programs difficult to
write - Work orchestration
- Data management
- Things that inhibit scalability
- Granularity
- Load balancing
- Data locality
RMI
24N tropy A Library for Rapid Development of
kd-tree Applications
- No existing paradigm gives us everything we need.
- Can we combine existing paradigms beneath a
simple, yet flexible API?
25N tropy A Library for Rapid Development of
kd-tree Applications
- Use RMI for orchestration
- Use DSM for data management
- Implementation of both is targeted towards
astrophysics
26A Simple N tropy ExampleN-body Gravity
Calculation
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
100 million light years
27A Simple N tropy ExampleN-body Gravity
Calculation
ntropy_Dynamic(, (myGravityFunc))
Computational Agenda
Master Thread
N tropy master RMI layer
Proc. 3
Proc. 0
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
P1
P2
Pn
Particles on which to calculate gravitational
force
28A Simple N tropy ExampleN-body Gravity
Calculation
- Cosmological N-Body
- simulation
- 100,000,000 particles
- 1 TB of RAM
To resolve the gravitational force on any single
particle requires the entire dataset
100 million light years
29A Simple N tropy ExampleN-body Gravity
Calculation
Proc. 0
Proc. 3
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
Work
Data
30N tropy Performance Features
- DSM allows performance features to be provided
under the hood - Interprocessor data caching for both reads and
writes - lt 1 in 100,000 off-PE requests actually result in
communication. - Updates through DSM interface must be commutative
- Relaxed memory model allows multiple writers with
no overhead - Consistency enforced through global
synchronization
31N tropy Performance Features
- RMI allows further performance features
- Thread virtualization
- Divide workload into many more pieces than
physical threads - Dynamic load balacing is achieved by migrating
work elements as computation progresses.
32N tropy Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
Interprocessor data cache, Load balancing
Interprocessor data cache, No load balancing
No interprocessor data cache, No load balancing
33Why does the data cache make such a huge
difference?
Proc. 0
myGravityFunc()
34N tropy Meaningful Benchmarks
- The purpose of this library is to minimize
development time! - Development time for
- Parallel N-point correlation function calculator
- 2 years -gt 3 months
- Parallel Friends-of-Friends group finder
- 8 months -gt 3 weeks
35Conclusions
- Most approaches for parallel application
development rely on providing a single paradigm
in the most general possible manner - Many scientific problems tend not to map well
onto single paradigms - Providing an ultra-general single paradigm
inhibits scalability
36Conclusions
- Scientists often borrow from several paradigms
and implement them in a restricted and targeted
manner. - Almost all current HPC programs are written in
MPI (paradigm-less) - MPI is a lowest common denominator upon which
any paradigm can be imposed.
37Conclusions
- N tropy provides
- Remote Method Invocation (RMI)
- Distributed Shared-Memory (DSM)
- Implementation of these paradigms is lean and
mean - Targeted specifically for problem domain
- This approach successfully enables astrophysics
data analysis - Substantially reduces application development
time - Scales to thousands of processors
- More Information
- Go to Wikipedia and seach Ntropy