Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications

Description:

... Development of Parallel Tree-Search Applications ... (happy scientist) Using 300 processors: ... Parallel Friends-of-Friends group finder. 8 months - 3 weeks ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 38
Provided by: jeffr139
Learn more at: https://users.sdsc.edu
Category:

less

Transcript and Presenter's Notes

Title: Enabling%20Rapid%20Development%20of%20Parallel%20Tree-Search%20Applications


1
Enabling Rapid Development of Parallel
Tree-Search Applications
  • Harnessing the Power of Massively Parallel
    Platforms for Astrophysical Data Analysis

Jeffrey P. Gardner Andrew Connolly Cameron McBride
Pittsburgh Supercomputing Center University of
Pittsburgh Carnegie Mellon University
2
How to turn astrophysics simulation output into
scientific knowledge
Using 300 processors (circa 1995)
Step 1 Run simulation
3
How to turn astrophysics simulation output into
scientific knowledge
Using 1000 processors (circa 2000)
Step 2 Analyze simulation on server (in serial)
Step 1 Run simulation
4
How to turn astrophysics simulation output into
scientific knowledge
Using 4000 processors (circa 2006)
(unhappy scientist)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
5
Exploring the Universe can be (Computationally)
Expensive
  • The size of simulations is no longer limited by
    computational power
  • It is limited by the parallelizability of data
    analysis tools
  • This situation, will only get worse in the future.

6
How to turn astrophysics simulation output into
scientific knowledge
Using 1,000,000 cores? (circa 2012)
X
Step 2 Analyze simulation on ???
Step 1 Run simulation
By 2012, we will have machines that will have
many hundreds of thousands of cores!
7
The Challenge of Data Analysis in a
Multiprocessor Universe
  • Parallel programs are difficult to write!
  • Steep learning curve to learn parallel
    programming
  • Parallel programs are expensive to write!
  • Lengthy development time
  • Parallel world is dominated by simulations
  • Code is often reused for many years by many
    people
  • Therefore, you can afford to invest lots of time
    writing the code.
  • Example GASOLINE (a cosmology N-body code)
  • Required 10 FTE-years of development

8
The Challenge of Data Analysis in a
Multiprocessor Universe
  • Data Analysis does not work this way
  • Rapidly changing scientific inqueries
  • Less code reuse
  • Simulation groups do not even write their
    analysis code in parallel!
  • Data Mining paradigm mandates rapid software
    development!

9
How to turn observational data into scientific
knowledge
Observe at Telescope (circa 1990)
Step 1 Collect data
10
How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2005)
X
Sloan Digital Sky Survey (500,000 galaxies)
Step 2 Analyze data on ???
Step 1 Collect data
11
How to turn observational data into scientific
knowledge
Use Sky Survey Data (circa 2012)
3-point correlation function several petaflop
weeks of computation
Large Synoptic Survey Telescope (2,000,000
galaxies)
12
Tightly-Coupled Parallelism(what this talk is
about)
  • Data and computational domains overlap
  • Computational elements must communicate with one
    another
  • Examples
  • Group finding
  • N-Point correlation functions
  • New object classification
  • Density estimation

13
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
  • Build a library that is
  • Sophisticated enough to take care of all of the
    nasty parallel bits for you.
  • Flexible enough to be used for your own
    particular astrophysics data analysis
    application.
  • Scalable scales well to thousands of processors.

14
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
  • Astrophysics uses dynamic, irregular data
    structures
  • Astronomy deals with point-like data in an
    N-dimensional parameter space
  • Most efficient methods on these kind of data use
    space-partitioning trees.
  • The most common data structure is a kd-tree.
  • Build a targeted library for distributed-memory
    kd-trees that is scalable to thousands of
    processing elements

15
Challenges for scalable parallel application
development
  • Things that make parallel programs difficult to
    write
  • Work orchestration
  • Data management
  • Things that inhibit scalability
  • Granularity (synchronization, consistency)
  • Load balancing
  • Data locality

Structured data Memory consistency
16
Overview of existing paradigms DSM
  • There are many existing distributed shared-memory
    (DSM) tools.
  • Compilers
  • UPC
  • Co-Array Fortran
  • Titanium
  • ZPL
  • Linda
  • Libraries
  • Global Arrays
  • TreadMarks
  • IVY
  • JIAJIA
  • Strings
  • Mirage
  • Munin
  • Quarks
  • CVM

17
Overview of existing paradigms DSM
  • The Good These are quite simple to use.
  • The Good Can manage data locality pretty well.
  • The Bad Existing DSM approaches tend not to
    scale very well because of fine granularity.
  • The Ugly Almost none support structured data
    (like trees).

18
Overview of existing paradigms DSM
  • There are some DSM approaches that do lend
    themselves to structured data
  • e.g. Linda (tuple-space)
  • The Good Almost universally flexible
  • The Bad These tend not to scale even worse than
    simple unstructured DSM approaches.
  • Granularity is too fine

19
Challenges for scalable parallel application
development
  • Things that make parallel programs difficult to
    write
  • Work orchestration
  • Data management
  • Things that inhibit scalability
  • Granularity
  • Load balancing
  • Data locality

DSM
20
Overview of existing paradigms RMI
Remote Method Invocation
rmi_broadcast(, (myFunction))
Computational Agenda
Master Thread
RMI layer
Proc. 0
Proc. 3
Proc. 1
Proc. 2
RMI layer
RMI Layer
RMI Layer
RMI Layer
myFunction()
myFunction()
myFunction()
myFunction()
myFunction() is coarsely grained
21
RMI Performance Features
  • Coarse Granulary
  • Thread virtualization
  • Queue many instances of myFunction() on each
    physical thread.
  • RMI Infrastucture can migrate these instances to
    achieve load balacing.

22
Overview of existing paradigms RMI
  • RMI can be language based
  • Java
  • CHARM
  • Or library based
  • RPC
  • ARMI

23
Challenges for scalable parallel application
development
  • Things that make parallel programs difficult to
    write
  • Work orchestration
  • Data management
  • Things that inhibit scalability
  • Granularity
  • Load balancing
  • Data locality

RMI
24
N tropy A Library for Rapid Development of
kd-tree Applications
  • No existing paradigm gives us everything we need.
  • Can we combine existing paradigms beneath a
    simple, yet flexible API?

25
N tropy A Library for Rapid Development of
kd-tree Applications
  • Use RMI for orchestration
  • Use DSM for data management
  • Implementation of both is targeted towards
    astrophysics

26
A Simple N tropy ExampleN-body Gravity
Calculation
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

100 million light years
27
A Simple N tropy ExampleN-body Gravity
Calculation
ntropy_Dynamic(, (myGravityFunc))
Computational Agenda
Master Thread
N tropy master RMI layer
Proc. 3
Proc. 0
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
P1

P2
Pn

Particles on which to calculate gravitational
force
28
A Simple N tropy ExampleN-body Gravity
Calculation
  • Cosmological N-Body
  • simulation
  • 100,000,000 particles
  • 1 TB of RAM

To resolve the gravitational force on any single
particle requires the entire dataset
100 million light years
29
A Simple N tropy ExampleN-body Gravity
Calculation
Proc. 0
Proc. 3
Proc. 1
Proc. 2
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
N tropy thread RMI layer
myGravityFunc()
myGravityFunc()
myGravityFunc()
myGravityFunc()
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
N tropy DSM layer
Work
Data
30
N tropy Performance Features
  • DSM allows performance features to be provided
    under the hood
  • Interprocessor data caching for both reads and
    writes
  • lt 1 in 100,000 off-PE requests actually result in
    communication.
  • Updates through DSM interface must be commutative
  • Relaxed memory model allows multiple writers with
    no overhead
  • Consistency enforced through global
    synchronization

31
N tropy Performance Features
  • RMI allows further performance features
  • Thread virtualization
  • Divide workload into many more pieces than
    physical threads
  • Dynamic load balacing is achieved by migrating
    work elements as computation progresses.

32
N tropy Performance
10 million particles Spatial 3-Point 3-gt4 Mpc
Interprocessor data cache, Load balancing
Interprocessor data cache, No load balancing
No interprocessor data cache, No load balancing
33
Why does the data cache make such a huge
difference?
Proc. 0
myGravityFunc()
34
N tropy Meaningful Benchmarks
  • The purpose of this library is to minimize
    development time!
  • Development time for
  • Parallel N-point correlation function calculator
  • 2 years -gt 3 months
  • Parallel Friends-of-Friends group finder
  • 8 months -gt 3 weeks

35
Conclusions
  • Most approaches for parallel application
    development rely on providing a single paradigm
    in the most general possible manner
  • Many scientific problems tend not to map well
    onto single paradigms
  • Providing an ultra-general single paradigm
    inhibits scalability

36
Conclusions
  • Scientists often borrow from several paradigms
    and implement them in a restricted and targeted
    manner.
  • Almost all current HPC programs are written in
    MPI (paradigm-less)
  • MPI is a lowest common denominator upon which
    any paradigm can be imposed.

37
Conclusions
  • N tropy provides
  • Remote Method Invocation (RMI)
  • Distributed Shared-Memory (DSM)
  • Implementation of these paradigms is lean and
    mean
  • Targeted specifically for problem domain
  • This approach successfully enables astrophysics
    data analysis
  • Substantially reduces application development
    time
  • Scales to thousands of processors
  • More Information
  • Go to Wikipedia and seach Ntropy
Write a Comment
User Comments (0)
About PowerShow.com