Parallel Application Case Studies - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Application Case Studies

Description:

... work between ... used: Barnes Hut. General Approach: discretize space and time in ... with time-steps. Insight : System evolves slowly ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 33
Provided by: jaswi2
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Application Case Studies


1
Parallel Application Case Studies
  • Examine Ocean and Barnes-Hut and Ray Tracing and
    Data Mining
  • Assume cache-coherent shared address space
  • Five parts for each application
  • Sequential algorithms and data structures
  • Partitioning
  • Orchestration
  • Mapping
  • Components of execution time on SGI Origin2000

2
Application Overview
  • Simulating Ocean Currents
  • Regular structure, scientific computing
  • Simulating the Evolution of Galaxies
  • Irregular structure, scientific computing
  • Rendering Scenes by Ray Tracing
  • Irregular structure, computer graphics
  • Data Mining
  • Irregular structure, information processing

3
Case Study 1 Ocean
  • Simulate eddy currents in an ocean basin
  • Problem Domain Oceanography
  • Representative of problems in CFD, finite
    differencing, regular grid problems
  • Algorithms used SOR, nearest neighbour
  • General Approach discretize space and time in
    the simulation. Model ocean basin as a grid of
    points. All important physical properties like
    pressure, velocity etc. have values at each grid
    point
  • Approximations Use several 2-D grids at several
    horizontal crossections of the ocean basin.
    Assume basin in rectangular, grid points are
    equally spaced.

4
Simulating Ocean Currents
(a) Cross sections
(b) Spatial discretization of a cross section
  • Model as two-dimensional grids
  • Discretize in space and time
  • finer spatial and temporal resolution gt greater
    accuracy
  • Many different computations per time step
  • set up and solve equations
  • Concurrency across and within grid computations

5
Ocean
  • Computations in a Time-step

6
Partitioning
  • Exploit data parallelism
  • Function parallelism only to reduce
    synchronization
  • Static partitioning within a grid computation
  • Block versus strip
  • inherent communication versus spatial locality in
    communication
  • Load imbalance due to border elements and number
    of boundaries
  • Solver has greater overheads than other
    computations

7
Orchestration and Mapping
  • Spatial Locality similar to equation solver
  • Except lots of grids, so cache conflicts across
    grids
  • Complex working set hierarchy
  • A few points for near-neighbor reuse, three
    subrows, partition of one grid, partitions of
    multiple grids
  • First three or four most important
  • Large working sets, but data distribution easy
  • Synchronization
  • Barriers between phases and solver sweeps
  • Locks for global variables
  • Lots of work between synchronization events
  • Mapping easy mapping to 2-d array topology or
    richer

8
Execution Time Breakdown
  • 1030 x 1030 grids with block partitioning on
    32-processor Origin2000
  • 4-d grids much better than 2-d, despite very
    large caches on machine
  • data distribution is much more crucial on
    machines with smaller caches
  • Major bottleneck in this configuration is time
    waiting at barriers
  • imbalance in memory stall times as well

9
Case Study 2 Barnes Hut
  • Simulate the evolution of galaxies
  • Problem Domain Astrophysics
  • Representative of problems hierarchical n-body
  • Algorithms used Barnes Hut
  • General Approach discretize space and time in
    the simulation. Represent 3D space as an oct-tree
    where each non-leaf node has upto 8 children and
    all leaf nodes contain one star. For each body
    traverse the tree starting at the root until a
    node is reached which represents a cell in space
    that is far enough from the body that the subtree
    can be approximated by a single body.

10
Simulating Galaxy Evolution
  • Simulate the interactions of many stars evolving
    over time
  • Computing forces is expensive
  • O(n2) brute force approach
  • Hierarchical Methods take advantage of force law
    G

m1m2
r2
  • Many time-steps, plenty of concurrency across
    stars within one

11
Barnes-Hut
  • Locality Goal
  • Particles close together in space should be on
    same processor
  • Difficulties Nonuniform, dynamically changing

12
Application Structure
  • Main data structures array of bodies, of cells,
    and of pointers to them
  • Each body/cell has several fields mass,
    position, pointers to others
  • pointers are assigned to processes

13
Partitioning
  • Decomposition bodies in most phases, cells in
    computing moments
  • Challenges for assignment
  • Nonuniform body distribution gt work and comm.
    Nonuniform
  • Cannot assign by inspection
  • Distribution changes dynamically across
    time-steps
  • Cannot assign statically
  • Information needs fall off with distance from
    body
  • Partitions should be spatially contiguous for
    locality
  • Different phases have different work
    distributions across bodies
  • No single assignment ideal for all
  • Focus on force calculation phase
  • Communication needs naturally fine-grained and
    irregular

14
Load Balancing
  • Equal particles ? equal work.
  • Solution Assign costs to particles based on the
    work they do
  • Work unknown and changes with time-steps
  • Insight System evolves slowly
  • Solution Count work per particle, and use as
    cost for next time-step.
  • Powerful technique for evolving physical systems

15
A Partitioning Approach ORB
  • Orthogonal Recursive Bisection
  • Recursively bisect space into subspaces with
    equal work
  • Work is associated with bodies, as before
  • Continue until one partition per processor
  • High overhead for large no. of processors

16
Another Approach Costzones
  • Insight Tree already contains an encoding of
    spatial locality.
  • Costzones is low-overhead and very easy to
    program

17
Performance Comparison
  • Speedups on simulated multiprocessor (16K
    particles)
  • Extra work in ORB partitioning is key difference

18
Orchestration and Mapping
  • Spatial locality Very different than in Ocean,
    like other aspects
  • Data distribution is much more difficult than
  • Redistribution across time-steps
  • Logical granularity (body/cell) much smaller than
    page
  • Partitions contiguous in physical space does not
    imply contiguous in array
  • But, good temporal locality, and most misses
    logically non-local anyway
  • Long cache blocks help within body/cell record,
    not entire partition
  • Temporal locality and working sets
  • Important working set scales as 1/?2log n
  • Slow growth rate, and fits in second-level
    caches, unlike Ocean
  • Synchronization
  • Barriers between phases
  • No synch within force calculation data written
    different from data read
  • Locks in tree-building, pt. to pt. event synch in
    center of mass phase
  • Mapping ORB maps well to hypercube, costzones to
    linear array

19
Execution Time Breakdown
  • 512K bodies on 32-processor Origin2000
  • Static, quite randomized in space, assignment of
    bodies versus costzones
  • Problem with static case is communication/locality
    , not load balance!

20
Case Study 3 RayTrace
  • Render a 3D scene on a 2D image plane.
  • Problem Domain Computer Graphics
  • General Approach Shoot rays through the pixels
    in an image plane into a 3D scene and trace the
    rays as they bounce around (reflect, refract
    etc.) and conpute the color and opacity for the
    corresponding pixels.

21
Rendering Scenes by Ray Tracing
  • Shoot rays into scene through pixels in image
    plane
  • Follow their paths
  • they bounce around as they strike objects
  • they generate new rays ray tree per input ray
  • Result is color and opacity for that pixel
  • Parallelism across rays

22
Raytrace
  • Rays shot through pixels in image are called
    primary rays
  • Reflect and refract when they hit objects
  • Recursive process generates ray tree per primary
    ray
  • Hierarchical spatial data structure keeps track
    of primitives in scene
  • Nodes are space cells, leaves have linked list of
    primitives
  • Tradeoffs between execution time and image quality

23
Partitioning
  • Scene-oriented approach
  • Partition scene cells, process rays while they
    are in an assigned cell
  • Ray-oriented approach
  • Partition primary rays (pixels), access scene
    data as needed
  • Simpler used here
  • Need dynamic assignment use contiguous blocks to
    exploit spatial coherence among neighboring rays,
    plus tiles for task stealing

A tile, the unit of decomposition and stealing
A block, the unit of assignment
Could use 2-D interleaved (scatter) assignment of
tiles instead
24
Orchestration and Mapping
  • Spatial locality
  • Proper data distribution for ray-oriented
    approach very difficult
  • Dynamically changing, unpredictable access,
    fine-grained access
  • Better spatial locality on image data than on
    scene data
  • Strip partition would do better, but less spatial
    coherence in scene access
  • Temporal locality
  • Working sets much larger and more diffuse than
    Barnes-Hut
  • But still a lot of reuse in modern second-level
    caches
  • SAS program does not replicate in main memory
  • Synchronization
  • One barrier at end, locks on task queues
  • Mapping natural to 2-d mesh for image, but
    likely not important

25
Execution Time Breakdown
  • Task stealing clearly very important for load
    balance

26
Case Study 4 Data Mining
  • Identify trends or associations in data gathered
    by a business.
  • Problem Domain Database Systems.
  • General Approach Examine the database and
    determine which sets of k items are found to
    occur together in more than a certain threshold
    of the transactions. Determine association rules
    given these itemsets and their frequency.

27
Basics
  • Itemset A set of items that occur together in a
    transaction.
  • Large Itemset An itemset that is found to occur
    in more than a threshold fraction of
    transactions.
  • Goal Discover the large itemsets of size k and
    their frequencies of occurance in the database.
  • Algorithm Determine large itemsets of size one.
  • Given large itemset of size n-1, construct a
    candidate list of itemsets of size m
  • Verify the frequencies of the itemsets in the
    candidate list to discover the large itemsets of
    size n.
  • Continue until size k

28
Wheres the Parallelism
  • Examining large itemsets of size n-1 to
    determine candidate sets of size n (1lt n ltk)
  • Counting the number of transactions in the
    database that countain each of the candidate
    itemsets.
  • Whats the bottleneck?
  • Disk access database typically is much larger
    than memory gt data has to be accessed from disk
  • As number of processes increase, disk can become
    more of a bottleneck.
  • Goal actually is to minimize disk access during
    execution.

29
Example
  • Items in Database A, B, C, D, E
  • Items within a transactions are lexicographically
    sorted e.g. T1 A, B, E
  • Items within itemsets are also lexicographically
    sorted.
  • Let the large itemset of size two be AB, AC, AD,
    BC, BD, CD, DE
  • Candidate list of size three ABC, ABD, ACD,
    BCD
  • Now search through all transactions to calculate
    frequencies to find large itemset of size three.

30
Optimizations
  • Store transactions by itemset e.g. AD T1,
    T5, T7
  • Exploit equivalence classes Itemsets of size n-1
    that have common (n-2)-sized prefixes form
    equivalence classes.
  • Itemsets of size n can only be formed from the
    equivalence classes of size n-1.
  • Equivalence classes are disjoint.

31
Parallelization(Phase 1)
  • Computing the 1-equivalence classes and large
    itemsets of size 2 done by partioning the
    database among processes
  • Each process computes a local frequency value
    for each itemset of size 2.
  • Compute global frequency values from local
    process values.
  • Use global frequency values to compute
    equivalence classes.
  • Transform the database format from
    per-transaction to per-itemset format.
  • Each process does a partial transformation for
    its partition of the database
  • Communicate transformations to other processes
    to form a complete transformation.

32
Parallelization(Phase 2)
  • Partitioning Divide the 1-equivalence classes
    among processes.
  • Disk accesses Use local disk as far as possible
    to store subsequently generated equivalence
    classes.
  • Load balancing static distribution (maybe with
    some heuristics) with possible task stealing if
    required.
  • Communication and Synchronization overhead None
    unless task stealing is required.
  • Remote Disk accesses None (unless task stealing
    required).
  • Spatial Locality ensured by lexicographic
    ordering.
  • Temporal Locality Ensured by proceeding over one
    equilence class over time.
Write a Comment
User Comments (0)
About PowerShow.com