John Cavazos - PowerPoint PPT Presentation

About This Presentation
Title:

John Cavazos

Description:

Title: John Cavazos Institute for Computing Systems Architecture Schoo Last modified by: John Cavazos Document presentation format: Custom Other titles – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 22
Provided by: udelEdu7
Category:
Tags: cavazos | john

less

Transcript and Presenter's Notes

Title: John Cavazos


1
Lecture 10 Patterns for Parallel Programming III
  • John Cavazos
  • Dept of Computer Information Sciences
  • University of Delaware
  • www.cis.udel.edu/cavazos/cisc879

2
Lecture 10 Overview
  • Cell B.E. Clarification
  • Design Patterns for Parallel Programs
  • Finding Concurrency
  • Algorithmic Structure
  • Organize by Tasks
  • Organize by Data
  • Supporting Structures

3
LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
4
LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
5
LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
6
LS-LS DMA transfer (SPU)
int main() gettimeofday(tv,NULL)
printf("spu lld t.tv_usec ld\n",
spuid,tv.tv_usec) if (spuid 0)
unsigned int ea
unsigned int tag 0
unsigned int mask 1
ea spu_read_in_mbox()
printf("ea p\n",(void)ea)
mfc_put(tv,ea
(unsigned int)tv,
sizeof(tv),tag,1,0)
mfc_write_tag_mask(mask)
mfc_read_tag_status_all()
spu_write_out_intr_mbox(0)
spu_read_in_mbox() printf("spu lld
tv.tv_usec ld\n",
spuid,tv.tv_usec) return 0
7
LS-LS Output
-bash-3.2 ./a.out spu 0 t.tv_usec 875360 spu
1 t.tv_usec 876446 spu 2 t.tv_usec
877443 spu 3 t.tv_usec 878459 mbox_data
f7764000 ea 0xf7764000 spu 0 tv.tv_usec
875360 spu 1 tv.tv_usec 875360 spu 2
tv.tv_usec 877443 spu 3 tv.tv_usec 878459
8
Organize by Data
  • Operations on core data structure
  • Geometric Decomposition
  • Recursive Data

9
Geometric Deomposition
  • Arrays and other linear structures
  • Divide into contiguous substructures
  • Example Matrix multiply
  • Data-centric algorithm and linear data structure
    (array) implies geometric decomposition

10
Recursive Data
  • Lists, trees, and graphs
  • Structures where you would use divide-and-conquer
  • May seem that can only move sequentially through
    data structure
  • But, there are ways to expose concurrency

11
Recursive Data Example
  • Find the Root Given a forest of directed trees
    find the root of each node
  • Parallel approach For each node, find its
    successors successor
  • Repeat until no changes
  • O(log n) vs O(n)

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
12
Organize by Flow of Data
Organize By Flow of Data
Regular
Irregular
Event-Based Coordination
Pipeline
13
Organize by Flow of Data
  • Computation can be viewed as a flow of data going
    through a sequence of stages
  • Pipeline one-way predictable communication
  • Event-based Coordination unrestricted
    unpredictable communication

14
Pipeline performance
  • Concurrency limited by pipeline depth
  • Balance computation and communication
    (architecture dependent)
  • Stages should be equally computationally
    intensive
  • Slowest stage creates bottleneck
  • Combine lightly loaded stages or decompose
    heavily-loaded stages
  • Time to fill and drain pipe should be small

15
Supporting Structures
  • Single Program Multiple Data (SPMD)
  • Loop Parallelism
  • Master/Worker
  • Fork/Join

16
SPMD Pattern
  • Create single program that runs on each processor
  • Initialize
  • Obtain a unique identifier
  • Run the same program each processor
  • Identifier and input data can differentiate
    behavior
  • Distribute data (if any)
  • Finalize

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
17
SPMD Challenges
  • Split data correctly
  • Correctly combine results
  • Achieve even work distribution
  • If programs require dynamic load balancing,
    another pattern may be more suitable (Job Queue)

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
18
Loop Parallelism Pattern
  • Many programs expressed as iterative constructs
  • Programming models like OpenMP provide pragmas to
    automatically assign loop iterations to processors

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
19
Master/Work Pattern
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
20
Master/Work Pattern
  • Relevant where tasks have no dependencies
  • Embarrassingly parallel
  • Problem is determining when entire problem
    complete

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
21
Fork/Join Pattern
  • Parent creates new tasks (fork), then waits until
    they complete (join)
  • Tasks created dynamically
  • Tasks can create more tasks
  • Tasks managed according to relationships

Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
Write a Comment
User Comments (0)
About PowerShow.com