Title: John Cavazos
1Lecture 10 Patterns for Parallel Programming III
- John Cavazos
- Dept of Computer Information Sciences
- University of Delaware
- www.cis.udel.edu/cavazos/cisc879
2Lecture 10 Overview
- Cell B.E. Clarification
- Design Patterns for Parallel Programs
- Finding Concurrency
- Algorithmic Structure
- Organize by Tasks
- Organize by Data
- Supporting Structures
3LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
4LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
5LS-LS DMA transfer (PPU)
rc spe_in_mbox_write(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) rc
spe_out_intr_mbox_read(spe0,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i) rc
spe_in_mbox_write(spei,
mbox_data, 1,
SPE_MBOX_ALL_BLOCKING) for (i
0 i lt N i)
pthread_join(ptsi,NULL)
spe_image_close(program) for (i 0 i lt
N i)
spe_context_destroy(spei)
return 0
int main() pthread_t ptsN
spe_context_ptr_t speN struct thread_args
t_argsN int i spe_program_handle_t
program program spe_image_open("../spu/he
llo") for (i 0 i lt N i)
spei spe_context_create(0,NULL)
spe_program_load(spei,program)
t_argsi.spe spei t_argsi.spuid
i pthread_create(ptsi,NULL, my_
spe_thread,t_argsi) void ls
spe_ls_area_get(spe1) unsigned int
mbox_data (unsigned int)ls printf
("mbox_data x\n", mbox_data) int rc
6LS-LS DMA transfer (SPU)
int main() gettimeofday(tv,NULL)
printf("spu lld t.tv_usec ld\n",
spuid,tv.tv_usec) if (spuid 0)
unsigned int ea
unsigned int tag 0
unsigned int mask 1
ea spu_read_in_mbox()
printf("ea p\n",(void)ea)
mfc_put(tv,ea
(unsigned int)tv,
sizeof(tv),tag,1,0)
mfc_write_tag_mask(mask)
mfc_read_tag_status_all()
spu_write_out_intr_mbox(0)
spu_read_in_mbox() printf("spu lld
tv.tv_usec ld\n",
spuid,tv.tv_usec) return 0
7LS-LS Output
-bash-3.2 ./a.out spu 0 t.tv_usec 875360 spu
1 t.tv_usec 876446 spu 2 t.tv_usec
877443 spu 3 t.tv_usec 878459 mbox_data
f7764000 ea 0xf7764000 spu 0 tv.tv_usec
875360 spu 1 tv.tv_usec 875360 spu 2
tv.tv_usec 877443 spu 3 tv.tv_usec 878459
8Organize by Data
- Operations on core data structure
- Geometric Decomposition
- Recursive Data
9Geometric Deomposition
- Arrays and other linear structures
- Divide into contiguous substructures
- Example Matrix multiply
- Data-centric algorithm and linear data structure
(array) implies geometric decomposition
10Recursive Data
- Lists, trees, and graphs
- Structures where you would use divide-and-conquer
- May seem that can only move sequentially through
data structure - But, there are ways to expose concurrency
11Recursive Data Example
- Find the Root Given a forest of directed trees
find the root of each node - Parallel approach For each node, find its
successors successor - Repeat until no changes
- O(log n) vs O(n)
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
12Organize by Flow of Data
Organize By Flow of Data
Regular
Irregular
Event-Based Coordination
Pipeline
13Organize by Flow of Data
- Computation can be viewed as a flow of data going
through a sequence of stages - Pipeline one-way predictable communication
- Event-based Coordination unrestricted
unpredictable communication
14Pipeline performance
- Concurrency limited by pipeline depth
- Balance computation and communication
(architecture dependent) - Stages should be equally computationally
intensive - Slowest stage creates bottleneck
- Combine lightly loaded stages or decompose
heavily-loaded stages - Time to fill and drain pipe should be small
15Supporting Structures
- Single Program Multiple Data (SPMD)
- Loop Parallelism
- Master/Worker
- Fork/Join
16SPMD Pattern
- Create single program that runs on each processor
- Initialize
- Obtain a unique identifier
- Run the same program each processor
- Identifier and input data can differentiate
behavior - Distribute data (if any)
- Finalize
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
17SPMD Challenges
- Split data correctly
- Correctly combine results
- Achieve even work distribution
- If programs require dynamic load balancing,
another pattern may be more suitable (Job Queue)
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
18Loop Parallelism Pattern
- Many programs expressed as iterative constructs
- Programming models like OpenMP provide pragmas to
automatically assign loop iterations to processors
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
19Master/Work Pattern
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
20Master/Work Pattern
- Relevant where tasks have no dependencies
- Embarrassingly parallel
- Problem is determining when entire problem
complete
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007
21Fork/Join Pattern
- Parent creates new tasks (fork), then waits until
they complete (join) - Tasks created dynamically
- Tasks can create more tasks
- Tasks managed according to relationships
Slide Source Dr. Rabbah, IBM, MIT Course 6.189
IAP 2007