Experiencing Cluster Computing - PowerPoint PPT Presentation

About This Presentation

Title:

Experiencing Cluster Computing

Description:

Suppose you are using the most efficient algorithm with an optimal ... Biology, pharmacology, genome sequencing, genetic engineering, protein folding, ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 45

Provided by: ProjectA7

Category:

more less

Transcript and Presenter's Notes

Title: Experiencing Cluster Computing

1
Experiencing Cluster Computing

Class 1

2
Introduction to Parallelism
3
Outline

Why Parallelism
Types of Parallelism
Drawbacks
Concepts
Starting Parallelization
Simple Example

4
Why Parallelism
5
Why Parallelism Passively

Suppose you are using the most efficient
algorithm with an optimal implementation and the
program still takes too long or does not even fit
onto your machine?
Parallelization is the last chance.

6
Why Parallelism Initiatively

Faster
Finish the work earlier
Same work in shorter time
Do more work
More work in the same time
Most importantly, you want to predict the result
before the event occurs

7
Examples

Many of the scientific and engineering problems
require enormous computational power. Following
are the few fields to mention
Quantum chemistry, statistical mechanics, and
relativistic physics
Cosmology and astrophysics
Computational fluid dynamics and turbulence
Material design and superconductivity
Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity,
and cell modeling
Medicine, and modeling of human organs and bones
Global weather and environmental modeling
Machine Vision

8
Parallelism

The upper bound for the computing power that can
be obtained from a single processor is limited by
the fastest processor available at any certain
time.
The upper bound for the computing power available
can be dramatically increased by integrating a
set of processors together.
Synchronization and exchange of partial results
among processors are therefore unavoidable.

9
Computer Architecture
4 categories SISD Single Instruction Single
Data SIMD Single Instruction Multiple
Data MISD Multiple Instruction Single
Data MIMD Multiple Instruction Multiple Data
10
Computer Architecture
11
Multiprocessing Clustering
Parallel Computer Architecture
Shared Memory Symmetric multiprocessors (SMP)
Distributed Memory Cluster
12
Types of Parallelism
13
Parallel Programming Paradigm

Multithreading
OpenMP
Message Passing
MPI (Message Passing Interface)
PVM (Parallel Virtual Machine)

Shared memory only
Shared memory, Distributed memory
14
Threads

In computer programming, a thread is placeholder
information associated with a single use of a
program that can handle multiple concurrent
users.
From the program's point-of-view, a thread is the
information needed to serve one individual user
or a particular service request.
If multiple users are using the program or
concurrent requests from other programs occur, a
thread is created and maintained for each of
them.
The thread allows a program to know which user is
being served as the program alternately gets
re-entered on behalf of different users.

15
Threads

Programmers view
Single CPU
Single block of memory
Several threads of action
Parallelization
Done by the compiler

Fork-Join Model
16
Shared Memory

Programmers view
Several CPUs
Single block of memory
Several threads of action
Parallelization
Done by the compiler
Example
OpenMP

Single threaded
P1
P2
P3
Process
Threads
P1
Data exchange via shared memory
Process
P2
Multi-threaded
P3
time
17
Multithreaded Parallelization
18
Distributed Memory

Programmers view
Several CPUs
Several block of memory
Several threads of action
Parallelization
Done by hand
Example
MPI

19
Drawbacks
20
Drawbacks of Parallelism

Traps
Deadlocks
Process Synchronization
Programming Effort
Few tools support for automated parallelization
and debugging
Task Distribution (Load balancing)

21
Deadlock

The earliest computer operating systems ran only
one program at a time.
All of the resources of the system were available
to this one program.
Later, operating systems ran multiple programs at
once, interleaving them.
Programs were required to specify in advance what
resources they needed so that they could avoid
conflicts with other programs running at the same
time.
Eventually some operating systems offered dynamic
allocation of resources. Programs could request
further allocations of resources after they had
begun running. This led to the problem of the
deadlock.

22
Deadlock

Parallel tasks require resources to accomplish
their work. If the resources are not available,
the work cannot be finished. Each resource can
only be locked (controlled) by exactly one task
at any given point in time.
Consider the situation
Two tasks need both the same two resources.
Each task manages to gain control over just one
resource, but not the other.
Neither task releases the resource that it
already holds.
It is called deadlock and the program will not
terminate.

23
Deadlock
Resource
Resource
24
Dining Philosophers

Each philosopher either thinks or eats.
In order to eat, he requires two forks.
Each philosopher tries to pick up the right fork
first.
If success, he waits for the left fork to become
available.
? Deadlock

25
Dining Philosophers Demo

Problem
http//www.sci.hkbu.edu.hk/tdgc/tutorial/ExpCluste
rComp/deadlock/Diners.htm
Solution
http//www.sci.hkbu.edu.hk/tdgc/tutorial/ExpCluste
rComp/deadlock/FixedDiners.htm

26
Concepts
27
Speedup

Given a fixed problem size.
TS sequential wall clock execution time (in
seconds)
TN parallel wall clock execution time using N
processors (in seconds)
Ideally, speedup N ? Linear speed up

28
Speedup

Absolute Speedup
Sequential time on 1 processor/parallel time on N
processors
Relative Speedup
Parallel time on 1 processor/parallel time on N
processors
Different because parallel code on 1 processor
has unnecessary MPI overhead
It may be slower than sequential code on 1
processor

29
Parallel Efficiency

Effciency is a measure of process utilization in
a parallel program, relative to the serial
program.
Parallel Efficiency E Speedup per processor
Ideally, EN 1.

30
Amdahls Law

It states that potential program speedup is
defined by the fraction of code (f) which can be
parallelized
If none of the code can be parallelized, f 0
and the speedup 1 (no speedup). If all of the
code is parallelized, f 1 and the speedup is
infinite (in theory).

31
Amdahls Law
Introducing the number of processors performing
the parallel fraction of work, the relationship
can be modeled by the equation where P
parallel fraction S serial fraction N
number of processors

32
Amdahls Law

When N ? 8, Speedup 1/S
Interpretation
No matter how many processors are used, the upper
bound for the speed up is determined by the
sequential section.

33
Amdahls Law Example

If the sequential section of a program amounts 5
of the run time, then S 0.05 and hence

34
Behind Amdahls Law

How much faster can a given problem be solved?
Which problem size can be solved on a parallel
machine in the same time as on a sequential one?
(Scalability)

35
Starting Parallelization
36
Parallelization Option 1

Starting from an existing, sequential program
Easy on shared memory architectures (OpenMP)
Potentially adequate for small number of
processes (moderate speed-up)
Does not scale to large number of processes
Restricted to trivially parallel problems on
distributed memory machines

37
Parallelization Option 2

Starting from scratch
Not popular, but often inevitable
Needs new program design
Increase complexity (data distribution)
Widely applicable
Often the best choice for large scale problems

38
Goals for Parallelization

Avoid or reduce
synchronization
communication
Try to maximize computational intensive sections.

39
Simple Example
40
Summation

Given an N-dimensional vector of type integer.
// Initialization //
for (int i 0 iltlen i)
veci ii
// Sum Calculation //
for (int i 0 iltlen i)
sum veci

41
Parallel Algorithm

Divide the vector in certain parts
In each CPU, initialize their own parts
Use global reduction to calculate the sum of the
vector

42
OpenMP

Compiler directives (pragma omp) are inserted to
tell the compiler to perform parallelization.
The compiler would be responsible for
automatically parallelizing certain types of
loops.
pragma omp parallel for
for (int i1 iltlen i)
veci ii
pragma omp parallel for reduction( sum)
for (int i0 iltlen i)
sum veci

43
MPI
vec