Parallel

About This Presentation

Title:

Parallel

Description:

Title: Parallel & Distributed Computing Seminar (ICS691) Author: Henri Casanova Last modified by: Administrator Created Date: 5/13/2005 2:20:40 PM – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 54

Provided by: HenriCa7

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel

1
High-Performance Grid Computing and Research
Networking
Introduction to High Performance Computing
Instructor S. Masoud Sadjadi http//www.cs.fiu.ed
u/sadjadi/Teaching/ sadjadi At cs Dot fiu Dot
edu
2
Acknowledgements

The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks!
Henri Casanova
Principles of High Performance Computing
http//navet.ics.hawaii.edu/casanova
henric_at_hawaii.edu
Ligang He
http//www.dcs.warwick.ac.uk/liganghe
Email liganghe_at_dcs.warwick.ac.uk
Kai Wang
Department of Computer Science
University of South Dakota
http//www.usd.edu/Kai.Wang
Kyril Faenov
Director of High Performance Computing
Windows Server Group
Andrew Tanenbaum

3
Agenda

HPC Introduction
HPC Applications
HPC Goals
Concurrency
History

4
High Performance Computing

Difficult to define - its a moving target.
In 1980s
a supercomputer was performing 100 Mega FLOPS
FLOPS FLoating point Operations Per Second
Today
a 2G Hz desktop/laptop performs a few Giga FLOPS
a supercomputer performs tens of Tera FLOPS
(Top500)
High Performance Computing loosely an order of
1000 times more powerful than the latest desktops

5
Units of Measure in HPC

High Performance Computing (HPC) units are
Flops floating point operations
Flop/s floating point operations per second
Bytes size of data (double precision floating
point number is 8)
Typical sizes are millions, billions, trillions
Mega Mflop/s 106 flop/sec Mbyte 106 byte
(also 220 1048576)
Giga Gflop/s 109 flop/sec Gbyte 109 byte
(also 230 1073741824)
Tera Tflop/s 1012 flop/sec Tbyte 1012 byte
(also 240 10995211627776)
Peta Pflop/s 1015 flop/sec Pbyte 1015 byte
(also 250 1125899906842624)
Exa Eflop/s 1018 flop/sec Ebyte 1018 byte

6
Metric Units

The principal metric prefixes.

7
High Performance Computing

HPC
The term high performance computing (HPC) refers
to the use of (parallel) supercomputers and
computer clusters, that is, computing systems
comprised of multiple (usually mass-produced)
processors linked together in a single system
with commercially available interconnects.
Wikipedia
This is in contrast to mainframe computers,
which are generally monolithic in nature.
Wikipedia

8
High Performance Computing

HPC
The more current and evolving definition of HPC
refers to High Productivity Computing, and
reflects the purpose and use model of the myriad
of existing and evolving architectures, and the
supporting ecosystem of software, middleware,
storage, networking and tools behind the next
generation of applications.
Wikipedia
Parallel Computing
Computing on parallel computers
Super Computing
Computing on top 500 machines

9
High Performance Computing

The definition that we use in this course
How do we make computers to compute bigger
problems faster?
Three main issues
Hardware How do we build faster computers?
Software How do we write faster programs?
Hardware and Software How do they interact?
Many perspectives
architecture
systems
programming
modeling and analysis
simulation
algorithms and complexity

10
High Performance Computing

HPC Related Technologies
HPC is an all-encompassing term for related
technologies that continually push computing
boundaries.
Computer architecture
CPU, memory, VLSI
Compilers
Identify inefficient implementations
Make use of the characteristics of the computer
architecture
Choose suitable compiler for a certain
architecture
Algorithms (for parallel and distributed systems)
How to program on parallel and distributed
systems
Middleware
From Grid computing technology
Application-gtmiddleware-gtoperating system
Resource discovery and sharing

11
High Performance Computing

The key techniques for making computers compute
bigger problems faster is to use multiple
computers at once
Later in this lecture, we will learn why!
This is called parallelism
It takes 1000 hours for this program to run on
one computer!
Well, if I use 100 computers maybe it will take
only 10 hours?!
This computer can only handle a dataset thats
2GB!
So maybe if I use 100 computers I can deal with a
200GB dataset?!
We will spend enough time to learn and experience
different flavors of parallel computing
shared-memory parallelism
distributed-memory parallelism
hybrid parallelism

12
Agenda

HPC Introduction
HPC Applications
HPC Goals
Concurrency
History

13
Words of Wisdom

Four or five computers should be enough for the
entire world until the year 2000.
T.J. Watson, Chairman of IBM, 1945.
640KB of memory ought to be enough for
anybody.
Bill Gates, Chairman of Microsoft,1981.
You may laugh at their vision today, but
Lesson learned Dont be too visionary and try to
make things work! )
We now know this was not quite true!
Games
Digital video/images
Databases
Operating systems
But the first people to really need more
computing oomph where scientists
And they go way back

14
Evolution of Science

Traditional scientific and engineering
Do theory or paper design
Perform experiments or build system
Limitations
Too difficult -- build large wind tunnels
Too expensive -- build a throw-away airplane
Too slow -- wait for climate or galactic
evolution
Too dangerous -- weapons, drug design, climate
experiments
Solution
Use high performance computer systems to simulate
the phenomenon

15
Scientific Computing

Use of computers to solve/compute scientific
models
For instance, many natural phenomena can be well
approximated by differential equations
Classic Example Heat Transfer
Consider a 1-D material between 2 heat sources

T H
T L
x
16
Scientific Computing

Use of computers to solve/compute scientific
models
For instance, many natural phenomena can be well
approximated by partial differential equations
(PDEs)
Problem compute f(x,t)

T H
T L
f(x,t) temperature at location x at time t
0 lt x lt X
17
Heat Transfer

The laws of physics say that
where alpha depends on the material
where f(0,t) H, f(X,t) L and f(x,0) are all
fixed
Called the boundary conditions
Question How do we solve this PDE?
It does not have an analytical solution
Therefore it must be solved numerically (i.e.,
via approximation)

18
Heat Transfer

One well-known methods to solve the heat equation
is called finite differences
Approach
Discretize the domain decide that the values of
f(x,t) will only be known for some finite (but
large) number of values of x and t
The discretized domain is called a mesh
All x values are separated by ?x
All t values are separated by ?t
Then, one replaces partial derivatives by
algebraic differences
In the limit, when ?x and ?t go to zero, we get
close to the real solution

19
Heat Transfer

There are many different approximations of the
partial derivatives, based on Taylor series
developments, etc.
For instance, denoting f(x,t) as (discrete) fi,m,
we can write the Forward Time, Centered Space
(FTCS) heat transfer equation as
The various discretizations of the heat transfer
equation have advantages and drawbacks in terms
of
complexity
numerical stability
(if youre into it, there are countless papers
and textbooks)
We have transfer a difficult PDE into some type
of algebraic induction!
Easy to compute in an iterative fashion
Given all the values at time m, one can compute
all the values at time m1

20
Heat Transfer

Summary
But they all use some matrix or volume of numbers
(in the 2-D and 3-D cases) and iteratively do
additions, multiplications and divisions, for
many iterations
Therefore, we can replace difficult calculus by
simple computations on multi-dimensional arrays
of numbers
Challenges
These matrices may be really big, for better
resolution and larger domains ? Large Data
The number of additions and multiplications can
be overwhelming ? Heavy Computation
Hence
the early and always constant need of scientists
to get bigger memories and faster CPUs

21
HPC Applications

Science
Global climate modeling
Astrophysical modeling
Biology genomics protein folding drug design
Computational Chemistry
Computational Material Sciences and Nanosciences
Engineering
Crash simulation
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Business
Financial and economic modeling
Transaction processing, web services and search
engines
Defense
Nuclear weapons -- test by simulation
Cryptography

22
Example Computational Fluid Dynamics (CFD)
Replacing NASAs Wind Tunnels with Computers
23
Example Global Climate

Problem is to compute
f (latitude, longitude, elevation, time) ?
temperature, pressure,
humidity, wind velocity
Approach
Discretize the domain, e.g., a measurement point
every 10 km
Devise an algorithm to predict weather at time
t1 given t
Uses
Predict El Nino
Set air emissions standards

24
Global Climate Requirements

One piece is modeling the fluid flow in the
atmosphere
Solve Navier-Stokes problem
Roughly 100 Flops per grid point with 1 minute
timestep
Computational requirements
To match real-time, need 5x1011 flops in 60
seconds 8 Gflop/s
Weather prediction (7 days in 24 hours) ? 56
Gflop/s
Climate prediction (50 years in 30 days) ? 4.8
Tflop/s
To use in policy negotiations (50 years in 12
hours) ? 288 Tflop/s
Lets make it even worse!
To 2x grid resolution, computation is gt 8x
State of the art models require integration of
atmosphere, ocean, sea-ice, land models, plus
possibly carbon cycle, geochemistry and more
Current models are coarser than this!

25
High Resolution Climate Modeling on NERSC-3 P.
Duffy, et al., LLNL
26
1000-year climate

Demonstration of the Community Climate Model
(CCSM2)
A 1000-year simulation shows long-term, stable
representation of the earths climate.
760,000 processor hours used
Temperature change shown

Warren Washington and Jerry Meehl, National
Center for Atmospheric Research Bert Semtner,
Naval Postgraduate School John Weatherly, U.S.
Army Cold Regions Research and Engineering Lab
Laboratory et al. http//www.nersc.gov/aboutnersc
/pubs/bigsplash.pdf
27
Agenda

HPC Introduction
HPC Applications
HPC Goals
Concurrency
History

28
Goals of HPC

Minimize turn-around time
to complete specific application problems (strong
scaling)
Maximise the problem size
that can be solved given a set amount of time
(weak scaling)
Identify compromise between
performance and cost.
Note Most supercomputers are obsolete
in terms of performance before the end of their
physical life.

29
Maximizing Performance

How is performance maximized?
Reduce the time per instruction (cycle time) 1
clock rate.
Increase the number of instructions executed
per-cycle 2 pipelining.
Allow multiple processors to work on different
parts of the same program at the same time 3
parallel execution.
When performance is gained from 1 and 2
There is a limit to how quick processors will
operate.
Speed of light and electricity.
Heat dissipation.
Power consumption
An instruction processing procedure cannot be
divided into infinite stages
When performance improvements come from 3
Overhead of communications

30
A 10 TFlop/s CPU?

Question Could we build a single CPU that
delivers 10,000 billion floating point operations
per second (10 TFlops), and operates over 10,000
billion bytes (10 TByte)?
Representative of what many scientists need
today.
Clock rate has to be 10,000 GHz
Assume that data travels at the speed of light
Assume that the computer is an ideal sphere

31
A 10 TFlop/s CPU?

Assume that the machine issues one instruction
per cycle
therefore the clock rate must be 10,000GHz
1013 Hz
Data must travel some distance from the memory to
the CPU
Assume that Each instruction will need at least
one 8 bytes of memory
Assume that data travels at the speed of light
c3x108 m/s
Then the distance between the memory and the CPU
must be r lt c / 1013 3x10-6 m
Then we must have 1013 bytes of memory in 4/3?r3
3.7e-17 m3
Therefore, each word of memory must occupy
3.7e-30 m3
This is 3.7 Angstrom3
Or the volume of a very small molecule that
consists of only a few atoms
Current memory densities are 10GB/cm3,
or about a factor 1020 from what would be needed!
Conclusion Its not going to happen until some
scifi breakthrough happens

32
Agenda

HPC Introduction
HPC Applications
HPC Goals
Concurrency
History

33
Concurrency

Since we cannot conceivably build a single CPU to
solve relevant scientific problems, we resort to
concurrency
execution of multiple tasks at the same time
Concurrency is everywhere in computers
Load a word from memory while adding two
registers
Adding two pairs of registers at the same time
Receiving data from the network while writing to
disk
Dual-proc systems
Clusters of workstations
SETI_at_home
Some concurrency is true
meaning that things really happen at the same
time
Some concurrency is just the illusion
of simultaneous execution, with rapid switching
among activities

34
Concurrent, parallel, distributed?

Concurrency is typically the more general term
A program is said to be concurrent if it contains
more than one execution context
e.g., more than one thread/process
Typically the word parallel implies some notion
of high performance / scientific application
running on a single hardware platform
The word distributed typically refers to
applications that run on multiple computers that
may not be in the same room
These terms are conflated and misused all the
time in different research communities they mean
different things.
Well see that distinctions are disappearing
anyway

35
Two Types of HPC

Parallel Computing
Breaking the problem to be computed into parts
that can be run simultaneously in different
processors
Distributed Computing
Parts of the work to be computed are computed in
different places
Note does not necessarily imply simultaneous
processing
An example C/S model
Solve loosely-coupled problems
(no much communication)

36
Parallel Computing

Architectures of Parallel Computing
SMP (Symmetric Multi-Processing)
Multiple CPUs, single memory, shared I/O
All resources in a SMP machine are equally
available to each CPU
Does not scale well to a large number of
processors (less than 8)
NUMA (Non-Uniform Memory Access)
Multiple CPUs
Each CPU has fast access to its local area of the
memory, but slower access to other areas
Scale well to a large number of processors
Complicated memory access pattern
MPP (Massively Parallel Processing)
Cluster

37
Reasons for Concurrency

Concurrency arises for at least 4 reasons
To increase performance or memory capacity
To allow users and computers to collaborate
To capture the logical structure of a problem
To cope with independent physical devices

38
Reason 1

To increase performance

39
Reason 1 (cont.)

To increase memory capacity
Example
A 3D weather simulation over Kaneohe Bay (1-meter
resolution)
Say we consider a volume 2km x 2km x 1km over the
bay
Each zone is characterized by, say, temperature,
wind direction, wind velocity, air pressure, air
moisture, for a total of (13111)8 56 bytes
Therefore we need about 208GB of memory to hold
the data
Option 1 Buy a machine with gt 208 GB RAM
96GB server from Sun about 1 million dollar!
They have a 288GB configuration (contact them for
price)
There is a 3TB shared-memory SGI machine at NCSA
Option 2 Couple individual machines together
Buy 52 4GB Power-edge servers from Dell for 2.5K
each
Slap some network on them and youve got enough
memory
total cost 200K
But its not as simple as that!

40
Reason 1 (cont.)
Interferometer Gravitational Wave Observatory
(LIGO) Tiny distortions of space and time caused
when very large masses, such as stars, move
suddenly. 1TB/day (1024 GB/day), Year-long
experiments
The Compact Muon Solenoid At CERN, designed to
study proton-proton collisions with high quality
measurements (12,000 tons) 10 GB/sec!!! Many
PB/year (1024 TB/year)
41
Reason 2

To allow users and computers to collaborate
Example
Assume that we want to allow users to do on-line
purchases
We need Web browsers, Web servers, Database
servers
All these are processes
They all communicate with multiple processes
simultaneously, they are all multithreaded,
running on multiple machines, some of them are
multi-proc servers
Its just a big concurrent system and it is
critical that it be fast and correct!

42
Reason 3

To capture the logical structure of a problem
Example
Lets assume that we want to write a program that
simulates the interactions between a robot and
living entities
We can implement the robot as its own thread
The code is just the code of the robot
We can implement each entity as its own thread
The code is the simulation of the entitys
behavior
Now we let them loose at the beginning
They may meet, interact, etc.
All of this happens without a central notion of
control, although I may be running on a single
CPU
Concurrency just fits the problem

43
Reason 4

To cope with independent physical devices
Example
Lets assume that we want to write a program that
receives data from the network, processes it, and
writes output to the disk
We can read from the network and write to disk at
the same time almost for free
We can compute on the data while I receive from
the network almost for free
We can compute on the data while I write the
previously computed data to disk almost for
free
We are better off writing this program as three
concurrent threads (even if on a single CPU)
Each thread uses one independent device of the
computer

44
Agenda

HPC Introduction
HPC Applications
HPC Goals
Concurrency
History

45
A brief history of concurrency

First machines were used in single-user mode
The user would declare I am going to use the
machine for 2PM till 4PM
Then the user would go in the special machine
room and sit there for 2 hours
The user punches cards, which were prepared in
advance
The user tries to run the program
The user tries to debug the program
etc. etc.
Extreme lack of productivity
During the users thinking time, the
multi-million machine practically does nothing!

46
A brief history of concurrency

Batch Processing!
Instead of reserving the machine for a lapse of
time to do all the activities (including
debugging), the user just submits requests to a
queue
The queue serves requests in order (possibly with
priorities)
When the program fails and stops, another program
is scheduled to use the machine immediately
Great! But how about the CPU idle time during the
I/O!

47
A brief history of concurrency
48
A brief history of concurrency

Multi-programming!
Multiple programs reside in memory at once
Required interrupts and memory protection
Interrupts are used to switch programs between
devices and CPUs
Concurrency issues in the O/S
race conditions, deadlocks, critical sections
semaphores, monitors, etc.
beginning of theory of concurrent systems (1960)
Increase in memory size
Development of virtual memory

49
A brief history of concurrency

Multiprogramming system
three jobs in memory

50
A brief history of concurrency

Time-sharing!
For fast, interactive response, one needs fast
context switching
Makes it possible to have the illusion that one
is alone on a (perhaps slower) machine
Already common by 1970
Led to concurrency in user applications!
The users application is logically two
concurrent tasks
The user can now implement it as two concurrent
tasks!

51
A brief history of concurrency