Tuesday, September 04, 2006 - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Tuesday, September 04, 2006

Description:

Tuesday, September 04, 2006. I hear and I forget, I see and ... Evolution of Parallel Systems. Course URL. http://suraj.lums.edu.pk/~cs524a06. Folder on indus ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 40

Provided by: Erud

Category:

more less

Transcript and Presenter's Notes

Title: Tuesday, September 04, 2006

1
Tuesday, September 04, 2006

I hear and I forget,
I see and I remember,
I do and I understand.
-Chinese Proverb

2
Today

Course Overview.
Why Parallel Computing?
Evolution of Parallel Systems.

3
CS 524 High Performance Computing

Course URL
http//suraj.lums.edu.pk/cs524a06
Folder on indus
\\indus\Common\cs524a06
Website Check Regularly Course announcements,
office hours, slides, resources, policies
Course Outline

Several programming exercises will be given
throughout the course. Assignments will include
popular programming models for shared memory and
message passing such as OpenMP and MPI.
The development environment will be C/C on
UNIX.

5
Pre-requisites

Computer Organization Assembly Language (CS
223)
Data Structures Algorithms (CS 213)
Senior level standing.
Operating Systems?

Five minute rule.

7
Hunger For More Power!
8
Hunger For More Power!

Endless quest for more and more computing power.
However much computing power there is, it is
never enough.

9
Why this need for greater computational power?

Science, engineering, businesses, entertainment
etc., all are providing the impetus.
Scientists observe, theorize, test through
experimentation.
Engineers design, test prototypes, build.

10
HPC offers a new way to do science

Computation used to approximate physical systems
- Advantages include
Playing with simulation parameters to study
emergent trends
Possible replay of a particular simulation event
Study systems where no exact theories exist

11
Why Turn to Simulation?

When the problem is too . . .
Complex
Large
Expensive
Dangerous

12
Why this need for greater computational power?

Less expensive to carry out computer simulations.
Able to simulate phenomenon that could not be
studied by experimentation. e.g. evolution of
universe.

13
Why this need for greater computational power?

Problems such as
Weather prediction.
Aeronautics (airflow analysis, structural
mechanics, engine efficiency etc) .
Simulating world economy.
Pharmaceutical (molecular modeling).
Understanding drug receptor interactions in
brain.
Automotive crash simulation.
are all computationally intensive.
The more knowledge we acquire the more complex
our questions become.

14
Why this need for greater computational power?

In 1995, the first full length computer animated
motion picture, Toy Story, was produced on a
parallel system composed on hundreds of Sun
workstations.
Decreased cost
Decreased Time (Several months on several hundred
processors)

15
Why this need for greater computational power?

Commercial Computing has also come to rely on
parallel architectures.
Computer system speed and capacity ? Scale of
business.
OLTP (Online transaction processing) benchmark
represent the relation between performance and
scale of business.
Rate performance of system in terms of its
throughput in transactions per minute.

16
Why this need for greater computational power?

Vendors supplying database hardware or software
offer multiprocessor systems that provide
performance substantially greater than
uniprocessor products.

One solution in the past Make the clock run
faster.
The advance of VLSI technology allowed clock
rates to increase and larger number of components
to fit on a chip.
However there are limits
Electrical signal cannot propagate faster than
the speed of light 30cm/nsec in vacuum and
20cm/nsec in copper wire or optical fiber.

Electrical signal cannot propagate faster than
the speed of light 30cm/nsec in vacuum and
20cm/nsec in copper wire or optical fiber.
10-GHz clock - signal path length 2cm in total
100-GHz clock - 2mm
1 THZ (1000 GHz) computer will have to be smaller
than 100 microns if the signal has to travel from
one end to the other and back with a single clock
cycle.

Another fundamental problem
Heat dissipation
The faster a computer runs more heat it
generates
High end Pentium systems CPU cooling system
bigger than the CPU itself.

20
Evolution of Parallel Architecture

New dimension added to design space Number of
processors.
Driven by demand for performance at acceptable
cost.

21
Evolution of Parallel Architecture

Advances in hardware capability enable new
application functionality, which places a greater
demand on the architecture.
This cycle drives the ongoing design, engineering
and manufacturing effort.

22
Evolution of Parallel Architecture

Microprocessor performance has been improving at
a rate of about 50 per year.
A parallel machine of hundred processors can be
viewed as providing to applications computing
power that will be available in 10 years time.
1000 processors ? 20 year horizon
The advantages of using small, inexpensive, mass
produced processors as building blocks for
computer systems are clear.

23
Technology trends

With technological advance, transistors, gates
etc have been getting smaller and faster.
More can fit in same area.
Processors are getting faster by making more
effective use of ever larger volume of computing
resources.
Possibilities
Place more computer system on chip including
memory and I/O. (Building block for parallel
architectures. System-on-a-chip)
Or multiple processors on chip. (Parallel
architecture on single-chip regime)

24
Microprocessor Design Trends

Technology determines what is possible.
Architecture translates the potential of
technology into performance.
Parallelism is fundamental to conventional
computer architecture.
Current architectural trends are leading to
multiprocessor designs.

25
Bit level Parallelism

From 1970 to 1986 advancements in bit-level
parallelism
4bit, 8 bit, 16 bit and so-on
Doubling the data path reduces the number of
cycles required to perform an operation.

26
Instruction level Parallelism

Mid 1980s to mid 1990s
Performing portions of several machine
instructions concurrently.
Pipelining (kind of parallelism also)
Fetch multiple instructions at a time and issue
them in parallel to distinct function units in
parallel (superscalar)

27
Instruction level Parallelism

However
Instruction level parallelism is worthwhile only
if processor can be supplied with instructions
and data fast enough.
Gap between processor cycle time and memory cycle
time has grown wider.
To satisfy increasing bandwidth requirements,
larger and larger caches are placed on chip with
the processor.
cache miss
control transfer
Limits

In mid 1970s, the introduction of vector
processors marked the beginning of modern
supercomputing
Perform operations on sequences of data elements
rather than individual scalar data
Offered advantage of at least one order of
magnitude over conventional systems of that time.

In late 1980s a new generation of systems came on
market. These were microprocessor based
supercomputers that initially provided about 100
processors and increased roughly to 1000 in 1990.
These aggregation of processors are known as
massively parallel processors (MPPs).

Factors behind emergence of MPPs
Increase in performance of standard
microprocessors
Cost advantage
Usage of off-the-shelf microprocessors instead
of custom processors
Fostered by government programs for scalable
parallel computing using distributed memory.

MPPs claimed to equal or surpass the performance
of vector multiprocessors.
Top500
Lists the sites that have the 500 most powerful
installed computer systems.
LINPACK benchmark
Most widely used metric of performance on
numerical applications
Collection of Fortran subroutines that analyze
and solve linear equations and linear least
squares problems

Top500 (Updated twice a year since June 1993)
In the first Top500 list there were already 156
MPP and SIMD systems present (around 1/3rd)

33
Some memory related issues

Time to access memory has not kept pace with CPU
clock speeds.
SRAM
Each bit is stored in a latch made up of
transistors
Faster than DRAM, but is less dense and requires
greater power
DRAM
Each bit of memory is stored as a charge on a
capacitor
1GHz CPU will execute 60 instructions before a
typical 60ns DRAM can return a single byte

34
Some memory related issues

Hierarchy
Cache memories
Temporal locality
Cache lines (64, 128, 256 bytes)

35
Parallel Architectures Memory Parallelism

One way to increase performance is to replicate
computers.
Major choice is between shared memory and
distributed memory

36
Memory Parallelism

In mid 1980s, when 32-bit microprocessor was
first introduced, computers containing multiple
microprocessors sharing a common memory became
prevalent.
In most of these designs all processors plug into
a common bus.
However, a small number of processors can be
supported by bus

37
UMA bus based SMP architecture

If the bus is busy, when a CPU wants to read or
write memory, the CPU waits for CPU to become
idle.
Contention of bus can be manageable for small
number of processors only.
The system will be totally limited by bandwidth
of the bus and most of the CPUs will be idle most
of the time.

38
UMA bus based SMP architecture

One way to alleviate this problem is to add a
cache to each CPU.
Less bus traffic if most reads can be satisfied
from the cache and system can support more CPUs.
Single bus limits UMA microprocessor to about
16-32 CPUs.

39
SMP