Title: Tuesday, September 04, 2006
1Tuesday, September 04, 2006
- I hear and I forget,
- I see and I remember,
- I do and I understand.
- -Chinese Proverb
2Today
- Course Overview.
- Why Parallel Computing?
- Evolution of Parallel Systems.
3CS 524 High Performance Computing
- Course URL
- http//suraj.lums.edu.pk/cs524a06
- Folder on indus
- \\indus\Common\cs524a06
- Website Check Regularly Course announcements,
office hours, slides, resources, policies - Course Outline
4- Several programming exercises will be given
throughout the course. Assignments will include
popular programming models for shared memory and
message passing such as OpenMP and MPI. - The development environment will be C/C on
UNIX.
5Pre-requisites
- Computer Organization Assembly Language (CS
223) - Data Structures Algorithms (CS 213)
- Senior level standing.
- Operating Systems?
6 7Hunger For More Power!
8Hunger For More Power!
- Endless quest for more and more computing power.
- However much computing power there is, it is
never enough.
9Why this need for greater computational power?
- Science, engineering, businesses, entertainment
etc., all are providing the impetus. - Scientists observe, theorize, test through
experimentation. - Engineers design, test prototypes, build.
10HPC offers a new way to do science
- Computation used to approximate physical systems
- Advantages include - Playing with simulation parameters to study
emergent trends - Possible replay of a particular simulation event
- Study systems where no exact theories exist
11Why Turn to Simulation?
- When the problem is too . . .
- Complex
- Large
- Expensive
- Dangerous
12Why this need for greater computational power?
- Less expensive to carry out computer simulations.
- Able to simulate phenomenon that could not be
studied by experimentation. e.g. evolution of
universe.
13Why this need for greater computational power?
- Problems such as
- Weather prediction.
- Aeronautics (airflow analysis, structural
mechanics, engine efficiency etc) . - Simulating world economy.
- Pharmaceutical (molecular modeling).
- Understanding drug receptor interactions in
brain. - Automotive crash simulation.
- are all computationally intensive.
- The more knowledge we acquire the more complex
our questions become.
14Why this need for greater computational power?
- In 1995, the first full length computer animated
motion picture, Toy Story, was produced on a
parallel system composed on hundreds of Sun
workstations. - Decreased cost
- Decreased Time (Several months on several hundred
processors)
15Why this need for greater computational power?
- Commercial Computing has also come to rely on
parallel architectures. - Computer system speed and capacity ? Scale of
business. - OLTP (Online transaction processing) benchmark
represent the relation between performance and
scale of business. - Rate performance of system in terms of its
throughput in transactions per minute.
16Why this need for greater computational power?
- Vendors supplying database hardware or software
offer multiprocessor systems that provide
performance substantially greater than
uniprocessor products.
17- One solution in the past Make the clock run
faster. - The advance of VLSI technology allowed clock
rates to increase and larger number of components
to fit on a chip. - However there are limits
- Electrical signal cannot propagate faster than
the speed of light 30cm/nsec in vacuum and
20cm/nsec in copper wire or optical fiber.
18- Electrical signal cannot propagate faster than
the speed of light 30cm/nsec in vacuum and
20cm/nsec in copper wire or optical fiber. - 10-GHz clock - signal path length 2cm in total
- 100-GHz clock - 2mm
- 1 THZ (1000 GHz) computer will have to be smaller
than 100 microns if the signal has to travel from
one end to the other and back with a single clock
cycle.
19- Another fundamental problem
- Heat dissipation
- The faster a computer runs more heat it
generates - High end Pentium systems CPU cooling system
bigger than the CPU itself.
20Evolution of Parallel Architecture
- New dimension added to design space Number of
processors. - Driven by demand for performance at acceptable
cost.
21Evolution of Parallel Architecture
- Advances in hardware capability enable new
application functionality, which places a greater
demand on the architecture. - This cycle drives the ongoing design, engineering
and manufacturing effort.
22Evolution of Parallel Architecture
- Microprocessor performance has been improving at
a rate of about 50 per year. - A parallel machine of hundred processors can be
viewed as providing to applications computing
power that will be available in 10 years time. - 1000 processors ? 20 year horizon
- The advantages of using small, inexpensive, mass
produced processors as building blocks for
computer systems are clear.
23Technology trends
- With technological advance, transistors, gates
etc have been getting smaller and faster. - More can fit in same area.
- Processors are getting faster by making more
effective use of ever larger volume of computing
resources. - Possibilities
- Place more computer system on chip including
memory and I/O. (Building block for parallel
architectures. System-on-a-chip) - Or multiple processors on chip. (Parallel
architecture on single-chip regime)
24Microprocessor Design Trends
- Technology determines what is possible.
- Architecture translates the potential of
technology into performance. - Parallelism is fundamental to conventional
computer architecture. - Current architectural trends are leading to
multiprocessor designs.
25Bit level Parallelism
- From 1970 to 1986 advancements in bit-level
parallelism - 4bit, 8 bit, 16 bit and so-on
- Doubling the data path reduces the number of
cycles required to perform an operation.
26Instruction level Parallelism
- Mid 1980s to mid 1990s
- Performing portions of several machine
instructions concurrently. - Pipelining (kind of parallelism also)
- Fetch multiple instructions at a time and issue
them in parallel to distinct function units in
parallel (superscalar)
27Instruction level Parallelism
- However
- Instruction level parallelism is worthwhile only
if processor can be supplied with instructions
and data fast enough. - Gap between processor cycle time and memory cycle
time has grown wider. - To satisfy increasing bandwidth requirements,
larger and larger caches are placed on chip with
the processor. - cache miss
- control transfer
- Limits
28- In mid 1970s, the introduction of vector
processors marked the beginning of modern
supercomputing - Perform operations on sequences of data elements
rather than individual scalar data - Offered advantage of at least one order of
magnitude over conventional systems of that time.
29- In late 1980s a new generation of systems came on
market. These were microprocessor based
supercomputers that initially provided about 100
processors and increased roughly to 1000 in 1990.
- These aggregation of processors are known as
massively parallel processors (MPPs).
30- Factors behind emergence of MPPs
- Increase in performance of standard
microprocessors - Cost advantage
- Usage of off-the-shelf microprocessors instead
of custom processors - Fostered by government programs for scalable
parallel computing using distributed memory.
31- MPPs claimed to equal or surpass the performance
of vector multiprocessors. - Top500
- Lists the sites that have the 500 most powerful
installed computer systems. - LINPACK benchmark
- Most widely used metric of performance on
numerical applications - Collection of Fortran subroutines that analyze
and solve linear equations and linear least
squares problems
32- Top500 (Updated twice a year since June 1993)
- In the first Top500 list there were already 156
MPP and SIMD systems present (around 1/3rd)
33Some memory related issues
- Time to access memory has not kept pace with CPU
clock speeds. - SRAM
- Each bit is stored in a latch made up of
transistors - Faster than DRAM, but is less dense and requires
greater power - DRAM
- Each bit of memory is stored as a charge on a
capacitor - 1GHz CPU will execute 60 instructions before a
typical 60ns DRAM can return a single byte
34Some memory related issues
- Hierarchy
- Cache memories
- Temporal locality
- Cache lines (64, 128, 256 bytes)
35Parallel Architectures Memory Parallelism
- One way to increase performance is to replicate
computers. - Major choice is between shared memory and
distributed memory
36Memory Parallelism
- In mid 1980s, when 32-bit microprocessor was
first introduced, computers containing multiple
microprocessors sharing a common memory became
prevalent. - In most of these designs all processors plug into
a common bus. - However, a small number of processors can be
supported by bus
37UMA bus based SMP architecture
- If the bus is busy, when a CPU wants to read or
write memory, the CPU waits for CPU to become
idle. - Contention of bus can be manageable for small
number of processors only. - The system will be totally limited by bandwidth
of the bus and most of the CPUs will be idle most
of the time.
38UMA bus based SMP architecture
- One way to alleviate this problem is to add a
cache to each CPU. - Less bus traffic if most reads can be satisfied
from the cache and system can support more CPUs. - Single bus limits UMA microprocessor to about
16-32 CPUs.
39SMP
- SMP (Symmetric multiprocessor)
- Shared memory multiprocessor where the cost of
accessing a memory location is same for all
processors.