Title: Slides Prepared from the CI-Tutor Courses at NCSA
1Parallel Computing Explained Parallel Computing
Overview
- Slides Prepared from the CI-Tutor Courses at NCSA
- http//ci-tutor.ncsa.uiuc.edu/
- By
- S. Masoud Sadjadi
- School of Computing and Information Sciences
- Florida International University
- March 2009
2Agenda
- 1 Parallel Computing Overview
- 2 How to Parallelize a Code
- 3 Porting Issues
- 4 Scalar Tuning
- 5 Parallel Code Tuning
- 6 Timing and Profiling
- 7 Cache Tuning
- 8 Parallel Performance Analysis
- 9 About the IBM Regatta P690
3Agenda
- 1 Parallel Computing Overview
- 1.1 Introduction to Parallel Computing
- 1.1.1 Parallelism in our Daily Lives
- 1.1.2 Parallelism in Computer Programs
- 1.1.3 Parallelism in Computers
- 1.1.4 Performance Measures
- 1.1.5 More Parallelism Issues
- 1.2 Comparison of Parallel Computers
- 1.3 Summary
4Parallel Computing Overview
- Who should read this chapter?
- New Users to learn concepts and terminology.
- Intermediate Users for review or reference.
- Management Staff to understand the basic
concepts even if you dont plan to do any
programming. - Note Advanced users may opt to skip this
chapter.
5Introduction to Parallel Computing
- High performance parallel computers
- can solve large problems much faster than a
desktop computer - fast CPUs, large memory, high speed
interconnects, and high speed input/output - able to speed up computations
- by making the sequential components run faster
- by doing more operations in parallel
- High performance parallel computers are in demand
- need for tremendous computational capabilities in
science, engineering, and business. - require gigabytes/terabytes f memory and
gigaflops/teraflops of performance - scientists are striving for petascale performance
6Introduction to Parallel Computing
- HPPC are used in a wide variety of disciplines.
- Meteorologists prediction of tornadoes and
thunderstorms - Computational biologists analyze DNA sequences
- Pharmaceutical companies design of new drugs
- Oil companies seismic exploration
- Wall Street analysis of financial markets
- NASA aerospace vehicle design
- Entertainment industry special effects in movies
and commercials - These complex scientific and business
applications all need to perform computations on
large datasets or large equations.
7Parallelism in our Daily Lives
- There are two types of processes that occur in
computers and in our daily lives - Sequential processes
- occur in a strict order
- it is not possible to do the next step until the
current one is completed. - Examples
- The passage of time the sun rises and the sun
sets. - Writing a term paper pick the topic, research,
and write the paper. - Parallel processes
- many events happen simultaneously
- Examples
- Plant growth in the springtime
- An orchestra
8Agenda
- 1 Parallel Computing Overview
- 1.1 Introduction to Parallel Computing
- 1.1.1 Parallelism in our Daily Lives
- 1.1.2 Parallelism in Computer Programs
- 1.1.2.1 Data Parallelism
- 1.1.2.2 Task Parallelism
- 1.1.3 Parallelism in Computers
- 1.1.4 Performance Measures
- 1.1.5 More Parallelism Issues
- 1.2 Comparison of Parallel Computers
- 1.3 Summary
9Parallelism in Computer Programs
- Conventional wisdom
- Computer programs are sequential in nature
- Only a small subset of them lend themselves to
parallelism. - Algorithm the "sequence of steps" necessary to
do a computation. - The first 30 years of computer use, programs were
run sequentially. - The 1980's saw great successes with parallel
computers. - Dr. Geoffrey Fox published a book entitled
Parallel Computing Works! - many scientific accomplishments resulting from
parallel computing - Computer programs are parallel in nature
- Only a small subset of them need to be run
sequentially
10Parallel Computing
- What a computer does when it carries out more
than one computation at a time using more than
one processor. - By using many processors at once, we can speedup
the execution - If one processor can perform the arithmetic in
time t. - Then ideally p processors can perform the
arithmetic in time t/p. - What if I use 100 processors? What if I use 1000
processors? - Almost every program has some form of
parallelism. - You need to determine whether your data or your
program can be partitioned into independent
pieces that can be run simultaneously. - Decomposition is the name given to this
partitioning process. - Types of parallelism
- data parallelism
- task parallelism.
11Data Parallelism
- The same code segment runs concurrently on each
processor, but each processor is assigned its own
part of the data to work on. - Do loops (in Fortran) define the parallelism.
- The iterations must be independent of each other.
- Data parallelism is called "fine grain
parallelism" because the computational work is
spread into many small subtasks. - Example
- Dense linear algebra, such as matrix
multiplication, is a perfect candidate for data
parallelism.
12An example of data parallelism
-
- DO K1,N
- DO J1,N
- DO I1,N
- C(I,J) C(I,J) A(I,K)B(K,J)
- END DO
- END DO
- END DO
-
- !OMP PARALLEL DO
- DO K1,N
- DO J1,N
- DO I1,N
- C(I,J) C(I,J) A(I,K)B(K,J)
- END DO
- END DO
- END DO
- !END PARALLEL DO
13Quick Intro to OpenMP
- OpenMP is a portable standard for parallel
directives covering both data and task
parallelism. - More information about OpenMP is available on the
OpenMP website. - We will have a lecture on Introduction to OpenMP
later. - With OpenMP, the loop that is performed in
parallel is the loop that immediately follows the
Parallel Do directive. - In our sample code, it's the K loop
- DO K1,N
14OpenMP Loop Parallelism
- Iteration-Processor Assignments
- The code segment running on each processor
- DO J1,N
- DO I1,N
- C(I,J) C(I,J) A(I,K)B(K,J)
- END DO
- END DO
Processor Iterations of K Data Elements
proc0 K15 A(I, 15) B(15 ,J)
proc1 K610 A(I, 610) B(610 ,J)
proc2 K1115 A(I, 1115) B(1115 ,J)
proc3 K1620 A(I, 1620) B(1620 ,J)
15OpenMP Style of Parallelism
- can be done incrementally as follows
- Parallelize the most computationally intensive
loop. - Compute performance of the code.
- If performance is not satisfactory, parallelize
another loop. - Repeat steps 2 and 3 as many times as needed.
- The ability to perform incremental parallelism is
considered a positive feature of data
parallelism. - It is contrasted with the MPI (Message Passing
Interface) style of parallelism, which is an "all
or nothing" approach.
16Task Parallelism
- Task parallelism may be thought of as the
opposite of data parallelism. - Instead of the same operations being performed on
different parts of the data, each process
performs different operations. - You can use task parallelism when your program
can be split into independent pieces, often
subroutines, that can be assigned to different
processors and run concurrently. - Task parallelism is called "coarse grain"
parallelism because the computational work is
spread into just a few subtasks. - More code is run in parallel because the
parallelism is implemented at a higher level than
in data parallelism. - Task parallelism is often easier to implement and
has less overhead than data parallelism.
17Task Parallelism
- The abstract code shown in the diagram is
decomposed into 4 independent code segments that
are labeled A, B, C, and D. The right hand side
of the diagram illustrates the 4 code segments
running concurrently.
18Task Parallelism
program main code segment labeled A code
segment labeled B code segment labeled C
code segment labeled D end
program main code segment labeled A code
segment labeled B code segment labeled C
code segment labeled D end
program main !OMP PARALLEL !OMP SECTIONS
code segment labeled A !OMP SECTION code
segment labeled B !OMP SECTION code segment
labeled C !OMP SECTION code segment labeled D
!OMP END SECTIONS !OMP END PARALLEL end
19OpenMP Task Parallelism
- With OpenMP, the code that follows each
SECTION(S) directive is allocated to a different
processor. In our sample parallel code, the
allocation of code segments to processors is as
follows.
Processor Code
proc0 code segment labeled A
proc1 code segment labeled B
proc2 code segment labeled C
proc3 code segment labeled D
20Parallelism in Computers
- How parallelism is exploited and enhanced within
the operating system and hardware components of a
parallel computer - operating system
- arithmetic
- memory
- disk
21Operating System Parallelism
- All of the commonly used parallel computers run a
version of the Unix operating system. In the
table below each OS listed is in fact Unix, but
the name of the Unix OS varies with each vendor. - For more information about Unix, a collection of
Unix documents is available.
Parallel Computer OS
SGI Origin2000 IRIX
HP V-Class HP-UX
Cray T3E Unicos
IBM SP AIX
WorkstationClusters Linux
22Two Unix Parallelism Features
- background processing facility
- With the Unix background processing facility you
can run the executable a.out in the background
and simultaneously view the man page for the
etime function in the foreground. There are two
Unix commands that accomplish this - a.out gt results
- man etime
- cron feature
- With the Unix cron feature you can submit a job
that will run at a later time.
23Arithmetic Parallelism
- Multiple execution units
- facilitate arithmetic parallelism.
- The arithmetic operations of add, subtract,
multiply, and divide ( - /) are each done in a
separate execution unit. This allows several
execution units to be used simultaneously,
because the execution units operate
independently. - Fused multiply and add
- is another parallel arithmetic feature.
- Parallel computers are able to overlap multiply
and add. This arithmetic is named MultiplyADD
(MADD) on SGI computers, and Fused Multiply Add
(FMA) on HP computers. In either case, the two
arithmetic operations are overlapped and can
complete in hardware in one computer cycle. - Superscalar arithmetic
- is the ability to issue several arithmetic
operations per computer cycle. - It makes use of the multiple, independent
execution units. On superscalar computers there
are multiple slots per cycle that can be filled
with work. This gives rise to the name n-way
superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way
superscalar computer.
24Memory Parallelism
- memory interleaving
- memory is divided into multiple banks, and
consecutive data elements are interleaved among
them. For example if your computer has 2 memory
banks, then data elements with even memory
addresses would fall into one bank, and data
elements with odd memory addresses into the
other. - multiple memory ports
- Port means a bi-directional memory pathway. When
the data elements that are interleaved across the
memory banks are needed, the multiple memory
ports allow them to be accessed and fetched in
parallel, which increases the memory bandwidth
(MB/s or GB/s). - multiple levels of the memory hierarchy
- There is global memory that any processor can
access. There is memory that is local to a
partition of the processors. Finally there is
memory that is local to a single processor, that
is, the cache memory and the memory elements held
in registers. - Cache memory
- Cache is a small memory that has fast access
compared with the larger main memory and serves
to keep the faster processor filled with data.
25Memory Parallelism
26Disk Parallelism
- RAID (Redundant Array of Inexpensive Disk)
- RAID disks are on most parallel computers.
- The advantage of a RAID disk system is that it
provides a measure of fault tolerance. - If one of the disks goes down, it can be swapped
out, and the RAID disk system remains
operational. - Disk Striping
- When a data set is written to disk, it is striped
across the RAID disk system. That is, it is
broken into pieces that are written
simultaneously to the different disks in the RAID
disk system. When the same data set is read back
in, the pieces are read in parallel, and the full
data set is reassembled in memory.
27Agenda
- 1 Parallel Computing Overview
- 1.1 Introduction to Parallel Computing
- 1.1.1 Parallelism in our Daily Lives
- 1.1.2 Parallelism in Computer Programs
- 1.1.3 Parallelism in Computers
- 1.1.4 Performance Measures
- 1.1.5 More Parallelism Issues
- 1.2 Comparison of Parallel Computers
- 1.3 Summary
28Performance Measures
- Peak Performance
- is the top speed at which the computer can
operate. - It is a theoretical upper limit on the computer's
performance. - Sustained Performance
- is the highest consistently achieved speed.
- It is a more realistic measure of computer
performance. - Cost Performance
- is used to determine if the computer is cost
effective. - MHz
- is a measure of the processor speed.
- The processor speed is commonly measured in
millions of cycles per second, where a computer
cycle is defined as the shortest time in which
some work can be done. - MIPS
- is a measure of how quickly the computer can
issue instructions. - Millions of instructions per second is
abbreviated as MIPS, where the instructions are
computer instructions such as memory reads and
writes, logical operations , floating point
operations, integer operations, and branch
instructions.
29Performance Measures
- Mflops (Millions of floating point operations per
second) - measures how quickly a computer can perform
floating-point operations such as add, subtract,
multiply, and divide. - Speedup
- measures the benefit of parallelism.
- It shows how your program scales as you compute
with more processors, compared to the performance
on one processor. - Ideal speedup happens when the performance gain
is linearly proportional to the number of
processors used. - Benchmarks
- are used to rate the performance of parallel
computers and parallel programs. - A well known benchmark that is used to compare
parallel computers is the Linpack benchmark. - Based on the Linpack results, a list is produced
of the Top 500 Supercomputer Sites. This list is
maintained by the University of Tennessee and the
University of Mannheim.
30More Parallelism Issues
- Load balancing
- is the technique of evenly dividing the workload
among the processors. - For data parallelism it involves how iterations
of loops are allocated to processors. - Load balancing is important because the total
time for the program to complete is the time
spent by the longest executing thread. - The problem size
- must be large and must be able to grow as you
compute with more processors. - In order to get the performance you expect from a
parallel computer you need to run a large
application with large data sizes, otherwise the
overhead of passing information between
processors will dominate the calculation time. - Good software tools
- are essential for users of high performance
parallel computers. - These tools include
- parallel compilers
- parallel debuggers
- performance analysis tools
- parallel math software
- The availability of a broad set of application
software is also important.
31More Parallelism Issues
- The high performance computing market is risky
and chaotic. Many supercomputer vendors are no
longer in business, making the portability of
your application very important. - A workstation farm
- is defined as a fast network connecting
heterogeneous workstations. - The individual workstations serve as desktop
systems for their owners. - When they are idle, large problems can take
advantage of the unused cycles in the whole
system. - An application of this concept is the SETI
project. You can participate in searching for
extraterrestrial intelligence with your home PC.
More information about this project is available
at the SETI Institute. - Condor
- is software that provides resource management
services for applications that run on
heterogeneous collections of workstations. - Miron Livny at the University of Wisconsin at
Madison is the director of the Condor project,
and has coined the phrase high throughput
computing to describe this process of harnessing
idle workstation cycles. More information is
available at the Condor Home Page.
32Agenda
- 1 Parallel Computing Overview
- 1.1 Introduction to Parallel Computing
- 1.2 Comparison of Parallel Computers
- 1.2.1 Processors
- 1.2.2 Memory Organization
- 1.2.3 Flow of Control
- 1.2.4 Interconnection Networks
- 1.2.4.1 Bus Network
- 1.2.4.2 Cross-Bar Switch Network
- 1.2.4.3 Hypercube Network
- 1.2.4.4 Tree Network
- 1.2.4.5 Interconnection Networks Self-test
- 1.2.5 Summary of Parallel Computer
Characteristics - 1.3 Summary
33Comparison of Parallel Computers
- Now you can explore the hardware components of
parallel computers - kinds of processors
- types of memory organization
- flow of control
- interconnection networks
- You will see what is common to these parallel
computers, and what makes each one of them
unique.
34Kinds of Processors
- There are three types of parallel computers
- computers with a small number of powerful
processors - Typically have tens of processors.
- The cooling of these computers often requires
very sophisticated and expensive equipment,
making these computers very expensive for
computing centers. - They are general-purpose computers that perform
especially well on applications that have large
vector lengths. - The examples of this type of computer are the
Cray SV1 and the Fujitsu VPP5000.
35Kinds of Processors
- There are three types of parallel computers
- computers with a large number of less powerful
processors - Named a Massively Parallel Processor (MPP),
typically have thousands of processors. - The processors are usually proprietary and
air-cooled. - Because of the large number of processors, the
distance between the furthest processors can be
quite large requiring a sophisticated internal
network that allows distant processors to
communicate with each other quickly. - These computers are suitable for applications
with a high degree of concurrency. - The MPP type of computer was popular in the
1980s. - Examples of this type of computer were the
Thinking Machines CM-2 computer, and the
computers made by the MassPar company.
36Kinds of Processors
- There are three types of parallel computers
- computers that are medium scale in between the
two extremes - Typically have hundreds of processors.
- The processor chips are usually not proprietary
rather they are commodity processors like the
Pentium III. - These are general-purpose computers that perform
well on a wide range of applications. - The most common example of this class is the
Linux Cluster.
37Trends and Examples
- Processor trends
- The processors on todays commonly used parallel
computers
Decade Processor Type Computer Example
1970s Pipelined, Proprietary Cray-1
1980s Massively Parallel, Proprietary Thinking Machines CM2
1990s Superscalar, RISC, Commodity SGI Origin2000
2000s CISC, Commodity Workstation Clusters
Computer Processor
SGI Origin2000 MIPS RISC R12000
HP V-Class HP PA 8200
Cray T3E Compaq Alpha
IBM SP IBM Power3
Workstation Clusters Intel Pentium III, Intel Itanium
38Memory Organization
- The following paragraphs describe the three types
of memory organization found on parallel
computers - distributed memory
- shared memory
- distributed shared memory
39Distributed Memory
- In distributed memory computers, the total memory
is partitioned into memory that is private to
each processor. - There is a Non-Uniform Memory Access time (NUMA),
which is proportional to the distance between the
two communicating processors.
- On NUMA computers, data is accessed the quickest
from a private memory, while data from the most
distant processor takes the longest to access. - Some examples are the Cray T3E, the IBM SP, and
workstation clusters.
40Distributed Memory
- When programming distributed memory computers,
the code and the data should be structured such
that the bulk of a processors data accesses are
to its own private (local) memory.
- This is called having good data locality.
- Today's distributed memory computers use message
passing such as MPI to communicate between
processors as shown in the following example
41Distributed Memory
- One advantage of distributed memory computers is
that they are easy to scale. As the demand for
resources grows, computer centers can easily add
more memory and processors. - This is often called the LEGO block approach.
- The drawback is that programming of distributed
memory computers can be quite complicated.
42Shared Memory
- In shared memory computers, all processors have
access to a single pool of centralized memory
with a uniform address space. - Any processor can address any memory location at
the same speed so there is Uniform Memory Access
time (UMA). - Processors communicate with each other through
the shared memory.
- The advantages and disadvantages of shared memory
machines are roughly the opposite of distributed
memory computers. - They are easier to program because they resemble
the programming of single processor machines - But they don't scale like their distributed
memory counterparts
43Distributed Shared Memory
- In Distributed Shared Memory (DSM) computers, a
cluster or partition of processors has access to
a common shared memory. - It accesses the memory of a different processor
cluster in a NUMA fashion. - Memory is physically distributed but logically
shared. - Attention to data locality again is important.
- Distributed shared memory computers combine the
best features of both distributed memory
computers and shared memory computers. - That is, DSM computers have both the scalability
of distributed memory computers and the ease of
programming of shared memory computers. - Some examples of DSM computers are the SGI
Origin2000 and the HP V-Class computers.
44Trends and Examples
- Memory organization trends
- The memory organization of todays commonly used
parallel computers
Decade Memory Organization Example
1970s Shared Memory Cray-1
1980s Distributed Memory Thinking Machines CM-2
1990s Distributed Shared Memory SGI Origin2000
2000s Distributed Memory Workstation Clusters
Computer Memory Organization
SGI Origin2000 DSM
HP V-Class DSM
Cray T3E Distributed
IBM SP Distributed
Workstation Clusters Distributed
45Flow of Control
- When you look at the control of flow you will see
three types of parallel computers - Single Instruction Multiple Data (SIMD)
- Multiple Instruction Multiple Data (MIMD)
- Single Program Multiple Data (SPMD)
46Flynns Taxonomy
- Flynns Taxonomy, devised in 1972 by Michael
Flynn of Stanford University, describes computers
by how streams of instructions interact with
streams of data. - There can be single or multiple instruction
streams, and there can be single or multiple data
streams. This gives rise to 4 types of computers
as shown in the diagram below
- Flynn's taxonomy names the 4 computer types SISD,
MISD, SIMD and MIMD. - Of these 4, only SIMD and MIMD are applicable to
parallel computers. - Another computer type, SPMD, is a special case of
MIMD.
47SIMD Computers
- SIMD stands for Single Instruction Multiple Data.
- Each processor follows the same set of
instructions. - With different data elements being allocated to
each processor. - SIMD computers have distributed memory with
typically thousands of simple processors, and the
processors run in lock step. - SIMD computers, popular in the 1980s, are useful
for fine grain data parallel applications, such
as neural networks.
- Some examples of SIMD computers were the Thinking
Machines CM-2 computer and the computers from the
MassPar company. - The processors are commanded by the global
controller that sends instructions to the
processors. - It says add, and they all add.
- It says shift to the right, and they all shift to
the right. - The processors are like obedient soldiers,
marching in unison.
48MIMD Computers
- MIMD stands for Multiple Instruction Multiple
Data. - There are multiple instruction streams with
separate code segments distributed among the
processors. - MIMD is actually a superset of SIMD, so that the
processors can run the same instruction stream or
different instruction streams. - In addition, there are multiple data streams
different data elements are allocated to each
processor. - MIMD computers can have either distributed memory
or shared memory.
- While the processors on SIMD computers run in
lock step, the processors on MIMD computers run
independently of each other. - MIMD computers can be used for either data
parallel or task parallel applications. - Some examples of MIMD computers are the SGI
Origin2000 computer and the HP V-Class computer.
49SPMD Computers
- SPMD stands for Single Program Multiple Data.
- SPMD is a special case of MIMD.
- SPMD execution happens when a MIMD computer is
programmed to have the same set of instructions
per processor. - With SPMD computers, while the processors are
running the same code segment, each processor can
run that code segment asynchronously. - Unlike SIMD, the synchronous execution of
instructions is relaxed. - An example is the execution of an if statement on
a SPMD computer. - Because each processor computes with its own
partition of the data elements, it may evaluate
the right hand side of the if statement
differently from another processor. - One processor may take a certain branch of the if
statement, and another processor may take a
different branch of the same if statement. - Hence, even though each processor has the same
set of instructions, those instructions may be
evaluated in a different order from one processor
to the next. - The analogies we used for describing SIMD
computers can be modified for MIMD computers. - Instead of the SIMD obedient soldiers, all
marching in unison, in the MIMD world the
processors march to the beat of their own
drummer.
50Summary of SIMD versus MIMD
SIMD MIMD
Memory distributed memory distriuted memoryorshared memory
Code Segment same perprocessor sameordifferent
ProcessorsRun In lock step asynchronously
DataElements different perprocessor different perprocessor
Applications data parallel data parallelortask parallel
51Trends and Examples
- Flow of control trends
- The flow of control on today
Decade Flow of Control Computer Example
1980's SIMD Thinking Machines CM-2
1990's MIMD SGI Origin2000
2000's MIMD Workstation Clusters
Computer Flow of Control
SGI Origin2000 MIMD
HP V-Class MIMD
Cray T3E MIMD
IBM SP MIMD
Workstation Clusters MIMD
52Agenda
- 1 Parallel Computing Overview
- 1.1 Introduction to Parallel Computing
- 1.2 Comparison of Parallel Computers
- 1.2.1 Processors
- 1.2.2 Memory Organization
- 1.2.3 Flow of Control
- 1.2.4 Interconnection Networks
- 1.2.4.1 Bus Network
- 1.2.4.2 Cross-Bar Switch Network
- 1.2.4.3 Hypercube Network
- 1.2.4.4 Tree Network
- 1.2.4.5 Interconnection Networks Self-test
- 1.2.5 Summary of Parallel Computer
Characteristics - 1.3 Summary
53Interconnection Networks
- What exactly is the interconnection network?
- The interconnection network is made up of the
wires and cables that define how the multiple
processors of a parallel computer are connected
to each other and to the memory units. - The time required to transfer data is dependent
upon the specific type of the interconnection
network. - This transfer time is called the communication
time. - What network characteristics are important?
- Diameter the maximum distance that data must
travel for 2 processors to communicate. - Bandwidth the amount of data that can be sent
through a network connection. - Latency the delay on a network while a data
packet is being stored and forwarded. - Types of Interconnection Networks
- The network topologies (geometric arrangements of
the computer network connections) are - Bus
- Cross-bar Switch
- Hybercube
- Tree
54Interconnection Networks
- The aspects of network issues are
- Cost
- Scalability
- Reliability
- Suitable Applications
- Data Rate
- Diameter
- Degree
- General Network Characteristics
- Some networks can be compared in terms of their
degree and diameter. - Degree how many communicating wires are coming
out of each processor. - A large degree is a benefit because it has
multiple paths. - Diameter This is the distance between the two
processors that are farthest apart. - A small diameter corresponds to low latency.
55Bus Network
- Bus topology is the original coaxial cable-based
Local Area Network (LAN) topology in which the
medium forms a single bus to which all stations
are attached. - The positive aspects
- It is also a mature technology that is well known
and reliable. - The cost is also very low.
- simple to construct.
- The negative aspects
- limited data transmission rate.
- not scalable in terms of performance.
- Example SGI Power Challenge.
- Only scaled to 18 processors.
56Cross-Bar Switch Network
- A cross-bar switch is a network that works
through a switching mechanism to access shared
memory. - it scales better than the bus network but it
costs significantly more. - The telephone system uses this type of network.
An example of a computer with this type of
network is the HP V-Class.
- Here is a diagram of a cross-bar switch network
which shows the processors talking through the
switchboxes to store or retrieve data in memory. - There are multiple paths for a processor to
communicate with a certain memory. - The switches determine the optimal route to take.
57Cross-Bar Switch Network
- In a hypercube network, the processors are
connected as if they were corners of a
multidimensional cube. Each node in an N
dimensional cube is directly connected to N other
nodes.
- The fact that the number of directly connected,
"nearest neighbor", nodes increases with the
total size of the network is also highly
desirable for a parallel computer. - The degree of a hypercube network is log n and
the diameter is log n, where n is the number of
processors. - Examples of computers with this type of network
are the CM-2, NCUBE-2, and the Intel iPSC860.
58Tree Network
- The processors are the bottom nodes of the tree.
For a processor to retrieve data, it must go up
in the network and then go back down. - This is useful for decision making applications
that can be mapped as trees. - The degree of a tree network is 1. The diameter
of the network is 2 log (n1)-2 where n is the
number of processors.
- The Thinking Machines CM-5 is an example of a
parallel computer with this type of network. - Tree networks are very suitable for database
applications because it allows multiple searches
through the database at a time.
59Interconnected Networks
- Torus Network A mesh with wrap-around
connections in both the x and y directions. - Multistage Network A network with more than one
networking unit. - Fully Connected Network A network where every
processor is connected to every other processor. - Hypercube Network Processors are connected as if
they were corners of a multidimensional cube. - Mesh Network A network where each interior
processor is connected to its four nearest
neighbors.
60Interconnected Networks
- Bus Based Network Coaxial cable based LAN
topology in which the medium forms a single bus
to which all stations are attached. - Cross-bar Switch Network A network that works
through a switching mechanism to access shared
memory. - Tree Network The processors are the bottom nodes
of the tree. - Ring Network Each processor is connected to two
others and the line of connections forms a circle.
61Summary of Parallel Computer Characteristics
- How many processors does the computer have?
- 10s?
- 100s?
- 1000s?
- How powerful are the processors?
- what's the MHz rate
- what's the MIPS rate
- What's the instruction set architecture?
- RISC
- CISC
62Summary of Parallel Computer Characteristics
- How much memory is available?
- total memory
- memory per processor
- What kind of memory?
- distributed memory
- shared memory
- distributed shared memory
- What type of flow of control?
- SIMD
- MIMD
- SPMD
63Summary of Parallel Computer Characteristics
- What is the interconnection network?
- Bus
- Crossbar
- Hypercube
- Tree
- Torus
- Multistage
- Fully Connected
- Mesh
- Ring
- Hybrid
64Design decisions made by some of the major
parallel computer vendors
Computer ProgrammingStyle OS Processors Memory Flow ofControl Network
SGI Origin2000 OpenMPMPI IRIX MIPS RISCR10000 DSM MIMD CrossbarHypercube
HP V-Class OpenMPMPI HP-UX HP PA 8200 DSM MIMD CrossbarRing
Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus
IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch
WorkstationClusters MPI Linux Intel Pentium III Distributed MIMD Myrinet Tree
65Summary
- This completes our introduction to parallel
computing. - You have learned about parallelism in computer
programs, and also about parallelism in the
hardware components of parallel computers. - In addition, you have learned about the commonly
used parallel computers, and how these computers
compare to each other. - There are many good texts which provide an
introductory treatment of parallel computing.
Here are two useful references - Highly Parallel Computing, Second EditionGeorge
S. Almasi and Allan GottliebBenjamin/Cummings
Publishers, 1994Parallel Computing Theory and
PracticeMichael J. QuinnMcGraw-Hill, Inc., 1994