Computer Architectures ... High Performance Computing I - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Computer Architectures ... High Performance Computing I

Description:

Ttheor: theoretical peak performance; obtained by multiplying clock rate with no. ... fine subdivision of an operation into sub-operations leading to shorter cycle ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 63

Provided by: abani5

Learn more at: http://www.math.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architectures ... High Performance Computing I

1
Computer Architectures ...High Performance
Computing I

Fall 2001
MAE609 /Mth667
Abani Patra

2
Microprocessor

Basic Architecture
CISC vs. RISC
Superscalar
EPIC

3
Performance

Measures
Floating Point Operations Per Second (FLOPS)
1 MFLOP, workstations
1 GFLOP readily available HPC
1 TFLOP BEST NOW !!
1 PFLOP 2010 ??

4
Performance

Ttheor theoretical peak performance obtained by
multiplying clock rate with no. of CPU and no. of
FPU/CPU
Trealreal performance on some specific operation
e.g. vector add and multiply
Tsustained sustained performance on an
application e.g. CFD
Tsustained ltlt Treal ltlt Ttheor

5
Performance

Performance degrades if the CPU has to wait for
data to operate
Fast CPU gt need adequate fast memory
Thumb rule --
Memory in MB Ttheor in MFLOPS

6
Making a Supercomputer Faster

Reduce Cycle time
Pipelining
Instruction Pipelines
Vector Pipelines
Internal Parallelism
Superscalar
EPIC
External Parallelism

7
Making a SuperComputer Faster

Reduce Cycle time
increase clock rate
Limited by semiconductor manufacture!
Current generation 1-2GHz( Immediate future
10GHz)
Pipelining
fine subdivision of an operation into
sub-operations leading to shorter cycle time but
larger start-up time

8
Pipelining

Instruction Pipelining

4 stage instruction pipeline

stage
1
3
4
2
Fetch Ins Fetch Data Execute
Store
cycle
1
A B C
2
A B C

3 instructions A,B,C

3 4 5 6
A B C

4 cycles needed by each instruction

A B C

one result per cycle after pipe is full --
startup time

9
Pipelining

Almost all current computers use some pipelining
e.g. IBM RS6000
Speedup of instruction pipelining cannot always
be achieved !!
Next instruction may not be known till execution
--e.g. branch
Data for execution may not be available

10
Vector Pipelines

Effective for operations like
do 10 I1,1000
10 c(I)a(I)b(I)
same instructions executed 1000 times with
different data
using a vector pipe the whole loop is one
vector instruction
Cray XMP, YMP, T90 ...

11
Vector pipelining

For some operations like
a(I) b(I) c(I)d(I)
the results of the multiply are chained to the
addition pipeline
Disadvantage
startup time of vector
code has to be vectorized loops have to be
blocked into vector lengths

12
Internal Parallelism

Use multiple Functional Units per processor
Cray T90 has 2 track vector unitsNEC SX4,
Fujitsu VPP300 -- 8 track vector units
superscalar e.g. IBM RS6000 Power2 uses 2
arithmetic units
EPIC
Need to provide data to multiple functional unit
gt fast memory access
Limiting factors are memory-processor bandwidth

13
External Parallelism

Use multiple processors
Shared Memory (SMPSymmetric Multi-processors)
many processors accessing the same memory
limited by memory-processors bandwidth
SUN Ultra2, SGI Octane, SGI Onyx, Compaq ...

Memory banks
14
External Parallelism

Distributed memory
many processors each with local memory and some
type of high speed interconnect

Local Memories
Interconnection
CPU 1
E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf
Clusters of Pentium PCs
15
External Parallelism

SMP Clusters
nodes with multiple processors that have shared
local memory nodes connected by interconnect
best of both ?

16
Classification of Computers

Hardware
SISD (Single Instruction Single Data)
SIMD(Single Instruction Multiple Data)
MIMD (Multiple Instruction Multiple Data)
Programming Model
SPMD(Single Program Multiple Data)
MPMD(Multiple Program Multiple Data)

17
Hardware Classification

SISD (Single Instruction Single Data)
classical scalar/vector computer -- one
instruction one datum
superscalar -- instructions may run in parallel
SIMD (Single Instruction Multiple Data)
vector computers
Data Parallel -- Connection Machine etc. extinct
now

18
Hardware Classification

MIMD (Multiple Instruction Multiple Data)
usual parallel computer
each processor executes its own instructions on
different data streams
need synchronization to get meaningful result

19
Programming Model

SPMD(Single Program Multiple Data)
single program is run on all processors with
different data
each processor knows its ID -- thus
if(proc ID .eq. N) then
.
Else
.
Constructs can be used for program control

20
Programming Model

MPMD(Multiple Program Multiple Data)
Different programs run on different processors
usually a master-slave model is used

21
Topologies/Interconnects

Hypercube
Torus

Prototype Supercomputers and Bottlenecks

23
Types of Processors/Computers used in HPC

Prototype processors
Vector Processors
Superscalar Processors
Prototype Parallel Computers
Shared Memory
Without Cache
With Cache SMP
Distributed Memory

24
Vector Processors
25
Vector Processors

Components
Vector registers
ADD/Logic pipeline and MULTIPLY Pipelines
Load/Store pipelines
Scalar registers pipelines

26
Vector Registers

Finite length of vector registers 32/64/128 etc.
Strip mining to operate on longer vectors
Codes often manually restructured to
vector-length loops
Sawtooth performance curve -- maximum at
multiples of vector length

27
Vector Processors

Memory-processor bandwidth
performance depends completely on keeping the
vector registers supplied with operands from
memory
Size of main memory and extended memory
bandwidth of main memory is much higher but main
memory is more expensive
size determines -- size of problem that can be
run
scalar registers/scalar processors for scalar
instructions
I/O through special processor - -
T90 can produce data at 14400 MB/sec -- Disk
20MB/s. Thus single word can take 720 cycles on
Cray T90 !!

28
Superscalar Processor

Workstations and nodes of parallel
supercomputers

29
Superscalar Processor

main components are
Multiple ALU and FPU
data and instruction caches
superscalar since the ALU and FPUs can operate
in parallel producing more than one result per
cycle
e.g. IBM POWER2 -- 2 FPU/ALUs each can operate
in parallel producing up to 4 results per cycle
if operands are in registers

30
Superscalar Processor

RISC architecture operating at very high clock
speeds (gt1GHz now -- more in a year)
Processor works only on data in registers which
come only from and go only to data cache. If data
is not in cache -- cache miss -- processor is
idle while another cache line (4 -16 words) are
fetched from memory !!

31
Superscalar Processor

Large off chip Level 2 caches to help in data
availability. L1 cache data is accessed in 1/2
cycles while L2 cache is 3/4 cycles and memory
can be 8 times that!
Efficiency directly related to reuse of data in
cache
Remedies
Blocked algorithms,
contiguous storage,
avoid strides and random/non-deterministic
access

32
Superscalar Processor

Remedies
Blocked algorithms,
do I1,1000 do j1,20
a(I). do i(j-1)50,j50
a(i)....
contiguous storage,
avoid strides and random/non-deterministic
access
a(ix(i)) ...

33
Superscalar Processors

Memory bandwidth critical to performance
Many engineering applications are difficult to
optimize for cache efficiency
Application efficiency gt memory bandwidth
Size of memory determines size of problem that
can be solved
DMA (direct memory access) channels take memory
access duties for external application (I/O)
remote processor request away from CPU

34
Shared Memory Parallel Computer

Shared Memory Computer

Memory in banks is accessed equally through a
switch (crossbar) by the processors (usually
vector)
Processors run p independent tasks with
possibly shared data
Usually some compilers and preprocessors can
extract the fine-grained parallelism available

Cray T90
35
Shared Memory Paralllel ...

Memory contention and bandwidth limits the number
of processors that may be connected
Memory contention can be reduced by increasing
banks and reducing the bank busy time(bbt)
This type of parallel computer is closest in
programming model to the general purpose single
processor computer

36
Symmetric Multiprocessors (SMP)

Processors are usually superscalar -- SUN Ultra,
MIPS R10000 with large cache
Bus/crossbar used to connect to memory modules
For bus -- 1 processor can access memory at a time

SMP Computer

M1
M2
M3
Sun Ultraenterprise 10000, SGI Powerchallenge
37
Symmetric Multi-processors

If interconnect -- then there will be memory
contention
Data flows from memory to cache to processors
Cache coherence
If a piece of data is changed in one cache then
all other caches that contain that data must
update the value. Hardware and software must take
care of this.

38
Symmetric Multi-Processors

Performance depends dramatically on the reuse of
data in cache
Fetching data from larger memory with potential
memory contention can be expensive!
Caches and cache lines also are bigger
Large L2 cache really plays the role of local
fast memory with memory banks are more like
extended memory accessed in blocks

39
Distributed Memory Parallel Computer

Prototype DMP

Processors are superscalar RISC with only LOCAL
memory
Each processor can only work on data in local
memory
Communication required for access to remote
memory

IBM SP, Intel Paragon, SGI Origin2000
40
Distributed Memory Parallel Computer

Problems need to be broken up into independent
tasks with independent memory -- naturally
matches a data based decomposition of problem
using a owner computes rule
Parallelization mostly at high granularity level
controlled by user -- difficult for compilers/
automatic parallelization tools
Computers are scalable to very large numbers of
processors

41
Distributed Memory Parallel Computer

NUMA non uniform memory access based
classification
Intel Paragon (1st teraflop machine had 4
Pentiums per node with a bus)
HP exemplar has bus at node

Hybrid Parallel Computer

42
Distributed Memory Parallel Computer

Semi-autonomous memory

Semi-automomous memory Processor can access
remote memory using memory control units (MCU)
CRAY T3E and SGI Origin 2000

Comm. network
.
43
Distributed Memory Parallel Computer

Fully autonomous memory

Memory and procesors are equally distributed over
the network
Tera MTA is only example
Latency and data transfer from memory is at the
speed of network!

Comm. network
M
P
P
M
44
Accessing Distributed Memory

Message Passing
User transfers all data using explicit
send/receive instructions
synchronous message passing can be slow
Programming with NEW programming model !
User must optimize communication
asynchronous/one-sided get and put are faster but
need more care in programming
Codes used to be machine specific -- Intel NEXUS
etc. until standardized to PVM (parallel virtual
machine) and subsequently MPI (message passing
interface)

45
Accessing Distributed Memory

Global distributed memory
Physically distributed and globally addressable
-- Cray T3E/ SGI Origin 2000
User formally accesses remote memory as if it
were local -- operating system/compilers will
translate such accesses to fetches/stores over
the communication network
High Performance FORTRAN (HPF) -- software
realization of distributed memory -- arrays etc.
when declared can be distributed using compiler
directives. Compiler translates remote memory
access to appropriate calls (message passing/ OS
calls as supported by the hardware)

46
Processor interconnects/topologies

Buses
Lower cost -- but only one pair of devices
(processors/memories etc. can communicate at a
time) e.g. ethernet used to link workstation
networks
Switches
Like the telephone network -- can sustain
many-many communications higher cost!
Critical measure is bisection bandwidth -- how
much data can be passed between units

47
Processor interconnects/topologies

48
Processor interconnects/topologies

49
Processor interconnects/topologies

Workstation network on ethernet
Very high latency -- processors must participate
in communication

50
Processor interconnects/topologies

1D and 2D Meshes and rings/toruses

51
Processor interconnects/topologies

3DMeshes and rings/toruses

52
Processor interconnects/topologies

D- dimensional hypercubes

53
Processor Scheduling

Space Sharing
Processor banks of 4/8/16 etc. assigned to users
for specific times
Time sharing on processor partitions
Livermore Gang Scheduling

54
IBM RS/6000 SP

Distributed Memory Parallel Computer
Assembly of workstations using a HPS (a crossbar
type switch)
Comes with a choice of processors -- POWER2
(variants), POWER3 and clusters of PowerPC (also
used by Apple G3 G4 etc.)

55
POWER 2 Processor

Different versions -- with different frequency,
cache size and bandwidth

56
POWER 2 ARCHITECTURE
57
POWER2

Double fixed point/floating point units --
multiply/add in each
Max. 4 Floating Point results/cycle
ICU (with 32 KB instruction cache) can execute a
branch and a condition/cycle
Per cycle 8 instructions may be issued and
executed -- truly SUPERSCALAR!

58
Wide 77 Node Performance

Theoretical peak performance
277 154 MFLOP for dyad
477 308 MFLOP for triad
Cache Effects dominate performance
256 KB Cache and 256 bit path to cache and from
cache to memory -- 2 words (8 bytes each) may be
fetched and 2 words stored per cycle

59
Expected Performance

Expected Performance
For Dyad ai bici or aibici -- needs 2 load
and 1 store i.e. 6 memory references to feed 2
FPUs -- only 4 are available
(277)(4/6) 102.7 MFLOP
For linked triad
ai bi sci (2 load 1 store)
(477)(4/6) 205.3 MFLOP
For vector triad
ai bi ci di (3 load 1 store)
(477)(4/8)154 MFLOPS

60
Cache Hit/Miss

The Performance numbers assumed that data was
available in cache
If data is not in cache it must be fetched in
cache lines of 256 bytes each from memory at a
much slower pace

61
(No Transcript)
62
TERM PAPER

Based on the analysis of the Power 2 processor
and IBM SP presented here prepare a similar
analysis (including estimates of performance) for
the new Power4 chip in the IBM SP or a cluster of
Pentium4s.

Write a Comment

User Comments (0)