Title: Computer Architectures ... High Performance Computing I
1Computer Architectures ...High Performance
Computing I
- Fall 2001
- MAE609 /Mth667
- Abani Patra
2Microprocessor
- Basic Architecture
- CISC vs. RISC
- Superscalar
- EPIC
3Performance
- Measures
- Floating Point Operations Per Second (FLOPS)
- 1 MFLOP, workstations
- 1 GFLOP readily available HPC
- 1 TFLOP BEST NOW !!
- 1 PFLOP 2010 ??
4Performance
- Ttheor theoretical peak performance obtained by
multiplying clock rate with no. of CPU and no. of
FPU/CPU - Trealreal performance on some specific operation
e.g. vector add and multiply - Tsustained sustained performance on an
application e.g. CFD - Tsustained ltlt Treal ltlt Ttheor
5Performance
- Performance degrades if the CPU has to wait for
data to operate - Fast CPU gt need adequate fast memory
- Thumb rule --
- Memory in MB Ttheor in MFLOPS
6Making a Supercomputer Faster
- Reduce Cycle time
- Pipelining
- Instruction Pipelines
- Vector Pipelines
- Internal Parallelism
- Superscalar
- EPIC
- External Parallelism
7Making a SuperComputer Faster
- Reduce Cycle time
- increase clock rate
- Limited by semiconductor manufacture!
- Current generation 1-2GHz( Immediate future
10GHz) - Pipelining
- fine subdivision of an operation into
sub-operations leading to shorter cycle time but
larger start-up time
8Pipelining
- 4 stage instruction pipeline
stage
1
3
4
2
Fetch Ins Fetch Data Execute
Store
cycle
1
A B C
2
A B C
3 4 5 6
A B C
- 4 cycles needed by each instruction
A B C
- one result per cycle after pipe is full --
startup time
9Pipelining
- Almost all current computers use some pipelining
e.g. IBM RS6000 - Speedup of instruction pipelining cannot always
be achieved !! - Next instruction may not be known till execution
--e.g. branch - Data for execution may not be available
10Vector Pipelines
- Effective for operations like
- do 10 I1,1000
- 10 c(I)a(I)b(I)
- same instructions executed 1000 times with
different data - using a vector pipe the whole loop is one
vector instruction - Cray XMP, YMP, T90 ...
11Vector pipelining
- For some operations like
- a(I) b(I) c(I)d(I)
- the results of the multiply are chained to the
addition pipeline - Disadvantage
- startup time of vector
- code has to be vectorized loops have to be
blocked into vector lengths
12Internal Parallelism
- Use multiple Functional Units per processor
- Cray T90 has 2 track vector unitsNEC SX4,
Fujitsu VPP300 -- 8 track vector units - superscalar e.g. IBM RS6000 Power2 uses 2
arithmetic units - EPIC
- Need to provide data to multiple functional unit
gt fast memory access - Limiting factors are memory-processor bandwidth
13External Parallelism
- Use multiple processors
- Shared Memory (SMPSymmetric Multi-processors)
- many processors accessing the same memory
- limited by memory-processors bandwidth
- SUN Ultra2, SGI Octane, SGI Onyx, Compaq ...
Memory banks
14External Parallelism
- Distributed memory
- many processors each with local memory and some
type of high speed interconnect
Local Memories
Interconnection
CPU 1
E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf
Clusters of Pentium PCs
15External Parallelism
- SMP Clusters
- nodes with multiple processors that have shared
local memory nodes connected by interconnect - best of both ?
16Classification of Computers
- Hardware
- SISD (Single Instruction Single Data)
- SIMD(Single Instruction Multiple Data)
- MIMD (Multiple Instruction Multiple Data)
- Programming Model
- SPMD(Single Program Multiple Data)
- MPMD(Multiple Program Multiple Data)
17Hardware Classification
- SISD (Single Instruction Single Data)
- classical scalar/vector computer -- one
instruction one datum - superscalar -- instructions may run in parallel
- SIMD (Single Instruction Multiple Data)
- vector computers
- Data Parallel -- Connection Machine etc. extinct
now
18Hardware Classification
- MIMD (Multiple Instruction Multiple Data)
- usual parallel computer
- each processor executes its own instructions on
different data streams - need synchronization to get meaningful result
19Programming Model
- SPMD(Single Program Multiple Data)
- single program is run on all processors with
different data - each processor knows its ID -- thus
- if(proc ID .eq. N) then
- .
- Else
- .
- Constructs can be used for program control
20Programming Model
- MPMD(Multiple Program Multiple Data)
- Different programs run on different processors
- usually a master-slave model is used
21Topologies/Interconnects
22- Prototype Supercomputers and Bottlenecks
23Types of Processors/Computers used in HPC
- Prototype processors
- Vector Processors
- Superscalar Processors
- Prototype Parallel Computers
- Shared Memory
- Without Cache
- With Cache SMP
- Distributed Memory
24Vector Processors
25Vector Processors
- Components
- Vector registers
- ADD/Logic pipeline and MULTIPLY Pipelines
- Load/Store pipelines
- Scalar registers pipelines
26Vector Registers
- Finite length of vector registers 32/64/128 etc.
- Strip mining to operate on longer vectors
- Codes often manually restructured to
vector-length loops - Sawtooth performance curve -- maximum at
multiples of vector length
27Vector Processors
- Memory-processor bandwidth
- performance depends completely on keeping the
vector registers supplied with operands from
memory - Size of main memory and extended memory
- bandwidth of main memory is much higher but main
memory is more expensive - size determines -- size of problem that can be
run - scalar registers/scalar processors for scalar
instructions - I/O through special processor - -
- T90 can produce data at 14400 MB/sec -- Disk
20MB/s. Thus single word can take 720 cycles on
Cray T90 !!
28Superscalar Processor
- Workstations and nodes of parallel
supercomputers
29Superscalar Processor
- main components are
- Multiple ALU and FPU
- data and instruction caches
- superscalar since the ALU and FPUs can operate
in parallel producing more than one result per
cycle - e.g. IBM POWER2 -- 2 FPU/ALUs each can operate
in parallel producing up to 4 results per cycle
if operands are in registers
30Superscalar Processor
- RISC architecture operating at very high clock
speeds (gt1GHz now -- more in a year) - Processor works only on data in registers which
come only from and go only to data cache. If data
is not in cache -- cache miss -- processor is
idle while another cache line (4 -16 words) are
fetched from memory !!
31Superscalar Processor
- Large off chip Level 2 caches to help in data
availability. L1 cache data is accessed in 1/2
cycles while L2 cache is 3/4 cycles and memory
can be 8 times that! - Efficiency directly related to reuse of data in
cache - Remedies
- Blocked algorithms,
- contiguous storage,
- avoid strides and random/non-deterministic
access
32Superscalar Processor
- Remedies
- Blocked algorithms,
- do I1,1000 do j1,20
- a(I). do i(j-1)50,j50
-
a(i).... - contiguous storage,
- avoid strides and random/non-deterministic
access - a(ix(i)) ...
33Superscalar Processors
- Memory bandwidth critical to performance
- Many engineering applications are difficult to
optimize for cache efficiency - Application efficiency gt memory bandwidth
- Size of memory determines size of problem that
can be solved - DMA (direct memory access) channels take memory
access duties for external application (I/O)
remote processor request away from CPU
34Shared Memory Parallel Computer
- Memory in banks is accessed equally through a
switch (crossbar) by the processors (usually
vector) - Processors run p independent tasks with
possibly shared data - Usually some compilers and preprocessors can
extract the fine-grained parallelism available
Cray T90
35Shared Memory Paralllel ...
- Memory contention and bandwidth limits the number
of processors that may be connected - Memory contention can be reduced by increasing
banks and reducing the bank busy time(bbt) - This type of parallel computer is closest in
programming model to the general purpose single
processor computer
36Symmetric Multiprocessors (SMP)
- Processors are usually superscalar -- SUN Ultra,
MIPS R10000 with large cache - Bus/crossbar used to connect to memory modules
- For bus -- 1 processor can access memory at a time
M1
M2
M3
Sun Ultraenterprise 10000, SGI Powerchallenge
37Symmetric Multi-processors
- If interconnect -- then there will be memory
contention - Data flows from memory to cache to processors
- Cache coherence
- If a piece of data is changed in one cache then
all other caches that contain that data must
update the value. Hardware and software must take
care of this.
38Symmetric Multi-Processors
- Performance depends dramatically on the reuse of
data in cache - Fetching data from larger memory with potential
memory contention can be expensive! - Caches and cache lines also are bigger
- Large L2 cache really plays the role of local
fast memory with memory banks are more like
extended memory accessed in blocks
39Distributed Memory Parallel Computer
- Processors are superscalar RISC with only LOCAL
memory - Each processor can only work on data in local
memory - Communication required for access to remote
memory
IBM SP, Intel Paragon, SGI Origin2000
40Distributed Memory Parallel Computer
- Problems need to be broken up into independent
tasks with independent memory -- naturally
matches a data based decomposition of problem
using a owner computes rule - Parallelization mostly at high granularity level
controlled by user -- difficult for compilers/
automatic parallelization tools - Computers are scalable to very large numbers of
processors
41Distributed Memory Parallel Computer
- NUMA non uniform memory access based
classification - Intel Paragon (1st teraflop machine had 4
Pentiums per node with a bus) - HP exemplar has bus at node
42Distributed Memory Parallel Computer
- Semi-automomous memory Processor can access
remote memory using memory control units (MCU) - CRAY T3E and SGI Origin 2000
Comm. network
.
43Distributed Memory Parallel Computer
- Memory and procesors are equally distributed over
the network - Tera MTA is only example
- Latency and data transfer from memory is at the
speed of network!
Comm. network
M
P
P
M
44Accessing Distributed Memory
- Message Passing
- User transfers all data using explicit
send/receive instructions - synchronous message passing can be slow
- Programming with NEW programming model !
- User must optimize communication
- asynchronous/one-sided get and put are faster but
need more care in programming - Codes used to be machine specific -- Intel NEXUS
etc. until standardized to PVM (parallel virtual
machine) and subsequently MPI (message passing
interface)
45Accessing Distributed Memory
- Global distributed memory
- Physically distributed and globally addressable
-- Cray T3E/ SGI Origin 2000 - User formally accesses remote memory as if it
were local -- operating system/compilers will
translate such accesses to fetches/stores over
the communication network - High Performance FORTRAN (HPF) -- software
realization of distributed memory -- arrays etc.
when declared can be distributed using compiler
directives. Compiler translates remote memory
access to appropriate calls (message passing/ OS
calls as supported by the hardware)
46Processor interconnects/topologies
- Buses
- Lower cost -- but only one pair of devices
(processors/memories etc. can communicate at a
time) e.g. ethernet used to link workstation
networks - Switches
- Like the telephone network -- can sustain
many-many communications higher cost! - Critical measure is bisection bandwidth -- how
much data can be passed between units
47Processor interconnects/topologies
48Processor interconnects/topologies
49Processor interconnects/topologies
- Workstation network on ethernet
- Very high latency -- processors must participate
in communication
50Processor interconnects/topologies
- 1D and 2D Meshes and rings/toruses
51Processor interconnects/topologies
- 3DMeshes and rings/toruses
52Processor interconnects/topologies
- D- dimensional hypercubes
53Processor Scheduling
- Space Sharing
- Processor banks of 4/8/16 etc. assigned to users
for specific times - Time sharing on processor partitions
- Livermore Gang Scheduling
54IBM RS/6000 SP
- Distributed Memory Parallel Computer
- Assembly of workstations using a HPS (a crossbar
type switch) - Comes with a choice of processors -- POWER2
(variants), POWER3 and clusters of PowerPC (also
used by Apple G3 G4 etc.)
55POWER 2 Processor
- Different versions -- with different frequency,
cache size and bandwidth
56POWER 2 ARCHITECTURE
57POWER2
- Double fixed point/floating point units --
multiply/add in each - Max. 4 Floating Point results/cycle
- ICU (with 32 KB instruction cache) can execute a
branch and a condition/cycle - Per cycle 8 instructions may be issued and
executed -- truly SUPERSCALAR!
58Wide 77 Node Performance
- Theoretical peak performance
- 277 154 MFLOP for dyad
- 477 308 MFLOP for triad
- Cache Effects dominate performance
- 256 KB Cache and 256 bit path to cache and from
cache to memory -- 2 words (8 bytes each) may be
fetched and 2 words stored per cycle
59Expected Performance
- Expected Performance
- For Dyad ai bici or aibici -- needs 2 load
and 1 store i.e. 6 memory references to feed 2
FPUs -- only 4 are available - (277)(4/6) 102.7 MFLOP
- For linked triad
- ai bi sci (2 load 1 store)
- (477)(4/6) 205.3 MFLOP
- For vector triad
- ai bi ci di (3 load 1 store)
- (477)(4/8)154 MFLOPS
60Cache Hit/Miss
- The Performance numbers assumed that data was
available in cache - If data is not in cache it must be fetched in
cache lines of 256 bytes each from memory at a
much slower pace
61(No Transcript)
62TERM PAPER
- Based on the analysis of the Power 2 processor
and IBM SP presented here prepare a similar
analysis (including estimates of performance) for
the new Power4 chip in the IBM SP or a cluster of
Pentium4s.