Computer architecture II - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Computer architecture II

Description:

Examine programming model, motivation, intended applications, and ... Example Intel Paragon. Computer Architecture II. 28. SAS & MP Architectural. Convergence ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 40

Provided by: Bog61

Category:

more less

Transcript and Presenter's Notes

Title: Computer architecture II

1
Computer architecture II

Introduction

2
Recap

Importance of parallelism
Architecture classification
Flynn (SISD, SIMD, MISD, MIMD)
Memory access (SM, MP)
Clusters
Grids
Top500

3
Today's plan

Parallel Architecture convergence (Cullers
classification)
Shared Memory (Single Address Space)
Message Passing
Data parallel (SIMD)
Data flow
Systolic

4
Convergence of Architectural Models

Cullers classification of parallel architectures
Shared Address Space
Message Passing
Data Parallel
Others
Dataflow
Systolic Arrays
Examine programming model, motivation, intended
applications, and contributions to convergence

5
Where is Parallel Arch Going?
Old view Divergent architectures, no predictable
pattern of growth.
Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory

Uncertainty of direction paralyzed parallel
software development!

6
NEW VIEW Convergence of parallel architectures
Systolic Arrays
SIMD
Generic Architecture
Message Passing
Dataflow
Shared Memory
7
Parallel computer

Last class definition
A parallel computer is a collection of
processing elements that cooperate to solve
large problems fast
Extend the sequential computer architecture with
a communication architecture
Computer architecture has 2 important aspects
Abstractions hardware/software, user/system
Implementation of these abstractions
Communication architecture as well
Abstractions communication and synchronization
operations
Implementations of these abstractions
Programming model
Abstractions
Implementations of these abstractions

8
Modern Layered Framework
Layers of architectural abstraction
9
Programming Model

What programmer uses in coding applications
Specifies communication and synchronization
Examples
Multiprogramming no communication or synch. at
program level
Shared address space like bulletin board
Message passing like letters or phone calls,
explicit point to point
Data parallel global simultaneous actions on
data
Implemented with shared address space or message
passing

10
Modern Layered Framework
Layers of architectural abstraction
11
Communication Abstraction

Programming model is built on communication
abstraction
Possibilities
Supported directly by hardware
OS (sockets)
user software
Combination OS/hardware page fault OS handler
Earlier
Communication abstraction oriented toward
programming model
Today
Compilers and software play important roles as
bridges today (MPI/OpenMP)

12
Shared Address Space (SAS) Architectures

Any processor can directly reference any memory
location
Communication occurs implicitly as result of
loads and stores
Convenient
Location transparency
Similar programming model to time-sharing on
uniprocessors
Except processes run on different processors
Naturally provided on wide range of platforms
History dates at least to precursors of
mainframes in early 60s
Wide range of scale few to hundreds of
processors
Popularly known as shared memory machines or
model
memory may be physically distributed among
processors
UMA
NUMA

13
SAS-UMA (Uniform Memory Access)

Any processor can directly reference any memory
location
Theoretical same access time for all accesses

Mk
M1
M2

Interconnect
P1
P2

Pn
14
SAS-NUMA (Non Uniform Memory Access)

Any processor can directly reference any memory
location (including the memory of remote
processors)
NI integrated into the memory system
IMPORTANT DIFFERENCE! For message passing
machines P1 can not access M2 (NI integrated into
the I/O system)
Different access times for local and remote memory

Pn
P2
P1

Mn
M2
M1
PEn
PE1
PE2
Interconnect
15
SAS Memory Model

Process virtual address space plus one or more
threads of control
Portions of address spaces of processes are shared

Writes to shared address visible to other threads
(in other processes too)
Natural extension of uniprocessors model
Communication R/W the memory
Synchronization special atomic operations (we
come back later)
ONE OS uses shared memory to coordinate processes

16
Communication Hardware

Natural extension of uniprocessor
Already have processor, one or more memory
modules and I/O controllers connected by hardware
interconnect of some sort

Memory capacity increases by adding modules
I/O by adding Controllers
Add processors for processing!

17
History

Mainframe approach
Motivated by multiprogramming
Extends crossbar used for more memory and I/O
bandwidth
Bandwidth scales with nr of processors
Originally processor cost high
later, cost of crossbar use multistage
Multistage
Reduces the incremental cost
Increased latency
Minicomputer approach
Almost all microprocessor systems have bus
Used heavily for parallel computing
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
caching is key coherence problem
Low incremental cost

M
M
M
M
P
I/O
P
I/O

18
SAS UMA Example Intel Pentium Pro Quad

All coherence and multiprocessing in the
processor module
Highly integrated, targeted at high volume
Low latency and bandwidth

19
SAS UMA Example SUN UltraSPARC-based Enterprise

16 cards of either type processors memory, or
I/O
All memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus

20
NUMA
PE1
PE2
PEn
P1
P2
Pn
C1
C2
Cn
M1
M2
Mn
Interconnect
21
SAS-NUMA Example Cray T3E

Scale up to 1024 processors, 480MB/s links
Memory controller generates comm. request for
non-local references (no local caching, SGI
Origin has)
No hardware mechanism for coherence (SGI Origin
has)

22
Message Passing Architectures

High-level block diagram similar to
distributed-memory SAS
NIC integrated into I/O system, neednt be into
memory system
Clusters, but tighter integration
Easier to build than scalable SAS

Pn
P2
P1

Mn
M2
M1
PEn
PE1
PE2
Interconnect
23
Message Passing Architectures

Communication
via explicit I/O operations
In SAS case through memory accesses
Programming model
directly access only private address space (local
memory)
comm. via explicit messages (send/receive)
farther from hardware operations
Library (MPI)
OS intervention ( ex page fault in page DSM)

24
Message-Passing Abstraction

Send specifies buffer to be transmitted and
receiving process
Recv specifies sending process and a buffer to
receive into
Optional tag on send and matching rule on receive
Many overheads copying, buffer management,
protection

25
Evolution of Message-Passing Machines

Early machines
Store and forward
FIFO on each link
Hardware close to programming model
synchronous ops
Only neighboring nodes named!
Replaced by DMA, enabling non-blocking ops
Buffered by system at destination until recv
Diminishing role of topology
topology less important
all nodes named
pipelined wormhole routing (asynchronous MP)
Cost is in node-network interface
Simplifies programming (earlier you had to map
your program on the topology)

26
Example IBM SP-2

Made out of essentially complete RS6000
workstations
Network interface integrated in I/O bus (bw
limited by I/O bus)
8X8 Crossbar switch

27
Example Intel Paragon
28
SAS MP Architectural Convergence

SAS machines
SOFTWARE MP Send/recv supported via buffers
HARDWARE At lower level, even hardware SAS
passes hardware messages
MP machines
SOFTWARE Constructed SAS global address space on
MP (software DSM)
Page-based (or finer-grained) shared virtual
memory
HARDWARE Tighter NI integration even for MP
(low-latency, high-bandwidth)
Due to the mergence of fast system area networks
(SAN)
Traditional NI integrated into the memory system
for SAS NUMA systems
Clusters of SMP workstations

29
Data Parallel Systems (SIMD)

Architectural model
SIMD Array of many simple, cheap processors with
little memory each
Processors dont sequence through instructions
Attached to a control processor that issues
instructions
Specialized and general communication, cheap
global synchronization
Original motivations
Matches simple differential equation solvers
Well see Ocean Current simulation
Centralize high cost of instruction
fetch/sequencing

30
Data Parallel Systems

Programming model
Operations performed in parallel on each element
of data structure
Logically single thread of control, performs
sequential or parallel steps
Conceptually, a processor associated with each
data element
After a phase of computation all the processors
synchronize

31
Application of Data Parallelism

Each PE contains an employee record with his/her
salary
Each PE has a condition flag execute or not the
instruction
Ex work in parallel on several employer records
If salary gt 100K then
salary salary 1.05
else
salary salary 1.10
Logically, the whole operation is a single step
Some processors enabled for arithmetic operation,
others disabled
Other examples
Differential equations, linear algebra, ...
Document searching, graphics, image processing,
...
Last machines
Thinking Machines CM-1, CM-2 (and CM-5)
Maspar MP-1 and MP-2

32
Data parallel machines evolution

Architecture disappeared today, but programming
model still popular
Popular when cost savings of centralized
sequencer high
60s when CPU was a cabinet
Replaced by vectors in mid-70s
More flexible memory layout and easier to manage
No need to map the problem on the infrastructure
Revived in mid-80s when 32-bit datapath fit on
chip
Modern microprocessors more attractive today
Other reasons for demise
Simple, regular applications have good locality,
can do well anyway
Loss of applicability due to hardwiring data
parallelism
MIMD machines as effective for data parallelism
and more general

33
Convergence

Programming model
Still exists separated from hardware
converges to SPMD (single program multiple data)
Map local data structure on the HW machine model
HPF, OpenMP
Needed fast global synchronization
Global address space, implemented with either SAS
or MP

34
Dataflow Architectures

Represent computation (program) as a graph of
essential dependences
Logical processor at each node, activated by
availability of operands
Message (tokens) carrying tag of next instruction
sent to next processor
Tag compared with others in matching store match
fires execution

35
Data-flow architectures

Key characteristics
Name operations anywhere in the machine
Support synchronization for independent ops
Dynamic scheduling at machine level
The architectures demised
Problems
Operations have locality across them, useful to
group together
Handling complex data structures like arrays
Complexity of matching store and memory units
Expose too much parallelism
Too fine-grained
Hurts locality

36
Data-flow architectures convergence

Converged to use conventional processors and
memory
Support for large, dynamic set of threads to map
to processors
Typically shared address space as well
Separation of programming model from hardware
(like data-parallel)
Lasting contributions
Integration of communication with thread
(handler) generation
Tightly integrated communication and fine-grained
synchronization
Data-flow useful concept for software (compilers
etc.)

37
Systolic Architectures

Replace single processor with array of regular
processing elements
Orchestrate data flow for high throughput with
less memory access

Different from pipelining
Nonlinear array structure, multidirection data
flow, each PE may have (small) local instruction
and data memory
Different from SIMD each PE may do something
different
Initial motivation VLSI enables inexpensive
special-purpose chips
Represent algorithms directly by chips connected
in regular pattern

38
Systolic Arrays (contd.)
Example Systolic array for 1-D convolution