Title: Computer architecture II
1Computer architecture II
2Recap
- Importance of parallelism
- Architecture classification
- Flynn (SISD, SIMD, MISD, MIMD)
- Memory access (SM, MP)
- Clusters
- Grids
- Top500
3Today's plan
- Parallel Architecture convergence (Cullers
classification) - Shared Memory (Single Address Space)
- Message Passing
- Data parallel (SIMD)
- Data flow
- Systolic
4Convergence of Architectural Models
- Cullers classification of parallel architectures
- Shared Address Space
- Message Passing
- Data Parallel
- Others
- Dataflow
- Systolic Arrays
- Examine programming model, motivation, intended
applications, and contributions to convergence
5Where is Parallel Arch Going?
Old view Divergent architectures, no predictable
pattern of growth.
Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
- Uncertainty of direction paralyzed parallel
software development!
6NEW VIEW Convergence of parallel architectures
Systolic Arrays
SIMD
Generic Architecture
Message Passing
Dataflow
Shared Memory
7Parallel computer
- Last class definition
- A parallel computer is a collection of
processing elements that cooperate to solve
large problems fast - Extend the sequential computer architecture with
a communication architecture - Computer architecture has 2 important aspects
- Abstractions hardware/software, user/system
- Implementation of these abstractions
- Communication architecture as well
- Abstractions communication and synchronization
operations - Implementations of these abstractions
- Programming model
- Abstractions
- Implementations of these abstractions
8Modern Layered Framework
Layers of architectural abstraction
9Programming Model
- What programmer uses in coding applications
- Specifies communication and synchronization
- Examples
- Multiprogramming no communication or synch. at
program level - Shared address space like bulletin board
- Message passing like letters or phone calls,
explicit point to point - Data parallel global simultaneous actions on
data - Implemented with shared address space or message
passing
10Modern Layered Framework
Layers of architectural abstraction
11Communication Abstraction
- Programming model is built on communication
abstraction - Possibilities
- Supported directly by hardware
- OS (sockets)
- user software
- Combination OS/hardware page fault OS handler
- Earlier
- Communication abstraction oriented toward
programming model - Today
- Compilers and software play important roles as
bridges today (MPI/OpenMP)
12Shared Address Space (SAS) Architectures
- Any processor can directly reference any memory
location - Communication occurs implicitly as result of
loads and stores - Convenient
- Location transparency
- Similar programming model to time-sharing on
uniprocessors - Except processes run on different processors
- Naturally provided on wide range of platforms
- History dates at least to precursors of
mainframes in early 60s - Wide range of scale few to hundreds of
processors - Popularly known as shared memory machines or
model - memory may be physically distributed among
processors - UMA
- NUMA
13SAS-UMA (Uniform Memory Access)
- Any processor can directly reference any memory
location - Theoretical same access time for all accesses
Mk
M1
M2
Interconnect
P1
P2
Pn
14SAS-NUMA (Non Uniform Memory Access)
- Any processor can directly reference any memory
location (including the memory of remote
processors) - NI integrated into the memory system
- IMPORTANT DIFFERENCE! For message passing
machines P1 can not access M2 (NI integrated into
the I/O system) - Different access times for local and remote memory
Pn
P2
P1
Mn
M2
M1
PEn
PE1
PE2
Interconnect
15SAS Memory Model
- Process virtual address space plus one or more
threads of control - Portions of address spaces of processes are shared
- Writes to shared address visible to other threads
(in other processes too) - Natural extension of uniprocessors model
- Communication R/W the memory
- Synchronization special atomic operations (we
come back later) - ONE OS uses shared memory to coordinate processes
16Communication Hardware
- Natural extension of uniprocessor
- Already have processor, one or more memory
modules and I/O controllers connected by hardware
interconnect of some sort
- Memory capacity increases by adding modules
- I/O by adding Controllers
- Add processors for processing!
17History
- Mainframe approach
- Motivated by multiprogramming
- Extends crossbar used for more memory and I/O
bandwidth - Bandwidth scales with nr of processors
- Originally processor cost high
- later, cost of crossbar use multistage
- Multistage
- Reduces the incremental cost
- Increased latency
- Minicomputer approach
- Almost all microprocessor systems have bus
- Used heavily for parallel computing
- Called symmetric multiprocessor (SMP)
- Latency larger than for uniprocessor
- Bus is bandwidth bottleneck
- caching is key coherence problem
- Low incremental cost
M
M
M
M
P
I/O
P
I/O
18SAS UMA Example Intel Pentium Pro Quad
- All coherence and multiprocessing in the
processor module - Highly integrated, targeted at high volume
- Low latency and bandwidth
19SAS UMA Example SUN UltraSPARC-based Enterprise
- 16 cards of either type processors memory, or
I/O - All memory accessed over bus, so symmetric
- Higher bandwidth, higher latency bus
20NUMA
PE1
PE2
PEn
P1
P2
Pn
C1
C2
Cn
M1
M2
Mn
Interconnect
21SAS-NUMA Example Cray T3E
- Scale up to 1024 processors, 480MB/s links
- Memory controller generates comm. request for
non-local references (no local caching, SGI
Origin has) - No hardware mechanism for coherence (SGI Origin
has)
22Message Passing Architectures
- High-level block diagram similar to
distributed-memory SAS - NIC integrated into I/O system, neednt be into
memory system - Clusters, but tighter integration
- Easier to build than scalable SAS
Pn
P2
P1
Mn
M2
M1
PEn
PE1
PE2
Interconnect
23Message Passing Architectures
- Communication
- via explicit I/O operations
- In SAS case through memory accesses
- Programming model
- directly access only private address space (local
memory) - comm. via explicit messages (send/receive)
- farther from hardware operations
- Library (MPI)
- OS intervention ( ex page fault in page DSM)
24Message-Passing Abstraction
- Send specifies buffer to be transmitted and
receiving process - Recv specifies sending process and a buffer to
receive into - Optional tag on send and matching rule on receive
- Many overheads copying, buffer management,
protection
25Evolution of Message-Passing Machines
- Early machines
- Store and forward
- FIFO on each link
- Hardware close to programming model
- synchronous ops
- Only neighboring nodes named!
- Replaced by DMA, enabling non-blocking ops
- Buffered by system at destination until recv
- Diminishing role of topology
- topology less important
- all nodes named
- pipelined wormhole routing (asynchronous MP)
- Cost is in node-network interface
- Simplifies programming (earlier you had to map
your program on the topology)
26Example IBM SP-2
- Made out of essentially complete RS6000
workstations - Network interface integrated in I/O bus (bw
limited by I/O bus) - 8X8 Crossbar switch
27Example Intel Paragon
28SAS MP Architectural Convergence
- SAS machines
- SOFTWARE MP Send/recv supported via buffers
- HARDWARE At lower level, even hardware SAS
passes hardware messages - MP machines
- SOFTWARE Constructed SAS global address space on
MP (software DSM) - Page-based (or finer-grained) shared virtual
memory - HARDWARE Tighter NI integration even for MP
(low-latency, high-bandwidth) - Due to the mergence of fast system area networks
(SAN) - Traditional NI integrated into the memory system
for SAS NUMA systems - Clusters of SMP workstations
29Data Parallel Systems (SIMD)
- Architectural model
- SIMD Array of many simple, cheap processors with
little memory each - Processors dont sequence through instructions
- Attached to a control processor that issues
instructions - Specialized and general communication, cheap
global synchronization - Original motivations
- Matches simple differential equation solvers
- Well see Ocean Current simulation
- Centralize high cost of instruction
fetch/sequencing
30Data Parallel Systems
- Programming model
- Operations performed in parallel on each element
of data structure - Logically single thread of control, performs
sequential or parallel steps - Conceptually, a processor associated with each
data element - After a phase of computation all the processors
synchronize
31Application of Data Parallelism
- Each PE contains an employee record with his/her
salary - Each PE has a condition flag execute or not the
instruction - Ex work in parallel on several employer records
- If salary gt 100K then
- salary salary 1.05
- else
- salary salary 1.10
- Logically, the whole operation is a single step
- Some processors enabled for arithmetic operation,
others disabled - Other examples
- Differential equations, linear algebra, ...
- Document searching, graphics, image processing,
... - Last machines
- Thinking Machines CM-1, CM-2 (and CM-5)
- Maspar MP-1 and MP-2
32Data parallel machines evolution
- Architecture disappeared today, but programming
model still popular - Popular when cost savings of centralized
sequencer high - 60s when CPU was a cabinet
- Replaced by vectors in mid-70s
- More flexible memory layout and easier to manage
- No need to map the problem on the infrastructure
- Revived in mid-80s when 32-bit datapath fit on
chip - Modern microprocessors more attractive today
- Other reasons for demise
- Simple, regular applications have good locality,
can do well anyway - Loss of applicability due to hardwiring data
parallelism - MIMD machines as effective for data parallelism
and more general
33Convergence
- Programming model
- Still exists separated from hardware
- converges to SPMD (single program multiple data)
- Map local data structure on the HW machine model
- HPF, OpenMP
- Needed fast global synchronization
- Global address space, implemented with either SAS
or MP
34Dataflow Architectures
- Represent computation (program) as a graph of
essential dependences - Logical processor at each node, activated by
availability of operands - Message (tokens) carrying tag of next instruction
sent to next processor - Tag compared with others in matching store match
fires execution
35Data-flow architectures
- Key characteristics
- Name operations anywhere in the machine
- Support synchronization for independent ops
- Dynamic scheduling at machine level
- The architectures demised
- Problems
- Operations have locality across them, useful to
group together - Handling complex data structures like arrays
- Complexity of matching store and memory units
- Expose too much parallelism
- Too fine-grained
- Hurts locality
36Data-flow architectures convergence
- Converged to use conventional processors and
memory - Support for large, dynamic set of threads to map
to processors - Typically shared address space as well
- Separation of programming model from hardware
(like data-parallel) - Lasting contributions
- Integration of communication with thread
(handler) generation - Tightly integrated communication and fine-grained
synchronization - Data-flow useful concept for software (compilers
etc.)
37Systolic Architectures
- Replace single processor with array of regular
processing elements - Orchestrate data flow for high throughput with
less memory access
- Different from pipelining
- Nonlinear array structure, multidirection data
flow, each PE may have (small) local instruction
and data memory - Different from SIMD each PE may do something
different - Initial motivation VLSI enables inexpensive
special-purpose chips - Represent algorithms directly by chips connected
in regular pattern
38Systolic Arrays (contd.)
Example Systolic array for 1-D convolution
- Practical realizations (e.g. iWARP CMU-Intel) use
quite general processors - Enable variety of algorithms on same hardware
- Dedicated interconnect channels
- Data transfer directly from register to register
across channel - Specialized, and same problems as SIMD
- General purpose systems work well for same
algorithms (locality etc.)
39Recap Generic Multiprocessor Architecture