Today - PowerPoint PPT Presentation

About This Presentation

Title:

Today

Description:

Parallel Computer Architectures. Informationsteknologi. Saturday, ... contain four parts: Module which memory ... It bought cheap, modest performance PCs ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 57

Provided by: CaryL7

Category:

more less

Transcript and Presenter's Notes

Title: Today

1
Todays class

Parallel Computer Architectures

2
Reasons for Parallelism

Physical constraints, such as speed of light and
quantum mechanical effects, are being reached
To increase speed of computation provide parallel
computation rather than faster CPUs

3
Levels for Parallelism

CPU pipelining and superscalar architecture
Very long instruction words with implicit
parallelism
CPUs with special features to handle multiple
threads of control at once
Multiple CPUs on the same chip
Extra CPU boards with additional processing
capacity
Replicate entire CPUs multiprocessors and
multicomputers

4
Parallel Computer Architectures
5
On-Chip Parallelism
6
Instruction-Level Parallelism

At the lowest level, achieve parallelism by
issuing multiple instructions per clock cycle
Two ways to do this
Superscalar processors
VLIW (Very Large Instruction Word) processors

7
VLIW Processors
8
The TriMedia VLIW CPU

Designed by Phillips (inventor of CD)
Used as an embedded processor in CD, DVD, MP3
players, digital cameras, etc.
Each instruction holds up to 5 operations

9
The TriMedia VLIW CPU Characteristics

Byte-oriented memory
32-bit words
128 general-purpose 32-bit registers
R0 is always 0, R1 is always 1
Four special purpose registers
PC, PSW, two registers for interrupt handling
64-bit register that counts the number of CPU
cycles since the CPU was last reset
Takes 2000 years to wrap around at 300 MHz
64 KB instruction cache, 16 KB data cache

10
The TriMedia VLIW CPU Functional Units
11
On-Chip Multithreading

When a memory reference misses the level 1 and
level 2 caches there is a long wait until the
requested word is loaded into the cache
This stalls the pipeline
On-chip multithreading deals with this by
allowing the CPU to manage multiple threads of
control at the same time

12
Fine-Grained and Coarse-Grained Multithreading
13
Multithreading with a Dual-Issue Superscalar CPU
14
Single-Chip Multiprocessors

Provide a larger performance gain than
multithreading
Contain two or more CPUs

15
Homogeneous Multiprocessors on a Chip
16
Heterogeneous Multiprocessors on a Chip

The logical structure of a simple DVD player
contains a heterogeneous multiprocessor
containing multiple cores for different functions.

17
Coprocessors
18
Coprocessors

Increase speed of computer by adding a second,
specialized processor

19
Introduction to Networking
20
Network Processors

Programmable devices that can handle incoming and
outgoing network packets at wire speed

21
Media Processors

Handle high-resolution photographic images and
audio video streams
Ordinary CPUs are not good at the massive
computations needed to process the large amounts
of data in these applications

22
The Nexperia Media Processor
23
Shared-Memory Multiprocessors
24
Multiprocessor

A parallel computer in which all the CPUs share a
common memory
Hard to build (because of the shared memory)
Easy to program

25
Multicomputer

Each CPU has its own private memory, accessible
only to itself and not to any other CPU
Also known as a distributed memory system
Easy to build
Hard to program

26
Taxonomy of Parallel Computers
27
Taxonomy of Parallel Computers
28
Sequential Memory Consistency

In the presence of multiple read and write
requests, some interleaving of all the requests
is chosen by the hardware (nondeterministically),
but all CPUs see the same order

29
Processor Memory Consistency

Writes by any CPU are seen by all CPUs in the
order they were issued
For every memory word, all CPUs see all writes to
it in the same order
Does not guarantee that every CPU sees the same
ordering

30
Weak Memory Consistency

Does not guarantee that writes from a single CPU
are seen in order by other CPUs

31
UMA Symmetric Multiprocessor Architectures

Simplest multiprocessors are based on a single bus

32
Cache Coherence

Suppose CPU 1 and CPU 2 each have a copy of the
same data in their respective caches
Now suppose CPU 1 modifies the data in its cache
and immediately thereafter CPU 2 reads its copy
CPU 2 will get stale data
This problem is known as the cache coherence
problem
Solutions to this problem have the cache
controller eavesdropping on the bus

33
Snooping Caches

The write through cache coherence protocol note
that all writes go to memory
The empty boxes indicate that no action is taken

34
The MESI Cache Coherence Protocol
35
The MESI Cache Coherence Protocol
36
Crossbar Switches

Use of a single bus limits the size of a UMA
multi-processor to 16 or 32 CPUs
The simplest circuit for connecting n CPUs to k
memories is the crossbar switch

37
Multistage Switching Networks

Larger UMA multiprocessors are based on the
humble 2x2 switch shown here
Messages arriving on either input line (A or B)
can be switched to either output line (X or Y)
Messages contain four parts
Module which memory to use
Address specifies address within the module
Opcode specifies the operation, such as READ or
WRITE
Value optional field, may contain an operand

38
Multistage Switching Networks
39
NUMA Multiprocessors

Non Uniform Memory Access
All memory modules do not have the same access
time
Three key characteristics
A single address space visible to all CPUs
Access to remote memory is done using LOAD and
STORE instructions
Access to remote memory is slower than access to
local memory
NC-NUMA (no cache present)
CC-NUMA (coherent caches present)

40
NUMA Multiprocessors

A NUMA machine based on two levels of buses

41
Directory-Based Multiprocessor

Most popular approach for building Cache Coherent
NUMA multiprocessors
Maintains a database telling where each cache
line is and what its status is
Database is queried on every instruction that
references memory, so must be kept in extremely
fast special purpose hardware

42
Directory-Based Multiprocessor

Below is an example 256-node multiprocessor
Each node has one CPU and 16 MB RAM
Total memory is 232 bytes, divided into 226 cache
lines of 64 bytes each

43
Directory-Based Multiprocessor

Suppose CPU 20 issues a LOAD instruction for
memory at physical address 0x24000108
This translates to node 36, line 4, offset 8
A request is made over the interconnection
network the directory for node 36 shows line 4
is not cached, so it is fetched from local RAM,
sent back to node 20, and the directory entry for
line 4 is updated to show it cached at node 20

44
Directory-Based Multiprocessor

Now suppose we want to load memory referenced by
node 36s cache line 2
From the directory entry we see its at node 82
At this point the hardware updates directory
entry 2 to show the line is now at node 20 and
then send a message to node 82 telling it to pass
the line to node 20 and invalidate its cache

45
Message-Passing Multicomputers
46
Multicomputers

Each CPU has its own private memory, not directly
accessible to any other CPU
Programs interact with messages, since they
cannot get at each others memory via LOAD and
STORE

47
Topology

(a) A star
(b) A complete interconnect
(c) A tree
(d) A ring

48
Topology

(e) A grid
(f) A double torus
(g) A cube
(h) A 4D hypercube

49
Massively Parallel Processors

Use standard CPUs, such as the Intel Pentium, as
their processors
Very high performance proprietary interconnection
network
Enormous I/O capacity
Fault tolerant
Do not want a program that runs for many hours
aborted because one CPU crashed

50
BlueGene/L Custom Processor Chip
51
BlueGene/L System
52
Cluster Computing

Consists of hundreds or thousands of computers
connected by a commercially available network
board
Centralized cluster a cluster of workstations
mounted in a big rack in a single room
Decentralized cluster workstations spread
around a building or campus, connected by a LAN

53
Google

Built the worlds largest off-the-shelf cluster
It bought cheap, modest performance PCs
Lots of them!
A typical Google PC has a 2 GHz Pentium, 512 MB
RAM, 80 GB disk

54
A Typical Google Cluster
55
Communication Software

Special software required for interprocess
communication and synchronization
Most message passing systems provide two
primitives SEND and RECEIVE

56
Three Main Semantics

Synchronous message passing
If sender has executed SEND and the receiver has
not yet executed RECEIVE the sender is blocked
until the receiver executes RECEIVE
Buffered message passing
If a message is sent before receiver is ready the
message is buffered somewhere until the receiver
takes it out
Nonblocking message passing
Sender is allowed to continue immediately after
executing SEND
However, it may not reuse the message buffer as
the message may not have been sent yet