Title: Today
1Todays class
- Parallel Computer Architectures
2Reasons for Parallelism
- Physical constraints, such as speed of light and
quantum mechanical effects, are being reached - To increase speed of computation provide parallel
computation rather than faster CPUs
3Levels for Parallelism
- CPU pipelining and superscalar architecture
- Very long instruction words with implicit
parallelism - CPUs with special features to handle multiple
threads of control at once - Multiple CPUs on the same chip
- Extra CPU boards with additional processing
capacity - Replicate entire CPUs multiprocessors and
multicomputers
4Parallel Computer Architectures
5On-Chip Parallelism
6Instruction-Level Parallelism
- At the lowest level, achieve parallelism by
issuing multiple instructions per clock cycle - Two ways to do this
- Superscalar processors
- VLIW (Very Large Instruction Word) processors
7VLIW Processors
8The TriMedia VLIW CPU
- Designed by Phillips (inventor of CD)
- Used as an embedded processor in CD, DVD, MP3
players, digital cameras, etc. - Each instruction holds up to 5 operations
9The TriMedia VLIW CPU Characteristics
- Byte-oriented memory
- 32-bit words
- 128 general-purpose 32-bit registers
- R0 is always 0, R1 is always 1
- Four special purpose registers
- PC, PSW, two registers for interrupt handling
- 64-bit register that counts the number of CPU
cycles since the CPU was last reset - Takes 2000 years to wrap around at 300 MHz
- 64 KB instruction cache, 16 KB data cache
10The TriMedia VLIW CPU Functional Units
11On-Chip Multithreading
- When a memory reference misses the level 1 and
level 2 caches there is a long wait until the
requested word is loaded into the cache - This stalls the pipeline
- On-chip multithreading deals with this by
allowing the CPU to manage multiple threads of
control at the same time
12Fine-Grained and Coarse-Grained Multithreading
13Multithreading with a Dual-Issue Superscalar CPU
14Single-Chip Multiprocessors
- Provide a larger performance gain than
multithreading - Contain two or more CPUs
15Homogeneous Multiprocessors on a Chip
16Heterogeneous Multiprocessors on a Chip
- The logical structure of a simple DVD player
contains a heterogeneous multiprocessor
containing multiple cores for different functions.
17Coprocessors
18Coprocessors
- Increase speed of computer by adding a second,
specialized processor
19Introduction to Networking
20Network Processors
- Programmable devices that can handle incoming and
outgoing network packets at wire speed
21Media Processors
- Handle high-resolution photographic images and
audio video streams - Ordinary CPUs are not good at the massive
computations needed to process the large amounts
of data in these applications
22The Nexperia Media Processor
23Shared-Memory Multiprocessors
24Multiprocessor
- A parallel computer in which all the CPUs share a
common memory - Hard to build (because of the shared memory)
- Easy to program
25Multicomputer
- Each CPU has its own private memory, accessible
only to itself and not to any other CPU - Also known as a distributed memory system
- Easy to build
- Hard to program
26Taxonomy of Parallel Computers
27Taxonomy of Parallel Computers
28Sequential Memory Consistency
- In the presence of multiple read and write
requests, some interleaving of all the requests
is chosen by the hardware (nondeterministically),
but all CPUs see the same order
29Processor Memory Consistency
- Writes by any CPU are seen by all CPUs in the
order they were issued - For every memory word, all CPUs see all writes to
it in the same order - Does not guarantee that every CPU sees the same
ordering
30Weak Memory Consistency
- Does not guarantee that writes from a single CPU
are seen in order by other CPUs
31UMA Symmetric Multiprocessor Architectures
- Simplest multiprocessors are based on a single bus
32Cache Coherence
- Suppose CPU 1 and CPU 2 each have a copy of the
same data in their respective caches - Now suppose CPU 1 modifies the data in its cache
and immediately thereafter CPU 2 reads its copy - CPU 2 will get stale data
- This problem is known as the cache coherence
problem - Solutions to this problem have the cache
controller eavesdropping on the bus
33Snooping Caches
- The write through cache coherence protocol note
that all writes go to memory - The empty boxes indicate that no action is taken
34The MESI Cache Coherence Protocol
35The MESI Cache Coherence Protocol
36Crossbar Switches
- Use of a single bus limits the size of a UMA
multi-processor to 16 or 32 CPUs - The simplest circuit for connecting n CPUs to k
memories is the crossbar switch
37Multistage Switching Networks
- Larger UMA multiprocessors are based on the
humble 2x2 switch shown here - Messages arriving on either input line (A or B)
can be switched to either output line (X or Y) - Messages contain four parts
- Module which memory to use
- Address specifies address within the module
- Opcode specifies the operation, such as READ or
WRITE - Value optional field, may contain an operand
38Multistage Switching Networks
39NUMA Multiprocessors
- Non Uniform Memory Access
- All memory modules do not have the same access
time - Three key characteristics
- A single address space visible to all CPUs
- Access to remote memory is done using LOAD and
STORE instructions - Access to remote memory is slower than access to
local memory - NC-NUMA (no cache present)
- CC-NUMA (coherent caches present)
40NUMA Multiprocessors
- A NUMA machine based on two levels of buses
41Directory-Based Multiprocessor
- Most popular approach for building Cache Coherent
NUMA multiprocessors - Maintains a database telling where each cache
line is and what its status is - Database is queried on every instruction that
references memory, so must be kept in extremely
fast special purpose hardware
42Directory-Based Multiprocessor
- Below is an example 256-node multiprocessor
- Each node has one CPU and 16 MB RAM
- Total memory is 232 bytes, divided into 226 cache
lines of 64 bytes each
43Directory-Based Multiprocessor
- Suppose CPU 20 issues a LOAD instruction for
memory at physical address 0x24000108 - This translates to node 36, line 4, offset 8
- A request is made over the interconnection
network the directory for node 36 shows line 4
is not cached, so it is fetched from local RAM,
sent back to node 20, and the directory entry for
line 4 is updated to show it cached at node 20
44Directory-Based Multiprocessor
- Now suppose we want to load memory referenced by
node 36s cache line 2 - From the directory entry we see its at node 82
- At this point the hardware updates directory
entry 2 to show the line is now at node 20 and
then send a message to node 82 telling it to pass
the line to node 20 and invalidate its cache
45Message-Passing Multicomputers
46Multicomputers
- Each CPU has its own private memory, not directly
accessible to any other CPU - Programs interact with messages, since they
cannot get at each others memory via LOAD and
STORE
47Topology
- (a) A star
- (b) A complete interconnect
- (c) A tree
- (d) A ring
48Topology
- (e) A grid
- (f) A double torus
- (g) A cube
- (h) A 4D hypercube
49Massively Parallel Processors
- Use standard CPUs, such as the Intel Pentium, as
their processors - Very high performance proprietary interconnection
network - Enormous I/O capacity
- Fault tolerant
- Do not want a program that runs for many hours
aborted because one CPU crashed
50BlueGene/L Custom Processor Chip
51BlueGene/L System
52Cluster Computing
- Consists of hundreds or thousands of computers
connected by a commercially available network
board - Centralized cluster a cluster of workstations
mounted in a big rack in a single room - Decentralized cluster workstations spread
around a building or campus, connected by a LAN
53Google
- Built the worlds largest off-the-shelf cluster
- It bought cheap, modest performance PCs
- Lots of them!
- A typical Google PC has a 2 GHz Pentium, 512 MB
RAM, 80 GB disk
54A Typical Google Cluster
55Communication Software
- Special software required for interprocess
communication and synchronization - Most message passing systems provide two
primitives SEND and RECEIVE
56Three Main Semantics
- Synchronous message passing
- If sender has executed SEND and the receiver has
not yet executed RECEIVE the sender is blocked
until the receiver executes RECEIVE - Buffered message passing
- If a message is sent before receiver is ready the
message is buffered somewhere until the receiver
takes it out - Nonblocking message passing
- Sender is allowed to continue immediately after
executing SEND - However, it may not reuse the message buffer as
the message may not have been sent yet