Today - PowerPoint PPT Presentation

About This Presentation
Title:

Today

Description:

Parallel Computer Architectures. Informationsteknologi. Saturday, ... contain four parts: Module which memory ... It bought cheap, modest performance PCs ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 57
Provided by: CaryL7
Category:
Tags: computer | parts | today

less

Transcript and Presenter's Notes

Title: Today


1
Todays class
  • Parallel Computer Architectures

2
Reasons for Parallelism
  • Physical constraints, such as speed of light and
    quantum mechanical effects, are being reached
  • To increase speed of computation provide parallel
    computation rather than faster CPUs

3
Levels for Parallelism
  • CPU pipelining and superscalar architecture
  • Very long instruction words with implicit
    parallelism
  • CPUs with special features to handle multiple
    threads of control at once
  • Multiple CPUs on the same chip
  • Extra CPU boards with additional processing
    capacity
  • Replicate entire CPUs multiprocessors and
    multicomputers

4
Parallel Computer Architectures
5
On-Chip Parallelism
6
Instruction-Level Parallelism
  • At the lowest level, achieve parallelism by
    issuing multiple instructions per clock cycle
  • Two ways to do this
  • Superscalar processors
  • VLIW (Very Large Instruction Word) processors

7
VLIW Processors
8
The TriMedia VLIW CPU
  • Designed by Phillips (inventor of CD)
  • Used as an embedded processor in CD, DVD, MP3
    players, digital cameras, etc.
  • Each instruction holds up to 5 operations

9
The TriMedia VLIW CPU Characteristics
  • Byte-oriented memory
  • 32-bit words
  • 128 general-purpose 32-bit registers
  • R0 is always 0, R1 is always 1
  • Four special purpose registers
  • PC, PSW, two registers for interrupt handling
  • 64-bit register that counts the number of CPU
    cycles since the CPU was last reset
  • Takes 2000 years to wrap around at 300 MHz
  • 64 KB instruction cache, 16 KB data cache

10
The TriMedia VLIW CPU Functional Units
11
On-Chip Multithreading
  • When a memory reference misses the level 1 and
    level 2 caches there is a long wait until the
    requested word is loaded into the cache
  • This stalls the pipeline
  • On-chip multithreading deals with this by
    allowing the CPU to manage multiple threads of
    control at the same time

12
Fine-Grained and Coarse-Grained Multithreading
13
Multithreading with a Dual-Issue Superscalar CPU
14
Single-Chip Multiprocessors
  • Provide a larger performance gain than
    multithreading
  • Contain two or more CPUs

15
Homogeneous Multiprocessors on a Chip
16
Heterogeneous Multiprocessors on a Chip
  • The logical structure of a simple DVD player
    contains a heterogeneous multiprocessor
    containing multiple cores for different functions.

17
Coprocessors
18
Coprocessors
  • Increase speed of computer by adding a second,
    specialized processor

19
Introduction to Networking
20
Network Processors
  • Programmable devices that can handle incoming and
    outgoing network packets at wire speed

21
Media Processors
  • Handle high-resolution photographic images and
    audio video streams
  • Ordinary CPUs are not good at the massive
    computations needed to process the large amounts
    of data in these applications

22
The Nexperia Media Processor
23
Shared-Memory Multiprocessors
24
Multiprocessor
  • A parallel computer in which all the CPUs share a
    common memory
  • Hard to build (because of the shared memory)
  • Easy to program

25
Multicomputer
  • Each CPU has its own private memory, accessible
    only to itself and not to any other CPU
  • Also known as a distributed memory system
  • Easy to build
  • Hard to program

26
Taxonomy of Parallel Computers
27
Taxonomy of Parallel Computers
28
Sequential Memory Consistency
  • In the presence of multiple read and write
    requests, some interleaving of all the requests
    is chosen by the hardware (nondeterministically),
    but all CPUs see the same order

29
Processor Memory Consistency
  • Writes by any CPU are seen by all CPUs in the
    order they were issued
  • For every memory word, all CPUs see all writes to
    it in the same order
  • Does not guarantee that every CPU sees the same
    ordering

30
Weak Memory Consistency
  • Does not guarantee that writes from a single CPU
    are seen in order by other CPUs

31
UMA Symmetric Multiprocessor Architectures
  • Simplest multiprocessors are based on a single bus

32
Cache Coherence
  • Suppose CPU 1 and CPU 2 each have a copy of the
    same data in their respective caches
  • Now suppose CPU 1 modifies the data in its cache
    and immediately thereafter CPU 2 reads its copy
  • CPU 2 will get stale data
  • This problem is known as the cache coherence
    problem
  • Solutions to this problem have the cache
    controller eavesdropping on the bus

33
Snooping Caches
  • The write through cache coherence protocol note
    that all writes go to memory
  • The empty boxes indicate that no action is taken

34
The MESI Cache Coherence Protocol
35
The MESI Cache Coherence Protocol
36
Crossbar Switches
  • Use of a single bus limits the size of a UMA
    multi-processor to 16 or 32 CPUs
  • The simplest circuit for connecting n CPUs to k
    memories is the crossbar switch

37
Multistage Switching Networks
  • Larger UMA multiprocessors are based on the
    humble 2x2 switch shown here
  • Messages arriving on either input line (A or B)
    can be switched to either output line (X or Y)
  • Messages contain four parts
  • Module which memory to use
  • Address specifies address within the module
  • Opcode specifies the operation, such as READ or
    WRITE
  • Value optional field, may contain an operand

38
Multistage Switching Networks
39
NUMA Multiprocessors
  • Non Uniform Memory Access
  • All memory modules do not have the same access
    time
  • Three key characteristics
  • A single address space visible to all CPUs
  • Access to remote memory is done using LOAD and
    STORE instructions
  • Access to remote memory is slower than access to
    local memory
  • NC-NUMA (no cache present)
  • CC-NUMA (coherent caches present)

40
NUMA Multiprocessors
  • A NUMA machine based on two levels of buses

41
Directory-Based Multiprocessor
  • Most popular approach for building Cache Coherent
    NUMA multiprocessors
  • Maintains a database telling where each cache
    line is and what its status is
  • Database is queried on every instruction that
    references memory, so must be kept in extremely
    fast special purpose hardware

42
Directory-Based Multiprocessor
  • Below is an example 256-node multiprocessor
  • Each node has one CPU and 16 MB RAM
  • Total memory is 232 bytes, divided into 226 cache
    lines of 64 bytes each

43
Directory-Based Multiprocessor
  • Suppose CPU 20 issues a LOAD instruction for
    memory at physical address 0x24000108
  • This translates to node 36, line 4, offset 8
  • A request is made over the interconnection
    network the directory for node 36 shows line 4
    is not cached, so it is fetched from local RAM,
    sent back to node 20, and the directory entry for
    line 4 is updated to show it cached at node 20

44
Directory-Based Multiprocessor
  • Now suppose we want to load memory referenced by
    node 36s cache line 2
  • From the directory entry we see its at node 82
  • At this point the hardware updates directory
    entry 2 to show the line is now at node 20 and
    then send a message to node 82 telling it to pass
    the line to node 20 and invalidate its cache

45
Message-Passing Multicomputers
46
Multicomputers
  • Each CPU has its own private memory, not directly
    accessible to any other CPU
  • Programs interact with messages, since they
    cannot get at each others memory via LOAD and
    STORE

47
Topology
  • (a) A star
  • (b) A complete interconnect
  • (c) A tree
  • (d) A ring

48
Topology
  • (e) A grid
  • (f) A double torus
  • (g) A cube
  • (h) A 4D hypercube

49
Massively Parallel Processors
  • Use standard CPUs, such as the Intel Pentium, as
    their processors
  • Very high performance proprietary interconnection
    network
  • Enormous I/O capacity
  • Fault tolerant
  • Do not want a program that runs for many hours
    aborted because one CPU crashed

50
BlueGene/L Custom Processor Chip
51
BlueGene/L System
52
Cluster Computing
  • Consists of hundreds or thousands of computers
    connected by a commercially available network
    board
  • Centralized cluster a cluster of workstations
    mounted in a big rack in a single room
  • Decentralized cluster workstations spread
    around a building or campus, connected by a LAN

53
Google
  • Built the worlds largest off-the-shelf cluster
  • It bought cheap, modest performance PCs
  • Lots of them!
  • A typical Google PC has a 2 GHz Pentium, 512 MB
    RAM, 80 GB disk

54
A Typical Google Cluster
55
Communication Software
  • Special software required for interprocess
    communication and synchronization
  • Most message passing systems provide two
    primitives SEND and RECEIVE

56
Three Main Semantics
  • Synchronous message passing
  • If sender has executed SEND and the receiver has
    not yet executed RECEIVE the sender is blocked
    until the receiver executes RECEIVE
  • Buffered message passing
  • If a message is sent before receiver is ready the
    message is buffered somewhere until the receiver
    takes it out
  • Nonblocking message passing
  • Sender is allowed to continue immediately after
    executing SEND
  • However, it may not reuse the message buffer as
    the message may not have been sent yet
Write a Comment
User Comments (0)
About PowerShow.com