Platform Design - PowerPoint PPT Presentation

1 / 70

About This Presentation

Title:

Platform Design

Description:

MPSoC TU/e 5kk70 Henk Corporaal Bart Mesman Overview What is a platform, and why platform based design? Why parallel platforms? A first classification of parallel ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 71

Provided by: abc61

Category:

more less

Transcript and Presenter's Notes

Title: Platform Design

1
Platform Design
MPSoC

TU/e 5kk70
Henk Corporaal
Bart Mesman

2
Overview

What is a platform, and why platform based
design?
Why parallel platforms?
A first classification of parallel systems
Design choices for parallel systems
Shared memory systems
Memory Coherency, Consistency, Synchronization,
Mutual exlusion
Message passing systems
Further decisions

3
Design Product requirements?

Short Time-to-Market
Reuse / Standards
Short design time
Flexible solution
Reduces design time
Extends product lifetime remote inspect and
debug,
Scalability
High performance and Low power
Memory bottleneck, Wiring bottleneck
Low cost
High quality, reliability, dependability
RTOS and libs
Good programming environment

4
Solution ?

Platforms
Programmable
One or more processor cores
Reconfigurable
Scalable and flexible
Memory hierarchy
Exploit locality
Separate local and global wiring
HW and SW IP reuse
Standardization (on SW and HW-interfaces)
Raising design abstraction level
Reliable
Cheaper
Advanced Design Flow for Platforms

5
What is a platform?

Definition

A platform is a generic, but domain specific
information processing (sub-)system

Generic means that it is flexible, containing
programmable component(s). Platforms are meant to
quickly realize your next system (in a certain
domain). Single chip?
6
Platform example TI OMAP
Up to 192Mbyte off-chip memory
7
Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
8
Why parallel processing

Performance drive
Diminishing returns for exploiting ILP and OLP
Multiple processors fit easily on a chip
Cost effective (just connect existing processors
or processor cores)
Low power parallelism may allow lowering Vdd
However
Parallel programming is hard

9
Low power through parallelism

Sequential Processor
Switching capacitance C
Frequency f
Voltage V
P ?fCV2
Parallel Processor (two times the number of
units)
Switching capacitance 2C
Frequency f/2
Voltage V lt V
P ?f/2 2C V2 ?fCV2

10
Power efficiency compare 2 examples

Intel Pentium-4 (Northwood) in 0.13 micron
technology
3.0 GHz
20 pipeline stages
Aggressive buffering to boost clock frequency
13 nano Joule / instruction
Philips Trimedia Lite in 0.13 micron technology
250 MHz
8 pipeline stages
Relaxed buffering, focus on instruction
parallelism
0.2 nano Joule / instruction
Trimedia is doing 65x better than Pentium

11
Parallel Architecture

Parallel Architecture extends traditional
computer architecture with a communication
network
abstractions (HW/SW interface)
organizational structure to realize abstraction
efficiently

Communication Network
Processing node
Processing node
Processing node
Processing node
Processing node
12
Platform characteristics

System level
Processor level
Communication network
Memory system
Tooling

13
System level characteristics

Homogeneous ? Heterogeneous
Granularity of processing elements
Type of supported parallelism TLP, DLP
Runtime mapping support?

14
Homogeneous or Heterogeneous

Homogenous
replication effect
memory dominated any way
solve realization issuesonce and for all
less flexible
Typically
data level parallelism
shared memory
dynamic task mapping

15
Example Philips Wasabi

Homogeneous multiprocessor for media applications
First 65 nm silicon expected 1st half 2006
Two-level communication hierarchy
Top scalable message passingnetwork plus tiles

Tile shared memory plus processors, accelerators

Fully cache coherent to support data parallelism

16
Homogeneous or Heterogeneous

Heterogeneous
better fit to application domain
smaller increments
Typically
task level parallelism
message passing
static task mapping

17
Example Viper2

Heterogeneous
Platform based
gt60 different cores
Task parallelism
Sync with interrupts
Streaming communication
Semi-static application graph
50 M transistors
120nm technology
Powerful, efficient

18
Homogeneous or Heterogeneous

Middle of the road approach
Flexibile tiles
Fixed tile structure at top level

19
Types of parallelism
TLP Heterogenous
Program/Thread level
Kernel level
ILP Heterogenous
20
Processor level characteristics

Processor consists of
Instruction engine (Control Processor, Ifetch
unit)
Processing element (PE) Register file, Function
unit(s), L1 DMem
Single PE ? Multiple PEs (as in SIMD)
Single FU/PE ? Multiple FUs/PE (as in VLIW)
Granularity of PEs, FUs
Specialized ? Generic
Interruptable, pre-emption support
Multithreading support (fast context switches)
Clustering of PEs Clustering of FUs
Type of inter PE and inter FU communication
network
Others MMU virtual memory, ..

21
Generic or Specialized?Intrinsic computational
efficiency
22
(pipelined) processor organization
PE processing engine
Instruction fetch - Control
FU
23
(Linear) SIMD Architecture
Control Processor
IMem
PE1
PEn

To be added
inter PE communication
communication from PEs to Control Processor
Input and Output

24
Communication network

Bus (N-N) ? NoC with point-to-point connections
Topology, Router degree
Routing
path, path control, collision resolvement,
network support, deadlock handling, livelock
handling
virtual layer support
flow control and buffering
error handling
Inter-chip network support
Guarantees
TDMA
GT ? BE traffic
etc, etc.

25
Comm. Network Performance metrics

Network Bandwidth
Need high bandwidth in communication
How does it scale with number of nodes?
Communication Latency
Affects performance, since processor may have to
wait
Affects ease of programming, since it requires
more thought to overlap communication and
computation
Latency Hiding
How can a mechanism help hide latency?
Examples
overlap message send with computation,
prefetch data,
switch to other tasks

26
Network Topology

Topology determines
Degree number of links from a node
Diameter max number of links crossed between
nodes
Average distance number of links to random
destination
Bisection minimum number of links that separate
the network into two halves
Bisection bandwidth link bandwidth x bisection

27
Common Topologies
Type Degree Diameter Ave Dist
Bisection 1D mesh 2 N-1 N/3 1 2D mesh
4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh
6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh
2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring
2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2
Hypercube Log2N nLog2N n/2 N/2 2D Tree
3 2Log2N 2Log2 N 1 Crossbar N-1 1 1
N2/2 N number of nodes, n dimension
28
Topology examples
Hypercube
Grid/Mesh
Torus
Assume 64 nodes
Criteria Bus Ring Mesh 2Dtorus 6-cube Fully connected
Performance Bisection bandwidth 1 2 8 16 32 1024
Cost Ports/switch Total links 1 3 128 5 176 5 192 7 256 64 2080
29
Butterfly or Omega Network

All paths equal length
Unique path from any input to any output
Try to avoid conflicts !!

8 x 8 butterfly switch
30
Multistage Fat Tree

A multistage fat tree (CM-5) avoids congestion at
the root node
Randomly assign packets to different paths on way
up to spread the load
Increase degree near root, decrease congestion

31
What did architects design in the 90ties?Old
(off-chip) MP Networks

Name Number Topology Bits Clock Link
Bis. BW Year
nCube/ten 1-1024 10-cube 1 10 MHz
1.2 640 1987
iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988
MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989
Delta 540 2D grid 16 40 MHz 40 640 1991
CM-5 32-2048 fat tree 4 40
MHz 20 10,240 1991
CS-2 32-1024 fat tree 8 70
MHz 50 50,000 1992
Paragon 4-1024 2D grid 16 100
MHz 200 6,400 1992
T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993

MBytes/s
No standard topology! However, for on-chip mesh
and torus are in favor !
32
Memory hierarchy

Number of memory levels 1, 2, 3, 4
HW ? SW controlled level 1
Cache or Scratchpad memory L1
Central ? Distributed memory
Shared ? Distributed memory address space
Intelligent DMA support Communication Assist
For shared memory
coherency
consistency
synchronization

33
IntermezzoWhats the problem with memory ?
Performance
µProc 55/year
1000
Patterson
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Memories can be also big power consumers !
34
Multiple levels of memory
Architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
Memory
I/O
Memory
Level N
35
Communication models Shared Memory
Shared Memory
(read, write)
(read, write)
Process P2
Process P1

Coherence problem
Memory consistency issue
Synchronization problem

36
Communication models Shared memory

Shared address space
Communication primitives
load, store, atomic swap
Two varieties
Physically shared gt Symmetric Multi-Processors
(SMP)
usually combined with local caching
Physically distributed gt Distributed Shared
Memory (DSM)

37
SMP Symmetric Multi-Processor

Memory centralized with uniform access time
(UMA) and bus interconnect, I/O
Examples Sun Enterprise 6000, SGI Challenge,
Intel

Main memory
I/O System
38
DSM Distributed Shared Memory

Nonuniform access time (NUMA) and scalable
interconnect (distributed memory)

Interconnection Network
Main memory
I/O System
39
Shared Address Model Summary

Each processor can name every physical location
in the machine
Each process can name all data it shares with
other processes
Data transfer via load and store
Data size byte, word, ... or cache blocks
Memory hierarchy model applies
communication moves data to local proc. cache

40
Communication models Message Passing

Communication primitives
e.g., send, receive library calls
Note that MP can be build on top of SM and vice
versa

41
Message Passing Model

Explicit message send and receive operations
Send specifies local buffer receiving process
on remote computer
Receive specifies sending process on remote
computer local buffer to place data
Typically blocking communication, but may use DMA

Message structure
Header
Data
Trailer
42
Message passing communication
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Interconnection Network
43
Communication Models Comparison

Shared-Memory
Compatibility with well-understood (language)
mechanisms
Ease of programming for complex or dynamic
communications patterns
Shared-memory applications sharing of large data
structures
Efficient for small items
Supports hardware caching
Messaging Passing
Simpler hardware
Explicit communication
Improved synchronization

44
Challenges of parallel processing

Q1 can we get linear speedup
Suppose we want speedup 80 with 100 processors.
What fraction of the original computation can be
sequential (i.e. non-parallel)?
Q2 how important is communication latency
Suppose 0.2 of all accesses are remote, and
require 100 cycles on a processor with base CPI
0.5 Whats the communication impact?

45
Three fundamental issues for shared memory
multiprocessors

Coherence, about Do I see the most recent data?
Consistency, about When do I see a written
value?
e.g. do different processors see writes at the
same time (w.r.t. other memory accesses)?
SynchronizationHow to synchronize processes?
how to protect access to shared data?

46
Coherence problem, in single CPU system
CPU
cache
a'
100
b'
200
memory
a
100
b
200
I/O
47
Coherence problem, in Multi-Proc system
CPU-1
CPU-2
cache
cache
a'
550
a''
100
b'
200
b''
200
memory
a
100
b
200
48
What Does Coherency Mean?

Informally
Any read must return the most recent write
Too strict and too difficult to implement
Better
Any write must eventually be seen by a read
All writes are seen in proper order
(serialization)

49
Two rules to ensure coherency

If P writes x and P1 reads it, Ps write will be
seen by P1 if the read and write are sufficiently
far apart
Writes to a single location are serialized seen
in one order
Latest write will be seen
Otherwise could see writes in illogical order
(could see older value after a newer value)

50
Potential HW Coherency Solutions

Snooping Solution (Snoopy Bus)
Send all requests for data to all processors (or
local caches)
Processors snoop to see if they have a copy and
respond accordingly
Requires broadcast, since caching information is
at processors
Works well with bus (natural broadcast medium)
Dominates for small scale machines (most of the
market)
Directory-Based Schemes
Keep track of what is being shared in one
centralized place
Distributed memory gt distributed directory for
scalability(avoids bottlenecks)
Send point-to-point requests to processors via
network
Scales better than Snooping
Actually existed BEFORE Snooping-based schemes

51
Example Snooping protocol

3 states for each cache line
invalid, shared, modified (exclusive)
FSM per cache, receives requests from both
processor and bus

Main memory
I/O System
52
Cache coherence protocal

Write invalidate protocol for write-back cache
Showing state transitions for each block in
the cache

53
Synchronization problem

Computer system of bank has credit process (P_c)
and debit process (P_d)

/ Process P_c / / Process P_d
/ shared int balance shared int balance private
int amount private int amount balance amount
balance - amount lw
t0,balance lw t2,balance lw
t1,amount lw t3,amount add
t0,t0,t1 sub t2,t2,t3 sw
t0,balance sw t2,balance
54
Critical Section Problem

n processes all competing to use some shared data
Each process has code segment, called critical
section, in which shared data is accessed.
Problem ensure that when one process is
executing in its critical section, no other
process is allowed to execute in its critical
section
Structure of process

while (TRUE) entry_section ()
critical_section () exit_section ()
remainder_section ()
55
Attempt 1 Strict Alternation
Process P0
Process P1
shared int turn while (TRUE) while
(turn!0) critical_section() turn 1
remainder_section()
shared int turn while (TRUE) while
(turn!1) critical_section() turn 0
remainder_section()

Two problems
Satisfies mutual exclusion, but not
progress(works only when both processes strictly
alternate)
Busy waiting

56
Attempt 2 Warning Flags
Process P0
Process P1
shared int flag2 while (TRUE) flag0
TRUE while (flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 while (TRUE) flag1
TRUE while (flag0) critical_section()
flag1 FALSE remainder_section()

Satisfies mutual exclusion
P0 in critical section flag0?!flag1
P1 in critical section !flag0?flag1
However, contains a deadlock(both flags may be
set to TRUE !!)

57
Software solution Petersons Algorithm
(combining warning flags and alternation)
Process P0
Process P1
shared int flag2 shared int turn while
(TRUE) flag0 TRUE turn 0 while
(turn0flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 shared int turn while
(TRUE) flag1 TRUE turn 1 while
(turn1flag0) critical_section()
flag1 FALSE remainder_section()
Software solution is slow !
58
Issues for Synchronization

Hardware support
Un-interruptable instruction to fetch-and-update
memory (atomic operation)
User level synchronization operation(s) using
this primitive
For large scale MPs, synchronization can be a
bottleneck techniques to reduce contention and
latency of synchronization

59
Uninterruptable Instructions to Fetch and Update
Memory

Atomic exchange interchange a value in a
register for a value in memory
0 gt synchronization variable is free
1 gt synchronization variable is locked and
unavailable
Test-and-set tests a value and sets it if the
value passes the test (also Compare-and-swap)
Fetch-and-increment it returns the value of a
memory location and atomically increments it
0 gt synchronization variable is free

60
User Level SynchronizationOperation

Spin locks processor continuously tries to
acquire, spinning around a loop trying to get the
lock LI R2,1 load immediate lockit EXCH R2,
0(R1) atomic exchange BNEZ R2,lockit
already locked?
What about MP with cache coherency?
Want to spin on cache copy to avoid full memory
latency
Likely to get cache hits for such variables
Problem exchange includes a write, which
invalidates all other copies this generates
considerable bus traffic
Solution start by simply repeatedly reading the
variable when it changes, then try exchange
(test and testset)
try LI R2,1 load immediate lockit LW R3,0(R1
) load var BNEZ R3,lockit not
freegtspin EXCH R2,0(R1) atomic
exchange BNEZ R2,try already locked?

61
Fetch and Update (cont'd)

Hard to have read write in 1 instruction use 2
instead
Load Linked (or load locked) Store Conditional
Load linked returns the initial value
Store conditional returns 1 if it succeeds (no
other store to same memory location since
preceding load) and 0 otherwise
Example doing atomic swap with LL SC
try OR R3,R4,R0 R4R3 LL R2,0(R1) load
linked SC R3,0(R1) store conditional BEQZ R3,t
ry branch store fails (R30) MOV R4,R2
put load value in R4
Example doing fetch increment with LL SC
try LL R2,0(R1) load linked ADDUI R3,R2,1
increment SC R3,0(R1) store conditional
BEQZ R3,try branch store fails (R20)

62
Another MP Issue Memory Consistency

What is consistency? When must a processor see a
new memory value?
Example
P1 A 0 P2 B 0
..... .....
A 1 B 1
L1 if (B 0) ... L2 if (A 0) ...
Seems impossible for both if-statements L1 L2
to be true?
What if write invalidate is delayed processor
continues?
Memory consistency models what are the rules
for such cases?

63
Tooling, OS, and Mapping

Which mapping steps are performed in HW?
Pre-emption support
Programming model
streaming or vector support
(like KernelC and StreamingC for Imagine,
StreamIT for RAW
Process communication Shared memory ? Message
passing
Process Synchronization

64
A few platform examples
65
Massively Parallel Processors Targeting Digital
Signal Processing Applications
66
Field Programmable Object ArrayMathStar
67
PACT XPP-III Processor array
68
RAW processor from MIT
69
RAW Switch Detail
Raw exposes wire delay at the ISA level. This
allows the compiler to explicitly manage static
network, where routes compiled into static
router and messages arrive in known order.
Latency 2 hops Throughput 1 word/cycle per
dir. Per network
70
Philips AETHEREAL
Router provides both guaranteed throughput (GT)
and best effort (BE) services to communicate with
IPs. Combination of GT and BE leads to
efficient use of bandwidth and simple programming
model.
Router Network
Network Interface
IP
Network Interface
Network Interface
IP
IP

Write a Comment

User Comments (0)