Title: Platform Design
1Platform Design
MPSoC
- TU/e 5kk70
- Henk Corporaal
- Bart Mesman
2Overview
- What is a platform, and why platform based
design? - Why parallel platforms?
- A first classification of parallel systems
- Design choices for parallel systems
- Shared memory systems
- Memory Coherency, Consistency, Synchronization,
Mutual exlusion - Message passing systems
- Further decisions
3Design Product requirements?
- Short Time-to-Market
- Reuse / Standards
- Short design time
- Flexible solution
- Reduces design time
- Extends product lifetime remote inspect and
debug, - Scalability
- High performance and Low power
- Memory bottleneck, Wiring bottleneck
- Low cost
- High quality, reliability, dependability
- RTOS and libs
- Good programming environment
4Solution ?
- Platforms
- Programmable
- One or more processor cores
- Reconfigurable
- Scalable and flexible
- Memory hierarchy
- Exploit locality
- Separate local and global wiring
- HW and SW IP reuse
- Standardization (on SW and HW-interfaces)
- Raising design abstraction level
- Reliable
- Cheaper
- Advanced Design Flow for Platforms
5What is a platform?
- A platform is a generic, but domain specific
- information processing (sub-)system
Generic means that it is flexible, containing
programmable component(s). Platforms are meant to
quickly realize your next system (in a certain
domain). Single chip?
6Platform example TI OMAP
Up to 192Mbyte off-chip memory
7Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
8Why parallel processing
- Performance drive
- Diminishing returns for exploiting ILP and OLP
- Multiple processors fit easily on a chip
- Cost effective (just connect existing processors
or processor cores) - Low power parallelism may allow lowering Vdd
- However
- Parallel programming is hard
9Low power through parallelism
- Sequential Processor
- Switching capacitance C
- Frequency f
- Voltage V
- P ?fCV2
- Parallel Processor (two times the number of
units) - Switching capacitance 2C
- Frequency f/2
- Voltage V lt V
- P ?f/2 2C V2 ?fCV2
10Power efficiency compare 2 examples
- Intel Pentium-4 (Northwood) in 0.13 micron
technology - 3.0 GHz
- 20 pipeline stages
- Aggressive buffering to boost clock frequency
- 13 nano Joule / instruction
- Philips Trimedia Lite in 0.13 micron technology
- 250 MHz
- 8 pipeline stages
- Relaxed buffering, focus on instruction
parallelism - 0.2 nano Joule / instruction
- Trimedia is doing 65x better than Pentium
11Parallel Architecture
- Parallel Architecture extends traditional
computer architecture with a communication
network - abstractions (HW/SW interface)
- organizational structure to realize abstraction
efficiently
Communication Network
Processing node
Processing node
Processing node
Processing node
Processing node
12Platform characteristics
- System level
- Processor level
- Communication network
- Memory system
- Tooling
13System level characteristics
- Homogeneous ? Heterogeneous
- Granularity of processing elements
- Type of supported parallelism TLP, DLP
- Runtime mapping support?
14Homogeneous or Heterogeneous
- Homogenous
- replication effect
- memory dominated any way
- solve realization issuesonce and for all
- less flexible
- Typically
- data level parallelism
- shared memory
- dynamic task mapping
15Example Philips Wasabi
- Homogeneous multiprocessor for media applications
- First 65 nm silicon expected 1st half 2006
- Two-level communication hierarchy
- Top scalable message passingnetwork plus tiles
- Tile shared memory plus processors, accelerators
- Fully cache coherent to support data parallelism
16Homogeneous or Heterogeneous
- Heterogeneous
- better fit to application domain
- smaller increments
- Typically
- task level parallelism
- message passing
- static task mapping
17Example Viper2
- Heterogeneous
- Platform based
- gt60 different cores
- Task parallelism
- Sync with interrupts
- Streaming communication
- Semi-static application graph
- 50 M transistors
- 120nm technology
- Powerful, efficient
18Homogeneous or Heterogeneous
- Middle of the road approach
- Flexibile tiles
- Fixed tile structure at top level
19Types of parallelism
TLP Heterogenous
Program/Thread level
Kernel level
ILP Heterogenous
20Processor level characteristics
- Processor consists of
- Instruction engine (Control Processor, Ifetch
unit) - Processing element (PE) Register file, Function
unit(s), L1 DMem - Single PE ? Multiple PEs (as in SIMD)
- Single FU/PE ? Multiple FUs/PE (as in VLIW)
- Granularity of PEs, FUs
- Specialized ? Generic
- Interruptable, pre-emption support
- Multithreading support (fast context switches)
- Clustering of PEs Clustering of FUs
- Type of inter PE and inter FU communication
network - Others MMU virtual memory, ..
21Generic or Specialized?Intrinsic computational
efficiency
22 (pipelined) processor organization
PE processing engine
Instruction fetch - Control
FU
23(Linear) SIMD Architecture
Control Processor
IMem
PE1
PEn
- To be added
- inter PE communication
- communication from PEs to Control Processor
- Input and Output
24Communication network
- Bus (N-N) ? NoC with point-to-point connections
- Topology, Router degree
- Routing
- path, path control, collision resolvement,
network support, deadlock handling, livelock
handling - virtual layer support
- flow control and buffering
- error handling
- Inter-chip network support
- Guarantees
- TDMA
- GT ? BE traffic
- etc, etc.
25Comm. Network Performance metrics
- Network Bandwidth
- Need high bandwidth in communication
- How does it scale with number of nodes?
- Communication Latency
- Affects performance, since processor may have to
wait - Affects ease of programming, since it requires
more thought to overlap communication and
computation - Latency Hiding
- How can a mechanism help hide latency?
- Examples
- overlap message send with computation,
- prefetch data,
- switch to other tasks
26Network Topology
- Topology determines
- Degree number of links from a node
- Diameter max number of links crossed between
nodes - Average distance number of links to random
destination - Bisection minimum number of links that separate
the network into two halves - Bisection bandwidth link bandwidth x bisection
27Common Topologies
Type Degree Diameter Ave Dist
Bisection 1D mesh 2 N-1 N/3 1 2D mesh
4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh
6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh
2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring
2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2
Hypercube Log2N nLog2N n/2 N/2 2D Tree
3 2Log2N 2Log2 N 1 Crossbar N-1 1 1
N2/2 N number of nodes, n dimension
28Topology examples
Hypercube
Grid/Mesh
Torus
Assume 64 nodes
Criteria Bus Ring Mesh 2Dtorus 6-cube Fully connected
Performance Bisection bandwidth 1 2 8 16 32 1024
Cost Ports/switch Total links 1 3 128 5 176 5 192 7 256 64 2080
29Butterfly or Omega Network
- All paths equal length
- Unique path from any input to any output
- Try to avoid conflicts !!
8 x 8 butterfly switch
30Multistage Fat Tree
- A multistage fat tree (CM-5) avoids congestion at
the root node - Randomly assign packets to different paths on way
up to spread the load - Increase degree near root, decrease congestion
31What did architects design in the 90ties?Old
(off-chip) MP Networks
- Name Number Topology Bits Clock Link
Bis. BW Year - nCube/ten 1-1024 10-cube 1 10 MHz
1.2 640 1987 - iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988
- MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989
- Delta 540 2D grid 16 40 MHz 40 640 1991
- CM-5 32-2048 fat tree 4 40
MHz 20 10,240 1991 - CS-2 32-1024 fat tree 8 70
MHz 50 50,000 1992 - Paragon 4-1024 2D grid 16 100
MHz 200 6,400 1992 - T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993
MBytes/s
No standard topology! However, for on-chip mesh
and torus are in favor !
32Memory hierarchy
- Number of memory levels 1, 2, 3, 4
- HW ? SW controlled level 1
- Cache or Scratchpad memory L1
- Central ? Distributed memory
- Shared ? Distributed memory address space
- Intelligent DMA support Communication Assist
- For shared memory
- coherency
- consistency
- synchronization
33IntermezzoWhats the problem with memory ?
Performance
µProc 55/year
1000
Patterson
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Memories can be also big power consumers !
34Multiple levels of memory
Architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
Memory
I/O
Memory
Level N
35Communication models Shared Memory
Shared Memory
(read, write)
(read, write)
Process P2
Process P1
- Coherence problem
- Memory consistency issue
- Synchronization problem
36Communication models Shared memory
- Shared address space
- Communication primitives
- load, store, atomic swap
- Two varieties
- Physically shared gt Symmetric Multi-Processors
(SMP) - usually combined with local caching
- Physically distributed gt Distributed Shared
Memory (DSM)
37SMP Symmetric Multi-Processor
- Memory centralized with uniform access time
(UMA) and bus interconnect, I/O - Examples Sun Enterprise 6000, SGI Challenge,
Intel
Main memory
I/O System
38DSM Distributed Shared Memory
- Nonuniform access time (NUMA) and scalable
interconnect (distributed memory)
Interconnection Network
Main memory
I/O System
39Shared Address Model Summary
- Each processor can name every physical location
in the machine - Each process can name all data it shares with
other processes - Data transfer via load and store
- Data size byte, word, ... or cache blocks
- Memory hierarchy model applies
- communication moves data to local proc. cache
40Communication models Message Passing
- Communication primitives
- e.g., send, receive library calls
- Note that MP can be build on top of SM and vice
versa
41Message Passing Model
- Explicit message send and receive operations
- Send specifies local buffer receiving process
on remote computer - Receive specifies sending process on remote
computer local buffer to place data - Typically blocking communication, but may use DMA
Message structure
Header
Data
Trailer
42Message passing communication
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Interconnection Network
43Communication Models Comparison
- Shared-Memory
- Compatibility with well-understood (language)
mechanisms - Ease of programming for complex or dynamic
communications patterns - Shared-memory applications sharing of large data
structures - Efficient for small items
- Supports hardware caching
- Messaging Passing
- Simpler hardware
- Explicit communication
- Improved synchronization
44Challenges of parallel processing
- Q1 can we get linear speedup
- Suppose we want speedup 80 with 100 processors.
What fraction of the original computation can be
sequential (i.e. non-parallel)? - Q2 how important is communication latency
- Suppose 0.2 of all accesses are remote, and
require 100 cycles on a processor with base CPI
0.5 Whats the communication impact?
45Three fundamental issues for shared memory
multiprocessors
- Coherence, about Do I see the most recent data?
- Consistency, about When do I see a written
value? - e.g. do different processors see writes at the
same time (w.r.t. other memory accesses)? - SynchronizationHow to synchronize processes?
- how to protect access to shared data?
46Coherence problem, in single CPU system
CPU
cache
a'
100
b'
200
memory
a
100
b
200
I/O
47Coherence problem, in Multi-Proc system
CPU-1
CPU-2
cache
cache
a'
550
a''
100
b'
200
b''
200
memory
a
100
b
200
48What Does Coherency Mean?
- Informally
- Any read must return the most recent write
- Too strict and too difficult to implement
- Better
- Any write must eventually be seen by a read
- All writes are seen in proper order
(serialization)
49Two rules to ensure coherency
- If P writes x and P1 reads it, Ps write will be
seen by P1 if the read and write are sufficiently
far apart - Writes to a single location are serialized seen
in one order - Latest write will be seen
- Otherwise could see writes in illogical order
(could see older value after a newer value)
50Potential HW Coherency Solutions
- Snooping Solution (Snoopy Bus)
- Send all requests for data to all processors (or
local caches) - Processors snoop to see if they have a copy and
respond accordingly - Requires broadcast, since caching information is
at processors - Works well with bus (natural broadcast medium)
- Dominates for small scale machines (most of the
market) - Directory-Based Schemes
- Keep track of what is being shared in one
centralized place - Distributed memory gt distributed directory for
scalability(avoids bottlenecks) - Send point-to-point requests to processors via
network - Scales better than Snooping
- Actually existed BEFORE Snooping-based schemes
51Example Snooping protocol
- 3 states for each cache line
- invalid, shared, modified (exclusive)
- FSM per cache, receives requests from both
processor and bus
Main memory
I/O System
52Cache coherence protocal
- Write invalidate protocol for write-back cache
- Showing state transitions for each block in
the cache
53Synchronization problem
- Computer system of bank has credit process (P_c)
and debit process (P_d)
/ Process P_c / / Process P_d
/ shared int balance shared int balance private
int amount private int amount balance amount
balance - amount lw
t0,balance lw t2,balance lw
t1,amount lw t3,amount add
t0,t0,t1 sub t2,t2,t3 sw
t0,balance sw t2,balance
54Critical Section Problem
- n processes all competing to use some shared data
- Each process has code segment, called critical
section, in which shared data is accessed. - Problem ensure that when one process is
executing in its critical section, no other
process is allowed to execute in its critical
section - Structure of process
while (TRUE) entry_section ()
critical_section () exit_section ()
remainder_section ()
55Attempt 1 Strict Alternation
Process P0
Process P1
shared int turn while (TRUE) while
(turn!0) critical_section() turn 1
remainder_section()
shared int turn while (TRUE) while
(turn!1) critical_section() turn 0
remainder_section()
- Two problems
- Satisfies mutual exclusion, but not
progress(works only when both processes strictly
alternate) - Busy waiting
56Attempt 2 Warning Flags
Process P0
Process P1
shared int flag2 while (TRUE) flag0
TRUE while (flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 while (TRUE) flag1
TRUE while (flag0) critical_section()
flag1 FALSE remainder_section()
- Satisfies mutual exclusion
- P0 in critical section flag0?!flag1
- P1 in critical section !flag0?flag1
- However, contains a deadlock(both flags may be
set to TRUE !!)
57Software solution Petersons Algorithm
(combining warning flags and alternation)
Process P0
Process P1
shared int flag2 shared int turn while
(TRUE) flag0 TRUE turn 0 while
(turn0flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 shared int turn while
(TRUE) flag1 TRUE turn 1 while
(turn1flag0) critical_section()
flag1 FALSE remainder_section()
Software solution is slow !
58Issues for Synchronization
- Hardware support
- Un-interruptable instruction to fetch-and-update
memory (atomic operation) - User level synchronization operation(s) using
this primitive - For large scale MPs, synchronization can be a
bottleneck techniques to reduce contention and
latency of synchronization
59Uninterruptable Instructions to Fetch and Update
Memory
- Atomic exchange interchange a value in a
register for a value in memory - 0 gt synchronization variable is free
- 1 gt synchronization variable is locked and
unavailable - Test-and-set tests a value and sets it if the
value passes the test (also Compare-and-swap) - Fetch-and-increment it returns the value of a
memory location and atomically increments it - 0 gt synchronization variable is free
60User Level SynchronizationOperation
- Spin locks processor continuously tries to
acquire, spinning around a loop trying to get the
lock LI R2,1 load immediate lockit EXCH R2,
0(R1) atomic exchange BNEZ R2,lockit
already locked? - What about MP with cache coherency?
- Want to spin on cache copy to avoid full memory
latency - Likely to get cache hits for such variables
- Problem exchange includes a write, which
invalidates all other copies this generates
considerable bus traffic - Solution start by simply repeatedly reading the
variable when it changes, then try exchange
(test and testset) - try LI R2,1 load immediate lockit LW R3,0(R1
) load var BNEZ R3,lockit not
freegtspin EXCH R2,0(R1) atomic
exchange BNEZ R2,try already locked?
61 Fetch and Update (cont'd)
- Hard to have read write in 1 instruction use 2
instead - Load Linked (or load locked) Store Conditional
- Load linked returns the initial value
- Store conditional returns 1 if it succeeds (no
other store to same memory location since
preceding load) and 0 otherwise - Example doing atomic swap with LL SC
- try OR R3,R4,R0 R4R3 LL R2,0(R1) load
linked SC R3,0(R1) store conditional BEQZ R3,t
ry branch store fails (R30) MOV R4,R2
put load value in R4 - Example doing fetch increment with LL SC
- try LL R2,0(R1) load linked ADDUI R3,R2,1
increment SC R3,0(R1) store conditional
BEQZ R3,try branch store fails (R20)
62Another MP Issue Memory Consistency
- What is consistency? When must a processor see a
new memory value? - Example
- P1 A 0 P2 B 0
- ..... .....
- A 1 B 1
- L1 if (B 0) ... L2 if (A 0) ...
- Seems impossible for both if-statements L1 L2
to be true? - What if write invalidate is delayed processor
continues? - Memory consistency models what are the rules
for such cases?
63Tooling, OS, and Mapping
- Which mapping steps are performed in HW?
- Pre-emption support
- Programming model
- streaming or vector support
- (like KernelC and StreamingC for Imagine,
- StreamIT for RAW
- Process communication Shared memory ? Message
passing - Process Synchronization
64A few platform examples
65Massively Parallel Processors Targeting Digital
Signal Processing Applications
66Field Programmable Object ArrayMathStar
67PACT XPP-III Processor array
68RAW processor from MIT
69RAW Switch Detail
Raw exposes wire delay at the ISA level. This
allows the compiler to explicitly manage static
network, where routes compiled into static
router and messages arrive in known order.
Latency 2 hops Throughput 1 word/cycle per
dir. Per network
70Philips AETHEREAL
Router provides both guaranteed throughput (GT)
and best effort (BE) services to communicate with
IPs. Combination of GT and BE leads to
efficient use of bandwidth and simple programming
model.
Router Network
Network Interface
IP
Network Interface
Network Interface
IP
IP