Platform Design - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Platform Design

Description:

MPSoC TU/e 5kk70 Henk Corporaal Bart Mesman Overview What is a platform, and why platform based design? Why parallel platforms? A first classification of parallel ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 71
Provided by: abc61
Category:
Tags: design | platform | rtos | what

less

Transcript and Presenter's Notes

Title: Platform Design


1
Platform Design
MPSoC
  • TU/e 5kk70
  • Henk Corporaal
  • Bart Mesman

2
Overview
  • What is a platform, and why platform based
    design?
  • Why parallel platforms?
  • A first classification of parallel systems
  • Design choices for parallel systems
  • Shared memory systems
  • Memory Coherency, Consistency, Synchronization,
    Mutual exlusion
  • Message passing systems
  • Further decisions

3
Design Product requirements?
  • Short Time-to-Market
  • Reuse / Standards
  • Short design time
  • Flexible solution
  • Reduces design time
  • Extends product lifetime remote inspect and
    debug,
  • Scalability
  • High performance and Low power
  • Memory bottleneck, Wiring bottleneck
  • Low cost
  • High quality, reliability, dependability
  • RTOS and libs
  • Good programming environment

4
Solution ?
  • Platforms
  • Programmable
  • One or more processor cores
  • Reconfigurable
  • Scalable and flexible
  • Memory hierarchy
  • Exploit locality
  • Separate local and global wiring
  • HW and SW IP reuse
  • Standardization (on SW and HW-interfaces)
  • Raising design abstraction level
  • Reliable
  • Cheaper
  • Advanced Design Flow for Platforms

5
What is a platform?
  • Definition
  • A platform is a generic, but domain specific
  • information processing (sub-)system

Generic means that it is flexible, containing
programmable component(s). Platforms are meant to
quickly realize your next system (in a certain
domain). Single chip?
6
Platform example TI OMAP
Up to 192Mbyte off-chip memory
7
Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
8
Why parallel processing
  • Performance drive
  • Diminishing returns for exploiting ILP and OLP
  • Multiple processors fit easily on a chip
  • Cost effective (just connect existing processors
    or processor cores)
  • Low power parallelism may allow lowering Vdd
  • However
  • Parallel programming is hard

9
Low power through parallelism
  • Sequential Processor
  • Switching capacitance C
  • Frequency f
  • Voltage V
  • P ?fCV2
  • Parallel Processor (two times the number of
    units)
  • Switching capacitance 2C
  • Frequency f/2
  • Voltage V lt V
  • P ?f/2 2C V2 ?fCV2

10
Power efficiency compare 2 examples
  • Intel Pentium-4 (Northwood) in 0.13 micron
    technology
  • 3.0 GHz
  • 20 pipeline stages
  • Aggressive buffering to boost clock frequency
  • 13 nano Joule / instruction
  • Philips Trimedia Lite in 0.13 micron technology
  • 250 MHz
  • 8 pipeline stages
  • Relaxed buffering, focus on instruction
    parallelism
  • 0.2 nano Joule / instruction
  • Trimedia is doing 65x better than Pentium

11
Parallel Architecture
  • Parallel Architecture extends traditional
    computer architecture with a communication
    network
  • abstractions (HW/SW interface)
  • organizational structure to realize abstraction
    efficiently

Communication Network
Processing node
Processing node
Processing node
Processing node
Processing node
12
Platform characteristics
  • System level
  • Processor level
  • Communication network
  • Memory system
  • Tooling

13
System level characteristics
  • Homogeneous ? Heterogeneous
  • Granularity of processing elements
  • Type of supported parallelism TLP, DLP
  • Runtime mapping support?

14
Homogeneous or Heterogeneous
  • Homogenous
  • replication effect
  • memory dominated any way
  • solve realization issuesonce and for all
  • less flexible
  • Typically
  • data level parallelism
  • shared memory
  • dynamic task mapping

15
Example Philips Wasabi
  • Homogeneous multiprocessor for media applications
  • First 65 nm silicon expected 1st half 2006
  • Two-level communication hierarchy
  • Top scalable message passingnetwork plus tiles
  • Tile shared memory plus processors, accelerators
  • Fully cache coherent to support data parallelism

16
Homogeneous or Heterogeneous
  • Heterogeneous
  • better fit to application domain
  • smaller increments
  • Typically
  • task level parallelism
  • message passing
  • static task mapping

17
Example Viper2
  • Heterogeneous
  • Platform based
  • gt60 different cores
  • Task parallelism
  • Sync with interrupts
  • Streaming communication
  • Semi-static application graph
  • 50 M transistors
  • 120nm technology
  • Powerful, efficient

18
Homogeneous or Heterogeneous
  • Middle of the road approach
  • Flexibile tiles
  • Fixed tile structure at top level

19
Types of parallelism
TLP Heterogenous
Program/Thread level
Kernel level
ILP Heterogenous
20
Processor level characteristics
  • Processor consists of
  • Instruction engine (Control Processor, Ifetch
    unit)
  • Processing element (PE) Register file, Function
    unit(s), L1 DMem
  • Single PE ? Multiple PEs (as in SIMD)
  • Single FU/PE ? Multiple FUs/PE (as in VLIW)
  • Granularity of PEs, FUs
  • Specialized ? Generic
  • Interruptable, pre-emption support
  • Multithreading support (fast context switches)
  • Clustering of PEs Clustering of FUs
  • Type of inter PE and inter FU communication
    network
  • Others MMU virtual memory, ..

21
Generic or Specialized?Intrinsic computational
efficiency
22
(pipelined) processor organization
PE processing engine
Instruction fetch - Control
FU
23
(Linear) SIMD Architecture
Control Processor
IMem
PE1
PEn
  • To be added
  • inter PE communication
  • communication from PEs to Control Processor
  • Input and Output

24
Communication network
  • Bus (N-N) ? NoC with point-to-point connections
  • Topology, Router degree
  • Routing
  • path, path control, collision resolvement,
    network support, deadlock handling, livelock
    handling
  • virtual layer support
  • flow control and buffering
  • error handling
  • Inter-chip network support
  • Guarantees
  • TDMA
  • GT ? BE traffic
  • etc, etc.

25
Comm. Network Performance metrics
  • Network Bandwidth
  • Need high bandwidth in communication
  • How does it scale with number of nodes?
  • Communication Latency
  • Affects performance, since processor may have to
    wait
  • Affects ease of programming, since it requires
    more thought to overlap communication and
    computation
  • Latency Hiding
  • How can a mechanism help hide latency?
  • Examples
  • overlap message send with computation,
  • prefetch data,
  • switch to other tasks

26
Network Topology
  • Topology determines
  • Degree number of links from a node
  • Diameter max number of links crossed between
    nodes
  • Average distance number of links to random
    destination
  • Bisection minimum number of links that separate
    the network into two halves
  • Bisection bandwidth link bandwidth x bisection

27
Common Topologies
Type Degree Diameter Ave Dist
Bisection 1D mesh 2 N-1 N/3 1 2D mesh
4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh
6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh
2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring
2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2
Hypercube Log2N nLog2N n/2 N/2 2D Tree
3 2Log2N 2Log2 N 1 Crossbar N-1 1 1
N2/2 N number of nodes, n dimension
28
Topology examples
Hypercube
Grid/Mesh
Torus
Assume 64 nodes
Criteria Bus Ring Mesh 2Dtorus 6-cube Fully connected
Performance Bisection bandwidth 1 2 8 16 32 1024
Cost Ports/switch Total links 1 3 128 5 176 5 192 7 256 64 2080
29
Butterfly or Omega Network
  • All paths equal length
  • Unique path from any input to any output
  • Try to avoid conflicts !!

8 x 8 butterfly switch
30
Multistage Fat Tree
  • A multistage fat tree (CM-5) avoids congestion at
    the root node
  • Randomly assign packets to different paths on way
    up to spread the load
  • Increase degree near root, decrease congestion

31
What did architects design in the 90ties?Old
(off-chip) MP Networks
  • Name Number Topology Bits Clock Link
    Bis. BW Year
  • nCube/ten 1-1024 10-cube 1 10 MHz
    1.2 640 1987
  • iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988
  • MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989
  • Delta 540 2D grid 16 40 MHz 40 640 1991
  • CM-5 32-2048 fat tree 4 40
    MHz 20 10,240 1991
  • CS-2 32-1024 fat tree 8 70
    MHz 50 50,000 1992
  • Paragon 4-1024 2D grid 16 100
    MHz 200 6,400 1992
  • T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993

MBytes/s
No standard topology! However, for on-chip mesh
and torus are in favor !
32
Memory hierarchy
  • Number of memory levels 1, 2, 3, 4
  • HW ? SW controlled level 1
  • Cache or Scratchpad memory L1
  • Central ? Distributed memory
  • Shared ? Distributed memory address space
  • Intelligent DMA support Communication Assist
  • For shared memory
  • coherency
  • consistency
  • synchronization

33
IntermezzoWhats the problem with memory ?
Performance
µProc 55/year
1000
Patterson
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Memories can be also big power consumers !
34
Multiple levels of memory
Architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
Memory
I/O
Memory
Level N
35
Communication models Shared Memory
Shared Memory
(read, write)
(read, write)
Process P2
Process P1
  • Coherence problem
  • Memory consistency issue
  • Synchronization problem

36
Communication models Shared memory
  • Shared address space
  • Communication primitives
  • load, store, atomic swap
  • Two varieties
  • Physically shared gt Symmetric Multi-Processors
    (SMP)
  • usually combined with local caching
  • Physically distributed gt Distributed Shared
    Memory (DSM)

37
SMP Symmetric Multi-Processor
  • Memory centralized with uniform access time
    (UMA) and bus interconnect, I/O
  • Examples Sun Enterprise 6000, SGI Challenge,
    Intel

Main memory
I/O System
38
DSM Distributed Shared Memory
  • Nonuniform access time (NUMA) and scalable
    interconnect (distributed memory)

Interconnection Network
Main memory
I/O System
39
Shared Address Model Summary
  • Each processor can name every physical location
    in the machine
  • Each process can name all data it shares with
    other processes
  • Data transfer via load and store
  • Data size byte, word, ... or cache blocks
  • Memory hierarchy model applies
  • communication moves data to local proc. cache

40
Communication models Message Passing
  • Communication primitives
  • e.g., send, receive library calls
  • Note that MP can be build on top of SM and vice
    versa

41
Message Passing Model
  • Explicit message send and receive operations
  • Send specifies local buffer receiving process
    on remote computer
  • Receive specifies sending process on remote
    computer local buffer to place data
  • Typically blocking communication, but may use DMA

Message structure
Header
Data
Trailer
42
Message passing communication
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Interconnection Network
43
Communication Models Comparison
  • Shared-Memory
  • Compatibility with well-understood (language)
    mechanisms
  • Ease of programming for complex or dynamic
    communications patterns
  • Shared-memory applications sharing of large data
    structures
  • Efficient for small items
  • Supports hardware caching
  • Messaging Passing
  • Simpler hardware
  • Explicit communication
  • Improved synchronization

44
Challenges of parallel processing
  • Q1 can we get linear speedup
  • Suppose we want speedup 80 with 100 processors.
    What fraction of the original computation can be
    sequential (i.e. non-parallel)?
  • Q2 how important is communication latency
  • Suppose 0.2 of all accesses are remote, and
    require 100 cycles on a processor with base CPI
    0.5 Whats the communication impact?

45
Three fundamental issues for shared memory
multiprocessors
  • Coherence, about Do I see the most recent data?
  • Consistency, about When do I see a written
    value?
  • e.g. do different processors see writes at the
    same time (w.r.t. other memory accesses)?
  • SynchronizationHow to synchronize processes?
  • how to protect access to shared data?

46
Coherence problem, in single CPU system
CPU
cache
a'
100
b'
200
memory
a
100
b
200
I/O
47
Coherence problem, in Multi-Proc system
CPU-1
CPU-2
cache
cache
a'
550
a''
100
b'
200
b''
200
memory
a
100
b
200
48
What Does Coherency Mean?
  • Informally
  • Any read must return the most recent write
  • Too strict and too difficult to implement
  • Better
  • Any write must eventually be seen by a read
  • All writes are seen in proper order
    (serialization)

49
Two rules to ensure coherency
  • If P writes x and P1 reads it, Ps write will be
    seen by P1 if the read and write are sufficiently
    far apart
  • Writes to a single location are serialized seen
    in one order
  • Latest write will be seen
  • Otherwise could see writes in illogical order
    (could see older value after a newer value)

50
Potential HW Coherency Solutions
  • Snooping Solution (Snoopy Bus)
  • Send all requests for data to all processors (or
    local caches)
  • Processors snoop to see if they have a copy and
    respond accordingly
  • Requires broadcast, since caching information is
    at processors
  • Works well with bus (natural broadcast medium)
  • Dominates for small scale machines (most of the
    market)
  • Directory-Based Schemes
  • Keep track of what is being shared in one
    centralized place
  • Distributed memory gt distributed directory for
    scalability(avoids bottlenecks)
  • Send point-to-point requests to processors via
    network
  • Scales better than Snooping
  • Actually existed BEFORE Snooping-based schemes

51
Example Snooping protocol
  • 3 states for each cache line
  • invalid, shared, modified (exclusive)
  • FSM per cache, receives requests from both
    processor and bus

Main memory
I/O System
52
Cache coherence protocal
  • Write invalidate protocol for write-back cache
  • Showing state transitions for each block in
    the cache

53
Synchronization problem
  • Computer system of bank has credit process (P_c)
    and debit process (P_d)

/ Process P_c / / Process P_d
/ shared int balance shared int balance private
int amount private int amount balance amount
balance - amount lw
t0,balance lw t2,balance lw
t1,amount lw t3,amount add
t0,t0,t1 sub t2,t2,t3 sw
t0,balance sw t2,balance
54
Critical Section Problem
  • n processes all competing to use some shared data
  • Each process has code segment, called critical
    section, in which shared data is accessed.
  • Problem ensure that when one process is
    executing in its critical section, no other
    process is allowed to execute in its critical
    section
  • Structure of process

while (TRUE) entry_section ()
critical_section () exit_section ()
remainder_section ()
55
Attempt 1 Strict Alternation
Process P0
Process P1
shared int turn while (TRUE) while
(turn!0) critical_section() turn 1
remainder_section()
shared int turn while (TRUE) while
(turn!1) critical_section() turn 0
remainder_section()
  • Two problems
  • Satisfies mutual exclusion, but not
    progress(works only when both processes strictly
    alternate)
  • Busy waiting

56
Attempt 2 Warning Flags
Process P0
Process P1
shared int flag2 while (TRUE) flag0
TRUE while (flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 while (TRUE) flag1
TRUE while (flag0) critical_section()
flag1 FALSE remainder_section()
  • Satisfies mutual exclusion
  • P0 in critical section flag0?!flag1
  • P1 in critical section !flag0?flag1
  • However, contains a deadlock(both flags may be
    set to TRUE !!)

57
Software solution Petersons Algorithm
(combining warning flags and alternation)
Process P0
Process P1
shared int flag2 shared int turn while
(TRUE) flag0 TRUE turn 0 while
(turn0flag1) critical_section()
flag0 FALSE remainder_section()
shared int flag2 shared int turn while
(TRUE) flag1 TRUE turn 1 while
(turn1flag0) critical_section()
flag1 FALSE remainder_section()
Software solution is slow !
58
Issues for Synchronization
  • Hardware support
  • Un-interruptable instruction to fetch-and-update
    memory (atomic operation)
  • User level synchronization operation(s) using
    this primitive
  • For large scale MPs, synchronization can be a
    bottleneck techniques to reduce contention and
    latency of synchronization

59
Uninterruptable Instructions to Fetch and Update
Memory
  • Atomic exchange interchange a value in a
    register for a value in memory
  • 0 gt synchronization variable is free
  • 1 gt synchronization variable is locked and
    unavailable
  • Test-and-set tests a value and sets it if the
    value passes the test (also Compare-and-swap)
  • Fetch-and-increment it returns the value of a
    memory location and atomically increments it
  • 0 gt synchronization variable is free

60
User Level SynchronizationOperation
  • Spin locks processor continuously tries to
    acquire, spinning around a loop trying to get the
    lock LI R2,1 load immediate lockit EXCH R2,
    0(R1) atomic exchange BNEZ R2,lockit
    already locked?
  • What about MP with cache coherency?
  • Want to spin on cache copy to avoid full memory
    latency
  • Likely to get cache hits for such variables
  • Problem exchange includes a write, which
    invalidates all other copies this generates
    considerable bus traffic
  • Solution start by simply repeatedly reading the
    variable when it changes, then try exchange
    (test and testset)
  • try LI R2,1 load immediate lockit LW R3,0(R1
    ) load var BNEZ R3,lockit not
    freegtspin EXCH R2,0(R1) atomic
    exchange BNEZ R2,try already locked?

61
Fetch and Update (cont'd)
  • Hard to have read write in 1 instruction use 2
    instead
  • Load Linked (or load locked) Store Conditional
  • Load linked returns the initial value
  • Store conditional returns 1 if it succeeds (no
    other store to same memory location since
    preceding load) and 0 otherwise
  • Example doing atomic swap with LL SC
  • try OR R3,R4,R0 R4R3 LL R2,0(R1) load
    linked SC R3,0(R1) store conditional BEQZ R3,t
    ry branch store fails (R30) MOV R4,R2
    put load value in R4
  • Example doing fetch increment with LL SC
  • try LL R2,0(R1) load linked ADDUI R3,R2,1
    increment SC R3,0(R1) store conditional
    BEQZ R3,try branch store fails (R20)

62
Another MP Issue Memory Consistency
  • What is consistency? When must a processor see a
    new memory value?
  • Example
  • P1 A 0 P2 B 0
  • ..... .....
  • A 1 B 1
  • L1 if (B 0) ... L2 if (A 0) ...
  • Seems impossible for both if-statements L1 L2
    to be true?
  • What if write invalidate is delayed processor
    continues?
  • Memory consistency models what are the rules
    for such cases?

63
Tooling, OS, and Mapping
  • Which mapping steps are performed in HW?
  • Pre-emption support
  • Programming model
  • streaming or vector support
  • (like KernelC and StreamingC for Imagine,
  • StreamIT for RAW
  • Process communication Shared memory ? Message
    passing
  • Process Synchronization

64
A few platform examples
65
Massively Parallel Processors Targeting Digital
Signal Processing Applications
66
Field Programmable Object ArrayMathStar
67
PACT XPP-III Processor array
68
RAW processor from MIT
69
RAW Switch Detail
Raw exposes wire delay at the ISA level. This
allows the compiler to explicitly manage static
network, where routes compiled into static
router and messages arrive in known order.
Latency 2 hops Throughput 1 word/cycle per
dir. Per network
70
Philips AETHEREAL
Router provides both guaranteed throughput (GT)
and best effort (BE) services to communicate with
IPs. Combination of GT and BE leads to
efficient use of bandwidth and simple programming
model.
Router Network
Network Interface
IP
Network Interface
Network Interface
IP
IP
Write a Comment
User Comments (0)
About PowerShow.com