Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16

Description:

... interleaving of threads. Processor pipeline ... Threads are taken from the Process Status Word (PSW) ... Threads can have only 1 outstanding memory request ... – PowerPoint PPT presentation

Number of Views:353
Avg rating:5.0/5.0
Slides: 48
Provided by: david2523
Category:

less

Transcript and Presenter's Notes

Title: Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16


1
Multi Threaded ArchitecturesSima, Fountain and
KacsukChapter 16
  • CSE462

2
Memory and Synchronization Latency
  • Scalability of system is limited by ability to
    handle memory latency algorithmic
    sychronization delays
  • Overall solution is well known
  • Do something else whilst waiting
  • Remote memory accesses
  • Much slower than local
  • Varying delay depending on
  • Network traffic
  • Memory traffic

3
Processor Utilization
  • Utilization
  • P/T
  • P time spent processing
  • T total time
  • P/(P I S)
  • I time spent waiting on other tasks
  • S time spent switching tasks

4
Basic ideas - Multithreading
  • Fine Grain task switch every cycle
  • Coarse Grain Task swith every n cycles

Blocked
Blocked
Blocked
Task Switch Overhead
Task Switch Overhead
5
Design Space
6
Classification of multi-threaded architectures
7
Computational Models
8
Sequential control flow (von Neumann)
  • Flow of control and data separated
  • Executed sequentially (or at least sequential
    semantics see chapter 7)
  • Control flow changed with JUMP/GOTO/CALL
    instructions
  • Data stored in rewritable memory
  • Flow of data does not affect execution order

9
Sequential Control Flow Model
L1
-
A
B
m1
L2

Control Flow
B
1
m2
L3

m1
m2
R (A - B) (B 1)
R
10
Dataflow
  • Control tied to data
  • Instruction fires when data is available
  • Otherwise it is suspended
  • Order of instructions in program has no effect on
    execution order
  • Cf Von Neumann
  • No shared rewritable memory
  • Write once semantics
  • Code is stored as a dataflow graph
  • Data transported as tokens
  • Parallelism occurs if multiple instructions can
    fire at same time
  • Needs a parallel processor
  • Nodes are self scheduling

11
Dataflow arbitrary execution order
A
B
1
-


R
R (A - B) (B 1)
12
Dataflow arbitrary execution order
A
B
1
-


R
R (A - B) (B 1)
13
Dataflow Parallel Execution
A
B
1
-


R
R (A - B) (B 1)
14
Implementation
  • Dataflow model required very different execution
    engine
  • Data must be stored in special matching store
  • Instructions must be triggered when both operands
    are available
  • Parallel operations must be scheduled to
    processors dynamically
  • Dont know apriori when they are available.
  • Instruction operands are pointers
  • To instruction
  • Operand number

15
Dataflow model of execution
L1
Compte B
L2/2
L2
L3
L3/1
B
-

A
B
1
L4/1
L4/2
L4

L6/1
16
Parallel Control flow
  • Sometimes called macro dataflow
  • Data flows between blocks of sequential code
  • Has advantaged of dataflow Von Neumann
  • Context switch overhead reduced
  • Compiler can schedule instructions statically
  • Dont need fast matching store
  • Requires additional control instructions
  • Fork/Join

17
Macro Dataflow (Hybrid Control/Dataflow)
L4
L1
FORK

L4
B
1
L2
-
m2
A
Control Flow
B
m1
L5
Control Flow
JOIN
2
L6

L3
GOTO
m1
L5
m2
R
R (A - B) (B 1)
18
Issues for Hybrid dataflow
  • Blocks of sequential instructions need to be
    large enough to absorb overheads of context
    switching
  • Data memory same as MIMD
  • Can be partitioned or shared
  • Synchronization instructions required
  • Semaphores, test-and-set
  • Control tokens required to synchronize threads.

19
Some examples
20
Denelcor HEP
  • Designed to tolerate latency in memory
  • Fine grain interleaving of threads
  • Processor pipeline contains 8 stages
  • Each time step a new thread enters the pipeline
  • Threads are taken from the Process Status Word
    (PSW)
  • After thread taken from the PSW queue,
    instruction and operands are fetched
  • When an instruction is executed, another one is
    placed on the PSW queue
  • Threads are interleaved at the instruction level.

21
Denelcor HEP
  • Memory latency toleration solved with Scheduler
    Function Unit (SFU)
  • Memory words are tagged as full or empty
  • Attempting to read an empty suspends the current
    thread
  • Then current PSW entry is moved to the SFU
  • When data is written, taken from the SFU and
    placed back on the PSW queue.

22
Synchronization on the HEP
  • All registers have Full/Empty/Reserved bit
  • Reading an empty register causes thread toe be
    placed back on the PSW queue without updating its
    program counter
  • Thread synchronization is busy-wait
  • But other threads can run

23
HEP Architecture
PSW queue
Matching Unit
Program memory
Increment control
Operand hand 1
Operand fetch
Registers
Operand hand 2
SFU
Funct unit 1
Funct unit 2
Funct unit N
To/from Data memory
24
HEP configuration
  • Up to 16 processors
  • Up to 128 data memories
  • Connected by high speed switch
  • Limitations
  • Threads can have only 1 outstanding memory
    request
  • Thread synchronization puts bubbles in the
    pipeline
  • Maximum of 64 threads causing problems for
    software
  • Need to throttle loops
  • If parallelism is lower than 8 full utilisation
    not possible.

25
MIT Alewife Processor
  • 512 Processors in 2-dim mesh
  • Sparcle Processor
  • Physcially distributed memory
  • Logical shared memory
  • Hardware supported cache coherence
  • Hardware supported user level message passing
  • Multi-threading

26
Threading in Alewife
  • Coarse-grained multithreading
  • Pipeline works on single thread as long as remote
    memory access or synchronization not required
  • Can exploit register optimization in the pipeline
  • Integration of multi-threading with hardware
    supported cache coherence

27
The Sparcle Processor
  • Extension of SUN Sparc architecture
  • Tolerant of memory latency
  • Fine grained synchronisation
  • Efficient user level message passing

28
Fast context switching
  • In Sparc 8 overlapping register windows
  • Used in Sparcle in paris to represent 4
    independent, non-overlapping contexts
  • Three for user threads
  • 1 for traps and message handlers
  • Each context contains 32 general purpose
    registers and
  • PSR (Processor State Register)
  • PC (Program Counter)
  • nPC (next Program Counter)
  • Thread states
  • Active
  • Loaded
  • State stored in registers can become active
  • Ready
  • Not suspended and not loaded
  • Suspended
  • Thread switching
  • In fast if one is active and the other is loaded
  • Need to flush the pipeline (cf HEP)

29
Sparcle Architecture
0R0
PSR
PC
nPC
0R31
PSR
Active thread
1R0
PC
CP
nPC
PSR
1R31
PC
2R0
nPC
PSR
PC
nPC
2R31
3R0
3R31
30
MIT Alewife and Sparcle
NR
Sparcle
Cache 64 kbytes
Main Memory 4 Bytes
CMMU
FPU
NR Network router CMMU Communication memory
management unit FPU Floating point unit
31
From here figures are drawn by Tim
32
Figures 16.10 Thread states in Sparcle
Process state
Memory
Global register frames
G0
Ready queue
Suspended queue
G7
PC and PSR frames
0R0
PSR
PC
0R31
nPC
activethread
1R0
PSR
CP
PC
1R31
nPC
2R0
Loaded thread
PSR
PC
Unloaded thread
nPC
PSR
2R31
PC
3R0
nPC
3R31
33
Figures 16.11 structure of a typical static
dataflow PE
Activitystore
Fetch unit
Instruction queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To/From other (PEs)
34
Figures 16.12 structure of a typical tagged-token
dataflow PE
Matching unit
Matching store
Instruction/data memory
Fetch unit
Token queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Update unit
To other (PEs)
35
Figures 16.13 organization of the I-structure
storage
Data storage
Data storage
k
k1
k2
k3
k4
Presence bits (AAbsent, PPresent, WWaiting
36
Figures 16.14 coding in explicit token-store
architectures (a) and (b)
lt12, ltFP, IPgtgt
lt35, ltFP, IPgtgt
-
-
fire
lt23, ltFP, IP1gtgt
lt23, ltFP, IP2gtgt




37
Figures 16.14 coding in explicit token-store
architectures (c)
Instruction memory
Frame memory
Frame memory
FP
IP
FP
fire
FP2
FP3
FP4
Presence bit
38
Figures 16.15 structure of a typical explicit
token-store dataflow PE
From other PEs
Fetch unit
Fetch unit
Effective address
Presence bits
Framememory
Frame store operation
Func. Unit 1
Func. Unit 2
Func. Unit N
Form tokenunit
Form tokenunit
To/from other PEs
39
Figures 16.16 scale of von Neumann/dataflow
architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
40
Figures 16.17 structure of a typical macro
dataflow PE
Instruction Framememory
Matching unit
Fetch unit
Token queue
Func. Unit
Internal control pipeline (program counter-based
sequential execution)
Form tokenunit
To/from other (PEs)
41
Figures 16.18 organization of a PE in the MIT
hybrid Machine
PC
FBR
1
Instruction fetch
Instruction memory
Enabled continuation queue(Token queue)
Framememory
Decode unit
Operand fetch
Execution unit
Registers
To/from global memory
42
Figures 16.19 comparison of (a) SQ and (b) SCB
macro nodes
a
b
c
a
b
c
SQ1
SQ2
SCB1
SCB2
l4
l4
l1
l1
3
l5
l5
l2
l2
1
2
1
2
43
Figures 16.20 structure of the USC Decoupled
Architecture
To/from network (Graph virtual space)
Cluster graph memory
GC
GC
DFGE
DFGE
AQ
RQ
AQ
RQ
Cluster 0
CE
CE
CC
CC
Cluster graph memory
To/from network (Computation virtual space)
44
Figures 16.21 structure of a node in the SAM
Mainmemory
APU
fire
ASU
SEU
done
LEU
To/from network
45
Figures 16.22 structure of the P-RISC processing
element
Internal control pipeline (conventional
RISC-processor)
Local memory
Instruction
Instruction fetch
Framememory
Operand fetch
Load/Store
Token queue
Func. unit
Messages to/from other PEs memory
Operand store
Start
46
Figures 16.23 transformation of dataflow graphs
into control flow graphs (a) dataflow graph (b)
control flow graph
join


fork L1

-
join
join
L1

-
47
Figures 16.24 structure of T node
Network interface
Message formatter
From network
To network
Message queues
Remote memory requestcoprocessor
ContinuationqueueltIP,FPgt
Local memory
Write a Comment
User Comments (0)
About PowerShow.com