Microthreaded model and DRISC processors Managing concurrency dynamically - PowerPoint PPT Presentation

About This Presentation
Title:

Microthreaded model and DRISC processors Managing concurrency dynamically

Description:

... and Digital Techniques, 143, ... Threads are dynamic share memory and can be ... The big picture - where are we? FPGA Microthreaded. processor. TC to ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 38
Provided by: chrisje2
Learn more at: http://www.ifipwg103.org
Category:

less

Transcript and Presenter's Notes

Title: Microthreaded model and DRISC processors Managing concurrency dynamically


1
Microthreaded model and DRISC processorsManaging
concurrency dynamically
  • A seminar given to IFIP 10.3 on 9/5/2007
  • Chris Jesshope
  • Professor of Computer Systems Engineering
  • University of Amsterdam
  • Jesshope_at_science.uva.nl

2
Background - 10 years of research
  • This work started in 1996 as a latency-tolerant
    processor architecture called DRISC designed
    for executing data-parallel languages on
    multiprocessors
  • It has evolved over 10 years into a self-similar
    concurrency model called SVP - or Microthreading
    with implementations at the ISA and system level

A Bolychevsky, C R Jesshope and V B Muchnick,
(1996) Dynamic scheduling in RISC architectures,
IEE Trans. E, Computers and Digital Techniques,
143, pp 309-317 C R Jesshope (2006)
Microthreading - a model for distributed
instruction-level concurrency, Parallel
processing Letters, 16(2), pp209-228 - C R
Jesshope (2007) A model for the design and
programming of multicores, submitted to advances
in Parallel Computing L. Grandinetti (Ed.), IOS
Press, Amsterdam, http//staff.science.uva.nl/jes
shope/papers/Multicores.pdf
3
Current and proposed projects
  • The NWO Microgrids project model is evaluating
    homogeneous reconfigurable multi-cores based on
    microthreaded microprocessors
  • 4 years from 01/09/05
  • SVP has been adopted in the EU AETHER project as
    a model for self-adaptive computation based on
    FPGAs
  • 3 years from 01/01/06
  • The APPLE-CORE FP7 proposal will target C and SAC
    languages to SVP and will implement prototypes of
    microthreaded microprocessors (we hope)

4
UvAs multi-core mission
  • Managing 102 - 105 processors per chip
  • Operands from large distributed register files
  • Processors tolerant to significant latency
  • hundreds of processor cycles
  • On-chip COMA distributed shared memory
  • Support for a range of architectural paradigms
  • homogeneous / heterogeneous / FPGA / SIMD
  • To do all of this we need a programming model
    supporting concurrency as a core concept

5
Programming models
  • Sequential programming has advantages
  • sequential programs are deterministic and safely
    composable - i.e. using the well understood
    concept of hierarchy (calling functions)
  • source code is universally compatible and can be
    compiled to any sequential ISA without
    modification
  • binary-code compatibility is important in
    commodity processors - although this is not
    scalable in current processors
  • Our aim is to gain the same benefits from a
    concurrent programming model for multi-cores

6
Microthread or SVP model
  • Blocking threads with
  • data-driven instruction execution

7
Concurrency trees - hierarchichal composition
  • Concurrent composition - build programs
    concurrently
  • nodes represent threads - leaf nodes perform
    computation
  • branching at nodes represent concurrent
    subordinate threads

Program A
Program B
Program AB
8
Blocking threads
A
What does this mean?

B0
Bn
  • Threads at different levels run concurrently
  • A creates Bi for all i in some set
  • dependencies defined between threads
  • A continues until a sync
  • The identifiable events are
  • when A creates B
  • when A writes a value used by B etc.
  • when Bi completes for all i


9
Terminology and concepts
  • Family of threads
  • All threads at one level
  • Unit of work
  • a sub-tree i.e. all of a threads subordinate
    threads
  • may be considered as a job or a task
  • Place
  • where a unit of work executes - one or more
    processors FPGA cells etc.

10
Safe composition
  • A family of threads is created dynamically as an
    ordered set defined on an index sequence
  • each thread in the family has access to a unique
    value in the index sequence - its index in the
    family
  • Restrictions are placed on the communication
    between threads - these are blocking reads
  • the creating thread may write to the first thread
    in index sequence and
  • any created thread may write to the thread whose
    index is next in sequence to its own

Communication in a family is acyclic and deadlock
cannot be induced by composition - i.e. one
thread creating a subordinate family of threads
11
Thread distribution
  • A create operation distributes a parameterised
    family of threads to processing resources -
    deterministically
  • the number of threads processors is defined at
    runtime
  • Processors may be one or more homogeneous
    processors a dedicated unit configured FPGA cells
    etc.
  • Communication deadlock is avoided but resource
    deadlock can occur
  • have a finite set of registers for synchronising
    contexts
  • this can be statically analysed for some
    programs
  • but not for unbounded recursion of creates -
    solved by delegating a unit of work to a new place

12
Registers as synchronisers
  • Efficient implementations of microthreads
    synchronise in shared registers (as i-structures)
  • avoids a memory round-trip latency in
    synchronising
  • single-cycle synchronisation is possible
  • Families of threads communicate and synchronise
    on shared memory
  • a familys output to memory is not defined until
    the family completes (the synchronising event)
  • i.e. a bulk synchronisation or barrier

focus on direct implementations of the model at
the level of ISA instructions
13
Putting it all together
i 0 2 4 6
Family of threads - indexed and dynamically
defined

Squeeze is a preemption or retraction of a
concurrent unit of work
14
System issues
  • Threads are dynamic share memory and can be
    executed anywhere
  • Shared and/or distributed memory implementations
    are possible
  • A place can be on the same chip or on another
  • Deterministic distribution of families can be
    used to optimise data locality

15
Implementation of SVPin conventional processors
  • Dynamic RISC - DRISC

16
DRISC processor
  • Can apply microthreading to any base ISA
  • just add concurrency control instructions
  • provide control structures for threads and
    families
  • provide a large synchronising register file
  • Have backward compatibility to the base ISA
  • old binaries run as a threads under the model
  • New binaries are schedule invariant
  • use from 1 to Nmax number of processors

17
Synchronous vs Asynchronous register updates
  • An instruction in a microthreaded pipeline
    updates registers either
  • synchronously in when the register is set at the
    writeback stage of the pipeline
  • asynchronously in when the register is set to
    empty at the writeback stage and some activity
    concurrent to the pipelines operation will write
    a value to the register file asynchronously
  • Some instructions do one or the other depending
    on machine state, e.g. load word depends on L1
    cache hit

18
Regular ISA concurrency control
  • Add just five new instructions
  • cre - creates a family of microthreads - this is
    asynchronous and may set more than one register
  • the events are when the family is identified and
    completes
  • a Thread Control Block (TCB) in memory contains
    parameters
  • brk - terminates the family of the executing
    thread
  • a return value is read from the first register
    specifier
  • kill sqze - terminate preempt a family
    specified by a family id
  • the family identifier is read from the second
    register specifier

19
DRISC pipeline
  • Note the potential for power efficiency
  • If a thread is inactive its TIB line is turned
    off
  • If the queue is empty the processor turns off
  • The queue length measures local load and can be
    used to adjust the local clock rates

4. Suspended threads are rescheduled when data is
written and re-execute the blocked instruction
Synchronising memory
Fixed delay operations
Thread instruction buffer
Queue Of Active threads
Variable delay Operations (e.g. loads)
3. If data is available it is sent for
processing otherwise the thread suspends on the
empty register
2. Instructions issued from the head of the
active queue and read synchronising memory
instructions
data
20
Processor control structures required
  • A large synchronising register file (RF)
  • also a register-file map for register allocation
  • A thread table (TT) to store a threads state
  • PC RF base addresses queue link field etc.
  • A thread instruction buffer (TIB)
  • an active thread is associated with a line in the
    TIB
  • A family table (FT) to store family information
  • Thread and family identifiers are indices into TT
    and FT respectively - i.e. they are direct access
    structures

Do not require branch predictors large data
caches or complex issue logic
21
Synchronising memory
initialisation
  • Registers provide the synchronising memory in a
    microthreaded pipeline
  • The state of a register is stored with its data
    and ports adapt according to that state
  • In state T-cont the register contains a TT
    address
  • In state RR-cont the register contains a remote
    RF address

empty
Local read no data
Remote read no data
T-cont
RR-cont
data write
data write reschedules thread
data write completes remote-read
full
asynchronous pipeline operations
22
Memory references
  • To provide latency tolerance loads and stores are
    decoupled from the pipelines operation
  • n.b the datapath cache may be very small e.g.
    1KByte
  • The ISAs load instruction is
  • synchronous on L1 D-cache hit
  • asynchronous on L1 D-cache miss
  • In the latter case the target register is written
    empty by the pipeline and overwritten
    asynchronously by the memory subsystem when it
    provides data

23
Register-to-register operations
  • Single-cycle operations are synchronous and
    scheduled every clock cycle using bypassing
  • Multi-cycle operations can be either synchronous
    or asynchronous
  • Variable-cycle operations are scheduled
    asynchronously (e.g. shared FPU)
  • the writeback sets the register empty and any
    dependent instruction is blocked

24
Sharing registers between threads
  • Each thread has an identified context in the
    register file ( 31 registers R310 with Alpha
    ISA)
  • registers are shared between threads contexts to
    support the distributed-shared register file -
    sharing is restricted
  • on the same processor sharing is performed by
    mapping
  • on adjacent processors sharing is performed by
    local communication
  • Have sub-classes of variables managed in the
    context
  • global - to all threads in a family
  • local - to one thread only
  • shared/dependent - written by one thread read by
    its neighbour

25
Creating thread
Thread 1
Thread 2
Thread n
Neighbours shared
Neighbours shared
Locals
Neighbours shared
Local shared
Local shared
Local shared
31
Locals
Locals
Locals
Locals
Global scalars
read only
read/write
26
Create
  • Create performs the following actions
    autonomously
  • writes TCB address to the create buffer at
    execute stage
  • sets two targets (e.g. Ra and Ra1) to empty at
    WB stage
  • when the family is a allocated an FT slot it
    (optionally) it writes the fid to Ra1 using the
    asynchronous port
  • the family may now be killed or squeezed
  • when the family completes it (optionally) writes
    the return value to the target specified in the
    TCB using the asynchronous port
  • finally when the familys memory writes have
    completed it writes the return code to Ra using
    the asynchronous port and cleans itself up -
    i.e.releasing the FT slot

27
Squeeze and Kill
  • kill and squeeze are asynchronous and very
    powerful!
  • To provide security a pseudo random number is
    generated by the processor and kept in the FT and
    as a part of the fid
  • They require these to match in order to enable
    the operations
  • kill and squeeze traverse down through the create
    tree from the node the signal was sent to
  • for squeeze this is to a user defined level
  • The concurrency tree is captured implicitly by a
    parent in the FT
  • i.e. families are located in related FTs that
    have the same fid as a parent these children then
    propagate the signal in turn

28
Thread state
  • Threads are held in an indexed table
  • the table index is the threads reference and and
    is used to build queues on that table
  • Thread state in the TT is encoded by the queue a
    thread is currently in
  • empty - not allocated
  • active - head/tail in family table
  • suspended - degenerate queue (headtail) stored
    in the register the thread is suspended on
  • waiting - head/tail in I-cache line
  • N.b no thread will execute unless its
    instructions are in cache

29
Thread state transition
active
Executes, context switches reads data
successfully
Data written PC hits I cache
Executes, context switches Reads data
unsuccessfully
suspended
Data written PC misses I cache
waiting
Cache line returns
30
Microgrids
  • of microthreaded micropropcessors

31
Family distribution to clusters
Source code
Binary code
for i 1, n
create i 1, n
Hardware
deterministic global schedule
i1,n
Microthreads scheduled to pipelines dynamically
and instructions executed according to dataflow
constraints
thread queues
i3
i6
i9
i12
i2
i5
i11
i8
i1
i4
i7
i10
schedulers
Pipelines
P0
P1
P2
P3
register-sharing ring network
32
SEP - dynamic processor allocation
  • The microgrid concept defines a pool of bare
    processors allocated dynamically by the SEP to
    threads at any level in the concurrency tree in
    order to delegate units of work
  • clusters of processors is configured to a ring
    and is known as a place and identified by the
    address of the rtoot processor
  • microthreaded binary code can be executed
    anywhere and on any number of processors

33
Delegation across CMP
Coherent shared memory
SEP
Cluster 1
µT proc
µT proc
µT proc
µT proc
Cluster 2
Cluster 2
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
Cluster 3
Cluster 3
Cluster 3
µT proc
Cluster 4
Cluster 4
µT proc
µT proc
µT proc
µT proc
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
Cluster 5
34
Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipes
Pipe 3
Pipe 0
Coherency network (64 bytes wide ring / ring of
rings)
Register-sharing network (8 bytes wide ring)
Delegation network (1 bit wide grid)
35
The big picture - where are we?
Sequential Data parallel Streaming
exist today
in development
to be developed
36
Discussion
  • Microthreading provides a unified model of
    concurrency on a scale from CMPs to grids
  • The model is composed concurrently with
    restrictions to allow safe composition
  • It reflects future silicon implementations
    problems
  • We have developed a language µTC that captures
    this concurrency

37
Conclusions
  • Microthreaded processors are both computationally
    and power efficient
  • code is schedule invariant and dynamically
    distributed
  • instructions are dynamically interleaved
  • Control structures are distributed and scalable
  • Small compared to an FPU
  • Can manage code fragments (threads) as small as a
    few instructions
  • context switch - signal - reschedule a thread on
    every clock cycle
Write a Comment
User Comments (0)
About PowerShow.com