Lecture 4 Multithreaded Processors continued and an Example - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Lecture 4 Multithreaded Processors continued and an Example

Description:

Child threads are relatively short (tens of instructions), often need to ... Memory controllers and bus interface units. No on-chip caches (except for StrongARM core) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 17
Provided by: juny8
Category:

less

Transcript and Presenter's Notes

Title: Lecture 4 Multithreaded Processors continued and an Example


1
Lecture 4 Multithreaded Processors (continued)
and an Example
  • Instructor Jun Yang

2
Implicitly Multithreaded Processors
  • Explicit Multithreading (covered in last class)
    programmer /compiler create multiple threads.
  • Implicit multithreading dynamically spawn
    threads (with the help of compiler), number of
    active threads varies.
  • Child threads are relatively short (tens of
    instructions), often need to communicate large
    amounts of state with other threads to resolve
    data and control dependences.
  • How are threads created?
  • How are register data dependences resolved?
  • How are memory data dependences resolved?

3
Thread Creation
  • Control independence
  • Future threads

A
B
C
4
Thread Creation
  • Out-of-order thread creation, DMT
  • Creation of C and H is out of program order
  • The reorder buffer has to support out-of-order
    insertion of an arbitrary number of instructions
    into the middle of a set of already active
    instructions.

5
Thread Creation
  • Disjoint Eager Execution (DEE)
  • Eager execution execute both paths following a
    branch.
  • Explosion of paths as multiple branches are
    traversed
  • DEE prune the decision tree according to branch
    prediction rate, spawn thread along high rated
    branch path.

1
5
2
0.75
0.25
3
0.19
0.56
4
0.42
0.14
0.32
0.24
6
Where Are They Running
  • Partition execution resources, each thread runs
    on each partition, each partition can be less
    aggressive.
  • Much like on-chip multiprocessor.
  • E.g. thread-level speculation (TLS) creates a
    thread for each iteration of a loop. Each
    iteration is run on a core.
  • Rely on an SMT-like processor, interleaves
    implicit threads (instead of explicit threads).
  • E.g. DMT spawns threads at procedure calls
    backward loop branches. Threads share the
    existing reourses.
  • Multiple processing elements structured as a
    circular queue. Each element executes one thread.
    The tail of the queue is the current thread
    (non-speculative), others are executing future
    threads (can be speculative).
  • E.g. Multiscalar allows creation of a thread at
    arbitrary points in programs control flow.

7
Resolving Register Data Dependences
  • Intrathread dependences are handled with standard
    techniques.
  • Interthread dependences are hard
  • A new future thread might need a register read at
    the beginning. But its producer may be at the end
    of the prior thread. The producer instruction
    might not even be fetched.
  • Disallow reg. data dependences. Communicate all
    shared operands through memory w/ loads and
    stores, e.g. TLS.
  • Compiler tells the dependences explicitly. Embeds
    a write mask in future thread, telling which
    registers have pending writes to them., e.g.
    Multiscalar.
  • Speculatively execute as if the operands are
    ready. Recover if wrong. E.g. DMT.

8
Resolving Memory Data Depend.
  • Intrathread memory dependences are resolved w/
    standard tech.
  • WAR and WAW interthread dependences buffer
    writes from future threads and committing them
    when the threads retire.
  • RAW interthread memory dependence resolution
    (complex!)
  • A load in later thread searches an aliasing store
    in earlier threads store queue (bypass). A store
    in an earlier thread searches an aliasing load
    (already executed) in a later threads load queue
    (a violation). E.g. DMT, DEE.
  • Centralized address resolution buffer (ARB). One
    entry for each load in future thread and any
    aliased store from old threads flag a violation
    (future thread will be squashed and restarted).
    Each load is checked with all unretired stores
    also in ARB. E.g. Multscalar

9
Concluding Remarks
  • IMT exists only in research proposals up to now.
  • Implementation details are difficult.

10
Case Study
  • The Intel IXP 12XX Network Processor Family
  • IXP 1200 contains
  • StrongARM processor core.
  • 6 microengines (ME), each of which is a
    programmable 32-bit RISC processor.
  • Memory controllers and bus interface units.
  • No on-chip caches (except for StrongARM core).
  • ME supports up to 4 threads running in a
    coarse-grained fashion.

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Pipeline Stages
15
Branch Decisions
  • Branches are classified into 3 classes
  • Class 1 branches are resolved in stage P1
    instruction in P0 may be aborted
  • Class 2 branches are resolved either in P2 or P1
    depending the condition set by instruction in P3
    instruction in P0,P1 may be aborted.
  • Class 3 branches are resolved in P3
    instructions in P0,1,2 may be aborted.

16
Context Switch
  • Specified by instruction explicitly.
  • Overhead is at most 1 cycle a context switch
    instruction is a class 1 branch instruction which
    is resolved in P1.
  • If delayed branch is present, this 1 cycle loss
    can be removed.
  • Needs to update event signals.
  • Context event arbiter wakes up a thread
    (round-robin fashion)
Write a Comment
User Comments (0)
About PowerShow.com