Multiscalar%20Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Multiscalar%20Processors

Description:

Paradigm has been around for about 60 years ... Glossed over. Outline. Multiscalar Basics. Tasks. Multiscalars In-Depth. Distribution of Cycles ... – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 42
Provided by: matthewm151
Category:

less

Transcript and Presenter's Notes

Title: Multiscalar%20Processors


1
Multiscalar Processors
  • Presented by Matthew Misler
  • Gurindar S. Sohi, Scott E. Breach, T. N.
    Vijayjumar
  • University of Wisconsin-Madison
  • ISCA 95

2
Scalar Processors
Instruction Queue
Execution Unit
addu 20, 20, 16
ld 23, SYMVAL -16(20)
move 17, 21
beq 17, 0, SKIPINNER
ld 8, LELE(17)
3
SuperScalar Processors
Instruction Queue
Execution Unit
addu 20, 20, 16
ld 23, SYMVAL -16(20)
move 17, 21
beq 17, 0, SKIPINNER
ld 8, LELE(17)
4
Fetch-Execute
  • Paradigm has been around for about 60 years
  • Superscalar processors to execute instructions
    out of order
  • Sometimes re-ordering done in hardware
  • Sometimes software
  • Sometimes both
  • Partial ordering

5
Control Flow Graphs
  • Segments are split on control dependencies
    (conditional branches)

6
Sequential Walk
  • Walk through the CFG with enough parallelism
  • Use speculative execution and branch prediction
    to raise the level of parallelism
  • Sequential semantics must be preserved
  • Can still execute out of order, but in-order
    commit

7
Multiscalars and Tasks
  • CFG broken down into tasks
  • Multiscalars step through at the task level
  • No inspection of instructions within a task
  • Each Task is assigned to one processing unit
  • Multiple tasks can execute in parallel

8
Multiscalar Microarchitecture
  • Sequencer
  • Queue of processing units
  • Unidirectional ring
  • Each has an instruction cache, processing
    element, register file
  • Interconnect
  • Data Bank
  • Each has address resolution buffer, data cache

9
Multiscalar Microarchitecture
10
Outline
  • Multiscalar Microarchitecture
  • Tasks
  • Multiscalars in-depth
  • Distribution of cycles
  • Comparison to other paradigms
  • Performance
  • Conclusion

11
Tasks
  • Sequencer distributes a task to a Processing unit
  • Unit fetches and executes the task until
    completion
  • Instructions in the window are bounded
  • By the first instruction in the earliest
    executing task
  • By the last instruction in the latest executing
    task

12
Tasks
  • Sequencer distributes a task to a Processing unit
  • Unit fetches and executes the task until
    completion
  • The Instruction Window is bounded by
  • The first instruction in the earliest executing
    task
  • The last instruction in the latest executing task
  • So? Instruction windows can be huge

13
Tasks Example
A
B
C
D
E
A
B
C
B
B
C
D
14
Tasks Example
A
B
C
D
E
A
B
C
B
B
C
D
A
B
B
C
D
A
B
C
B
C
D
E
15
Tasks
  • Hold true to sequential semantics inside each
    block
  • Enforce sequential order overall on tasks
  • The circular queue takes care of this part
  • In the previous example
  • Head of queue does ABCBBCD
  • Middle unit does ABBCD
  • Tail of the queue ABCBCDE

16
Tasks
  • Registers
  • Create mask
  • May produce values for a future task
  • Forward values down the ring
  • Accum mask
  • Union of the create masks of active tasks
  • Memory
  • If its a known producer-consumer, then
    synchronize on loads and stores

17
Tasks
  • Memory (contd)
  • Unknown P-C relationship
  • Conservative approach wait
  • Aggressive approach speculate
  • Conservative approach means sequential operation
  • Aggressive approach requires dynamic checking,
    squashing and recovery

18
Outline
  • Multiscalar basics
  • Tasks
  • Multiscalars in-depth
  • Distribution of cycles
  • Comparison to other paradigms
  • Performance
  • Conclusion

19
Multiscalar Programs
  • Code for the tasks
  • Small changes to existing ISA
  • add specification of tasks
  • no major overhaul
  • Structure of the CFG and tasks
  • Communications between tasks

20
Control Flow Graph Structure
  • Successors
  • Task descriptor
  • Producing and consuming values
  • Forward register information on last update
  • Compiler can mark instructions operate and
    forward
  • Stopping conditions
  • Special condition, evaluate conditions, complete
  • All of these can be viewed as tag bits

21
Multiscalar Hardware
  • Walks through the CFG
  • Assign tasks to processing units
  • Execute tasks in a sequential order
  • Sequencer fetches the task descriptors
  • Using the address of the first instruction
  • Specifying the create masks
  • Constructing the accum mask
  • Using the task descriptor, predict successor

22
Multiscalar Hardware
  • Databanks
  • Updates to cache not speculative
  • Use of Address Resolution Buffer
  • Detects violation of dependencies
  • Initiates corrective actions
  • If it runs out of space, squash tasks
  • Not the head of the queue it doesnt use the ARB
  • Can stall rather than squash

23
Multiscalar Hardware
  • Remember the earlier architectural picture?

24
Multiscalar Hardware
  • Its not the only possible architecture
  • Possible design with shared functional units
  • Possible design with ARB and data cache on the
    same side as the processing units
  • Scaling the interconnect is non-trivial
  • Glossed over

25
Outline
  • Multiscalar Basics
  • Tasks
  • Multiscalars In-Depth
  • Distribution of Cycles
  • Comparison to Other Paradigms
  • Performance
  • Conclusion

26
Distribution of Cycles
  • Wasted cycles
  • Non-useful computation
  • Squashed
  • No computation
  • Waiting
  • Remains idle
  • No assigned task

27
Distribution of Cycles
  • Non-useful computation cycles
  • Determine useless computation early
  • Validate prediction early
  • Check if the next task is predicted correctly
  • Eg. Test for loop exit at the start of the loop
  • Tasks violating sequentiality are squashed
  • To avoid, try to synchronize memory communication
    with register communication
  • Could delay the load for a number of cycles
  • Can use signal-wait synchronization

28
Distribution of Cycles
  • Contrast with no assigned task
  • No computation cycles
  • Dependencies within the same task
  • Dependencies between tasks (earlier/later)
  • Load Balancing

29
Outline
  • Multiscalar Basics
  • Tasks
  • Multiscalars In-Depth
  • Distribution of Cycles
  • Comparison to Other Paradigms
  • Performance
  • Conclusion

30
Comparison to Other Paradigms
  • Branch prediction
  • Sequencer only needs to predict branches across
    tasks
  • Wide instruction window
  • Check to see which is ready for issue, in
    Multiscalar relatively few ready for inspection

31
Comparison to Other Paradigms
  • Issue logic
  • Superscalar processors have n2 logic
  • Multiscalar logic is distributed,
  • Each processing unit issues instructions
    independently
  • Loads and stores
  • Normally sequence numbers for managing the
    buffers
  • In multiscalar, the loads and stores are
    independent

32
Comparison to Other Paradigms
  • Superscalar processors need to discover CFG as it
    decodes branches
  • Only requires the compiler to split code into
    tasks
  • Multiprocessors require all dependence to be
    known or conservatively provided for
  • If a compiler could compile independently, it can
    be executed in parallel

33
Outline
  • Multiscalar Basics
  • Tasks
  • Multiscalars In-Depth
  • Distribution of Cycles
  • Comparison to Other Paradigms
  • Performance
  • Conclusion

34
Performance
  • Simulated
  • 5 stage pipeline
  • Functional unit latency

35
Performance
  • Memory
  • Non-blocking loads and stores
  • 10 cycle latency for first 4 words
  • 1 cycle for each additional 4 words
  • Instruction Cache 1 cycle for 4 words
  • 103 cycles for miss
  • Data Cache 1 word per cycle multiscalar
  • 103 cycles bus contention, for a miss
  • 1024 entry cache of task descriptors

36
Performance
  • 12.2 on average

37
Performance In-Order
38
Performance Out-of-Order
39
Performance Summary
  • Most of the benchmarks achieve speedup
  • Eg. An average of 1.924 in 1-way in-order 4-unit
    multiscalar
  • Worst case 0.86 speedup (slowdown)
  • Many squashes in prediction and memory order in
    Gcc and Xlisp
  • Leads to almost sequential execution
  • Keeping in mind, 12.2 increase in IC

40
Outline
  • Multiscalar Basics
  • Tasks
  • Multiscalars In-Depth
  • Distribution of Cycles
  • Comparison to Other Paradigms
  • Performance
  • Conclusion

41
Conclusion
  • Divide the CFG into tasks
  • Assign tasks to processing units
  • Walk the CFG in task-size steps
  • Shows performance gains
Write a Comment
User Comments (0)
About PowerShow.com