Multiscalar%20Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Multiscalar%20Processors

Description:

Paradigm has been around for about 60 years ... Glossed over. Outline. Multiscalar Basics. Tasks. Multiscalars In-Depth. Distribution of Cycles ... – PowerPoint PPT presentation

Number of Views:239

Avg rating:3.0/5.0

Slides: 42

Provided by: matthewm151

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiscalar%20Processors

1
Multiscalar Processors

Presented by Matthew Misler
Gurindar S. Sohi, Scott E. Breach, T. N.
Vijayjumar
University of Wisconsin-Madison
ISCA 95

2
Scalar Processors
Instruction Queue
Execution Unit
addu 20, 20, 16
ld 23, SYMVAL -16(20)
move 17, 21
beq 17, 0, SKIPINNER
ld 8, LELE(17)
3
SuperScalar Processors
Instruction Queue
Execution Unit
addu 20, 20, 16
ld 23, SYMVAL -16(20)
move 17, 21
beq 17, 0, SKIPINNER
ld 8, LELE(17)
4
Fetch-Execute

Paradigm has been around for about 60 years
Superscalar processors to execute instructions
out of order
Sometimes re-ordering done in hardware
Sometimes software
Sometimes both
Partial ordering

5
Control Flow Graphs

Segments are split on control dependencies
(conditional branches)

6
Sequential Walk

Walk through the CFG with enough parallelism
Use speculative execution and branch prediction
to raise the level of parallelism
Sequential semantics must be preserved
Can still execute out of order, but in-order
commit

7
Multiscalars and Tasks

CFG broken down into tasks
Multiscalars step through at the task level
No inspection of instructions within a task
Each Task is assigned to one processing unit
Multiple tasks can execute in parallel

8
Multiscalar Microarchitecture

Sequencer
Queue of processing units
Unidirectional ring
Each has an instruction cache, processing
element, register file
Interconnect
Data Bank
Each has address resolution buffer, data cache

9
Multiscalar Microarchitecture
10
Outline

Multiscalar Microarchitecture
Tasks
Multiscalars in-depth
Distribution of cycles
Comparison to other paradigms
Performance
Conclusion

11
Tasks

Sequencer distributes a task to a Processing unit
Unit fetches and executes the task until
completion
Instructions in the window are bounded
By the first instruction in the earliest
executing task
By the last instruction in the latest executing
task

12
Tasks

Sequencer distributes a task to a Processing unit
Unit fetches and executes the task until
completion
The Instruction Window is bounded by
The first instruction in the earliest executing
task
The last instruction in the latest executing task
So? Instruction windows can be huge

13
Tasks Example
A
B
C
D
E
A
B
C
B
B
C
D
14
Tasks Example
A
B
C
D
E
A
B
C
B
B
C
D
A
B
B
C
D
A
B
C
B
C
D
E
15
Tasks

Hold true to sequential semantics inside each
block
Enforce sequential order overall on tasks
The circular queue takes care of this part
In the previous example
Head of queue does ABCBBCD
Middle unit does ABBCD
Tail of the queue ABCBCDE

16
Tasks

Registers
Create mask
May produce values for a future task
Forward values down the ring
Accum mask
Union of the create masks of active tasks
Memory
If its a known producer-consumer, then
synchronize on loads and stores

17
Tasks

Memory (contd)
Unknown P-C relationship
Conservative approach wait
Aggressive approach speculate
Conservative approach means sequential operation
Aggressive approach requires dynamic checking,
squashing and recovery

18
Outline

Multiscalar basics
Tasks
Multiscalars in-depth
Distribution of cycles
Comparison to other paradigms
Performance
Conclusion

19
Multiscalar Programs

Code for the tasks
Small changes to existing ISA
add specification of tasks
no major overhaul
Structure of the CFG and tasks
Communications between tasks

20
Control Flow Graph Structure

Successors
Task descriptor
Producing and consuming values
Forward register information on last update
Compiler can mark instructions operate and
forward
Stopping conditions
Special condition, evaluate conditions, complete
All of these can be viewed as tag bits

21
Multiscalar Hardware

Walks through the CFG
Assign tasks to processing units
Execute tasks in a sequential order
Sequencer fetches the task descriptors
Using the address of the first instruction
Specifying the create masks
Constructing the accum mask
Using the task descriptor, predict successor

22
Multiscalar Hardware

Databanks
Updates to cache not speculative
Use of Address Resolution Buffer
Detects violation of dependencies
Initiates corrective actions
If it runs out of space, squash tasks
Not the head of the queue it doesnt use the ARB
Can stall rather than squash

23
Multiscalar Hardware

Remember the earlier architectural picture?

24
Multiscalar Hardware

Its not the only possible architecture
Possible design with shared functional units
Possible design with ARB and data cache on the
same side as the processing units
Scaling the interconnect is non-trivial
Glossed over

25
Outline

Multiscalar Basics
Tasks
Multiscalars In-Depth
Distribution of Cycles
Comparison to Other Paradigms
Performance
Conclusion

26
Distribution of Cycles

Wasted cycles
Non-useful computation
Squashed
No computation
Waiting
Remains idle
No assigned task

27
Distribution of Cycles

Non-useful computation cycles
Determine useless computation early
Validate prediction early
Check if the next task is predicted correctly
Eg. Test for loop exit at the start of the loop
Tasks violating sequentiality are squashed
To avoid, try to synchronize memory communication
with register communication
Could delay the load for a number of cycles
Can use signal-wait synchronization

28
Distribution of Cycles

Contrast with no assigned task
No computation cycles
Dependencies within the same task
Dependencies between tasks (earlier/later)
Load Balancing

29
Outline

Multiscalar Basics
Tasks
Multiscalars In-Depth
Distribution of Cycles
Comparison to Other Paradigms
Performance
Conclusion

30
Comparison to Other Paradigms

Branch prediction
Sequencer only needs to predict branches across
tasks
Wide instruction window
Check to see which is ready for issue, in
Multiscalar relatively few ready for inspection

31
Comparison to Other Paradigms

Issue logic
Superscalar processors have n2 logic
Multiscalar logic is distributed,
Each processing unit issues instructions
independently
Loads and stores
Normally sequence numbers for managing the
buffers
In multiscalar, the loads and stores are
independent

32
Comparison to Other Paradigms

Superscalar processors need to discover CFG as it
decodes branches
Only requires the compiler to split code into
tasks
Multiprocessors require all dependence to be
known or conservatively provided for
If a compiler could compile independently, it can
be executed in parallel

33
Outline