Mescal Architecture Project: Forwarding - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Mescal Architecture Project: Forwarding

Description:

... can be used with these to extract forwarding information, which can be passed to ... So we need to be able to extract latencies from structure and export it to ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 26

Provided by: YUC80

Category:

more less

Transcript and Presenter's Notes

Title: Mescal Architecture Project: Forwarding

1
Mescal Architecture Project Forwarding Precise
Exceptions

EE244 Fall 2000
Sam Williams

2
Outline

Motivation
Problem statement
Prior work
Investigative approach
Results
Summary
Conclusions
Future Work

3
Motivation Forwarding

Pipelining can improve performance by a factor of
the pipeline depth.
However, in a statically scheduled processor,
latency is the pipeline depth, thus instructions
must be scheduled at compile time, and in the
worst case, nops must be inserted.
Forwarding eliminates this by allowing
instructions proceed as soon as the operand
values are ready.
So performance is improved, at the cost of
additional area, power and cycle time.

4
Motivation Precise Exceptions

For a precise exception, we want all previous
instructions to have committed, and no future
instructions.
This allows for restart of program flow after and
exception. e.g. after a TLB refill exception, or
an external interrupt.
Furthermore it allows for debugging by allowing
for precise stops.

5
Motivation Design Methodology

Chip complexity is growing exponentially
Shorter design cycle in order to reach a target
market
DSM effects
On chip heterogeneity
Can the first two be dealt with under the MESCAL
environment?

6
Problem statement

Can Forwarding be easily implemented on a
component basis in the MESCAL framework.
Can this information be exported to the compiler
to allow better scheduling.
What about dynamic hazard detection, and variable
latency functional units
Can precise exceptions be implemented

7
Prior Work

In the forwarding part, it will be a
generalization of forwarding and reservation
stations (aka Tomasulos Algorithm) with virtual
register renaming
Precise exceptions are handled (implicitly)
through a deep pipeline, or through a reorder
buffer.

8
Architectural Components

For min/max cycle time, its easy to distinguish
combinational logic from registers
Then we just remove the registers and examine
each block of combinational logic.

For pipeline latencies, we must determine when
values are produced, and when they are used.
Removing all the functional units, and replacing
them with theyre know latencies allows for
construction of weighted DCG.
In this case to facilitate the difference between
registers used within a functional unit, and
those used as pipeline registers or reservation
stations, it is useful to move to a higher
abstraction Architectural components (e.g.
pipeline components, and functional units,
memories, etc)
Additionally, it is important to determine when
instructions are issued, and when they commit
Furthermore, busses used for forwarding (backward
arcs), must be distinguished from carrying
(forward) arcs which are used in a deep pipeline.

9
Investigative Approach

Implement the reservation station component, and
a tag array (physical to virtual mapping) for a
deep pipeline
Combine these components to create a framework
for precise interrupts
Define the algorithm that can be used with these
to extract forwarding information, which can be
passed to the compiler.

10
Forwarding and Hazards Reservation Stations

Configurable reservation station with
- N operand value/tag pairs
- M output value/tag/valid trios
- K forwarding paths in the form of CDBs, sent
to the N
renamed (operand) registers

11
Forwarding and Hazards Renamed Register Table

Configurable Table with
- N source operand addresses e.g. the address
to the RF
- M destination operand addresses/tag pairs -
e.g. the address
to the RF, and the tag (renamed register)
- K forwarding paths in the form of CDBs, which
are translated
to RF addresses, data, and enables.

12
Director/Conductor Interaction with forwarding
components

Since for some realistic elements will have
variable latency, scheduling can not be
determined at compile time.
Additionally, these components can be used to
construct a dynamically scheduled processor.
Reservation Stations must interpret stall /
occupied signals, and signal director/conductor
Interrupts, which cant be predicted statically,
and exceptions can interrupt control flow

13
Precise Exceptions

In an inorder pipelined machine this can be
easily realized since results are only
committed at the end of the pipeline, determine
there if it caused an exception, update the PC,
and flush the pipeline.
For OO machines, this is more complicated. The
standard technique is to implement a reorder
buffer and take the exception when the
instruction commits.
For interrupts just take them when an instruction
commits.

14
Example of MIPS with full forwarding components

For an Arithmetic operation, the ALU needs both
operands, and only one for LD/SD.
The Memory Stage needs both operands
Allows for a SD to proceed passed the E stage
even if the store data isnt ready yet.
RS_D,RS_M have no CDB inputs
RS_R,RS_E each have two
For precise exceptions, we should send a tag
containing both the virtual and physical
register.
Immediate operands are placed in RS_D

15
Example of inorder VLIW with forwarding components

3 sub instructions per packed instruction
All each sub instruction is synchronized with the
other instructions in the packed instruction.
1 stage removed forcing computation of EA by
previous instruction
Any of the 3 outputs can be forwarded to any of
the 6 inputs in RS_R

16
Example of out of order VLIW with forwarding
components

3 sub instructions per packed instruction
Any of the 3 outputs can be forwarded to either
of the 2 operands of RS_R
Out of order execution and commit.
Need multi-line / multi-issue reservations
stations (version I implemented was single line,
single issue, not too difficult to do, could also
be implemented with existing components)
Need a reorder buffer for inorder commit and
thereby precise exceptions.

17
Example of out of order VLIW with forwarding
components

reorder buffer can be constructed out of 3
operand, no output reservation stations.
On issue, the first entry is written with
op/pc/qs
Only the final entry requires all operands. This
allows instructions to proceed to the last stage
without outputs being available. They then stall
in the final entry for outputs. This helps to
reduce latency on exceptions and promote out of
order execution.
All outputs must be forwarded to all other
reservation stations since one of their operands
could have finished execution, but would be
waiting to commit in the reorder buffer.
All execution units must forward data to every
entry in the reorder buffer.
This could be redesigned into a circular queue
reorder buffer by changing which entry
instructions are issued to and committed from.

18
Deep Pipeline Motivating Example

In some deep pipelines, we will either need to
stall or have statically scheduled code
Regardless of the forwarding we put into place
there must be 3 stalls or fills if the output of
a MUL is used by the ADD
So we need to be able to extract latencies from
structure and export it to either the Compiler
(for code generation) or the Conductor (so that
it knows when it should stall) e.g. generating
control logic from the data path.
Solution is to build a graph, however each edge
is noted either as forwarding (green), or
carrying (red)
Forwarding paths travel backwards in the
pipeline, thus our shortest path algorithm must
pass thru at least one forwarding edge to be
valid. e.g. ADD used by MEM, there is a straight
path of length 3, but its not valid, since its
the same instruction at that time.

19
Algorithm Shortest Path in DCG

stalls depth(s) dist(s,d) - depth(d) - 1
Where dist(s,d) is the shortest valid path in the
graph from the source s to the destination d,
and depth(n) is the depth in the pipeline
reservation station n is.
A path is valid if it goes thru one forwarding
path.
We can define a forwarding path from a write back
stage to the register file access stage
Its first called with dist(s,d,0,false)
dist(node n, node d, int cur, int valid)
if((nd)(valid1))
return(cur)
min pipeline_depth
if(validgt1)return(min)
if(curgtmin)return(min)
foreach e (_at_n-gtedges)
tempdist(e-gtnode, d, cur1, valid
(e-gttypeFORWARDING))
if(templtmin)mintemp
return(min)

20
Example

Dist(s,d) Depth(s)-Depth(d)-1
shorizontal, dvertical
stalls(s,d) (Dist(s,d)) (Depth(s)-Depth(d)-1)

21
Results

If designs are based on architectural components,
it is easy to determine pipeline structure
The algorithm in its naïve form is slow, but
pipeline depth will be typically less than 10,
and only has to be done once (table lookup)
The design is based on a combination of
Tomasulos algorithm, and virtual register
renaming.
I implemented a configurable (number operands,
outputs, forwarding paths only) reservations
station
I also implemented a Tag array which keeps track
of the physical to virtual register mapping
I did not simulate any designs

22
Summary

If designs are based on architectural components,
it is easy to determine pipeline structure, and
thus export it to the compiler
The main architectural component I implemented
was a generic reservation station which can also
be viewed as a pipeline register with forwarding.
By combining these reservation stations, deep
pipelines and reorder buffers can be realized.
I also implemented a tag array, which maps
virtual to physical registers

23
Conclusion and Future Work

Design with architectural components in the
MESCAL framework allows the architect to quickly
explore, simulate, and analyze various
architects, and their affect on performance and
code density.
This could also be used by a genetic algorithm
just as branch components are to generate
accurate branch predictors.
The reservation station needs to be adapted to be
multi-line, feeding multiple functional units to
allow for true OOE super scalar processor design.
There is no current tools to extract or
synthesize these components to extract timing or
area numbers.
A future design methodology might involve writing
code generators which could take the
configuration of each component and generate a
VERILOG model, which could be synthesized, or
even directly to a netlist.

24
Next Step ?

Current stdcell Libraries include layout,
schematic, spice paramaters, boolean equivalent,
etc
A higher level library could include a simulation
model, and a VERILOG code generator
Thus the ASIC design cycle would start with an
architectural model, which could be used to
generate VERILOG code, and from there through a
standard ASIC flow.
Can this be used to functionally verify the
correctness of a micro-architecture to an ISA

25
References

Tomasulo, R. M. 1967. An efficient algorithm
for exploiting multiple arithmetic units, IBM J
Research and Development 111 (January)
Patterson, D. A. and J. L. Hennessy 1996.
Computer Architecture A Quantitative Approach,
Morgan Kaufmann, San Francisco

Write a Comment

User Comments (0)