Title: Mescal Architecture Project: Forwarding
1Mescal Architecture Project Forwarding Precise
Exceptions
- EE244 Fall 2000
- Sam Williams
2Outline
- Motivation
- Problem statement
- Prior work
- Investigative approach
- Results
- Summary
- Conclusions
- Future Work
3Motivation Forwarding
- Pipelining can improve performance by a factor of
the pipeline depth. - However, in a statically scheduled processor,
latency is the pipeline depth, thus instructions
must be scheduled at compile time, and in the
worst case, nops must be inserted. - Forwarding eliminates this by allowing
instructions proceed as soon as the operand
values are ready. - So performance is improved, at the cost of
additional area, power and cycle time.
4Motivation Precise Exceptions
- For a precise exception, we want all previous
instructions to have committed, and no future
instructions. - This allows for restart of program flow after and
exception. e.g. after a TLB refill exception, or
an external interrupt. - Furthermore it allows for debugging by allowing
for precise stops.
5Motivation Design Methodology
- Chip complexity is growing exponentially
- Shorter design cycle in order to reach a target
market - DSM effects
- On chip heterogeneity
- Can the first two be dealt with under the MESCAL
environment?
6Problem statement
- Can Forwarding be easily implemented on a
component basis in the MESCAL framework. - Can this information be exported to the compiler
to allow better scheduling. - What about dynamic hazard detection, and variable
latency functional units - Can precise exceptions be implemented
7Prior Work
- In the forwarding part, it will be a
generalization of forwarding and reservation
stations (aka Tomasulos Algorithm) with virtual
register renaming - Precise exceptions are handled (implicitly)
through a deep pipeline, or through a reorder
buffer.
8Architectural Components
- For min/max cycle time, its easy to distinguish
combinational logic from registers - Then we just remove the registers and examine
each block of combinational logic.
- For pipeline latencies, we must determine when
values are produced, and when they are used. - Removing all the functional units, and replacing
them with theyre know latencies allows for
construction of weighted DCG. - In this case to facilitate the difference between
registers used within a functional unit, and
those used as pipeline registers or reservation
stations, it is useful to move to a higher
abstraction Architectural components (e.g.
pipeline components, and functional units,
memories, etc) - Additionally, it is important to determine when
instructions are issued, and when they commit - Furthermore, busses used for forwarding (backward
arcs), must be distinguished from carrying
(forward) arcs which are used in a deep pipeline.
9Investigative Approach
- Implement the reservation station component, and
a tag array (physical to virtual mapping) for a
deep pipeline - Combine these components to create a framework
for precise interrupts - Define the algorithm that can be used with these
to extract forwarding information, which can be
passed to the compiler.
10Forwarding and Hazards Reservation Stations
- Configurable reservation station with
- - N operand value/tag pairs
- - M output value/tag/valid trios
- - K forwarding paths in the form of CDBs, sent
to the N - renamed (operand) registers
11Forwarding and Hazards Renamed Register Table
- Configurable Table with
- - N source operand addresses e.g. the address
to the RF - - M destination operand addresses/tag pairs -
e.g. the address - to the RF, and the tag (renamed register)
- - K forwarding paths in the form of CDBs, which
are translated - to RF addresses, data, and enables.
12Director/Conductor Interaction with forwarding
components
- Since for some realistic elements will have
variable latency, scheduling can not be
determined at compile time. - Additionally, these components can be used to
construct a dynamically scheduled processor. - Reservation Stations must interpret stall /
occupied signals, and signal director/conductor - Interrupts, which cant be predicted statically,
and exceptions can interrupt control flow
13Precise Exceptions
- In an inorder pipelined machine this can be
easily realized since results are only
committed at the end of the pipeline, determine
there if it caused an exception, update the PC,
and flush the pipeline. - For OO machines, this is more complicated. The
standard technique is to implement a reorder
buffer and take the exception when the
instruction commits. - For interrupts just take them when an instruction
commits.
14Example of MIPS with full forwarding components
- For an Arithmetic operation, the ALU needs both
operands, and only one for LD/SD. - The Memory Stage needs both operands
- Allows for a SD to proceed passed the E stage
even if the store data isnt ready yet. - RS_D,RS_M have no CDB inputs
- RS_R,RS_E each have two
- For precise exceptions, we should send a tag
containing both the virtual and physical
register. - Immediate operands are placed in RS_D
15Example of inorder VLIW with forwarding components
- 3 sub instructions per packed instruction
- All each sub instruction is synchronized with the
other instructions in the packed instruction. - 1 stage removed forcing computation of EA by
previous instruction - Any of the 3 outputs can be forwarded to any of
the 6 inputs in RS_R
16Example of out of order VLIW with forwarding
components
- 3 sub instructions per packed instruction
- Any of the 3 outputs can be forwarded to either
of the 2 operands of RS_R - Out of order execution and commit.
- Need multi-line / multi-issue reservations
stations (version I implemented was single line,
single issue, not too difficult to do, could also
be implemented with existing components) - Need a reorder buffer for inorder commit and
thereby precise exceptions.
17Example of out of order VLIW with forwarding
components
- reorder buffer can be constructed out of 3
operand, no output reservation stations. - On issue, the first entry is written with
op/pc/qs - Only the final entry requires all operands. This
allows instructions to proceed to the last stage
without outputs being available. They then stall
in the final entry for outputs. This helps to
reduce latency on exceptions and promote out of
order execution. - All outputs must be forwarded to all other
reservation stations since one of their operands
could have finished execution, but would be
waiting to commit in the reorder buffer. - All execution units must forward data to every
entry in the reorder buffer. - This could be redesigned into a circular queue
reorder buffer by changing which entry
instructions are issued to and committed from.
18Deep Pipeline Motivating Example
- In some deep pipelines, we will either need to
stall or have statically scheduled code - Regardless of the forwarding we put into place
there must be 3 stalls or fills if the output of
a MUL is used by the ADD - So we need to be able to extract latencies from
structure and export it to either the Compiler
(for code generation) or the Conductor (so that
it knows when it should stall) e.g. generating
control logic from the data path. - Solution is to build a graph, however each edge
is noted either as forwarding (green), or
carrying (red) - Forwarding paths travel backwards in the
pipeline, thus our shortest path algorithm must
pass thru at least one forwarding edge to be
valid. e.g. ADD used by MEM, there is a straight
path of length 3, but its not valid, since its
the same instruction at that time.
19Algorithm Shortest Path in DCG
- stalls depth(s) dist(s,d) - depth(d) - 1
- Where dist(s,d) is the shortest valid path in the
graph from the source s to the destination d,
and depth(n) is the depth in the pipeline
reservation station n is. - A path is valid if it goes thru one forwarding
path. - We can define a forwarding path from a write back
stage to the register file access stage - Its first called with dist(s,d,0,false)
-
- dist(node n, node d, int cur, int valid)
- if((nd)(valid1))
- return(cur)
-
- min pipeline_depth
- if(validgt1)return(min)
- if(curgtmin)return(min)
- foreach e (_at_n-gtedges)
- tempdist(e-gtnode, d, cur1, valid
(e-gttypeFORWARDING)) - if(templtmin)mintemp
-
- return(min)
-
20Example
- Dist(s,d) Depth(s)-Depth(d)-1
- shorizontal, dvertical
- stalls(s,d) (Dist(s,d)) (Depth(s)-Depth(d)-1)
21Results
- If designs are based on architectural components,
it is easy to determine pipeline structure - The algorithm in its naïve form is slow, but
pipeline depth will be typically less than 10,
and only has to be done once (table lookup) - The design is based on a combination of
Tomasulos algorithm, and virtual register
renaming. - I implemented a configurable (number operands,
outputs, forwarding paths only) reservations
station - I also implemented a Tag array which keeps track
of the physical to virtual register mapping - I did not simulate any designs
22Summary
- If designs are based on architectural components,
it is easy to determine pipeline structure, and
thus export it to the compiler - The main architectural component I implemented
was a generic reservation station which can also
be viewed as a pipeline register with forwarding. - By combining these reservation stations, deep
pipelines and reorder buffers can be realized. - I also implemented a tag array, which maps
virtual to physical registers
23Conclusion and Future Work
- Design with architectural components in the
MESCAL framework allows the architect to quickly
explore, simulate, and analyze various
architects, and their affect on performance and
code density. - This could also be used by a genetic algorithm
just as branch components are to generate
accurate branch predictors. - The reservation station needs to be adapted to be
multi-line, feeding multiple functional units to
allow for true OOE super scalar processor design. - There is no current tools to extract or
synthesize these components to extract timing or
area numbers. - A future design methodology might involve writing
code generators which could take the
configuration of each component and generate a
VERILOG model, which could be synthesized, or
even directly to a netlist.
24Next Step ?
- Current stdcell Libraries include layout,
schematic, spice paramaters, boolean equivalent,
etc - A higher level library could include a simulation
model, and a VERILOG code generator - Thus the ASIC design cycle would start with an
architectural model, which could be used to
generate VERILOG code, and from there through a
standard ASIC flow. - Can this be used to functionally verify the
correctness of a micro-architecture to an ISA
25References
- Tomasulo, R. M. 1967. An efficient algorithm
for exploiting multiple arithmetic units, IBM J
Research and Development 111 (January) - Patterson, D. A. and J. L. Hennessy 1996.
Computer Architecture A Quantitative Approach,
Morgan Kaufmann, San Francisco