Title: PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor
1Power-Performance Evaluation of Instruction
Level Reuse in a Superscalar Microprocessor
- G. Surendra
- CAD Lab, SERC, IISc
- Research advisor S.K. Nandy
2Outline
- Motivation
- Properties of Programs
- Value locality
- Instruction repetition
- Exploiting value locality and instruction
repetition - Instruction reuse (IR) (or value reuse or
computation reuse) - Basics, working, types/granularity of IR
- Baseline results
- Thesis work
- Using instruction criticality information while
exploiting IR - Resultbus optimization
- Flow based IR in packet processing applications
3Motivation and contributions
- Power as a metric (EDP actually)
- Previous studies consider performance only
Sodani 00, Molina 99, Citron 02 - Power has been briefly considered in Brooks 00,
Citron 03 - Reduces the degree of freedom with which IR can
be exploited - May make IR totally ineffective
- E.g. Value prediction schemes are not energy
efficient Sam 05 - Understanding IR using criticality information
- Reveals the limitations of IR
- Indicates what types of instructions one must
concentrate on - RB management
- Understand how IR affects other processor
parameters and pipeline stages - Has been considered in Sodani 00
- Use criticality information to explain the
behavior - Indicates what type of architecture is required
to maximize the effectiveness of IR - Other techniques that exploit IR to improve
energy efficiency
4Value locality
- Likelihood of a previously seen value occurring
repeatedly in the same register/memory location - Analogous to address locality (reoccurrence of
address patterns) - Instructions operate on a small set of input
values and produce a small set of result values
(32 bit register gt 232 different values are not
stored) - 22-75 instructions produce repetitive result
values Lipasti 96 - gt 50 static instructions generate only one
result value Sazeides 97 - gt 90 static instructions generate less than 64
different values - gt 50 dynamic instructions generate fewer than 64
values - Causes
- Register recycling, 90 - 10 rule of execution
- External inputs repeat white spaces in text
processing, zeros in sparse matrices - Width for immediate operands is small gt few
values - Calls to certain functions (e.g. printf()) is
normally a jump to a fixed address
5Instruction Repetition
- Instruction repetition execution of a dynamic
instruction with the same operand values as a
previous instance - A direct consequence of value locality
- Repeated instructions are essentially performing
redundant computation - Is more of a program property Sodani 00,
Sazeides 03 - Compiler has no knowledge of run-time information
and has to make conservative decisions in
presence of - pointers, indirect jumps (switch statements),
dynamically shared objects - Types of repetition in result value
- (i) Local level repetition - inputs are same as a
previous dynamic instance of the same instruction
(quasi-invariants Molina 99) - for(int i0 iltsize i )
- PCi ci ai bi
- (ii) Global level repetition - inputs are same
as a previous dynamic instance of a different
instruction with same opcode (quasi common
sub-expressions Molina 99) - PCj x y/z
- PCk r y/p p z gt x r
- Inputs are different, but result is the same
(e.g. compare instruction) - Inputs are same but result is different
(influence of other intervening instructions
matters e.g. load from address)
6Techniques that exploit repetition
- Memoization
- Exploit function level repetition of arguments
- Software technique
- Applicable for functions with no side effects
(e.g. Fibonacci (n)) - Value prediction (H/W technique and speculative)
- Value prediction to exceed the data flow limit
- Predicting outcome of loads to reduce average
memory latency - Requires validation and recovery mechanisms
- Instruction Reuse (IR) (H/W technique)
- HW implementation of memoization that operates at
instruction granularity - non-speculative, no validation required.
- Does not capture as much redundancy in programs
as VP. - Types
- single instruction reuse vs block reuse
techniques - Mixed H/W S/W techniques
7Instruction Reuse (IR)
- Reduces the number of instructions executed on
FUs - Table lookup (Reuse Buffer RB) vs execution
- Reduces operation latency for multicycle
instructions - Reduces cache accesses when loads are reused
- Allows dependent instructions to be woken up
early - Reduces wrong path activity if branches are
reused - Sv scheme practical to implement Citron 02
- RB
- Cache like structure that holds operand values
(Sv scheme Sodani 97) - Separate ALU and Load RBs that can be accessed
within one cycle - ALU RB accessed with PC Load RB with EA (L0
cache like)
8 Baseline IR
- Reuse test or RB query
- - In issue stage for ROD instructions (Sodani
00, Citron 02) - (IW holds operand values (data capture
structure values got from RF are sent to the IW) - - In parallel with execution for long latency
NROD instructions - - For ready instructions that are not issued
immediately (delayed instructions) - RB update Instructions that missed in RB (did
not query due to port limitations queried
instructions that
incurred misses).
9Simulation environment
- Simplescalar simulator Wattch power model
(2.5V, 600MHz) - Benchmarks
- SPEC (Int FP) Alpha and PISA ISA
- Media/NPU PISA ISA
- Metrics Speedup, EDP
- RB power model cache like Brooks 00
- Most results shown are for a 256 entry 2-way set
associative ALU RB (64_2 2 for split RB), 32
entry FA load RB, 8 ports (2R 1W 2 1R1W) - Size of ALU RB 8KB (less than size of modern L1
caches) - RB updated by committing instructions squash
reuse not exploited - Trivial computations
- A computation whose result is 0, 1 or one of the
inputs itself Yi 02 (e.g. a 0, a0 etc) - Dynamic detection and elimination of trivial
instructions is done in all simulations
10- Quantifying the effectiveness of IR
-
-
11ROD/NROD breakup
- Reuse opportunity is determined by the
availability of operands early in the pipeline - Architecture and compiler dependent
- Operand availability shows stability
- Breakup includes EA computation instructions
- Single cycle latency instructions are benefited
by IR only if they are ROD or wait in the IW
12Delay characteristics of ROD/NROD instr
Causes -
- Burst of instr
- Limited resources
- Issue ports
- Cache ports
- Oldest first
- scheduling
- policy
- Program
- and compiler
- dependent
Delays show repetitive behavior and can be
predicted
13Most often repeated instructions
- Methodology
- sim-safe simulator
- Dominated by
- - EA computation
- - Conditional branch
- - Conditional move
- Compare instructions
- load address (lda,ldah) an arithmetic instr
-
- Selective IR
- Need to consider only
- the specified opcodes
14Speedup
Speedups are modest Range similar to
that achieved by Citron 02
15EDP
- Load IR
- FA RB
- Invalidation
- L0 cache
- Dcache satisfies
- most accesses
- -identify missing
- instr that repeat
- - Latency tolerance
16Parameters affecting IPC, EDP (Plackett Burman
expt)
- Performance gains with IR depends on Underlying
microarchitecture (latency, pipeline
stages skipped), Number of instructions reused
(limited by RB organization, management),
Criticality of instructions reused
- Num of RB sets more important than associativity
- Lower associativity gt better energy efficiency
- Load IR does not affect IPC significantly
- Actual values (sensitivity analysis) not shown
17- Understanding IR using instruction
criticality information
18Critical path model Fields 00
- DD inorder dispatch CD finite
IW ED control dependence - DE execution follows dispatch (if last arriving
edge gt ROD instr) - EE data dependence (if last arriving edge gt
NROD instr) - EC commit follows execution CC
inorder commit - An instruction can be Execute (E),
Fetch/Dispatch(D) or Commit (C) critical or may
be critical due to a combination of the above - Critical path model of Fields Fields 00 is used
to identify critical instructions - Perfect predictor (based on instruction trace)
- Criticality Predictor (Avg prediction accuracy
of 88 for SPEC00)
19Fraction of E-critical instructions reused with
baseline IR
- Most criticality is due D-nodes
- 80 E-crit instr are NROD
- If an ROD instr is critical it can be only due to
ED (control dependence), DE and EC edges. (EE
edges (data dependence) will not contribute) -
- Most instr that query the
- RB in the issue stage
- are therefore non-critical
- Only 2-3 of E-crit instr
- are reused
- A modest fraction of E-crit
- instr are trivial computations
- (probably explains why 6
- speedup is achieved by
- Yi 02
- IR reduces execution latency and may be viewed
as a technique that directly impacts
E-criticality - - With IR, the processor backend consumes
instructions faster than normal - - Puts pressure on the front-end to fetch
instructions quickly - With dynamic scheduling, D and C critical nodes
are also affected
20Fraction of D-critical instructions reused with
baseline IR
21Characterizing instruction repetition for
E-critical instructions
- Very few D/E-critical instructions are reused
with the base IR policy - Poor repetition of data values for critical
instructions (characterize this) - Poor management of the limited sized RB due to
which few critical instructions access/update the
RB - A combination of both
- Use a timing simulator (sim-outorder) for
evaluation - Definitions
- Dynamic instruction repetition
- Static instruction repetition
- Unique repeatable instance
22Characterizing instruction repetition for
E-critical instructions
- I12,I13,I21 are dynamic repeated instructions
- I1, I2 are static repeated instructions
- Data tuples associated with I12, I13 and I21 are
the unique repeatable instances
(lt13,14,27gtlt15,16,31gtlt21,22,43gt) - Methodology
- Buffer instructions that are marked critical by
the tracer - Buffer data and other information
23Repetition of E-critical instructions
- Static instr repeated gt
- (I1I2)/(I1I2I3) gt 2/3
- Dynamic instr repeated gtI12I13I21)/(I11..I32)
gt 3/6 - There is significant repetition in E-critical
instructions (i.e. instructions that dont repeat
- I11, I31 and I32, are rare) - However, certain instructions may contribute more
to repeatability than others (e.g. I1 produces 6
of the 8 dynamic repeated instructions and I2
only 2)
24Static instruction coverage of dynamic repetition
- Art 70 of the repeated static instructions
- are responsible for 90 of dynamic repetition
(repetition is more common when all instructions
are considered). - Certain instructions tend to contribute more
significantly to dynamic repetition than others
(indicated by the steep slope) - These instructions must be ideally allocated
entries in the RB (e.g. I1 with data tuple I12
in the example) - However, the number of different values
produced/used by instructions and collisions
(index to same entry) also impact - RB hit rate (specially with small RB and power as
a constraint).
25Contribution of unique repeatable instances to
dynamic repetition
e.g. instance 1 gt I21 2/8 25
instance 2 gt I12 I13 (42)/8 75
26RB management using criticality information
- Try to improve performance by allowing only
instructions predicted to be critical to
access/update the RB
Scheme1 - IR_Qrod_Umissed Scheme2 -
IR_Qrodcrit_Umissedcrit Scheme3 -
IR_Qrod_Qdelay_Umissed Scheme4 -
IR_Qrodcrit_Qdelaycrit_Umissedcrit
Performance degrades slightly when criticality
info is used Note Power consumed by the
criticality predictor itself is
ignored Criticality predictor acts only as a
filter Conclusion Better to reuse all types
of instructions rather than a few predicted
critical instructions specially with a limited RB
size and large working set
27Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
28- Improving the energy efficiency of IR
- the resultbus optimization
29The idea
- The resultbus optimization is based on the fact
that result values of reused instructions are
already present in the RB. So, dependent
instructions can receive their operand(s) by
reading a RB entry. - Instead of sending the full 32/64 bit result
values over a high capacitance result bus, we
send only a small index (which indicates where in
the RB the value concerned may be found) over the
bus. - The optimization is applicable only for result
producing instructions that are reused (e.g.
branches are not candidates) - Power savings is achieved due to lower bit
transition activity over the high capacitance
resultbus and also due to the fact that reading
the RB entry is not very expensive (small RB,
read is not a query). - The Idea is similar to bus encoding, but exploits
IR to achieve energy savings - Why reducing IW power may be effective
- The IW dissipates as much as 25 of the total
microprocessor power Folegnani 01 - Forwarding data values from producer to consumer
instructions in the IW is power hungry - 30 and 44 of the total IW energy is due to
forwarding data values over the resultbus in
SPECint95 and SPECfp95 benchmarks Ponomarev 03 - NROD instructions are the majority and receive
some/all of their data values from the resultbus.
30Resultbus optimization
- Distinguish b/w bypass n/w and resultbus
- Normal result value is sent over bypass
- path
- Resultbus runs through the entire IW
- Palacharla 97 Shen 04
- The resultbus optimization is applicable for
values transferred over the resultbus (i.e. non
consecutive execution of P and C). - Dependent instruction that receives an
- index must access the RB to obtain its
- value i.e. it is delayed by one cycle
- Number of bits transferred 2w32 or
- 2w64
- Effectiveness of resultbus optimization
- Depends on capacitance/length
- Depends on layout (length of buses
- connecting the RB to FUs etc).
31Implementation details
- Changes in IW
- 1 bit RBhit
- RBindex of log2N bits
- 1 bit RBindex_valid
- 1 bit RBlast
- Other changes
- A signal/wire Resadd
- RBlock bit vector (N bits)
- Separating the RB locking mechanism from the RB
will reduce complexity since the RBlock bit
vector is likely to be accessed many times
- If separate ALU and load RBs are used, an
additional 1 bit RBALU_load field is required to
indicate in which RB a hit occurred -
32Unlocking RB entries - Issues and solutions
- Assume that tags are broadcast earlier than
results (valid assumption) Palacharla 97, Shen
04 - Scenarios -
- No dependent instruction in the IW indicated by
no tag match - Only one dependent instruction in the IW
- Multiple (one or more) dependent instructions in
the IW - Some dependent instructions in the IW are
speculative and are - likely to be squashed
- e) Producer (reused) instruction is of ALU
type and consumer - is a load instruction or vice versa
- Any instruction that attempts to update a RB
entry must examine if it is locked (access
RBlock) - Except for scenario(a), a locked RB entry is
unlocked when the dependent consumer instruction - C commits
- Commit stage unlocking is required due to the
OoO/speculative execution model
Last dependent instruction that is slated to be
squashed is converted into a NOP RBlast1spec ?
dummy Some IW entries occupied But branch
mispredicts are relatively uncommon
33Issues and solutions
- Power savings is due to
- Transfer of fewer bits over the high capacitance
resultbus - Resource (port) constraints imposed by RBlock bit
vector which prevents some RB updates - RBlock is a small structure (e.g. 128 bits for a
RB of 128 entries) and not very power hungry - Additional accesses to RB occur, but does not
contribute significantly to energy consumption - e.g. dependent load instruction accessing the
ALU RB to obtain its operand (this occurs even in
the base IR scheme since EA computations are
stored in the ALU RB and a load instruction uses
this value) - Dependent stores accessing RB to get the operand
- Since low level issues are not considered due to
modeling constraints, the results presented are
optimistic. - Limitations of Simplescalar and Wattch power
models - Effectiveness of resultbus optimization depends
on - Number of result producing instructions reused
- Amount of bit activity reduced (depends on the
data values produced by reused instr) - Distance b/w reused producer and consumer i.e.
whether consumer is in the IW when the producer
is reused (depends on IW size, code generated)
34(Optimistic) Results
- Loads are often reused due to high value
locality and fully associative load RB. - Values loaded from caches normally
- result in significant bit activity over the
- resultbus connecting to the cache ports
35IR in packet processing applications
36Motivation
- Given an instruction and several RBs, which RB
must be queried? - Opcode based selection (split RB)
- Random, RandBest
- Flow based selection
- A Flow gives only partial knowledge of input
data. Can this information be used to select one
among many RBs? - How does sharing the RB among threads affect hit
rate in a multithreaded processor? - constructive vs destructive interference
- Most NPUs are multithreaded
37Classifying packets Concept of Flows
- Sequence of packets sent between a
source-destination pair following the same route
in the network - - ltsrc IP addr, dst IP addrgt
- - ltsrc IP addr, src port, dst IP addr, dst portgt
- Classifying pkts into flows
- Must be fast and can be approximate as long as it
is consistent across all packets - IP address/port based requires extracting these
fields i.e. flow based IR can make use of the
flow information only after the above operation
is done - Direction based or port based
- Any other easily computable method based on where
the router is located
Classification IP addr (src, dst) IP addr
appln port input port output port
Red - fields that are constant for a
session Blue - fields usually remain the
same Others - vary for every packet
38Simplistic (and hypothetical) High level example
of how flow based IR works
-
- Pipeline of processors NPU model where each PE is
multithreaded Sherwood 03 - Multiprogram homogeneous workload with each
thread operating on a different packet - Instructions operating on packets belonging to
different flows must access different RBs - Likely to beneficial mainly in header processing
applications
39Flow based IR
- What we are trying to do -
- use multiple RBs each catering to a flow or
set of flows - Exploit high level concept of flows and make
this information visible to instructions so
that the RB query/update is controlled by this
information. - manipulate/classify the way instructions access
the RB - aggregate related packets so that their data
is consistently confined to a particular RB i.e.
instructions operating on packets with similar
information (flow) query the same RB - Processor has no idea which packet it is
operating on - Flow tag is a per thread resource in a
multithreaded processor since instructions
operating on different packets are interleaved in
the pipeline - Why it works -
- temporal locality in network traffic makes
classification possible - similarity in many packet fields - at least
header info and sometimes payload (e.g. layer 4,
encapsulated packets)
40Flow based IR in SMT processor
- RB can either be shared among threads or can
cater to just one thread in multithreaded
processors - Sharing enhances the possibility of constructive
interference - Provided RB size is large enough
- Values put into the RB by one thread may be
reused by another thread - Theoretically, we can control the distance
between threads so that the threads are mutually
benefited and evictions from the RB are reduced
not done - Flow classification ERNET trace, 800 flows
need to be mapped to 2 RBs. - Non anonymized complete data
- Instructions operating on packets must access one
of 2 RBs - Packet flow direction used incoming vs outgoing
pkts - Comparison base IR also uses 2 RBs without the
above consistent mapping
41Does the aggregation policy retain data
similarity within aggregate flows?
- Similarity measurement
- At same offset
- Randomly select 1 pkt
- Select the next near pkt
- similarity check
- Repeat for number of
- sample pairs of pkts
- Other ways of quantifying similarity sliding
window (suitable for payload) - Need to measure similarity taking into account IR
- Packets of same flow aggregate must be roughly
similar compared to those selected from - different flows
42Does the aggregation policy retain data
similarity within aggregate flows?
43Does the aggregation policy retain data
similarity within aggregate flows?
44Results flow based IR
- Increasing the num of threads improves hit rate
(except in url) - Flow based IR results in a hit rate similar to
that achieved by the opcode based RB selection
policy (specially in header processing
applications) - Rand is obviously the worst
- RBest is the best
45Results flow based IR
- For url, destructive interference dominates
-
- Can utilize packet flow information to achieve RB
hit rates - similar to that obtained with opcode based
partitioning of data - May be employed in application specific scenarios
-
- Performance improvement is the same in flow and
opcode based selection policies
46Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
47Impact of IR on the processor front-end
- IR leads to an
- increase in the number
- of D-critical nodes
- making the processor
- front-end the main
- bottleneck to performance
- IR transfers a portion of
- the criticality from the backend to the front-end
and commit stage - Using an aggressive instruction
- fetch when exploiting IR
- is likely to result in greater
- performance gains
48Using an aggressive front-end (Performance impact)
- 2 schemes for aggressive fetch -
- - Increase icache blk size
- - Fetch past multiple taken
- branches every cycle
- Avg speedup (SPEC)
- - without aggressive fetch 1.027
- - with aggressive fetch 1.041
- When IR is exploited, it is
- better to use an aggressive
- front-end so that the impact of D-critical nodes
is reduced - Aggressive fetch gt possibility
- of wrong path activity and energy loss
49Reducing processor work using IR and throttling
- IR reduces the total number of instructions
that have to be executed - Executed instructions
- Correct path
- Wrong path (due to speculative execution)
instructions are truly redundant - Num of incorrectly fetched instructions can
account for up to 80 of all instr Aragon 03 - Reducing wrong path activity
- Pipeline gating Manne 98, pipeline throttling
Aragon 03 - Techniques based on instruction traffic
Baniasidi 01, IPC variation Ghaisi 00 or
branch confidence estimators Grunwald 98 - IR improves performance and could reduce wrong
path activity - If a conditional branch instruction is reused,
its outcome is known one cycle early - Quantify the reduction in extra work and energy
savings achieved with IR and throttling - A combination of IR and throttling is likely to
result in greater EDP savings - A perfect branch confidence estimator that
consumes no extra power is assumed (though
unrealistic, using this gives an upper bound on
the performance/EDP gains)
50Energy wastage due to wrong path activity
- Front end is a significant contributor to wasted
energy - Throttling needs to be applied to these stages
(stall fetch/dispatch stages for 2 and 1 cycle
when a - low confidence branch is encountered)
51IR and throttling Impact on EDP
- Reducing wrong path activity is necessary in
processors with - aggressive front-end and deep pipelines
- A combination of IR and throttling exploits the
best features of both
- Number of wrong path
- instructions querying the RB
- is reduced, which leads to
- some power savings
52IR without RB updates IRPS scheme
- Profile benchmarks and determine most frequently
executed and repeated instructions - Allocate these in the RB and disallow dynamic RB
updates - Power benefits RB hardware (e.g. write ports,
LRU implementation) complexity reduces - Choosing data values to be allocated RB entries
- Profile ROD instructions only using sim-outorder
since NROD instructions cannot be reused with the
base IR scheme - Methodology similar to that discussed in
characterizing repetition for critical
instructions - Metrics
- Unique repetition ratio (0ltURRlt1)
- Smaller gt fewer data tuples used/generated gt
smaller working set gt fewer values to be stored
in the RB per instruction - Dominating tuple (0ltDTlt1)
- Larger gt more often repeated gt dominates over
other tuples gt better if reused - URR uniq/exe DT Ti/exe maxDT max(Ti/exe)
53Instruction selection and placement
- Number of static instructions that are available
for selection is gt 10 (depends on input,
benchmark) - Instructions with URR lt 0.01 and maxDT gt 0.98 are
considered as candidates to be allocated entries
in the RB - A greedy algorithm selects the data values/tuples
from the above set - Types of instructions most often allocated
entries in the RB are - - EA computation (dominates)
- Conditional branches
- Load address (Alpha ISA - lda, ldah)
- Instructions from certain frequently executed
portions of the program are likely to be selected
for allocation - Selected instructions must be placed at
appropriate locations in the RB to allow PC based
indexing - Conflicts handled by the greedy algorithm
- Duplicate entries may be stored at different
locations in the RB
54Simulation results
- IRPS with PC based indexing is inefficient
- RB savings occurs due to reduced number of
updates - Operand based indexing reduces conflicts and may
be better suited for accessing the RB - Unable to reuse instructions from different
phases of the program - Frequently executed instr may not impact
performance - Dynamic updates are better suited when PC based
indexing is used - RB organization is important
55Contributions
- Comprehensive analysis of IR when energy is
considered as a metric - Characterization of instruction delays and
their predictability - Impact of instruction criticality on IR
- Characterization of repetition for critical
instructions - RB access/update with predicted critical
instructions - Impact of processor front-end on IR
- Resultbus optimization that exploits
communication reuse to reduce bit activity over
the resultbus - Impact of the following schemes on RB hit rate
- Opcode based indexing vs random vs flow based
(domain specific case study) - Single vs multiple RBs
- RB management - IRPS scheme limitations
- Impact of sharing RBs among multiple threads in a
multithreaded processor (domain specific case
study)
56Conclusions future directions
- IR is beneficial if
- Number of sets is given more importance than
associativity - The RB is small (lt256 entries) and direct mapped
(or 2-way associative) - Load IR is not exploited
- Multiple RBs catering to a specific set of
opcodes are employed this also reduces decoder
power and lookup time - Better indexing or eviction mechanisms are used
- Floating point instructions are exploited
- The performance impact of IR is more dependent on
the underlying microarchitecture (pipeline depth,
IW size, instruction latency etc) - Power and access time issues may make block level
IR useless (future work) - Certain data values may be given higher
importance and stored in specially designed RBs
(e.g. dynamic update precomputed storage)
common values lt0,9gt among instr - Criticality
- A large number of critical instructions have to
be buffered to capture a certain fraction of
repetition. - Degree of criticality and presence of near
critical paths may also matter. - Its better to exploit reuse in all possible
instructions than only critical instructions. - The resultbus optimization is a promising
technique to improve the energy efficiency of a
processor that exploits IR.
57References
- Sodani 97 A. Sodani and G. S. Sohi, Dynamic
Instruction Reuse, ISCA-24, 1997 - Sodani 00 A. Sodani, PhD Thesis, University of
Wisconsin Madison - Molina 99 Molina et. Al. Dynamic Removal of
Redundant Computations, Proc. ICS, 99 - Sazeides 97 Sazeides et. Al., The
Predictability of Data Values, MICRO-30, 1997 - Yi 02 Joshua J. Yi and David J. Lilja,
Improving Processor Performance by Simplifying
and Bypassing Trivial Computations, ICCD, 2002 - Lipasti 96 Mikko H. Lipasti et. Al., Value
Locality and Load Value Prediction, ASPLOS-7,
1996 - Yi 01 Yi et. Al., An Analysis of the Potential
for Global Level Value Reuse in the SPEC95 and
SPEC 2000 Benchmarks, Technical report, 2001. - Yi 02 Joshua J. Yi et. Al. , Increasing
Instruction-Level Parallelism with Instruction
Precomputation, EUROPAR-8, 2002 - Citron 03 D. Citron and D. Feitelson, Look It
Up or Do the Math An Energy, Area and Timing
Analysis of Instruction Reuse and Memoization,
PACS, 2003 - Citron 02 Daneil Citron, Revisiting Instruction
Level Reuse, WDDD, 2002 - Ponomarev 03 Dmitry V. Ponomarev et. Al. ,
Energy-Efficient Issue Queue Design, IEEE
Transactions on VLSI Systems, Vol 11, No. 5, 2003 - Sam 05 Sam et. Al, On the Energy Efficiency of
Speculative Hardware, Intnl Conf on Computing
Frontiers, 2005. - Brooks 00 Brooks et. Al, Wattch A Framework
for Architectural Level Power Analysis and
Optimization, ISCA 27, 2000 - Fields 01 Fields et. Al. , Focusing Processor
Policies via Critical Path Prediction, ISCA 28,
01 - Sazeides 03 Sazeides et. Al., Instruction
Isomorphism in Program Execution, JILP, 2003 - Aragon 03 Aragon etl al., Power Aware Control
Speculation Through Selective Throttling, HPCA 9,
2003. - Manne 98 Manne et. Al, Pipeline Gating
Speculation Control for Energy Reduction, ISCA 98 - Banisaidi 01 Banisadi et. Al., Instruction Flow
based Front End Throttling for Power Aware High
Performance Processors, ISLPED 2001 - Ghaisi 00 Ghaisi et al, Using IPC Variation in
Workloads with Externally Specified Rates for
Power Reduction, WCED 2001.
58Thank you