PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor

About This Presentation

Title:

PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor

Description:

IP address/port based requires extracting these fields i.e. flow based IR can ... Flow classification ERNET trace, ~ 800 flows need to be mapped to 2 RB's. ... – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 59

Provided by: UdoKeb2

Category:

more less

Transcript and Presenter's Notes

Title: PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor

1
Power-Performance Evaluation of Instruction
Level Reuse in a Superscalar Microprocessor

G. Surendra
CAD Lab, SERC, IISc
Research advisor S.K. Nandy

2
Outline

Motivation
Properties of Programs
Value locality
Instruction repetition
Exploiting value locality and instruction
repetition
Instruction reuse (IR) (or value reuse or
computation reuse)
Basics, working, types/granularity of IR
Baseline results
Thesis work
Using instruction criticality information while
exploiting IR
Resultbus optimization
Flow based IR in packet processing applications

3
Motivation and contributions

Power as a metric (EDP actually)
Previous studies consider performance only
Sodani 00, Molina 99, Citron 02
Power has been briefly considered in Brooks 00,
Citron 03
Reduces the degree of freedom with which IR can
be exploited
May make IR totally ineffective
E.g. Value prediction schemes are not energy
efficient Sam 05
Understanding IR using criticality information
Reveals the limitations of IR
Indicates what types of instructions one must
concentrate on
RB management
Understand how IR affects other processor
parameters and pipeline stages
Has been considered in Sodani 00
Use criticality information to explain the
behavior
Indicates what type of architecture is required
to maximize the effectiveness of IR
Other techniques that exploit IR to improve
energy efficiency

4
Value locality

Likelihood of a previously seen value occurring
repeatedly in the same register/memory location
Analogous to address locality (reoccurrence of
address patterns)
Instructions operate on a small set of input
values and produce a small set of result values
(32 bit register gt 232 different values are not
stored)
22-75 instructions produce repetitive result
values Lipasti 96
gt 50 static instructions generate only one
result value Sazeides 97
gt 90 static instructions generate less than 64
different values
gt 50 dynamic instructions generate fewer than 64
values
Causes
Register recycling, 90 - 10 rule of execution
External inputs repeat white spaces in text
processing, zeros in sparse matrices
Width for immediate operands is small gt few
values
Calls to certain functions (e.g. printf()) is
normally a jump to a fixed address

5
Instruction Repetition

Instruction repetition execution of a dynamic
instruction with the same operand values as a
previous instance
A direct consequence of value locality
Repeated instructions are essentially performing
redundant computation
Is more of a program property Sodani 00,
Sazeides 03
Compiler has no knowledge of run-time information
and has to make conservative decisions in
presence of
pointers, indirect jumps (switch statements),
dynamically shared objects
Types of repetition in result value
(i) Local level repetition - inputs are same as a
previous dynamic instance of the same instruction
(quasi-invariants Molina 99)
for(int i0 iltsize i )
PCi ci ai bi
(ii) Global level repetition - inputs are same
as a previous dynamic instance of a different
instruction with same opcode (quasi common
sub-expressions Molina 99)
PCj x y/z
PCk r y/p p z gt x r
Inputs are different, but result is the same
(e.g. compare instruction)
Inputs are same but result is different
(influence of other intervening instructions
matters e.g. load from address)

6
Techniques that exploit repetition

Memoization
Exploit function level repetition of arguments
Software technique
Applicable for functions with no side effects
(e.g. Fibonacci (n))
Value prediction (H/W technique and speculative)
Value prediction to exceed the data flow limit
Predicting outcome of loads to reduce average
memory latency
Requires validation and recovery mechanisms
Instruction Reuse (IR) (H/W technique)
HW implementation of memoization that operates at
instruction granularity
non-speculative, no validation required.
Does not capture as much redundancy in programs
as VP.
Types
single instruction reuse vs block reuse
techniques
Mixed H/W S/W techniques

7
Instruction Reuse (IR)

Reduces the number of instructions executed on
FUs
Table lookup (Reuse Buffer RB) vs execution
Reduces operation latency for multicycle
instructions
Reduces cache accesses when loads are reused
Allows dependent instructions to be woken up
early
Reduces wrong path activity if branches are
reused
Sv scheme practical to implement Citron 02
RB
Cache like structure that holds operand values
(Sv scheme Sodani 97)
Separate ALU and Load RBs that can be accessed
within one cycle
ALU RB accessed with PC Load RB with EA (L0
cache like)

8
Baseline IR

Reuse test or RB query
- In issue stage for ROD instructions (Sodani
00, Citron 02)
(IW holds operand values (data capture
structure values got from RF are sent to the IW)
- In parallel with execution for long latency
NROD instructions
- For ready instructions that are not issued
immediately (delayed instructions)
RB update Instructions that missed in RB (did
not query due to port limitations queried
instructions that
incurred misses).

9
Simulation environment

Simplescalar simulator Wattch power model
(2.5V, 600MHz)
Benchmarks
SPEC (Int FP) Alpha and PISA ISA
Media/NPU PISA ISA
Metrics Speedup, EDP
RB power model cache like Brooks 00
Most results shown are for a 256 entry 2-way set
associative ALU RB (64_2 2 for split RB), 32
entry FA load RB, 8 ports (2R 1W 2 1R1W)
Size of ALU RB 8KB (less than size of modern L1
caches)
RB updated by committing instructions squash
reuse not exploited
Trivial computations
A computation whose result is 0, 1 or one of the
inputs itself Yi 02 (e.g. a 0, a0 etc)
Dynamic detection and elimination of trivial
instructions is done in all simulations

Quantifying the effectiveness of IR

11
ROD/NROD breakup

Reuse opportunity is determined by the
availability of operands early in the pipeline
Architecture and compiler dependent
Operand availability shows stability
Breakup includes EA computation instructions

Single cycle latency instructions are benefited
by IR only if they are ROD or wait in the IW

12
Delay characteristics of ROD/NROD instr
Causes -

Burst of instr
Limited resources
Issue ports
Cache ports

Oldest first
scheduling
policy

Program
and compiler
dependent

Delays show repetitive behavior and can be
predicted
13
Most often repeated instructions

Methodology
sim-safe simulator
Dominated by
- EA computation
- Conditional branch
- Conditional move
Compare instructions
load address (lda,ldah) an arithmetic instr

Selective IR
Need to consider only
the specified opcodes

14
Speedup
Speedups are modest Range similar to
that achieved by Citron 02
15
EDP

Load IR
FA RB
Invalidation
L0 cache
Dcache satisfies
most accesses
-identify missing
instr that repeat
- Latency tolerance

16
Parameters affecting IPC, EDP (Plackett Burman
expt)

Performance gains with IR depends on Underlying
microarchitecture (latency, pipeline
stages skipped), Number of instructions reused
(limited by RB organization, management),
Criticality of instructions reused

Num of RB sets more important than associativity
Lower associativity gt better energy efficiency
Load IR does not affect IPC significantly
Actual values (sensitivity analysis) not shown

Understanding IR using instruction
criticality information

18
Critical path model Fields 00

DD inorder dispatch CD finite
IW ED control dependence
DE execution follows dispatch (if last arriving
edge gt ROD instr)
EE data dependence (if last arriving edge gt
NROD instr)
EC commit follows execution CC
inorder commit
An instruction can be Execute (E),
Fetch/Dispatch(D) or Commit (C) critical or may
be critical due to a combination of the above
Critical path model of Fields Fields 00 is used
to identify critical instructions
Perfect predictor (based on instruction trace)
Criticality Predictor (Avg prediction accuracy
of 88 for SPEC00)

19
Fraction of E-critical instructions reused with
baseline IR

Most criticality is due D-nodes
80 E-crit instr are NROD
If an ROD instr is critical it can be only due to
ED (control dependence), DE and EC edges. (EE
edges (data dependence) will not contribute)
Most instr that query the
RB in the issue stage
are therefore non-critical
Only 2-3 of E-crit instr
are reused
A modest fraction of E-crit
instr are trivial computations
(probably explains why 6
speedup is achieved by
Yi 02

IR reduces execution latency and may be viewed
as a technique that directly impacts
E-criticality
- With IR, the processor backend consumes
instructions faster than normal
- Puts pressure on the front-end to fetch
instructions quickly
With dynamic scheduling, D and C critical nodes
are also affected

20
Fraction of D-critical instructions reused with
baseline IR

21
Characterizing instruction repetition for
E-critical instructions

Very few D/E-critical instructions are reused
with the base IR policy
Poor repetition of data values for critical
instructions (characterize this)
Poor management of the limited sized RB due to
which few critical instructions access/update the
RB
A combination of both
Use a timing simulator (sim-outorder) for
evaluation
Definitions
Dynamic instruction repetition
Static instruction repetition
Unique repeatable instance

22
Characterizing instruction repetition for
E-critical instructions

I12,I13,I21 are dynamic repeated instructions
I1, I2 are static repeated instructions
Data tuples associated with I12, I13 and I21 are
the unique repeatable instances
(lt13,14,27gtlt15,16,31gtlt21,22,43gt)
Methodology
Buffer instructions that are marked critical by
the tracer
Buffer data and other information

23
Repetition of E-critical instructions

Static instr repeated gt
(I1I2)/(I1I2I3) gt 2/3
Dynamic instr repeated gtI12I13I21)/(I11..I32)
gt 3/6
There is significant repetition in E-critical
instructions (i.e. instructions that dont repeat
- I11, I31 and I32, are rare)
However, certain instructions may contribute more
to repeatability than others (e.g. I1 produces 6
of the 8 dynamic repeated instructions and I2
only 2)

24
Static instruction coverage of dynamic repetition

Art 70 of the repeated static instructions
are responsible for 90 of dynamic repetition
(repetition is more common when all instructions
are considered).
Certain instructions tend to contribute more
significantly to dynamic repetition than others
(indicated by the steep slope)
These instructions must be ideally allocated
entries in the RB (e.g. I1 with data tuple I12
in the example)
However, the number of different values
produced/used by instructions and collisions
(index to same entry) also impact
RB hit rate (specially with small RB and power as
a constraint).

25
Contribution of unique repeatable instances to
dynamic repetition
e.g. instance 1 gt I21 2/8 25
instance 2 gt I12 I13 (42)/8 75
26
RB management using criticality information

Try to improve performance by allowing only
instructions predicted to be critical to
access/update the RB

Scheme1 - IR_Qrod_Umissed Scheme2 -
IR_Qrodcrit_Umissedcrit Scheme3 -
IR_Qrod_Qdelay_Umissed Scheme4 -
IR_Qrodcrit_Qdelaycrit_Umissedcrit
Performance degrades slightly when criticality
info is used Note Power consumed by the
criticality predictor itself is
ignored Criticality predictor acts only as a
filter Conclusion Better to reuse all types
of instructions rather than a few predicted
critical instructions specially with a limited RB
size and large working set
27
Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
28

Improving the energy efficiency of IR
the resultbus optimization

29
The idea

The resultbus optimization is based on the fact
that result values of reused instructions are
already present in the RB. So, dependent
instructions can receive their operand(s) by
reading a RB entry.
Instead of sending the full 32/64 bit result
values over a high capacitance result bus, we
send only a small index (which indicates where in
the RB the value concerned may be found) over the
bus.
The optimization is applicable only for result
producing instructions that are reused (e.g.
branches are not candidates)
Power savings is achieved due to lower bit
transition activity over the high capacitance
resultbus and also due to the fact that reading
the RB entry is not very expensive (small RB,
read is not a query).
The Idea is similar to bus encoding, but exploits
IR to achieve energy savings
Why reducing IW power may be effective
The IW dissipates as much as 25 of the total
microprocessor power Folegnani 01
Forwarding data values from producer to consumer
instructions in the IW is power hungry
30 and 44 of the total IW energy is due to
forwarding data values over the resultbus in
SPECint95 and SPECfp95 benchmarks Ponomarev 03
NROD instructions are the majority and receive
some/all of their data values from the resultbus.

30
Resultbus optimization

Distinguish b/w bypass n/w and resultbus
Normal result value is sent over bypass
path
Resultbus runs through the entire IW
Palacharla 97 Shen 04
The resultbus optimization is applicable for
values transferred over the resultbus (i.e. non
consecutive execution of P and C).
Dependent instruction that receives an
index must access the RB to obtain its
value i.e. it is delayed by one cycle
Number of bits transferred 2w32 or
2w64
Effectiveness of resultbus optimization
Depends on capacitance/length
Depends on layout (length of buses
connecting the RB to FUs etc).

31
Implementation details

Changes in IW
1 bit RBhit
RBindex of log2N bits
1 bit RBindex_valid
1 bit RBlast
Other changes
A signal/wire Resadd
RBlock bit vector (N bits)
Separating the RB locking mechanism from the RB
will reduce complexity since the RBlock bit
vector is likely to be accessed many times

If separate ALU and load RBs are used, an
additional 1 bit RBALU_load field is required to
indicate in which RB a hit occurred

32
Unlocking RB entries - Issues and solutions

Assume that tags are broadcast earlier than
results (valid assumption) Palacharla 97, Shen
04
Scenarios -
No dependent instruction in the IW indicated by
no tag match
Only one dependent instruction in the IW
Multiple (one or more) dependent instructions in
the IW
Some dependent instructions in the IW are
speculative and are
likely to be squashed
e) Producer (reused) instruction is of ALU
type and consumer
is a load instruction or vice versa

Any instruction that attempts to update a RB
entry must examine if it is locked (access
RBlock)
Except for scenario(a), a locked RB entry is
unlocked when the dependent consumer instruction
C commits
Commit stage unlocking is required due to the
OoO/speculative execution model

Last dependent instruction that is slated to be
squashed is converted into a NOP RBlast1spec ?
dummy Some IW entries occupied But branch
mispredicts are relatively uncommon
33
Issues and solutions

Power savings is due to
Transfer of fewer bits over the high capacitance
resultbus
Resource (port) constraints imposed by RBlock bit
vector which prevents some RB updates
RBlock is a small structure (e.g. 128 bits for a
RB of 128 entries) and not very power hungry
Additional accesses to RB occur, but does not
contribute significantly to energy consumption
e.g. dependent load instruction accessing the
ALU RB to obtain its operand (this occurs even in
the base IR scheme since EA computations are
stored in the ALU RB and a load instruction uses
this value)
Dependent stores accessing RB to get the operand
Since low level issues are not considered due to
modeling constraints, the results presented are
optimistic.
Limitations of Simplescalar and Wattch power
models
Effectiveness of resultbus optimization depends
on
Number of result producing instructions reused
Amount of bit activity reduced (depends on the
data values produced by reused instr)
Distance b/w reused producer and consumer i.e.
whether consumer is in the IW when the producer
is reused (depends on IW size, code generated)

34
(Optimistic) Results

Loads are often reused due to high value
locality and fully associative load RB.
Values loaded from caches normally
result in significant bit activity over the
resultbus connecting to the cache ports

35
IR in packet processing applications
36
Motivation

Given an instruction and several RBs, which RB
must be queried?
Opcode based selection (split RB)
Random, RandBest
Flow based selection
A Flow gives only partial knowledge of input
data. Can this information be used to select one
among many RBs?
How does sharing the RB among threads affect hit
rate in a multithreaded processor?
constructive vs destructive interference
Most NPUs are multithreaded

37
Classifying packets Concept of Flows

Sequence of packets sent between a
source-destination pair following the same route
in the network
- ltsrc IP addr, dst IP addrgt
- ltsrc IP addr, src port, dst IP addr, dst portgt

Classifying pkts into flows
Must be fast and can be approximate as long as it
is consistent across all packets
IP address/port based requires extracting these
fields i.e. flow based IR can make use of the
flow information only after the above operation
is done
Direction based or port based
Any other easily computable method based on where
the router is located

Classification IP addr (src, dst) IP addr
appln port input port output port
Red - fields that are constant for a
session Blue - fields usually remain the
same Others - vary for every packet
38
Simplistic (and hypothetical) High level example
of how flow based IR works

Pipeline of processors NPU model where each PE is
multithreaded Sherwood 03
Multiprogram homogeneous workload with each
thread operating on a different packet
Instructions operating on packets belonging to
different flows must access different RBs
Likely to beneficial mainly in header processing
applications

39
Flow based IR

What we are trying to do -
use multiple RBs each catering to a flow or
set of flows
Exploit high level concept of flows and make
this information visible to instructions so
that the RB query/update is controlled by this
information.
manipulate/classify the way instructions access
the RB
aggregate related packets so that their data
is consistently confined to a particular RB i.e.
instructions operating on packets with similar
information (flow) query the same RB
Processor has no idea which packet it is
operating on
Flow tag is a per thread resource in a
multithreaded processor since instructions
operating on different packets are interleaved in
the pipeline
Why it works -
temporal locality in network traffic makes
classification possible
similarity in many packet fields - at least
header info and sometimes payload (e.g. layer 4,
encapsulated packets)

40
Flow based IR in SMT processor

RB can either be shared among threads or can
cater to just one thread in multithreaded
processors
Sharing enhances the possibility of constructive
interference
Provided RB size is large enough
Values put into the RB by one thread may be
reused by another thread
Theoretically, we can control the distance
between threads so that the threads are mutually
benefited and evictions from the RB are reduced
not done
Flow classification ERNET trace, 800 flows
need to be mapped to 2 RBs.
Non anonymized complete data
Instructions operating on packets must access one
of 2 RBs
Packet flow direction used incoming vs outgoing
pkts
Comparison base IR also uses 2 RBs without the
above consistent mapping

41
Does the aggregation policy retain data
similarity within aggregate flows?

Similarity measurement
At same offset
Randomly select 1 pkt
Select the next near pkt
similarity check
Repeat for number of
sample pairs of pkts

Other ways of quantifying similarity sliding
window (suitable for payload)
Need to measure similarity taking into account IR
Packets of same flow aggregate must be roughly
similar compared to those selected from
different flows

42
Does the aggregation policy retain data
similarity within aggregate flows?
43
Does the aggregation policy retain data
similarity within aggregate flows?
44
Results flow based IR

Increasing the num of threads improves hit rate
(except in url)
Flow based IR results in a hit rate similar to
that achieved by the opcode based RB selection
policy (specially in header processing
applications)
Rand is obviously the worst
RBest is the best

45
Results flow based IR

For url, destructive interference dominates

Can utilize packet flow information to achieve RB
hit rates
similar to that obtained with opcode based
partitioning of data
May be employed in application specific scenarios
Performance improvement is the same in flow and
opcode based selection policies

46
Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
47
Impact of IR on the processor front-end

IR leads to an
increase in the number
of D-critical nodes
making the processor
front-end the main
bottleneck to performance
IR transfers a portion of
the criticality from the backend to the front-end
and commit stage
Using an aggressive instruction
fetch when exploiting IR
is likely to result in greater
performance gains

48
Using an aggressive front-end (Performance impact)

2 schemes for aggressive fetch -
- Increase icache blk size
- Fetch past multiple taken
branches every cycle
Avg speedup (SPEC)
- without aggressive fetch 1.027
- with aggressive fetch 1.041
When IR is exploited, it is
better to use an aggressive
front-end so that the impact of D-critical nodes
is reduced
Aggressive fetch gt possibility
of wrong path activity and energy loss

49
Reducing processor work using IR and throttling

IR reduces the total number of instructions
that have to be executed
Executed instructions
Correct path
Wrong path (due to speculative execution)
instructions are truly redundant
Num of incorrectly fetched instructions can
account for up to 80 of all instr Aragon 03
Reducing wrong path activity
Pipeline gating Manne 98, pipeline throttling
Aragon 03
Techniques based on instruction traffic
Baniasidi 01, IPC variation Ghaisi 00 or
branch confidence estimators Grunwald 98
IR improves performance and could reduce wrong
path activity
If a conditional branch instruction is reused,
its outcome is known one cycle early
Quantify the reduction in extra work and energy
savings achieved with IR and throttling
A combination of IR and throttling is likely to
result in greater EDP savings
A perfect branch confidence estimator that
consumes no extra power is assumed (though
unrealistic, using this gives an upper bound on
the performance/EDP gains)

50
Energy wastage due to wrong path activity

Front end is a significant contributor to wasted
energy
Throttling needs to be applied to these stages
(stall fetch/dispatch stages for 2 and 1 cycle
when a
low confidence branch is encountered)

51
IR and throttling Impact on EDP

Reducing wrong path activity is necessary in
processors with
aggressive front-end and deep pipelines
A combination of IR and throttling exploits the
best features of both

Number of wrong path
instructions querying the RB
is reduced, which leads to
some power savings

52
IR without RB updates IRPS scheme

Profile benchmarks and determine most frequently
executed and repeated instructions
Allocate these in the RB and disallow dynamic RB
updates
Power benefits RB hardware (e.g. write ports,
LRU implementation) complexity reduces
Choosing data values to be allocated RB entries
Profile ROD instructions only using sim-outorder
since NROD instructions cannot be reused with the
base IR scheme
Methodology similar to that discussed in
characterizing repetition for critical
instructions
Metrics
Unique repetition ratio (0ltURRlt1)
Smaller gt fewer data tuples used/generated gt
smaller working set gt fewer values to be stored
in the RB per instruction
Dominating tuple (0ltDTlt1)
Larger gt more often repeated gt dominates over
other tuples gt better if reused
URR uniq/exe DT Ti/exe maxDT max(Ti/exe)

53
Instruction selection and placement

Number of static instructions that are available
for selection is gt 10 (depends on input,
benchmark)
Instructions with URR lt 0.01 and maxDT gt 0.98 are
considered as candidates to be allocated entries
in the RB
A greedy algorithm selects the data values/tuples
from the above set
Types of instructions most often allocated
entries in the RB are -
EA computation (dominates)
Conditional branches
Load address (Alpha ISA - lda, ldah)
Instructions from certain frequently executed
portions of the program are likely to be selected
for allocation
Selected instructions must be placed at
appropriate locations in the RB to allow PC based
indexing
Conflicts handled by the greedy algorithm
Duplicate entries may be stored at different
locations in the RB

54
Simulation results

IRPS with PC based indexing is inefficient
RB savings occurs due to reduced number of
updates
Operand based indexing reduces conflicts and may
be better suited for accessing the RB
Unable to reuse instructions from different
phases of the program
Frequently executed instr may not impact
performance
Dynamic updates are better suited when PC based
indexing is used
RB organization is important

55
Contributions

Comprehensive analysis of IR when energy is
considered as a metric
Characterization of instruction delays and
their predictability
Impact of instruction criticality on IR
Characterization of repetition for critical
instructions
RB access/update with predicted critical
instructions
Impact of processor front-end on IR
Resultbus optimization that exploits
communication reuse to reduce bit activity over
the resultbus
Impact of the following schemes on RB hit rate
Opcode based indexing vs random vs flow based
(domain specific case study)
Single vs multiple RBs
RB management - IRPS scheme limitations
Impact of sharing RBs among multiple threads in a
multithreaded processor (domain specific case
study)

56
Conclusions future directions

IR is beneficial if
Number of sets is given more importance than
associativity
The RB is small (lt256 entries) and direct mapped
(or 2-way associative)
Load IR is not exploited
Multiple RBs catering to a specific set of
opcodes are employed this also reduces decoder
power and lookup time
Better indexing or eviction mechanisms are used
Floating point instructions are exploited
The performance impact of IR is more dependent on
the underlying microarchitecture (pipeline depth,
IW size, instruction latency etc)
Power and access time issues may make block level
IR useless (future work)
Certain data values may be given higher
importance and stored in specially designed RBs
(e.g. dynamic update precomputed storage)
common values lt0,9gt among instr
Criticality
A large number of critical instructions have to
be buffered to capture a certain fraction of
repetition.
Degree of criticality and presence of near
critical paths may also matter.
Its better to exploit reuse in all possible
instructions than only critical instructions.
The resultbus optimization is a promising
technique to improve the energy efficiency of a
processor that exploits IR.

57
References

Sodani 97 A. Sodani and G. S. Sohi, Dynamic
Instruction Reuse, ISCA-24, 1997
Sodani 00 A. Sodani, PhD Thesis, University of
Wisconsin Madison
Molina 99 Molina et. Al. Dynamic Removal of
Redundant Computations, Proc. ICS, 99
Sazeides 97 Sazeides et. Al., The
Predictability of Data Values, MICRO-30, 1997
Yi 02 Joshua J. Yi and David J. Lilja,
Improving Processor Performance by Simplifying
and Bypassing Trivial Computations, ICCD, 2002
Lipasti 96 Mikko H. Lipasti et. Al., Value
Locality and Load Value Prediction, ASPLOS-7,
1996
Yi 01 Yi et. Al., An Analysis of the Potential
for Global Level Value Reuse in the SPEC95 and
SPEC 2000 Benchmarks, Technical report, 2001.
Yi 02 Joshua J. Yi et. Al. , Increasing
Instruction-Level Parallelism with Instruction
Precomputation, EUROPAR-8, 2002
Citron 03 D. Citron and D. Feitelson, Look It
Up or Do the Math An Energy, Area and Timing
Analysis of Instruction Reuse and Memoization,
PACS, 2003
Citron 02 Daneil Citron, Revisiting Instruction
Level Reuse, WDDD, 2002
Ponomarev 03 Dmitry V. Ponomarev et. Al. ,
Energy-Efficient Issue Queue Design, IEEE
Transactions on VLSI Systems, Vol 11, No. 5, 2003
Sam 05 Sam et. Al, On the Energy Efficiency of
Speculative Hardware, Intnl Conf on Computing
Frontiers, 2005.
Brooks 00 Brooks et. Al, Wattch A Framework
for Architectural Level Power Analysis and
Optimization, ISCA 27, 2000
Fields 01 Fields et. Al. , Focusing Processor
Policies via Critical Path Prediction, ISCA 28,
01
Sazeides 03 Sazeides et. Al., Instruction
Isomorphism in Program Execution, JILP, 2003
Aragon 03 Aragon etl al., Power Aware Control
Speculation Through Selective Throttling, HPCA 9,
2003.
Manne 98 Manne et. Al, Pipeline Gating
Speculation Control for Energy Reduction, ISCA 98
Banisaidi 01 Banisadi et. Al., Instruction Flow
based Front End Throttling for Power Aware High
Performance Processors, ISLPED 2001
Ghaisi 00 Ghaisi et al, Using IPC Variation in
Workloads with Externally Specified Rates for
Power Reduction, WCED 2001.