PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor

Description:

IP address/port based requires extracting these fields i.e. flow based IR can ... Flow classification ERNET trace, ~ 800 flows need to be mapped to 2 RB's. ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 59
Provided by: UdoKeb2
Category:

less

Transcript and Presenter's Notes

Title: PowerPerformance Evaluation of Instruction Level Reuse in a Superscalar Microprocessor


1
Power-Performance Evaluation of Instruction
Level Reuse in a Superscalar Microprocessor
  • G. Surendra
  • CAD Lab, SERC, IISc
  • Research advisor S.K. Nandy

2
Outline
  • Motivation
  • Properties of Programs
  • Value locality
  • Instruction repetition
  • Exploiting value locality and instruction
    repetition
  • Instruction reuse (IR) (or value reuse or
    computation reuse)
  • Basics, working, types/granularity of IR
  • Baseline results
  • Thesis work
  • Using instruction criticality information while
    exploiting IR
  • Resultbus optimization
  • Flow based IR in packet processing applications

3
Motivation and contributions
  • Power as a metric (EDP actually)
  • Previous studies consider performance only
    Sodani 00, Molina 99, Citron 02
  • Power has been briefly considered in Brooks 00,
    Citron 03
  • Reduces the degree of freedom with which IR can
    be exploited
  • May make IR totally ineffective
  • E.g. Value prediction schemes are not energy
    efficient Sam 05
  • Understanding IR using criticality information
  • Reveals the limitations of IR
  • Indicates what types of instructions one must
    concentrate on
  • RB management
  • Understand how IR affects other processor
    parameters and pipeline stages
  • Has been considered in Sodani 00
  • Use criticality information to explain the
    behavior
  • Indicates what type of architecture is required
    to maximize the effectiveness of IR
  • Other techniques that exploit IR to improve
    energy efficiency

4
Value locality
  • Likelihood of a previously seen value occurring
    repeatedly in the same register/memory location
  • Analogous to address locality (reoccurrence of
    address patterns)
  • Instructions operate on a small set of input
    values and produce a small set of result values
    (32 bit register gt 232 different values are not
    stored)
  • 22-75 instructions produce repetitive result
    values Lipasti 96
  • gt 50 static instructions generate only one
    result value Sazeides 97
  • gt 90 static instructions generate less than 64
    different values
  • gt 50 dynamic instructions generate fewer than 64
    values
  • Causes
  • Register recycling, 90 - 10 rule of execution
  • External inputs repeat white spaces in text
    processing, zeros in sparse matrices
  • Width for immediate operands is small gt few
    values
  • Calls to certain functions (e.g. printf()) is
    normally a jump to a fixed address

5
Instruction Repetition
  • Instruction repetition execution of a dynamic
    instruction with the same operand values as a
    previous instance
  • A direct consequence of value locality
  • Repeated instructions are essentially performing
    redundant computation
  • Is more of a program property Sodani 00,
    Sazeides 03
  • Compiler has no knowledge of run-time information
    and has to make conservative decisions in
    presence of
  • pointers, indirect jumps (switch statements),
    dynamically shared objects
  • Types of repetition in result value
  • (i) Local level repetition - inputs are same as a
    previous dynamic instance of the same instruction
    (quasi-invariants Molina 99)
  • for(int i0 iltsize i )
  • PCi ci ai bi
  • (ii) Global level repetition - inputs are same
    as a previous dynamic instance of a different
    instruction with same opcode (quasi common
    sub-expressions Molina 99)
  • PCj x y/z
  • PCk r y/p p z gt x r
  • Inputs are different, but result is the same
    (e.g. compare instruction)
  • Inputs are same but result is different
    (influence of other intervening instructions
    matters e.g. load from address)

6
Techniques that exploit repetition
  • Memoization
  • Exploit function level repetition of arguments
  • Software technique
  • Applicable for functions with no side effects
    (e.g. Fibonacci (n))
  • Value prediction (H/W technique and speculative)
  • Value prediction to exceed the data flow limit
  • Predicting outcome of loads to reduce average
    memory latency
  • Requires validation and recovery mechanisms
  • Instruction Reuse (IR) (H/W technique)
  • HW implementation of memoization that operates at
    instruction granularity
  • non-speculative, no validation required.
  • Does not capture as much redundancy in programs
    as VP.
  • Types
  • single instruction reuse vs block reuse
    techniques
  • Mixed H/W S/W techniques

7
Instruction Reuse (IR)
  • Reduces the number of instructions executed on
    FUs
  • Table lookup (Reuse Buffer RB) vs execution
  • Reduces operation latency for multicycle
    instructions
  • Reduces cache accesses when loads are reused
  • Allows dependent instructions to be woken up
    early
  • Reduces wrong path activity if branches are
    reused
  • Sv scheme practical to implement Citron 02
  • RB
  • Cache like structure that holds operand values
    (Sv scheme Sodani 97)
  • Separate ALU and Load RBs that can be accessed
    within one cycle
  • ALU RB accessed with PC Load RB with EA (L0
    cache like)

8
Baseline IR
  • Reuse test or RB query
  • - In issue stage for ROD instructions (Sodani
    00, Citron 02)
  • (IW holds operand values (data capture
    structure values got from RF are sent to the IW)
  • - In parallel with execution for long latency
    NROD instructions
  • - For ready instructions that are not issued
    immediately (delayed instructions)
  • RB update Instructions that missed in RB (did
    not query due to port limitations queried
    instructions that
    incurred misses).

9
Simulation environment
  • Simplescalar simulator Wattch power model
    (2.5V, 600MHz)
  • Benchmarks
  • SPEC (Int FP) Alpha and PISA ISA
  • Media/NPU PISA ISA
  • Metrics Speedup, EDP
  • RB power model cache like Brooks 00
  • Most results shown are for a 256 entry 2-way set
    associative ALU RB (64_2 2 for split RB), 32
    entry FA load RB, 8 ports (2R 1W 2 1R1W)
  • Size of ALU RB 8KB (less than size of modern L1
    caches)
  • RB updated by committing instructions squash
    reuse not exploited
  • Trivial computations
  • A computation whose result is 0, 1 or one of the
    inputs itself Yi 02 (e.g. a 0, a0 etc)
  • Dynamic detection and elimination of trivial
    instructions is done in all simulations

10
  • Quantifying the effectiveness of IR

11
ROD/NROD breakup
  • Reuse opportunity is determined by the
    availability of operands early in the pipeline
  • Architecture and compiler dependent
  • Operand availability shows stability
  • Breakup includes EA computation instructions
  • Single cycle latency instructions are benefited
    by IR only if they are ROD or wait in the IW

12
Delay characteristics of ROD/NROD instr
Causes -
  • Burst of instr
  • Limited resources
  • Issue ports
  • Cache ports
  • Oldest first
  • scheduling
  • policy
  • Program
  • and compiler
  • dependent

Delays show repetitive behavior and can be
predicted
13
Most often repeated instructions
  • Methodology
  • sim-safe simulator
  • Dominated by
  • - EA computation
  • - Conditional branch
  • - Conditional move
  • Compare instructions
  • load address (lda,ldah) an arithmetic instr
  • Selective IR
  • Need to consider only
  • the specified opcodes

14
Speedup
Speedups are modest Range similar to
that achieved by Citron 02
15
EDP
  • Load IR
  • FA RB
  • Invalidation
  • L0 cache
  • Dcache satisfies
  • most accesses
  • -identify missing
  • instr that repeat
  • - Latency tolerance

16
Parameters affecting IPC, EDP (Plackett Burman
expt)
  • Performance gains with IR depends on Underlying
    microarchitecture (latency, pipeline
    stages skipped), Number of instructions reused
    (limited by RB organization, management),
    Criticality of instructions reused
  • Num of RB sets more important than associativity
  • Lower associativity gt better energy efficiency
  • Load IR does not affect IPC significantly
  • Actual values (sensitivity analysis) not shown

17
  • Understanding IR using instruction
    criticality information

18
Critical path model Fields 00
  • DD inorder dispatch CD finite
    IW ED control dependence
  • DE execution follows dispatch (if last arriving
    edge gt ROD instr)
  • EE data dependence (if last arriving edge gt
    NROD instr)
  • EC commit follows execution CC
    inorder commit
  • An instruction can be Execute (E),
    Fetch/Dispatch(D) or Commit (C) critical or may
    be critical due to a combination of the above
  • Critical path model of Fields Fields 00 is used
    to identify critical instructions
  • Perfect predictor (based on instruction trace)
  • Criticality Predictor (Avg prediction accuracy
    of 88 for SPEC00)

19
Fraction of E-critical instructions reused with
baseline IR
  • Most criticality is due D-nodes
  • 80 E-crit instr are NROD
  • If an ROD instr is critical it can be only due to
    ED (control dependence), DE and EC edges. (EE
    edges (data dependence) will not contribute)
  • Most instr that query the
  • RB in the issue stage
  • are therefore non-critical
  • Only 2-3 of E-crit instr
  • are reused
  • A modest fraction of E-crit
  • instr are trivial computations
  • (probably explains why 6
  • speedup is achieved by
  • Yi 02
  • IR reduces execution latency and may be viewed
    as a technique that directly impacts
    E-criticality
  • - With IR, the processor backend consumes
    instructions faster than normal
  • - Puts pressure on the front-end to fetch
    instructions quickly
  • With dynamic scheduling, D and C critical nodes
    are also affected

20
Fraction of D-critical instructions reused with
baseline IR

21
Characterizing instruction repetition for
E-critical instructions
  • Very few D/E-critical instructions are reused
    with the base IR policy
  • Poor repetition of data values for critical
    instructions (characterize this)
  • Poor management of the limited sized RB due to
    which few critical instructions access/update the
    RB
  • A combination of both
  • Use a timing simulator (sim-outorder) for
    evaluation
  • Definitions
  • Dynamic instruction repetition
  • Static instruction repetition
  • Unique repeatable instance

22
Characterizing instruction repetition for
E-critical instructions
  • I12,I13,I21 are dynamic repeated instructions
  • I1, I2 are static repeated instructions
  • Data tuples associated with I12, I13 and I21 are
    the unique repeatable instances
    (lt13,14,27gtlt15,16,31gtlt21,22,43gt)
  • Methodology
  • Buffer instructions that are marked critical by
    the tracer
  • Buffer data and other information

23
Repetition of E-critical instructions
  • Static instr repeated gt
  • (I1I2)/(I1I2I3) gt 2/3
  • Dynamic instr repeated gtI12I13I21)/(I11..I32)
    gt 3/6
  • There is significant repetition in E-critical
    instructions (i.e. instructions that dont repeat
    - I11, I31 and I32, are rare)
  • However, certain instructions may contribute more
    to repeatability than others (e.g. I1 produces 6
    of the 8 dynamic repeated instructions and I2
    only 2)

24
Static instruction coverage of dynamic repetition
  • Art 70 of the repeated static instructions
  • are responsible for 90 of dynamic repetition
    (repetition is more common when all instructions
    are considered).
  • Certain instructions tend to contribute more
    significantly to dynamic repetition than others
    (indicated by the steep slope)
  • These instructions must be ideally allocated
    entries in the RB (e.g. I1 with data tuple I12
    in the example)
  • However, the number of different values
    produced/used by instructions and collisions
    (index to same entry) also impact
  • RB hit rate (specially with small RB and power as
    a constraint).

25
Contribution of unique repeatable instances to
dynamic repetition
e.g. instance 1 gt I21 2/8 25
instance 2 gt I12 I13 (42)/8 75
26
RB management using criticality information
  • Try to improve performance by allowing only
    instructions predicted to be critical to
    access/update the RB

Scheme1 - IR_Qrod_Umissed Scheme2 -
IR_Qrodcrit_Umissedcrit Scheme3 -
IR_Qrod_Qdelay_Umissed Scheme4 -
IR_Qrodcrit_Qdelaycrit_Umissedcrit
Performance degrades slightly when criticality
info is used Note Power consumed by the
criticality predictor itself is
ignored Criticality predictor acts only as a
filter Conclusion Better to reuse all types
of instructions rather than a few predicted
critical instructions specially with a limited RB
size and large working set
27
Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
28
  • Improving the energy efficiency of IR
  • the resultbus optimization

29
The idea
  • The resultbus optimization is based on the fact
    that result values of reused instructions are
    already present in the RB. So, dependent
    instructions can receive their operand(s) by
    reading a RB entry.
  • Instead of sending the full 32/64 bit result
    values over a high capacitance result bus, we
    send only a small index (which indicates where in
    the RB the value concerned may be found) over the
    bus.
  • The optimization is applicable only for result
    producing instructions that are reused (e.g.
    branches are not candidates)
  • Power savings is achieved due to lower bit
    transition activity over the high capacitance
    resultbus and also due to the fact that reading
    the RB entry is not very expensive (small RB,
    read is not a query).
  • The Idea is similar to bus encoding, but exploits
    IR to achieve energy savings
  • Why reducing IW power may be effective
  • The IW dissipates as much as 25 of the total
    microprocessor power Folegnani 01
  • Forwarding data values from producer to consumer
    instructions in the IW is power hungry
  • 30 and 44 of the total IW energy is due to
    forwarding data values over the resultbus in
    SPECint95 and SPECfp95 benchmarks Ponomarev 03
  • NROD instructions are the majority and receive
    some/all of their data values from the resultbus.

30
Resultbus optimization
  • Distinguish b/w bypass n/w and resultbus
  • Normal result value is sent over bypass
  • path
  • Resultbus runs through the entire IW
  • Palacharla 97 Shen 04
  • The resultbus optimization is applicable for
    values transferred over the resultbus (i.e. non
    consecutive execution of P and C).
  • Dependent instruction that receives an
  • index must access the RB to obtain its
  • value i.e. it is delayed by one cycle
  • Number of bits transferred 2w32 or
  • 2w64
  • Effectiveness of resultbus optimization
  • Depends on capacitance/length
  • Depends on layout (length of buses
  • connecting the RB to FUs etc).

31
Implementation details
  • Changes in IW
  • 1 bit RBhit
  • RBindex of log2N bits
  • 1 bit RBindex_valid
  • 1 bit RBlast
  • Other changes
  • A signal/wire Resadd
  • RBlock bit vector (N bits)
  • Separating the RB locking mechanism from the RB
    will reduce complexity since the RBlock bit
    vector is likely to be accessed many times
  • If separate ALU and load RBs are used, an
    additional 1 bit RBALU_load field is required to
    indicate in which RB a hit occurred

32
Unlocking RB entries - Issues and solutions
  • Assume that tags are broadcast earlier than
    results (valid assumption) Palacharla 97, Shen
    04
  • Scenarios -
  • No dependent instruction in the IW indicated by
    no tag match
  • Only one dependent instruction in the IW
  • Multiple (one or more) dependent instructions in
    the IW
  • Some dependent instructions in the IW are
    speculative and are
  • likely to be squashed
  • e) Producer (reused) instruction is of ALU
    type and consumer
  • is a load instruction or vice versa
  • Any instruction that attempts to update a RB
    entry must examine if it is locked (access
    RBlock)
  • Except for scenario(a), a locked RB entry is
    unlocked when the dependent consumer instruction
  • C commits
  • Commit stage unlocking is required due to the
    OoO/speculative execution model

Last dependent instruction that is slated to be
squashed is converted into a NOP RBlast1spec ?
dummy Some IW entries occupied But branch
mispredicts are relatively uncommon
33
Issues and solutions
  • Power savings is due to
  • Transfer of fewer bits over the high capacitance
    resultbus
  • Resource (port) constraints imposed by RBlock bit
    vector which prevents some RB updates
  • RBlock is a small structure (e.g. 128 bits for a
    RB of 128 entries) and not very power hungry
  • Additional accesses to RB occur, but does not
    contribute significantly to energy consumption
  • e.g. dependent load instruction accessing the
    ALU RB to obtain its operand (this occurs even in
    the base IR scheme since EA computations are
    stored in the ALU RB and a load instruction uses
    this value)
  • Dependent stores accessing RB to get the operand
  • Since low level issues are not considered due to
    modeling constraints, the results presented are
    optimistic.
  • Limitations of Simplescalar and Wattch power
    models
  • Effectiveness of resultbus optimization depends
    on
  • Number of result producing instructions reused
  • Amount of bit activity reduced (depends on the
    data values produced by reused instr)
  • Distance b/w reused producer and consumer i.e.
    whether consumer is in the IW when the producer
    is reused (depends on IW size, code generated)

34
(Optimistic) Results
  • Loads are often reused due to high value
    locality and fully associative load RB.
  • Values loaded from caches normally
  • result in significant bit activity over the
  • resultbus connecting to the cache ports

35
IR in packet processing applications
36
Motivation
  • Given an instruction and several RBs, which RB
    must be queried?
  • Opcode based selection (split RB)
  • Random, RandBest
  • Flow based selection
  • A Flow gives only partial knowledge of input
    data. Can this information be used to select one
    among many RBs?
  • How does sharing the RB among threads affect hit
    rate in a multithreaded processor?
  • constructive vs destructive interference
  • Most NPUs are multithreaded

37
Classifying packets Concept of Flows
  • Sequence of packets sent between a
    source-destination pair following the same route
    in the network
  • - ltsrc IP addr, dst IP addrgt
  • - ltsrc IP addr, src port, dst IP addr, dst portgt
  • Classifying pkts into flows
  • Must be fast and can be approximate as long as it
    is consistent across all packets
  • IP address/port based requires extracting these
    fields i.e. flow based IR can make use of the
    flow information only after the above operation
    is done
  • Direction based or port based
  • Any other easily computable method based on where
    the router is located

Classification IP addr (src, dst) IP addr
appln port input port output port
Red - fields that are constant for a
session Blue - fields usually remain the
same Others - vary for every packet
38
Simplistic (and hypothetical) High level example
of how flow based IR works
  • Pipeline of processors NPU model where each PE is
    multithreaded Sherwood 03
  • Multiprogram homogeneous workload with each
    thread operating on a different packet
  • Instructions operating on packets belonging to
    different flows must access different RBs
  • Likely to beneficial mainly in header processing
    applications

39
Flow based IR
  • What we are trying to do -
  • use multiple RBs each catering to a flow or
    set of flows
  • Exploit high level concept of flows and make
    this information visible to instructions so
    that the RB query/update is controlled by this
    information.
  • manipulate/classify the way instructions access
    the RB
  • aggregate related packets so that their data
    is consistently confined to a particular RB i.e.
    instructions operating on packets with similar
    information (flow) query the same RB
  • Processor has no idea which packet it is
    operating on
  • Flow tag is a per thread resource in a
    multithreaded processor since instructions
    operating on different packets are interleaved in
    the pipeline
  • Why it works -
  • temporal locality in network traffic makes
    classification possible
  • similarity in many packet fields - at least
    header info and sometimes payload (e.g. layer 4,
    encapsulated packets)

40
Flow based IR in SMT processor
  • RB can either be shared among threads or can
    cater to just one thread in multithreaded
    processors
  • Sharing enhances the possibility of constructive
    interference
  • Provided RB size is large enough
  • Values put into the RB by one thread may be
    reused by another thread
  • Theoretically, we can control the distance
    between threads so that the threads are mutually
    benefited and evictions from the RB are reduced
    not done
  • Flow classification ERNET trace, 800 flows
    need to be mapped to 2 RBs.
  • Non anonymized complete data
  • Instructions operating on packets must access one
    of 2 RBs
  • Packet flow direction used incoming vs outgoing
    pkts
  • Comparison base IR also uses 2 RBs without the
    above consistent mapping

41
Does the aggregation policy retain data
similarity within aggregate flows?
  • Similarity measurement
  • At same offset
  • Randomly select 1 pkt
  • Select the next near pkt
  • similarity check
  • Repeat for number of
  • sample pairs of pkts
  • Other ways of quantifying similarity sliding
    window (suitable for payload)
  • Need to measure similarity taking into account IR
  • Packets of same flow aggregate must be roughly
    similar compared to those selected from
  • different flows

42
Does the aggregation policy retain data
similarity within aggregate flows?
43
Does the aggregation policy retain data
similarity within aggregate flows?
44
Results flow based IR
  • Increasing the num of threads improves hit rate
    (except in url)
  • Flow based IR results in a hit rate similar to
    that achieved by the opcode based RB selection
    policy (specially in header processing
    applications)
  • Rand is obviously the worst
  • RBest is the best

45
Results flow based IR
  • For url, destructive interference dominates
  • Can utilize packet flow information to achieve RB
    hit rates
  • similar to that obtained with opcode based
    partitioning of data
  • May be employed in application specific scenarios
  • Performance improvement is the same in flow and
    opcode based selection policies

46
Impact of IR on the processor
front-endIR front end throttlingIR with
RB updates eliminated
47
Impact of IR on the processor front-end
  • IR leads to an
  • increase in the number
  • of D-critical nodes
  • making the processor
  • front-end the main
  • bottleneck to performance
  • IR transfers a portion of
  • the criticality from the backend to the front-end
    and commit stage
  • Using an aggressive instruction
  • fetch when exploiting IR
  • is likely to result in greater
  • performance gains

48
Using an aggressive front-end (Performance impact)
  • 2 schemes for aggressive fetch -
  • - Increase icache blk size
  • - Fetch past multiple taken
  • branches every cycle
  • Avg speedup (SPEC)
  • - without aggressive fetch 1.027
  • - with aggressive fetch 1.041
  • When IR is exploited, it is
  • better to use an aggressive
  • front-end so that the impact of D-critical nodes
    is reduced
  • Aggressive fetch gt possibility
  • of wrong path activity and energy loss

49
Reducing processor work using IR and throttling
  • IR reduces the total number of instructions
    that have to be executed
  • Executed instructions
  • Correct path
  • Wrong path (due to speculative execution)
    instructions are truly redundant
  • Num of incorrectly fetched instructions can
    account for up to 80 of all instr Aragon 03
  • Reducing wrong path activity
  • Pipeline gating Manne 98, pipeline throttling
    Aragon 03
  • Techniques based on instruction traffic
    Baniasidi 01, IPC variation Ghaisi 00 or
    branch confidence estimators Grunwald 98
  • IR improves performance and could reduce wrong
    path activity
  • If a conditional branch instruction is reused,
    its outcome is known one cycle early
  • Quantify the reduction in extra work and energy
    savings achieved with IR and throttling
  • A combination of IR and throttling is likely to
    result in greater EDP savings
  • A perfect branch confidence estimator that
    consumes no extra power is assumed (though
    unrealistic, using this gives an upper bound on
    the performance/EDP gains)

50
Energy wastage due to wrong path activity
  • Front end is a significant contributor to wasted
    energy
  • Throttling needs to be applied to these stages
    (stall fetch/dispatch stages for 2 and 1 cycle
    when a
  • low confidence branch is encountered)

51
IR and throttling Impact on EDP
  • Reducing wrong path activity is necessary in
    processors with
  • aggressive front-end and deep pipelines
  • A combination of IR and throttling exploits the
    best features of both
  • Number of wrong path
  • instructions querying the RB
  • is reduced, which leads to
  • some power savings

52
IR without RB updates IRPS scheme
  • Profile benchmarks and determine most frequently
    executed and repeated instructions
  • Allocate these in the RB and disallow dynamic RB
    updates
  • Power benefits RB hardware (e.g. write ports,
    LRU implementation) complexity reduces
  • Choosing data values to be allocated RB entries
  • Profile ROD instructions only using sim-outorder
    since NROD instructions cannot be reused with the
    base IR scheme
  • Methodology similar to that discussed in
    characterizing repetition for critical
    instructions
  • Metrics
  • Unique repetition ratio (0ltURRlt1)
  • Smaller gt fewer data tuples used/generated gt
    smaller working set gt fewer values to be stored
    in the RB per instruction
  • Dominating tuple (0ltDTlt1)
  • Larger gt more often repeated gt dominates over
    other tuples gt better if reused
  • URR uniq/exe DT Ti/exe maxDT max(Ti/exe)

53
Instruction selection and placement
  • Number of static instructions that are available
    for selection is gt 10 (depends on input,
    benchmark)
  • Instructions with URR lt 0.01 and maxDT gt 0.98 are
    considered as candidates to be allocated entries
    in the RB
  • A greedy algorithm selects the data values/tuples
    from the above set
  • Types of instructions most often allocated
    entries in the RB are -
  • EA computation (dominates)
  • Conditional branches
  • Load address (Alpha ISA - lda, ldah)
  • Instructions from certain frequently executed
    portions of the program are likely to be selected
    for allocation
  • Selected instructions must be placed at
    appropriate locations in the RB to allow PC based
    indexing
  • Conflicts handled by the greedy algorithm
  • Duplicate entries may be stored at different
    locations in the RB

54
Simulation results
  • IRPS with PC based indexing is inefficient
  • RB savings occurs due to reduced number of
    updates
  • Operand based indexing reduces conflicts and may
    be better suited for accessing the RB
  • Unable to reuse instructions from different
    phases of the program
  • Frequently executed instr may not impact
    performance
  • Dynamic updates are better suited when PC based
    indexing is used
  • RB organization is important

55
Contributions
  • Comprehensive analysis of IR when energy is
    considered as a metric
  • Characterization of instruction delays and
    their predictability
  • Impact of instruction criticality on IR
  • Characterization of repetition for critical
    instructions
  • RB access/update with predicted critical
    instructions
  • Impact of processor front-end on IR
  • Resultbus optimization that exploits
    communication reuse to reduce bit activity over
    the resultbus
  • Impact of the following schemes on RB hit rate
  • Opcode based indexing vs random vs flow based
    (domain specific case study)
  • Single vs multiple RBs
  • RB management - IRPS scheme limitations
  • Impact of sharing RBs among multiple threads in a
    multithreaded processor (domain specific case
    study)

56
Conclusions future directions
  • IR is beneficial if
  • Number of sets is given more importance than
    associativity
  • The RB is small (lt256 entries) and direct mapped
    (or 2-way associative)
  • Load IR is not exploited
  • Multiple RBs catering to a specific set of
    opcodes are employed this also reduces decoder
    power and lookup time
  • Better indexing or eviction mechanisms are used
  • Floating point instructions are exploited
  • The performance impact of IR is more dependent on
    the underlying microarchitecture (pipeline depth,
    IW size, instruction latency etc)
  • Power and access time issues may make block level
    IR useless (future work)
  • Certain data values may be given higher
    importance and stored in specially designed RBs
    (e.g. dynamic update precomputed storage)
    common values lt0,9gt among instr
  • Criticality
  • A large number of critical instructions have to
    be buffered to capture a certain fraction of
    repetition.
  • Degree of criticality and presence of near
    critical paths may also matter.
  • Its better to exploit reuse in all possible
    instructions than only critical instructions.
  • The resultbus optimization is a promising
    technique to improve the energy efficiency of a
    processor that exploits IR.

57
References
  • Sodani 97 A. Sodani and G. S. Sohi, Dynamic
    Instruction Reuse, ISCA-24, 1997
  • Sodani 00 A. Sodani, PhD Thesis, University of
    Wisconsin Madison
  • Molina 99 Molina et. Al. Dynamic Removal of
    Redundant Computations, Proc. ICS, 99
  • Sazeides 97 Sazeides et. Al., The
    Predictability of Data Values, MICRO-30, 1997
  • Yi 02 Joshua J. Yi and David J. Lilja,
    Improving Processor Performance by Simplifying
    and Bypassing Trivial Computations, ICCD, 2002
  • Lipasti 96 Mikko H. Lipasti et. Al., Value
    Locality and Load Value Prediction, ASPLOS-7,
    1996
  • Yi 01 Yi et. Al., An Analysis of the Potential
    for Global Level Value Reuse in the SPEC95 and
    SPEC 2000 Benchmarks, Technical report, 2001.
  • Yi 02 Joshua J. Yi et. Al. , Increasing
    Instruction-Level Parallelism with Instruction
    Precomputation, EUROPAR-8, 2002
  • Citron 03 D. Citron and D. Feitelson, Look It
    Up or Do the Math An Energy, Area and Timing
    Analysis of Instruction Reuse and Memoization,
    PACS, 2003
  • Citron 02 Daneil Citron, Revisiting Instruction
    Level Reuse, WDDD, 2002
  • Ponomarev 03 Dmitry V. Ponomarev et. Al. ,
    Energy-Efficient Issue Queue Design, IEEE
    Transactions on VLSI Systems, Vol 11, No. 5, 2003
  • Sam 05 Sam et. Al, On the Energy Efficiency of
    Speculative Hardware, Intnl Conf on Computing
    Frontiers, 2005.
  • Brooks 00 Brooks et. Al, Wattch A Framework
    for Architectural Level Power Analysis and
    Optimization, ISCA 27, 2000
  • Fields 01 Fields et. Al. , Focusing Processor
    Policies via Critical Path Prediction, ISCA 28,
    01
  • Sazeides 03 Sazeides et. Al., Instruction
    Isomorphism in Program Execution, JILP, 2003
  • Aragon 03 Aragon etl al., Power Aware Control
    Speculation Through Selective Throttling, HPCA 9,
    2003.
  • Manne 98 Manne et. Al, Pipeline Gating
    Speculation Control for Energy Reduction, ISCA 98
  • Banisaidi 01 Banisadi et. Al., Instruction Flow
    based Front End Throttling for Power Aware High
    Performance Processors, ISLPED 2001
  • Ghaisi 00 Ghaisi et al, Using IPC Variation in
    Workloads with Externally Specified Rates for
    Power Reduction, WCED 2001.

58
Thank you
Write a Comment
User Comments (0)
About PowerShow.com