Nektarios Paisios. - PowerPoint PPT Presentation

About This Presentation
Title:

Nektarios Paisios.

Description:

Study: 'The cost is not for the processor but for the memory' ... extra flag bit is used to indicate wether a value is frequent and so, although ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 43
Provided by: libr301
Category:

less

Transcript and Presenter's Notes

Title: Nektarios Paisios.


1
Nektarios Paisios.
  • An Overview of the Techniques of Space and Energy
    Reduction using data compression.

2
Introduction
  • Data Compression a technique of data reduction.
  • Space is costly.
  • Study "The cost is not for the processor but for
    the memory"
  • In the past memory provided enough space for
    then current application footprint,
  • but disk space too small to hold data.
  • Compression An old method of saving disk space.
  • 1994 advertisement Up to 50-100 more free disk
    space.

3
Data compression why?
  • Explosive growth of disk space drop in prices.
  • But still network links slow processors will
    soon reach their chip limit according to Moor's
    Law (2010)
  • eg. 2 GHZ 4 years ago, 3 GHZ 2 years ago, what
    now?
  • New methods of speeding up need to be invented.
  • By bringing data closer to processor by
    providing data faster by making predictions
    more accurate.
  • Compression can
  • Store more data in caches closer to processor.
  • Store more data in predictors more accurate
    predictions.
  • But faster?

4
Data compression why?
  • Clusters of computers built out of commodity
    equipment can make wanders.
  • Less cost due to commoditisation, but
  • More energy needed more energy means more
    cooling.
  • Compression can
  • reduce data structures less energy
    requirements.
  • But maintain equal performance.

5
Data Compression what?
  • Two forms Lossless and lossy.
  • Pictures music lossy - other files lossless.
  • Both useful in processors.
  • Data caches lossless because of program accuracy
    and integrity.
  • Predictors lossy up to an acceptable point. Why?
  • Lossy faster.
  • Prediction needs to be faster than actual program
    execution

6
Data compression how?
  • Commonest method
  • Finds common patterns.
  • Isolates them.
  • Replaces them with a pointer.
  • Example The fat cat sat like that.
  • at the space a common pattern.
  • Three techniques proposed in processors
  • Pattern matching,
  • pattern differentiation,
  • common repeating bit elimination.

7
Three techniques.
  • 1. Pattern matching,
  • Produces a dictionary of common items.
  • But
  • How to make the dictionary (what to choose)?
  • When to update it?
  • How big is the dictionary (speed)?
  • 2. Pattern differentiation
  • Finds common changes increments - decrements.
  • Used when we have series of data with an expected
    dispersion.
  • eg. value predictors.
  • Can it be used in other cases?
  • 3. common repeating bit elimination
  • Large memory blocks are all zeros.
  • A series of 0s or a series of 1s can be replaced
    with a code.

8
Example1 Compression in caches and memory.
  • From Technical Report 1500, Computer Sciences
    Dept., UW-Madison, April 2004
  • Aims
  • Increase effective memory size,
  • reduce memory address and data bandwidth,
  • increase effective cache size.
  • Three aproaches Dictionary, differencial,
    significance.
  • Dictionary Common patterns are stored in a
    separate table and a pointer to them is place in
    the compressed data.
  • Differential The common patterns are stored with
    the compressed data together with a list of
    differences amongst the various data parts.
  • Significance Not all bits are required and the
    upper once are usually zero.

9
Dictionary-based compression in main memory.
  • From Technical Report 1500, Computer Sciences
    Dept., UW-Madison, April 2004
  • IBMs Memory Compression.
  • IBMs MXT technology 26 employs real-time
    main-memory content compression.
  • Effectively double memory.
  • Implemented in the Pinnacle chip single-chip
    memory controller.
  • Franaszek, et al. CRAM. (MXT)
  • Kjelso, et al. X-Match hardware compression.
    (4-byte entries)
  • Lempel-Ziv (LZ77) sequential algorithm
  • Block-Referential Compression with Directory
    Sharing,
  • divides the input data block (1 KB in MXT) into
    sub-blocks
  • Four 256-byte sub-blocks, cooperatively
    constructs dictionaries while compressing all
    sub-blocks in parallel.

10
Dictionary-based compression in caches.
  • Lee, et al. selectively compress L2 cache
    memory blocks if can be reduced to half their
    original size.
  • (SCMS) use of the X-RL compression algorithm
    similar to X-Match.
  • Speed considerations?
  • Parallel decompression
  • Selective compression not everything is compress
    if not worth it.
  • Chen, et al. divide cache into different
    section of compressibility.
  • Use of LZ algorithm.

11
Dictionary-based compression in caches.
  • Frequent-Value-Based Compression.
  • Yang and Gupta analysed the SPECint95
    benchmarks.
  • Discovered that a small number of distinct
    values occupy a large fraction of memory access
    values.
  • This value locality enabled the design
    energy-efficient caches data compressed caches.
  • How? Each line in the L1 cache can be either one
    uncompressed line or two lines compressed to at
    least half based on frequent values.
  • Zhang, et al. value-centric data cache design
    called the frequent value cache (FVC).
  • Added a small direct-mapped cache with values
    frequently found in the benchmarks.
  • greatly reduce the cache miss rate.
  • Is this a right aproach?

12
Differential-based compression in caches.
  • Benini, et al. uncompressed caches but
    compressed memory.
  • Assumption it is likely for data words in same
    cache line to have some bits in common.
  • Zhang and Gupta added 6 new data compression
    instructions to MIPS.
  • New instructions
  • Compress 32-bit data and addresses into 15 bits.
  • By common prefixes and narrow data trasformations

13
Significance-Based Compression.
  • Most significant bits are shared amongst data and
    instruction data addresses.
  • Addresses Why transfer long addresses with
    repeating patterns?
  • Farrens and Park "many address references
    transferred between processor and memory have
    redundant information in their high-order (most
    significant) portions".
  • Solution cache these high order bits in a group
    of dynamically allocated base registers,
  • only transferr small register indexes rather than
    the high-order address bits between the processor
    and memory.
  • Also Citron and Rudolph store common high-order
    bits in address and data words in a table,
  • transfer only an index plus the low order bits
    between the processor and memory.

14
Significance-Based Compression.
  • Canal , et al. compress addresses
    instructions.
  • Keep only the significant bytes.
  • Maintain a two - three extension bits to maintain
    significant byte positions.
  • Results Reduces power consumption in the
    pipeline.
  • Kant and Iyer most significant bits of address
    can be predicted with high accuracy whilst data
    with lower accuracy.
  • Simple solution
  • Compress individual cache lines on a word-by-word
    basis by storing common word patterns in a
    compressed format.
  • Store each word with an appropriate prefix.

15
Significance-Based Compression.
  • Significant bits of processor structure entries
    are the same or are to be found in a small data
    set
  • BTB 256 entry table can store 99 of higher bits.
  • Data bits Why have multible instances of them in
    every BTB, cache, etc, entry?
  • Solution Use multible tables with different
    sizes,
  • use pointers amongst the different table levels.

16
Frequent value caches how do they work?
  • They work as follows
  • The cache is divided into two arrays.
  • One let's say 5 lower bits and the other 27 upper
    bits.
  • If the lower 5 bits let's say belong to a value
    which is frequent, the remaining 27 bits are not
    read and they are read instead from a smaller
    high speed register file containing 2 power of 5
    places.
  • Otherwise, If the let's say lower 5 bits do not
    belong to a frequent value then the rest of the
    27 bits are read from the second cache array.
  • Thus, the actual value sharing is not done
    between the two cache tables but between 3
    tables
  • the two cache tables and the smaller fast
    register file.
  • Also, an extra flag bit is used to indicate
    wether a value is frequent and so, although there
    is always an indirection, (either between the two
    cache tables, or between the first cache table
    and the special register file), and thus a delay,
    there is no extra pointer and so the first of the
    two delays could have been in theory avoided.
  • Why don't they do it simpler?

17
cache compression schemes a summary.
  • Cache compression schemes
  • 1. Indirect tags "The IIC does not associate a
    tag with a specific data block instead, each tag
    contains a pointer into a data array which
    contains the blocks."
  • 2. FVC "The Frequent Value Cache (FVC) replaces
    the top N frequently used 32bit values with log
    (N) bits. When built as a separate structure the
    FVC can increase cache size if an entire cache
    block is made up of frequent values."
  • Probability decreases though with larger caches,
    since larger cache more uniqueness in the data.
  • So suitable for small structures the paper
    mentions only l1 cache.
  • 3. Dynamic Zero Compression (DZC) If a byte is
    all zero then only one bit is used to signify
    this saving the other 7 bits.

18
cache compression schemes a summary.
  • Cache compression schemes
  • 4. Separate banks Kim et al. utilize the
    knowledge that most of the bits of values stored
    in a L1 data cache are merely sign bits.
  • Their scheme compresses the upper portion of a
    word to a single bit if it is all 1s or all 0s.
  • These high order bits can be stored in a separate
    cache bank and accessed only if needed,
  • or, tags can be further modified indicating
    whether an access to the second cache bank is
    necessary.
  • 5. Alameldeen and Wood algorithm called
    frequent pattern compression (FPC).
  • What? Adaptive scheme of compression sometimes
    compresses sometimes not based on whether the
    penalty of uncompression is more or less than the
    potential penalties incurred by cache misses.
    Very elegant!
  • 6. "general compression algorithm. Cache lines
    are compressed in pairs (where the line address
    is the same except for the low-order bit). If
    both lines compress by 50 or more, they are
    stored in a single cache line, freeing a cache
    line in an adjacent set.
  • Paper doesn't specify compression algorithm
    though. Also, does not specify how these lines
    are tagged differently.

19
cache compression schemes a summary.
  • G. Hallnor and S. K. Reinhardt, "A Compressed
    Memory Hierarchy using an Indirect Index Cache".
  • Compression through an indirect table of tags.
  • Cache fully associative and lines are referenced
    through a pointer stored alongside the tag.
  • More than one pointers slots are present to allow
    compression.
  • Algorithm used LZSS.
  • Compression carried out only if line can be
    compressed to fit into the size of the sector
    architecturally specified, otherwise no
    compression.
  • Attains greater than 50 of the performance gain
    of doubling the cache size, with about one
    tenth the area overhead.
  • Disadvantages
  • The speed of LZSS is dependent on the number of
    simultaneous compressions.
  • 6 bytes per tag extra for the pointers.
  • Pointers may be unused if not compression is
    possible for that line.
  • Resulting in 134 kb for an 1 mb cache are for
    the indirection table (tags, pointers, etc). Bad!

20
cache compression schemes a summary.
  • N. Kim, T. Austin, T. Mudge, Low-Energy Data
    Cache using Sign Compression and Cache Line
    Bisection
  • How does the sign compression work?
  • "each word fetched during a cache line miss is
    not changed,
  • But the upper half-words are replaced by a zero
    or one when the upper half-words are all zeros or
    all ones respectively.
  • Uses some sign compression bits instead.
  • However
  • Allows uncompressed words in the line too.
  • Extra bits to indicate uncompressed / compressed
    / sign bits.
  • Innovation
  • Two tags per cache line instead of one.

21
cache compression schemes a summary.
  • N. Kim, T. Austin, T. Mudge, Low-Energy Data
    Cache using Sign Compression and Cache Line
    Bisection
  • It allows energy savings as only half the line is
    accessed based on where the block in question is
    and given that sign compression is carried out.
  • Energy precharge using a MRU mechanism.
  • Uses empty spaces in a block to store new blocks
    fetched having the same index.
  • Reduces misses.

22
cache compression schemes a summary.
  • Adaptive Cache Compression for High-Performance
    Processors
  • Alaa R. Alameldeen and David A. Wood Computer
    Sciences Department, University of
    Wisconsin-Madison alaa, david_at_cs.wisc.edu
  • Adaptive simply means that sometimes you compress
    sometimes not based on two factors
  • 1. Decompression latency and
  • 2. Avoided cache misses.
  • If the cost of decompressing is more than the
    time that would be saved by avoiding potential
    misses if compression was used, then compression
    is not performed, otherwise compression is
    carried out.
  • How? A single global saturating counter predicts
    whether the L2 cache should store a line in
    compressed or uncompressed form.
  • Counter updated by the L2 controller.
  • Based on whether "compression could (or did)
    eliminate a (potential) miss or incurs an
    unnecessary decompression overhead."
  • Not a new idea though virtual memory.

23
cache compression schemes a summary.
  • L. Villa, M. Zhang, K. Asanovic, Dynamic Zero
    Compression for Cache Energy Reduction
  • "Dynamic Zero Compression reduces the energy
    required for cache accesses by only writing and
    reading a single bit for every zero-valued byte."
  • Invisible to software.
  • Basically what it does is for each byte if it is
    all zeros it uses only one bit to store it.
  • Disadvantages
  • Compression scheme for every byte of the cache
    line,
  • Increases the complexity of the cache
    architecture.
  • Lose the opportunity to compress all ones and
    only deal with zeros.

24
cache compression schemes a summary.
  • Chuanjun Zhang, Jun Yang and Frank Vahid, Low
    Static-Power Frequent-Value Data Caches
  • "Recently, a frequent value low power data cache
    design was proposed based on the observation that
    a major portion of data cache accesses involves
    frequent values that can be" separated and stored
    only once."
  • Basically it means that if a cache line value is
    "frequent" then you store it only once and you
    keep a pointer to it.
  • Same idea.
  • But Proposes a method to shut off the unused
    bits to conserve energy in the case that a
    pointer is used.
  • They are also proposing to reduce the latency of
    reading both a frequent value table and the
    ordinary cache.

25
Compression in caches conclusion.
  • Cache designers might consider using cache
    compression to increase cache capacity and reduce
    off-chip bandwidth.
  • "A key challenge in the design of a compressed
    data store is the management of variable-sized
    data blocks."
  • Generally, in the studies carried out, a lot of
    work has been done.
  • Compression has been examined from a thousand
    angles.

26
Compression in caches conclusion.
  • Compression has been examined from a thousand
    angles
  • Most are using the idea that 0s and 1s come
    together in great numbers.
  • Some deal with common "frequent" bit patterns.
  • However, found none that shows a mechanism of
    finding those "frequent" values.
  • They rely on prophiling or on hard-coding the
    values from what I understand.
  • Marios paper?

27
Example 2 compression in predictors.
  • Prediction important for high parallelism.
  • Branches 15 of program.
  • Pentium 4k BTB.
  • Do branch targets exhibit the same pattern
    behaviour as cache lines?
  • Surely targets might not be as compressible as
    cache lines by the removal of leading zero bits
    but there might be pattern repetition in them.

28
Compression in predictors.
  • Ideal Dynamic allocation of target space
    according to the needs of each instruction.
  • Rehashable BTB
  • Recognises polymorphic branches and store them in
    a common BTB space.
  • Value predictors
  • Loh H. Gabriel stores values in separate tables
    based on length.
  • Energy saving upto 25 space upto 75.
  • However Cannot be used with the BTB.

29
What did we do with the BTB?
  • Mission Minimize the waste of space in the BTB.
  • Data compression to avoid duplicate entries,
    meaning bit sharing.
  • How?
  • Simple Two-table structure.

30
Methodology
  • Aim Find all Entries/branches that have the same
    or partially the same target.
  • We used No replacements BTB BTB with multible
    tables.

31
Results.
  • Questions
  • 1. What width will each table have?
  • 2. How many entries?
  • 3. How to join them up?

32
Q1 Bit Ranges.
  • GCC95 results
  • BTB performance
  • BTB type Num-of-branches that are correct
    hits percentage of performance
  • Normal BTB 19164012 87.8982
  • BTB with no replacements 19892653 91.2402
  • Bits 1-16 21802503 99.9999
  • Bits 5-20 21791021 99.9473
  • Bits 9-24 21706107 99.5578
  • Bits 13-28 21304116 97.714
  • Bits 17-32 20013651 91.7951
  • Bits 25-32 21801499 99.9953
  • Bits 1-24 21706107 99.5578
  • 1-24 bits 25-32 best performance than bits 1-16
    17-32.

33
Q2 How much space for each table.
  • BTB type 1-32bit hits percentage 25-32bit
    hits percentage 1-24bit hits percentage
  • 4k normal 19164012 87.8982 19164385 87.8999 1984
    7639 91.0337
  • 4k no replacement 19892653 91.2402 21801499 99.99
    53 21706107 99.5578
  • 2k normal 18571627 85.1811 18571985 85.1827 1924
    8618 88.2862
  • 2k no replacement 19567018 89.7466 21801090 99.99
    34 21701108 99.5349
  • 1k normal 17522962 80.3713 17523248 80.3726 1818
    7048 83.4172
  • 1k no replacement 18885228 86.6195 21799372 99.98
    56 21688593 99.4775
  • 512 normal 16149816 74.0732 16150001 74.074 1678
    6062 76.9914
  • 512 no replacement 17720797 81.2787 21795542 99.9
    68 21671880 99.4008
  • 256 normal 14327910 65.7168 14328057 65.7174 149
    24424 68.4528
  • 256 no replacement 15905413 72.9522 21770601 99.8
    536 21582680 98.9917
  • For BTB without replacements critical point at
    256 places for lower 8 bits 8-bits are after
    all.
  • Upper 24 bits very common!

34
Results.
  • Benchmark BTB size Num of correct hits Num of
    correct hits
  • Name in
    normal BTB in improved BTB
  • GCC95 8k places 19528170 89.5684 19384014 88.9072
  • GCC95 4k places 19164012 87.8982 19034779 87.3054
  • GCC95 2k places 18571627 85.1811 18456969 84.6552
  • MCF2000 8k places 149389263 99.2259 149389263 99.
    2259
  • MCF2000 4k places 149387899 99.225 149387899 99.2
    25
  • MCF2000 2k places 147731903 98.125 147731903 98.1
    25
  • Vortex2000 8k places 89371437 86.9625 88953923 86
    .5563
  • Vortex2000 4k places 88405444 86.0226 87995598 85
    .6238
  • Vortex2000 2k places 85185389 82.8893 84805232 82
    .5194
  • Up to 80 of original size!

35
Costs.
  • Num of 1st table size of normal size of
    improved reduction
  • Entries BTB BTB
  • 8k entries 376832 bits
    299520 bits 20.516
  • 4k entries 192512 bits
    156160 bits 18.882
  • 2k entries 98304 bits
    82432 bits 16.145
  • Generally reduces size requirements by 20.

36
Don't use the page number, but a pointer to it.
  • Andr6 Seznec Brilliant proposal.
  • Caches Relative size of addresses (tags) is huge
    especially in small blocks.
  • Predictors Accuracy affected due to large
    addresses (targets tags).
  • Curious finding Addresses represented 3 times,
    in cache tags, in instructions, in BTB, in TLB.
  • Removed by
  • 1. Store page number-s only once
  • 2. Do not use the page number, but a pointer to it

37
Don't use the page number, but a pointer to it.
  • Andr6 Seznec Brilliant proposal.
  • How?
  • Page number stored in a page number cache.
  • Can be the TLB when vertual addresses are used or
    another buffer if physical.
  • Store 5-bit pointers in place of addresses.
  • Reduce cache, reduced predictor tags.
  • Cache If a page pointer is invalidated (page
    miss) all entries are invalidated.
  • But Why not invalidating the BTB entries as well
    as the tlb ones?

38
Don't use the page number, but a pointer to it.
  • How does it compare?
  • Andr6 Seznec comparison with other schemes
  • Isolated compression scheme Seznec scheme
  • Touch only the targets touch both tags and
    targets
  • Predictor size Dependent on address width
    Predictor size independent of address width
  • 8-bit pointers 6-bit pointers
  • Second-level table accessed every time table
    with page pointers accessed only when getting
    outside processor, ie. to ram.
  • A specific predictor only solution A BTB,
    cache and tlb solution
  • Not affected by page misses Affected by
    misses though according to paper not much
  • Only specific predictor changed cache,
    BTB even program counter has to be modified to
    the new scheme to be effective

39
Conclusion.
  • Compression a huge field and we have touched the
    surface.
  • The key to a successful algorithm
  • 1. Speed, speed, speed!
  • 2. Simple to implement in hardware,
  • 3. Balances space energy savings with overhead.
  • Based on the above are
  • Decision trees,
  • Classification algorithms,
  • etc,
  • worth it?

40
References
  • Technical Report 1500, Computer Sciences Dept.,
    UW-Madison, April 2004
  • Target Prediction for Indirect Jumps Po-Yung
    Chang Eric Hao Yale N. Patt
  • Don't use the page number, but a pointer to it
    Andr6 Seznec
  • A. R. Alameldeen and D. Wood, "Adaptive Cache
    Compression for High-Performance Processors",
    Proc. of the 31st International Symposium on
    Computer Architecture, June 2004, pg. 212-223.
  • G. Hallnor and S. K. Reinhardt, "A Compressed
    Memory Hierarchy using an Indirect Index Cache",
    Technical Report CSE-TR-488-04, 2004.
  • L. Villa, M. Zhang, K. Asanovic, Dynamic Zero
    Compression for Cache Energy Reduction, In the
    proceedings of the 33 rd International Symposium
    on Microarchitecture, Dec2000.
  • P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, The
    Case for Compressed Caching in Virtual Memory
    Systems, In the proceedings of USENIX 1999.
  • J. Yang, R. Gupta, Energy Efficient Frequent
    Value Data Cache Design, In the proceedings of
    the 35 th Annual International Symposium on
    Microarchitecture, 2002, (MICRO-
  • Y. Zhang, J. Yang, R. Gupta, Frequent Value
    Locality and Value-Centric Data Cache Design,
    In the proceedings of the Ninth International
    Conference on Architectural Support for
    Programming Languages and Operating Systems, Nov.
    2000
  • N. Kim, T. Austin, T. Mudge, Low-Energy Data
    Cache using Sign Compression and Cache Line
    Bisection, 2 nd Annual Workshop on Memory
    Performance Issues, May 2002
  • P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, The
    Case for Compressed Caching in Virtual Memory
    Systems, In the proceedings of USENIX 1999.
  • Chuanjun Zhang, Jun Yang and Frank Vahid, Low
    Static-Power Frequent-Value Data Caches

41
References
  • Li, T. Joxn, L., K. (2001). Rehashable BTB An
    Adaptive Branch Target Buffer to Improve the
    Target Predictability of Java Code. The
    University of Texas at Austin.
  • Sazeides Y. Smith J. E. (1998). Implementations
    of the Context-Based Value Predictors. University
    of Wisconsin-Madison.
  • Loh. G. H. (2003). Width-Partitioned Load Value
    Predictors. Journal of Instruction-Level
    Parallelism. College of Computing Georgia
    Institude of Technology Atlanta.
  • Gifford S. Huang C.-W. Yang Z. Yu C.
    (2003). A Comprehensive Front-end Architecture
    for the VeriSimple Alpha Pipeline. University of
    Michigan.
  • Yung R. (1996). Design of the UltraSPARC
    Instruction Fetch Unit. Sn Microsystems.
  • Chang P.-Y. Hao E. Patt Y. N. (1997). Target
    Prediction for Indirect Jumps. Department of
    Electrical Engineering and Computer Science the
    University of Michigan.
  • Calder B. Grunwald D. (1995). Next Cache Line
    and Set Prediction. Department of Computer
    Science University of Colorado.
  • McFarling S. (1993). Combining Branch Predictors.
    Western Research Laboratory California.
  • Hinton G. Sager D. Upton M. Boggs D.
    Carmean D. Kyker A. Roussel P. (2001). The
    Microarchitecture of the Pentium 4 Processor.
    Intel Technology Journal Q1.
  • Lohy G. H. Henrizy D. S. Krishnamurthyy A.
    (2003). Exploiting Bias in the Hysteresis Bit of
    a Two-bit Saturating Counters in Branch
    Predictows. Journal of Instruction Level
    Parallelism.
  • Kalla R. Sinharoy B. Tendler J. M. (2004).
    IBM Power5 Chip A Dual-Core Multithreaded
    Processor. IEEE Computer Society.
  • Arora K., Sharangpani H. (2000). Itanium
    Processor Microarchitecture. IEEE Computer
    Society.
  • Perleberg C. H. Smith A. J. (1993). Branch
    Target Buffer Design and Optimizationn. IEEE
    Transactions on Computers.

42
Other interesting references
  • Gabriel H. Loh
  • Simulation Differences Between Academia and
    Industry A Branch Prediction Case Study
  • To appear in the International Symposium on
    Performance Analysis of Software and Systems
    (ISPASS), March , 2005, Austin, TX, USA.
  • Gabriel H. Loh
  • The Frankenpredictor Stitiching Together Nasty
    Bits of Other Predictors
  • In the 1st Championship Branch Prediction Contest
    (CBP1), pp. 1-4, Dec 6, 2004, Portland, OR, USA.
    (Held in conjunction with MICRO-37.)
  • Gabriel H. Loh
  • The Frankenpredictor Satisfying Multiple
    Objectives in a Balanced Branch Predictor Design
  • Invited to appear in the Journal of Instruction
    Level Parallelism (JILP).
  • Gabriel H. Loh, Dana S. Henry
  • Predicting Conditional Branches With Fusion-Based
    Hybrid Predictors
  • In the 11th Conference on Parallel Architectures
    and Compilation Techniques (PACT), pp. 165-176,
    September 22-25, 2002, Charlottesville, VA, USA.
Write a Comment
User Comments (0)
About PowerShow.com