Phase Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Phase Detection

Description:

Detects phases by finding likely phase boundaries ... We know how often a phase occurs and approximately where its boundaries are ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 56
Provided by: casey94
Category:

less

Transcript and Presenter's Notes

Title: Phase Detection


1
Phase Detection
  • Jonathan Winter
  • Casey Smith
  • CS 612
  • 04/05/05

2
Motivation
  • Large-scale phases exist (order of millions of
    instructions)
  • For many programs, if we look at any interesting
    metric (cache misses, IPC, etc.), we see
    repeating behavior
  • Call the regions with similar behavior phases
  • Knowledge of phase-based behavior can be used for
    adaptive optimization
  • Current hardware doesnt exploit phase behaviors
  • For instance
  • A region of execution may only need a small
    cachesave power/increase performance by
    shrinking
  • A region of execution may benefit from data
    structure reorganization

3
Basic Methodology
  • Identify phase boundaries
  • Classify phases
  • Determine what optimizations to perform for each
    phase
  • When can each step be performed?
  • Run time, compile time, offline

4
Overview
  • Well focus two papers on phase detection
  • Sherwood, Sair, and Calder, Phase Tracking and
    Prediction, ISCA 2003
  • Shen, Zhong, and Ding, Locality Phase
    Prediction, ASPLOS 2004

5
Sherwood et al. 2003
  • Classifies the behavior of a program into phases
    based on code execution
  • Finds strong correlations between code execution
    phases and important performance and energy
    metrics
  • Simulates hardware for real-time detection and
    prediction of phases
  • Demonstrates usefulness through a variety of
    optimization techniques made possible by phase
    detection

6
Definition of a Phase
  • Previously (stemming from Denning 1972), a phase
    was defined as an interval of execution where a
    measured program metric stayed relatively
    constant.
  • Sherwood et al. consider all sections of code
    with similar values for the program metric to be
    part of the same phase even if the intervals are
    spread out over the course of the programs
    execution.

7
Key Program Metrics
  • Instructions per cycle (IPC), energy, branch
    prediction accuracy, data cache misses,
    instruction cache misses, L2 cache misses are all
    vital statistics for optimizing speed and power
    consumption

8
Single Unified Metric
  • Goal find a single metric that
  • Uniquely distinguishes phases
  • Guides optimization and policy decisions
  • Need some section of code on which to measure
    this metricpick 10M instructions
  • Much longer time span than typical architectural
    techniques handle
  • Long enough to capture large-scale behavior
  • Short enough to capture detailed phase behavior
  • Size of an OS timeslice

9
Metric for Classification
  • Based on Basic Blocks
  • Basic blocks are a section of code with one entry
    point and one exit point
  • Basic Block Vector
  • Count the number of times each basic block is
    executed in the 10M interval
  • Entries in the vector are the product of the
    number times each basic block is executed and the
    block length (BB1L1, BB2L2, BB3L3, )
  • This vector is a signature of the phase which
    correlates well with other metrics of interest
    IPC, cache misses, etc.

10
Advantages of BBVs
  • Independent of architectural measures and thus
    unaffected by optimizations
  • Weighting biases the signatures to more
    frequently executed instructions
  • Creates unique signatures which execute the same
    code but in different proportions

11
Hardware Implementation
  • Dont want to store and examine the whole vector
    compress to a 32-entry vector (footprint)

12
Visualization of the Footprints
Footprints for different intervals of gzip
13
What do we do with our footprint?
  • Store a small sample of representative footprints
    as phase signatures
  • Compare the current footprint to previously
    stored footprints
  • If we have a close enough match, we classify them
    as the same phase
  • If not, we store the new footprint as the
    representative member of a new phase

14
Comparing Footprints
  • To save space, only store the top 6 bits of each
    entry in the 32-vector
  • Counters were saturating 24-bit counters
  • The smallest value that the maximum entry could
    have would occur if all 10M instructions were
    distributed evenly across the 32 entries
  • In this case the top six bits means that a
    counter value of 10M/32 would have a value of 1
  • Distance between footprints is defined as the
    Manhattan distance the sum of the absolute
    difference between corresponding entries in two
    vectors

15
Finding a Match
  • If the Manhattan distance is less than a
    threshold, two footprints are classified as being
    in the same phase
  • Determine threshold
  • by false positives/
  • false negatives as
  • compared to an offline
  • oracle tool.
  • Threshold of 220
  • chosen

16
Opportunity
  • These classification methods are oversimplified
  • Opportunity to apply better machine learning
    techniques

17
Within Phase Homogeneity
  • Within a phase, architectural metrics have nearly
    constant values (this is what we were aiming for)

18
Phase Prediction
  • Once weve been through an interval, we can
    identify the phase easily
  • But we want to know what phase were going to go
    to next
  • We need to know what phase we will be in before
    the interval starts in order to perform useful
    optimizations (such as changing the cache size)

19
Simple Prediction
  • We could just predict that the next phase would
    be the same as the current phase
  • The program tends to change phases more slowly
    than our 10M intervals, so this actually gives
    reasonable accuracy
  • However, we can do better
  • Note standard hardware predictors have not been
    tried (branch prediction, memory disambiguation,
    etc.)

20
Markov Model Predictor
  • Phase changes depend on the set of previous
    phases and the duration of their execution
  • Phases tend to last many intervals, therefore
    studying recent previous history doesnt provide
    more information than the current state
  • Need to encode how long weve been in the current
    state
  • Predict the length of phase to be the same length
    it was previously

21
Run Length Encoding
22
Opportunity
  • RLE Markov model is overly simple
  • Better prediction techniques exist
  • Make use of the order of previous states rather
    than just the length of the current state

23
Prediction Accuracy
24
Applications
  • Frequent Value Locality
  • Certain data values form bulk of loads
  • Compress to save energy
  • Specialize code segments to common values
  • Dynamic cache size adaptation
  • Shrink cache size to save energy
  • Dynamic processor width adaptation
  • Fetch/Decode/Issue fewer instructions per cycle
    when IPC will be low anyway

25
Frequent Value Locality
26
Cache Size Adaptation
27
Processor Width Adaptation
28
Summary of BBV method
  • Divide program into 10M instruction intervals
  • Characterize each interval by footprint
    approximation to basic block vector
  • Classify intervals as phases based on footprint
  • Predict future phases based on RLE Markov
    predictor
  • Use information about phases to improve frequent
    value locality and optimize cache size and
    processor width for performance/energy

29
Bottom Line
  • Classifying phases based on the frequency of
    executed basic blocks is effective at
    partitioning the program into regions of
    homogenous architectural behavior
  • Significant energy savings with small performance
    degradation can be achieved by applying phase
    specific optimizations.

30
Shen et al. 2004
  • Defines phases in a totally different way
  • Phases have variable lengths (not 10M intervals)
  • Detects phases by finding likely phase boundaries
  • Uses offline analysis of programs on test inputs
    to predict behavior on other inputs

31
Metric of Interest
  • For optimizing cache size, what we really care
    about is the locality of reference
  • Measure the locality directly, and classify
    phases based on that
  • Independent of optimizations performed phases
    recovered are independent of the hardware it runs
    on.

32
Reuse Distance
  • Define the reuse distance as the number of
    distinct data elements (locations in memory)
    touched between two consecutive references to the
    same element.
  • Define the reuse distance at the second reference
  • Example abcbbac
  • ---1022
  • Also called LRU Stack Distance

33
Overview
  • Simulate a test run and record reuse distance
    throughout the program
  • Use this to separate the program into phases
  • Insert phase markers into binary code
  • Predict when phase changes will occur
  • Use information about phases to adjust cache size
    or other hardware parameters

34
New Definition of Phase
  • Here, a phase is a unit of repeating behavior,
    rather than a unit of nearly uniform behavior
  • A phase change is an abrupt change in the data
    reuse pattern

35
Reuse Trace
36
Why Offline Analysis?
  • Compilers cannot fully analyze data locality in
    programs with indirect referencing or dynamic
    structures
  • Hardware methods like the one presented earlier
    require many severe approximations for real-time
    analysis
  • Solution take method offline and analyze program
    behavior on test inputs.

37
Phase Detection Process
  1. Record reuse trace
  2. Perform signal processing techniques to extract
    useful information from the trace
  3. Use the extracted information to find good places
    for phase transitions

38
1) Record Reuse Trace
  • Nontrivial programs access data locations so many
    times that an actual full trace would be
    overwhelming
  • Just sample a representative set of memory
    locations/reuse distances
  • Threshold to reduce trace size and remove
    irrelevant data
  • Throw out short distances (Ci Ci 2)
  • Throw out references to nearby memory locations

39
2) Signal Processing
  • Use wavelet filtering to find abrupt changes in
    reuse distance for each recorded memory location

40
3) Phase Partitioning
  • Now we have points representing locations of
    abrupt changes in reuse distance for individual
    memory locations
  • Want to divide the list with two things in mind
  • Maximize phase length
  • Minimize repetitions of memory locations within a
    phase (no multiple abrupt changes)
  • Example abcdeefabdfccabef
  • abcde efabdfc cabef

41
Missing Link
  • So now we have locations of phase transitions.
  • How do detect which regions are the same phase?
    Doesnt say.
  • Missing section in paper?
  • Assume we can somehow classify the regions into
    phases

42
Phase Markers
  • We know how often a phase occurs and
    approximately where its boundaries are
  • Goal find markers that tell us when were
    entering a particular phase
  • For each phase, look for basic blocks that occur
    once near each of its beginning boundaries, and
    only near the beginnings of its boundaries.
  • Use that basic block as a marker to tell when the
    program enters that phase

43
Using Phases
  • Now we know what basic blocks signal phase entry
    points
  • Run the program with new input
  • When we enter a phase for the first time, we
    record how long it lasts and its locality
    properties
  • Assume that these properties will hold for all
    subsequent executions of the same phase

44
Phase Prediction Performance
45
Negative Examples
  • Not all programs have phases of repeating
    behavior that can be identified from test runs

46
Applications
  • Adaptive Cache Resizing
  • Potential performance increase
  • Potential power savings
  • Memory Remapping
  • Reorder data in memory to speed up execution

47
Adaptive Cache Resizing
  • Shrink cache without increasing miss ratio
  • Phases have repeating behavior, not uniform
    behavior
  • Divide phases into 10K intervals
  • First couple of times we execute a phase follow
    test properties
  • Apply those cache sizes to subsequent executions
    of the phase

48
Cache Size Reductions
49
Cache Size Reductions with 5 Miss Increase
50
Memory Remapping
  • Reorder data in memory to speed up execution
  • For example, we might interleave arrays that tend
    to be accessed together.
  • Options
  • Analyze whole program to find array affinities
  • Analyze by phase and reorganize data during
    execution (should take into account cost of
    remapping, but the authors dont)

51
Memory Remapping
52
Summary of the locality-based method
  • Record a sampled version of the reuse distance
    trace on test input
  • Process the trace
  • Find phase boundaries
  • Find basic block markers for each phase
  • Run the program on new data.
  • When we see a new phase marker, record how long
    it lasts and experiment with optimization
    parameters for 10K intervals
  • Assume subsequent executions of the phase will
    have the same length and locality profile, so we
    can use the determined optimization parameters

53
Bottom Line
  • Many programs have long repeating patterns of
    data reuse separated by abrupt changes
  • These repeating patterns can be detected by
    analyzing the reuse trace
  • Characterizing these patterns can lead to
    significant energy savings and performance
    enhancement through cache resizing and memory
    remapping

54
Overall Conclusions
  • Many programs exhibit large-scale phase behavior
    which can be classified and predicted
  • Characterization of the phases can lead to energy
    savings and performance enhancement through cache
    resizing and other techniques
  • But no well-done analysis of just how much power
    is saved
  • Some of this can be done at compile time
    (identifying many phase markers), but interval
    type analysis and phase characterization must be
    done at runtime

55
Opportunities
  • More intelligent classification
  • More sophisticated prediction
  • Account for the cost of changing the cache size
    in the energy/performance analysis
  • Compare results of phase-based adjustments to
    actual optimal adjustments
  • Examine potential for using compilers for
    different parts of the analysis
Write a Comment
User Comments (0)
About PowerShow.com