Performance Evaluation - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Performance Evaluation

Description:

It is often helpful to have some yardstick by which to compare systems ... For comparing classification algorithms A, B. Choose N random subsets of size n ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 35
Provided by: hpl5
Category:

less

Transcript and Presenter's Notes

Title: Performance Evaluation


1
Performance Evaluation
  • Carl Staelin

2
Motivation
  • It is often helpful to have some yardstick by
    which to compare systems
  • During development to evaluate different
    algorithms or optimizations
  • During purchasing to compare between product
    offerings

3
What To Measure?
  • Standard information retrieval measures
  • Precision
  • The fraction of retrieved documents that are
    deemed relevant
  • Recall
  • The fraction of relevant documents that are
    retrieved
  • Other measures may be important
  • Execution time, disk space,

4
Experiment Design
  • There is a fine art to designing experiments
  • Ensure that you measure the item of interest
  • Ensure that all possible sources of variation
    have been identified and controlled
  • E.g., corpus is static, or at least controlled
  • Ensure that the statistical analysis is valid
  • Proper statistical techniques are applied
    correctly

5
Statistical Analysis
  • Statistics convert data into information
  • It is easy to misuse statistical tools
  • There are lies, damned lies, and statistics
  • Understand the assumptions upon which the tools
    are based
  • If you violate the assumptions, your results are
    not reliable
  • These types of mistakes are silent and are easy
    to overlook

6
Summarizing data by a single number
  • Mode most likely value
  • Median middle value
  • Arithmetic mean x ?xi/n
  • Geometric mean (?xi)1/n
  • Harmonic mean n /?(1/xi)

7
Mode
  • When to use the mode
  • When the data is categorical
  • Or
  • When the data is numeric
  • The data is multi-modal
  • When not to use the mode
  • When the data is numeric and uni-modal

8
Median
  • When to use the median
  • When the data is a numeric observation
  • When the data is skewed or is dominated by
    outliers
  • When the data is uni-modal
  • When not to use the median
  • If the total value is also of interest

9
Arithmetic Mean
  • When to use the mean
  • When the data is a numeric observation
  • When the data is not dominated by outliers
  • When the data is uni-modal, and is not skewed
  • When not to use the mean
  • When the data is categorical, or a ratio
  • When the distribution is badly skewed
  • What else not to do?
  • The mean of the product of two variables is equal
    to the product of the means iff the variables are
    independent.

10
Other Means
  • When to use the geometric mean
  • When the data is a numeric observation
  • When the item of interest is the product of the
    observations
  • E.g., cache hit ratios over several levels of
    cache
  • When to use the harmonic mean
  • Whenever the arithmetic can be justified for 1/x
  • For example, if you measure time t, and want to
    average CPU cycles (m/t), then use harmonic mean

11
Summarizing variability
  • Sample variance s2 1/(n-1) ?(xi- x)2
  • Sample stddev s
  • Coefficient of variation (C.O.V.) s / x
  • Range maxxi minxi
  • 10-90 range
  • Semi-interquantile range (SIQR) (Q3 Q1) / 2
  • Mean absolute deviation 1/n ?xi- x

12
Sample Variance, Stddev COV
  • When to use
  • When the data is normally distributed (uni-modal,
    symmetric)
  • Sample variance stddev are well understood
  • COV has the advantage that it is unit-free
  • E.g., 5 is large and 0.2 is small

13
Ranges
  • Range, percentile range, semi interquartile
    ranges
  • Range is only useful for bounded distributions
  • Percentile range can be used for unbounded
    distributions
  • SIQR is useful when the data is uni-modal and
    skewed, and is generally used in conjunction with
    the median
  • Robust against outliers

14
Confidence Intervals
  • Confidence level
  • Probability that true mean falls in confidence
    interval
  • Experimental designs typically start with desired
    confidence level and data to compute the
    resulting confidence interval
  • For normally distributed data
  • Interval (x z1-?/2s/?n, x z1-?/2s/?n)

15
Hypothesis Testing
  • Researchers often attempt to answer the question
    Is A different from B?
  • Null hypothesis Is A the same as B?
  • Analysis attempts to verify null hypothesis with
    a given confidence level
  • If the confidence intervals overlap, then null
    hypothesis is true, and original hypothesis is
    not supported

16
Paired Observations
  • Conduct n experiments on each of two systems
  • If experiment i on system A corresponds to
    experiment i on system B, then they are paired
  • Treat each pair as a single experiment, with data
    value di (Ai Bi)
  • If confidence interval contains 0, then A and B
    are not significantly different

17
Unpaired Observations
  • Sample means xA, xB
  • Standard deviations sA, sB
  • Mean difference x? xA - xB
  • Stddev of mean difference
  • s? ?((sA2/nA)(sA2/nA))
  • Effective degrees of freedom, v
  • v ltcomplex equationgt
  • Confidence interval for mean difference
  • (xA - xB)?t1-?/2vs

18
Determining Sample Size
  • To estimate mean within ?r
  • Confidenc interval x ?zs/?n
  • where z is normal variate of confidence level
  • n ((100zs)/rx)2

19
Basic Experimental Method
  • Using a corpus of documents manually annotated
    with relevant/irrelevant rankings
  • Run query against corpus
  • Count
  • of relevant retrieved documents
  • of retrieved documents
  • of relevant documents
  • of documents
  • Compute precision and recall

20
Problems With Method
  • Precision/recall for this query on this corpus
  • NOT estimate for any query on any corpus
  • Require several experiments to establish estimate
    general performance
  • Several queries
  • Several corpuses

21
Practical Problems
  • Practical problem(s) with determining precision
    and recall
  • How do you determine which documents are
    relevant?
  • Each corpus must be annotated for each query
  • Many corpuses are huge
  • Manual annotation is expensive or impractical

22
Design of Experiments
  • To get statistically valid results, need to
    repeat experiment several times
  • 10-fold cross-validation is typical in machine
    learning community
  • Separate training and testing data
  • Do not re-use testing data!
  • Use representatively sized corpus
  • Performance often dramatically affected by corpus
    size

23
Ideal Experimental Setup
  • N annotated query/corpus pairs
  • Compute precision/recall for each pair
  • Compute precision/recall statistics
  • Mean
  • Standard deviation
  • Confidence intervals
  • Report results

24
Ideal Research Setup
  • N annotated query/corpus pairs
  • Develop/tune system using this data
  • M annotated query/corpus pairs
  • Distinct from development data!
  • Just before publication
  • Compute results based on unseen data
  • NEVER alter system to improve system performance
    on this data!

25
TREC
  • TREC is an annual conference sponsored by various
    US government agencies
  • http//trec.nist.gov/
  • There are several task categories
  • Filtering, Ad-hoc queries,
  • TREC prepares annotated datasets for researchers
    to develop and test their systems
  • TREC then runs the systems on unseen data

26
Real Life Intrudes
  • Annotated datasets are usually scarce
  • Squeeze every last drop of utility out of scarce
    datasets
  • Question how to reuse data without creating
    invalid results?
  • Cross-validation, bootstrapping,

27
Over Fitting
  • Over fitting is when the system has learned to
    recognize the noise as well as the signal
  • Optimizing system may lead to over-fitting
  • Over-optimistic performance on test data
  • Sub-optimal performance on real data
  • Over fitting is hard to detect

28
Comparing Systems
  • If anything, comparing systems is more difficult
    than simply estimating general performance
  • It is easy to take valid general performance
    estimates and create invalid comparisons!
  • If you conduct 100 experiments measuring paired
    significance at 95 confidence
  • Expect 5 of conclusions are incorrect
  • 99.41 confidence that at least one result is
    wrong!
  • Instead use ? 1 (1- ?)1/n, where ? is
    per-pair confidence and ? is overall confidence

29
Some Experimental Designs
  • Some experimental designs from the machine
    learning community
  • Test difference of two proportions
  • Paired-differences t-test on several random
    subsets
  • Paired-differences t-test based on 10-fold
    cross-validation
  • McNemars test
  • 5x2cv, five iterations of 2-fold cross-validation

30
McNemars Test
  • For comparing classification algorithms A,B
  • Uses one test set, with n samples
  • X2 distribution, P0.5
  • ( n01 n10 - 1)2 / (n01 n10) gt 3.841459
  • Difference is significant with 95 confidence

31
Two Proportions Test
  • Based on comparing error rate of A,B
  • PA (n00 n01)/n, PB (n00 n10)/n
  • Assumption probability of mis-classifying an
    example is a binomial random variable
  • Mean nPA, Variance PA(1- PA)n
  • For reasonable n, approximate by normal
    distribution, so PAPB is normally distributed
  • Assuming PA and PB are independent!
  • z (PA-PB) / ?(2p(1-p)/n)
  • For 95 confidence, result is significant for
    zgt1.96

32
Resampled, Paired t-test
  • For comparing classification algorithms A, B
  • Choose N random subsets of size n
  • PiA, PiB are proportion misclassified on subset i
  • Pi PiA - PiB
  • P average Pi
  • t P n1/2 / (?(Pi P)2 / (n 1))1/2
  • If t gt 2.04523, then difference is significant

33
k-fold Cross-validated Paired t-Test
  • Similar to previous test, except that use
    cross-validation instead of random resampling to
    create test sets
  • k test sets are independent
  • If k is small (e.g. 2), then significant features
    of the dataset can be obscured

34
5x2cv Paired t-Test
  • Perform five 2-fold cross-validation tests
  • Each 2-fold cross-validation test i gives
  • Pi1 PA1 - PB1, Pi2 PA2 PB2
  • si ?((Pi1-P)2(Pi2- P)2)
  • t P11 / ?(?(si)2/5)
  • Result is significant at 95 confidence when t
    gt 2.02
Write a Comment
User Comments (0)
About PowerShow.com