Performance Evaluation - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Performance Evaluation

Description:

It is often helpful to have some yardstick by which to compare systems ... For comparing classification algorithms A, B. Choose N random subsets of size n ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 35

Provided by: hpl5

Category:

more less

Transcript and Presenter's Notes

Title: Performance Evaluation

1
Performance Evaluation

Carl Staelin

2
Motivation

It is often helpful to have some yardstick by
which to compare systems
During development to evaluate different
algorithms or optimizations
During purchasing to compare between product
offerings

3
What To Measure?

Standard information retrieval measures
Precision
The fraction of retrieved documents that are
deemed relevant
Recall
The fraction of relevant documents that are
retrieved
Other measures may be important
Execution time, disk space,

4
Experiment Design

There is a fine art to designing experiments
Ensure that you measure the item of interest
Ensure that all possible sources of variation
have been identified and controlled
E.g., corpus is static, or at least controlled
Ensure that the statistical analysis is valid
Proper statistical techniques are applied
correctly

5
Statistical Analysis

Statistics convert data into information
It is easy to misuse statistical tools
There are lies, damned lies, and statistics
Understand the assumptions upon which the tools
are based
If you violate the assumptions, your results are
not reliable
These types of mistakes are silent and are easy
to overlook

6
Summarizing data by a single number

Mode most likely value
Median middle value
Arithmetic mean x ?xi/n
Geometric mean (?xi)1/n
Harmonic mean n /?(1/xi)

7
Mode

When to use the mode
When the data is categorical
Or
When the data is numeric
The data is multi-modal
When not to use the mode
When the data is numeric and uni-modal

8
Median

When to use the median
When the data is a numeric observation
When the data is skewed or is dominated by
outliers
When the data is uni-modal
When not to use the median
If the total value is also of interest

9
Arithmetic Mean

When to use the mean
When the data is a numeric observation
When the data is not dominated by outliers
When the data is uni-modal, and is not skewed
When not to use the mean
When the data is categorical, or a ratio
When the distribution is badly skewed
What else not to do?
The mean of the product of two variables is equal
to the product of the means iff the variables are
independent.

10
Other Means

When to use the geometric mean
When the data is a numeric observation
When the item of interest is the product of the
observations
E.g., cache hit ratios over several levels of
cache
When to use the harmonic mean
Whenever the arithmetic can be justified for 1/x
For example, if you measure time t, and want to
average CPU cycles (m/t), then use harmonic mean

11
Summarizing variability

Sample variance s2 1/(n-1) ?(xi- x)2
Sample stddev s
Coefficient of variation (C.O.V.) s / x
Range maxxi minxi
10-90 range
Semi-interquantile range (SIQR) (Q3 Q1) / 2
Mean absolute deviation 1/n ?xi- x

12
Sample Variance, Stddev COV

When to use
When the data is normally distributed (uni-modal,
symmetric)
Sample variance stddev are well understood
COV has the advantage that it is unit-free
E.g., 5 is large and 0.2 is small

13
Ranges

Range, percentile range, semi interquartile
ranges
Range is only useful for bounded distributions
Percentile range can be used for unbounded
distributions
SIQR is useful when the data is uni-modal and
skewed, and is generally used in conjunction with
the median
Robust against outliers

14
Confidence Intervals

Confidence level
Probability that true mean falls in confidence
interval
Experimental designs typically start with desired
confidence level and data to compute the
resulting confidence interval
For normally distributed data
Interval (x z1-?/2s/?n, x z1-?/2s/?n)

15
Hypothesis Testing

Researchers often attempt to answer the question
Is A different from B?
Null hypothesis Is A the same as B?
Analysis attempts to verify null hypothesis with
a given confidence level
If the confidence intervals overlap, then null
hypothesis is true, and original hypothesis is
not supported

16
Paired Observations

Conduct n experiments on each of two systems
If experiment i on system A corresponds to
experiment i on system B, then they are paired
Treat each pair as a single experiment, with data
value di (Ai Bi)
If confidence interval contains 0, then A and B
are not significantly different

17
Unpaired Observations

Sample means xA, xB
Standard deviations sA, sB
Mean difference x? xA - xB
Stddev of mean difference
s? ?((sA2/nA)(sA2/nA))
Effective degrees of freedom, v
v ltcomplex equationgt
Confidence interval for mean difference
(xA - xB)?t1-?/2vs

18
Determining Sample Size

To estimate mean within ?r
Confidenc interval x ?zs/?n
where z is normal variate of confidence level
n ((100zs)/rx)2

19
Basic Experimental Method

Using a corpus of documents manually annotated
with relevant/irrelevant rankings
Run query against corpus
Count
of relevant retrieved documents
of retrieved documents
of relevant documents
of documents
Compute precision and recall

20
Problems With Method

Precision/recall for this query on this corpus
NOT estimate for any query on any corpus
Require several experiments to establish estimate
general performance
Several queries
Several corpuses

21
Practical Problems

Practical problem(s) with determining precision
and recall
How do you determine which documents are
relevant?
Each corpus must be annotated for each query
Many corpuses are huge
Manual annotation is expensive or impractical

22
Design of Experiments

To get statistically valid results, need to
repeat experiment several times
10-fold cross-validation is typical in machine
learning community
Separate training and testing data
Do not re-use testing data!
Use representatively sized corpus
Performance often dramatically affected by corpus
size

23
Ideal Experimental Setup

N annotated query/corpus pairs
Compute precision/recall for each pair
Compute precision/recall statistics
Mean
Standard deviation
Confidence intervals
Report results

24
Ideal Research Setup

N annotated query/corpus pairs
Develop/tune system using this data
M annotated query/corpus pairs
Distinct from development data!
Just before publication
Compute results based on unseen data
NEVER alter system to improve system performance
on this data!

25
TREC

TREC is an annual conference sponsored by various
US government agencies
http//trec.nist.gov/
There are several task categories
Filtering, Ad-hoc queries,
TREC prepares annotated datasets for researchers
to develop and test their systems
TREC then runs the systems on unseen data

26
Real Life Intrudes

Annotated datasets are usually scarce
Squeeze every last drop of utility out of scarce
datasets
Question how to reuse data without creating
invalid results?
Cross-validation, bootstrapping,

27
Over Fitting

Over fitting is when the system has learned to
recognize the noise as well as the signal
Optimizing system may lead to over-fitting
Over-optimistic performance on test data
Sub-optimal performance on real data
Over fitting is hard to detect

28
Comparing Systems

If anything, comparing systems is more difficult
than simply estimating general performance
It is easy to take valid general performance
estimates and create invalid comparisons!
If you conduct 100 experiments measuring paired
significance at 95 confidence
Expect 5 of conclusions are incorrect
99.41 confidence that at least one result is
wrong!
Instead use ? 1 (1- ?)1/n, where ? is
per-pair confidence and ? is overall confidence

29
Some Experimental Designs

Some experimental designs from the machine
learning community
Test difference of two proportions
Paired-differences t-test on several random
subsets
Paired-differences t-test based on 10-fold
cross-validation
McNemars test
5x2cv, five iterations of 2-fold cross-validation

30
McNemars Test

For comparing classification algorithms A,B
Uses one test set, with n samples
X2 distribution, P0.5
( n01 n10 - 1)2 / (n01 n10) gt 3.841459
Difference is significant with 95 confidence

31
Two Proportions Test

Based on comparing error rate of A,B
PA (n00 n01)/n, PB (n00 n10)/n
Assumption probability of mis-classifying an
example is a binomial random variable
Mean nPA, Variance PA(1- PA)n
For reasonable n, approximate by normal
distribution, so PAPB is normally distributed
Assuming PA and PB are independent!
z (PA-PB) / ?(2p(1-p)/n)
For 95 confidence, result is significant for
zgt1.96

32
Resampled, Paired t-test

For comparing classification algorithms A, B
Choose N random subsets of size n
PiA, PiB are proportion misclassified on subset i
Pi PiA - PiB
P average Pi
t P n1/2 / (?(Pi P)2 / (n 1))1/2
If t gt 2.04523, then difference is significant

33
k-fold Cross-validated Paired t-Test

Similar to previous test, except that use
cross-validation instead of random resampling to
create test sets
k test sets are independent
If k is small (e.g. 2), then significant features
of the dataset can be obscured

34
5x2cv Paired t-Test