Empirical Methods for AI

About This Presentation

Title:

Empirical Methods for AI

Description:

not very helpful beauty competition? ... MYCIN was a medial expert system. recommended therapy for blood/meningitis infections ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 120

Provided by: iang6

Category:

more less

Transcript and Presenter's Notes

Title: Empirical Methods for AI

1
Empirical Methods for AI CS

Paul Cohen Ian P. Gent
Toby Walsh
cohen_at_cs.umass.edu ipg_at_dcs.st-and.ac.uk
tw_at_cs.york.ac.uk

2
Overview

Introduction
What are empirical methods?
Why use them?
Case Study
Eight Basic Lessons
Experiment design
Data analysis
How not to do it
Supplementary material

3
Resources

Web
www.cs.york.ac.uk/tw/empirical.html
www.cs.amherst.edu/dsj/methday.html
Books
Empirical Methods for AI, Paul Cohen, MIT
Press, 1995
Journals
Journal of Experimental Algorithmics,
www.jea.acm.org
Conferences
Workshop on Empirical Methods in AI (last
Saturday, ECAI-02?)
Workshop on Algorithm Engineering and
Experiments, ALENEX 01 (alongside SODA)

4
Empirical Methods for CS

Part I Introduction

5
What does empirical mean?

Relying on observations, data, experiments
Empirical work should complement theoretical work
Theories often have holes (e.g., How big is the
constant term? Is the current problem a bad
one?)
Theories are suggested by observations
Theories are tested by observations
Conversely, theories direct our empirical
attention
In addition (in this tutorial at least) empirical
means wanting to understand behavior of complex
systems

6
Why We Need Empirical Methods Cohen, 1990 Survey
of 150 AAAI Papers

Roughly 60 of the papers gave no evidence that
the work they described had been tried on more
than a single example problem.
Roughly 80 of the papers made no attempt to
explain performance, to tell us why it was good
or bad and under which conditions it might be
better or worse.
Only 16 of the papers offered anything that
might be interpreted as a question or a
hypothesis.
Theory papers generally had no applications or
empirical work to support them, empirical papers
were demonstrations, not experiments, and had no
underlying theoretical support.
The essential synergy between theory and
empirical work was missing

7
Theory, not Theorems

Theory based science need not be all theorems
otherwise science would be mathematics
Consider theory of QED
based on a model of behaviour of particles
predictions accurate to many decimal places (9?)
most accurate theory in the whole of science?
success derived from accuracy of predictions
not the depth or difficulty or beauty of theorems
QED is an empirical theory!

8
Empirical CS/AI

Computer programs are formal objects
so lets reason about them entirely formally?
Two reasons why we cant or wont
theorems are hard
some questions are empirical in nature
e.g. are Horn clauses adequate to represent the
sort of knowledge met in practice?
e.g. even though our problem is intractable in
general, are the instances met in practice easy
to solve?

9
Empirical CS/AI

Treat computer programs as natural objects
like fundamental particles, chemicals, living
organisms
Build (approximate) theories about them
construct hypotheses
e.g. greedy hill-climbing is important to GSAT
test with empirical experiments
e.g. compare GSAT with other types of
hill-climbing
refine hypotheses and modelling assumptions
e.g. greediness not important, but hill-climbing
is!

10
Empirical CS/AI

Many advantage over other sciences
Cost
no need for expensive super-colliders
Control
unlike the real world, we often have complete
command of the experiment
Reproducibility
in theory, computers are entirely deterministic
Ethics
no ethics panels needed before you run experiments

11
Types of hypothesis

My search program is better than yours
not very helpful beauty competition?
Search cost grows exponentially with number of
variables for this kind of problem
better as we can extrapolate to data not yet
seen?
Constraint systems are better at handling
over-constrained systems, but OR systems are
better at handling under-constrained systems
even better as we can extrapolate to new
situations?

12
A typical conference conversation

What are you up to these days?
Im running an experiment to compare the
Davis-Putnam algorithm with GSAT?
Why?
I want to know which is faster
Why?
Lots of people use each of these algorithms
How will these people use your result?...

13
Keep in mind the BIG picture

What are you up to these days?
Im running an experiment to compare the
Davis-Putnam algorithm with GSAT?
Why?
I have this hypothesis that neither will dominate
What use is this?
A portfolio containing both algorithms will be
more robust than either algorithm on its own

14
Keep in mind the BIG picture

...
Why are you doing this?
Because many real problems are intractable in
theory but need to be solved in practice.
How does your experiment help?
It helps us understand the difference between
average and worst case results
So why is this interesting?
Intractability is one of the BIG open questions
in CS!

15
Why is empirical CS/AI in vogue?

Inadequacies of theoretical analysis
problems often arent as hard in practice as
theory predicts in the worst-case
average-case analysis is very hard (and often
based on questionable assumptions)
Some spectacular successes
phase transition behaviour
local search methods
theory lagging behind algorithm design

16
Why is empirical CS/AI in vogue?

Compute power ever increasing
even intractable problems coming into range
easy to perform large (and sometimes meaningful)
experiments
Empirical CS/AI perceived to be easier than
theoretical CS/AI
often a false perception as experiments easier to
mess up than proofs

17
Empirical Methods for CS

Part II A Case Study
Eight Basic Lessons

18
Rosenberg study

An Empirical Study of Dynamic Scheduling on
Rings of Processors
Gregory, Gao, Rosenberg Cohen
Proc. of 8th IEEE Symp. on Parallel Distributed
Processing, 1996
Linked to from

www.cs.york.ac.uk/tw/empirical.html
19
Problem domain

Scheduling processors on ring network
jobs spawned as binary trees
KOSO
keep one, send one to my left or right
arbitrarily
KOSO
keep one, send one to my least heavily loaded
neighbour

20
Theory

On complete binary trees, KOSO is asymptotically
optimal
So KOSO cant be any better?
But assumptions unrealistic
tree not complete
asymptotically not necessarily the same as in
practice!

Thm Using KOSO on a ring of p processors, a
binary tree of height n is executed within
(2n-1)/p low order terms

21
Benefits of an empirical study

More realistic trees
probabilistic generator that makes shallow trees,
which are bushy near root but quickly get
scrawny
similar to trees generated when performing
Trapezoid or Simpsons Rule calculations
binary trees correspond to interval bisection
Startup costs
network must be loaded

22
Lesson 1 Evaluation begins with claimsLesson
2 Demonstration is good, understanding better

Hypothesis (or claim) KOSO takes longer than
KOSO because KOSO balances loads better
The because phrase indicates a hypothesis about
why it works. This is a better hypothesis than
the beauty contest demonstration that KOSO beats
KOSO
Experiment design
Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn),
Dependent variable time to complete jobs

23
Criticism 1 This experiment design includes no
direct measure of the hypothesized effect

Hypothesis KOSO takes longer than KOSO because
KOSO balances loads better
But experiment design includes no direct measure
of load balancing
Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn),
Dependent variable time to complete jobs

24
Lesson 3 Exploratory data analysis means
looking beneath immediate results for explanations

T-test on time to complete jobs t
(2825-2935)/587 -.19
KOSO apparently no faster than KOSO (as theory
predicted)
Why? Look more closely at the data
Outliers create excessive variance, so test isnt
significant

KOSO
KOSO
25
Lesson 4 The task of empirical work is to
explain variability
Empirical work assumes the variability in a
dependent variable (e.g., run time) is the sum of
causal factors and random noise. Statistical
methods assign parts of this variability to the
factors and the noise.
Number of processors and number of jobs explain
74 of the variance in run time. Algorithm
explains almost none.
26
Lesson 3 (again) Exploratory data analysis
means looking beneath immediate results for
explanations

Why does the KOSO/KOSO choice account for so
little of the variance in run time?
Unless processors starve, there will be no effect
of load balancing. In most conditions in this
experiment, processors never starved. (This is
why we run pilot experiments!)

27
Lesson 5 Of sample variance, effect size, and
sample size control the first before touching
the last
This intimate relationship holds for all
statistics
28
Lesson 5 illustrated A variance reduction
method
Let N num-jobs, P num-processors, T run
time Then T k (N / P), or k multiples of the
theoretical best time And k 1 / (N / P T)
90
70
80
60
70
50
60
40
50
40
30
30
20
20
10
10
2
3
4
5
2
3
4
5
k(KOSO)
k(KOSO)
29
Where are we?

KOSO is significantly better than KOSO when the
dependent variable is recoded as percentage of
optimal run time
The difference between KOSO and KOSO explains
very little of the variance in either dependent
variable
Exploratory data analysis tells us that
processors arent starving so we shouldnt be
surprised
Prediction The effect of algorithm on run time
(or k) increases as the number of jobs increases
or the number of processors increases
This prediction is about interactions between
factors

30
Lesson 6 Most interesting science is about
interaction effects, not simple main effects

Data confirm prediction
KOSO is superior on larger rings where
starvation is an issue
Interaction of independent variables
choice of algorithm
number of processors
Interaction effects are essential to explaining
how things work

multiples of optimal run-time
3
KOSO
KOSO
2
1
3
6
10
20
number of processors
31
Lesson 7 Significant and meaningful are not
synonymous. Is a result meaningful?

KOSO is significantly better than KOSO, but can
you use the result?
Suppose you wanted to use the knowledge that the
ring is controlled by KOSO or KOSO for some
prediction.
Grand median k 1.11 Pr(trial i has k gt 1.11)
.5
Pr(trial i under KOSO has k gt 1.11) 0.57
Pr(trial i under KOSO has k gt 1.11) 0.43
Predict for trial i whether its k is above or
below the median
If its a KOSO trial youll say no with (.43
150) 64.5 errors
If its a KOSO trial youll say yes with ((1 -
.57) 160) 68.8 errors
If you dont know youll make (.5 310) 155
errors
155 - (64.5 68.8) 22
Knowing the algorithm reduces error rate from .5
to .43. Is this enough???

32
Lesson 8 Keep the big picture in mind

Why are you studying this?
Load balancing is important to get good
performance out of parallel computers
Why is this important?
Parallel computing promises to tackle many of our
computational bottlenecks
How do we know this? Its in the first paragraph
of the paper!

33
Case study conclusions

Evaluation begins with claims
Demonstrations of simple main effects are good,
understanding the effects is better
Exploratory data analysis means using your eyes
to find explanatory patterns in data
The task of empirical work is to explain
variablitity
Control variability before increasing sample size
Interaction effects are essential to explanations
Significant ? meaningful
Keep the big picture in mind

34
Empirical Methods for CS

Part III Experiment design

35
Experimental Life Cycle

Exploration
Hypothesis construction
Experiment
Data analysis
Drawing of conclusions

36
Checklist for experiment design

Consider the experimental procedure
making it explicit helps to identify spurious
effects and sampling biases
Consider a sample data table
identifies what results need to be collected
clarifies dependent and independent variables
shows whether data pertain to hypothesis
Consider an example of the data analysis
helps you to avoid collecting too little or too
much data
especially important when looking for
interactions

From Chapter 3, Empirical Methods for
Artificial Intelligence, Paul Cohen, MIT Press
37
Guidelines for experiment design

Consider possible results and their
interpretation
may show that experiment cannot support/refute
hypotheses under test
unforeseen outcomes may suggest new hypotheses
What was the question again?
easy to get carried away designing an experiment
and lose the BIG picture
Run a pilot experiment to calibrate parameters
(e.g., number of processors in Rosenberg
experiment)

38
Types of experiment

Manipulation experiment
Observation experiment
Factorial experiment

39
Manipulation experiment

Independent variable, x
xidentity of parser, size of dictionary,
Dependent variable, y
yaccuracy, speed,
Hypothesis
x influences y
Manipulation experiment
change x, record y

40
Observation experiment

Predictor, x
xvolatility of stock prices,
Response variable, y
yfund performance,
Hypothesis
x influences y
Observation experiment
classify according to x, compute y

41
Factorial experiment

Several independent variables, xi
there may be no simple causal links
data may come that way
e.g. individuals will have different sexes, ages,
...
Factorial experiment
every possible combination of xi considered
expensive as its name suggests!

42
Designing factorial experiments

In general, stick to 2 to 3 independent variables
Solve same set of problems in each case
reduces variance due to differences between
problem sets
If this not possible, use same sample sizes
simplifies statistical analysis
As usual, default hypothesis is that no influence
exists
much easier to fail to demonstrate influence than
to demonstrate an influence

43
Some problem issues

Control
Ceiling and Floor effects
Sampling Biases

44
Control

A control is an experiment in which the
hypothesised variation does not occur
so the hypothesized effect should not occur
either
BUT remember
placebos cure a large percentage of patients!

45
Control a cautionary tale

Macaque monkeys given vaccine based on human
T-cells infected with SIV (relative of HIV)
macaques gained immunity from SIV
Later, macaques given uninfected human T-cells
and macaques still gained immunity!
Control experiment not originally done
and not always obvious (you cant control for all
variables)

46
Control MYCIN case study

MYCIN was a medial expert system
recommended therapy for blood/meningitis
infections
How to evaluate its recommendations?
Shortliffe used
10 sample problems, 8 therapy recommenders
5 faculty, 1 resident, 1 postdoc, 1 student
8 impartial judges gave 1 point per problem
max score was 80
Mycin 65, faculty 40-60, postdoc 60, resident 45,
student 30

47
Control MYCIN case study

What were controls?
Control for judges bias for/against computers
judges did not know who recommended each therapy
Control for easy problems
medical student did badly, so problems not easy
Control for our standard being low
e.g. random choice should do worse
Control for factor of interest
e.g. hypothesis in MYCIN that knowledge is
power
have groups with different levels of knowledge

48
Ceiling and Floor Effects

Well designed experiments (with good controls)
can still go wrong
What if all our algorithms do particularly well
Or they all do badly?
Weve got little evidence to choose between them

49
Ceiling and Floor Effects

Ceiling effects arise when test problems are
insufficiently challenging
floor effects the opposite, when problems too
challenging
A problem in AI because we often repeatedly use
the same benchmark sets
most benchmarks will lose their challenge
eventually?
but how do we detect this effect?

50
Machine learning example

14 datasets from UCI corpus of benchmarks
used as mainstay of ML community
Problem is learning classification rules
each item is vector of features and a
classification
measure classification accuracy of method (max
100)
Compare C4 with 1R, two competing algorithms
Rob Holte, Machine Learning, vol. 3, pp. 63-91,
1993
www.site.uottawa.edu/holte/Publications/simple_ru
les.ps

51
Floor effects machine learning example

DataSet BC CH GL G2 HD HE Mean
C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
1R 72.5 69.2 56.4 77 78 85.1 ... 83.8

Is 1R above the floor of performance? How would
we tell?
52
Floor effects machine learning example

DataSet BC CH GL G2 HD HE Mean
C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
1R 72.5 69.2 56.4 77 78 85.1 ... 83.8
Baseline 70.3 52.2 35.5 53.4 54.5 79.4
59.9

Baseline rule puts all items in more popular
category. 1R is above baseline on most
datasets A bit like the prime number joke? 1 is
prime. 3 is prime. 5 is prime. So, baseline rule
is that all odd numbers are prime.
53
Ceiling Effects machine learning

DataSet BC GL HY LY MU Mean
C4 72 63.2 99.1 77.5 100.0 ... 85.9
1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
How do we know that C4 and 1R are not near the
ceiling of performance?
Do the datasets have enough attributes to make
perfect classification?
Obviously for MU, but what about the rest?

54
Ceiling Effects machine learning

DataSet BC GL HY LY MU Mean
C4 72 63.2 99.1 77.5 100.0 ... 85.9
1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
max(C4,1R) 72.5 63.2 99.1 77.5 100.0 87.4
max(Buntine) 72.8 60.4 99.1 66.0 98.6 82.0
C4 achieves only about 2 better than 1R
Best of the C4/1R achieves 87.4 accuracy
We have only weak evidence that C4 better
Both methods performing appear to be near ceiling
of possible so comparison hard!

55
Ceiling Effects machine learning

In fact 1R only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2 improvement
Conclusion?
Either real world learning problems are easy (use
1R)
Or we need more challenging datasets
We need to be aware of ceiling effects in results

56
Sampling bias

Data collection is biased against certain data
e.g. teacher who says Girls dont answer maths
question
observation might suggest
girls dont answer many questions
but that the teacher doesnt ask them many
questions
Experienced AI researchers dont do that, right?

57
Sampling bias Phoenix case study

AI system to fight (simulated) forest fires
Experiments suggest that wind speed uncorrelated
with time to put out fire
obviously incorrect as high winds spread forest
fires

58
Sampling bias Phoenix case study

Wind Speed vs containment time (max 150 hours)
3 120 55 79 10 140 26 15 110 12 54 10 103
6 78 61 58 81 71 57 21 32 70
9 62 48 21 55 101
Whats the problem?

59
Sampling bias Phoenix case study

The cut-off of 150 hours introduces sampling bias
many high-wind fires get cut off, not many low
wind
On remaining data, there is no correlation
between wind speed and time (r -0.53)
In fact, data shows that
a lot of high wind fires take gt 150 hours to
contain
those that dont are similar to low wind fires
You wouldnt do this, right?
you might if you had automated data analysis.

60
Sampling biases can be subtle...

Assume gender (G) is an independent variable and
number of siblings (S) is a noise variable.
If S is truly a noise variable then under random
sampling, no dependency should exist between G
and S in samples.
Parents have children until they get at least one
boy. They don't feel the same way about girls.
In a sample of 1000 girls the number with S 0
is smaller than in a sample of 1000 boys.
The frequency distribution of S is different for
different genders. S and G are not independent.
Girls do better at math than boys in random
samples at all levels of education.
Is this because of their genes or because they
have more siblings?
What else might be systematically associated with
G that we don't know about?

61
Empirical Methods for CS

Part IV Data analysis

62
Kinds of data analysis

Exploratory (EDA) looking for patterns in data
Statistical inferences from sample data
Testing hypotheses
Estimating parameters
Building mathematical models of datasets
Machine learning, data mining
We will introduce hypothesis testing and
computer-intensive methods

63
The logic of hypothesis testing

Example toss a coin ten times, observe eight
heads. Is the coin fair (i.e., what is its long
run behavior?) and what is your residual
uncertainty?
You say, If the coin were fair, then eight or
more heads is pretty unlikely, so I think the
coin isnt fair.
Like proof by contradiction Assert the opposite
(the coin is fair) show that the sample result (
8 heads) has low probability p, reject the
assertion, with residual uncertainty related to
p.
Estimate p with a sampling distribution.

64
Probability of a sample result under a null
hypothesis

If the coin were fair (p .5, the null
hypothesis) what is the probability distribution
of r, the number of heads, obtained in N tosses
of a fair coin? Get it analytically or estimate
it by simulation (on a computer)
Loop K times
r 0 r is num.heads in N tosses
Loop N times simulate the tosses
Generate a random 0 x 1.0
If x lt p increment r p is the probability of a
head
Push r onto sampling_distribution
Print sampling_distribution

65
Sampling distributions
This is the estimated sampling distribution of r
under the null hypothesis that p .5. The
estimation is constructed by Monte Carlo sampling.
66
The logic of hypothesis testing

Establish a null hypothesis H0 p .5, the
coin is fair
Establish a statistic r, the number of heads in
N tosses
Figure out the sampling distribution of r given
H0
The sampling distribution will tell you the
probability p of a result at least as extreme as
your sample result, r 8
If this probability is very low, reject H0 the
null hypothesis
Residual uncertainty is p

67
The only tricky part is getting the sampling
distribution

Sampling distributions can be derived...
Exactly, e.g., binomial probabilities for coins
are given by the formula
Analytically, e.g., the central limit theorem
tells us that the sampling distribution of the
mean approaches a Normal distribution as samples
grow to infinity
Estimated by Monte Carlo simulation of the null
hypothesis process

68
A common statistical test The Z test for
different means

A sample N 25 computer science students has
mean IQ m135. Are they smarter than average?
Population mean is 100 with standard deviation 15
The null hypothesis, H0, is that the CS students
are average, i.e., the mean IQ of the
population of CS students is 100.
What is the probability p of drawing the sample
if H0 were true? If p small, then H0 probably
false.
Find the sampling distribution of the mean of a
sample of size 25, from population with mean 100

69
Central Limit Theorem
The sampling distribution of the mean is given
bythe Central Limit Theorem The sampling
distribution of the mean of samples of size N
approaches a normal (Gaussian) distribution as N
approaches infinity. If the samples are drawn
from a population with mean and standard
deviation , then the mean of the sampling
distribution is and its standard deviation is
as N increases. These
statements hold irrespective of the shape of the
original distribution.
70
The sampling distribution for the CS student
example

If sample of N 25 students were drawn from a
population with mean 100 and standard deviation
15 (the null hypothesis) then the sampling
distribution of the mean would asymptotically be
normal with mean 100 and standard deviation

The mean of the CS students falls nearly 12
standard deviations away from the mean of the
sampling distribution Only 1 of a normal
distribution falls more than two standard
deviations away from the mean The probability
that the students are average is roughly zero
71
The Z test
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std3
std1.0
100
135
0
11.67
72
Reject the null hypothesis?

Commonly we reject the H0 when the probability of
obtaining a sample statistic (e.g., mean 135)
given the null hypothesis is low, say lt .05.
A test statistic value, e.g. Z 11.67, recodes
the sample statistic (mean 135) to make it easy
to find the probability of sample statistic given
H0.
We find the probabilities by looking them up in
tables, or statistics packages provide them.
For example, Pr(Z 1.67) .05 Pr(Z 1.96)
.01.
Pr(Z 11) is approximately zero, reject H0.

73
The t test

Same logic as the Z test, but appropriate when
population standard deviation is unknown, samples
are small, etc.
Sampling distribution is t, not normal, but
approaches normal as samples size increases
Test statistic has very similar form but
probabilities of the test statistic are obtained
by consulting tables of the t distribution, not
the normal

74
The t test
Suppose N 5 students have mean IQ 135, std
27
Estimate the standard deviation of sampling
distribution using the sample standard deviation
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std12.1
std1.0
100
135
0
2.89
75
Summary of hypothesis testing

H0 negates what you want to demonstrate find
probability p of sample statistic under H0 by
comparing test statistic to sampling
distribution if probability is low, reject H0
with residual uncertainty proportional to p.
Example Want to demonstrate that CS graduate
students are smarter than average. H0 is that
they are average. t 2.89, p .022
Have we proved CS students are smarter? NO!
We have only shown that mean 135 is unlikely
if they arent. We never prove what we want to
demonstrate, we only reject H0, with residual
uncertainty.
And failing to reject H0 does not prove H0,
either!

76
Common tests

Tests that means are equal
Tests that samples are uncorrelated or
independent
Tests that slopes of lines are equal
Tests that predictors in rules have predictive
power
Tests that frequency distributions (how often
events happen) are equal
Tests that classification variables such as
smoking history and heart disease history are
unrelated
...
All follow the same basic logic

77
Computer-intensive Methods

Basic idea Construct sampling distributions by
simulating on a computer the process of drawing
samples.
Three main methods
Monte carlo simulation when one knows population
parameters
Bootstrap when one doesnt
Randomization, also assumes nothing about the
population.
Enormous advantage Works for any statistic and
makes no strong parametric assumptions (e.g.,
normality)

78
Another Monte Carlo example, relevant to machine
learning...

Suppose you want to buy stocks in a mutual fund
for simplicity assume there are just N 50 funds
to choose from and youll base your decision on
the proportion of J30 stocks in each fund that
increased in value
Suppose Pr(a stock increasing in price) .75
You are tempted by the best of the funds, F,
which reports price increases in 28 of its 30
stocks.
What is the probability of this performance?

79
Simulate...

Loop K 1000 times
B 0 number of stocks that increase in
the best of N funds
Loop N 50 times N is number of funds
H 0 stocks that increase in this fund
Loop M 30 times M is number of stocks in
this fund
Toss a coin with bias p to decide whether this
stock increases in value and if so increment H
Push H on a list We get N values of H
B maximum(H) The number of increasing
stocks in
the best fund
Push B on a list We get K values of B

80
Surprise!

The probability that the best of 50 funds reports
28 of 30 stocks increase in price is roughly 0.4
Why? The probability that an arbitrary fund
would report this increase is Pr(28 successes
pr(success).75).01, but the probability that
the best of 50 funds would report this is much
higher.
Machine learning algorithms use critical values
based on arbitrary elements, when they are
actually testing the best element they think
elements are more unusual than they really are.
This is why ML algorithms overfit.

81
The Bootstrap

Monte Carlo estimation of sampling distributions
assume you know the parameters of the population
from which samples are drawn.
What if you dont?
Use the sample as an estimate of the population.
Draw samples from the sample!
With or without replacement?
Example Sampling distribution of the mean check
the results against the central limit theorem.

82
Bootstrapping the sampling distribution of the
mean

S is a sample of size N
Loop K 1000 times
Draw a pseudosample S of size N from S by
sampling with replacement
Calculate the mean of S and push it on a list L
L is the bootstrapped sampling distribution of
the mean
This procedure works for any statistic, not just
the mean.

Recall we can get the sampling distribution of
the mean via the central limit theorem this
example is just for illustration. This
distribution is not a null hypothesis
distribution and so is not directly used for
hypothesis testing, but can easily be transformed
into a null hypothesis distribution (see Cohen,
1995).
83
Randomization

Used to test hypotheses that involve association
between elements of two or more groups very
general.
Example Paul tosses H H H H, Carole tosses T T
T T is outcome independent of tosser?
Example 4 women score 54 66 64 61, six men score
23 28 27 31 51 32. Is score independent of
gender?
Basic procedure Calculate a statistic f for
your sample randomize one factor relative to the
other and calculate your pseudostatistic f.
Compare f to the sampling distribution for f.

84
Example of randomization

Four women score 54 66 64 61, six men score 23 28
27 31 51 32. Is score independent of gender?
f difference of means of mens and womens
scores 29.25
Under the null hypothesis of no association
between gender and score, the score 54 might
equally well have been achieved by a male or a
female.
Toss all scores in a hopper, draw out four at
random and without replacement, call them
female, call the rest male, and calculate f,
the difference of means of female and male.
Repeat to get a distribution of f. This is an
estimate of the sampling distribution of f under
H0 no difference between male and female scores.

85
Empirical Methods for CS

Part V How Not To Do It

86
Tales from the coal face

Those ignorant of history are doomed to repeat it
we have committed many howlers
We hope to help others avoid similar ones
and illustrate how easy it is to screw up!
How Not to Do It
I Gent, S A Grant, E. MacIntyre, P Prosser, P
Shaw,
B M Smith, and T Walsh
University of Leeds Research Report, May 1997
Every howler we report committed by at least one
of the above authors!

87
How Not to Do It

Do measure with many instruments
in exploring hard problems, we used our best
algorithms
missed very poor performance of less good
algorithms
better algorithms will be bitten by same effect
on larger instances than we considered
Do measure CPU time
in exploratory code, CPU time often misleading
but can also be very informative
e.g. heuristic needed more search but was faster

88
How Not to Do It

Do vary all relevant factors
Dont change two things at once
ascribed effects of heuristic to the algorithm
changed heuristic and algorithm at the same time
didnt perform factorial experiment
but its not always easy/possible to do the
right experiments if there are many factors

89
How Not to Do It

Do Collect All Data Possible . (within reason)
one year Santa Claus had to repeat all our
experiments
ECAI/AAAI/IJCAI deadlines just after new year!
we had collected number of branches in search
tree
performance scaled with backtracks, not branches
all experiments had to be rerun
Dont Kill Your Machines
we have got into trouble with sysadmins
over experimental data we never used
often the vital experiment is small and quick

90
How Not to Do It

Do It All Again (or at least be able to)
e.g. storing random seeds used in experiments
we didnt do that and might have lost important
result
Do Be Paranoid
identical implementations in C, Scheme gave
different results
Do Use The Same Problems
reproducibility is a key to science (c.f. cold
fusion)
can reduce variance

91
Choosing your test data

Weve seen the possible problem of over-fitting
remember machine learning benchmarks?
Two common approaches
benchmark libraries
random problems
Both have potential pitfalls

92
Benchmark libraries

ve
can be based on real problems
lots of structure
-ve
library of fixed size
possible to over-fit algorithms to library
problems have fixed size
so cant measure scaling

93
Random problems

ve
problems can have any size
so can measure scaling
can generate any number of problems
hard to over-fit?
-ve
may not be representative of real problems
lack structure
easy to generate flawed problems
CSP, QSAT,

94
Flawed random problems

Constraint satisfaction example
40 papers over 5 years by many authors used
Models A, B, C, and D
all four models are flawed Achlioptas et al.
1997
asymptotically almost all problems are trivial
brings into doubt many experimental results
some experiments at typical sizes affected
fortunately not many
How should we generate problems in future?

95
Flawed random problems

Gent et al. 1998 fix flaw .
introduce flawless problem generation
defined in two equivalent ways
though no proof that problems are truly flawless
Undergraduate student at Strathclyde found new
bug
two definitions of flawless not equivalent
Eventually settled on final definition of
flawless
gave proof of asymptotic non-triviality
so we think that we just about understand the
problem generator now

96
Prototyping your algorithm

Often need to implement an algorithm
usually novel algorithm, or variant of existing
one
e.g. new heuristic in existing search algorithm
novelty of algorithm should imply extra care
more often, encourages lax implementation
its only a preliminary version

97
How Not to Do It

Dont Trust Yourself
bug in innermost loop found by chance
all experiments re-run with urgent deadline
curiously, sometimes bugged version was better!
Do Preserve Your Code
Or end up fixing the same error twice
Do use version control!

98
How Not to Do It

Do Make it Fast Enough
emphasis on enough
its often not necessary to have optimal code
in lifecycle of experiment, extra coding time not
won back
e.g. we have published many papers with
inefficient code
compared to state of the art
first GSAT version O(N2), but this really was too
slow!
Do Report Important Implementation Details
Intermediate versions produced good results

99
How Not to Do It

Do Look at the Raw Data
Summaries obscure important aspects of behaviour
Many statistical measures explicitly designed to
minimise effect of outliers
Sometimes outliers are vital
exceptionally hard problems dominate mean
we missed them until they hit us on the head
when experiments crashed overnight
old data on smaller problems showed clear
behaviour

100
How Not to Do It

Do face up to the consequences of your results
e.g. preprocessing on 450 problems
should obviously reduce search
reduced search 448 times
increased search 2 times
Forget algorithm, its useless?
Or study in detail the two exceptional cases
and achieve new understanding of an important
algorithm

101
Empirical Methods for CS

Part VII Coda

102
Our objectives

Outline some of the basic issues
exploration, experimental design, data analysis,
...
Encourage you to consider some of the pitfalls
we have fallen into all of them!
Raise standards
encouraging debate
identifying best practice
Learn from your experiences
experimenters get better as they get older!

103
Summary

Empirical CS and AI are exacting sciences
There are many ways to do experiments wrong
We are experts in doing experiments badly
As you perform experiments, youll make many
mistakes
Learn from those mistakes, and ours!

104
Empirical Methods for CS

Part VII Supplement

105
Some expert advice

Bernard Moret, U. New Mexico
Towards a Discipline of Experimental
Algorithmics
David Johnson, ATT Labs
A Theoreticians Guide to the Experimental
Analysis of Algorithms
Both linked to from www.cs.york.ac.uk/tw/empirica
l.html

106
Bernard Morets guidelines

Useful types of empirical results
accuracy/correctness of theoretical results
real-world performance
heuristic quality
impact of data structures
...

107
Bernard Morets guidelines

Hallmarks of a good experimental paper
clearly defined goals
large scale tests
both in number and size of instances
mixture of problems
real-world, random, standard benchmarks, ...
statistical analysis of results
reproducibility
publicly available instances, code, data files,
...

108
Bernard Morets guidelines

Pitfalls for experimental papers
simpler experiment would have given same result
result predictable by (back of the envelope)
calculation
bad experimental setup
e.g. insufficient sample size, no consideration
of scaling,
poor presentation of data
e.g. lack of statistics, discarding of outliers,
...

109
Bernard Morets guidelines

Ideal experimental procedure
define clear set of objectives
which questions are you asking?
design experiments to meet these objectives
collect data
do not change experiments until all data is
collected to prevent drift/bias
analyse data
consider new experiments in light of these
results

110
David Johnsons guidelines

3 types of paper describe the implementation of
an algorithm
application paper
Heres a good algorithm for this problem
sales-pitch paper
Heres an interesting new algorithm
experimental paper
Heres how this algorithm behaves in practice
These lessons apply to all 3

111
David Johnsons guidelines

Perform newsworthy experiments
standards higher than for theoretical papers!
run experiments on real problems
theoreticians can get away with idealized
distributions but experimentalists have no
excuse!
dont use algorithms that theory can already
dismiss
look for generality and relevance
dont just report algorithm A dominates algorithm
B, identify why it does!

112
David Johnsons guidelines

Place work in context
compare against previous work in literature
ideally, obtain their code and test sets
verify their results, and compare with your new
algorithm
less ideally, re-implement their code
report any differences in performance
least ideally, simply report their old results
try to make some ball-park comparisons of machine
speeds

113
David Johnsons guidelines

Use efficient implementations
somewhat controversial
efficient implementation supports claims of
practicality
tells us what is achievable in practice
can run more experiments on larger instances
can do our research quicker!
dont have to go over-board on this
exceptions can also be made
e.g. not studying CPU time, comparing against a
previously newsworthy algorithm, programming time
more valuable than processing time, ...

114
David Johnsons guidelines

Use testbeds that support general conclusions
ideally one (or more) random class, real world
instances
predict performance on real world problems based
on random class, evaluate quality of predictions
structured random generators
parameters to control structure as well as size
dont just study real world instances
hard to justify generality unless you have a very
broad class of real world problems!

115
David Johnsons guidelines

Provide explanations and back them up with
experiment
adds to credibility of experimental results
improves our understanding of algorithms
leading to better theory and algorithms
can weed out bugs in your implementation!

116
David Johnsons guidelines

Ensure reproducibilty
easily achieved via the Web
adds support to a paper if others can (and do)
reproduce the results
requires you to use large samples and wide range
of problems
otherwise results will not be reproducible!

117
David Johnsons guidelines

Ensure comparability (and give the full picture)
make it easy for those who come after to
reproduce your results
provide meaningful summaries
give sample sizes, report standard deviations,
plot graphs but report data in tables in the
appendix
do not hide anomalous results
report running times even if this is not the main
focus
readers may want to know before studying your
results in detail

118
David Johnsons pitfalls