Title: Empirical Methods for AI
1Empirical Methods for AI CS
- Paul Cohen Ian P. Gent
Toby Walsh - cohen_at_cs.umass.edu ipg_at_dcs.st-and.ac.uk
tw_at_cs.york.ac.uk
2Overview
- Introduction
- What are empirical methods?
- Why use them?
- Case Study
- Eight Basic Lessons
- Experiment design
- Data analysis
- How not to do it
- Supplementary material
3Resources
- Web
- www.cs.york.ac.uk/tw/empirical.html
- www.cs.amherst.edu/dsj/methday.html
- Books
- Empirical Methods for AI, Paul Cohen, MIT
Press, 1995 - Journals
- Journal of Experimental Algorithmics,
www.jea.acm.org - Conferences
- Workshop on Empirical Methods in AI (last
Saturday, ECAI-02?) - Workshop on Algorithm Engineering and
Experiments, ALENEX 01 (alongside SODA)
4Empirical Methods for CS
5What does empirical mean?
- Relying on observations, data, experiments
- Empirical work should complement theoretical work
- Theories often have holes (e.g., How big is the
constant term? Is the current problem a bad
one?) - Theories are suggested by observations
- Theories are tested by observations
- Conversely, theories direct our empirical
attention - In addition (in this tutorial at least) empirical
means wanting to understand behavior of complex
systems
6Why We Need Empirical Methods Cohen, 1990 Survey
of 150 AAAI Papers
- Roughly 60 of the papers gave no evidence that
the work they described had been tried on more
than a single example problem. - Roughly 80 of the papers made no attempt to
explain performance, to tell us why it was good
or bad and under which conditions it might be
better or worse. - Only 16 of the papers offered anything that
might be interpreted as a question or a
hypothesis. - Theory papers generally had no applications or
empirical work to support them, empirical papers
were demonstrations, not experiments, and had no
underlying theoretical support. - The essential synergy between theory and
empirical work was missing
7Theory, not Theorems
- Theory based science need not be all theorems
- otherwise science would be mathematics
- Consider theory of QED
- based on a model of behaviour of particles
- predictions accurate to many decimal places (9?)
- most accurate theory in the whole of science?
- success derived from accuracy of predictions
- not the depth or difficulty or beauty of theorems
- QED is an empirical theory!
8Empirical CS/AI
- Computer programs are formal objects
- so lets reason about them entirely formally?
- Two reasons why we cant or wont
- theorems are hard
- some questions are empirical in nature
- e.g. are Horn clauses adequate to represent the
sort of knowledge met in practice? - e.g. even though our problem is intractable in
general, are the instances met in practice easy
to solve?
9Empirical CS/AI
- Treat computer programs as natural objects
- like fundamental particles, chemicals, living
organisms - Build (approximate) theories about them
- construct hypotheses
- e.g. greedy hill-climbing is important to GSAT
- test with empirical experiments
- e.g. compare GSAT with other types of
hill-climbing - refine hypotheses and modelling assumptions
- e.g. greediness not important, but hill-climbing
is!
10Empirical CS/AI
- Many advantage over other sciences
- Cost
- no need for expensive super-colliders
- Control
- unlike the real world, we often have complete
command of the experiment - Reproducibility
- in theory, computers are entirely deterministic
- Ethics
- no ethics panels needed before you run experiments
11Types of hypothesis
- My search program is better than yours
- not very helpful beauty competition?
- Search cost grows exponentially with number of
variables for this kind of problem - better as we can extrapolate to data not yet
seen? - Constraint systems are better at handling
over-constrained systems, but OR systems are
better at handling under-constrained systems - even better as we can extrapolate to new
situations?
12A typical conference conversation
- What are you up to these days?
- Im running an experiment to compare the
Davis-Putnam algorithm with GSAT? - Why?
- I want to know which is faster
- Why?
- Lots of people use each of these algorithms
- How will these people use your result?...
13Keep in mind the BIG picture
- What are you up to these days?
- Im running an experiment to compare the
Davis-Putnam algorithm with GSAT? - Why?
- I have this hypothesis that neither will dominate
- What use is this?
- A portfolio containing both algorithms will be
more robust than either algorithm on its own
14Keep in mind the BIG picture
- ...
- Why are you doing this?
- Because many real problems are intractable in
theory but need to be solved in practice. - How does your experiment help?
- It helps us understand the difference between
average and worst case results - So why is this interesting?
- Intractability is one of the BIG open questions
in CS!
15Why is empirical CS/AI in vogue?
- Inadequacies of theoretical analysis
- problems often arent as hard in practice as
theory predicts in the worst-case - average-case analysis is very hard (and often
based on questionable assumptions) - Some spectacular successes
- phase transition behaviour
- local search methods
- theory lagging behind algorithm design
16Why is empirical CS/AI in vogue?
- Compute power ever increasing
- even intractable problems coming into range
- easy to perform large (and sometimes meaningful)
experiments - Empirical CS/AI perceived to be easier than
theoretical CS/AI - often a false perception as experiments easier to
mess up than proofs
17Empirical Methods for CS
- Part II A Case Study
- Eight Basic Lessons
18Rosenberg study
- An Empirical Study of Dynamic Scheduling on
Rings of Processors - Gregory, Gao, Rosenberg Cohen
- Proc. of 8th IEEE Symp. on Parallel Distributed
Processing, 1996 - Linked to from
www.cs.york.ac.uk/tw/empirical.html
19Problem domain
- Scheduling processors on ring network
- jobs spawned as binary trees
- KOSO
- keep one, send one to my left or right
arbitrarily - KOSO
- keep one, send one to my least heavily loaded
neighbour
20Theory
- On complete binary trees, KOSO is asymptotically
optimal - So KOSO cant be any better?
- But assumptions unrealistic
- tree not complete
- asymptotically not necessarily the same as in
practice!
- Thm Using KOSO on a ring of p processors, a
binary tree of height n is executed within
(2n-1)/p low order terms
21Benefits of an empirical study
- More realistic trees
- probabilistic generator that makes shallow trees,
which are bushy near root but quickly get
scrawny - similar to trees generated when performing
Trapezoid or Simpsons Rule calculations - binary trees correspond to interval bisection
- Startup costs
- network must be loaded
22Lesson 1 Evaluation begins with claimsLesson
2 Demonstration is good, understanding better
- Hypothesis (or claim) KOSO takes longer than
KOSO because KOSO balances loads better - The because phrase indicates a hypothesis about
why it works. This is a better hypothesis than
the beauty contest demonstration that KOSO beats
KOSO - Experiment design
- Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn), - Dependent variable time to complete jobs
23Criticism 1 This experiment design includes no
direct measure of the hypothesized effect
- Hypothesis KOSO takes longer than KOSO because
KOSO balances loads better - But experiment design includes no direct measure
of load balancing - Independent variables KOSO v KOSO, no. of
processors, no. of jobs, probability(job will
spawn), - Dependent variable time to complete jobs
24Lesson 3 Exploratory data analysis means
looking beneath immediate results for explanations
- T-test on time to complete jobs t
(2825-2935)/587 -.19 - KOSO apparently no faster than KOSO (as theory
predicted) - Why? Look more closely at the data
- Outliers create excessive variance, so test isnt
significant
KOSO
KOSO
25Lesson 4 The task of empirical work is to
explain variability
Empirical work assumes the variability in a
dependent variable (e.g., run time) is the sum of
causal factors and random noise. Statistical
methods assign parts of this variability to the
factors and the noise.
Number of processors and number of jobs explain
74 of the variance in run time. Algorithm
explains almost none.
26Lesson 3 (again) Exploratory data analysis
means looking beneath immediate results for
explanations
- Why does the KOSO/KOSO choice account for so
little of the variance in run time? - Unless processors starve, there will be no effect
of load balancing. In most conditions in this
experiment, processors never starved. (This is
why we run pilot experiments!)
27Lesson 5 Of sample variance, effect size, and
sample size control the first before touching
the last
This intimate relationship holds for all
statistics
28Lesson 5 illustrated A variance reduction
method
Let N num-jobs, P num-processors, T run
time Then T k (N / P), or k multiples of the
theoretical best time And k 1 / (N / P T)
90
70
80
60
70
50
60
40
50
40
30
30
20
20
10
10
2
3
4
5
2
3
4
5
k(KOSO)
k(KOSO)
29Where are we?
- KOSO is significantly better than KOSO when the
dependent variable is recoded as percentage of
optimal run time - The difference between KOSO and KOSO explains
very little of the variance in either dependent
variable - Exploratory data analysis tells us that
processors arent starving so we shouldnt be
surprised - Prediction The effect of algorithm on run time
(or k) increases as the number of jobs increases
or the number of processors increases - This prediction is about interactions between
factors
30Lesson 6 Most interesting science is about
interaction effects, not simple main effects
- Data confirm prediction
- KOSO is superior on larger rings where
starvation is an issue - Interaction of independent variables
- choice of algorithm
- number of processors
- Interaction effects are essential to explaining
how things work
multiples of optimal run-time
3
KOSO
KOSO
2
1
3
6
10
20
number of processors
31Lesson 7 Significant and meaningful are not
synonymous. Is a result meaningful?
- KOSO is significantly better than KOSO, but can
you use the result? - Suppose you wanted to use the knowledge that the
ring is controlled by KOSO or KOSO for some
prediction. - Grand median k 1.11 Pr(trial i has k gt 1.11)
.5 - Pr(trial i under KOSO has k gt 1.11) 0.57
- Pr(trial i under KOSO has k gt 1.11) 0.43
- Predict for trial i whether its k is above or
below the median - If its a KOSO trial youll say no with (.43
150) 64.5 errors - If its a KOSO trial youll say yes with ((1 -
.57) 160) 68.8 errors - If you dont know youll make (.5 310) 155
errors - 155 - (64.5 68.8) 22
- Knowing the algorithm reduces error rate from .5
to .43. Is this enough???
32Lesson 8 Keep the big picture in mind
- Why are you studying this?
- Load balancing is important to get good
performance out of parallel computers - Why is this important?
- Parallel computing promises to tackle many of our
computational bottlenecks - How do we know this? Its in the first paragraph
of the paper!
33Case study conclusions
- Evaluation begins with claims
- Demonstrations of simple main effects are good,
understanding the effects is better - Exploratory data analysis means using your eyes
to find explanatory patterns in data - The task of empirical work is to explain
variablitity - Control variability before increasing sample size
- Interaction effects are essential to explanations
- Significant ? meaningful
- Keep the big picture in mind
34Empirical Methods for CS
- Part III Experiment design
35Experimental Life Cycle
- Exploration
- Hypothesis construction
- Experiment
- Data analysis
- Drawing of conclusions
36Checklist for experiment design
- Consider the experimental procedure
- making it explicit helps to identify spurious
effects and sampling biases - Consider a sample data table
- identifies what results need to be collected
- clarifies dependent and independent variables
- shows whether data pertain to hypothesis
- Consider an example of the data analysis
- helps you to avoid collecting too little or too
much data - especially important when looking for
interactions
From Chapter 3, Empirical Methods for
Artificial Intelligence, Paul Cohen, MIT Press
37Guidelines for experiment design
- Consider possible results and their
interpretation - may show that experiment cannot support/refute
hypotheses under test - unforeseen outcomes may suggest new hypotheses
- What was the question again?
- easy to get carried away designing an experiment
and lose the BIG picture - Run a pilot experiment to calibrate parameters
(e.g., number of processors in Rosenberg
experiment)
38Types of experiment
- Manipulation experiment
- Observation experiment
- Factorial experiment
39Manipulation experiment
- Independent variable, x
- xidentity of parser, size of dictionary,
- Dependent variable, y
- yaccuracy, speed,
- Hypothesis
- x influences y
- Manipulation experiment
- change x, record y
40Observation experiment
- Predictor, x
- xvolatility of stock prices,
- Response variable, y
- yfund performance,
- Hypothesis
- x influences y
- Observation experiment
- classify according to x, compute y
41Factorial experiment
- Several independent variables, xi
- there may be no simple causal links
- data may come that way
- e.g. individuals will have different sexes, ages,
... - Factorial experiment
- every possible combination of xi considered
- expensive as its name suggests!
42Designing factorial experiments
- In general, stick to 2 to 3 independent variables
- Solve same set of problems in each case
- reduces variance due to differences between
problem sets - If this not possible, use same sample sizes
- simplifies statistical analysis
- As usual, default hypothesis is that no influence
exists - much easier to fail to demonstrate influence than
to demonstrate an influence
43Some problem issues
- Control
- Ceiling and Floor effects
- Sampling Biases
44Control
- A control is an experiment in which the
hypothesised variation does not occur - so the hypothesized effect should not occur
either - BUT remember
- placebos cure a large percentage of patients!
45Control a cautionary tale
- Macaque monkeys given vaccine based on human
T-cells infected with SIV (relative of HIV) - macaques gained immunity from SIV
- Later, macaques given uninfected human T-cells
- and macaques still gained immunity!
- Control experiment not originally done
- and not always obvious (you cant control for all
variables)
46Control MYCIN case study
- MYCIN was a medial expert system
- recommended therapy for blood/meningitis
infections - How to evaluate its recommendations?
- Shortliffe used
- 10 sample problems, 8 therapy recommenders
- 5 faculty, 1 resident, 1 postdoc, 1 student
- 8 impartial judges gave 1 point per problem
- max score was 80
- Mycin 65, faculty 40-60, postdoc 60, resident 45,
student 30
47Control MYCIN case study
- What were controls?
- Control for judges bias for/against computers
- judges did not know who recommended each therapy
- Control for easy problems
- medical student did badly, so problems not easy
- Control for our standard being low
- e.g. random choice should do worse
- Control for factor of interest
- e.g. hypothesis in MYCIN that knowledge is
power - have groups with different levels of knowledge
48Ceiling and Floor Effects
- Well designed experiments (with good controls)
can still go wrong - What if all our algorithms do particularly well
- Or they all do badly?
- Weve got little evidence to choose between them
49Ceiling and Floor Effects
- Ceiling effects arise when test problems are
insufficiently challenging - floor effects the opposite, when problems too
challenging - A problem in AI because we often repeatedly use
the same benchmark sets - most benchmarks will lose their challenge
eventually? - but how do we detect this effect?
50Machine learning example
- 14 datasets from UCI corpus of benchmarks
- used as mainstay of ML community
- Problem is learning classification rules
- each item is vector of features and a
classification - measure classification accuracy of method (max
100) - Compare C4 with 1R, two competing algorithms
- Rob Holte, Machine Learning, vol. 3, pp. 63-91,
1993 - www.site.uottawa.edu/holte/Publications/simple_ru
les.ps
51Floor effects machine learning example
- DataSet BC CH GL G2 HD HE Mean
- C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
- 1R 72.5 69.2 56.4 77 78 85.1 ... 83.8
Is 1R above the floor of performance? How would
we tell?
52Floor effects machine learning example
- DataSet BC CH GL G2 HD HE Mean
- C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
- 1R 72.5 69.2 56.4 77 78 85.1 ... 83.8
- Baseline 70.3 52.2 35.5 53.4 54.5 79.4
59.9 -
Baseline rule puts all items in more popular
category. 1R is above baseline on most
datasets A bit like the prime number joke? 1 is
prime. 3 is prime. 5 is prime. So, baseline rule
is that all odd numbers are prime.
53Ceiling Effects machine learning
- DataSet BC GL HY LY MU Mean
- C4 72 63.2 99.1 77.5 100.0 ... 85.9
- 1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
-
- How do we know that C4 and 1R are not near the
ceiling of performance? - Do the datasets have enough attributes to make
perfect classification? - Obviously for MU, but what about the rest?
54Ceiling Effects machine learning
- DataSet BC GL HY LY MU Mean
- C4 72 63.2 99.1 77.5 100.0 ... 85.9
- 1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
- max(C4,1R) 72.5 63.2 99.1 77.5 100.0 87.4
- max(Buntine) 72.8 60.4 99.1 66.0 98.6 82.0
-
- C4 achieves only about 2 better than 1R
- Best of the C4/1R achieves 87.4 accuracy
- We have only weak evidence that C4 better
- Both methods performing appear to be near ceiling
of possible so comparison hard!
55Ceiling Effects machine learning
- In fact 1R only uses one feature (the best one)
- C4 uses on average 6.6 features
- 5.6 features buy only about 2 improvement
- Conclusion?
- Either real world learning problems are easy (use
1R) - Or we need more challenging datasets
- We need to be aware of ceiling effects in results
56Sampling bias
- Data collection is biased against certain data
- e.g. teacher who says Girls dont answer maths
question - observation might suggest
- girls dont answer many questions
- but that the teacher doesnt ask them many
questions - Experienced AI researchers dont do that, right?
57Sampling bias Phoenix case study
- AI system to fight (simulated) forest fires
- Experiments suggest that wind speed uncorrelated
with time to put out fire - obviously incorrect as high winds spread forest
fires
58Sampling bias Phoenix case study
- Wind Speed vs containment time (max 150 hours)
- 3 120 55 79 10 140 26 15 110 12 54 10 103
- 6 78 61 58 81 71 57 21 32 70
- 9 62 48 21 55 101
- Whats the problem?
59Sampling bias Phoenix case study
- The cut-off of 150 hours introduces sampling bias
- many high-wind fires get cut off, not many low
wind - On remaining data, there is no correlation
between wind speed and time (r -0.53) - In fact, data shows that
- a lot of high wind fires take gt 150 hours to
contain - those that dont are similar to low wind fires
- You wouldnt do this, right?
- you might if you had automated data analysis.
60Sampling biases can be subtle...
- Assume gender (G) is an independent variable and
number of siblings (S) is a noise variable. - If S is truly a noise variable then under random
sampling, no dependency should exist between G
and S in samples. - Parents have children until they get at least one
boy. They don't feel the same way about girls.
In a sample of 1000 girls the number with S 0
is smaller than in a sample of 1000 boys. - The frequency distribution of S is different for
different genders. S and G are not independent. - Girls do better at math than boys in random
samples at all levels of education. - Is this because of their genes or because they
have more siblings? - What else might be systematically associated with
G that we don't know about?
61Empirical Methods for CS
62Kinds of data analysis
- Exploratory (EDA) looking for patterns in data
- Statistical inferences from sample data
- Testing hypotheses
- Estimating parameters
- Building mathematical models of datasets
- Machine learning, data mining
- We will introduce hypothesis testing and
computer-intensive methods
63The logic of hypothesis testing
- Example toss a coin ten times, observe eight
heads. Is the coin fair (i.e., what is its long
run behavior?) and what is your residual
uncertainty? - You say, If the coin were fair, then eight or
more heads is pretty unlikely, so I think the
coin isnt fair. - Like proof by contradiction Assert the opposite
(the coin is fair) show that the sample result (
8 heads) has low probability p, reject the
assertion, with residual uncertainty related to
p. - Estimate p with a sampling distribution.
64Probability of a sample result under a null
hypothesis
- If the coin were fair (p .5, the null
hypothesis) what is the probability distribution
of r, the number of heads, obtained in N tosses
of a fair coin? Get it analytically or estimate
it by simulation (on a computer) - Loop K times
- r 0 r is num.heads in N tosses
- Loop N times simulate the tosses
- Generate a random 0 x 1.0
- If x lt p increment r p is the probability of a
head - Push r onto sampling_distribution
- Print sampling_distribution
65Sampling distributions
This is the estimated sampling distribution of r
under the null hypothesis that p .5. The
estimation is constructed by Monte Carlo sampling.
66The logic of hypothesis testing
- Establish a null hypothesis H0 p .5, the
coin is fair - Establish a statistic r, the number of heads in
N tosses - Figure out the sampling distribution of r given
H0 - The sampling distribution will tell you the
probability p of a result at least as extreme as
your sample result, r 8 - If this probability is very low, reject H0 the
null hypothesis - Residual uncertainty is p
67The only tricky part is getting the sampling
distribution
- Sampling distributions can be derived...
- Exactly, e.g., binomial probabilities for coins
are given by the formula - Analytically, e.g., the central limit theorem
tells us that the sampling distribution of the
mean approaches a Normal distribution as samples
grow to infinity - Estimated by Monte Carlo simulation of the null
hypothesis process
68A common statistical test The Z test for
different means
- A sample N 25 computer science students has
mean IQ m135. Are they smarter than average? - Population mean is 100 with standard deviation 15
- The null hypothesis, H0, is that the CS students
are average, i.e., the mean IQ of the
population of CS students is 100. - What is the probability p of drawing the sample
if H0 were true? If p small, then H0 probably
false. - Find the sampling distribution of the mean of a
sample of size 25, from population with mean 100
69Central Limit Theorem
The sampling distribution of the mean is given
bythe Central Limit Theorem The sampling
distribution of the mean of samples of size N
approaches a normal (Gaussian) distribution as N
approaches infinity. If the samples are drawn
from a population with mean and standard
deviation , then the mean of the sampling
distribution is and its standard deviation is
as N increases. These
statements hold irrespective of the shape of the
original distribution.
70The sampling distribution for the CS student
example
- If sample of N 25 students were drawn from a
population with mean 100 and standard deviation
15 (the null hypothesis) then the sampling
distribution of the mean would asymptotically be
normal with mean 100 and standard deviation
The mean of the CS students falls nearly 12
standard deviations away from the mean of the
sampling distribution Only 1 of a normal
distribution falls more than two standard
deviations away from the mean The probability
that the students are average is roughly zero
71The Z test
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std3
std1.0
100
135
0
11.67
72Reject the null hypothesis?
- Commonly we reject the H0 when the probability of
obtaining a sample statistic (e.g., mean 135)
given the null hypothesis is low, say lt .05. - A test statistic value, e.g. Z 11.67, recodes
the sample statistic (mean 135) to make it easy
to find the probability of sample statistic given
H0. - We find the probabilities by looking them up in
tables, or statistics packages provide them. - For example, Pr(Z 1.67) .05 Pr(Z 1.96)
.01. - Pr(Z 11) is approximately zero, reject H0.
73The t test
- Same logic as the Z test, but appropriate when
population standard deviation is unknown, samples
are small, etc. - Sampling distribution is t, not normal, but
approaches normal as samples size increases - Test statistic has very similar form but
probabilities of the test statistic are obtained
by consulting tables of the t distribution, not
the normal
74The t test
Suppose N 5 students have mean IQ 135, std
27
Estimate the standard deviation of sampling
distribution using the sample standard deviation
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std12.1
std1.0
100
135
0
2.89
75Summary of hypothesis testing
- H0 negates what you want to demonstrate find
probability p of sample statistic under H0 by
comparing test statistic to sampling
distribution if probability is low, reject H0
with residual uncertainty proportional to p. - Example Want to demonstrate that CS graduate
students are smarter than average. H0 is that
they are average. t 2.89, p .022 - Have we proved CS students are smarter? NO!
- We have only shown that mean 135 is unlikely
if they arent. We never prove what we want to
demonstrate, we only reject H0, with residual
uncertainty. - And failing to reject H0 does not prove H0,
either!
76Common tests
- Tests that means are equal
- Tests that samples are uncorrelated or
independent - Tests that slopes of lines are equal
- Tests that predictors in rules have predictive
power - Tests that frequency distributions (how often
events happen) are equal - Tests that classification variables such as
smoking history and heart disease history are
unrelated - ...
- All follow the same basic logic
77Computer-intensive Methods
- Basic idea Construct sampling distributions by
simulating on a computer the process of drawing
samples. - Three main methods
- Monte carlo simulation when one knows population
parameters - Bootstrap when one doesnt
- Randomization, also assumes nothing about the
population. - Enormous advantage Works for any statistic and
makes no strong parametric assumptions (e.g.,
normality)
78Another Monte Carlo example, relevant to machine
learning...
- Suppose you want to buy stocks in a mutual fund
for simplicity assume there are just N 50 funds
to choose from and youll base your decision on
the proportion of J30 stocks in each fund that
increased in value - Suppose Pr(a stock increasing in price) .75
- You are tempted by the best of the funds, F,
which reports price increases in 28 of its 30
stocks. - What is the probability of this performance?
79Simulate...
- Loop K 1000 times
- B 0 number of stocks that increase in
- the best of N funds
- Loop N 50 times N is number of funds
- H 0 stocks that increase in this fund
- Loop M 30 times M is number of stocks in
this fund - Toss a coin with bias p to decide whether this
- stock increases in value and if so increment H
- Push H on a list We get N values of H
- B maximum(H) The number of increasing
stocks in - the best fund
- Push B on a list We get K values of B
80Surprise!
- The probability that the best of 50 funds reports
28 of 30 stocks increase in price is roughly 0.4 - Why? The probability that an arbitrary fund
would report this increase is Pr(28 successes
pr(success).75).01, but the probability that
the best of 50 funds would report this is much
higher. - Machine learning algorithms use critical values
based on arbitrary elements, when they are
actually testing the best element they think
elements are more unusual than they really are.
This is why ML algorithms overfit.
81The Bootstrap
- Monte Carlo estimation of sampling distributions
assume you know the parameters of the population
from which samples are drawn. - What if you dont?
- Use the sample as an estimate of the population.
- Draw samples from the sample!
- With or without replacement?
- Example Sampling distribution of the mean check
the results against the central limit theorem.
82Bootstrapping the sampling distribution of the
mean
- S is a sample of size N
- Loop K 1000 times
- Draw a pseudosample S of size N from S by
sampling with replacement - Calculate the mean of S and push it on a list L
- L is the bootstrapped sampling distribution of
the mean - This procedure works for any statistic, not just
the mean.
Recall we can get the sampling distribution of
the mean via the central limit theorem this
example is just for illustration. This
distribution is not a null hypothesis
distribution and so is not directly used for
hypothesis testing, but can easily be transformed
into a null hypothesis distribution (see Cohen,
1995).
83Randomization
- Used to test hypotheses that involve association
between elements of two or more groups very
general. - Example Paul tosses H H H H, Carole tosses T T
T T is outcome independent of tosser? - Example 4 women score 54 66 64 61, six men score
23 28 27 31 51 32. Is score independent of
gender? - Basic procedure Calculate a statistic f for
your sample randomize one factor relative to the
other and calculate your pseudostatistic f.
Compare f to the sampling distribution for f.
84Example of randomization
- Four women score 54 66 64 61, six men score 23 28
27 31 51 32. Is score independent of gender? - f difference of means of mens and womens
scores 29.25 - Under the null hypothesis of no association
between gender and score, the score 54 might
equally well have been achieved by a male or a
female. - Toss all scores in a hopper, draw out four at
random and without replacement, call them
female, call the rest male, and calculate f,
the difference of means of female and male.
Repeat to get a distribution of f. This is an
estimate of the sampling distribution of f under
H0 no difference between male and female scores.
85Empirical Methods for CS
86Tales from the coal face
- Those ignorant of history are doomed to repeat it
- we have committed many howlers
- We hope to help others avoid similar ones
- and illustrate how easy it is to screw up!
- How Not to Do It
- I Gent, S A Grant, E. MacIntyre, P Prosser, P
Shaw, - B M Smith, and T Walsh
- University of Leeds Research Report, May 1997
- Every howler we report committed by at least one
of the above authors!
87How Not to Do It
- Do measure with many instruments
- in exploring hard problems, we used our best
algorithms - missed very poor performance of less good
algorithms - better algorithms will be bitten by same effect
on larger instances than we considered - Do measure CPU time
- in exploratory code, CPU time often misleading
- but can also be very informative
- e.g. heuristic needed more search but was faster
88How Not to Do It
- Do vary all relevant factors
- Dont change two things at once
- ascribed effects of heuristic to the algorithm
- changed heuristic and algorithm at the same time
- didnt perform factorial experiment
- but its not always easy/possible to do the
right experiments if there are many factors
89How Not to Do It
- Do Collect All Data Possible . (within reason)
- one year Santa Claus had to repeat all our
experiments - ECAI/AAAI/IJCAI deadlines just after new year!
- we had collected number of branches in search
tree - performance scaled with backtracks, not branches
- all experiments had to be rerun
- Dont Kill Your Machines
- we have got into trouble with sysadmins
- over experimental data we never used
- often the vital experiment is small and quick
90How Not to Do It
- Do It All Again (or at least be able to)
- e.g. storing random seeds used in experiments
- we didnt do that and might have lost important
result - Do Be Paranoid
- identical implementations in C, Scheme gave
different results - Do Use The Same Problems
- reproducibility is a key to science (c.f. cold
fusion) - can reduce variance
91Choosing your test data
- Weve seen the possible problem of over-fitting
- remember machine learning benchmarks?
- Two common approaches
- benchmark libraries
- random problems
- Both have potential pitfalls
92Benchmark libraries
- ve
- can be based on real problems
- lots of structure
- -ve
- library of fixed size
- possible to over-fit algorithms to library
- problems have fixed size
- so cant measure scaling
93Random problems
- ve
- problems can have any size
- so can measure scaling
- can generate any number of problems
- hard to over-fit?
- -ve
- may not be representative of real problems
- lack structure
- easy to generate flawed problems
- CSP, QSAT,
94Flawed random problems
- Constraint satisfaction example
- 40 papers over 5 years by many authors used
Models A, B, C, and D - all four models are flawed Achlioptas et al.
1997 - asymptotically almost all problems are trivial
- brings into doubt many experimental results
- some experiments at typical sizes affected
- fortunately not many
- How should we generate problems in future?
95Flawed random problems
- Gent et al. 1998 fix flaw .
- introduce flawless problem generation
- defined in two equivalent ways
- though no proof that problems are truly flawless
- Undergraduate student at Strathclyde found new
bug - two definitions of flawless not equivalent
- Eventually settled on final definition of
flawless - gave proof of asymptotic non-triviality
- so we think that we just about understand the
problem generator now
96Prototyping your algorithm
- Often need to implement an algorithm
- usually novel algorithm, or variant of existing
one - e.g. new heuristic in existing search algorithm
- novelty of algorithm should imply extra care
- more often, encourages lax implementation
- its only a preliminary version
97How Not to Do It
- Dont Trust Yourself
- bug in innermost loop found by chance
- all experiments re-run with urgent deadline
- curiously, sometimes bugged version was better!
- Do Preserve Your Code
- Or end up fixing the same error twice
- Do use version control!
98How Not to Do It
- Do Make it Fast Enough
- emphasis on enough
- its often not necessary to have optimal code
- in lifecycle of experiment, extra coding time not
won back - e.g. we have published many papers with
inefficient code - compared to state of the art
- first GSAT version O(N2), but this really was too
slow! - Do Report Important Implementation Details
- Intermediate versions produced good results
99How Not to Do It
- Do Look at the Raw Data
- Summaries obscure important aspects of behaviour
- Many statistical measures explicitly designed to
minimise effect of outliers - Sometimes outliers are vital
- exceptionally hard problems dominate mean
- we missed them until they hit us on the head
- when experiments crashed overnight
- old data on smaller problems showed clear
behaviour
100How Not to Do It
- Do face up to the consequences of your results
- e.g. preprocessing on 450 problems
- should obviously reduce search
- reduced search 448 times
- increased search 2 times
- Forget algorithm, its useless?
- Or study in detail the two exceptional cases
- and achieve new understanding of an important
algorithm
101Empirical Methods for CS
102Our objectives
- Outline some of the basic issues
- exploration, experimental design, data analysis,
... - Encourage you to consider some of the pitfalls
- we have fallen into all of them!
- Raise standards
- encouraging debate
- identifying best practice
- Learn from your experiences
- experimenters get better as they get older!
103Summary
- Empirical CS and AI are exacting sciences
- There are many ways to do experiments wrong
- We are experts in doing experiments badly
- As you perform experiments, youll make many
mistakes - Learn from those mistakes, and ours!
104Empirical Methods for CS
105Some expert advice
- Bernard Moret, U. New Mexico
- Towards a Discipline of Experimental
Algorithmics - David Johnson, ATT Labs
- A Theoreticians Guide to the Experimental
Analysis of Algorithms - Both linked to from www.cs.york.ac.uk/tw/empirica
l.html
106Bernard Morets guidelines
- Useful types of empirical results
- accuracy/correctness of theoretical results
- real-world performance
- heuristic quality
- impact of data structures
- ...
107Bernard Morets guidelines
- Hallmarks of a good experimental paper
- clearly defined goals
- large scale tests
- both in number and size of instances
- mixture of problems
- real-world, random, standard benchmarks, ...
- statistical analysis of results
- reproducibility
- publicly available instances, code, data files,
...
108Bernard Morets guidelines
- Pitfalls for experimental papers
- simpler experiment would have given same result
- result predictable by (back of the envelope)
calculation - bad experimental setup
- e.g. insufficient sample size, no consideration
of scaling, - poor presentation of data
- e.g. lack of statistics, discarding of outliers,
...
109Bernard Morets guidelines
- Ideal experimental procedure
- define clear set of objectives
- which questions are you asking?
- design experiments to meet these objectives
- collect data
- do not change experiments until all data is
collected to prevent drift/bias - analyse data
- consider new experiments in light of these
results
110David Johnsons guidelines
- 3 types of paper describe the implementation of
an algorithm - application paper
- Heres a good algorithm for this problem
- sales-pitch paper
- Heres an interesting new algorithm
- experimental paper
- Heres how this algorithm behaves in practice
- These lessons apply to all 3
111David Johnsons guidelines
- Perform newsworthy experiments
- standards higher than for theoretical papers!
- run experiments on real problems
- theoreticians can get away with idealized
distributions but experimentalists have no
excuse! - dont use algorithms that theory can already
dismiss - look for generality and relevance
- dont just report algorithm A dominates algorithm
B, identify why it does!
112David Johnsons guidelines
- Place work in context
- compare against previous work in literature
- ideally, obtain their code and test sets
- verify their results, and compare with your new
algorithm - less ideally, re-implement their code
- report any differences in performance
- least ideally, simply report their old results
- try to make some ball-park comparisons of machine
speeds
113David Johnsons guidelines
- Use efficient implementations
- somewhat controversial
- efficient implementation supports claims of
practicality - tells us what is achievable in practice
- can run more experiments on larger instances
- can do our research quicker!
- dont have to go over-board on this
- exceptions can also be made
- e.g. not studying CPU time, comparing against a
previously newsworthy algorithm, programming time
more valuable than processing time, ...
114David Johnsons guidelines
- Use testbeds that support general conclusions
- ideally one (or more) random class, real world
instances - predict performance on real world problems based
on random class, evaluate quality of predictions - structured random generators
- parameters to control structure as well as size
- dont just study real world instances
- hard to justify generality unless you have a very
broad class of real world problems!
115David Johnsons guidelines
- Provide explanations and back them up with
experiment - adds to credibility of experimental results
- improves our understanding of algorithms
- leading to better theory and algorithms
- can weed out bugs in your implementation!
116David Johnsons guidelines
- Ensure reproducibilty
- easily achieved via the Web
- adds support to a paper if others can (and do)
reproduce the results - requires you to use large samples and wide range
of problems - otherwise results will not be reproducible!
117David Johnsons guidelines
- Ensure comparability (and give the full picture)
- make it easy for those who come after to
reproduce your results - provide meaningful summaries
- give sample sizes, report standard deviations,
plot graphs but report data in tables in the
appendix - do not hide anomalous results
- report running times even if this is not the main
focus - readers may want to know before studying your
results in detail
118David Johnsons pitfalls
- Failing to report key implementation details
- Extrapolating from tiny samples
- Using irreproducible benchmarks
- Using running time as a stopping criterion
- Ignoring hidden costs (e.g. preprocessing)
- Misusing statistical tools
- Failing to use graphs
119David Johnsons pitfalls
- Obscuring raw data by using hard-to-read charts
- Comparing apples and oranges
- Drawing conclusions not supported by the data
- Leaving obvious anomalies unnoted/unexplained
- Failing to back up explanations with further
experiments - Ignoring the literature
- the self-referential study!