Empirical Methods for AI - PowerPoint PPT Presentation

1 / 119
About This Presentation
Title:

Empirical Methods for AI

Description:

not very helpful beauty competition? ... MYCIN was a medial expert system. recommended therapy for blood/meningitis infections ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 120
Provided by: iang6
Category:

less

Transcript and Presenter's Notes

Title: Empirical Methods for AI


1
Empirical Methods for AI CS
  • Paul Cohen Ian P. Gent
    Toby Walsh
  • cohen_at_cs.umass.edu ipg_at_dcs.st-and.ac.uk
    tw_at_cs.york.ac.uk

2
Overview
  • Introduction
  • What are empirical methods?
  • Why use them?
  • Case Study
  • Eight Basic Lessons
  • Experiment design
  • Data analysis
  • How not to do it
  • Supplementary material

3
Resources
  • Web
  • www.cs.york.ac.uk/tw/empirical.html
  • www.cs.amherst.edu/dsj/methday.html
  • Books
  • Empirical Methods for AI, Paul Cohen, MIT
    Press, 1995
  • Journals
  • Journal of Experimental Algorithmics,
    www.jea.acm.org
  • Conferences
  • Workshop on Empirical Methods in AI (last
    Saturday, ECAI-02?)
  • Workshop on Algorithm Engineering and
    Experiments, ALENEX 01 (alongside SODA)

4
Empirical Methods for CS
  • Part I Introduction

5
What does empirical mean?
  • Relying on observations, data, experiments
  • Empirical work should complement theoretical work
  • Theories often have holes (e.g., How big is the
    constant term? Is the current problem a bad
    one?)
  • Theories are suggested by observations
  • Theories are tested by observations
  • Conversely, theories direct our empirical
    attention
  • In addition (in this tutorial at least) empirical
    means wanting to understand behavior of complex
    systems

6
Why We Need Empirical Methods Cohen, 1990 Survey
of 150 AAAI Papers
  • Roughly 60 of the papers gave no evidence that
    the work they described had been tried on more
    than a single example problem.
  • Roughly 80 of the papers made no attempt to
    explain performance, to tell us why it was good
    or bad and under which conditions it might be
    better or worse.
  • Only 16 of the papers offered anything that
    might be interpreted as a question or a
    hypothesis.
  • Theory papers generally had no applications or
    empirical work to support them, empirical papers
    were demonstrations, not experiments, and had no
    underlying theoretical support.
  • The essential synergy between theory and
    empirical work was missing

7
Theory, not Theorems
  • Theory based science need not be all theorems
  • otherwise science would be mathematics
  • Consider theory of QED
  • based on a model of behaviour of particles
  • predictions accurate to many decimal places (9?)
  • most accurate theory in the whole of science?
  • success derived from accuracy of predictions
  • not the depth or difficulty or beauty of theorems
  • QED is an empirical theory!

8
Empirical CS/AI
  • Computer programs are formal objects
  • so lets reason about them entirely formally?
  • Two reasons why we cant or wont
  • theorems are hard
  • some questions are empirical in nature
  • e.g. are Horn clauses adequate to represent the
    sort of knowledge met in practice?
  • e.g. even though our problem is intractable in
    general, are the instances met in practice easy
    to solve?

9
Empirical CS/AI
  • Treat computer programs as natural objects
  • like fundamental particles, chemicals, living
    organisms
  • Build (approximate) theories about them
  • construct hypotheses
  • e.g. greedy hill-climbing is important to GSAT
  • test with empirical experiments
  • e.g. compare GSAT with other types of
    hill-climbing
  • refine hypotheses and modelling assumptions
  • e.g. greediness not important, but hill-climbing
    is!

10
Empirical CS/AI
  • Many advantage over other sciences
  • Cost
  • no need for expensive super-colliders
  • Control
  • unlike the real world, we often have complete
    command of the experiment
  • Reproducibility
  • in theory, computers are entirely deterministic
  • Ethics
  • no ethics panels needed before you run experiments

11
Types of hypothesis
  • My search program is better than yours
  • not very helpful beauty competition?
  • Search cost grows exponentially with number of
    variables for this kind of problem
  • better as we can extrapolate to data not yet
    seen?
  • Constraint systems are better at handling
    over-constrained systems, but OR systems are
    better at handling under-constrained systems
  • even better as we can extrapolate to new
    situations?

12
A typical conference conversation
  • What are you up to these days?
  • Im running an experiment to compare the
    Davis-Putnam algorithm with GSAT?
  • Why?
  • I want to know which is faster
  • Why?
  • Lots of people use each of these algorithms
  • How will these people use your result?...

13
Keep in mind the BIG picture
  • What are you up to these days?
  • Im running an experiment to compare the
    Davis-Putnam algorithm with GSAT?
  • Why?
  • I have this hypothesis that neither will dominate
  • What use is this?
  • A portfolio containing both algorithms will be
    more robust than either algorithm on its own

14
Keep in mind the BIG picture
  • ...
  • Why are you doing this?
  • Because many real problems are intractable in
    theory but need to be solved in practice.
  • How does your experiment help?
  • It helps us understand the difference between
    average and worst case results
  • So why is this interesting?
  • Intractability is one of the BIG open questions
    in CS!

15
Why is empirical CS/AI in vogue?
  • Inadequacies of theoretical analysis
  • problems often arent as hard in practice as
    theory predicts in the worst-case
  • average-case analysis is very hard (and often
    based on questionable assumptions)
  • Some spectacular successes
  • phase transition behaviour
  • local search methods
  • theory lagging behind algorithm design

16
Why is empirical CS/AI in vogue?
  • Compute power ever increasing
  • even intractable problems coming into range
  • easy to perform large (and sometimes meaningful)
    experiments
  • Empirical CS/AI perceived to be easier than
    theoretical CS/AI
  • often a false perception as experiments easier to
    mess up than proofs

17
Empirical Methods for CS
  • Part II A Case Study
  • Eight Basic Lessons

18
Rosenberg study
  • An Empirical Study of Dynamic Scheduling on
    Rings of Processors
  • Gregory, Gao, Rosenberg Cohen
  • Proc. of 8th IEEE Symp. on Parallel Distributed
    Processing, 1996
  • Linked to from

www.cs.york.ac.uk/tw/empirical.html
19
Problem domain
  • Scheduling processors on ring network
  • jobs spawned as binary trees
  • KOSO
  • keep one, send one to my left or right
    arbitrarily
  • KOSO
  • keep one, send one to my least heavily loaded
    neighbour

20
Theory
  • On complete binary trees, KOSO is asymptotically
    optimal
  • So KOSO cant be any better?
  • But assumptions unrealistic
  • tree not complete
  • asymptotically not necessarily the same as in
    practice!
  • Thm Using KOSO on a ring of p processors, a
    binary tree of height n is executed within
    (2n-1)/p low order terms

21
Benefits of an empirical study
  • More realistic trees
  • probabilistic generator that makes shallow trees,
    which are bushy near root but quickly get
    scrawny
  • similar to trees generated when performing
    Trapezoid or Simpsons Rule calculations
  • binary trees correspond to interval bisection
  • Startup costs
  • network must be loaded

22
Lesson 1 Evaluation begins with claimsLesson
2 Demonstration is good, understanding better
  • Hypothesis (or claim) KOSO takes longer than
    KOSO because KOSO balances loads better
  • The because phrase indicates a hypothesis about
    why it works. This is a better hypothesis than
    the beauty contest demonstration that KOSO beats
    KOSO
  • Experiment design
  • Independent variables KOSO v KOSO, no. of
    processors, no. of jobs, probability(job will
    spawn),
  • Dependent variable time to complete jobs

23
Criticism 1 This experiment design includes no
direct measure of the hypothesized effect
  • Hypothesis KOSO takes longer than KOSO because
    KOSO balances loads better
  • But experiment design includes no direct measure
    of load balancing
  • Independent variables KOSO v KOSO, no. of
    processors, no. of jobs, probability(job will
    spawn),
  • Dependent variable time to complete jobs

24
Lesson 3 Exploratory data analysis means
looking beneath immediate results for explanations
  • T-test on time to complete jobs t
    (2825-2935)/587 -.19
  • KOSO apparently no faster than KOSO (as theory
    predicted)
  • Why? Look more closely at the data
  • Outliers create excessive variance, so test isnt
    significant

KOSO
KOSO
25
Lesson 4 The task of empirical work is to
explain variability
Empirical work assumes the variability in a
dependent variable (e.g., run time) is the sum of
causal factors and random noise. Statistical
methods assign parts of this variability to the
factors and the noise.
Number of processors and number of jobs explain
74 of the variance in run time. Algorithm
explains almost none.
26
Lesson 3 (again) Exploratory data analysis
means looking beneath immediate results for
explanations
  • Why does the KOSO/KOSO choice account for so
    little of the variance in run time?
  • Unless processors starve, there will be no effect
    of load balancing. In most conditions in this
    experiment, processors never starved. (This is
    why we run pilot experiments!)

27
Lesson 5 Of sample variance, effect size, and
sample size control the first before touching
the last
This intimate relationship holds for all
statistics
28
Lesson 5 illustrated A variance reduction
method
Let N num-jobs, P num-processors, T run
time Then T k (N / P), or k multiples of the
theoretical best time And k 1 / (N / P T)
90
70
80
60
70
50
60
40
50
40
30
30
20
20
10
10
2
3
4
5
2
3
4
5
k(KOSO)
k(KOSO)
29
Where are we?
  • KOSO is significantly better than KOSO when the
    dependent variable is recoded as percentage of
    optimal run time
  • The difference between KOSO and KOSO explains
    very little of the variance in either dependent
    variable
  • Exploratory data analysis tells us that
    processors arent starving so we shouldnt be
    surprised
  • Prediction The effect of algorithm on run time
    (or k) increases as the number of jobs increases
    or the number of processors increases
  • This prediction is about interactions between
    factors

30
Lesson 6 Most interesting science is about
interaction effects, not simple main effects
  • Data confirm prediction
  • KOSO is superior on larger rings where
    starvation is an issue
  • Interaction of independent variables
  • choice of algorithm
  • number of processors
  • Interaction effects are essential to explaining
    how things work

multiples of optimal run-time
3
KOSO
KOSO
2
1
3
6
10
20
number of processors
31
Lesson 7 Significant and meaningful are not
synonymous. Is a result meaningful?
  • KOSO is significantly better than KOSO, but can
    you use the result?
  • Suppose you wanted to use the knowledge that the
    ring is controlled by KOSO or KOSO for some
    prediction.
  • Grand median k 1.11 Pr(trial i has k gt 1.11)
    .5
  • Pr(trial i under KOSO has k gt 1.11) 0.57
  • Pr(trial i under KOSO has k gt 1.11) 0.43
  • Predict for trial i whether its k is above or
    below the median
  • If its a KOSO trial youll say no with (.43
    150) 64.5 errors
  • If its a KOSO trial youll say yes with ((1 -
    .57) 160) 68.8 errors
  • If you dont know youll make (.5 310) 155
    errors
  • 155 - (64.5 68.8) 22
  • Knowing the algorithm reduces error rate from .5
    to .43. Is this enough???

32
Lesson 8 Keep the big picture in mind
  • Why are you studying this?
  • Load balancing is important to get good
    performance out of parallel computers
  • Why is this important?
  • Parallel computing promises to tackle many of our
    computational bottlenecks
  • How do we know this? Its in the first paragraph
    of the paper!

33
Case study conclusions
  • Evaluation begins with claims
  • Demonstrations of simple main effects are good,
    understanding the effects is better
  • Exploratory data analysis means using your eyes
    to find explanatory patterns in data
  • The task of empirical work is to explain
    variablitity
  • Control variability before increasing sample size
  • Interaction effects are essential to explanations
  • Significant ? meaningful
  • Keep the big picture in mind

34
Empirical Methods for CS
  • Part III Experiment design

35
Experimental Life Cycle
  • Exploration
  • Hypothesis construction
  • Experiment
  • Data analysis
  • Drawing of conclusions

36
Checklist for experiment design
  • Consider the experimental procedure
  • making it explicit helps to identify spurious
    effects and sampling biases
  • Consider a sample data table
  • identifies what results need to be collected
  • clarifies dependent and independent variables
  • shows whether data pertain to hypothesis
  • Consider an example of the data analysis
  • helps you to avoid collecting too little or too
    much data
  • especially important when looking for
    interactions

From Chapter 3, Empirical Methods for
Artificial Intelligence, Paul Cohen, MIT Press
37
Guidelines for experiment design
  • Consider possible results and their
    interpretation
  • may show that experiment cannot support/refute
    hypotheses under test
  • unforeseen outcomes may suggest new hypotheses
  • What was the question again?
  • easy to get carried away designing an experiment
    and lose the BIG picture
  • Run a pilot experiment to calibrate parameters
    (e.g., number of processors in Rosenberg
    experiment)

38
Types of experiment
  • Manipulation experiment
  • Observation experiment
  • Factorial experiment

39
Manipulation experiment
  • Independent variable, x
  • xidentity of parser, size of dictionary,
  • Dependent variable, y
  • yaccuracy, speed,
  • Hypothesis
  • x influences y
  • Manipulation experiment
  • change x, record y

40
Observation experiment
  • Predictor, x
  • xvolatility of stock prices,
  • Response variable, y
  • yfund performance,
  • Hypothesis
  • x influences y
  • Observation experiment
  • classify according to x, compute y

41
Factorial experiment
  • Several independent variables, xi
  • there may be no simple causal links
  • data may come that way
  • e.g. individuals will have different sexes, ages,
    ...
  • Factorial experiment
  • every possible combination of xi considered
  • expensive as its name suggests!

42
Designing factorial experiments
  • In general, stick to 2 to 3 independent variables
  • Solve same set of problems in each case
  • reduces variance due to differences between
    problem sets
  • If this not possible, use same sample sizes
  • simplifies statistical analysis
  • As usual, default hypothesis is that no influence
    exists
  • much easier to fail to demonstrate influence than
    to demonstrate an influence

43
Some problem issues
  • Control
  • Ceiling and Floor effects
  • Sampling Biases

44
Control
  • A control is an experiment in which the
    hypothesised variation does not occur
  • so the hypothesized effect should not occur
    either
  • BUT remember
  • placebos cure a large percentage of patients!

45
Control a cautionary tale
  • Macaque monkeys given vaccine based on human
    T-cells infected with SIV (relative of HIV)
  • macaques gained immunity from SIV
  • Later, macaques given uninfected human T-cells
  • and macaques still gained immunity!
  • Control experiment not originally done
  • and not always obvious (you cant control for all
    variables)

46
Control MYCIN case study
  • MYCIN was a medial expert system
  • recommended therapy for blood/meningitis
    infections
  • How to evaluate its recommendations?
  • Shortliffe used
  • 10 sample problems, 8 therapy recommenders
  • 5 faculty, 1 resident, 1 postdoc, 1 student
  • 8 impartial judges gave 1 point per problem
  • max score was 80
  • Mycin 65, faculty 40-60, postdoc 60, resident 45,
    student 30

47
Control MYCIN case study
  • What were controls?
  • Control for judges bias for/against computers
  • judges did not know who recommended each therapy
  • Control for easy problems
  • medical student did badly, so problems not easy
  • Control for our standard being low
  • e.g. random choice should do worse
  • Control for factor of interest
  • e.g. hypothesis in MYCIN that knowledge is
    power
  • have groups with different levels of knowledge

48
Ceiling and Floor Effects
  • Well designed experiments (with good controls)
    can still go wrong
  • What if all our algorithms do particularly well
  • Or they all do badly?
  • Weve got little evidence to choose between them

49
Ceiling and Floor Effects
  • Ceiling effects arise when test problems are
    insufficiently challenging
  • floor effects the opposite, when problems too
    challenging
  • A problem in AI because we often repeatedly use
    the same benchmark sets
  • most benchmarks will lose their challenge
    eventually?
  • but how do we detect this effect?

50
Machine learning example
  • 14 datasets from UCI corpus of benchmarks
  • used as mainstay of ML community
  • Problem is learning classification rules
  • each item is vector of features and a
    classification
  • measure classification accuracy of method (max
    100)
  • Compare C4 with 1R, two competing algorithms
  • Rob Holte, Machine Learning, vol. 3, pp. 63-91,
    1993
  • www.site.uottawa.edu/holte/Publications/simple_ru
    les.ps

51
Floor effects machine learning example
  • DataSet BC CH GL G2 HD HE Mean
  • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
  • 1R 72.5 69.2 56.4 77 78 85.1 ... 83.8

Is 1R above the floor of performance? How would
we tell?
52
Floor effects machine learning example
  • DataSet BC CH GL G2 HD HE Mean
  • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
  • 1R 72.5 69.2 56.4 77 78 85.1 ... 83.8
  • Baseline 70.3 52.2 35.5 53.4 54.5 79.4
    59.9

Baseline rule puts all items in more popular
category. 1R is above baseline on most
datasets A bit like the prime number joke? 1 is
prime. 3 is prime. 5 is prime. So, baseline rule
is that all odd numbers are prime.
53
Ceiling Effects machine learning
  • DataSet BC GL HY LY MU Mean
  • C4 72 63.2 99.1 77.5 100.0 ... 85.9
  • 1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
  • How do we know that C4 and 1R are not near the
    ceiling of performance?
  • Do the datasets have enough attributes to make
    perfect classification?
  • Obviously for MU, but what about the rest?

54
Ceiling Effects machine learning
  • DataSet BC GL HY LY MU Mean
  • C4 72 63.2 99.1 77.5 100.0 ... 85.9
  • 1R 72.5 56.4 97.2 70.7 98.4 ... 83.8
  • max(C4,1R) 72.5 63.2 99.1 77.5 100.0 87.4
  • max(Buntine) 72.8 60.4 99.1 66.0 98.6 82.0
  • C4 achieves only about 2 better than 1R
  • Best of the C4/1R achieves 87.4 accuracy
  • We have only weak evidence that C4 better
  • Both methods performing appear to be near ceiling
    of possible so comparison hard!

55
Ceiling Effects machine learning
  • In fact 1R only uses one feature (the best one)
  • C4 uses on average 6.6 features
  • 5.6 features buy only about 2 improvement
  • Conclusion?
  • Either real world learning problems are easy (use
    1R)
  • Or we need more challenging datasets
  • We need to be aware of ceiling effects in results

56
Sampling bias
  • Data collection is biased against certain data
  • e.g. teacher who says Girls dont answer maths
    question
  • observation might suggest
  • girls dont answer many questions
  • but that the teacher doesnt ask them many
    questions
  • Experienced AI researchers dont do that, right?

57
Sampling bias Phoenix case study
  • AI system to fight (simulated) forest fires
  • Experiments suggest that wind speed uncorrelated
    with time to put out fire
  • obviously incorrect as high winds spread forest
    fires

58
Sampling bias Phoenix case study
  • Wind Speed vs containment time (max 150 hours)
  • 3 120 55 79 10 140 26 15 110 12 54 10 103
  • 6 78 61 58 81 71 57 21 32 70
  • 9 62 48 21 55 101
  • Whats the problem?

59
Sampling bias Phoenix case study
  • The cut-off of 150 hours introduces sampling bias
  • many high-wind fires get cut off, not many low
    wind
  • On remaining data, there is no correlation
    between wind speed and time (r -0.53)
  • In fact, data shows that
  • a lot of high wind fires take gt 150 hours to
    contain
  • those that dont are similar to low wind fires
  • You wouldnt do this, right?
  • you might if you had automated data analysis.

60
Sampling biases can be subtle...
  • Assume gender (G) is an independent variable and
    number of siblings (S) is a noise variable.
  • If S is truly a noise variable then under random
    sampling, no dependency should exist between G
    and S in samples.
  • Parents have children until they get at least one
    boy. They don't feel the same way about girls.
    In a sample of 1000 girls the number with S 0
    is smaller than in a sample of 1000 boys.
  • The frequency distribution of S is different for
    different genders. S and G are not independent.
  • Girls do better at math than boys in random
    samples at all levels of education.
  • Is this because of their genes or because they
    have more siblings?
  • What else might be systematically associated with
    G that we don't know about?

61
Empirical Methods for CS
  • Part IV Data analysis

62
Kinds of data analysis
  • Exploratory (EDA) looking for patterns in data
  • Statistical inferences from sample data
  • Testing hypotheses
  • Estimating parameters
  • Building mathematical models of datasets
  • Machine learning, data mining
  • We will introduce hypothesis testing and
    computer-intensive methods

63
The logic of hypothesis testing
  • Example toss a coin ten times, observe eight
    heads. Is the coin fair (i.e., what is its long
    run behavior?) and what is your residual
    uncertainty?
  • You say, If the coin were fair, then eight or
    more heads is pretty unlikely, so I think the
    coin isnt fair.
  • Like proof by contradiction Assert the opposite
    (the coin is fair) show that the sample result (
    8 heads) has low probability p, reject the
    assertion, with residual uncertainty related to
    p.
  • Estimate p with a sampling distribution.

64
Probability of a sample result under a null
hypothesis
  • If the coin were fair (p .5, the null
    hypothesis) what is the probability distribution
    of r, the number of heads, obtained in N tosses
    of a fair coin? Get it analytically or estimate
    it by simulation (on a computer)
  • Loop K times
  • r 0 r is num.heads in N tosses
  • Loop N times simulate the tosses
  • Generate a random 0 x 1.0
  • If x lt p increment r p is the probability of a
    head
  • Push r onto sampling_distribution
  • Print sampling_distribution

65
Sampling distributions
This is the estimated sampling distribution of r
under the null hypothesis that p .5. The
estimation is constructed by Monte Carlo sampling.
66
The logic of hypothesis testing
  • Establish a null hypothesis H0 p .5, the
    coin is fair
  • Establish a statistic r, the number of heads in
    N tosses
  • Figure out the sampling distribution of r given
    H0
  • The sampling distribution will tell you the
    probability p of a result at least as extreme as
    your sample result, r 8
  • If this probability is very low, reject H0 the
    null hypothesis
  • Residual uncertainty is p

67
The only tricky part is getting the sampling
distribution
  • Sampling distributions can be derived...
  • Exactly, e.g., binomial probabilities for coins
    are given by the formula
  • Analytically, e.g., the central limit theorem
    tells us that the sampling distribution of the
    mean approaches a Normal distribution as samples
    grow to infinity
  • Estimated by Monte Carlo simulation of the null
    hypothesis process

68
A common statistical test The Z test for
different means
  • A sample N 25 computer science students has
    mean IQ m135. Are they smarter than average?
  • Population mean is 100 with standard deviation 15
  • The null hypothesis, H0, is that the CS students
    are average, i.e., the mean IQ of the
    population of CS students is 100.
  • What is the probability p of drawing the sample
    if H0 were true? If p small, then H0 probably
    false.
  • Find the sampling distribution of the mean of a
    sample of size 25, from population with mean 100

69
Central Limit Theorem
The sampling distribution of the mean is given
bythe Central Limit Theorem The sampling
distribution of the mean of samples of size N
approaches a normal (Gaussian) distribution as N
approaches infinity. If the samples are drawn
from a population with mean and standard
deviation , then the mean of the sampling
distribution is and its standard deviation is
as N increases. These
statements hold irrespective of the shape of the
original distribution.
70
The sampling distribution for the CS student
example
  • If sample of N 25 students were drawn from a
    population with mean 100 and standard deviation
    15 (the null hypothesis) then the sampling
    distribution of the mean would asymptotically be
    normal with mean 100 and standard deviation

The mean of the CS students falls nearly 12
standard deviations away from the mean of the
sampling distribution Only 1 of a normal
distribution falls more than two standard
deviations away from the mean The probability
that the students are average is roughly zero
71
The Z test
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std3
std1.0
100
135
0
11.67
72
Reject the null hypothesis?
  • Commonly we reject the H0 when the probability of
    obtaining a sample statistic (e.g., mean 135)
    given the null hypothesis is low, say lt .05.
  • A test statistic value, e.g. Z 11.67, recodes
    the sample statistic (mean 135) to make it easy
    to find the probability of sample statistic given
    H0.
  • We find the probabilities by looking them up in
    tables, or statistics packages provide them.
  • For example, Pr(Z 1.67) .05 Pr(Z 1.96)
    .01.
  • Pr(Z 11) is approximately zero, reject H0.

73
The t test
  • Same logic as the Z test, but appropriate when
    population standard deviation is unknown, samples
    are small, etc.
  • Sampling distribution is t, not normal, but
    approaches normal as samples size increases
  • Test statistic has very similar form but
    probabilities of the test statistic are obtained
    by consulting tables of the t distribution, not
    the normal

74
The t test
Suppose N 5 students have mean IQ 135, std
27
Estimate the standard deviation of sampling
distribution using the sample standard deviation
Mean of sampling distribution
Sample statistic
Mean of sampling distribution
Test statistic
std12.1
std1.0
100
135
0
2.89
75
Summary of hypothesis testing
  • H0 negates what you want to demonstrate find
    probability p of sample statistic under H0 by
    comparing test statistic to sampling
    distribution if probability is low, reject H0
    with residual uncertainty proportional to p.
  • Example Want to demonstrate that CS graduate
    students are smarter than average. H0 is that
    they are average. t 2.89, p .022
  • Have we proved CS students are smarter? NO!
  • We have only shown that mean 135 is unlikely
    if they arent. We never prove what we want to
    demonstrate, we only reject H0, with residual
    uncertainty.
  • And failing to reject H0 does not prove H0,
    either!

76
Common tests
  • Tests that means are equal
  • Tests that samples are uncorrelated or
    independent
  • Tests that slopes of lines are equal
  • Tests that predictors in rules have predictive
    power
  • Tests that frequency distributions (how often
    events happen) are equal
  • Tests that classification variables such as
    smoking history and heart disease history are
    unrelated
  • ...
  • All follow the same basic logic

77
Computer-intensive Methods
  • Basic idea Construct sampling distributions by
    simulating on a computer the process of drawing
    samples.
  • Three main methods
  • Monte carlo simulation when one knows population
    parameters
  • Bootstrap when one doesnt
  • Randomization, also assumes nothing about the
    population.
  • Enormous advantage Works for any statistic and
    makes no strong parametric assumptions (e.g.,
    normality)

78
Another Monte Carlo example, relevant to machine
learning...
  • Suppose you want to buy stocks in a mutual fund
    for simplicity assume there are just N 50 funds
    to choose from and youll base your decision on
    the proportion of J30 stocks in each fund that
    increased in value
  • Suppose Pr(a stock increasing in price) .75
  • You are tempted by the best of the funds, F,
    which reports price increases in 28 of its 30
    stocks.
  • What is the probability of this performance?

79
Simulate...
  • Loop K 1000 times
  • B 0 number of stocks that increase in
  • the best of N funds
  • Loop N 50 times N is number of funds
  • H 0 stocks that increase in this fund
  • Loop M 30 times M is number of stocks in
    this fund
  • Toss a coin with bias p to decide whether this
  • stock increases in value and if so increment H
  • Push H on a list We get N values of H
  • B maximum(H) The number of increasing
    stocks in
  • the best fund
  • Push B on a list We get K values of B

80
Surprise!
  • The probability that the best of 50 funds reports
    28 of 30 stocks increase in price is roughly 0.4
  • Why? The probability that an arbitrary fund
    would report this increase is Pr(28 successes
    pr(success).75).01, but the probability that
    the best of 50 funds would report this is much
    higher.
  • Machine learning algorithms use critical values
    based on arbitrary elements, when they are
    actually testing the best element they think
    elements are more unusual than they really are.
    This is why ML algorithms overfit.

81
The Bootstrap
  • Monte Carlo estimation of sampling distributions
    assume you know the parameters of the population
    from which samples are drawn.
  • What if you dont?
  • Use the sample as an estimate of the population.
  • Draw samples from the sample!
  • With or without replacement?
  • Example Sampling distribution of the mean check
    the results against the central limit theorem.

82
Bootstrapping the sampling distribution of the
mean
  • S is a sample of size N
  • Loop K 1000 times
  • Draw a pseudosample S of size N from S by
    sampling with replacement
  • Calculate the mean of S and push it on a list L
  • L is the bootstrapped sampling distribution of
    the mean
  • This procedure works for any statistic, not just
    the mean.

Recall we can get the sampling distribution of
the mean via the central limit theorem this
example is just for illustration. This
distribution is not a null hypothesis
distribution and so is not directly used for
hypothesis testing, but can easily be transformed
into a null hypothesis distribution (see Cohen,
1995).
83
Randomization
  • Used to test hypotheses that involve association
    between elements of two or more groups very
    general.
  • Example Paul tosses H H H H, Carole tosses T T
    T T is outcome independent of tosser?
  • Example 4 women score 54 66 64 61, six men score
    23 28 27 31 51 32. Is score independent of
    gender?
  • Basic procedure Calculate a statistic f for
    your sample randomize one factor relative to the
    other and calculate your pseudostatistic f.
    Compare f to the sampling distribution for f.

84
Example of randomization
  • Four women score 54 66 64 61, six men score 23 28
    27 31 51 32. Is score independent of gender?
  • f difference of means of mens and womens
    scores 29.25
  • Under the null hypothesis of no association
    between gender and score, the score 54 might
    equally well have been achieved by a male or a
    female.
  • Toss all scores in a hopper, draw out four at
    random and without replacement, call them
    female, call the rest male, and calculate f,
    the difference of means of female and male.
    Repeat to get a distribution of f. This is an
    estimate of the sampling distribution of f under
    H0 no difference between male and female scores.

85
Empirical Methods for CS
  • Part V How Not To Do It

86
Tales from the coal face
  • Those ignorant of history are doomed to repeat it
  • we have committed many howlers
  • We hope to help others avoid similar ones
  • and illustrate how easy it is to screw up!
  • How Not to Do It
  • I Gent, S A Grant, E. MacIntyre, P Prosser, P
    Shaw,
  • B M Smith, and T Walsh
  • University of Leeds Research Report, May 1997
  • Every howler we report committed by at least one
    of the above authors!

87
How Not to Do It
  • Do measure with many instruments
  • in exploring hard problems, we used our best
    algorithms
  • missed very poor performance of less good
    algorithms
  • better algorithms will be bitten by same effect
    on larger instances than we considered
  • Do measure CPU time
  • in exploratory code, CPU time often misleading
  • but can also be very informative
  • e.g. heuristic needed more search but was faster

88
How Not to Do It
  • Do vary all relevant factors
  • Dont change two things at once
  • ascribed effects of heuristic to the algorithm
  • changed heuristic and algorithm at the same time
  • didnt perform factorial experiment
  • but its not always easy/possible to do the
    right experiments if there are many factors

89
How Not to Do It
  • Do Collect All Data Possible . (within reason)
  • one year Santa Claus had to repeat all our
    experiments
  • ECAI/AAAI/IJCAI deadlines just after new year!
  • we had collected number of branches in search
    tree
  • performance scaled with backtracks, not branches
  • all experiments had to be rerun
  • Dont Kill Your Machines
  • we have got into trouble with sysadmins
  • over experimental data we never used
  • often the vital experiment is small and quick

90
How Not to Do It
  • Do It All Again (or at least be able to)
  • e.g. storing random seeds used in experiments
  • we didnt do that and might have lost important
    result
  • Do Be Paranoid
  • identical implementations in C, Scheme gave
    different results
  • Do Use The Same Problems
  • reproducibility is a key to science (c.f. cold
    fusion)
  • can reduce variance

91
Choosing your test data
  • Weve seen the possible problem of over-fitting
  • remember machine learning benchmarks?
  • Two common approaches
  • benchmark libraries
  • random problems
  • Both have potential pitfalls

92
Benchmark libraries
  • ve
  • can be based on real problems
  • lots of structure
  • -ve
  • library of fixed size
  • possible to over-fit algorithms to library
  • problems have fixed size
  • so cant measure scaling

93
Random problems
  • ve
  • problems can have any size
  • so can measure scaling
  • can generate any number of problems
  • hard to over-fit?
  • -ve
  • may not be representative of real problems
  • lack structure
  • easy to generate flawed problems
  • CSP, QSAT,

94
Flawed random problems
  • Constraint satisfaction example
  • 40 papers over 5 years by many authors used
    Models A, B, C, and D
  • all four models are flawed Achlioptas et al.
    1997
  • asymptotically almost all problems are trivial
  • brings into doubt many experimental results
  • some experiments at typical sizes affected
  • fortunately not many
  • How should we generate problems in future?

95
Flawed random problems
  • Gent et al. 1998 fix flaw .
  • introduce flawless problem generation
  • defined in two equivalent ways
  • though no proof that problems are truly flawless
  • Undergraduate student at Strathclyde found new
    bug
  • two definitions of flawless not equivalent
  • Eventually settled on final definition of
    flawless
  • gave proof of asymptotic non-triviality
  • so we think that we just about understand the
    problem generator now

96
Prototyping your algorithm
  • Often need to implement an algorithm
  • usually novel algorithm, or variant of existing
    one
  • e.g. new heuristic in existing search algorithm
  • novelty of algorithm should imply extra care
  • more often, encourages lax implementation
  • its only a preliminary version

97
How Not to Do It
  • Dont Trust Yourself
  • bug in innermost loop found by chance
  • all experiments re-run with urgent deadline
  • curiously, sometimes bugged version was better!
  • Do Preserve Your Code
  • Or end up fixing the same error twice
  • Do use version control!

98
How Not to Do It
  • Do Make it Fast Enough
  • emphasis on enough
  • its often not necessary to have optimal code
  • in lifecycle of experiment, extra coding time not
    won back
  • e.g. we have published many papers with
    inefficient code
  • compared to state of the art
  • first GSAT version O(N2), but this really was too
    slow!
  • Do Report Important Implementation Details
  • Intermediate versions produced good results

99
How Not to Do It
  • Do Look at the Raw Data
  • Summaries obscure important aspects of behaviour
  • Many statistical measures explicitly designed to
    minimise effect of outliers
  • Sometimes outliers are vital
  • exceptionally hard problems dominate mean
  • we missed them until they hit us on the head
  • when experiments crashed overnight
  • old data on smaller problems showed clear
    behaviour

100
How Not to Do It
  • Do face up to the consequences of your results
  • e.g. preprocessing on 450 problems
  • should obviously reduce search
  • reduced search 448 times
  • increased search 2 times
  • Forget algorithm, its useless?
  • Or study in detail the two exceptional cases
  • and achieve new understanding of an important
    algorithm

101
Empirical Methods for CS
  • Part VII Coda

102
Our objectives
  • Outline some of the basic issues
  • exploration, experimental design, data analysis,
    ...
  • Encourage you to consider some of the pitfalls
  • we have fallen into all of them!
  • Raise standards
  • encouraging debate
  • identifying best practice
  • Learn from your experiences
  • experimenters get better as they get older!

103
Summary
  • Empirical CS and AI are exacting sciences
  • There are many ways to do experiments wrong
  • We are experts in doing experiments badly
  • As you perform experiments, youll make many
    mistakes
  • Learn from those mistakes, and ours!

104
Empirical Methods for CS
  • Part VII Supplement

105
Some expert advice
  • Bernard Moret, U. New Mexico
  • Towards a Discipline of Experimental
    Algorithmics
  • David Johnson, ATT Labs
  • A Theoreticians Guide to the Experimental
    Analysis of Algorithms
  • Both linked to from www.cs.york.ac.uk/tw/empirica
    l.html

106
Bernard Morets guidelines
  • Useful types of empirical results
  • accuracy/correctness of theoretical results
  • real-world performance
  • heuristic quality
  • impact of data structures
  • ...

107
Bernard Morets guidelines
  • Hallmarks of a good experimental paper
  • clearly defined goals
  • large scale tests
  • both in number and size of instances
  • mixture of problems
  • real-world, random, standard benchmarks, ...
  • statistical analysis of results
  • reproducibility
  • publicly available instances, code, data files,
    ...

108
Bernard Morets guidelines
  • Pitfalls for experimental papers
  • simpler experiment would have given same result
  • result predictable by (back of the envelope)
    calculation
  • bad experimental setup
  • e.g. insufficient sample size, no consideration
    of scaling,
  • poor presentation of data
  • e.g. lack of statistics, discarding of outliers,
    ...

109
Bernard Morets guidelines
  • Ideal experimental procedure
  • define clear set of objectives
  • which questions are you asking?
  • design experiments to meet these objectives
  • collect data
  • do not change experiments until all data is
    collected to prevent drift/bias
  • analyse data
  • consider new experiments in light of these
    results

110
David Johnsons guidelines
  • 3 types of paper describe the implementation of
    an algorithm
  • application paper
  • Heres a good algorithm for this problem
  • sales-pitch paper
  • Heres an interesting new algorithm
  • experimental paper
  • Heres how this algorithm behaves in practice
  • These lessons apply to all 3

111
David Johnsons guidelines
  • Perform newsworthy experiments
  • standards higher than for theoretical papers!
  • run experiments on real problems
  • theoreticians can get away with idealized
    distributions but experimentalists have no
    excuse!
  • dont use algorithms that theory can already
    dismiss
  • look for generality and relevance
  • dont just report algorithm A dominates algorithm
    B, identify why it does!

112
David Johnsons guidelines
  • Place work in context
  • compare against previous work in literature
  • ideally, obtain their code and test sets
  • verify their results, and compare with your new
    algorithm
  • less ideally, re-implement their code
  • report any differences in performance
  • least ideally, simply report their old results
  • try to make some ball-park comparisons of machine
    speeds

113
David Johnsons guidelines
  • Use efficient implementations
  • somewhat controversial
  • efficient implementation supports claims of
    practicality
  • tells us what is achievable in practice
  • can run more experiments on larger instances
  • can do our research quicker!
  • dont have to go over-board on this
  • exceptions can also be made
  • e.g. not studying CPU time, comparing against a
    previously newsworthy algorithm, programming time
    more valuable than processing time, ...

114
David Johnsons guidelines
  • Use testbeds that support general conclusions
  • ideally one (or more) random class, real world
    instances
  • predict performance on real world problems based
    on random class, evaluate quality of predictions
  • structured random generators
  • parameters to control structure as well as size
  • dont just study real world instances
  • hard to justify generality unless you have a very
    broad class of real world problems!

115
David Johnsons guidelines
  • Provide explanations and back them up with
    experiment
  • adds to credibility of experimental results
  • improves our understanding of algorithms
  • leading to better theory and algorithms
  • can weed out bugs in your implementation!

116
David Johnsons guidelines
  • Ensure reproducibilty
  • easily achieved via the Web
  • adds support to a paper if others can (and do)
    reproduce the results
  • requires you to use large samples and wide range
    of problems
  • otherwise results will not be reproducible!

117
David Johnsons guidelines
  • Ensure comparability (and give the full picture)
  • make it easy for those who come after to
    reproduce your results
  • provide meaningful summaries
  • give sample sizes, report standard deviations,
    plot graphs but report data in tables in the
    appendix
  • do not hide anomalous results
  • report running times even if this is not the main
    focus
  • readers may want to know before studying your
    results in detail

118
David Johnsons pitfalls
  • Failing to report key implementation details
  • Extrapolating from tiny samples
  • Using irreproducible benchmarks
  • Using running time as a stopping criterion
  • Ignoring hidden costs (e.g. preprocessing)
  • Misusing statistical tools
  • Failing to use graphs

119
David Johnsons pitfalls
  • Obscuring raw data by using hard-to-read charts
  • Comparing apples and oranges
  • Drawing conclusions not supported by the data
  • Leaving obvious anomalies unnoted/unexplained
  • Failing to back up explanations with further
    experiments
  • Ignoring the literature
  • the self-referential study!
Write a Comment
User Comments (0)
About PowerShow.com