Title: CSI5388:%20Functional%20Elements%20of%20Statistics%20for%20Machine%20Learning%20%20Part%20I
1CSI5388Functional Elements of Statistics for
Machine Learning Part I
2Contents of the Lecture
- Part I (This set of lecture notes)
- Definition and Preliminaries
- Hypothesis Testing Parametric Approaches
- Part II (The next set of lecture notes)
- Hypothesis Testing Non-Parametric Approaches
- Power of a Test
- Statistical Tests for Comparing Multiple
Classifiers
3Definitions and Preliminaries I
- A Random Variable is a function, which assigns
unique numerical values to all possible outcomes
of a random experiment under fixed conditions. - If X takes on N values x1, x2, .. xN, such that
each xi ? R, then, - The Mean of X is
- The Variance is
- The Standard Deviation is
4Definitions and Preliminaries II
- Sample Variance
- Sample Standard Deviation
5Hypothesis Testing
- Generalities
- Sampling Distributions
- Procedure
- One- versus Two-tailed tests
- Parametric approaches
6Generalities
- Purpose If we assume a given sampling
distribution, we want to establish whether or not
a sample result is representative of the sampling
distribution or not. This is interesting because
it helps us decide whether the results we
obtained on an experiment can generalize to
future data. - Approaches to Hypothesis Testing There are two
different approached to hypothesis testing
Parametric and Non-Parametric approaches
7Sampling Distributions
- Definition The sampling distribution of a
statistic (example, the mean, the median or any
other description/summary of a data set) is the
distribution of values obtained for that
statistics over all possible samplings of the
same size from a given population. - Note Since the populations under study are
usually infinite or at least, very large, the
true sampling distribution is usually unknown.
Therefore, rather than finding its exact value,
it will have to be estimated. Nonetheless, we can
do so quite well, especially when considering the
mean of the data
8Procedure I
- Idea If we assume a given sampling distribution,
we want to establish whether or not a sample
result is representative of the sampling
distribution or not. This is interesting because
it helps us decide whether the results we
obtained on an experiment can generalize to
future data. - Example If a sample mean we obtain on a
particular data sample is representative of the
sampling distribution, then we can conclude that
our data sample is representative of the whole
population. If not, it means that the values in
our sample are unrepresentative. (Perhaps this
sample contained data that were particularly
easy or particularly difficult to classify).
9Procedure II
- State your research hypothesis
- Formulate a null hypothesis stating the opposite
of your research hypothesis. In particular, the
null hypothesis regards the relationship between
the sampling statistics of the basic population
and the sample result you obtained from your
specific set of data. - Collect your specific data and compute the
statistics sample result on it. - Calculate the probability of obtaining the sample
result you obtained if the sample emanated from
the data set that gave you the original sample
statistic. - If this probability is low, reject the null
hypothesis, and state that the sample you
considered does not emanate from the data set
that gave you the original sample statistic.
10One- and Two-Tailed Tests
- If H0 is expressed as an equality, then there are
two ways to reject H0. Either the statistic
computed from your sample at hand is lower than
the sampling statistics or it is higher. If you
are only concerned about either lower or higher
statistics, then you should perform a one-tailed
test. If you are simultaneously concerned about
the two ways in which H0 can be rejected, then
you should perform a two-tailed test.
11Parametric Approaches to Hypothesis Testing
- The classical approach to hypothesis testing is
parametric. This means that in order to be
applied, this approach makes a number of
assumptions regarding the distribution of the
population and the available sample. - Non-parametric approaches, discussed later do not
make these strong assumptions, although they do
make some assumptions as well, as will be
discussed there.
12Why are Hypothesis Tests often applied to means?
- Hypothesis tests are often applied to means. The
reason is that unlike for other statistics, the
standard deviation of the mean is known and
simple to calculate. - Since, without a standard deviation, hypothesis
testing could not be performed (since the
probability that the sample under consideration
emanates from the population that is represented
by the original sampling statistics is linked to
this standard deviation), having access to the
standard deviation is essential.
13Why is the standard deviation of the mean easy to
calculate?
- Because of the important Central Limit Theorem
which states that no matter how your original
population is distributed, if you use large
enough samples, then the sampling distribution of
the mean of these samples approaches a normal
distribution. If the mean of the original
population is µ and its standard deviation s,
then the mean of the sampling distribution is µ
and its standard deviation s/sqrt(N).
14When is the sampling distribution of the mean
Normal?
- The number of samples necessary for the sampling
distribution of the mean to approach normal
depends on the distribution of the parent
population. - If the parent population is normal, then the
sampling distribution of the mean is also normal.
- If the parent population is not normal, but
symmetrical and uni-modal, then the sampling
distribution of the mean will be normal, even for
small sample sizes. - If the population is very skewed, then, sample
sizes of at least 30 will be required for the
sampling distribution of the mean to be normal.
15 How are hypothesis tests set up?t-tests
- Hypothesis Tests are used to find out whether a
sample mean comes from a sampling distribution
with a specified mean. - We will consider
- One-sample t-tests
- µ, s known
- µ, s unknown
- Two-sample t-tests
- Two-matched samples
- Two-independent samples
16One-sample t-tests known
- If s is known, we can use the central limit
theorem to obtain the sampling distribution of
this populations mean (mean is µ and standard
deviation is s/sqrt(N)). - Let X be the mean of our data sample, we compute
- z (X µ)/(s/sqrt(N)) (1)
- We find the probability that z is as large as the
value obtained from the z-table and then output
this probability if we are solely interested in a
one-tailed test and double it before outputting
it if we are interested in a two-tailed test. - If this output probability is smaller than .05,
we would reject H0 at the .05 level of
significance. Otherwise, we would state that we
have no evidence to conclude that H0 does not
hold.
17What is the meanings and purpose of z?
- Normal distributions can all be easily mapped
into a single one, using a specific
transformation. - This means that, in our hypothesis tests, we can
use the same information about the sampling
distribution over and over (if we assume that our
population is normally distributed), no matter
what the mean and variance of our actual
population are. - Any observation can be changed into a standard
score, z, with respect to mean0 and standard
deviation 1, as follows - Z (X-mean)/sd
18One-sample t-tests unknown
- In most situations, s, the variance of the
population is unknown. In this case, we replace s
by s, the sample standard deviation, in equation
(1) yielding - t (X µ)/(s/sqrt(N)) (2)
- Because s is likely to under-estimate s, and,
thus, return a t-value larger than z would have
been had s been known, it is inappropriate to use
the distribution of z to accept or reject the
null hypothesis. - Instead, we use the Students t distribution,
which corrects for this problem and compares t to
the t-table with degree of freedom N-1. We then
proceed as we did for z on the slide about s
known, above.
19What is the meanings and purpose of t?
- t follows the same principle as z except for the
fact that t should be used when the standard
deviation is unknown. - t, however, represents a family of curves rather
than a single curve. The shape of the t
distribution changes from sample size to sample
size. - As the sample size grows larger and larger, t
looks more and more like a normal distribution
20Assumption of the t-test with s
unknown
- Please, note that one assumption is made in the
use of the t-test. That is that we assume that
the sample was drawn from a normally distributed
population. - This is required because the derivation of t by
Student was based on the assumption that the mean
and variance of the population were independent,
an assumption that is true in the case of a
normal distribution. - In practice, however, the assumption about the
distribution from which the sample was drawn can
be lifted whenever the sample size is
sufficiently large to produce a normal sampling
distribution of the mean. In general, n 25 or 30
(number of cases in a sample) is sufficiently
large. Often, it can be smaller than that.
21Two-sample t-testsmatched samples
- Given two matched population, we want to test
whether the difference in means between these two
populations are significant or not. We do so by
looking at the difference in means, D, and
variance, SD, between these two populations and
comparing it to the mean of 0. - We can then apply the t-test as we did above, in
the case where s was unknown. - This time, we have
- t (D 0)/ (SD/sqrt(n)) (3)
- We use the t-table as before with a n-1 degree of
freedom, and the same assumptions about the
normality of the distribution.
22Two-sample t-testsindependent samples
- This time, we are interested in comparing two
populations with different means and variance.
The two populations are completely independent.
- We can, again apply the t-test, with the same
conditions applying, using the formula -
- t (X1 X2)/ sqrt((s12/n1) (s22/n2))
23Confidence Intervals
- Sample means represent point estimates of the
mean parameter.Here, we are interested in
interval estimates, which tell us how large or
small the true value of µ could be without
causing us to reject H0, given that we ran a
t-test on the mean of our sample. - To calculate these intervals, we simply take the
equations presented on the previous slides and
express them in terms of µ, and as a function of
t. - We then replace t for the two-tailed value we are
interested in in the t-table. This value can be
positive or negative, meaning that we will obtain
two values for µ µupper and µlower. This gives
us the limits of the confidence interval. - The confidence interval means that µ has a
certain probability (attached to the value of t
chosen) to belong to this interval. The greater
the size of the interval, the greater the
probability that µ is included. Conversely, the
smaller that interval, the smaller the
probability that it is included.