Title: Statistics for Linguistics Students
1Statistics for Linguistics Students
- Michaelmas 2004
- Week 3
- Bettina Braun
- www.phon.ox.ac.uk/bettina
2Overview
- Discussion of last assignment
- Z-scores
- Sampling distributions
- Confidence intervals
- Hypothesis testing
- Type I and Type II errors
3General comments
- Please let every file you submit contain your
initials and the week the assignment was given! - Please put your name somewhere on the page
- Paste figures into the doc-file (or rtf-file) and
only submit the .sav-files - Name the x- and y-axis of the figures and give
them a title - Do not work with var0001 (name and label varia
bles) - Scale figures so that numbers are readible
4Manipulating figures
- If you want to copy SPSS-figures into your
document, it is sensible to increase the font
sizes (otherwise theyll be too difficult to
read). Also, you might want to change the title
or legend, ... - Double click on any figure
5Measures of central tendency
- Interval data, roughly normally distributed data
(less appropriate for skewed distributions) ?
mean (although mode and median should give
same results!) - Interval data, strongly skewed ? mode, median
- Categorical data (different versions, ) ? mode
6Sentence lengths
Very likely that most of the sentences do not
exceed 20 to 30 words but there will be few
sentences that are very long
Sentence length
N.B. It is likely that distribution of sentence
lengths in Th. Mann are skewed to the left
7Preference for 3 resynthesised versions
- Suppose this were the outcome
- version subjects
- a 20
- b 37
- c 18
-
- Coding a1, b2, c3
- ? mean 1.97
- Mode is more meaningful! If you are reporting a
mean, one might think there is a normal
distribution
8Merging datasets
- Year 90 year 00
- 30
- 54
- 45
- 67
- 54
- 60
- 45
-
- Year results
- 1990 21,00
- 1990 64,00
- 1990 48,00
- 1990 64,00
-
- 2000 58,00
- 2000 33,00
- 2000 8,00
- 2000 55,00
- 2000 47,00
- 2000 61,00
This is how to organise observations from the
same person in different years
9Describe this distribution
10Normal distribution (Gaussian distribution)
- Example IQ scores, mean100, sd16
Mean Median Mode
11z-scores
- Z-score deviation of given score from the mean
in terms of standard deviations
12How likely is a given event?
- Example time to utter a particular sentence x
3.45s and sd .84s - Questions
- What proportion of the population of utterance
times will fall below 3s? - What proportion would lie between 3s and 4s?
- What is the time value below which we will find
1 of the data?
13Sample mean and sd as parameter estimators
- Mean and standard deviation of the population are
unknown - But we can use the sample mean and sd as
estimators for the parameters of the unknown
population
14Sample mean and sd as estimators
Degrees of freedom scores that contain new
information better estimator for parameter
15From sample statistics to population parameters
- We only know the statistics of our sample
- Sample statistics will differ from population
parameters - Knowledge about sampling distribution of the
statistic (i.e. how it behaved if large samples
were taken) would tell us how well the statistic
estimated the parameter (degree of confidence)
16Sampling distribution
- Population (mean 4.9, sd 3.1)
- 100 samples with n503 examples
Taken from www.fw.umn.edu/FW5601/
ALAB/Lab5/LAB4_BA2.HTM
17Sampling distribution
- Relative frequency of 100 means
- sample mean 4.9
- sample sd 0.46
- Note
- Shape of sampling distribution roughly normal
- Mean of sampling distribution is population mean
- Sample sd smaller than population sd
18Central limit theorem
n30
Terminology Standard deviation of the sampling
distribution of the means is called standard
error of the mean (SE)
19Experimental research
- Often, we are interested if human behaviour is
dependent on certain factors. E.g. - Is the speech rate dependent on the dialectal
region? - Do foreigners and native speakers produce
sentences with the same number of words?
20Dependent and independent variables
- Independent variable
- Variable(s) manipulated by the experimenter
- experimenter determines the values it will assume
- Independent variables may have a number of
different levels - Dependent variable
- Measure of behaviour (not manipulated or
controlled by experimenter)
21Examples
- What are the dependent and independent variables
in the following questions? - Is the speech rate dependent on the dialectal
region of the speakers? - Do foreigners and native speakers produce
sentences with the same number of words? - Is the articulatory precision dependent on the
part-of-speech? - Do different word orders influence the
grammatiality judgements of subjects?
22Null-hypothesis H0
- Generally phrased to negate the possiblity of a
relationship between the independent and
dependent variables - If the null-hypothesis is true, there is no
interaction between dependent and independent
variables - Alternative hypothesis contradicts
null-hypothesis
23Statistical tests of significance
- Allows to evaluate the probability that the
observed sample values would occur if the null
hypothesis were true - If that probability is sufficiently low, the null
hypothesis can be rejected - In other words provide evidence for conlcuding
(with a specified risk of error) that there are
or are no real differences between conditions in
the population
24p-value
- Probability that values of the statistic like the
one observed would occur if the null hypothesis
were true - In other words how unusual is the observed test
statistic compared to what H0 predicts? - The smaller p, the more unusual the observed data
if H0 were true - (e.g. p0.45 very usual, compared to p0.001)
25Type I error
- Type I error
- Rejection of a true null hypothesis
- That is, in reality, there is no relationship
between independent and dependent variable but
you conclude there is - Probability of type I error is called a
- a is usually determined before you run an
experiment (often set at 5 or 1)
26Type II error
- Type II error
- Failure to reject a false null hypothesis
- That is, in reality, there is a relation between
the independent variable and the dependent one(s)
but you conclude there is none - Probability of type II error is called ß
- In contrast to a, ß cannot be precisely
controlled
27Reducing the Type II error
- ß can be reduced by
- Using an a-level of .05 (instead of a more
stringent one) - Using as many subjects as can be reasonably
obtained - Selecting the levels of the independent variable
so as to maximise the size of the effect - Reducing variability (e.g. controlling more
variables)
28Organise SPSS tables
- Every independent variable and every dependent
variable has its own column - Independent variables are often found before
dependent variables - It is wise to compare the distributions of the
conditions before statistical tests of
significance (histograms, boxplots) - Either select the condition you are interested in
- Or split the output according to the different
levels - You can also compare boxplots for the different
conditions
29Data exploration
- Error bars show the 95 confidence interval for
the mean (i.e. the mean and the area where 95 of
the data fall in)
30Data exploration
- Error bars show the 95 confidence interval for
the mean (i.e. the mean and the area where 95 of
the data fall in) - One independent variable
- Error bar (simple, groups of variables)
- Two independent variables
- Error bar (clustered, groups of variables)