Statistics for Linguistics Students - PowerPoint PPT Presentation

About This Presentation
Title:

Statistics for Linguistics Students

Description:

Paste figures into the doc-file (or rtf-file) and only submit the .sav-files ... Do not work with var0001 (name and label varia bles) Scale figures so that ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 31
Provided by: phon5
Category:

less

Transcript and Presenter's Notes

Title: Statistics for Linguistics Students


1
Statistics for Linguistics Students
  • Michaelmas 2004
  • Week 3
  • Bettina Braun
  • www.phon.ox.ac.uk/bettina

2
Overview
  • Discussion of last assignment
  • Z-scores
  • Sampling distributions
  • Confidence intervals
  • Hypothesis testing
  • Type I and Type II errors

3
General comments
  • Please let every file you submit contain your
    initials and the week the assignment was given!
  • Please put your name somewhere on the page
  • Paste figures into the doc-file (or rtf-file) and
    only submit the .sav-files
  • Name the x- and y-axis of the figures and give
    them a title
  • Do not work with var0001 (name and label varia
    bles)
  • Scale figures so that numbers are readible

4
Manipulating figures
  • If you want to copy SPSS-figures into your
    document, it is sensible to increase the font
    sizes (otherwise theyll be too difficult to
    read). Also, you might want to change the title
    or legend, ...
  • Double click on any figure

5
Measures of central tendency
  • Interval data, roughly normally distributed data
    (less appropriate for skewed distributions) ?
    mean (although mode and median should give
    same results!)
  • Interval data, strongly skewed ? mode, median
  • Categorical data (different versions, ) ? mode

6
Sentence lengths
Very likely that most of the sentences do not
exceed 20 to 30 words but there will be few
sentences that are very long
Sentence length
N.B. It is likely that distribution of sentence
lengths in Th. Mann are skewed to the left
7
Preference for 3 resynthesised versions
  • Suppose this were the outcome
  • version subjects
  • a 20
  • b 37
  • c 18
  • Coding a1, b2, c3
  • ? mean 1.97
  • Mode is more meaningful! If you are reporting a
    mean, one might think there is a normal
    distribution

8
Merging datasets
  • Year 90 year 00
  • 30
  • 54
  • 45
  • 67
  • 54
  • 60
  • 45
  • Year results
  • 1990 21,00
  • 1990 64,00
  • 1990 48,00
  • 1990 64,00
  • 2000 58,00
  • 2000 33,00
  • 2000 8,00
  • 2000 55,00
  • 2000 47,00
  • 2000 61,00

This is how to organise observations from the
same person in different years
9
Describe this distribution
10
Normal distribution (Gaussian distribution)
  • Example IQ scores, mean100, sd16

Mean Median Mode
11
z-scores
  • Z-score deviation of given score from the mean
    in terms of standard deviations

12
How likely is a given event?
  • Example time to utter a particular sentence x
    3.45s and sd .84s
  • Questions
  • What proportion of the population of utterance
    times will fall below 3s?
  • What proportion would lie between 3s and 4s?
  • What is the time value below which we will find
    1 of the data?

13
Sample mean and sd as parameter estimators
  • Mean and standard deviation of the population are
    unknown
  • But we can use the sample mean and sd as
    estimators for the parameters of the unknown
    population

14
Sample mean and sd as estimators
Degrees of freedom scores that contain new
information better estimator for parameter
15
From sample statistics to population parameters
  • We only know the statistics of our sample
  • Sample statistics will differ from population
    parameters
  • Knowledge about sampling distribution of the
    statistic (i.e. how it behaved if large samples
    were taken) would tell us how well the statistic
    estimated the parameter (degree of confidence)

16
Sampling distribution
  • Population (mean 4.9, sd 3.1)
  • 100 samples with n503 examples

Taken from www.fw.umn.edu/FW5601/
ALAB/Lab5/LAB4_BA2.HTM
17
Sampling distribution
  • Relative frequency of 100 means
  • sample mean 4.9
  • sample sd 0.46
  • Note
  • Shape of sampling distribution roughly normal
  • Mean of sampling distribution is population mean
  • Sample sd smaller than population sd

18
Central limit theorem
n30
Terminology Standard deviation of the sampling
distribution of the means is called standard
error of the mean (SE)
19
Experimental research
  • Often, we are interested if human behaviour is
    dependent on certain factors. E.g.
  • Is the speech rate dependent on the dialectal
    region?
  • Do foreigners and native speakers produce
    sentences with the same number of words?

20
Dependent and independent variables
  • Independent variable
  • Variable(s) manipulated by the experimenter
  • experimenter determines the values it will assume
  • Independent variables may have a number of
    different levels
  • Dependent variable
  • Measure of behaviour (not manipulated or
    controlled by experimenter)

21
Examples
  • What are the dependent and independent variables
    in the following questions?
  • Is the speech rate dependent on the dialectal
    region of the speakers?
  • Do foreigners and native speakers produce
    sentences with the same number of words?
  • Is the articulatory precision dependent on the
    part-of-speech?
  • Do different word orders influence the
    grammatiality judgements of subjects?

22
Null-hypothesis H0
  • Generally phrased to negate the possiblity of a
    relationship between the independent and
    dependent variables
  • If the null-hypothesis is true, there is no
    interaction between dependent and independent
    variables
  • Alternative hypothesis contradicts
    null-hypothesis

23
Statistical tests of significance
  • Allows to evaluate the probability that the
    observed sample values would occur if the null
    hypothesis were true
  • If that probability is sufficiently low, the null
    hypothesis can be rejected
  • In other words provide evidence for conlcuding
    (with a specified risk of error) that there are
    or are no real differences between conditions in
    the population

24
p-value
  • Probability that values of the statistic like the
    one observed would occur if the null hypothesis
    were true
  • In other words how unusual is the observed test
    statistic compared to what H0 predicts?
  • The smaller p, the more unusual the observed data
    if H0 were true
  • (e.g. p0.45 very usual, compared to p0.001)

25
Type I error
  • Type I error
  • Rejection of a true null hypothesis
  • That is, in reality, there is no relationship
    between independent and dependent variable but
    you conclude there is
  • Probability of type I error is called a
  • a is usually determined before you run an
    experiment (often set at 5 or 1)

26
Type II error
  • Type II error
  • Failure to reject a false null hypothesis
  • That is, in reality, there is a relation between
    the independent variable and the dependent one(s)
    but you conclude there is none
  • Probability of type II error is called ß
  • In contrast to a, ß cannot be precisely
    controlled

27
Reducing the Type II error
  • ß can be reduced by
  • Using an a-level of .05 (instead of a more
    stringent one)
  • Using as many subjects as can be reasonably
    obtained
  • Selecting the levels of the independent variable
    so as to maximise the size of the effect
  • Reducing variability (e.g. controlling more
    variables)

28
Organise SPSS tables
  • Every independent variable and every dependent
    variable has its own column
  • Independent variables are often found before
    dependent variables
  • It is wise to compare the distributions of the
    conditions before statistical tests of
    significance (histograms, boxplots)
  • Either select the condition you are interested in
  • Or split the output according to the different
    levels
  • You can also compare boxplots for the different
    conditions

29
Data exploration
  • Error bars show the 95 confidence interval for
    the mean (i.e. the mean and the area where 95 of
    the data fall in)

30
Data exploration
  • Error bars show the 95 confidence interval for
    the mean (i.e. the mean and the area where 95 of
    the data fall in)
  • One independent variable
  • Error bar (simple, groups of variables)
  • Two independent variables
  • Error bar (clustered, groups of variables)
Write a Comment
User Comments (0)
About PowerShow.com