STAT131 Numerical Summaries - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

STAT131 Numerical Summaries

Description:

Describe to the person sitting next to you the amount of mobile phone calls that ... Interpolate if necessary. The interquartile range= Q3 - Q1 ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 45
Provided by: microsofts2
Category:

less

Transcript and Presenter's Notes

Title: STAT131 Numerical Summaries


1
STAT131Numerical Summaries
  • W2L2

Anne Porter
2
Describing Data
  • Task
  • Describe to the person sitting next to you the
    amount of mobile phone calls that you receive
    each day.

3
Describing Data
  • Describe to the person sitting next to you the
    number of mobile phone calls that you receive
    each day.
  • On average 80 per day
  • At most 5 per per day
  • It varies from 10 to 50 per day

4
Two summaries of data
  • Summaries most often used are to indicate the
  • Centre (often called location), and
  • Spread
  • of sample data.

5
Centre
  • Mean
  • Median
  • Mode
  • Trimmed Mean

6
Centre Mean
Mean of 8, 7, 9, 4 is
In mathematical language
7
Centre Sample median
It is often called the middle value.
  • Another measure of the centre of a sample is the
    sample median, m (say). It satisfies
  • the number of sample values m is equal to the
    number of sample values m.

To find the median, FIRST arrange the sample
values from smallest to largest.
8
Example Sample median
  • N odd Median of 8, 7, 9
  • Median
  • N even Median 8, 7, 9, 4
  • Median

Median (7,8,9) 8 Median (4,7,8,9) (78)/2
7.5
9
Example mean, median
  • Data A 60, 2, 3, 5 Data B 6,
    2, 3, 5
  • Calculate the mean and median for both data sets

A B 60
6 2
2 3
3 5 5
10
Question mean vs median
  • In what sense are the mean and median the same?
  • In what sense are the mean and median different?

They are both measures of the centre
They may give different numerical values and for
different data sets one may be better as a
measure than the other or both may be required
11
Question mean vs median
  • Data A 60, 2, 3, 5 Data B 6, 2, 3, 5
  • Which measure best typifies the data A? Why?
  • Which measure best typifies the data set B? Why?

For A the outlier 60 suggests the median (4) as
the Mean (17.5) is dragged up by the outlier
60 For B both are the same. The median (4) used
2 points the mean (4) uses all the data
12
Why do we use both the sample mean and sample
median as measures of the centre of a sample?
  • The mean uses all the information in the sample,
    because each value is added in the sum. This
    makes it subject to error if spurious values are
    entered.
  • In general, the median is less affected by wild
    values than the mean. We say it is more robust
    than the mean.
  • However, the median does not use very much
    information from the sample.
  • The context of what the data are are used for may
    also determine what is an appropriate measure

13
CentreMode
  • Most common value in the data set
  • Data 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 6, 9
  • For this data set the mode is

3
14
Centre Trimmed mean
  • What might be an example of this?

Diving at the Olympics is the average of the
judges scores after having tossed out the
highest and the lowest scores
15
Context
  • The choice of measure for the centre may also
    depend upon the context of the problem.
  • See examples from Lab work.

16
Measures of Spread
  • Range maximum value - minimum value
  • Interquartile range Q1- Q3
  • Sums of Squares
  • Variance
  • Standard Deviation

17
Criteria for a good measure of spread
  • Whatever measure of variability (or spread) the
    measure should not be affected by adding a
    constant to each value so as to change the centre
    (or location)
  • If there is spread in the data it should indicate
    this

18
Example Spread
  • Data B 6, 2, 3, 5 Data C 8, 4,
    5, 7

How was Data set C obtained? Is the mean the
same for both data sets? Is the spread the same
for both data sets?
By adding 2 to each value in B
No, 4 and 6
Yes! - explain
19
Spread Range
  • Range maximum value - minimum value
  • Data B 6, 2, 3, 5 Data C 8, 4,
    5, 7
  • Range B Range C

6-2 4 8-4 4
When is is going to be a poor measure of
spread? Why?
20
Undesirable features of the range
  • Sensitive to outliers
  • Insensitive to the bulk of the data
  • -as it is based only on two scores

21
Spread Interquartile range
  • IQR upper quartile - lower quartile
  • The upper quartile (3rd quartile, Q3) has
    one-quarter of the observations above it, and
    three-quarters below.
  • The lower quartile (1st quartile, Q1) has
    one-quarter of the observations below it, and
    three-quarters above.
  • Hence the IQR gives the spread of the middle 50
    of the sample.

22
Quartiles
  • Q1 is value of the (n3)/4th observation,
  • and Q3 is the value of the (3n1)/4th
    observation.
  • There are other systems!
  • Interpolate if necessary.
  • The interquartile range Q3 - Q1
  • If we have 17 heights what observation
  • do we need to get the upper and lower quartile?
  • What observation will give the median?

23
Quartiles
- 19
The upper quartile is? The lower quartile is? The
interquartile range is?
166 147
24
Interquartile range
  • Is resistant (or robust) to the impact of
    outliers
  • But it does not use all the data in calculating
  • How else might we measure spread, perhaps spread
    around the mean?

25
Spread Sum of Deviation Scores
In calculating the spread of a sample, we measure
how far each observation is from the sample
mean ie.
Calculate the sum of the deviations for each
sample? Is this a good measure of spread?
A B 6
8 2
4 3
5 5 7
2
2 -2
-2 -1
-1 1
1 0
0
26
Looking for another measure
  • We could perhaps find the sum of these
    differences, except that the sum (and average) is
    always zero. (The positive differences cancel
    out the negative ones.)
  • We prevent this cancellation by
    each difference.

squaring
27
Spread Sum of Squares
  • The sum of the squared deviations is

A B 6
8 2
4 3
5 5 7
2
2 -2
-2 -1
-1 1
1

4
4 4
4 1
1 1
1
28
Spread Sum of Squares
  • Is difficult to understand in this context as the
    answer is very big and gets bigger with every
    additional data point. It is used and useful in
    other contexts.
  • What might we do?

Average the SS by dividing by n or...
29
Spread Sample Variance
  • The obvious thing to calculate would be
  • but, for reasons to be explained later, we use
  • i.e., we divide by (n - 1) instead of n. So S2
    this measure still seems to big! So
    what might we do?

Symbolised by s2
Symbolised by S2
10/3
30
Spread Standard Deviation
  • We use the positive square root of the sample
    variance as our measure of spread.
  • The square root is called the sample standard
    deviation, and is denoted by s i.e. and equals

31
Use of standard deviation
  • The mean and std deviation gives information
    about where most of the distribution of values is
    to be found.
  • For many distributions, the range
  • mean - 2 standard devs to mean 2 standard
    devs
  • (mean 2SD)
  • contains approx 95 of the distribution.
  • (The very least that this spread can contain is
    75 of the distribution.)

32
Measuring the sample spread
  • The standard deviation uses information from all
    the values in the sample, so it is also affected
    by wild values.
  • That is it is not robust (or resistant).
  • Our choice will also depend to some extent on
    what is chosen for the centre
  • Mean and standard deviation
  • Median and IQR

Conclusion
33
Percentiles
  • Sometimes we want another description of where
    the data may be found
  • The kth percentile is a number that has k percent
    of the scores at or below it and (100-k) above
    it
  • The lower quartile has 25 of scores at or below
    that score

34
Using the calculator
  • Most calculators have a
  • button and a button to
    calculate the mean and standard deviation of some
    numbers you type in. (? is the lower-case Greek
    sigma.)
  • Some also have ?n. Dont use this one!
  • To check that you are using the right std
    deviation the std deviation of -2, 0 and 2
    should be 2 (exactly).
  • Ask in lecture breaks or tutes to see that you
    can use you statistics functions

35
Using the calculator cont.
  • If your calculator does not have a ??n-1 button,
    you will have to calculate the standard deviation
    by hand. Find s2 from
  • and then take the square root of s2 to get s, the
    standard deviation.

36
Box-and-Whiskers plots
  • Often just called box plots, they give a
    pictorial summary of the data for a single
    variable.
  • They use the five-number summary
  • minimum value,
  • Q1,
  • median,
  • Q3,
  • maximum value

37
  • Example If minimum 3, Q1 6,
  • median10, Q3 12, maximum 16, the box plot
    would look like
  • You must draw a scale for the box plot.

2
4
6
14
16
8
12
10
38
  • In a horizontal box plot, a horizontal axis shows
    the scale. The boxs left and right boundaries
    are Q1 and Q3, and an inner line shows the
    median.
  • Whiskers are drawn outwards from the box to the
    minimum and maximum values.
  • Often the sample mean is also shown.

39
  • What values given rise to the box plot below
  • If minimum , Q1 ,
  • median , Q3 , maximum ,
  • the box plot would look like
  • You must draw a scale for the box plot.

2
4
6
14
16
8
12
10
40
Box plots vs Stem and Leaf plots
  • Box plots are especially useful for comparing 2
    samples. They show the key points of a sample,
    but not the individual values.
  • Stem and leaf plots show individual values, and
    give a better picture of the shape of the spread,
    but their detail makes them unsuitable for
    comparing more than two groups (side by side or
    back to back).

41
What do you want to see in data?
  • Information
  • Meaning
  • We must turn data into information in order to
    have meaning

42
What can we see in data?
  • Location (centre)
  • Spread
  • Shape
  • Outliers
  • Unusual patterns
  • Gaps, clusters
  • How do batches differ

43
Tools for making meaning from data
  • Ordering data
  • Dot plots jittered dot plots
  • Stem-and-leaf plots
  • Histograms, Boxplots, Bar charts
  • Pie charts
  • Frequency tables
  • Numerical summaries

44
Selecting the tool depends on
  • The question asked
  • How the variable is measured
  • The structure of the data
Write a Comment
User Comments (0)
About PowerShow.com