Title: STAT131 Numerical Summaries
1STAT131Numerical Summaries
Anne Porter
2Describing Data
- Task
- Describe to the person sitting next to you the
amount of mobile phone calls that you receive
each day.
3Describing Data
- Describe to the person sitting next to you the
number of mobile phone calls that you receive
each day. - On average 80 per day
- At most 5 per per day
- It varies from 10 to 50 per day
-
4Two summaries of data
- Summaries most often used are to indicate the
- Centre (often called location), and
- Spread
- of sample data.
5Centre
- Mean
- Median
- Mode
- Trimmed Mean
6Centre Mean
Mean of 8, 7, 9, 4 is
In mathematical language
7Centre Sample median
It is often called the middle value.
- Another measure of the centre of a sample is the
sample median, m (say). It satisfies - the number of sample values m is equal to the
number of sample values m.
To find the median, FIRST arrange the sample
values from smallest to largest.
8Example Sample median
- N odd Median of 8, 7, 9
- Median
- N even Median 8, 7, 9, 4
- Median
Median (7,8,9) 8 Median (4,7,8,9) (78)/2
7.5
9Example mean, median
- Data A 60, 2, 3, 5 Data B 6,
2, 3, 5 - Calculate the mean and median for both data sets
A B 60
6 2
2 3
3 5 5
10Question mean vs median
- In what sense are the mean and median the same?
- In what sense are the mean and median different?
They are both measures of the centre
They may give different numerical values and for
different data sets one may be better as a
measure than the other or both may be required
11Question mean vs median
- Data A 60, 2, 3, 5 Data B 6, 2, 3, 5
- Which measure best typifies the data A? Why?
- Which measure best typifies the data set B? Why?
For A the outlier 60 suggests the median (4) as
the Mean (17.5) is dragged up by the outlier
60 For B both are the same. The median (4) used
2 points the mean (4) uses all the data
12Why do we use both the sample mean and sample
median as measures of the centre of a sample?
- The mean uses all the information in the sample,
because each value is added in the sum. This
makes it subject to error if spurious values are
entered. - In general, the median is less affected by wild
values than the mean. We say it is more robust
than the mean. - However, the median does not use very much
information from the sample. - The context of what the data are are used for may
also determine what is an appropriate measure
13CentreMode
- Most common value in the data set
- Data 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 6, 9
- For this data set the mode is
3
14Centre Trimmed mean
- What might be an example of this?
Diving at the Olympics is the average of the
judges scores after having tossed out the
highest and the lowest scores
15Context
- The choice of measure for the centre may also
depend upon the context of the problem. - See examples from Lab work.
16Measures of Spread
- Range maximum value - minimum value
- Interquartile range Q1- Q3
- Sums of Squares
- Variance
- Standard Deviation
17Criteria for a good measure of spread
- Whatever measure of variability (or spread) the
measure should not be affected by adding a
constant to each value so as to change the centre
(or location) - If there is spread in the data it should indicate
this
18Example Spread
- Data B 6, 2, 3, 5 Data C 8, 4,
5, 7
How was Data set C obtained? Is the mean the
same for both data sets? Is the spread the same
for both data sets?
By adding 2 to each value in B
No, 4 and 6
Yes! - explain
19Spread Range
- Range maximum value - minimum value
- Data B 6, 2, 3, 5 Data C 8, 4,
5, 7 - Range B Range C
6-2 4 8-4 4
When is is going to be a poor measure of
spread? Why?
20Undesirable features of the range
- Sensitive to outliers
- Insensitive to the bulk of the data
- -as it is based only on two scores
21Spread Interquartile range
- IQR upper quartile - lower quartile
- The upper quartile (3rd quartile, Q3) has
one-quarter of the observations above it, and
three-quarters below. - The lower quartile (1st quartile, Q1) has
one-quarter of the observations below it, and
three-quarters above. - Hence the IQR gives the spread of the middle 50
of the sample.
22Quartiles
- Q1 is value of the (n3)/4th observation,
- and Q3 is the value of the (3n1)/4th
observation. - There are other systems!
- Interpolate if necessary.
- The interquartile range Q3 - Q1
- If we have 17 heights what observation
- do we need to get the upper and lower quartile?
- What observation will give the median?
23Quartiles
- 19
The upper quartile is? The lower quartile is? The
interquartile range is?
166 147
24Interquartile range
- Is resistant (or robust) to the impact of
outliers - But it does not use all the data in calculating
- How else might we measure spread, perhaps spread
around the mean?
25Spread Sum of Deviation Scores
In calculating the spread of a sample, we measure
how far each observation is from the sample
mean ie.
Calculate the sum of the deviations for each
sample? Is this a good measure of spread?
A B 6
8 2
4 3
5 5 7
2
2 -2
-2 -1
-1 1
1 0
0
26Looking for another measure
- We could perhaps find the sum of these
differences, except that the sum (and average) is
always zero. (The positive differences cancel
out the negative ones.) - We prevent this cancellation by
each difference.
squaring
27Spread Sum of Squares
- The sum of the squared deviations is
A B 6
8 2
4 3
5 5 7
2
2 -2
-2 -1
-1 1
1
4
4 4
4 1
1 1
1
28Spread Sum of Squares
- Is difficult to understand in this context as the
answer is very big and gets bigger with every
additional data point. It is used and useful in
other contexts. - What might we do?
Average the SS by dividing by n or...
29Spread Sample Variance
- The obvious thing to calculate would be
- but, for reasons to be explained later, we use
- i.e., we divide by (n - 1) instead of n. So S2
this measure still seems to big! So
what might we do?
Symbolised by s2
Symbolised by S2
10/3
30Spread Standard Deviation
- We use the positive square root of the sample
variance as our measure of spread. - The square root is called the sample standard
deviation, and is denoted by s i.e. and equals
31Use of standard deviation
- The mean and std deviation gives information
about where most of the distribution of values is
to be found. - For many distributions, the range
- mean - 2 standard devs to mean 2 standard
devs - (mean 2SD)
- contains approx 95 of the distribution.
- (The very least that this spread can contain is
75 of the distribution.)
32Measuring the sample spread
- The standard deviation uses information from all
the values in the sample, so it is also affected
by wild values. - That is it is not robust (or resistant).
- Our choice will also depend to some extent on
what is chosen for the centre - Mean and standard deviation
- Median and IQR
Conclusion
33Percentiles
- Sometimes we want another description of where
the data may be found - The kth percentile is a number that has k percent
of the scores at or below it and (100-k) above
it - The lower quartile has 25 of scores at or below
that score
34Using the calculator
- Most calculators have a
- button and a button to
calculate the mean and standard deviation of some
numbers you type in. (? is the lower-case Greek
sigma.) - Some also have ?n. Dont use this one!
- To check that you are using the right std
deviation the std deviation of -2, 0 and 2
should be 2 (exactly). - Ask in lecture breaks or tutes to see that you
can use you statistics functions
35Using the calculator cont.
- If your calculator does not have a ??n-1 button,
you will have to calculate the standard deviation
by hand. Find s2 from - and then take the square root of s2 to get s, the
standard deviation.
36Box-and-Whiskers plots
- Often just called box plots, they give a
pictorial summary of the data for a single
variable. - They use the five-number summary
- minimum value,
- Q1,
- median,
- Q3,
- maximum value
37- Example If minimum 3, Q1 6,
- median10, Q3 12, maximum 16, the box plot
would look like - You must draw a scale for the box plot.
2
4
6
14
16
8
12
10
38- In a horizontal box plot, a horizontal axis shows
the scale. The boxs left and right boundaries
are Q1 and Q3, and an inner line shows the
median. - Whiskers are drawn outwards from the box to the
minimum and maximum values. - Often the sample mean is also shown.
39- What values given rise to the box plot below
- If minimum , Q1 ,
- median , Q3 , maximum ,
- the box plot would look like
- You must draw a scale for the box plot.
2
4
6
14
16
8
12
10
40Box plots vs Stem and Leaf plots
- Box plots are especially useful for comparing 2
samples. They show the key points of a sample,
but not the individual values. - Stem and leaf plots show individual values, and
give a better picture of the shape of the spread,
but their detail makes them unsuitable for
comparing more than two groups (side by side or
back to back).
41What do you want to see in data?
- Information
- Meaning
- We must turn data into information in order to
have meaning
42What can we see in data?
- Location (centre)
- Spread
- Shape
- Outliers
- Unusual patterns
- Gaps, clusters
- How do batches differ
43Tools for making meaning from data
- Ordering data
- Dot plots jittered dot plots
- Stem-and-leaf plots
- Histograms, Boxplots, Bar charts
- Pie charts
- Frequency tables
- Numerical summaries
44Selecting the tool depends on
- The question asked
- How the variable is measured
- The structure of the data