Title: Statistics : Describing Data
1Statistics Describing Data
2For weeks 18 and 19
- Summarising data
- Graphic representations
- The shape of distributions
- Measures of central tendency
- Measures of dispersion
- Percentiles, quartiles, interquartile range
3Notice / Caveat
- This stats course is a very quick
- and short version of parts of
- a full stats course that would normally
- take 1 full year to cover
- However, because we only have 8 weeks
- there are many issues of the topics
- we will study that we will not have time
- to explain/cover in class
4Notice / Caveat
- There are therefore certain question
- of why and/or how which we will not
- have time to cover or explain.
- However, do not let this stop you from
- doing your extra reading if you want to.
- So, dont let the constraint of the syllabus
- hold you back if you want to learn more
5Booklist
- Suggested books for you to study from include
most of those found in the SOAS library under
class mark number 519.4 - Scan through them to find those appropriate to
the stats we will do
6Booklist
- But it is up to you to decide which is your most
preferred texts to use in support of the topic of
statistics - Basically all you need to look for are books with
headings the same as or similar to those in
these, and future, slides
7The Nature of Statistics
- Some examples to highlight the application of
stats - Someone taking a poll during an election and
predicting the outcome of the ellection based on
the data they get - To get an idea of the economic status of a town a
researcher collects data on salaries of people in
that town and calculates the average salary
8The Nature of Statistics
- Some examples to highlight the application of
stats - Police forces collect crime figures from
different parts of the country and produce
summaries of the data for different types of
crime in different areas - Production of goods in the thousands by a factory
usually results in some of the goods being
defective. The company tries to estimate the
number of defective goods produced.
9The Nature of Statistics
- Basic material of statistics is data or
observation gathered from experiments - Data can be
- Numerical age, weight, number of people
- Non-numerical colour, who to vote for in an
election,
10The Nature of Statistics
- Definition
- Descriptive stats
- Statistics is the science of collecting,
simplifying and describing/presenting data -
- Inferential stats
- It is also the science of making inferences
(drawing conclusions) based on the analysis of
data
11The Nature of Statistics
- When collecting data we cannot collect the total
data possible - We cannot interview all the people who voted
- We cannot collect all the salaries of the people
in a town - etc
- so a judgement has to be made about the larger
body of data by studying the info from some of
the data
12The Nature of Statistics
- Definition
- The entire collection of all data we are
interested in is called a population - A collection of some of the elements obtained
from the population is called a sample of the
population
13The Nature of Statistics
- Example
- In studying voter preferences the population is
everybody who votes in an election. The votes are
the data values of interest. - The sample is those people who are interviewed
after voting
14The Nature of Statistics
- Example
- In studying average salaries of a town the
population is everybody in the town. The salaries
are the data values of interest. - The sample is the salaries of those people we
interview
15The Nature of Statistics
- In the salaries example above
- the average salary of the population is called a
parameter of the population - the average salary of the sample is called a
statistic of the population - Definition
- A numerical property of a population is called a
parameter. A numerical property of a sample is
called a statistic.
16Sampling
- Statistics uses samples instead of population
- we therefore need our samples to be
representative of the population as a whole. - We therefore need to be careful about the way we
collect sample data
17Sampling
- Random Samples
- These are simply samples chosen at random,
without using any criteria or prejudices - It also means that however many different
samples we choose from a population they will all
represent characteristics and properties similar
to that of the population
18Sampling
- Two specific methods of sampling are prejudices
lottery and random number method. - Lottery method elements of a population are
written on separate tags, placed in a container,
and mixed. Tags are then chosen at random - Random number suppose we have 1000 employees
numbered 0001 to 1000. Use a table of random
numbers to choose 100 employees
19Sampling
- Stratified Samples
- If we divide a population into sub-populations
this is known as stratifying the population. - Samples taken from sub-populations are known as
stratified samples
20Sampling
- Stratified Samples
- An examples of this might be to subdivide
student population into 1st year, 2nd year and
3rd year students. - Each year is then a strata of the population,
and we may take random samples from each strata.
21Sampling
- Stratified Samples
- Suppose, for example, that we had 40 1st years,
25 2nd years and 35 3rd years. - To obtain a stratified sample of size 100 we
would therefore take 40 1st years, 25 2nd years
and 35 3rd years
22Sampling
- Cluster Samples
- Stratified sampling is practical only when we
have a small number of sub-populations / strata. - However, if there are a large number of strata
then it is impractical to do so and we have
therefore to choose a random number of strata - This is cluster sampling
23Sampling
- Cluster Samples
- As an example consider voting district. A
particular county will have many voting districts
too many to sample from all of them. - So, each voting district is a stratum of the
county as a whole and we take a random sample of
the strata, i.e. a random number of voting
district (from which we sample)
24Some Problems with Sampling
- Non-response Bias
- Consider the case of the 1936 USA presidential
campaign between Franklin Roosevelt and Alf
Landon (a true story) - a survey was conducted of 2.4 million people
- the survey predicted that Alf Landon would win
by a landslide - but in fact Franklin Roosevelt won by a
landslide
25Some Problems with Sampling
- Non-response Bias
- why ?
- It was found that a lot of the participants of
the survey were people whose names were taken
from the telephone book - in 1936 people who had phones were well off
(not poor) and the politics of such people was
that of Alf Landon - but this was not representative of the politics
of the population as a whole
26Some Problems with Sampling
- Non-response Bias
- thus the sample was not representative of the
population as a whole - Such a situation is known as bias. Particularly
this situation is known as non-response bias
since a segment of the population was not
represented in the sample.
27Some Problems with Sampling
- Non-response bias can range from
- People not answering specific questions
- to people refusing to participate at all
- to whole groups of people being excluded from
the sample - E.g. internet surveys suffer major non-response
bias. Why ? - E.g. surveys of home owners suffer major
non-response bias. Why ?
28Some Problems with Sampling
- To prevent Non-response bias some companies
- 1) Call back people if they could not be
contacted 1st time - 2) Or, offer incentives such as complete this
survey and you can choose 1 of the following free
products. - What problems can occur with doing 2) ?
29Some Problems with Sampling
- Response Bias
- Response bias relates to participants not giving
true, honest answers or forgetting what their
answer might have been. - This can also occur in questionnaires when
questions are written in a way to elicit a
certain answer
30Some Problems with Sampling
- Response Bias
- E.g. Given that the congestion charge has
produced no net reduction in traffic congestion,
would you favour an increase in congestion
charges next year ? - Hence the design of questions is a questionnaires
is very important to make sure response bias does
not occur or is minimised.
31Some Problems with Sampling
- Lying is more difficult to prevent have you
ever cheated when filling in your tax return form
?
32Descriptive Statistics
- (to do
- raw data, frequency distribution
- Histogram, frequency polygon, comparison of
latter two. See p81-96 of Saunders white stats
book - )-----
33Graphic Representation
34The Shape of Distribution
- -----
- see p273 chase and bown
- For descrip of normal distribution
35Measures of Central Tendencies
- There exists three different types of "averages"
to a set of data. - Such averages are generally called central
tendencies because averages tell us the value
of the data lying in the middle - hence averages are called measures of central
tendencies.
36Measures of Central Tendencies
- The Mean
- This is the usual average as we know it, i.e.
- add up all the data and
- divide by the number of data values
- x , ? (? x) / n
37Measures of Central Tendencies
- Notation
- ? population mean
- x sample mean
- Example
- See lecture
- Note extreme values can affect the value of the
average.
38Measures of Central Tendencies
- The larger any one value in the data the more it
will drag the mean away from the central area of
the data - These extreme values are called outliers,
- Because of the effect of outliers on the average,
the mean is said to be not a resistant measure.
39Measures of Central Tendencies
- The Median
- Arrange the set of data in order of increasing
magnitude. The median of that set is the data
value which lies in the middle. - The symbol for this median is x (x tilde).
40Measures of Central Tendencies
- The Median
- If we have an even number of data then there
will be only one value in the middle and that is
the median. -
- What do we do if we have an odd number of values
in our data ?
41Measures of Central Tendencies
- Examples
- See lecture
- For the mean we saw that there was an outlier,
but that the median calculation was not affected
by this. - Thus the median is a resistant measure of central
tendency.
42Measures of Central Tendencies
- Q when to use Mean and when to use Median as
you measure of central tendency - 1) generally (but not always) use the median as
measure if you have outliers. - 2) otherwise, use the mean to calculate your
measure. - but
- 3) always investigate the data to know which one
to use.
43Dispersion
- When data is
- plotted on a graph it can either look compact or
spread out.
44Dispersion
- Such spreading out is generally called dispersion
- Then is the data tightly packed around the mean
or is it loosely spread out around the mean. - Looking at the two diagrams above we see that the
1st is more loosely spread out than the 2nd one.
45Dispersion
- Knowing such info can be crucial.
- Example
- A manufacturer produces items of a certain
strength. - You would expect that wherever you used the
product it would be of the strength you expected.
46Dispersion
- You don't want there to be a great difference
between the strength of the product in Leeds
compared to Luton. - Such unreliability could be very dangerous if
the product was a bridge. - Thus we would want a very low difference in
strength, i.e. a very small measure of dispersion
in strength between products.
47Measures of Dispersion
- Closely related to dispersion is the position of
a piece of data w.r.t. all the other data - and the way to describe this position is called
measure of dispersion. - Q How do we calculate or measure dispersion ?
48Measures of Dispersion
- Range R
- This is the simplest way but the least useful.
- R highest data values lowest data value
- Example
- See lecture
49Measures of Dispersion
- Range does not
- Measure dispersion well enough.
- In the diag both sets of data have range 10
50Measures of Dispersion
- but the data is spread out very differently
- range does not measure variability within
data. - Hence we need another measure
- one which measure variability away from the
mean.
51Measures of Dispersion
- Average deviation
- So, to calculate the deviation of each data
point from the mean we can find the average
distance of each data from the mean -
- A. D. 1 ? x mean
- N
- where N is number of values of a population
52Measures of Dispersion
- Examples
- See lecture
- However, calculating absolute distances in
average deviation does not statistically/
naturally represent the most appropriate spread
of data. - The more commonly used measure of spread
is -----gt
53Measures of Dispersion
- Standard Deviation
- So, we now calculate standard deviation of each
data point from the mean -
- S.D. v 1/N ? (x - ?)2
- where N is number of values of a population, and
? is the population mean
54Measures of Dispersion
- Intuitively it makes sense to use the average
deviation - but this type of measure is not the most
representative of the way naturally occurring
data is spread out - standard deviation is more representative of
the spread of data which is normally distributed - -----gt
55Measures of Dispersion
56Measures of Dispersion
- Standard deviation can be thought of as a typical
distance from the mean. - Populations are generally too big for us to
calculate means and S.D.s - So we need to calculate means and S.D.s of
samples taken from populations
57Measures of Dispersion
- Hence we use sample standard deviation
- s v (1/(n-1) . ? (x - x)2)
- where n is sample size and x is sample mean
58Measures of Dispersion
- For sample S.D. we divide by n-1 since samples
tend to be small in size and dividing by n biases
the result of S.D. compared to population S.D. - Example
- See lecture
59Percentiles, quartiles, interquartile range
- Another way of measuring dispersion is by a
(percentage distance from the mean ?) - (see all 3 stats books I have)
60Descriptive Statistics