Statistics : Describing Data - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Statistics : Describing Data

Description:

Someone taking a poll during an election and predicting the outcome of the ... but this was not representative of the politics of the population as a whole ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 61
Provided by: chr1
Category:

less

Transcript and Presenter's Notes

Title: Statistics : Describing Data


1
Statistics Describing Data
  • Week 18 to 19

2
For weeks 18 and 19
  • Summarising data
  • Graphic representations
  • The shape of distributions
  • Measures of central tendency
  • Measures of dispersion
  • Percentiles, quartiles, interquartile range

3
Notice / Caveat
  • This stats course is a very quick
  • and short version of parts of
  • a full stats course that would normally
  • take 1 full year to cover
  • However, because we only have 8 weeks
  • there are many issues of the topics
  • we will study that we will not have time
  • to explain/cover in class

4
Notice / Caveat
  • There are therefore certain question
  • of why and/or how which we will not
  • have time to cover or explain.
  • However, do not let this stop you from
  • doing your extra reading if you want to.
  • So, dont let the constraint of the syllabus
  • hold you back if you want to learn more

5
Booklist
  • Suggested books for you to study from include
    most of those found in the SOAS library under
    class mark number 519.4
  • Scan through them to find those appropriate to
    the stats we will do

6
Booklist
  • But it is up to you to decide which is your most
    preferred texts to use in support of the topic of
    statistics
  • Basically all you need to look for are books with
    headings the same as or similar to those in
    these, and future, slides

7
The Nature of Statistics
  • Some examples to highlight the application of
    stats
  • Someone taking a poll during an election and
    predicting the outcome of the ellection based on
    the data they get
  • To get an idea of the economic status of a town a
    researcher collects data on salaries of people in
    that town and calculates the average salary

8
The Nature of Statistics
  • Some examples to highlight the application of
    stats
  • Police forces collect crime figures from
    different parts of the country and produce
    summaries of the data for different types of
    crime in different areas
  • Production of goods in the thousands by a factory
    usually results in some of the goods being
    defective. The company tries to estimate the
    number of defective goods produced.

9
The Nature of Statistics
  • Basic material of statistics is data or
    observation gathered from experiments
  • Data can be
  • Numerical age, weight, number of people
  • Non-numerical colour, who to vote for in an
    election,

10
The Nature of Statistics
  • Definition
  • Descriptive stats
  • Statistics is the science of collecting,
    simplifying and describing/presenting data
  • Inferential stats
  • It is also the science of making inferences
    (drawing conclusions) based on the analysis of
    data

11
The Nature of Statistics
  • When collecting data we cannot collect the total
    data possible
  • We cannot interview all the people who voted
  • We cannot collect all the salaries of the people
    in a town
  • etc
  • so a judgement has to be made about the larger
    body of data by studying the info from some of
    the data

12
The Nature of Statistics
  • Definition
  • The entire collection of all data we are
    interested in is called a population
  • A collection of some of the elements obtained
    from the population is called a sample of the
    population

13
The Nature of Statistics
  • Example
  • In studying voter preferences the population is
    everybody who votes in an election. The votes are
    the data values of interest.
  • The sample is those people who are interviewed
    after voting

14
The Nature of Statistics
  • Example
  • In studying average salaries of a town the
    population is everybody in the town. The salaries
    are the data values of interest.
  • The sample is the salaries of those people we
    interview

15
The Nature of Statistics
  • In the salaries example above
  • the average salary of the population is called a
    parameter of the population
  • the average salary of the sample is called a
    statistic of the population
  • Definition
  • A numerical property of a population is called a
    parameter. A numerical property of a sample is
    called a statistic.

16
Sampling
  • Statistics uses samples instead of population
  • we therefore need our samples to be
    representative of the population as a whole.
  • We therefore need to be careful about the way we
    collect sample data

17
Sampling
  • Random Samples
  • These are simply samples chosen at random,
    without using any criteria or prejudices
  • It also means that however many different
    samples we choose from a population they will all
    represent characteristics and properties similar
    to that of the population

18
Sampling
  • Two specific methods of sampling are prejudices
    lottery and random number method.
  • Lottery method elements of a population are
    written on separate tags, placed in a container,
    and mixed. Tags are then chosen at random
  • Random number suppose we have 1000 employees
    numbered 0001 to 1000. Use a table of random
    numbers to choose 100 employees

19
Sampling
  • Stratified Samples
  • If we divide a population into sub-populations
    this is known as stratifying the population.
  • Samples taken from sub-populations are known as
    stratified samples

20
Sampling
  • Stratified Samples
  • An examples of this might be to subdivide
    student population into 1st year, 2nd year and
    3rd year students.
  • Each year is then a strata of the population,
    and we may take random samples from each strata.

21
Sampling
  • Stratified Samples
  • Suppose, for example, that we had 40 1st years,
    25 2nd years and 35 3rd years.
  • To obtain a stratified sample of size 100 we
    would therefore take 40 1st years, 25 2nd years
    and 35 3rd years

22
Sampling
  • Cluster Samples
  • Stratified sampling is practical only when we
    have a small number of sub-populations / strata.
  • However, if there are a large number of strata
    then it is impractical to do so and we have
    therefore to choose a random number of strata
  • This is cluster sampling

23
Sampling
  • Cluster Samples
  • As an example consider voting district. A
    particular county will have many voting districts
    too many to sample from all of them.
  • So, each voting district is a stratum of the
    county as a whole and we take a random sample of
    the strata, i.e. a random number of voting
    district (from which we sample)

24
Some Problems with Sampling
  • Non-response Bias
  • Consider the case of the 1936 USA presidential
    campaign between Franklin Roosevelt and Alf
    Landon (a true story)
  • a survey was conducted of 2.4 million people
  • the survey predicted that Alf Landon would win
    by a landslide
  • but in fact Franklin Roosevelt won by a
    landslide

25
Some Problems with Sampling
  • Non-response Bias
  • why ?
  • It was found that a lot of the participants of
    the survey were people whose names were taken
    from the telephone book
  • in 1936 people who had phones were well off
    (not poor) and the politics of such people was
    that of Alf Landon
  • but this was not representative of the politics
    of the population as a whole

26
Some Problems with Sampling
  • Non-response Bias
  • thus the sample was not representative of the
    population as a whole
  • Such a situation is known as bias. Particularly
    this situation is known as non-response bias
    since a segment of the population was not
    represented in the sample.

27
Some Problems with Sampling
  • Non-response bias can range from
  • People not answering specific questions
  • to people refusing to participate at all
  • to whole groups of people being excluded from
    the sample
  • E.g. internet surveys suffer major non-response
    bias. Why ?
  • E.g. surveys of home owners suffer major
    non-response bias. Why ?

28
Some Problems with Sampling
  • To prevent Non-response bias some companies
  • 1) Call back people if they could not be
    contacted 1st time
  • 2) Or, offer incentives such as complete this
    survey and you can choose 1 of the following free
    products.
  • What problems can occur with doing 2) ?

29
Some Problems with Sampling
  • Response Bias
  • Response bias relates to participants not giving
    true, honest answers or forgetting what their
    answer might have been.
  • This can also occur in questionnaires when
    questions are written in a way to elicit a
    certain answer

30
Some Problems with Sampling
  • Response Bias
  • E.g. Given that the congestion charge has
    produced no net reduction in traffic congestion,
    would you favour an increase in congestion
    charges next year ?
  • Hence the design of questions is a questionnaires
    is very important to make sure response bias does
    not occur or is minimised.

31
Some Problems with Sampling
  • Lying is more difficult to prevent have you
    ever cheated when filling in your tax return form
    ?

32
Descriptive Statistics
  • (to do
  • raw data, frequency distribution
  • Histogram, frequency polygon, comparison of
    latter two. See p81-96 of Saunders white stats
    book
  • )-----

33
Graphic Representation
  • -----

34
The Shape of Distribution
  • -----
  • see p273 chase and bown
  • For descrip of normal distribution

35
Measures of Central Tendencies
  • There exists three different types of "averages"
    to a set of data.
  • Such averages are generally called central
    tendencies because averages tell us the value
    of the data lying in the middle
  • hence averages are called measures of central
    tendencies.

36
Measures of Central Tendencies
  • The Mean
  • This is the usual average as we know it, i.e.
  • add up all the data and
  • divide by the number of data values
  • x , ? (? x) / n

37
Measures of Central Tendencies
  • Notation
  • ? population mean
  • x sample mean
  • Example
  • See lecture
  • Note extreme values can affect the value of the
    average.

38
Measures of Central Tendencies
  • The larger any one value in the data the more it
    will drag the mean away from the central area of
    the data
  • These extreme values are called outliers,
  • Because of the effect of outliers on the average,
    the mean is said to be not a resistant measure.

39
Measures of Central Tendencies
  • The Median
  • Arrange the set of data in order of increasing
    magnitude. The median of that set is the data
    value which lies in the middle.
  • The symbol for this median is x (x tilde).

40
Measures of Central Tendencies
  • The Median
  • If we have an even number of data then there
    will be only one value in the middle and that is
    the median.
  • What do we do if we have an odd number of values
    in our data ?

41
Measures of Central Tendencies
  • Examples
  • See lecture
  • For the mean we saw that there was an outlier,
    but that the median calculation was not affected
    by this.
  • Thus the median is a resistant measure of central
    tendency.

42
Measures of Central Tendencies
  • Q when to use Mean and when to use Median as
    you measure of central tendency
  • 1) generally (but not always) use the median as
    measure if you have outliers.
  • 2) otherwise, use the mean to calculate your
    measure.
  • but
  • 3) always investigate the data to know which one
    to use.

43
Dispersion
  • When data is
  • plotted on a graph it can either look compact or
    spread out.

44
Dispersion
  • Such spreading out is generally called dispersion
  • Then is the data tightly packed around the mean
    or is it loosely spread out around the mean.
  • Looking at the two diagrams above we see that the
    1st is more loosely spread out than the 2nd one.

45
Dispersion
  • Knowing such info can be crucial.
  • Example
  • A manufacturer produces items of a certain
    strength.
  • You would expect that wherever you used the
    product it would be of the strength you expected.

46
Dispersion
  • You don't want there to be a great difference
    between the strength of the product in Leeds
    compared to Luton.
  • Such unreliability could be very dangerous if
    the product was a bridge.
  • Thus we would want a very low difference in
    strength, i.e. a very small measure of dispersion
    in strength between products.

47
Measures of Dispersion
  • Closely related to dispersion is the position of
    a piece of data w.r.t. all the other data
  • and the way to describe this position is called
    measure of dispersion.
  • Q How do we calculate or measure dispersion ?

48
Measures of Dispersion
  • Range R
  • This is the simplest way but the least useful.
  • R highest data values lowest data value
  • Example
  • See lecture

49
Measures of Dispersion
  • Range does not
  • Measure dispersion well enough.
  • In the diag both sets of data have range 10

50
Measures of Dispersion
  • but the data is spread out very differently
  • range does not measure variability within
    data.
  • Hence we need another measure
  • one which measure variability away from the
    mean.

51
Measures of Dispersion
  • Average deviation
  • So, to calculate the deviation of each data
    point from the mean we can find the average
    distance of each data from the mean
  • A. D. 1 ? x mean
  • N
  • where N is number of values of a population

52
Measures of Dispersion
  • Examples
  • See lecture
  • However, calculating absolute distances in
    average deviation does not statistically/
    naturally represent the most appropriate spread
    of data.
  • The more commonly used measure of spread
    is -----gt

53
Measures of Dispersion
  • Standard Deviation
  • So, we now calculate standard deviation of each
    data point from the mean
  • S.D. v 1/N ? (x - ?)2
  • where N is number of values of a population, and
    ? is the population mean

54
Measures of Dispersion
  • Intuitively it makes sense to use the average
    deviation
  • but this type of measure is not the most
    representative of the way naturally occurring
    data is spread out
  • standard deviation is more representative of
    the spread of data which is normally distributed
  • -----gt

55
Measures of Dispersion
56
Measures of Dispersion
  • Standard deviation can be thought of as a typical
    distance from the mean.
  • Populations are generally too big for us to
    calculate means and S.D.s
  • So we need to calculate means and S.D.s of
    samples taken from populations

57
Measures of Dispersion
  • Hence we use sample standard deviation
  • s v (1/(n-1) . ? (x - x)2)
  • where n is sample size and x is sample mean

58
Measures of Dispersion
  • For sample S.D. we divide by n-1 since samples
    tend to be small in size and dividing by n biases
    the result of S.D. compared to population S.D.
  • Example
  • See lecture

59
Percentiles, quartiles, interquartile range
  • Another way of measuring dispersion is by a
    (percentage distance from the mean ?)
  • (see all 3 stats books I have)

60
Descriptive Statistics
  • The End
Write a Comment
User Comments (0)
About PowerShow.com