Describing and Exploring Data - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Describing and Exploring Data

Description:

Check out this demo which clearly shows how the width of the bin that you select ... Using the procedure, the mean can be shown to be an unbiased estimator (see p 47) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 62
Provided by: johnba4
Category:

less

Transcript and Presenter's Notes

Title: Describing and Exploring Data


1
Chapter 2
  • Describing and Exploring Data

2
Describing and Exploring Data
  • Once a bunch of data has been collected, the raw
    numbers must be manipulated in some fashion to
    make them more informative.
  • Several options are available including plotting
    the data or calculating descriptive statistics.

3
Plotting Data
  • Often, the first thing one does with a set of raw
    data is to plot frequency distributions.
  • Usually this is done by first creating a table of
    the frequencies broken down by values of the
    relevant variable, then the frequencies in the
    table are plotted in a histogram.

4
TABLE Example Your age as
estimated by the questionnaire from the first
class.
  • Note The frequencies in the adjacent table were
    calculated by simply counting the number of
    subjects having the specified value for the age
    variable.

5
Histogram
6
Grouping Data
  • Plotting is easy when the variable of interest
    has a relatively small number of values (like our
    age variable did).
  • However, the values of a variable are sometimes
    more continuous, resulting in uninformative
    frequency plots if done in the above manner.

7
Grouping Data
  • For example, our weight variable ranges from 100
    lb. to 200 lb. If we used the previously
    described technique, we would end up with 100
    bars, most of which with a frequency less than 2
    or 3 (and many with a frequency of zero).
  • We can get around this problem by grouping our
    values into bins. Try for around 10 bins with
    natural splits.

8
TableExample Binning our weight variable.
9
Histogram
Check out this demo which clearly shows how the
width of the bin that you select can clearly
affect the look of the data Here is another
similar demonstration of the effects of bin width
  • See section in text on cumulative frequency
    distributions

10
Stem Leaf Plots
  • If values of a variable must be grouped prior to
    creating a frequency plot, then the information
    related to the specific values becomes lost in
    the process (i.e., the resulting graph depicts
    only the frequency values associated with the
    grouped values).
  • However, it is possible to obtain the graphical
    advantage of grouping and still keep all of the
    information if stem leaf plots are used.

11
Stem Leaf Plots
  • These plots are created by splitting a data point
    into that part associated with the group and
    that associated with the individual point.
  • For example, the numbers 180, 180, 181, 182, 185,
    186, 187, 187, 189 could be represented as
  • 18 001256779

12
Thus, we could represent our weight data in the
following stem leaf plot
13
Stem leaf plots are especially nice for
comparing distributions
14
Terminology Related to Distributions
  • Often, frequency histograms tend to have a
    roughly symmetrical bell-shape and such
    distributions are called normal or gaussion.

15
Terminology Related to Distributions
  • Sometimes, the bell shape is not symmetrical.
  • The term positive skew refers to the situation
    where the tail of the distribution is to the
    right, negative skew is when the tail is to the
    left.

16
Example Pizza Data.
17
Notation Variables
  • When we describe a set of data corresponding to
    the values of some variable, we will refer to
    that set using an uppercase letter such as X or
    Y.
  • When we want to talk about specific data points
    within that set, we specify those points by
    adding a subscript to the uppercase letter like
    X1.

18
For Example
  • 5, 8, 12, 3, 6, 8, 7
  • X1, X2, X3, X4, X5, X6, X7

19
Summation
  • The Greek letter sigma, which looks like ?, means
    add up or sum whatever follows it.
  • Thus, ?Xi, means add up all the Xis.
  • If we use the Xis from the previous example, ?Xi
    49 (or just ?X).

20
Summation
  • Note, that sometimes the ? has number above and
    below if. These numbers specify the range over
    which to sum.
  • For example, if we again use the the Xis from the
    previous example, but now limit the summation
  • ?Xi 34

21
Nasty Example
22
Nasty Example . . .continued
  • ?X
  • ?Y
  • ?(X-Y)
  • ?X2
  • (?X)2

23
Your turn
  • ?(XY)
  • (?(X-Y))2
  • ?(X2-Y2)

24
Double Subscripts
  • Sometimes things are made more complicated
    because capital letters (e.g., X) are sometimes
    used to refer to entire data sets (as opposed to
    single variables) and multiple subscripts are
    used to specify specific data points.

25
X24 3??X or ??Xij 61
26
Measures of Central Tendency
  • While distributions provide an overall picture of
    some data set, it is sometimes desirable to
    represent the entire data set using descriptive
    statistics.
  • The first descriptive statistics we will discuss,
    are those used to indicate where the centre of
    the distribution lies.

27
(No Transcript)
28
The Mode
  • There are, in fact, three different measures of
    central tendency.
  • The first of these is called the mode.
  • The mode is simply the value of the relevant
    variable that occurs most often (i.e., has the
    highest frequency) in the sample.

29
The Mode
  • Note that if you have done a frequency histogram,
    you can often identify the mode simply by finding
    the value with the highest bar.
  • However, that will not work when grouping was
    performed prior to plotting the histogram
    (although you can still use the histogram to
    identify the modal group, just not the modal
    value).

30
Finding the mode
  • Create a non-grouped frequency table as described
    previously, then identify the value with the
    greatest frequency.
  • Example Class height.

31
The Median
  • A second measure of central tendency is called
    the median.
  • The median is the point corresponding to the
    score that lies in the middle of the distribution
    (i.e., there are as many data points above the
    median as there are below the median).

32
The Median
  • To find the median, the data points must first be
    sorted into either ascending or descending
    numerical order.
  • The position of the median value can then be
    calculated using the following formula

33
Examples
  • 1) If there are an odd number of data points
  • (1, 3, 3, 4, 4, 5, 6, 7, 12)
  • 2) If there are an even number of data points
  • The median is the item in the fifth position of
    the
  • ordered data set, therefore the median is 4.

34
The Mean
  • Finally, the most commonly used measure of
    central tendency is called the mean (denoted for
    a sample, and for a population).
  • The mean is the same of what most of us call the
    average, and it is calculated in the following
    manner

35
The Mean
  • For example, given the data set that we used to
    calculate the median (odd number example), the
    corresponding mean would be
  • Similarly, the mean height of our class,
  • as indicated by our sample, is

36
Mode vs. Median vs. Mean
  • In our height example, the mode and median were
    the same, and the mean was fairly close to the
    mode and median.
  • This was the case because the height distribution
    was fairly symmetrical.
  • However, when the underlying distribution is not
    symmetrical, the three measures of central
    tendency can be quite different.

37
  • This raises the issue of which measure is best.
  • Note that if you were calculating these values,
    you would show all your steps (its good to be
    prof!).

38
Some Visual Demos
Here is a demonstration that allows you to change
a frequency histogram while simultaneously noting
the effects of those changes on the mean versus
the median. As you use the demo, you should
easily be able to think about how these changes
are also affecting the mode, right?
39
Measures of Variability
  • In addition to knowing where the centre of the
    distribution is, it is often helpful to know the
    degree to which individual values cluster around
    the centre.
  • This is known as variability.

40
Range
  • There are various measures of variability, the
    most straightforward being the range of the
    sample
  • Highest value minus lowest value
  • While range provides a good first pass at
    variance, it is not the best measure because of
    its sensitivity to extreme scores (see text).

41
The Average Deviation
  • Another approach to estimating variance is to
    directly measure the degree to which individual
    data points differ from the mean and then average
    those deviations.
  • That is

42
The Average Deviation
  • However, if we try to do this with real data, the
    result will always be zero
  • Example (2,3,4,4,6,6,12)

43
The Mean Absolute Deviation (MAD)
  • One way to get around the problem with the
    average deviation is to use the absolute value of
    the differences, instead of the differences
    themselves.
  • The absolute value of some number is just the
    number without any sign
  • For Example -3 3

44
The Mean Absolute Deviation (MAD)
  • Thus, we could re-write and solve our average
    deviation question as follows
  • The data set in question has a mean of 5 and a
    mean absolute deviation of 2.

45
The Variance
  • Although the MAD is an acceptable measure of
    variability, the most commonly used measure is
    variance (denoted s2 for a sample and ?2 for a
    population) and its square root termed the
    standard deviation (denoted s for a sample and ?
    for a population).

46
The Variance
  • The computation of variance is also based on the
    basic notion of the average deviation however,
    instead of getting around the zero problem by
    using absolute deviations (as in MAD), the zero
    problem is eliminating by squaring the
    differences from the mean.
  • Specifically

47
(No Transcript)
48
Alternate formula for s2 and s
  • The definitional formula of variance just
    presented was
  • An equivalent formula that is easier to work with
    when calculating variances by hand is
  • Although this second formula may
    look more intimidating, a few examples
    will show you that it is actually easier to
    work with (as youll see in assignment 2).

49
Visualizing Means and Standard Deviations
This demonstration allows you to play with the
mean and standard deviation of a distribution.
Note that changing the mean of the distribution
simply moves the entire distribution to the left
or right without changing its shape. In
contrast, changing the standard deviation alters
the spread of the data but does not affect where
the distribution is centered Run demo
50
Estimating Population Parameters
  • So, the mean (X) and variance (s2) are the
    descriptive statistics that are most commonly
    used to represent the data points of some sample.
  • The real reason that they are the preferred
    measures of central tendency and variance is
    because of certain properties they have as
    estimators of their corresponding population
    parameters and ?2.

51
Estimating Population Parameters
  • Four properties are considered desirable in a
    population estimator sufficiency, unbiasedness,
    efficiency, resistance.
  • Both the mean and the variance are the best
    estimators in their class in terms of the first
    three of these four properties.
  • To understand these properties, you first need to
    understand a concept in statistics called the
    sampling distribution

52
Sampling Distribution Demo
We will discuss sampling distributions off and on
throughout the course, and I only want to touch
on the notion now. Basically, the idea is this
in order to exam the properties of a statistic we
often want to take repeated samples from some
population of data and calculate the relevant
statistic on each sample. We can then look at
the distribution of the statistic across these
samples and ask a variety of questions about
it. Check out this demonstration which I hope
makes the concept of sampling distributions more
clear.
53
Properties of a Statistic
  • 1) Sufficiency
  • A sufficient statistic is one that makes use of
    all of the information in the sample to estimate
    its corresponding parameter.

54
Estimating Population Parameters
  • 2) Unbiasedness
  • A statistic is said to be an unbiased estimator
    if its expected value (i.e., the mean of a number
    of sample means) is equal to the population
    parameter it is estimating.
  • Explanation of N-1 in s2 formula.

55
Assessing the Bias of an Estimator
  • Using the procedure, the mean can be shown to be
    an unbiased estimator (see p 47).
  • However, if the more intuitive formula for s2 is
    used
  • it turns out to underestimate ?2

56
Assessing the Bias of an Estimator
  • This bias to underestimate is caused by the act
    of sampling and it can be shown that this bias
    can be eliminated if N-1 is used in the
    denominator instead of N.
  • Note that this is only true when calculating s2,
    if you have a measurable population and you want
    to calculate ?2, you use N in the denominator,
    not N-1.

57
Degrees of Freedom
  • The mean of 6, 8, 10 is 8.
  • If I allow you to change as many of these numbers
    as you want BUT the mean must stay 8, how many of
    the numbers are you free to vary?

58
Degrees of Freedom
  • The point of this exercise is that when the mean
    is fixed, it removes a degree of freedom from
    your sample -- this is like actually subtracting
    1 from the number of observations in your sample.
  • It is for exactly this reason that we use N-1 in
    the denominator when we calculate s2 (i.e., the
    calculation requires that the mean be fixed first
    which effectively removes -- fixes -- one of the
    data points).

59
Estimating Population Parameters
  • 3) Efficiency
  • The efficiency of a statistic is reflected in the
    variance that is observed when one examines the
    means of a bunch of independently chosen samples.
    The smaller the variance, the more efficient the
    statistic is said to be.

60
Estimating Population Parameters
  • 4) Resistance
  • The resistance of an estimator refers to the
    degree to which that estimate is effected by
    extreme values.
  • As mentioned previously, both X and s2 are
    highly sensitive to extreme values.

61
Estimating Population Parameters
  • 4) Resistance
  • Despite this, they are still the most commonly
    used estimates of the corresponding population
    parameters, mostly because of their superiority
    over other measures in terms sufficiency,
    unbiasedness, efficiency.
Write a Comment
User Comments (0)
About PowerShow.com