Chapter 1: Looking at Data Distributions - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Chapter 1: Looking at Data Distributions

Description:

... are density curves that are symmetric, unimodal and bell-shaped. ... Shape (Histogram, Boxplot and Stemplot Normal Quantile Plot) Mode: unimodal, bimodal? ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 61
Provided by: erich2
Category:

less

Transcript and Presenter's Notes

Title: Chapter 1: Looking at Data Distributions


1
Chapter 1 Looking at Data Distributions
2
1. Introduction
  • Individual objects described by a set of data
    (people, animals, or things)
  • Variable Characteristic of an individual. It
    can take on different values for different
    individuals.
  • Examples age, height, gender, favorite class,
    speed, etc.

3
Types of Variables
  • Categorical Variable An individual is placed
    into one of several groups or categories. These
    groups or categories are not usually numerical.

4
Types of Variables
  • Quantitative Variable (numeric) values can be
    added, subtracted, averaged, etc.
  • Discrete takes on values which are spaced. That
    is, for two values of a discrete variable that
    are adjacent, there is no value that goes between
    them.
  • Continuous values are all numbers in a given
    interval. That is, for two values of a
    continuous variable that are adjacent, there is
    another value that can go between the two.

5
Types of Variables
  • Examples
  • Numeric
  • Variable Discrete Continuous Categorical
  • Length
  • Hours Enrolled
  • Major
  • Zip Code

6
Types of Variables
  • Examples
  • Numeric
  • Variable Discrete Continuous Categorical
  • Length X
  • Hours Enrolled X
  • Major X
  • Zip Code X

7
Distribution of a Variable
  • The distribution of a variable tells us the
    possible values for the variable and the
    probability that the variable takes these values.

8
2. Describing Categorical Variable Distributions
  • Suppose we poll 50 students on an issue
    (Statistics is interesting!). How can we exhibit
    their responses?
  • Frequency Tables
  • gives counts (31 agree), proportions (31/50
    .62 agree), and percents (62 agree)

9
2. Describing Categorical Variable Distributions
  • Suppose we poll 50 students on an issue
    (Statistics is interesting!). How can we exhibit
    their response?
  • Bar Chart
  • can have counts,
  • percents or
  • proportions on
  • vertical axis

10
2. Describing Categorical Variable Distributions
  • Suppose we poll 50 students on an issue
    (Statistics is interesting!). How can we exhibit
    their response?
  • Pie Chart

11
3. Describing Numeric Variable Distributions
  • To describe a distribution we need 3 items
  • Shape modes, symmetric, skewed, outliers
  • Center mean, median, mode
  • Spread range, standard deviation, IQR

12
3. Describing Numeric Variable Distributions
  • Shape
  • Modes Major peaks in the distribution
  • Symmetric The values smaller and larger than
    the midpoint are mirror images of each other
  • Skewed to the right Right tail is much longer
    than the left tail
  • Skewed to the left Left tail is much longer
    than the right tail

13
3. Describing Numeric Variable Distributions
  • Center
  • Mean The arithmetic average. Add up the
    numbers and divide by the number of observations.
    If the n observations are , their sample mean is
  • Median List the data from smallest to
    largest. If there are an odd number of data
    values, the median is the middle one in the list.
    If there are an even number of data values,
    average the middle two in the list.

14
3. Describing Numeric Variable Distributions
  • Spread
  • Range The difference between the largest and
    smallest value.
  • Standard Deviation Measures spread by looking
    at how far observations are from their mean. The
    computational formula for the sample standard
    deviation is
  • Variance Square of Standard Deviation

15
3. Describing Numeric Variable Distributions
  • Spread
  • Interquartile Range (IQR) IQRQ3-Q1
  • Distance between the first quartile (Q1) and
    the third quartile (Q3).
  • Q1 25 of the observations are less than
    Q1 and 75 are greater than Q1.
  • or the median of the observations whose
    position in the ordered
  • list is to the left of the
    location of the overall median.
  • Q3 75 of the observations are less than
    Q3 and 25 are greater than Q3.
  • or the median of the
    observations whose position in the ordered
  • list is to the right of the
    location of the overall median.
  • Note To compute Q1/Q3 for odd number of
    observations, the center value is excluded.

16
3. Describing Numeric Variable Distributions
  • Example Suppose the age of five students are
    20, 18, 22, 20, 23.

Mean ? Median ? Q1 ? Q3 ?
Range ? Std. dev. ? Var. ? IQR ?
17
3. Describing Numeric Variable Distributions
  • Examples for Q1 and Q3
  • Even number of observations
  • The highway mileages of the 18 cars, arranged in
    increasing order, are
  • 13 13 16 19 21 21 23 23 24 26 26 27
    27 27 28 28 30 30
  • Find Q1 and Q3.
  • Odd number of observations
  • The highway mileages of the 11 trucks arranged in
    increasing order, are
  • 22 22 23 24 24 25 27 28 28 30 31
  • Find Q1 and Q3.

18
3. Describing Numeric Variable Distributions
  • Another example in the book shows how much 50
    consecutive shoppers spent in a store. The data
    appear as follows

19
3. Describing Numeric Variable Distributions
  • How can we describe the distribution of these 50
    numbers?

20
3. Describing Numeric Variable Distributions
  • How can we describe the distribution of these 50
    numbers?
  • 50th percentile is also called the median the
    middle data value if ordered smallest to largest
  • 25th and 75th percentiles are also called Q1 and
    Q3 respectively the middle data value of each
    half

21
3. Describing Numeric Variable Distributions
  • How can we describe the distribution of these 50
    numbers?
  • Stemplot (discard decimals)
  • 1. Separate each observation into a stem and a
    leaf. Stems may have as many digits as needed,
    but each leaf contains only a single digit.
  • Write the stems in a vertical column
  • Write each leaf in the row to the right of its
    stem,
  • in increasing order out from the stem.

22
3. Describing Numeric Variable Distributions
  • How can we describe the distribution of these 50
    numbers?
  • Histogram
  • Breaks the range of values
  • of a variable into intervals
  • and displays only the count
  • or percent of the
  • observations in each interval.

23
3. Describing Numeric Variable Distributions
  • How can we describe the distribution of these 50
    numbers?
  • Box Plot (made up of min., Q1, median, Q3, and
    max.)
  • these five numbers
  • are called the
  • five number summary

24
3. Describing Numeric Variable Distributions
  • Outlier observations that are unusually far from
    the bulk of the data.
  • What are some possible explanations for outliers?
  • The data point was recorded wrong.
  • The data point wasnt actually a member of the
    population we were trying to sample.
  • We just happened to get an extreme value in our
    sample.
  • The 1.5 x IQR Criterion for Outliers Designate
    an observation a suspected outlier if it falls
    more than 1.5 x IQR below the first quartile or
    above the third quartile.

25
3. Describing Numeric Variable Distributions
  • Now, we examine the appearance of other data
  • this example is bimodal has two
    modes

26
3. Describing Numeric Variable Distributions
  • Now, we examine the appearance of other data
  • this only has one mode - unimodal

27
3. Describing Numeric Variable Distributions
  • Now, we examine the appearance of other data
  • This example is called
  • right skewed since
  • the distribution has
  • a long right tail.

28
3. Describing Numeric Variable Distributions
  • Now, we examine the appearance of other data
  • This is an example of
  • a boxplot that is
  • skewed to the right.

29
3. Describing Numeric Variable Distributions
  • Symmetry versus Skewness

__________
__________
___________
30
3. Describing Numeric Variable Distributions
  • Symmetry versus Skewness

__________
__________
____________
31
3. Describing Numeric Variable Distributions
  • Mean versus Median for Different Distributions
  • meanltmedian meanmedian
    meangtmedian

Right Skewed
Symmetric
Left Skewed
32
3. Describing Numeric Variable Distributions
  • Example
  • Calculate mean, median, std. dev. and IQR of
    these two observations
  • 3, 3, 5, 6, 7
  • 3, 3, 5, 6, 7, 80
  • Conclusion
  • Mean and std. dev. cant resist the influence of
    outliers. That is, mean and std. dev. are not
    resistant.
  • Median and IQR are better than mean and std. dev.
    for describing a skewed distribution or a
    distribution with strong outliers. Use mean and
    std. dev. only for reasonably symmetric
    distributions that are free of outliers.

33
3. Describing Numeric Variable Distributions
  • Measures of Center and Spread
  • To describe distributions use
  • Median and IQR Mean and
    standard deviation Median and IQR

Right Skewed
Left Skewed
Symmetric
34
3. Describing Numeric Variable Distributions
  • Summary
  • Shape (usually Histogram. Boxplot and Stemplot)
  • Mode unimodal, bimodal?
  • Symmetric? Left skewed? Right skewed?
  • Center (Descriptives)
  • Spread (Descriptives)
  • Outlier (Boxplot)

35
4. The Normal Distribution
  • Sometimes the overall pattern of a large number
    of observations is so regular that we can
    describe it by a smooth curve.
  • A density curve is a curve that is always on or
    above the horizontal axis and has area exactly 1
    underneath it.
  • A density curve describes the overall pattern of
    a distribution. The area under the curve and
    above any range of values is the relative
    frequency of all observations that fall in that
    range.

36
4. The Normal Distribution
  • Normal curves are density curves that are
    symmetric, unimodal and bell-shaped. They
    describe normal distributions.
  • A normal curve is specified by giving its mean µ
    and its standard deviation s.
  • We often write that a variable (call it X) has
    normal distribution with mean m and variance s2
    in the following way

37
4. The Normal Distribution
  • Here are some examples of normal distributions

m 0
m 3
m -2
s 1
s 2
s 0.5
0
-2
3
N(0,12)
N(-2,0.52)
38
4. The Normal Distribution
  • Empirical Rule (The 68-95-99.7 Rule) If the
    distribution is normal, then
  • Approximately 68 of the data falls within one
    standard deviation of the mean
  • Approximately 95 of the data falls within two
    standard deviation of the mean
  • Approximately 99.7 of the data falls within
    three standard deviation of the mean

39
4. The Normal Distribution Empirical Rule for
N(0,12)
40
4. The Normal Distribution
  • If x is an observation from a distribution that
    has mean m and standard deviation s, the
    standardized value of x is

This is known as a Z-score. A z-score is
literally how many sds an observation is from
its mean. They are measures of relative standing.
  • The standard normal distribution is the normal
    distribution with mean 0 and standard
    deviation 1. Z N(0,12) .

41
4. The Normal Distribution
  • If a variable X has any normal distribution
    N(m,s2 ) , then the standardized variable
  • has the standard normal distribution.

42
4. The Normal Distribution
  • For N(0,12) we can find approximate probabilities
    associated with different values of Z using
    Empirical Rule.

43
4. The Normal Distribution
  • We can find the approximate probability that Z is
    to the left of any integers using the Empirical
    Rule.

P( Z lt -4.00) ? 0 P( Z lt -3.00) ? 0.15 P( Z lt
-2.00) ? 2.5 P( Z lt -1.00) ? 16
P( Z lt 0.00) ? 50 P( Z lt 1.00) ? 84 P( Z lt
2.00) ? 97.5 P( Z lt 3.00) ? 99.85 P( Z lt
4.00) ? 100
44
4. The Normal Distribution
  • We can find the approximate probability that Z is
    to the right of any integers using the symmetry.
  • P( Z gt z) P( Z lt -z)
  • For example, P( Z gt 3.00) P( Z lt -3.00)
  • We can find the approximate probability that Z is
    between any two integers using the Empirical
    Rule.
  • P( a lt Z lt b) P( Z lt b) P(Z lt a)
  • Examples
  • P( Z gt 3.00) ? 0.15 P( -2.00 lt Z lt 1.00) ?
    81.5
  • P( Z gt -1.00) ? 84 P( 2.00 lt Z lt 3.00) ?
    2.35

45
4. The Normal Distributionexample
  • P(Z lt 1.25) ? P(Z gt 0.25) ?
  • A. 0.4840 A. 0.7040
  • B. 0.8944 B. 0.0217
  • C. 0.9989 C. 0.8485
  • D. 0.1736 D. 0.4013

46
4. The Normal Distribution
  • Now suppose we know X N (m, s2) and we want to
    know P(X lt x), P(X gt x) and P(x1 lt X lt x2).
  • We can first convert the X to Z and then use the
    probabilities from the Empirical Rule.
  • Recall that if X N (m, s2) , then
  • we have

47
4. The Normal Distributionexamples
  • Suppose X N ( 3, 22). Find the probability
    that X is less than 5.
  • Suppose X N (-1, 52). Find the probability
    that X is greater than 11.

48
4. The Normal Distribution
  • We will look at some more difficult examples
  • Suppose X N (2, 32),
  • Given a value z, find the corresponding x that it
    came from.
  • say z 5, x ?
  • How many standard deviations is x from m?
  • say x 10
  • Find Pr (X lt -4 or X gt 8).
  • Find Pr ( -4 lt X lt 8 ).
  • Find the x such that Pr ( X lt x ) ? .84
  • Find the x such that Pr ( X gt x ) ? .84

49
Go back to3. Describing Numeric Variable
Distributions
  • How can we describe the distribution of these 50
    numbers?
  • Normal Quantile Plot (This compares the
    distribution of the sample to the Normal
    Distribution)
  • the straight line
  • is normal,
  • compare dots
  • to the line

50
Go back to3. Describing Numeric Variable
Distributions
  • Summary
  • Shape (Histogram, Boxplot and Stemplot Normal
    Quantile Plot)
  • Mode unimodal, bimodal?
  • Symmetric? Left skewed? Right skewed?
  • Are the data normally distributed?
  • Center (Descriptives)
  • Spread (Descriptives)
  • Outlier (Boxplot)

51
5. Distribution Properties
  • Shift Changes adding or subtracting a number
    from the each of the values. If c gt 0, then

mean
mean c
mean - c
52
5. Distribution Properties
  • The mean, median, Q1, Q3, maximum, and minimum
    all shift when there is a shift change. The
    shift change, say c, is added or subtracted to
    each of the statistics accordingly.
  • The measures of spread (standard deviation,
    variance, IQR, and range) do not change when
    there is a shift change.

53
5. Distribution Properties
  • Scale Changes multiplying or dividing each of
    the values by a number. If c gt 1, then

mean
meanc
mean/c
54
5. Distribution Properties
  • The mean, median, Q1, Q3, maximum, and minimum
    all change when there is a scale change unless
    they are zero. Each is multiplied or divided by
    the scale change c.
  • The measures of spread (standard deviation,
    variance, IQR, and range) always change when
    there is a scale change. The standard deviation,
    IQR, and range are multiplied or divided by the
    scale change c. The variance is multiplied or
    divided by c2.

55
5. Distribution Properties
  • Suppose we measure the weight of everyone on a
    football team and obtain the following statistics
    for a team report
  • Mean 230 lbs. Median 240 lbs.
  • Std. Dev. 50 lbs. Q1 200 lbs., Q3 280 lbs.
  • Variance 250 lbs. IQR 80 lbs
  • Min. 170 lbs. Range 180 lbs.
  • Max. 350 lbs.

56
5. Distribution Properties
  • Now suppose we found out the scale was 10 lbs.
    under so we need to add 10 lbs. to every weight.
    What would happen to each of the following
    statistics?

Original
After Shift Change
Mean 230 lbs. Mean________
Median 240 lbs. Median_________
s 50 lbs. s_______
Q1 200 lbs. Q1________
Q3 280 lbs. Q3________
57
5. Distribution Properties
  • Now suppose we found out the scale was 10 lbs.
    under so we need to add 10 lbs. to every weight.
    What would happen to each of the following
    statistics?

Original
After Shift Change
Variance 250 lbs.
Variance ________
IQR 80 lbs.
IQR _________
Min 170 lbs.
Min _________
Max 350 lbs.
Max _________
Range 180 lbs.
Range _________
58
5. Distribution Properties
  • Further, suppose we found out that we are
    supposed to report the weights and statistics in
    kilograms, not lbs (Remember, 1 lb 0.6
    kilograms). What would happen to each of the
    following statistics?

After Shift Change
After Shift and Scale Change
Mean 240 lbs.
Mean ______________
Median 250 lbs.
Median ______________
s 50 lbs.
s _____________
Q1 210 lbs.
Q1 _____________
Q3 290 lbs.
Q3 _____________
59
5. Distribution Properties
  • Further, suppose we found out that we are
    supposed to report the weights and statistics in
    kilograms, not lbs (Remember, 1 lb 0.6
    kilograms). What would happen to each of the
    following statistics?

After Shift Change
After Shift and Scale Change
Variance 250 lbs.
Variance _______________
IQR 80 lbs.
IQR _______________
Min 180 lbs.
Min _______________
Max 360 lbs.
Max ________________
Range 180 lbs.
Range _________________
60
Linear Transformations
  • If you are given a mean, (or ?), and a
    standard deviation, s (or ?), and want to convert
    your data so you have a new mean, (or
    ?new), and new standard deviation, snew (or
    ?new), all you need is to remember what shift and
    scales changes affect.
  • In our linear transformation formula
  • a is the shift change
  • b is the scale change
  • Standard deviation are only affected by scale
    changes, but means are affected by both shift and
    scales changes.
Write a Comment
User Comments (0)
About PowerShow.com