Describing Data - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Describing Data

Description:

Babe Ruth had an average of 43.9 home runs per year. ... This means that Babe Ruth's home run values would typically vary 11.25 home runs ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 36
Provided by: cdaMr
Category:
Tags: babe | data | describing

less

Transcript and Presenter's Notes

Title: Describing Data


1
Describing Data
  • Numerical Summaries

2
Numerical Summaries
  • Measures of center
  • Mean
  • Median
  • Measures of Spread
  • Standard Deviation
  • IQR ( Inter-Quartile Range)

3
Mean
  • A measure of center, or typical value is the
    mean. Also known as the average or x-bar.
  • For numeric data values, x_1, x_2, ,x_n the mean
    is defined as,

4
Examples
  • Babe Ruth had an average of 43.9 home runs per
    year.
  • Roger Maris had an average of 26.1 home runs per
    year.
  • Barry Bonds avg658/1836.55 HR per year.

5
Median
  • The median is defined as being the middle value
    in an ordered list, m. To obtain the median
  • Order data from smallest to largest values.
  • Determine number of data values n.
  • If n is odd, the median is located at position
    (n1)/2 in the ordered list.
  • If n is even, the median is found by obtaining
    the middle position (n1)/2 and then averaging
    the two middle values found there.

6
Example
  • Consider a list with an odd number of elements
    like 11, 8, 6, 10, 7 so that n5.
  • The ordered list is 6, 7, 8, 10, 11.
  • The middle position is (n1)/2 (51)/2 6/23,
    the third position.
  • The median is therefore m8.

7
Example
  • Consider the list 11, 7, 8, 9 with n4 elements,
    an even number.
  • The ordered list is 7, 8, 9, 11.
  • The middle position is (n1)/2(41)/25/22.5.
  • This means the two middle values are 8 and 9.
  • The average of 8 and 9 is (89)/2 17/28.5, so
    the median is m8.5.

8
Examples
  • Median for Babe Ruth is m46, n15 an odd number.
    Middle is at position (151)/28.
  • Median for Roger Maris is m24.5, n10 an even
    number. Middle position is (101)/211/25.5.
    Median is average of 23 and 26, m24.5. So here
    median is not a value in the data list.

9
Why Need Both Measures?
  • Motivating example dataset 12.9, 13.2, 14.4,
    16.5, 17.1.
  • Mean is 74.1/514.82.
  • Median is m14.4, middle is at (51)/2 3

10
Modify Dataset
  • Add an additional value to the data, 50.2.
  • Mean is now 124.3/6 20.72 (old mean14.82)
  • Median is now (14.416.5)/2 15.45 (old median
    14.4)
  • Notice that both the mean and median increased by
    the additional large number added to the set.
    This should make sense.

11
Modify Again
  • Now instead of value 50.2, suppose we take our
    original list 12.9, 13.2, 14.4, 16.5, 17.1 and
    supply an additional value of 374,558,222,999,999,
    999 !!!!
  • What happens to the mean?
  • What happens to the median?

12
Behavior
  • The previous example shows that the mean is VERY
    sensitive to weirdo data values.
  • The median was not affected very much by the
    weird data or outlier a couch potato.

13
Example
14
Usage
  • When should you use the mean? Or median?
  • If your data has substantial skewness or outliers
    you should use the median to describe center.
  • If the data is regular, mound shaped with no
    strange features, the mean has a slight
    theoretical advantage.
  • Some statisticians suggest to always use the
    median.

15
Spread Standard Deviation
  • For a data list like
  • The standard deviation S is defined as,

16
Properties of S
  • 0
  • Smallest S can be is S0. Means no variation in
    data at all. All values are the same.
  • S is very sensitive to weird values and skewness.
  • The formula for S shows that it is almost the
    average deviation around x-bar. This is a good
    way to think about S, it is the average amount
    that data values vary around the average, x-bar.

17
Interpretation of S
  • S is the almost the average deviation of data
    values around x-bar.
  • S is the typical amount that data values vary
    around x-bar.
  • If S3, we would say that data values typically
    vary by 3 units above or below the average x-bar.

18
S Examples
  • Babe Ruth home run data, S11.25 home runs. This
    means that Babe Ruths home run values would
    typically vary 11.25 home runs above or below his
    average.
  • I expect S to be computed by a computer program
    like StatCrunch. I dont expect this to be
    computed from a calculator.

19
Percentiles
  • The p-th percentile is defined to be the value
    such that p percent of the data has values less
    than or equal to p.
  • For example, the 50 th percentile (the median)
    means half the data has values less than or equal
    to this number.
  • I am at about the 90 th percentile for height for
    males.

20
IQR
  • The IQR ( Inter-Quartile Range ) is defined as
    IQRQ3-Q1.
  • Q3 is the value at the 75th percentile.
  • Q1 is the value at the 25th percentile.
  • FYI, Q2 is the median the 50th percentile.
  • These Qs break the dataset into four parts,
    hence name quartiles.

21
IQR
  • The IQR is the distance or spread in the middle
    half of the data.
  • IQR is also a measure of spread like S. If IQR
    gets larger, there is more spread in the data.

22
IQR Properties
  • IQR is NOT sensitive to outliers or skewness.
    The values are more stable in cases like these
    than the standard deviation S.

23
Spread Measures Usage
  • Use IQR when data has either skewness or outliers
    or both. It provides a very reasonable measure
    of spread. The standard deviation is potentially
    misleading in this situation.
  • The standard deviation, S, is a reasonable
    measure of center for well-behaved datasets.

24
Boxplot
  • A boxplot is a graphical display of a
    distribution based on only five numbers min,
    Q1, median, Q3, max.
  • Values Q1, median, and Q3 form the box. Values
    min and max form the whiskers out from the box.

25
Boxplot Example
  • Fuel mileage for two-seater and mini-compact cars.

26
Example Skewed Right
27
Boxplot Example
28
Boxplot Example
29
Distributions Activity
  • Match Histograms to descriptions.
  • a) Scores in range 0-100 on an easy statistics
    exam.
  • b) Number of cycles required to achieve pregnancy
    for a sample of women attempting to get pregnant.
    Data are self-reported.
  • c) Heights of a group of college students.
  • d) Numbers of medals won by countries in 1992
    Winter Olympics. Albertville France?
  • e) SAT scores for a group of college students.

30
Distributions Activity
31
Distributions Activity
32
Activity Combined Plots
33
Bluegill Lengths Lake Mary, MN
34
Histogram Example
35
Summary
  • Numerical and graphical summaries go together.
  • A numerical summary without first looking at a
    graph can be very misleading.
  • Know the strengths and weaknesses of the
    graphical and numerical summaries you use.
Write a Comment
User Comments (0)
About PowerShow.com