MATH 401 Probability and Statistics - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

MATH 401 Probability and Statistics

Description:

The book by Montgomery suggests that the number of classes should be the square ... We determine the highest and the lowest values in the data set in order to find ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 91
Provided by: drolek
Category:

less

Transcript and Presenter's Notes

Title: MATH 401 Probability and Statistics


1
MATH 401Probability and Statistics
  • Spring 2009

2
Statistics Basic Concepts
  • Lecture 8

3
What is Statistics?
  • Statistics deals with the collection,
    presentation, analysis and use of numerical
    data to make decisions, solve problems etc.
    (Montgomery, 2002, p.2)
  • Statistics is concerned with collecting data,
    describing and analyzing them, and possibly
    drawing conclusions from the data. (Ross, 2004,
    p.1)

4
Qualitative Data
  • The colors of 25 Toyota Corollas sold by a dealer
    in Maadi were recorded as follows (W white, B
    black, S silver, R red).

5
Quantitative Data
  • Data represent the number of requests per minute
    placed to a server, recorded during 30
    consecutive minutes.

6
Numerical Data
  • The following data represent the life-time (in
    months) of 50 electronic color tubes for TV.

7
Types of Data
  • We distinguish between qualitative (blood types,
    colours of cars sold, letter grades in an exam
    etc.) and quantitative (i.e. numerical) data.
  • Throughout this course we are mostly concerned
    with quantitative data.
  • Basically, little can be done mathematically if
    the data are not numerical.

8
Types of statistical studies
  • Dependent on why the study is conducted, two
    types of statistical studies are distinguished
  • Descriptive statistics.
  • Inferential statistics.

9
MATH 401 Final Exam Grades (2006).
  • A total of 297 exam grades (out of 50) are as
    follows.

10
Descriptive Statistics
  • Involves only the collection, as well as
    organization, presentation and summarization of
    data.
  • The point is to describe a certain situation as
    represented by a particular data set.
  • Math 401 exam results.

11
Device Life-time
  • The following data represent the life-time (in
    months) of 50 electronic color tubes for TV.

12
Inferential Statistics
  • Involves drawing conclusions from data.
  • The point is to make inferences about a certain
    situation represented by a particular data set.
  • The main tools for making such inferences are
    provided by the Probability Theory.

13
Populations
  • Statistical studies examine certain
    attributes/characteristics of a set of
    individuals the population.
  • Populations are normally so large that it is
    logistically impossible to examine all the
    individuals. So

14
Populations and Samples
  • One takes a subgroup of a population and examines
    the desired characteristic for the subgroup.
  • Such subgroups are called samples.
  • A major problem is to ensure that a selected
    sample is representative of the population.

15
Inferential Statistics (revisited)
  • So the major task of the inferential statistics
    is to make conclusions about the whole population
    on the basis of analyzing a sample taken from
    this population.
  • We stress, again, that the Probability Theory is
    absolutely crucial for making such conclusions.

16
Data Organization.
17
Qualitative Data
  • The colors of 25 Toyota Corollas sold by a dealer
    in Maadi were recorded as follows (W white, B
    black, S silver, R red).

18
Small-range Numerical Data
  • Data represent the number of requests per minute
    placed to a server, recorded during 30
    consecutive minutes.

19
Large-range Numerical Data
  • The following data represent the life-times (in
    months, rounded to the nearest month) of 50
    electronic color tubes for TV.

20
Frequency Distribution
  • A frequency distribution for a data set is a
    table listing groups into which the data are
    divided - classes, with a row for each class, the
    number of occurrences for each class class
    frequency, sometimes the class relative
    frequency, i.e. the class frequency divided by
    the total number of values in the data set.

21
Frequency Distribution

22
Relative Frequency
  • Relative frequency of a class is the frequency of
    the class divided by the total number of data.
  • Relative frequencies are useful for comparing
    distributions of different sizes.
  • In this case using frequencies will be misleading.

23
Data Organization
  • Data is organized by constructing frequency
    distributions.
  • Two types of frequency distributions
  • Categorical for qualitative and small
    quantitative data sets.
  • Grouped for large quantitative data sets.

24
Categorical Frequency Distribution
  • The colors of 25 Toyota Corollas sold by a dealer
    in Maadi were recorded as follows. Construct a FD.

25
Categorical Frequency Distribution
26
Categorical FD
27
MATH 401 Final 2006
  • FD with 10 classes. The last class is treated as
    45,50.

28
Endpoints Ambiguity
  • We adopt the left-end inclusion convention, i.e.
    we assume that the class 0-5 contains 0 but
    excludes 5, etc.
  • The last class is assumed to include both
    endpoints.

29
Guidelines for constructing frequency
distributions
  • Classes must be of equal widths.
  • width of n-th class is given by
  • WIDTH Upper Limit n ? Lower Limit n
  • Classes must be mutually exclusive.
  • Classes must be exhaustive.
  • Classes must be continuous.
  • Number of classes should be sufficient for a
    clear description of the data.
  • the book says between 5 and 20.

30
Problem
  • How to group the data if only a sample is
    available?

31
Sample of 50
  • The following data represent the life-times (in
    months, rounded to the nearest month) of 50
    electronic color tubes for TV.

32
Questions to answer
  • How many classes shall we have?
  • What should be the width of each class?
  • What should be the lower limit of the first
    class?

33
Question 1 Determine the number of classes
  • The book by Montgomery suggests that the number
    of classes should be the square root of the
    number of data values.
  • Hence, in the example, the number of classes is 7
    or 8.
  • We shall take 7.

34
Question 2 Class Width
  • Find the Range of the data
  • Range Highest value ? Lowest value
  • Find the Width
  • Divide the Range by the number of classes.
  • W Range / k
  • Increase the result of division to the next
    integer.
  • This integer is the width.

35
Reminder The Data Set
  • We determine the highest and the lowest values
    in the data set in order to find the range, an
    then the class width in our FD.

36
Example Class Width
  • Find the Range
  • Range 136 ? 100 36.
  • Divide the range by of classes.
  • W 36 / 7 5.14 ?6 width.

37
Question 3 Lower Limit of Class 1
  • We want the lower limit of Class 1 to be a bit
    smaller than the lowest value.
  • Similarly, the upper limit of the last class is
    to be a bit greater than the highest value in the
    set.
  • For all 7 classes to be equally wide (6 units),
    we can select 95, 96, 97, 98 or 99 to be the
    lower limit of class 1.
  • We shall take 97.

38
Frequency Distribution
  • Sum the frequencies to make sure that nothing
    was forgotten.

39
Frequency Distribution 2
  • A rule from another book suggests that an optimal
    number of classes is 6.
  • An FD can be as follows.

40
Now . . .
  • Presenting data graphically.

41
Categorical FD for the number of requests
42
Graph for Quantitative Data in Categorical FD
  • The simplest graph for numerical data organized
    in a categorical frequency distribution looks
    like the graph of a probability mass function.
  • The y-coordinate of a point represents the
    (relative) frequency of the class.
  • The x-coordinate of a point represents the class.
    (See Plot 1)

43
Presenting Grouped Data.
  • Wide-range data are presented using various types
    of graphs.
  • Well consider 2
  • Histograms.
  • Ogives.

44
Math 401 - Final Exam
  • The respective FD is presented below.

45
Histograms
  • A histogram displays data using continuous
    vertical bars.
  • Each bar represents a class.
  • The height of a bar represents the frequency of
    the respective class.
  • Bars extend between class limits.

46
Drawing a histogram
47
Drawing a histogram
48
Relative Frequency Histograms
  • Same principle as for histograms. One simply uses
    relative frequencies instead of ordinary ones
    to determine the height of each vertical bar.
  • Obviously the shape of the graph remains
    unchanged. Only the vertical scale changes.

49
Drawing a histogram
50
Drawing a histogram
51
Ogives
  • An ogive displays data by using lines connecting
    points.
  • The x-coordinate of a point represents the upper
    class limit.
  • The y-coordinate represents the class cumulative
    frequency.
  • Note ogive is the graph of a non-decreasing
    function!

52
MATH 401 Cumulative FD
  • Cumulative FD with 10 classes.

53
Drawing an ogive
54
Drawing an ogive
55
Ogives to Relative Frequencies
  • Same principle as for histograms is adopted.
    Cumulative relative frequencies instead of
    ordinary ones are used to determine the
    y-coordinate of each point.
  • Obviously, the shape of the graph remains
    unchanged. Only the vertical scale changes.

56
Basis for Inferential Statistics
  • We assume that an unknown population is described
    by a random variable.
  • In that case a histogram and an ogive for
    relative frequencies based on a sample give
    the contour of the PDF and Cumulative Probability
    Function, respectively.
  • Other important characteristics of a random
    variable are the expectation and the variance.
  • Summarizing data is essentially an initial
    attempt to estimate these parameters.

57
Data Summarization.
58
Data Summarization
  • Data summarization involves extracting
    information about the general distribution of
    data.
  • This is achieved by measuring certain aspects of
    the data set.
  • Well consider two aspects
  • Central tendency.
  • Variation.

59
Measures of Central Tendency
  • Were interested in a value that represents the
    center of the distribution.
  • Vaguely, we are searching for the best
    representative of the distribution.
  • Different ideas about what is the best
    representative result in different definitions.
  • Well study three definitions.

60
Population Mean
  • For a population of size N, its mean, ?, is given
    by
  • Reminder For a discrete RV X, then its
    expectation is given by

61
Sample Mean
  • For a sample of size n, the sample mean is given
    by

62
Finding the Mean of the Number of Requests
63
Computing the Mean
  • The formula takes form
  • where Fi is the frequency of the value xi, i
    1,,k.

64
Computing the mean
  • Note Rounding rule for the mean.
  • The mean is rounded to one more decimal place
    than occurs in the data.

65
2. The Median
  • The median, MD, is the midpoint of the entire
    quantitative data array.
  • To determine the median
  • Sort the data values.
  • Pick the value in the middle
  • For n data values,
  • If n is odd,
  • then MD X(n 1) / 2
  • If n is even,
  • then MD (X(n/2) X(n/2 1)) / 2
  • (Note this need not be a data value)

66
Finding the Median of the Number of Requests
  • The median equals 5.

67
About the Median
  • The median divides the data set into two subsets
    with equally many values in such a way that all
    values in the first subset do not exceed the
    value of the median, while all values in the
    second subset are greater than or equal to the
    median value.
  • If the respective RV is continuous, then the
    median predicts the x-coordinate of the point
    where the graph of the cumulative probability
    function meets the horizontal line y 0.5.

68
3. The Mode
  • The mode is a value that has the highest
    frequency in a data set.
  • Defined for both qualitative and quantitative
    data.
  • A distribution may have one, more than one, or no
    mode at all.
  • If the respective RV is continuous, then the mode
    predicts where the respective PDF has a peak.

69
Finding the Mode of the Number of Requests
  • The modes are 4 and 5.

70
Measures of Variation
  • Measures of central tendency locate the center of
    a distribution.
  • They do not indicate how the values are
    distributed around the center.
  • Measures of variation examine the spread, or
    variation, of data values around the center.

71
The Population Variance
  • The population variance is given by
  • Reminder If X is a discrete RV, then its
    variance is given by

72
Sample Variance
  • For a better estimate (???), the sample variance
    is defined by

73
Sample Variance. Shortcut formula.
  • Rearranging the terms in the formula for the
    variance we arrive at an expression that does not
    involve the mean explicitly

74
The Standard Deviation
  • The standard deviation is the square root of the
    variance.
  • It has the same units as the raw data.

75
Computing the standard deviation
  • Find the sample standard deviation for the
    amount of European auto sales for a sample of 6
    years shown. The data are in millions of dollars.
  • 11.2, 11.9, 12.0, 12.8, 13.4, 14.3

76
Computing the standard deviation
  • Use the shortcut formula.
  • 1. Find the sum of the values

77
Computing the standard deviation
  • 2. Square each value and find the sum

78
Computing the standard deviation
  • 3. Substitute into the formula

79
Computing the standard deviation
  • 4. Compute the square root and round the answer
    to one more decimal place.

80
Future Plans
  • In practice, of interest are certain
    characteristics of a population, e.g. the mean,
    the standard deviation, other parameters.
  • Due to various limitations, only sample mean,
    sample variance etc. are available.
  • The latter are estimates of the former.
  • Next week we develop some techniques for
  • ESTIMATION OF PARAMETERS.

81
Thank you
82
Food for thought.Mean for grouped data
  • Suppose we are given a grouped frequency
    distribution. Is it possible to find the exact
    value of the mean?
  • If not, think of a way to find an approximate
    value?
  • What do you think the accuracy of such an
    approximation is dependent on?

83
Approximating the Mean
  • The formula takes form
  • where xi,m is the midpoint of class j, and Fi is
    the frequency of class j for all j 1,,k.

84
Variance for Grouped Data
  • Is it possible to approximate the value of s2
    for grouped data?

85
Median for grouped data
  • Suppose we are given a grouped frequency
    distribution. Is it possible to estimate the
    value of the median?
  • If so, describe a procedure to get the value.
    (Hint use the ogive).
  • Alternatively, describe a procedure to determine
    a class that contains the median.

86
Sample Percentiles
  • Sometimes it is important to know below which
    value a certain percentage of data in a data set
    lies.
  • Let p be from 0,1. The sample 100p percentile
    is a value such that
  • 100p of the data are less than or equal to it,
  • And 100(1-p) of the data are greater than or
    equal to it.
  • If two values satisfy this condition, then their
    arithmetic average is taken.

87
Sample Quartiles
  • The sample 25th, 50th and 75th percentiles are
    called the sample 1st , 2nd and 3rd Quartiles,
    respectively.
  • As their names suggest they split a data set into
    4 parts with roughly equal number of values.
  • Note the Second Quartile is simply the median.

88
Box Plots
  • A box plot for a data set is a straight line
    segment stretching from the smallest to the
    largest value, drawn on a horizontal axis.
  • On the line we impose a box that starts at
    Quartile 1 and ends at Quartile 3.
  • The value of the median Quartile 2 is indicated
    by a vertical line.
  • The value IQR Q3 - Q1 is called the
    inter-quartile range of the data.
  • The data values smaller than Q1 - 1.5 IQR and
    larger than Q31.5 IQR are called outliers and
    marked by small circles on the horizontal line
  • The data lying outside the interval
    Q1-3IQR,Q3IQR are called extreme outliers.

89
Miles to travel to work - sorted
90
Data for Box Plotting
  • Parameter Value
  • Minimum 1
  • 1st Quartile 3.5
  • 2nd Quartile 6.5
  • 3rd Quartile 13.5
  • Maximum 18
  • IQR 10
Write a Comment
User Comments (0)
About PowerShow.com