Title: MATH 401 Probability and Statistics
1MATH 401Probability and Statistics
2Statistics Basic Concepts
3What is Statistics?
- Statistics deals with the collection,
presentation, analysis and use of numerical
data to make decisions, solve problems etc.
(Montgomery, 2002, p.2) - Statistics is concerned with collecting data,
describing and analyzing them, and possibly
drawing conclusions from the data. (Ross, 2004,
p.1)
4Qualitative Data
- The colors of 25 Toyota Corollas sold by a dealer
in Maadi were recorded as follows (W white, B
black, S silver, R red).
5Quantitative Data
- Data represent the number of requests per minute
placed to a server, recorded during 30
consecutive minutes.
6Numerical Data
- The following data represent the life-time (in
months) of 50 electronic color tubes for TV.
7Types of Data
- We distinguish between qualitative (blood types,
colours of cars sold, letter grades in an exam
etc.) and quantitative (i.e. numerical) data. - Throughout this course we are mostly concerned
with quantitative data. - Basically, little can be done mathematically if
the data are not numerical.
8Types of statistical studies
- Dependent on why the study is conducted, two
types of statistical studies are distinguished - Descriptive statistics.
- Inferential statistics.
9MATH 401 Final Exam Grades (2006).
- A total of 297 exam grades (out of 50) are as
follows.
10Descriptive Statistics
- Involves only the collection, as well as
organization, presentation and summarization of
data. - The point is to describe a certain situation as
represented by a particular data set. - Math 401 exam results.
11Device Life-time
- The following data represent the life-time (in
months) of 50 electronic color tubes for TV.
12Inferential Statistics
- Involves drawing conclusions from data.
- The point is to make inferences about a certain
situation represented by a particular data set. - The main tools for making such inferences are
provided by the Probability Theory.
13Populations
- Statistical studies examine certain
attributes/characteristics of a set of
individuals the population. - Populations are normally so large that it is
logistically impossible to examine all the
individuals. So
14Populations and Samples
- One takes a subgroup of a population and examines
the desired characteristic for the subgroup. - Such subgroups are called samples.
- A major problem is to ensure that a selected
sample is representative of the population.
15Inferential Statistics (revisited)
- So the major task of the inferential statistics
is to make conclusions about the whole population
on the basis of analyzing a sample taken from
this population. - We stress, again, that the Probability Theory is
absolutely crucial for making such conclusions.
16Data Organization.
17Qualitative Data
- The colors of 25 Toyota Corollas sold by a dealer
in Maadi were recorded as follows (W white, B
black, S silver, R red).
18Small-range Numerical Data
- Data represent the number of requests per minute
placed to a server, recorded during 30
consecutive minutes.
19Large-range Numerical Data
- The following data represent the life-times (in
months, rounded to the nearest month) of 50
electronic color tubes for TV.
20Frequency Distribution
- A frequency distribution for a data set is a
table listing groups into which the data are
divided - classes, with a row for each class, the
number of occurrences for each class class
frequency, sometimes the class relative
frequency, i.e. the class frequency divided by
the total number of values in the data set.
21Frequency Distribution
22Relative Frequency
- Relative frequency of a class is the frequency of
the class divided by the total number of data. - Relative frequencies are useful for comparing
distributions of different sizes. - In this case using frequencies will be misleading.
23Data Organization
- Data is organized by constructing frequency
distributions. - Two types of frequency distributions
- Categorical for qualitative and small
quantitative data sets. - Grouped for large quantitative data sets.
24Categorical Frequency Distribution
- The colors of 25 Toyota Corollas sold by a dealer
in Maadi were recorded as follows. Construct a FD.
25Categorical Frequency Distribution
26Categorical FD
27MATH 401 Final 2006
- FD with 10 classes. The last class is treated as
45,50.
28Endpoints Ambiguity
- We adopt the left-end inclusion convention, i.e.
we assume that the class 0-5 contains 0 but
excludes 5, etc. - The last class is assumed to include both
endpoints.
29Guidelines for constructing frequency
distributions
- Classes must be of equal widths.
- width of n-th class is given by
- WIDTH Upper Limit n ? Lower Limit n
- Classes must be mutually exclusive.
- Classes must be exhaustive.
- Classes must be continuous.
- Number of classes should be sufficient for a
clear description of the data. - the book says between 5 and 20.
30Problem
- How to group the data if only a sample is
available?
31Sample of 50
- The following data represent the life-times (in
months, rounded to the nearest month) of 50
electronic color tubes for TV.
32Questions to answer
- How many classes shall we have?
- What should be the width of each class?
- What should be the lower limit of the first
class?
33Question 1 Determine the number of classes
- The book by Montgomery suggests that the number
of classes should be the square root of the
number of data values. - Hence, in the example, the number of classes is 7
or 8. - We shall take 7.
-
34Question 2 Class Width
- Find the Range of the data
- Range Highest value ? Lowest value
- Find the Width
- Divide the Range by the number of classes.
- W Range / k
- Increase the result of division to the next
integer. - This integer is the width.
35Reminder The Data Set
- We determine the highest and the lowest values
in the data set in order to find the range, an
then the class width in our FD.
36Example Class Width
- Find the Range
- Range 136 ? 100 36.
- Divide the range by of classes.
- W 36 / 7 5.14 ?6 width.
37Question 3 Lower Limit of Class 1
- We want the lower limit of Class 1 to be a bit
smaller than the lowest value. - Similarly, the upper limit of the last class is
to be a bit greater than the highest value in the
set. - For all 7 classes to be equally wide (6 units),
we can select 95, 96, 97, 98 or 99 to be the
lower limit of class 1. - We shall take 97.
38Frequency Distribution
- Sum the frequencies to make sure that nothing
was forgotten.
39Frequency Distribution 2
- A rule from another book suggests that an optimal
number of classes is 6. - An FD can be as follows.
40Now . . .
- Presenting data graphically.
41Categorical FD for the number of requests
42Graph for Quantitative Data in Categorical FD
- The simplest graph for numerical data organized
in a categorical frequency distribution looks
like the graph of a probability mass function. - The y-coordinate of a point represents the
(relative) frequency of the class. - The x-coordinate of a point represents the class.
(See Plot 1)
43Presenting Grouped Data.
- Wide-range data are presented using various types
of graphs. - Well consider 2
- Histograms.
- Ogives.
44Math 401 - Final Exam
- The respective FD is presented below.
45Histograms
- A histogram displays data using continuous
vertical bars. - Each bar represents a class.
- The height of a bar represents the frequency of
the respective class. - Bars extend between class limits.
46Drawing a histogram
47Drawing a histogram
48Relative Frequency Histograms
- Same principle as for histograms. One simply uses
relative frequencies instead of ordinary ones
to determine the height of each vertical bar. - Obviously the shape of the graph remains
unchanged. Only the vertical scale changes.
49Drawing a histogram
50Drawing a histogram
51Ogives
- An ogive displays data by using lines connecting
points. - The x-coordinate of a point represents the upper
class limit. - The y-coordinate represents the class cumulative
frequency. - Note ogive is the graph of a non-decreasing
function!
52MATH 401 Cumulative FD
- Cumulative FD with 10 classes.
53Drawing an ogive
54Drawing an ogive
55Ogives to Relative Frequencies
- Same principle as for histograms is adopted.
Cumulative relative frequencies instead of
ordinary ones are used to determine the
y-coordinate of each point. - Obviously, the shape of the graph remains
unchanged. Only the vertical scale changes.
56Basis for Inferential Statistics
- We assume that an unknown population is described
by a random variable. - In that case a histogram and an ogive for
relative frequencies based on a sample give
the contour of the PDF and Cumulative Probability
Function, respectively. - Other important characteristics of a random
variable are the expectation and the variance. - Summarizing data is essentially an initial
attempt to estimate these parameters.
57Data Summarization.
58Data Summarization
- Data summarization involves extracting
information about the general distribution of
data. - This is achieved by measuring certain aspects of
the data set. - Well consider two aspects
- Central tendency.
- Variation.
59Measures of Central Tendency
- Were interested in a value that represents the
center of the distribution. - Vaguely, we are searching for the best
representative of the distribution. - Different ideas about what is the best
representative result in different definitions. - Well study three definitions.
60Population Mean
- For a population of size N, its mean, ?, is given
by - Reminder For a discrete RV X, then its
expectation is given by
61Sample Mean
- For a sample of size n, the sample mean is given
by
62Finding the Mean of the Number of Requests
63Computing the Mean
- The formula takes form
-
- where Fi is the frequency of the value xi, i
1,,k.
64Computing the mean
- Note Rounding rule for the mean.
- The mean is rounded to one more decimal place
than occurs in the data.
652. The Median
- The median, MD, is the midpoint of the entire
quantitative data array. - To determine the median
- Sort the data values.
- Pick the value in the middle
- For n data values,
- If n is odd,
- then MD X(n 1) / 2
- If n is even,
- then MD (X(n/2) X(n/2 1)) / 2
- (Note this need not be a data value)
66Finding the Median of the Number of Requests
67About the Median
- The median divides the data set into two subsets
with equally many values in such a way that all
values in the first subset do not exceed the
value of the median, while all values in the
second subset are greater than or equal to the
median value. - If the respective RV is continuous, then the
median predicts the x-coordinate of the point
where the graph of the cumulative probability
function meets the horizontal line y 0.5.
683. The Mode
- The mode is a value that has the highest
frequency in a data set. - Defined for both qualitative and quantitative
data. - A distribution may have one, more than one, or no
mode at all. - If the respective RV is continuous, then the mode
predicts where the respective PDF has a peak.
69Finding the Mode of the Number of Requests
70Measures of Variation
- Measures of central tendency locate the center of
a distribution. - They do not indicate how the values are
distributed around the center. - Measures of variation examine the spread, or
variation, of data values around the center.
71The Population Variance
- The population variance is given by
- Reminder If X is a discrete RV, then its
variance is given by
72Sample Variance
- For a better estimate (???), the sample variance
is defined by
73Sample Variance. Shortcut formula.
- Rearranging the terms in the formula for the
variance we arrive at an expression that does not
involve the mean explicitly
74The Standard Deviation
- The standard deviation is the square root of the
variance. - It has the same units as the raw data.
75Computing the standard deviation
- Find the sample standard deviation for the
amount of European auto sales for a sample of 6
years shown. The data are in millions of dollars. - 11.2, 11.9, 12.0, 12.8, 13.4, 14.3
76Computing the standard deviation
- Use the shortcut formula.
- 1. Find the sum of the values
77Computing the standard deviation
- 2. Square each value and find the sum
78Computing the standard deviation
- 3. Substitute into the formula
79Computing the standard deviation
- 4. Compute the square root and round the answer
to one more decimal place.
80Future Plans
- In practice, of interest are certain
characteristics of a population, e.g. the mean,
the standard deviation, other parameters. - Due to various limitations, only sample mean,
sample variance etc. are available. - The latter are estimates of the former.
- Next week we develop some techniques for
- ESTIMATION OF PARAMETERS.
81Thank you
82Food for thought.Mean for grouped data
- Suppose we are given a grouped frequency
distribution. Is it possible to find the exact
value of the mean? - If not, think of a way to find an approximate
value? - What do you think the accuracy of such an
approximation is dependent on?
83Approximating the Mean
- The formula takes form
-
- where xi,m is the midpoint of class j, and Fi is
the frequency of class j for all j 1,,k.
84Variance for Grouped Data
- Is it possible to approximate the value of s2
for grouped data?
85Median for grouped data
- Suppose we are given a grouped frequency
distribution. Is it possible to estimate the
value of the median? - If so, describe a procedure to get the value.
(Hint use the ogive). - Alternatively, describe a procedure to determine
a class that contains the median.
86Sample Percentiles
- Sometimes it is important to know below which
value a certain percentage of data in a data set
lies. - Let p be from 0,1. The sample 100p percentile
is a value such that - 100p of the data are less than or equal to it,
- And 100(1-p) of the data are greater than or
equal to it. - If two values satisfy this condition, then their
arithmetic average is taken.
87Sample Quartiles
- The sample 25th, 50th and 75th percentiles are
called the sample 1st , 2nd and 3rd Quartiles,
respectively. - As their names suggest they split a data set into
4 parts with roughly equal number of values. - Note the Second Quartile is simply the median.
88Box Plots
- A box plot for a data set is a straight line
segment stretching from the smallest to the
largest value, drawn on a horizontal axis. - On the line we impose a box that starts at
Quartile 1 and ends at Quartile 3. - The value of the median Quartile 2 is indicated
by a vertical line. - The value IQR Q3 - Q1 is called the
inter-quartile range of the data. - The data values smaller than Q1 - 1.5 IQR and
larger than Q31.5 IQR are called outliers and
marked by small circles on the horizontal line - The data lying outside the interval
Q1-3IQR,Q3IQR are called extreme outliers.
89Miles to travel to work - sorted
90Data for Box Plotting
- Parameter Value
- Minimum 1
- 1st Quartile 3.5
- 2nd Quartile 6.5
- 3rd Quartile 13.5
- Maximum 18
- IQR 10