Title: Chapter One Data Collection
1Chapter OneData Collection
- 1.1
- Introduction to the Practice of Statistics
2Statistics is the science of data. A statistic
is a quantity derived from information collected
from the population of interest (e.g., mean.)
3Variables are the characteristics of the
individuals within the population.
4Variables are the characteristics of the
individuals within the population Variables are
the quantities we are interested in measuring to
find out something about a population.
5Qualitative or Categorical variables allow for
classification of individuals based on some
attribute or characteristic.
6Qualitative or Categorical variables allow for
classification of individuals based on some
attribute or characteristic.
Quantitative variables provide numerical measures
of individuals. Arithmetic operations (means,
etc.) can be performed on these.
7Two Types of Quantitative Variables
8Two Types of Quantitative Variables
A discrete variable is a quantitative variable
that either has a finite number of possible
values or a countable number of possible values.
The term countable means the values result from
counting such as 0, 1, 2, 3, and so on.
9Two Types of Quantitative Variables
A discrete variable is a quantitative variable
that either has a finite number of possible
values or a countable number of possible values.
The term countable means the values result from
counting such as 0, 1, 2, 3, and so on.
A continuous variable is a quantitative variable
that has an infinite number of possible values it
can take on and can be measured to any desired
level of accuracy.
10The list of observations a set of variables
assume is called data.
11The list of observations a set of variables
assume is called data. Qualitative data are
observations corresponding to a qualitative
variable.
12The list of observations a set of variables
assume is called data. Qualitative data are
observations corresponding to a qualitative
variable. Quantitative data are observations
corresponding to a quantitative(numerical)
variable.
13- The list of observations a set of variables
assume is called data. - Qualitative data are observations corresponding
to a qualitative variable. - Quantitative data are observations corresponding
to a quantitative(numerical) variable. - Discrete data are observations corresponding to
a discrete variable.
14- The list of observations a set of variables
assume is called data. - Qualitative data are observations corresponding
to a qualitative variable. - Quantitative data are observations corresponding
to a quantitative(numerical) variable. - Discrete data are observations corresponding to
a discrete variable. - Continuous data are observations corresponding
to a continuous variable.
15Chapter OneData Collection
- 1.2 1.3
- Observational Studies
- Random Sampling
16A population is the entire group of people or
objects that we are interested in studying.
17A population is the entire group of people or
objects that we are interested in studying. A
sample is simply a subcollection of individuals
from a population.
18Two Types of Studies
19- Two Types of Studies
- Observational Study
20- Two Types of Studies
- Observational Study
- Designed Experiment
21An observational study does not attempt to
manipulate or apply a treatment to the
individuals in the sample.
22An observational study does not attempt to
manipulate or apply a treatment to the
individuals in the sample. Observational studies
are useful for determining if there is a relation
(correlation) between two variables in a
population.
23Sampling Techniques from Finite Populations
24- Sampling Techniques from Finite Populations
- Simple Random Samples
25- Sampling Techniques from Finite Populations
- Simple Random Samples
- Stratified Samples
26- Sampling Techniques from Finite Populations
- Simple Random Samples
- Stratified Samples
- Systematic Samples
27- Sampling Techniques from Finite Populations
- Simple Random Samples
- Stratified Samples
- Systematic Samples
- Cluster Samples
28- Sampling Techniques from Finite Populations
- Simple Random Samples
- Stratified Samples
- Systematic Samples
- Cluster Samples
- Convenience Samples
29Simple Random Samples
30Simple Random Samples N Number of individuals
in population.
31Simple Random Samples N Number of individuals
in population. n Number of individuals selected
in sample.
32Simple Random Samples N Number of individuals
in population. n Number of individuals selected
in sample. If each such sample of size n is
equally probable to be selected, it is a simple
random sample.
33Steps for Obtaining a Simple Random Sample
34Steps for Obtaining a Simple Random Sample
1) List all the individuals in the population of
interest.
35Steps for Obtaining a Simple Random Sample
1) List all the individuals in the population of
interest. 2) Number the individuals from 1 - N.
36Steps for Obtaining a Simple Random Sample
1) List all the individuals in the population of
interest. 2) Number the individuals from 1 -
N. 3) Use a random number table, graphing
calculator, or statistical software to randomly
generate n numbers where n is the desired sample
size.
37Stratified Random Sample
38Stratified Random Sample A stratified random
sample is one obtained by separating the
population into non-overlapping groups called
strata.
39Stratified Random Sample A stratified random
sample is one obtained by separating the
population into non-overlapping groups called
strata. A simple random sample is then obtained
from each stratum.
40Stratified Random Sample A stratified random
sample is one obtained by separating the
population into non-overlapping groups called
strata. A simple random sample is then obtained
from each stratum. Each stratum is relatively
homogeneous with respect to a certain variable.
41A systematic sample is obtained by selecting
every kth individual from the population up to
the desired sample size n.
42STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN
43STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN Step 1 Determine the population size, N.
44STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN Step 1 Determine the population size,
N. Step 2 Determine the sample size desired, n.
45STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN Step 1 Determine the population size,
N. Step 2 Determine the sample size desired,
n. Step 3 Compute N/n and round down to the
nearest integer. This value is k.
46STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN Step 1 Determine the population size,
N. Step 2 Determine the sample size desired,
n. Step 3 Compute N/n and round down to the
nearest integer. This value is k. Step 4
Randomly select a number between 1 and k. Call
this number p.
47STEPS IN SYSTEMATIC SAMPLING, POPULATION SIZE
KNOWN Step 1 Determine the population size,
N. Step 2 Determine the sample size desired,
n. Step 3 Compute N/n and round down to the
nearest integer. This value is k. Step 4
Randomly select a number between 1 and k. Call
this number p. Step 5 Select every kth
individual starting at the pth individual. The
sample will consist of the following
individuals p, p k, p 2k,
48A cluster sample is obtained by selecting all
individuals within a randomly selected collection
or group of individuals.
49A convenience sample is one in which the
individuals in the sample are easily obtained.
Any studies that use this type of sampling
generally have results that are suspect. Results
should be looked upon with extreme skepticism.
50Chapter OneData Collection
- 1.5
- The Design of Experiments
51A designed experiment is a controlled study in
which one or more treatments are applied to
experimental units.
52A designed experiment is a controlled study in
which one or more treatments are applied to
experimental units. The experimenter then
observes the effect of varying these treatments
on a response variable.
53The experimental unit (or subject) is a person,
object or some other well-defined item upon which
a treatment is applied.
54The experimental unit (or subject) is a person,
object or some other well-defined item upon which
a treatment is applied. The treatment is a
condition applied to the experimental unit.
55The experimental unit (or subject) is a person,
object or some other well-defined item upon which
a treatment is applied. The treatment is a
condition applied to the experimental unit. A
response variable is a quantitative or
qualitative variable that represents our variable
of interest.
56The experimental unit (or subject) is a person,
object or some other well-defined item upon which
a treatment is applied. The treatment is a
condition applied to the experimental unit. A
response variable is a quantitative or
qualitative variable that represents our variable
of interest. A predictor variable is a variable
which somehow affects the response variable.
These include the experimental treatment.
57Chapter TwoOrganizing and Summarizing Data
- 2.1
- Organizing Qualitative Data
58Suppose there are k categories listed from 1 to
k.
59Suppose there are k categories listed from 1 to
k. We collect sample of size n.
60Suppose there are k categories listed from 1 to
k. We collect sample of size n. n1 fall in
category 1.
61Suppose there are k categories listed from 1 to
k. We collect sample of size n. n1 fall in
category 1. n2 fall in category 2.
62Suppose there are k categories listed from 1 to
k. We collect sample of size n. n1 fall in
category 1. n2 fall in category 2. . . . . . . .
63Suppose there are k categories listed from 1 to
k. We collect sample of size n. n1 fall in
category 1. n2 fall in category 2. . . . . . .
. nk fall in category k.
64The frequency of a category is the number of
times the data fall into that category.
65The frequency of a category is the number of
times the data fall into that category. The
frequency of category j is nj for j 1, . . .,
k.
66The frequency of a category is the number of
times the data fall into that category. The
frequency of category j is nj for j 1, . . .,
k. Note n n1 n2 . . . nk
67A frequency distribution lists the number of
occurrences for each category of data.
68A frequency distribution lists the number of
occurrences for each category of data. In other
words, the list of frequencies (n1, n2,. . .,
nk) is the frequency distribution for the
categorical variable with categories 1 - k.
69The relative frequency is the proportion or
percent of observations within a category and is
found using the formula
70The relative frequency is the proportion or
percent of observations within a category and is
found using the formula
71The relative frequency is the proportion or
percent of observations within a category and is
found using the formula I.e., relative
frequency for category j is rj nj / n.
72A relative frequency distribution lists the
relative frequency of each category of data.
73A relative frequency distribution lists the
relative frequency of each category of data. In
other words, the list of relative
frequencies (r1, r2,. . ., rk) is the
relative frequency distribution for the
categorical variable with categories 1 - k.
74A relative frequency distribution lists the
relative frequency of each category of data. In
other words, the list of relative
frequencies (r1, r2,. . ., rk) is the
relative frequency distribution for the
categorical variable with categories 1 -
k. Note 1 r1 r2 . . . rk
75A bar graph is constructed by
76- A bar graph is constructed by
- Labeling each category of data on a horizontal
axis.
77- A bar graph is constructed by
- Labeling each category of data on a horizontal
axis. - The frequency or relative frequency of the
category on the vertical axis.
78- A bar graph is constructed by
- Labeling each category of data on a horizontal
axis. - The frequency or relative frequency of the
category on the vertical axis. - A rectangle of equal width is drawn for each
category whose height is equal to the category's
frequency or relative frequency.
79Chapter TwoOrganizing and Summarizing Data
- 2.2
- Organizing Quantitative Data I
80Summarizing Quantitative Data
81- Summarizing Quantitative Data
- Discrete Data
82- Summarizing Quantitative Data
- Discrete Data
- Recall that discrete data consist of a finite or
countable (0, 1, 2, ) number of numerical values.
83- Summarizing Quantitative Data
- Discrete Data
- Recall that discrete data consist of a finite or
countable (0, 1, 2, ) number of numerical
values. - Continuous Data
84- Summarizing Quantitative Data
- Discrete Data
- Recall that discrete data consist of a finite or
countable (0, 1, 2, ) number of numerical
values. - Continuous Data
- Recall that continuous data are real numbers
an infinite number of possible values measured
with any degree of accuracy.
85When summarizing quantitative data, we need to
create groups of numbers called classes.
86When summarizing quantitative data, we need to
create groups of numbers called classes. We can
then construct frequency distributions and
relative frequency distributions using these
classes.
87When summarizing quantitative data, we need to
create groups of numbers called classes. We can
then construct frequency distributions and
relative frequency distributions using these
classes. With discrete data we can use each
individual number as its own class.
88A histogram is a graphical representation of the
frequencies in each class.
89A histogram is a graphical representation of the
frequencies in each class. It is constructed by
drawing rectangles for each class of data
90A histogram is a graphical representation of the
frequencies in each class. It is constructed by
drawing rectangles for each class of data
Frequency histogram the height is the frequency
of the class.
91A histogram is a graphical representation of the
frequencies in each class. It is constructed by
drawing rectangles for each class of data
Frequency histogram the height is the frequency
of the class. Relative frequency histogram the
height is the relative frequency of the class.
92Continuous data are summarized similarly to
discrete data.
93Continuous data are summarized similarly to
discrete data. However, with continuous data we
need to create classes instead of using
individual numbers as classes.
94Continuous data is summarized similarly to
discrete data. However, with continuous data we
need to create classes instead of using
individual numbers as classes. Classes for
continuous data are created by using
non-overlapping intervals of (usually) equal
width.
95A rule of thumb is that we want approximately 5
to 20 classes.
96A rule of thumb is that we want approximately 5
to 20 classes. For smaller datasets use fewer
classes and for larger datasets use more
classes.
97Steps for Making a Frequency Distribution with
Continuous Data
98Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C.
99Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C. Step 2 Calculate Range of data
100Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C. Step 2 Calculate Range of data R
Largest Value Smallest Value
101Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C. Step 2 Calculate Range of data R
Largest Value Smallest Value Step 3 Let W
R / C (approximately). W is called the class
width.
102Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C. Step 2 Calculate Range of data R
Largest Value Smallest Value Step 3 Let W
R / C (approximately). W is called the class
width. Step 4 Starting at a value equal to or
slightly less than the lowest value in the data,
create C classes of width W.
103Steps for Making a Frequency Distribution with
Continuous Data Step 1 Determine of classes
C. Step 2 Calculate Range of data R
Largest Value Smallest Value Step 3 Let W
R / C (approximately). W is called the class
width. Step 4 Starting at a value equal to or
slightly less than the lowest value in the data,
create C classes of width W. Step 5 Tally
frequencies for each class.
104Histograms for continuous data are made exactly
as for discrete data.
105Histograms for continuous data are made exactly
as for discrete data. We can make a frequency
histogram using the frequency distribution.
106Histograms for continuous data are made exactly
as for discrete data. We can make a frequency
histogram using the frequency distribution. We
can make a relative frequency histogram using the
relative frequency distribution.
107Stem-and-Leaf Plots
108Stem-and-Leaf Plots Stem-and-Leaf Plots are
analogous to histograms but display more
numerical details of the data.
109Stem-and-Leaf Plots Stem-and-Leaf Plots are
analogous to histograms but display more
numerical details of the data. A stem-and-leaf
plot is essentially a histogram turned on its
side.
110Construction of a Stem-and-Leaf Plot
111Construction of a Stem-and-Leaf Plot Step 1
The stem is the leading digit(s) The leaf is
the rightmost digit. (The choice of the stem
depends upon the class width desired.)
112Construction of a Stem-and-Leaf Plot Step 1
The stem is the leading digit(s) The leaf is
the rightmost digit. (The choice of the stem
depends upon the class width desired.) Step 2
Write the stems in a vertical column in
increasing order. Draw a vertical line to the
right of the stems.
113Construction of a Stem-and-Leaf Plot Step 1
The stem is the leading digit(s) The leaf is
the rightmost digit. (The choice of the stem
depends upon the class width desired.) Step 2
Write the stems in a vertical column in
increasing order. Draw a vertical line to the
right of the stems. Step 3 Write each leaf
corresponding to the stems to the right of the
vertical line. The leafs must be written in
ascending order.
114Advantage of Stem-and-Leaf Diagrams over
Histograms
115Advantage of Stem-and-Leaf Diagrams over
Histograms
Once a frequency distribution or histogram of
continuous data is created, the raw data is lost.
116Advantage of Stem-and-Leaf Diagrams over
Histograms
Once a frequency distribution or histogram of
continuous data is created, the raw data is
lost. However, the raw data can be retrieved
from the stem-and-leaf plot.
117Distribution Shapes
118Distribution Shapes
119Distribution Shapes
120Distribution Shapes
121Chapter TwoOrganizing and Summarizing Data
- 2.3
- Organizing Quantitative Data II
122A cumulative frequency table displays the
aggregate frequency of the category.
123A cumulative frequency table displays the
aggregate frequency of the category. In other
words, it displays the total number of
observations less than or equal to the category.
124A cumulative frequency table displays the
aggregate frequency of the category. In other
words, it displays the total number of
observations less than or equal to the category.
A cumulative relative frequency table displays
the aggregate proportion (or percent) of
observations less than or equal to the category.
125Definitions
126Definitions The lower class limit of a class is
the smallest value within the class.
127Definitions The lower class limit of a class is
the smallest value within the class. The upper
class limit of a class is the largest value
within the class.
128Definitions The lower class limit of a class is
the smallest value within the class. The upper
class limit of a class is the largest value
within the class. The class midpoint is found by
adding a classs lower class limit and upper
class limit and dividing the result by 2. That
is,
129Definitions The lower class limit of a class is
the smallest value within the class. The upper
class limit of a class is the largest value
within the class. The class midpoint is found by
adding a classs lower class limit and upper
class limit and dividing the result by 2. That
is,
130Frequency Polygon
131Frequency Polygon Step 1 Mark each class
midpoint on a horizontal axis.
132Frequency Polygon Step 1 Mark each class
midpoint on a horizontal axis. Step 2 Plot a
point above each class midpoint at a height equal
to the frequency of the class.
133Frequency Polygon Step 1 Mark each class
midpoint on a horizontal axis. Step 2 Plot a
point above each class midpoint at a height equal
to the frequency of the class. Step 3 After the
points for each class are plotted, draw straight
lines between consecutive points.
134Relative Frequency Polygon
135Relative Frequency Polygon Step 1 Mark each
class midpoint on a horizontal axis.
136Relative Frequency Polygon Step 1 Mark each
class midpoint on a horizontal axis. Step 2
Plot a point above each class midpoint at a
height equal to the relative frequency of the
class.
137Relative Frequency Polygon Step 1 Mark each
class midpoint on a horizontal axis. Step 2
Plot a point above each class midpoint at a
height equal to the relative frequency of the
class. Step 3 Connect the dots.
138Frequency ogive
139Frequency ogive a graph that represents the
cumulative frequency or cumulative relative
frequency for the class.
140Frequency ogive a graph that represents the
cumulative frequency or cumulative relative
frequency for the class. Step 1 Plot the upper
class limits on a horizontal axis.
141Frequency ogive a graph that represents the
cumulative frequency or cumulative relative
frequency for the class. Step 1 Plot the upper
class limits on a horizontal axis. Step 2 Plot
the cumulative frequency above each upper class
limit.
142Frequency ogive a graph that represents the
cumulative frequency or cumulative relative
frequency for the class. Step 1 Plot the upper
class limits on a horizontal axis. Step 2 Plot
the cumulative frequency above each upper class
limit. Step 3 Connect the dots.
143Time Series Plots
144Time Series Plots If the value of a variable is
measured at different points in time, the data is
referred to as time series data.
145Time Series Plots If the value of a variable is
measured at different points in time, the data is
referred to as time series data.
A time series plot is obtained by plotting the
time in which a variable is measured on the
horizontal axis and the corresponding value of
the variable on the vertical axis. Lines are
then drawn connecting the points.
146Chapter ThreeNumerically Summarizing Data
- 3.1
- Measures of Central Tendency
147Some Definitions
148Some Definitions
- A parameter is a descriptive measure of a
population.
149Some Definitions
- A parameter is a descriptive measure of a
population.
- A statistic is a descriptive measure of a sample.
150Some Definitions
- A parameter is a descriptive measure of a
population.
- A statistic is a descriptive measure of a sample.
- A statistic which is used to estimate a
population parameter is called an estimator.
151Some Definitions
- A parameter is a descriptive measure of a
population.
- A statistic is a descriptive measure of a sample.
- A statistic which is used to estimate a
population parameter is called an estimator.
- A statistic is an unbiased estimator of a
parameter if it does not consistently over- or
underestimate the parameter.
152Measures of Centrality
153Measures of Centrality A measure of centrality
is a measure of the center of the data.
154Measures of Centrality A measure of centrality
is a measure of the center of the
data. Center can be defined in different ways.
155- Measures of Centrality
- A measure of centrality is a measure of the
center of the data. - Center can be defined in different ways.
- Arithmetic mean.
156- Measures of Centrality
- A measure of centrality is a measure of the
center of the data. - Center can be defined in different ways.
- Arithmetic mean.
- (2) Median.
157- Measures of Centrality
- A measure of centrality is a measure of the
center of the data. - Center can be defined in different ways.
- Arithmetic mean.
- (2) Median.
- (3) Mode.
158Arithmetic Mean
159Arithmetic Mean The arithmetic mean of a
variable is computed by
160- Arithmetic Mean
- The arithmetic mean of a variable is computed by
- Sum of all the values of the variable in the data
set.
161- Arithmetic Mean
- The arithmetic mean of a variable is computed by
- Sum of all the values of the variable in the data
set. - Divide the sum of all the values by the number
of values.
162The population arithmetic mean, is computed using
all the individuals in a population.
163The population arithmetic mean, is computed using
all the individuals in a population. The
population mean is a parameter.
164The population arithmetic mean, is computed using
all the individuals in a population. The
population mean is a parameter. We usually do not
know what its value is.
165The population arithmetic mean, is computed using
all the individuals in a population. The
population mean is a parameter. We usually do not
know what its value is.
The population mean is denoted by
166(No Transcript)
167The sample arithmetic mean, is computed using
sample data.
168The sample arithmetic mean, is computed using
sample data. The sample mean is denoted by
169(No Transcript)
170Median
171Median The median M is computed by
172- Median
- The median M is computed by
- Arrange the data in order from smallest to
largest.
173- Median
- The median M is computed by
- Arrange the data in order from smallest to
largest. - Choose the value in the exact middle.
-
174- Median
- The median M is computed by
- Arrange the data in order from smallest to
largest. - Choose the value in the exact middle.
- Half the data is below the median
175- Median
- The median M is computed by
- Arrange the data in order from smallest to
largest. - Choose the value in the exact middle.
- Half the data is below the median
- Half the data is below the median
176Precise Steps for Calculating the Median
177- Precise Steps for Calculating the Median
- Arrange the data in ascending order.
178- Precise Steps for Calculating the Median
- Arrange the data in ascending order.
- Determine the number of observation n.
179- Precise Steps for Calculating the Median
- Arrange the data in ascending order.
- Determine the number of observation n.
- If n is an odd number, the median M is the value
in the middle of the data the value in position
(n 1) / 2.
180- Precise Steps for Calculating the Median
- Arrange the data in ascending order.
- Determine the number of observation n.
- If n is an odd number, the median M is the value
in the middle of the data the value in position
(n 1) / 2. - If n is an even number, the median M is the
average of the two observations in the middle -
181- Precise Steps for Calculating the Median
- Arrange the data in ascending order.
- Determine the number of observation n.
- If n is an odd number, the median M is the value
in the middle of the data the value in position
(n 1) / 2. - If n is an even number, the median M is the
average of the two observations in the middle - I.e., the average of the value in the n / 2
position and the value in the (n / 2) 1
position.
182Mode
183Mode The mode of a variable is the most frequent
observation of the variable that occurs in the
data set.
184Mode The mode of a variable is the most frequent
observation of the variable that occurs in the
data set. If there is no observation that occurs
with the most frequency, we say the data has no
mode.
185Mode The mode of a variable is the most frequent
observation of the variable that occurs in the
data set. If there is no observation that occurs
with the most frequency, we say the data has no
mode. Used most often with categorical data.
186 Comparison of Mean and Median
187 Comparison of Mean and Median The arithmetic
mean is sensitive to extreme (very large or
small) values in the data.
188 Comparison of Mean and Median The arithmetic
mean is sensitive to extreme (very large or
small) values in the data. The median is
resistant to extreme values.
189Use Median When
190- Use Median When
- Data have unusually large or small values
relative to the entire set of data.
191- Use Median When
- Data have unusually large or small values
relative to the entire set of data. - When the distribution of the data is skewed.
-
192- Use Median When
- Data have unusually large or small values
relative to the entire set of data. - When the distribution of the data is skewed.
- The median gives a more accurate picture of the
center of the data in these situations.
193(No Transcript)
194(No Transcript)
195(No Transcript)
196(No Transcript)
197Chapter 3Numerically Summarizing Data
- 3.2
- Measures of Dispersion
198Measures of Dispersion
199- Measures of Dispersion
- Range
200- Measures of Dispersion
- Range
- Variance
201- Measures of Dispersion
- Range
- Variance
- Standard Deviation
202Range
203Range The range, R, of a variable is the
difference between the largest data value and the
smallest data values.
204Range The range, R, of a variable is the
difference between the largest data value and the
smallest data values. R Largest Data Value
Smallest Data Value
205Population Variance
206Population Variance The population variance of
is the sum of squared deviations about the
population mean divided by the number of
observations in the population, N.
207Population Variance The population variance of
is the sum of squared deviations about the
population mean divided by the number of
observations in the population, N. In other
words, it is the average squared deviation about
the mean.
208Population Variance
209Population Variance The N population values are
x1, x2 , . . . , xN
210Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean
211Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences
212Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences x1 - , x2 - , . . . ,
xN -
213Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences x1 - , x2 - , . . . ,
xN - Step 3 Square the differences
214Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences x1 - , x2 - , . . . ,
xN - Step 3 Square the differences (x1 -
)2 , (x2 - )2 , . . . , (xN - )2
215Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences x1 - , x2 - , . . . ,
xN - Step 3 Square the differences (x1 -
)2 , (x2 - )2 , . . . , (xN - )2 Step
4 Take the average
216Population Variance The N population values are
x1, x2 , . . . , xN Step 1 Determine the
population mean Step 2 Determine the
differences x1 - , x2 - , . . . ,
xN - Step 3 Square the differences (x1 -
)2 , (x2 - )2 , . . . , (xN - )2 Step
4 Take the average ((x1 - )2 (x2 - )2
. . . (xN - )2 )/ N
217The population variance is symbolically
represented by lower case Greek sigma squared.
218Sample Variance The sample variance s2 is
computed by determining the sum of squared
deviations about the sample mean and then
dividing this result by n 1.
219Sample Variance
220Sample Variance The n sample values are x1, x2
, . . . , xn
221Sample Variance The n sample values are x1, x2
, . . . , xn Step 1 Determine the sample mean
222Sample Variance The n sample values are x1, x2
, . . . , xn Step 1 Determine the sample mean
Step 2 Determine the differences x1 - ,
x2 - , . . . , xn -
223Sample Variance The n sample values are x1, x2
, . . . , xn Step 1 Determine the sample mean
Step 2 Determine the differences x1 - ,
x2 - , . . . , xn - Step 3 Square
the differences (x1 - )2 , (x2 - )2 , .
. . , (xn - )2
224Sample Variance The n sample values are x1, x2
, . . . , xn Step 1 Determine the sample mean
Step 2 Determine the differences x1 - ,
x2 - , . . . , xn - Step 3 Square
the differences (x1 - )2 , (x2 - )2 , .
. . , (xn - )2 Step 4 Sum and divide by
n-1 ((x1 - )2 (x2 - )2 . . . (xn -
)2 )/ (n-1)
225Note Whenever a statistic consistently
overestimates or underestimates a parameter, it
is called biased. To obtain an unbiased estimate
of the population variance, we divide the sum of
the squared deviations about the mean by n - 1.
226Population Standard Deviation
227Population Standard Deviation The population
standard deviation is denoted by
228Population Standard Deviation The population
standard deviation is denoted by
It is obtained by taking the square root of the
population variance, so that
229Sample Standard Deviation
230Sample Standard Deviation The sample standard
deviation is denoted by s
231Sample Standard Deviation The sample standard
deviation is denoted by s
It is obtained by taking the square root of the
sample variance.
232The Empirical Rule
233The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb
234The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean
235The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean ( - ,
)
236The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean ( - ,
) 95 of the data lies within 2 s.d. of the
mean
237The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean ( - ,
) 95 of the data lies within 2 s.d. of the
mean ( - 2 , 2 )
238The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean ( - ,
) 95 of the data lies within 2 s.d. of the
mean ( - 2 , 2 ) 99.7 of the
data lies within 3 s.d. of the mean
239The Empirical Rule If the population is
approximately bell-shaped, then we have the
following rules of thumb 68 of the data lies
within 1 s.d. of the mean ( - ,
) 95 of the data lies within 2 s.d. of the
mean ( - 2 , 2 ) 99.7 of the
data lies within 3 s.d. of the mean ( - 3 ,
3 )
240(No Transcript)
241(No Transcript)
242Chapter 3Numerically Summarizing Data
- 3.3
- Measures of Central Tendency and Dispersion from
Grouped Data
243Chapter 3Numerically Summarizing Data
244There are several ways of measuring the location
of a data point.
245There are several ways of measuring the location
of a data point. Idea We want to locate a data
point relative to
246- There are several ways of measuring the location
of a data point. - Idea We want to locate a data point relative
to the other data. - z-scores
247- There are several ways of measuring the location
of a data point. - Idea We want to locate a data point relative
to the other data. - z-scores
- percentiles (median, quartiles)
248The z-score represents the number of standard
deviations that a data value is from the mean.
249The z-score represents the number of standard
deviations that a data value is from the mean.
It is obtained by subtracting the mean from the
data value and dividing this result by the
standard deviation.
250The z-score represents the number of standard
deviations that a data value is from the mean.
It is obtained by subtracting the mean from the
data value and dividing this result by the
standard deviation. The z-score is unitless with
a mean of 0 and a standard deviation of 1.
251Population Z - score
252Population Z - score
253Population Z - score
Sample Z - score
254Population Z - score
Sample Z - score
255Percentiles
256Percentiles The kth percentile, denoted Pk , of
a dataset divides the lower k of the data from
the upper (100 k) of the data.
257Percentiles The kth percentile, denoted Pk , of
a dataset divides the lower k of the data from
the upper (100 k) of the data. The median
divides the lower 50 of the data from the upper
50
258Computing the kth Percentile, Pk
Step 1 Arrange the data in ascending order.
259Computing the kth Percentile, Pk
Step 1 Arrange the n data points in ascending
order.
Step 2 Let
260Computing the kth Percentile, Pk
Step 1 Arrange the n data points in ascending
order.
Step 2 Let
Step 3 (a) If i is not an integer, round up to
the next highest integer. Pk is the ith value of
the data. (b) If i is an integer, the Pk is the
mean of the ith and (i 1)st data value.
261The most common percentiles are quartiles.
262The most common percentiles are quartiles.
Quartiles divide data sets into fourths or four
equal parts.
263The most common percentiles are quartiles.
Quartiles divide data sets into fourths or four
equal parts. Q1 The 1st quartile divides the
bottom 25 the data from the top 75. (25th
percentile.)
264The most common percentiles are quartiles.
Quartiles divide data sets into fourths or four
equal parts. Q1 The 1st quartile divides the
bottom 25 the data from the top 75. (25th
percentile.) Q2 The 2nd quartile divides the
bottom 50 the data from the top 50. (50th
percentile, or median.)
265The most common percentiles are quartiles.
Quartiles divide data sets into fourths or four
equal parts. Q1 The 1st quartile divides the
bottom 25 the data from the top 75. (25th
percentile.) Q2 The 2nd quartile divides the
bottom 50 the data from the top 50. (50th
percentile, or median.) Q3 The 3rd quartile
divides the bottom 75 the data from the top 25.
(75th percentile.)
266Checking for Outliers Using Quartiles
267Checking for Outliers Using Quartiles
Step 1 Determine the first and third quartiles
of the data.
268Checking for Outliers Using Quartiles
Step 1 Determine the first and third quartiles
of the data.
Step 2 Compute the interquartile range. The
interquartile range or IQR is the difference
between the third and first quartile. That is,
IQR Q3 - Q1
269Checking for Outliers Using Quartiles
Step 1 Determine the first and third quartiles
of the data.
Step 2 Compute the interquartile range. The
interquartile range or IQR is the difference
between the third and first quartile. That is,
IQR Q3 - Q1
Step 3 Compute the fences that serve as cut-off
points for outliers.
Lower Fence Q1 - 1.5(IQR) Upper Fence Q3
1.5(IQR)
270Checking for Outliers Using Quartiles
Step 1 Determine the first and third quartiles
of the data.
Step 2 Compute the interquartile range. The
interquartile range or IQR is the difference
between the third and first quartile. That is,
IQR Q3 - Q1
Step 3 Compute the that serve as cut-off points
for outliers.
Lower Fence Q1 - 1.5(IQR) Upper Fence Q3
1.5(IQR)
Step 4 If a data value is less than the lower
fence or greater than fences the upper fence,
then it is considered an outlier.
271Chapter 3Numerically Summarizing Data
- Section 3.5
- Five Number Summary Boxplots
272The Five-Number Summary
273The Five-Number Summary MINIMUM
274The Five-Number Summary MINIMUM Q1
275The Five-Number Summary MINIMUM Q1 Median
276The Five-Number Summary MINIMUM Q1
Median Q3
277The Five-Number Summary MINIMUM Q1
Median Q3 MAXIMUM
278A Boxplot is a graphical representation of the
five number summary.
279Steps for Drawing a Boxplot
280Steps for Drawing a Boxplot
Step 1 Draw vertical lines at Q1, M, and Q3.
Enclose these vertical lines in a box.
281Steps for Drawing a Boxplot
Step 1 Draw vertical lines at Q1, M, and Q3.
Enclose these vertical lines in a box. Step 2
Label the lower and upper fence.
282Steps for Drawing a Boxplot
Step 1 Draw vertical lines at Q1, M, and Q3.
Enclose these vertical lines in a box. Step 2
Label the lower and upper fence. Step 3 Draw a
line from Q1 to the smallest data value that is
larger than the lower fence. Draw a line from Q3
to the largest data value that is smaller than
the upper fence.
283Steps for Drawing a Boxplot
Step 1 Draw vertical lines at Q1, M, and Q3.
Enclose these vertical lines in a box. Step 2
Label the lower and upper fence. Step 3 Draw a
line from Q1 to the smallest data value that is
larger than the lower fence. Draw a line from Q3
to the largest data value that is smaller than
the upper fence. Step 4 Any data values less
than the lower fence or greater than the upper
fence are outliers and are marked with an
asterisk ().
284Symmetric
285Skewed Right
286Skewed Left
287Chapter 4Describing the Relation Between Two
Variables
- 4.1
- Scatter Diagrams Correlation
288The response variable is the variable whose value
we want to explain, predict or control.
289The response variable is the variable whose value
we want to explain, predict or control. The
predictor variable is the variable which
explains, predicts, or controls the response.
290The response variable is the variable whose value
we want to explain, predict or control. The
predictor variable is the variable which
explains, predicts, or controls the
response. Data for which two variables are
measured for each unit in the sample is called
bivariate data.
291A scatter diagram shows the relationship between
two quantitative variables measured on the same
individual.
292A scatter diagram shows the relationship between
two quantitative variables measured on the same
individual. Each individual in the data set is
represented by a point in the scatter diagram.
293A scatter diagram shows the relationship between
two quantitative variables measured on the same
individual. Each individual in the data set is
represented by a point in the scatter diagram.
The predictor variable is plotted on the
horizontal axis.
294A scatter diagram shows the relationship between
two quantitative variables measured on the same
individual. Each individual in the data set is
represented by a point in the scatter diagram.
The predictor variable is plotted on the
horizontal axis. The response variable is plotted
on the vertical axis.
295Two variables that are linearly related are said
to be positively associated when the values of
the predictor variable increase, the values of
the response variable also increase.
296Two variables that are linearly related are said
to be negatively associated when the values of
the predictor variable increase, the values of
the response variable decrease.
297The sample correlation coefficient is a measure
of the strength of linear relation between two
quantitative variables.
298The sample correlation coefficient is a measure
of the strength of linear relation between two
quantitative variables. We let r denote the
sample correlation coefficient.
299The sample correlation coefficient is a measure
of the strength of linear relation between two
quantitative variables. We let r denote the
sample correlation coefficient. r close to 1
indicates strong positive linear relation.
300The sample correlation coefficient is a measure
of the strength of linear relation between two
quantitative variables. We let r denote the
sample correlation coefficient. r close to 1
indicates strong positive linear relation. r
close to 0 indicates little linear relation.
301The sample correlation coefficient is a measure
of the strength of linear relation between two
quantitative variables. We let r denote the
sample correlation coefficient. r close to 1
indicates strong positive linear relation. r
close to 0 indicates little linear relation. r
close to -1 indicates strong negative linear
relation.
302Suppose we have bivariate data
303Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
304Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
305Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
n is the number of units sampled
306Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
n is the number of units sampled
x is the sample mean for X
307Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
n is the number of units sampled
x is the sample mean for X y is the sample
mean for Y
308Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
n is the number of units sampled
sx is the sample s.d for X
x is the sample mean for X y is the sample
mean for Y
309Suppose we have bivariate data X Y x1
y1 x2 y2 x3 y3 . xn yn
n is the number of units sampled
sx is the sample s.d for X sy is the sample
s.d for Y
x is the sample mean for X y is the sample
mean for Y
310Steps for Calculating r
311Steps for Calculating r Step 1 Calculate the
sample mean x for variable X and y for variable
Y.
312Steps for Calculating r Step 1 Calculate the
sample mean x for variable X and y for variable
Y. Step 2 Calculate the sample standard
deviation sx for variable X and sy for variable
Y.
313Steps for Calculating r Step 1 Calculate the
sample mean x for variable X and y for variable
Y. Step 2 Calculate the sample standard
deviation sx for variable X and sy for variable
Y. Step 3 Calculate z-scores for all the
data
314Steps for Calculating r Step 1 Calculate the
sample mean x for variable X and y for variable
Y. Step 2 Calculate the sample standard
deviation sx for variable X and sy for variable
Y. Step 3 Calculate z-scores for all the
data (x1 x)/ sx , (x2 x)/ sx, , (xn x)/
sx (y1 y)/ sy , (y2 y)/ sy, , (yn y)/
sx
315Steps for Calculating r Step 4 Multiple the
respective z-scores for X and Y
316Steps for Calculating r Step 4 Multiple the
respective z-scores for X and Y (x1 x)/ sx
x (y1 y)/ sy , (x2 x)/ sx x
(y2 y)/ sy . . . . . . . . . . . . (xn x)/
sx x (yn y)/ sx
317Steps for Calculating r Step 4 Multiple the
respective z-scores for X and Y (x1 x)/ sx
x (y1 y)/ sy , (x2 x)/ sx x
(y2 y)/ sy . . . . . . . . . . . . (xn x)/
sx x (yn y)/ sx Step 5 Add these
together and divide by (n-1)
318Steps for Calculating r Step 4 Multiple the
respective z-scores for X and Y (x1 x)/ sx
x (y1 y)/ sy , (x2 x)/ sx x
(y2 y)/ sy . . . . . . . . . . . . (xn x)/
sx x (yn y)/ sx Step 5 Add these
together and divide by (n-1) ((x1 x)/ sx x
(y1 y)/ sy) . . . ( (x1 x)/ sx x (y1
y)/ sy) (n-1)
319(No Transcript)
320(No Transcript)
321(No Transcript)
322(No Transcript)
323(No Transcript)
324(No Transcript)
325(No Transcript)
326Chapter 4Describing the Relation Between Two
Variables
- 4.2
- Least-squares Regression
327Recall that the equation for a line is given
by
328Recall that the equation for a line is given
by Y m X b
329Recall that the equation for a line is given
by Y m X b m slope of the line.
330Recall that the equation for a line is given
by Y m X b m slope of the line. b
intercept of the line.
331Recall that the equation for a line is given