Title: Chapter Overview
1Chapter Overview
- Statistical Process
- Know your question plan your study or
experiment - ID Population parameter of interest
- Plan your study (using historical data cant
claim causation) - Or plan your experiment (control, random
selection, random assignment to control,
replication) - EAC
- Evidence (collect data)
- Analysis (graph calculate statistics)
- Conclusion (only infer to population if SRS)
- Analysis (A of EAC) Present and describe sets
of data - Graph
- Qualitative Data (_attribute, or categorical
data used to show relationships/proportions) - Circle
- bar graph
- Pareto Diagram (bar graph with cumulative line
graph_) - Quantitative Data (uses graph to show_dispersion)
- dot plot
- stem leaf
- frequency chart
- Conclusion (C of EAC) Interpret findings so
that we know what the data is telling us about
the sampled population
2Summary 2.1-2.2 Data Presentation
- Categorical (Qualitative)
- Circle graph
- Bar graph
- Bars spaced (x axis is categories not measure)
- Pareto
- Prioritized columns descending
- Plots cumulative
- Exceptions
- Some quantitative data summarized into categories
- Quantitative
- Dot plot
- Vertical or horizontal
- Stem and leaf
- State leaf unit
- Large group subdivide into 2 or 5 subgroups
- Histogram
- Bar graph of frequency distr.
- Bars touch (x axis is a scale)
- Frequency distribution
- Ungrouped
- Grouped (f vs. midpoint)
- Relative ( vs. class boundaries)
- Cumulative
- Cumulative relative (Ogive vs upper limit of
class)
Interpret Patterns Overlapping distributions?
Separate into back to backs (p.47) Normal,
symmetrical mirror image Uniform box
like Bimodal two peaks Skewed left negatively
skewed, mean to left of median Skewed right
positively skewed, mean to right of median J
shaped no tail on side with highest frequency
32.3 2.4 Measures Overview (no rounding until
final answer round to one more place than data)
- 2.3 Measures of Center
- Mean
- Of sample
- Of population
- Depth of median (n1)/2
- Median MD
- Mode
- Midrange MR (hi low)/2
- Measures for grouped data
- mean
- Weighted mean
- Variance standard deviation
- 2.4 Measures of Dispersion
- Range hi - low
- Deviation from mean
- Abs. value of deviation
- Mean abs. deviation
- Sample variance
- Sample standard deviation s
- square root of variance
- (used to est. pop.std.dev.-underestimates so n-l
to reduce bias) - Population standard deviation
SECS S hape unimodal, bimodal, symmetric,
skewed ( or -) E xceptions 1.5 IQR meaning
Q1-1.5(stand.dev.) or Q31.5(stand.dev.) C
enter (measures of central tendency) mean,
median, mode, midrange S pread (measures of
dispersion) range, interquartile range,
var., standard deviation RESISTANT
Measures mean or median? variance or standard
deviation?
4Remember Chapter 1
- Only experiments can conclude causation
- 3 Requirements for Exp.
- Control
- May have 2 groups being compared (comparative
exp.) - Direct manipulation of independent variable
- measure the effect on the dependent or response
variable in a quantified way - Randomization
- Statistical equivalence (subjects must be as
similar as possible) RANDOM SELECTION - RANDOM ASSIGNMENT to control or treatment group
- Control of extraneous variables
- Replication
- Repeat with large s so chance variation can be
reduced and the effect of treatment more easily
identified - Process
- Purpose of research
- Plan define var., pop.,sample, methods
- Collect data
- Analyze data
5Additional Tips From Prep Workbooks
- Outliers mean 1.5IQR or Q1-1.5(Q3-Q1) and
Q31.5(Q3-Q1) - Dot plots ideal for discrete data
- can you find. from hist., cum.rel freq., etc.
means is it possible even if it takes multi.
steps - Which is best? Hist., stem leaf means which
one can be pulled from the quickest - cut-point class mark for grouped histogram
- Data set could be 5,5,5,5 where box plot would
be just a mark at 5 - When asked how change affects measures model
with your own s if needed -
-
can use
99.73std.dev.(which is 6 sigma) to verify
reasonable std. dev. -
-
Standardized normal curve (meaning datas been
translated to z scores) always
looks exactly the same
(ht spread) - center is always z0 stand.
dev. 1 (always) -
-
Normal curve is denser at center (greater
area under the curve for 1 increment) -
When using actual
data (not z s) small standard deviation shorter
range so taller -
large standard deviation
wider range so flatter -
6Warm Up after chapter 2 Name_______________
- When deciding if a variable is quantitiative or
qualitative, the most important criteria to
consider is _____________________________. - When describing a distribution remember to
discuss ______________________. - Describing the ____________ can be very
meaningful for a histogram but may not be for a
bar graph of _____________ data. - A study conducted on youth obesity collected data
via a pedometer attached to each child.
Researchers found some of the more obese children
had the highest activity levels. How do you
think this could be explained? - In algebra we use x to represent ____________
which has ____________ value(s). In stats we
use X to represent ______________which has
_____________value(s). - In algebra x x __ ____________ in stats
X X ____ _____________ - Game of Greed. Roll a pair of dice. All stand.
You can take the sum or remain standing and add
the sum of the next roll. However, when the sum
of 2 is rolled the round ends, and you score 0.
Play 5 rounds and collect data.
7(No Transcript)
8Chapter 3 Descriptive Analysis Presentation
of Bivariate Data
- To be able to present bivariate data in tabular
and graphic form
- To become familiar with the ideas of descriptive
presentation
- To gain an understanding of the distinction
between the basic purposes of correlation
analysis and regression analysis
9Sept. 8, Tues. P. 140 3.11,12,15,17,18,22 Wed.
Read p. 150, work p. 152 3.35,36, 37 and
39 Thurs. Section 3.3 work p.165
56-62 Fri. Quiz C3 hwk chapter
exercises 3.70, 3.76 and 3.86
10Vocabulary
- Bivariate data
- Sample statistics
- 3 data type combinations for bivariate data
- Contingency table
- Cross-tabulation
- Scatter-plot 1st step to see if linear
relationship exists between 2 quantitative
variables. predicted value is on y-axis - Independent variable
- Dependent variable
- Ordered pair
- Input variable
- Output variable
- Least squares regression
- Line of best fit
- Linear correlation
- Linear regression
- Residuals plot
- Correlation analysis
- Positive correlation
- Negative correlation
113.1 Bivariate Data
- Bivariate Data Consists of the values of two
different response variables that are obtained
from the same population of interest.
- Three combinations of variable types
- Both variables are qualitative (_________or_______
___) arranged on a - a. ____________ or ____________ table (These
statistics may also be displayed in a
side-by-side bar graph) - b. Example A survey was conducted
to investigate the relationship between
preferences for television, radio, or newspaper
for national news, and gender. The results are
given in the table below -
- This table may be extended to
display the _____totals (or _______). The total
of the marginal totals is the grand total - Note contingency tables often show percentages
(relative frequencies). These percentages are
based on the entire sample or on the subsample
(row or column) classifications. - One variable qualitative (_______) and other
quantitative (________) - Quantitative values viewed as separate samples
- each set identified by levels of the qualitative
variable - Statistics for comparison measures of central
tendency, measures of variation, 5-number summary - Each is described using techniques from C2
results displayed side by side for easy
comparison (dot-plots or box plots) - Both variables are quantitative (both numerical)
- a. Expressed as _____________ (__,__)
- b. x _________variable, __________
variable y _________variable, ___________
variable - c. Present data pictorially on a
_________diagram
12Illustration
- These sample statistics (numerical values
describing sample results) can be shown in a
(side-by-side) bar graph
TV Radio NP
Percentages Based on Row (Column) Totals
- The entries in a contingency table may also be
expressed as percentages of the row (column)
totals by dividing each row (column) entry by
that rows (columns) total and multiplying by
100. The entries in the contingency table below
are expressed as percentages of the column totals
Note These statistics may also be displayed in a
side-by-side bar graph
13Example
- Example A random sample of households from
three different parts of the country was obtained
and their electric bill for June was recorded.
The data is given in the table below
- The part of the country is a qualitative variable
with three levels of response. The electric bill
is a quantitative variable. The electric bills
may be compared with numerical and graphical
techniques.
. . . . . . .
. ----------------------------------
----------------- Northeast
. ... ..
----------------------------------------
----------- Midwest
.
. . . . .
. . -----------------------------
---------------------- West
24.0 32.0 40.0 48.0 56.0
64.0
- The electric bills in the Northeast tend to be
more spread out than those in the Midwest. The
bills in the West tend to be higher than both
those in the Northeast and Midwest.
14Two Quantitative Variables
Scatter Diagram The first tool used in
determining whether a _______ __________ exists
between 2 _______ _____________. Decide which
variable is to be _________. This variable will
be the __________ variable. Plot of all the
____________of _________ data on a coordinate
axis system. The input variable __ is plotted on
the __________ axis, and the output variable __
is plotted on the _________axis.
Note Use scales so that the range of the
y-values is equal to or slightly less than the
range of the x-values. This creates a window
that is approximately square.
- Example In a study involving childrens fear
related to being hospitalized, the age and the
score each child made on the Child Medical Fear
Scale (CMFS) are given in the table below
Construct a scatter diagram for this data.
age input variable, CMFS output variable
How to construct a scatter diagram 1.
Find the range of x values range of y values.
2. Then choose your increments for x-axis
y-axis (theyre not always the
same) 3. Plots the points - ordered pairs
(x,y). 4. Label both axes give a title
to the diagram.
examples 1. the number of hours studied for an
exam versus the grade on the exam, 2. the number
of years a runner has been training versus
his/her time for running a mile, 3. the weight
of a car versus its gas mileage. 4. Age and
price of a car Determine whether the dependent
variable in each example will increase or
decrease as the independent variable increases.
What examples can you come up with on your own.?
15Take a look.
- AGAINST ALL ODDS Inside Statistics has three
videos presenting the concepts of correlation and
regression analysis for bivariate data. Program
9 "Correlation" reinforces the concepts behind
correlation with several excellent examples.
Program 8 "Describing Relationships" and the
first 15 minutes of Program 7 "Models for
Growth" give additional insight into regression
analysis plus excellent examples. - The Student Suite CD contains three video clips
Bivariate Data, Linear Correlation and
Linear Regression. - Paper helicopter link http//courses.ncssm.edu/ma
th/Stat_inst01/intro.htm
163.2 Linear Correlation
- Linear Correlation Measures the strength of a
linear relationship between 2 variables
- As x increases, no definite shift in y no
correlation
- As x increases, a definite shift in y correlation
- __________ correlation x increases, y increases
- ___________correlation x increases, y decreases
- If the ordered pairs follow a straight-line path
__________ correlation - If the points are patterned close to the line
_________ - If the points are spread out, yet still look
linear _____________
- Perfect positive correlation all the points lie
along a line with positive slope
- Perfect negative correlation all the points lie
along a line with negative slope
- If the points lie along a _________ or _________
line no correlation
- If the points exhibit some other nonlinear
pattern no linear relationship, no correlation
- Need some way to measure correlation
17Measures of Correlation
Coefficient of Linear Correlation r, measures
the strength of the linear relationship between
two variables
- Notes
- r 1 perfect positive correlation
- r -1 perfect negative correlation
Alternate Formula for r
18Example
- Example The table below presents the weight (in
thousands of pounds) x and the gasoline mileage
(miles per gallon) y for ten different
automobiles. Find the linear correlation
coefficient
19Please Note
- r is usually rounded to the nearest hundredth
- r close to 0 little or no linear correlation
- As the magnitude of r increases, towards -1 or
1, there is an increasingly stronger linear
correlation between the two variables
- Method of estimating r based on the scatter
diagram. Window should be approximately square.
Useful for checking calculations. - Can you make any predictions about circumstances
based on the various outputs? If age vs. price
has r -0.9, then ___ - r only measures the strength of a linear
relationship, and a cause and effect relationship
cannot be concluded - Before answering any questions concerning data in
contingency tables, add all of the rows and
columns. Be sure the sum of the row totals the
sum of the column totals the grand total. Now
you are ready to answer all questions easily. - Explanatory variable independent x
- Response variable dependent y what your e
predicting -
-
203.2 Linear Correlation Understanding the
Linear Correlation Coefficient
- Estimating r - the linear correlation coefficient
-
- 1. Draw as small a rectangle as possible that
encompasses all of the data on the scatter
diagram. (Diagram should cover a "square window"
- same length and width) - 2. Measure the width.
- 3. Let k the number of times the width fits
along the length or in other words
length/width. - 4. r (1 - 1/k )
- 5. Use , if the rectangle is slanted positively
or upward. - Use -, if the rectangle is slanted negatively
or downward. - If there is a strong linear correlation between
two variables, then one of the following
situations may be true about the relationship
between the two variables. - There is a direct cause-and-effect relationship
between the two. - There is a reverse cause-and-effect relationship.
- Their relationship may be caused by a third
variable (called a ______ ________). - Their relationship may be caused by the
interactions of several other variables. - The apparent relationship may be strictly a
coincidence. - Remember that a strong correlations does not
necessarily imply causation.
21Text Problems
- P.140 3.11, 12, 15,17, 18, 22
- Read p. 150 lurking variables causation
- P.152 3.35, 36,37
22Key Homework Problem Solutionsfor Small Group
Discussion Correction
- Problem 1
- Problem 2
- Problem 3
- Problem 4
- Write your personal reflections about the
statistics you learned from doing the problem to
turn in for your warm up activity.
233.3 Linear Regression
- If a linear relationship exists between two
variables, that is, - 1. its scatter diagram suggests a
__________ _______________ - 2. its calculated ____ value is not near
_________ -
- Linear regression will calculate an __________
of a ________based on the data. . This line,
also known as the ____________ of ___________
_________, will fit through the data with the
smallest possible amount of ________ between it
and the actual data points. The regression line
can be used for generalizing and _________ over
the sampled ________ of x. - STASTICAL FORM OF A LINEAR REGRESSION LINE
- where _____________
format differs
from algebra - _____________
- _____________
- x ________________
- Regression analysis finds linear equation that
best describes relationship between 2 variables
. - Least squares criterion Find the constants b0
and b1 such that the sum is
as small as possible -
-
- Some examples of various possible relationships
What would a scatter diagram look like to suggest
each relationship?
24Illustration
- Observed and predicted values of y
25Example
- Example A recent article measured the job
satisfaction of subjects with a 14-question
survey. The data below represents the job
satisfaction scores, y, and the salaries, x, for
a sample of similar individuals
1) Draw a scatter diagram for this data 2) Find
the equation of the line of best fit
26Line of Best Fit
27Scatter Diagram
28Sept. 29, Mon. C3 work day p.174 practice test
Tues. AP practice problems Quiz Wed. start
test Thurs. finish C3 Test Fri. activity
Design of Experiments/Studies Oct.6, Mon. AP
practice problems
29Please Note
- Keep at least three extra decimal places while
doing the calculations to ensure an accurate
answer
- When rounding off the calculated values of b0 and
b1, always keep at least two significant digits
in the final answer
- The slope b1 represents the predicted change in y
per unit increase in x
- The y-intercept is the value of y where the line
of best fit intersects the y-axis
- One of the main purposes for obtaining a
regression equation is for making predictions - For a given value of x, we can predict a value of
. - The regression equation should be used to make
predictions only about the population from which
the sample was drawn - The regression equation should be used only to
cover the sample domain on the input variable.
You can estimate values outside the domain
interval, but use caution and use values close to
the domain interval. - Use current data. A sample taken in 1987 should
not be used to make predictions in 1999. - Work p. 165 3.57, 58, 59, 60 and 61
30Chapter Practice
31Supplemental Material
32(No Transcript)
33(No Transcript)
34List Serv Comments
- 1) What is plotted on a residuals plot? Is it the
x-variable versus the residuals? If so, then in
BVD book, on p. 149, they plot "predicted vs.
gtresiduals." Is this another way to do
it?Either way is fine. Typically in AP Stat it
is x-variable v. residuals, though there was an
AP Exam question once that did it the other way.
Predicted v. residuals is needed when doing
multiple regression.gt2) For TI-84 users, what
is the difference between 4 LinReg(axb) versus
8LinReg(abx), which is the one we use for
stats? Both are found gtin the STAT CALC
menu.They're the same, but abx is more
commonly used for stats.gt3) Any good, short and
sweet ways to explain regression to the
mean?One thing you can do is take a scatterplot
of something like 'Son's Height v. Father's
Height' into several vertical bands. Put a big X
at about the mean y-value for each vertical band.
The regression line roughly passes through those
bands, and the line will be less steep than the
line yx.One way to explain the reason for this
is to see that the line yx is approximately a
symmetry line for the elliptical cloud of points.
A segment perpendcular to yx with endpoints on
the 'boundary' of the cloud is approximately
centered on the line yx. But we are predicting
vertically, not perpendicularly. Vertical
segments with endpoints on the boundary of the
cloud will be centered not on yx, but on a line
less steep. That's hard to explain without a
picture, but maybe it makes some sense? I
hope?gt4) Why can we assume that the least
squares regression line must go through the point
(x-bar, y-bar)? Why can we assume that the mean
gtvalue of the x-variable must necessarily
correspond to the mean value of the
y-variable?It's not that we assume this, it
just happens to fall out of the algebra when you
minimize the sum of squared residuals. David Bee
could probably reproduce that algebra easily. It
would take me a while. And I'd probably have to
look it up anyway!
35- Lisa and OthersLisa asks3) Any good, short
and sweet ways to explain regression to the
mean?4) Why can we assume that the
least-squares regression line must go through the
point (x-bar, y-bar)? Why can we assume that the
mean value of the x-variable must necessarily
correspond to the mean value of the
y-variable?Re L3 Here's a fairly short
explanation that will probably seem long
because it's being written outWe know one way
of writing the equation of the regression line
is (y - ybar)/sy r(x - xbar)/sx, or
simply z_y rz_xConsider 0 lt r lt 1 and
consider an x higher than its mean xbar.Thus,
z_x gt 0, and so so is rz_x, which means the
corresponding yvalue is higher than its mean
ybar. But since r lt 1, it followsthat z_y lt z_x,
and so the predicted y value is closer to ybar
thanx is to xbar so, for an observation with a
high x value, we predicty to be high, but,
relative to the standard deviations, not as
highas x, and so the predictions all appear to
regress toward the mean.
36- Re L4 At the APStat level, one justification
would be as follows Prerequisite The
point-slope equation of a line. Consider an
equation of the form y a bx, where a and
b are to be determined by the method of
least-squares but the calculations are all
automated now. The process involves
least-squares leading to normal equations, but
consider all n x_i values and their
corresponding y_i values. Thus, we would
have yi a
bxi Summing both sides of this equation
gives SUM yi an b SUM
xi Dividing by n gives
ybar a b xbar (1) Since the
equation of the line is
y a bx (2) subtracting (1)
from (2) gives
y - ybar b(x - xbar) which shows the
line passes through the point (x-bar,y-bar).
Note This is just a justification and not a
proof I think someone in the Forum in the
past gave a good non-calculus-using proof but
I don't recall it.HTH-- David BeePS
Note we didn't determine the value of b. However,
if interested, since we have SUM yi an b
SUM xi, we could get the second normal
equation by multiplying yi a bxi through by
xi and summing, giving SUM xiyi a SUM xi
b SUM (xi)2. Since we now have two equations
in a and b, they could be solved for a and b,
which CALC Choice 8 LinReg(abx) in effect does
for us.
37- I am writing to see if there is a quicker way to
find the standard deviation of the residuals. I
know the formula, but the only way I currently
see how to do it is to 1) list data 2) find
linreg 3) place predicted data on list 4) find
resid on next list 5) find square of resid list
on next list 6) find sum of squared resid, then
quickly divide by n-2 and take square root. Is
there another way to do it on a TI 84? - Thanks in advance for your help.
38Phone book Activity
- Chapter 3 Descriptive Analysis of Bivariate
Data-phone books -
- Teams are used for this project. Each team is
given a local phone book. Each team determines a
question of interest for which the answer is a
proportion and then determines a sampling method
for answering it. Possible questions of interest
include the proportion who list a name and no
address, the proportion who use initials only,
the proportion of last names that end in son,
or the proportion of last names for which only
one household has that last name. Different
sampling methods are appropriate for different
questions. Sample size calculation must be part
of the design. Each team reports its results to
the class. Often, the same question is asked by
more than one team, so the issue of variability
in results and the relationship to margin of
error can be discussed.