Title: Ch 2 and 9.1 Relationships Between 2 Variables
1Ch 2 and 9.1Relationships Between 2 Variables
- More than one variable can be measured on each
individual. - Examples
- Gender and Height
- Size and Cost
- Eye color and Major
- We want to look at the relationship among these
variables. - Is there an association between these two
variables? - Two variables measured on the same individuals
are associated if some values tend to occur more
often with some values of the second variable
than with other values of that variable.
2Relationships Between 2 Variables
- If we expect one variable to influence another,
we call it the ___________ variable. - Explains or influences changes in the response
variable - The variable that is influenced is called the
____________ variable. - Measures an outcome of a study
- In each of the following examples, identify the
explanatory and response variables - Gender and blood pressure
- Class attendance and course grade
- Number of beers and BAC
3Relationships Between 2 Variables
- We may be interested in relationships of
different types of variables. - Categorical and Numeric
- Categorical and Categorical
- Numeric and Numeric
4Relationships between Categorical and Numeric
Variables
- We are interested in comparing the numerical
variable across each of the levels of the
categorical variable. - Examples
- Compare high speeds for 4 different car brands
- Compare sucrose levels for 5 different types of
fruit - Compare GPR for 20 different majors
5Relationships between Categorical and Numeric
Variables
- Graphical Comparison
- Example Sucrose levels of fruits (fictitious
data)
6Relationships between Categorical and Numeric
Variables
- Numerical Comparison
- We could also look at summary statistics for each
group.
7Ch 9.1Relationships Between Two Categorical
Variables
- Depending on the situation, one of the variables
is the explanatory variable and the other is the
response variable. - In this case, we look at the percentages of one
variable for each level of the other variable. - Examples
- Gender and Soda Preference
- Country of Origin and Marital Status
- Smoking Habits and Socioeconomic Status
8Two-Way Tables
- Two-way tables come about when we are interested
in the relationship between two categorical
variables. - One of the variables is the _____________.
- The other is the _______________.
- The combination of a row variable and a column
variable is a ______________.
9Two-Way Tables
10Relationships between two categorical variables
- Example Gender and Highest Degree Obtained
- Joint Distribution How likely are you to have a
bachelors degree and be a male? _____________ - Marginal Distribution What is the least likely
highest degree obtained? _____________ - Conditional Distribution If you are a female,
how likely are you to have obtained a graduate
degree? ______________
11Relationships between two categorical variables
- Shows the percentages
- for the joint, marginal,
- and conditional distributions.
12Ch 2 Relationships Between 2 Numeric Variables
- Depending on the situation, one of the variables
is the explanatory variable and the other is the
response variable. - There is not always an explanatory-response
relationship. - Examples
- Height and Weight
- Income and Age
- SAT scores on math exam and on verbal exam
- Amount of time spent studying for an exam and
exam score
13Relationships between 2 numeric variables
- Scatterplots
- Look for overall pattern and any striking
deviations from that pattern. - Look for outliers, values falling outside the
overall pattern of the relationship - You can describe the overall pattern of a
scatterplot by the form, direction, and strength
of the relationship. - Form Linear or clusters
- Direction
- Two variables are _____________________ when
above-average values of one tend to accompany
above-average values of the other and likewise
below-average values also tend to occur together. - Two variables are _____________________ when
above-average values of one variable accompany
below-average values of the other variable, and
vice-versa. - Strength-how close the points lie to a line
14Relationships between 2 numeric variables
- Example
- Response MPG
- Explanatory Weight
Response Variable (y-axis)
Explanatory Variable (x-axis)
15Relationships between 2 numeric variables
- Relationships between two numeric variables
- Example
- Vehicle Weight
- Horsepower
16Relationships between 2 numeric variables
- ___________ or r measures the direction and
strength of the linear relationship between two
numeric variables - General Properties
- It must be between -1 and 1, or (-1 r 1).
- If r is negative, the relationship is negative.
- If r 1, there is a perfect negative linear
relationship (extreme case). - If r is positive, the relationship is positive.
- If r 1, there is a perfect positive linear
relationship (extreme case). - If r is 0, there is no linear relationship.
- r measures the strength of the linear
relationship. - If explanatory and response are switched, r
remains the same. - r has no units of measurement associated with it
- Scale changes do not affect r
17(No Transcript)
18Relationships between 2 numeric variables
- Examples of extreme cases
r 1
r 0
r -1
19Relationships between 2 numeric variables
- Match the correlation with to the scatterplot
r 0.04
r 0.43
r -0.84
r 0.76
r 0.21
20Relationships between 2 numeric variables
- It is possible for there to be a strong
relationship between two variables and still have
r 0. - EX.
-
21Relationships between 2 numeric variables
- Important notes
- Association does not imply causation
- Correlation does not imply causation
- Slope is not correlation
- A scale change does not change the correlation.
- Correlation doesnt measure the strength of a
non-linear relationship
22Regression Line
- A regression line is a straight line that
describes how a response variable y changes as an
explanatory variable x changes. - A regression line summarizes the relationship
between two variables, but only in a specific
setting when one of the variables helps explain
or predict the other. - We often use a regression line to predict the
value of y for a given value of x. - Regression, unlike correlation, requires that we
have an explanatory variable and a response
variable
23Regression Line
- Fitting a line to data means drawing a line that
comes as close as possible to the points. - Extrapolation-the use of a regression line for
prediction far outside the range of values of the
explanatory variable x that you used to obtain
the line. - Such predictions are often not accurate.
24Least-Squares Regression Line
- The least-squares regression line of y on x is
the line that makes the sum of squares of the
vertical distances of the data points from the
line as small as possible. - These vertical distances are called the
residuals, or the error in prediction, because
they measure how far the point is from the line
- where y is the point and
is the predicted point.
25Least-Squares Regression Line
- The equation of the least-squares regression line
of y on x is -
-
26Least-Squares Regression Line
- The expression for slope, b1, says that along the
regression line, a change of one standard
deviation in x corresponds to a change of r
standard deviations in y. - The slope, b1, is the amount by which y changes
when x increases by one unit. - The intercept, b0, is the value of y when
- The least-squares regression line ALWAYS passes
through the point
27r2 in Regression
- The square of the correlation, r2, is the
fraction of the variation in the values of y that
is explained by the least-squares regression of y
on x. - Use r2 as a measure of how successfully the
regression explains the response. - Interpret r2 as the percent of variation
explained - For Simple Linear Regression, r2 is simply the
square of the correlation coefficient.
28Relationships between 2 numeric variables
- Example
- How much of the variation is explained
- by the least squares line of y on x? ______
- What is the correlation coefficient? ______
Horsepower -10.78 0.04weight (Equation of
the line.)
__________ y-value or response (horsepower) when
line crosses the y-axis.
_______ increase in response for a unit increase
in explanatory variable.
So if weight increases by one pound, horsepower
increases by 0.04 units (on average).
29Relationships between 2 variables
- Lurking Variable A variable that is not among
the explanatory or response variables in a study
and yet may influence the interpretation of
relationships among those variables. - Simpsons Paradox An association or comparison
that holds for all of several groups can reverse
direction when the data are combined to form a
single group. This reversal is called Simpsons
Paradox. This can happen when a lurking variable
is present. Please see Examples 9.9 and 9.10 in
the text.
30Outliers and Influential Observations in
Regression
- An outlier is an observation that lies outside
the overall pattern of the other observations. - An observation is influential for a statistical
calculation if removing it would markedly change
the result of the calculation. - Points that are outliers in the x direction of a
scatterplot are often influential for the
least-squares regression line.
31Outliers and Influential Observations in
Regression
Child 18 is an outlier in the x direction.
Because of its extreme position on the age scale,
this point has a strong influence on the position
of the regression line.
r2 is also affected by the influential
observation. With Child 18, r2 41, but without
Child 18, r2 11. The apparent strength of the
association was largely due to a single
influential observation.
The dashed line was calculated leaving out Child
18. The solid line is with Child 18.