Title: Looking%20at%20real%20data%20Relationship
1Looking at real dataRelationship
- Lecture Statistics of time series
2Introduction
- A medical study finds that short women are more
likely to have heart attacks - than women of average height, while tall women
have the fewest heart attacks. - An insurance group reports that heavier cars have
fewer accident deaths per - 10000 vehicles registered than do lighter cars.
These and many other statistical - studies look at the relationship between two
variables.To understand such a - relationship, we must often examine other
variables as well. To conclude that - shorter women have higher risk from heart attack,
for example the researers - had to eliminate the effect of others variables
such as weight and exercise - habits.
- We are also interested in relationship between
variables. One of our main - themes is that the relationship between two
variables can be strongly - influenced by other variables that are lurking in
the background.
3Introduction
- To study the relationship between two variables,
we measure both variables on - the same individuals. If we measure both the
height and the weight of each of - large group of people, we know which height goes
with each weight. These - data alloows us to study the connection between
height and weight. A list of - the heights and a separate list of the weights,
two set of single variable data, do - not show the connection between the two
variables. In fact, taller people also - tend to be heavier. And people who smoke more
cigarettes per day tend not to - live as long as those who smoke fewer. We say
that pairs of variables such as - height and weight or smoking and life expectancy
are associated.
4Association between Variables
- Two variables measured on the same individuals
are associated if some values - of one variable tend to occur more often with
some values of the second - variable than with other values of that variable.
- Statistical associations are overall tendencies,
not ironclad rules. They allow - individual exceptions. Althought smokers on the
average die earlier than - nonsmokers, some people live to 90 while smoking
three packs a day.
5Examining Relationship
- When you examine the relationship between two or
more variables, first ask the preliminary
questions - What individuals do the data describie?
- What variables are presented? How are they
measures? - What variables are quantitative and which are
cathegorical?
6Association between Variables
- A medical study, for example, may record each
subjects sex (male, female) - and smoking status along with quantitative
variables such as weight and blood - pressure. We may be interested in possible
associations between two - quantitative variables (such as persons weight
and blood pressure), between a - cathegorical and quantitative variable (such as
sex and blood pressure) or - between two cathegorical variables (such as sex
and smoking status). - When you examine the relationship between two
variables, a new question - becomes important
- Is your purpose simply to explore the nature of
the relationship, or do you hope to show that one
of the variables can explain variation in the
other? That is, are some of the variables
response variables and others explanatory
variables.
7Response Variable, Explanatory Variable
- A response variable measures an outcome of a
study. An explanatory - variable explains or causes changes in the
response variable. - Example Alcohol has many effect on the body.
One effect is a drop in body - temperature. To study this effect, researchers
give several different amounts of - alcohol to mice, then measure the change in each
mouses body temperature in - the 15 minutes after taking the alcohol. Amount
of alcohol is the explanatory - variable and change in body temperature is the
response variable. - In many studies, the goal is to show that changes
in one or more explanatory - variables actually cause changes in a response
variable. But not all - explanatory-response relationships involve direct
causation.
8Dependent and Independent Variables
- Some of statistical techniques require us to
distinguish explanatory from - response variables others make no use of this
distinction. You will often see - explanatory variables called independent
variables and response variables - called dependent variables. The idea behind this
language is that response - variables depend on explanatory variables.
- Most statistical studies examine data on more
than one variable. Fortunately, - statistical analysis of several-variable data
builds on the tools used for - examining individual variables.
9Scatterplots
- Relationship between two quantitative variables
are best displayed graphically. - The most useful graph for this purpose is a
scatterplot. - A scatterplot shows the relationship between two
quantitative variables - measured on the same individuals. The values of
one variable apper on the - horizontal axis, and the values of the other on
the vertical axis. Each - individual in the data appears as the point in
the plot fixed by the values of - both variables for that individual.
- Always plot the explanatory variables (if there
is one) on the horizontal axis - (the x axis) of a scatterplot.
- The explanatory variable-x
- The response variable-y
-
10Scatterplot
Country Alcohol from wine Heart disease deaths Country Alcohol from wine Heart disease deaths
Australia 2.5 211 Netherlands 1.8 167
Australia 3.9 167 New Zeland 1.9 266
Belgium 2.9 131 Norway 0.8 227
Canada 2.4 191 Spain 6.5 86
Denmark 2.9 220 Sweden 1.6 207
Finland 0.8 297 Switzerland 5.8 115
France 9.1 71 United Kingdom 1.3 285
Iceland 0.8 211 United States 1.2 199
Ireland 0.7 300 Germany 2.7 172
Italy 7.9 107
11Scatterplot
12Positive Association, Negative Association
- Tow variables are positively associated when
above-average values of one - tend to accompany above-average values of the
other and below-average - values also tend to occur together.
- Two variables are negatively associated when
above-average values of one - accompany below-average values of the other, and
vice versa.
13More examples of scatterplots
Plants per acre\yield (in bushes per acre) 1956 1958 1959 1960 Mean
12000 150.1 113.0 118.4 142.6 131.0
16000 166.9 120.7 135.2 149.8 143.2
20000 165.3 130.1 139.6 149.9 146.2
24000 134.7 138.4 156.1 143.1
28000 119.0 150.5 134.8
Mean 160.8 124.6 130.1 149.8
14More examples of scatterplots
15More examples of scatterplots
16Correlation
- We have data on variables x and y for n
individuals. Think, for example, of - measuring height and weight for n people. Then x1
and y1 are your height and - weight, x2 and y2 are my height and weight and so
on. For the i-th individual, - height xi goes with weight yi.
17Correlation
- The correlation measures the direction and
strenght of the linear relationship - between two quantitative varaibles. Correlation
is usually written as r. - Suppose that we have data on variables x and y
for n individuals. The means and - the standard deviations of the two variables are
- The correlation between x and y is
18Correlation
- The formula for correlation help us see that r is
positive when there is a - positive association between the variables.
Height and weight, for example, - have a positive association. People who are above
(below) average in height - tend to also be above (below) average in weight.
Using the formula for r, we - can see that the correlation is negative when the
association between x and y is - negative.
19What you need to know in order to interpret
correlation
- Correlation makes to use of the distinction
between explanatory and response variables. It
makes no difference which variable you call x and
which you call y in calculating the correlation. - Correlation requires that both variables be
quantitative, so that it makes sense to do the
artithmetic indicated by the formula for r. - Because r uses the standarized values of the
observations, r does not change when we change
the units of measurements of x and y. - Positive r indicates positive association between
the variables, and negative r indicates negative
association. - The correlation is always a number between -1 and
1. Values of r near 0 indicate a very weak linear
relationship.Values close to -1or 1 indicate that
the points lie close to a straight line. The
extreme values r-1 and r1 occur only when the
points in a scatterplots lie exactly along the
straight line. - Correlation measures the strenght of only the
linear relationship between two variables. It
does not describe curved relationships between
variables, no matter how strong they are.
20Correlation
21Least-Squares Regression
- Correlation measures the direction and strength
of the linear (straight-line) - relationship between two quantitative variables.
If a scatterplot shows a linear - relationship, we would like to summarize this
overall pattern by drawing a line - on the scatterplot. A regression line summarizes
the relationship between two - variables, but only in a specific setting when
one of the variables helps - explain or predict the other. That is, regression
describes a relationship - between explanatory and response variables.
22Least-Squares Regression
- A regression line is a straight line that
desribes how a response variable y - changes as an explanatory variable x changes. We
often use the regression line - to predict the value of y for a given value of x.
Regression, unlike correlation, - requires that we have an explanatory variable and
a response variable.
23Fitting a line to data
- When a scatterplot displays a linear pattern, we
can display the overall pattern - by drawing a straight line through the points. Of
course, no straight line passes - exactly through all of the points. Fitting a line
to data means drawing a line - that comes as close as possible to the points.
The equation of a line fitted to the - data gives a compact description of the
dependence of the response variable on - the explanatory variable. It is a mathematical
model for the straightt-line - relationship.
24Straight Line
- Suppose that y is a response variable and x is an
explanatory variable. A - straight line relating y to x has an equation of
the form - In this equation, b is the slope, the amount by
which y changes when x increases by - one unit. The number a is the intercept, the
value of y when x0.
25Prediction
- We can use a regression line to predict the
response y for a specific value of the - explanatory variable x.
- Extrapolation is the use of a regression line for
prediction far outside the range of - values of the explanatory variable that you use
to obtain the line. Such predictions are - often not accurate.
26Prediction
Age x (in months) Height y in centimeters
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
27Prediction
28Equation of the Least-Squares Regression Line
- We have data on explanatory variable x and a
response variable y for n individuals.The - means and standard deviations are
- The correlation between x and y is r.
- The equation of the least squares regression line
of y and x is
29Correlation and Regression
- Lest-squares regression looks at the distances of
the data points from the line - only in the y direction. So the two variables x
and y play different roles in - regression.
- Although the correlation ignores the distinction
between explanatory and - response variables, there is a close connection
between correlation and - regression. We saw that the slope of the
least-squares line involves r. Another - connection between correlation and regression is
even more important. In fact, - the numerical value of r as a mesure of the
strength of a linear relationship is - best interpreted by thinking about regression.
30Correlation and Regression
- The square of the correlation, is the fraction of
the variation in the values of y - that is explained by the least-squares regression
of y on x. - Example Age and Height
- The straight line relationship between height and
age explains 98.88 (almost all) of the - variation in heights.
- Square of regression mesure, how successfully
the regression explain the response. - When r1 or r-1 points are exactly on the
line, square of correlation1- all of the - variation in one variable is accounted for by the
linear relationship with the other - variable.
31Residuals
- A residual is a difference between an observed
value of the response variable and the - value predicted by the regression line. That is
- A residual plot is a scatterplot of the
regression residuals agains the explanatory - variable. Residual plots help us assess the fit
of a regression line.
32Residual
33Outliers and Influential Observation in Regression
- An outlier is an observation that lies outside
the overall pattern of the other - observations. Points that are outliers in the y
direction of a scatterplot have - large regression residuals, but other outliers
need not have large residuals. - An observation is influential for statistical
calculation if removing it markedly - change the result of calculation. Points that are
outliers in the x direction of a - scatterplot are often influential for the
least-squares regression line.
34Outliers and Influential Observation in Regression
35Transforming Relationships