Looking%20at%20real%20data%20Relationship - PowerPoint PPT Presentation

About This Presentation

Title:

Looking%20at%20real%20data%20Relationship

Description:

A medical study finds that short women are more likely to have heart attacks ... Lest-squares regression looks at the distances of the data points from the line ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 36

Provided by: AgaW3

Category:

more less

Transcript and Presenter's Notes

Title: Looking%20at%20real%20data%20Relationship

1
Looking at real dataRelationship

Lecture Statistics of time series

2
Introduction

A medical study finds that short women are more
likely to have heart attacks
than women of average height, while tall women
have the fewest heart attacks.
An insurance group reports that heavier cars have
fewer accident deaths per
10000 vehicles registered than do lighter cars.
These and many other statistical
studies look at the relationship between two
variables.To understand such a
relationship, we must often examine other
variables as well. To conclude that
shorter women have higher risk from heart attack,
for example the researers
had to eliminate the effect of others variables
such as weight and exercise
habits.
We are also interested in relationship between
variables. One of our main
themes is that the relationship between two
variables can be strongly
influenced by other variables that are lurking in
the background.

3
Introduction

To study the relationship between two variables,
we measure both variables on
the same individuals. If we measure both the
height and the weight of each of
large group of people, we know which height goes
with each weight. These
data alloows us to study the connection between
height and weight. A list of
the heights and a separate list of the weights,
two set of single variable data, do
not show the connection between the two
variables. In fact, taller people also
tend to be heavier. And people who smoke more
cigarettes per day tend not to
live as long as those who smoke fewer. We say
that pairs of variables such as
height and weight or smoking and life expectancy
are associated.

4
Association between Variables

Two variables measured on the same individuals
are associated if some values
of one variable tend to occur more often with
some values of the second
variable than with other values of that variable.
Statistical associations are overall tendencies,
not ironclad rules. They allow
individual exceptions. Althought smokers on the
average die earlier than
nonsmokers, some people live to 90 while smoking
three packs a day.

5
Examining Relationship

When you examine the relationship between two or
more variables, first ask the preliminary
questions
What individuals do the data describie?
What variables are presented? How are they
measures?
What variables are quantitative and which are
cathegorical?

6
Association between Variables

A medical study, for example, may record each
subjects sex (male, female)
and smoking status along with quantitative
variables such as weight and blood
pressure. We may be interested in possible
associations between two
quantitative variables (such as persons weight
and blood pressure), between a
cathegorical and quantitative variable (such as
sex and blood pressure) or
between two cathegorical variables (such as sex
and smoking status).
When you examine the relationship between two
variables, a new question
becomes important
Is your purpose simply to explore the nature of
the relationship, or do you hope to show that one
of the variables can explain variation in the
other? That is, are some of the variables
response variables and others explanatory
variables.

7
Response Variable, Explanatory Variable

A response variable measures an outcome of a
study. An explanatory
variable explains or causes changes in the
response variable.
Example Alcohol has many effect on the body.
One effect is a drop in body
temperature. To study this effect, researchers
give several different amounts of
alcohol to mice, then measure the change in each
mouses body temperature in
the 15 minutes after taking the alcohol. Amount
of alcohol is the explanatory
variable and change in body temperature is the
response variable.
In many studies, the goal is to show that changes
in one or more explanatory
variables actually cause changes in a response
variable. But not all
explanatory-response relationships involve direct
causation.

8
Dependent and Independent Variables

Some of statistical techniques require us to
distinguish explanatory from
response variables others make no use of this
distinction. You will often see
explanatory variables called independent
variables and response variables
called dependent variables. The idea behind this
language is that response
variables depend on explanatory variables.
Most statistical studies examine data on more
than one variable. Fortunately,
statistical analysis of several-variable data
builds on the tools used for
examining individual variables.

9
Scatterplots

Relationship between two quantitative variables
are best displayed graphically.
The most useful graph for this purpose is a
scatterplot.
A scatterplot shows the relationship between two
quantitative variables
measured on the same individuals. The values of
one variable apper on the
horizontal axis, and the values of the other on
the vertical axis. Each
individual in the data appears as the point in
the plot fixed by the values of
both variables for that individual.
Always plot the explanatory variables (if there
is one) on the horizontal axis
(the x axis) of a scatterplot.
The explanatory variable-x
The response variable-y

10
Scatterplot
Country Alcohol from wine Heart disease deaths Country Alcohol from wine Heart disease deaths
Australia 2.5 211 Netherlands 1.8 167
Australia 3.9 167 New Zeland 1.9 266
Belgium 2.9 131 Norway 0.8 227
Canada 2.4 191 Spain 6.5 86
Denmark 2.9 220 Sweden 1.6 207
Finland 0.8 297 Switzerland 5.8 115
France 9.1 71 United Kingdom 1.3 285
Iceland 0.8 211 United States 1.2 199
Ireland 0.7 300 Germany 2.7 172
Italy 7.9 107
11
Scatterplot
12
Positive Association, Negative Association

Tow variables are positively associated when
above-average values of one
tend to accompany above-average values of the
other and below-average
values also tend to occur together.
Two variables are negatively associated when
above-average values of one
accompany below-average values of the other, and
vice versa.

13
More examples of scatterplots
Plants per acre\yield (in bushes per acre) 1956 1958 1959 1960 Mean
12000 150.1 113.0 118.4 142.6 131.0
16000 166.9 120.7 135.2 149.8 143.2
20000 165.3 130.1 139.6 149.9 146.2
24000 134.7 138.4 156.1 143.1
28000 119.0 150.5 134.8
Mean 160.8 124.6 130.1 149.8
14
More examples of scatterplots
15
More examples of scatterplots
16
Correlation

We have data on variables x and y for n
individuals. Think, for example, of
measuring height and weight for n people. Then x1
and y1 are your height and
weight, x2 and y2 are my height and weight and so
on. For the i-th individual,
height xi goes with weight yi.

17
Correlation

The correlation measures the direction and
strenght of the linear relationship
between two quantitative varaibles. Correlation
is usually written as r.
Suppose that we have data on variables x and y
for n individuals. The means and
the standard deviations of the two variables are
The correlation between x and y is

18
Correlation

The formula for correlation help us see that r is
positive when there is a
positive association between the variables.
Height and weight, for example,
have a positive association. People who are above
(below) average in height
tend to also be above (below) average in weight.
Using the formula for r, we
can see that the correlation is negative when the
association between x and y is
negative.

19
What you need to know in order to interpret
correlation

Correlation makes to use of the distinction
between explanatory and response variables. It
makes no difference which variable you call x and
which you call y in calculating the correlation.
Correlation requires that both variables be
quantitative, so that it makes sense to do the
artithmetic indicated by the formula for r.
Because r uses the standarized values of the
observations, r does not change when we change
the units of measurements of x and y.
Positive r indicates positive association between
the variables, and negative r indicates negative
association.
The correlation is always a number between -1 and
1. Values of r near 0 indicate a very weak linear
relationship.Values close to -1or 1 indicate that
the points lie close to a straight line. The
extreme values r-1 and r1 occur only when the
points in a scatterplots lie exactly along the
straight line.
Correlation measures the strenght of only the
linear relationship between two variables. It
does not describe curved relationships between
variables, no matter how strong they are.

20
Correlation
21
Least-Squares Regression

Correlation measures the direction and strength
of the linear (straight-line)
relationship between two quantitative variables.
If a scatterplot shows a linear
relationship, we would like to summarize this
overall pattern by drawing a line
on the scatterplot. A regression line summarizes
the relationship between two
variables, but only in a specific setting when
one of the variables helps
explain or predict the other. That is, regression
describes a relationship
between explanatory and response variables.

22
Least-Squares Regression

A regression line is a straight line that
desribes how a response variable y
changes as an explanatory variable x changes. We
often use the regression line
to predict the value of y for a given value of x.
Regression, unlike correlation,
requires that we have an explanatory variable and
a response variable.

23
Fitting a line to data

When a scatterplot displays a linear pattern, we
can display the overall pattern
by drawing a straight line through the points. Of
course, no straight line passes
exactly through all of the points. Fitting a line
to data means drawing a line
that comes as close as possible to the points.
The equation of a line fitted to the
data gives a compact description of the
dependence of the response variable on
the explanatory variable. It is a mathematical
model for the straightt-line
relationship.

24
Straight Line

Suppose that y is a response variable and x is an
explanatory variable. A
straight line relating y to x has an equation of
the form
In this equation, b is the slope, the amount by
which y changes when x increases by
one unit. The number a is the intercept, the
value of y when x0.

25
Prediction

We can use a regression line to predict the
response y for a specific value of the
explanatory variable x.
Extrapolation is the use of a regression line for
prediction far outside the range of
values of the explanatory variable that you use
to obtain the line. Such predictions are
often not accurate.

26
Prediction
Age x (in months) Height y in centimeters
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
27
Prediction
28
Equation of the Least-Squares Regression Line

We have data on explanatory variable x and a
response variable y for n individuals.The
means and standard deviations are
The correlation between x and y is r.
The equation of the least squares regression line
of y and x is

29
Correlation and Regression

Lest-squares regression looks at the distances of
the data points from the line
only in the y direction. So the two variables x
and y play different roles in
regression.
Although the correlation ignores the distinction
between explanatory and
response variables, there is a close connection
between correlation and
regression. We saw that the slope of the
least-squares line involves r. Another
connection between correlation and regression is
even more important. In fact,
the numerical value of r as a mesure of the
strength of a linear relationship is
best interpreted by thinking about regression.

30
Correlation and Regression

The square of the correlation, is the fraction of
the variation in the values of y
that is explained by the least-squares regression
of y on x.
Example Age and Height
The straight line relationship between height and
age explains 98.88 (almost all) of the
variation in heights.
Square of regression mesure, how successfully
the regression explain the response.
When r1 or r-1 points are exactly on the
line, square of correlation1- all of the
variation in one variable is accounted for by the
linear relationship with the other
variable.

31
Residuals

A residual is a difference between an observed
value of the response variable and the
value predicted by the regression line. That is
A residual plot is a scatterplot of the
regression residuals agains the explanatory
variable. Residual plots help us assess the fit
of a regression line.

32
Residual
33
Outliers and Influential Observation in Regression

An outlier is an observation that lies outside
the overall pattern of the other
observations. Points that are outliers in the y
direction of a scatterplot have
large regression residuals, but other outliers
need not have large residuals.
An observation is influential for statistical
calculation if removing it markedly
change the result of calculation. Points that are
outliers in the x direction of a
scatterplot are often influential for the
least-squares regression line.