Looking%20at%20real%20data%20Relationship - PowerPoint PPT Presentation

About This Presentation
Title:

Looking%20at%20real%20data%20Relationship

Description:

A medical study finds that short women are more likely to have heart attacks ... Lest-squares regression looks at the distances of the data points from the line ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 36
Provided by: AgaW3
Category:

less

Transcript and Presenter's Notes

Title: Looking%20at%20real%20data%20Relationship


1
Looking at real dataRelationship
  • Lecture Statistics of time series

2
Introduction
  • A medical study finds that short women are more
    likely to have heart attacks
  • than women of average height, while tall women
    have the fewest heart attacks.
  • An insurance group reports that heavier cars have
    fewer accident deaths per
  • 10000 vehicles registered than do lighter cars.
    These and many other statistical
  • studies look at the relationship between two
    variables.To understand such a
  • relationship, we must often examine other
    variables as well. To conclude that
  • shorter women have higher risk from heart attack,
    for example the researers
  • had to eliminate the effect of others variables
    such as weight and exercise
  • habits.
  • We are also interested in relationship between
    variables. One of our main
  • themes is that the relationship between two
    variables can be strongly
  • influenced by other variables that are lurking in
    the background.

3
Introduction
  • To study the relationship between two variables,
    we measure both variables on
  • the same individuals. If we measure both the
    height and the weight of each of
  • large group of people, we know which height goes
    with each weight. These
  • data alloows us to study the connection between
    height and weight. A list of
  • the heights and a separate list of the weights,
    two set of single variable data, do
  • not show the connection between the two
    variables. In fact, taller people also
  • tend to be heavier. And people who smoke more
    cigarettes per day tend not to
  • live as long as those who smoke fewer. We say
    that pairs of variables such as
  • height and weight or smoking and life expectancy
    are associated.

4
Association between Variables
  • Two variables measured on the same individuals
    are associated if some values
  • of one variable tend to occur more often with
    some values of the second
  • variable than with other values of that variable.
  • Statistical associations are overall tendencies,
    not ironclad rules. They allow
  • individual exceptions. Althought smokers on the
    average die earlier than
  • nonsmokers, some people live to 90 while smoking
    three packs a day.

5
Examining Relationship
  • When you examine the relationship between two or
    more variables, first ask the preliminary
    questions
  • What individuals do the data describie?
  • What variables are presented? How are they
    measures?
  • What variables are quantitative and which are
    cathegorical?

6
Association between Variables
  • A medical study, for example, may record each
    subjects sex (male, female)
  • and smoking status along with quantitative
    variables such as weight and blood
  • pressure. We may be interested in possible
    associations between two
  • quantitative variables (such as persons weight
    and blood pressure), between a
  • cathegorical and quantitative variable (such as
    sex and blood pressure) or
  • between two cathegorical variables (such as sex
    and smoking status).
  • When you examine the relationship between two
    variables, a new question
  • becomes important
  • Is your purpose simply to explore the nature of
    the relationship, or do you hope to show that one
    of the variables can explain variation in the
    other? That is, are some of the variables
    response variables and others explanatory
    variables.

7
Response Variable, Explanatory Variable
  • A response variable measures an outcome of a
    study. An explanatory
  • variable explains or causes changes in the
    response variable.
  • Example Alcohol has many effect on the body.
    One effect is a drop in body
  • temperature. To study this effect, researchers
    give several different amounts of
  • alcohol to mice, then measure the change in each
    mouses body temperature in
  • the 15 minutes after taking the alcohol. Amount
    of alcohol is the explanatory
  • variable and change in body temperature is the
    response variable.
  • In many studies, the goal is to show that changes
    in one or more explanatory
  • variables actually cause changes in a response
    variable. But not all
  • explanatory-response relationships involve direct
    causation.

8
Dependent and Independent Variables
  • Some of statistical techniques require us to
    distinguish explanatory from
  • response variables others make no use of this
    distinction. You will often see
  • explanatory variables called independent
    variables and response variables
  • called dependent variables. The idea behind this
    language is that response
  • variables depend on explanatory variables.
  • Most statistical studies examine data on more
    than one variable. Fortunately,
  • statistical analysis of several-variable data
    builds on the tools used for
  • examining individual variables.

9
Scatterplots
  • Relationship between two quantitative variables
    are best displayed graphically.
  • The most useful graph for this purpose is a
    scatterplot.
  • A scatterplot shows the relationship between two
    quantitative variables
  • measured on the same individuals. The values of
    one variable apper on the
  • horizontal axis, and the values of the other on
    the vertical axis. Each
  • individual in the data appears as the point in
    the plot fixed by the values of
  • both variables for that individual.
  • Always plot the explanatory variables (if there
    is one) on the horizontal axis
  • (the x axis) of a scatterplot.
  • The explanatory variable-x
  • The response variable-y

10
Scatterplot
Country Alcohol from wine Heart disease deaths Country Alcohol from wine Heart disease deaths
Australia 2.5 211 Netherlands 1.8 167
Australia 3.9 167 New Zeland 1.9 266
Belgium 2.9 131 Norway 0.8 227
Canada 2.4 191 Spain 6.5 86
Denmark 2.9 220 Sweden 1.6 207
Finland 0.8 297 Switzerland 5.8 115
France 9.1 71 United Kingdom 1.3 285
Iceland 0.8 211 United States 1.2 199
Ireland 0.7 300 Germany 2.7 172
Italy 7.9 107      
11
Scatterplot
12
Positive Association, Negative Association
  • Tow variables are positively associated when
    above-average values of one
  • tend to accompany above-average values of the
    other and below-average
  • values also tend to occur together.
  • Two variables are negatively associated when
    above-average values of one
  • accompany below-average values of the other, and
    vice versa.

13
More examples of scatterplots
Plants per acre\yield (in bushes per acre) 1956 1958 1959 1960 Mean
12000 150.1 113.0 118.4 142.6 131.0
16000 166.9 120.7 135.2 149.8 143.2
20000 165.3 130.1 139.6 149.9 146.2
24000   134.7 138.4 156.1 143.1
28000     119.0 150.5 134.8
Mean 160.8 124.6 130.1 149.8  
14
More examples of scatterplots
15
More examples of scatterplots
16
Correlation
  • We have data on variables x and y for n
    individuals. Think, for example, of
  • measuring height and weight for n people. Then x1
    and y1 are your height and
  • weight, x2 and y2 are my height and weight and so
    on. For the i-th individual,
  • height xi goes with weight yi.

17
Correlation
  • The correlation measures the direction and
    strenght of the linear relationship
  • between two quantitative varaibles. Correlation
    is usually written as r.
  • Suppose that we have data on variables x and y
    for n individuals. The means and
  • the standard deviations of the two variables are
  • The correlation between x and y is

18
Correlation
  • The formula for correlation help us see that r is
    positive when there is a
  • positive association between the variables.
    Height and weight, for example,
  • have a positive association. People who are above
    (below) average in height
  • tend to also be above (below) average in weight.
    Using the formula for r, we
  • can see that the correlation is negative when the
    association between x and y is
  • negative.

19
What you need to know in order to interpret
correlation
  • Correlation makes to use of the distinction
    between explanatory and response variables. It
    makes no difference which variable you call x and
    which you call y in calculating the correlation.
  • Correlation requires that both variables be
    quantitative, so that it makes sense to do the
    artithmetic indicated by the formula for r.
  • Because r uses the standarized values of the
    observations, r does not change when we change
    the units of measurements of x and y.
  • Positive r indicates positive association between
    the variables, and negative r indicates negative
    association.
  • The correlation is always a number between -1 and
    1. Values of r near 0 indicate a very weak linear
    relationship.Values close to -1or 1 indicate that
    the points lie close to a straight line. The
    extreme values r-1 and r1 occur only when the
    points in a scatterplots lie exactly along the
    straight line.
  • Correlation measures the strenght of only the
    linear relationship between two variables. It
    does not describe curved relationships between
    variables, no matter how strong they are.

20
Correlation
21
Least-Squares Regression
  • Correlation measures the direction and strength
    of the linear (straight-line)
  • relationship between two quantitative variables.
    If a scatterplot shows a linear
  • relationship, we would like to summarize this
    overall pattern by drawing a line
  • on the scatterplot. A regression line summarizes
    the relationship between two
  • variables, but only in a specific setting when
    one of the variables helps
  • explain or predict the other. That is, regression
    describes a relationship
  • between explanatory and response variables.

22
Least-Squares Regression
  • A regression line is a straight line that
    desribes how a response variable y
  • changes as an explanatory variable x changes. We
    often use the regression line
  • to predict the value of y for a given value of x.
    Regression, unlike correlation,
  • requires that we have an explanatory variable and
    a response variable.

23
Fitting a line to data
  • When a scatterplot displays a linear pattern, we
    can display the overall pattern
  • by drawing a straight line through the points. Of
    course, no straight line passes
  • exactly through all of the points. Fitting a line
    to data means drawing a line
  • that comes as close as possible to the points.
    The equation of a line fitted to the
  • data gives a compact description of the
    dependence of the response variable on
  • the explanatory variable. It is a mathematical
    model for the straightt-line
  • relationship.

24
Straight Line
  • Suppose that y is a response variable and x is an
    explanatory variable. A
  • straight line relating y to x has an equation of
    the form
  • In this equation, b is the slope, the amount by
    which y changes when x increases by
  • one unit. The number a is the intercept, the
    value of y when x0.

25
Prediction
  • We can use a regression line to predict the
    response y for a specific value of the
  • explanatory variable x.
  • Extrapolation is the use of a regression line for
    prediction far outside the range of
  • values of the explanatory variable that you use
    to obtain the line. Such predictions are
  • often not accurate.

26
Prediction
Age x (in months) Height y in centimeters
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
27
Prediction
28
Equation of the Least-Squares Regression Line
  • We have data on explanatory variable x and a
    response variable y for n individuals.The
  • means and standard deviations are
  • The correlation between x and y is r.
  • The equation of the least squares regression line
    of y and x is

29
Correlation and Regression
  • Lest-squares regression looks at the distances of
    the data points from the line
  • only in the y direction. So the two variables x
    and y play different roles in
  • regression.
  • Although the correlation ignores the distinction
    between explanatory and
  • response variables, there is a close connection
    between correlation and
  • regression. We saw that the slope of the
    least-squares line involves r. Another
  • connection between correlation and regression is
    even more important. In fact,
  • the numerical value of r as a mesure of the
    strength of a linear relationship is
  • best interpreted by thinking about regression.

30
Correlation and Regression
  • The square of the correlation, is the fraction of
    the variation in the values of y
  • that is explained by the least-squares regression
    of y on x.
  • Example Age and Height
  • The straight line relationship between height and
    age explains 98.88 (almost all) of the
  • variation in heights.
  • Square of regression mesure, how successfully
    the regression explain the response.
  • When r1 or r-1 points are exactly on the
    line, square of correlation1- all of the
  • variation in one variable is accounted for by the
    linear relationship with the other
  • variable.

31
Residuals
  • A residual is a difference between an observed
    value of the response variable and the
  • value predicted by the regression line. That is
  • A residual plot is a scatterplot of the
    regression residuals agains the explanatory
  • variable. Residual plots help us assess the fit
    of a regression line.

32
Residual
33
Outliers and Influential Observation in Regression
  • An outlier is an observation that lies outside
    the overall pattern of the other
  • observations. Points that are outliers in the y
    direction of a scatterplot have
  • large regression residuals, but other outliers
    need not have large residuals.
  • An observation is influential for statistical
    calculation if removing it markedly
  • change the result of calculation. Points that are
    outliers in the x direction of a
  • scatterplot are often influential for the
    least-squares regression line.

34
Outliers and Influential Observation in Regression
35
Transforming Relationships
Write a Comment
User Comments (0)
About PowerShow.com