ARCH 21266126 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

ARCH 21266126

Description:

... in the sample we have collected is an estimate of it the sample statistic ... Usual symbols are: (rho) for population parameter; r for sample statistic ... – PowerPoint PPT presentation

Number of Views:492
Avg rating:3.0/5.0
Slides: 30
Provided by: anu9
Category:
Tags: arch | statistic

less

Transcript and Presenter's Notes

Title: ARCH 21266126


1
ARCH 2126/6126
  • Session 11 Relating variables to each other

2
Bivariate statistics
  • Considering 2 numerical variables observed on the
    same set of cases simultaneously
  • For example, - length breadth of a set of
    scrapers - height weight of a set of
    people- spleen size malaria positivity-
    mothers education level use of traditional
    medicines

3
Or sometimes
  • Or some cases may have missing values for one
    variable, which you want to estimate from the
    other, e.g.
  • You want to estimate stature from femur length

4
Association of metric variables can be shown
visually
  • Here is a simple hypothetical example of a
    positive association between variables
  • They increase together
  • Labels, units, caption, number, to be added

5
And similarly
  • Here is a simple hypothetical example of a
    negative association between variables
  • As one increases, the other decreases

6
And again
  • Here is a simple hypothetical example of no
    (significant) association between variables
  • There are other possibilities but lets consider
    these

7
The approach is similar in many ways to univariate
  • In the population out there, there is a
    measurable association (or lack of it) between
    these two variables the population parameter
  • The association that we measure in the sample we
    have collected is an estimate of it the sample
    statistic
  • The H0 is normally no association

8
Last time correlation as a measure of association
  • Correlation (positive or negative) is a specific
    term with a specific meaning
  • Usual measure is Pearson or product-moment
    correlation coefficient
  • Usual symbols are ? (rho) for population
    parameter r for sample statistic
  • This is a quantity we can calculate

9
Summary on calculation of r
  • r SPxy/v(SSxSSy), where-
  • SPxySxy ((Sx)(Sy)/n)
  • SSxS(x2) ((Sx)2/n)
  • SSyS(y2) ((Sy)2/n)
  • n number of cases
  • D.f. n - 2

10
Some points about r
  • Ranges from 1 through 0 to 1
  • No units
  • Extremes indicate perfect straight-line
    variation, upwards or downwards unlikely to be
    seen in real data
  • Intermediate values indicate how tightly the
    points on a scatterplot cluster around a straight
    line

11
Some further points
  • r2 tells us the proportion of the variance in one
    variable that can be explained by straight-line
    dependence on the other
  • Notice the stress on straight-line association r
    does not automatically measure other forms of
    association curvilinear, U- or J-shaped,
    bimodal

12
Lets re-consider the same situation
  • We have, for a sample of n cases, the values for
    each case of the 2 variables-x1, x2, x3, xn
    and y1, y2, y3, yn

13
But now lets ask a differentquestion
  • Not- how strong is the association between
    co-varying variables?
  • But- how can we state the mathematical
    relationship between 2 variables?- how would you
    predict one, knowing the other?

14
The appropriate approach for that question is
regression
  • Also deals with 2 variables in 1 sample
  • But the two are not treated in same way
  • Mathematically related to correlation
  • Statpacks will often give you both measures
    together for a bivariate data set
  • Comes in a variety of versions, suited to
    different situations

15
The relationship between variable x and variable y
  • x, by convention plotted on the horizontal axis,
    is the independent, predictor, or controlled
    variable
  • y, on the vertical axis, is the dependent or
    response variable
  • This does not always literally mean a hypothesis
    of x causes y
  • Linear regression finds the straight line that
    best fits this relationship

16
Works in, but is not restricted to, experiments
  • There is an interest in explaining the dependent
    (y) variable in terms of the independent (x) one
  • Thus ideally x can be controlled or at least
    measured without error by researcher
  • Distinction depends on purpose of research, not
    nature of variable

17
Simple examples
  • y x
  • y x 2
  • y x - 5
  • y x 10
  • y 2x
  • y ½x
  • y ½x 10
  • y 1.5x 3.5
  • All these equations symbolize straight lines that
    could be graphed, by hand or by computer
  • In each case, they state a relationship whereby,
    if you know x, you can work out y

18
y x - 5
y x
y x 2
y 2x
y x/2
y 1.5x 3.5
19
y x0 6
y -0.8x
y 10 x/2
y 3x/4 - 4
y -2-x/2
y 29x/10
20
The general form of the equation is y bx a,
where
  • b is the coefficient of x and governs the slope
    of the linear relationship
  • If b is positive, the line rises from left to
    right if negative, it falls
  • a is the value of y when x 0, i.e. where the
    line crosses the y axis, is known as the y
    intercept
  • If a is positive, the line crosses the y axis
    above the origin if negative, below it

21
This is known as the regression line of y on x
  • When the points on a scatterplot are a cloud
    rather than a row, how do we find the line of
    best fit through it?
  • By statistics rather than by eye
  • We take the values of x as given try to
    minimize the overall deviations (or residuals) of
    y from the line the sum of squares of all the
    residuals is least

22
(No Transcript)
23
So this is sometimes called a least-squares
regression
  • And again there is a not too complex formula for
    it, based on the observations and the means for
    each variable
  • We need eight quantities-means of x and yn?x
    and ?x2?y and ?y2?xy

24
Lets do it all components are already familiar
  • Enter data as two columns
  • Sum each column to get Sx Sy
  • Also square each value, sum these, to get Sx2
    Sy2
  • Also, for each case, multiply the two values,
    sum the products, to get Sxy

25
Calculate sums of squaresas before
  • SSxS(x2) ((Sx)2/n)
  • SSyS(y2) ((Sy)2/n)
  • and sum of products as before
  • SPxySxy ((Sx)(Sy)/n)

26
Now we are ready for the new and final steps
  • b SPxy/SSx
  • a y-bar (b x-bar)
  • As a reminder the basic regression equation is
    that
  • y bx a

27
Now we are in a position to
  • Review this in relation to r2 (calculated before)
    to see how much variation the equation can
    account for
  • Test null hypothesis that the slope is not
    significantly different from 0 (by F or t test)
  • Predict values of y from x (with confidence
    intervals)
  • Analyse residuals to check fit of model (scatter
    should show no particular shape)

28
(No Transcript)
29
Variations on regression
  • There are other methods for defining the line of
    least fit, within the overall simple linear
    regression method
  • There are also more complex regression methods
    which go beyond this one in a) fitting curves
    rather than straight lines and b) having a
    number of independent variables, not just one
    i.e. multiple regression
Write a Comment
User Comments (0)
About PowerShow.com