Title: ARCH 21266126
1ARCH 2126/6126
- Session 11 Relating variables to each other
2Bivariate statistics
- Considering 2 numerical variables observed on the
same set of cases simultaneously - For example, - length breadth of a set of
scrapers - height weight of a set of
people- spleen size malaria positivity-
mothers education level use of traditional
medicines
3Or sometimes
- Or some cases may have missing values for one
variable, which you want to estimate from the
other, e.g. - You want to estimate stature from femur length
4Association of metric variables can be shown
visually
- Here is a simple hypothetical example of a
positive association between variables - They increase together
- Labels, units, caption, number, to be added
5And similarly
- Here is a simple hypothetical example of a
negative association between variables - As one increases, the other decreases
6And again
- Here is a simple hypothetical example of no
(significant) association between variables - There are other possibilities but lets consider
these
7The approach is similar in many ways to univariate
- In the population out there, there is a
measurable association (or lack of it) between
these two variables the population parameter - The association that we measure in the sample we
have collected is an estimate of it the sample
statistic - The H0 is normally no association
8Last time correlation as a measure of association
- Correlation (positive or negative) is a specific
term with a specific meaning - Usual measure is Pearson or product-moment
correlation coefficient - Usual symbols are ? (rho) for population
parameter r for sample statistic - This is a quantity we can calculate
9Summary on calculation of r
- r SPxy/v(SSxSSy), where-
- SPxySxy ((Sx)(Sy)/n)
- SSxS(x2) ((Sx)2/n)
- SSyS(y2) ((Sy)2/n)
- n number of cases
- D.f. n - 2
10Some points about r
- Ranges from 1 through 0 to 1
- No units
- Extremes indicate perfect straight-line
variation, upwards or downwards unlikely to be
seen in real data - Intermediate values indicate how tightly the
points on a scatterplot cluster around a straight
line
11Some further points
- r2 tells us the proportion of the variance in one
variable that can be explained by straight-line
dependence on the other - Notice the stress on straight-line association r
does not automatically measure other forms of
association curvilinear, U- or J-shaped,
bimodal
12Lets re-consider the same situation
- We have, for a sample of n cases, the values for
each case of the 2 variables-x1, x2, x3, xn
and y1, y2, y3, yn
13But now lets ask a differentquestion
- Not- how strong is the association between
co-varying variables? - But- how can we state the mathematical
relationship between 2 variables?- how would you
predict one, knowing the other?
14The appropriate approach for that question is
regression
- Also deals with 2 variables in 1 sample
- But the two are not treated in same way
- Mathematically related to correlation
- Statpacks will often give you both measures
together for a bivariate data set - Comes in a variety of versions, suited to
different situations
15The relationship between variable x and variable y
- x, by convention plotted on the horizontal axis,
is the independent, predictor, or controlled
variable - y, on the vertical axis, is the dependent or
response variable - This does not always literally mean a hypothesis
of x causes y - Linear regression finds the straight line that
best fits this relationship
16Works in, but is not restricted to, experiments
- There is an interest in explaining the dependent
(y) variable in terms of the independent (x) one - Thus ideally x can be controlled or at least
measured without error by researcher - Distinction depends on purpose of research, not
nature of variable
17Simple examples
- y x
- y x 2
- y x - 5
- y x 10
- y 2x
- y ½x
- y ½x 10
- y 1.5x 3.5
- All these equations symbolize straight lines that
could be graphed, by hand or by computer - In each case, they state a relationship whereby,
if you know x, you can work out y
18y x - 5
y x
y x 2
y 2x
y x/2
y 1.5x 3.5
19y x0 6
y -0.8x
y 10 x/2
y 3x/4 - 4
y -2-x/2
y 29x/10
20The general form of the equation is y bx a,
where
- b is the coefficient of x and governs the slope
of the linear relationship - If b is positive, the line rises from left to
right if negative, it falls - a is the value of y when x 0, i.e. where the
line crosses the y axis, is known as the y
intercept - If a is positive, the line crosses the y axis
above the origin if negative, below it
21This is known as the regression line of y on x
- When the points on a scatterplot are a cloud
rather than a row, how do we find the line of
best fit through it? - By statistics rather than by eye
- We take the values of x as given try to
minimize the overall deviations (or residuals) of
y from the line the sum of squares of all the
residuals is least
22(No Transcript)
23So this is sometimes called a least-squares
regression
- And again there is a not too complex formula for
it, based on the observations and the means for
each variable - We need eight quantities-means of x and yn?x
and ?x2?y and ?y2?xy
24Lets do it all components are already familiar
- Enter data as two columns
- Sum each column to get Sx Sy
- Also square each value, sum these, to get Sx2
Sy2 - Also, for each case, multiply the two values,
sum the products, to get Sxy
25Calculate sums of squaresas before
- SSxS(x2) ((Sx)2/n)
- SSyS(y2) ((Sy)2/n)
- and sum of products as before
- SPxySxy ((Sx)(Sy)/n)
26Now we are ready for the new and final steps
- b SPxy/SSx
- a y-bar (b x-bar)
- As a reminder the basic regression equation is
that - y bx a
27Now we are in a position to
- Review this in relation to r2 (calculated before)
to see how much variation the equation can
account for - Test null hypothesis that the slope is not
significantly different from 0 (by F or t test) - Predict values of y from x (with confidence
intervals) - Analyse residuals to check fit of model (scatter
should show no particular shape)
28(No Transcript)
29Variations on regression
- There are other methods for defining the line of
least fit, within the overall simple linear
regression method - There are also more complex regression methods
which go beyond this one in a) fitting curves
rather than straight lines and b) having a
number of independent variables, not just one
i.e. multiple regression