Introduction to Data Analysis. - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Data Analysis.

Description:

Introduction to Data Analysis. Multivariate Linear Regression – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 54

Provided by: jame3339

Learn more at: http://spia.uga.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Data Analysis.

1
Introduction to Data Analysis.

Multivariate Linear Regression

2
Last weeks lecture

Simple model of how one interval level variable
affects another interval level variable.
A predictive and causal model.
We have an independent variable (X) that predicts
a dependent variable (Y).
For any value of X we can predict a value of Y.
A statistical model.
We can assess how likely that there is a real
relationship between X and Y in the population
given the relationship in the sample.
We have a p-value that tells us the probability
of there being no relationship (the null
hypothesis).

3
This weeks lecture

There are some problems with this though, so this
week we extend the idea of simple linear
regression in a number of ways.
Using more than one independent variable.
Using categorical independent variables.
Accounting for interactions between independent
variables.
Assessing whether some models are better than
other models.
Reading.
Agresti and Finley chapters 10-11.

4
Causation (1)

Before we deal with the first of these problems,
want to talk a bit more about causation.
Normally in social science we want to be able to
say X causes Y.
Whatever relationships were interested in, the
issue of causality is almost always important.
We can almost never prove causality however,
merely offer strong evidence for it.

5
Causation (2)

Theres really three conditions that we need.
Association.
i.e. a statistically significant relationship
between the two variables were interested in.
Time ordering.
i.e. cause comes before effect. Can be tricky
sometimes for social science if were not using
experiments or fixed variables like race.
No alternative explanations.
Is this possible?

6
Causation (3)

People in the Hebrides were convinced that body
lice caused good health. Healthy people always
had lots of lice, and sick people had few.
Should we be discouraging baths and encouraging
lice?
Probably not. If you live(d) in the Hebrides,
youre likely to have lice. The only people that
dont are ill or dead. Lice cant live on a dead
person, and they dont like the heat when someone
is ill and feverish.
Association does not imply causation.

7
The ideal Daily Mail headline
Do booze fuelled yobs increase your mortgage?
8
Alternative explanations (1)

The relationship could be spurious.
An increase in the amount of ice cream consumed
leads to greater numbers of spouse abuse
complaints. Should we ban ice cream?
Of course not. There is no causal relationship,
because both are caused by another variable (hot
weather in this case).
The relationship could work through another
variable.
Being married is associated with greater
happiness.
There is an intervening variable of having
someone else to help pay the mortgage however.
The relationship could be conditional on another
variable.
As the price of Lego goes down, the amount of
Lego each person has goes up.
This is conditional on age though. If youre 60,
your amount of Lego will not increase, but if
youre 6 it will.

9
Alternative explanations (2)

The relationship could be spurious.
The relationship could work through another
variable.
The relationship could be conditional on another
variable.

10
Experiments and causality

We could virtually eliminate these problems if we
used experiments.
Experiments mean that we can change the variable
we are interested in and see how people respond.
Becoming more popular in social science.
Unfortunately we are normally reliant on
observational data.
Therefore we want to try and control for
alternative explanations.
The best way of doing this is to use multiple
regression.

11
Multiple regression

Multiple regression allows us to include numerous
independent variables.
This means that we can include those variables
that we think might be producing spurious
relationships.
e.g. our dependent variable would be number of
spouses beaten in a month, and our TWO
independent variables could be a) amount of ice
cream consumed and b) temperature.

12
Example for the day

Some actual social science data.
We are interested in attitudes to abortion, and
what predicts them.
We have a hypothesis that older people are less
pro-choice than younger people. This is due to
younger people being raised in a more socially
liberal environment than their elders.
Our sample comprises 100 British people.

13
Measuring attitudes

We measure abortion attitudes using a 10 point
scale (this kind of measure is quite common).
Please tell me whether you think abortion can
always be justified, never be justified or
something inbetween using this card R. given a
1-10 response card, where 1 is always justified
and 10 is never justified.
NB this is not strictly interval level data as we
cannot be sure that the distance between 1 and 2
is the same as the difference between 6 and 7.
These type of scales are often treated as
interval level in social science however.

14
A scatter-plot
Linear regression line
15
Simple linear regression

The equation for our linear regression is
y 0.46 0.10X e
Where y is attitude to abortion, X is age, and e
is the error term.

Variable Coefficient value Standard error p-value
Age 0.10 0.01 0.00
Intercept 0.46 0.45 0.31
16
Analysis

So there seems to be a statistically and
substantively significant relationship between
attitudes to abortion and age.
If James is 10 years older then Tessa, then we
predict that he will be more pro-life than her,
and will score around 1 point higher on our 1-10
scale.
Is this a completely accurate way of portraying
the relationship though?

17
What about religiosity?

We might think that irreligious people are more
pro-choice than religious people.
We might also think that religiosity (measured by
an interval level measure of church attendance
per month) is higher for older people.
Given this, our relationship between age and
attitudes to abortion may be non-existent (or at
least weaker than we thought).

18
Some data

People that go to church 4 times a month or more
(lets call these religious people).
Have a mean score of 6.95 on our abortion scale.
Have a mean age of 58.
People that go to church under once a month
(lets call these irreligious people).
Have a mean score of 2.48 on our abortion scale.
Have a mean age of 26.
So perhaps the relationship between age and
attitudes to abortion is accounted for by this?

19
Another scatter-plot
Religious people (who are old and pro-life).
Linear regression line
Irreligious people (who are young and pro-choice).
20
What does this mean?

We need to include religiosity (no. of times go
to church per month) as an independent variable
in our regression as well as age.
We can easily generalise our regression equation
in order to do this.
Each ß is a coefficient for a particular
independent variable
Our ß1 would be the coefficient for age (called
X1) and our ß2 would be the coefficient for
religion (called X2).
Similarly to simple linear regression we are
trying to minimise the squared deviations from
our predictions.

21
What do we get?

We let STATA do the hard work for us, and
estimate the values for the three coefficients
(the intercept, age and religiosity).

Variable Coefficient value Standard error p-value
Age (b1) 0.03 0.01 0.06
Religiosity (b2) 0.84 0.12 0.00
Intercept (a) 2.07 0.43 0.00
22
Thinking about extra predictors (1)

So we can make a prediction for any individual
with a certain age and religiosity.
So for a 40 year old that attends church once a
month.
The coefficients for age and religiosity should
be interpreted carefully.
The 0.84 for religiosity means that our model
predicts that as people go to church an extra
time per month their abortion attitude score goes
up by 0.84 points if age is constant.

23
Thinking about extra predictors (2)

Thus, the best way of thinking about regression
with more than one independent variable is to
imagine a separate regression line for age at
each value of religiosity, and vice versa.
The effect of age is the slope of these parallel
lines, controlling for the effect of religiosity.

24
Graphing predictors
Regression line when X23
Regression line when X24
Regression line when X21
Regression line when X22
25
Multiple regression summary

Our example only has two predictors, but we can
have any number of independent variables.
Thus, multiple regression is a really useful
extension of simple linear regression.
Multiple regression is a way of reducing spurious
relationships between variables by including the
real cause.
Multiple regression is also a way of testing
whether a relationship is actually working
through another variable (as it appears to be in
our example).

26
Comparing groups (1)

The independent variables weve been using are
all interval level (age, number of times attended
church etc.).
A lot of social science variables that we are
interested in are actually categorical though,
how do we include these?
We create dummy variables (i.e. 0/1 variables
which can be included in the regression).

27
Comparing groups (2)

We might also be interested in whether men or
women have different attitudes to abortion.
We would create a dummy variable (called here
Xsex), so lets say that men are coded as 0 and
women coded as 1.
If we include this dummy variable in the
regression equation then the coefficient will
represent the difference between men and women.
This means well be looking at the effect of
being a woman compared to being a man.

28
Comparing groups (3)

The coefficient for the sex dummy variable is
1.16.
We know that it only has two values, 0 or 1. If
the person is a man it will be 0, and if theyre
a woman it will be 1.
We add 1.16 to our predicted value of Y when the
person is a woman (as 1.16Xsex is 1.161).
We add zero to our predicted value of Y when the
person is a man (as 1.16Xsex is 1.160).
bsex(i.e. 1.16) is the difference between men and
women.

29
What about many groups?

Lets take a new example. Were interested in
number of deep-fried Mars bars consumed by people
in different parts of Britain.
Our dependent variable is DFMB consumed, and our
independent variable is region (measured as
England, Wales and Scotland).
We can use dummy variables again. We define
A Scottish dummy variable (Xscot), if youre
Scottish you are coded 1, everyone else is 0.
A Welsh dummy variable (Xwales), if youre Welsh
you are coded 1, everyone else is 0.
We dont define a dummy variable for England, as
England is the reference category.

30
Many groups (1)

For an Englishman, Xscot 0 and Xwales 0, so
Y a, and the prediction for England is a
For a Scotsman, Xscot 1 and Xwales 0, so
Y a bscot, and the prediction for
Scotland is a bscot
For a Welshman, Xscot 0 and Xwales 1, so
Y a bwales, and the prediction for Wales
is a bwales
bscot is the difference between Scotland and
England.
bwales is the difference between Wales and
England.

31
Many groups (2)

It doesnt matter which groups you choose to make
dummy variables out of but
You must leave one category out.
This is normally known as the reference category
and is what we compare (or reference) the other
categories to.
In our example, we were comparing Wales and
Scotland to England. We could have set Wales or
Scotland as our reference category though.
We test these variables for statistical
significance in the same way as for interval
level variables by seeing how many SEs the
coefficient is from zero, and calculating the
p-value.

32
Exercise

According to our model predicting attitudes to
abortion would a 60 year old women that never
goes to church be more pro-choice or pro-life
than a 20 year old man that goes to church 5
times a month?

33
Exercise answer
34
Interactions

There was a third kind of alternative explanation
that we havent looked at yet.
The relationship could be conditional on another
variable (e.g. Lego prices, Lego ownership and
age).
Or, more generally, the relationship between X
and Y is dependent on the value of Z.

35
Another example of the day

We might think that the longer you are married
the more that you nag your spouse.
Our dependent variable is the amount of nagging
that an individual does, in minutes per day.
Our independent variable is years of marriage.
The population of interest is all married people.
We have a sample of 50 married people.
First step, lets look at the data.

36
And another scatter-plot
Linear regression line
37
Simple linear regression

The equation for our linear regression is
y 14.43 1.26X e
Where y is nagging, X is length of marriage, and
e is the error term.

Variable Coefficient value Standard error p-value
Marriage length 1.26 0.32 0.000
Intercept 14.43 4.67 0.003
38
Men and women (1)

We might think that women tend to nag more than
men, and hence for every length of marriage women
nag more than men.
We use multiple regression to test this, and
include a dummy variable for sex (man 0, woman
1). A ve coefficient means that women nag more
than men, a ve coefficient means men nag more
than women.

Variable Coefficient value Standard error p-value
Marriage length 1.31 0.32 0.00
Female -5.27 4.78 0.276
Intercept 16.41 5.00 0.002
39
And yet another scatter-plot
Regression line for men
Regression line for women
40
Men and women (2)

There does not appear to be a statistically
significant difference between men and women.
Perhaps the difference between men and women in
how much they nag differs by length of marriage
though?
This is what we call an interaction effect, for
different levels of a variable Z the effect of X
on Y is different.
Lets examine the data again.

41
Men and women (3)
42
Interaction terms (1)

It seems we need to include an interaction term.
We include another variable which is the product
of the two other variables (i.e. them multiplied
together).
This variable has a coefficient estimated for it
and this tells us the magnitude of the
interaction effect.
In our case the regression equation is as below

43
Interaction terms (2)
Predicted amount of nagging
Extra effect of length of marriage if
female (Xsex is 0 for men)
Effect of length of marriage i.e. Effect of
length of marriage for men
Effect of being female (Xsex is zero for men)
Mean level of nagging when all Xs are zero
44
Interaction terms (3)

For our example, there is a statistically
significant interaction effect (i.e. the slopes
for men and women are different)

Variable Coefficient value Standard error p-value
Marriage length -0.15 0.41 0.728
Female -36.73 7.84 0.000
Female marriage length 2.54 0.55 0.000
Intercept 33.06 5.49 0.000
45
Interaction terms (4)
Women
Men
46
Final word on interactions

More generally we can interact variables of all
sorts.
With our dummy variablelength of marriage, we
generate a separate slope for men and women.
If we were interacting two interval level
variables, say age and religiosity, then it is
best to think of generating a particular slope
for the relationship between age and the
dependent variable for each different value of
religiosity.
e.g. we want to say something like at high
levels of religiosity age has a large effect, but
at low levels of religiosity age has a small
effect.

47
Model fit

Sometimes we want to know more general properties
about the model we have fitted.
We often want to know how well our model
generally fits the data we have.
We also often want to whether including an extra
variable (or interaction term) makes a big
improvement to the model or not.
We normally use a measure called R2 to measure
how well a model fits the data.

48
What is R2 ?

R2 measures the proportion of all of the
variation in Y (i.e. the sample values) that is
explained by all the independent variables that
we have.
Our model is trying to predict where the Y values
are, so we want to know how close we are.
The total sum of squares is the sum of all the
squared deviations of each Y from the mean of Y.
The sum of squared errors is the sum of the
squared deviations of each Y from our model
predictions of what Y is (i.e. Y).

49
Properties of R2

Can work out the properties from the equation.
Varies between 0 and 1, and the closer it is to 1
the better the independent variables predict Y.
If our regression perfectly predicts all the data
points, then R2 1 (if this happens theres
probably something wrong).
Each independent variable we add to a model will
either increase R2 or leave it as it was.
We normally use a statistic called adjusted R2,
the principle underlying it is very similar.

50
Quick example

Could calculate the adjusted R2 for the models of
nagging we had earlier.
Here we can see that including sex does not
really improve the model fit, but the addition of
the interaction term does.

Model Adjusted R2
Marriage length .226
Marriage length sex .229
Marriage length sex marriagesex .470
51
Whens an increase a real increase?

We can test whether increases are statistically
significant using something called a F-test
This is based on a distribution called the
F-distribution.
This test tells us whether we can reject the null
hypothesis that the increase in model fit is
zero.
In our example, we cannot reject the H0 that the
addition of sex to the model does not increase
model fit.
We can reject the H0 that the addition of
sexmarriage length to the model does not
increase model fit.

52
Over-interpreting R2

R2 can be a useful measure of model
performance, but it is not what we are often
interested in.
Many social science models have low R2 values,
but this doesnt mean that they are useless.
Rather it just means that there is a lot of
variation not explained by our independent
variables.
We still might be interested in whether there is
a relationship between X and Y though.
High R2 values dont automatically make your
model a good model.
I could predict attitudes to having a European
army using attitudes to the Euro. The R2 would be
high, but it is unclear what the model is
showing

53
Problems with all this

Weve managed to get beyond several problems with
simple linear regression, but
How do we know when the assumptions (for example
linearity) that underlie regression models are
met?
Use plots of the residuals (the differences
between the actual observations and our
predictions) to try and work out when different
assumptions are not met.
More generally, how do we go about specifying
models?
All to be dealt with next week.