Title: Lecture 8 Relationships between Scale variables: Regression Analysis
1Lecture 8Relationships between Scale variables
Regression Analysis
- Graduate School
- Quantitative Research Methods
- Gwilym Pryce
- g.pryce_at_socsci.gla.ac.uk
2Notices
3Plan
- 1. Linear Non-linear Relationships
- 2. Fitting a line using OLS
- 3. Inference in Regression
- 4. Ommitted Variables R2
- 5. Types of Regression Analysis
- 6. Properties of OLS Estimates
- 7. Assumptions of OLS
- 8. Doing Regression in SPSS
41. Linear Non-linear relationships between
variables
- Often of greatest interest in social science is
investigation into relationships between
variables - is social class related to political perspective?
- is income related to education?
- is worker alienation related to job monotony?
- We are also interested in the direction of
causation, but this is more difficult to prove
empirically - our empirical models are usually structured
assuming a particular theory of causation
5Relationships between scale variables
- The most straight forward way to investigate
evidence for relationship is to look at scatter
plots - traditional to
- put the dependent variable (I.e. the effect) on
the vertical axis - or y axis
- put the explanatory variable (I.e. the cause)
on the horizontal axis - or x axis
6Scatter plot of IQ and Income
7We would like to find the line of best fit
8What does the output mean?
9Sometimes the relationship appears non-linear
10 and so a straight line of best fit is not
always very satisfactory
11Could try a quadratic line of best fit
12But we can simulate a non-linear relationship by
first transforming one of the variables
13(No Transcript)
14 or a cubic line of best fit(overfitted?)
15Or could try two linear linesstructural break
162. Fitting a line using OLS
- The most popular algorithm for drawing the line
of best fit is one that minimises the sum of
squared deviations from the line to each
observation
Where yi observed value of y predicted
value of yi the value on the line of
best fit corresponding to xi
17Regression estimates of a, bor Ordinary Least
Squares (OLS)
- This criterion yields estimates of the slope b
and y-intercept a of the straight line
183. Inference in Regression Hypothesis tests on
the slope coefficient
- Regressions are usually run on samples, so what
can we say about the population relationship
between x and y? - Repeated samples would yield a range of values
for estimates of b N(b, sb) - I.e. b is normally distributed with mean b
population mean value of b if regression run on
population - If there is no relationship in the population
between x and y, then b 0, this is our H0
19What does the standard error mean?
20Hypothesis test on b
- (1) H0 b 0
- (I.e. slope coefficient, if regression run on
population, would 0) - H1 b ? 0
- (2) a 0.05 or 0.01 etc.
- (3) Reject H0 iff P lt a
- (N.B. Rule of thumb P lt 0.05 if tc ? 2, and P lt
0.01 if tc ? 2.6) - (4) Calculate P and conclude.
21Example using SPSS output
- (1) H0 no relationship between house price and
floor area. - H1 there is a relationship
- (2), (3), (4)
- P 1- CDF.T(24.469,554) 0.000000
- Reject H0
224. Ommitted Variables R2 Q/ is floor area the
only factor?How much of the variation in Price
does it explain?
23R-square
- R-square tells you how much of the variation in y
is explained by the explanatory variable x - 0 lt R2 lt 1 (NB you want R2 to be near
1). - If more than one explanatory variable, use
Adjusted R2
24Example 2 explanatory variables
25Scatter plot (with floor spikes)
263D Surface PlotsConstruction, Price
UnemploymentQ -246 27P - 0.2P2 - 73U 3U2
27Construction Equation in a SlumpQ 315 4P -
73U 5U2
285. Types of regression analysis
- Univariate regression one explanatory variable
- what weve looked at so far in the above
equations - Multivariate regression gt1 explanatory variable
- more than one equation on the RHS
- Log-linear regression log-log regression
- taking logs of variables can deal with certain
types of non-linearities useful properties
(e.g. elasticities) - Categorical dependent variable regression
- dependent variable is dichotomous -- observation
has an attribute or not - e.g. MPPI take-up, unemployed or not etc.
296. Properties of OLS estimators
- OLS estimates of the slope and intercept
parameters have been shown to be BLUE (provided
certain assumptions are met) - Best
- Linear
- Unbiased
- Estimator
30- Best in that they have the minimum variance
compared with other estimators (i.e. given
repeated samples, the OLS estimates for a and ß
vary less between samples than any other sample
estimates for a and ß). - Linear in that a straight line relationship is
assumed. - Unbiased because, in repeated samples, the mean
of all the estimates achieved will tend towards
the population values for a and ß. - Estimates in that the true values of a and ß
cannot be known, and so we are using statistical
techniques to arrive at the best possible
assessment of their values, given the information
available.
317. Assumptions of OLS
- For estimation of a and b to be BLUE and for
regression inference to be correct - 1. Equation is correctly specified
- Linear in parameters (can still transform
variables) - Contains all relevant variables
- Contains no irrelevant variables
- Contains no variables with measurement errors
- 2. Error Term has zero mean
- 3. Error Term has constant variance
32- 4. Error Term is not autocorrelated
- I.e. correlated with error term from previous
time periods - 5. Explanatory variables are fixed
- observe normal distribution of y for repeated
fixed values of x - 6. No linear relationship between RHS
- variables
- I.e. no multicolinearity
338. Doing Regression analysis in SPSS
- To run regression analysis in SPSS, click on
Analyse, Regression, Linear
34Select your dependent (i.e. explained) variable
and independent (i.e. explanatory) variables
35e.g. Floor area and bathroomsFloor area a b
Number of bathrooms e
36(No Transcript)
37Confidence Intervals for regression coefficients
- Population slope coefficient CI
- Rule of thumb
38e.g. regression of floor area on number of
bathrooms, CI on slope
- b 64.6 ? 2 ? 3.8
- 64.6 ? 7.6
- 95 CI ( 57, 72)
39Confidence Intervals in SPSSAnalyse,
Regression, Linear, click on Statistics and
select Confidence intervals
40Our rule of thumb said 95 CI for slope ( 57,
72). How does this compare?
41Past Paper (C2)Relationships (30)
- Suppose you have a theory that suggests that time
watching TV is determined by gregariousness - the less gregarious, the more time spent watching
TV - Use a random sample of 60 observations from the
TV watching data to run a statistical test for
this relationship that also controls for the
effects of age and gender. - Carefully interpret the output from this model
and discuss the statistical robustness of the
results.
42Reading
- Regression Analysis
- Field, A. chapters on regression.
- Moore and McCabe Chapters on regression.
- Kennedy, P. A Guide to Econometrics
- Bryman, Alan, and Cramer, Duncan (1999)
Quantitative Data Analysis with SPSS for
Windows A Guide for Social Scientists, Chapters
9 and 10. - Achen, Christopher H. Interpreting and Using
Regression (London Sage, 1982).