Title: 161'130 Biometrics Week 3 Lecture slides
1161.130 Biometrics Week 3 Lecture slides
- Exploring Bivariate Data
- Scatterplots
- Text section 4.1
- CAST section 4.1
- Least Squares Nonlinear relationships
- Text section 4.3 and 4.4.2
- CAST sections 4.3 and 4.4
- Correlation
- Text section 4.2 and 4.4.1
- CAST section 4.2 and 4.5
- Multivariate Data
- CAST section 4.6
2- Univariate Data
- Single measurement from each individual
- Cannot associate variation in that measurement
with other characteristics of the individuals - Bivariate Data
- Two measurements from each individual
- May be be able to associate variation in one
measurement with changes in the other measurement - Examples are ...
- Blood pressure and weight of males in their 50s
- Carbohydrate content and moisture content of corn
- Our aim with such data is to find information
about the relationship between the variables.
3Three Tools we will use
- Scatterplot, a two-dimensional graph of data
values - Correlation, a statistic that measures the
strength and direction of a linear relationship - Regression Equation, an equation that describes
the average relationship between a response and
explanatory variable
4Scatterplots
- The relationship between two variables cannot be
determined from examination of the two variables
in isolation.
5Scatterplots
6Scatterplots
7Scatterplots
8Scatterplots
9Scatterplots
10Looking for Patterns in Scatterplots
- What is the average pattern? Does it look like a
straight line or is it curved? - What is the direction of the pattern?
- How much do individual points vary from the
average pattern? - Are there any unusual data points?
11Positive/Negative Association
- Two variables have a positive association when
the values of one variable tend to increase as
the values of the other variable increase. - Two variables have a negative association when
the values of one variable tend to decrease as
the values of the other variable increase.
12Example Height and Handspan
- Data shown are the first 12 observations of a
data set that includes the heights (in inches)
and fully stretched handspans (in centimeters) of
167 college students.
13Example Height and Handspan
- Taller people tend to have greater handspan
measurements than shorter people do. - When two variables tend to increase together, we
say that they have a positive association. - The height and handspan measurements appear to
have a linear relationship.
14Example Driver Age and MaximumReading
Distance of Highway Signs
- A research firm determined the maximum
distance at which each of 30 randomly selected
drivers could read a newly designed road sign. - The 30 participants in the study ranged in
age from 18 to 82 years old. - We want to examine the relationship between
age and the sign legibility (readability)
distance.
15Example Driver Age and MaximumReading
Distance of Highway Signs
- We see a negative association with a linear
pattern. - We will use a straight-line equation to model
this relationship.
16Identifying Groups and Outliers
- Use different plotting symbols or colours to
represent different subgroups. - Look for outliers points that have an unusual
combination of data values.
17- In both univariate and bivariate data sets,
outliers or clusters must be very distinct before
we should conclude that they are real, in the
absence of further external information
confirming that the individuals are distinct. - Particularly in small data sets, outliers,
clusters and other patterns may arise by chance,
without being associated with any real features
in the individuals. - Be careful not to over-interpret features in
scatterplot unless they are well defined,
especially if the sample size is small.
18Describing Linear Patterns using a Regression
Line
When the best equation for describing the
relationship between x and y is a straight line,
the equation is called the regression line.
- Two purposes of a regression line
- to estimate the average value of y at any
specified value of x - to predict the value of y for an individual,
given that individuals x value
19Example Body Mass Index (BMI)
- Body Mass Index is calculated from the
correlation between Height and body mass
(weight)
20Example Height and Handspan
- Regression equation Handspan -3.0 0.35
Height - Estimate the average handspan for people 60
inches tall (1.50m)
Average handspan -3.0 0.35(60) 18 cm.
21Example Height and Handspan (cont)
- Regression equation Handspan (cm) -3 0.35
Height
Slope 0.35 Handspan increases by 0.35 cm, on
average, for each increase of 1 inch in height.
Intercept - 3.0
22Examples for you to work out
Knowing how to set up bivariate data
A table of bivariate data given are Draw a
scatterplot, using these data.
Solutions
23The Equation for the Regression Line
- is spoken as y-hat, and it is also referred
to either as predicted y or estimated y. - b 0 is the intercept of the straight line. The
intercept is the value of y when x 0. - b1 is the slope of the straight line. The slope
tells us how much of an increase (or decrease)
there is for the y variable when the x variable
increases by one unit. The sign of the slope
tells us whether y increases or decreases when x
increases. - e random errors (residuals) with mean zero
24Example Driver Age and Maximum Legibility
Distance of Highway Signs
- Regression equation Distance 577 - 3 Age
Slope of 3 tells us that, on average, the
legibility distance decreases, on average, by 3
feet when age increases by one year
Estimate the average distance for 20-year-old
driversAverage distance 577 3(20) 517
ft. Predict the legibility distance for a
20-year-old driverPredicted distance 577
3(20) 517 ft.
25Extrapolation
- Usually a bad idea to use a regression equation
to predict values far outside the range where the
original data fell. - No guarantee that the relationship will continue
beyond the range for which we have observed data.
26Least Squares Line and Formulas
- Least Squares Regression Line minimizes the sum
of squared prediction errors. - SSE Sum of squared prediction errors.
- Formulas for Slope and Intercept
27Linear model
- Only appropriate when the cloud of crosses in a
scatterplot of the data is regularly spread
around a straight line. - If the crosses are scattered round a curve, the
relationship is called nonlinear and other models
must be used. - Outliers should be investigated
- Detecting problems with the model
- Plot residuals against X to look for problems in
the model
28Nonlinear Relationships
- If the relationship between Y and X is nonlinear,
a linear model will give poor predictions and
must be avoided. - Transformation of one or both variables
- often possible to linearise the relationship and
therefore use least squares to fit a linear model
to the transformed variables - For many data sets, a logarithmic transformation
works, but a more general power transformation is
sometimes needed to linearise the relationship
29Example The Development of
Musical Preferences
- We want to examine the relationship between
song-specific age (age in the year a particular
song was popular) and musical preference
(positive score gt above average, negative score
gt below average). - The 108 participants in the study ranged in age
from 16 to 86 years old.
30Example The Development of
Musical Preferences
- Popular music preferences acquired in late
adolescence and early adulthood. - The association is nonlinear.
31Example The Development of
Musical Preferences
- Interpretation If you are 10 now and a song came
out in 1967 you were -30 years old when the song
came out. - If you are 85 now and a song comes out, you are
85. - So the 85 year olds don't like current music, the
young ones don't like old music etc.
32Measuring Strength and Direction with the
Correlation Coefficient
Correlation Coefficient r indicates the strength
and the direction of a straight-line relationship.
- The strength of the relationship is determined by
the closeness of the points to a straight line. - The direction is determined by whether one
variable generally increases or generally
decreases when the other variable increases.
33Interpretation of r and a Formula
- r is always between 1 and 1
- magnitude indicates the strength
- r 1 or 1 indicates a perfect linear
relationship - sign indicates the direction
- r 0 indicates a slope of 0 so a change of x
does not change the predicted value of y - Formula for Pearson Linear Correlation
Coefficient
r
.
34Interpretation of r and its Formula
- Formula for Pearson Linear Correlation
Coefficient r - PS You will not be required to use this formula,
but in an exam the summary numbers and
formula may be given to you so you can check your
answer.
r
.
PS You will not be required to use this formula,
but in an exam the summary numbers and
formula may be given to you so you can check your
answer.
35 Interpretation of r
36 Interpretation of r (contd)
37 Interpretation of r (contd)
- Pearsons Linear Correlation Coefficient
- The correlation r is a measure of how close Y
and X are to having a straight-line association.
- r 1 implies a perfect straight-line
relationship with positive slope - r 0.9 implies a very strong straight-line
relationship with positive slope - r gt 0.7 implies a moderate straight-line
relationship with positive slope - 0.3 lt r 0.7 implies a weak to moderate
straight-line relationship with positive slope - 0lt r 0.3 implies a very weak straight-line
relationship with positive slope - r 0 implies no linear relationship or zero
slope. - Ditto r lt 0, but implying relationship with
negative slope
38 Interpretation of r (contd)
- Pearsons Linear Correlation Coefficient r
- The correlation r is a measure of how close Y
and X are to having a straight-line association.
In summary
39Example Height and Handspan (cont)
- Regression equation Handspan -3 0.35 Height
- Correlation r 0.74
- a medium/strong positive linear
relationship.
40Example Driver Age and Maximum Legibility
Distance of Highway Signs (cont)
- Regression equation Distance 577 - 3 Age
Correlation r - 0.8 gt a strong
negative linear association.
41Example Left and Right Handspans
- If you know the span of a persons right hand,
can you accurately predict his/her left handspan?
Correlation r 0.95 gt a very strong
positive linear relationship.
42 Example Verbal SAT and GPA
- Grade point averages (GPAs) and verbal SAT
(Standardised college Admission Test) scores for
a sample of 100 university students. - Correlation r 0.485 gt
- a moderately strong positive linear
relationship.
43 Example Age and Hours of TV Viewing
- Relationship between age and hours of daily
television viewing for 1913 survey respondents. - Correlation r 0.12 a weak connection.
- Note some participants claimed to watch more
than 20 hours/day!
44 Example Hours of Sleep and Hours of Study
- Relationship between reported hours of sleep the
previous 24 hours and the reported hours of
study during the same period for a sample of 116
college students.
Correlation r 0.36 a weak to moderate
negative association.
45Example Driver Age and Maximum Legibility
Distance of Highway Signs (cont)
- Check the Regression Model by Calculating
Residuals
So we can compute the residuals for all 30
observations. Positive mean residual gt observed
values higher than predicted. Negative mean
residual gt observed values lower than predicted.
46Why the Models may not always make Sense
- Allowing outliers to overly influence the
results - Combining groups inappropriately
- Using correlation and a straight-line equation to
describe curvilinear data
47 Example Height and Foot Length (cont)
Three outliers were data entry errors.
- Regression equation uncorrected data 15.4
0.13 height corrected data -3.2 0.42 height
- Correlation uncorrected data r
0.28 corrected data r 0.69
48 Example Earthquakes in US
San Francisco earthquake of 1906.
- Correlation coefficient
- Inclusive of all data r 0.73
- Without San Francisco r 0.96
49 Example Height and Heavy Feet
- Scatterplot of all data College student heights
and responses to the question What is the
fastest you have ever driven a car?
Scatterplot by gender Combining two groups led
to illegitimate correlation
50Example Dont Predict without drawing a Plot
Population of US (in millions) for each census
year between 1790 and 1990.
- Correlation r 0.96Regression Line
population 2218 1.218(Year)Poor Prediction
for Year 2005 2218 1.218(2005), about 224
million, which is less than the 1990 population.