161'130 Biometrics Week 3 Lecture slides - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

161'130 Biometrics Week 3 Lecture slides

Description:

Bivariate Data. Two measurements from each individual ... In both univariate and bivariate data sets, outliers or clusters must be very ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 50
Provided by: dmbr5
Category:

less

Transcript and Presenter's Notes

Title: 161'130 Biometrics Week 3 Lecture slides


1
161.130 Biometrics Week 3 Lecture slides
  • Exploring Bivariate Data
  • Scatterplots
  • Text section 4.1
  • CAST section 4.1
  • Least Squares Nonlinear relationships
  • Text section 4.3 and 4.4.2
  • CAST sections 4.3 and 4.4
  • Correlation
  • Text section 4.2 and 4.4.1
  • CAST section 4.2 and 4.5
  • Multivariate Data
  • CAST section 4.6

2
  • Univariate Data
  • Single measurement from each individual
  • Cannot associate variation in that measurement
    with other characteristics of the individuals
  • Bivariate Data
  • Two measurements from each individual
  • May be be able to associate variation in one
    measurement with changes in the other measurement
  • Examples are ...
  • Blood pressure and weight of males in their 50s
  • Carbohydrate content and moisture content of corn
  • Our aim with such data is to find information
    about the relationship between the variables.

3
Three Tools we will use
  • Scatterplot, a two-dimensional graph of data
    values
  • Correlation, a statistic that measures the
    strength and direction of a linear relationship
  • Regression Equation, an equation that describes
    the average relationship between a response and
    explanatory variable

4
Scatterplots
  • The relationship between two variables cannot be
    determined from examination of the two variables
    in isolation.

5
Scatterplots
6
Scatterplots
7
Scatterplots
8
Scatterplots
9
Scatterplots
10
Looking for Patterns in Scatterplots
  • What is the average pattern? Does it look like a
    straight line or is it curved?
  • What is the direction of the pattern?
  • How much do individual points vary from the
    average pattern?
  • Are there any unusual data points?

11
Positive/Negative Association
  • Two variables have a positive association when
    the values of one variable tend to increase as
    the values of the other variable increase.
  • Two variables have a negative association when
    the values of one variable tend to decrease as
    the values of the other variable increase.

12
Example Height and Handspan
  • Data shown are the first 12 observations of a
    data set that includes the heights (in inches)
    and fully stretched handspans (in centimeters) of
    167 college students.

13
Example Height and Handspan
  • Taller people tend to have greater handspan
    measurements than shorter people do.
  • When two variables tend to increase together, we
    say that they have a positive association.
  • The height and handspan measurements appear to
    have a linear relationship.

14
Example Driver Age and MaximumReading
Distance of Highway Signs
  • A research firm determined the maximum
    distance at which each of 30 randomly selected
    drivers could read a newly designed road sign.
  • The 30 participants in the study ranged in
    age from 18 to 82 years old.
  • We want to examine the relationship between
    age and the sign legibility (readability)
    distance.

15
Example Driver Age and MaximumReading
Distance of Highway Signs
  • We see a negative association with a linear
    pattern.
  • We will use a straight-line equation to model
    this relationship.

16
Identifying Groups and Outliers
  • Use different plotting symbols or colours to
    represent different subgroups.
  • Look for outliers points that have an unusual
    combination of data values.

17
  • In both univariate and bivariate data sets,
    outliers or clusters must be very distinct before
    we should conclude that they are real, in the
    absence of further external information
    confirming that the individuals are distinct.
  • Particularly in small data sets, outliers,
    clusters and other patterns may arise by chance,
    without being associated with any real features
    in the individuals.
  • Be careful not to over-interpret features in
    scatterplot unless they are well defined,
    especially if the sample size is small.

18
Describing Linear Patterns using a Regression
Line
When the best equation for describing the
relationship between x and y is a straight line,
the equation is called the regression line.
  • Two purposes of a regression line
  • to estimate the average value of y at any
    specified value of x
  • to predict the value of y for an individual,
    given that individuals x value

19
Example Body Mass Index (BMI)
  • Body Mass Index is calculated from the
    correlation between Height and body mass
    (weight)

20
Example Height and Handspan
  • Regression equation Handspan -3.0 0.35
    Height
  • Estimate the average handspan for people 60
    inches tall (1.50m)

Average handspan -3.0 0.35(60) 18 cm.
21
Example Height and Handspan (cont)
  • Regression equation Handspan (cm) -3 0.35
    Height

Slope 0.35 Handspan increases by 0.35 cm, on
average, for each increase of 1 inch in height.
Intercept - 3.0
22
Examples for you to work out
Knowing how to set up bivariate data
A table of bivariate data given are Draw a
scatterplot, using these data.
Solutions
23
The Equation for the Regression Line
  • is spoken as y-hat, and it is also referred
    to either as predicted y or estimated y.
  • b 0 is the intercept of the straight line. The
    intercept is the value of y when x 0.
  • b1 is the slope of the straight line. The slope
    tells us how much of an increase (or decrease)
    there is for the y variable when the x variable
    increases by one unit. The sign of the slope
    tells us whether y increases or decreases when x
    increases.
  • e random errors (residuals) with mean zero

24
Example Driver Age and Maximum Legibility
Distance of Highway Signs
  • Regression equation Distance 577 - 3 Age

Slope of 3 tells us that, on average, the
legibility distance decreases, on average, by 3
feet when age increases by one year
Estimate the average distance for 20-year-old
driversAverage distance 577 3(20) 517
ft. Predict the legibility distance for a
20-year-old driverPredicted distance 577
3(20) 517 ft.
25
Extrapolation
  • Usually a bad idea to use a regression equation
    to predict values far outside the range where the
    original data fell.
  • No guarantee that the relationship will continue
    beyond the range for which we have observed data.

26
Least Squares Line and Formulas
  • Least Squares Regression Line minimizes the sum
    of squared prediction errors.
  • SSE Sum of squared prediction errors.
  • Formulas for Slope and Intercept

27
Linear model
  • Only appropriate when the cloud of crosses in a
    scatterplot of the data is regularly spread
    around a straight line.
  • If the crosses are scattered round a curve, the
    relationship is called nonlinear and other models
    must be used.
  • Outliers should be investigated
  • Detecting problems with the model
  • Plot residuals against X to look for problems in
    the model

28
Nonlinear Relationships
  • If the relationship between Y and X is nonlinear,
    a linear model will give poor predictions and
    must be avoided.
  • Transformation of one or both variables
  • often possible to linearise the relationship and
    therefore use least squares to fit a linear model
    to the transformed variables
  • For many data sets, a logarithmic transformation
    works, but a more general power transformation is
    sometimes needed to linearise the relationship

29
Example The Development of
Musical Preferences
  • We want to examine the relationship between
    song-specific age (age in the year a particular
    song was popular) and musical preference
    (positive score gt above average, negative score
    gt below average).
  • The 108 participants in the study ranged in age
    from 16 to 86 years old.

30
Example The Development of
Musical Preferences
  • Popular music preferences acquired in late
    adolescence and early adulthood.
  • The association is nonlinear.

31
Example The Development of
Musical Preferences
  • Interpretation If you are 10 now and a song came
    out in 1967 you were -30 years old when the song
    came out.
  • If you are 85 now and a song comes out, you are
    85.
  • So the 85 year olds don't like current music, the
    young ones don't like old music etc.

32
Measuring Strength and Direction with the
Correlation Coefficient
Correlation Coefficient r indicates the strength
and the direction of a straight-line relationship.
  • The strength of the relationship is determined by
    the closeness of the points to a straight line.
  • The direction is determined by whether one
    variable generally increases or generally
    decreases when the other variable increases.

33
Interpretation of r and a Formula
  • r is always between 1 and 1
  • magnitude indicates the strength
  • r 1 or 1 indicates a perfect linear
    relationship
  • sign indicates the direction
  • r 0 indicates a slope of 0 so a change of x
    does not change the predicted value of y
  • Formula for Pearson Linear Correlation
    Coefficient

r
.
34
Interpretation of r and its Formula
  • Formula for Pearson Linear Correlation
    Coefficient r
  • PS You will not be required to use this formula,
    but in an exam the summary numbers and
    formula may be given to you so you can check your
    answer.

r
.
PS You will not be required to use this formula,
but in an exam the summary numbers and
formula may be given to you so you can check your
answer.
35
Interpretation of r
36
Interpretation of r (contd)
37
Interpretation of r (contd)
  • Pearsons Linear Correlation Coefficient
  • The correlation r is a measure of how close Y
    and X are to having a straight-line association.
  • r 1 implies a perfect straight-line
    relationship with positive slope
  • r 0.9 implies a very strong straight-line
    relationship with positive slope
  • r gt 0.7 implies a moderate straight-line
    relationship with positive slope
  • 0.3 lt r 0.7 implies a weak to moderate
    straight-line relationship with positive slope
  • 0lt r 0.3 implies a very weak straight-line
    relationship with positive slope
  • r 0 implies no linear relationship or zero
    slope.
  • Ditto r lt 0, but implying relationship with
    negative slope

38
Interpretation of r (contd)
  • Pearsons Linear Correlation Coefficient r
  • The correlation r is a measure of how close Y
    and X are to having a straight-line association.
    In summary

39
Example Height and Handspan (cont)
  • Regression equation Handspan -3 0.35 Height
  • Correlation r 0.74
  • a medium/strong positive linear
    relationship.

40
Example Driver Age and Maximum Legibility
Distance of Highway Signs (cont)
  • Regression equation Distance 577 - 3 Age

Correlation r - 0.8 gt a strong
negative linear association.
41
Example Left and Right Handspans
  • If you know the span of a persons right hand,
    can you accurately predict his/her left handspan?

Correlation r 0.95 gt a very strong
positive linear relationship.
42
Example Verbal SAT and GPA
  • Grade point averages (GPAs) and verbal SAT
    (Standardised college Admission Test) scores for
    a sample of 100 university students.
  • Correlation r 0.485 gt
  • a moderately strong positive linear
    relationship.

43
Example Age and Hours of TV Viewing
  • Relationship between age and hours of daily
    television viewing for 1913 survey respondents.
  • Correlation r 0.12 a weak connection.
  • Note some participants claimed to watch more
    than 20 hours/day!

44
Example Hours of Sleep and Hours of Study
  • Relationship between reported hours of sleep the
    previous 24 hours and the reported hours of
    study during the same period for a sample of 116
    college students.

Correlation r 0.36 a weak to moderate
negative association.
45
Example Driver Age and Maximum Legibility
Distance of Highway Signs (cont)
  • Check the Regression Model by Calculating
    Residuals

So we can compute the residuals for all 30
observations. Positive mean residual gt observed
values higher than predicted. Negative mean
residual gt observed values lower than predicted.
46
Why the Models may not always make Sense
  • Allowing outliers to overly influence the
    results
  • Combining groups inappropriately
  • Using correlation and a straight-line equation to
    describe curvilinear data

47
Example Height and Foot Length (cont)
Three outliers were data entry errors.
  • Regression equation uncorrected data 15.4
    0.13 height corrected data -3.2 0.42 height
  • Correlation uncorrected data r
    0.28 corrected data r 0.69

48
Example Earthquakes in US
San Francisco earthquake of 1906.
  • Correlation coefficient
  • Inclusive of all data r 0.73
  • Without San Francisco r 0.96

49
Example Height and Heavy Feet
  • Scatterplot of all data College student heights
    and responses to the question What is the
    fastest you have ever driven a car?

Scatterplot by gender Combining two groups led
to illegitimate correlation
50
Example Dont Predict without drawing a Plot
Population of US (in millions) for each census
year between 1790 and 1990.
  • Correlation r 0.96Regression Line
    population 2218 1.218(Year)Poor Prediction
    for Year 2005 2218 1.218(2005), about 224
    million, which is less than the 1990 population.
Write a Comment
User Comments (0)
About PowerShow.com