GG1011 MODULE D STATISTICS - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

GG1011 MODULE D STATISTICS

Description:

Correlation is a measure of the degree of association between two variables. ... SCATTERPLOT and look at the degree of scatter and any evidence ... Y = A BX ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 32
Provided by: uhfa
Category:

less

Transcript and Presenter's Notes

Title: GG1011 MODULE D STATISTICS


1
GG1011 MODULE DSTATISTICS
  • Session 3 Confidence Intervals and difference
    tests

2
Assumptions of the t-test
  • Its a Parametric Statistic
  • Parametric statistics are the most powerful BUT
  • These statistics make the following key
    assumption
  • 1. Background population from which the data is
    sampled is NORMALLY DISTRIBUTED
  • What evidence would we have for this?

3
Non-parametric tests for difference between two
samples
  • If we cannot be confident that our sample data
    comes from a normally distributed population.
  • Use a NON-PARAMETRIC TEST
  • In this case, with TWO samples its called the
    Mann-Whitney Test.
  • Non-parametric statistics make NO ASSUMPTIONS
    about population distribution.
  • They compare the ranks and Median rather than the
    Mean standard deviation.

4
Non-Parametric Test of Difference between TWO
samplesThe Mann-Whitney Test
  • Has 95 power of the t-test.
  • Can be applied to ordinal and higher data.
  • Usual null/alternative hypotheses can be set
    upbut dont mention means
  • The statistic, orders the combined samples and
    considers the relative ranks of the data in each
    sample. Sample 1 has n1 observations, Sample 2
    has n2 observations.
  • The ranks of each sample are summed, R1 R2
  • Statistic U, is calculated as Un1n2
    0.5n1(n11)-R1
  • Significance of U is looked up on a table
    (Minitab will do this for you!) and a
    significance level (p) is given. plt0.05 is
    significant.

5
What happens when we have MORE than two sample
means to test for difference?
  • 1. We could use multiple t-testsbut this
    increases the chance of making a type 1 error.
  • 2. Use a test that can compare SEVERAL samples at
    once.
  • 3. This is called the F-test, or ANALYSIS OF
    VARIANCE.
  • 4. The F-test asks are there any significant
    differences among the samples?
  • 5. It compares the variability of values WITHIN
    groups with the variability of values BETWEEN
    GROUPS.

6
The F-test assumes the variance of all our
samples is similar (Case i) NOT Case ii
7
Analysis of Variance
8
Analysis of Variance
  • The F statistic is a ratio of BETWEEN GROUP
    VARIANCE/WITHIN GROUP VARIANCE.
  • The F-sampling distribution is a series of curves
    like the t-distribution. It varies with sample
    SIZE and HOW MANY SAMPLES are being compared.
  • As for the t-test, Minitab will give you a test
    statistic (F-Statistic) , and its associated
    significance level, p.
  • These are interpreted in exactly the same way as
    for the t-test.

9
Association between two variables How do
variables interact in natural and economic/social
processes?
10
What is Correlation?
  • Correlation is a measure of the degree of
    association between two variables.
  • How does one variable control/affect another?
  • This is the idea of process mechanisms in broad
    termsfinding links between phenomena.
  • Three types of correlation POSITIVE, NEGATIVE
    AND ZERO.

11
Correlation
  • First steps Data must be PAIRED eg for each
    observation of variable 1, there must be an
    equivalent observation of variable 2.
  • 1. Decide which of the two variables is the
    CONTROL or INDEPENDENT VARIABLE (X)
  • 2. Decide which of the two variables is therefore
    the CONTROLLED or DEPENDENT VARIABLE (Y)
  • 3. Plot the variables (paired x,y observations)
    on a SCATTERPLOT and look at the degree of
    scatter and any evidence of TREND in the data
    alignment.

12
Correlation Scatterplot indications of STRENGTH.
Each of the dots represents a sample member,
for which there are TWO measurements, one for the
independent variable, one for the dependent
variable.
13
Correlation factors to consider from the
scatter graph
14
Correlation Linear non-linear. The middle
scatterplot has zero linear correlation but
clearly there IS a non-linear relationship
between the variables.
15
Correlation Coefficient Parametric
StatisticsPearson Product Moment Coefficient
16
Correlation Pearson Correlation coefficient
17
Correlation
  • Values of the Pearson Correlation Coefficient
  • R 1.0. Perfect Positive correlation. The data
    is aligned perfectly on a straight line with a
    positive gradient. As x increases, y increases
  • R -1.0. Perfect negative correlation. The data
    is aligned perfectly on a straight line with a
    negative gradient. As x increases, y decreases.
  • R0.0. Zero correlation, no association between
    the x and y variables. Random scatter of points
    on scattergraph.

18
Significance of the correlation coefficient
  • Is the correlation found in our sample data
    reflecting a REAL association between the
    populations of the variables?
  • Null Hypothesis The correlation coefficient
    between the two variables is NOT significant.
  • Set the significance level 0.05
  • T-test statistic t r. v(n-2)/(1-r2)
  • Is Find significance level for t statistic.
  • Is it lt0.05? Yes- Reject the Null hypothesis

19
Pearson Product Moment Correlation Coefficient
  • Assumptions
  • Backround populations from which data sampled are
    NORMALLY distributed
  • Data is on Interval or Ratio scale
  • Look at the histograms of each sample variable.
  • If the assumptions cannot be met? Skewed data?
  • Transform the data to normalise the distributions
  • Use a NON-PARAMETRIC STATISTIC OF CORRELATION
  • Spearmans Rank Correlation Coefficient. (which
    can also be used for ordinal scale data)

20
Trend lines
  • Variation in the extent of sea-ice in the period
    sept 1979- sept 1988 as measured by satellites.
  • What does the line allow us to do?

21
Examples of trend lines
22
What is a regression line?
  • A regression line is the line of BEST FIT through
    a scatter of x,y data values.
  • If correlation is 1 or -1, ALL the data will lie
    ON the regression line.
  • Form of Regression Equation is Y-dependent
    variable, X-independent variable
  • Y A BX
  • (A is the y-intercept, where the line cuts the
    y-axis B is the gradient of the line, the rate
    of change of Y when X changes)

23
Regression
  • Direction of relationship
  • Case 1 Positive gradient (correlation) Gradient
    (b) is ve
  • Case 2 Negative gradient (correlation) Gradient
    (b) is ve.
  • Case 3 No relationship (Correlation 0.0) b0.0

24
Regression influence of UNITs of measurement
  • Notice how changing the distance units from
    kilometers to Miles changes the GRADIENT of the
    regression line.

25
Fitting a Regression Line
a
b
  • a) Sum of perpendicular distances from the line
    are minimised
  • b) area of triangles minimised
  • c) Sum of distances in y variable minimised
  • d) Least squares of distances minimised.

c
d
26
Errors or residuals from a regression line
  • How Residuals are found.

27
Regression Errors or residuals
  • Residuals measure the DIFFERENCE between the
    value of Y PREDICTED by the regression line and
    the ACTUAL value of Y.
  • There is Explained variation (due to the line)
    and Unexplained variation due to other factors,
    not included in the regression equation.

28
Regression
  • How well does the regression line represent the
    data?
  • Compare Observed variation of the data and the
    predicted variation given by the regression line.
  • R2 the coefficient of variation is found to give
    this proportion.
  • It is usually expressed as a percentage.

29
Assumptions of Regression Analysis
  • The residuals should be normally distrbuted
    around the regression line.
  • Residuals above the line are Positive.
  • Residuals below the line are Negative.

30
Assumptions of Regression
  • 1. Residuals should be randomly distributed
    around the line.
  • 2. The size of residuals should be random along
    the regression line.

31
What if my data is NON_LINEAR?
  • Transformations
  • These can be used to make the data linear so that
    correlation and linear regression can then be
    carried out.
Write a Comment
User Comments (0)
About PowerShow.com