GG1011 MODULE D STATISTICS - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

GG1011 MODULE D STATISTICS

Description:

Correlation is a measure of the degree of association between two variables. ... SCATTERPLOT and look at the degree of scatter and any evidence ... Y = A BX ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 32

Provided by: uhfa

Category:

more less

Transcript and Presenter's Notes

Title: GG1011 MODULE D STATISTICS

1
GG1011 MODULE DSTATISTICS

Session 3 Confidence Intervals and difference
tests

2
Assumptions of the t-test

Its a Parametric Statistic
Parametric statistics are the most powerful BUT
These statistics make the following key
assumption
1. Background population from which the data is
sampled is NORMALLY DISTRIBUTED
What evidence would we have for this?

3
Non-parametric tests for difference between two
samples

If we cannot be confident that our sample data
comes from a normally distributed population.
Use a NON-PARAMETRIC TEST
In this case, with TWO samples its called the
Mann-Whitney Test.
Non-parametric statistics make NO ASSUMPTIONS
about population distribution.
They compare the ranks and Median rather than the
Mean standard deviation.

4
Non-Parametric Test of Difference between TWO
samplesThe Mann-Whitney Test

Has 95 power of the t-test.
Can be applied to ordinal and higher data.
Usual null/alternative hypotheses can be set
upbut dont mention means
The statistic, orders the combined samples and
considers the relative ranks of the data in each
sample. Sample 1 has n1 observations, Sample 2
has n2 observations.
The ranks of each sample are summed, R1 R2
Statistic U, is calculated as Un1n2
0.5n1(n11)-R1
Significance of U is looked up on a table
(Minitab will do this for you!) and a
significance level (p) is given. plt0.05 is
significant.

5
What happens when we have MORE than two sample
means to test for difference?

1. We could use multiple t-testsbut this
increases the chance of making a type 1 error.
2. Use a test that can compare SEVERAL samples at
once.
3. This is called the F-test, or ANALYSIS OF
VARIANCE.
4. The F-test asks are there any significant
differences among the samples?
5. It compares the variability of values WITHIN
groups with the variability of values BETWEEN
GROUPS.

6
The F-test assumes the variance of all our
samples is similar (Case i) NOT Case ii
7
Analysis of Variance
8
Analysis of Variance

The F statistic is a ratio of BETWEEN GROUP
VARIANCE/WITHIN GROUP VARIANCE.
The F-sampling distribution is a series of curves
like the t-distribution. It varies with sample
SIZE and HOW MANY SAMPLES are being compared.
As for the t-test, Minitab will give you a test
statistic (F-Statistic) , and its associated
significance level, p.
These are interpreted in exactly the same way as
for the t-test.

9
Association between two variables How do
variables interact in natural and economic/social
processes?
10
What is Correlation?

Correlation is a measure of the degree of
association between two variables.
How does one variable control/affect another?
This is the idea of process mechanisms in broad
termsfinding links between phenomena.
Three types of correlation POSITIVE, NEGATIVE
AND ZERO.

11
Correlation

First steps Data must be PAIRED eg for each
observation of variable 1, there must be an
equivalent observation of variable 2.
1. Decide which of the two variables is the
CONTROL or INDEPENDENT VARIABLE (X)
2. Decide which of the two variables is therefore
the CONTROLLED or DEPENDENT VARIABLE (Y)
3. Plot the variables (paired x,y observations)
on a SCATTERPLOT and look at the degree of
scatter and any evidence of TREND in the data
alignment.

12
Correlation Scatterplot indications of STRENGTH.
Each of the dots represents a sample member,
for which there are TWO measurements, one for the
independent variable, one for the dependent
variable.
13
Correlation factors to consider from the
scatter graph
14
Correlation Linear non-linear. The middle
scatterplot has zero linear correlation but
clearly there IS a non-linear relationship
between the variables.
15
Correlation Coefficient Parametric
StatisticsPearson Product Moment Coefficient
16
Correlation Pearson Correlation coefficient
17
Correlation

Values of the Pearson Correlation Coefficient
R 1.0. Perfect Positive correlation. The data
is aligned perfectly on a straight line with a
positive gradient. As x increases, y increases
R -1.0. Perfect negative correlation. The data
is aligned perfectly on a straight line with a
negative gradient. As x increases, y decreases.
R0.0. Zero correlation, no association between
the x and y variables. Random scatter of points
on scattergraph.

18
Significance of the correlation coefficient

Is the correlation found in our sample data
reflecting a REAL association between the
populations of the variables?
Null Hypothesis The correlation coefficient
between the two variables is NOT significant.
Set the significance level 0.05
T-test statistic t r. v(n-2)/(1-r2)
Is Find significance level for t statistic.
Is it lt0.05? Yes- Reject the Null hypothesis

19
Pearson Product Moment Correlation Coefficient

Assumptions
Backround populations from which data sampled are
NORMALLY distributed
Data is on Interval or Ratio scale
Look at the histograms of each sample variable.
If the assumptions cannot be met? Skewed data?
Transform the data to normalise the distributions
Use a NON-PARAMETRIC STATISTIC OF CORRELATION
Spearmans Rank Correlation Coefficient. (which
can also be used for ordinal scale data)

20
Trend lines

Variation in the extent of sea-ice in the period
sept 1979- sept 1988 as measured by satellites.
What does the line allow us to do?

21
Examples of trend lines
22
What is a regression line?

A regression line is the line of BEST FIT through
a scatter of x,y data values.
If correlation is 1 or -1, ALL the data will lie
ON the regression line.
Form of Regression Equation is Y-dependent
variable, X-independent variable
Y A BX
(A is the y-intercept, where the line cuts the
y-axis B is the gradient of the line, the rate
of change of Y when X changes)

23
Regression

Direction of relationship
Case 1 Positive gradient (correlation) Gradient
(b) is ve
Case 2 Negative gradient (correlation) Gradient
(b) is ve.
Case 3 No relationship (Correlation 0.0) b0.0

24
Regression influence of UNITs of measurement

Notice how changing the distance units from
kilometers to Miles changes the GRADIENT of the
regression line.

25
Fitting a Regression Line
a
b

a) Sum of perpendicular distances from the line
are minimised
b) area of triangles minimised
c) Sum of distances in y variable minimised
d) Least squares of distances minimised.

c
d
26
Errors or residuals from a regression line

How Residuals are found.

27
Regression Errors or residuals

Residuals measure the DIFFERENCE between the
value of Y PREDICTED by the regression line and
the ACTUAL value of Y.
There is Explained variation (due to the line)
and Unexplained variation due to other factors,
not included in the regression equation.

28
Regression

How well does the regression line represent the
data?
Compare Observed variation of the data and the
predicted variation given by the regression line.
R2 the coefficient of variation is found to give
this proportion.
It is usually expressed as a percentage.

29
Assumptions of Regression Analysis