Title: GG1011 MODULE D STATISTICS
1GG1011 MODULE DSTATISTICS
- Session 3 Confidence Intervals and difference
tests
2Assumptions of the t-test
- Its a Parametric Statistic
- Parametric statistics are the most powerful BUT
- These statistics make the following key
assumption - 1. Background population from which the data is
sampled is NORMALLY DISTRIBUTED - What evidence would we have for this?
3Non-parametric tests for difference between two
samples
- If we cannot be confident that our sample data
comes from a normally distributed population. - Use a NON-PARAMETRIC TEST
- In this case, with TWO samples its called the
Mann-Whitney Test. - Non-parametric statistics make NO ASSUMPTIONS
about population distribution. - They compare the ranks and Median rather than the
Mean standard deviation.
4Non-Parametric Test of Difference between TWO
samplesThe Mann-Whitney Test
- Has 95 power of the t-test.
- Can be applied to ordinal and higher data.
- Usual null/alternative hypotheses can be set
upbut dont mention means - The statistic, orders the combined samples and
considers the relative ranks of the data in each
sample. Sample 1 has n1 observations, Sample 2
has n2 observations. - The ranks of each sample are summed, R1 R2
- Statistic U, is calculated as Un1n2
0.5n1(n11)-R1 - Significance of U is looked up on a table
(Minitab will do this for you!) and a
significance level (p) is given. plt0.05 is
significant.
5What happens when we have MORE than two sample
means to test for difference?
- 1. We could use multiple t-testsbut this
increases the chance of making a type 1 error. - 2. Use a test that can compare SEVERAL samples at
once. - 3. This is called the F-test, or ANALYSIS OF
VARIANCE. - 4. The F-test asks are there any significant
differences among the samples? - 5. It compares the variability of values WITHIN
groups with the variability of values BETWEEN
GROUPS.
6The F-test assumes the variance of all our
samples is similar (Case i) NOT Case ii
7Analysis of Variance
8Analysis of Variance
- The F statistic is a ratio of BETWEEN GROUP
VARIANCE/WITHIN GROUP VARIANCE. - The F-sampling distribution is a series of curves
like the t-distribution. It varies with sample
SIZE and HOW MANY SAMPLES are being compared. - As for the t-test, Minitab will give you a test
statistic (F-Statistic) , and its associated
significance level, p. - These are interpreted in exactly the same way as
for the t-test.
9Association between two variables How do
variables interact in natural and economic/social
processes?
10What is Correlation?
- Correlation is a measure of the degree of
association between two variables. - How does one variable control/affect another?
- This is the idea of process mechanisms in broad
termsfinding links between phenomena. - Three types of correlation POSITIVE, NEGATIVE
AND ZERO.
11Correlation
- First steps Data must be PAIRED eg for each
observation of variable 1, there must be an
equivalent observation of variable 2. - 1. Decide which of the two variables is the
CONTROL or INDEPENDENT VARIABLE (X) - 2. Decide which of the two variables is therefore
the CONTROLLED or DEPENDENT VARIABLE (Y) - 3. Plot the variables (paired x,y observations)
on a SCATTERPLOT and look at the degree of
scatter and any evidence of TREND in the data
alignment.
12Correlation Scatterplot indications of STRENGTH.
Each of the dots represents a sample member,
for which there are TWO measurements, one for the
independent variable, one for the dependent
variable.
13Correlation factors to consider from the
scatter graph
14Correlation Linear non-linear. The middle
scatterplot has zero linear correlation but
clearly there IS a non-linear relationship
between the variables.
15Correlation Coefficient Parametric
StatisticsPearson Product Moment Coefficient
16Correlation Pearson Correlation coefficient
17Correlation
- Values of the Pearson Correlation Coefficient
- R 1.0. Perfect Positive correlation. The data
is aligned perfectly on a straight line with a
positive gradient. As x increases, y increases - R -1.0. Perfect negative correlation. The data
is aligned perfectly on a straight line with a
negative gradient. As x increases, y decreases. - R0.0. Zero correlation, no association between
the x and y variables. Random scatter of points
on scattergraph.
18Significance of the correlation coefficient
- Is the correlation found in our sample data
reflecting a REAL association between the
populations of the variables? - Null Hypothesis The correlation coefficient
between the two variables is NOT significant. - Set the significance level 0.05
- T-test statistic t r. v(n-2)/(1-r2)
- Is Find significance level for t statistic.
- Is it lt0.05? Yes- Reject the Null hypothesis
19Pearson Product Moment Correlation Coefficient
- Assumptions
- Backround populations from which data sampled are
NORMALLY distributed - Data is on Interval or Ratio scale
- Look at the histograms of each sample variable.
- If the assumptions cannot be met? Skewed data?
- Transform the data to normalise the distributions
- Use a NON-PARAMETRIC STATISTIC OF CORRELATION
- Spearmans Rank Correlation Coefficient. (which
can also be used for ordinal scale data)
20Trend lines
- Variation in the extent of sea-ice in the period
sept 1979- sept 1988 as measured by satellites. - What does the line allow us to do?
21Examples of trend lines
22What is a regression line?
- A regression line is the line of BEST FIT through
a scatter of x,y data values. - If correlation is 1 or -1, ALL the data will lie
ON the regression line. - Form of Regression Equation is Y-dependent
variable, X-independent variable - Y A BX
- (A is the y-intercept, where the line cuts the
y-axis B is the gradient of the line, the rate
of change of Y when X changes)
23Regression
- Direction of relationship
- Case 1 Positive gradient (correlation) Gradient
(b) is ve - Case 2 Negative gradient (correlation) Gradient
(b) is ve. - Case 3 No relationship (Correlation 0.0) b0.0
24Regression influence of UNITs of measurement
- Notice how changing the distance units from
kilometers to Miles changes the GRADIENT of the
regression line.
25Fitting a Regression Line
a
b
- a) Sum of perpendicular distances from the line
are minimised - b) area of triangles minimised
- c) Sum of distances in y variable minimised
- d) Least squares of distances minimised.
c
d
26Errors or residuals from a regression line
27Regression Errors or residuals
- Residuals measure the DIFFERENCE between the
value of Y PREDICTED by the regression line and
the ACTUAL value of Y. - There is Explained variation (due to the line)
and Unexplained variation due to other factors,
not included in the regression equation.
28Regression
- How well does the regression line represent the
data? - Compare Observed variation of the data and the
predicted variation given by the regression line.
- R2 the coefficient of variation is found to give
this proportion. - It is usually expressed as a percentage.
29Assumptions of Regression Analysis
- The residuals should be normally distrbuted
around the regression line. - Residuals above the line are Positive.
- Residuals below the line are Negative.
30Assumptions of Regression
- 1. Residuals should be randomly distributed
around the line. - 2. The size of residuals should be random along
the regression line.
31What if my data is NON_LINEAR?
- Transformations
- These can be used to make the data linear so that
correlation and linear regression can then be
carried out.