Title: Introduction to Statistics
1Introduction to Statistics
- Biomedical Sciences Degrees Honours Students
- Derek Scott
- d.scott_at_abdn.ac.uk
2Why use statistics?
- Statistics are used to analyse populations and
predict changes in terms of probability. - Normally, a representative sample is taken, large
enough to make likely conclusions about the
population as a whole. - Descriptive statistics summarise the data and
describe the population. These values allow you
to see how large and how variable the data are. - Inferential statistics propose null hypothesis
and endeavour to disprove it. By looking at
these, you can check for error.
3- When analysing data, you want to make the
strongest possible conclusion from limited
amounts of data. To do this, you need to overcome
2 problems - Important differences can be obscured by
biological variability and experimental error.
This makes it difficult to distinguish real
differences from random variability. - The human brain excels at finding patterns, even
from random data. Our natural inclination
(especially with our own data) is to conclude
that any differences are real, and to minimise
the contribution of random variability.
Statistical rigor prevents you from making this
mistake.
4Errors
- Bias or systematic error Data go in a
predictable direction perhaps due to experimental
design or human errors. Can remove the errors if
you identify them. - Random error Unpredictable errors. Cant get rid
of these. - Usually you will quote a measure of error with
your data (e.g. standard deviation, standard
error of the mean) - EXAMPLE The mean height of a student in BM4005
is 1.71 0.20 (43) metres.
MEAN VALUE
SD or SEM
n, the number of samples
Units!!!
5Independent Sampling 1
- Measure BP in rats, 5 rats per group.
- Measure BP 3 times in each animal.
- You do not have 15 independent measurements,
since triplicate measurements in each animals
will be closer to one another than to those in
other animals. - You should average values from each rat.
- Now have 5 independent mean values.
6Independent Sampling - 2
- Perform a biochemical test 3 times, each time in
triplicate. - Do not have 9 independent values, as an error in
preparing the reagents for 1 experiment could
affect all 3 triplicates. - Average the triplicates, and you have 3
independent mean values.
7Independent Sampling - 3
- Doing a human exercise study.
- Recruit 10 people from the inner-city, and 10
people from the countryside. - Have not independently sampled 20 subjects from
one population. - Data from inner-city subjects may be closer to
each other than to the data from rural subjects.
You have sampled from 2 populations, and need to
account for this in your analysis.
8Gaussian (Normal) Distribution
- Data usually follow a bell-shaped distribution
called Gaussian distribution. t-tests and ANOVA
tests assume that the population follows an
approximately Gaussian distribution. - For example, of we measure the height of everyone
in 4th year and plot this, most people would fall
in the middle of the curve, with a few at the
bottom end, and a few at the top end of the
curve. - For Gaussian distribution, we use parametric tests
9Gaussian Distribution
Bell-shaped curve
10Outliers
- When analysing data, some values can be very
different the rest. - Tempting to delete it from analysis.
- Was the value typed in correctly?
- Was there an experimental problem with that
value? - Is it due to biological diversity?
- What if answers to these questions are no?
11Outliers
- If outlier is due to chance, keep it in the data
set. - If it is due to a mistake (e.g. bad pipetting,
voltage spike, apparatus problem) then you must
remove it from the analysis. - If you want to be absolutely sure whether the
outlier is due to chance or not, there are
specific statistical tests you can do, but
usually these basic checks are enough to decide.
12Mean
- Sample mean will probably not be exactly the
population mean. Mean is more accurate if you
have a bigger sample size with a low variability. - You may calculate Confidence Intervals (CIs)
telling you the area in which 95 of the
population will fall. - EXAMPLE Mean height of a student in BM4005 is
1.71 metres. The 95 confidence limits for this
value are 1.5 and 1.8 metres. These are the upper
and lower heights between which 95 of the class
will fall.
13Confidence Intervals
- Nothing magical about 95. You could do it for
any value you liked 99, 90 etc. - If you set a value of 99, then the intervals
would be wider because 99 of the classs heights
must fall within that range. - 95 confidence limits mean you have a reasonable
level of confidence that the true population mean
lies within that range.
14Standard Deviation (SD)
- Quantifies variability
- If data follow Gaussian distribution, then 68 of
values lie within one SD of mean (on either side)
and 95 of values lie within 2 SDs of the mean. - So, as a rule of thumb, if 2 points on a graph
are more than 2 SDs away from each other, they
are significantly different. - Expressed in same units as data
15Standard Error of the Mean (SEM)
- Measure of how far sample mean is likely to be
from the true population mean. - SEM SD/?n
- Smaller than SD, so used more to give smaller
error bars! - SD quantifies scatter how much values vary from
each other. Doesnt really change much even if
you have a bigger sample size. - SEM quantifies how accurately you know the true
mean of the population. SEM gets smaller as
sample gets larger
16P Values
17Students t-test
- Used to compare the means of two groups of data.
- Paired t-test control expt. and treatment done
on same person, animal or cell etc. - Unpaired t-test control done on 1 group of
subjects, with the treatment being done on
another separate group. - Can be 1- or 2-tailed.
18Iron and zinc evoke electrogenic responses that
are pH-dependent
Krebs pH 6.0
Krebs pH 7.4
IRON (100mM)
ZINC (100mM)
19Iron- and zinc-evoked transport is
temperature-dependent
IRON
ZINC
? 4 oC ? 37 oC
20Paired or Unpaired?
- Choose paired if the 2 columns of data are
matched, e.g. - You measure weight before and after an
intervention in the same subjects. - You recruit subjects as pairs, matched for
variables such as age, ethnic group, disease
severity. One of the pair gets one treatment, the
other gets an alternative treatment. - You perform the control experiment in one cell or
piece of tissue, and then apply a drug. You
measure the effect of the drug in the same cell
or tissue. - Shouldnt be based on the variable you are
comparing. For example, if measuring BP, you can
match subjects based on their age or postcode,
but not on their BPs.
21Students t-test
- You will probably always use a 2-tailed t-test.
- 2-tailed test just asks whether there is a
difference between the 2 means. - 1-tailed test predicts whether
- Mean 1 is bigger than Mean 2 or
- Mean 2 is bigger than Mean 1.
- For 1 tailed you must know which mean will be
bigger before you start not usually possible - Stick to a 2-tailed t-test to be safe!!!
22Analysis of Variance (ANOVA)
- Used to compare means of 3 or more groups.
- Again, can have matched (paired) or unmatched
(unpaired) values. - You will probably only use 1-way ANOVA
- EXAMPLE Your null hypothesis is that the average
BP for 4 men is equal. ANOVA can compare each
subjects BP and say if they are different or
not.
23Features of ANOVA
- ANOVA produces an F value which tells you how
much variation there is in your sample. Higher F
value means more variation. - Dunnetts post test allows you to compare against
1 group e.g. A v B, A v C, A v D. Handy if A is
the control group. - Tukeys post test allows you to compare all
columns against one another just to check for any
differences between any groups. Good way of
finding significant differences that you may not
have expected.
24The effect of non-selective protein kinase
inhibition with staurosporine
IRON
ZINC
? 8-Br cGMP Staurosporine ? Staurosporine
(0.5 mM) ? 8-Br cGMP (100 mM) ? Control
25Non-Gaussian Distribution
- Use non-parametric tests for these unusual
situations which rank data from low to high and
analyse distribution of ranks. - Less powerful than parametric but used when
values are too low or high to measure by
assigning arbitrary values. Also used if outcome
is a rank or score with only a few categories. - P values are usually higher.
26Skewness
27Correlation
ve correlation
-ve correlation
Correlation doesnt tell you about the cause of
the effect, it just tells you that there is a
link between value X and value Y. The nearer the
R value is to 1, the better the correlation.
28Regression
Regression calculates a line of best fit. Often
used to calculate a standard curve which you
could use to estimate value x if you know value
y. Unknowns must fall within your standard
curves range.
29Correlation and regression
- A word of caution about doing regression and
finding correlations. - Just because you can draw a line of best fit
through some points and make quite a good
straight line, it does not necessarily mean there
is a relationship. - Correlation does not necessarily imply causation!
- For example, the consumption of tropical fruit in
the UK since WW2 has increased, and so has the
birth rate in the UK. If I plot this on a graph,
and did a regression, I would probably get a nice
straight line as both increase together. I would
probably also show there is a good correlation. - This does not mean that I can say that eating
tropical fruit improves your fertility!!! - Use some common sense when interpreting your data!
30Summary
- This is just a basic introduction.
- For extra information, try the Help files on
Graphpad Prism (on the University PCs) - If you end up doing an Honours project with
certain types of data (e.g. collecting
psychological data, epidemiological studies
etc.), your supervisor should inform you about
any special tests/calculations they use for that
type of data. - Finally, if you are still unsure, make it clear
to your supervisor that you do not understand why
or what you are doing.