Title: Part time MSc course Epidemiology
1The following lecture has been approved for
University Undergraduate Students This lecture
may contain information, ideas, concepts and
discursive anecdotes that may be thought
provoking and challenging It is not intended
for the content or delivery to cause
offence Any issues raised in the lecture may
require the viewer to engage in further thought,
insight, reflection or critical evaluation
2Background to Statistics Distributions Data
collection Data presentation Dr. Craig
Jackson Senior Lecturer in Health Psychology
Trauma Critical Care Faculty of Health
Community Care UCE Birmingham
craig.jackson_at_uce.ac.uk
3The Monty Hall Problem
33
33
33
50 ?
50 ?
Problem stick with initial choice or choose
another door ?
Solution probability says that you stand a
better chance of finding the cash if you SWAP
4The Monty Hall Problem
Door 1 Door 2 Door 3 Never swap
WIN
LOSE
LOSE
Always swap
LOSE
WIN
WIN
Marilyn vos Savant
5Dispersion Range Spread of data Mean Arithmetic
average Median Location Mode Frequency SD Sprea
d of data about the mean Range 50-112
mmHg Mean 82mmHg Median 82mmHg Mode 82mmHg SD
10mmHg
6Types of Data / Variables Continuous Discrete
BP Children Height No. colds in last 12
months Weight Age last birthday Age Ordi
nal Nominal Grade of condition Sex Position
s 1st 2nd 3rd Hair colour Better - Same
Worse Blood group Height groups Eye
colour Age groups
7Conversion Re-classification Easier to
summarise Ordinal / Nominal data Cut-off
Points (who decides this?) Allows Continuous
variables to be changed into Nominal variables
BP gt 90mmHg Hypertensive BP lt
90mmHg Normotensive Easier clinical
decisions Categorisation reduces quality of
data Statistical tests may be more
sensational Good for summaries Bad for
analyses
8Histograms and Bar-Charts Distinction is often
lost Histograms The distribution of a continuous
variable No gaps between the bars Bar-Chart Spa
ces between the bars Distribution of discrete /
categorical data
9Types of statistics / analyses DESCRIPTIVE
STATISTICS Describing a phenomena Frequencies H
ow many Basic measurements Meters, seconds,
cm3, IQ INFERENTIAL STATISTICS Inferences
about phenomena Hypothesis Testing Proving or
disproving theories Confidence Intervals If
sample relates to the larger population Correlatio
n Associations between phenomena Significance
testing e.g diet and health
10Types of Data QUALITITATIVE Data expressed by
type Data that has been described QUANTITATIVE
Data classified by numeric value Data that has
been measured or counted QUALITITATIVE and
QUANTITATIVE data are not mutually exclusive Use
of the two data types in research is ok
11Categorical Data NOMINAL DATA values that
the data may have do not have specific
order values act as labels with no real
meaning e.g. hair colour brown 1 blond 2 black
100 ORDINAL DATA values with some kind of
ordering data that has been measured or
counted e.g. social class upper 1 middle
2 working 3 e.g. glioblastoma tumor
grade 1 2 3 4 5 e.g. position in a race 1
st 2 nd 3 rd
12Quantitative Data DISCRETE distinct or separate
parts, with no finite detail e.g children in
family CONTINUOUS between any two values,
there would be a third e.g between meters there
are centimetres INTERVAL equal intervals
between values and an arbitrary zero on the
scale e.g temperature gradient RATIO equal
intervals between values and an absolute zero e.g
body mass index
13Quantitative Data COUNTS number of items having
a particular shared characteristic PROPORTIONS n
umber of items with a particular characteristic /
by the number of the total population PERCENTAGES
a proportion multiplied by 100 represents parts
per hundred RATIO alternative to proportions -
number with the characteristic / by the number
without RATES A variance of the proportion
method, expressed as counts per 1000
14Terminology - Variables INDEPENDENT - Working
hours, exposure, worker attitudes, policies -
Chemical exposure in workplace DEPENDENT -
Symptomotology, productivity, accident rates,
attitudes, health - Performance on
neuropsychological test CONTROLLED - Working
hours, temperatures, exposure, diet, class,
income - Ambient noise and temperature in testing
room
15Levels of Variables Temperature
16 of population
56 57 58 59
510 511 6 61 62
63 64 Height
17Quincunx machine 1877
balls dropped through a succession of metal
pins..
..a normal distribution of balls
do not have a normal distribution here. Why?
18Normal Non-normal distributions
The distribution derived from the quincunx is not
perfect
It was only made from 18 balls
19Normal Non-normal distributions Galtons
quincunx machine ran with hundreds of balls a
more perfect shaped normal distribution.
Obvious implications for the size of samples of
populations used The more lead shot runs
through the quincunx machine, the smoother the
distribution in the long run . . . . .
20Normal Non-normal distributions bigger samples
are best (usually)
A SAMPLE OF VISUAL ABILITIES IN THE UK
(SIMPLIFIED DATA)
frequency
very poor
average very good
recruiting participants in the r.n.i.b
magazine would yield? recruiting participants in
ornathology magazine would yield? recruiting
participants in a gp surgery would yield?
21Presentation of data Why use tables and
graphs FIRST PRINCIPALS OF DATA PRESENTATION
enhance understanding clarity avoidance of
misunderstanding WHY USE TABLES ? more
accurate than graphs more concise than
graphs WHY USE GRAPHS ? provide good general
overview allows reader to visualise the concept
22Presentation of data Table of means
Exposed Controls T P n197 n178 Age 45.5
48.9 2.19 0.07 (yrs) (? 9.4) (?
7.3) I.Q 105 99 1.78 0.12 (?
10.8) (? 8.7) Speed 115.1 94.7
3.76 0.04 (ms) (? 13.4) (? 12.4)
23Presentation of data Category tables
Exposed Controls Healthy 50
150 200 Unwell 147
28 175 197 178 375
Chi square (test of association) shows Chi
square 7.2 P 0.02
24Graphical displays
Use for Comparing data and Counts of data
Use for Comparing data and to show spread of data
Use for Counts of data
Use to show spread of data
Confidence intervals
25Bar charts
Mean GHQ scores for exposure groups
GHQ score
Job Type
26Graphical display components
27Graphical displays Some real data
Movie-goers ratings for National Lampoons
European Vacation (1985)
votes
Viewer rating
What does the distribution of votes indicate ?
What other info is needed ?
www.imdb.com
28Graphical displays Some real data
Movie-goers ratings for The Empire Strikes Back
(1980)
votes
Viewer rating
What does the distribution of votes indicate ?
What other info is needed ?
www.imdb.com
29Movie data summary Both tables represent the
same data.. Do either of them convey the
general trend ?
Movie-goers ratings () Rating 10 9 8 7 6 5 4 3
2 1 Lampoon 3.5 3.1 8 13.8 15.9 14.7 14.6 10.3 9.1
7.1 Empire 39.4 20.2 17.5 10.9 4.9 2.8 1.3 0.7 0.
7 1.8
Movie-goers ratings Rating 10 9 8 7 6 5 4 3 2 1
Lampoon 31 27 70 121 140 129 128 90 80 62 Empire 6
197 3182 2749 1710 766 435 201 109 113 279
www.imdb.com
30Movie data summary Back to back comparison
Lampoon
Empire
votes
Viewer rating
What is wrong with this bar chart ? How could it
be improved ?
www.imdb.com
31Movie data summary Back to back comparison
Lampoon
Empire
of votes
Viewer rating
Can this be improved ?
www.imdb.com
32Movie data summary Back to back comparison
Lampoon
Empire
of votes
Viewer rating
over-complicated and messy
www.imdb.com
33Clarity vs accuracy
Tables determined numbers word processed less
space
Figures overview at a glance little processing
power showing trends
34- Importance of Sample Size
- Forgotten in many studies
- Little consideration given
- Appropriate size needed to confirm / refute
hypotheses - Small samples far too small to detect anything
but the grossest differenceNon-significant
results are reported Type 2 errors occur - Too large a sample unnecessary waste of
(clinical) resourcesEthical considerations
waste of patient time, inconvenience, discomfort - Make assessment of optimal sample size before
starting investigation
35- Quantitative Data Summary
- What data is needed to answer the larger-scale
research question - Combination of quantitative and qualitative ?
- Cleaning, re-scoring, re-scaling, or
re-formatting - Measurement of both IVs and DVs is complex
but can be simplified - Binary measurement makes analysis easier but
less meaningful - Binary data needs clear parameters e.g exposed
vs controls
36- Quantitative Data Summary
- Continuous Discrete data can also be converted
into Binary data - Normal distribution of participants / data points
desirable - Means - age, height, weight, BMI, IQ, attitudes
- Frequencies / Classifications - job type, sick
vs. healthy, dead vs alive - Means must be followed by Standard Deviation (SD
or ) - Presentation of data must enhance understanding
or be redundant
37Further Reading Altman DG. Designing Research.
In Altman DG (ed.) Practical Statistics For
Medical Research. Chapman and Hall, London 1991
74-106. Bland M. The design of experiments. In
Bland M. (ed.) An introduction to medical
statistics. Oxford Medical Publications, Oxford
1995 5-25. Daly LE, Bourke GJ. Epidemiological
and clinical research methods. In Daly LE,
Bourke GJ. (eds.) Interpretation and uses of
medical statistics. Blackwell Science Ltd, Oxford
2000 143-201. Gao Smith F, Smith J. (eds.) Key
Topics in Clinical Research. BIOS scientific
Publications, Oxford 2002. Jackson CA. Planning
Health and Safety Research Projects. Croner
Health and Safety at Work Special Report 2002
62 1-16. Jackson CA. Analyzing Statistical
Data in Occupational Health Research. Management
of Health Risks Special Report, 81 Croner
Publications, Surrey, June 2003