Title: YY Teo
1Statistics for Medical Sciences Division
Summer 07
2Resources
- Lectures
- Lecture Notes
- Ramsey and Schafer (2002) The Statistical Sleuth
(2nd Edition) Thomson Learning - Email teo_at_stats.ox.ac.uk
3Lesson Structure
- 5 lectures practical sessions (1.5hr)
- Overview of statistics, exploratory data analysis
- Z-tests, t-tests, Analysis of Variance (ANOVA),
post-hoc analysis - Categorical analysis, odds ratio
- Linear regression
- Logistic regression
4Overview of Statistics
5Scientific Process
6Analysing a set of data
- Look at the data (initial checks on the data)
- Downloading data, formatting, data collection,
discrepant data, missing data - Visualize the data (exploratory data analysis)
- Descriptive statistics, informative tables,
well-constructed figures - Analyse the data (definitive analysis)
- Formal statistical analysis
- Quantify any interesting results
- Report the findings
7Computers and Statistics
- Excel, SPSS, Minitab, Stata, Mathlab, R, etc
- Advantages
- Speed, accuracy, ease of data manipulation
- Easy to produce plots, cross-tabulation tables,
summary statistics - Disadvantages
- Inappropriate analysis / use of wrong tests
- Data dredging
8Sample Selection
- Simple Random Sample
- Stratified Sample
- Cluster Sampling
- Multistage Sampling
- Multi-phase Sampling
9Types of Variables
- Often, test to use depends on the type of
variable at hand - Two main classes of variables
- Categorical
- Numerical
- Categorical variables further divided into two
sub-classes - Nominal categorical (example gender, ethnic
groups) - Ordinal categorical (example size of a car,
quality of teaching)
10Numerical Variables
- Distinguish between discrete or continuous
numerical variables - Discrete
- Integer values (number of male subjects, number
of episodes of flu outbreaks) - Continuous
- Takes a whole range of values (height, weight)
- Continuous variables treated as discrete (age)
11Exploratory Data Analysis
12EDA
- Tabular EDA
- Univariate tables, cross-tabulation of
categorical variables - Numerical EDA
- Location, spread, skewness, covariance and
correlation - Graphical EDA
- Frequency plots, histograms, boxplots,
scatterplots - The precise form of EDA depends on the data at
hand.
13Tabular EDA
- Useful for summarising categorical data.
For example, the following table shows the
classification of 2,555 DNA samples in a
case-control study of Malaria onset into the
respective phenotypes
14Tabular EDA
- For two categorical variables i.e. the
distribution of genotypes for a sub-sample of the
individuals affected and unaffected by severe
malaria
Question Appears to be more affected individuals
with the C allele (or conversely, more unaffected
individuals have the AA genotype). Does this mean
anything? (see later for test of independence)
15Numerical EDA
- Spread (range, standard deviation, interquartile
range, mode)
16Numerical EDA
- Sample QuartilesQ1 25th quantile (or value of
the 25 ranked data)Q2 50th quantile (also
known as median of data)Q3 75th quantile (or
value of the 75 ranked data) - Interquartile range (IQR) IQR Q3 Q1
- Minimum, Maximum of data
17Numerical EDA
- Numbers can be informative to identify potential
problems with the data - Example Suppose the height for 1,496 individuals
randomly sampled from the population produces the
following summary
IQR Q3 Q1 188 172 16 Range Max Min
201 0 201
18Numerical EDA
- Skewness
-
- b1 is unit-free, where 0 indicates symmetry
about sample mean. Positive values indicate
right-skewness, and negative values indicate
left-skewness.
19Covariance and Correlation
- Two numerical variables height and weight
- Questions
- Are there any relationship between these
variables? - If there is, how do we quantify this
relationship? - Covariance and correlation Measures the degree
of association between two numerical variables.
20Covariance and Correlation
- Covariance is scale-dependent, and correlation
is unit-free. - More intuitive to interpret correlation than
covariance. - Example Covariance for height and weight is 2.4
when assessed using metres and kilograms, but
240,000 when assessed using centimetres and
grams. Correlation is a constant value at 0.83
for both scenario.
- Correlation always bounded between 1 and 1
inclusive.
21Example
22Graphical EDA
- Visual summaries of the data
- Flagging outliers, obvious relationships, check
for distribution
23Boxplots
- Univariate boxplot for 1 numerical variable
Ends of box Q1 and Q3 Length of box IQR White
line Sample median Whiskers 1.5 times IQR Lines
outside whiskers Outliers Circles Extreme
outliers
24Boxplots
- Multivariate boxplots for 1 numerical variable
across different levels of a categorical variable - Graphical comparison
25Scatterplots
- Graphical representation for 2 numerical
variables
26Scatterplots
27Scatterplots