Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Lecture 3Univariate Analyses Discrete Data
MBP1010H Dr. Paul C. Boutros
Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others
4Course Overview
- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized
models - Lecture 5 Multivariate Analyses II general
models - Lecture 6 Microarray Analysis I Pre-Processing
- Lecture 7 Microarray Analysis II
Multiple-Testing - Lecture 8 Data Visualization Machine-Learning
- Lecture 9 Sequence Analysis Basics
- Final Exam (written)
5How Will You Be Graded?
- 9 Participation 1 per week
- 56 Assignments 8 x 7 each
- 35 Final Examination in-class
- Each individual will get their own, unique
assignment - Assignments will all be in R, and will be graded
according to computational correctness only (i.e.
does your R script yield the correct result when
run) - Final Exam will include multiple-choice and
written answers
6Review From Lecture 1
All MBP Students Population MBP Students in
1010 Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why dont we use Excel/spreadsheets?
Input errors, reproducibility, wrong results
7Review From Lecture 2
No gaps on the number-line
What is the central limit theorem?
A random variable that is the sum of many small
random variables is normally distributed
Theoretical vs. empirical quantiles
Probability vs. percentage of values less than p
Components of a boxplot?
25 - 1.5 IQR, 25, 50, 75, 75 1.5 IQR
8Boxplot
Descriptive statistics can be intuitively
summarized in a Boxplot.
1.5 x IQR
75 quantile Median 25 quantile
IQR
gt boxplot(x)
1.5 x IQR
Everything above and below 1.5 x IQR is
considered an "outlier".
IQR Inter Quantile Range 75 quantile 25
quantile
9Review From Lecture 2
- How can you interpret a QQ plot?
Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis test null hypothesis
What is a p-value?
Evidence against null probability of FP,
probability of seeing as extreme a value by
chance alone
10Review From Lecture 2
- Parametric vs. non-parametric tests
Parametric tests have distributional assumptions
What is the t-statistics?
SignalNoise ratio
Assumptions of the t-test?
Data sampled from normal distribution
independence of replicates independence of
groups homoscedasticity
11Flow-Chart For Two-Sample Tests
Is Data Sampled From a Normally-Distributed
Population?
Yes
No
Sufficient n for CLT (gt30)?
Equal Variance (F-Test)?
Yes
Yes
No
No
Heteroscedastic T-Test
Homoscedastic T-Test
Wilcoxon U-Test
12Topics For This Week
- Correlations
- ceRNAs
- Attendance
- Common discrete univariate analyses
13Power, error rates and decision
Power calculation in R
gt power.t.test(n 5, delta 1, sd2,
alternative"two.sided", type"one.sample")
One-sample t test power calculation
n 5 delta 1 sd
2 sig.level 0.05 power
0.1384528 alternative two.sided
Other tests are available see ??power.
14Power, error rates and decision
PR(False Negative) PR(Type II error)
Lets Try Some Power Analyses in R
µ0
µ1
PR(False Positive) PR(Type I error)
15Problem
- When we measure more one than one variable for
each member of a population, a scatter plot may
show us that the values are not completely
independent there is e.g. a trend for one
variable to increase as the other increases. - Regression analyses assess the dependence.
- Examples
- Height vs. weight
- Gene dosage vs.expression level
- Survival analysisprobability of death vs. age
16Correlation
When one variable depends on the other, the
variables are to some degree correlated. (Note
correlation need not imply causality.) In R, the
function cov() measures covariance and cor()
measures the Pearson coefficient of correlation
(a normalized measure of covariance). Pearson's
coeffecient of correlation values rangefrom -1
to 1, with 0 indicating no correlation.
17Pearson's Coefficient of Correlation
How to interpret the correlation coefficient
Explore varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.99 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9999666
18Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.8 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9661111
19Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.4 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.6652423
20Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.01 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.01232522
21Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt periodic ... gt
y lt- (r cos(xpi)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.3438495
22Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt polynomial
... gt y lt- (r xx) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 -0.5024503
23Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt exponential gt
y lt- (r exp(5x)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.6334732
24Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt circular ... gt
a lt- (r cos(xpi)) ((1-r) rnorm(50)) gt b lt-
(r sin(xpi)) ((1-r) rnorm(50)) gt
plot(a,b) cor(a,b) 1 0.04531711
25Correlation coefficient
26Other Correlations
- There are many other types of correlations
- Spearmans correlation
- rho
- Kendalls correlation
- Tau
- Spearman is a Pearson on ranked values
- Spearman rho 1 means a monotonic relationship
- Pearson R 1 means a linear relationship
27When Do We Use Statistics?
- Ubiquitous in modern biology
- Every class I will show a use of statistics in a
(very, very) recent Nature paper.
January 9, 2014
28Non-Small Cell Lung Cancer 101
15 5-year survival
Lung Cancer
80 of lung cancer
Non-Small Cell
Small Cell
Large Cell (and others)
Squamous Cell Carcinomas
Adenocarcinomas
29Non-Small Cell Lung Cancer 102
Stage I
Local Tumour Only
Stage II
Local Lymph Nodes
Distal Lymph Nodes
Stage III
Metastasis
Stage IV
IA small tumour IB large tumour
30General Idea HMGA2 is a ceRNA
What are ceRNAs?
Salmena et al. Cell 2011
31Test Multiple Constructs for Activity
32What Statistical Analysis Did They Do?
- No information given in main text!
- Figure legend says
- Values are technical triplicates, have been
performed independently three times, and
represent mean /- standard deviation (s.d.) with
propagated error. - In supplementary they say
- Unless otherwise specified, statistical
significance was assessed by the Students
t-test - So, what would you do differently?
33Attendance Break
34Lets Go Back to Discrete vs. Continuous
- Definition?
- Lets take a few examples of discrete univariate
statistical analyses in biology and write them
down here - RNA abundance
- Colony formation
- Number of mice
- Peptide counts
35Four Main Discrete Univariate Tests
- Hypergeometric test
- Is a sample randomly selected from a fixed
population? - Proportion test
- Are two proportions equivalent?
- Fishers Exact test
- Are two binary classifications associated?
- (Pearsons) Chi-Squared Test
- Are paired observations on two variables
independent?
36Hypergeometric Test
- Is a sample randomly selected from a fixed
population? - Closer to discrete mathematics than statistics
- Technically sampling without replacement
- In R ?phyper
- Classic example marbles
- Less classic poker
5/24 are yellow
1/6 sampled are yellow
37Hypergeometric Test Biological Example
- Class example in genomics pathway analysis
- I do a screen and identify n genes associated
with something - Are those n genes biased towards a pathway?
- Well a pathway contains m genes
- So is n a random selection of m? Hypergeometric
test! - Similar example drug screening
- I test 1000 drugs to see which ones kill a
cell-line - 100 of these are kinase inhibitors
- 100 drugs kill my cell-line
- 30 of these are kinase inhibitors
- Did I find more kinase inhibitors than expected
by chance? - Lets do the calculation
38Hypergeometric Venn Diagram Overlap
Lets pretend X and Y are sets of genes (or
drugs, etc.) found in two separate
experiments. We want to know, is there more
overlap than expected by chance? To do this
Exercise can you calculate an effect-size?
Total Balls total number of genes considered
(but a gene must be analyzed in both experiments
exclude those studied in only one)
Black Balls all genes found in experiment X
White Balls all genes not found in experiment X
Sample all genes found in experiment Y
39Proportion Test
- Are two proportions equivalent?
- Example is the fraction of people who play
hockey in MBP different from the fraction who
play hockey in Mathematics? - Mathematics 12/85
- MBP 24/135
- In R prop.test
- Only useful for two-group studies
40Proportion Test Biological Example
- Does the frequency of TP53 mutations differ
between prostate cancer patients who will suffer
a recurrence and those who will not? - 12/150 patients whose tumours recur have mutated
TP53 - 50/921 patients whose tumours do not recur have
mutated TP53 - P-value guesses?
- What if it is 100/921?
- What if it is 5/10 vs. 1/10?
- 6/10 vs. 1/10?
- 7/10 vs. 1/10?
41Fishers Exact Test
- Are two binary categorizations associated?
- Based on a contingency table
- What are these? Have we seen any before?
- In R ?fisher.test
- Classic example drinking tea
Dr. Muriel Bristow claimed to be able to taste if
whether tea or milk was added first to a cup. Dr.
Ronald Fisher didnt believe her.
0
4
0
4
42Fishers Exact Test Biological Example
- You can use this any time you form a contingency
table - Any time you make predictions (biomarkers)
- Any time you compare two binary phenomena
- Examples?
43(Pearsons) Chi-Squared Test
- Are two variables independent?
- There are a lot of different chi-squared tests.
Why? - Pearson
- Yates
- McNemar
- Portmanteau test
- In R ?chisq.test
- You can think of it as a multiple-category
Fishers test - The assumptions break down if lt5 values in a cell
44Chi-Squared Test Biological Example
- Comparing sex across different tumour subtypes
Female
Male
Adenocarcinoma
192
250
Squamous Cell Carcinoma
202
261
9
15
Small Cell Carcinoma
Neuroendocrine
12
10
45Course Overview
- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized
models - Lecture 5 Multivariate Analyses II general
models - Lecture 6 Microarray Analysis I Pre-Processing
- Lecture 7 Microarray Analysis II
Multiple-Testing - Lecture 8 Data Visualization Machine-Learning
- Lecture 9 Sequence Analysis Basics
- Final Exam (written)