Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 46
Provided by: Michael3390
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 3Univariate Analyses Discrete Data
MBP1010H Dr. Paul C. Boutros

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Microarray Analysis I Pre-Processing
  • Lecture 7 Microarray Analysis II
    Multiple-Testing
  • Lecture 8 Data Visualization Machine-Learning
  • Lecture 9 Sequence Analysis Basics
  • Final Exam (written)

5
How Will You Be Graded?
  • 9 Participation 1 per week
  • 56 Assignments 8 x 7 each
  • 35 Final Examination in-class
  • Each individual will get their own, unique
    assignment
  • Assignments will all be in R, and will be graded
    according to computational correctness only (i.e.
    does your R script yield the correct result when
    run)
  • Final Exam will include multiple-choice and
    written answers

6
Review From Lecture 1
  • Population vs. Sample

All MBP Students Population MBP Students in
1010 Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why dont we use Excel/spreadsheets?
Input errors, reproducibility, wrong results
7
Review From Lecture 2
  • Define discrete data

No gaps on the number-line
What is the central limit theorem?
A random variable that is the sum of many small
random variables is normally distributed
Theoretical vs. empirical quantiles
Probability vs. percentage of values less than p
Components of a boxplot?
25 - 1.5 IQR, 25, 50, 75, 75 1.5 IQR
8
Boxplot
Descriptive statistics can be intuitively
summarized in a Boxplot.
1.5 x IQR
75 quantile Median 25 quantile
IQR
gt boxplot(x)
1.5 x IQR
Everything above and below 1.5 x IQR is
considered an "outlier".
IQR Inter Quantile Range 75 quantile 25
quantile
9
Review From Lecture 2
  • How can you interpret a QQ plot?

Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis test null hypothesis
What is a p-value?
Evidence against null probability of FP,
probability of seeing as extreme a value by
chance alone
10
Review From Lecture 2
  • Parametric vs. non-parametric tests

Parametric tests have distributional assumptions
What is the t-statistics?
SignalNoise ratio
Assumptions of the t-test?
Data sampled from normal distribution
independence of replicates independence of
groups homoscedasticity
11
Flow-Chart For Two-Sample Tests
Is Data Sampled From a Normally-Distributed
Population?
Yes
No
Sufficient n for CLT (gt30)?
Equal Variance (F-Test)?
Yes
Yes
No
No
Heteroscedastic T-Test
Homoscedastic T-Test
Wilcoxon U-Test
12
Topics For This Week
  • Correlations
  • ceRNAs
  • Attendance
  • Common discrete univariate analyses

13
Power, error rates and decision
Power calculation in R
gt power.t.test(n 5, delta 1, sd2,
alternative"two.sided", type"one.sample")
One-sample t test power calculation
n 5 delta 1 sd
2 sig.level 0.05 power
0.1384528 alternative two.sided
Other tests are available see ??power.
14
Power, error rates and decision
PR(False Negative) PR(Type II error)
Lets Try Some Power Analyses in R
µ0
µ1
PR(False Positive) PR(Type I error)
15
Problem
  • When we measure more one than one variable for
    each member of a population, a scatter plot may
    show us that the values are not completely
    independent there is e.g. a trend for one
    variable to increase as the other increases.
  • Regression analyses assess the dependence.
  • Examples
  • Height vs. weight
  • Gene dosage vs.expression level
  • Survival analysisprobability of death vs. age

16
Correlation
When one variable depends on the other, the
variables are to some degree correlated. (Note
correlation need not imply causality.) In R, the
function cov() measures covariance and cor()
measures the Pearson coefficient of correlation
(a normalized measure of covariance). Pearson's
coeffecient of correlation values rangefrom -1
to 1, with 0 indicating no correlation.
17
Pearson's Coefficient of Correlation
How to interpret the correlation coefficient
Explore varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.99 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9999666
18
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.8 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9661111
19
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.4 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.6652423
20
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.01 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.01232522
21
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt periodic ... gt
y lt- (r cos(xpi)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.3438495
22
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt polynomial
... gt y lt- (r xx) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 -0.5024503
23
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt exponential gt
y lt- (r exp(5x)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.6334732
24
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt circular ... gt
a lt- (r cos(xpi)) ((1-r) rnorm(50)) gt b lt-
(r sin(xpi)) ((1-r) rnorm(50)) gt
plot(a,b) cor(a,b) 1 0.04531711
25
Correlation coefficient
26
Other Correlations
  • There are many other types of correlations
  • Spearmans correlation
  • rho
  • Kendalls correlation
  • Tau
  • Spearman is a Pearson on ranked values
  • Spearman rho 1 means a monotonic relationship
  • Pearson R 1 means a linear relationship

27
When Do We Use Statistics?
  • Ubiquitous in modern biology
  • Every class I will show a use of statistics in a
    (very, very) recent Nature paper.

January 9, 2014
28
Non-Small Cell Lung Cancer 101
15 5-year survival
Lung Cancer
80 of lung cancer
Non-Small Cell
Small Cell
Large Cell (and others)
Squamous Cell Carcinomas
Adenocarcinomas
29
Non-Small Cell Lung Cancer 102
Stage I
Local Tumour Only
Stage II
Local Lymph Nodes
Distal Lymph Nodes
Stage III
Metastasis
Stage IV
IA small tumour IB large tumour
30
General Idea HMGA2 is a ceRNA
What are ceRNAs?
Salmena et al. Cell 2011
31
Test Multiple Constructs for Activity
32
What Statistical Analysis Did They Do?
  • No information given in main text!
  • Figure legend says
  • Values are technical triplicates, have been
    performed independently three times, and
    represent mean /- standard deviation (s.d.) with
    propagated error.
  • In supplementary they say
  • Unless otherwise specified, statistical
    significance was assessed by the Students
    t-test
  • So, what would you do differently?

33
Attendance Break
34
Lets Go Back to Discrete vs. Continuous
  • Definition?
  • Lets take a few examples of discrete univariate
    statistical analyses in biology and write them
    down here
  • RNA abundance
  • Colony formation
  • Number of mice
  • Peptide counts

35
Four Main Discrete Univariate Tests
  • Hypergeometric test
  • Is a sample randomly selected from a fixed
    population?
  • Proportion test
  • Are two proportions equivalent?
  • Fishers Exact test
  • Are two binary classifications associated?
  • (Pearsons) Chi-Squared Test
  • Are paired observations on two variables
    independent?

36
Hypergeometric Test
  • Is a sample randomly selected from a fixed
    population?
  • Closer to discrete mathematics than statistics
  • Technically sampling without replacement
  • In R ?phyper
  • Classic example marbles
  • Less classic poker

5/24 are yellow
1/6 sampled are yellow
37
Hypergeometric Test Biological Example
  • Class example in genomics pathway analysis
  • I do a screen and identify n genes associated
    with something
  • Are those n genes biased towards a pathway?
  • Well a pathway contains m genes
  • So is n a random selection of m? Hypergeometric
    test!
  • Similar example drug screening
  • I test 1000 drugs to see which ones kill a
    cell-line
  • 100 of these are kinase inhibitors
  • 100 drugs kill my cell-line
  • 30 of these are kinase inhibitors
  • Did I find more kinase inhibitors than expected
    by chance?
  • Lets do the calculation

38
Hypergeometric Venn Diagram Overlap
Lets pretend X and Y are sets of genes (or
drugs, etc.) found in two separate
experiments. We want to know, is there more
overlap than expected by chance? To do this
Exercise can you calculate an effect-size?
Total Balls total number of genes considered
(but a gene must be analyzed in both experiments
exclude those studied in only one)
Black Balls all genes found in experiment X
White Balls all genes not found in experiment X
Sample all genes found in experiment Y
39
Proportion Test
  • Are two proportions equivalent?
  • Example is the fraction of people who play
    hockey in MBP different from the fraction who
    play hockey in Mathematics?
  • Mathematics 12/85
  • MBP 24/135
  • In R prop.test
  • Only useful for two-group studies

40
Proportion Test Biological Example
  • Does the frequency of TP53 mutations differ
    between prostate cancer patients who will suffer
    a recurrence and those who will not?
  • 12/150 patients whose tumours recur have mutated
    TP53
  • 50/921 patients whose tumours do not recur have
    mutated TP53
  • P-value guesses?
  • What if it is 100/921?
  • What if it is 5/10 vs. 1/10?
  • 6/10 vs. 1/10?
  • 7/10 vs. 1/10?

41
Fishers Exact Test
  • Are two binary categorizations associated?
  • Based on a contingency table
  • What are these? Have we seen any before?
  • In R ?fisher.test
  • Classic example drinking tea

Dr. Muriel Bristow claimed to be able to taste if
whether tea or milk was added first to a cup. Dr.
Ronald Fisher didnt believe her.
0
4
0
4
42
Fishers Exact Test Biological Example
  • You can use this any time you form a contingency
    table
  • Any time you make predictions (biomarkers)
  • Any time you compare two binary phenomena
  • Examples?

43
(Pearsons) Chi-Squared Test
  • Are two variables independent?
  • There are a lot of different chi-squared tests.
    Why?
  • Pearson
  • Yates
  • McNemar
  • Portmanteau test
  • In R ?chisq.test
  • You can think of it as a multiple-category
    Fishers test
  • The assumptions break down if lt5 values in a cell

44
Chi-Squared Test Biological Example
  • Comparing sex across different tumour subtypes

Female
Male
Adenocarcinoma
192
250
Squamous Cell Carcinoma
202
261
9
15
Small Cell Carcinoma
Neuroendocrine
12
10
45
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Microarray Analysis I Pre-Processing
  • Lecture 7 Microarray Analysis II
    Multiple-Testing
  • Lecture 8 Data Visualization Machine-Learning
  • Lecture 9 Sequence Analysis Basics
  • Final Exam (written)
Write a Comment
User Comments (0)
About PowerShow.com