Title: Pathway Analysis
1Pathway Analysis
2Goals
- Characterize biological meaning of joint changes
in gene expression - Organize expression (or other) changes into
meaningful chunks (themes) - Identify crucial points in process where
intervention could make a difference - Why? Biology is Redundant! Often sets of genes
doing related functions are changed
3Gene Sets
- Gene Ontology
- Biological Process
- Molecular Function
- Cellular Location
- Pathway Databases
- KEGG
- BioCarta
- Broad Institute
4Other Gene Sets
- Transcription factor targets
- All the genes regulated by particular TFs
- Protein complex components
- Sets of genes whose protein products function
together - Ion channel receptors
- RNA / DNA Polymerase
- Paralogs
- Families of genes descended (in eukaryotic times)
from a common ancestor
5Approaches
- Univariate
- Derive summary statistics for each gene
independently - Group statistics of genes by gene group
- Multivariate
- Analyze covariation of genes in groups across
individuals - More adaptable to continuous statistics
6Univariate Approaches
- Discrete tests enrichment for groups in gene
lists - Select genes differentially expressed at some
cutoff - For each gene group cross-tabulate
- Test for significance (Hypergeometric or Fisher
test) - Continuous tests from gene scores to group
scores - Compare distribution of scores within each group
to random selections - GSEA (Gene Set Enrichment Analysis)
- PAGE (Parametric Analysis of Gene Expression)
7Multivariate Approaches
- Classical multivariate methods
- Multi-dimensional Scaling
- Hotellings T2
- Informativeness
- Topological score relative to network
- Prediction by machine learning tool
- e.g. random forest
8Contingency Table 2 X 2
Signif. Genes NS Genes
Group of Interest k n-k n
Others K-k (N-n)-(K-k) N-n
K N-K N
P
9Categorical Analysis
- Fishers Exact Test
- Condition on margins fixed
- Of all tables with same margins, how many have
dependence as or more extreme? - Hard to compute when n or k are large
- Approximations
- Binomial (when k/n is small)
- Chi-square (when expected values gt 5 )
- G2 (log-likelihood ratio compare to c2)
10Issues in Assessing Significance
- P-value or FDR?
- Heuristic only use FDR
- If a child category is significant, how to assess
significance of parent category? - Include child category
- Consider only genes outside child category
- What is appropriate Null Distribution?
- Random sets of genes? Or
- Random assignments of samples?
11Critiques of Discrete Approach
- No use of information about size of change
- Continuous procedures usually have twice the
power of analogous discrete procedures on
discretized continuous data - No use of covariation knowing covariation
usually improves power of test
12(No Transcript)
13(2003)
14GSEA
- Uses Kolmogorov-Smirnov (K-S) test of
distribution equality to compare t-scores for
selected gene group with all genes
15Update Fixes a Problem
- Sometimes ranks concentrated in middle
- Hack Ad-hoc weighting by scores emphasizes peaks
at extremes
16(No Transcript)
17Group Z- or T- Scores
- Under Null Hypothesis, each genes z-score (zi)
is distributed N(0,1) - Hence the sum over genes in a group G
- Identify which groups have highest scores
- Same issues as discrete
- Null Distribution permute which indices?
- Hierarchy
18Issues for Pathway Methods
- How to assess significance?
- Null distribution by permutations
- Permute genes or samples?
- How to handle activators and inhibitors in the
same pathway? - Variance Test
- Other approaches
19Pathway Analysis of Genotype Data
20The Pathways Proposal
- Complex disease ensues from the malfunction of
one or a few specific signaling pathways - Alternatives
- Common variants of several genes in the pathway
each contribute moderate risk - Rare de novo variants confer great risk and
persist for generations in LD with typed markers
within unidentified subpopulations of the study
group
21Approach 1 - Adaptation of GSEA
- Order log-odds ratios or linkage p-values for all
SNPs - Map SNPs to genes, and genes to groups
- Use linkage p-values in place of t-scores in GSEA
- Compare distribution of log-odds ratios for SNPs
in group to randomly selected SNPs from the chip
22Possible Association Models
- Each of several genes may have a variant that
confers increased RR independent of other genes - Several genes in contribute additively to the
malfunction of the pathway - There are several distinct combinations of gene
variants that increase RR but only modest
increases in risk for any single variant
23Approach 2 Combining p-values
- 1. Compute gene-wise p-value
- Select most likely variant - best p-value
- Selected minimum p-value is biased downward
- Assign gene-wise p-value by permutations
(Westfall-Young) - Permute samples and compute best p-value for
each permutation - Compare candidate SNP pvalues to this null
distribution of best p-values - 2. Combine p-values by Fishers method
24Methods 2
- Additive model
- Where ni indexes the number of allele Bs of a
SNP in gene i in the gene set G - Select subset of most likely SNPs
- Fit by logistic regression (glm() in R)
- Significance by permutations
- Permute sample outcomes
- Select genes and fit logistic regression again
- Assess goodness of fit each time
- Compare observed goodness of fit
25Multivariate Approaches to Gene Set Analysis
26Key Multivariate Ideas
- PCA (Principal Components Analysis)
- SVD (Singular Value Decomposition)
- MDS (Multi-dimensional Scaling)
- Hotelling T2
27PCA
PCA1 lies along the direction of maximal
correlation PCA 2 at right angles with the next
highest variation.
Three correlated variables
28Multi-Dimensional Scaling
- Aim to represent graphically the most
information about relationships among samples
with multi-dimensional attributes in 2 (or 3)
dimensions - Algorithm
- Transform distances into cross-product matrix
- Initial PCA onto 2 (or 3) axes
- Deform until better representation
- Minimize strain measure
29Separating Using MDS
Left distributions of individual
variables Right MDS plot (in this case PCA)
30Multivariate Approaches to Selection
- Visualizing differences by MDS
- Hotellings T-squared
31MDS for Pathways
- BAD pathway
- Normal
- IBC
- Other BC
- Clear separation between groups
- Variation differences
32Hotellings T2
- Compute distance between sample means using
(common) metric of covariation - Where
- Multidimensional analog of t (actually F)
statistic
33Principles of Kong et al Method
- Normal covariation generally acts to preserve
homeostasis - The transcription of genes that participate in
many processes will be changed - The joint changes in genes will be most
distinctive for those genes active in pathways
that are working differently
34Critiques of Hotellings T
- Not robust to outliers
- Assumes same covariance in each sample
- S1 S2 ? Usually not in disease
- Small samples unreliable S estimates
- N lt p