Title: Revealing Biological Modules via Graph Summarization
1Revealing Biological Modules via Graph
Summarization
- Saket Navlakha
- Computer Science Department
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park, USA
- Joint work with Michael C. Schatz and Carl
Kingsford
2Introduction
- High-throughput methods are producing lots of
protein interaction data for many organisms - Analyze networks and recover complexes or
proteins involved in the same biological process - Common approach
- Partition network into clusters or modules
- Use modules to predict annotations
Yeast protein interaction network
3Module-assisted predictions
?
Module Finding Algorithm
?
e.g. Newman 2003 King et al. 2004
Bader and Hogue 2003 Blatt et al. 2006
van Dongen 1998
4Our module-finding approach
- Want modules with proteins having similar
cellular roles - Similar interaction partners implies similar
cellular roles - Similar interaction partners also implies
redundancy in the graph - Redundancy implies compressibility!
Compressed Rep.
5Graph Summarization (GS)
Goal produce compressed representation of input
graph G
Corrections
Summary
h
G
(a,h)
X a,b,c
Y d,e,f,g
d
a
(c,i)
e
(c,j)
b
Modules
-(a,d)
f
c
g
j
i
6GS representation
Cost 14 edges
d
e
f
g
- Summary S highlights dominant trends, easy to
visualize - Supernode ? set of nodes with common neighbors
- Identifies bipartite cores, cliques, stars, etc.
- Corrections C exception/unusual edges
- Handles noise
- Minimize total cost S C
- (based on MDL principle)
a
b
c
Summary
X d,e,f,g
Y a,b,c
Corrections
Cost 5(1 superedge 4 corrections)
7Greedy GS
Cost reduction 11 to 6
- Bottom-up, iterative merging
- Start with S G
- At every step, merge pair which maximally reduces
cost - If no pair reduces cost, stop
b
c
a
d
e
h
f
g
8Application to yeast network
- Network IntAct 5,492 proteins, 40,332
interactions - Complexes MIPS 266 leaf level
- Biological Process GO 182 terms Myers et al.
2006 - Comparisons
- Graph Summarization
-
- Markov Clustering Alg.
- Molecular Complex Detect. Alg.
- Newman Modularity
9Module-assisted predictions
- Test quality of modules (supernodes)
- Compute hypergeometric enrichment of modules
- Use modules for annotation prediction (complexes,
biological processes)
101. Annotation Enrichment
Graph summarization ? larger variety of
annotations enriched in some module (P lt 0.001)
(A lower of GS modules are enriched for some
annotation, but not indicative of predictive
performance.)
112. Making predictions
- Cross-validation (leave one out)
- Module-assisted annotation transfer schemes
Majority
Hypergeometric
Plurality
transfer the statistically enriched annotations
transfer if gt 50 annotated proteins have the
annotation
transfer the most common annotation(s)
12 Complexes Biological
Processes
All GS predictions are Pareto optimal
13Conclusion
- Graph Summarization is good for annotation
prediction - Confirmed on complexes and biological processes
- Preferable to MCL, MCODE, NSP
- Advantages
- MDL-based graph compression, no parameters
- Module proteins which have similar interaction
partners - Generalizes bipartite cores, cliques, stars
handles noise - More in paper
- Similar results for high-confidence network
- Also used GS corrections to predict co-complexed
pairs using only topology, fared favorably
against Defective Cliques Yu et al. 2006 - Many unique predictions made by each algorithm
-
14Acknowledgements
Nisheeth Shrivastava
Research Scientist Bell Labs Research,
India
PhD student University of Maryland
Rajeev Rastogi
Carl Kingsford
VP and Head Yahoo Research, India
Assistant Professor University of Maryland
15Additional slides
16Enrichment of Modules
- Want modules to contain proteins dominantly with
the same complex or biological process - Measure enrichment via hypergeometric P-value
- n nodes in the network
- m nodes in M
- f nodes in network annotated with F
- k nodes in M annotated with F
- Threshold 0.001
171. Enrichment of modules
181. Module enrichment
- Graph Summarization
- Variety of enriched annotations (P lt 0.001)
- Less modules enriched, but not indicative of
predictive performance
GS
GS
Complexes Biological
Processes
Complexes Biological
Processes
192. Making predictions
- Leave one-out cross-validation
- Module-assisted annotation prediction
202. Making predictions
- Leave one-out cross-validation
- Module-assisted annotation prediction for
annotation A in module M - Majority transfer annotation A if gt 50
annotated proteins in module M have A - Plurality transfer most common A in M
- Hypergeometric transfer all enriched A in M
- (P lt 0.001)
21High-Confidence Network
- High-throughput data known to be noisy
- IntAct edges, associated with gt 1 PubMed
identifier - 2,604 proteins
- 8,341 interactions
- Similar results for both complexes and biological
processes
22Co-complexed Pairs
Cost 7 edges
a
b
h
i
- Corrections hint at missing or false edges
- -ve corrections co-complexed pair
- ve corrections false edge
c
d
Summary
Y a,b,c,d
i
h
Corrections
Cost 4(1 superedge 3 corrections)
(d,h)
(b,i)
-(a,c)
23Complex-Pair Prediction
- Compared with defective clique algorithm (DCC)
- Find two cliques of size at least k, overlapping
on at least l nodes - Predict missing edges
24Prediction Results
- Test set, co-complexed pairs
- Co-complexed gold standard set of MIPS complexes
- Not co-complexed proteins in different
sub-cellular locations - Results high confidence
- GS 66.7 precise, 224 correct predictions
- DCC 62.5 precise, 39 correct predictions
- GS also filters 3,331 edges, 97 accuracy
- Results unfiltered network
- Both perform poorly
- Too many false positive edges
25Comparison of Predictions
- Complexes, Majority
- ? useful to combine methods
26Reconstruction
R
X d,e,f,g
- Reconstruct G from R
- For all superedges (u,v) in S, insert all pair of
edges between u-nodes and v-nodes - For all ve corrections, insert edge (a,b)
- For all -ve corrections, delete edge (a,b)
i
h
Y a,b,c
j
C (a,h), (c,i), (c,j), -(a,d)
G
d
e
f
g
h
i
j
a
b
c
27How to compress?
Cost 14 edges
d
e
f
g
- Summary S(VS, ES)
- Each supernode x represents a set of nodes Ax
- Each superedge (x,y) represents all pair of
edges pxy Ax x Ay - Corrections C (a,b) a and b are nodes of G
- Compute edge cost
- Exy actual edges of G between Ax and Ay
- Cost without (x,y) Exy
- Cost with (x,y) 1 pxy Exy
- Choose the minimum
a
b
c
Summary
X d,e,f,g
Y a,b,c
Corrections
Cost 5(1 superedge 4 corrections)
28Greedy Algorithm
- Cost of merging nodes u and v into supernode w
- Cost of a superedge (u,x)
- c(u,x) minpux Eux1, Eux
- cu sum of costs of all its edges
- s(u,v) (cu cv cw)/(cu cv)
- Main idea iterative, bottom-up merging of nodes
- If s(u,v) gt 0, merging u and v reduces the cost
of reduction - Normalize the cost remove bias towards high
degree nodes
29Greedy Algorithm
Cost reduction 11 to 6
bc
a
- s(u,v) (cu cv cw)/(cu cv)
- GREEDY algorithm
- Start with SG
- At every step, pick the pair with max s(.) value,
merge - If no pair has positive s(.) value, stop
d
ef
gh
C (h,d),(a,e)
b
bc
bc
c
a
a
a
d
d
d
e
e
e
h
h
f
gh
f
f
g
g
C (h,d)
s(b,c).5 cb 2 cc2 cbc2
s(g,h)3/7 cg 3 ch4 cgh4
s(e,f)1/3 ce 2 cf1 cef2