Revealing Biological Modules via Graph Summarization - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Revealing Biological Modules via Graph Summarization

Description:

Center for Bioinformatics and Computational Biology. University of Maryland, College Park, USA ... High-throughput methods are producing lots of protein ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 30

Provided by: saketna

Category:

more less

Transcript and Presenter's Notes

Title: Revealing Biological Modules via Graph Summarization

1
Revealing Biological Modules via Graph
Summarization

Saket Navlakha
Computer Science Department
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park, USA
Joint work with Michael C. Schatz and Carl
Kingsford

2
Introduction

High-throughput methods are producing lots of
protein interaction data for many organisms
Analyze networks and recover complexes or
proteins involved in the same biological process
Common approach
Partition network into clusters or modules
Use modules to predict annotations

Yeast protein interaction network
3
Module-assisted predictions
?
Module Finding Algorithm
?
e.g. Newman 2003 King et al. 2004
Bader and Hogue 2003 Blatt et al. 2006
van Dongen 1998
4
Our module-finding approach

Want modules with proteins having similar
cellular roles
Similar interaction partners implies similar
cellular roles
Similar interaction partners also implies
redundancy in the graph
Redundancy implies compressibility!

Compressed Rep.
5
Graph Summarization (GS)
Goal produce compressed representation of input
graph G
Corrections
Summary
h
G
(a,h)
X a,b,c
Y d,e,f,g
d
a
(c,i)
e
(c,j)
b
Modules
-(a,d)
f
c
g
j
i
6
GS representation
Cost 14 edges
d
e
f
g

Summary S highlights dominant trends, easy to
visualize
Supernode ? set of nodes with common neighbors
Identifies bipartite cores, cliques, stars, etc.
Corrections C exception/unusual edges
Handles noise
Minimize total cost S C
(based on MDL principle)

a
b
c
Summary
X d,e,f,g
Y a,b,c
Corrections
Cost 5(1 superedge 4 corrections)
7
Greedy GS
Cost reduction 11 to 6

Bottom-up, iterative merging
Start with S G
At every step, merge pair which maximally reduces
cost
If no pair reduces cost, stop

b
c
a
d
e
h
f
g
8
Application to yeast network

Network IntAct 5,492 proteins, 40,332
interactions
Complexes MIPS 266 leaf level
Biological Process GO 182 terms Myers et al.
2006
Comparisons
Graph Summarization
Markov Clustering Alg.
Molecular Complex Detect. Alg.
Newman Modularity

9
Module-assisted predictions

Test quality of modules (supernodes)
Compute hypergeometric enrichment of modules
Use modules for annotation prediction (complexes,
biological processes)

10
1. Annotation Enrichment
Graph summarization ? larger variety of
annotations enriched in some module (P lt 0.001)
(A lower of GS modules are enriched for some
annotation, but not indicative of predictive
performance.)
11
2. Making predictions

Cross-validation (leave one out)
Module-assisted annotation transfer schemes

Majority
Hypergeometric
Plurality
transfer the statistically enriched annotations
transfer if gt 50 annotated proteins have the
annotation
transfer the most common annotation(s)
12
Complexes Biological
Processes
All GS predictions are Pareto optimal
13
Conclusion

Graph Summarization is good for annotation
prediction
Confirmed on complexes and biological processes
Preferable to MCL, MCODE, NSP
Advantages
MDL-based graph compression, no parameters
Module proteins which have similar interaction
partners
Generalizes bipartite cores, cliques, stars
handles noise
More in paper
Similar results for high-confidence network
Also used GS corrections to predict co-complexed
pairs using only topology, fared favorably
against Defective Cliques Yu et al. 2006
Many unique predictions made by each algorithm

14
Acknowledgements
Nisheeth Shrivastava

Michael C. Schatz

Research Scientist Bell Labs Research,
India
PhD student University of Maryland
Rajeev Rastogi
Carl Kingsford
VP and Head Yahoo Research, India
Assistant Professor University of Maryland
15
Additional slides
16
Enrichment of Modules

Want modules to contain proteins dominantly with
the same complex or biological process
Measure enrichment via hypergeometric P-value
n nodes in the network
m nodes in M
f nodes in network annotated with F
k nodes in M annotated with F
Threshold 0.001

17
1. Enrichment of modules
18
1. Module enrichment

Graph Summarization
Variety of enriched annotations (P lt 0.001)
Less modules enriched, but not indicative of
predictive performance

GS
GS
Complexes Biological
Processes
Complexes Biological
Processes
19
2. Making predictions

Leave one-out cross-validation
Module-assisted annotation prediction

20
2. Making predictions

Leave one-out cross-validation
Module-assisted annotation prediction for
annotation A in module M
Majority transfer annotation A if gt 50
annotated proteins in module M have A
Plurality transfer most common A in M
Hypergeometric transfer all enriched A in M
(P lt 0.001)

21
High-Confidence Network

High-throughput data known to be noisy
IntAct edges, associated with gt 1 PubMed
identifier
2,604 proteins
8,341 interactions
Similar results for both complexes and biological
processes

22
Co-complexed Pairs
Cost 7 edges
a
b
h
i

Corrections hint at missing or false edges
-ve corrections co-complexed pair
ve corrections false edge

c
d
Summary
Y a,b,c,d
i
h
Corrections
Cost 4(1 superedge 3 corrections)
(d,h)
(b,i)
-(a,c)
23
Complex-Pair Prediction

Compared with defective clique algorithm (DCC)
Find two cliques of size at least k, overlapping
on at least l nodes
Predict missing edges

24
Prediction Results

Test set, co-complexed pairs
Co-complexed gold standard set of MIPS complexes
Not co-complexed proteins in different
sub-cellular locations
Results high confidence
GS 66.7 precise, 224 correct predictions
DCC 62.5 precise, 39 correct predictions
GS also filters 3,331 edges, 97 accuracy
Results unfiltered network
Both perform poorly
Too many false positive edges

25
Comparison of Predictions

Complexes, Majority
? useful to combine methods

26
Reconstruction
R
X d,e,f,g

Reconstruct G from R
For all superedges (u,v) in S, insert all pair of
edges between u-nodes and v-nodes
For all ve corrections, insert edge (a,b)
For all -ve corrections, delete edge (a,b)