Revealing Biological Modules via Graph Summarization - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Revealing Biological Modules via Graph Summarization

Description:

Center for Bioinformatics and Computational Biology. University of Maryland, College Park, USA ... High-throughput methods are producing lots of protein ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 30
Provided by: saketna
Category:

less

Transcript and Presenter's Notes

Title: Revealing Biological Modules via Graph Summarization


1
Revealing Biological Modules via Graph
Summarization
  • Saket Navlakha
  • Computer Science Department
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park, USA
  • Joint work with Michael C. Schatz and Carl
    Kingsford

2
Introduction
  • High-throughput methods are producing lots of
    protein interaction data for many organisms
  • Analyze networks and recover complexes or
    proteins involved in the same biological process
  • Common approach
  • Partition network into clusters or modules
  • Use modules to predict annotations

Yeast protein interaction network
3
Module-assisted predictions
?
Module Finding Algorithm
?
e.g. Newman 2003 King et al. 2004
Bader and Hogue 2003 Blatt et al. 2006
van Dongen 1998
4
Our module-finding approach
  • Want modules with proteins having similar
    cellular roles
  • Similar interaction partners implies similar
    cellular roles
  • Similar interaction partners also implies
    redundancy in the graph
  • Redundancy implies compressibility!

Compressed Rep.
5
Graph Summarization (GS)
Goal produce compressed representation of input
graph G
Corrections
Summary
h
G
(a,h)
X a,b,c
Y d,e,f,g
d
a
(c,i)
e
(c,j)
b
Modules
-(a,d)
f
c
g
j
i
6
GS representation
Cost 14 edges
d
e
f
g
  • Summary S highlights dominant trends, easy to
    visualize
  • Supernode ? set of nodes with common neighbors
  • Identifies bipartite cores, cliques, stars, etc.
  • Corrections C exception/unusual edges
  • Handles noise
  • Minimize total cost S C
  • (based on MDL principle)

a
b
c
Summary
X d,e,f,g
Y a,b,c
Corrections
Cost 5(1 superedge 4 corrections)
7
Greedy GS
Cost reduction 11 to 6
  • Bottom-up, iterative merging
  • Start with S G
  • At every step, merge pair which maximally reduces
    cost
  • If no pair reduces cost, stop

b
c
a
d
e
h
f
g
8
Application to yeast network
  • Network IntAct 5,492 proteins, 40,332
    interactions
  • Complexes MIPS 266 leaf level
  • Biological Process GO 182 terms Myers et al.
    2006
  • Comparisons
  • Graph Summarization
  • Markov Clustering Alg.
  • Molecular Complex Detect. Alg.
  • Newman Modularity

9
Module-assisted predictions
  • Test quality of modules (supernodes)
  • Compute hypergeometric enrichment of modules
  • Use modules for annotation prediction (complexes,
    biological processes)

10
1. Annotation Enrichment
Graph summarization ? larger variety of
annotations enriched in some module (P lt 0.001)
(A lower of GS modules are enriched for some
annotation, but not indicative of predictive
performance.)
11
2. Making predictions
  • Cross-validation (leave one out)
  • Module-assisted annotation transfer schemes

Majority
Hypergeometric
Plurality
transfer the statistically enriched annotations
transfer if gt 50 annotated proteins have the
annotation
transfer the most common annotation(s)
12
Complexes Biological
Processes
All GS predictions are Pareto optimal
13
Conclusion
  • Graph Summarization is good for annotation
    prediction
  • Confirmed on complexes and biological processes
  • Preferable to MCL, MCODE, NSP
  • Advantages
  • MDL-based graph compression, no parameters
  • Module proteins which have similar interaction
    partners
  • Generalizes bipartite cores, cliques, stars
    handles noise
  • More in paper
  • Similar results for high-confidence network
  • Also used GS corrections to predict co-complexed
    pairs using only topology, fared favorably
    against Defective Cliques Yu et al. 2006
  • Many unique predictions made by each algorithm

14
Acknowledgements
Nisheeth Shrivastava
  • Michael C. Schatz

Research Scientist Bell Labs Research,
India
PhD student University of Maryland
Rajeev Rastogi
Carl Kingsford
VP and Head Yahoo Research, India
Assistant Professor University of Maryland
15
Additional slides
16
Enrichment of Modules
  • Want modules to contain proteins dominantly with
    the same complex or biological process
  • Measure enrichment via hypergeometric P-value
  • n nodes in the network
  • m nodes in M
  • f nodes in network annotated with F
  • k nodes in M annotated with F
  • Threshold 0.001

17
1. Enrichment of modules
18
1. Module enrichment
  • Graph Summarization
  • Variety of enriched annotations (P lt 0.001)
  • Less modules enriched, but not indicative of
    predictive performance

GS
GS
Complexes Biological
Processes
Complexes Biological
Processes
19
2. Making predictions
  • Leave one-out cross-validation
  • Module-assisted annotation prediction

20
2. Making predictions
  • Leave one-out cross-validation
  • Module-assisted annotation prediction for
    annotation A in module M
  • Majority transfer annotation A if gt 50
    annotated proteins in module M have A
  • Plurality transfer most common A in M
  • Hypergeometric transfer all enriched A in M
  • (P lt 0.001)

21
High-Confidence Network
  • High-throughput data known to be noisy
  • IntAct edges, associated with gt 1 PubMed
    identifier
  • 2,604 proteins
  • 8,341 interactions
  • Similar results for both complexes and biological
    processes

22
Co-complexed Pairs
Cost 7 edges
a
b
h
i
  • Corrections hint at missing or false edges
  • -ve corrections co-complexed pair
  • ve corrections false edge

c
d
Summary
Y a,b,c,d
i
h
Corrections
Cost 4(1 superedge 3 corrections)
(d,h)
(b,i)
-(a,c)
23
Complex-Pair Prediction
  • Compared with defective clique algorithm (DCC)
  • Find two cliques of size at least k, overlapping
    on at least l nodes
  • Predict missing edges

24
Prediction Results
  • Test set, co-complexed pairs
  • Co-complexed gold standard set of MIPS complexes
  • Not co-complexed proteins in different
    sub-cellular locations
  • Results high confidence
  • GS 66.7 precise, 224 correct predictions
  • DCC 62.5 precise, 39 correct predictions
  • GS also filters 3,331 edges, 97 accuracy
  • Results unfiltered network
  • Both perform poorly
  • Too many false positive edges

25
Comparison of Predictions
  • Complexes, Majority
  • ? useful to combine methods

26
Reconstruction
R
X d,e,f,g
  • Reconstruct G from R
  • For all superedges (u,v) in S, insert all pair of
    edges between u-nodes and v-nodes
  • For all ve corrections, insert edge (a,b)
  • For all -ve corrections, delete edge (a,b)

i
h
Y a,b,c
j
C (a,h), (c,i), (c,j), -(a,d)
G
d
e
f
g
h
i
j
a
b
c
27
How to compress?
Cost 14 edges
d
e
f
g
  • Summary S(VS, ES)
  • Each supernode x represents a set of nodes Ax
  • Each superedge (x,y) represents all pair of
    edges pxy Ax x Ay
  • Corrections C (a,b) a and b are nodes of G
  • Compute edge cost
  • Exy actual edges of G between Ax and Ay
  • Cost without (x,y) Exy
  • Cost with (x,y) 1 pxy Exy
  • Choose the minimum

a
b
c
Summary
X d,e,f,g
Y a,b,c
Corrections
Cost 5(1 superedge 4 corrections)
28
Greedy Algorithm
  • Cost of merging nodes u and v into supernode w
  • Cost of a superedge (u,x)
  • c(u,x) minpux Eux1, Eux
  • cu sum of costs of all its edges
  • s(u,v) (cu cv cw)/(cu cv)
  • Main idea iterative, bottom-up merging of nodes
  • If s(u,v) gt 0, merging u and v reduces the cost
    of reduction
  • Normalize the cost remove bias towards high
    degree nodes

29
Greedy Algorithm
Cost reduction 11 to 6
bc
a
  • s(u,v) (cu cv cw)/(cu cv)
  • GREEDY algorithm
  • Start with SG
  • At every step, pick the pair with max s(.) value,
    merge
  • If no pair has positive s(.) value, stop

d
ef
gh
C (h,d),(a,e)
b
bc
bc
c
a
a
a
d
d
d
e
e
e
h
h
f
gh
f
f
g
g
C (h,d)
s(b,c).5 cb 2 cc2 cbc2
s(g,h)3/7 cg 3 ch4 cgh4
s(e,f)1/3 ce 2 cf1 cef2
Write a Comment
User Comments (0)
About PowerShow.com