Title: Protein%20Complex%20Detection%20in%20Large%20Protein%20Interaction%20Networks
1Protein Complex Detection in Large Protein
Interaction Networks
May.21.2003
Chris Sander Lab Computational Biology Center
(cBio) Memorial Sloan-Kettering Cancer Center
http//cbio.mskcc.org/
2Yeast Two-Hybrid
233 Interactions 145 Proteins
Fields, Drees, Boone, Tong
3Highly Connected 6-Core Las17 Actin Assembly
Complex?
20 proteins
Tong et al. Science 2002295(5553)
4Experimental Validation of Las17 Complex
Experimental
Ypr154
Myo3
Yfr024
Bbc1
Ysc84
Bzz1
Rvs167
Yfr024
Ygr136
Sho1
Ypr154
Ysc84
Ygr136
Ygr136
Rvs167
Bzz1
2
3
4
5
Las17
1
ELISA Bbc1, Bzz1, Ygr136w, Ypr154w, Yfr024c,
Ysc84 CoIP Colocalized
Tong et al. Science 2002295(5553)
5So...
- Based on observations, densely interconnected
regions of an interaction network may represent
molecular complexes - Complexes are another level of annotation above
other guilt by association methods - Methods that find dense network regions can help
us understand biological systems (using only
qualitative connectivity information)
6Nuclear Complexes
7k-core
- A part of a graph where every node is connected
to other nodes with at least k edges
(k0,1,2,3...) - Highest k-core is a central most densely
connected region of a graph - Therefore, high k-cores may be molecular complexes
8k-core
Pajek - Batagelj,V., Mrvar, A.
9A Better Complex Finder
- k-core method is limited to a single complex in
the middle of a network
SH3 Y2H data
102 Complexes
7/19 membrane10/19 unknown1/19
cytoskeletal Signal transduction
6/17 cell polarity3/17 unknown role Actin
cytoskeleton rearrangement
other
11(No Transcript)
12Molecular Complex Detection
MCODE
- MCODE finds densely connected regions of a
network - Graph theoretic based clustering algorithm
- Three stages
- Network Weighting
- Complex Detection
- Optional Post-processing
Bader Hogue - BMC Bioinformatics 2003 Jan
134(1)2
13Overview
- MCODE Algorithm
- Evaluation
- Application
14MCODE
- Take a network
- Give each node a score
- High score node in dense region
- Find complexes
- Optionally expand/contract complexes
15Input Network
16Find neighbors of Pti1
17Find highest k-core (8-core)
Removes low degree nodes in power-law networks
18Find graph density
19Calculate score for Pti1
20Repeat for entire network
21Find dense regions -Pick highest scoring
vertex -Paint outwards until threshold score
reached ( score from seed node)
22Post-process (optional) Fluff the boundary by
fluff density threshold - Haircut 2-core
23Polyadenylation Factor I Complex
KnownCft1, Cft2, Fip1, Pap1, Pfs2, Pta1, Ysh1,
Yth1 and Ykl059c UnknownYor179c and Pti1
Ideally ? testin wet-lab
Continue with rest of network
24Evaluation
- Yeast
- Requires a list of known complexes for
comparison Gavin et al. (221), MIPS (208) - Not perfect, but neither is the data
25Modeling Copurification Data
- E.g. Co-immunoprecipitation (CoIP) data
- Population of complexes of unknown topology
- Want to use this data with pairwise interactions
- Must model CoIP as pairwise interactions
Bader Hogue Nature Biotech 200220(10)
26Spoke and Matrix Models
- Vrp1 (bait), Las17, Rad51, Sla1, Tfp1, Ypt7
Possible Actual Topology
Spoke
Matrix
Theoretical max. number of interactions, but many
FPs
Simple model Intuitive, more accurate, but
canmisrepresent
27TAP Benchmark
- Only 88/221 coverage
- Better predictions with better experimental
coverage
28Application to Yeast Network
- From a list of 15,143 known yeast intx among
4,825 proteins 209 complexes predicted - 100 random network permutations
- Average of 27.4 complexes (SD4.4)
- Random complexes 5x larger
- Did not match any known complexes
- Large annotation spread
- Thus, number, size, functional composition
unlikely to occur by chance - Not affected by high number of false positives in
high-throughput data sets
29The Yeast 26S Proteasome
16/21 19S regulatory subunit
9/15 20S proteolytic subunit
Basic structure is evident
30Cytoskeleton/Cytokinesis Complex?
31Directed Mode
32Directed Mode - Split
26S proteasome
Lsm mRNA Modification
snRNA associated
Allows fine-tuning without considering entire
network
33Functional Connections Between Complexes
34Advantages
- Compared to other clustering algorithms
- Directed mode
- Complex connectivity mode
- Does not force all data points into clusters
- Makes visualization of large networks more
manageable
35Conclusions
- Initial step in taking advantage of current
purely qualitative connectivity information - Requires graph layout software (Pajek)
- Future networks need to have more information
about time, space, data quality, stoichiometry
(from e.g. interaction databases) ? p-value
weights on edges - Dynamic, not static
- ftp.mshri.on.ca/pub/BIND/Tools/MCODE
36Future MCODE Directions
- Adaptive vertex scoring function (functional
annotation, gene expression) - Interactive viewer for directed mode
www.cytoscape.org
37Acknowledgements
Sander Group Chris Sander Mike Cary Ethan
Cerami Daniel Eisenbud Anton Enright Ronald
Jansen Alex Lash Boris Reva
Original work Chris Hogue (Toronto)
Data Boone lab, Tyers lab, MDSP,SLRI, Fields lab,
Cesareni lab
bicjobs_at_cbio.mskcc.org CB, SE, DBA, SA
38(No Transcript)
39Evaluation
- Require interactions and known complexes
- Interactions from Saccharomyces cerevisiae
- List of known complexes for comparison Gavin et
al., MIPS - Predict and compare with parameter optimization
40Evaluation with Gavin Data Set
- Convert 588 raw copurification data to binary
interactions using spoke model - 3,225 interactions among 1,363 proteins
- Run MCODE 840 parameter combinations
- Compare with 221 hand annotated complexes
(somewhat redundant) - Pick parameters with most number of matched known
complexes
Gavin et al. Nature 2002 415(6868)
41Complex Comparison
- Overlap score ? i 2/ab
- i size of intersection set of two complexes
(predicted known) - a size of predicted complex
- b size of known complex
42Matched Known Complexes
43Parameter Optimization
Large range best parameters hFfT/0.05/0.05
44Evaluation with MIPS Data Set
- More varied benchmark
- 9,088 interactions among 4,379 proteins
literature and large-scale (not including HTMS) - 208 MIPS curated complexes
- Best parameters hTfT/0.1/0.2
- 166 predicted complexes
- 52 matched 64 MIPS complexes gt ?0.2
Gavin et al. Nature 2002 415(6868)
45Prediction Benchmark Overlap
MIPS complex catalogue incomplete Data set
incomplete MCODE complex ! human definition
46Effect of Data Set Properties
MCODE Predictions vs. MIPS Complexes
High FP doesnt affect sens/spec
47Large-Scale Data Sets
- Large-scale data has many false positives
- Benchmark interaction data set augmented by
large-scale data set interactions that only
connect proteins in the Benchmark set with each
other - gt3100 interactions added to existing 3300
- Sensitivity/specificity was not affected
- ? High FP rate doesnt affect prediction
48Effect of Data Set Properties
MCODE Predictions vs. Gavin Complexes
Spoke is reasonable
49Prediction Significance
- 100 random network permutations
- Average of 27.4 complexes (SD4.4) ? optimized
- Random complexes 5x larger than MIPS
- Did not match any known complexes
- Large annotation spread
- Thus, number, size, functional composition
unlikely to occur by chance
50Complex Score
- MCODE Score Complex density x Size of complex
(DC x V) - Ranks larger more dense complexes higher
- Other scoring functions exist, this one developed
empirically (heuristic)
51Top 5 Complexes
52Complex Score Accuracy
53Future Directions
- Complexes and network connectivity are just one
set of features that will be useful to compare
between organisms - Evolution of complexity
- Protein interaction network involved in actin
cytoskeleton regulation, defined by SH3, PDZ
domains
54http//www.caida.org/tools/visualization/walrus/
55Xerox.com high, low traffic (Apr1997)
Chi Card INFOVIS 1999 Xerox Parc
56new, deleted, unchanged
Chi et al. 1997 Xerox Parc
57VWP Parameter Properties