Title: Protein Interaction Networks
1Protein Interaction Networks
- Thanks to Mehmet Koyuturk
2Protein-Protein Interactions
- Physical association between proteins
- Signal transduction, phosphorylation
- Docking, complex formation
- Permanent vs. transient interactions
- Co-location of proteins
- Proteins that work in the same cellular component
- Soluble location lysosome, mitochondrial stroma
- Membrane location receptors in plasma membrane,
transporters in mitochondrial membrane - Functional association of proteins
- Proteins involved in the same biomolecular
activity - Enzymes in the same pathway, co-regulated proteins
3Permanent vs Transient Interactions
- Permanent interactions
- Some proteins form a stable protein complex that
carries out a structural or functional
biomolecular role - These proteins are protein subunits of the
complex and they work together - ATPase subunits, subunits of nuclear pore
- Transient interactions
- Proteins that come together in certain cellular
states to undertake a biomolecular function - DNA replicative complex, signal transduction
4Signal Transduction
- Phosphorylation
- Protein-kinase interaction
- Enzyme activation
5Why Study Protein Interactions?
- Identification of functional modules and
interconnections between these modules - Functional annotation based on binding partners
and interaction patterns - Identification of evolutionarily conserved
pathways - Identification of drug target proteins to
minimize side effects
6Identification of Protein Interactions
- Traditionally, protein interactions are
identified by wetlab experiments based on
hypotheses on candidate proteins - Small scale assays
- Coimmunoprecipitation Immunoprecipitate one
protein, see if other is also precipitated - Reliable, but can only verify interactions
between suspected partners - High throughput screening
- Throw in thousands of ORFs and see which ones
bind to each other - Yeast two hybrid, tandem affinity purification
- Large scale, but a lot of noise
7Yeast Two Hybrid
- Split yeast GAL4 gene, which encodes a
transcription factor, required for activation of
GAL genes in two parts - Activating domain, binding domain
- The split protein does not work unless the two
parts are in physical contact
8Protein Interaction Networks
- Organize all identified interactions in a
network, where proteins are represented by nodes
and interactions are represented by edges - TAP identifies a group of proteins that are
caught by target protein - Spoke model (star network) vs. matrix model
(clique)
Interaction
Protein
9Functional Modularity in PPI Networks
- A protein complex
- Dense subgraph
- A signal transduction pathway
- Simple path, parallel paths
- A protein with common, key,
- fundamental role (e.g., a kinase)
- Hub node
10Computational Prediction of PPIs
- Functional association is a higher level
conceptualization of interaction - Proteins that act as enzymes catalyzing reactions
in the same metabolic pathway - Functionally associated proteins are likely to
show up in similar contexts - Co-regulation, co-expression, co-evolution,
co-citation - Functional association between proteins can be
computationally identified by looking at
different sources of data such as sequences, gene
expression, literature - Can also be extended to capture physical
associations, for example, by taking into account
evolution at structural level
11Conservation of Gene Neighborhood
- In bacteria, the genome of an organism is
organized in such a way that that functionally
related proteins are coded by neighboring regions - Operons
- When more than one bacterial species are
considered, it is observed that this neighborhood
relationship becomes even more relevant
Distribution of neighboring genes in H.
Influenzae and E. coli into functional classes
12Comparison of Nine Bacterial Genomes
- trpB-trpA is the only gene pair whose proximity
is conserved across nine prokaryotic genomes - These genes encode the two subunits of tryptophan
synthase that interact and catalyze a single
reaction
13Close Orthologs
- Run of genes
- A set of genes on one strand, such that gaps
between adjacent genes is less than a threshold,
g (in practice, g ? 300 bp) - Any pair of genes on the same run are said to be
close - Bidirectional best hits
- Genes X1 and X2 from genomes G1 and G2 are BBH,
if their sequence similarity is significant and
there are no Y1 (Y2) in G1(G2) that is more
similar to X2 (X1) than X1 (X2)
Pair of close bidirectional best hits Xa, Ya
close in G1, Xb, Yb close in G2, XaXb BBH, Ya
Yb BBH
14Predicting Interactions
- For each pair of close orthologs (occuring at
least one pair of genomes), calculate a score - Score should increase with the phylogenetic
distance between the two genomes, since closely
related organisms are more likely to have similar
genes nearby due to chance alone - Existence of a triplet (P1, P2, P3) should be
stronger than the existence of two pairs (P1, P2
and P1, P3) - Triplet distance can be estimated as the minimum
distance between any pair of organisms (in
addition to pair score)
15Reconstructing Pathways
Purine Metabolism
- Can identify the association between unknown
proteins and known pathways!
16Projection of Gene Neighborhood
- The composition of operons is evolutionarily
variable - A particular set of functionally related genes do
not always comprise an operon - The application of gene neighborhood based
interaction prediction is limited for a single
organism - With multiple organisms, it is possible to
statistically strengthen conclusions and project
findings on other organisms - If an operon with functionally related genes
exists in several genomes, a functional
association can be predicted for other organisms,
even if the corresponding genes are scattered - Variability turns out to be an advantage for
prediction
17Gene Neighborhood - Limitations
- It is only directly applicable to bacteria (and
archaea), because relevance of gene order does
not necessarily extend to eukaryotes - For closely related species, conserved gene order
might just be due to lack of time for genome
rearrangements - We are interested in selective constraints that
preserve gene order - Compared species should be distant enough
- But not too distant, because we need sufficient
number of orthologs to be able to derive
statistically meaningful results
18Gene Fusion
- Domain fusion events
- Two protein domains that act as independent
proteins (components) in one organism may form
(part of) a single polypoptide chain (composites)
in another organism - Most proteins that are involved in domain fusion
events are known to be subunits of multiprotein
complexes (76 in E. coli metabolic network)
19Gene Fusion Based PPI Prediction
- A pair of proteins in query genome are candidate
interacting pairs if - They show (local) sequence similarity to the same
protein (rosetta stone) in reference genome - They do now show sequence similarity with each
other - Complete genomes!
20Predicted Interactions
Known physical interactions
Proteins in the same pathway
21Gene Fusion Based Prediction - Results
- Interactions predicted based on gene fusion
events - Distance on circle shows distance on genome
22Co-evolution of Interacting Proteins
- Selective pressure is likely to act on common
function - Proteins that are interacting are expected to
either be conserved together along with their
interactions, or not conserved at all - Hypothesis 1 Orthologs of interacting proteins
also interact in other species (supported by
evidence, but there are subtleties, which we will
discuss this later) - Hypothesis II If two proteins are
- interacting, then they will show
- similar conservation patterns
- Phylogenetic profiles
23Phylogenetic Profiles
24Correlation of Phylogenetic Profiles
- Assume we have N genomes, protein X has homologs
in x of them, Y has y, and they co-occur in z
genomes - Hamming distance
- Pearson correlation
- Mutual information
- Statistical significance
25Phylogenetic Profiles - Limitations
- Many processes may be common across lineages
- Too many false positives
- Database of genomes may be biased
- All organisms are treated equally
- Improvement Use trees instead of profiles
- Proteins are assumed to be conserved as a whole
- It is domains that interact
- Improvement Use domain profiles
Yeast nucleoli and ribosomal proteins
Organisms
26Phylogenetic Tree Based Prediction
- Phylogenetic trees of Ntr-family two-component
sensor histidine kinases and their corresponding
regulators
27Mirror Tree Method
- Need to have sufficient number of genomes that
contain homologs of both proteins
28Matrix Method
- Start with families of proteins that are
suspected to interact - Identify specific pairs of proteins that interact
by aligning the phylogenetic trees that underly
the two families - Assumption Identical number of proteins in each
family
29Correlated Mutations
- Co-evolution of interacting proteins can be
followed more closely by quantifying the degree
of co-variation between pairs of residues from
these proteins - Correlated mutations may correspond to
compensatory mutations that stabilize the
mutations in one protein with changes in the
other
Distribution of distances between aminoacid
positions on a folded protein
30In silico Two-Hybrid
- The correlation of mutations between two
positions (may be on different proteins) can be
estimated from pairwise assessment of aligned
multiple sequences - Position pairs with high correlation are
potential contact points - Interaction index
- For a protein pair, compute the aggregate
correlation (of mutations) across all positions
31In silico Two-Hybrid
32Performance of I2H
- I2H predicts physical, rather than functional
association - It requires complete genomes sufficient number
of homologs
33Co-citation Based PPI Prediction
- Functionally associated proteins are likely to be
cited in the same research article - We can assess the statistical significance of
co-citation based on hypergeometric model - Algorithmic problem How to recognize match
protein names? - Train algorithm using annotated abstracts via
conditional random fields (CRF)
34Performance of Co-citation
- Statistical significance is quite relevant until
it saturates
- The method is robust to choice of parameters for
name recognition
35Integrating PPI Networks
- Interaction data coming from multiple sources
- Different sources refer to different levels of
interaction - Can integration handle noise, making interaction
data more reliable? - Superpose interactions based on their reliability
36Bayesian Integration
- For each prediction method, compute
log-likelihood score - Let P(LE) be the number of interactions
predicted by method E, such that functional
association between corresponding proteins is
known - Let P(LE) be the number of false positives
- Let P(L) and P(L) be the corresponding priors
- Assign weights to methods based on their
log-likelihood scores
37Comparison of Prediction Methods
- Integrated network captures functional
association better - Note that the integrated network is trained
using available data on functional association
38Classification Based Integration
- Points Proteins, Space Expression,
Conservation, Labels Function - Points Protein Pairs, Space Co-expression,
Co-evolution, etc., Labels Existence of
Interaction
39Performance of Domain Co-evolution
40Co-Evolutionary Matrix
41Domain Identification
42Difference between Predicted PPIs