Title: Functional modules discovery from protein interaction data
1EECS 700 Bioinformatics and Machine
Learning. Presentation and Term Paper on
Functional modules discovery from protein
interaction data Aditya Deekonda
2INTRODUCTIONResearch in Bioinformatics -
sequence alignment - protein structure
prediction - protein function prediction -
protein protein interactions Proteins and
proteins and proteins Basic building blocks,
carry out functions. Laboratory methods for
protein discovery - yeast2hybrid, tagged
bait Rate of protein discovery gtgt protein
annotation Need to turn this protein information
into data.
3Protein classification - Class - 3d
Structure - Function - Processes - Functional
modules (present discussion) High level
classification Help in functional classification
4- Basic Facts and Assumptions
- - Proteins work as groups
- - They interact with each other
- They form large complexes
- They perform higher-order bio functions
- DNA replication, transcription, metabolism
- Laboratory methods are costly
- Need for insilico methods
5- Identification of Functional Modules in Protein
Complex via Hyperclique Pattern discovery - Detection of Functional Modules from Protein
Interaction Networks
6Identification of Functional Modules in Protein
Complex via Hyperclique Pattern discovery Uses
protein complex data to discovery F.
Modules Complex data Search
Algorithm Analysis using Gene
Ontology Function analysis Verification
7- Terms used
- Prot Functional module group of proteins gt
common function - Function elementary biological
- Functional module smaller and simpler than
prot complex - - Prot Complex again a group of proteins
- This group can be a group of functional modules
- Performs higher-order functions
- Large complexes act as efficient protein
machines - Complex and larger than functional module
8- Present methods
- Experimental methods for complex data
- yeast2hybrid, tagged bait
- For functional modules
- densely connected subgraphs from prot networks.
- clustering of proteins
9Definitions Pattern A set of proteins Hypercliqu
e pattern A type of pattern where the proteins
are highly affiliated Highly affiliated
Presence of one protein implies presence of
others. ..more ahead
10 Concept of Association Rules Let P p1, p2,
..pn set of proteins C c1,c2, cl set
of prot complexes ci set of prot ci
subset P X pattern (set of prots) such
that X subset P Support of X, supp (X) fraction
of prot complexes containing X
11Sample Protein complex data set
supp p3, p4 3/5 60 supp p1, p2, p3 2/5
40
12Association rule X ? Y , presence of pattern X
implies presence of pattern Y in the same protein
complex X subset P Y subset P X n Y
? Strength of the affinity. Confidence of X ? Y
Conf (X ? Y) Conf (X ? Y) supp (X U Y) /
supp (X)
13Conf of association rule (p3 ?p4) Conf (p3 ?p4)
supp p3, p4 / suppp3 60 / 80 75 supp
p3, p4 3/5 60 supp p3 4/5 80 What
of complexes that have p3 also have both p3,
p4. What of 80 is 60 in this case.
14 Hyperclique pattern An association pattern that
contains proteins that are highly affiliated to
each other. For a pattern p1, p2, p3 If p1 is
present in a complex gt presence of p2, p3 p2
gt p1, p3 p3 gt p1, p2 in the same
complex Comingup h-confidence
15h-confidence the confidence of a hyperclique
pattern h-confidence of pattern X- hconf(X)
Xp1, p2, p3, pm h-conf (X) min(conf(p1 ?
p2, p3..pm), conf( p2 ? p1,
p2..pm).conf(pm ?p1,p2pm-1)) Xp2, p3,
p4 supp( p280, p380, p460
) supp(p2, p3, p4 40 conf ( p2 ? p3, p4
) supp(p2,p3,p4) / supp(p2) 50 conf (
p3 ? p2, p4 ) supp(p2,p3,p4) / supp(p3)
50 conf ( p4 ? p2, p3 )
supp(p2,p3,p4) / supp(p4) 66.7 h-conf
(X) min(50, 50, 60.7) 50
16 Hyperclique pattern Pattern X gt hyperclique
pattern if hconf (X) hc where hc is minimum
h-confidence threshold The pattern Xp2, p3, p4
is a Hyperclique pattern at a threshold of 0.5 as
its h-confidence 50 (from prev slide) Maximal
hyperclique pattern Hyperclique patt gt Max
hypcl patt if no superset of this pattern is a
hyperclique pattern
17- We now know
- Protein Functional module group of proteins
- Protein Complex group of modules
- Protein Pattern group of proteins again
- Supp(X) fraction of complexes that have X
- Confidence of assoc (p3 ?p4) supp p3, p4 /
suppp3 - Xp1, p2, p3, pm
- h-conf (X) min(conf(p1 ? p2, p3..pm), conf(
p2 ? p1, p2..pm).conf(pm ?p1,p2pm-1)) - 7) Hyperclique pattern gt h-conf hc (user
specified threshold)
18Searching the protein complex data for
Hyperclique patterns Like generation of
level-wise pattern tree Every level of tree has
patt with same num of proteins Every patt has a
branch which contains all the superset of this
patt Use a breath-first search algorithm If a
patt is not satisfied at the given threshold,
prune the whole branch This is due to
anti-monotone property h-conf(X) h-conf
(superset X) p1, p2, p3 h-conf p1?p2 gt
h-conf p1 ? p2,p3
19The algorithm Hyperclique miner
algorithm Input Give the protein complex
data Give the threshold values for supp and
h-conf Give the proteins to find hyp patterns
among Output List of patterns with supp and
h-conf values above the thresholds
20 A hyperclique pattern gt proteins are highly
affiliated to each other. Theorem every pair
of prots has cosine similarity h-conf X p1,
p2, ..pm and h-conf hc For two prots pl, pk
cosine(pl, pk ) hc cosine(pl, pk )
supp(pl, pk ) / root(supppl .supppk )
21- How the algorithm works
- For a given set of proteins (p1, p2, p3, p4..pk)
- For size of the pattern 2,3,..k-1
- Generate candidate hyperclique patterns
- Prune based on the support threshold
- Prune based on the hc threshold
- Generate hypercliques
- Pruning is also done based on anti-monotone
property - Eg if p1?p2 is not a hyperclique then we can
prune the candidate patt p1, p2, p3
22Protein Complex data set Prot Complex of
Saccharomyces Cerevisiae TAP-MS (two
purification methods) and used bait proteins to
identify physiologically intact protein
complexesDataset by Gavin et al. has better
accuracy for predicting protein functions.Using
this dataset, we have 1440 prots within 232
complexes
23- Analysis tools
- The Gene Ontology- used to annotate proteins
- GO controlled use of vocabulary
- consistency in use of terms
- classifies all the terms by
- Molecular function
- Biological process
- Cellular process
24 Graph Viz used to produce the graph
representation of the annotation The functional
description from Saccharomyces Genome
Database The 3-D structure from Protein Data
Bank PyMOL for visualizing the 3-D structures
25 Setting a support threshold of 0 and
h-confidence of 0.6 We obtain 60 maximal
hyperclique patterns To analyze and test these
patterns, use terms form GO For a pattern
Cus1, MSl1, Prp3, Prp9, Sme1, Smx2, Smx3,
Yhc1 Annotate the proteins with GO terms Analyze
the function and process
26Analysis of protein pattern function
Graph on left function annotation of the
pattern Graph on right process annotation of
the pattern Significant nodes with number of
proteins and stat significance
27Hypercliques as functional modules
Hyp Pattern Pre2, Pre4, Pre5, Pre6, Pre9, Pre8,
Pup3, Scl1 Contained in 4 complexes and the
consistency of Function. 2 categories gt hyp
patterns act as functional modules participating
in protein complexes to perform higher-order
funcs
28A 3-D view for interaction of proteins of
pattern Pre2, Pre4, Pre5, Pre6, Pre9, Pre8,
Pup3, Scl1
3-D structures of proteins of the hypcl pattern
show physical interactions. A compelling reason
evidence that hpycl patts form a compact
structure and perform a common function.
29Sub graph of Gene Ontology for function
correspondence for Complex 151
This complex has 3 hyp patterns with one function
each
30Detection of Functional Modules from Protein
Interaction Networks
31Detect Protein functional modules form protein
interaction networks. Modules groups of genes or
proteins involved in common
elementary funcitons Use of Clustering Extracts
1046 modules from Saccharomyces
cerevisiae. TribeMCL which is an unsupervised
graph clustering alg. Use of additive weighting
scheme Transformation of protein interaction
network into line graph.
32Basic steps in the methods. Protein interaction
network (Sacchromyces DIP) Add weights
(confidence) Graph transformation TribeMCL
Classif of proteins Cluster
analysis
33Terms used ahead Protein interaction network
graph where node is protein edge is interaction.
Line graph Interactions connected by proteins
34- Protein interaction network
- Saccharomyces Cerevisiae from DIP
- Its is well studied
- DIP has curated information
- This network has 8046 physical interactions in
4081 proteins - 6586 supported by single exp and others by at
least two.
35- Adding weights
- Believed that most protein interaction
assignments are those supported by more that one
method. - Use a simple additive weighting scheme
- Weight attributed to each interaction gt degree
of confidence - Confidence level represents the number of exp
that support the interaction
36All methods are equally weighed. No difference
between small scale and large scale
experiments Start with a score of 3. For further
instances of the interaction, add 1 if the
method is different . 0.25 if it method is
already used. All scores are normalized as a
percentage of the highest score in the data
set. We now have a weighted network of proteins
connected by interactions where weight represents
confidence in that interaction.
373) Graph transformation Network of proteins
connected by interactions to Network of
interactions connected by proteins Every binary
interaction is condensed into a node that
represent a pair of interacting proteins. Edge
represents a common protein between two
interactions. The score for the edge is the
average of the two edge scores in the original
graph.
38- Advantages of Line Graph
- It does not sacrifice information content
- It takes into account the higher-order local
neighborhood - More highly structured than original graph
- Increase in local structure better suited for
clustering algorithms like TribeMCL
394) Use the algorithm Now use the transformed
graph as input to the TribeMCL which is a graph
clustering algorithm Get the clusters of
associated interactions. Clusters are transformed
back to original form for all further analysis.
40- 5) Classification of Proteins in detected
Clusters - For each protein obtain
- Manually derived regulatory and metabolic (KEGG)
- Automatic functional classification (GQFC)
- Cellular localization information (LOC)
- Among the 1046 clusters of the Saachromyces
- KEGG 20
- GQFC 45
- LOC 48
41 Use 3 different schemes so as not to be biased
toward a particular classification Now, analyze
the homogeneity of each cluster and each
classification scheme
426) Cluster analysis Validation by assessing the
consistency of the cluster. More consistentgt
more proteins with same function more protein
homogeneity Less consistentgt more proteins with
different function. Less homogeneity The
measure of this consistency given by Redundancy
(R) for each cluster j n number of classes Ps
freq of the class in the cluster
43All R between 0 and 1 Clusters with high
consistency close to 1 Clusters with low
consistency close to 0
44- Measuring the significance of overall clustering
- For each classification scheme,
- Calculate
- Sum of scores of all clusters
- Sum of scores of set of 1000 random clusters.
- Compare 1 and 2
- Score of each scheme is significantly larger than
the score of random clusters. - Indicates that this clustering is not random and
detects biologically relevant modules.
45- The detected clusters
- Contain proteins with consistent functional
classifications and cellular localization - Are enriched in proteins of known functional
connectedness - Include many examples of previously described
functional modules - Suggest that the clusters correspond to
meaningful functional modules.
46The reason for the effectiveness of the method
seems to be the complementarity of graph
weighting graph transformation graph flow
simulation clustering algorithm Ex the scores
of the clusters without the graph transformation
were drastically reduced.
47Use of the method for function prediction Clusters
with both known and unknown proteins can be a
case for function prediction. According to their
interaction partners, poorly characterized
proteins can be placed into their functional
context.
48Advantages of this method Overlapping graph
partitioning Few interactions are discarded
49Summary Two major methods of protein functional
module extraction Both methods say they are the
best Depends on dataset, inputs and
assumptions Hope to better understand
proteins Assign the correct function to unknown
proteins
50Thats enough Thank You ..Questions(hope not)