Functional modules discovery from protein interaction data - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Functional modules discovery from protein interaction data

Description:

Every level of tree has patt with same num of proteins ... Confidence level represents the number of exp that support the interaction ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 51

Provided by: people98

Category:

more less

Transcript and Presenter's Notes

Title: Functional modules discovery from protein interaction data

1
EECS 700 Bioinformatics and Machine
Learning. Presentation and Term Paper on
Functional modules discovery from protein
interaction data Aditya Deekonda
2
INTRODUCTIONResearch in Bioinformatics -
sequence alignment - protein structure
prediction - protein function prediction -
protein protein interactions Proteins and
proteins and proteins Basic building blocks,
carry out functions. Laboratory methods for
protein discovery - yeast2hybrid, tagged
bait Rate of protein discovery gtgt protein
annotation Need to turn this protein information
into data.
3
Protein classification - Class - 3d
Structure - Function - Processes - Functional
modules (present discussion) High level
classification Help in functional classification

4

Basic Facts and Assumptions
- Proteins work as groups
- They interact with each other
They form large complexes
They perform higher-order bio functions
DNA replication, transcription, metabolism
Laboratory methods are costly
Need for insilico methods

Identification of Functional Modules in Protein
Complex via Hyperclique Pattern discovery
Detection of Functional Modules from Protein
Interaction Networks

6
Identification of Functional Modules in Protein
Complex via Hyperclique Pattern discovery Uses
protein complex data to discovery F.
Modules Complex data Search
Algorithm Analysis using Gene
Ontology Function analysis Verification
7

Terms used
Prot Functional module group of proteins gt
common function
Function elementary biological
Functional module smaller and simpler than
prot complex
- Prot Complex again a group of proteins
This group can be a group of functional modules
Performs higher-order functions
Large complexes act as efficient protein
machines
Complex and larger than functional module

Present methods
Experimental methods for complex data
yeast2hybrid, tagged bait
For functional modules
densely connected subgraphs from prot networks.
clustering of proteins

9
Definitions Pattern A set of proteins Hypercliqu
e pattern A type of pattern where the proteins
are highly affiliated Highly affiliated
Presence of one protein implies presence of
others. ..more ahead
10
Concept of Association Rules Let P p1, p2,
..pn set of proteins C c1,c2, cl set
of prot complexes ci set of prot ci
subset P X pattern (set of prots) such
that X subset P Support of X, supp (X) fraction
of prot complexes containing X
11
Sample Protein complex data set
supp p3, p4 3/5 60 supp p1, p2, p3 2/5
40
12
Association rule X ? Y , presence of pattern X
implies presence of pattern Y in the same protein
complex X subset P Y subset P X n Y
? Strength of the affinity. Confidence of X ? Y
Conf (X ? Y) Conf (X ? Y) supp (X U Y) /
supp (X)
13
Conf of association rule (p3 ?p4) Conf (p3 ?p4)
supp p3, p4 / suppp3 60 / 80 75 supp
p3, p4 3/5 60 supp p3 4/5 80 What
of complexes that have p3 also have both p3,
p4. What of 80 is 60 in this case.
14
Hyperclique pattern An association pattern that
contains proteins that are highly affiliated to
each other. For a pattern p1, p2, p3 If p1 is
present in a complex gt presence of p2, p3 p2
gt p1, p3 p3 gt p1, p2 in the same
complex Comingup h-confidence
15
h-confidence the confidence of a hyperclique
pattern h-confidence of pattern X- hconf(X)
Xp1, p2, p3, pm h-conf (X) min(conf(p1 ?
p2, p3..pm), conf( p2 ? p1,
p2..pm).conf(pm ?p1,p2pm-1)) Xp2, p3,
p4 supp( p280, p380, p460
) supp(p2, p3, p4 40 conf ( p2 ? p3, p4
) supp(p2,p3,p4) / supp(p2) 50 conf (
p3 ? p2, p4 ) supp(p2,p3,p4) / supp(p3)
50 conf ( p4 ? p2, p3 )
supp(p2,p3,p4) / supp(p4) 66.7 h-conf
(X) min(50, 50, 60.7) 50
16
Hyperclique pattern Pattern X gt hyperclique
pattern if hconf (X) hc where hc is minimum
h-confidence threshold The pattern Xp2, p3, p4
is a Hyperclique pattern at a threshold of 0.5 as
its h-confidence 50 (from prev slide) Maximal
hyperclique pattern Hyperclique patt gt Max
hypcl patt if no superset of this pattern is a
hyperclique pattern
17

We now know
Protein Functional module group of proteins
Protein Complex group of modules
Protein Pattern group of proteins again
Supp(X) fraction of complexes that have X
Confidence of assoc (p3 ?p4) supp p3, p4 /
suppp3
Xp1, p2, p3, pm
h-conf (X) min(conf(p1 ? p2, p3..pm), conf(
p2 ? p1, p2..pm).conf(pm ?p1,p2pm-1))
7) Hyperclique pattern gt h-conf hc (user
specified threshold)

18
Searching the protein complex data for
Hyperclique patterns Like generation of
level-wise pattern tree Every level of tree has
patt with same num of proteins Every patt has a
branch which contains all the superset of this
patt Use a breath-first search algorithm If a
patt is not satisfied at the given threshold,
prune the whole branch This is due to
anti-monotone property h-conf(X) h-conf
(superset X) p1, p2, p3 h-conf p1?p2 gt
h-conf p1 ? p2,p3
19
The algorithm Hyperclique miner
algorithm Input Give the protein complex
data Give the threshold values for supp and
h-conf Give the proteins to find hyp patterns
among Output List of patterns with supp and
h-conf values above the thresholds
20
A hyperclique pattern gt proteins are highly
affiliated to each other. Theorem every pair
of prots has cosine similarity h-conf X p1,
p2, ..pm and h-conf hc For two prots pl, pk
cosine(pl, pk ) hc cosine(pl, pk )
supp(pl, pk ) / root(supppl .supppk )
21

How the algorithm works
For a given set of proteins (p1, p2, p3, p4..pk)
For size of the pattern 2,3,..k-1
Generate candidate hyperclique patterns
Prune based on the support threshold
Prune based on the hc threshold
Generate hypercliques
Pruning is also done based on anti-monotone
property
Eg if p1?p2 is not a hyperclique then we can
prune the candidate patt p1, p2, p3

22
Protein Complex data set Prot Complex of
Saccharomyces Cerevisiae TAP-MS (two
purification methods) and used bait proteins to
identify physiologically intact protein
complexesDataset by Gavin et al. has better
accuracy for predicting protein functions.Using
this dataset, we have 1440 prots within 232
complexes
23

Analysis tools
The Gene Ontology- used to annotate proteins
GO controlled use of vocabulary
consistency in use of terms
classifies all the terms by
Molecular function
Biological process
Cellular process

24
Graph Viz used to produce the graph
representation of the annotation The functional
description from Saccharomyces Genome
Database The 3-D structure from Protein Data
Bank PyMOL for visualizing the 3-D structures
25
Setting a support threshold of 0 and
h-confidence of 0.6 We obtain 60 maximal
hyperclique patterns To analyze and test these
patterns, use terms form GO For a pattern
Cus1, MSl1, Prp3, Prp9, Sme1, Smx2, Smx3,
Yhc1 Annotate the proteins with GO terms Analyze
the function and process
26
Analysis of protein pattern function
Graph on left function annotation of the
pattern Graph on right process annotation of
the pattern Significant nodes with number of
proteins and stat significance
27
Hypercliques as functional modules
Hyp Pattern Pre2, Pre4, Pre5, Pre6, Pre9, Pre8,
Pup3, Scl1 Contained in 4 complexes and the
consistency of Function. 2 categories gt hyp
patterns act as functional modules participating
in protein complexes to perform higher-order
funcs
28
A 3-D view for interaction of proteins of
pattern Pre2, Pre4, Pre5, Pre6, Pre9, Pre8,
Pup3, Scl1
3-D structures of proteins of the hypcl pattern
show physical interactions. A compelling reason
evidence that hpycl patts form a compact
structure and perform a common function.
29
Sub graph of Gene Ontology for function
correspondence for Complex 151
This complex has 3 hyp patterns with one function
each
30
Detection of Functional Modules from Protein
Interaction Networks
31
Detect Protein functional modules form protein
interaction networks. Modules groups of genes or
proteins involved in common
elementary funcitons Use of Clustering Extracts
1046 modules from Saccharomyces
cerevisiae. TribeMCL which is an unsupervised
graph clustering alg. Use of additive weighting
scheme Transformation of protein interaction
network into line graph.
32
Basic steps in the methods. Protein interaction
network (Sacchromyces DIP) Add weights
(confidence) Graph transformation TribeMCL
Classif of proteins Cluster
analysis
33
Terms used ahead Protein interaction network
graph where node is protein edge is interaction.
Line graph Interactions connected by proteins
34

Protein interaction network
Saccharomyces Cerevisiae from DIP
Its is well studied
DIP has curated information
This network has 8046 physical interactions in
4081 proteins
6586 supported by single exp and others by at
least two.

Adding weights
Believed that most protein interaction
assignments are those supported by more that one
method.
Use a simple additive weighting scheme
Weight attributed to each interaction gt degree
of confidence
Confidence level represents the number of exp
that support the interaction

36
All methods are equally weighed. No difference
between small scale and large scale
experiments Start with a score of 3. For further
instances of the interaction, add 1 if the
method is different . 0.25 if it method is
already used. All scores are normalized as a
percentage of the highest score in the data
set. We now have a weighted network of proteins
connected by interactions where weight represents
confidence in that interaction.
37
3) Graph transformation Network of proteins
connected by interactions to Network of
interactions connected by proteins Every binary
interaction is condensed into a node that
represent a pair of interacting proteins. Edge
represents a common protein between two
interactions. The score for the edge is the
average of the two edge scores in the original
graph.
38

Advantages of Line Graph
It does not sacrifice information content
It takes into account the higher-order local
neighborhood
More highly structured than original graph
Increase in local structure better suited for
clustering algorithms like TribeMCL

39
4) Use the algorithm Now use the transformed
graph as input to the TribeMCL which is a graph
clustering algorithm Get the clusters of
associated interactions. Clusters are transformed
back to original form for all further analysis.
40

5) Classification of Proteins in detected
Clusters
For each protein obtain
Manually derived regulatory and metabolic (KEGG)
Automatic functional classification (GQFC)
Cellular localization information (LOC)
Among the 1046 clusters of the Saachromyces
KEGG 20
GQFC 45
LOC 48

41
Use 3 different schemes so as not to be biased
toward a particular classification Now, analyze
the homogeneity of each cluster and each
classification scheme
42
6) Cluster analysis Validation by assessing the
consistency of the cluster. More consistentgt
more proteins with same function more protein
homogeneity Less consistentgt more proteins with
different function. Less homogeneity The
measure of this consistency given by Redundancy
(R) for each cluster j n number of classes Ps
freq of the class in the cluster
43
All R between 0 and 1 Clusters with high
consistency close to 1 Clusters with low
consistency close to 0
44

Measuring the significance of overall clustering
For each classification scheme,
Calculate
Sum of scores of all clusters
Sum of scores of set of 1000 random clusters.
Compare 1 and 2
Score of each scheme is significantly larger than
the score of random clusters.
Indicates that this clustering is not random and
detects biologically relevant modules.

The detected clusters
Contain proteins with consistent functional
classifications and cellular localization
Are enriched in proteins of known functional
connectedness
Include many examples of previously described
functional modules
Suggest that the clusters correspond to
meaningful functional modules.

46
The reason for the effectiveness of the method
seems to be the complementarity of graph
weighting graph transformation graph flow
simulation clustering algorithm Ex the scores
of the clusters without the graph transformation
were drastically reduced.
47
Use of the method for function prediction Clusters
with both known and unknown proteins can be a
case for function prediction. According to their
interaction partners, poorly characterized
proteins can be placed into their functional
context.
48
Advantages of this method Overlapping graph
partitioning Few interactions are discarded
49
Summary Two major methods of protein functional
module extraction Both methods say they are the
best Depends on dataset, inputs and
assumptions Hope to better understand
proteins Assign the correct function to unknown
proteins
50
Thats enough Thank You ..Questions(hope not)

Write a Comment

User Comments (0)