Title: Large Scale Analysis of Proteomic Data
1Large Scale Analysis of Proteomic Data
- A sample of the research in the Linal group
2Three Sample Projects
- ProtoNet Automatic Clustering of Protein
Sequences - EVEREST Automatic Identification and
Classification of Protein Domains - PANDORA Annotation Based Analysis of Protein
Sets
3Protein classification - The Challenges
- We have
- MANY new Genomes Sequences - almost 250 complete
genomes - Growing very fast
- We want
- Sequence to function inference
- New discoveries (new families, hypothetical
proteins, new connection of remote families..)
4Clustering by Pairwise Similarity
- Our Approach
- Use sequence similarity data to define
(hierarchical) protein families - Validate our families
- Infer function / 3d structure of unknown proteins
from annotation of proteins in family - Identify families with no annotation as new
families (candidates for further research)
5Transitivity of Sequence Similarity
- Basic tool pairwise sequence similarity
- Limited range
- Transitivity increases range
- Pitfalls
- Amplification of noise
- False transitivity
6Amplification of Noise
- Assume a simple world were proteins are divided
into exact families. - Our sequence similarity tools are noisy.
- Some edges are missed, some edges that should not
be there are added.
- We use transitivity to recover lost edges.
- Careless use of transitivity would create one
large cluster
7ProtoNet Clustering Method
- First, each protein is considered a singleton (a
cluster of its own).
8ProtoNet Clustering Method
- Next, we iteratively merge the pairs of clusters
- We choose to merge the most similar pair of
clusters.
9ProtoNet Clustering Method
- The clustering process gradually generates a tree
of clusters
10ProtoNet Clustering Method
- Lastly, we choose a subset of the clusters in the
tree.
11ProtoNet - Evaluation
- Evaluate vs. an annotation system (e.g. InterPro)
- Cluster correspondence w.r.t. a specific keyword
- TP proteins with keyword in cluster
- FN proteins with keyword out of cluster
- FP annotated proteins without keyword in
cluster - ScoreTP/(TPFPFN)
- For each keyword, we find the highest scoring
cluster - We want minimal number of clusters covering all
keywords
12ProtoNet - Evaluation
13ProtoNet - Example
14ProtoNet Web - in brief
15Protein card
16Cluster card
Automatically added A NAME for a cluster (when
coherent with its annotations)
17EVEREST Domain Classification
- Proteins are composed of domains
- Lets identify these domains and classify them
instead of the whole sequence
C transcriptional terminal regulatory domain
Response regulator receiver domain
luxR family
Autoinducer binding domain
18EVEREST - process
19EVEREST - evaluation
- Testing for Domains gt more FP
- Coverage Proportion of known domain families
reconstructed well - Accuracy Proportion of clusters that be well
explained by known families (excluding new)
20PANDORA Analysing Annotations of Protein Sets
- How does one understand a cluster of proteins?
- Look at the annotations of those proteins
- Say you have 100 proteins
- Say I tell you that
- 60 of them are annotated as membrane proteins
- 40 of them are annotated as enzymes
- Is that enough information?
21PANDORA
60 membrane
40 enzyme
?
?
?
membrane
enzyme
membrane
membrane
enzyme
enzyme
22PANDORA
- Given
- set of proteins (e.g. ProtoNet cluster)
- keywords annotating them
- Let each combination of keywords define a subset
of the proteins - Show
- A graph of inclusion and intersection of the
protein subsets defined by the keywords
23PANDORA - example
- Take set of all 576 proteins annotated by GO
molecular function as anion channel. - Analyze their InterPro signatures
24BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
25InterPro
GABA A receptor
gamma subunit
alpha subunit
beta subunit
26Summary
- ProtoNet EVEREST
- Automatic classification of large databases of
proteins sequences (protein domains) - Find new members of known families
- Interactions between families
- New families
- PANDORA
- Analyze annotations of a set of proteins
- Understand clusters
- Understand experiment results (e.g. expression
chips) - Curate annotations