Large Scale Analysis of Proteomic Data - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Large Scale Analysis of Proteomic Data

Description:

DIAMONDS Kick-Off Meeting. Gent, 4/Feb/2005. Large Scale ... Understand clusters. Understand experiment results (e.g. expression chips) Curate annotations ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 27

Provided by: sbcell

Category:

more less

Transcript and Presenter's Notes

Title: Large Scale Analysis of Proteomic Data

1
Large Scale Analysis of Proteomic Data

A sample of the research in the Linal group

2
Three Sample Projects

ProtoNet Automatic Clustering of Protein
Sequences
EVEREST Automatic Identification and
Classification of Protein Domains
PANDORA Annotation Based Analysis of Protein
Sets

3
Protein classification - The Challenges

We have
MANY new Genomes Sequences - almost 250 complete
genomes
Growing very fast
We want
Sequence to function inference
New discoveries (new families, hypothetical
proteins, new connection of remote families..)

4
Clustering by Pairwise Similarity

Our Approach
Use sequence similarity data to define
(hierarchical) protein families
Validate our families
Infer function / 3d structure of unknown proteins
from annotation of proteins in family
Identify families with no annotation as new
families (candidates for further research)

5
Transitivity of Sequence Similarity

Basic tool pairwise sequence similarity
Limited range
Transitivity increases range
Pitfalls
Amplification of noise
False transitivity

6
Amplification of Noise

Assume a simple world were proteins are divided
into exact families.
Our sequence similarity tools are noisy.
Some edges are missed, some edges that should not
be there are added.

We use transitivity to recover lost edges.
Careless use of transitivity would create one
large cluster

7
ProtoNet Clustering Method

First, each protein is considered a singleton (a
cluster of its own).

8
ProtoNet Clustering Method

Next, we iteratively merge the pairs of clusters
We choose to merge the most similar pair of
clusters.

9
ProtoNet Clustering Method

The clustering process gradually generates a tree
of clusters

10
ProtoNet Clustering Method

Lastly, we choose a subset of the clusters in the
tree.

11
ProtoNet - Evaluation

Evaluate vs. an annotation system (e.g. InterPro)
Cluster correspondence w.r.t. a specific keyword
TP proteins with keyword in cluster
FN proteins with keyword out of cluster
FP annotated proteins without keyword in
cluster
ScoreTP/(TPFPFN)
For each keyword, we find the highest scoring
cluster
We want minimal number of clusters covering all
keywords

12
ProtoNet - Evaluation

28000 clusters

13
ProtoNet - Example
14
ProtoNet Web - in brief
15
Protein card
16
Cluster card
Automatically added A NAME for a cluster (when
coherent with its annotations)
17
EVEREST Domain Classification

Proteins are composed of domains
Lets identify these domains and classify them
instead of the whole sequence

C transcriptional terminal regulatory domain
Response regulator receiver domain
luxR family
Autoinducer binding domain
18
EVEREST - process
19
EVEREST - evaluation

Testing for Domains gt more FP

Coverage Proportion of known domain families
reconstructed well
Accuracy Proportion of clusters that be well
explained by known families (excluding new)

20
PANDORA Analysing Annotations of Protein Sets

How does one understand a cluster of proteins?
Look at the annotations of those proteins
Say you have 100 proteins
Say I tell you that
60 of them are annotated as membrane proteins
40 of them are annotated as enzymes
Is that enough information?

21
PANDORA
60 membrane
40 enzyme
?
?
?
membrane
enzyme
membrane
membrane
enzyme
enzyme
22
PANDORA

Given
set of proteins (e.g. ProtoNet cluster)
keywords annotating them
Let each combination of keywords define a subset
of the proteins
Show
A graph of inclusion and intersection of the
protein subsets defined by the keywords

23
PANDORA - example

Take set of all 576 proteins annotated by GO
molecular function as anion channel.
Analyze their InterPro signatures

24
BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
25
InterPro
GABA A receptor
gamma subunit
alpha subunit
beta subunit
26
Summary

ProtoNet EVEREST
Automatic classification of large databases of
proteins sequences (protein domains)
Find new members of known families
Interactions between families
New families
PANDORA
Analyze annotations of a set of proteins
Understand clusters
Understand experiment results (e.g. expression
chips)
Curate annotations