Large Scale Analysis of Proteomic Data - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Large Scale Analysis of Proteomic Data

Description:

DIAMONDS Kick-Off Meeting. Gent, 4/Feb/2005. Large Scale ... Understand clusters. Understand experiment results (e.g. expression chips) Curate annotations ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 27
Provided by: sbcell
Category:

less

Transcript and Presenter's Notes

Title: Large Scale Analysis of Proteomic Data


1
Large Scale Analysis of Proteomic Data
  • A sample of the research in the Linal group

2
Three Sample Projects
  • ProtoNet Automatic Clustering of Protein
    Sequences
  • EVEREST Automatic Identification and
    Classification of Protein Domains
  • PANDORA Annotation Based Analysis of Protein
    Sets

3
Protein classification - The Challenges
  • We have
  • MANY new Genomes Sequences - almost 250 complete
    genomes
  • Growing very fast
  • We want
  • Sequence to function inference
  • New discoveries (new families, hypothetical
    proteins, new connection of remote families..)

4
Clustering by Pairwise Similarity
  • Our Approach
  • Use sequence similarity data to define
    (hierarchical) protein families
  • Validate our families
  • Infer function / 3d structure of unknown proteins
    from annotation of proteins in family
  • Identify families with no annotation as new
    families (candidates for further research)

5
Transitivity of Sequence Similarity
  • Basic tool pairwise sequence similarity
  • Limited range
  • Transitivity increases range
  • Pitfalls
  • Amplification of noise
  • False transitivity

6
Amplification of Noise
  • Assume a simple world were proteins are divided
    into exact families.
  • Our sequence similarity tools are noisy.
  • Some edges are missed, some edges that should not
    be there are added.
  • We use transitivity to recover lost edges.
  • Careless use of transitivity would create one
    large cluster

7
ProtoNet Clustering Method
  • First, each protein is considered a singleton (a
    cluster of its own).

8
ProtoNet Clustering Method
  • Next, we iteratively merge the pairs of clusters
  • We choose to merge the most similar pair of
    clusters.

9
ProtoNet Clustering Method
  • The clustering process gradually generates a tree
    of clusters

10
ProtoNet Clustering Method
  • Lastly, we choose a subset of the clusters in the
    tree.

11
ProtoNet - Evaluation
  • Evaluate vs. an annotation system (e.g. InterPro)
  • Cluster correspondence w.r.t. a specific keyword
  • TP proteins with keyword in cluster
  • FN proteins with keyword out of cluster
  • FP annotated proteins without keyword in
    cluster
  • ScoreTP/(TPFPFN)
  • For each keyword, we find the highest scoring
    cluster
  • We want minimal number of clusters covering all
    keywords

12
ProtoNet - Evaluation
  • 28000 clusters

13
ProtoNet - Example
14
ProtoNet Web - in brief
15
Protein card
16
Cluster card
Automatically added A NAME for a cluster (when
coherent with its annotations)
17
EVEREST Domain Classification
  • Proteins are composed of domains
  • Lets identify these domains and classify them
    instead of the whole sequence

C transcriptional terminal regulatory domain
Response regulator receiver domain
luxR family
Autoinducer binding domain
18
EVEREST - process
19
EVEREST - evaluation
  • Testing for Domains gt more FP
  • Coverage Proportion of known domain families
    reconstructed well
  • Accuracy Proportion of clusters that be well
    explained by known families (excluding new)

20
PANDORA Analysing Annotations of Protein Sets
  • How does one understand a cluster of proteins?
  • Look at the annotations of those proteins
  • Say you have 100 proteins
  • Say I tell you that
  • 60 of them are annotated as membrane proteins
  • 40 of them are annotated as enzymes
  • Is that enough information?

21
PANDORA
60 membrane
40 enzyme
?
?
?
membrane
enzyme
membrane
membrane
enzyme
enzyme
22
PANDORA
  • Given
  • set of proteins (e.g. ProtoNet cluster)
  • keywords annotating them
  • Let each combination of keywords define a subset
    of the proteins
  • Show
  • A graph of inclusion and intersection of the
    protein subsets defined by the keywords

23
PANDORA - example
  • Take set of all 576 proteins annotated by GO
    molecular function as anion channel.
  • Analyze their InterPro signatures

24
BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
25
InterPro
GABA A receptor
gamma subunit
alpha subunit
beta subunit
26
Summary
  • ProtoNet EVEREST
  • Automatic classification of large databases of
    proteins sequences (protein domains)
  • Find new members of known families
  • Interactions between families
  • New families
  • PANDORA
  • Analyze annotations of a set of proteins
  • Understand clusters
  • Understand experiment results (e.g. expression
    chips)
  • Curate annotations
Write a Comment
User Comments (0)
About PowerShow.com