DIAMONDS MidTerm - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

DIAMONDS MidTerm

Description:

Sequence high quality, large amount, constant flow, consistent ... PRINTS. ProDom. Integration: Data Fusion. InterPro 11,000 entries. Based on UniProt DB ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 30
Provided by: noamk
Category:
Tags: diamonds | midterm

less

Transcript and Presenter's Notes

Title: DIAMONDS MidTerm


1
DIAMONDS Mid-Term
Michal Linial The Hebrew University of Jerusalem
Brussels 27 Sep 2006
2
Classification Knowledge integrationConcepts
Survey what we have (D.2)How to reach the
experimentalistsQuantitative aspects of
knowledge integration Advances
contribution
3
In quantitative terms
Concepts Problem definition
Sequence high quality, large amount, constant
flow, consistent Structure high quality , low
amounts, sparse, partially consistent Function -
low/ medium quality, large amounts, False
positive, not consistent
Functional homologues often share low identity
( similarity) Function is ambiguous term,
context dependent Structure 3D is a good
intermediate, but only sparse information Inferenc
e limitation lt40 sequence identity
Solution (DIAMONDS)
  • Knowledge integration from different sources and
    different methodologies
  • Incorporating NON SEQUENCE based information
    (P2P interaction, gene expression, pathways)
  • Add significant scores and evaluation for any
    functional inference

4
Survey The advances of Classifications
Increasing quality, improved functional inference
, detection of hidden connections, Overcoming the
inherent noise in the data, good scheme for data
updating and adding new knowledge The challenge
Defining the Accuracy Annotation must be based
on (accurate) classification Integration Quali
ty and confidence Validation
5
Classification for Protein Familieswhy ?
  • For Function prediction annotation
  • For Evolutionary view - orthologs and paralogs
  • For new protein folds
  • For constructing the protein space scaffold
  • For reducing the redundancy and determining the
    real building blocks
  • For a rich source of evolutionary relationship
  • (i.e., set of proteins in Yeast that
    participate in cell cycle immediate association
    with their homologues in human and vice versa)

6
Avoiding confusion Motif / Domain / Signature /
Profile / Family / Cluster / Seed / Clan /
  • These terms are used interchangeably,
  • They are (too) flexible

Our survey (D2.1) 29 different main protein
families systems were compared Pro and Con and
the source for inconsistency were identified
7
A reminder Protein Sequence Classifications
ProtoMap ProtoMap (http//protomap.cs.cornell.ed
u, Yona et al., 2000a) ProtoNet ProtoNet
(http//www.protonet.cs.huji.ac.il, Sasson et
al., 2002) PIR-ALN PIR-ALN (http//www-nrbf.georg
etown.edu/pirwww/dbinfo/piraln.html, Srinivasarao
et al., 1999) SYSTERS Systers
(http//systers.molgen.mpg.de, Krause et al.,
2002) ProClust (http//promoter.mi.uni-koeln.de
/proclust, Pipenbacher et al., 2002)
CluSTr CluSTr (http//www.ebi.ac.uk/clustr,
Kriventseva et al., 2001) Picasso Picasso
(http//www.ebi.ac.uk/picasso, Heger and Holm,
2001) TribeMCL TribeMCL (http//www.ebi.ac.uk/re
search/cgg/tribe, Enright et al., 2002) PIR SF,
Panther, (integrated iProClass, MetaFam)
8
Protein Sequence Domain Classifications
DOMO ADDA EVEREST InterPro CDD MetaFam Pfam
Blocks
ProSite Profile SBASE TigrFam eMotif SMART P
RINTS ProDom
9
Integration Data Fusion InterPro 11,000
entries Based on UniProt DB
10
DIAMONDS platform associated with alternative
resource of Classification
  • Methods that are based on
  • A. Sequence (motif, proteins)
  • B. Structure
  • C. Function (annotation)
  • D. Evolution

The Goal New Annotation, New Family, Family
connections (sub/ super) Predicting power
(given a new unknown sequence) Focusing on
Functional Map (i.e., Cell Cycle Related)
11
Challenges to be addressedGlobally and in
C-Cycle
  • Many families are very easy to detect
  • BLAST search can be used to detect many protein
    families
  • A classical 80-20 situation 80 of families
    can be identified with 20 of the effort
  • But Sensitivity is low, remote homologues are
    lost
  • Validation tools are missing
  • Solution (DIAMONDS -HUJI)
  • High quality classifications and navigation tools
  • Annotation based integration

12
DIAMONDS platform (2) associated with
alternative resource of Classification
  • Sequence based (motif, proteins) (HUJI)
  • ProtoNet (proteins)
  • EVEREST (including external sources Pfam, SCOP,
    CATH)
  • Annotation based scheme (including structure,
    motif, phylogenetic, function) (HUJI)
  • PANDORA

13
ProtoNet 4.5 (August 2006)
Includes over 1 million proteins, UniProt based

Combining methodologies for best performance
(dev)
Condense to only 27,000 most significant
clusters Only 3 are singletons
Built in Quality tools
ProtoNet annotations for Families (ProtoName)
Visualization - unique features for
Experimentalists AN OPTION to add to the
system ANY new sequence that was experimentaly
discovered
14
For the experimentalist Visualization tools
Pfam, Prosite, SMART, PRINTS, BLOCKS,
ProDom.
15
and more
Automatically added A NAME for a cluster (when
coherent with its annotations)
16
Challenge 2 Quality of prediction
  • Compare with Knowledge Based DB (i.e. InterPro)
  • Take a supervised approach
  • For each family, look for the best match in the
    clustering
  • Analyze the correspondence between the cluster
    and the respective InterPro class (or any other
    expert view class)

How much of the Functional knowledge is captured
by any classification system ?
  • We define a matching score that allows
    performance comparison
  • Measures the correspondence between an expert
    class K and a cluster C

17
InterPro
ProtoNet matching 83.5 of InterPro
matching 85 if ENZYME entries
18
Integration of Annotation
Challenge 3
  • High quality annotations reaching the biologists
  • PANDORA concept
  • A web-base tool aimed at biological analysis of
    protein sets.
  • Biological information is shown through
    intersection and inclusion.
  • Goal provide a biological roadmap of the genes
    or protein set.

19
Annotations included
20
BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
21
Challenge 3 Analyzing test case A message to
the biologists1. Provide a set of genes /
proteins originated by any Omics technology to
PANDORA(without any pre-knowledge) OutPut
1. Functional Maps2. A ranked list of
functions scored by their significance and a
statistical P-value3. Detecting mistakes and
mis-annotations
22
Beyond the gene/protein setThe added value of
the biologist commentsInput Add your comments
and any quantitative properties
Examples Gene Expression levels
Degradation time Viability and
toxicity levels Tumorgeniety
score Quality of your RNA in the
experiment
A platform in which COMMENTS, EXPERIMENTAL
values And Personal Unformulated Observation are
incorporated and analyzed!
23
Example From the biologist notebook to PANDORA
knowledge
Protein quantitative binary
multiple binary
comments
24
Graphical results
25
Current additionAddressing the one to Many
problem
Many genes are alternatively spliced resulting in
one to many association Many proteins are
multi-domains
26
The Modular Nature of Proteins
27
False Transitivity of Local Alignment
BLAST values
Pairwise similarities better than 1e-40 EScore
If we cluster these proteins, assuming
transitivity of local alignment scores, we will
cluster K6A1_MOUSE with MPP3_HUMAN
28
On the Web
29
Evaluate any reference domain resources
30
Two that became one Examples EVEREST Detecting
new connections
PFAM (OLD) Taurine catabolism dioxygenase TauD,
TfdA family Pfam (NEW) a composed entry TauD
31
Expanding Diamonds Platform
  • We provide an automated framework for
    identification and classification of new protein
    domains. EVEREST cover almost 3 million proteins
    in UniProt and all PDB
  • Manual inspection of families scoring low w.r.t.
    Pfam suggested that many of those are valid
    families.
  • Incorporating EVEREST families to DIAMONDS
    platform
  • Covering the exhaustive cover of all Structural
    knowledge and Domains in one resource.

32
Summary conclusions
  • Enhance validation essential
  • Enhanced visualization / interactivity
  • Biologist information being analyzed
  • The limit of automatic methods in classification
    (sequence, domain, structure, function)
  • ProtoNet-PANDORA and EVEREST as part of the
    Platform
  • Thanks to DIAMONDS and to
  • PANDORA team - Noam Kaplan
  • ProtoNet EVEREST team Elon Portugaly, Ori
    Sasson, Menachem Fromer
  • Cell Cylcle Analysis set Roy Varshvsky
  • Evaluation DBA and Web Michael Dvorkin, Alex
    Savanok
Write a Comment
User Comments (0)
About PowerShow.com