Protein World - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Protein World

Description:

At this moment more than 80 genomes have been sequenced and published, of all ... Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 21
Provided by: Hul78
Category:
Tags: domo | pc | protein | uk | world

less

Transcript and Presenter's Notes

Title: Protein World


1
Protein World
  • SARA
  • 12-12-2002 Amsterdam
  • Tim Hulsen

2
Genome sequencing
  • Since 1995 sequencing of complete genomes
    (DNA) A/C/G/T order
  • ACGTCATCGTAGCTAGCTAGTCGTACGTATG
  • TGCAGTAGCATCGATCGATCAGCATGCATAC
  • At this moment more than 80 genomes have been
    sequenced and published, of all kinds of
    organisms
  • Animals
  • Plants
  • Fungi
  • Bacteria

3
Genomes ? Proteins
  • Transcription and translation of specific
    regions of the genome leads to proteins,
    consisting of twenty types of amino acids
  • ATG ACG CTG AGC TGC GGA CGT TGA -gt TLSCGR
  • Proteins are responsible for all kinds of life
    processes
  • All the proteins that can be produced in an
    organism
  • together are called the proteome
  • Sequence comparisons make
  • possible the classification of
  • proteins

4
Protein families
  • e.g. The GPCR family
  • Sequence comparison helps in predicting the
    function of new proteins

5
Determining protein functions
  • Function of 40-50 of the new proteins is unknown
  • Understanding of protein functions and
    relationships is important for
  • Study of fundamental biological processes
  • Drug design
  • Genetic engineering

6
Sequence comparison
  • Smith-Waterman dynamic programming algorithm
    (1981) calculates similarity/distance between
    two sequences
  • Query ---PLIT-LETRESV-
  • Subject NEQPKVTMLETRQTAD
  • (boldsimilar)
  • Results in a SW-score that is a measure for how
    similar the two sequences are to each other
  • Disadvantage score is dependent of length
  • After the alignments, the proteins are
    clustered (divided into families) according to
    their similarity

7
Existent databases
  • Domain-based clusterings Prosite, Pfam, ProDom,
    Prints, Domo, Blocks
  • Protein-based clusterings ProtoMap, COGs,
    Systers, PIR, ClusTr
  • Structural classifications SCOP, CATH, FSSP
  • Why should there be another database?

8
Another method
  • Enhanced Smith-Waterman algorithm Monte-Carlo
    evaluation (Lipman et al., 1984)
  • How big is the chance that two sequences are
    similar but not related?
  • One of the two sequences is randomized and
    recalculated (200 times). Randomization leads to
    sequences with the same length and the same
    composition, but different order
  • Method leads to calculation of the Z-value
  • S(A,B) - µ
  • Z(A,B) -------------------
  • s

9
Advantages
  • The obtained Z-value is a very reliable measure
    for sequence, compared to SW-score
  • SW-score is dependent of length, Z-value is not
  • Amino acid bias does not affect the Z-value
  • Independent of the database size
  • Easier updating of the database, without a total
    recalculation

10
Disadvantage
  • LOTS of calculation time needed, especially when
    all proteins in all proteomes are compared to
    each other (all-against-all)!
  • ? SARA

11
SARA calculation
  • Proteomes of 82 organisms compared
    all-against-all with the use of the Monte Carlo
    algorithm more than 400,000 proteins!
  • 21,600 CPU days (520,000 CPU hours)
  • 21,600 PCs running parallel over 24 hours / 1
    PC running for 60 years
  • Using supercomputer TERAS (1024-CPU SGI Origin
    3800) at SARA less than two months!

12
Parties involved
  • Gene-IT (Paris, France)
  • SARA (Amsterdam, the Netherlands)
  • CMBI (Nijmegen, the Netherlands)
  • Organon (Oss, the Netherlands)
  • EBI (Hinxton, UK)

13
Supporting parties
  • Financed by NCF, foundation in support of
    supercomputing
  • Under the auspices of BioASP, the new Dutch
    knowledge and service center for Bioinformatics

14
Results available through BioASP
  • http//www.bioasp.nl
  • Log in and click on links Research and Protein
    World

1
2
15
Results available through BioASP
  • Organism selection screen

16
Results available through BioASP
  • Results screen

17
Results available through BioASP
  • Alignment screen

18
Conclusions
  • Currently the most comprehensive and most
    accurate data-set of protein comparisons
  • A start for a maintainable and unique database of
    all proteins currently known
  • A rich data-source for clustering, data-mining
    and orthology determination

19
Orthology determination
  • Orthologs genes/proteins in different species
    that derive from a common ancestor
  • Orthologs often have the same function
  • Interesting! Information from other species could
    help in annotating a protein

20
Thank you for your attention
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com