MA /GENERAL_SPEC: ALPHABET ... 3D structure Contain the - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

MA /GENERAL_SPEC: ALPHABET ... 3D structure Contain the

Description:

MA /GENERAL_SPEC: ALPHABET ... 3D structure Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 103
Provided by: chEmbnetO
Learn more at: http://www.ch.embnet.org
Category:

less

Transcript and Presenter's Notes

Title: MA /GENERAL_SPEC: ALPHABET ... 3D structure Contain the


1
An introduction to biological databases
2
Database or databank ?
  • At the beginning, subtle distinctions were done
    between databases and databanks (in UK, but not
    in the USA), such as
  •  Database management programs for the gestion
    of databanks 
  • From now on, the term  database  (db) is
    usually preferred

3
What is a database ?
  • A collection of...
  • structured
  • searchable (index) -gt table of contents
  • updated periodically (release) -gt new edition
  • cross-referenced (hyperlinks) -gt links with
    other db
  • data
  • Includes also associated tools (software)
    necessary for db access, db updating, db
    information insertion, db information deletion.

4
Databases an simple example
 Introduction To Database Teacher Database
(ITDTdb) (flat file, 3 entries)
  • Accession number 1
  • First Name Amos
  • Last Name Bairoch
  • Course DEAoct-nov-dec 2000
  • http//expasy4.expasy.ch/people/amos.html
  • //
  • Accession number 2
  • First Name Laurent
  • Last name Falquet
  • Course EMBnetsept 2000DEAoct-nov-dec 2000
  • //
  • Accession number 3
  • First Name Marie-Claude
  • Last name Blatter Garin
  • Course EMBnetsept 2000DEAoct-nov-dec 2000
  • http//expasy4.expasy.ch/people/Marie-Claude.Blatt
    er-Garin.html
  • //
  • Easy to manage all the entries are visible at
    the same time !

5
Databases an simple example (cont.)
Relational database ( table file )
Easier to manage choice of the output
6
Why biological databases ?
  • Explosive growth in biological data
  • Data (sequences, 3D structures, 2D gel analysis,
    MS analysis.) are no longer published in a
    conventional manner, but directly submitted to
    databases
  • Essential tools for biological research, as
    classical publications used to be !

7
Biological databases
  • Some databases in the field of molecular
    biology
  • AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
  • ARR, AsDb, BBDB, BCGD, Beanref,
    Biolmage,
  • BioMagResBank, BIOMDB, BLOCKS,
    BovGBASE,
  • BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
  • CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
  • ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
  • CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP,
    DictyDb,
  • Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract,
    ECDC,
  • ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
  • ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
  • GCRDB, GDB, GENATLAS, Genbank, GeneCards,
  • Genline, GenLink, GENOTK, GenProtEC,
    GIFTS,
  • GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
  • HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
  • HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
  • HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
  • KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,

8
Some statistics
  • More than 1000 different databases
  • Generally accessible through the web
  • (useful link www.expasy.ch/alinks.html)
  • Variable size lt100Kb to gt10Gb
  • DNA gt 10 Gb
  • Protein 1 Gb
  • 3D structure 5 Gb
  • Other smaller
  • Update frequency daily to annually

9
Categories of databases for Life Sciences
  • Sequences (DNA, protein) -gt Primary db
  • Genomics
  • Protein domain/family -gt Secondary db
  • Mutation/polymorphism
  • Proteomics (2D gel, MS)
  • 3D structure -gt Structure db
  • Metabolism
  • Bibliography
  • Others

10
Distribution of sequence databases
  • Books, articles 1968 -gt 1985
  • Computer tapes 1982 -gt1992
  • Floppy disks 1984 -gt 1990
  • CD-ROM 1989 -gt ?
  • FTP 1989 -gt ?
  • On-line services 1982 -gt 1994
  • WWW 1993 -gt ?
  • DVD 2001 -gt ?

11
Sequence Databases some  technical  definitions
  • Data storage management
  • flat file text file
  • relational (e.g., Oracle)
  • object oriented (rare in biological field)
  • Format (flat file)
  • fasta
  • GCG
  • NBRF/PIR
  • MSF.
  • standardized format ?
  • Federated databases different autonomous,
    redundant, heterogeneous db linked together by
    links/hyperlinks.

12
Ideal minimal content of a  sequence  db
  • Sequences !!
  • Accession number (AC)
  • References
  • Taxonomic data
  • ANNOTATION/CURATION
  • Keywords
  • Cross-references
  • Documentation

13
Sequence database example
SWISS-PROT Flat file
ID EPO_HUMAN STANDARD PRT 193
AA. AC P01588 DT 21-JUL-1986 (Rel. 01,
Created) DT 21-JUL-1986 (Rel. 01, Last sequence
update) DT 30-MAY-2000 (Rel. 39, Last
annotation update) DE Erythropoietin
precursor. GN EPO. OS Homo sapiens
(Human). OC Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi OC
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. RN 1 RP SEQUENCE FROM
N.A. RX MEDLINE 85137899. RA Jacobs K.,
Shoemaker C., Rudersdorf R., Neill S.D., Kaufman
R.J., RA Mufson A., Seehra J., Jones S.S.,
Hewick R., Fritsch E.F., RA Kawakita M.,
Shimizu T., Miyake T. RT "Isolation and
characterization of genomic and cDNA clones of
human RT erythropoietin." RL Nature
313806-810(1985). ... CC -!- FUNCTION
ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED
IN THE CC REGULATION OF ERYTHROCYTE
DIFFERENTIATION AND THE MAINTENANCE OF A CC
PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE
MASS. CC -!- SUBCELLULAR LOCATION SECRETED. CC
-!- TISSUE SPECIFICITY PRODUCED BY KIDNEY OR
LIVER OF ADULT MAMMALS CC AND BY LIVER OF
FETAL OR NEONATAL MAMMALS. CC -!-
PHARMACEUTICAL Available under the names Epogen
(Amgen) and CC Procrit (Ortho Biotech). CC
-!- DATABASE NAMERD Systems' cytokine source
book CC WWW"http//www.rndsystems.com/cyt_
cat/epo.html". DR EMBL X02158 CAA26095.1
-. DR EMBL X02157 CAA26094.1 -. DR EMBL
M11319 AAA52400.1 -. DR EMBL AF053356
AAC78791.1 -. DR EMBL AF202308 AAF23132.1
-. DR EMBL AF202306 AAF23132.1
JOINED. ... KW Erythrocyte maturation
Glycoprotein Hormone Signal Pharmaceutical. FT
SIGNAL 1 27 FT CHAIN 28
193 ERYTHROPOIETIN. FT PROPEP 190
193 MAY BE REMOVED IN PROCESSED PROTEIN. FT
DISULFID 34 188 ...
taxonomy
reference
annotations
Cross-references
Keywords
14
Sequence database example (cont.)
FT DISULFID 34 188 FT DISULFID 56
60 FT CARBOHYD 51 51 N-LINKED
(GLCNAC...). FT CARBOHYD 65 65
N-LINKED (GLCNAC...). FT CARBOHYD 110 110
N-LINKED (GLCNAC...). FT CARBOHYD 153
153 FT CONFLICT 40 40 E -gt Q
(IN CAA26095). FT CONFLICT 85 85
Q -gt QQ (IN REF. 5). FT CONFLICT 140 140
G -gt R (IN CAA26095). Chromosomal
location 7q22 SQ SEQUENCE 193 AA 21306 MW
C91F0E4C26A52033 CRC64 MGVHECPAWL
WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE
NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV
WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR
SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR
VYSNFLRGKL KLYTGEACRT GDR //
sequence
15
Sequence database example
  • a SWISS-PROT entry, in fasta format
  • gtspP01588EPO_HUMAN ERYTHROPOIETIN PRECURSOR -
    Homo sapiens (Human).
  • MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE
  • NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA
  • VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD
  • AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

16
Databases 1 nucleotide sequence
  • The main DNA sequence db are
  • EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
  • There are also specialized databases for the
    different types of RNAs (i.e. tRNA, rRNA, tm RNA,
    uRNA, etc)
  • 3D structure (DNA and RNA)
  • Others Aberrant splicing db Eucaryotic promoter
    db (EPD) RNA editing sites, Multimedia Telomere
    Resource

17
EMBL/GenBank/DDJB
  • These 3 db contain mainly the same informations
    within 2-3 days (few differences in the format
    and syntax)
  • Serve as archives containing all sequences
    (single genes, ESTs, complete genomes, etc.)
    derived from
  • Genome projects and sequencing centers
  • Individual scientists
  • Patent offices (i.e. European Patent Office, EPO)
  • Non-confidential data are exchanged daily
  • Currently 8.3 x106 sequences, over 9.7 x109 bp
  • Sequences from gt 50000 different species

18
EMBL/GenBank/DDBJ
  • Heterogeneous sequence length genomes, variants,
    fragments
  • Sequence sizes
  • max 300000 bp /entry (! genomic sequences,
    overlapping)
  • min 10 bp /entry
  • Archive nothing goes out -gt highly redundant !
  • full of errors in sequences, in annotations, in
    CDS attribution
  • no consistency of annotations most annotations
    are done by the submitters heterogeneity of the
    quality and the completion and updating of the
    informations

19
EMBL/GenBank/DDJB
  • Unexpected informations you can find in these db
  • FT source 1..124
  • FT /db_xref"taxon4097"
  • FT /organelle"plastidchloropla
    st"
  • FT /organism"Nicotiana
    tabacum"
  • FT /isolate"Cuban cahibo
    cigar, gift from President Fidel
  • FT Castro"
  • Or
  • FT source 1..17084
  • FT /chromosome"complete
    mitochondrial genome"
  • FT /db_xref"taxon9267"
  • FT /organelle"mitochondrion"
  • FT /organism"Didelphis
    virginiana"
  • FT /dev_stage"adult"
  • FT /isolate"fresh road killed
    individual"
  • FT /tissue_type"liver"

20
EMBL entry example
  • ID HSERPG standard DNA HUM 3398 BP.
  • XX
  • AC X02158
  • XX
  • SV X02158.1
  • XX
  • DT 13-JUN-1985 (Rel. 06, Created)
  • DT 22-JUN-1993 (Rel. 36, Last updated, Version
    2)
  • XX
  • DE Human gene for erythropoietin
  • XX
  • KW erythropoietin glycoprotein hormone
    hormone signal peptide.
  • XX
  • OS Homo sapiens (human)
  • OC Eukaryota Metazoa Chordata Craniata
    Vertebrata Euteleostomi Mammalia
  • OC Eutheria Primates Catarrhini Hominidae
    Homo.
  • XX
  • RN 1
  • RP 1-3398

keyword
taxonomy
references
Cross-references
21
EMBL entry (cont.)
  • CC Data kindly reviewed (24-FEB-1986) by K.
    Jacobs
  • FH Key Location/Qualifiers
  • FH
  • FT source 1..3398
  • FT /db_xreftaxon9606
  • FT /organismHomo sapiens
  • FT mRNA join(397..627,1194..1339,1596
    ..1682,2294..2473,2608..3327)
  • FT CDS join(615..627,1194..1339,1596
    ..1682,2294..2473,2608..2763)
  • FT /db_xrefSWISS-PROTP01588
  • FT /producterythropoietin
  • FT /protein_idCAA26095.1
  • FT /translationMGVHECPAWLWLLLSL
    LSLPLGLPVLGAPPRLICDSRVLQRYLLE
  • FT AKEAENITTGCAEHCSLNENITVPDTKVN
    FYAWKRMEVGQQAVEVWQGLALLSEAVLRG
  • FT QALLVNSSQPWEPLQLHVDKAVSGLRSLT
    TLLRALGAQKEAISPPDAASAAPLRTITAD
  • FT TFRKLFRVYSNFLRGKLKLYTGEACRTGD
    R
  • FT mat_peptide join(1262..1339,1596..1682,22
    94..2473,2608..2763)
  • FT /producterythropoietin
  • FT sig_peptide join(615..627,1194..1261)
  • FT exon 397..627

annotation
sequence
22
GenBank entry example
  • LOCUS HSERPG 3398 bp DNA
    PRI 22-JUN-1993
  • DEFINITION Human gene for
    erythropoietin.
  • ACCESSION X02158
  • VERSION X02158.1
    GI31224
  • KEYWORDS
    erythropoietin glycoprotein hormone hormone
    signal peptide.
  • SOURCE human.
  • ORGANISM Homo sapiens
  • Eukaryota
    Metazoa Chordata Vertebrata Mammalia
    Eutheria
  • Primates
    Catarrhini Hominidae Homo.
  • REFERENCE 1 (bases 1 to
    3398)
  • AUTHORS Jacobs,K.,
    Shoemaker,C., Rudersdorf,R., Neill,S.D.,
    Kaufman,R.J.,
  • Mufson,A.,
    Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F.,
  • Kawakita,M.,
    Shimizu,T. and Miyake,T.
  • TITLE Isolation and
    characterization of genomic and cDNA clones of
    human
  • erythropoietin
  • JOURNAL Nature 313
    (6005), 806-810 (1985)
  • MEDLINE 85137899
  • COMMENT Data kindly
    reviewed (24-FEB-1986) by K. Jacobs.
  • FEATURES
    Location/Qualifiers

23
GenBank entry (cont.)

  • TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR"
  • intron
    628..1193

  • /number1
  • exon
    1194..1339

  • /number2
  • mat_peptide
    join(1262..1339,1596..1682,2294..2473,2608..2760)

  • /product"erythropoietin"
  • intron
    1340..1595

  • /number2
  • exon
    1596..1682

  • /number3
  • intron
    1683..2293

  • /number3
  • exon
    2294..2473

  • /number4
  • intron
    2474..2607

  • /number4
  • exon
    2608..3327

  • /note"3' untranslated region"

24
DDJB entry example
  • LOCUS HSERPG 3398 bp DNA
    HUM 22-JUN-1993
  • DEFINITION Human gene for erythropoietin.
  • ACCESSION X02158
  • VERSION X02158.1
  • KEYWORDS erythropoietin glycoprotein hormone
    hormone signal peptide.
  • SOURCE human.
  • ORGANISM Homo sapiens
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Mammalia
  • Eutheria Primates Catarrhini
    Hominidae Homo.
  • REFERENCE 1 (bases 1 to 3398)
  • AUTHORS Jacobs,K., Shoemaker,C.,
    Rudersdorf,R., Neill,S.D., Kaufman,R.J.,
  • Mufson,A., Seehra,J., Jones,S.S.,
    Hewick,R., Fritsch,E.F.,
  • Kawakita,M., Shimizu,T. and Miyake,T.
  • TITLE Isolation and characterization of
    genomic and cDNA clones of human
  • erythropoietin
  • JOURNAL Nature 313, 806-810(1985)
  • MEDLINE 85137899
  • COMMENT Data kindly reviewed (24-FEB-1986) by
    K. Jacobs
  • FEATURES Location/Qualifiers

25
DDJB (cont.)
  • mat_peptide join(1262..1339,1596..1682,2294..2
    473,2608..2763)
  • /product"erythropoietin"
  • sig_peptide join(615..627,1194..1261)
  • exon 397..627
  • /number1
  • intron 628..1193
  • /number1
  • exon 1194..1339
  • /number2
  • intron 1340..1595
  • /number2
  • exon 1596..1682
  • /number3
  • intron 1683..2293
  • /number3
  • exon 2294..2473
  • /number4
  • intron 2474..2607
  • /number4

26
The tremendous increase in nucleotide sequences
  • EMBL datafirst increase in data due to the PCR
    development

1980 80 genes fully sequenced !
27
EMBL divisions
  • EMBL has been divided into subdatabases to allow
    easier data management and searches
  • fun, hum, inv, mam, org, phg, pln, pro, rod, syn,
    unc, vrl, vrt
  • est, gss, htg, sts, patent

28
RefSeq a SWISS-PROT clone?
  • The NCBI Reference Sequence project (RefSeq) will
    provide reference sequence standards for the
    naturally occurring molecules of the central
    dogma, from chromosomes to mRNAs to proteins.
    RefSeq standards provide a foundation for the
    functional annotation of the human genome. They
    provide a stable reference point for mutation
    analysis, gene expression studies, and
    polymorphism discovery.
  • Molecule Accession Format Genome
  • Complete Genome NC_ Archaea, Bacterial,
    Organelle,Virus, Viroid
  • Complete Chrom. NC_ Eukaryote
  • Complete Sequence NC_ Plasmid
  • Genomic Contig NT_ Homo sapiens
  • mRNA NM_ Homo sapiens, Mus musculus,
    Rattus norvegicus
  • Protein NP_ All of the above

29
RefSeq a SWISS-PROT clone?
  • RefSeq records are created via a process
    consisting of
  • identifying sequences that represent distinct
    genes
  • establishing the correct gene name-to-accession
    number association
  • identifying the full extent of available sequence
    data
  • creating a new RefSeq record with a status of
  • PREDICTED
  • PROVISIONAL
  • REVIEWED
  • Provisional RefSeq records are reviewed by a
    biologist who confirms the initial
    name-to-sequence association, adds information
    including a summary of gene function, and, more
    importantly, corrects, re-annotates, or extends
    the sequence data using data available in other
    GenBank records.

30
Databases 2 genomics
  • Contain information on genes, gene location
    (mapping), gene nomenclature and links to
    sequence databases
  • Exist for most organisms important for life
    science research
  • Examples MIM, GDB (human), MGD (mouse), FlyBase
    (Drosophila), SGD (yeast), MaizeDB (maize),
    SubtiList (B.subtilis), etc.
  • Format generally relational (Oracle, SyBase or
    AceDb).

31
MIM
  • OMIM Online Mendelian Inheritance in Man
  • a catalog of human genes and genetic disorders
  • contains a summary of literature, pictures, and
    reference information. It also contains numerous
    links to articles and sequence information.

32
MIM example
  • 133170 ERYTHROPOIETIN EPO
  • Alternative titles symbols
  • EP
  • TABLE OF CONTENTS
  • TEXT
  • REFERENCES
  • SEE ALSO
  • CONTRIBUTORS
  • CREATION DATE
  • EDIT HISTORY
  • Database Links
  • Gene Map Locus 7q21

33
Ensembl
  • Contains all the human genome DNA sequences
    currently available in the public domain.
  • Automated annotation by using different software
    tools, features are identified in the DNA
    sequences
  • Genes (known or predicted)
  • Single nucleotide polymorphisms (SNPs)
  • Repeats
  • Homologies
  • Created and maintained by the EBI and the Sanger
    Center (UK)
  • www.ensembl.org

34
Database 3 protein sequence
  • SWISS-PROT created in 1986 (A.Bairoch)
  • TrEMBL created in 1996 complement to
    SWISS-PROT derived from automated EMBL CDS
    translations ( proteomic  version of EMBL)
  • GenPept derived from automated GenBank CDS
    translations and journal scans ( proteomic 
    version of GenBank)
  • PIR Protein Information Resources
  • MIPS Martinsried Institute for Protein Sequences
  • PIR PATCHX (supplement of unverified protein
    sequences from external sources)

35
Database 3 protein sequence
  • NRL-3D produced by PIR from PDB (3D struture)
    sequences
  • Many specialized protein databases for specific
    families or groups of proteins.
  • Examples YPD (yeast proteins), AMSDb
    (antibacterial peptides), GPCRDB (7 TM
    receptors), IMGT (immune system) etc.

36
SWISS-PROT
  • Collaboration between the SIB (CH) and EMBL/EBI
    (UK)
  • Annotated (manually), non-redundant,
    cross-referenced, documented protein sequence
    database.
  • 88 000 sequences from more than 6800 different
    species 70 000 references (publications)
    550 000 cross-references (databases) 200 Mb of
    annotations.
  • Weekly releases available from about 50 servers
    across the world, the main source being ExPASy

37
SWISS-PROT example
Never changed
38
SWISS-PROT (cont.)
39
SWISS-PROT (cont.)
40
TrEMBL (Translation of EMBL)
  • Computer-annotated supplement to SWISS-PROT, as
    it is impossible to cope with the flow of data
  • Well-structure SWISS-PROT-like resource
  • Derived from automated EMBL CDS translation
    (maintained at the EBI (UK))
  • TrEMBL is automatically generated and annotated
    using software tools (incompatible with the
    SWISS-PROT in terms of quality)
  • TrEMBL contains all what is not yet in SWISS-PROT
  • Yerk!! But there is no choice and these software
    tools are becoming quite good !

41
The simplified story of a Sprot entry
cDNAs, genomes, .
  •  Automatic 
  • Redundancy check (merge)
  • InterPro (family attribution)
  • Annotation

EMBLnew EMBL
CDS
TrEMBLnew TrEMBL
  •  Manual 
  • Redundancy (merge, conflicts)
  • Annotation
  • Sprot tools (macros)
  • Sprot documentation
  • Medline
  • Databases (MIM, MGD.)
  • Brain storming

SWISS-PROT
Once in Sprot, the entry is no more in TrEMBL,
but still in EMBL (archive)
42
SWISS-PROT introduces a new arithmetical concept !
  • How many sequences in SWISS-PROT TrEMBL ?
  • 88000 300 000 about 240000
  • SWISS-PROT and TrEMBL (SPTR)
  • a minimal of redundancy

43
TrEMBL divisions
  • TrEMBL SPTrEMBL REMTrEMBL
  • SPTrEMBL TrEMBL entries that will eventually be
    integrated into SWISS-PROT, but that have not yet
    be manually annotated
  • REMTrEMBL sequences that are not destined to be
    included in SWISS-PROT
  • Immunoglobulins and T-cell receptors
  • Synthetic sequences
  • Patented sequences
  • Small fragments (lt8 aa)
  • CDS not coding for real proteins
  • TrEMBL new updates to the latest release of
    TREMBL

44
TrEMBL divisions
  • Subdivisions
  • Archae arc
  • Fungus fun
  • Human hum
  • Invertebrate inv
  • Mammals mam
  • Major Hist. Comp. mhc
  • Organelles org
  • Phage phg
  • Plant pln
  • Prokaryote pro
  • Rodent rod
  • Uncommented unc
  • Viral vrl
  • Vertebrate vrt

45
TrEMBL example
46
GenPept (translation of GenBank)
  • GenPept is a protein database translated from the
    last release of GenBank ( journal scans)
  • The current release has 484496 entries
  • In contrast to TrEMBL, keeps all protein
    sequences including small fragments (lt 8 aa),
    immunoglobulins.
  • Redundancy 20 entries for human EPO

47
GenPept example
  • LOCUS L33410_1
    HUMMLCMPL
  • DEFINITION Human c-mpl
    ligand (ML) mRNA, complete cds
  • erythropoietin
    homology domain bp 66..522.
  • DATE 07-JAN-1995
  • ACCESSION L33410
  • NID
  • ORGANISM Homo_SP_sapiens
  • Eukaryota
    Metazoa Chordata Craniata Vertebrata
    Euteleostomi
  • Mammalia
    Eutheria Primates Catarrhini Hominidae Homo.
  • COMMENT CDS 216..1277
  • /gene"ML"
  • /product"c-mpl
    ligand"

  • /protein_id"AAA59857.1"

  • /db_xref"GI506827"
  • WEIGHT 37823
  • LENGTH 353
  • ORIGIN
  • 1 MELTELLLVV
    MLLLTARLTL SSPAPPACDL RVLSKLLRDS HVLHSRLSQC
    PEVHPLPTPV
  • 61 LLPAVDFSLG
    EWKTQMEETK AQDILGAVTL LLEGVMAARG QLGPTCLSSL
    LGQLSGQVRL

48
PIR
  • Protein Information Resource, created in 1984
  • Successor of the National Biochemical Research
    Foundation (NBRF) protein sequence database
    developed in 1965 by M. O. Dayhoff  Atlas of
    Protein Sequence and Structure 
  • Maintained by MIPS (Germany) and JIPID (Japan)
  • Provides some cross-referencing to
    EMBL/GenBank/DDJB and PDB, GDB, FlyBase, OMIM,
    SGD, and MGD
  • In august 2000 178050 entries.
  • Redundancy 3 entries for human EPO

49
PIR example
  • gtP1ZUHU
  • erythropoietin precursor - human
  • CSpecies Homo sapiens (man)
  • CDate 27-Nov-1985 sequence_revision
    27-Nov-1985 text_change 22-Jun-1999
  • CAccession A01855 A24744 A25384 A22210
    S56178
  • RJacobs, K. Shoemaker, C. Rudersdorf, R.
    Neill, S.D. Kaufman, R.J. Mufson, A. Seehra,
    J. Jones, S.S. Hewick, R. Fritsch, E.F.
    Kawakita, M. Shimizu, T. Miyake, T.
  • Nature 313, 806-810, 1985
  • ATitle Isolation and characterization of
    genomic and cDNA clones of human erythropoietin.
  • AReference number A01855 MUID85137899
  • AAccession A01855
  • AMolecule type mRNA DNA
  • AResidues 1-193
  • ACross-references GBX02157 GBX02158
  • RLin, F.K. Suggs, S. Lin, C.H. Browne, J.K.
    Smalling, R. Egrie, J.C. Chen, K.K. Fox, G.M.
    Martin, F. Stabinsky, Z. Badrawi, S.M. Lai,
    P.H. Goldwasser, E.
  • Proc. Natl. Acad. Sci. U.S.A. 82, 7580-7584, 1985
  • ATitle Cloning and expression of the human
    erythropoietin gene.
  • AReference number A24744 MUID86067948
  • AAccession A24744
  • AMolecule type DNA

50
PIR (cont.)
  • AAccession A22210
  • AMolecule type protein
  • AResidues 28-29,'X',31-33,'L',35-50,'X',52-53,'D
    ',55,'G',57
  • RMatsumoto, S. Ikura, K. Ueda, M. Sasaki, R.
    Plant Mol. Biol. 27, 1163-1172, 1995
  • ATitle Characterization of a human glycoprotein
    (erythropoietin) produced in cultured tobacco
    cells.
  • AReference number S56178 MUID95284365
  • AAccession S56178
  • AMolecule type protein
  • AResidues 28-33,'X',35-37
  • CComment Erythropoietin is produced by kidney
    or liver of adult mammals and by liver of fetal
    or neonatal mammals.
  • CGenetics
  • AGene GDBEPO
  • ACross-references GDB119110 OMIM133170
  • AMap position 7q21.3-7q22.1
  • AIntrons 5/1 53/3 82/3 142/3
  • CFunction
  • ADescription the primary inducer of erythrocyte
    formation
  • CSuperfamily erythropoietin
  • CKeywords erythropoiesis glycoprotein
    hormone kidney liver

51
Composite protein sequence db
Different composite db use different primary
sources and different redundancy criteria in
their amalgamation procedures
Redundancy priority criteria
Also called SWall at EBI SWIR SPTrEMBL
Wormpep
52
Composite protein family
  • The proteins /genes are classified by
    superfamily/family according to Blast/Fasta
    (homology) results
  • General
  • ProtFam PIR
  • ProtoMap SWISS-PROT
  • SYSTERS SWISS-PROT and PIR (non redundant)
  • ProClass PIR and PROSITE
  • Species specific
  • HOVERGEN vertebrates
  • HOBACGEN bacteria
  • COG complete organism genome

53
ProtoMap example
54
ProtoMap (cont.)
55
Database 4 protein domain/family
  • Contains biologically significant  pattern /
    profiles/ HMM  formulated in such a way that,
    with appropriate computional tools, it can
    rapidly and reliably determine to which known
    family of proteins (if any) a new sequence
    belongs to
  • -gt tools to identify what is the function of
    uncharacterized proteins translated from genomic
    or cDNA sequences ( functional diagnostic )

56
Protein domain/family
  • Most proteins have  modular  structure
  • Estimation 3 domains / protein
  • Domains (conserved sequences or structures) are
    identified by multi sequence alignments
  • Domains can be defined by different methods
  • Pattern (regular expression) used for very
    conserved domains
  • Profiles (weighted matrices) two-dimensional
    tables of position specific match-, gap-, and
    insertion-scores, derived from aligned sequence
    families used for less conserved domains
  • Hidden Markov Model (HMM) probabilistic models
    an other method to generate profiles.

57
Some statistics
  • 15 most common protein domains for H. sapiens
    (Incomplete)
  • Immunoglobulin and major histocompatibility
    complex domain
  • Eukaryotic protein kinase
  • Zinc finger, C2H2 type
  • Rhodopsin-like GPCR superfamily
  • Src homology 3 (SH3) domain
  • RNA-binding region RNP-1 (RNA recognition motif)
  • Fibronectin type III domain
  • Pleckstrin homology (PH) domain
  • Homeobox domain
  • Major histocompatibility complex protein, Class
    I
  • EF-hand family
  • EGF-like domain
  • RING finger
  • Cadherin domain
  • PDZ domain (also known as DHR or GLGF)
  • Serine proteases, trypsin family
  • http//www.ebi.ac.uk/proteome/HUMAN/interpro/top15
    d.html

58
Protein domain/family db
  • Secondary databases are the fruit of analyses of
    the sequences found in the primary db
  • Either manually curated (i.e. PROSITE, Pfam,
    etc.) or automatically generated (i.e. ProDom,
    DOMO)
  • Some depend on the method used to detect if a
    protein belongs to a particular domain/family
    (patterns, profiles, HMM)

59
Protein domain/family db
60
Prosite
  • Created in 1988 (SIB)
  • Contains functional domains fully annotated,
    based on two methods patterns and profiles
  • Entries are deposited in PROSITE in two distinct
    files
  • Pattern/profiles with the lists of all matches in
    the parent version of SWISS-PROT
  • Documentation
  • Aug 2000 contains 1064 documentation entries
    that
  • describe 1424 different patterns, rules and
  • profiles/matrices.

61
Prosite (pattern) example
ID EPO_TPO PATTERN. AC PS00817 DT
OCT-1993 (CREATED) NOV-1995 (DATA UPDATE)
JUL-1998 (INFO UPDATE). DE Erythropoietin /
thrombopoeitin signature. PA P-x(4)-C-D-x-R-LIV
M(2)-x-KR-x(14)-C. NR /RELEASE38,80000 NR
/TOTAL14(14) /POSITIVE14(14) /UNKNOWN0(0)
/FALSE_POS0(0) NR /FALSE_NEG0
/PARTIAL1 CC /TAXO-RANGE??E??
/MAX-REPEAT1 CC /SITE3,disulfide
/SITE11,disulfide DR P48617, EPO_BOVIN , T
P33707, EPO_CANFA , T P33708, EPO_FELCA , T DR
P01588, EPO_HUMAN , T P07865, EPO_MACFA , T
Q28513, EPO_MACMU , T DR P07321, EPO_MOUSE ,
T P49157, EPO_PIG , T P29676, EPO_RAT , T
DR P33709, EPO_SHEEP , T P42705, TPO_CANFA ,
T P40225, TPO_HUMAN , T DR P40226, TPO_MOUSE
, T P49745, TPO_RAT , T DR P42706, TPO_PIG
, P DO PDOC00644 //
Diagnostic performance
List of matches
62
Prosite (profile) example
  • PROSITE PS50097
  • ID BTB MATRIX.
  • AC PS50097
  • DT DEC-1999 (CREATED) DEC-1999 (DATA UPDATE)
    DEC-1999 (INFO UPDATE).
  • DE BTB domain profile.
  • MA /GENERAL_SPEC ALPHABET'ABCDEFGHIKLMNPQRSTVW
    YZ' LENGTH67
  • MA /DISJOINT DEFINITIONPROTECT N16 N262
  • MA /NORMALIZATION MODE1 FUNCTIONLINEAR
    R1.9751 R2.02068202 TEXT'-LogE'
  • MA /CUT_OFF LEVEL0 SCORE363 N_SCORE8.5
    MODE1 TEXT'!'
  • MA /CUT_OFF LEVEL-1 SCORE267 N_SCORE6.5
    MODE1 TEXT'?'
  • MA /DEFAULT D-20 I-20 B1-50 E1-50
    MI-105 MD-105 IM-105 DM-105 MM1 M0-2
  • MA /I B10 BI-105 BD-105
  • MA /M SY'C' M-6,-10,28,-14,-9,-15,-20,-14,-1
    9,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12
  • MA /M SY'D' M-16,41,-28,53,15,-34,-11,-1,-33
    ,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7
  • MA /M SY'V' M2,-23,-8,-28,-24,-1,-24,-25,16,
    -20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24
  • MA /M SY'T' M-2,-13,-18,-19,-13,-7,-24,-19,6
    ,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13
  • MA /M SY'L' M-11,-30,-22,-33,-24,15,-32,-23,
    25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24
  • MA /M SY'V' M0,-11,-18,-13,-10,-12,-20,-13,1
    ,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9

63
Prosite (profile) example (cont.)
  • MA /M SY'T' M-3,3,-16,1,-3,-18,-12,-9,-20,-6
    ,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5
  • MA /M SY'G' M-1,1,-25,2,-9,-26,31,-12,-32,-1
    0,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11
  • MA /M SY'E' M-9,3,-24,4,13,-25,-16,-1,-24,13
    ,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8
  • MA /M SY'I' M-6,-21,-18,-25,-21,-2,-29,-21,2
    1,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20
  • MA /M SY'E' M-4,3,-23,3,4,-18,-11,-7,-17,-1,
    -18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1
  • MA /M SY'I' M-8,-25,-23,-27,-20,1,-30,-21,21
    ,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20
  • MA /M SY'P' M-6,0,-24,2,1,-22,-13,-8,-21,-2,
    -23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3
  • MA /M SY'E' M-7,1,-27,4,11,-24,-15,-4,-19,2,
    -18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7
  • MA /I E10 IE-105 DE-105
  • NR /RELEASE39,87397
  • NR /TOTAL46(44) /POSITIVE45(43)
    /UNKNOWN1(1) /FALSE_POS0(0)
  • NR /FALSE_NEG0 /PARTIAL0
  • CC /TAXO-RANGE??E?V /MAX-REPEAT2
  • DR O14867, BAC1_HUMAN, T P97302, BAC1_MOUSE,
    T P97303, BAC2_MOUSE, T
  • DR P41182, BCL6_HUMAN, T P41183, BCL6_MOUSE,
    T Q01295, BRC1_DROME, T
  • DR Q01296, BRC2_DROME, T Q01293, BRC3_DROME,
    T Q28068, CALI_BOVIN, T
  • DR Q13939, CALI_HUMAN, T Q08605, GAGA_DROME,
    T Q01820, GCL1_DROME, T
  • DR P10074, HKR3_HUMAN, T Q04652, KELC_DROME,
    T P42283, LOLL_DROME, T

64
PRINTS
  • Compendium of protein motif fingerprints
  • Most protein families are characterized by
    several conserved motifs
  • Fingerprint set of motif(s) (simple or
    composite, such as multidomains) signature of
    family membership
  • True family members exhibit all elements of the
    fingerprint, while subfamily members may possess
    only a part

65
ProDom
  • consists of an automated compilation of
    homologous domain alignment (procedure based on
    PSI-BLAST searches)
  • Updating problem !
  • Last ProDom update February 7, 2000
  • built from SWISS-PROT 38 TREMBL TREMBL
  • updates - October 22, 1999

66
ProDom example
Your query
67
Protein domain/family Composite databases
  • Example InterPro
  • Unification of PROSITE, PRINTS, Pfam and ProDom
    into an integrated resource of protein families,
    domains and functional sites
  • Single set of documents linked to the various
    methods
  • Will be used to improve the functional annotation
    of SWISS-PROT (classification of unknown
    protein)
  • This release contains 3052 entries, representing
    574 domains, 2418 families, 46 repeats and 14
    post-translational modification sites.

68
InterPro example
  • IPR001323
  • Name

  • Erythropoietin/thrombopoeitin
  • Type
  • Family
  • Abstract
  • Erythropoietin,
    a plasma glycoprotein, is the primary
    physiological mediator of erythropoiesis 1 . It
    is involved in
  • the regulation
    of the level of peripheral erythrocytes by
    stimulating the differentiation of erythroid
    progenitor cells,
  • found in the
    spleen and bone marrow, into mature erythrocytes
    2 . It is primarily produced in adult kidneys
    and
  • foetal liver,
    acting by attachment to specific binding sites on
    erythroid progenitor cells, stimulating their
  • differentiation
    3 . Severe kidney dysfunction causes reduction
    in the plasma levels of erythropoietin, resulting
    in
  • chronic anaemia
    - injection of purified erythropoietin into the
    blood stream can help to relieve this type of
    anaemia.
  • Levels of
    erythropoietin in plasma fluctuate with varying
    oxygen tension of the blood, but androgens and
  • prostaglandins
    also modulate the levels to some extent 3 .
    Erythropoietin glycoprotein sequences are well
  • conserved, a
    consequence of which is that the hormones are
    cross-reactive among mammals, i.e. that from one
  • species, say
    human, can stimulate erythropoiesis in other
    species, say mouse or rat 4 .
  • Thrombopoeitin
    (TPO), a glycoprotein, is the mammalian hormone
    which functions as a megakaryocytic lineage
  • specific growth
    and differentiation factor affecting the
    proliferation and maturation from their committed
    progenitor

69
InterPro example
  • ...
  • Examplelist
  • P33708
  • P33709
  • P49745
  • view matches for
    the examples
  • Publications
  • 1. Shoemaker
    C.B., Mitsock L.D. 849-858 (1986)
  • 2. Takeuchi M.,
    Takasaki S., Miyazaki H., Kato T., Hoshi S.,
    Kochibe N., Kobata A. J. Biol. Chem. 263
  • 3657-3663 (1988)
  • 3. Lin F.K., Lin
    C.H., Lai P.H., Browne J.K., Egrie J.C., Smalling
    R., Fox G.M., Chen K.K., Castro M., Suggs
  • S. Gene 44
    201-209 (1986)
  • 4. Nagao M.,
    Suga H., Okano M., Masuda S., Narita H., Ikura
    K., Sasaki R.
  • Nucleotide
    sequence of rat erythropoietin.
  • 1171 99-102
    (1992)
  • Children
  • IPR003013
  • Signatures
  • PROSITE PS00817
    EPO_TPO

70
Databases 5 mutation/polymorphism
  • Contain informations on sequence variations that
    are linked or not to genetic diseases
  • Mainly human but OMIA - Online Mendelian
    Inheritance in Animals
  • General db
  • OMIM
  • HMGD - Human Gene Mutation db
  • SVD - Sequence variation db
  • HGBASE - Human Genic Bi-Allelic Sequences db
  • dbSNP - Human single nucleotide polymorphism
    (SNP) db
  • Disease-specific db most of these databases are
    either linked to a single gene or to a single
    disease
  • p53 mutation db
  • ADB - Albinism db (Mutations in human genes
    causing albinism)
  • Asthma and Allergy gene db
  • .

71
Mutation/polymorphisms definitions
  • SNPs single nucleotide polymorphisms
  • c-SNPs coding single nucleotide polymorphisms
    (Single Nucleotide Polymorphisms within cDNA
    sequences)
  • SAPs single amino-acid polymorphisms
  • Missense mutation -gt SAP
  • Nonsense mutation -gt STOP
  • Insertion/deletion of nucleotides -gt frameshift
  • ! Numbering of the mutation depends on the db (aa
    no 1 is not necessary the initiator Met !)

72
Mutation/polymorphisms
  • dbSNP consortium http//snp.cshl.org/
  • Bayer, Roche, IBM, Pfizer, Novartis, Motorola
  • Mission develop up to 300,000 SNPs distributed
    evenly throughout the human genome and make the
    informations related to these SNPs available to
    the public without intellectual property
    restrictions. The project started in April 1999
    and is anticipated to continue until the end of
    2001.
  • dbSNP at NCBI http//www.ncbi.nlm.nih.gov/SNP/
  • Collaboration between the National Human Genome
    Research Institute and the National Center for
    Biotechnology Information (NCBI)
  • Mission central repository for both single base
    nucleotide subsitutions and short deletion and
    insertion polymorphisms
  • Aug 24, 2000 , dbSNP has submissions for 803557
    SNPs.
  • Chromosome 21 dbSNP http//csnp.isb-sib.ch/
  • A joint project between the Division of Medical
    Genetics of the
    University of Geneva Medical School and the SIB
  • Mission comprehensive cSNP (Single Nucleotide
    Polymorphisms within cDNA sequences) database and
    map of chromosome 21

73
Mutation/polymorphisms
  • Very heterogeneous format
  • Generally modest size
  • There are initiatives to standardize and to unify
    these databases (SVD - Sequence Variation
    Database project at EBI HMutDB)

74
Databases 6 proteomics
  • Contain informations obtained by 2D-PAGE master
    images of the gels and description of identified
    proteins
  • Examples SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,
    Sub2D, Cyano2DBase, etc.
  • Format composed of image and text files
  • Most 2D-PAGE databases are federated and
  • use SWISS-PROT as a master index
  • There is currently no protein Mass Spectrometry
    (MS) database (not for long)

75
Databases 7 3D structure
  • Contain the spatial coordinates of macromolecules
    whose 3D structure has been obtained by X-ray or
    NMR studies
  • Proteins represent more than 90 of available
    structures (others are DNA, RNA, sugars, virus,
    complex protein/DNA)
  • PDB (Protein Data Bank), SCOP (structural
    classification of proteins (according to the
    secondary structures)), BMRB (BioMagResBank RMN
    results)
  • Future Homology-derived 3D structure db.

76
PDB
  • Protein Data Bank, managed by RCSB
  • Currently there are 13000 structures for about
    4000 different molecules, but far less protein
    family !
  • There are also databases that contain data
    derived from PDB. Examples HSSP
    (homology-derived secondary structure of
    proteins), SWISS-3DIMAGE (images)

Restriction enzyme
77
PDB example
  • HEADER LYASE(OXO-ACID)
    01-OCT-91 12CA 12CA 2
  • COMPND CARBONIC ANHYDRASE /II (CARBONATE
    DEHYDRATASE) (/HCA II) 12CA 3
  • COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121
    REPLACED BY ALA (/V121A) 12CA 4
  • SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT
    PROTEIN 12CA 5
  • AUTHOR S.K.NAIR,D.W.CHRISTIANSON
    12CA 6
  • REVDAT 1 15-OCT-92 12CA 0
    12CA 7
  • JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRI
    STIANSON,C.A.FIERKE 12CA 8
  • JRNL TITL ALTERING THE MOUTH OF A
    HYDROPHOBIC POCKET. 12CA 9
  • JRNL TITL 2 STRUCTURE AND KINETICS OF
    HUMAN CARBONIC ANHYDRASE 12CA 10
  • JRNL TITL 3 /II MUTANTS AT RESIDUE
    VAL-121 12CA 11
  • JRNL REF J.BIOL.CHEM.
    V. 266 17320 1991 12CA 12
  • JRNL REFN ASTM JBCHA3 US ISSN 0021-9258
    071 12CA 13
  • REMARK 1
    12CA 14
  • REMARK 2
    12CA 15
  • REMARK 2 RESOLUTION. 2.4 ANGSTROMS.
    12CA 16
  • REMARK 3
    12CA 17
  • REMARK 3 REFINEMENT.
    12CA 18
  • REMARK 3 PROGRAM PROLSQ
    12CA 19

78
PDB (cont.)
  • SHEET 3 S10 PHE 66 PHE 70 -1 O ASN
    67 N LEU 60 12CA 68
  • SHEET 4 S10 TYR 88 TRP 97 -1 O PHE
    93 N VAL 68 12CA 69
  • SHEET 5 S10 ALA 116 ASN 124 -1 O HIS
    119 N HIS 94 12CA 70
  • SHEET 6 S10 LEU 141 VAL 150 -1 O LEU
    144 N LEU 120 12CA 71
  • SHEET 7 S10 VAL 207 LEU 212 1 O ILE
    210 N GLY 145 12CA 72
  • SHEET 8 S10 TYR 191 GLY 196 -1 O TRP
    192 N VAL 211 12CA 73
  • SHEET 9 S10 LYS 257 ALA 258 -1 O LYS
    257 N THR 193 12CA 74
  • SHEET 10 S10 LYS 39 TYR 40 1 O LYS
    39 N ALA 258 12CA 75
  • TURN 1 T1 GLN 28 VAL 31 TYPE VIB
    (CIS-PRO 30) 12CA 76
  • TURN 2 T2 GLY 81 LEU 84 TYPE
    II(PRIME) (GLY 82) 12CA 77
  • TURN 3 T3 ALA 134 GLN 137 TYPE I
    (GLN 136) 12CA 78
  • TURN 4 T4 GLN 137 GLY 140 TYPE I
    (ASP 139) 12CA 79
  • TURN 5 T5 THR 200 LEU 203 TYPE VIA
    (CIS-PRO 202) 12CA 80
  • TURN 6 T6 GLY 233 GLU 236 TYPE II
    (GLY 235) 12CA 81
  • CRYST1 42.700 41.700 73.000 90.00 104.60
    90.00 P 21 2 12CA 82
  • ORIGX1 1.000000 0.000000 0.000000
    0.00000 12CA 83
  • ORIGX2 0.000000 1.000000 0.000000
    0.00000 12CA 84
  • ORIGX3 0.000000 0.000000 1.000000
    0.00000 12CA 85
  • SCALE1 0.023419 0.000000 0.006100
    0.00000 12CA 86

79
Databases 8 metabolic
  • Contain informations that describe enzymes,
    biochemical reactions and metabolic pathways
  • ENZYME and BRENDA nomenclature databases that
    store informations on enzyme names and reactions
  • Examples of metabolic databases EcoCyc
    (specialized on Escherichia coli), KEGG, EMP/WIT
  • Usualy these databases are tightly coupled with
    query software that allows the user to visualise
    reaction schemes.

80
Databases 9 bibliographic
  • Bibliographic reference databases contain
    citations and abstract informations of published
    life science articles
  • Example Medline
  • Other more specialized databases also exist
    (example Agricola).

81
Medline
  • MEDLINE covers the fields of medicine, nursing,
    dentistry, veterinary medicine, the health care
    system, and the preclinical sciences
  • more than 4,000 biomedical journals published in
    the United States and 70 other countries
  • Contains over 10 million citations since 1966
    until now
  • Contains links to biological db and to some
    journals
  • New records are added to PreMEDLINE daily!
  • Many papers not dealing with human are not in
    Medline !
  • Before 1970, keeps only the first 10 authors !
  • Not all journals have citations since 1966 !

82
Medline/Pubmed
  • PubMed is developed by the National Center for
    Biotechnology Information (NCBI)
  • PubMed provides access to bibliographic
    information such as MEDLINE, PreMEDLINE,
    HealthSTAR, and to integrated molecular biology
    databases (composite db)
  • PMID 10923642 (PubMed ID), UI 20378145 (Medline
    ID)

83
Databases 10 others
  • There are many databases that cannot be
    classified in the categories listed previously
  • Examples ReBase (restriction enzymes), TRANSFAC
    (transcription factors), O-GLYCBASE (O-linked
    sugars), Protein-protein interactions db (DIR),
    biotechnology patents db, etc.
  • As well as many other resources concerning any
    aspects of macromolecules and molecular biology.

84
Proliferation of databases
  • What is the best db for sequence analysis ?
  • Which does contain the highest quality data ?
  • Which is the more comprehensive ?
  • Which is the more up-to-date ?
  • Which is the less redundant ?
  • Which is the more indexed (allows complex
    queries) ?
  • Which Web server does respond most quickly ?
  • .??????

85
Some important practical remarks
  • Databases many errors (automated annotation) !
  • Not all db are available on all servers
  • The update frequency is not the same for all
    servers creation of db_new between releases
    (exemple EMBLnew TrEMBLnew.)
  • Some servers add automatically useful
    cross-references to an entry (implicit links) in
    addition to already existing links (explicit
    links)

86
Database retrieval tools
  • Sequence Retrieval System (SRS, Europe) allows
    any flat-file db to be indexed to any other
    allows to formulate queries across a wide range
    of different db types via a single interface,
    without any worry about data structure, query
    languages
  • Entrez (USA) less flexible than SRS but exploits
    the concept of  neighbouring , which allows
    related articles in different db to be linked
    together, whether or not they are
    cross-referenced directly
  • ATLAS specific for macromolecular sequences db
    (i.e. NRL-3D)
  • .

87
More informations about SWISS-PROT
88
The golden goals of SWISS-PROT
  • Annotated / curated
  • Complete
  • Non-redundant
  • Highly cross-referenced
  • Available from a variety of servers and through
    sequence analysis software tools
  • Associated with wide range of documentation
  • Review Protein sequence databases
  • R. Apweiler (2000), Adv. in protein chemistry,
    54, 31-70

89
SWISS-PROT species
  • 6840 different species
  • 20 species represent about 45 of all sequences
    in the database
  • 5000 species are only represented by one to
    three sequences. In most cases, these are
    sequences which were obtained in the context of a
    phylogenetic study

90
SWISS-PROT cross-references
  • SWISS-PROT was the first database with
    cross-references.
  • Explicitly cross-referenced to 34 databases
  • Cross-ref to DNA (EMBL/GenBank/DDBJ),
    3D-structure (PDB), literature (Medline), genomic
    (MIM, MGD, FlyBase, SGD, SubtiList, etc.), 2D-gel
    (SWISS-2DPAGE), specialized db (PROSITE,
    TRANSFAC)
  • Implicitly cross-referenced to additional db on
    the WWW (GeneCards, PRODOM, etc.)

91
Annotations
  • Function(s)
  • Post-translational modifications (PTM)
  • Domains
  • Quaternary structure
  • Similarities
  • Diseases, mutagenesis
  • Conflicts, variants
  • Cross-references

92
A Swiss-Prot entry
93
Sprot entry (cont.)
94
Sprot entry (cont.)
95
Sprot entry (cont.)
96
Sprot entry (cont.)
97
Future for human proteins
  • Original estimate from 70000 to 100000 genes
  • Incyte recently announced an estimation of
    140000 genes
  • More recent estimations give about 30000 to
    40000 genes
  • C. elegans and Drosophila have 15000 genes.
    There was two sets of genome duplication in the
    evolutionary history leading to vertebrates. Very
    roughly it means that
  • Human genes60000 genes - losses new genes
  • But more than 1 million proteins !
  • (due to PTM, alternative products, variants)
  • http//www.ensembl.org/genesweep.html

98
Genesweep
http//www.ensembl.org/genesweep.html
99
What after genomes?
  • Proteome projects are an essential tool for the
    understanding of real proteins
  • There will be a flood of characterization data
    (MS, 2D) that will be the equivalent of ESTs at
    the protein level
  • Protein databases are going to be more and more
    important for new biological studies

100
(No Transcript)
101
Databases in GCG
  • DNA
  • EMBL, EPD, RepBase, vectordb (NCBI)
  • Protein
  • Swiss-Prot, TrEMBL, PDB
  • Other
  • PROSITE, REBASE

102
How to access databases in GCG?
  • Fetch or typedata ?
  • Stringsearch
  • Name
  • Lookup (based on SRS)
  • Useful to generate list files
Write a Comment
User Comments (0)
About PowerShow.com