Sequence databases and retrieval systems - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Sequence databases and retrieval systems

Description:

Maintained by the Swiss Institute of Bio-informatics (SIB) and funded by GeneBio. ... The National Biomedical Research Foundation (NBRF) in the United States. ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 59
Provided by: guype7
Category:

less

Transcript and Presenter's Notes

Title: Sequence databases and retrieval systems


1
  • Sequence databases and retrieval systems
  • Guy Perrière
  • Pôle Bioinformatique Lyonnais
  • Laboratoire de Biométrie et Biologie Évolutive
  • UMR CNRS n 5558
  • Université Claude Bernard Lyon 1

2
In the beginning
  • First paper compilation in 1965 (Atlas of Protein
    Sequences).
  • Development of real databanks at the begin-ning
    of the 80s
  • Fast access.
  • Make possible analyses that require a lot of
    data
  • Codon usage.
  • Molecular phylogeny.

3
General databanks
  • Nucleotide sequences
  • EMBL/GenBank/DDBJ.
  • Protein sequences
  • Simple translations of coding regions
  • GenPept.
  • TrEMBL.
  • Systems containing specific data
  • SWISS-PROT.
  • PIR.

4
EMBL
  • Created in 1980 at the European Molecular Biology
    Laboratory in Heidelberg.
  • Maintained since 1994 at the European
    Bio-informatic Institute (EBI) in Cambridge.
  • Web server
  • http//www.ebi.ac.uk/embl

5
GenBank
  • Set up in 1979 at the Los Alamos National
    Laboratory (LANL) in Los Alamos.
  • Maintained since 1992 at the National Cen-ter for
    Biotechnology Information (NCBI) in Bethesda.
  • Web server
  • http//www.ncbi.nlm.nih.gov/Genbank/index.html

6
DDBJ
  • Started its activities in 1984 at the National
    Institute of Genetics (NIG) in Mishima.
  • Since then, still maintained in this institute by
    the team of Takashi Gojobori.
  • Web server
  • http//www.ddbj.nig.ac.jp

7
Nucleotide sequences
  • Data mainly provided by direct submissions from
    the authors.
  • Submissions are made through the Internet
  • Web forms.
  • Email.
  • The sequences are exchanged between the three
    centers on a daily basis
  • The content of the banks is identical.

8
Data growth
EMBL GenBank NBRF/PIR SWISS-PROT
log(Nb. of residues)
9
GenBank size (May 2001)
  • 12.7?109 nucleotides.
  • 11.8?106 sequences.
  • 690,323 genes (proteins and RNA).
  • 242,616 bibliographic references.
  • 47.8 gigabytes on disc.
  • Growth of 280 in 12 months.
  • 24-36 h to download the whole GenBank files from
    NCBI.

10
Taxonomic sampling
  • There are 72,000 species for which at least one
    sequence is available.
  • Nine species (0.01) correspond to 85 of the
    total.
  • 18,000 species are represented by only one
    sequence!

The nine species the most represented in GenBank
11
Distribution format
  • The banks are distributed as a set of text files
    ( 215 for GenBank).
  • A file contains sequences corresponding to
  • A given taxon (e.g., bacteria, invertebrates,
    mammals).
  • A given class of sequences (EST, HTG, GSS).
  • Inside a file, each sequence is called an entry.

12
Entries structure
  • Informations are introduced into structured
    fields.
  • The format is different between EMBL and
    GenBank/DDBJ.
  • The data introduced in the three databanks are
    identical.

13
ID, AC, SV and DT fields
Contain identifyers as well as the creation and
the last modification dates for the entries. ID
BSAMYL standard DNA PRO 2680 BP. XX AC
V00101 J01547 XX SV V00101.1 XX DT
13-JUL-1983 (Rel. 03, Created) DT 12-NOV-1996
(Rel. 49, Last updated, Version 11)
14
DE, KW, OS and OC fields
Contain general information on sequences
(defini-tion, keywords, taxonomy). DE Bacillus
subtilis amylase gene. XX KW amyE gene
amylase amylase-alpha KW regulatory region
signal peptide. XX OS Bacillus subtilis OC
Bacteria Firmicutes Bacillus/Clostridium
group OS Bacillus/Staphylococcus group
Bacillus.
15
RN, RX, RA and RT fields
Contain informations related to bibliographic
refe-rences. RN 1 RP 1-2680 RX MEDLINE
83143299. RA Yang M., Galizzi, A., Henner,
D.J. RT "Nucleotide sequence of the amylase
gene from RT Bacillus subtilis" RL Nucleic
Acids Res. 11237-249(1983).
16
FT field
Contains the descriptions of functional regions
described by the qualifyers. FT promoter
369..374 FT /note"put. promoter
sequence P2 3 (amyR1)" FT RBS
414..419 FT /note"rRNA-binding site
rbs-1 3" FT CDS 498..2480 FT
/gene"amyE" FT /db_xref"SWISS-PROT
P00691" FT /product"alpha-amylase
precursor" FT /EC_number"3.2.1.1  FT
/protein_id"CAA23437.1" FT
/translation"MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNM
KDIHDAG ...
17
Intron/exon structure
FT CDS join(242..610,3397..3542,5100..535
1) FT /codon_start1 FT
/db_xref"SWISS-PROTP01308" FT
/note"precursor" FT /gene"INS" FT
/product"insulin" ...
18
SQ field
Contains the sequence iself SQ Sequence 2680
BP 825 A 520 C 642 G 693 T 0 other
gctcatgccg agaatagaca ccaaagaaga actgtaaaaa
cgggtgaagc agcagcgaat 60 agaatcaatt
gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca
gggggattcc 120 ccatacattc ttcgcttggc
tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct
180 gtttctgctt cggtatgtga ttgtgaagct
ggcttacaga agagcggtaa aagaagaaat 240
(...) gatggtttct tttttgttca taaatcagac
aaaacttttc tcttgcaaaa gtttgtgaag 2580
tgttgcacaa tataaatgtg aaatacttca caaacaaaaa
gacatcaaag agaaacatac 2640 cctgcaagga
tgctgatatt gtctgcattt gcgccggagc
2680 //
19
Errors in databanks
  • There are a lot of errors in the nucleotide
    sequence databanks
  • For the annotations, the free submission of
    entries involves
  • Inaccuracies, omissions, and even mistakes.
  • Inconsistencies between some fields.
  • In the sequences themselves
  • Sequencing erors.
  • Compression, gel reading.
  • Cloning vectors inserted.

20
Redundancy
  • Another major pro-blem is redundancy.
  • A lot of entries are partially or entirely
    duplicated
  • 20 of vertebrate se-quences in GenBank.
  • Duplicated entries are often different in their
    sequence.

21
Variations in duplicates
  • It is often impossible to decide whether a
    difference between two duplicates is due to
  • Polymorphism.
  • Sequencing error.
  • True gene duplication.
  • And what to do when annotations are diffe-rent or
    even contradictory?

22
Protein sequences
  • Translation of Coding DNA Sequences (CDS) from
    EMBL/GenBank/DDBJ.
  • Consultation of publications or patents.
  • Small number of direct protein submissions by
    authors.
  • Integration of specific annotations only for the
    true protein sequences databanks.

23
SWISS-PROT
  • Created by Amos Bairoch in 1986 at the
    Departement of Medical Biochemistry in Geneva.
  • Maintained by the Swiss Institute of
    Bio-informatics (SIB) and funded by GeneBio.
  • Web server
  • http//www.expasy.ch/sprot/sprot-top.html

24
SWISS-PROT characteristics
  • Almost no redundancies.
  • Cross-references with 55 other databanks.
  • High-quality annotations
  • Systematic control by a team of annotators.
  • Help of a set of 204 benevolent experts.
  • Embedded in a complete environment devoted to
    proteins.

25
Annotations
  • Protein function.
  • Post-translational modifications.
  • Structural or functional domains.
  • Secondary and quaternary structures.
  • Similarities with other proteins.
  • Conflicts between positions for CDS.

26
Associated databanks
  • TrEMBL, built using only annotated CDS from EMBL.
  • ENZYME, for the international enzyme
    nomenclature.
  • PROSITE, for biologically significant sites,
    patterns and profiles.
  • SWISS-2DPAGE, for two-dimensional polyacrylamide
    gel electrophoresis maps.

27
PIR
  • PIR (The Protein Information Resource) was
    created by M.O. Dayhoff in 1965.
  • Aims
  • To provide exhaustive and non-redundant protein
    data.
  • To give a classification using taxonomic and
    similarity data
  • Grouping of entries into super-families, families
    and sub-families.

28
Data maintenance
  • Three organisms collect and organize the data
    introduced in PIR
  • The National Biomedical Research Foundation
    (NBRF) in the United States.
  • The Martinsried Institute for Protein Sequence
    (MIPS) in Germany.
  • The Japan International Protein Sequence
    Information Database (JIPID) in Japan.

29
Results
  • The exhaustivity is not better than what is
    obtained with SWISS-PROTTrEMBL.
  • Still contains a lot of redundancies.
  • Lower quality for the annotations.
  • Low number of cross-references.

30
Specialized databanks
  • A lot of specialized databanks has been
    developed, which are devoted to
  • Complete genomes.
  • Families of homologous genes.
  • Non-sequence data.
  • These systems are under the responsibility of
    curators
  • Data quality and homogeneity control.

31
Complete genomes
  • There is a large number of databanks devo-ted to
    peculiar organisms.
  • These banks are associated to sequencing or
    mapping projects.
  • For some model organisms there are often
    concurrent systems.

32
Examples
33
Gene families databanks
  • Built thanks to automated procedures
  • Similarity search between sets of proteins
    (BLASTP, FASTP, Smith-Waterman).
  • Clustering into homologous families using
    similarity criteria.
  • Include various data
  • Protein (and sometimes nucleotide) sequences.
  • Multiple alignments and trees.
  • Taxonomy.

34
ProtFam
  • Developed at MIPS.
  • Built with PIR sequences.
  • Includes four levels of classification
  • Superfamilies (based on function and similarity
    criteria).
  • Families (50 similarity).
  • Subfamilies (80 similarity).
  • Entries (95 similarity).

35
ProtFAm characteristics
  • Allows to visualize alignments and dendro-grams
    for the families.
  • Integrates Pfam domains.
  • Allows users to classify their own protein
    sequences.
  • Web server
  • http//www.mips.biochem.mpg.de

36
ProtoMap
  • Developed at the Department of Biological
    Chemistry from The Hebrew University of
    Jerusalem.
  • Built with SWISS-PROT sequences.
  • Uses three similarity measures for sequences
    (BLASTP, FASTP and Smith-Waterman).

37
ProtoMap characteristics
  • Alignments and trees are visualized thanks to
    Java applets.
  • Possibility for the users to submit their own
    sequences in a way to classify them.
  • No domain data but integrates the possibility to
    visualize low-similarity relationships.
  • Web server
  • http//www.protomap.cs.huji.ac.il

38
Specialized systems
  • HOVERGEN (Homologous Vertebrate Genes Database)
    for vertebrates
  • Based on GenBank CDS.
  • HOBACGEN (Homologous Bacterial Genes Database)
    for prokaryotes and yeast
  • Based on SWISS-PROT/TrEMBL.
  • HOBACGEN-CG for completely sequenced genomes
  • Based on SWISS-PROT/TrEMBL.

39
Other specialized systems
  • COG (Clusters of Orthologous Groups), also for
    complete genomes
  • Based on GenBank CDS.
  • NuReBase (Nuclear Receptors Database) for
    mammalian nuclear receptors
  • Based on EMBL CDS.
  • RTKdb (Tyrosine Kinase Receptors)
  • Based on EMBL CDS.

40
Are COGs real orthologs?
Escherichia coli Bacillus subtilis Pseudomonas Aer
uginosa Vibrio cholerae Synechocystis sp.
Glutamate synthase large subunit
41
HOBACGEN
  • Integrates protein and nucleic sequences as well
    as multiple alignments and trees.
  • Is based upon a client/server architecture.
  • Client software is distributed as well as the
    server structure (including all sequences).
  • Web server
  • http//pbil.univ-lyon1.fr/databases/hobacgen.html

42
Similarities search
?
SWISS-PROT/TrEMBL sequences
43
Segments selection
44
Families assembling
45
Alignments and trees
Rooting by mid-point
46
Domain structure
6PG1_YEAST
6PGD_CANAL
6PGD_SOYBN
6PG2_BACSU
O32911_MYCLR
P95165_MYCTU
6PGD_CERCA
Q40311_MEDSA
Y770_MYCTU
Y229_SYNY3
ProDom domains for the 6PGD family
47
Examples
  • Domains (Blocks, Domo, Pfam, ProDom, SBASE).
  • Sites, patterns and profiles (PRINTS, PRO-SITE).
  • The InterPro databank gather all available data
    on domains and patterns with a known biological
    function.

48
Non-sequence data
49
Data retrieval
  • Made mainly through Internet access
  • With client software (e.g., Entrez, HobacFetch).
  • By remote connections to servers providing
    on-line access to the banks (INFOBIOGEN).
  • Using World-Wide Web servers and browsers
    (Netscape, Internet Explorer, Lynx, etc.)

50
Advantages and limitations
  • Users do not have to cope with the usual
    databases problems
  • Storing of large amounts of data.
  • Daily updates.
  • Software upgrades.
  • Simplicity of use.
  • Net access is sometimes very slow at peak hours
  • Use your local servers instead of NCBI!!!

51
Retrieval systems
  • Direct access to functional regions described in
    the features (CDS, tRNA, rRNA).
  • Selection of entries using various criteria
  • Sequence names and accession numbers.
  • Bibliographic references.
  • Keywords.
  • Taxonomy.
  • Publication date.
  • Organelle, host.

52
Query
  • Developed at the Laboratoire de Biométrie et
    Biologie Évolutive by Gouy et al. (1985).
  • Graphical interface distributed along with the
    databases themselves.
  • Web access at Pôle Bioinformatique Lyon-nais
    (PBIL)
  • http//pbil.univ-lyon1.fr/search/query.html

53
Characteristics
  • Allows to query any bank in PIR, SWISS-PROT,
    EMBL, or GenBank/DDBJ formats.
  • Keywords and species browsing.
  • Complex queries.
  • Links with sequence analysis programs on the Web
    server (alignment, codon usage).

54
SRS
  • Public version developed at EMBL by Etzold and
    Argos (1993).
  • Presently available on the different Web servers
    belonging to EMBnet
  • INFOBIOGEN (France).
  • EBI (England).
  • DKFZ (Germany).

55
Characteristics
  • Database index built thanks to the use of ODD
    (Object Design and Definition).
  • More than 250 databanks have been indexed and are
    accessible through 35 SRS servers.
  • Allows queries on different banks simul-taneously.

56
Databanks interconnection
57
Entrez
  • Developed by Schuler et al. (1996) at NCBI.
  • Allows to query only databases that are made in
    the USA
  • GenBank, GenPept, NR, MMDB, MEDLINE.
  • Access through client software (Unix, Mac or
    Windows) or Web server
  • http//www.ncbi.nlm.nih.gov

58
Characteristics
  • Introduce the concept of neighbours between
    sequences, references and structures.
  • Sequence neighbours are establish using
    similarity criteria.
  • No access to multiple alignments.
Write a Comment
User Comments (0)
About PowerShow.com