Sequence%20databases%20and%20retrieval%20systems - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence%20databases%20and%20retrieval%20systems

Description:

High-quality annotations: Systematic control by a team of annotators. ... InterPro unifies PROSITE, PRINTS, Profile, ProDom, Pfam, SMART, and TIGRFam. InterPro ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 67
Provided by: GuyPe9
Category:

less

Transcript and Presenter's Notes

Title: Sequence%20databases%20and%20retrieval%20systems


1
  • Sequence databases and retrieval systems
  • Guy Perrière
  • replaced by Manolo Gouy
  • Pôle Bio-Informatique Lyonnais
  • Laboratoire de Biométrie et Biologie Évolutive
  • UMR CNRS n 5558
  • Université Claude Bernard Lyon 1

2
In the beginning
  • First paper compilation in 1965 (Atlas of Protein
    Sequences).
  • Development of real databanks at the begin-ning
    of the 80s
  • Fast access.
  • Make possible analyses that require a lot of
    data
  • Codon usage.
  • Molecular phylogeny.

3
General databanks
  • Nucleotide sequences
  • EMBL/GenBank/DDBJ.
  • Protein sequences
  • Simple translations of coding regions
  • GenPept (from GenBank).
  • TrEMBL (from EMBL).
  • Systems containing additional data
  • SWISS-PROT.
  • PIR.

4
EMBL
  • Created in 1980 at the European Molecular Biology
    Laboratory in Heidelberg.
  • Maintained since 1994 at the European
    Bioinformatics Institute (EBI) near Cambridge.
  • Web server
  • http//www.ebi.ac.uk/embl

5
GenBank
  • Set up in 1979 at the Los Alamos National
    Laboratory in New Mexico, US.
  • Maintained since 1992 at the National Cen-ter for
    Biotechnology Information (NCBI) in Bethesda.
  • Web server
  • http//www.ncbi.nlm.nih.gov/Genbank/index.html

6
DDBJ
  • Active since 1984 at the National Institute of
    Genetics (NIG) in Mishima, Japan.
  • Web server
  • http//www.ddbj.nig.ac.jp

7
EMBL / GenBank / DDBJ
  • The International Nucleotide Sequence Database
    Collaboration EMBL / GenBank / DDBJ
  • New sequences are exchanged daily between the
    three centers
  • --gt the three banks have an identical content.
  • Data mainly provided by direct submissions from
    the authors through Internet
  • Web forms.
  • Email.

8
Data growth
log (number of residues)
9
GenBank/EMBL size (April 2003)
  • 31?109 nucleotides.
  • 24?106 sequences.
  • 1.8 million genes (proteins and RNA).
  • 313,000 bibliographic references.
  • 100 gigabytes on disk.
  • Growth of 63 in 12 months.

10
Taxonomic sampling (April 2003)
  • There are 135,560 species for which at least one
    sequence is available.
  • Nine species (0.007 ) correspond to 62 of the
    total.
  • 77,900 species are represented by only one
    sequence!

The nine most represented species in GenBank/EMBL
11
Distribution format
  • The banks are distributed as a set of text files
    called divisions ( 292 for EMBL).
  • A division contains sequences related to
  • A taxon (e.g., bacteria, invertebrates, mammals).
  • A class of sequences (EST, HTG, GSS).
  • Within a division, each sequence is called an
    entry.

12
Entry structure
  • Information is introduced in structured fields.
  • The format differs in its form between EMBL and
    GenBank/DDBJ
  • but not in substance.

13
ID, AC, SV and DT fields
Contain identifiers and the creation and the last
modification dates for the entries. ID BSAMYL
standard DNA PRO 2680 BP. XX AC V00101
J01547 XX SV V00101.1 XX DT 13-JUL-1983 (Rel.
03, Created) DT 12-NOV-1996 (Rel. 49, Last
updated, Version 11)
14
DE, KW, OS and OC fields
Definition, Keywords, Taxonomy. DE
Bacillus subtilis amylase gene. XX KW amyE
gene amylase amylase-alpha KW regulatory
region signal peptide. XX OS Bacillus
subtilis OC Bacteria Firmicutes
Bacillus/Clostridium group OS
Bacillus/Staphylococcus group Bacillus.
The NCBI maintains a unified taxonomy, largely
based on sequence information.
15
RN, RX, RA and RT fields
contain bibliographic information. RN 1 RP
1-2680 RX MEDLINE 83143299. RA Yang M.,
Galizzi, A., Henner, D.J. RT "Nucleotide
sequence of the amylase gene from RT Bacillus
subtilis" RL Nucleic Acids Res.
11237-249(1983).
16
FT field
contains the descriptions of functional regions.
key location and qualifiers FT
promoter 369..374 FT /note"put.
promoter sequence P2 3 (amyR1)" FT RBS
414..419 FT /note"rRNA-binding site
rbs-1 3" FT CDS 498..2480 FT
/gene"amyE" FT /db_xref"SWISS-PROT
P00691" FT /product"alpha-amylase
precursor" FT /EC_number"3.2.1.1 FT
/protein_id"CAA23437.1" FT
/translation"MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMK
DIHDAG ...
17
Intron/exon structure
FT CDS join(242..610,3397..3542,5100..535
1) FT /codon_start1 FT
/db_xref"SWISS-PROTP01308" FT
/note"precursor" FT /gene"INS" FT
/product"insulin" ...
18
SQ field
Contains the sequence iself SQ Sequence 2680
BP 825 A 520 C 642 G 693 T 0 other
gctcatgccg agaatagaca ccaaagaaga actgtaaaaa
cgggtgaagc agcagcgaat 60 agaatcaatt
gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca
gggggattcc 120 ccatacattc ttcgcttggc
tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct
180 gtttctgctt cggtatgtga ttgtgaagct
ggcttacaga agagcggtaa aagaagaaat 240
(...) gatggtttct tttttgttca taaatcagac
aaaacttttc tcttgcaaaa gtttgtgaag 2580
tgttgcacaa tataaatgtg aaatacttca caaacaaaaa
gacatcaaag agaaacatac 2640 cctgcaagga
tgctgatatt gtctgcattt gcgccggagc
2680 //
19
Errors in databanks
  • There are a lot of errors in the nucleotide
    sequence databanks
  • In annotations
  • Inaccuracies, omissions, and even mistakes.
  • Inconsistencies between entries.
  • In the sequences themselves
  • Sequencing errors.
  • Cloning vectors inserted.

20
Redundancy
  • Another major pro-blem is redundancy.
  • A lot of entries are partially or entirely
    duplicated
  • 20 of vertebrate se-quences in GenBank.
  • Duplicated entries are often different in their
    sequence.

21
Variations in duplicates
  • It is often impossible to decide whether a
    difference between two duplicates is due to
  • Polymorphism.
  • Sequencing error.
  • True gene duplication.
  • And what to do when annotations differ or are
    even contradictory?

22
Protein sequence databases
  • Translation of Coding DNA Sequences (CDS) from
    EMBL/GenBank/DDBJ.
  • Consultation of publications or patents.
  • Very small number of direct protein sequence
    submission by authors.
  • In SwissProt and PIR additional annotations.

23
SWISS-PROT
  • Created by Amos Bairoch in 1986 at the Department
    of Medical Biochemistry in Geneva.
  • Maintained by the Swiss Institute of
    Bioinformatics (SIB) and funded by GeneBio, and,
    very recently, by NIH.
  • Web server
  • http//www.expasy.ch/sprot/sprot-top.html

24
SWISS-PROT characteristics
  • Almost no redundancy.
  • Cross-references with 60 other databanks.
  • High-quality annotations
  • Systematic control by a team of annotators.
  • Help from a set of gt 200 volunteer experts.
  • Embedded in Expasy, a www proteomics server
    (http//www.expasy.org) .

25
Annotations
  • Protein function.
  • Post-translational modifications.
  • Structural or functional domains.
  • Secondary and quaternary structures.
  • Similarities with other proteins.
  • Conflicts between positions for CDS.
  • Disease-related mutations

26
Associated databanks
  • TrEMBL, built using only annotated CDS from the
    EMBL data library.
  • ENZYME, for the international enzyme
    nomenclature.
  • PROSITE, for biologically significant sites,
    patterns and profiles.
  • SWISS-2DPAGE, for two-dimensional polyacrylamide
    gel electrophoresis maps.

27
PIR
  • PIR (The Protein Information Resource) was
    created by Margaret Dayhoff in 1965.
  • Aims
  • To provide exhaustive and non-redundant protein
    sequence data.
  • To give a classification using taxonomic and
    similarity data
  • entries grouped in super-families, families
  • and subfamilies.

28
Data maintenance
  • Three organisms collect and organize the data
    introduced in PIR
  • The National Biomedical Research Foundation
    (NBRF) in the United States.
  • The Martinsried Institute for Protein Sequence
    (MIPS) in Germany.
  • The Japan International Protein Sequence
    Information Database (JIPID) in Japan.

29
Results
  • The exhaustivity is not better than what is
    obtained with SWISS-PROTTrEMBL.
  • Still contains redundancy.
  • Less comprehensive annotation.
  • Low number of cross-references.
  • PIR has recently joined forces with EBI and SIB
    to establish the UniProt (United Protein
    Databases), the central resource of protein
    sequence and function.

30
Specialized databanks
  • A lot of specialized databanks have been
    developed, which are devoted to
  • Complete genomes.
  • Families of homologous genes.
  • Non-sequence data.
  • These systems are under the responsibility of
    curators
  • Data quality and homogeneity control.

31
Complete genomes
  • There is a large number of databanks devoted to
    specific organisms.
  • These banks are associated to sequencing or
    mapping projects.
  • For some model organisms there are often several
    concurrent systems.

32
Examples
33
Gene family databanks
  • Built with automated procedures
  • Similarity search between sets of proteins
    (BLASTP, FASTP, Smith-Waterman).
  • Clustering into homologous families using
    similarity criteria.
  • Include various data
  • Protein (and sometimes nucleotide) sequences.
  • Multiple sequence alignments and trees.
  • Taxonomy.

34
ProtFam
  • Developed at MIPS.
  • Built with PIR sequences.
  • Includes four levels of classification
  • Superfamilies (based on function and similarity
    criteria).
  • Families (50 similarity).
  • Subfamilies (80 similarity).
  • Entries (95 similarity).

35
ProtFAm characteristics
  • Allows to visualize alignments and dendrograms
    for the families.
  • Integrates Pfam domains.
  • Allows users to classify their own protein
    sequences.
  • Web server
  • http//mips.gsf.de

36
ProtoMap
  • Initially developed at the Hebrew University of
    Jerusalem now hosted at Cornell University.
  • Built with SWISS-PROT TrEMBL sequences.
  • Combines 3 sequence similarity measures (BLASTP,
    FASTA and Smith-Waterman).

37
ProtoMap characteristics
  • Alignments and trees are visualized with Java
    applets.
  • Users can submit sequences and classify them.
  • Web server
  • http//protomap.cornell.edu/index.html

38
Specialized systems
  • HOVERGEN (Homologous Vertebrate Genes Database)
  • Based on GenBank CDS.
  • HOBACGEN (Homologous Bacterial Genes Database)
    for prokaryotes and yeast
  • Based on SWISS-PROT/TrEMBL.
  • HOBACGEN-CG for completely sequenced genomes
  • Based on SWISS-PROT/TrEMBL.

39
Other specialized systems
  • COG (Clusters of Orthologous Groups), also for
    complete genomes
  • Based on GenBank CDS.
  • NuReBase (Nuclear Receptors Database) for
    mammalian nuclear receptors
  • Based on EMBL CDS.
  • RTKdb (Tyrosine Kinase Receptors)
  • Based on EMBL CDS.

40
Are COGs real orthologs?
Escherichia coli Bacillus subtilis Pseudomonas aer
uginosa Vibrio cholerae Synechocystis sp.
Glutamate synthase large subunit
41
Beyond protein families
ProtFam, Hovergen, Hobacgen, COGs gather protein
sequences homologous on their whole
length Patterns, profiles, domains, are
covered in Terry Attwoods lecture.
42
HOBACGEN
  • Integrates protein and nucleotide sequences as
    well as multiple alignments and trees.
  • Is based upon a client/server architecture.
  • Client software is distributed as well as the
    server structure (including all sequences).
  • Web server
  • http//pbil.univ-lyon1.fr/databases/hobacgen.html

43
Similarities search
?
SWISS-PROT/TrEMBL sequences
44
Segments selection
45
Families assembly
46
Alignments and trees
Rooting by mid-point
47
Domains and Families
Proteins can be made of very different sets of
domains
48
Site, Motif, Domain
Simple motifs
Patterns (PROSITE)
Alignments of whole domains
Profiles (PROSITE)
HMM (Pfam)
Fingerprint series of aligned motifs (PRINTS)
Complex motifs
Ungapped alignment of segments (BLOCKS)
49
ProDom defining domain structure
6PG1_YEAST
6PGD_CANAL
6PGD_SOYBN
6PG2_BACSU
O32911_MYCLR
P95165_MYCTU
6PGD_CERCA
Q40311_MEDSA
Y770_MYCTU
Y229_SYNY3
ProDom domains for the 6PGD family
50
InterPro
prints
InterPro unifies PROSITE, PRINTS, Profile,
ProDom, Pfam, SMART, and TIGRFam.
prosite
InterPro
pfam
smart
prodom
http//www.ebi.ac.uk/interpro
51
An InterPro entry
Accession IPR001425 Name Bacterial
rhodopsin Type Family Dates
08-OCT-1999 (created) 28-FEB-2000
(last modified) Signatures PROSITE PS00327
BACTERIAL_OPSIN_RET PROSITE PS00950
BACTERIAL_OPSIN_1 PRINTS
BACTRLOPSIN PFAM PF01036
Bac_rhodopsin Abstract The bacterial opsins
are retinal-binding proteins that provide
light-dependent ion transport and sensory
functions to a family of halophilic
bacteria 1, 2 . They are integral membrane
proteins believed to contain seven
transmembrane (TM) domains, the last
of which contains the attachment point for
retinal (a conserved lysine).
... Example s Q48315 BACH_HALHP
Halorhodopsin Q53496
BACR_HALSR Cruxrhodopsin
P15647 BACH_NATPH P96787
BAC3_HALSD Archaearhodopsin
View examples ...
52
Non-sequence data
53
Sequence Data retrieval
  • Made mainly through Internet access
  • With client software (e.g., Entrez, HobacFetch).
  • By remote connections to servers providing
    on-line access to the banks (INFOBIOGEN).
  • Using World-Wide Web servers and browsers

54
Advantages and limitations
  • Users do not have to cope with the usual
    databases problems
  • Storing of large amounts of data.
  • Daily updates.
  • Software upgrades.
  • Simplicity of use.
  • Net access is sometimes very slow at peak hours
  • consider using other servers besides NCBI

55
The ACNUC retrieval system
  • Direct access to functional regions described in
    feature tables (CDS, tRNA, rRNA).
  • Selection of entries using various criteria
  • Sequence names and accession numbers.
  • Bibliographic criteria.
  • Keywords.
  • Taxonomy.
  • Organelle.
  • Developed at Lyon University

56
ACNUC possible accesses
  • Graphical interface distributed along with the
    databases themselves.
  • http//pbil.univ-lyon1.fr/databases/acnuc.html
  • Web access at Pôle Bio-Informatique Lyonnais
    (PBIL)
  • http//pbil.univ-lyon1.fr/search/query.html

57
ACNUC characteristics
  • Allows to query any bank in PIR, SWISS-PROT,
    EMBL, or GenBank formats.
  • Keywords and species browsing.
  • Complex queries.
  • Links with sequence analysis programs on the Web
    server (alignment, codon usage).

58
click
click
59
The Query form
60
Building queries to the sequence data bases
click
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
click
65
Retrieving sequences
Locally save the received sequence data.
66
Browsing the species trees
67
(No Transcript)
68
(No Transcript)
69
HOVERGEN Families of homologous vertebrate genes
70
Access to family members
Download tree or alignment
71
(No Transcript)
72
SRS
  • Public version developed at EMBL by Etzold and
    Argos (1993).
  • Presently available on the different Web servers
    belonging to EMBnet
  • EBI (England).
  • INFOBIOGEN (France).
  • DKFZ (Germany).

73
Characteristics
  • Database index built with the use of ODD (Object
    Design and Definition).
  • More than 250 databanks have been indexed and are
    accessible through 35 SRS servers.
  • Allows queries to operate simultaneously on
    different banks.

74
Databanks interconnection
75
Entrez
  • Developed by Schuler et al. (1996) at NCBI.
  • Allows to query several US-made databases
  • GenBank, GenPept, NR, MMDB, MEDLINE.
  • Access through client software (Unix, Mac or
    Windows) or Web server
  • http//www.ncbi.nlm.nih.gov

76
Characteristics
  • Introduces the concept of neighbours between
    sequences, references and structures.
  • Sequence neighbours are established using
    similarity criteria.
  • No access to multiple alignments.

77
NAR 2003 database issue
http//nar.oupjournals.org/content/vol31/issue1/
Write a Comment
User Comments (0)
About PowerShow.com