Title: Sequence databases and retrieval systems
1- Sequence databases and retrieval systems
- Guy Perrière
- Pôle Bioinformatique Lyonnais
- Laboratoire de Biométrie et Biologie Évolutive
- UMR CNRS n 5558
- Université Claude Bernard Lyon 1
2In the beginning
- First paper compilation in 1965 (Atlas of Protein
Sequences). - Development of real databanks at the begin-ning
of the 80s - Fast access.
- Make possible analyses that require a lot of
data - Codon usage.
- Molecular phylogeny.
3General databanks
- Nucleotide sequences
- EMBL/GenBank/DDBJ.
- Protein sequences
- Simple translations of coding regions
- GenPept.
- TrEMBL.
- Systems containing specific data
- SWISS-PROT.
- PIR.
4EMBL
- Created in 1980 at the European Molecular Biology
Laboratory in Heidelberg. - Maintained since 1994 at the European
Bio-informatic Institute (EBI) in Cambridge. - Web server
- http//www.ebi.ac.uk/embl
5GenBank
- Set up in 1979 at the Los Alamos National
Laboratory (LANL) in Los Alamos. - Maintained since 1992 at the National Cen-ter for
Biotechnology Information (NCBI) in Bethesda. - Web server
- http//www.ncbi.nlm.nih.gov/Genbank/index.html
6DDBJ
- Started its activities in 1984 at the National
Institute of Genetics (NIG) in Mishima. - Since then, still maintained in this institute by
the team of Takashi Gojobori. - Web server
- http//www.ddbj.nig.ac.jp
7Nucleotide sequences
- Data mainly provided by direct submissions from
the authors. - Submissions are made through the Internet
- Web forms.
- Email.
- The sequences are exchanged between the three
centers on a daily basis - The content of the banks is identical.
8Data growth
EMBL GenBank NBRF/PIR SWISS-PROT
log(Nb. of residues)
9GenBank size (May 2001)
- 12.7?109 nucleotides.
- 11.8?106 sequences.
- 690,323 genes (proteins and RNA).
- 242,616 bibliographic references.
- 47.8 gigabytes on disc.
- Growth of 280 in 12 months.
- 24-36Â h to download the whole GenBank files from
NCBI.
10Taxonomic sampling
- There are 72,000 species for which at least one
sequence is available. - Nine species (0.01) correspond to 85 of the
total. - 18,000 species are represented by only one
sequence!
The nine species the most represented in GenBank
11Distribution format
- The banks are distributed as a set of text files
( 215 for GenBank). - A file contains sequences corresponding to
- A given taxon (e.g., bacteria, invertebrates,
mammals). - A given class of sequences (EST, HTG, GSS).
- Inside a file, each sequence is called an entry.
12Entries structure
- Informations are introduced into structured
fields. - The format is different between EMBL and
GenBank/DDBJ. - The data introduced in the three databanks are
identical.
13ID, AC, SV and DT fields
Contain identifyers as well as the creation and
the last modification dates for the entries. ID
BSAMYL standard DNA PRO 2680 BP. XX AC
V00101 J01547 XX SV V00101.1 XX DT
13-JUL-1983 (Rel. 03, Created) DT 12-NOV-1996
(Rel. 49, Last updated, Version 11)
14DE, KW, OS and OC fields
Contain general information on sequences
(defini-tion, keywords, taxonomy). DE Bacillus
subtilis amylase gene. XX KW amyE gene
amylase amylase-alpha KW regulatory region
signal peptide. XX OS Bacillus subtilis OC
Bacteria Firmicutes Bacillus/Clostridium
group OS Bacillus/Staphylococcus group
Bacillus.
15RN, RX, RA and RT fields
Contain informations related to bibliographic
refe-rences. RN 1 RP 1-2680 RX MEDLINE
83143299. RA Yang M., Galizzi, A., Henner,
D.J. RT "Nucleotide sequence of the amylase
gene from RT Bacillus subtilis" RL Nucleic
Acids Res. 11237-249(1983).
16FT field
Contains the descriptions of functional regions
described by the qualifyers. FT promoter
369..374 FT /note"put. promoter
sequence P2 3 (amyR1)" FT RBS
414..419 FT /note"rRNA-binding site
rbs-1 3" FT CDS 498..2480 FT
/gene"amyE" FT /db_xref"SWISS-PROT
P00691" FT /product"alpha-amylase
precursor" FT /EC_number"3.2.1.1Â FT
/protein_id"CAA23437.1" FT
/translation"MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNM
KDIHDAG ...
17Intron/exon structure
FT CDS join(242..610,3397..3542,5100..535
1) FT /codon_start1 FT
/db_xref"SWISS-PROTP01308" FT
/note"precursor" FT /gene"INS" FT
/product"insulin" ...
18SQ field
Contains the sequence iself SQ Sequence 2680
BP 825 A 520 C 642 G 693 T 0 other
gctcatgccg agaatagaca ccaaagaaga actgtaaaaa
cgggtgaagc agcagcgaat 60 agaatcaatt
gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca
gggggattcc 120 ccatacattc ttcgcttggc
tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct
180 gtttctgctt cggtatgtga ttgtgaagct
ggcttacaga agagcggtaa aagaagaaat 240
(...) gatggtttct tttttgttca taaatcagac
aaaacttttc tcttgcaaaa gtttgtgaag 2580
tgttgcacaa tataaatgtg aaatacttca caaacaaaaa
gacatcaaag agaaacatac 2640 cctgcaagga
tgctgatatt gtctgcattt gcgccggagc
2680 //
19Errors in databanks
- There are a lot of errors in the nucleotide
sequence databanks - For the annotations, the free submission of
entries involves - Inaccuracies, omissions, and even mistakes.
- Inconsistencies between some fields.
- In the sequences themselves
- Sequencing erors.
- Compression, gel reading.
- Cloning vectors inserted.
20Redundancy
- Another major pro-blem is redundancy.
- A lot of entries are partially or entirely
duplicated - 20 of vertebrate se-quences in GenBank.
- Duplicated entries are often different in their
sequence.
21Variations in duplicates
- It is often impossible to decide whether a
difference between two duplicates is due to - Polymorphism.
- Sequencing error.
- True gene duplication.
- And what to do when annotations are diffe-rent or
even contradictory?
22Protein sequences
- Translation of Coding DNA Sequences (CDS) from
EMBL/GenBank/DDBJ. - Consultation of publications or patents.
- Small number of direct protein submissions by
authors. - Integration of specific annotations only for the
true protein sequences databanks.
23SWISS-PROT
- Created by Amos Bairoch in 1986 at the
Departement of Medical Biochemistry in Geneva. - Maintained by the Swiss Institute of
Bio-informatics (SIB) and funded by GeneBio. - Web server
- http//www.expasy.ch/sprot/sprot-top.html
24SWISS-PROT characteristics
- Almost no redundancies.
- Cross-references with 55 other databanks.
- High-quality annotations
- Systematic control by a team of annotators.
- Help of a set of 204 benevolent experts.
- Embedded in a complete environment devoted to
proteins.
25Annotations
- Protein function.
- Post-translational modifications.
- Structural or functional domains.
- Secondary and quaternary structures.
- Similarities with other proteins.
- Conflicts between positions for CDS.
26Associated databanks
- TrEMBL, built using only annotated CDS from EMBL.
- ENZYME, for the international enzyme
nomenclature. - PROSITE, for biologically significant sites,
patterns and profiles. - SWISS-2DPAGE, for two-dimensional polyacrylamide
gel electrophoresis maps.
27PIR
- PIR (The Protein Information Resource) was
created by M.O. Dayhoff in 1965. - Aims
- To provide exhaustive and non-redundant protein
data. - To give a classification using taxonomic and
similarity data - Grouping of entries into super-families, families
and sub-families.
28Data maintenance
- Three organisms collect and organize the data
introduced in PIR - The National Biomedical Research Foundation
(NBRF) in the United States. - The Martinsried Institute for Protein Sequence
(MIPS) in Germany. - The Japan International Protein Sequence
Information Database (JIPID) in Japan.
29Results
- The exhaustivity is not better than what is
obtained with SWISS-PROTTrEMBL. - Still contains a lot of redundancies.
- Lower quality for the annotations.
- Low number of cross-references.
30Specialized databanks
- A lot of specialized databanks has been
developed, which are devoted to - Complete genomes.
- Families of homologous genes.
- Non-sequence data.
- These systems are under the responsibility of
curators - Data quality and homogeneity control.
31Complete genomes
- There is a large number of databanks devo-ted to
peculiar organisms. - These banks are associated to sequencing or
mapping projects. - For some model organisms there are often
concurrent systems.
32Examples
33Gene families databanks
- Built thanks to automated procedures
- Similarity search between sets of proteins
(BLASTP, FASTP, Smith-Waterman). - Clustering into homologous families using
similarity criteria. - Include various data
- Protein (and sometimes nucleotide) sequences.
- Multiple alignments and trees.
- Taxonomy.
34 ProtFam
- Developed at MIPS.
- Built with PIR sequences.
- Includes four levels of classification
- Superfamilies (based on function and similarity
criteria). - Families (50 similarity).
- Subfamilies (80 similarity).
- Entries (95 similarity).
35ProtFAm characteristics
- Allows to visualize alignments and dendro-grams
for the families. - Integrates Pfam domains.
- Allows users to classify their own protein
sequences. - Web server
- http//www.mips.biochem.mpg.de
36ProtoMap
- Developed at the Department of Biological
Chemistry from The Hebrew University of
Jerusalem. - Built with SWISS-PROT sequences.
- Uses three similarity measures for sequences
(BLASTP, FASTP and Smith-Waterman).
37ProtoMap characteristics
- Alignments and trees are visualized thanks to
Java applets. - Possibility for the users to submit their own
sequences in a way to classify them. - No domain data but integrates the possibility to
visualize low-similarity relationships. - Web server
- http//www.protomap.cs.huji.ac.il
38Specialized systems
- HOVERGEN (Homologous Vertebrate Genes Database)
for vertebrates - Based on GenBank CDS.
- HOBACGEN (Homologous Bacterial Genes Database)
for prokaryotes and yeast - Based on SWISS-PROT/TrEMBL.
- HOBACGEN-CG for completely sequenced genomes
- Based on SWISS-PROT/TrEMBL.
39Other specialized systems
- COG (Clusters of Orthologous Groups), also for
complete genomes - Based on GenBank CDS.
- NuReBase (Nuclear Receptors Database) for
mammalian nuclear receptors - Based on EMBL CDS.
- RTKdb (Tyrosine Kinase Receptors)
- Based on EMBL CDS.
40Are COGs real orthologs?
Escherichia coli Bacillus subtilis Pseudomonas Aer
uginosa Vibrio cholerae Synechocystis sp.
Glutamate synthase large subunit
41HOBACGEN
- Integrates protein and nucleic sequences as well
as multiple alignments and trees. - Is based upon a client/server architecture.
- Client software is distributed as well as the
server structure (including all sequences). - Web server
- http//pbil.univ-lyon1.fr/databases/hobacgen.html
42Similarities search
?
SWISS-PROT/TrEMBL sequences
43Segments selection
44Families assembling
45Alignments and trees
Rooting by mid-point
46Domain structure
6PG1_YEAST
6PGD_CANAL
6PGD_SOYBN
6PG2_BACSU
O32911_MYCLR
P95165_MYCTU
6PGD_CERCA
Q40311_MEDSA
Y770_MYCTU
Y229_SYNY3
ProDom domains for the 6PGD family
47Examples
- Domains (Blocks, Domo, Pfam, ProDom, SBASE).
- Sites, patterns and profiles (PRINTS, PRO-SITE).
- The InterPro databank gather all available data
on domains and patterns with a known biological
function.
48Non-sequence data
49Data retrieval
- Made mainly through Internet access
- With client software (e.g., Entrez, HobacFetch).
- By remote connections to servers providing
on-line access to the banks (INFOBIOGEN). - Using World-Wide Web servers and browsers
(Netscape, Internet Explorer, Lynx, etc.)
50Advantages and limitations
- Users do not have to cope with the usual
databases problems - Storing of large amounts of data.
- Daily updates.
- Software upgrades.
- Simplicity of use.
- Net access is sometimes very slow at peak hours
- Use your local servers instead of NCBI!!!
51Retrieval systems
- Direct access to functional regions described in
the features (CDS, tRNA, rRNA). - Selection of entries using various criteria
- Sequence names and accession numbers.
- Bibliographic references.
- Keywords.
- Taxonomy.
- Publication date.
- Organelle, host.
52Query
- Developed at the Laboratoire de Biométrie et
Biologie Évolutive by Gouy et al. (1985). - Graphical interface distributed along with the
databases themselves. - Web access at Pôle Bioinformatique Lyon-nais
(PBIL) - http//pbil.univ-lyon1.fr/search/query.html
53Characteristics
- Allows to query any bank in PIR, SWISS-PROT,
EMBL, or GenBank/DDBJ formats. - Keywords and species browsing.
- Complex queries.
- Links with sequence analysis programs on the Web
server (alignment, codon usage).
54SRS
- Public version developed at EMBL by Etzold and
Argos (1993). - Presently available on the different Web servers
belonging to EMBnet - INFOBIOGEN (France).
- EBI (England).
- DKFZ (Germany).
55Characteristics
- Database index built thanks to the use of ODD
(Object Design and Definition). - More than 250 databanks have been indexed and are
accessible through 35 SRS servers. - Allows queries on different banks simul-taneously.
56Databanks interconnection
57Entrez
- Developed by Schuler et al. (1996) at NCBI.
- Allows to query only databases that are made in
the USA - GenBank, GenPept, NR, MMDB, MEDLINE.
- Access through client software (Unix, Mac or
Windows) or Web server - http//www.ncbi.nlm.nih.gov
58Characteristics
- Introduce the concept of neighbours between
sequences, references and structures. - Sequence neighbours are establish using
similarity criteria. - No access to multiple alignments.