Sequence%20databases%20and%20retrieval%20systems - PowerPoint PPT Presentation

About This Presentation

Title:

Sequence%20databases%20and%20retrieval%20systems

Description:

High-quality annotations: Systematic control by a team of annotators. ... InterPro unifies PROSITE, PRINTS, Profile, ProDom, Pfam, SMART, and TIGRFam. InterPro ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 67

Provided by: GuyPe9

Category:

more less

Transcript and Presenter's Notes

Title: Sequence%20databases%20and%20retrieval%20systems

1

Sequence databases and retrieval systems
Guy Perrière
replaced by Manolo Gouy
Pôle Bio-Informatique Lyonnais
Laboratoire de Biométrie et Biologie Évolutive
UMR CNRS n 5558
Université Claude Bernard Lyon 1

2
In the beginning

First paper compilation in 1965 (Atlas of Protein
Sequences).
Development of real databanks at the begin-ning
of the 80s
Fast access.
Make possible analyses that require a lot of
data
Codon usage.
Molecular phylogeny.

3
General databanks

Nucleotide sequences
EMBL/GenBank/DDBJ.
Protein sequences
Simple translations of coding regions
GenPept (from GenBank).
TrEMBL (from EMBL).
Systems containing additional data
SWISS-PROT.
PIR.

4
EMBL

Created in 1980 at the European Molecular Biology
Laboratory in Heidelberg.
Maintained since 1994 at the European
Bioinformatics Institute (EBI) near Cambridge.
Web server
http//www.ebi.ac.uk/embl

5
GenBank

Set up in 1979 at the Los Alamos National
Laboratory in New Mexico, US.
Maintained since 1992 at the National Cen-ter for
Biotechnology Information (NCBI) in Bethesda.
Web server
http//www.ncbi.nlm.nih.gov/Genbank/index.html

6
DDBJ

Active since 1984 at the National Institute of
Genetics (NIG) in Mishima, Japan.
Web server
http//www.ddbj.nig.ac.jp

7
EMBL / GenBank / DDBJ

The International Nucleotide Sequence Database
Collaboration EMBL / GenBank / DDBJ
New sequences are exchanged daily between the
three centers
--gt the three banks have an identical content.
Data mainly provided by direct submissions from
the authors through Internet
Web forms.
Email.

8
Data growth
log (number of residues)
9
GenBank/EMBL size (April 2003)

31?109 nucleotides.
24?106 sequences.
1.8 million genes (proteins and RNA).
313,000 bibliographic references.
100 gigabytes on disk.
Growth of 63 in 12 months.

10
Taxonomic sampling (April 2003)

There are 135,560 species for which at least one
sequence is available.
Nine species (0.007 ) correspond to 62 of the
total.
77,900 species are represented by only one
sequence!

The nine most represented species in GenBank/EMBL
11
Distribution format

The banks are distributed as a set of text files
called divisions ( 292 for EMBL).
A division contains sequences related to
A taxon (e.g., bacteria, invertebrates, mammals).
A class of sequences (EST, HTG, GSS).
Within a division, each sequence is called an
entry.

12
Entry structure

Information is introduced in structured fields.
The format differs in its form between EMBL and
GenBank/DDBJ
but not in substance.

13
ID, AC, SV and DT fields
Contain identifiers and the creation and the last
modification dates for the entries. ID BSAMYL
standard DNA PRO 2680 BP. XX AC V00101
J01547 XX SV V00101.1 XX DT 13-JUL-1983 (Rel.
03, Created) DT 12-NOV-1996 (Rel. 49, Last
updated, Version 11)
14
DE, KW, OS and OC fields
Definition, Keywords, Taxonomy. DE
Bacillus subtilis amylase gene. XX KW amyE
gene amylase amylase-alpha KW regulatory
region signal peptide. XX OS Bacillus
subtilis OC Bacteria Firmicutes
Bacillus/Clostridium group OS
Bacillus/Staphylococcus group Bacillus.
The NCBI maintains a unified taxonomy, largely
based on sequence information.
15
RN, RX, RA and RT fields
contain bibliographic information. RN 1 RP
1-2680 RX MEDLINE 83143299. RA Yang M.,
Galizzi, A., Henner, D.J. RT "Nucleotide
sequence of the amylase gene from RT Bacillus
subtilis" RL Nucleic Acids Res.
11237-249(1983).
16
FT field
contains the descriptions of functional regions.
key location and qualifiers FT
promoter 369..374 FT /note"put.
promoter sequence P2 3 (amyR1)" FT RBS
414..419 FT /note"rRNA-binding site
rbs-1 3" FT CDS 498..2480 FT
/gene"amyE" FT /db_xref"SWISS-PROT
P00691" FT /product"alpha-amylase
precursor" FT /EC_number"3.2.1.1 FT
/protein_id"CAA23437.1" FT
/translation"MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMK
DIHDAG ...
17
Intron/exon structure
FT CDS join(242..610,3397..3542,5100..535
1) FT /codon_start1 FT
/db_xref"SWISS-PROTP01308" FT
/note"precursor" FT /gene"INS" FT
/product"insulin" ...
18
SQ field
Contains the sequence iself SQ Sequence 2680
BP 825 A 520 C 642 G 693 T 0 other
gctcatgccg agaatagaca ccaaagaaga actgtaaaaa
cgggtgaagc agcagcgaat 60 agaatcaatt
gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca
gggggattcc 120 ccatacattc ttcgcttggc
tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct
180 gtttctgctt cggtatgtga ttgtgaagct
ggcttacaga agagcggtaa aagaagaaat 240
(...) gatggtttct tttttgttca taaatcagac
aaaacttttc tcttgcaaaa gtttgtgaag 2580
tgttgcacaa tataaatgtg aaatacttca caaacaaaaa
gacatcaaag agaaacatac 2640 cctgcaagga
tgctgatatt gtctgcattt gcgccggagc
2680 //
19
Errors in databanks

There are a lot of errors in the nucleotide
sequence databanks
In annotations
Inaccuracies, omissions, and even mistakes.
Inconsistencies between entries.
In the sequences themselves
Sequencing errors.
Cloning vectors inserted.

20
Redundancy

Another major pro-blem is redundancy.
A lot of entries are partially or entirely
duplicated
20 of vertebrate se-quences in GenBank.
Duplicated entries are often different in their
sequence.

21
Variations in duplicates

It is often impossible to decide whether a
difference between two duplicates is due to
Polymorphism.
Sequencing error.
True gene duplication.
And what to do when annotations differ or are
even contradictory?

22
Protein sequence databases

Translation of Coding DNA Sequences (CDS) from
EMBL/GenBank/DDBJ.
Consultation of publications or patents.
Very small number of direct protein sequence
submission by authors.
In SwissProt and PIR additional annotations.

23
SWISS-PROT

Created by Amos Bairoch in 1986 at the Department
of Medical Biochemistry in Geneva.
Maintained by the Swiss Institute of
Bioinformatics (SIB) and funded by GeneBio, and,
very recently, by NIH.
Web server
http//www.expasy.ch/sprot/sprot-top.html

24
SWISS-PROT characteristics

Almost no redundancy.
Cross-references with 60 other databanks.
High-quality annotations
Systematic control by a team of annotators.
Help from a set of gt 200 volunteer experts.
Embedded in Expasy, a www proteomics server
(http//www.expasy.org) .

25
Annotations

Protein function.
Post-translational modifications.
Structural or functional domains.
Secondary and quaternary structures.
Similarities with other proteins.
Conflicts between positions for CDS.
Disease-related mutations

26
Associated databanks

TrEMBL, built using only annotated CDS from the
EMBL data library.
ENZYME, for the international enzyme
nomenclature.
PROSITE, for biologically significant sites,
patterns and profiles.
SWISS-2DPAGE, for two-dimensional polyacrylamide
gel electrophoresis maps.

27
PIR

PIR (The Protein Information Resource) was
created by Margaret Dayhoff in 1965.
Aims
To provide exhaustive and non-redundant protein
sequence data.
To give a classification using taxonomic and
similarity data
entries grouped in super-families, families
and subfamilies.

28
Data maintenance

Three organisms collect and organize the data
introduced in PIR
The National Biomedical Research Foundation
(NBRF) in the United States.
The Martinsried Institute for Protein Sequence
(MIPS) in Germany.
The Japan International Protein Sequence
Information Database (JIPID) in Japan.

29
Results

The exhaustivity is not better than what is
obtained with SWISS-PROTTrEMBL.
Still contains redundancy.
Less comprehensive annotation.
Low number of cross-references.
PIR has recently joined forces with EBI and SIB
to establish the UniProt (United Protein
Databases), the central resource of protein
sequence and function.

30
Specialized databanks

A lot of specialized databanks have been
developed, which are devoted to
Complete genomes.
Families of homologous genes.
Non-sequence data.
These systems are under the responsibility of
curators
Data quality and homogeneity control.

31
Complete genomes

There is a large number of databanks devoted to
specific organisms.
These banks are associated to sequencing or
mapping projects.
For some model organisms there are often several
concurrent systems.

32
Examples
33
Gene family databanks

Built with automated procedures
Similarity search between sets of proteins
(BLASTP, FASTP, Smith-Waterman).
Clustering into homologous families using
similarity criteria.
Include various data
Protein (and sometimes nucleotide) sequences.
Multiple sequence alignments and trees.
Taxonomy.

34
ProtFam

Developed at MIPS.
Built with PIR sequences.
Includes four levels of classification
Superfamilies (based on function and similarity
criteria).
Families (50 similarity).
Subfamilies (80 similarity).
Entries (95 similarity).

35
ProtFAm characteristics

Allows to visualize alignments and dendrograms
for the families.
Integrates Pfam domains.
Allows users to classify their own protein
sequences.
Web server
http//mips.gsf.de

36
ProtoMap

Initially developed at the Hebrew University of
Jerusalem now hosted at Cornell University.
Built with SWISS-PROT TrEMBL sequences.
Combines 3 sequence similarity measures (BLASTP,
FASTA and Smith-Waterman).

37
ProtoMap characteristics

Alignments and trees are visualized with Java
applets.
Users can submit sequences and classify them.
Web server
http//protomap.cornell.edu/index.html

38
Specialized systems

HOVERGEN (Homologous Vertebrate Genes Database)
Based on GenBank CDS.
HOBACGEN (Homologous Bacterial Genes Database)
for prokaryotes and yeast
Based on SWISS-PROT/TrEMBL.
HOBACGEN-CG for completely sequenced genomes
Based on SWISS-PROT/TrEMBL.

39
Other specialized systems

COG (Clusters of Orthologous Groups), also for
complete genomes
Based on GenBank CDS.
NuReBase (Nuclear Receptors Database) for
mammalian nuclear receptors
Based on EMBL CDS.
RTKdb (Tyrosine Kinase Receptors)
Based on EMBL CDS.

40
Are COGs real orthologs?
Escherichia coli Bacillus subtilis Pseudomonas aer
uginosa Vibrio cholerae Synechocystis sp.
Glutamate synthase large subunit
41
Beyond protein families
ProtFam, Hovergen, Hobacgen, COGs gather protein
sequences homologous on their whole
length Patterns, profiles, domains, are
covered in Terry Attwoods lecture.
42
HOBACGEN

Integrates protein and nucleotide sequences as
well as multiple alignments and trees.
Is based upon a client/server architecture.
Client software is distributed as well as the
server structure (including all sequences).
Web server
http//pbil.univ-lyon1.fr/databases/hobacgen.html

43
Similarities search
?
SWISS-PROT/TrEMBL sequences
44
Segments selection
45
Families assembly
46
Alignments and trees
Rooting by mid-point
47
Domains and Families
Proteins can be made of very different sets of
domains
48
Site, Motif, Domain
Simple motifs
Patterns (PROSITE)
Alignments of whole domains
Profiles (PROSITE)
HMM (Pfam)
Fingerprint series of aligned motifs (PRINTS)
Complex motifs
Ungapped alignment of segments (BLOCKS)
49
ProDom defining domain structure
6PG1_YEAST
6PGD_CANAL
6PGD_SOYBN
6PG2_BACSU
O32911_MYCLR
P95165_MYCTU
6PGD_CERCA
Q40311_MEDSA
Y770_MYCTU
Y229_SYNY3
ProDom domains for the 6PGD family
50
InterPro
prints
InterPro unifies PROSITE, PRINTS, Profile,
ProDom, Pfam, SMART, and TIGRFam.
prosite
InterPro
pfam
smart
prodom
http//www.ebi.ac.uk/interpro
51
An InterPro entry
Accession IPR001425 Name Bacterial
rhodopsin Type Family Dates
08-OCT-1999 (created) 28-FEB-2000
(last modified) Signatures PROSITE PS00327
BACTERIAL_OPSIN_RET PROSITE PS00950
BACTERIAL_OPSIN_1 PRINTS
BACTRLOPSIN PFAM PF01036
Bac_rhodopsin Abstract The bacterial opsins
are retinal-binding proteins that provide
light-dependent ion transport and sensory
functions to a family of halophilic
bacteria 1, 2 . They are integral membrane
proteins believed to contain seven
transmembrane (TM) domains, the last
of which contains the attachment point for
retinal (a conserved lysine).
... Example s Q48315 BACH_HALHP
Halorhodopsin Q53496
BACR_HALSR Cruxrhodopsin
P15647 BACH_NATPH P96787
BAC3_HALSD Archaearhodopsin
View examples ...
52
Non-sequence data
53
Sequence Data retrieval

Made mainly through Internet access
With client software (e.g., Entrez, HobacFetch).
By remote connections to servers providing
on-line access to the banks (INFOBIOGEN).
Using World-Wide Web servers and browsers

54
Advantages and limitations

Users do not have to cope with the usual
databases problems
Storing of large amounts of data.
Daily updates.
Software upgrades.
Simplicity of use.
Net access is sometimes very slow at peak hours
consider using other servers besides NCBI

55
The ACNUC retrieval system

Direct access to functional regions described in
feature tables (CDS, tRNA, rRNA).
Selection of entries using various criteria
Sequence names and accession numbers.
Bibliographic criteria.
Keywords.
Taxonomy.
Organelle.
Developed at Lyon University

56
ACNUC possible accesses

Graphical interface distributed along with the
databases themselves.
http//pbil.univ-lyon1.fr/databases/acnuc.html
Web access at Pôle Bio-Informatique Lyonnais
(PBIL)
http//pbil.univ-lyon1.fr/search/query.html

57
ACNUC characteristics

Allows to query any bank in PIR, SWISS-PROT,
EMBL, or GenBank formats.
Keywords and species browsing.
Complex queries.
Links with sequence analysis programs on the Web
server (alignment, codon usage).

58
click
click
59
The Query form
60
Building queries to the sequence data bases
click
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
click
65
Retrieving sequences
Locally save the received sequence data.
66
Browsing the species trees
67
(No Transcript)
68
(No Transcript)
69
HOVERGEN Families of homologous vertebrate genes
70
Access to family members
Download tree or alignment
71
(No Transcript)
72
SRS

Public version developed at EMBL by Etzold and
Argos (1993).
Presently available on the different Web servers
belonging to EMBnet
EBI (England).
INFOBIOGEN (France).
DKFZ (Germany).

73
Characteristics

Database index built with the use of ODD (Object
Design and Definition).
More than 250 databanks have been indexed and are
accessible through 35 SRS servers.
Allows queries to operate simultaneously on
different banks.

74
Databanks interconnection
75
Entrez

Developed by Schuler et al. (1996) at NCBI.
Allows to query several US-made databases
GenBank, GenPept, NR, MMDB, MEDLINE.
Access through client software (Unix, Mac or
Windows) or Web server
http//www.ncbi.nlm.nih.gov

76
Characteristics

Introduces the concept of neighbours between
sequences, references and structures.
Sequence neighbours are established using
similarity criteria.
No access to multiple alignments.

77
NAR 2003 database issue
http//nar.oupjournals.org/content/vol31/issue1/

Write a Comment

User Comments (0)