Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r

Description:

OMIM: Online Mendelian Inheritance in Man. Taxonomy: organisms in GenBank. Books: online books ... Evolution: involves sequence variation ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 35

Provided by: dbs7

Category:

more less

Transcript and Presenter's Notes

Title: Essential Bioinformatics and Biocomputing LSM2104: Section I Lecture 3: More biological databases, r

1
Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Lecture 3 More
biological databases, retrieval systems and
database searching
2
Biological databases Function and pathways
databases - KEGG

KEGG (http//www.genome.ad.jp/kegg/kegg2.html)
database links genetic
info with cellular functions. It provides keyword
and pre-calculated sequence
comparison searches.
It consists of several interconnected databases
PATHWAY contains info on metabolic and regulatory
networks.
GENES contains information on genes and proteins.
LIGAND contains information on chemical compounds
and reactions involved in cellular processes.
EXPRESSION and BRITE contain micro-array gene
expression data.
SSDB helps identify protein coding genes.
It has an integrated database retrieval system
DBGET

3
KEGG web-page
4
KEGG Human TCA Pathway
5
Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/

Stores descriptions of interactions, molecular
complexes and pathways.
Provides search tools.
PreBIND locates literature sources
Show me a list of all of the papers in PubMed
that are about my protein of
interest. Then classify all of these papers and
tell me which ones are likely
to contain interaction information.
Finally, identify all of the other proteins
mentioned in these papers and
indicate whether these proteins might interact
with my protein of interest
Bader GD, et al. BIND The Biomolecular
Interaction Network Database.
Nucleic Acids Res. 2001 29(1)242-5.

6
Biological databases BINDBiomolecular
Interaction Network Database http//www.bind.ca/

3.Its Blast searches BIND database for similarity
to a query sequence.
BIND is at the forefront of the proteomics
efforts and is expected
to grow from the large-scale proteomic data.
Bader GD, et al. BIND The Biomolecular
Interaction Network Database.
Nucleic Acids Res. 2001 29(1)242-5.

7
BIND website
8
BIND Statistics

Database
Record Count
Interaction Database
11255
Biomolecular Pathway Database 8
Molecular Complex Database 851
Organisms represented
12
GI Database
4651
DI Database
0
Publication Database
428

9
Protein family/domain databases Sequence
alignmentPfam (http//www.sanger.ac.uk/Software/P
fam/)

Pfam is a collection of multiple protein sequence
alignments and statistical models that can be
used to classify protein families and domains.
Descriptions of protein domains
Given an established SWISSPROT sequence, Pfam
shows pre-computed domain structure of the
protein.
Given a completely new protein sequence, Pfam
computes a domain structure.

10
Pfam Web-site
11
Protein family/domain databases Sequence
patterns PROSITE ( http//ca.expasy.org/prosite/
)

Protein families and domains. It consists of
biologically significant sites, patterns and
profiles that help to reliably identify to which
known protein family (if any) a new sequence
belongs.
It currently contains patterns and profiles
specific for more than a thousand protein
families or domains.
An example of a pattern (motif)
W-x(9,11)-VFY-FYW-x(6,7)-GSTNE-GSTQCR-
FYW-x(2)-P

12
Protein sequence motif databases-PROSITE

A profile is a matrix derived from multiple
alignments

13
PROSITE web-site
14
PROSITE web-site
15
Biological data retrieval systems Entrez
http//www.ncbi.nlm.nih.gov/Database/index.html

A retrieval system for searching a number of
inter-connected databases at the NCBI. It
provides access to
PubMed The biomedical literature (Medline)
Genbank Nucleotide sequence database
Protein sequence database
Structure three-dimensional macromolecular
structures
Genome complete genome assemblies
PopSet population study data sets
OMIM Online Mendelian Inheritance in Man
Taxonomy organisms in GenBank
Books online books
ProbeSet gene expression and microarray datasets
3D Domains domains from Entrez Structure
UniSTS markers and mapping data
SNP single nucleotide polymorphisms
CDD conserved domains
2. Entrez allows users to perform various
searches.

16
Entrez Interface
17
Biological data Retrieval systems
SRShttp//srs.ebi.ac.uk/

SRS is a retrieval system for searching several
linked databases at the EBI. Similarly to Entrez,
it provides access to various databases and
enables various keyword, sequence similarity or
class searches.

18
SRS interface
19
Biological databases Database searching

Database searching can be used to answer the
kinds of question like
What is the sequence of human IL-10?
What is the gene coding for human IL-10?
Is the function of human IL-10 known? What is it?
Are there any variants of human IL-10?
Who sequenced this gene?
What are the differences between IL-10 in human
and in other species?
Which species are known to have IL-10?
Is the structure of IL-10 known?
What are structural and functional domains of the
IL-10?
Are there any motifs in the sequence that explain
their properties?
What is an upstream region of IL-10 containing
transcriptional regulation sites?

20
Biological databases Database searching

For well studied molecule such as IL-10, we
expect to extract much of the well-known facts.
These searches are useful for characterizing
newly identified sequences
Notes
Multiple errors can be found in database
entries. Some of these errors are introduced with
the submission of sequences to databases. Some
errors are due to naming conventions (or lack of
these). Some errors are due to poor links between
databases.
Users should take data extracted from databases
with care and compare these results with
information from other databases, journal
articles, and other sources.

21
Biological databases Keyword searching

Search DNA and protein databases with keywords
(10-July-2002)

22
Biological databases Keyword searching

Notes
GenPept is protein translation of GenBank. SPTR
is SWISS_PROT plus protein translation of EMBL
sequences.
Different databases contain different, but
overlapping, sets of entries. The same sequence
may have entries in different databases
Some databases have non-redundant sections. For
example the UniGene System which automatically
partition GenBank sequences into a non-redundant
set of gene-oriented clusters.
For completeness of results usually it is
necessary to search multiple databases.

23
Biological databases Database coverage

Example Scorpion KALIOTOXIN 2 (SwissProtP45696)

24
Biological databases Database errors

Our scorpion study (Srinivasan et al., 2002) also
revealed numerous errors and missing data in the
major databases. One of the entries had an error
in sequence in journal publication, but correct
sequence in the databases.

25
Biological Databases Sequence Similarity
Searching

Proteins that have similar sequence often have
similar structure and similar
function. If we have only a protein sequence we
can deduce its
structural and functional properties by analyzing
sequences that are similar.
Sequence Structure Function Relationship
Similar sequence Similar structure Similar
function
Why this relationship?
Evolution involves sequence variation
Laws of physics and chemistry defines
sequence-structure relationship
Function as defined by molecular interaction
structure-based

26
Database Searching Cautionary Notes

Some database matches happen because of chance
similarities, keywords and sequence similarity
alike. Distinguishing chance matches from
biologically significant matches is one of the
most important issues for effective use of
biological databases.
Searching GenBank by sequence similarity tool
BLAST for short, nearly exact matches, for the
sequence similarity to the names of the lecturers
of this module in last semester returned two
imperfect matches to VLADIMIR and seven perfect
matches to TINWEE.

27
Database Searching Cautionary Notes
28
Database Searching Cautionary Notes

If we blindly interpret these results, we would
erroneously conclude that motif VLADIMIR may have
some functional importance for structure or
function of the strawberry vein binding virus,
and that TINWEE has to do with calciumdependent
protein kinase in rice and possibly in Legionella
pneumophila.
We would avoid conclusions like this by looking
at the similarity scores. This will be done in
more detail later in the course, for now it is
important to know that the lower the expected
value, the better the match. Anything close or
greater than 1 should be observed with suspicion.
However, sometimes matches that are not
statistically significant, still can have
biological significance. If we suspect that this
might be the case, further analysis is necessary.

29
Database Searching Cautionary Notes

Examples of chance matches virtually any string
or keyword can show matches to database
entries. We are interested only in real ones.
The same search with GenBank. Fortunately we
have statistical measures that indicate the
quality of matches. However, sometimes matches
that have low statistical significance,
nevertheless have real, biological significance.
More about that will be taught later in the
course.

30
Biological databases Concluding remarks

Biological databases represent an invaluable
resource in support of biological research.
2. We can learn much about a particular molecule
by searching databases and using available
analysis tools
3. A large number of databases are available for
that task. Some databases are very general, some
are more specialized, while some are very
specialized. For best results we often need to
access multiple databases.

31
Biological databases Concluding remarks

4. Major types of databases covered in this
course are focusing on general nucleotide,
general protein, structure, pathways, molecular
interactions, protein motifs, publication, and
specialized databases.
5. Common database search methods include keyword
matching, sequence similarity, motif searching,
and class searching.
6. The problems with using biological databases
include incomplete information, data spread over
multiple databases, redundant information,
various errors, sometimes incorrect links, and
constant change.

32
Biological databases Concluding remarks

7. Database standards, nomenclature, and naming
conventions are not clearly defined for many
aspects of biological information. This makes
information extraction more difficult.
8. Retrieval systems help extract rich
information from multiple databases. Examples
include Entrez and SRS.
9. Formulating queries is a serious issue in
biological databases. Often the quality of
results depends on the quality of the queries.

33
Biological databases Concluding remarks

10. Statistical measures indicate the quality of
matches. Often the statistical and biological
significance are related. Sometimes, however
matches of real biological significance have low
statistical scores.
11. Access to biological databases is so
important that today virtually every molecular
biological project starts and ends with querying
biological databases.

34
Biological databases Summary of Todays lecture